由于页面无限加载,无法在 python 中通过 selenium 进行抓取

由于页面无限加载,无法在 python 中通过 selenium 进行抓取

问题描述:

我正在尝试提取一些新闻文章的内容.一些网址需要登录才能访问完整内容.我决定使用 selenium 自动登录.但是,我无法提取内容,因为第一个 url 需要永远加载并且永远不会到达完成实际文本提取的点.它最终抛出超时异常.

I am trying to extract the contents of some of the news articles. Some of the urls required logging in in order to access the full content. I decided to use selenium to automate logging in. However, I am not able to extract contents because the first url takes forever to load and never reaches the point where actual text extraction is done. It ends up throwing timeout exception.

这是我的代码

for url in url_list:
    chrome_options = Options()
    ua = UserAgent()
    userAgent = ua.random
    options.add_argument(f'user-agent={userAgent}')
    driver = webdriver.Chrome(ChromeDriverManager().install(), options = chrome_options)
    driver.get(url)
    time.sleep(5)
    frame = driver.find_elements_by_xpath('//iframe[@id="wallIframe"]')
    #Some articles require going through a paywall and some don't
    if len(frame)==0:
        text_element = driver.find_elements_by_xpath('//section[@id="main-content"]//article//p')
        text = " ".join(x.text for x in element)
    else:
        text = log_in(frame)
    driver.quit()

虽然代码从未触及它,但这是我的登录方法

Although the code never reaches to it, here is my log_in method

def log_in(frame):
    driver.switch_to.frame(frame[0])
    driver.find_element_by_id("PAYWALL_V2_SIGN_IN").click()
    time.sleep(2)
    driver.find_elements_by_id("username")[0].send_keys(username)
    time.sleep(2)
    driver.find_elements_by_xpath('//button[text()="Continue"]')[0].click()
    time.sleep(1)
    driver.find_elements_by_id("password")[0].send_keys(password)
    time.sleep(1)
    element = driver.find_elements_by_xpath('//button[@type="submit"]')[0].click()
    time.sleep(1)
    text = parse_text(element)

我该如何解决这个问题?

How can I get around this?

您应该使用 WebDriverWait,而不是使用 time.sleep 手动设置超时/strong> 和 expected_conditions;这样,只有在满足特定条件(例如,元素可见或元素可点击)时,才会对元素执行的操作.

Instead of manually setting the timeout with time.sleep, you should use WebDriverWait along with expected_conditions; this way the action to be done on your element will be performed only when a certain condition is satisfied (for example if the element is visible or if the element is clickable).

from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from selenium.common.exceptions import TimeoutException

try:
    frame = WebDriverWait(driver, 30).until(EC.presence_of_element_located((By.XPATH, '//iframe[@id="wallIframe"]')))

except TimeoutException:
    print "Element not found."