如何在Python中使用Selenium持續爬取網頁中的文章-有解無憂

我正在嘗試爬取bloomberg.com 并找到所有英文新聞文章的鏈接。下面代碼的問題在于，它確實從第一頁找到了很多文章，但它只是進入了一個不回傳任何內容的回圈，并且偶爾會去一次。

from collections import deque
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.firefox.options import Options

visited = set()
to_crawl = deque()
to_crawl.append("https://www.bloomberg.com")

def crawl_link(input_url):
    options = Options()
    options.add_argument('--headless')
    browser = webdriver.Firefox(options=options)
    browser.get(input_url)
    elems = browser.find_elements(by=By.XPATH, value="//a[@href]")
    for elem in elems:
        #retrieve all href links and save it to url_element variable
        url_element = elem.get_attribute("href")
        if url_element not in visited:
            to_crawl.append(url_element)
            visited.add(url_element)
            #save news articles
            if 'www.bloomberg.com/news/articles' in url_element:
                print(str(url_element))
                with open("result.txt", "a") as outf:
                    outf.write(str(url_element)   "\n")
    browser.close()

while len(to_crawl):
    url_to_crawl = to_crawl.pop()
    crawl_link(url_to_crawl)

我嘗試過使用佇列，然后使用堆疊，但行為是相同的。我似乎無法完成我正在尋找的東西。

你如何抓取這樣的網站來抓取新聞網址？

uj5u.com熱心網友回復：

您使用的方法應該可以正常作業，但是在我自己運行它之后，我注意到有一些事情導致它掛起或拋出錯誤。

我做了一些調整，并加入了一些內嵌評論來解釋我添加它們的原因。

from collections import deque
from selenium.common.exceptions import StaleElementReferenceException
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.firefox.options import Options

base = "https://www.bloomberg.com"
article = base   "/news/articles"
visited = set()

# this is so there aren't multiple entries 
# for the same article in the `result.txt` file
articles = set()

to_crawl = deque()
to_crawl.append(base)

def crawl_link(input_url):
    options = Options()
    options.add_argument('--headless')
    browser = webdriver.Firefox(options=options)
    print(input_url)
    browser.get(input_url)
    elems = browser.find_elements(by=By.XPATH, value="//a[@href]")

    # this was the issue, before this line was just after 
    # `to_crawl.append()` which was prematurely adding links 
    # to the visited list so those links were skipped over without
    # being crawled
    visited.add(input_url)

    for elem in elems:

        # checks for errors
        try:
            url_element = elem.get_attribute("href")
        except StaleElementReferenceException as err:
            print(err)
            continue

        # checks to make sure links aren't being crawled more than once
        # and that all the links are in the propper domain
        if base in url_element and all(url_element not in i for i in [visited, to_crawl]):

            to_crawl.append(url_element)

            # this checks if the link matches the correct url pattern
            # and ensures no article links are entered multiple times
            if article in url_element and url_element not in articles:

                articles.add(url_element)
                print(str(url_element))
                with open("result.txt", "a") as outf:
                    outf.write(str(url_element)   "\n")
    
    browser.quit() # guarantees the browser closes completely


while len(to_crawl):
    # popleft makes the deque a FIFO instead of LIFO.
    # A queue would achieve the same thing.
    url_to_crawl = to_crawl.popleft()

    crawl_link(url_to_crawl)

運行 90 秒后，這是result.txt https://gist.github.com/alexpdev/b7545970c4e3002b1372e26651301a23的輸出

轉載請註明出處，本文鏈接：https://www.uj5u.com/ruanti/468940.html

標籤：Python python-3.x 硒网络网络爬虫

上一篇：需要添加/洗掉選項卡的選項卡系統。添加新標簽時需要幫助更新索引嗎？

下一篇：在事件傳播中，事件如何知道何時觸發？