網頁抓取無序串列問題-有解無憂

我正在閱讀一本名為“通過構建資料科學應用程式學習 Python”的書，其中有一章是關于網路抓取的，我完全承認我以前沒有玩過。我已經到達了討論無序串列以及如何使用它們的部分，并且我的代碼正在生成一個對我來說沒有意義的錯誤：

Traceback（最近一次通話最后一次）：檔案“/Users/gillian/100-days-of-code/Learn-Python-by-Building-Data-Science-Applications/Chapter07/wiki2.py”，第 77 行，在 list_element = front.find_next_siblings("div", "div-col columns column-width")[0].ul IndexError: list index out of range

我的第一個想法是頁面上不再有無序串列，但我檢查了，并且......有。我對這個錯誤的解釋是它沒有回傳串列，但我無法弄清楚如何測驗它，我完全承認遞回讓我頭暈目眩，這不是我最好的領域。

附上我的完整代碼（包括我做的筆記，因此有大量的評論）

    '''scrapes list of WWII battles'''
import requests as rq

base_url = 'https://en.wikipedia.org/wiki/List_of_World_War_II_battles'
response = rq.get(base_url)

'''access the raw content of a page with response.content'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(response.content, 'html.parser')

def get_dom(url):
    response = rq.get(url)
    response.raise_for_status()
    return BeautifulSoup(response.content, 'html.parser')

'''3 ways to search for an element:
    1. find
    2. finda_all
    3. select

for 1 and 2 you pass an object type and attributes, maybe,

a recursive argument defines if the search should be recursive
First method retrieves first occurrence
Second method will always return a list with all elements
select will return a list and expects you to pass a single CSS selector string

this makes select easier to use, sometimes
'''
content = soup.select('div#mw-content-text > div.mw-parser-output', limit=1)[0]

'''
collect corresponding elements for each front, which are all h2 headers

all fronts are sections - each with a title in h2 but hierarchically the titles are not nested within the sections

last title is citations and notes

one way is to just drop the last element or we can use a CSS Selector trick, which is to specify :not(:last-of-type) but that is less readable

'''
fronts = content.select('div.mw-parser-output>h2')[:-1]

for el in fronts:
    print(el.text[:-6])

'''getting the corresponding ul lists for each header

bs4 has a find_next_siblings method that works like find_all except that it will look in the document after each element

to get this all simultaneously, we'll need to use recursion
'''

def dictify(ul, level=0):
    result = dict()
    for li in ul.find_all("li", recursive=False):
        text = li.stripped_strings
        key = next(text)
        try:
            time = next(text).replace(':', '').strip()
        except StopIteration:
            time = None
        ul, link = li.find("ul"), li.find('a')
        if link:
            link = _abs_link(link.get('href'))
        r ={'url': link,
            'time':time,
            'level': level}
        if ul:
            r['children'] = dictify(ul, level=(level   1))
        result[key] = r
    return result

theaters = {}

for front in fronts:
    list_element = front.find_next_siblings("div", "div-col columns column-width")[0].ul
    theaters[front.text[:-6]] = dictify(list_element)

如果有人對我如何繼續解決此問題有任何意見，我將不勝感激。謝謝。

uj5u.com熱心網友回復：

該錯誤意味著.find_next_siblings沒有找到任何東西。嘗試將其更改為front.find_next_siblings("div", "div-col"). 也_abs_link()沒有指定，所以我洗掉了它：

"""scrapes list of WWII battles"""
import requests as rq

base_url = "https://en.wikipedia.org/wiki/List_of_World_War_II_battles"
response = rq.get(base_url)

"""access the raw content of a page with response.content"""
from bs4 import BeautifulSoup

soup = BeautifulSoup(response.content, "html.parser")


def get_dom(url):
    response = rq.get(url)
    response.raise_for_status()
    return BeautifulSoup(response.content, "html.parser")


"""3 ways to search for an element:
    1. find
    2. finda_all
    3. select

for 1 and 2 you pass an object type and attributes, maybe,

a recursive argument defines if the search should be recursive
First method retrieves first occurrence
Second method will always return a list with all elements
select will return a list and expects you to pass a single CSS selector string

this makes select easier to use, sometimes
"""
content = soup.select("div#mw-content-text > div.mw-parser-output", limit=1)[0]

"""
collect corresponding elements for each front, which are all h2 headers

all fronts are sections - each with a title in h2 but hierarchically the titles are not nested within the sections

last title is citations and notes

one way is to just drop the last element or we can use a CSS Selector trick, which is to specify :not(:last-of-type) but that is less readable

"""
fronts = content.select("div.mw-parser-output>h2")[:-1]

for el in fronts:
    print(el.text[:-6])

"""getting the corresponding ul lists for each header

bs4 has a find_next_siblings method that works like find_all except that it will look in the document after each element

to get this all simultaneously, we'll need to use recursion
"""


def dictify(ul, level=0):
    result = dict()
    for li in ul.find_all("li", recursive=False):
        text = li.stripped_strings
        key = next(text)
        try:
            time = next(text).replace(":", "").strip()
        except StopIteration:
            time = None
        ul, link = li.find("ul"), li.find("a")
        if link:
            link = link.get("href")
        r = {"url": link, "time": time, "level": level}
        if ul:
            r["children"] = dictify(ul, level=(level   1))
        result[key] = r
    return result


theaters = {}

for front in fronts:
    list_element = front.find_next_siblings("div", "div-col")[0].ul
    theaters[front.text[:-6]] = dictify(list_element)

print(theaters)

印刷：

{
    "African Front": {
        "North African campaign": {
            "url": "/wiki/North_African_campaign",
            "time": "June 1940 - May 1943",
            "level": 0,
            "children": {
                "Western Desert campaign": {
                    "url": "/wiki/Western_Desert_campaign",
                    "time": "June 1940 – February 1943",
                    "level": 1,
                    "children": {
                        "Italian invasion of Egypt": {
                            "url": "/wiki/Italian_invasion_of_Egypt",
                            "time": "September 1940",
                            "level": 2,
                        },
                        "Operation Compass": {
                            "url": "/wiki/Operation_Compass",
                            "time": "December 1940 – February 1941",
                            "level": 2,
                            "children": {
                                "Battle of Nibeiwa": {
                                    "url": "/wiki/Battle_of_Nibeiwa",
                                    "time": "December 1940",
                                    "level": 3,
                                },

...and so on.

轉載請註明出處，本文鏈接：https://www.uj5u.com/gongcheng/468755.html

標籤：Python 网页抓取美丽的汤 html 列表维基百科

上一篇：將一些評論的評分刮成圖片

下一篇：如何在更改語言時不更改URL的網站上使用Scrapy