我正在閱讀一本名為“通過構建資料科學應用程式學習 Python”的書,其中有一章是關于網路抓取的,我完全承認我以前沒有玩過。我已經到達了討論無序串列以及如何使用它們的部分,并且我的代碼正在生成一個對我來說沒有意義的錯誤:
Traceback(最近一次通話最后一次):檔案“/Users/gillian/100-days-of-code/Learn-Python-by-Building-Data-Science-Applications/Chapter07/wiki2.py”,第 77 行,在 list_element = front.find_next_siblings("div", "div-col columns column-width")[0].ul IndexError: list index out of range
我的第一個想法是頁面上不再有無序串列,但我檢查了,并且......有。我對這個錯誤的解釋是它沒有回傳串列,但我無法弄清楚如何測驗它,我完全承認遞回讓我頭暈目眩,這不是我最好的領域。
附上我的完整代碼(包括我做的筆記,因此有大量的評論)
'''scrapes list of WWII battles'''
import requests as rq
base_url = 'https://en.wikipedia.org/wiki/List_of_World_War_II_battles'
response = rq.get(base_url)
'''access the raw content of a page with response.content'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(response.content, 'html.parser')
def get_dom(url):
response = rq.get(url)
response.raise_for_status()
return BeautifulSoup(response.content, 'html.parser')
'''3 ways to search for an element:
1. find
2. finda_all
3. select
for 1 and 2 you pass an object type and attributes, maybe,
a recursive argument defines if the search should be recursive
First method retrieves first occurrence
Second method will always return a list with all elements
select will return a list and expects you to pass a single CSS selector string
this makes select easier to use, sometimes
'''
content = soup.select('div#mw-content-text > div.mw-parser-output', limit=1)[0]
'''
collect corresponding elements for each front, which are all h2 headers
all fronts are sections - each with a title in h2 but hierarchically the titles are not nested within the sections
last title is citations and notes
one way is to just drop the last element or we can use a CSS Selector trick, which is to specify :not(:last-of-type) but that is less readable
'''
fronts = content.select('div.mw-parser-output>h2')[:-1]
for el in fronts:
print(el.text[:-6])
'''getting the corresponding ul lists for each header
bs4 has a find_next_siblings method that works like find_all except that it will look in the document after each element
to get this all simultaneously, we'll need to use recursion
'''
def dictify(ul, level=0):
result = dict()
for li in ul.find_all("li", recursive=False):
text = li.stripped_strings
key = next(text)
try:
time = next(text).replace(':', '').strip()
except StopIteration:
time = None
ul, link = li.find("ul"), li.find('a')
if link:
link = _abs_link(link.get('href'))
r ={'url': link,
'time':time,
'level': level}
if ul:
r['children'] = dictify(ul, level=(level 1))
result[key] = r
return result
theaters = {}
for front in fronts:
list_element = front.find_next_siblings("div", "div-col columns column-width")[0].ul
theaters[front.text[:-6]] = dictify(list_element)
如果有人對我如何繼續解決此問題有任何意見,我將不勝感激。謝謝。
uj5u.com熱心網友回復:
該錯誤意味著.find_next_siblings
沒有找到任何東西。嘗試將其更改為front.find_next_siblings("div", "div-col")
. 也_abs_link()
沒有指定,所以我洗掉了它:
"""scrapes list of WWII battles"""
import requests as rq
base_url = "https://en.wikipedia.org/wiki/List_of_World_War_II_battles"
response = rq.get(base_url)
"""access the raw content of a page with response.content"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(response.content, "html.parser")
def get_dom(url):
response = rq.get(url)
response.raise_for_status()
return BeautifulSoup(response.content, "html.parser")
"""3 ways to search for an element:
1. find
2. finda_all
3. select
for 1 and 2 you pass an object type and attributes, maybe,
a recursive argument defines if the search should be recursive
First method retrieves first occurrence
Second method will always return a list with all elements
select will return a list and expects you to pass a single CSS selector string
this makes select easier to use, sometimes
"""
content = soup.select("div#mw-content-text > div.mw-parser-output", limit=1)[0]
"""
collect corresponding elements for each front, which are all h2 headers
all fronts are sections - each with a title in h2 but hierarchically the titles are not nested within the sections
last title is citations and notes
one way is to just drop the last element or we can use a CSS Selector trick, which is to specify :not(:last-of-type) but that is less readable
"""
fronts = content.select("div.mw-parser-output>h2")[:-1]
for el in fronts:
print(el.text[:-6])
"""getting the corresponding ul lists for each header
bs4 has a find_next_siblings method that works like find_all except that it will look in the document after each element
to get this all simultaneously, we'll need to use recursion
"""
def dictify(ul, level=0):
result = dict()
for li in ul.find_all("li", recursive=False):
text = li.stripped_strings
key = next(text)
try:
time = next(text).replace(":", "").strip()
except StopIteration:
time = None
ul, link = li.find("ul"), li.find("a")
if link:
link = link.get("href")
r = {"url": link, "time": time, "level": level}
if ul:
r["children"] = dictify(ul, level=(level 1))
result[key] = r
return result
theaters = {}
for front in fronts:
list_element = front.find_next_siblings("div", "div-col")[0].ul
theaters[front.text[:-6]] = dictify(list_element)
print(theaters)
印刷:
{
"African Front": {
"North African campaign": {
"url": "/wiki/North_African_campaign",
"time": "June 1940 - May 1943",
"level": 0,
"children": {
"Western Desert campaign": {
"url": "/wiki/Western_Desert_campaign",
"time": "June 1940 – February 1943",
"level": 1,
"children": {
"Italian invasion of Egypt": {
"url": "/wiki/Italian_invasion_of_Egypt",
"time": "September 1940",
"level": 2,
},
"Operation Compass": {
"url": "/wiki/Operation_Compass",
"time": "December 1940 – February 1941",
"level": 2,
"children": {
"Battle of Nibeiwa": {
"url": "/wiki/Battle_of_Nibeiwa",
"time": "December 1940",
"level": 3,
},
...and so on.
轉載請註明出處,本文鏈接:https://www.uj5u.com/gongcheng/468755.html
標籤:Python 网页抓取 美丽的汤 html 列表 维基百科
上一篇:將一些評論的評分刮成圖片