我想準備一個大學的資料框,它的縮寫和網站鏈接。
我的代碼:
abb_url = 'https://en.wikipedia.org/wiki/List_of_colloquial_names_for_universities_and_colleges_in_the_United_States'
abb_html = requests.get(abb_url).content
abb_df_list = pd.read_html(abb_html)
目前答案:
ValueError: No tables found
預期答案:
df =
| | university_full_name | uni_abb | uni_url|
---------------------------------------------------------------------
| 0 | Albert Einstein College of Medicine | AECOM | https://en.wikipedia.org/wiki/Albert_Einstein_College_of_Medicine|
uj5u.com熱心網友回復:
那是你那里的一個時髦的頁面......
首先,那里確實沒有桌子。其次,有些組織沒有鏈接,有些組織有重定向鏈接,還有一些組織使用相同的縮寫表示多個組織。
所以你需要帶上重炮:xpath...
import pandas as pd
import requests
from lxml import html as lh
url = "https://en.wikipedia.org/wiki/List_of_colloquial_names_for_universities_and_colleges_in_the_United_States"
response = requests.get(url)
doc = lh.fromstring(response.text)
rows = []
for uni in doc.xpath('//h2[./span[@]]//following-sibling::ul//li'):
info = uni.text.split(' – ')
abb = info[0]
#for those w/ no links
if not uni.xpath('.//a'):
rows.append((abb," ",info[1]))
#now to account for those using the same abbreviation for multiple teams
for a in uni.xpath('.//a'):
dat = a.xpath('./@*')
#for those with redirects
if len(dat)==3:
del dat[1]
link = f"https://en.wikipedia.org{dat[0]}"
rows.append((abb,link,dat[1]))
#and now, at last, to the dataframe
cols = ['abb','url','full name']
df = pd.DataFrame(rows,columns=cols)
df
輸出:
abb url full name
0 AECOM https://en.wikipedia.org/wiki/Albert_Einstein_... Albert Einstein College of Medicine
1 AFA https://en.wikipedia.org/wiki/United_States_Ai... United States Air Force Academy
等等
注意:如果您愿意,可以重新排列資料框中列的順序。
uj5u.com熱心網友回復:
僅選擇并迭代預期<li>
并提取其資訊,但請注意有一所大學沒有<a>
(SUI – 愛荷華州立大學),因此應if-statement
在示例中處理:
for e in soup.select('h2 ul li'):
data.append({
'abb':e.text.split('-')[0],
'full_name':e.text.split('-')[-1],
'url':'https://en.wikipedia.org' e.a.get('href') if e.a else None
})
例子
from bs4 import BeautifulSoup
import requests
import pandas as pd
url = "https://en.wikipedia.org/wiki/List_of_colloquial_names_for_universities_and_colleges_in_the_United_States"
response = requests.get(url)
soup = BeautifulSoup(response.text)
data = []
for e in soup.select('h2 ul li'):
data.append({
'abb':e.text.split('-')[0],
'full_name':e.text.split('-')[-1],
'url':'https://en.wikipedia.org' e.a.get('href') if e.a else None
})
pd.DataFrame(data)
輸出:
abb | 全名 | 網址 | |
---|---|---|---|
0 | AECOM | 阿爾伯特愛因斯坦醫學院 | https://en.wikipedia.org/wiki/Albert_Einstein_College_of_Medicine |
1 | AFA | 美國空軍學院 | https://en.wikipedia.org/wiki/United_States_Air_Force_Academy |
2 | 安納波利斯 | 美國海軍學院 | https://en.wikipedia.org/wiki/United_States_Naval_Academy |
3 | 是 | 德州農工大學,還有其他;見 A&M | https://en.wikipedia.org/wiki/Texas_A&M_University |
4 | A&M-CC 或 A&M-科珀斯克里斯蒂 | 科珀斯克里斯蒂 | https://en.wikipedia.org/wiki/Texas_A&M_University–Corpus_Christi |
...
uj5u.com熱心網友回復:
此頁面上沒有表格,只有串列。所以目標是遍歷<ul>和<li>標簽,跳過你不感興趣的段落(第一個和第 26 個之后的段落)。您可以通過這種方式提取大學的 aab_code:
uni_abb = li.text.strip().replace(' - ', ' - ').replace(' - ', ' - ').split(' - ')[0]
而要獲取 url,您必須訪問<a>標記內的 'href' 和 'title' 引數:
for a in li.find_all('a', href=True):
title = a['title']
url= f"https://en.wikipedia.org/{a['href']}"
將提取的資訊累積到一個串列中,最后通過分配適當的列名來創建資料框。
這是完整的代碼,我在其中使用BeautifulSoup:
import requests
import pandas as pd
from bs4 import BeautifulSoup
abb_url = 'https://en.wikipedia.org/wiki/List_of_colloquial_names_for_universities_and_colleges_in_the_United_States'
abb_html = requests.get(abb_url).content
soup = BeautifulSoup(abb_html)
l = []
for ul in soup.find_all("ul")[1:26]:
for li in ul.find_all("li"):
uni_abb = li.text.strip().replace(' - ', ' – ').replace(' — ', ' – ').split(' – ')[0]
for a in li.find_all('a', href=True):
l.append((a['title'], uni_abb, f"https://en.wikipedia.org/{a['href']}"))
df = pd.DataFrame(l, columns=['university_full_name', 'uni_abb', 'uni_url'])
結果:
university_full_name uni_abb uni_url
0 Albert Einstein College of Medicine AECOM https://en.wikipedia.org//wiki/Albert_Einstein...
1 United States Air Force Academy AFA https://en.wikipedia.org//wiki/United_States_A...
轉載請註明出處,本文鏈接:https://www.uj5u.com/yidong/507426.html
標籤:Python 熊猫 数据框 网址 python-请求-html