Python大學名稱和縮寫以及Weblink-有解無憂

我想準備一個大學的資料框，它的縮寫和網站鏈接。

我的代碼：

abb_url = 'https://en.wikipedia.org/wiki/List_of_colloquial_names_for_universities_and_colleges_in_the_United_States'
abb_html = requests.get(abb_url).content
abb_df_list = pd.read_html(abb_html)

目前答案：

ValueError: No tables found

預期答案：

df =

|      |  university_full_name              |  uni_abb  |  uni_url|
---------------------------------------------------------------------
|    0 |  Albert Einstein College of Medicine | AECOM   |  https://en.wikipedia.org/wiki/Albert_Einstein_College_of_Medicine|

uj5u.com熱心網友回復：

那是你那里的一個時髦的頁面......

首先，那里確實沒有桌子。其次，有些組織沒有鏈接，有些組織有重定向鏈接，還有一些組織使用相同的縮寫表示多個組織。

所以你需要帶上重炮：xpath...

import pandas as pd
import requests
from lxml import html as lh

url = "https://en.wikipedia.org/wiki/List_of_colloquial_names_for_universities_and_colleges_in_the_United_States"
response = requests.get(url)

doc = lh.fromstring(response.text)
rows = []
for uni in doc.xpath('//h2[./span[@]]//following-sibling::ul//li'):    
    info = uni.text.split(' – ')
    abb = info[0]
    
    #for those w/ no links
    if not uni.xpath('.//a'):
        rows.append((abb," ",info[1]))

    #now to account for those using the same abbreviation for multiple teams
    for a in uni.xpath('.//a'):
        dat = a.xpath('./@*')
        
        #for those with redirects
        if len(dat)==3:
            del dat[1]
        link = f"https://en.wikipedia.org{dat[0]}"
        rows.append((abb,link,dat[1]))
   
#and now, at last, to the dataframe
cols = ['abb','url','full name']
df = pd.DataFrame(rows,columns=cols)
df

輸出：

    abb     url                                               full name
0   AECOM   https://en.wikipedia.org/wiki/Albert_Einstein_...   Albert Einstein College of Medicine
1   AFA     https://en.wikipedia.org/wiki/United_States_Ai...   United States Air Force Academy

等等

注意：如果您愿意，可以重新排列資料框中列的順序。

uj5u.com熱心網友回復：

僅選擇并迭代預期<li>并提取其資訊，但請注意有一所大學沒有<a>（SUI – 愛荷華州立大學），因此應if-statement在示例中處理：

for e in soup.select('h2   ul li'):
    data.append({
        'abb':e.text.split('-')[0],
        'full_name':e.text.split('-')[-1],
        'url':'https://en.wikipedia.org'   e.a.get('href') if e.a else None
    })

例子

from bs4 import BeautifulSoup
import requests
import pandas as pd
url = "https://en.wikipedia.org/wiki/List_of_colloquial_names_for_universities_and_colleges_in_the_United_States"
response = requests.get(url)
soup = BeautifulSoup(response.text)

data = []
for e in soup.select('h2   ul li'):
    data.append({
        'abb':e.text.split('-')[0],
        'full_name':e.text.split('-')[-1],
        'url':'https://en.wikipedia.org'   e.a.get('href') if e.a else None
    })

pd.DataFrame(data)

輸出：

	abb	全名	網址
0	AECOM	阿爾伯特愛因斯坦醫學院	https://en.wikipedia.org/wiki/Albert_Einstein_College_of_Medicine
1	AFA	美國空軍學院	https://en.wikipedia.org/wiki/United_States_Air_Force_Academy
2	安納波利斯	美國海軍學院	https://en.wikipedia.org/wiki/United_States_Naval_Academy
3	是	德州農工大學，還有其他；見 A&M	https://en.wikipedia.org/wiki/Texas_A&M_University
4	A&M-CC 或 A&M-科珀斯克里斯蒂	科珀斯克里斯蒂	https://en.wikipedia.org/wiki/Texas_A&M_University–Corpus_Christi

...

uj5u.com熱心網友回復：

此頁面上沒有表格，只有串列。所以目標是遍歷<ul>和<li>標簽，跳過你不感興趣的段落（第一個和第 26 個之后的段落）。您可以通過這種方式提取大學的 aab_code：

uni_abb = li.text.strip().replace(' - ', ' - ').replace(' - ', ' - ').split(' - ')[0]

而要獲取 url，您必須訪問<a>標記內的 'href' 和 'title' 引數：

for a in li.find_all('a', href=True):
    title = a['title']
    url= f"https://en.wikipedia.org/{a['href']}"

將提取的資訊累積到一個串列中，最后通過分配適當的列名來創建資料框。

這是完整的代碼，我在其中使用BeautifulSoup：

import requests
import pandas as pd
from bs4 import BeautifulSoup

abb_url = 'https://en.wikipedia.org/wiki/List_of_colloquial_names_for_universities_and_colleges_in_the_United_States'
abb_html = requests.get(abb_url).content

soup = BeautifulSoup(abb_html)
l = []
for ul in soup.find_all("ul")[1:26]:
    for li in ul.find_all("li"):
        uni_abb = li.text.strip().replace(' - ', ' – ').replace(' — ', ' – ').split(' – ')[0]
        for a in li.find_all('a', href=True):
            l.append((a['title'], uni_abb, f"https://en.wikipedia.org/{a['href']}"))
            
df = pd.DataFrame(l, columns=['university_full_name', 'uni_abb', 'uni_url'])

結果：

                         university_full_name    uni_abb      uni_url
0             Albert Einstein College of Medicine      AECOM  https://en.wikipedia.org//wiki/Albert_Einstein...
1                 United States Air Force Academy        AFA  https://en.wikipedia.org//wiki/United_States_A...

轉載請註明出處，本文鏈接：https://www.uj5u.com/yidong/507426.html

標籤：Python 熊猫数据框网址 python-请求-html

上一篇：在php中去除URL中的所有方案部分

下一篇：索引錯誤：串列索引超出范圍-如何跳過損壞的URL？