爬取豆瓣Top250圖書資料

專案的實作步驟
1.專案結構
2.獲取網頁資料
3.提取網頁中的關鍵資訊
4.保存資料
1.專案結構

2.獲取網頁資料
對應的網址為https://book.douban.com/top250

import requests
from bs4 import BeautifulSoup
"""
獲取網頁資料，決議資料，將相應的資料傳出
"""
def get_page(url):
    headers = {
        'User-Agent':'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) '
                     'AppleWebKit/537.36 (KHTML, like Gecko) Chrome/114.0.0.0 '
                     'Mobile Safari/537.36 Edg/114.0.1823.43'
    }
    resp=requests.get(url,headers=headers)
    soup=BeautifulSoup(resp.text,'html.parser')
    return soup

3.提取網頁中的關鍵資訊
獲取傳出的決議后的資料，獲取對應的圖片，書名，作者，價格，評價，簡介

from geturlcocument.get_document import get_page
import re
# 初始資料
pictures=[]
names=[]
authors=[]
prices=[]
scores=[]
sums=[]
def get_single():
    # 網址地址
    urls = [f"https://book.douban.com/top250?start={num}" for num in range(0,250,25)]
    for url in urls:
        # 獲取對應的網頁文本
        text = get_page.get_page(url)
        # 所有資料的集合
        all_tr = text.find_all(name="tr", attrs={"class": "item"})
        # 查找每個單項
        for tr in all_tr:
            # 資料型別：圖片，書名，作者，價格，評分，簡介
            # 圖片
            picture = tr.find(name="img")
            picture = picture.get('src')
            # print(picture)
            # 書名
            div = tr.find(name='div', attrs={'class': 'pl2'})
            name = div.find('a').text
            name = re.sub(r'\s+', '', name)
            # 作者
            author = tr.find(name='p', attrs={'class': 'pl'}).text
            author = author.split('/')[0]
            # 價格
            price = author.split('/')[-1]
            price = re.sub(r'元', '', price)
            # 評分
            score = tr.find(name='span', attrs={'class': 'rating_nums'}).text
            try:
                sum = tr.find(name='span', attrs={'class': 'inq'}).text
            except AttributeError:
                sum = ''
            pictures.append(picture)
            names.append(name)
            authors.append(author)
            prices.append(price)
            scores.append(score)
            sums.append(sum)
    data = https://www.cnblogs.com/prettyspider/archive/2023/06/13/{"picture": pictures,
        "name": names,
        "author": authors,
        "price": prices,
        "score": scores,
        "sum": sums
    }
    return data

將獲取的資料存入到字典中，將資料傳出，使用re庫對相應的資料進行處理，運用例外檢錯
4.保存資料
獲取傳出的字典型別的資料，將資料存入到pandas的DataFrame型別中

from geturlcocument.get_single_docuemnt import get_single
import pandas as pd
# 獲取字典型別的資料
data=https://www.cnblogs.com/prettyspider/archive/2023/06/13/get_single.get_single()
# 用pandas的DataFrame型別存盤資料
df=pd.DataFrame(data)
df.to_csv('./books.csv',encoding='utf-8')
print('ending of data')

該專案完成！！！

轉載請註明出處，本文鏈接：https://www.uj5u.com/houduan/555070.html

標籤：其他

上一篇：[ARM匯編]計算機原理與數制基礎—1.1.2 二進制與十進制數制轉換

下一篇：返回列表