在B站爬取坤坤的視頻，并獲取視頻基本資訊-有解無憂

在B站有許多坤坤的視頻，作為一名ikun，讓我們寫個爬蟲研究一下視頻的視頻的名字、鏈接、觀看次數、彈幕、發布時間以及作者，我們用selenium來實作這個爬蟲，由于要獲取的資料比較多，我們寫幾個函式來實作這個爬蟲，

先倒入需要用到的庫，包括selenium, time ,BeautifulSoup ,ChromeDriverManager，

打開及搜索函式：在這段代碼中，我們匯入了 `webdriver` 和 `ChromeDriverManager` 模塊，以便使用 ChromeDriver 控制 Chrome 瀏覽器，以及自動下載和安裝最新版本的 ChromeDriver，同時，我們還匯入了 `time` 模塊，以便在代碼中添加延遲，以便頁面加載完成，最后，我們還匯入了 `BeautifulSoup` 模塊，以便從網頁中提取資訊，

    from selenium import webdriver
    from webdriver_manager.chrome import ChromeDriverManager
    import time
    from bs4 import BeautifulSoup

先創建一個 ChromeDriver 實體，打開B站，在這里我們使用ChromeDriverManager().install()方法，他可以自動下載對應版本的chrome，防止因為Chrome版本不正確而報錯，（PS：直接用webdriver.Chrome()會使用已下載Chrome游覽器，他的版本可能與webdriver需要的版本不匹配這樣打開chrome的時候會出現閃退報錯）

    url="嗶哩嗶哩 (゜-゜)つロ 干杯~-bilibili"
    browser = webdriver.Chrome(ChromeDriverManager().install()) #創建一個 ChromeDriver 實體
    browser.set_window_size(1400, 900) #設定頁面大小
    browser.get(url) #訪問B站

我們可以用xpath方法來獲取目標操作的位置，chrome瀏覽器能夠自動提取xpath鏈接，我們使用 `browser.window_handles` 方法獲取當前視窗中的所有句柄，并切換到第二個視窗，以便獲取搜索結果頁面，

在打開B站首頁的時候，會出現一個提示登陸的頁面干擾我們，因此我們最好點擊一下首頁，重繪一下，

    first_page = browser.find_element_by_xpath("//*[@id='i_cecream']/div[2]/div[1]/div[1]/ul[1]/li[1]/a/span") #取得首頁按鈕的的位置
    first_page.click() #點擊首頁進行重繪

然后，我們模擬用戶在網站上搜索蔡徐坤籃球相關的視頻；點擊搜索欄，輸入"蔡徐坤籃球"，并進行搜索；最后定位到新頁面，

    input = browser.find_element_by_xpath("//*[@id='nav-searchform']/div[1]/input") #取得搜索欄的位置
    input.send_keys('蔡徐坤 籃球') #輸入內容
    submit = browser.find_element_by_xpath("//*[@id='nav-searchform']/div[2]") #取得搜索按鈕的位置  
    submit.click() #點擊搜索
    all_h = browser.window_handles #取得新頁面的代碼
    browser.switch_to.window(all_h[1]) #取得新頁面的代碼

打開及搜索函式的完整代碼：

def search(url="嗶哩嗶哩 (゜-゜)つロ 干杯~-bilibili"):
    try:
        browser = webdriver.Chrome(ChromeDriverManager().install())
        browser.set_window_size(1400, 900)
        browser.get(url)
        first_page = browser.find_element_by_xpath("//*[@id='i_cecream']/div[2]/div[1]/div[1]/ul[1]/li[1]/a/span")
        first_page.click()
        input = browser.find_element_by_xpath("//*[@id='nav-searchform']/div[1]/input")
        input.send_keys('蔡徐坤 籃球')
        submit = browser.find_element_by_xpath("//*[@id='nav-searchform']/div[2]")
        submit.click()
        all_h = browser.window_handles
        browser.switch_to.window(all_h[1])
        return browser
    except TimeoutError:
        return search(url)

翻頁函式：該函式使用 `find_element_by_xpath` 方法查找“下一頁”按鈕，并使用 `click` 方法單擊該按鈕，在單擊“下一頁”按鈕之后，該函式使用 `browser.window_handles` 方法獲取所有視窗句柄，并使用`browser.switch_to.window` 方法將焦點切換到搜索結果頁面，以便我們可以繼續提取視頻資訊，最后回傳頁面代碼資訊，如果沒有找到“下一頁”按鈕，該函式會遞回呼叫自身來查找“下一頁”按鈕，直到找到為止，

這里有一點需要注意，在取得下一頁按鈕的時候，最好利用“下一頁”的文本進行匹配，如果根據瀏覽器提供的位置資訊匹配，在某些頁面下會出現位置不匹配而導致程式例外，

`next = browser.find_element_by_xpath("//button[text()='下一頁']")

這段代碼較為簡單，直接給出完整代碼，

    def next_page(browser): #獲取下一頁
      try:
          next = browser.find_element_by_xpath("//button[text()='下一頁']") #下一頁按鈕
          next.click() #點擊按鈕
          time.sleep(1) #聽一秒等待重繪
          all_h1 = browser.window_handles #取得新頁面的代碼
          browser.switch_to.window(all_h1[1]) #取得新頁面的代碼
          return browser 
      except:
          next_page(browser)

在這個函式中，我們首先使用 `browser.page_source` 方法獲取當前頁面的 HTML 代碼，然后使用 `BeautifulSoup` 類決議 HTML 代碼，以提取視頻資訊，接著，我們使用 `xlwt` 庫將視頻資訊寫入到 Excel 檔案中，

我們還需要一個函式來提取頁面資訊，

先用BeautifulSoup決議頁面，用find方法找到\<class\_='video-list'>標簽，這里包含我們需要的所有視頻，然后用find\_all方法找到每個視頻的標簽\<class\_='bili-video-card'>,同時find\_all會幫助我們形成一個串列存盤資料，也就是我用到的videos：

    html = browser.page_source
    soup = BeautifulSoup(html, "lxml")
    videos = soup.find(class_='video-list').find_all(class_='bili-video-card')

遍歷視頻資訊串列，提取資訊（視頻的名字、鏈接、觀看次數、彈幕、發布時間以及作者），這里建議多嘗試，多列印幾次確保資訊準確：

            for item in videos:
                item_readers = item.find("span", class_="bili-video-card__stats--item").span.string
                item_danmu = item.find_all("span", class_="bili-video-card__stats--item")[1].span.string
                item_time = item.find("span", class_="bili-video-card__stats__duration").string
                item_title = item.find("h3").get("title")
                item_author = item.find("span", class_="bili-video-card__info--author").string
                item_link = item.find("a").get("href")[2:]
                print(item_title)

最后一步是儲存資訊：

sheet.write(n, 0, n)

sheet.write(n, 1, item_title)

sheet.write(n, 2, item_link)

sheet.write(n, 3, item_readers)

sheet.write(n, 4, item_danmu)

sheet.write(n, 5, item_time)

sheet.write(n, 6, item_author)

獲取資訊和儲存函式的完整代碼：

    def save_to_excel(browser,sheet,n,wb):
        try:
            html = browser.page_source
            soup = BeautifulSoup(html, "lxml")
            videos = soup.find(class_='video-list').find_all(class_='bili-video-card')

            for item in videos:
                item_readers = item.find("span", class_="bili-video-card__stats--item").span.string
                item_danmu = item.find_all("span", class_="bili-video-card__stats--item")[1].span.string
                item_time = item.find("span", class_="bili-video-card__stats__duration").string
                item_title = item.find("h3").get("title")
                item_author = item.find("span", class_="bili-video-card__info--author").string
                item_link = item.find("a").get("href")[2:]
                print(item_title)

                sheet.write(n, 0, n)
                sheet.write(n, 1, item_title)
                sheet.write(n, 2, item_link)
                sheet.write(n, 3, item_readers)
                sheet.write(n, 4, item_danmu)
                sheet.write(n, 5, item_time)
                sheet.write(n, 6, item_author)
                n += 1
            wb.save("cxk1.xls")
            print(n)
            return n
        except TimeoutError:
            save_to_excel(browser,sheet,n,wb) #存盤函式

最后寫一個主程式呼叫三個函式，在主程式中,我們要記得先設定一下存盤資料用到的EXCEL;

import xlwt

#設定excel
wb = xlwt.Workbook(encoding="utf-8",style_compression=0)
sheet = wb.add_sheet("b站爬去",cell_overwrite_ok=True)
sheet.write(0,0,"序號")
sheet.write(0,1,"名稱")
sheet.write(0,2,"鏈接")
sheet.write(0,3,"觀看次數")
sheet.write(0,4,"彈幕")
sheet.write(0,5,"時間")
sheet.write(0,6,"作者")

主程式完整代碼：

在呼叫主程式時，當我們進入收索資訊的第一頁時，可以先爬取第一資訊，然后利用一個for回圈取得其他頁面資訊，

#設定excel
wb = xlwt.Workbook(encoding="utf-8",style_compression=0)
sheet = wb.add_sheet("b站爬去",cell_overwrite_ok=True)
sheet.write(0,0,"序號")
sheet.write(0,1,"名稱")
sheet.write(0,2,"鏈接")
sheet.write(0,3,"觀看次數")
sheet.write(0,4,"彈幕")
sheet.write(0,5,"時間")
sheet.write(0,6,"作者")

n = 1
browser = search("嗶哩嗶哩 (゜-゜)つロ 干杯~-bilibili") #進入B站，并完成搜索
all_h = browser.window_handles
browser.switch_to.window(all_h[1]) #獲取頁面代碼

total = browser.find_element_by_xpath("//*[@id='i_cecream']/div/div[2]/div[2]/div/div/div/div[2]/div/div/button[9]").text #獲取總頁數，給for回圈做準備
# n = save_to_excel(browser,sheet,n,wb) #呼叫函式存盤頁面資料
# browser = next_page(browser) #翻頁
for i in range(1,int(total)+1):

    time.sleep(3) #等待加載
    n = save_to_excel(browser,sheet,n,wb)  # 呼叫函式存盤頁面資料
    browser = next_page(browser)  # 翻頁

完整代碼地址：github.com/zhangaynami…

轉載請註明出處，本文鏈接：https://www.uj5u.com/houduan/554334.html

標籤：Python

上一篇：Python第三方庫批量下載到本地，并離線批量安裝第三方庫

下一篇：返回列表