我正在抓取發布新法律 (Gazzetta Ufficiale) 的意大利網站的頁面,以保存包含法律文本的最后一頁。
我有一個回圈,它構建了一個要下載的頁面串列,并附上了一個完整作業的 cose 示例,該示例顯示了我正在運行的問題(示例沒有回圈,我只是在執行兩次“獲取”。
處理不顯示“Visualizza”(顯示)按鈕但直接顯示所需全文的稀有頁面的最佳方法是什么?
希望代碼是非常自我解釋和評論的。在此先感謝您,2022 年超級快樂!
import time
from selenium import webdriver
from selenium.webdriver.support.ui import Select
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
driver = webdriver.Chrome("/Users/bob/Documents/work/scraper/scrape_gu/chromedriver")
# showing the "normal" behaviour
driver.get(
"https://www.gazzettaufficiale.it/atto/vediMenuHTML?atto.dataPubblicazioneGazzetta=2021-01-02&atto.codiceRedazionale=20A07300&tipoSerie=serie_generale&tipoVigenza=originario"
)
# this page has a "Visualizza" button, find it and click it.
bottoni = WebDriverWait(driver, 10).until(
EC.visibility_of_all_elements_located(
(By.XPATH, '//*[@id="corpo_export"]/div/input[1]')
)
)
time.sleep(5) # just to see the "normal" result with the "Visualizza" button
bottoni[0].click() # now click it and this shows the desired final webpage
time.sleep(5) # just to see the "normal" desired result
# but unfortunately some pages directly get to the end result WITHOUT the "Visualizza" button.
# as an example see the following get
# showing the "normal" behaviour
driver.get(
"https://www.gazzettaufficiale.it/atto/vediMenuHTML?atto.dataPubblicazioneGazzetta=2021-01-02&atto.codiceRedazionale=20A07249&tipoSerie=serie_generale&tipoVigenza=originario"
) # get a law page
time.sleep(
5
) # as you can see we are now on the final desired full page WITHOUT the Visualizza button
# hence the following code, identical to that above will fail and timeout
bottoni = WebDriverWait(driver, 10).until(
EC.visibility_of_all_elements_located(
(By.XPATH, '//*[@id="corpo_export"]/div/input[1]')
)
)
time.sleep(5) # just to see the result
bottoni[0].click() # and this shows the desired final webpage
# and the program abends with the following message
# File "/Users/bob/Documents/work/scraper/scrape_gu/temp.py", line 33, in <module>
# bottoni = WebDriverWait(driver, 10).until(
# File "/Users/bob/opt/miniconda3/envs/scraping/lib/python3.8/site-packages/selenium/webdriver/support/wait.py", line 80, in until
# raise TimeoutException(message, screen, stacktrace)
# selenium.common.exceptions.TimeoutException: Message:
uj5u.com熱心網友回復:
用 atry
和except
塊捕捉例外- 如果沒有按鈕直接提取文本 -處理例外
...
urls = [
'https://www.gazzettaufficiale.it/atto/vediMenuHTML?atto.dataPubblicazioneGazzetta=2021-01-02&atto.codiceRedazionale=20A07300&tipoSerie=serie_generale&tipoVigenza=originario',
'https://www.gazzettaufficiale.it/atto/vediMenuHTML?atto.dataPubblicazioneGazzetta=2021-01-02&atto.codiceRedazionale=20A07249&tipoSerie=serie_generale&tipoVigenza=originario'
]
data = []
for url in urls:
driver.get(url)
try:
bottoni = WebDriverWait(driver,1).until(
EC.element_to_be_clickable(
(By.XPATH, '//input[@value="Visualizza"]')
)
)
bottoni.click()
except TimeoutException:
print('no bottoni -')
finally:
data.append(driver.find_element(By.XPATH, '//body').text)
driver.close()
print(data)
...
uj5u.com熱心網友回復:
首先,將 selenium 用于此任務是過大的。
您可以使用requests或aiohttp加上beautifulsoup來做同樣的事情,除了這樣會更快更容易編碼。
現在回到你的問題,有幾個解決方案。
最簡單的是:
- 捕獲超時例外:如果未找到按鈕,則直接決議法律。
- 在單擊按鈕
!driver.findElements(By.id("corpo_export")).isEmpty()
或決議網頁之前檢查按鈕是否存在 : 。
但話又說回來,你會更容易擺脫硒并使用beautifulsoup。
轉載請註明出處,本文鏈接:https://www.uj5u.com/houduan/401186.html