我想使用 Scrapy 抓取存盤在 csv 檔案中的多個 url。我的代碼有效(沒有顯示錯誤)但它只抓取最后一個網址,但不是全部。這是我的代碼的圖片。請告訴我我做錯了什么。我想抓取所有網址并將抓取的文本保存在一起。我已經嘗試了很多在 StackOverflow 上找到的建議。我的代碼-
import scrapy
from scrapy import Request
from ..items import personalprojectItem
class ArticleSpider(scrapy.Spider):
name = 'articles'
with open('C:\\Users\\Admin\\Documents\\Bhavya\\input_urls.csv') as file:
for line in file:
start_urls = line
def start_requests(self):
request = Request(url=self.start_urls)
yield request
def parse(self, response):
item = personalprojectItem()
article = response.css('div p::text').extract()
item['article'] = article
yield item
uj5u.com熱心網友回復:
下面是一個最小示例,說明如何在 scrapy 專案中包含檔案中的 url 串列。
我們在 scrapy 專案檔案夾中有一個包含以下鏈接的文本檔案:
https://www.theguardian.com/technology/2022/nov/18/elon-musk-twitter-engineers-workers-mass-resignation
https://www.theguardian.com/world/2022/nov/18/iranian-protesters-set-fire-to-ayatollah-khomeinis-ancestral-home
https://www.theguardian.com/world/2022/nov/18/canada-safari-park-shooting-animals-two-charged
蜘蛛代碼看起來像這樣(再次,最小的例子):
import scrapy
class GuardianSpider(scrapy.Spider):
name = 'guardian'
allowed_domains = ['theguardian.com']
start_urls = [x for x in open('urls_list.txt', 'r').readlines()]
def parse(self, response):
title = response.xpath('//h1/text()').get()
header = response.xpath('//div[@data-gu-name="standfirst"]//p/text()').get()
yield {
'title': title,
'header': header
}
如果我們用 運行爬蟲scrapy crawl guardian -o guardian_news.json
,我們會得到一個如下所示的 JSON 檔案:
[
{"title": "Elon Musk summons Twitter engineers amid mass resignations and puts up poll on Trump ban", "header": "Reports show nearly 1,200 workers left company after demand for \u2018long hours at high intensity\u2019, while Musk starts poll on whether to reinstate Donald Trump"},
{"title": "Iranian protesters set fire to Ayatollah Khomeini\u2019s ancestral home", "header": "Social media images show what is now a museum commemorating the Islamic Republic founder ablaze as protests continue"},
{"title": "Two Canadian men charged with shooting animals at safari park", "header": "Mathieu Godard and Jeremiah Mathias-Polson accused of breaking into Parc Omega in Quebec and killing three wild boar and an elk"}
]
Scrapy 檔案可以在這里找到:https ://docs.scrapy.org/en/latest/
轉載請註明出處,本文鏈接:https://www.uj5u.com/qianduan/537412.html
標籤:网页抓取网址刮擦
上一篇:python中的字數統計