蜘蛛只抓取最后一個url，而不是全部-有解無憂

我想使用 Scrapy 抓取存盤在 csv 檔案中的多個 url。我的代碼有效（沒有顯示錯誤）但它只抓取最后一個網址，但不是全部。這是我的代碼的圖片。請告訴我我做錯了什么。我想抓取所有網址并將抓取的文本保存在一起。我已經嘗試了很多在 StackOverflow 上找到的建議。我的代碼-

import scrapy
from scrapy import Request
from ..items import personalprojectItem


class ArticleSpider(scrapy.Spider):
    name = 'articles'
    with open('C:\\Users\\Admin\\Documents\\Bhavya\\input_urls.csv') as file:
        for line in file:
            start_urls = line

            def start_requests(self):
                request = Request(url=self.start_urls)
                yield request

        def parse(self, response):
            item = personalprojectItem()
            article = response.css('div p::text').extract()
            item['article'] = article
            yield item

uj5u.com熱心網友回復：

下面是一個最小示例，說明如何在 scrapy 專案中包含檔案中的 url 串列。

我們在 scrapy 專案檔案夾中有一個包含以下鏈接的文本檔案：

https://www.theguardian.com/technology/2022/nov/18/elon-musk-twitter-engineers-workers-mass-resignation
https://www.theguardian.com/world/2022/nov/18/iranian-protesters-set-fire-to-ayatollah-khomeinis-ancestral-home
https://www.theguardian.com/world/2022/nov/18/canada-safari-park-shooting-animals-two-charged

蜘蛛代碼看起來像這樣（再次，最小的例子）：

import scrapy


class GuardianSpider(scrapy.Spider):
    name = 'guardian'
    allowed_domains = ['theguardian.com']
    start_urls = [x for x in open('urls_list.txt', 'r').readlines()]

    def parse(self, response): 
        title = response.xpath('//h1/text()').get()
        header = response.xpath('//div[@data-gu-name="standfirst"]//p/text()').get()
        yield {
            'title': title,
            'header': header
        }

如果我們用運行爬蟲scrapy crawl guardian -o guardian_news.json，我們會得到一個如下所示的 JSON 檔案：

[
{"title": "Elon Musk summons Twitter engineers amid mass resignations and puts up poll on Trump ban", "header": "Reports show nearly 1,200 workers left company after demand for \u2018long hours at high intensity\u2019, while Musk starts poll on whether to reinstate Donald Trump"},
{"title": "Iranian protesters set fire to Ayatollah Khomeini\u2019s ancestral home", "header": "Social media images show what is now a museum commemorating the Islamic Republic founder ablaze as protests continue"},
{"title": "Two Canadian men charged with shooting animals at safari park", "header": "Mathieu Godard and Jeremiah Mathias-Polson accused of breaking into Parc Omega in Quebec and killing three wild boar and an elk"}
]

Scrapy 檔案可以在這里找到：https ://docs.scrapy.org/en/latest/

轉載請註明出處，本文鏈接：https://www.uj5u.com/qianduan/537412.html

標籤：网页抓取网址刮擦

上一篇：python中的字數統計

下一篇：僅保留href和src的子目錄（ROOThtml鏈接）