如何在Scrapy CrawlSpider中访问特定的start_url？

I'm using Scrapy, in particular Scrapy's CrawlSpider class to scrape web links which contain certain keywords. I have a pretty long start_urls list which gets its entries from a SQLite database which is connected to a Django project. I want to save the scraped web links in this database.

我正在使用Scrapy,特别是Scrapy的CrawlSpider类来抓取包含某些关键字的Web链接。我有一个很长的start_urls列表,它从一个连接到Django项目的SQLite数据库中获取它的条目。我想在此数据库中保存已删除的Web链接。

I have two Django models, one for the start urls such as http://example.com and one for the scraped web links such as http://example.com/website1, http://example.com/website2 etc. All scraped web links are subsites of one of the start urls in the start_urls list.

我有两个Django模型,一个用于http://example.com等开始网址,另一个用于网页链接,例如http://example.com/website1,http://example.com/website2等。所有抓取的Web链接都是start_urls列表中某个起始URL的子网站。

The web links model has a many-to-one relation to the start url model, i.e. the web links model has a Foreignkey to the start urls model. In order to save my scraped web links properly to the database, I need to tell the CrawlSpider's parse_item() method which start url the scraped web link belongs to. How can I do that? Scrapy's DjangoItem class does not help in this respect as I still have to define the used start url explicitly.

Web链接模型与起始URL模型具有多对一关系,即Web链接模型具有到开始URL模型的外键。为了将我的已删除的Web链接正确保存到数据库,我需要告诉CrawlSpider的parse_item()方法,该方法启动了已删除的Web链接所属的URL。我怎样才能做到这一点? Scrapy的DjangoItem类在这方面没有帮助,因为我仍然必须明确定义使用的启动URL。

In other words, how can I pass the currently used start url to the parse_item() method, so that I can save it together with the appropriate scraped web links to the database? Any ideas? Thanks in advance!

换句话说,如何将当前使用的起始URL传递给parse_item()方法,以便我可以将其与适当的已删除Web链接一起保存到数据库中?有任何想法吗?提前致谢!

3 个解决方案

#1

By default you can not access the original start url.

默认情况下,您无法访问原始启动网址。

But you can override make_requests_from_url method and put the start url into a meta. Then in a parse you can extract it from there (if you yield in that parse method subsequent requests, don't forget to forward that start url in them).

但是您可以覆盖make_requests_from_url方法并将开始URL放入元数据中。然后在解析中你可以从那里解压缩(如果你在后面的请求中产生了解析方法,不要忘记在它们中转发那个起始URL)。

I haven't worked with CrawlSpider and maybe what Maxim suggests will work for you, but keep in mind that response.url has the url after possible redirections.

我没有使用CrawlSpider,也许Maxim建议你可以使用它,但请记住,response.url在可能的重定向后有url。

Here is an example of how i would do it, but it's just an example (taken from the scrapy tutorial) and was not tested:

这是我如何做的一个例子,但它只是一个例子(取自scrapy教程)并且没有经过测试:

class MySpider(CrawlSpider):
    name = 'example.com'
    allowed_domains = ['example.com']
    start_urls = ['http://www.example.com']

    rules = (
        # Extract links matching 'category.php' (but not matching 'subsection.php')
        # and follow links from them (since no callback means follow=True by default).
        Rule(SgmlLinkExtractor(allow=('category\.php', ), deny=('subsection\.php', ))),

        # Extract links matching 'item.php' and parse them with the spider's method parse_item
        Rule(SgmlLinkExtractor(allow=('item\.php', )), callback='parse_item'),
    )

    def parse(self, response): # When writing crawl spider rules, avoid using parse as callback, since the CrawlSpider uses the parse method itself to implement its logic. So if you override the parse method, the crawl spider will no longer work.
        for request_or_item in CrawlSpider.parse(self, response):
            if isinstance(request_or_item, Request):
                request_or_item = request_or_item.replace(meta = {'start_url': response.meta['start_url']})
            yield request_or_item

    def make_requests_from_url(self, url):
        """A method that receives a URL and returns a Request object (or a list of Request objects) to scrape. 
        This method is used to construct the initial requests in the start_requests() method, 
        and is typically used to convert urls to requests.
        """
        return Request(url, dont_filter=True, meta = {'start_url': url})

    def parse_item(self, response):
        self.log('Hi, this is an item page! %s' % response.url)

        hxs = HtmlXPathSelector(response)
        item = Item()
        item['id'] = hxs.select('//td[@id="item_id"]/text()').re(r'ID: (\d+)')
        item['name'] = hxs.select('//td[@id="item_name"]/text()').extract()
        item['description'] = hxs.select('//td[@id="item_description"]/text()').extract()
        item['start_url'] = response.meta['start_url']
        return item

Ask if you have any questions. BTW, using PyDev's 'Go to definition' feature you can see scrapy sources and understand what parameters Request, make_requests_from_url and other classes and methods expect. Getting into the code helps and saves you time, though it might seem difficult at the beginning.

问你是否有任何问题。顺便说一句,使用PyDev的“转到定义”功能,您可以看到scrapy源并了解Request,make_requests_from_url和其他类和方法所期望的参数。进入代码可以帮助并节省您的时间,尽管一开始可能看起来很难。

3 个解决方案

#1

更多相关文章

随机推荐