scrapy爬取了链接之后如何继续进一步爬取该链接对应的内容

做网页数据爬取，最经常遇到的问题应该是爬取了某个链接之后，如何直接在爬虫里继续进一步把这个链接对应的内容给爬取下来。这一点上Scrapy的官方文档做得实在是很扯，居然提都没有提过，真是太恶心了。

还好，在网上找的了这篇文章：http://www.hulufei.com/post/Some-Experiences-Of-Using-Scrapy

这里面说得比较清楚了，但是有一点他这里也是没有说到的，如果我爬取到的是一组链接，而不是一条链接，那么我该怎么做呢？比如以爬取南大小百合十大热门话题为例：http://bbs.nju.edu.cn/bbstop10

这里有十篇文章，爬取了每篇我文章的标题作者之后，需要根据文章的链接去爬取文章的内容，该怎么做呢？这里我不解释过多，直接上代码吧：

# -*- coding: utf-8 -*-from scrapy.spider import BaseSpiderfrom scrapy.selector import HtmlXPathSelectorfrom scrapy.utils.url import urljoin_rfcfrom scrapy.http import Requestfrom datacrawler.items import bbsItemclass bbsSpider(BaseSpider):    name = "bbs"    allowed_domains = ["bbs.nju.edu.cn"]    start_urls = ["http://bbs.nju.edu.cn/bbstop10"]    def parseContent(self,content):        #content = content.encode('utf8')        authorIndex =content.index(unicode('信区','gbk'))        author = content[4:authorIndex-2]        boardIndex = content.index(unicode('标  题','gbk'))        board = content[authorIndex+4:boardIndex-2]        timeIndex = content.index(unicode('南京大学小百合站 (','gbk'))        time = content[timeIndex+10:timeIndex+34]        content = content[timeIndex+38:]        return (author,board,time,content)    def parse2(self,response):        hxs =HtmlXPathSelector(response)        item = response.meta['item']        items = []        content = hxs.select('/html/body/center/table[1]//tr[2]/td/textarea/text()').extract()[0]        parseTuple = self.parseContent(content)        item['author'] = parseTuple[0]        item['board'] =parseTuple[1]        item['time'] = parseTuple[2]        item['content'] = parseTuple[3]        return item    def parse(self, response):        hxs = HtmlXPathSelector(response)        items = []        title= hxs.select('/html/body/center/table/tr[position()>1]/td[3]/a/text()').extract()        url= hxs.select('/html/body/center/table/tr[position()>1]/td[3]/a/@href').extract()        for i in range(0, 10):            item = bbsItem()            item['link'] = urljoin_rfc('http://bbs.nju.edu.cn/', url[i])            item['title'] =  title[i][:-1]            items.append(item)        for item in items:            yield Request(item['link'],meta={'item':item},callback=self.parse2)

原文地址：scrapy爬取了链接之后如何继续进一步爬取该链接对应的内容, 感谢原作者分享。觉得自己做的到和不做的到，其实只在一念之间

相关文章：

你感兴趣的文章：

标签云：