这是崔斯特的第十九篇原创文章

先来首火影压压惊 (｡・`ω´･)

最开始接触 Rules是在Scrapy的文档上看到的，但是并看读懂这是什么意思。接下来看别人的案例，有使用到Rules，便花了很多时间去了解。

解释：
Rule是在定义抽取链接的规则，上面的两条规则分别对应列表页的各个分页页面和详情页，关键点在于通过restrict_xpath来限定只从页面特定的部分来抽取接下来将要爬取的链接。

其实用我的话来说就是，一个是可以便捷的进行翻页操作，二是可以采集二级页面，相当于打开获得详情页内容。所以若使用了 Rules，可以便捷的帮助我们采集批量网页。

官方文档

CrawlSpider示例

import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
class MySpider(CrawlSpider):
    name = 'example.com'
    allowed_domains = ['example.com']
    start_urls = ['http://www.example.com']
    rules = (
        # Extract links matching 'category.php' (but not matching 'subsection.php')
        # and follow links from them (since no callback means follow=True by default).
        Rule(LinkExtractor(allow=('category\.php', ), deny=('subsection\.php', ))),
        # Extract links matching 'item.php' and parse them with the spider's method parse_item
        Rule(LinkExtractor(allow=('item\.php', )), callback='parse_item'),
    )
    def parse_item(self, response):
        self.logger.info('Hi, this is an item page! %s', response.url)
        item = scrapy.Item()
        item['id'] = response.xpath('//td[@id="item_id"]/text()').re(r'ID: (\d+)')
        item['name'] = response.xpath('//td[@id="item_name"]/text()').extract()
        item['description'] = response.xpath('//td[@id="item_description"]/text()').extract()
        return item

该spider将从example.com的首页开始爬取，获取category以及item的链接并对后者使用 parse_item 方法。对于每个item response，将使用XPath从HTML中提取一些数据，并使用它填充Item。

实际应用

为了更好的理解，我们来看看实际案例中Rules如何使用

豆瓣应用


rules = [Rule(LinkExtractor(allow=(r'https://movie.douban.com/top250\?start=\d+.*'))),
        Rule(LinkExtractor(allow=(r'https://movie.douban.com/subject/\d+')),
            callback='parse_item', follow=False)
]

如果接触过django，那么可以发现这个规则与django的路由系统十分相似（django都已经忘完了 -_-！），其实这里使用的正则匹配。

使用 r'https://movie.douban.com/top250\?start=\d+.*'来匹配翻页链接，如：

使用https://movie.douban.com/subject/\d+来匹配具体电影的链接，如：

链家应用

爬虫的通常需要在一个网页里面爬去其他的链接，然后一层一层往下爬，scrapy提供了LinkExtractor类用于对网页链接的提取，使用LinkExtractor需要使用CrawlSpider爬虫类中，CrawlSpider与Spider相比主要是多了rules，可以添加一些规则，先看下面这个例子，爬取链家网的链接

from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
class LianjiaSpider(CrawlSpider):
    name = "lianjia"
    allowed_domains = ["lianjia.com"]
    start_urls = [
        "http://bj.lianjia.com/ershoufang/"
    ]
    rules = [
        # 匹配正则表达式,处理下一页
        Rule(LinkExtractor(allow=(r'http://bj.lianjia.com/ershoufang/pg\s+$',)), callback='parse_item'),
        # 匹配正则表达式,结果加到url列表中,设置请求预处理函数
        # Rule(FangLinkExtractor(allow=('http://www.lianjia.com/client/', )), follow=True, process_request='add_cookie')
    ]
    def parse_item(self, response):
        # 这里与之前的parse方法一样，处理
        pass

同样的，使用r'http://bj.lianjia.com/ershoufang/pg\s+$'来匹配下一页链接，如：

还可以使用 r'https://bj.lianjia.com/ershoufang/\d+.html'来匹配详情页链接，如：

学习参数

Rule对象

Role对象有下面参数

link_extractor：链接提取规则
callback：link_extractor提取的链接的请求结果的回调
cb_kwargs：附加参数，可以在回调函数中获取到
follow：表示提取的链接请求完成后是否还要应用当前规则（boolean），如果为False则不会对提取出来的网页进行进一步提取，默认为False
process_links：处理所有的链接的回调，用于处理从response提取的links，通常用于过滤（参数为link列表）
process_request：链接请求预处理（添加header或cookie等）

LinkExtractor

LinkExtractor常用的参数有：

allow：提取满足正则表达式的链接
deny：排除正则表达式匹配的链接（优先级高于allow）
allow_domains：允许的域名（可以是str或list）
deny_domains：排除的域名（可以是str或list）
restrict_xpaths：提取满足XPath选择条件的链接（可以是str或list）
restrict_css：提取满足css选择条件的链接（可以是str或list）
tags：提取指定标签下的链接，默认从a和area中提取（可以是str或list）
attrs：提取满足拥有属性的链接，默认为href（类型为list）
unique：链接是否去重（类型为boolean）
process_value：值处理函数（优先级大于allow）

关于LinkExtractor的详细参数介绍见官网

注意：在编写抓取Spider规则时，避免使用parse作为回调，因为CrawlSpider使用parse方法自己实现其逻辑。因此，如果你覆盖parse方法，爬行Spider将不再工作。

最后说一个自己犯过的低级错误，我用Scrapy有个习惯，创建一个项目之后，直接cd目录，然后使用genspider命令，然后。。

D:\Backup\桌面
λ scrapy startproject example
New Scrapy project 'example', using template directory 'c:\\users\\administrator\\appdata\\local\\programs\\python\\python36\\lib\\site-packages\\scrapy\\templates\\project', created in:
    D:\Backup\桌面\example
You can start your first spider with:
    cd example
    scrapy genspider example example.com
D:\Backup\桌面
λ cd example
D:\Backup\桌面\example
λ scrapy genspider em example.com
Created spider 'em' using template 'basic' in module:
  example.spiders.em

然后我的em.py就变成了这样：

# -*- coding: utf-8 -*-
import scrapy
class EmSpider(scrapy.Spider):
    name = 'em'
    allowed_domains = ['example.com']
    start_urls = ['http://example.com/']
    def parse(self, response):
        pass

注意，这个时候是不能使用Rules方法的，因为object不对，应该是

class EmSpider(CrawlSpider)

而不是class EmSpider(scrapy.Spider):

共勉！！！

下一节应该会讲到Scrapy中各个组件的作用，以及这张神图

Scrapy学习实例（三）采集批量网页