这是崔斯特的第十七篇原创文章

上一篇写的是采集虎嗅网首页的新闻数据，有朋友对我说，采集多页试试看。后来研究下，虎嗅网首页是POST加载，Form Data中携带参数，所以只需要带上一个循环就好了。这是我最初的想法，先让我们看看Scrapy中
如何采集无限滚动页面？

先举个栗子，采集网站是quotes

分析网页

下拉时，会发现更多新的请求，观察这些请求，返回的都是json数据，也就是我们所需的，再看看他们的不同，也就是参数的改变，完整链接是：

1
2
3

http://spidyquotes.herokuapp.com/api/quotes?page=2
http://spidyquotes.herokuapp.com/api/quotes?page=3
http://spidyquotes.herokuapp.com/api/quotes?page=4

这就很清晰了。

返回的是json，我们需要解析，然后提取数据，那我们如何知道最多有多少条json呢，文件已经告诉我们了：

has_next:true

创建项目

scrapy startproject quote
cd quote
scrapy genspider spiderquote http://spidyquotes.herokuapp.com/scroll

定义Item

查看网站，采集text、author和tags这三个

import scrapy
class QuoteItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    text = scrapy.Field()
    author = scrapy.Field()
    tag = scrapy.Field()

编写spider


# -*- coding: utf-8 -*-
import scrapy
import json
class SpiderquoteSpider(scrapy.Spider):
    name = 'spiderquote'
    quotes_base_url = 'http://spidyquotes.herokuapp.com/api/quotes?page=%s'
    start_urls = [quotes_base_url % 1]
    download_delay = 1.5
    def parse(self, response):
        data = json.loads(response.body)
        for item in data.get('quotes', []):
            yield {
                'text': item.get('text'),
                'author': item.get('author', {}).get('name'),
                'tags': item.get('tags'),
            }
        if data['has_next']:
            next_page = data['page'] + 1
            yield scrapy.Request(self.quotes_base_url % next_page)

运行爬虫，然后就可以看到结果了。

应用到虎嗅网

那么如何应用到虎嗅网呢？首先还是要去分析网页。

虎嗅网的参数有3个：

1
2
3

huxiu_hash_code:13a3a353c52d424e1e263dda4d594e59
page:3
last_dateline:1512026700

我们知道page就是翻页页码，huxiu_hash_code是一个不变的字符，last_dateline看起来像unix时间戳，验证确实如此。这个时间戳有必要带上吗，我想验证试试看。

在postman中测试，不带上last_dateline也是可以返回数据，并且这个json中已经告诉我们一共有多少页：

"total_page": 1654

在主函数中我们可以依葫芦画瓢


# -*- coding: utf-8 -*-
import scrapy
from huxiu.items import HuxiuItem
import json
from lxml import etree
class HuxiuSpider(scrapy.Spider):
    name = 'HuXiu'
    def start_requests(self):
        url = 'https://www.huxiu.com/v2_action/article_list'
        for i in range(1, 10):
        # FormRequest 是Scrapy发送POST请求的方法
            yield scrapy.FormRequest(
                url = url,
                formdata = {"huxiu_hash_code" : "13a3a353c52d424e1e263dda4d594e59", "page" : str(i)},
                callback = self.parse
            )
    def parse(self, response):
        item = HuxiuItem()
        data = json.loads(response.text)
        s = etree.HTML(data['data'])
        item['title'] = s.xpath('//a[@class="transition msubstr-row2"]/text()')
        item['link'] = s.xpath('//a[@class="transition msubstr-row2"]/@href')
        item['author'] = s.xpath('//span[@class="author-name"]/text()')
        item['introduction'] = s.xpath('//div[@class="mob-sub"]/text()')
        yield item

输出的数据有点难看，是一段一段的。。

因为data['data']是一段html文件，所以这里选择的是xpath，不清楚这里是否直接使用Scrapy的xpath解析工具，如果可以，欢迎在评论中告诉我。

本篇收获

Scrapy采集动态网站：分析网页
使用Scrapy模拟post请求方法，文档在这
刘亦菲好漂亮

待做事宜

完善文件保存与解析
全站抓取大概用了3分钟，速度有点慢

若想评论，先翻长城