Fork me on GitHub

Scrapy学习实例(一)

这是崔斯特的第十五篇原创文章


Hello,我又回来啦。以后就在这发文章吧,记录自己的学习历程。

举头卖竹鼠,低头嘤嘤嘤。

我会记录自己对Scrapy的学历经历,更重要的是理解。下面就开始吧,首先当然是创建一个项目啦!

我选择爬取虎嗅网首页的新闻列表。

1、创建项目

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
F:\Python\huxiu>scrapy startproject huxiu
New Scrapy project 'huxiu', using template directory 'c:\\users\\administrator\\appdata\\local\\programs\\python\\python36\\lib\\site-packages\\scrapy\\templates\\pro
ject', created in:
F:\Python\huxiu\huxiu
You can start your first spider with:
cd huxiu
scrapy genspider example example.com
F:\Python\huxiu>cd huxiu
F:\Python\huxiu\huxiu>scrapy genspider huxiu huxiu.com
Cannot create a spider with the same name as your project
F:\Python\huxiu\huxiu>scrapy genspider HuXiu huxiu.com
Created spider 'HuXiu' using template 'basic' in module:
huxiu.spiders.HuXiu

记住爬虫和项目命名不一样

2、定义Item

item.py中创建scrapy.Item类,并定义它的类型为scrapy.Field的属性。

1
2
3
4
5
6
7
8
9
10
11
12
import scrapy
class HuxiuItem(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
title = scrapy.Field() #标题
link = scrapy.Field() #链接
author = scrapy.Field() #作者
introduction = scrapy.Field() #简介
time = scrapy.Field() #时间

3、编写Spider

一目了然

huxiu/spider/HuXiu.py中编写代码

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
# -*- coding: utf-8 -*-
import scrapy
from huxiu.items import HuxiuItem
class HuxiuSpider(scrapy.Spider):
name = 'HuXiu'
allowed_domains = ['huxiu.com']
start_urls = ['http://huxiu.com/']
def parse(self, response):
for s in response.xpath('//div[@class="mod-info-flow"]/div/div[@class="mob-ctt"]'):
item = HuxiuItem()
item['title'] = s.xpath('h2/a/text()')[0].extract()
item['link'] = s.xpath('h2/a/@href')[0].extract()
url = response.urljoin(item['link'])
item['author'] = s.xpath('div/a/span/text()')[0].extract()
item['introduction'] = s.xpath('div[2]/text()')[0].extract()
item['time'] = s.xpath('div/span/text()')[0].extract()
print(item)

在终端输入命令

scrapy crawl HuXiu

部分输出

4、深度爬取

哈哈,这里借用造数的命名了。其实就是爬取新闻详情页。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
# -*- coding: utf-8 -*-
import scrapy
from huxiu.items import HuxiuItem
class HuxiuSpider(scrapy.Spider):
name = 'HuXiu'
allowed_domains = ['huxiu.com']
start_urls = ['http://huxiu.com/']
def parse(self, response):
for s in response.xpath('//div[@class="mod-info-flow"]/div/div[@class="mob-ctt"]'):
item = HuxiuItem()
item['title'] = s.xpath('h2/a/text()')[0].extract()
item['link'] = s.xpath('h2/a/@href')[0].extract()
url = response.urljoin(item['link'])
item['author'] = s.xpath('div/a/span/text()')[0].extract()
item['introduction'] = s.xpath('div[2]/text()')[0].extract()
item['time'] = s.xpath('div/span/text()')[0].extract()
#print(item)
yield scrapy.Request(url, callback=self.parse_article)
def parse_article(self, response):
item = HuxiuItem()
detail = response.xpath('//div[@class="article-wrap"]')
item['title'] = detail.xpath('h1/text()')[0].extract().strip()
item['link'] = response.url
item['author'] = detail.xpath('div[@class="article-author"]/span/a/text()')[0].extract()
item['time'] = detail.xpath('div[@class="article-author"]/div[@class="column-link-box"]/span/text()')[0].extract()
print(item)
word = detail.xpath('div[5]')
print(word[0].xpath('string(.)').extract()[0])
yield item

输出结果

说明一点,如何使用xpath获得多个标签下的文本,这里参考了解决:xpath取出指定多标签内所有文字text,把文章详细内容打印出来,但是会遇到一些错误,可以使用goose来试试看。

Python-Goose - Article Extractor

1
2
3
4
5
6
7
8
9
10
11
>>> from goose import Goose
>>> from goose.text import StopWordsChinese
>>> url = 'http://www.bbc.co.uk/zhongwen/simp/chinese_news/2012/12/121210_hongkong_politics.shtml'
>>> g = Goose({'stopwords_class': StopWordsChinese})
>>> article = g.extract(url=url)
>>> print article.cleaned_text[:150]
香港行政长官梁振英在各方压力下就其大宅的违章建筑(僭建)问题到立法会接受质询,并向香港民众道歉。
梁振英在星期二(1210日)的答问大会开始之际在其演说中道歉,但强调他在违章建筑问题上没有隐瞒的意图和动机。
一些亲北京阵营议员欢迎梁振英道歉,且认为应能获得香港民众接受,但这些议员也质问梁振英有

参考文章:

若想评论,先翻长城