Python2-Scrapy学习(三)

继续学习scrapy,这次学习如何将数据进行存储。
mongo数据.png

Python2-Scrapy学习(二)学习如何通过xpath获取数据,接下来通过MongDB or MySQL将数据进行保存。

scrapy数据存储

freebuf搜索信息爬取

上一章中爬取的数据为freebuf首页的资讯,这次爬取freebuf搜索所产生的数据。Ps:上一章通过HTTP进行访问,这次直接通过HTTPS就不需要考虑COOKIE的问题。
由于搜索返回的是json,因此不需要使用xpath,直接使用json进行解析即可获取数据。这里就直接贴上爬取代码:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
# -*- coding: utf-8 -*-

import scrapy
import time
import json
import pymongo

from freebuf.items import FreebufItem


class FreebufSpider(scrapy.Spider):
name = 'freebuf'
allowed_domains = ['freebuf.com']
custom_settings = {
'ITEM_PIPELINES': {
'freebuf.pipelines.MongodbPipeline': 300,
}
}

def start_requests(self):
urls = [
u'https://search.freebuf.com/search/find/?year=0&score=0&articleType=0&origin=0&tabType=1&content=攻击&page=1',
]
for url in urls:
yield scrapy.Request(url=url, callback=self.parse)

def parse(self, response):
item = FreebufItem()
data = json.loads(response.body_as_unicode())
if data["data"]["total"] != u'0':
page = int(response.url[response.url.find('page=') + 5:]) + 1
next_page = response.url[:response.url.find('page=') + 5] + str(page)
for i in data["data"]["list"]:
if i["time"] == time.strftime("%Y-%m-%d").decode('utf-8'):
item['source'] = 'freebuf'
item['title'] = i["title"]
item['url'] = i['url']
item['content'] = i['content']
item['time'] = i['time']
item['author'] = i['name']
else:
next_page = None
continue
yield item
if next_page is not None:
next_page = response.urljoin(next_page)
yield scrapy.Request(next_page, callback=self.parse)
self.log("Freebuf sprider:%s" % (response.url))

custom_settings用于爬虫自定义设置,该优先级大于项目设置。这里设置使用pipelines.py文件中MongodbPipeline类对数据进行处理,优先级300(数字越小,优先级越高)。

数据处理

在piplines.py中我写了两个类:MongodbPipeline、MysqlPipeline,分别存储进MongoDB、MySQL。其中MongoDB已经测试,可完美进行存储;MySQL还未进行测试。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html

import pymongo
import pymysql

from scrapy import log
from scrapy.conf import settings
from twisted.enterprise import adbapi


class MongodbPipeline(object):
def __init__(self):
self.mongo_host = settings["MONGODB_HOST"]
self.mongo_port = settings["MONGODB_PORT"]
self.mongo_db = settings["MONGODB_DBNAME"]

def open_spider(self, spider):
self.client = pymongo.MongoClient(host=self.mongo_host, port=self.mongo_port)
self.db = self.client[self.mongo_db]

def close_spider(self, spider):
self.client.close()

def process_item(self, item, spider):
info = dict(item)
self.db[item['source']].insert_one(info)
return item

class MysqlPipeline(object):
def __init__(self):
dbparms = dict(
host = settings['MYSQL_HOST'],
port = settings['MYSQL_PORT'],
dbname = settings['MYSQL_DBNAME'],
user = settings['MYSQL_USER'],
passwd = settings['MYSQL_PASSWORD'],
charset = 'utf8',
cursorclass = pymysql.cursors.DictCursor,
use_unicode = True,
)
self.dbpool = adbapi.ConnectionPool("pymysql", **dbparms)

def process_item(self, item, spider):
query = self.dbpool.runInteraction(self.do_insert, item, spider)
log.msg("MySQL connect")
query.addErrback(self.handle_error, item, spider)
query.addBoth(lambda _: item)
return query

def handle_error(self, failure, item, spider):
print failure

def do_insert(self, cursor, item):
cursor.execute("insert into freebuf (title, url, content, time, author, source) values(%s, %s, %s, %s, %s, %s)",
item['title'], item['url'], item['content'], item['time'], item['author'], item['source'])

数据库配置

在项目settings.py文件需要定义数据库host、port、username、password等,直接在最后面加上配置信息即可。

1
2
3
4
5
6
7
8
9
10
11
# MongDB Config
MONGODB_HOST = 'localhost'
MONGODB_PORT = 27017
MONGODB_DBNAME = 'freebuf'

# MySQL Config
# MYSQL_HOST = 'localhost'
# MYSQL_PORT = 3306
# MYSQL_DBNAME = 'freebuf'
# MYSQL_USER = 'root'
# MYSQL_PASSWORD = 'root'

总结

  1. 项目、爬虫本身都可以进行配置,爬虫本身所配置的优先级大于项目。优先级:Command line options (most precedence)> Settings per-spider> Project settings module> Default settings per-command> Default global settings (less precedence);
  2. 通过response.urljoin可将需要爬取的url添加至待爬取池,只要解析规则正确就可以将所需要的页面不断加入待爬取池;
  3. scrapy函数大部分都可以通过callback进行回调,yield进行资源控制超级棒;
  4. 感觉item.py文件利用的较少,应该还有我不知道的用途。