Python2-Scrapy学习(三)

发表于 2018-11-23 更新于 2023-04-05 分类于 Code ， Python2 阅读次数：

继续学习scrapy,这次学习如何将数据进行存储。
mongo数据.png

在Python2-Scrapy学习(二)学习如何通过xpath获取数据，接下来通过MongDB or MySQL将数据进行保存。

scrapy数据存储

freebuf搜索信息爬取

上一章中爬取的数据为freebuf首页的资讯，这次爬取freebuf搜索所产生的数据。Ps：上一章通过HTTP进行访问，这次直接通过HTTPS就不需要考虑COOKIE的问题。
由于搜索返回的是json，因此不需要使用xpath，直接使用json进行解析即可获取数据。这里就直接贴上爬取代码：

# -*- coding: utf-8 -*-

import scrapy
import time
import json
import pymongo

from freebuf.items import FreebufItem


class FreebufSpider(scrapy.Spider):
    name = 'freebuf'
    allowed_domains = ['freebuf.com']
    custom_settings = {
        'ITEM_PIPELINES': {
            'freebuf.pipelines.MongodbPipeline': 300,
        }
    }

    def start_requests(self):
        urls = [
            u'https://search.freebuf.com/search/find/?year=0&score=0&articleType=0&origin=0&tabType=1&content=攻击&page=1',
        ]
        for url in urls:
            yield scrapy.Request(url=url, callback=self.parse)

    def parse(self, response):
        item = FreebufItem()
        data = json.loads(response.body_as_unicode())
        if data["data"]["total"] != u'0':
            page = int(response.url[response.url.find('page=') + 5:]) + 1
            next_page = response.url[:response.url.find('page=') + 5] + str(page)
            for i in data["data"]["list"]:
                if i["time"] == time.strftime("%Y-%m-%d").decode('utf-8'):
                    item['source'] = 'freebuf'
                    item['title'] = i["title"]
                    item['url'] = i['url']
                    item['content'] = i['content']
                    item['time'] = i['time']
                    item['author'] = i['name']
                else:
                    next_page = None
                    continue
        yield item
        if next_page is not None:
            next_page = response.urljoin(next_page)
            yield scrapy.Request(next_page, callback=self.parse)
        self.log("Freebuf sprider:%s" % (response.url))

custom_settings用于爬虫自定义设置，该优先级大于项目设置。这里设置使用pipelines.py文件中MongodbPipeline类对数据进行处理，优先级300（数字越小，优先级越高）。

数据处理

在piplines.py中我写了两个类：MongodbPipeline、MysqlPipeline，分别存储进MongoDB、MySQL。其中MongoDB已经测试，可完美进行存储；MySQL还未进行测试。

# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html

import pymongo
import pymysql

from scrapy import log
from scrapy.conf import settings
from twisted.enterprise import adbapi


class MongodbPipeline(object):
    def __init__(self):
        self.mongo_host = settings["MONGODB_HOST"]
        self.mongo_port = settings["MONGODB_PORT"]
        self.mongo_db = settings["MONGODB_DBNAME"]

    def open_spider(self, spider):
        self.client = pymongo.MongoClient(host=self.mongo_host, port=self.mongo_port)
        self.db = self.client[self.mongo_db]

    def close_spider(self, spider):
        self.client.close()
		        
    def process_item(self, item, spider):
        info = dict(item)
        self.db[item['source']].insert_one(info)
        return item
	
class MysqlPipeline(object):
    def __init__(self):
        dbparms = dict(
            host = settings['MYSQL_HOST'],
            port = settings['MYSQL_PORT'],
            dbname = settings['MYSQL_DBNAME'],
            user = settings['MYSQL_USER'],
            passwd = settings['MYSQL_PASSWORD'],
            charset = 'utf8',
            cursorclass = pymysql.cursors.DictCursor,
            use_unicode = True,
        )
        self.dbpool = adbapi.ConnectionPool("pymysql", **dbparms)

    def process_item(self, item, spider):
            query = self.dbpool.runInteraction(self.do_insert, item, spider)
            log.msg("MySQL connect")
            query.addErrback(self.handle_error, item, spider)
            query.addBoth(lambda _: item)
            return query
    
    def handle_error(self, failure, item, spider):
        print failure

    def do_insert(self, cursor, item):
        cursor.execute("insert into freebuf (title, url, content, time, author, source) values(%s, %s, %s, %s, %s, %s)",
                        item['title'], item['url'], item['content'], item['time'], item['author'], item['source'])

数据库配置

在项目settings.py文件需要定义数据库host、port、username、password等，直接在最后面加上配置信息即可。

# MongDB Config
MONGODB_HOST = 'localhost'
MONGODB_PORT = 27017
MONGODB_DBNAME = 'freebuf'

# MySQL Config
# MYSQL_HOST = 'localhost'
# MYSQL_PORT = 3306
# MYSQL_DBNAME = 'freebuf'
# MYSQL_USER = 'root'
# MYSQL_PASSWORD = 'root'

总结

项目、爬虫本身都可以进行配置，爬虫本身所配置的优先级大于项目。优先级：Command line options (most precedence)> Settings per-spider> Project settings module> Default settings per-command> Default global settings (less precedence)；
通过response.urljoin可将需要爬取的url添加至待爬取池，只要解析规则正确就可以将所需要的页面不断加入待爬取池；
scrapy函数大部分都可以通过callback进行回调，yield进行资源控制超级棒；
感觉item.py文件利用的较少，应该还有我不知道的用途。