Python2-Scrapy学习(三)

继续学习scrapy,这次学习如何将数据进行存储。
mongo数据.png

Python2-Scrapy学习(二)学习如何通过xpath获取数据,接下来通过MongDB or MySQL将数据进行保存。

scrapy数据存储

freebuf搜索信息爬取

上一章中爬取的数据为freebuf首页的资讯,这次爬取freebuf搜索所产生的数据。Ps:上一章通过HTTP进行访问,这次直接通过HTTPS就不需要考虑COOKIE的问题。
由于搜索返回的是json,因此不需要使用xpath,直接使用json进行解析即可获取数据。这里就直接贴上爬取代码:
​ # -- coding: utf-8 --
​ import scrapy
​ import time
​ import json
​ import pymongo

from freebuf.items import FreebufItem


class FreebufSpider(scrapy.Spider):
name = ‘freebuf’
allowed_domains = [‘freebuf.com’]
custom_settings = {
‘ITEM_PIPELINES’: {
‘freebuf.pipelines.MongodbPipeline’: 300,
}
}

def start_requests(self):
    urls = [
        u'https://search.freebuf.com/search/find/?year=0&score=0&articleType=0&origin=0&tabType=1&content=攻击&page=1',
    ]
    for url in urls:
        yield scrapy.Request(url=url, callback=self.parse)

def parse(self, response):
    item = FreebufItem()
    data = json.loads(response.body_as_unicode())
    if data["data"]["total"] != u'0':
        page = int(response.url[response.url.find('page=') + 5:]) + 1
        next_page = response.url[:response.url.find('page=') + 5] + str(page)
        for i in data["data"]["list"]:
            if i["time"] == time.strftime("%Y-%m-%d").decode('utf-8'):
                item['source'] = 'freebuf'
                item['title'] = i["title"]
                item['url'] = i['url']
                item['content'] = i['content']
                item['time'] = i['time']
                item['author'] = i['name']
            else:
                next_page = None
                continue
    yield item
    if next_page is not None:
        next_page = response.urljoin(next_page)
        yield scrapy.Request(next_page, callback=self.parse)
    self.log("Freebuf sprider:%s" % (response.url))

custom_settings用于爬虫自定义设置,该优先级大于项目设置。这里设置使用pipelines.py文件中MongodbPipeline类对数据进行处理,优先级300(数字越小,优先级越高)。

数据处理

在piplines.py中我写了两个类:MongodbPipeline、MysqlPipeline,分别存储进MongoDB、MySQL。其中MongoDB已经测试,可完美进行存储;MySQL还未进行测试。
​ # -- coding: utf-8 --

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html

import pymongo
import pymysql

from scrapy import log
from scrapy.conf import settings
from twisted.enterprise import adbapi


class MongodbPipeline(object):
def init(self):
self.mongo_host = settings[“MONGODB_HOST”]
self.mongo_port = settings[“MONGODB_PORT”]
self.mongo_db = settings[“MONGODB_DBNAME”]

    def open_spider(self, spider):
        self.client = pymongo.MongoClient(host=self.mongo_host, port=self.mongo_port)
        self.db = self.client[self.mongo_db]

    def close_spider(self, spider):
        self.client.close()

    def process_item(self, item, spider):
        info = dict(item)
        self.db[item['source']].insert_one(info)
        return item

class MysqlPipeline(object):
    def __init__(self):
        dbparms = dict(
            host = settings['MYSQL_HOST'],
            port = settings['MYSQL_PORT'],
            dbname = settings['MYSQL_DBNAME'],
            user = settings['MYSQL_USER'],
            passwd = settings['MYSQL_PASSWORD'],
            charset = 'utf8',
            cursorclass = pymysql.cursors.DictCursor,
            use_unicode = True,
        )
        self.dbpool = adbapi.ConnectionPool("pymysql", **dbparms)

    def process_item(self, item, spider):
            query = self.dbpool.runInteraction(self.do_insert, item, spider)
            log.msg("MySQL connect")
            query.addErrback(self.handle_error, item, spider)
            query.addBoth(lambda _: item)
            return query

    def handle_error(self, failure, item, spider):
        print failure

    def do_insert(self, cursor, item):
        cursor.execute("insert into freebuf (title, url, content, time, author, source) values(%s, %s, %s, %s, %s, %s)",
                        item['title'], item['url'], item['content'], item['time'], item['author'], item['source'])

数据库配置

在项目settings.py文件需要定义数据库host、port、username、password等,直接在最后面加上配置信息即可。
​ # MongDB Config
​ MONGODB_HOST = ‘localhost’
​ MONGODB_PORT = 27017
​ MONGODB_DBNAME = ‘freebuf’

# MySQL Config
# MYSQL_HOST = 'localhost'
# MYSQL_PORT = 3306
# MYSQL_DBNAME = 'freebuf'
# MYSQL_USER = 'root'
# MYSQL_PASSWORD = 'root'

总结

  1. 项目、爬虫本身都可以进行配置,爬虫本身所配置的优先级大于项目。优先级:Command line options (most precedence)> Settings per-spider> Project settings module> Default settings per-command> Default global settings (less precedence);
  2. 通过response.urljoin可将需要爬取的url添加至待爬取池,只要解析规则正确就可以将所需要的页面不断加入待爬取池;
  3. scrapy函数大部分都可以通过callback进行回调,yield进行资源控制超级棒;
  4. 感觉item.py文件利用的较少,应该还有我不知道的用途。
Hywell wechat
遗世独立
-------------本文结束感谢您的阅读-------------
0%