Python-scrapy学习(四)

scrapy学习到此告一段落,下图为一个项目的框架图。

项目框架图

Python2-Scrapy学习(三)学习如何将数据进行存储,接下来学习如何使用selenium解析JS、邮件通知

selenium解析JS

在爬取seebug直接请求无法访问到数据页面,发现访问seebug流程为:访问seebug-解析js-赋值cookie字段-再次访问-成功获取数据。

js解析

Connect success

查阅资料得知scrapy可以使用Splash进行JavaScript渲染,但是根据官网信息显示得与docker进行配合。后面想起来可以通过selenium进行解析,但是scrapy-selenium需要python>=3.6。但是,我使用的是2.7,因此,我直接使用selenium进行解析。Ps:在CentOS6下,无法使用Chrome进行。

安装

直接使用pip安装selenium即可,phantomjs已经暂停项目,新版的selenium已经不支持phantomjs。报错信息:UserWarning: Selenium support for PhantomJS has been deprecated, please use headless versions of Chrome or Firefox instead。因此,根据官方文档需要下载相应的浏览器内核(Chrome、Edge、Firefox、Safari)。我下载了Chrome Driver,并将chromedriver文件移动到/usr/local/bin目录下。

1
pip install selenium

使用middlewares

安装完成之后,我想的是通过scary项目的方式进行调用(例如scrapy-selenium),所以并不会采取直接在爬虫里面selenium的调用。通过阅读scrapy文档发现,可以在middleware.py中将selenium使用封装成类,通过setting.py或者爬虫custom_settings进行调用。

middlewares.py

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
from selenium import webdriver
from scrapy.http import HtmlResponse
from scrapy.exceptions import IgnoreRequest
from selenium.webdriver.chrome.options import Options


class SeleniumMiddleware(object):
def __init__(self):
self.chrome_options = Options()
self.chrome_options.add_argument('--headless')
self.chrome_options.add_argument('--disable-gpu')
self.driver = webdriver.Chrome(chrome_options=self.chrome_options)
# self.driver = webdriver.Chrome()

def process_request(self, request, spider):
self.driver.get(request.url)
time.sleep(3)
try:
body = self.driver.page_source
return HtmlResponse(self.driver.current_url, body=body, encoding='utf-8', request=request)
except Exception as e:
# Timeout on WebDriverWait
logging.error(e)
raise IgnoreRequest

seebug.py

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
import scrapy
import time

from zjyd.items import ZjydItem
from scrapy.utils.project import get_project_settings

settings = get_project_settings()


class SeebugSpider(scrapy.Spider):
name = 'seebug'
allowed_domains = ['seebug.org']
custom_settings = {
'DOWNLOADER_MIDDLEWARES': {
'zjyd.pipelines.SeleniumMiddleware': 723,
},
}

def start_requests(self):
keywords = list(settings['KEYWORDS'])
for i in keywords:
yield scrapy.Request(url=('https://www.seebug.org/search/?keywords=%s&category=&page=1' % str(i)), callback=self.parse)

def parse(self, response):
item = ZjydItem()
if response.xpath("//table[@class='table sebug-table table-vul-list']/tbody/tr"):
page = int(response.url[response.url.find('page=') + 5]) + 1
next_page = response.url[:response.url.find('page=') + 5] + str(page)
else:
next_page = None
for i in response.xpath("//table[@class='table sebug-table table-vul-list']/tbody/tr"):
if i.xpath("td[@class='text-center datetime hidden-sm hidden-xs']/text()").extract_first().strip()[:-6] == time.strftime("%Y-%m-%d").decode('utf-8'):
item['source'] = 'seebug'
item['title'] = i.xpath("td[@class='vul-title-wrapper']/a[@class='vul-title']/text()").extract_first()
item['time'] = i.xpath("td[@class='text-center datetime hidden-sm hidden-xs']/text()").extract_first().strip()
item['url'] = u'https://www.seebug.org' + i.xpath("td[@class='vul-title-wrapper']/a[@class='vul-title']/href").extract_first()
item['content'] = i.xpath("td[@class='vul-title-wrapper']/a[@class='vul-title']/text()").extract_first()
item['author'] = u'null'
else:
next_page = None
continue
if item:
yield item
else:
self.log('%s is none!' % (response.url))
if next_page is not None:
next_page = response.urljoin(next_page)
yield scrapy.Request(next_page, callback=self.parse)
self.log("seebug sprider:%s" % (response.url))

Or setting.py

1
2
3
4
5
6
DOWNLOADER_MIDDLEWARES = {
'zjyd.middlewares.ZjydDownloaderMiddleware': 543,
'scrapy.downloadermiddlewares.retry.RetryMiddleware': 90,
'scrapy_proxies.RandomProxy': 100,
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware': 110,
}

通过上面的配置就可以将seebug的内容解析出来,并进行数据存储。

项目整合

现在项目上有freebuf、seebug等爬虫,通过scrapy list可以查看项目下一共有多少个爬虫。

scrapy list

爬虫整合

项目下每个爬虫为一个文件,并非所有爬虫都在一个文件、按照类进行划分。所以需要一个脚本,运行所有的爬虫。通过官方文档可以看到提供了API接口的方式运行爬虫。这里我采用将所有爬虫通过process.crawl进行运行。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
from scrapy.crawler import CrawlerProcess
# from scrapy import spiderloader
from scrapy.utils.project import get_project_settings

settings = CrawlerProcess(get_project_settings())

def main():
process = CrawlerProcess(settings)
process.crawl('freebuf', domain='freebuf.com')
process.crawl('vulbox', domain='vulbox.com')
process.crawl('anquanke', domain='anquanke.com')
process.crawl('bugbank', domain='bugbank.com')
process.crawl('seebug', domain='seebug.com')
process.crawl('cnvd', domain='cnvd.org.cn')
# spider_loader = spiderloader.SpiderLoader.from_settings(settings)
# spiders = spider_loader.list()
# classes = [spider_loader.load(name) for name in spiders]
# for i in classes:
# process.crawl(i)

process.start()

通过上述方法就可以运行项目下的freebuf、vulbox等爬虫。

邮件通知

现在爬虫所获的的数据都会存储在MongoDB数据库中,通过读取数据库中当前日期的数据,进行邮件通知即可。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
# -*- coding: utf-8 -*-

# import os
import time
import pymongo
import scrapy
import smtplib
from email.header import Header
from email.mime.text import MIMEText
from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings
from scrapy import spiderloader

settings = get_project_settings()


def send_mail():
mongo_client = pymongo.MongoClient(host=settings["MONGODB_HOST"], port=settings["MONGODB_PORT"])
mongo_db = mongo_client[settings["MONGODB_DBNAME"]]
mongo_query = {'time': time.strftime("%Y-%m-%d").decode('utf-8')}
result = "邮件更新提醒:\n"

spider_loader = spiderloader.SpiderLoader.from_settings(settings)
spiders = spider_loader.list()
for i in spiders:
mongo_col = mongo_db[i]
if mongo_col.find(mongo_query).sort("ts",pymongo.ASCENDING).count() != 0:
result += '%s 有更新,请注意查收!\n' % (i)
else:
result += '%s无更新!\n' % (i)

sender = settings["MAIL_SENDER"]
receivers = settings["MAIL_RECEIVERS"]

message = MIMEText(result,'plain','utf-8')
message['From'] = Header("hywell", 'utf-8')
message['To'] = receivers
subject = "信息收集爬虫"
message['Subject'] = Header(subject, 'utf-8')

smtpObj = smtplib.SMTP()
smtpObj.connect(settings["MAIL_HOST"], 25)
smtpObj.login(settings["MAIL_USER"], settings["MAIL_PASSWORD"])
smtpObj.sendmail(sender, receivers, message.as_string())

mongo_client.close()

def main():
process = CrawlerProcess(settings)
spider_loader = spiderloader.SpiderLoader.from_settings(settings)
spiders = spider_loader.list()
classes = [spider_loader.load(name) for name in spiders]
for i in classes:
process.crawl(i)

process.start()
send_mail()

if __name__ == "__main__":
main()

总结

  1. python2.7越来越“老旧了”,新的模块很多都已经不支持python2.7,后续还是得开始使用python3进行编程。毕竟,python最主要的就是可以import module;
  2. 在进行邮件发送的时候,要注意邮箱服务器的设置,例如:密码是否需要设置为独立密码、发件人信息是否有与账号匹配校验等;
  3. 由于Chrome最新版本不支持CentOS6,我通过种种方式安装上了Chrome以及成功运行ChromeDriver,但是在运行时会出现卡在开始连接[urllib3.connectionpool] DEBUG: Starting new HTTP connection (1): 127.0.0.1:1269,后面改用Firefox。

完整代码

完整代码已经上传到我的GiHub。如果有兴趣,不妨移步到Github上一观!Code