Python 爬虫 性能 相关( asyncio 模块 --- 高性能爬虫 )

 

From:https://www.cnblogs.com/bravexz/p/7741633.html

爬虫应用 asyncio 模块 ( 高性能爬虫 ):https://www.cnblogs.com/morgana/p/8495555.html

python异步编程之asyncio(百万并发):https://www.cnblogs.com/shenh/p/9090586.html

深入理解 Python 异步编程(上):https://blog.csdn.net/catwan/article/details/84975893

https://mp.weixin.qq.com/s?__biz=MjM5OTA1MDUyMA==&mid=2655439072&idx=3&sn=07ca0046b92998ea216958afa5baff8f

requests + asyncio :https://github.com/wangy8961/python3-concurrency-pics-02

python 高并发模块 asynio:https://www.jianshu.com/p/9ea1198beb49

aiohttp 官网文档 https://docs.aiohttp.org/en/latest/

关键字:python 异步编程   、asyncio requests

 

写爬虫时性能的消耗主要在IO请求中,当单进程单线程模式下请求URL时必然会引起等待,从而使得请求整体变慢。

 

 

同步执行

 

示例代码:

import requestsdef fetch_async(url=None):response = requests.get(url)return responseurl_list = ['http://www.github.com', 'http://www.bing.com']for url in url_list:fetch_async(url)

 

 

多线程执行

 

示例代码:

from concurrent.futures import ThreadPoolExecutor
import requestsdef fetch_async(url):response = requests.get(url)return responseurl_list = ['http://www.github.com', 'http://www.bing.com']pool = ThreadPoolExecutor(5)
for url in url_list:pool.submit(fetch_async, url)
pool.shutdown(wait=True)

 

 

多线程 + 回调函数执行

 

示例代码:

# -*- coding: utf-8 -*-from concurrent.futures import ThreadPoolExecutor
import requestsdef fetch_async(url):response = requests.get(url)return responsedef callback(future):print(future.result())url_list = ['http://www.github.com', 'http://www.bing.com']pool = ThreadPoolExecutor(5)
for url in url_list:v = pool.submit(fetch_async, url)v.add_done_callback(callback)
pool.shutdown(wait=True)

 

 

多进程执行

 

示例代码:

# -*- coding: utf-8 -*-from concurrent.futures import ProcessPoolExecutor
import requestsdef fetch_async(url):response = requests.get(url)return responseurl_list = ['http://www.github.com', 'http://www.bing.com']pool = ProcessPoolExecutor(5)
for url in url_list:pool.submit(fetch_async, url)
pool.shutdown(wait=True)

 

 

多进程 + 回调函数执行

 

示例代码:

# -*- coding: utf-8 -*-from concurrent.futures import ProcessPoolExecutor
import requestsdef fetch_async(url):response = requests.get(url)return responsedef callback(future):print(future.result())url_list = ['http://www.github.com', 'http://www.bing.com']pool = ProcessPoolExecutor(5)
for url in url_list:v = pool.submit(fetch_async, url)v.add_done_callback(callback)
pool.shutdown(wait=True)

 

通过上述代码均可以完成对请求性能的提高,对于多线程和多进程的缺点是在IO阻塞时会造成了线程和进程的浪费,所以异步IO会是首选:

 

 

异步 IO

 

Python 中 异步协程 的 使用方法介绍:https://blog.csdn.net/freeking101/article/details/88119858

python---异步IO(asyncio)协程 :https://www.cnblogs.com/ssyfj/p/9219360.html

 

 

python 由于 GIL(全局锁)的存在,不能发挥多核的优势,其性能一直饱受诟病。然而在 IO 密集型的网络编程里,异步处理比同步处理能提升成百上千倍的效率,弥补了 python 性能方面的短板,如最新的微服务框架 japronto,每秒的请求 可达百万级。

python 还有一个优势是库(第三方库)极为丰富,运用十分方便。asyncio 是 python3.4 版本引入到标准库,python2x 没有加这个库,毕竟 python3x 才是未来!python3.5 又加入了 async/await 特性。

 

在学习 asyncio 之前,先的理清楚 同步/异步的概念

  • · 同步 是指完成事务的逻辑,先执行第一个事务,如果阻塞了,会一直等待,直到这个事务完成,再执行第二个事务,顺序执行。。。
  • · 异步 是和同步相对的,异步是指在处理调用这个事务的之后,不会等待这个事务的处理结果,直接处理第二个事务去了,通过状态、通知、回调 来通知 调用者处理结果

 

调用步骤:

  • 1. 当我们给一个函数添加了async 关键字,或者使用 asyncio.coroutine 装饰器装饰,就会把它变成一个异步函数。 
  • 2. 每个线程 有一个 事件循环,主线程调用 asyncio.get_event_loop 时会创建事件循环,
  • 3. 将任务封装为集合 asyncio.gather(*args),之后一起传入事件循环中
  • 4. 要把异步的任务丢给这个循环的 run_until_complete 方法,事件循环会安排协同程序的执行。和方法名字一样,该方法会等待异步的任务完全执行才会结束。

 

asyncio 示例 1

# -*- coding: utf-8 -*-import asyncio@asyncio.coroutine
def func1():print('before...func1......')yield from asyncio.sleep(5)print('end...func1......')tasks = [func1(), func1()]loop = asyncio.get_event_loop()
loop.run_until_complete(asyncio.gather(*tasks))
loop.close()

 

asyncio 示例 2

# -*- coding: utf-8 -*-import asyncio@asyncio.coroutine
def fetch_async(host, url='/'):print(host, url)reader, writer = yield from asyncio.open_connection(host, 80)request_header_content = """GET %s HTTP/1.0\r\nHost: %s\r\n\r\n""" % (url, host,)request_header_content = bytes(request_header_content, encoding='utf-8')writer.write(request_header_content)yield from writer.drain()text = yield from reader.read()print(host, url, text)writer.close()tasks = [fetch_async('www.cnblogs.com', '/wupeiqi/'),fetch_async('dig.chouti.com', '/pic/show?nid=4073644713430508&lid=10273091')
]loop = asyncio.get_event_loop()
results = loop.run_until_complete(asyncio.gather(*tasks))
loop.close()

 

 

asyncio + aiohttp

参考:https://www.cnblogs.com/zhanghongfeng/p/8662265.html

用 aiohttp 写爬虫:https://luca-notebook.readthedocs.io/zh_CN/latest/c01/用aiohttp写爬虫.html

 

aiohttp

  如果需要并发 http 请求怎么办呢,通常是用 requests,但 requests 是同步的库,如果想异步的话需要引入 aiohttp。这里引入一个类,from aiohttp import ClientSession,首先要建立一个 session 对象,然后用 session 对象去打开网页。session 可以进行多项操作,比如 post, get, put, head 等。

示例:

import asyncio
from aiohttp import ClientSessiontasks = []
test_url = "https://www.baidu.com/{}"async def hello(url):async with ClientSession() as session:async with session.get(url) as response:response = await response.read()print(response)if __name__ == '__main__':loop = asyncio.get_event_loop()loop.run_until_complete(hello(test_url))

首先async def 关键字定义了这是个异步函数,await 关键字加在需要等待的操作前面,response.read()等待request响应,是个耗IO操作。然后使用ClientSession类发起http请求。

 

多链接 异步 访问

如果我们需要请求多个URL该怎么办呢,同步的做法访问多个URL只需要加个for循环就可以了。但异步的实现方式并没那么容易,在之前的基础上需要将hello()包装在asyncio的Future对象中,然后将Future对象列表作为任务传递给事件循环

import time
import asyncio
from aiohttp import ClientSessiontasks = []
test_url = "https://www.baidu.com/{}"async def hello(url):async with ClientSession() as session:async with session.get(url) as response:response = await response.read()#            print(response)print('Hello World:%s' % time.time())def run():for i in range(5):task = asyncio.ensure_future(hello(test_url.format(i)))tasks.append(task)if __name__ == '__main__':loop = asyncio.get_event_loop()run()loop.run_until_complete(asyncio.wait(tasks))

 

收集 http 响应

上面介绍了访问不同链接的异步实现方式,但是我们只是发出了请求,如果要把响应一一收集到一个列表中,最后保存到本地或者打印出来要怎么实现呢,可通过asyncio.gather(*tasks)将响应全部收集起来,具体通过下面实例来演示。

import datetime
import asyncio
from aiohttp import ClientSessiontasks = []
test_url = "https://www.baidu.com/{}"async def hello(url):async with ClientSession() as session:async with session.get(url) as response:# print(response)print(f'Hello World : {datetime.datetime.now().replace(microsecond=0)}')return await response.read()def run():for i in range(5):task = asyncio.ensure_future(hello(test_url.format(i)))tasks.append(task)result = loop.run_until_complete(asyncio.gather(*tasks))print(result)if __name__ == '__main__':loop = asyncio.get_event_loop()run()

假如你的并发达到2000个,程序会报错:ValueError: too many file descriptors in select()。报错的原因字面上看是 Python 调取的 select 对打开的文件有最大数量的限制,这个其实是操作系统的限制,linux打开文件的最大数默认是1024,windows默认是509,超过了这个值,程序就开始报错。

这里我们有 三种方法解决 这个问题:

  • 1.限制并发数量。(一次不要塞那么多任务,或者限制最大并发数量)
  • 2.使用回调的方式
  • 3.修改操作系统打开文件数的最大限制,在系统里有个配置文件可以修改默认值,具体步骤不再说明了。

不修改系统默认配置的话,个人推荐限制并发数的方法,设置并发数为 500,处理速度更快。

# coding:utf-8
import time, asyncio, aiohttptest_url = 'https://www.baidu.com/'async def hello(url, semaphore):async with semaphore:async with aiohttp.ClientSession() as session:async with session.get(url) as response:print(f'status:{response.status}')return await response.read()async def run():semaphore = asyncio.Semaphore(500)  # 限制并发量为500to_get = [hello(test_url.format(), semaphore) for _ in range(1000)]  # 总共1000任务await asyncio.wait(to_get)if __name__ == '__main__':# now = lambda :time.time()loop = asyncio.get_event_loop()loop.run_until_complete(run())loop.close()

示例代码:

# -*- coding: utf-8 -*-import aiohttp
import asyncio@asyncio.coroutine
def fetch_async(url):print(url)# request函数是个IO阻塞型的函数# response = yield from aiohttp.request('GET', url)response = yield from aiohttp.ClientSession().get(url)print(response.status)print(url, response)# data = yield from response.read()return responsetasks = [# fetch_async('http://www.google.com/'),fetch_async('http://www.chouti.com/')
]event_loop = asyncio.get_event_loop()
results = event_loop.run_until_complete(asyncio.gather(*tasks))
event_loop.close()

 

Python3 协程控制 并发数 的 两种方法

1、TCPConnector 链接池

import asyncio
import aiohttpCONCURRENT_REQUESTS = 0async def aio_http_get(url, session):global CONCURRENT_REQUESTSasync with session.get(url) as response:CONCURRENT_REQUESTS += 1html = await response.text()print(f'[{CONCURRENT_REQUESTS}] : {response.status}')return htmldef main():urls = ['http://www.baidu.com' for _ in range(1000)]loop = asyncio.get_event_loop()connector = aiohttp.TCPConnector(limit=10)  # 限制同时链接数,连接默认是100,limit=0 无限制session = aiohttp.ClientSession(connector=connector, loop=loop)loop.run_until_complete(asyncio.gather(*(aio_http_get(url, session=session) for url in urls)))loop.close()passif __name__ == "__main__":main()

2、Semaphore 信号量

import asyncio
from aiohttp import ClientSession, TCPConnectorasync def async_spider(sem, url):"""异步任务"""async with sem:print('Getting data on url', url)async with ClientSession() as session:async with session.get(url) as response:html = await response.text()return htmldef parse_html(task):print(f'Status:{task.result()}')passasync def task_manager():"""异步任务管理器"""tasks = []sem = asyncio.Semaphore(10)  # 控制并发数url_list = ['http://www.baidu.com' for _ in range(100)]for url in url_list:task = asyncio.create_task(async_spider(sem, url))task.add_done_callback(parse_html)tasks.append(task)await asyncio.gather(*tasks)if __name__ == '__main__':print('Task start! It is working...')loop = asyncio.get_event_loop()loop.run_until_complete(task_manager())print('Finished!')

示例代码 2:


import os
import sys
import aiohttp
import asynciosys.path.append(os.getcwd())
sys.path.append("..")
sys.path.append(os.path.abspath("../../"))CONCURRENT_REQUESTS = 20
CONCURRENT_REQUESTS_actual = 0async def hello(sem, url):async with sem:      async with aiohttp.ClientSession() as session:           async with session.get('http://www.baidu.com', allow_redirects=False) as response:r = await response.read()print(f'[{url}] : http://www.baidu.com   {response.status}')def main():loop = asyncio.get_event_loop()tasks = []# 限制协程并发量sem = asyncio.Semaphore(CONCURRENT_REQUESTS)  # thisfor i in range(100000):task = asyncio.ensure_future(hello(sem, i))tasks.append(task)feature = asyncio.ensure_future(asyncio.gather(*tasks))loop.run_until_complete(feature)if __name__ == "__main__":main()

 

asyncio + requests

# -*- coding: utf-8 -*-import asyncio
import requests@asyncio.coroutine
def fetch_async(func, *args):loop = asyncio.get_event_loop()future = loop.run_in_executor(None, func, *args)response = yield from futureprint(response.url, response.content)tasks = [fetch_async(requests.get, 'http://www.cnblogs.com/wupeiqi/'),fetch_async(requests.get, 'http://dig.chouti.com/pic/show?nid=4073644713430508&lid=10273091')
]loop = asyncio.get_event_loop()
results = loop.run_until_complete(asyncio.gather(*tasks))
loop.close()

 

gevent + requests

# -*- coding: utf-8 -*-import geventimport requests
from gevent import monkeymonkey.patch_all()def fetch_async(method, url, req_kwargs):print(method, url, req_kwargs)response = requests.request(method=method, url=url, **req_kwargs)print(response.url, response.content)# ##### 发送请求 #####
gevent.joinall([gevent.spawn(fetch_async, method='get', url='https://www.python.org/', req_kwargs={}),gevent.spawn(fetch_async, method='get', url='https://www.yahoo.com/', req_kwargs={}),gevent.spawn(fetch_async, method='get', url='https://github.com/', req_kwargs={}),
])# ##### 发送请求(协程池控制最大协程数量) #####
# from gevent.pool import Pool
# pool = Pool(None)
# gevent.joinall([
#     pool.spawn(fetch_async, method='get', url='https://www.python.org/', req_kwargs={}),
#     pool.spawn(fetch_async, method='get', url='https://www.yahoo.com/', req_kwargs={}),
#     pool.spawn(fetch_async, method='get', url='https://www.github.com/', req_kwargs={}),
# ])

 

grequests

# -*- coding: utf-8 -*-import grequestsrequest_list = [grequests.get('http://httpbin.org/delay/1', timeout=0.001),grequests.get('http://fakedomain/'),grequests.get('http://httpbin.org/status/500')
]# ##### 执行并获取响应列表 #####
response_list = grequests.map(request_list)
print(response_list)# ##### 执行并获取响应列表(处理异常) #####
# def exception_handler(request, exception):
# print(request,exception)
#     print("Request failed")# response_list = grequests.map(request_list, exception_handler=exception_handler)
# print(response_list)

 

Twisted 示例

from twisted.web.client import getPage, defer
from twisted.internet import reactordef all_done(arg):reactor.stop()def callback(contents):print(contents)deferred_list = []url_list = ['http://www.bing.com', 'http://www.baidu.com', ]
for url in url_list:deferred = getPage(bytes(url, encoding='utf8'))deferred.addCallback(callback)deferred_list.append(deferred)dlist = defer.DeferredList(deferred_list)
dlist.addBoth(all_done)reactor.run()

 

Tornado 示例

# -*- coding: utf-8 -*-from tornado.httpclient import AsyncHTTPClient
from tornado.httpclient import HTTPRequest
from tornado import ioloopdef handle_response(response):"""处理返回值内容(需要维护计数器,来停止IO循环),调用 ioloop.IOLoop.current().stop():param response::return:"""if response.error:print("Error:", response.error)else:print(response.body)def func():url_list = ['http://www.baidu.com','http://www.bing.com',]for url in url_list:print(url)http_client = AsyncHTTPClient()http_client.fetch(HTTPRequest(url), handle_response)ioloop.IOLoop.current().add_callback(func)
ioloop.IOLoop.current().start()

 

Twisted 更多

# -*- coding: utf-8 -*-from twisted.internet import reactor
from twisted.web.client import getPage
import urllib.parsedef one_done(arg):print(arg)reactor.stop()post_data = urllib.parse.urlencode({'check_data': 'adf'})
post_data = bytes(post_data, encoding='utf8')
headers = {b'Content-Type': b'application/x-www-form-urlencoded'}
response = getPage(bytes('http://dig.chouti.com/login', encoding='utf8'),method=bytes('POST', encoding='utf8'),postdata=post_data,cookies={},headers=headers)
response.addBoth(one_done)reactor.run()

 

以上均是 Python内置以及第三方模块提供异步IO请求模块,使用简便大大提高效率,

而对于异步IO请求的本质则是【非阻塞Socket】+【IO多路复用】

 

 

非阻塞SocketIO多路复用

 

 史上最牛逼的 异步 IO 模块 ( selectpollepoll

import select
import socket
import timeclass AsyncTimeoutException(TimeoutError):"""请求超时异常类"""def __init__(self, msg):self.msg = msgsuper(AsyncTimeoutException, self).__init__(msg)class HttpContext(object):"""封装请求和相应的基本数据"""def __init__(self, sock, host, port, method, url, data, callback, timeout=5):"""sock: 请求的客户端socket对象host: 请求的主机名port: 请求的端口port: 请求的端口method: 请求方式url: 请求的URLdata: 请求时请求体中的数据callback: 请求完成后的回调函数timeout: 请求的超时时间"""self.sock = sockself.callback = callbackself.host = hostself.port = portself.method = methodself.url = urlself.data = dataself.timeout = timeoutself.__start_time = time.time()self.__buffer = []def is_timeout(self):"""当前请求是否已经超时"""current_time = time.time()if (self.__start_time + self.timeout) < current_time:return Truedef fileno(self):"""请求sockect对象的文件描述符,用于select监听"""return self.sock.fileno()def write(self, data):"""在buffer中写入响应内容"""self.__buffer.append(data)def finish(self, exc=None):"""在buffer中写入响应内容完成,执行请求的回调函数"""if not exc:response = b''.join(self.__buffer)self.callback(self, response, exc)else:self.callback(self, None, exc)def send_request_data(self):content = """%s %s HTTP/1.0\r\nHost: %s\r\n\r\n%s""" % (self.method.upper(), self.url, self.host, self.data,)return content.encode(encoding='utf8')class AsyncRequest(object):def __init__(self):self.fds = []self.connections = []def add_request(self, host, port, method, url, data, callback, timeout):"""创建一个要请求"""client = socket.socket()client.setblocking(False)try:client.connect((host, port))except BlockingIOError as e:pass# print('已经向远程发送连接的请求')req = HttpContext(client, host, port, method, url, data, callback, timeout)self.connections.append(req)self.fds.append(req)def check_conn_timeout(self):"""检查所有的请求,是否有已经连接超时,如果有则终止"""timeout_list = []for context in self.connections:if context.is_timeout():timeout_list.append(context)for context in timeout_list:context.finish(AsyncTimeoutException('请求超时'))self.fds.remove(context)self.connections.remove(context)def running(self):"""事件循环,用于检测请求的socket是否已经就绪,从而执行相关操作"""while True:r, w, e = select.select(self.fds, self.connections, self.fds, 0.05)if not self.fds:returnfor context in r:sock = context.sockwhile True:try:data = sock.recv(8096)if not data:self.fds.remove(context)context.finish()breakelse:context.write(data)except BlockingIOError as e:breakexcept TimeoutError as e:self.fds.remove(context)self.connections.remove(context)context.finish(e)breakfor context in w:# 已经连接成功远程服务器,开始向远程发送请求数据if context in self.fds:data = context.send_request_data()context.sock.sendall(data)self.connections.remove(context)self.check_conn_timeout()if __name__ == '__main__':def callback_func(context, response, ex):""":param context: HttpContext对象,内部封装了请求相关信息:param response: 请求响应内容:param ex: 是否出现异常(如果有异常则值为异常对象;否则值为None):return:"""print(context, response, ex)obj = AsyncRequest()url_list = [{'host': 'www.google.com', 'port': 80, 'method': 'GET', 'url': '/', 'data': '', 'timeout': 5,'callback': callback_func},{'host': 'www.baidu.com', 'port': 80, 'method': 'GET', 'url': '/', 'data': '', 'timeout': 5,'callback': callback_func},{'host': 'www.bing.com', 'port': 80, 'method': 'GET', 'url': '/', 'data': '', 'timeout': 5,'callback': callback_func},]for item in url_list:print(item)obj.add_request(**item)obj.running()

 

 

Scrapy

 

Scrapy是一个为了爬取网站数据,提取结构性数据而编写的应用框架。 其可以应用在数据挖掘,信息处理或存储历史数据等一系列的程序中。
其最初是为了页面抓取 (更确切来说, 网络抓取 )所设计的, 也可以应用在获取API所返回的数据(例如 Amazon Associates Web Services ) 或者通用的网络爬虫。Scrapy用途广泛,可以用于数据挖掘、监测和自动化测试。

Scrapy 使用了 Twisted异步网络库来处理网络通讯。整体架构大致如下

Scrapy主要包括了以下组件:

  • 引擎(Scrapy)
    用来处理整个系统的数据流处理, 触发事务(框架核心)
  • 调度器(Scheduler)
    用来接受引擎发过来的请求, 压入队列中, 并在引擎再次请求的时候返回. 可以想像成一个URL(抓取网页的网址或者说是链接)的优先队列, 由它来决定下一个要抓取的网址是什么, 同时去除重复的网址
  • 下载器(Downloader)
    用于下载网页内容, 并将网页内容返回给蜘蛛(Scrapy下载器是建立在twisted这个高效的异步模型上的)
  • 爬虫(Spiders)
    爬虫是主要干活的, 用于从特定的网页中提取自己需要的信息, 即所谓的实体(Item)。用户也可以从中提取出链接,让Scrapy继续抓取下一个页面
  • 项目管道(Pipeline)
    负责处理爬虫从网页中抽取的实体,主要的功能是持久化实体、验证实体的有效性、清除不需要的信息。当页面被爬虫解析后,将被发送到项目管道,并经过几个特定的次序处理数据。
  • 下载器中间件(Downloader Middlewares)
    位于Scrapy引擎和下载器之间的框架,主要是处理Scrapy引擎与下载器之间的请求及响应。
  • 爬虫中间件(Spider Middlewares)
    介于Scrapy引擎和爬虫之间的框架,主要工作是处理蜘蛛的响应输入和请求输出。
  • 调度中间件(Scheduler Middewares)
    介于Scrapy引擎和调度之间的中间件,从Scrapy引擎发送到调度的请求和响应。

Scrapy运行流程大概如下:

  1. 引擎从调度器中取出一个链接(URL)用于接下来的抓取
  2. 引擎把URL封装成一个请求(Request)传给下载器
  3. 下载器把资源下载下来,并封装成应答包(Response)
  4. 爬虫解析Response
  5. 解析出实体(Item),则交给实体管道进行进一步的处理
  6. 解析出的是链接(URL),则把URL交给调度器等待抓取

 

一、scrapy 的安装

 

Linux
      pip3 install scrapy 
 
Windows

  • a.  pip3 install wheel
  • b.  下载twisted http://www.lfd.uci.edu/~gohlke/pythonlibs/#twisted
  • c.  进入下载目录,执行 pip3 install Twisted‑17.1.0‑cp35‑cp35m‑win_amd64.whl
  • d.  pip3 install scrapy
  • e.  下载并安装pywin32:https://sourceforge.net/projects/pywin32/files/

 

二、scrapy 的基本使用

 

一、基本命令

1. scrapy startproject 项目名称- 在当前目录中创建中创建一个项目文件(类似于Django)2. scrapy genspider [-t template] <name> <domain>- 创建爬虫应用如:scrapy gensipider -t basic oldboy oldboy.comscrapy gensipider -t xmlfeed autohome autohome.com.cnPS:查看所有命令:scrapy gensipider -l查看模板命令:scrapy gensipider -d 模板名称3. scrapy list- 展示爬虫应用列表4. scrapy crawl 爬虫应用名称- 运行单独爬虫应用

 

二、项目结构及其爬虫应用简介

project_name/scrapy.cfgproject_name/__init__.pyitems.pypipelines.pysettings.pyspiders/__init__.py爬虫1.py爬虫2.py爬虫3.py

文件说明:

  • scrapy.cfg  项目的主配置信息。(真正爬虫相关的配置信息在settings.py文件中)
  • items.py    设置数据存储模板,用于结构化数据,如:Django的Model
  • pipelines    数据处理行为,如:一般结构化的数据持久化
  • settings.py 配置文件,如:递归的层数、并发数,延迟下载等
  • spiders      爬虫目录,如:创建文件,编写爬虫规则

注意:一般创建爬虫文件时,以网站域名命名

import scrapyclass XiaoHuarSpider(scrapy.spiders.Spider):name = "xiaohuar"                            # 爬虫名称 *****allowed_domains = ["xueshengmai.com"]  # 允许的域名start_urls = ["http://www.xueshengmai.com/hua/",   # 其实URL]def parse(self, response):# 访问起始URL并获取结果后的回调函数pass
关于 windows 的编码问题import sys,os
sys.stdout=io.TextIOWrapper(sys.stdout.buffer,encoding='gb18030')

 

三、开始项目

# -*- coding: utf-8 -*-import scrapy
from scrapy.selector import Selector
from scrapy.http.request import Requestclass DigSpider(scrapy.Spider):# 爬虫应用的名称,通过此名称启动爬虫命令name = "dig"# 允许的域名allowed_domains = ["chouti.com"]# 起始URLstart_urls = ['http://dig.chouti.com/',]has_request_set = {}def parse(self, response):print(response.url)hxs = Selector(response=response)page_list = hxs.xpath(r'//div[@id="dig_lcpage"]//a[re:test(@href, "/all/hot/recent/\d+")]/@href').extract()for page in page_list:page_url = 'http://dig.chouti.com%s' % pagekey = self.md5(page_url)if key in self.has_request_set:passelse:self.has_request_set[key] = page_urlobj = Request(url=page_url, method='GET', callback=self.parse)yield obj@staticmethoddef md5(val):import hashlibha = hashlib.md5()ha.update(bytes(val, encoding='utf-8'))key = ha.hexdigest()return keyif __name__ == '__main__':from scrapy import cmdlinecmdline.execute('scrapy crawl dig'.split())pass    

执行此爬虫文件,则在终端进入项目目录执行如下命令:scrapy crawl dig --nolog

对于上述代码重要之处在于:

  • Request 是一个封装用户请求的类,在回调函数中yield该对象表示继续访问
  • Selector 用于结构化 HTML 代码并提供选择器功能。 ( HtmlXpathSelector 已被弃用

 

四、scrapy 的选择器

# -*- coding: utf-8 -*-from scrapy.selector import Selector
from scrapy.http import HtmlResponse
html = """
<!DOCTYPE html>
<html><head lang="en"><meta charset="UTF-8"><title></title></head><body><ul><li class="item-"><a id='i1' href="link.html">first item</a></li><li class="item-0"><a id='i2' href="llink.html">first item</a></li><li class="item-1"><a href="llink2.html">second item<span>vv</span></a></li></ul><div><a href="llink2.html">second item</a></div></body>
</html>
"""
response = HtmlResponse(url='http://example.com', body=html, encoding='utf-8')hxs = Selector(response=response).xpath('//a')
print(hxs)
hxs = Selector(response=response).xpath('//a[2]')
print(hxs)
hxs = Selector(response=response).xpath('//a[@id]')
print(hxs)
hxs = Selector(response=response).xpath('//a[@id="i1"]')
print(hxs)
hxs = Selector(response=response).xpath('//a[@href="link.html"][@id="i1"]')
print(hxs)
hxs = Selector(response=response).xpath('//a[contains(@href, "link")]')
print(hxs)
hxs = Selector(response=response).xpath('//a[starts-with(@href, "link")]')
print(hxs)
hxs = Selector(response=response).xpath(r'//a[re:test(@id, "i\d+")]')
print(hxs)
hxs = Selector(response=response).xpath(r'//a[re:test(@id, "i\d+")]/text()').extract()
print(hxs)
hxs = Selector(response=response).xpath(r'//a[re:test(@id, "i\d+")]/@href').extract()
print(hxs)
hxs = Selector(response=response).xpath('/html/body/ul/li/a/@href').extract()
print(hxs)
hxs = Selector(response=response).xpath('//body/ul/li/a/@href').extract_first()
print(hxs)ul_list = Selector(response=response).xpath('//body/ul/li')
for item in ul_list:v = item.xpath('./a/span')# 或# v = item.xpath('a/span')# 或# v = item.xpath('*/a/span')print(v)

自动登录抽屉并点赞

# -*- coding: utf-8 -*-import scrapy
from scrapy.selector import Selector
from scrapy.http.request import Request
from scrapy.http.cookies import CookieJar
from scrapy import FormRequestclass ChouTiSpider(scrapy.Spider):# 爬虫应用的名称,通过此名称启动爬虫命令name = "chouti"# 允许的域名allowed_domains = ["chouti.com"]cookie_dict = {}has_request_set = {}def start_requests(self):url = 'http://dig.chouti.com/'# return [Request(url=url, callback=self.login)]yield Request(url=url, callback=self.login)def login(self, response):cookie_jar = CookieJar()cookie_jar.extract_cookies(response, response.request)for k, v in cookie_jar._cookies.items():for i, j in v.items():for m, n in j.items():self.cookie_dict[m] = n.valuereq = Request(url='http://dig.chouti.com/login',method='POST',headers={'Content-Type': 'application/x-www-form-urlencoded; charset=UTF-8'},body='phone=8615131255089&password=pppppppp&oneMonth=1',cookies=self.cookie_dict,callback=self.check_login)yield reqdef check_login(self, response):req = Request(url='http://dig.chouti.com/',method='GET',callback=self.show,cookies=self.cookie_dict,dont_filter=True)yield reqdef show(self, response):# print(response)hxs = Selector(response=response)news_list = hxs.xpath('//div[@id="content-list"]/div[@class="item"]')for new in news_list:# temp = new.xpath('div/div[@class="part2"]/@share-linkid').extract()link_id = new.xpath('*/div[@class="part2"]/@share-linkid').extract_first()yield Request(url='http://dig.chouti.com/link/vote?linksId=%s' % (link_id,),method='POST',cookies=self.cookie_dict,callback=self.do_favor)page_list = hxs.xpath(r'//div[@id="dig_lcpage"]//a[re:test(@href, "/all/hot/recent/\d+")]/@href').extract()for page in page_list:page_url = 'http://dig.chouti.com%s' % pageimport hashlibhash = hashlib.md5()hash.update(bytes(page_url, encoding='utf-8'))key = hash.hexdigest()if key in self.has_request_set:passelse:self.has_request_set[key] = page_urlyield Request(url=page_url,method='GET',callback=self.show)def do_favor(self, response):print(response.text)if __name__ == '__main__':from scrapy import cmdlinecmdline.execute('scrapy crawl chouti'.split())pass

注意:settings.py 中设置 DEPTH_LIMIT = 1来指定  “递归”  的层数。

 

五、格式化处理

上述实例只是简单的处理,所以在 parse 方法中直接处理。如果对于想要获取更多的数据处理,则可以利用 Scrapy 的 items 将数据格式化,然后统一交由 pipelines 来处理。

spiders/xiahuar.py

import scrapy
from scrapy.selector import Selector
from scrapy.http.request import Request
from scrapy.http.cookies import CookieJar
from scrapy import FormRequest
from ..items import XiaoHuarItemclass XiaoHuarSpider(scrapy.Spider):# 爬虫应用的名称,通过此名称启动爬虫命令name = "xiaohuar"# 允许的域名allowed_domains = ["xueshengmai.com"]start_urls = ["http://www.xueshengmai.com/hua/",]# custom_settings = {#     'ITEM_PIPELINES':{#         'spider1.pipelines.JsonPipeline': 100#     }# }has_request_set = {}def parse(self, response):# 分析页面# 找到页面中符合规则的内容(校花图片),保存# 找到所有的a标签,再访问其他a标签,一层一层的搞下去hxs = Selector(response=response)items = hxs.xpath('//div[@class="item_list infinite_scroll"]/div')for item in items:src = item.select('.//div[@class="img"]/a/img/@src').extract_first()name = item.select('.//div[@class="img"]/span/text()').extract_first()school = item.select('.//div[@class="img"]/div[@class="btns"]/a/text()').extract_first()url = "http://www.xiaohuar.com%s" % srcobj = XiaoHuarItem(name=name, school=school, url=url)yield objurls = hxs.xpath(r'//a[re:test(@href, "http://www.xueshengmai.com/list-1-\d+.html")]/@href')for url in urls:key = self.md5(url)if key in self.has_request_set:passelse:self.has_request_set[key] = urlreq = Request(url=url, method='GET', callback=self.parse)yield req@staticmethoddef md5(val):import hashlibha = hashlib.md5()ha.update(bytes(val, encoding='utf-8'))key = ha.hexdigest()return key

items

import scrapyclass XiaoHuarItem(scrapy.Item):name = scrapy.Field()school = scrapy.Field()url = scrapy.Field()

 

改进( 不需要 item ,直接返回一个 Python 类型的 dict ):

可以不需要 item ,直接返回一个 Python 类型的 dict 即可,因为 Scrapy 的 item 本身就是一个 Python 类型 的 dict

import scrapy
from scrapy.selector import Selector
from scrapy.http.request import Request
from scrapy.http.cookies import CookieJar
from scrapy import FormRequestclass XiaoHuarSpider(scrapy.Spider):# 爬虫应用的名称,通过此名称启动爬虫命令name = "xiaohuar"# 允许的域名allowed_domains = ["xueshengmai.com"]start_urls = ["http://www.xueshengmai.com/hua/",]# custom_settings = {#     'ITEM_PIPELINES':{#         'spider1.pipelines.JsonPipeline': 100#     }# }has_request_set = {}def parse(self, response):# 分析页面# 找到页面中符合规则的内容(校花图片),保存# 找到所有的a标签,再访问其他a标签,一层一层的搞下去hxs = Selector(response=response)items = hxs.xpath('//div[@class="item_list infinite_scroll"]/div')for item in items:src = item.select('.//div[@class="img"]/a/img/@src').extract_first()name = item.select('.//div[@class="img"]/span/text()').extract_first()school = item.select('.//div[@class="img"]/div[@class="btns"]/a/text()').extract_first()url = "http://www.xiaohuar.com%s" % srcobj = dict(name=name, school=school, url=url)            yield objurls = hxs.xpath(r'//a[re:test(@href, "http://www.xueshengmai.com/list-1-\d+.html")]/@href')for url in urls:key = self.md5(url)if key in self.has_request_set:passelse:self.has_request_set[key] = urlreq = Request(url=url, method='GET', callback=self.parse)yield req@staticmethoddef md5(val):import hashlibha = hashlib.md5()ha.update(bytes(val, encoding='utf-8'))key = ha.hexdigest()return key

pipelines

import json
import os
import requestsclass JsonPipeline(object):def __init__(self):self.file = open('xiaohua.txt', 'w')def process_item(self, item, spider):v = json.dumps(dict(item), ensure_ascii=False)self.file.write(v)self.file.write('\n')self.file.flush()return itemclass FilePipeline(object):def __init__(self):if not os.path.exists('imgs'):os.makedirs('imgs')def process_item(self, item, spider):response = requests.get(item['url'], stream=True)file_name = '%s_%s.jpg' % (item['name'], item['school'])with open(os.path.join('imgs', file_name), mode='wb') as f:f.write(response.content)return item

settings

ITEM_PIPELINES = {'spider1.pipelines.JsonPipeline': 100,'spider1.pipelines.FilePipeline': 300,
}
# 每行后面的整型值,确定了他们运行的顺序,item按数字从低到高的顺序,通过pipeline,通常将这些数字定义在0-1000范围内。

对于pipeline可以做更多,如下:

自定义 pipeline

from scrapy.exceptions import DropItemclass CustomPipeline(object):def __init__(self,v):self.value = vdef process_item(self, item, spider):# 操作并进行持久化# return表示会被后续的pipeline继续处理return item# 表示将item丢弃,不会被后续pipeline处理# raise DropItem()@classmethoddef from_crawler(cls, crawler):"""初始化时候,用于创建pipeline对象:param crawler: :return: """val = crawler.settings.getint('MMMM')return cls(val)def open_spider(self,spider):"""爬虫开始执行时,调用:param spider: :return: """print('000000')def close_spider(self,spider):"""爬虫关闭时,被调用:param spider: :return: """print('111111')

 

六、中间件

爬虫中间件

class SpiderMiddleware(object):def process_spider_input(self,response, spider):"""下载完成,执行,然后交给parse处理:param response: :param spider: :return: """passdef process_spider_output(self,response, result, spider):"""spider处理完成,返回时调用:param response::param result::param spider::return: 必须返回包含 Request 或 Item 对象的可迭代对象(iterable)"""return resultdef process_spider_exception(self,response, exception, spider):"""异常调用:param response::param exception::param spider::return: None,继续交给后续中间件处理异常;含 Response 或 Item 的可迭代对象(iterable),交给调度器或pipeline"""return Nonedef process_start_requests(self,start_requests, spider):"""爬虫启动时调用:param start_requests::param spider::return: 包含 Request 对象的可迭代对象"""return start_requests

下载器中间件

class DownMiddleware1(object):def process_request(self, request, spider):"""请求需要被下载时,经过所有下载器中间件的process_request调用:param request: :param spider: :return:  None,继续后续中间件去下载;Response对象,停止process_request的执行,开始执行process_responseRequest对象,停止中间件的执行,将Request重新调度器raise IgnoreRequest异常,停止process_request的执行,开始执行process_exception"""passdef process_response(self, request, response, spider):"""spider处理完成,返回时调用:param response::param result::param spider::return: Response 对象:转交给其他中间件process_responseRequest 对象:停止中间件,request会被重新调度下载raise IgnoreRequest 异常:调用Request.errback"""print('response1')return responsedef process_exception(self, request, exception, spider):"""当下载处理器(download handler)或 process_request() (下载中间件)抛出异常:param response::param exception::param spider::return: None:继续交给后续中间件处理异常;Response对象:停止后续process_exception方法Request对象:停止中间件,request将会被重新调用下载"""return None

 

七、自定义命令

  • 在 spiders 同级创建任意目录,如:commands
  • 在其中创建 crawlall.py 文件 (此处文件名就是自定义的命令)

crawlall.py

from scrapy.commands import ScrapyCommand
from scrapy.utils.project import get_project_settingsclass Command(ScrapyCommand):requires_project = Truedef syntax(self):return '[options]'def short_desc(self):return 'Runs all of the spiders'def run(self, args, opts):spider_list = self.crawler_process.spiders.list()for name in spider_list:self.crawler_process.crawl(name, **opts.__dict__)self.crawler_process.start()
  • 在 settings.py 中添加配置 COMMANDS_MODULE = '项目名称.目录名称'
  • 在项目目录执行命令:scrapy crawlall 

 

八、自定义扩展( 使用 信号

自定义扩展时,使用 信号 在指定位置注册制定操作

自定义扩展

from scrapy import signalsclass MyExtension(object):def __init__(self, value):self.value = value@classmethoddef from_crawler(cls, crawler):val = crawler.settings.getint('MMMM')ext = cls(val)crawler.signals.connect(ext.spider_opened, signal=signals.spider_opened)crawler.signals.connect(ext.spider_closed, signal=signals.spider_closed)return extdef spider_opened(self, spider):print('open')def spider_closed(self, spider):print('close')

 

九、避免重复访问

scrapy 默认使用 scrapy.dupefilter.RFPDupeFilter 进行去重,相关配置有:

DUPEFILTER_CLASS = 'scrapy.dupefilter.RFPDupeFilter'DUPEFILTER_DEBUG = FalseJOBDIR = "保存范文记录的日志路径,如:/root/"  # 最终路径为 /root/requests.seen  

自定义 URL 去重操作

class RepeatUrl:def __init__(self):self.visited_url = set()@classmethoddef from_settings(cls, settings):"""初始化时,调用:param settings: :return: """return cls()def request_seen(self, request):"""检测当前请求是否已经被访问过:param request: :return: True表示已经访问过;False表示未访问过"""if request.url in self.visited_url:return Trueself.visited_url.add(request.url)return Falsedef open(self):"""开始爬去请求时,调用:return: """print('open replication')def close(self, reason):"""结束爬虫爬取时,调用:param reason: :return: """print('close replication')def log(self, request, spider):"""记录日志:param request: :param spider: :return: """print('repeat', request.url)

 

十、其他

# -*- coding: utf-8 -*-# Scrapy settings for step8_king project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
#     http://doc.scrapy.org/en/latest/topics/settings.html
#     http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html
#     http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html# 1. 爬虫名称
BOT_NAME = 'step8_king'# 2. 爬虫应用路径
SPIDER_MODULES = ['step8_king.spiders']
NEWSPIDER_MODULE = 'step8_king.spiders'# Crawl responsibly by identifying yourself (and your website) on the user-agent
# 3. 客户端 user-agent请求头
# USER_AGENT = 'step8_king (+http://www.yourdomain.com)'# Obey robots.txt rules
# 4. 禁止爬虫配置
# ROBOTSTXT_OBEY = False# Configure maximum concurrent requests performed by Scrapy (default: 16)
# 5. 并发请求数
# CONCURRENT_REQUESTS = 4# Configure a delay for requests for the same website (default: 0)
# See http://scrapy.readthedocs.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
# 6. 延迟下载秒数
# DOWNLOAD_DELAY = 2# The download delay setting will honor only one of:
# 7. 单域名访问并发数,并且延迟下次秒数也应用在每个域名
# CONCURRENT_REQUESTS_PER_DOMAIN = 2
# 单IP访问并发数,如果有值则忽略:CONCURRENT_REQUESTS_PER_DOMAIN,并且延迟下次秒数也应用在每个IP
# CONCURRENT_REQUESTS_PER_IP = 3# Disable cookies (enabled by default)
# 8. 是否支持cookie,cookiejar进行操作cookie
# COOKIES_ENABLED = True
# COOKIES_DEBUG = True# Disable Telnet Console (enabled by default)
# 9. Telnet用于查看当前爬虫的信息,操作爬虫等...
#    使用telnet ip port ,然后通过命令操作
# TELNETCONSOLE_ENABLED = True
# TELNETCONSOLE_HOST = '127.0.0.1'
# TELNETCONSOLE_PORT = [6023,]# 10. 默认请求头
# Override the default request headers:
# DEFAULT_REQUEST_HEADERS = {
#     'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
#     'Accept-Language': 'en',
# }# Configure item pipelines
# See http://scrapy.readthedocs.org/en/latest/topics/item-pipeline.html
# 11. 定义pipeline处理请求
# ITEM_PIPELINES = {
#    'step8_king.pipelines.JsonPipeline': 700,
#    'step8_king.pipelines.FilePipeline': 500,
# }# 12. 自定义扩展,基于信号进行调用
# Enable or disable extensions
# See http://scrapy.readthedocs.org/en/latest/topics/extensions.html
# EXTENSIONS = {
#     # 'step8_king.extensions.MyExtension': 500,
# }# 13. 爬虫允许的最大深度,可以通过meta查看当前深度;0表示无深度
# DEPTH_LIMIT = 3# 14. 爬取时,0表示深度优先Lifo(默认);1表示广度优先FiFo# 后进先出,深度优先
# DEPTH_PRIORITY = 0
# SCHEDULER_DISK_QUEUE = 'scrapy.squeue.PickleLifoDiskQueue'
# SCHEDULER_MEMORY_QUEUE = 'scrapy.squeue.LifoMemoryQueue'
# 先进先出,广度优先# DEPTH_PRIORITY = 1
# SCHEDULER_DISK_QUEUE = 'scrapy.squeue.PickleFifoDiskQueue'
# SCHEDULER_MEMORY_QUEUE = 'scrapy.squeue.FifoMemoryQueue'# 15. 调度器队列
# SCHEDULER = 'scrapy.core.scheduler.Scheduler'
# from scrapy.core.scheduler import Scheduler# 16. 访问URL去重
# DUPEFILTER_CLASS = 'step8_king.duplication.RepeatUrl'# Enable and configure the AutoThrottle extension (disabled by default)
# See http://doc.scrapy.org/en/latest/topics/autothrottle.html"""
17. 自动限速算法from scrapy.contrib.throttle import AutoThrottle自动限速设置1. 获取最小延迟 DOWNLOAD_DELAY2. 获取最大延迟 AUTOTHROTTLE_MAX_DELAY3. 设置初始下载延迟 AUTOTHROTTLE_START_DELAY4. 当请求下载完成后,获取其"连接"时间 latency,即:请求连接到接受到响应头之间的时间5. 用于计算的... AUTOTHROTTLE_TARGET_CONCURRENCYtarget_delay = latency / self.target_concurrencynew_delay = (slot.delay + target_delay) / 2.0 # 表示上一次的延迟时间new_delay = max(target_delay, new_delay)new_delay = min(max(self.mindelay, new_delay), self.maxdelay)slot.delay = new_delay
"""# 开始自动限速
# AUTOTHROTTLE_ENABLED = True
# The initial download delay
# 初始下载延迟
# AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
# 最大下载延迟
# AUTOTHROTTLE_MAX_DELAY = 10
# The average number of requests Scrapy should be sending in parallel to each remote server
# 平均每秒并发数
# AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0# Enable showing throttling stats for every response received:
# 是否显示
# AUTOTHROTTLE_DEBUG = True# Enable and configure HTTP caching (disabled by default)
# See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings"""
18. 启用缓存目的用于将已经发送的请求或相应缓存下来,以便以后使用from scrapy.downloadermiddlewares.httpcache import HttpCacheMiddlewarefrom scrapy.extensions.httpcache import DummyPolicyfrom scrapy.extensions.httpcache import FilesystemCacheStorage
"""
# 是否启用缓存策略
# HTTPCACHE_ENABLED = True# 缓存策略:所有请求均缓存,下次在请求直接访问原来的缓存即可
# HTTPCACHE_POLICY = "scrapy.extensions.httpcache.DummyPolicy"
# 缓存策略:根据Http响应头:Cache-Control、Last-Modified 等进行缓存的策略
# HTTPCACHE_POLICY = "scrapy.extensions.httpcache.RFC2616Policy"# 缓存超时时间
# HTTPCACHE_EXPIRATION_SECS = 0# 缓存保存路径
# HTTPCACHE_DIR = 'httpcache'# 缓存忽略的Http状态码
# HTTPCACHE_IGNORE_HTTP_CODES = []# 缓存存储的插件
# HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'"""
19. 代理,需要在环境变量中设置from scrapy.contrib.downloadermiddleware.httpproxy import HttpProxyMiddleware方式一:使用默认os.environ{http_proxy:http://root:woshiniba@192.168.11.11:9999/https_proxy:http://192.168.11.11:9999/}方式二:使用自定义下载中间件def to_bytes(text, encoding=None, errors='strict'):if isinstance(text, bytes):return textif not isinstance(text, six.string_types):raise TypeError('to_bytes must receive a unicode, str or bytes ''object, got %s' % type(text).__name__)if encoding is None:encoding = 'utf-8'return text.encode(encoding, errors)class ProxyMiddleware(object):def process_request(self, request, spider):PROXIES = [{'ip_port': '111.11.228.75:80', 'user_pass': ''},{'ip_port': '120.198.243.22:80', 'user_pass': ''},{'ip_port': '111.8.60.9:8123', 'user_pass': ''},{'ip_port': '101.71.27.120:80', 'user_pass': ''},{'ip_port': '122.96.59.104:80', 'user_pass': ''},{'ip_port': '122.224.249.122:8088', 'user_pass': ''},]proxy = random.choice(PROXIES)if proxy['user_pass'] is not None:request.meta['proxy'] = to_bytes("http://%s" % proxy['ip_port'])encoded_user_pass = base64.encodestring(to_bytes(proxy['user_pass']))request.headers['Proxy-Authorization'] = to_bytes('Basic ' + encoded_user_pass)print "**************ProxyMiddleware have pass************" + proxy['ip_port']else:print "**************ProxyMiddleware no pass************" + proxy['ip_port']request.meta['proxy'] = to_bytes("http://%s" % proxy['ip_port'])DOWNLOADER_MIDDLEWARES = {'step8_king.middlewares.ProxyMiddleware': 500,}""""""
20. Https访问Https访问时有两种情况:1. 要爬取网站使用的可信任证书(默认支持)DOWNLOADER_HTTPCLIENTFACTORY = "scrapy.core.downloader.webclient.ScrapyHTTPClientFactory"DOWNLOADER_CLIENTCONTEXTFACTORY = "scrapy.core.downloader.contextfactory.ScrapyClientContextFactory"2. 要爬取网站使用的自定义证书DOWNLOADER_HTTPCLIENTFACTORY = "scrapy.core.downloader.webclient.ScrapyHTTPClientFactory"DOWNLOADER_CLIENTCONTEXTFACTORY = "step8_king.https.MySSLFactory"# https.pyfrom scrapy.core.downloader.contextfactory import ScrapyClientContextFactoryfrom twisted.internet.ssl import (optionsForClientTLS, CertificateOptions, PrivateCertificate)class MySSLFactory(ScrapyClientContextFactory):def getCertificateOptions(self):from OpenSSL import cryptov1 = crypto.load_privatekey(crypto.FILETYPE_PEM, open('/Users/wupeiqi/client.key.unsecure', mode='r').read())v2 = crypto.load_certificate(crypto.FILETYPE_PEM, open('/Users/wupeiqi/client.pem', mode='r').read())return CertificateOptions(privateKey=v1,  # pKey对象certificate=v2,  # X509对象verify=False,method=getattr(self, 'method', getattr(self, '_ssl_method', None)))其他:相关类scrapy.core.downloader.handlers.http.HttpDownloadHandlerscrapy.core.downloader.webclient.ScrapyHTTPClientFactoryscrapy.core.downloader.contextfactory.ScrapyClientContextFactory相关配置DOWNLOADER_HTTPCLIENTFACTORYDOWNLOADER_CLIENTCONTEXTFACTORY""""""
21. 爬虫中间件class SpiderMiddleware(object):def process_spider_input(self,response, spider):'''下载完成,执行,然后交给parse处理:param response: :param spider: :return: '''passdef process_spider_output(self,response, result, spider):'''spider处理完成,返回时调用:param response::param result::param spider::return: 必须返回包含 Request 或 Item 对象的可迭代对象(iterable)'''return resultdef process_spider_exception(self,response, exception, spider):'''异常调用:param response::param exception::param spider::return: None,继续交给后续中间件处理异常;含 Response 或 Item 的可迭代对象(iterable),交给调度器或pipeline'''return Nonedef process_start_requests(self,start_requests, spider):'''爬虫启动时调用:param start_requests::param spider::return: 包含 Request 对象的可迭代对象'''return start_requests内置爬虫中间件:'scrapy.contrib.spidermiddleware.httperror.HttpErrorMiddleware': 50,'scrapy.contrib.spidermiddleware.offsite.OffsiteMiddleware': 500,'scrapy.contrib.spidermiddleware.referer.RefererMiddleware': 700,'scrapy.contrib.spidermiddleware.urllength.UrlLengthMiddleware': 800,'scrapy.contrib.spidermiddleware.depth.DepthMiddleware': 900,"""
# from scrapy.contrib.spidermiddleware.referer import RefererMiddleware
# Enable or disable spider middlewares
# See http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html
SPIDER_MIDDLEWARES = {# 'step8_king.middlewares.SpiderMiddleware': 543,
}"""
22. 下载中间件class DownMiddleware1(object):def process_request(self, request, spider):'''请求需要被下载时,经过所有下载器中间件的process_request调用:param request::param spider::return:None,继续后续中间件去下载;Response对象,停止process_request的执行,开始执行process_responseRequest对象,停止中间件的执行,将Request重新调度器raise IgnoreRequest异常,停止process_request的执行,开始执行process_exception'''passdef process_response(self, request, response, spider):'''spider处理完成,返回时调用:param response::param result::param spider::return:Response 对象:转交给其他中间件process_responseRequest 对象:停止中间件,request会被重新调度下载raise IgnoreRequest 异常:调用Request.errback'''print('response1')return responsedef process_exception(self, request, exception, spider):'''当下载处理器(download handler)或 process_request() (下载中间件)抛出异常:param response::param exception::param spider::return:None:继续交给后续中间件处理异常;Response对象:停止后续process_exception方法Request对象:停止中间件,request将会被重新调用下载'''return None默认下载中间件{'scrapy.contrib.downloadermiddleware.robotstxt.RobotsTxtMiddleware': 100,'scrapy.contrib.downloadermiddleware.httpauth.HttpAuthMiddleware': 300,'scrapy.contrib.downloadermiddleware.downloadtimeout.DownloadTimeoutMiddleware': 350,'scrapy.contrib.downloadermiddleware.useragent.UserAgentMiddleware': 400,'scrapy.contrib.downloadermiddleware.retry.RetryMiddleware': 500,'scrapy.contrib.downloadermiddleware.defaultheaders.DefaultHeadersMiddleware': 550,'scrapy.contrib.downloadermiddleware.redirect.MetaRefreshMiddleware': 580,'scrapy.contrib.downloadermiddleware.httpcompression.HttpCompressionMiddleware': 590,'scrapy.contrib.downloadermiddleware.redirect.RedirectMiddleware': 600,'scrapy.contrib.downloadermiddleware.cookies.CookiesMiddleware': 700,'scrapy.contrib.downloadermiddleware.httpproxy.HttpProxyMiddleware': 750,'scrapy.contrib.downloadermiddleware.chunked.ChunkedTransferMiddleware': 830,'scrapy.contrib.downloadermiddleware.stats.DownloaderStats': 850,'scrapy.contrib.downloadermiddleware.httpcache.HttpCacheMiddleware': 900,}"""
# from scrapy.contrib.downloadermiddleware.httpauth import HttpAuthMiddleware
# Enable or disable downloader middlewares
# See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html
# DOWNLOADER_MIDDLEWARES = {
#    'step8_king.middlewares.DownMiddleware1': 100,
#    'step8_king.middlewares.DownMiddleware2': 500,
# }sittings

 

十一、TinyScrapy

参考版

#!/usr/bin/env python
# -*- coding:utf-8 -*-
import types
from twisted.internet import defer
from twisted.web.client import getPage
from twisted.internet import reactorclass Request(object):def __init__(self, url, callback):self.url = urlself.callback = callbackself.priority = 0class HttpResponse(object):def __init__(self, content, request):self.content = contentself.request = requestclass ChouTiSpider(object):def start_requests(self):url_list = ['http://www.cnblogs.com/', 'http://www.bing.com']for url in url_list:yield Request(url=url, callback=self.parse)def parse(self, response):print(response.request.url)# yield Request(url="http://www.baidu.com", callback=self.parse)from queue import Queue
Q = Queue()class CallLaterOnce(object):def __init__(self, func, *a, **kw):self._func = funcself._a = aself._kw = kwself._call = Nonedef schedule(self, delay=0):if self._call is None:self._call = reactor.callLater(delay, self)def cancel(self):if self._call:self._call.cancel()def __call__(self):self._call = Nonereturn self._func(*self._a, **self._kw)class Engine(object):def __init__(self):self.nextcall = Noneself.crawlling = []self.max = 5self._closewait = Nonedef get_response(self,content, request):response = HttpResponse(content, request)gen = request.callback(response)if isinstance(gen, types.GeneratorType):for req in gen:req.priority = request.priority + 1Q.put(req)def rm_crawlling(self,response,d):self.crawlling.remove(d)def _next_request(self,spider):if Q.qsize() == 0 and len(self.crawlling) == 0:self._closewait.callback(None)if len(self.crawlling) >= 5:returnwhile len(self.crawlling) < 5:try:req = Q.get(block=False)except Exception as e:req = Noneif not req:returnd = getPage(req.url.encode('utf-8'))self.crawlling.append(d)d.addCallback(self.get_response, req)d.addCallback(self.rm_crawlling,d)d.addCallback(lambda _: self.nextcall.schedule())@defer.inlineCallbacksdef crawl(self):spider = ChouTiSpider()start_requests = iter(spider.start_requests())flag = Truewhile flag:try:req = next(start_requests)Q.put(req)except StopIteration as e:flag = Falseself.nextcall = CallLaterOnce(self._next_request,spider)self.nextcall.schedule()self._closewait = defer.Deferred()yield self._closewait@defer.inlineCallbacksdef pp(self):yield self.crawl()_active = set()
obj = Engine()
d = obj.crawl()
_active.add(d)li = defer.DeferredList(_active)
li.addBoth(lambda _,*a,**kw: reactor.stop())reactor.run()

备注: 更多文档参见:https://scrapy-chs.readthedocs.io/zh_CN/latest/index.html

 

 

 

 

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/news/495429.shtml

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

“反机器学习”:人工智能突破的关键是“遗忘”

来源&#xff1a;36Kr摘要&#xff1a;人工智能越来越频繁地出现在人们的生活中&#xff0c;而其技术上的重大进步仍然不曾明朗。本文作者Natalie Fratto在“Machine Un-Learning: Why Forgetting Might Be the Key to AI”一文中讲述了实现人工智能战略性遗忘的三个方法。让我…

打造一个宇宙 星系模拟产生对宇宙进化惊人见解

图片来源&#xff1a;TNG COLLABORATION来源&#xff1a;中国科学报摘要&#xff1a;研究人员不断发展宇宙模型&#xff0c;并借此发现新的宇宙理论。美国加州理工学院理论物理学家Philip Hopkins喜欢跟他的同事恶作剧。作为模拟星系形成的专家&#xff0c;Hopkins有时会在演讲…

Scrapy 性能

参考&#xff1a;https://blog.csdn.net/s150503/article/details/72571680 CONCURRENT_REQUESTS 与 DOWNLOAD_DELAY Scrapy 中 CONCURRENT_REQUESTS 与 DOWNLOAD_DELAY 的联系&#xff0c;先建立一个项目来找CONCURRENT_REQUESTS与DOWNLOAD_DELAY的联系 以豆瓣电影top250 为例…

5G 发展报告:以四项技术为基础,广泛应用还需十年

来源&#xff1a;36Kr摘要&#xff1a;近年来&#xff0c;在5G领域的竞争非常激烈。但5G的部署与应用到底是一个什么样的过程&#xff1f;人们到底需要多久才能普遍用上5G技术&#xff1f;日前&#xff0c;CB Insights发表了一篇报告&#xff0c;在对无线技术的背景进行研究的基…

33个优秀的 jQuery 图片展示插件分享

这篇文章收集了33个优秀的 jQuery 图片插件分享给大家。jQuery 是一个非常优秀的 JavaScript 框架&#xff0c;使用简单灵活&#xff0c;同时还有许多成熟的插件可供选择&#xff0c;其中最令人印象深刻的应用之一就是对图片的处理&#xff0c;它可以让帮助你在你的项目中加入一…

安卓逆向_13 --- AndroidStudio + Smalidea 动态调试 smali 代码【APK可调试】、gradle 配置

教我兄弟学Android逆向04 动态调试smali代码&#xff1a;https://www.52pojie.cn/thread-658865-1-1.html From&#xff1a;Android Studio 3.6 调试 smali&#xff1a;https://blog.csdn.net/jha334201553/article/details/104494732 From&#xff1a;SmalideaIntelliJ IDEA/…

IEEE协会首次在京举办研讨会,王飞跃称不存在AI芯片

本文来源&#xff1a;网易智能摘要&#xff1a;6月9日至10日&#xff0c;IEEE SMC学会&#xff08;IEEE System&#xff0c;Man&#xff0c;and Cybernetics Society&#xff09;与中国自动化学会、中国科学院自动化研究所、青岛智能产业技术研究院共同在京举办IEEE人工智能与控…

安卓逆向_14 --- 单机和弱联网游戏内购 突破口 和 思路

From&#xff1a;https://www.bilibili.com/video/BV1UE411A7rW?p41 Android 逆向资源收集&#xff08; apk &#xff09;&#xff1a;https://blog.csdn.net/qq_36869808/article/details/79290420 Android逆向-Android基础逆向7&#xff08;内购干货集合&#xff09;&#…

OpenAI最新研究:如何通过无监督学习提升「自然语言理解能力」?

来源&#xff1a;amazonaws.com摘要&#xff1a;长期以来&#xff0c;使用无监督&#xff08;预&#xff09;训练来提高区别性任务的性能表现一直是机器学习研究的一个重要目标。最近&#xff0c;OpenAI通过使用一个具有可扩展性的任务不可知系统&#xff0c;在一系列不同的自然…

安卓逆向_15( 一 ) --- JNI 和 NDK

From&#xff1a;较详细的介绍JNI&#xff1a;https://blog.csdn.net/lizhifa2011/article/details/21021177 From&#xff1a;https://www.jb51.net/article/126111.htm NDK 官方文档&#xff1a;https://developer.android.google.cn/training/articles/perf-jni JNI / NDK …

Nature:科学家成功绘制出大脑神经细胞“地图”

图片来源&#xff1a;Thomas Hainmller, Marlene Bartos来源&#xff1a;生物谷摘要&#xff1a;最近&#xff0c;一项刊登在国际杂志Nature上的研究报告中&#xff0c;来自弗莱堡大学的科学家们通过研究开发出了一种新型模型来解释大脑如何储存一些“有形事件”&#xff08;ta…

互联网的大脑模型与原子的太阳系模型,科学史上的巨系统对比

作者&#xff1a;刘锋 计算机博士&#xff0c;互联网进化论作者科学探索中&#xff0c;有两种重要的促进力量&#xff0c;第一种是认同&#xff0c;会帮助研究者增强对探索方向的信心和勇气&#xff0c;第二种是批判&#xff0c;会帮助研究者获知探索路上的障碍和陷阱。10年前…

安卓逆向_15( 二 ) --- Android Studio 3.6.3 JNI 环境配置 和 so 生成开发 demo

From&#xff1a;Android Studio 3.0 JNI 的实现&#xff1a;https://blog.csdn.net/ziyoutiankoong/article/details/79696279 Android Studio 生成so包和.H文件给jni调用产生新so包。(即so包调so包)&#xff1a;https://blog.csdn.net/sxh_android/article/details/80694291…

卡内基梅隆大学机器学习系副主任邢波:AI落地现在最缺的是思维方式

来源&#xff1a;亿欧摘要&#xff1a;邢波认为&#xff1a;人工智能现在最缺的不是算法和知识&#xff0c;而是落地应用的思维方式&#xff1b;数据如何被处理、系统如何被调试、资源如何配置&#xff0c;目前阶段还处于黑箱&#xff0c;很混沌的状态&#xff1b;人工智能未来…

安卓逆向_15( 三 ) --- Android NDK 开发【 jni 静态注册、JNI_OnLoad 动态注册】

Android Studio开发JNI示例&#xff1a;https://blog.csdn.net/wzhseu/article/details/79683045 JNI_动态注册_静态注册.zip : https://pan.baidu.com/s/1wpTYA9euSdPqE1Z2bA_BHA 提取码: 7h97 错误: 编码GBK的不可映射字符 ( https://blog.csdn.net/talenter111/article/de…

学界 | DeepMind等机构提出「图网络」:面向关系推理

来源&#xff1a;机器之心摘要&#xff1a;近日&#xff0c;由 DeepMind、谷歌大脑、MIT 和爱丁堡大学等公司和机构的 27 位科学家共同提交的论文《Relational inductive biases, deep learning, and graph networks》引起了人们的关注。深度学习虽然精于分类&#xff0c;但一直…

ARM 汇编基础教程番外篇 ——配置实验环境

From&#xff1a;https://zhuanlan.zhihu.com/p/29145513 win10 arm 汇编环境 Windows 平台下搭建 ARM 汇编集成环境&#xff1a;https://jingyan.baidu.com/article/4b52d70288bfcdfc5c774ba5.html 要调试 ARM 程序&#xff0c;我们需要&#xff1a; 能运行 ARM 程序的运行环…

asp.net调试方法

1、先将网站设为启动项目。 2、选择“启动选项”。 3、进行设置&#xff1a; 然后调试&#xff0c;在浏览器输入网址&#xff0c;此时如果遇到“断点”程序将自动停止运行&#xff0c;即可进行调试&#xff0c;查看运行中的变量的值。 转载于:https://www.cnblogs.com/gwjtssy/…

基因对智力的预测能力不到7%,别迷信它

图片来源&#xff1a;The Conversation撰文 Carl Zimmer翻译 李杨审校 贾晓璇编辑 魏潇2016 年我在写一本关于遗传的书时&#xff0c;曾对自己的基因组进行了测序。一些科学家还好心地指出了我基因组图谱的一些有趣特征&#xff0c;教我如何自己读取数据。从那以后&#xff0c;…

ARM 汇编语言入门

[翻译]二进制漏洞利用&#xff08;二&#xff09;ARM32位汇编下的TCP Bind shell&#xff1a;https://bbs.pediy.com/thread-253511.htm ARM汇编语言入门 From&#xff1a;ARM汇编语言入门&#xff08;一&#xff09;&#xff1a;https://zhuanlan.zhihu.com/p/109057983 原文…