Python 框架 之 Scrapy 爬虫(一)

在编写爬虫时,性能的消耗主要在IO请求中,当单进程单线程模式下请求URL时必然会引起等待,从而使得请求整体变慢。

1、同步执行

import requestsdef fetch_async(url):response = requests.get(url)return responseurl_list = ['http://www.github.com', 'http://www.bing.com']for url in url_list:fetch_async(url)

2、多线程执行

from concurrent.futures import ThreadPoolExecutor
import requestsdef fetch_async(url):response = requests.get(url)return responseurl_list = ['http://www.github.com', 'http://www.bing.com']
pool = ThreadPoolExecutor(5)
for url in url_list:pool.submit(fetch_async, url)
pool.shutdown(wait=True)

3、多线程+回调函数

from concurrent.futures import ThreadPoolExecutor
import requestsdef fetch_async(url):response = requests.get(url)return responsedef callback(future):print(future.result())url_list = ['http://www.github.com', 'http://www.bing.com']
pool = ThreadPoolExecutor(5)
for url in url_list:v = pool.submit(fetch_async, url)v.add_done_callback(callback)
pool.shutdown(wait=True)

4、多进程执行

from concurrent.futures import ProcessPoolExecutor
import requestsdef fetch_async(url):response = requests.get(url)return responseurl_list = ['http://www.github.com', 'http://www.bing.com']
pool = ProcessPoolExecutor(5)
for url in url_list:pool.submit(fetch_async, url)
pool.shutdown(wait=True)

5、多进程+回调函数

from concurrent.futures import ProcessPoolExecutor
import requestsdef fetch_async(url):response = requests.get(url)return responsedef callback(future):print(future.result())url_list = ['http://www.github.com', 'http://www.bing.com']
pool = ProcessPoolExecutor(5)
for url in url_list:v = pool.submit(fetch_async, url)v.add_done_callback(callback)
pool.shutdown(wait=True)

通过上述代码均可以完成对请求性能的提高,对于多线程和多进行的缺点是在IO阻塞时会造成了线程和进程的浪费,所以首选异步IO:

1、asyncio 1

import asyncio@asyncio.coroutine
def func1():print('before...func1......')yield from asyncio.sleep(5)print('end...func1......')tasks = [func1(), func1()]
loop = asyncio.get_event_loop()
loop.run_until_complete(asyncio.gather(*tasks))
loop.close()

2、asyncio 2

import asyncio@asyncio.coroutine
def fetch_async(host, url='/'):print(host, url)reader, writer = yield from asyncio.open_connection(host, 80)request_header_content = """GET %s HTTP/1.0\r\nHost: %s\r\n\r\n""" % (url, host,)request_header_content = bytes(request_header_content, encoding='utf-8')writer.write(request_header_content)yield from writer.drain()text = yield from reader.read()print(host, url, text)writer.close()tasks = [fetch_async('www.cnblogs.com', '/wupeiqi/'),fetch_async('dig.chouti.com', '/pic/show?nid=4073644713430508&lid=10273091')
]loop = asyncio.get_event_loop()
results = loop.run_until_complete(asyncio.gather(*tasks))
loop.close()

3、asyncio+aiohttp

import aiohttp
import asyncio@asyncio.coroutine
def fetch_async(url):print(url)response = yield from aiohttp.request('GET', url)# data = yield from response.read()# print(url, data)print(url, response)response.close()tasks = [fetch_async('http://www.google.com/'), fetch_async('http://www.chouti.com/')]event_loop = asyncio.get_event_loop()
results = event_loop.run_until_complete(asyncio.gather(*tasks))
event_loop.close()

4、asynico+requests

import asyncio
import requests@asyncio.coroutine
def fetch_async(func, *args):loop = asyncio.get_event_loop()future = loop.run_in_executor(None, func, *args)response = yield from futureprint(response.url, response.content)tasks = [fetch_async(requests.get, 'http://www.cnblogs.com/wupeiqi/'),fetch_async(requests.get, 'http://dig.chouti.com/pic/show?nid=4073644713430508&lid=10273091')
]loop = asyncio.get_event_loop()
results = loop.run_until_complete(asyncio.gather(*tasks))
loop.close()

5、gevert+requests

import gevent
import requests
from gevent import monkeymonkey.patch_all()def fetch_async(method, url, req_kwargs):print(method, url, req_kwargs)response = requests.request(method=method, url=url, **req_kwargs)print(response.url, response.content)# ##### 发送请求 #####
gevent.joinall([gevent.spawn(fetch_async, method='get', url='https://www.python.org/', req_kwargs={}),gevent.spawn(fetch_async, method='get', url='https://www.yahoo.com/', req_kwargs={}),gevent.spawn(fetch_async, method='get', url='https://github.com/', req_kwargs={}),
])# ##### 发送请求(协程池控制最大协程数量) #####
# from gevent.pool import Pool
# pool = Pool(None)
# gevent.joinall([
#     pool.spawn(fetch_async, method='get', url='https://www.python.org/', req_kwargs={}),
#     pool.spawn(fetch_async, method='get', url='https://www.yahoo.com/', req_kwargs={}),
#     pool.spawn(fetch_async, method='get', url='https://www.github.com/', req_kwargs={}),
# ])

6、grequests

import grequestsrequest_list = [grequests.get('http://httpbin.org/delay/1', timeout=0.001),grequests.get('http://fakedomain/'),grequests.get('http://httpbin.org/status/500')
]# ##### 执行并获取响应列表 #####
# response_list = grequests.map(request_list)
# print(response_list)# ##### 执行并获取响应列表(处理异常) #####
# def exception_handler(request, exception):
# print(request,exception)
#     print("Request failed")# response_list = grequests.map(request_list, exception_handler=exception_handler)
# print(response_list)

7、Twisted 示例

from twisted.web.client import getPage, defer
from twisted.internet import reactordef all_done(arg):reactor.stop()def callback(contents):print(contents)deferred_list = []url_list = ['http://www.bing.com', 'http://www.baidu.com', ]
for url in url_list:deferred = getPage(bytes(url, encoding='utf8'))deferred.addCallback(callback)deferred_list.append(deferred)dlist = defer.DeferredList(deferred_list)
dlist.addBoth(all_done)
reactor.run()

8、tornado

from tornado.httpclient import AsyncHTTPClient
from tornado.httpclient import HTTPRequest
from tornado import ioloopdef handle_response(response):"""处理返回值内容(需要维护计数器,来停止IO循环),调用 ioloop.IOLoop.current().stop():param response: :return: """if response.error:print("Error:", response.error)else:print(response.body)def func():url_list = ['http://www.baidu.com','http://www.bing.com',]for url in url_list:print(url)http_client = AsyncHTTPClient()http_client.fetch(HTTPRequest(url), handle_response)ioloop.IOLoop.current().add_callback(func)
ioloop.IOLoop.current().start()

Twisted 更多

from twisted.internet import reactor
from twisted.web.client import getPage
import urllib.parse

def one_done(arg):
print(arg)
reactor.stop()

post_data = urllib.parse.urlencode({‘check_data’: ‘adf’})
post_data = bytes(post_data, encoding=‘utf8’)
headers = {b’Content-Type’: b’application/x-www-form-urlencoded’}
response = getPage(bytes(‘http://dig.chouti.com/login’, encoding=‘utf8’),
method=bytes(‘POST’, encoding=‘utf8’),
postdata=post_data,
cookies={},
headers=headers)
response.addBoth(one_done)

reactor.run()
以上均是Python内置以及第三方模块提供异步IO请求模块,使用简便大大提高效率,而对于异步IO请求的本质则是【非阻塞Socket】+【IO多路复用】:

异步IO

import select
import socket
import timeclass AsyncTimeoutException(TimeoutError):"""请求超时异常类"""def __init__(self, msg):self.msg = msgsuper(AsyncTimeoutException, self).__init__(msg)class HttpContext(object):"""封装请求和相应的基本数据"""def __init__(self, sock, host, port, method, url, data, callback, timeout=5):"""sock: 请求的客户端socket对象host: 请求的主机名port: 请求的端口port: 请求的端口method: 请求方式url: 请求的URLdata: 请求时请求体中的数据callback: 请求完成后的回调函数timeout: 请求的超时时间"""self.sock = sockself.callback = callbackself.host = hostself.port = portself.method = methodself.url = urlself.data = dataself.timeout = timeoutself.__start_time = time.time()self.__buffer = []def is_timeout(self):"""当前请求是否已经超时"""current_time = time.time()if (self.__start_time + self.timeout) < current_time:return Truedef fileno(self):"""请求sockect对象的文件描述符,用于select监听"""return self.sock.fileno()def write(self, data):"""在buffer中写入响应内容"""self.__buffer.append(data)def finish(self, exc=None):"""在buffer中写入响应内容完成,执行请求的回调函数"""if not exc:response = b''.join(self.__buffer)self.callback(self, response, exc)else:self.callback(self, None, exc)def send_request_data(self):content = """%s %s HTTP/1.0\r\nHost: %s\r\n\r\n%s""" % (self.method.upper(), self.url, self.host, self.data,)return content.encode(encoding='utf8')class AsyncRequest(object):def __init__(self):self.fds = []self.connections = []def add_request(self, host, port, method, url, data, callback, timeout):"""创建一个要请求"""client = socket.socket()client.setblocking(False)try:client.connect((host, port))except BlockingIOError as e:pass# print('已经向远程发送连接的请求')req = HttpContext(client, host, port, method, url, data, callback, timeout)self.connections.append(req)self.fds.append(req)def check_conn_timeout(self):"""检查所有的请求,是否有已经连接超时,如果有则终止"""timeout_list = []for context in self.connections:if context.is_timeout():timeout_list.append(context)for context in timeout_list:context.finish(AsyncTimeoutException('请求超时'))self.fds.remove(context)self.connections.remove(context)def running(self):"""事件循环,用于检测请求的socket是否已经就绪,从而执行相关操作"""while True:r, w, e = select.select(self.fds, self.connections, self.fds, 0.05)if not self.fds:returnfor context in r:sock = context.sockwhile True:try:data = sock.recv(8096)if not data:self.fds.remove(context)context.finish()breakelse:context.write(data)except BlockingIOError as e:breakexcept TimeoutError as e:self.fds.remove(context)self.connections.remove(context)context.finish(e)breakfor context in w:# 已经连接成功远程服务器,开始向远程发送请求数据if context in self.fds:data = context.send_request_data()context.sock.sendall(data)self.connections.remove(context)self.check_conn_timeout()if __name__ == '__main__':def callback_func(context, response, ex):""":param context: HttpContext对象,内部封装了请求相关信息:param response: 请求响应内容:param ex: 是否出现异常(如果有异常则值为异常对象;否则值为None):return:"""print(context, response, ex)obj = AsyncRequest()url_list = [{'host': 'www.google.com', 'port': 80, 'method': 'GET', 'url': '/', 'data': '', 'timeout': 5,'callback': callback_func},{'host': 'www.baidu.com', 'port': 80, 'method': 'GET', 'url': '/', 'data': '', 'timeout': 5,'callback': callback_func},{'host': 'www.bing.com', 'port': 80, 'method': 'GET', 'url': '/', 'data': '', 'timeout': 5,'callback': callback_func},]for item in url_list:print(item)obj.add_request(**item)obj.running()

此文为转载!!!

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/news/454732.shtml

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

编程新手导论(转载)

第二部分 导论&#xff0c;这一部分主要是关于编程的导论&#xff0c; (要懂得一点思想具备一点常识)《设计&#xff0c;编码&#xff0c;&#xff0c;与软工》&#xff08;编程与思想&#xff09;这一章解释了三种思想&#xff0c;原语&#xff0c;抽象&#xff0c;组合&#…

如何让电脑成为看图说话的高手?计算机视觉顶会ICCV论文解读

ICCV&#xff0c;被誉为计算机视觉领域三大顶级会议之一。作为计算机视觉领域最高级别的会议之一&#xff0c;其论文集代表了计算机视觉领域最新的发展方向和水平。阿里巴巴在今年的大会上有多篇论文入选&#xff0c;本篇所解读的论文是阿里iDST与多家机构合作的入选论文之一&a…

canvas绘制线条1像素的问题

http://jo2.org/html5-canvas%E7%94%BB%E5%9B%BE3%EF%BC%9A1px%E7%BA%BF%E6%9D%A1%E6%A8%A1%E7%B3%8A%E9%97%AE%E9%A2%98/转载于:https://www.cnblogs.com/XIE7654/p/7493315.html

php汽车找车位,遭遇到车多车位少 教你如何快速找到停车位

[摘要]车主们大多时间会穿梭在市区&#xff0c;到了目的地后那就先找停车位&#xff0c;现在市区寸土寸金&#xff0c;一个停车位面积要占几平米呢&#xff0c;所以停车位基本是不够用的。下面和大家聊聊怎么找合适的停车位。车主们大多时间会穿梭在市区&#xff0c;去商场购物…

Python 框架 之 Scrapy 爬虫(二)

Scrapy是一个为了爬取网站数据&#xff0c;提取结构性数据而编写的应用框架。 其可以应用在数据挖掘&#xff0c;信息处理或存储历史数据等一系列的程序中。其最初是为了页面抓取 (更确切来说, 网络抓取)所设计的&#xff0c; 也可以应用在获取API所返回的数据(例如 Amazon Ass…

十六进制透明度参照表

00%FF&#xff08;不透明&#xff09; 5%F2 10%E5 15%D8 20%CC 25%BF 30%B2 35%A5 40%99 45%8c 50%7F 55%72 60%66 65%59 70%4c 75%3F 80%33 85%21 90%19 95%0c 100%00&#xff08;全透明&#xff09;转载于:http…

lamp和php,[LAMP]Apache和PHP的结合

在LAMP架构中&#xff0c;Apache通过PHP模块与Mysql建立连接&#xff0c;读写数据。那么配置Apache和PHP结合的步骤是怎么操作的呢&#xff1f;1、修改http.conf文件[rootjuispan ~]# cat /usr/local/apache2.4/conf/httpd.conf......#ServerName......AllowOverride noneRequi…

Day-5: Python高级特性

python的理念是&#xff1a;简单、优雅。所以&#xff0c;在Python中集成了许多经常要使用的高级特性&#xff0c;以此来简化代码。 切片&#xff1a;对于一个list或者tuple&#xff0c;取其中一段的元素&#xff0c;称为切片&#xff08;Slice&#xff09;。 L[start:end]表示…

前端之 XMLHttpRequest

XMLHttpRequest 和AJAX的爱恨情仇 AJAX 是 asynchronous javascript and XML 的简写&#xff0c;中文翻译是异步的 javascript 和 XML&#xff0c;这一技术能够向服务器请求额外的数据而无须卸载页面&#xff0c;会带来更好的用户体验。虽然名字中包含 XML &#xff0c;但 AJAX…

makefile——小试牛刀

//a.h,包含头文件stdio.h,并且定义一个函数print #include<stdio.h> void print();//b.c&#xff0c;包含头文件a.h&#xff0c;然后就可以写print函数的内容了 #include"a.h" void print(){ printf("who are you\n"); }//c.c&#xff0c;包含头文件…

云电脑是什么_云电脑和我们现在平时用的电脑有什么区别?

&#x1f340;温馨提示&#x1f340;公众号推送改版&#xff0c;为了不让您错过【掌中IT发烧友圈】每天的精彩推送&#xff0c;切记将本号设置星标哦&#xff01;~01云电脑&#xff0c;是5G云服务时代的电脑新概念&#xff0c;是电脑的新的一种形态。从具体操作使用上来讲&…

PHP如何用while实现循环,PHP 循环 -

PHP 循环 - While 循环循环执行代码块指定的次数&#xff0c;或者当指定的条件为真时循环执行代码块。PHP 循环在您编写代码时&#xff0c;您经常需要让相同的代码块一次又一次地重复运行。我们可以在代码中使用循环语句来完成这个任务。在 PHP 中&#xff0c;提供了下列循环语…

比较全的C语言面试题

1. static有什么用途&#xff1f;&#xff08;请至少说明两种&#xff09; 1).限制变量的作用域 2).设置变量的存储域 2. 引用与指针有什么区别&#xff1f; 1) 引用必须被初始化&#xff0c;指针不必。 2) 引用初始化以后不能被改变&#xff0c;指针可以改变所指的对象…

PHP爬取历史天气

PHP爬取历史天气 PHP作为宇宙第一语言&#xff0c;爬虫也是非常方便&#xff0c;这里爬取的是从天气网获得中国城市历史天气统计结果。 程序架构 main.php <?phpinclude_once("./parser.php");include_once("./storer.php");#解析器和存储器见下文$par…

Python 第三方库之docx

日常上官网 https://python-docx.readthedocs.io/en/latest/ 一、安装 pip install python-docx 二、写入word word 中主要有两种用文本格式等级&#xff1a;块等级&#xff08;block-level&#xff09;和内联等级&#xff08;inline-level&#xff09;word 中大部分内容都…

Unity AI副总裁Danny Lange:如何用AI助推游戏行业?

本文讲的是Unity AI副总裁Danny Lange&#xff1a;如何用AI助推游戏行业&#xff1f; &#xff0c;10月26日&#xff0c;在加州山景城举办的ACMMM 2017大会进入正会第三天。在会上&#xff0c;Unity Technology负责AI与机器学习的副总裁Danny Longe进行了题为《Bringing Gaming…

SPI 读取不同长度 寄存器_SPI协议,MCP2515裸机驱动详解

SPI概述Serial Peripheral interface 通用串行外围设备接口是Motorola首先在其MC68HCXX系列处理器上定义的。SPI接口主要应用在 EEPROM&#xff0c;FLASH&#xff0c;实时时钟&#xff0c;AD转换器&#xff0c;还有数字信号处理器和数字信号解码器之间。SPI&#xff0c;是一种高…

oracle并发执行max,跪求大量并发执行insert into select语句的方案

现在有数十万张表要从A库通过insert into tablename select * from tablenamedblink的方式导入到B库中。B机上80个cpu&#xff0c;160G内存。希望能够大量并发执行。怎么写脚本呢&#xff1f;谁有这方面的经验&#xff0c;麻烦指点一下。谢谢。下面是我的脚本&#xff1a;#!/us…

20162314 《Program Design Data Structures》Learning Summary Of The First Week

20162314 2017-2018-1 《Program Design & Data Structures》Learning Summary Of The First Week Summary of teaching materials Algorithm analysis is the basic project of the computer science.Increasing function prove that the utilization of the time and spa…

高并发解决方法

2019独角兽企业重金招聘Python工程师标准>>> 高并发来说&#xff0c;要从实际项目的每一个过程去考虑&#xff0c;页面&#xff0c;访问过程&#xff0c;服务器处理&#xff0c;数据库访问每个过程都可以处理。&#xff08;前端-宽带-后端-DB&#xff09; 集群&…