aiohttp 的使用

基本介绍

aiohttp 是一个基于 asyncio 的异步 HTTP 网络模块，它即提供了服务端，又提供了客户端。其中，我们用服务端可以搭建一个支持异步处理的服务器，这个服务器就是用来处理请求并返回响应的，类似于 Django , Flask , Tornado 等一些 Web 服务器。而客户端可以用来发送请求，类似于使用 requests 发起一个HTTP 请求然后获得响应，但 requests 发起的是同步网络请求， aiohttp 则是异步的

官方文档：Welcome to AIOHTTP — aiohttp 3.9.5 documentation

基本实例

import aiohttp
import asyncio
import nest_asyncio
nest_asyncio.apply()

async def fetch(session, url):
async with session.get(url) as response:
return await response.text(), response.status

async def main():
async with aiohttp.ClientSession() as session:
html, status = await fetch(session, 'https://cuiqingcai.com')
print(f'html: {html[:100]}...')
print(f'status:{status}')

if __name__ == '__main__':
loop = asyncio.get_event_loop()
loop.run_until_complete(main())

html: <!DOCTYPE html>
<html lang="zh-CN"><head><meta charset="UTF-8"><meta name="viewport" content...
status:200

这里可以看到我们成功的获取了网页源代码和响应尊和响应状态码200，也就是完成了一次基本HTTP请求，当然，这样的请求 requests 也可以做到

aiohttp 的请求定义：

在导入库的时候，除了必须导入 aiohttp 这个库，还要导入 asyncio 库，因为要实现异步爬取，需要启动协程，而协程需要借助 asyncio 库中的事件循环才能执行。除了事件循环之外， asyncio 里面也提供了很多基础的异步操作

每个异步方法的前面都要统一加上 async 来修饰。

with as 前面同样需要加上 async 来修饰。在 python 中， with as 用于声明一个上下文管理器，能够帮我们自动分配和释放资源。而在异步方法中，with as 前面加上 async 代表声明一个支持异步的上下文管理器

对于一些返回协程对象的操作，前面需要加上 await 来修饰。例如 response 调用的 text 方法，查询 API 可以发现，其返回的是协程对象，那么前面就要加 await 。对于状态码来说，其返回的就是一个数值，因此前面不需要加 await 。所以这里可以按照实际情况来处理，参考官方文档说明，看看其对应的返回值是怎样的类型，然后决定加不加 await 就可以了

最后，定义完爬取方法之后，实际上是 main 方法调用了 fetch 方法，要运行的话，必须启用事件循环，而事件循环需要使用 asyncio 库，然后调用 run_until_complete 方法来运行

注意：在Python 3.7 之后的版本中，我们可以使用 asyncio.run(mai()) 代替最后的启动操作，不需要显示声明事件循环， run 方法内部会自动启动一个事件循环。

import aiohttp
import asyncio
import nest_asyncio
nest_asyncio.apply()

async def fetch(session, url):
async with session.get(url) as response:
return await response.text(), response.status

async def main():
async with aiohttp.ClientSession() as session:
html, status = await fetch(session, 'https://cuiqingcai.com')
print(f'html: {html[:100]}...')
print(f'status:{status}')

asyncio.run(main())

html: <!DOCTYPE html>
<html lang="zh-CN"><head><meta charset="UTF-8"><meta name="viewport" content...
status:200

效果一样

url 参数设置

对于URL 参数设置，我们可以使用 params 参数，传入一个字典即可

import aiohttp
import asyncio

async def main():
params = {'name': 'germey', 'age': 25}
async with aiohttp.ClientSession() as session:
async with session.get('http://www.httpbin.org/get', params=params) as response:
print(await response.text())
if __name__=='__main__':
asyncio.run(main())

{"args": {"age": "25", "name": "germey"}, "headers": {"Accept": "*/*", "Accept-Encoding": "gzip, deflate, br", "Host": "www.httpbin.org", "User-Agent": "Python/3.11 aiohttp/3.9.3", "X-Amzn-Trace-Id": "Root=1-66a21748-480d5a6e579d1a3b3cd1a710"}, "origin": "117.152.24.110", "url": "http://www.httpbin.org/get?name=germey&age=25"
}

这里可以看到，实际请求的URL 为 http://www.httpbin.org/get?name=germey&age=25 ，其中参数对应的是 params 中的内容

其他请求类型

aiohttp 还支持其他请求类型，例如 POST ,PUT ,DELETE 等

session.post('http://www.httpbin.org/post',fdata=b'data')

session.put('http://www.httpbin.org/put',fdata=b'data')

session.delete('http://www.httpbin.org/post',fdata=b'data')

POST 请求

对于 POST 表单提交，其对应的请求头中的 Content-Type 为 application/x-www-form-urlencoded

import aiohttp
import asyncio

async def main():
data = {'name': 'germey', 'age': 25}
async with aiohttp.ClientSession() as session:
async with session.post('https://www.httpbin.org/post', data=data) as response:
print(await response.text())
if __name__== '__main__':
asyncio.run(main())

{"args": {}, "data": "", "files": {}, "form": {"age": "25", "name": "germey"}, "headers": {"Accept": "*/*", "Accept-Encoding": "gzip, deflate, br", "Content-Length": "18", "Content-Type": "application/x-www-form-urlencoded", "Host": "www.httpbin.org", "User-Agent": "Python/3.11 aiohttp/3.9.3", "X-Amzn-Trace-Id": "Root=1-66a21972-0acc966c70ce51551219c086"}, "json": null, "origin": "117.152.24.110", "url": "https://www.httpbin.org/post"
}

对于POST JSON 数据的提交，其对应的请求头中的 Content-Type 为 application/json，我们只需要将 post 方法里的 data 参数改成 json 即可

import aiohttp
import asyncio

async def main():
data = {'name': 'germey', 'age': 25}
async with aiohttp.ClientSession() as session:
async with session.post('https://www.httpbin.org/post', json=data) as response:
print(await response.text())
if __name__== '__main__':
asyncio.run(main())

{"args": {}, "data": "{\"name\": \"germey\", \"age\": 25}", "files": {}, "form": {}, "headers": {"Accept": "*/*", "Accept-Encoding": "gzip, deflate, br", "Content-Length": "29", "Content-Type": "application/json", "Host": "www.httpbin.org", "User-Agent": "Python/3.11 aiohttp/3.9.3", "X-Amzn-Trace-Id": "Root=1-66a21abd-5bd0bcc5362a212746b15be4"}, "json": {"age": 25, "name": "germey"}, "origin": "117.152.24.110", "url": "https://www.httpbin.org/post"
}

响应

对于响应来说，我们可以用如下方法获取其中的状态码，响应体，响应体二进制内容，响应体JSON结果

import aiohttp
import asyncio

async def main():
data = {'name': 'germey', 'age': 25}
async with aiohttp.ClientSession() as session:
async with session.post('https://www.httpbin.org/post', data=data) as response:
print('status:', response.status)
print('headers:', response.headers)
print('body:', await response.text())
print('bytes:', await response.read())
print('json:', await response.json())
if __name__ == '__main__':
asyncio.run(main())

status: 200
headers: <CIMultiDictProxy('Date': 'Thu, 25 Jul 2024 09:35:42 GMT', 'Content-Type': 'application/json', 'Content-Length': '516', 'Connection': 'keep-alive', 'Server': 'gunicorn/19.9.0', 'Access-Control-Allow-Origin': '*', 'Access-Control-Allow-Credentials': 'true')>
body: {"args": {}, "data": "", "files": {}, "form": {"age": "25", "name": "germey"}, "headers": {"Accept": "*/*", "Accept-Encoding": "gzip, deflate, br", "Content-Length": "18", "Content-Type": "application/x-www-form-urlencoded", "Host": "www.httpbin.org", "User-Agent": "Python/3.11 aiohttp/3.9.3", "X-Amzn-Trace-Id": "Root=1-66a21c6d-0bd06f745165ce7917c83621"}, "json": null, "origin": "117.152.24.110", "url": "https://www.httpbin.org/post"
}bytes: b'{\n  "args": {}, \n  "data": "", \n  "files": {}, \n  "form": {\n    "age": "25", \n    "name": "germey"\n  }, \n  "headers": {\n    "Accept": "*/*", \n    "Accept-Encoding": "gzip, deflate, br", \n    "Content-Length": "18", \n    "Content-Type": "application/x-www-form-urlencoded", \n    "Host": "www.httpbin.org", \n    "User-Agent": "Python/3.11 aiohttp/3.9.3", \n    "X-Amzn-Trace-Id": "Root=1-66a21c6d-0bd06f745165ce7917c83621"\n  }, \n  "json": null, \n  "origin": "117.152.24.110", \n  "url": "https://www.httpbin.org/post"\n}\n'
json: {'args': {}, 'data': '', 'files': {}, 'form': {'age': '25', 'name': 'germey'}, 'headers': {'Accept': '*/*', 'Accept-Encoding': 'gzip, deflate, br', 'Content-Length': '18', 'Content-Type': 'application/x-www-form-urlencoded', 'Host': 'www.httpbin.org', 'User-Agent': 'Python/3.11 aiohttp/3.9.3', 'X-Amzn-Trace-Id': 'Root=1-66a21c6d-0bd06f745165ce7917c83621'}, 'json': None, 'origin': '117.152.24.110', 'url': 'https://www.httpbin.org/post'}

可以看到这里有些字段需要加上 await 有些不需要，其原则是，如果返回的是一个协程对象（如 async 修饰的方法），那么前面就要加上 await 具体可以看 aiohttp 的 API 其链接为： https://docs.aiohttp.org/en/stable/client_reference.html

超时设置

我们可以借助 ClientTimeout 对象设置超时，例如 1秒时间可以这样设置

import aiohttp
import asyncio

async def main():
timeout = aiohttp.ClientTimeout(total=1)
async with aiohttp.ClientSession(timeout=timeout) as session:
async with session.get('https://www.httpbin.org/get') as response:
print('status:', response.status)
if __name__ == '__main__':
asyncio.get_event_loop().run_until_complete(main())

如果在 1 秒内成功获取响应结果如下

status: 200

如果失败

并发限制

由于 aiohttp 可以支持非常高的并发量，如几万，十万，百万都是能做到的，但是面对如此高的并发量，目标网站可能无法再短时间内响应，而且有瞬间将目标网站爬挂掉的风险，这提示我们要控制一下爬取的并发量

一般情况下，可以借助 asyncio 的 Semaphore 来控制并发量

import asyncio
import aiohttp

CONCURRENCY = 5
URL = 'https://www.baidu.com'

semaphore = asyncio.Semaphore(CONCURRENCY)
session = None

async def scrape_api():
async with semaphore:
print('scraping', URL)
async with session.get(URL) as response:
await asyncio.sleep(1)
return await response.text()

async def main():
global session
session = aiohttp.ClientSession()
scrape_index_tasks = [asyncio.ensure_future(scrape_api()) for _ in range(10000)]
await asyncio.gather(*scrape_index_tasks)

if __name__=='__main__':
asyncio.run(main())

这里我们声明了 CONCURRENCY （代表爬取最大并发量）为 5，同时声明爬取的目标 URL 为百度。接着，借助 Semaphore 创建了一个信号量对象，将其赋值为 semaphore ，这样就可以用它来控制最大并发量了。这里我们把 semaphore 直接放在了对应的爬取方法里，使用 async with 语句将 semaphore 作为上下文管理对象即可。这样一来，信号便可以控制进入爬取的最大协程数量，即我们声明的 CONCURRENCY 的值

在main 方法里，我们声明了 10000 个 task ，将其传递给 gather 方法运行，倘若不加以限制，那么这 10000 个 task 会被同时执行，并发量相当的大，但是有了信号量的控制之后，同时运行的 task 数量会被控制在 5个，这样就能给 aiohttp 限制速度了