Python爬虫教程第2篇-reqeusts是最好用的网络请求工具

简介

爬虫第一步就是网络请求，一个好用的网络请求库会非常重要。而requests库就是非常好用的一个http库，pyhon中虽然也有内置的urllib库用于网络请求，但是urllib使用起来比较的麻烦，而且缺少很多实用的高级功能，所以这里直接推荐使用reqeusts库，它是一个Python第三方库，底层也是基于urllib实现的，处理url资源特别的方便。reqeusts不仅用于写爬虫方便，在日常的开发中也是少不了requests的使用。如调用后端接口，上传文件，查询数据库等。
在这里插入图片描述

安装reqeusts库

pip install requests

建议使用虚拟环境安装依赖，参考文章：【python创建虚拟环境venv】

First Step

使用reqeusts库，爬取百度首页内容，只需要下面几行代码：

import requests
response = requests.get('http://www.baidu.com')# 如果返回的内容有中文乱码，可以设置返回对象的encoding
# response.encoding='utf-8'print(response.status_code)
print(response.text)# 输出内容
200
<!DOCTYPE html>
<!--STATUS OK--><html> <head><meta http-equiv=content-type content=text/html;charset=utf-8><meta http-equiv=X-UA-Compatible content=IE=Edge><meta content=always name=referrer><link rel=stylesheet type=text/css href=https://ss1.bdstatic.com/5eN1bjq8AAUYm2zgoY3K/r/www/cache/bdorz/baidu.min.css><title>百度一下，你就知道</title></head>.......

可以看到请求一个网页页面一行代码就搞定了，非常简单，就如reqeusts库的设计理念一样：Requests is an elegant and simple HTTP library for Python, built for human beings.(requests是一个优雅而简单的 Python HTTP 库，它是为人类构建的。)

我们再来看搜索一个词的场景：搜索一个关键词，传入关键词参数。我们先分析一下百度搜索的链接是下面这种：
https://www.baidu.com/s?wd=爬虫
可以看到关键词作为url参数拼接在url后面，我们代码传入参数然后获取搜索结果，代码如下：

import requests
params = {'wd': '爬虫'}
response = requests.get('https://www.baidu.com/', params=params)response.encoding='utf-8'
print(response.url)
print(response.text)###### 输出内容
https://www.baidu.com/s?wd=%E7%88%AC%E8%99%AB

返回的html结果
在这里插入图片描述
这里介绍了简单的使用方式，下面介绍reqeusts各种高阶的用法。

进阶使用

请求方法

reqeusts库支持大部分的HTTP请求方法，具体如下：
1. get： 对应http的get请求，用于获取指定的url内容
**2. post：**对应http的post请求，创建资源，提交表达，上传文件等
**3. put：**对应http的put请求，对资源全局更新
**4. delete：**对应http的delete请求，删除资源
**5. head：**对应http的head请求，获取header
**6. patch：**对应http的patch请求，对资源进行局部更新
**7. options：**对应http的options请求，查看接口是否支跨域
以上都是符合restful风格定义的方法，具体含义一致，真正该使用哪个方法需要对应目标接口或者网址支持的方法。比如下图，就是get方法，那请求该链接的时候就用get方法就好了。
在这里插入图片描述
一般情况下面四种方法就够用了。

requests.get(url='')
requests.post(url='')
requests.put(url='')
requests.delete(url='')

request请求参数

上面的reqeusts的请求方法底层都是基于requests.requst封装了一层，参考下面的参数，就是把method对应的做了替换，其他参数还都存在，所以不管是使用requests.get，还是reqeusts.post，或是直接用requests.request，下面的参数都一致。
在这里插入图片描述
参数详细说明：

下面介绍几种常见的使用场景。

header设置

因为是爬虫，如果不设置header很容易就会被目标网站发现是爬虫程序而拒绝请求，header参数参考：

headers = {"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/123.0.0.0 Safari/537.36","Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7","Accept-Encoding": "gzip, deflate, br, zstd","Accept-Language": "zh-CN,zh;q=0.9,en;q=0.8,ja;q=0.7,zh-TW;q=0.6,nb;q=0.5,fr;q=0.4,en-US;q=0.3","Cache-Control": "max-age=0","Connection": "keep-alive","Sec-Ch-Ua": '"Google Chrome";v="123", "Not:A-Brand";v="8", "Chromium";v="123"',"Content-Type":"application/json"}

可以针对不同网站做调整，比如content-type、cookie等。

Content-Type字段

header 头部信息中有一个 Content-Type 字段，该字段用于客户端告诉服务器实际发送的数据类型，比如发送的数据可以是文件、纯文本、json字符串、表单等。在requests中常用的数据类型有5种：

application/x-www-form-urlencoded：form表单数据被编码为key/value格式发送到服务器。请求默认格式
multipart/form-data：不仅可以传输参数，还可以传输文件
text/xml ： XML格式。发送的数据必须是xml格式
application/json：json 格式。发送的数据必须是json格式
text/plain ：纯文本格式

Get请求

import requests
headers = {"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/123.0.0.0 Safari/537.36","Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7","Accept-Encoding": "gzip, deflate, br, zstd","Accept-Language": "zh-CN,zh;q=0.9,en;q=0.8,ja;q=0.7,zh-TW;q=0.6,nb;q=0.5,fr;q=0.4,en-US;q=0.3","Cache-Control": "max-age=0","Connection": "keep-alive","Sec-Ch-Ua": '"Google Chrome";v="123", "Not:A-Brand";v="8", "Chromium";v="123"'}params = {'wd': '爬虫'}
response = requests.get('https://www.baidu.com/s', params=params, headers=headers)

POST请求

1 模拟一个登录请求（默认使用的form表单）：

requests.post('https://accounts.douban.com/login', data={'form_email': 'abc@example.com', 'form_password': '123456'})

requests默认使用application/x-www-form-urlencoded对POST数据编码。如果要传递JSON数据，可以直接传入json参数：

requests.post('https://accounts.douban.com/login', json={'form_email': 'abc@example.com', 'form_password': '123456'})

演示用，实际douban可能不是这种形式提交数据。虽然json接收的是个dict参数，但是内部会自动化序列化为json

2 上传文件

upload_files = {'file': open('report.xls', 'rb')}
r = requests.post(url, files=upload_files)

其他的PUT、DELETE方法也都类似以上请求方式。

使用代理

import requests
headers = {"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/123.0.0.0 Safari/537.36","Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7","Accept-Encoding": "gzip, deflate, br, zstd","Accept-Language": "zh-CN,zh;q=0.9,en;q=0.8,ja;q=0.7,zh-TW;q=0.6,nb;q=0.5,fr;q=0.4,en-US;q=0.3","Cache-Control": "max-age=0","Connection": "keep-alive","Sec-Ch-Ua": '"Google Chrome";v="123", "Not:A-Brand";v="8", "Chromium";v="123"'}proxy = {"http": "http://127.0.0.1:7890","https": "http://127.0.0.1:7890"
}params = {'wd': '爬虫'}
response = requests.get('https://www.baidu.com/s', params=params, headers=headers, proxies=proxy)

reqeust返回对象

每一次请求都需要获取详细准确的返回结果，requests请求返回的是一个requests.Response对象，该对象有丰富的属性和方法。
在这里插入图片描述

获取返回文本内容

response.content 返回的是二进制内容
response.text 返回的字符串
response.json 返回的是序列化的json数据
接口推荐直接使用json()获取结果，如果不知道返回的格式，使用text。

返回的状态码status和ok

status_code 是接口的标准响应码，ok 是表示一个请求是否正常。关于正常的定义可以参见ok函数的函数说明。
在这里插入图片描述
状态码通用含义：

信息响应 (100–199)
成功响应 (200–299)
重定向消息 (300–399)
客户端错误响应 (400–499)
服务端错误响应 (500–599)

header和Cookie

在调用需要登陆的接口可能需要认证之后的cookies和header中某些特殊字段，所以在请求返回中通过header和cookies拿到相应的参数。

import requestsurl = "http://127.0.0.1:8090/demo/5"
res = requests.get(url)
print(f"header: {res.headers}")
print(f"cookies: {res.cookies}")>>>
header: {'Server': 'Werkzeug/2.3.6 Python/3.9.6', 'Date': 'Tue, 13 Jun 2023 13:27:13 GMT', 'Content-Type': 'application/json', 'Content-Length': '85', 'Connection': 'close'}
cookies: <RequestsCookieJar[]>

常见的网络请求异常

网络请求通常会存在很多可能的错误，特别是http请求还有复杂的后端接口。所以对于错误信息的捕获就特别重要，合理的捕获异常信息可以极大的增强代码的及健壮性。requests 提供了多种异常库，包括如下：

class RequestException(IOError):pass class InvalidJSONError(RequestException):pass class JSONDecodeError(InvalidJSONError, CompatJSONDecodeError):pass class HTTPError(RequestException):pass class ConnectionError(RequestException):pass class ProxyError(ConnectionError):pass class SSLError(ConnectionError):pass class Timeout(RequestException):pass class ConnectTimeout(ConnectionError, Timeout):pass class ReadTimeout(Timeout):pass class URLRequired(RequestException):pass class TooManyRedirects(RequestException):pass class MissingSchema(RequestException, ValueError):pass class InvalidSchema(RequestException, ValueError):pass class class InvalidURL(RequestException, ValueError):pass class InvalidHeader(RequestException, ValueError):pass class InvalidProxyURL(InvalidURL):pass class ChunkedEncodingError(RequestException):passclass ContentDecodingError(RequestException, BaseHTTPError):passclass StreamConsumedError(RequestException, TypeError):pass class RetryError(RequestException):pass class UnrewindableBodyError(RequestException):pass