目录
- HTTP请求
- HTTP响应
- 获得页面响应
- 伪装用户访问
- 打包数据
- 爬取豆瓣top250
HTTP请求
HTTP:HypertextTransferProtcol 超文本传输协议
1、请求行
POST/user/info?new_user=true HTTP/1.1
#资源了路径user/info 查询参数new_user=true 协议版本HTTP/1.1
2、请求头
Host:www.example.com
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; ×64)
#host指主机域名
User-Agent:curl/7.77.0
#告知服务器客户端的相关信息
Accept:*/*
#客户端想接受的响应数据是什么类型
3、请求体
{"username":"刘威","email":"liuwei@hotmail.com"}
HTTP响应
# 状态行
HTTP/1.1 200 OK
# 响应头
Date:Fri,27Jan 2023 02:10:50 GMT
Content-Type:text/html;charset=utf-8
# 响应体
<!DOCTYPE html><head><title>首页</title></head><body><h1>hello world!</h1></body>
</html>
获得页面响应
pip install requests
import requests
head = { "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; ×64)" }
response=requests.get("http://books.toscrape.com")
if response.ok:print(response.text)
else:print("error")
伪装用户访问
import requests
headers={"User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/94.0.4606.71 Safari/537.36 Core/1.94.184.400 QQBrowser/11.3.5190.400"
}
response=requests.get("https://movie.douban.com/top250",headers=headers)
print(response.text)
打包数据
pip install bs4
from bs4 import BeautifulSoup
import requestscontent=requests.get("https://movie.douban.com/top250").text
# 传入BeautifulSoup的构造函数里
# 解析器
soup=BeautifulSoup(content,"html.parser")
# 能根据标签、属性等找出所有符合要求的元素
all_prices=soup.findAll("span",attrs={"class","title"})
for price in all_prices:print(price.string) #会把标签包围的文字返回给我们
爬取豆瓣top250
from bs4 import BeautifulSoup
import requests
# 伪装用户访问
headers={"User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/94.0.4606.71 Safari/537.36 Core/1.94.184.400 QQBrowser/11.3.5190.400"
}
# 根据url格式进行自动翻页
for start_num in range(0,250,25): response=requests.get(f"https://movie.douban.com/top250?start={start_num}",headers=headers) #我们就可以用f字符串去格式化html=response.text #打包htmlsoup=BeautifulSoup(html,"html.parser") #用html方式解析all_title=soup.findAll("span",attrs={"class":"title"}) #限制特定条件for title in all_title: #遍历所需内容title_string=title.stringif "/" not in title_string: #限制内容显示print(title_string)