环境准备
在开始之前,你需要确保你的Python环境已经安装了以下库:
requests
:用于发送HTTP请求。BeautifulSoup
:用于解析HTML文档。
如果你还没有安装这些库,可以通过以下命令安装:
pip install requests beautifulsoup4
豆瓣数据抓取步骤
import requests
from bs4 import BeautifulSoupurl = 'https://movie.douban.com/top250'
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.107 Safari/537.36'
}
response = requests.get(url, headers=headers)
response.encoding = response.apparent_encoding
soup = BeautifulSoup(response.text, 'html.parser')
movies = soup.find_all('div', class_='item') # 根据实际的HTML结构来定位数据
data = []
for movie in movies:title = movie.find('span', class_='title').textrating = movie.find('span', class_='rating_num').textlink = 'https://movie.douban.com' + movie.find('a')['href']item = {'title': title, 'rating': rating, 'link': link}print(item)data.append(item)
抓取结果