Python爬虫实战：利用代理IP获取电商数据

文章目录

1.电商数据介绍
2.爬取目标
3.代理IP推荐
4.准备工作
- 4.1 模块安装
- 4.2 代理IP获取
5.爬虫代码实战
- 5.1分析网页
- - 5.1.1 获取cookie
  - 5.1.2 关键词分析
  - 5.1.3 翻页分析
  - 5.1.4 数据获取分析
- 5.2 发送请求
- 5.3 提取数据
- 5.4 保存数据
- 5.5 完整源码
- 5.6 数据分析
- 六、总结

1.电商数据介绍

● 电商数据对于了解用户行为、优化营销策略、提高转化率等方面具有重要作用。

●通过分析用户数据，企业可以找到目标用户，精准投放广告和推广活动，有效提高广告的转化率和投资回报率。

●电商数据还可以用于个性化推荐、营销活动优化、供应链管理等场景，帮助企业提升用户体验和运营效率。

2.爬取目标

本次博主爬取的目标是某东，代码实现输入关键词后翻页获取相关的商品信息，如：标题、价格、评论数、商铺名、商品链接、店铺链接、图片链接：

3.代理IP推荐

由于电商数据量巨大，为了安全快速获取数据，博主使用的是亮数据家的代理IP，质量很高个人感觉还不错，并且可以免费使用：
亮数据代理IP免费试用

4.准备工作

4.1 模块安装

Python：3.10

编辑器：PyCharm

第三方模块，自行安装：

pip install requests # 网页数据爬取
pip install lxml # 提取网页数据
pip install pandas #写入Excel表格

4.2 代理IP获取

1、首先先免费注册一个亮数据账号：亮数据代理IP免费试用

2、选择查看代理IP产品：

3、有动态IP、静态IP、机房IP、移动代理IP可以选择，博主这里选择是机房IP：

4、配置通道，可以设置IP类型（共享/独享）、IP数、IP来源国家等等：

5、配置完成后可以看到主机、用户名和密码，等下我们添加到代码中去获取IP：

6、下面代码只需要修改刚才获取到的主机、用户名和密码，即可返回代理IP：

import re # 正则，用于提取字符串
import pandas as pd # pandas，用于写入Excel文件
import requests  # python基础爬虫库
from lxml import etree  # 可以将网页转换为Elements对象
import time  # 防止爬取过快可以睡眠一秒def get_ip():"""获取亮数据代理IP"""host = '你的主机' # 主机user_name = '你的用户名' # 用户名password = '你的密码' # 密码proxy_url = f'http://{user_name}:{password}@{host}' # 将上面三个参数拼接为专属代理IP获取网址proxies = {'http':proxy_url,'https':proxy_url}url = "http://lumtest.com/myip.json" # 默认获取的接口（不用修改）response = requests.get(url,proxies=proxies,timeout=10).text # 发送请求获取IP# print('代理IP详情信息：',response)response_dict = eval(response)  # 将字符串转为字典，方便我们提取代理IPip =  response_dict['ip']# print('IP：',ip)return ip

5.爬虫代码实战

5.1分析网页

5.1.1 获取cookie

目前某东需要登录后才看得到数据，所以我们需要获取登录后的cookie：

5.1.2 关键词分析

只要在keyword传入我们需要获取的关键词即可：

5.1.3 翻页分析

第一页：

https://search.jd.com/Search?keyword=Python%E4%B9%A6%E7%B1%8&page=1

第二页：

https://search.jd.com/Search?keyword=Python%E4%B9%A6%E7%B1%8&page=2

可以看到是通过page进行控制翻页的。

5.1.4 数据获取分析

首先可以看到我们所需要的每个商品数据都在一个一个li标签下面：

li标签下面有我们所需要的全部数据：

数据没问题接下来我们就可以开始写代码了

5.2 发送请求

1、设置关键词和翻页拼接网页链接：

def main():keyword = '手机'page_num = 10 # 爬取的页数for page in range(1,page_num+1):url = f'https://search.jd.com/Search?keyword={keyword}&page={page}'

2、获取网页源代码，注意：下面代码需要看4.2获取并添加代理IP信息（主机、用户名和密码）和看5.1.1添加自己的cookie

import re # 正则，用于提取字符串
import pandas as pd # pandas，用于写入Excel文件
import requests  # python基础爬虫库
from lxml import etree  # 可以将网页转换为Elements对象
import time  # 防止爬取过快可以睡眠一秒def get_ip():host = '' # 主机user_name = '' # 用户名password = '' # 密码proxy_url = f'http://{user_name}:{password}@{host}' # 将上面三个参数拼接为专属代理IP获取网址proxies = {'http':proxy_url,'https':proxy_url}url = "http://lumtest.com/myip.json" # 默认获取的接口（不用修改）response = requests.get(url,proxies=proxies,timeout=10).text # 发送请求获取IP# print('代理IP详情信息：',response)response_dict = eval(response)  # 将字符串转为字典，方便我们提取代理IPip =  response_dict['ip']# print('IP：',ip)return ipdef get_html_str(url):"""发送请求，获取网页源码"""# 请求头模拟浏览器（注意这里一定添加自己已经登录的cookie才可以）headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/95.0.4638.69 Safari/537.36','cookie':''}# 添加代理IP（这里代理IP这里需要看`5.1 获取代理IP`自己去获取，博主的已过期）proxies = get_ip()# 添加请求头和代理IP发送请求response = requests.get(url,headers=headers,proxies=proxies) # # 获取网页源码html_str = response.content.decode()# 返回网页源码return html_strdef main():keyword = '手机'page_num = 1 # 爬取的页数for page in range(1,page_num+1):url = f'https://search.jd.com/Search?keyword={keyword}&page={page}'print(url)html_str = get_html_str(url)print(html_str)if __name__ == '__main__':main()

3、运行成功翻页网页数据：

5.3 提取数据

下面代码实现提取相关的商品信息，如：标题、价格、评论数（获取失败大家可以自行尝试一下）、商铺名、商品链接、店铺链接、图片链接：

def get_data(html_str,page, data_list):"""提取数据写入列表"""# 将html字符串转换为etree对象方便后面使用xpath进行解析html_data = etree.HTML(html_str)# 利用xpath取到所有的li标签li_list = html_data.xpath("//ul[@class='gl-warp clearfix']/li")# 打印一下li标签个数# print(len(li_list))# 遍历li_list列表取到某一个商品的对象标签for li in li_list:# 标题try:title = li.xpath(".//div[@class='p-name p-name-type-2']/a/em//text()")title = ''.join(title)except:title = None# 商品链接try:goods_url = 'https:' +li.xpath(".//div[@class='p-name p-name-type-2']/a/@href")[0]except:goods_url= None# 价格try:price = li.xpath(".//div[@class='p-price']/strong/i/text()")[0]except:price= None# 评论数，有问题获取不到try:comment_num = li.xpath(".//div[@class='p-commit']/strong/a/text()")[0]except:comment_num= None# 店铺名try:shop = li.xpath(".//div[@class='p-shop']/span/a/text()")[0]except:shop = None# 店铺链接try:shop_url = 'https:' + li.xpath(".//div[@class='p-shop']/span[@class='J_im_icon']/a[@class='curr-shop hd-shopname']/@href")[0]except:shop_url = None# 图片链接try:img_url = 'https:' + li.xpath(".//div[@class='p-img']/a/img/@data-lazy-img")[0].replace('.avif','')except:img_url = Noneprint({'页码':page,'标题':title,'价格':price,'评论数':comment_num,'店铺名':shop,'店铺链接':shop_url,'商品链接':goods_url,'图片链接':img_url})data_list.append({'页码':page,'标题':title,'价格':price,'评论数':comment_num,'店铺名':shop,'店铺链接':shop_url,'商品链接':goods_url,'图片链接':img_url})

运行成功：

5.4 保存数据

将获取到的数据写入Excel：

def to_excel(data_list):"""写入Excel"""df = pd.DataFrame(data_list)df.drop_duplicates() # 删除重复数据df.to_excel('京东采集数据集.xlsx')

5.5 完整源码

下面完整代码需要看4.2获取并添加代理IP信息（主机、用户名和密码）和看5.1.1添加自己的cookie，可以修改关键词和爬取的页数：

import re # 正则，用于提取字符串
import pandas as pd # pandas，用于写入Excel文件
import requests  # python基础爬虫库
from lxml import etree  # 可以将网页转换为Elements对象
import time  # 防止爬取过快可以睡眠一秒def get_ip():"""获取亮数据代理IP"""host = '你的主机' # 主机user_name = '你的用户名' # 用户名password = '你的密码' # 密码proxy_url = f'http://{user_name}:{password}@{host}' # 将上面三个参数拼接为专属代理IP获取网址proxies = {'http':proxy_url,'https':proxy_url}url = "http://lumtest.com/myip.json" # 默认获取的接口（不用修改）response = requests.get(url,proxies=proxies,timeout=10).text # 发送请求获取IP# print('代理IP详情信息：',response)response_dict = eval(response)  # 将字符串转为字典，方便我们提取代理IPip =  response_dict['ip']# print('IP：',ip)return ipdef get_html_str(url):"""发送请求，获取网页源码"""# 请求头模拟浏览器（注意这里一定添加自己已经登录的cookie才可以）headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/95.0.4638.69 Safari/537.36','cookie':'你的京东登录cookie'}# 添加代理IP（这里代理IP这里需要看`5.1 获取代理IP`自己去获取，博主的已过期）proxies = get_ip()# 添加请求头和代理IP发送请求response = requests.get(url,headers=headers,proxies=proxies) ## 获取网页源码html_str = response.content.decode()# 返回网页源码return html_strdef get_data(html_str,page, data_list):"""提取数据写入列表"""# 将html字符串转换为etree对象方便后面使用xpath进行解析html_data = etree.HTML(html_str)# 利用xpath取到所有的li标签li_list = html_data.xpath("//ul[@class='gl-warp clearfix']/li")# 打印一下li标签个数# print(len(li_list))# 遍历li_list列表取到某一个商品的对象标签for li in li_list:# 标题try:title = li.xpath(".//div[@class='p-name p-name-type-2']/a/em//text()")title = ''.join(title)except:title = None# 商品链接try:goods_url = 'https:' +li.xpath(".//div[@class='p-name p-name-type-2']/a/@href")[0]except:goods_url= None# 价格try:price = li.xpath(".//div[@class='p-price']/strong/i/text()")[0]except:price= None# 评论数，有问题获取不到try:comment_num = li.xpath(".//div[@class='p-commit']/strong/a/text()")[0]except:comment_num= None# 店铺名try:shop = li.xpath(".//div[@class='p-shop']/span/a/text()")[0]except:shop = None# 店铺链接try:shop_url = 'https:' + li.xpath(".//div[@class='p-shop']/span[@class='J_im_icon']/a[@class='curr-shop hd-shopname']/@href")[0]except:shop_url = None# 图片链接try:img_url = 'https:' + li.xpath(".//div[@class='p-img']/a/img/@data-lazy-img")[0].replace('.avif','')except:img_url = Noneprint({'页码':page,'标题':title,'价格':price,'评论数':comment_num,'店铺名':shop,'店铺链接':shop_url,'商品链接':goods_url,'图片链接':img_url})data_list.append({'页码':page,'标题':title,'价格':price,'评论数':comment_num,'店铺名':shop,'店铺链接':shop_url,'商品链接':goods_url,'图片链接':img_url})def to_excel(data_list):"""写入Excel"""df = pd.DataFrame(data_list)df.drop_duplicates() # 删除重复数据df.to_excel('京东采集数据集.xlsx')def main():# 1. 设置爬取的关键词和页数keyword = '手机'page_num = 10 # 爬取的页数data_list = [] # 空列表用于存储数据for page in range(1,page_num+1):url = f'https://search.jd.com/Search?keyword={keyword}&page={page}'print(url)# 2. 获取指定页的网页源码html_str = get_html_str(url)print(html_str)# 3. 提取数据get_data(html_str,page, data_list)time.sleep(1)# 4. 写入Excelto_excel(data_list)if __name__ == '__main__':main()

运行结果：