1.主题式网络爬虫名称:天天基金网爬虫分析
2.主题式网络爬虫爬取的内容与数据特征分析:通过访问天天基金的网站,爬取相对应的信息,最后保存下来做可视化分析。
3.主题式网络爬虫设计方案概述(包括实现思路与技术难点):
首先,用request进行访问页面。
其次,用xtree来获取页面内容,用etree.xpath进行数据筛选。
最后,文件操作进行数据的保存。
难点:网站的爬取与数据筛选。
1.数据爬取与采集
“”“ua大列表”“”
USER_AGENT_LIST = [
‘Mozilla/5.0 (Windows NT 6.2; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.90 Safari/537.36’,
‘Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3451.0 Safari/537.36’,
‘Mozilla/5.0 (Macintosh; Intel Mac OS X 10.9; rv:57.0) Gecko/20100101 Firefox/57.0’,
‘Mozilla/5.0 (Windows NT 5.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/28.0.1500.71 Safari/537.36’,
‘Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.2999.0 Safari/537.36’,
‘Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.70 Safari/537.36’,
‘Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.4; en-US; rv:1.9.2.2) Gecko/20100316 Firefox/3.6.2’,
‘Mozilla/5.0 (Windows NT 5.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.155 Safari/537.36 OPR/31.0.1889.174’,
‘Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.1; Trident/4.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 1.1.4322; MS-RTC LM 8; InfoPath.2; Tablet PC 2.0)’,
‘Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36’,
‘Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36 OPR/55.0.2994.61’,
‘Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.1 (KHTML, like Gecko) Chrome/14.0.814.0 Safari/535.1’,
‘Mozilla/5.0 (Macintosh; U; PPC Mac OS X; ja-jp) AppleWebKit/418.9.1 (KHTML, like Gecko) Safari/419.3’,
‘Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/43.0.2357.134 Safari/537.36’,
‘Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.1; Trident/6.0; Touch; MASMJS)’,
‘Mozilla/5.0 (X11; Linux i686) AppleWebKit/535.21 (KHTML, like Gecko) Chrome/19.0.1041.0 Safari/535.21’,
‘Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36’,
‘Mozilla/5.0 (Windows NT 6.2; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.90 Safari/537.36’,
‘Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3451.0 Safari/537.36’,
‘Mozilla/5.0 (Macintosh; Intel Mac OS X 10.9; rv:57.0) Gecko/20100101 Firefox/57.0’,
‘Mozilla/5.0 (Windows NT 5.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/28.0.1500.71 Safari/537.36’,
‘Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.2999.0 Safari/537.36’,
‘Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.70 Safari/537.36’,
‘Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.4; en-US; rv:1.9.2.2) Gecko/20100316 Firefox/3.6.2’,
‘Mozilla/5.0 (Windows NT 5.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.155 Safari/537.36 OPR/31.0.1889.174’,
‘Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.1; Trident/4.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 1.1.4322; MS-RTC LM 8; InfoPath.2; Tablet PC 2.0)’,
‘Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36’,
‘Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36 OPR/55.0.2994.61’,
‘Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.1 (KHTML, like Gecko) Chrome/14.0.814.0 Safari/535.1’,
‘Mozilla/5.0 (Macintosh; U; PPC Mac OS X; ja-jp) AppleWebKit/418.9.1 (KHTML, like Gecko) Safari/419.3’,
‘Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/43.0.2357.134 Safari/537.36’,
‘Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.1; Trident/6.0; Touch; MASMJS)’,
‘Mozilla/5.0 (X11; Linux i686) AppleWebKit/535.21 (KHTML, like Gecko) Chrome/19.0.1041.0 Safari/535.21’,
‘Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36’,
‘Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4093.3 Safari/537.36’,
‘Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_5) AppleWebKit/537.36 (KHTML, like Gecko; compatible; Swurl) Chrome/77.0.3865.120 Safari/537.36’,
‘Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36’,
‘Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36’,
‘Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.131 Safari/537.36’,
‘Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4086.0 Safari/537.36’,
‘Mozilla/5.0 (Windows NT 6.1; WOW64; rv:75.0) Gecko/20100101 Firefox/75.0’,
‘Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) coc_coc_browser/91.0.146 Chrome/85.0.4183.146 Safari/537.36’,
‘Mozilla/5.0 (Windows; U; Windows NT 5.2; en-US) AppleWebKit/537.36 (KHTML, like Gecko) Safari/537.36 VivoBrowser/8.4.72.0 Chrome/62.0.3202.84’,
‘Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.101 Safari/537.36’,
‘Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36 Edg/87.0.664.60’,
‘Mozilla/5.0 (Macintosh; Intel Mac OS X 10.16; rv:83.0) Gecko/20100101 Firefox/83.0’,
‘Mozilla/5.0 (X11; CrOS x86_64 13505.63.0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36’,
‘Mozilla/5.0 (Macintosh; Intel Mac OS X 10.9; rv:68.0) Gecko/20100101 Firefox/68.0’,
‘Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.101 Safari/537.36’,
‘Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36’,
‘Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36 OPR/72.0.3815.400’,
‘Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.101 Safari/537.36’,
]
2.对数据进行清洗和处理
def \_\_init\_\_(self):# 起始的请求地址----初始化self.start\_url = 'http://fund.eastmoney.com/fund.html'# 第二份数据地址self.next\_url = 'http://fund.eastmoney.com/HBJJ\_pjsyl.html'def parse\_start\_url(self):"""发送请求,获取响应:return:"""# 请求头headers = {# 通过随机模块提供的随机拿取数据方法'User-Agent': random.choice(USER\_AGENT\_LIST)}# 发送请求,获取响应字节数据response = session.get(self.start\_url, headers=headers).content"""序列化对象,将字节内容数据,经过转换,变成可进行xpath操作的对象"""response \= etree.HTML(response)"""调用提取第二份响应数据"""self.parse\_next\_url\_response(response)def parse\_next\_url\_response(self, response\_1):"""解析第二个数据页地址:return:"""# 请求头headers = {# 通过随机模块提供的随机拿取数据方法'User-Agent': random.choice(USER\_AGENT\_LIST)}# 发送请求,获取响应字节数据response = session.get(self.start\_url, headers=headers).content"""序列化对象,将字节内容数据,经过转换,变成可进行xpath操作的对象"""response \= etree.HTML(response)"""调用解析response响应数据方法"""self.parse\_response\_data(response, response\_1)def parse\_response\_data(self, response\_1, response):"""解析response响应数据,提取:return:"""# 股票名称name\_list\_1 = response.xpath('//tbody/tr/td\[5\]/nobr/a\[1\]/text()')name\_list\_2 \= response\_1.xpath('//tbody/tr/td\[5\]/nobr/a\[1\]/text()')# 合并name\_list = name\_list\_1 + name\_list\_2# 昨日单位净值num\_1\_list\_data\_1 = response.xpath('//tbody/tr/td\[6\]/text()')num\_1\_list\_data\_2 \= response\_1.xpath('//tr/td\[6\]/span/text()')# 合并num\_1\_list = num\_1\_list\_data\_1 + num\_1\_list\_data\_2# 昨日累计净值num\_2\_list\_data\_1 = response.xpath('//tbody/tr/td\[7\]/text()')num\_2\_list\_data\_2 \= response\_1.xpath('//tr/td\[7\]/text()')# 合并num\_2\_list = num\_2\_list\_data\_1 + num\_2\_list\_data\_2"""调用解析三个列表的方法"""self.for\_parse\_three\_list(name\_list, num\_1\_list, num\_2\_list)def for\_parse\_three\_list(self, name\_list, num\_1\_list, num\_2\_list):"""解析循环,:param name\_list: 股票名称:param num\_1\_list: 昨日单位净值:param num\_2\_list: 昨日累计净值:return:"""# 遍历解析3个列表数据for a, b, c in zip(name\_list, num\_1\_list, num\_2\_list):# 构造保存的excel字典数据dict\_data = {# 会根据该字典的key值创建工作簿的sheet名'股票数据': \[a, b, c\]}"""调用解析保存excel表格方法"""self.parse\_save\_excel(dict\_data)print(f'企业:{a}----采集完成!')"""数据采集完成,调用分析生成图像方法"""self.parse\_random\_data(name\_list, num\_1\_list, num\_2\_list)def parse\_random\_data(self, name\_list, num\_1\_list, num\_2\_list):"""随机抽取15条数据,进行分析:return:"""# 存放随机号码的列表index\_list = \[\]for i in range(15):# 随机抽取15个数据进行分析random\_num = random.randint(0, 200)# 将随机抽取的号码添加进入准备的列表中index\_list.append(random\_num)"""随机号码生成以后,调用解析生成四张分析图的方法"""self.parse\_img\_four\_func(index\_list, name\_list, num\_1\_list, num\_2\_list)
4.数据分析与可视化(例如:数据柱形图、直方图、散点图、盒图、分布图)
def parse\_img\_four\_func(self, index\_list, name\_list, num\_1\_list, num\_2\_list):"""解析生成四张分析图:param index\_list: 随机数据的下标:param name\_list: 股票名称列表:param num\_1\_list: 昨日单位净值列表:param num\_2\_list: 昨日累计净值列表:return:"""title\_list \= \[\] # 名称qy\_num\_1 = \[\] # 单位净值qy\_num\_2 = \[\] # 累计净值for index\_num in index\_list:# 企业名称列表title\_list.append(name\_list\[index\_num\])# 昨日单位净值列表qy\_num\_1.append(num\_1\_list\[index\_num\])# 昨日累计净值列表qy\_num\_2.append(num\_2\_list\[index\_num\])# 第一张图:根据净值生成折线图plt.rcParams\['font.sans-serif'\] = \['SimHei'\]plt.rcParams\['axes.unicode\_minus'\] = False# plot中参数的含义分别是横轴值,纵轴值,线的形状,颜色,透明度,线的宽度和标签plt.plot(title\_list, qy\_num\_2, 'ro-', color='#4169E1', alpha=0.8, linewidth=1, label='累计净值')plt.plot(title\_list, qy\_num\_1, 'ro-', color='#69e141', alpha=0.8, linewidth=1, label='单位净值')# 显示标签,如果不加这句,即使在plot中加了label='一些数字'的参数,最终还是不会显示标签plt.legend(loc="upper right")plt.xticks(rotation\=270)plt.xlabel('地点数量')plt.ylabel('工作属性数量')plt.savefig('根据净值生成折线图.png')plt.show()# 第二张图:根据单位净值生成饼图addr\_dict\_key = title\_listaddr\_dict\_value \= qy\_num\_1plt.rcParams\['font.sans-serif'\] = \['Microsoft YaHei'\]plt.rcParams\['axes.unicode\_minus'\] = Falseplt.pie(addr\_dict\_value, labels\=addr\_dict\_key, autopct='%1.1f%%')plt.title(f'单位净值对比')plt.savefig(f'单位净值对比-饼图')plt.show()# 第三张图:根据累计净值生成散点图# 这两行代码解决 plt 中文显示的问题plt.rcParams\['font.sans-serif'\] = \['SimHei'\]plt.rcParams\['axes.unicode\_minus'\] = False# 输入岗位地址和岗位属性数据production = title\_listtem \= qy\_num\_2colors \= np.random.rand(len(tem)) # 颜色数组plt.scatter(tem, production, s=200, c=colors) # 画散点图,大小为 200plt.xlabel('数量') # 横坐标轴标题plt.xticks(rotation=270)plt.ylabel('名称') # 纵坐标轴标题plt.savefig(f'净值散点图.png')plt.show()# 第四张图:根据净值生成柱状图import matplotlib;matplotlib.use('TkAgg')plt.rcParams\['font.sans-serif'\] = \['SimHei'\]plt.rcParams\['axes.unicode\_minus'\] = Falsezhfont1 \= matplotlib.font\_manager.FontProperties(fname='C:\\Windows\\Fonts\\simsun.ttc')name\_list \= title\_listnum\_list \= \[float(i) for i in qy\_num\_1\] # 单位净值width = 0.5 # 柱子的宽度index = np.arange(len(name\_list))plt.bar(index, num\_list, width, color\='steelblue', tick\_label=name\_list, label='单位净值')plt.bar(index \+ width, qy\_num\_2, width, color='red', hatch='\\\\', label='累计净值')plt.legend(\['单位净值', '累计净值'\], prop=zhfont1, labelspacing=1)for a, b in zip(index, num\_list): # 柱子上的数字显示plt.text(a, b, '%.2f' % b, ha='center', va='bottom', fontsize=7)plt.xticks(rotation\=270)plt.title('净值柱状图')plt.ylabel('率')plt.legend()plt.savefig(f'净值-柱状图', bbox\_inches='tight')plt.show()
5.将以上各部分的代码汇总,附上完整程序代码
"""ua大列表"""
USER\_AGENT\_LIST \= \['Mozilla/5.0 (Windows NT 6.2; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.90 Safari/537.36','Mozilla/5.0 (Macintosh; Intel Mac OS X 10\_13\_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3451.0 Safari/537.36','Mozilla/5.0 (Macintosh; Intel Mac OS X 10.9; rv:57.0) Gecko/20100101 Firefox/57.0','Mozilla/5.0 (Windows NT 5.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/28.0.1500.71 Safari/537.36','Mozilla/5.0 (Macintosh; Intel Mac OS X 10\_12\_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.2999.0 Safari/537.36','Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.70 Safari/537.36','Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.4; en-US; rv:1.9.2.2) Gecko/20100316 Firefox/3.6.2','Mozilla/5.0 (Windows NT 5.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.155 Safari/537.36 OPR/31.0.1889.174','Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.1; Trident/4.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 1.1.4322; MS-RTC LM 8; InfoPath.2; Tablet PC 2.0)','Mozilla/5.0 (Macintosh; Intel Mac OS X 10\_15\_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36','Mozilla/5.0 (Macintosh; Intel Mac OS X 10\_12\_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36 OPR/55.0.2994.61','Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.1 (KHTML, like Gecko) Chrome/14.0.814.0 Safari/535.1','Mozilla/5.0 (Macintosh; U; PPC Mac OS X; ja-jp) AppleWebKit/418.9.1 (KHTML, like Gecko) Safari/419.3','Mozilla/5.0 (Macintosh; Intel Mac OS X 10\_10\_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/43.0.2357.134 Safari/537.36','Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.1; Trident/6.0; Touch; MASMJS)','Mozilla/5.0 (X11; Linux i686) AppleWebKit/535.21 (KHTML, like Gecko) Chrome/19.0.1041.0 Safari/535.21','Mozilla/5.0 (Macintosh; Intel Mac OS X 10\_15\_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36','Mozilla/5.0 (Windows NT 6.2; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.90 Safari/537.36','Mozilla/5.0 (Macintosh; Intel Mac OS X 10\_13\_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3451.0 Safari/537.36','Mozilla/5.0 (Macintosh; Intel Mac OS X 10.9; rv:57.0) Gecko/20100101 Firefox/57.0','Mozilla/5.0 (Windows NT 5.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/28.0.1500.71 Safari/537.36','Mozilla/5.0 (Macintosh; Intel Mac OS X 10\_12\_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.2999.0 Safari/537.36','Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.70 Safari/537.36','Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.4; en-US; rv:1.9.2.2) Gecko/20100316 Firefox/3.6.2','Mozilla/5.0 (Windows NT 5.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.155 Safari/537.36 OPR/31.0.1889.174','Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.1; Trident/4.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 1.1.4322; MS-RTC LM 8; InfoPath.2; Tablet PC 2.0)','Mozilla/5.0 (Macintosh; Intel Mac OS X 10\_15\_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36','Mozilla/5.0 (Macintosh; Intel Mac OS X 10\_12\_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36 OPR/55.0.2994.61','Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.1 (KHTML, like Gecko) Chrome/14.0.814.0 Safari/535.1','Mozilla/5.0 (Macintosh; U; PPC Mac OS X; ja-jp) AppleWebKit/418.9.1 (KHTML, like Gecko) Safari/419.3','Mozilla/5.0 (Macintosh; Intel Mac OS X 10\_10\_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/43.0.2357.134 Safari/537.36','Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.1; Trident/6.0; Touch; MASMJS)','Mozilla/5.0 (X11; Linux i686) AppleWebKit/535.21 (KHTML, like Gecko) Chrome/19.0.1041.0 Safari/535.21','Mozilla/5.0 (Macintosh; Intel Mac OS X 10\_15\_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36','Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4093.3 Safari/537.36','Mozilla/5.0 (Macintosh; Intel Mac OS X 10\_14\_5) AppleWebKit/537.36 (KHTML, like Gecko; compatible; Swurl) Chrome/77.0.3865.120 Safari/537.36','Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36','Mozilla/5.0 (Macintosh; Intel Mac OS X 10\_14\_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36','Mozilla/5.0 (Macintosh; Intel Mac OS X 10\_14\_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.131 Safari/537.36','Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4086.0 Safari/537.36','Mozilla/5.0 (Windows NT 6.1; WOW64; rv:75.0) Gecko/20100101 Firefox/75.0','Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) coc\_coc\_browser/91.0.146 Chrome/85.0.4183.146 Safari/537.36','Mozilla/5.0 (Windows; U; Windows NT 5.2; en-US) AppleWebKit/537.36 (KHTML, like Gecko) Safari/537.36 VivoBrowser/8.4.72.0 Chrome/62.0.3202.84','Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.101 Safari/537.36','Mozilla/5.0 (Macintosh; Intel Mac OS X 10\_15\_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36 Edg/87.0.664.60','Mozilla/5.0 (Macintosh; Intel Mac OS X 10.16; rv:83.0) Gecko/20100101 Firefox/83.0','Mozilla/5.0 (X11; CrOS x86\_64 13505.63.0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36','Mozilla/5.0 (Macintosh; Intel Mac OS X 10.9; rv:68.0) Gecko/20100101 Firefox/68.0','Mozilla/5.0 (Macintosh; Intel Mac OS X 10\_15\_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.101 Safari/537.36','Mozilla/5.0 (Macintosh; Intel Mac OS X 10\_15\_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36','Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36 OPR/72.0.3815.400','Mozilla/5.0 (X11; Linux x86\_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.101 Safari/537.36',\]from requests\_html import HTMLSession
import os, xlwt, xlrd, random
from xlutils.copy import copy
import numpy as np
from matplotlib import pyplot as plt
from matplotlib.font\_manager import FontProperties # 字体库
from lxml import etree
session \= HTMLSession()class DFSpider(object):def \_\_init\_\_(self):# 起始的请求地址----初始化self.start\_url = 'http://fund.eastmoney.com/fund.html'# 第二份数据地址self.next\_url = 'http://fund.eastmoney.com/HBJJ\_pjsyl.html'def parse\_start\_url(self):"""发送请求,获取响应:return:"""# 请求头headers = {# 通过随机模块提供的随机拿取数据方法'User-Agent': random.choice(USER\_AGENT\_LIST)}# 发送请求,获取响应字节数据response = session.get(self.start\_url, headers=headers).content"""序列化对象,将字节内容数据,经过转换,变成可进行xpath操作的对象"""response \= etree.HTML(response)"""调用提取第二份响应数据"""self.parse\_next\_url\_response(response)def parse\_next\_url\_response(self, response\_1):"""解析第二个数据页地址:return:"""# 请求头headers = {# 通过随机模块提供的随机拿取数据方法'User-Agent': random.choice(USER\_AGENT\_LIST)}# 发送请求,获取响应字节数据response = session.get(self.start\_url, headers=headers).content"""序列化对象,将字节内容数据,经过转换,变成可进行xpath操作的对象"""response \= etree.HTML(response)"""调用解析response响应数据方法"""self.parse\_response\_data(response, response\_1)def parse\_response\_data(self, response\_1, response):"""解析response响应数据,提取:return:"""# 股票名称name\_list\_1 = response.xpath('//tbody/tr/td\[5\]/nobr/a\[1\]/text()')name\_list\_2 \= response\_1.xpath('//tbody/tr/td\[5\]/nobr/a\[1\]/text()')# 合并name\_list = name\_list\_1 + name\_list\_2# 昨日单位净值num\_1\_list\_data\_1 = response.xpath('//tbody/tr/td\[6\]/text()')num\_1\_list\_data\_2 \= response\_1.xpath('//tr/td\[6\]/span/text()')# 合并num\_1\_list = num\_1\_list\_data\_1 + num\_1\_list\_data\_2# 昨日累计净值num\_2\_list\_data\_1 = response.xpath('//tbody/tr/td\[7\]/text()')num\_2\_list\_data\_2 \= response\_1.xpath('//tr/td\[7\]/text()')# 合并num\_2\_list = num\_2\_list\_data\_1 + num\_2\_list\_data\_2"""调用解析三个列表的方法"""self.for\_parse\_three\_list(name\_list, num\_1\_list, num\_2\_list)def for\_parse\_three\_list(self, name\_list, num\_1\_list, num\_2\_list):"""解析循环,:param name\_list: 股票名称:param num\_1\_list: 昨日单位净值:param num\_2\_list: 昨日累计净值:return:"""# 遍历解析3个列表数据for a, b, c in zip(name\_list, num\_1\_list, num\_2\_list):# 构造保存的excel字典数据dict\_data = {# 会根据该字典的key值创建工作簿的sheet名'股票数据': \[a, b, c\]}"""调用解析保存excel表格方法"""self.parse\_save\_excel(dict\_data)print(f'企业:{a}----采集完成!')"""数据采集完成,调用分析生成图像方法"""self.parse\_random\_data(name\_list, num\_1\_list, num\_2\_list)def parse\_random\_data(self, name\_list, num\_1\_list, num\_2\_list):"""随机抽取15条数据,进行分析:return:"""# 存放随机号码的列表index\_list = \[\]for i in range(15):# 随机抽取15个数据进行分析random\_num = random.randint(0, 200)# 将随机抽取的号码添加进入准备的列表中index\_list.append(random\_num)"""随机号码生成以后,调用解析生成四张分析图的方法"""self.parse\_img\_four\_func(index\_list, name\_list, num\_1\_list, num\_2\_list)def parse\_img\_four\_func(self, index\_list, name\_list, num\_1\_list, num\_2\_list):"""解析生成四张分析图:param index\_list: 随机数据的下标:param name\_list: 股票名称列表:param num\_1\_list: 昨日单位净值列表:param num\_2\_list: 昨日累计净值列表:return:"""title\_list \= \[\] # 名称qy\_num\_1 = \[\] # 单位净值qy\_num\_2 = \[\] # 累计净值for index\_num in index\_list:# 企业名称列表title\_list.append(name\_list\[index\_num\])# 昨日单位净值列表qy\_num\_1.append(num\_1\_list\[index\_num\])# 昨日累计净值列表qy\_num\_2.append(num\_2\_list\[index\_num\])# 第一张图:根据净值生成折线图plt.rcParams\['font.sans-serif'\] = \['SimHei'\]plt.rcParams\['axes.unicode\_minus'\] = False# plot中参数的含义分别是横轴值,纵轴值,线的形状,颜色,透明度,线的宽度和标签plt.plot(title\_list, qy\_num\_2, 'ro-', color='#4169E1', alpha=0.8, linewidth=1, label='累计净值')plt.plot(title\_list, qy\_num\_1, 'ro-', color='#69e141', alpha=0.8, linewidth=1, label='单位净值')# 显示标签,如果不加这句,即使在plot中加了label='一些数字'的参数,最终还是不会显示标签plt.legend(loc="upper right")plt.xticks(rotation\=270)plt.xlabel('地点数量')plt.ylabel('工作属性数量')plt.savefig('根据净值生成折线图.png')plt.show()# 第二张图:根据单位净值生成饼图addr\_dict\_key = title\_listaddr\_dict\_value \= qy\_num\_1plt.rcParams\['font.sans-serif'\] = \['Microsoft YaHei'\]plt.rcParams\['axes.unicode\_minus'\] = Falseplt.pie(addr\_dict\_value, labels\=addr\_dict\_key, autopct='%1.1f%%')plt.title(f'单位净值对比')plt.savefig(f'单位净值对比-饼图')plt.show()# 第三张图:根据累计净值生成散点图# 这两行代码解决 plt 中文显示的问题plt.rcParams\['font.sans-serif'\] = \['SimHei'\]plt.rcParams\['axes.unicode\_minus'\] = False# 输入岗位地址和岗位属性数据production = title\_listtem \= qy\_num\_2colors \= np.random.rand(len(tem)) # 颜色数组plt.scatter(tem, production, s=200, c=colors) # 画散点图,大小为 200plt.xlabel('数量') # 横坐标轴标题plt.xticks(rotation=270)plt.ylabel('名称') # 纵坐标轴标题plt.savefig(f'净值散点图.png')plt.show()# 第四张图:根据净值生成柱状图import matplotlib;matplotlib.use('TkAgg')plt.rcParams\['font.sans-serif'\] = \['SimHei'\]plt.rcParams\['axes.unicode\_minus'\] = Falsezhfont1 \= matplotlib.font\_manager.FontProperties(fname='C:\\Windows\\Fonts\\simsun.ttc')name\_list \= title\_listnum\_list \= \[float(i) for i in qy\_num\_1\] # 单位净值width = 0.5 # 柱子的宽度index = np.arange(len(name\_list))plt.bar(index, num\_list, width, color\='steelblue', tick\_label=name\_list, label='单位净值')plt.bar(index \+ width, qy\_num\_2, width, color='red', hatch='\\\\', label='累计净值')plt.legend(\['单位净值', '累计净值'\], prop=zhfont1, labelspacing=1)for a, b in zip(index, num\_list): # 柱子上的数字显示plt.text(a, b, '%.2f' % b, ha='center', va='bottom', fontsize=7)plt.xticks(rotation\=270)plt.title('净值柱状图')plt.ylabel('率')plt.legend()plt.savefig(f'净值-柱状图', bbox\_inches='tight')plt.show()def parse\_save\_excel(self, data\_dict):"""保存数据:return:"""# 判断保存数据的文件夹是否存在,不存在,就创建os\_path\_1 = os.getcwd() + '/数据/'if not os.path.exists(os\_path\_1):os.mkdir(os\_path\_1)os\_path \= os\_path\_1 + '股票数据.xls'if not os.path.exists(os\_path):# 创建新的workbook(其实就是创建新的excel)workbook = xlwt.Workbook(encoding='utf-8')# 创建新的sheet表worksheet1 = workbook.add\_sheet("股票数据", cell\_overwrite\_ok=True)excel\_data\_1 \= ('股票名称', '昨日单位净值', '昨日累计净值')for i in range(0, len(excel\_data\_1)):worksheet1.col(i).width \= 2560 \* 3# 行,列, 内容, 样式worksheet1.write(0, i, excel\_data\_1\[i\])workbook.save(os\_path)# 判断工作表是否存在if os.path.exists(os\_path):# 打开工作薄workbook = xlrd.open\_workbook(os\_path)# 获取工作薄中所有表的个数sheets = workbook.sheet\_names()for i in range(len(sheets)):for name in data\_dict.keys():worksheet \= workbook.sheet\_by\_name(sheets\[i\])# 获取工作薄中所有表中的表名与数据名对比if worksheet.name == name:# 获取表中已存在的行数rows\_old = worksheet.nrows# 将xlrd对象拷贝转化为xlwt对象new\_workbook = copy(workbook)# 获取转化后的工作薄中的第i张表new\_worksheet = new\_workbook.get\_sheet(i)for num in range(0, len(data\_dict\[name\])):new\_worksheet.write(rows\_old, num, data\_dict\[name\]\[num\])new\_workbook.save(os\_path)def run(self):"""启动方法:return:"""self.parse\_start\_url()if \_\_name\_\_ == '\_\_main\_\_':d \= DFSpider()d.run()
仅用作项目练习,切勿商用
一、Python所有方向的学习路线
Python所有方向路线就是把Python常用的技术点做整理,形成各个领域的知识点汇总,它的用处就在于,你可以按照上面的知识点去找对应的学习资源,保证自己学得较为全面。
二、学习软件
工欲善其事必先利其器。学习Python常用的开发软件都在这里了,还有环境配置的教程,给大家节省了很多时间。
三、全套PDF电子书
书籍的好处就在于权威和体系健全,刚开始学习的时候你可以只看视频或者听某个人讲课,但等你学完之后,你觉得你掌握了,这时候建议还是得去看一下书籍,看权威技术书籍也是每个程序员必经之路。
四、入门学习视频全套
我们在看视频学习的时候,不能光动眼动脑不动手,比较科学的学习方法是在理解之后运用它们,这时候练手项目就很适合了。
五、实战案例
光学理论是没用的,要学会跟着一起敲,要动手实操,才能将自己的所学运用到实际当中去,这时候可以搞点实战案例来学习。
今天就分享到这里吧