本篇文章我们以抓取历史天气数据为例,简单说明数据抓取的两种方式:
1、一般简单或者较小量的数据需求,我们以requests(selenum)+beautiful的方式抓取数据
2、当我们需要的数据量较多时,建议采用scrapy框架进行数据采集,scrapy框架采用异步方式发起请求,数据抓取效率极高。
下面我们以http://www.tianqihoubao.com/lishi/网站数据抓取为例进行进行两种数据抓取得介绍:
1、以request+bs的方式采集天气数据,并以mysql存储数据
思路:
我们要采集的天气数据都在地址 http://www.tianqihoubao.com/lishi/beijing/month/201101.html 中存储,观察url可以发现,url中只有两部分在变化,即城市名称和你年月,而且每年都固定包含12个月份,可以使用 months = list(range(1, 13))构造月份,将城市名称和年份作为变量即可构造出需要采集数据的url列表,遍历列表,请求url,解析response,即可获取数据。
以上是我们采集天气数据的思路,首先我们需要构造url链接。
1 defget_url(cityname,start_year,end_year):2 years =list(range(start_year, end_year))3 months = list(range(1, 13))4 suburl = 'http://www.tianqihoubao.com/lishi/'
5 urllist =[]6 for year inyears:7 for month inmonths:8 if month < 10:9 url = suburl + cityname + '/month/'+ str(year) + (str(0) + str(month)) + '.html'
10 else:11 url = suburl + cityname + '/month/' + str(year) + str(month) + '.html'
12 urllist.append(url.strip())13 return urllist
通过以上函数,可以得到需要抓取的url列表。
可以看到,我们在上面使用了cityname,而cityname就是我们需要抓取的城市的城市名称,需要我们手工构造,假设我们已经构造了城市名称的列表,存储在mysql数据库中,我们需要查询数据库获取城市名称,遍历它,将城市名称和开始年份,结束年份,给上面的函数。
1 defget_cityid(db_conn,db_cur,url):2 suburl = url.split('/')3 sql = 'select cityid from city where cityname = %s'
4 db_cur.execute(sql,suburl[4])5 cityid =db_cur.fetchone()6 idlist =list(cityid)7 return idlist[0]
有了城市代码,开始和结束年份,生成了url列表,接下来当然就是请求地址,解析返回的html代码了,此处我们以bs解析网页源代码,代码如下:
1 defparse_html_bs(db_conn,db_cur,url):2 proxy =get_proxy()3 proxies ={4 'http': 'http://' +proxy,5 'https': 'https://' +proxy,6 }7 headers ={8 'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.116 Safari/537.36',9 'Connection': 'close'
10 }11
12 #获取天气数据的html网页源代码
13 weather_data = requests.get(url=url, headers=headers,proxies =proxies).text14 weather_data_new =(weather_data.replace('\n','').replace('\r','').replace(' ',''))15 soup = BeautifulSoup(weather_data_new,'lxml')16 table = soup.find_all(['td'])17 #获取城市id
18 cityid =get_cityid(db_conn, db_cur, url)19 listall =[]20 for t inlist(table):21 ts =t.string22 listall.append(ts)23 n= 4
24 sublist = [listall[i:i+n] for i inrange(0,len(listall),n)]25 sublist.remove(sublist[0])26 flist =[]27 #将列表元素中的最高和最低气温拆分,方便后续数据分析,并插入城市代码
28 for sub insublist:29 if sub ==sublist[0]:30 pass
31 sub2 = sub[2].split('/')32 sub.remove(sub[2])33 sub.insert(2, sub2[0])34 sub.insert(3, sub2[1])35 sub.insert(0, cityid) #插入城市代码
36 flist.append(sub)37 return flist
最后我们在主函数中遍历上面的列表,并将解析出来的结果存储到mysql数据库。
1 if __name__ == '__main__':2 citylist =get_cityname(db_conn,db_cur)3 for city incitylist:4 urllist = get_url(city,2016,2019)5 for url inurllist:6 time.sleep(1)7 flist =parse_html_bs(db_conn, db_cur, url)8 for li inflist:9 tool.dyn_insert_sql('weather',tuple(li),db_conn,db_cur)10 time.sleep(1)
以上我们便完成了以requests+bs方式抓取历史天气数据,并以mysql存储的程序代码,完成代码见:https://gitee.com/liangxinbin/Scrpay/blob/master/weatherData.py
2、用scrapy框架采集天气数据,并以mongo存储数据
1)定义我们需要抓取的数据结构,修改框架中的items.py文件
1 classWeatherItem(scrapy.Item):2 #define the fields for your item here like:
3 #name = scrapy.Field()
4 cityname =Field() #城市名称5 data =Field() #日期6 tq =Field() #天气7 maxtemp=Field() #最高温度8 mintemp=Field() #最低温度9 fengli=Field() #风力
2)修改下载器中间件,随机获取user-agent,ip地址
1 classRandomUserAgentMiddleware():2 def __init__(self,UA):3 self.user_agents =UA4
5 @classmethod6 deffrom_crawler(cls, crawler):7 return cls(UA = crawler.settings.get('MY_USER_AGENT')) #MY_USER_AGENT在settings文件中配置,通过类方法获取8
9 defprocess_request(self,request,spider):10 request.headers['User-Agent'] =random.choice(self.user_agents) #随机获取USER_AGENT11
12 defprocess_response(self,request, response, spider):13 returnresponse14
15
16 classProxyMiddleware():17 def __init__(self):18 ipproxy = requests.get('http://localhost:5000/random/') #此地址为从代理池中随机获取可用代理19 self.random_ip = 'http://' +ipproxy.text20
21 defprocess_request(self,request,spider):22 print(self.random_ip)23 request.meta['proxy'] =self.random_ip24
25 defprocess_response(self,request, response, spider):26 return response
3)修改pipeline文件,处理返回的item,处理蜘蛛文件返回的item
1 importpymongo2
3 classMongoPipeline(object):4
5 def __init__(self,mongo_url,mongo_db,collection):6 self.mongo_url =mongo_url7 self.mongo_db =mongo_db8 self.collection =collection9
10 @classmethod
14 deffrom_crawler(cls,crawler):15 returncls(16 mongo_url=crawler.settings.get('MONGO_URL'), #MONGO_URL,MONGO_DB,COLLECTION在settings文件中配置,通过类方法获取数据17 mongo_db = crawler.settings.get('MONGO_DB'),18 collection = crawler.settings.get('COLLECTION')19 )20
21 defopen_spider(self,spider):22 self.client =pymongo.MongoClient(self.mongo_url)23 self.db =self.client[self.mongo_db]24
25 defprocess_item(self,item, spider):26 #name = item.__class__.collection
27 name =self.collection28 self.db[name].insert(dict(item)) #将数据插入到mongodb数据库。29 returnitem30
31 defclose_spider(self,spider):32 self.client.close()
4)最后也是最重要的,编写蜘蛛文件解析数据,先上代码,在解释
1 #-*- coding: utf-8 -*-
2 importscrapy3 from bs4 importBeautifulSoup4 from scrapy importRequest5 from lxml importetree6 from scrapymodel.items importWeatherItem7
8
9 classWeatherSpider(scrapy.Spider):10 name = 'weather' #蜘蛛的名称,在整个项目中必须唯一
11 #allowed_domains = ['tianqihoubao']
12 start_urls = ['http://www.tianqihoubao.com/lishi/'] #起始链接,用这个链接作为开始,爬取数据,它的返回数据默认返回给parse来解析。13
14
15 #解析http://www.tianqihoubao.com/lishi/网页,提取连接形式http://www.tianqihoubao.com/lishi/beijing.html
16 defparse(self, response):17 soup = BeautifulSoup(response.text, 'lxml')18 citylists = soup.find_all(name='div', class_='citychk')19 for citys incitylists:20 for city in citys.find_all(name='dd'):21 url = 'http://www.tianqihoubao.com' + city.a['href']22 yield Request(url=url,callback =self.parse_citylist) #返回Request对象,作为新的url由框架进行调度请求,返回的response有回调函数parse_citylist进行解析23
24 #解析http://www.tianqihoubao.com/lishi/beijing.html网页,提取链接形式为http://www.tianqihoubao.com/lishi/tianjin/month/201811.html
25 defparse_citylist(self,response):26 soup = BeautifulSoup(response.text, 'lxml')27 monthlist = soup.find_all(name='div', class_='wdetail')28 for months inmonthlist:29 for month in months.find_all(name='li'):30 if month.text.endswith("季度:"):31 continue
32 else:33 url = month.a['href']34 url = 'http://www.tianqihoubao.com' +url35 yield Request(url= url,callback =self.parse_weather) #返回Request对象,作为新的url由框架进行调度请求,返回的response由parse_weather进行解析36
37 #以xpath解析网页数据;
38 defparse_weather(self,response): #解析网页数据,返回数据给pipeline处理39 #获取城市名称
40 url =response.url41 cityname = url.split('/')[4]42
43 weather_html =etree.HTML(response.text)44 table = weather_html.xpath('//table//tr//td//text()')45 #获取所有日期相关的数据,存储在列表中
46 listall =[]47 for t intable:48 if t.strip() == '':49 continue
50 #替换元素中的空格和\r\n
51 t1 = t.replace(' ', '')52 t2 = t1.replace('\r\n', '')53 listall.append(t2.strip())54 #对提取到的列表数据进行拆分,将一个月的天气数据拆分成每天的天气情况,方便数据插入数据库
55 n = 4
56 sublist = [listall[i:i + n] for i inrange(0, len(listall), n)]57 #删除表头第一行
58 sublist.remove(sublist[0])59 #将列表元素中的最高和最低气温拆分,方便后续数据分析,并插入城市代码
60
61 for sub insublist:62 if sub ==sublist[0]:63 pass
64 sub2 = sub[2].split('/')65 sub.remove(sub[2])66 sub.insert(2, sub2[0])67 sub.insert(3, sub2[1])68 sub.insert(0, cityname)69
70 Weather = WeatherItem() #使用items中定义的数据结构71
72 Weather['cityname'] =sub[0]73 Weather['data'] = sub[1]74 Weather['tq'] = sub[2]75 Weather['maxtemp'] = sub[3]76 Weather['mintemp'] = sub[4]77 Weather['fengli'] = sub[5]78 yield Weather
运行项目,即可获取数据,至此,我们完成了天气数据的抓取项目。
项目完整代码:
https://gitee.com/liangxinbin/Scrpay/tree/master/scrapymodel