本文为
霾大:scrapy_splash 爬取 js 加载网页初体验zhuanlan.zhihu.com的补充
在上面的文章中我们仅仅是初步完成了 scrapy_splash 的简单使用
接下来我们将介绍如何是使得 splash 在 render.html (默认)访问网页时也能动态调整其请求头等(代理等同理)
往常来说,我们设置 scrapy 的随机请求头是在中间件处,沿着这个思路,同理我们亦可以沿着这个思路设置,使得爬虫解析与反爬手段分离。
步骤
- 首先在 settings 文件放入一批 UA
USER_AGENTS = ['Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_8; en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50','Mozilla/5.0 (Windows; U; Windows NT 6.1; en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50','Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0','Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0)','Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0)','Mozilla/5.0 (Windows NT 6.1; rv:2.0.1) Gecko/20100101 Firefox/4.0.1','Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/45.0.2454.101 Safari/537.36','Opera/9.80 (Macintosh; Intel Mac OS X 10.6.8; U; en) Presto/2.8.131 Version/11.11','Opera/9.80 (Windows NT 6.1; U; en) Presto/2.8.131 Version/11.11','Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; The World)','Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/30.0.1599.101 Safari/537.36','Mozilla/5.0 (Windows NT 6.1; WOW64; Trident/7.0; rv:11.0) like Gecko'
]
2. 修改 middlewares 文件
import random
from scrapy_test.settings import USER_AGENTSclass RandomUA(object):def process_request(self, request, spider):ua = random.choice(USER_AGENTS)request.headers.setdefault('User-Agent', ua)
3. 在 settings 文件启用我们刚定义的中间件
DOWNLOADER_MIDDLEWARES = {# 'scrapy_test.middlewares.ScrapyTestDownloaderMiddleware': 543,'scrapy_test.middlewares.RandomUA': 543,
}
运行结果及解析
推荐阅读
- 霾大:scrapy_splash 爬取 js 加载网页初体验
代码传送门
LZC6244/scrapy_splash_testgithub.com原创文章,转载请保留或注明出处!