如何抓取 WEB 页面:http://blog.csdn.net/chenguolinblog/article/details/45024643
github 上一个关于模拟登录的项目:https://github.com/xchaoinfo/fuck-login
Python爬虫之模拟登录总结:http://blog.csdn.net/churximi/article/details/50917322
爬虫库 Python Requests 如何模拟用户登录?:https://segmentfault.com/q/1010000002421773
python爬虫实践之模拟登录:https://www.2cto.com/kf/201401/275152.html
Python 模拟登录知乎:http://blog.csdn.net/Marksinoberg/article/details/69569353
使用python编写简单网络爬虫技巧总结:http://armsword.com/2014/03/31/python-in-crawler
模拟登录这块一直没搞过,主要是对 模拟登陆的流程不太熟悉,网上找了好多资料,感觉熟悉个大概,就先用豆瓣 试试。
模拟登陆的重点,在于找到表单真实的提交地址,然后携带cookie,post数据即可,只要登陆成功,我们就可以访问其他任意网页,从而获取网页内容。
一个请求,只要正确模拟了method,url,header,body 这四要素,任何内容都能抓下来,而所有的四个要素,只要打开浏览器-审查元素-Network就能看到!
验证码这一块,现在主要是先把验证码的图片保存下来,手动输入验证码,后期研究下python自动识别验证码。
但是验证码保存成本地图片,看的不不太清楚(有时间在改下),可以把验证码的 url 地址在浏览器中打开,就可以看清楚验证码了。
主要实现 登录豆瓣,并发表一句话
# -*- coding:utf-8 -*-import re
import requests
from bs4 import BeautifulSoupclass DouBan(object):def __init__(self):self.__username = "豆瓣帐号" # 豆瓣帐号self.__password = "豆瓣密码" # 豆瓣密码self.__main_url = "https://www.douban.com"self.__login_url = "https://www.douban.com/accounts/login"self.__proxies = {"http": "http://172.17.18.80:8080","https": "https://172.17.18.80:8080"}self.__headers = {"Host": "www.douban.com","Origin": self.__main_url,"Referer": self.__main_url,"Upgrade-Insecure-Requests": "1","User-Agent": 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36'}self.__data = {"source": "index_nav","redir": "https://www.douban.com","form_email": self.__username,"form_password": self.__password,"login": u"登录"}self.__session = requests.session()self.__session.headers = self.__headersself.__session.proxies = self.__proxiespassdef login(self):r = self.__session.post(self.__login_url, self.__data)if r.status_code == 200:html = r.contentsoup = BeautifulSoup(html, "lxml")captcha_address = soup.find('img', id='captcha_image')['src']print captcha_address# 验证码存在if captcha_address:# 利用正则表达式获取captcha的IDre_captcha_id = r'<input type="hidden" name="captcha-id" value="(.*?)"/'captcha_id = re.findall(re_captcha_id, html)print captcha_id# 保存到本地with open('captcha.jpg', 'w') as f:f.write(requests.get(captcha_address, proxies=self.__proxies).content)captcha = raw_input('please input the captcha:')self.__data['captcha-solution'] = captchaself.__data['captcha-id'] = captcha_idr = self.__session.post(self.__login_url, data=self.__data)if r.status_code == 200:print "login success"data = {"ck": "NBJ2","comment": "模拟登录"}r = self.__session.post(self.__main_url, data=data)print r.status_codeelse:print "登录不需要验证码"# 不需要验证码的逻辑 和 上面输入验证码之后 的 逻辑 一样# 此处代码省略else:print "login fail", r.status_codepassif __name__ == "__main__":t = DouBan()t.login()pass
登录豆瓣帐号,可以看到说了一句话 “模拟登录”