python web scraping

2019独角兽企业重金招聘Python工程师标准>>> hot3.png

最近在看《Web Scraping with Python》,借此来熟悉Python2.7如何开始编程。
发现书上主要使用的 http://example.webscraping.com/ 网站有部分变化,书中的代码有点无法对照使用,因此稍微调了一下。
主要功能是,下载站上网页,然后抓取想要采集的数据内容保存到csv文件中。
需要提前安装第三方库——lxml。
具体代码在下面。

link_crawler.py

#!/usr/bin/env python
# -*- coding: utf-8 -*-import re
import urlparse
import urllib2
import time
from datetime import datetime
import robotparserdef link_crawler(seed_url, link_regex=None, delay=5, max_depth=-1, max_urls=-1, headers=None, user_agent='wswp', proxy=None, num_retries=1, scrape_callback=None):"""Crawl from the given seed URL following links matched by link_regex"""# the queue of URL's that still need to be crawledcrawl_queue = [seed_url]# the URL's that have been seen and at what depthseen = {seed_url: 0}# track how many URL's have been downloadednum_urls = 0# http://example.webscraping.com已经找不到robots.txt, rp失效rp = get_robots(seed_url)throttle = Throttle(delay)headers = headers or {}if user_agent:headers['User-agent'] = user_agentwhile crawl_queue:url = crawl_queue.pop()depth = seen[url]# check url passes robots.txt restrictions# rp没有读到robots.txt的内容,则没有不可fetchif rp.can_fetch(user_agent, url):throttle.wait(url)html = download(url, headers, proxy=proxy, num_retries=num_retries)links = []if scrape_callback:links.extend(scrape_callback(url, html) or [])if depth != max_depth:# can still crawl furtherif link_regex:# filter for links matching our regular expressionlinks.extend(link for link in get_links(html) if re.match(link_regex, link))for link in links:link = normalize(seed_url, link)# check whether already crawled this linkif link not in seen:seen[link] = depth + 1# check link is within same domainif same_domain(seed_url, link):# success! add this new link to queuecrawl_queue.append(link)# check whether have reached downloaded maximumnum_urls += 1if num_urls == max_urls:breakelse:print 'Blocked by robots.txt:', urlclass Throttle:"""Throttle downloading by sleeping between requests to same domain"""def __init__(self, delay):# amount of delay between downloads for each domainself.delay = delay# timestamp of when a domain was last accessedself.domains = {}def wait(self, url):"""Delay if have accessed this domain recently"""domain = urlparse.urlsplit(url).netloclast_accessed = self.domains.get(domain)if self.delay > 0 and last_accessed is not None:sleep_secs = self.delay - (datetime.now() - last_accessed).secondsif sleep_secs > 0:time.sleep(sleep_secs)self.domains[domain] = datetime.now()def download(url, headers, proxy, num_retries, data=None):print 'Downloading:', urlrequest = urllib2.Request(url, data, headers)opener = urllib2.build_opener()if proxy:proxy_params = {urlparse.urlparse(url).scheme: proxy}opener.add_handler(urllib2.ProxyHandler(proxy_params))try:response = opener.open(request)html = response.read()code = response.codeexcept urllib2.URLError as e:print 'Download error:', e.reasonhtml = ''if hasattr(e, 'code'):code = e.codeif num_retries > 0 and 500 <= code < 600:# retry 5XX HTTP errorshtml = download(url, headers, proxy, num_retries-1, data)else:code = Nonereturn htmldef normalize(seed_url, link):"""Normalize this URL by removing hash and adding domain"""link, _ = urlparse.urldefrag(link) # remove hash to avoid duplicatesreturn urlparse.urljoin(seed_url, link)def same_domain(url1, url2):"""Return True if both URL's belong to same domain"""return urlparse.urlparse(url1).netloc == urlparse.urlparse(url2).netlocdef get_robots(url):"""Initialize robots parser for this domain"""rp = robotparser.RobotFileParser()rp.set_url(urlparse.urljoin(url, '/robots.txt'))rp.read()return rpdef get_links(html):"""Return a list of links from html """# a regular expression to extract all links from the webpagewebpage_regex = re.compile('<a[^>]+href=["\'](.*?)["\']', re.IGNORECASE)# list of all links from the webpagereturn webpage_regex.findall(html)if __name__ == '__main__':link_crawler('http://example.webscraping.com', '[^\?]*/(index|view)', max_depth=5, num_retries=1)

scrape_callback.py

#!/usr/bin/env python
# -*- coding: utf-8 -*-import csv
import re
import urlparse
import lxml.html
from link_crawler import link_crawlerclass ScrapeCallback:def __init__(self):self.writer = csv.writer(open('countries.csv', 'w'))self.fields = ('area', 'population', 'iso', 'country', 'capital', 'continent', 'tld', 'currency_code', 'currency_name', 'phone', 'postal_code_format', 'postal_code_regex', 'languages', 'neighbours')self.writer.writerow(self.fields)def __call__(self, url, html):if re.search('/view/', url):tree = lxml.html.fromstring(html)row = []for field in self.fields:row.append(tree.cssselect('table > tr#places_{}__row > td.w2p_fw'.format(field))[0].text_content())self.writer.writerow(row)if __name__ == '__main__':link_crawler('http://example.webscraping.com/', '[^\?]*/(index|view)', max_depth=5, scrape_callback=ScrapeCallback())

转载于:https://my.oschina.net/elleneye/blog/1615795

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/news/278501.shtml

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

傅里叶变换的物理意义

用三角函数表示周期函数 傅里叶的相关理论始于下面假设&#xff1a;对于周期为1的信号$f(t)$&#xff0c;可以由不同频率的三角函数组成&#xff0c; $f(t) \frac{a_0}{2}\displaystyle{\sum^{\infty}_{k1}}(a_kcos(2\pi kt)b_ksin(2\pi kt))$ 组成的基础波形为一个信号对&…

天猫年度总结

2019独角兽企业重金招聘Python工程师标准>>> 鲁大师天猫工作总结 时间&#xff1a;2017年10月22日-1月30日 1、对代理商进行6大区域的划分管理&#xff0c;有专门的客服指导。 2、加班费申请和车费报销制度。 3、简化了特权订金2阶段改成1阶段&#xff0c;极大的方便…

因特网使用期限_Internet死亡时使用PC的其他方式

因特网使用期限Nothing is more annoying than getting your Internet connection shut down, due to weather, or perhaps forgetting to pay your bill. Let’s take a look at some ways you can be productive and entertained without the Internet. 没有什么比由于天气原…

【基础操作】线性基详解

线性基是一个奇妙的集合&#xff08;我摘的原话&#xff09; 这里以非 $OI$ 的角度介绍了线性基 基础部分 模板题 给你 $n$ 个数的集合&#xff0c;让你选出任意多个不重复的数&#xff0c;使得它们的异或和最大。 线性基是什么 我们称集合 $B$ 是集合 $S$ 的线性基&#xff0c…

节省大量教科书的三种潜在风险方法

Photo by Sultry 摄影&#xff1a; Sultry You can always save money on textbooks by buying online, going ebook, or renting what you need. But there are riskier ways to save a buck that just may yield even greater payoff, such as getting the international or …

解决内网搭建本地yum仓库。

2019独角兽企业重金招聘Python工程师标准>>> 一、使用iso镜像搭建本地yum仓库&#xff1b; 1、挂载镜像到/mnt目录下&#xff1a; [rootDasoncheng ~]# mount /dev/cdrom /mnt mount: /dev/sr0 is write-protected, mounting read-only2、备份配置文件&#xff0c;并…

通过用 .NET 生成自定义窗体设计器来定制应用程序

本文讨论&#xff1a; ? 设计时环境基本原理 ? 窗体设计器体系结构 ? Visual Studio .NET 中窗体设计器的实现 ? 为自己的应用程序编写窗体设计器而需要实现的服务 在很多年中&#xff0c;MFC 一直是生成基于 Windows? 的应用程序的流行框架。MFC 包含一个可以使窗体生成、…

airdrop 是 蓝牙吗_您可以在Windows PC或Android手机上使用AirDrop吗?

airdrop 是 蓝牙吗Aleksey Khilko/Shutterstock.comAleksey Khilko / Shutterstock.comApple’s AirDrop is a convenient way to send photos, files, links, and other data between devices. AirDrop only works on Macs, iPhones, and iPads, but similar solutions are av…

vue加百度统计代码(亲测有效)

申请百度统计后&#xff0c;会得到一段JS代码&#xff0c;需要插入到每个网页中去&#xff0c;在Vue.js项目首先想到的可能就是&#xff0c;把统计代码插入到index.html入口文件中&#xff0c;这样就全局插入&#xff0c;每个页面就都有了;这样做就涉及到一个问题&#xff0c;V…

如何将Rant变成生产力电动工具

Ranting doesn’t have to be a waste of breathe and time. You can turn a rant into a powerful tool for productivity. Learn how to transform your sense of victim hood and irritability to self-empowerment and mental clarity. 狂欢不必浪费呼吸和时间。 您可以将r…

linux 下使用 curl post

命令&#xff1a; curl -X POST -d /etc/lazada/lazada_tracking.txt http://localhost:8080/booking/rs/LazadaService/post --header "Content-Type:application/json" -d 后台 / &#xff1a; post 的 body 体 &#xff45;&#xff47;&#xff1a; {"a…

服务治理·理论篇(一)

0、故事主角 呱呱乐 是一家互联网金融公司。主营现金贷、p2p理财、消费分期业务。 公司现有技术人员800名&#xff0c;系统极其庞杂&#xff0c;每日稳定处理25w左右的订单量&#xff0c;有抢购活动时&#xff0c;系统的QPS(Query Per Second)峰值达到了3w。 系统虽然庞杂&…

2019-1-92.4G射频芯片培训资料

2019-1-92.4G射频芯片培训资料 培训 RF 小书匠 欢迎走进zozo的学习之旅。 2.4G芯片选型2.4G芯片开发Q&A2.4G芯片选型 芯片类型 soc防盗标签2.4G无线芯片选型发射器收发器LSD2RF-1600-V1.1 调制方式射频基础 2.4G芯片开发 原理图 发射优先收发均衡PCB topbottomlayout规…

在Outlook 2010中使用对话视图

One of the new features in Outlook 2010 is the ability to use Conversation View for easier management of your email conversations. Here we will take a quick look at how to use the new feature. Outlook 2010中的新功能之一是可以使用“对话视图”来更轻松地管理电…

openresty capture

local args {} args["name"] "张三" args["sex"] "男"local captureRes; if ngx.var.request_method "POST" thencaptureRes ngx.location.capture(/dsideal_yy/test, {method ngx.HTTP_POST, headers { ["Cont…

Day10:html和css

Day10:html和css <html> <body> <h1>标题</h1> <p>段落</p> </body> </html>HTML 是用来描述网页的一种语言&#xff0c;超文本标记语言&#xff0c;不是一种编程语言&#xff0c;而是一种标记语言&#xff0c;是一套标记标签…

如何在PowerPoint演示文稿中使用iTunes音乐

One of PowerPoint’s charms is its ability to play music during the presentation. Adding music to your presentation is simple, but using a song from your iTunes library requires a few extra steps. Here’s how to use iTunes music in PowerPoint. PowerPoint的…

Android:DELETE_FAILED_INTERNAL_ERROR Error while Installing APKs

Android studio DELETE_FAILED_INTERNAL_ERROR Error while Installing APKs 一、报错信息 DELETE_FAILED_INTERNAL_ERRORError while Installing APKs 二、报错原因 在一些机型上安装软件 提示卸载原先的软件 但是又安装不上新软件 三、解决方法&#xff1a; File->Settin…

hotmail_在新的Hotmail Wave 4中禁用Messenger

hotmailAre you annoyed by having Messenger automatically sign in when you’re reading your emails in the new Hotmail? Here’s how you can disable the Web Messenger in Hotmail and other Windows Live online apps. 当您在新的Hotmail中阅读电子邮件时&#xff0…

eclipse中将一个项目作为library导入另一个项目中

1. github上搜索viewpagerIndicator: https://github.com/JakeWharton/ViewPagerIndicator2. 下载zip包&#xff0c;解压&#xff0c;eclipse中import->Android Existing Code->(注意只导入解压后下面的Library)3. 导入后标记为Property->Android->isLibrary4. 将i…