python web scraping

2019独角兽企业重金招聘Python工程师标准>>> hot3.png

最近在看《Web Scraping with Python》,借此来熟悉Python2.7如何开始编程。
发现书上主要使用的 http://example.webscraping.com/ 网站有部分变化,书中的代码有点无法对照使用,因此稍微调了一下。
主要功能是,下载站上网页,然后抓取想要采集的数据内容保存到csv文件中。
需要提前安装第三方库——lxml。
具体代码在下面。

link_crawler.py

#!/usr/bin/env python
# -*- coding: utf-8 -*-import re
import urlparse
import urllib2
import time
from datetime import datetime
import robotparserdef link_crawler(seed_url, link_regex=None, delay=5, max_depth=-1, max_urls=-1, headers=None, user_agent='wswp', proxy=None, num_retries=1, scrape_callback=None):"""Crawl from the given seed URL following links matched by link_regex"""# the queue of URL's that still need to be crawledcrawl_queue = [seed_url]# the URL's that have been seen and at what depthseen = {seed_url: 0}# track how many URL's have been downloadednum_urls = 0# http://example.webscraping.com已经找不到robots.txt, rp失效rp = get_robots(seed_url)throttle = Throttle(delay)headers = headers or {}if user_agent:headers['User-agent'] = user_agentwhile crawl_queue:url = crawl_queue.pop()depth = seen[url]# check url passes robots.txt restrictions# rp没有读到robots.txt的内容,则没有不可fetchif rp.can_fetch(user_agent, url):throttle.wait(url)html = download(url, headers, proxy=proxy, num_retries=num_retries)links = []if scrape_callback:links.extend(scrape_callback(url, html) or [])if depth != max_depth:# can still crawl furtherif link_regex:# filter for links matching our regular expressionlinks.extend(link for link in get_links(html) if re.match(link_regex, link))for link in links:link = normalize(seed_url, link)# check whether already crawled this linkif link not in seen:seen[link] = depth + 1# check link is within same domainif same_domain(seed_url, link):# success! add this new link to queuecrawl_queue.append(link)# check whether have reached downloaded maximumnum_urls += 1if num_urls == max_urls:breakelse:print 'Blocked by robots.txt:', urlclass Throttle:"""Throttle downloading by sleeping between requests to same domain"""def __init__(self, delay):# amount of delay between downloads for each domainself.delay = delay# timestamp of when a domain was last accessedself.domains = {}def wait(self, url):"""Delay if have accessed this domain recently"""domain = urlparse.urlsplit(url).netloclast_accessed = self.domains.get(domain)if self.delay > 0 and last_accessed is not None:sleep_secs = self.delay - (datetime.now() - last_accessed).secondsif sleep_secs > 0:time.sleep(sleep_secs)self.domains[domain] = datetime.now()def download(url, headers, proxy, num_retries, data=None):print 'Downloading:', urlrequest = urllib2.Request(url, data, headers)opener = urllib2.build_opener()if proxy:proxy_params = {urlparse.urlparse(url).scheme: proxy}opener.add_handler(urllib2.ProxyHandler(proxy_params))try:response = opener.open(request)html = response.read()code = response.codeexcept urllib2.URLError as e:print 'Download error:', e.reasonhtml = ''if hasattr(e, 'code'):code = e.codeif num_retries > 0 and 500 <= code < 600:# retry 5XX HTTP errorshtml = download(url, headers, proxy, num_retries-1, data)else:code = Nonereturn htmldef normalize(seed_url, link):"""Normalize this URL by removing hash and adding domain"""link, _ = urlparse.urldefrag(link) # remove hash to avoid duplicatesreturn urlparse.urljoin(seed_url, link)def same_domain(url1, url2):"""Return True if both URL's belong to same domain"""return urlparse.urlparse(url1).netloc == urlparse.urlparse(url2).netlocdef get_robots(url):"""Initialize robots parser for this domain"""rp = robotparser.RobotFileParser()rp.set_url(urlparse.urljoin(url, '/robots.txt'))rp.read()return rpdef get_links(html):"""Return a list of links from html """# a regular expression to extract all links from the webpagewebpage_regex = re.compile('<a[^>]+href=["\'](.*?)["\']', re.IGNORECASE)# list of all links from the webpagereturn webpage_regex.findall(html)if __name__ == '__main__':link_crawler('http://example.webscraping.com', '[^\?]*/(index|view)', max_depth=5, num_retries=1)

scrape_callback.py

#!/usr/bin/env python
# -*- coding: utf-8 -*-import csv
import re
import urlparse
import lxml.html
from link_crawler import link_crawlerclass ScrapeCallback:def __init__(self):self.writer = csv.writer(open('countries.csv', 'w'))self.fields = ('area', 'population', 'iso', 'country', 'capital', 'continent', 'tld', 'currency_code', 'currency_name', 'phone', 'postal_code_format', 'postal_code_regex', 'languages', 'neighbours')self.writer.writerow(self.fields)def __call__(self, url, html):if re.search('/view/', url):tree = lxml.html.fromstring(html)row = []for field in self.fields:row.append(tree.cssselect('table > tr#places_{}__row > td.w2p_fw'.format(field))[0].text_content())self.writer.writerow(row)if __name__ == '__main__':link_crawler('http://example.webscraping.com/', '[^\?]*/(index|view)', max_depth=5, scrape_callback=ScrapeCallback())

转载于:https://my.oschina.net/elleneye/blog/1615795

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/news/278501.shtml

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

傅里叶变换的物理意义

用三角函数表示周期函数 傅里叶的相关理论始于下面假设&#xff1a;对于周期为1的信号$f(t)$&#xff0c;可以由不同频率的三角函数组成&#xff0c; $f(t) \frac{a_0}{2}\displaystyle{\sum^{\infty}_{k1}}(a_kcos(2\pi kt)b_ksin(2\pi kt))$ 组成的基础波形为一个信号对&…

天猫年度总结

2019独角兽企业重金招聘Python工程师标准>>> 鲁大师天猫工作总结 时间&#xff1a;2017年10月22日-1月30日 1、对代理商进行6大区域的划分管理&#xff0c;有专门的客服指导。 2、加班费申请和车费报销制度。 3、简化了特权订金2阶段改成1阶段&#xff0c;极大的方便…

因特网使用期限_Internet死亡时使用PC的其他方式

因特网使用期限Nothing is more annoying than getting your Internet connection shut down, due to weather, or perhaps forgetting to pay your bill. Let’s take a look at some ways you can be productive and entertained without the Internet. 没有什么比由于天气原…

节省大量教科书的三种潜在风险方法

Photo by Sultry 摄影&#xff1a; Sultry You can always save money on textbooks by buying online, going ebook, or renting what you need. But there are riskier ways to save a buck that just may yield even greater payoff, such as getting the international or …

解决内网搭建本地yum仓库。

2019独角兽企业重金招聘Python工程师标准>>> 一、使用iso镜像搭建本地yum仓库&#xff1b; 1、挂载镜像到/mnt目录下&#xff1a; [rootDasoncheng ~]# mount /dev/cdrom /mnt mount: /dev/sr0 is write-protected, mounting read-only2、备份配置文件&#xff0c;并…

通过用 .NET 生成自定义窗体设计器来定制应用程序

本文讨论&#xff1a; ? 设计时环境基本原理 ? 窗体设计器体系结构 ? Visual Studio .NET 中窗体设计器的实现 ? 为自己的应用程序编写窗体设计器而需要实现的服务 在很多年中&#xff0c;MFC 一直是生成基于 Windows? 的应用程序的流行框架。MFC 包含一个可以使窗体生成、…

airdrop 是 蓝牙吗_您可以在Windows PC或Android手机上使用AirDrop吗?

airdrop 是 蓝牙吗Aleksey Khilko/Shutterstock.comAleksey Khilko / Shutterstock.comApple’s AirDrop is a convenient way to send photos, files, links, and other data between devices. AirDrop only works on Macs, iPhones, and iPads, but similar solutions are av…

如何将Rant变成生产力电动工具

Ranting doesn’t have to be a waste of breathe and time. You can turn a rant into a powerful tool for productivity. Learn how to transform your sense of victim hood and irritability to self-empowerment and mental clarity. 狂欢不必浪费呼吸和时间。 您可以将r…

2019-1-92.4G射频芯片培训资料

2019-1-92.4G射频芯片培训资料 培训 RF 小书匠 欢迎走进zozo的学习之旅。 2.4G芯片选型2.4G芯片开发Q&A2.4G芯片选型 芯片类型 soc防盗标签2.4G无线芯片选型发射器收发器LSD2RF-1600-V1.1 调制方式射频基础 2.4G芯片开发 原理图 发射优先收发均衡PCB topbottomlayout规…

在Outlook 2010中使用对话视图

One of the new features in Outlook 2010 is the ability to use Conversation View for easier management of your email conversations. Here we will take a quick look at how to use the new feature. Outlook 2010中的新功能之一是可以使用“对话视图”来更轻松地管理电…

Day10:html和css

Day10:html和css <html> <body> <h1>标题</h1> <p>段落</p> </body> </html>HTML 是用来描述网页的一种语言&#xff0c;超文本标记语言&#xff0c;不是一种编程语言&#xff0c;而是一种标记语言&#xff0c;是一套标记标签…

如何在PowerPoint演示文稿中使用iTunes音乐

One of PowerPoint’s charms is its ability to play music during the presentation. Adding music to your presentation is simple, but using a song from your iTunes library requires a few extra steps. Here’s how to use iTunes music in PowerPoint. PowerPoint的…

hotmail_在新的Hotmail Wave 4中禁用Messenger

hotmailAre you annoyed by having Messenger automatically sign in when you’re reading your emails in the new Hotmail? Here’s how you can disable the Web Messenger in Hotmail and other Windows Live online apps. 当您在新的Hotmail中阅读电子邮件时&#xff0…

mac无法关机_Mac无法关机时该怎么办

mac无法关机Razvan Franco Nitoi/Shutterstock.com拉兹万佛朗哥尼托伊/Shutterstock.comMacs are like any other computer. Sometimes they won’t start up, and sometimes they won’t shut down. If your Mac is refusing to shut off, here’s how to shut it down anyway…

chromebook刷机_如何在Chromebook上拍照

chromebook刷机Your Chromebook comes equipped with a built-in camera you can use to snap pictures to post to your social media accounts or share with friends and family. Here’s how to take a photo on a Chromebook. 您的Chromebook配备了一个内置摄像头&#xf…

树和二叉树简介

一、树 1、什么是树&#xff1f; 树状图是一种数据结构&#xff0c;它是由n&#xff08;n>1&#xff09;个有限节点组成一个具有层次关系的集合。把它叫做“树”是因为它看起来像一棵倒挂的树&#xff0c;也就是说它是根朝上&#xff0c;而叶朝下的。它具有以下的特点&#…

【SSH高速进阶】——struts2简单的实例

近期刚刚入门struts2。这里做一个简单的struts2实例来跟大家一起学习一下。 本例实现最简单的登陆&#xff0c;仅包括两个页面&#xff1a;login.jsp 用来输入username与password&#xff1b;success.jsp 为登陆成功页面。error.jsp为登陆失败页面。 1、新建web项目“struts2”…

《智能家居》培训第六天------2019-01-10

目录&#xff1a; 一&#xff09;摄像头 二&#xff09;照明 三&#xff09;所想 四&#xff09;总结 一&#xff09;摄像头 摄像头这块学了跟没学一样我觉得&#xff0c;摄像头给的api&#xff0c;yuyv转rgb24也是给的api&#xff0c;总而言之就是&#xff0c;直接给了两个源文…

记一次kafka数据丢失问题的排查

2019独角兽企业重金招聘Python工程师标准>>> 数据丢失为大事&#xff0c;针对数据丢失的问题我们排查结果如下。 第一&#xff1a;是否存在数据丢失的问题&#xff1f; 存在&#xff0c;且已重现。 第二&#xff1a;是在什么地方丢失的数据&#xff0c;是否是YDB…

ipad iphone开发_如何在iPhone或iPad上更改应用程序的语言

ipad iphone开发BigTunaOnline/Shutterstock.comBigTunaOnline / Shutterstock.comApple’s iOS 13 makes the iPhone and iPad multilingual. Now, you can change the language of an individual app without changing your primary system language. Each app can have its …