网络爬虫--20.【Scrapy-Redis实战】分布式爬虫获取房天下--代码实现

文章目录

  • 一. 案例介绍
  • 二.创建项目
  • 三. settings.py配置
  • 四. 详细代码
  • 五. 部署
    • 1. windows环境下生成requirements.txt文件
    • 2. xshell连接ubuntu服务器并安装依赖环境
    • 3. 修改部分代码
    • 4. 上传代码至服务器并运行

一. 案例介绍

爬取房天下(https://www1.fang.com/)的网页信息。

源代码已更新至:Github

在这里插入图片描述

二.创建项目

打开windows终端,切换至项目将要存放的目录下:

scrapy startproject fang

cd fang\

scrapy genspider sfw “fang.com”

项目目录结构如下所示:
在这里插入图片描述

三. settings.py配置

# Obey robots.txt rules
ROBOTSTXT_OBEY = False
DOWNLOAD_DELAY = 3
DEFAULT_REQUEST_HEADERS = {'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8','Accept-Language': 'en',
}
DOWNLOADER_MIDDLEWARES = {'fang.middlewares.UserAgentDownloadMiddleware': 543,
}
ITEM_PIPELINES = {'fang.pipelines.FangPipeline': 300,
}

四. 详细代码

settings.py:

# -*- coding: utf-8 -*-# Scrapy settings for fang project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
#     https://docs.scrapy.org/en/latest/topics/settings.html
#     https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#     https://docs.scrapy.org/en/latest/topics/spider-middleware.htmlBOT_NAME = 'fang'SPIDER_MODULES = ['fang.spiders']
NEWSPIDER_MODULE = 'fang.spiders'# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'fang (+http://www.yourdomain.com)'# Obey robots.txt rules
ROBOTSTXT_OBEY = False# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32# Configure a delay for requests for the same website (default: 0)
# See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16# Disable cookies (enabled by default)
#COOKIES_ENABLED = False# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False# Override the default request headers:
DEFAULT_REQUEST_HEADERS = {'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8','Accept-Language': 'en',
}# Enable or disable spider middlewares
# See https://docs.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
#    'fang.middlewares.FangSpiderMiddleware': 543,
#}# Enable or disable downloader middlewares
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
DOWNLOADER_MIDDLEWARES = {'fang.middlewares.UserAgentDownloadMiddleware': 543,
}# Enable or disable extensions
# See https://docs.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
#    'scrapy.extensions.telnet.TelnetConsole': None,
#}# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {'fang.pipelines.FangPipeline': 300,
}# Enable and configure the AutoThrottle extension (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False# Enable and configure HTTP caching (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

items.py:

# -*- coding: utf-8 -*-# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.htmlimport scrapyclass NewHouseItem(scrapy.Item):# 省份province = scrapy.Field()# 城市city = scrapy.Field()# 小区名字name = scrapy.Field()# 价格price = scrapy.Field()# 几居 列表rooms = scrapy.Field()# 面积area = scrapy.Field()# 地址address = scrapy.Field()# 行政区district = scrapy.Field()# 是否在售sale = scrapy.Field()# 房天下详情页面的urlorigin_url = scrapy.Field()class ESFHouseItem(scrapy.Item):# 省份province = scrapy.Field()# 城市city = scrapy.Field()# 小区名字name = scrapy.Field()# 几室几厅rooms = scrapy.Field()# 层floor = scrapy.Field()# 朝向toward = scrapy.Field()# 年代year = scrapy.Field()# 地址address = scrapy.Field()# 建筑面积area = scrapy.Field()# 总价price = scrapy.Field()# 单价unit = scrapy.Field()# 原始urlorigin_url = scrapy.Field()

pipelines.py:

# -*- coding: utf-8 -*-# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html
from scrapy.exporters import JsonLinesItemExporterclass FangPipeline(object):def __init__(self):self.newhouse_fp = open('newhouse.json','wb')self.esfhouse_fp = open('esfhouse.json','wb')self.newhouse_exporter = JsonLinesItemExporter(self.newhouse_fp,ensure_ascii=False)self.esfhouse_exporter = JsonLinesItemExporter(self.esfhouse_fp, ensure_ascii=False)def process_item(self, item, spider):self.newhouse_exporter.export_item(item)self.esfhouse_exporter.export_item(item)return itemdef close_spider(self,spider):self.newhouse_fp.close()self.esfhouse_fp.close()

sfw.py:

# -*- coding: utf-8 -*-
import reimport scrapy
from fang.items import NewHouseItem, ESFHouseItemclass SfwSpider(scrapy.Spider):name = 'sfw'allowed_domains = ['fang.com']start_urls = ['https://www.fang.com/SoufunFamily.htm']def parse(self, response):trs = response.xpath("//div[@class='outCont']//tr")province = Nonefor tr in trs:tds = tr.xpath(".//td[not(@class)]")province_td = tds[0]province_text = province_td.xpath(".//text()").get()province_text = re.sub(r"\s","",province_text)if province_text:province = province_textif province == "其它":continuecity_id = tds[1]city_links = city_id.xpath(".//a")for city_link in city_links:city = city_link.xpath(".//text()").get()city_url = city_link.xpath(".//@href").get()# print("省份:",province)# print("城市:", city)# print("城市链接:", city_url)#构建新房的url链接url_module = city_url.split("//")scheme = url_module[0]domain_all = url_module[1].split("fang")domain_0 = domain_all[0]domain_1 = domain_all[1]if "bj." in  domain_0:newhouse_url = "https://newhouse.fang.com/house/s/"esf_url = "https://esf.fang.com/"else:newhouse_url =scheme + "//" + domain_0 + "newhouse.fang" + domain_1 + "house/s/"# 构建二手房的URL链接esf_url = scheme + "//" + domain_0 + "esf.fang" + domain_1# print("城市:%s%s"%(province, city))# print("新房链接:%s"%newhouse_url)# print("二手房链接:%s"%esf_url)# yield scrapy.Request(url=newhouse_url,callback=self.parse_newhouse,meta={"info":(province, city)})yield scrapy.Request(url=esf_url,callback=self.parse_esf,meta={"info":(province, city)},dont_filter=True)#     break# breakdef parse_newhouse(self,response):province,city = response.meta.get('info')lis = response.xpath("//div[contains(@class,'nl_con')]/ul/li")for li in lis:# 获取 项目名字name = li.xpath(".//div[@class='nlcd_name']/a/text()").get()name = li.xpath(".//div[@class='nlcd_name']/a/text()").get()if name == None:passelse:name = name.strip()# print(name)# 获取房子类型:几居house_type_list = li.xpath(".//div[contains(@class,'house_type')]/a/text()").getall()if len(house_type_list) == 0:passelse:house_type_list = list(map(lambda x:re.sub(r"\s","",x),house_type_list))rooms = list(filter(lambda x:x.endswith("居"),house_type_list))# print(rooms)# 获取房屋面积area = "".join(li.xpath(".//div[contains(@class,'house_type')]/text()").getall())area = re.sub(r"\s|/|-", "", area)if len(area) == 0:passelse:area =area# print(area)# 获取地址address = li.xpath(".//div[@class='address']/a/@title").get()if address == None:passelse:address = address# print(address)# 获取区划分:海淀 朝阳district_text = "".join(li.xpath(".//div[@class='address']/a//text()").getall())if len(district_text) == 0:passelse:district = re.search(r".*\[(.+)\].*",district_text).group(1)# print(district)# 获取是否在售sale = li.xpath(".//div[contains(@class,'fangyuan')]/span/text()").get()if sale == None:passelse:sale = sale# print(sale)# 获取价格price = li.xpath(".//div[@class='nhouse_price']//text()").getall()if len(price) == 0:passelse:price = "".join(price)price = re.sub(r"\s|广告","",price)# print(price)# 获取网址链接origin_url = li.xpath(".//div[@class='nlcd_name']/a/@href").get()if origin_url ==None:passelse:origin_url = origin_url# print(origin_url)item = NewHouseItem(name=name,rooms=rooms,area=area,address=address,district=district,sale=sale,price=price,origin_url=origin_url,province=province,city=city,)yield itemnext_url = response.xpath(".//div[@class='page']//a[@class='next']/@href").get()if next_url:yield scrapy.Request(url=response.urljoin(next_url), callback=self.parse_newhouse,meta={"info":(province,city)})def parse_esf(self, response):# 获取省份和城市province, city = response.meta.get('info')dls = response.xpath("//div[@class='shop_list shop_list_4']/dl")for dl in dls:item = ESFHouseItem(province=province,city=city)# 获取小区名字name = dl.xpath(".//p[@class='add_shop']/a/text()").get()if name == None:passelse:item['name'] = name.strip()# print(name)# 获取综合信息infos = dl.xpath(".//p[@class='tel_shop']/text()").getall()if len(infos) == 0:passelse:infos = list(map(lambda x:re.sub(r"\s","",x),infos))# print(infos)for info in infos:if "厅" in info :item['rooms']= infoelif '层' in info:item['floor']= infoelif '向' in info:item['toward']=infoelif '年' in info:item['year']=infoelif '㎡' in info:item['area'] = info# print(item)# 获取地址address = dl.xpath(".//p[@class='add_shop']/span/text()").get()if address == None:passelse:# print(address)item['address'] = address# 获取总价price = dl.xpath("./dd[@class='price_right']/span[1]/b/text()").getall()if len(price) == 0:passelse:price="".join(price)# print(price)item['price'] = price# 获取单价unit = dl.xpath("./dd[@class='price_right']/span[2]/text()").get()if unit == None:passelse:# print(unit)item['unit'] = unit# 获取初始urldetail_url = dl.xpath(".//h4[@class='clearfix']/a/@href").get()if detail_url == None:passelse:origin_url = response.urljoin(detail_url)# print(origin_url)item['origin_url'] = origin_url# print(item)yield itemnext_url = response.xpath(".//div[@class='page_al']/p/a/@href").get()# print(next_url)yield scrapy.Request(url=response.urljoin(next_url),callback=self.parse_esf,meta={"info":(province,city)})

middlewares.py:

# -*- coding: utf-8 -*-# Define here the models for your spider middleware
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/spider-middleware.htmlimport randomclass UserAgentDownloadMiddleware(object):# user-agent随机请求头中间件USER_AGENTS = ['Mozilla/5.0 (Windows; U; Windows NT 6.1; rv:2.2) Gecko/20110201''Mozilla/5.0 (Windows; U; Windows NT 5.1; pl; rv:1.9.2.3) Gecko/20100401 Lightningquail/3.6.3''Mozilla/5.0 (X11; ; Linux i686; rv:1.9.2.20) Gecko/20110805''Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US; rv:1.9.1b3) Gecko/20090305''Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.14) Gecko/2009091010''Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.10) Gecko/2009042523']def process_request(self, request, spider):user_agent = random.choice(self.USER_AGENTS)request.headers['User-Agent'] = user_agent

start.sh:

from scrapy import cmdlinecmdline.execute("scrapy crawl sfw".split())

此时在windows开发环境下运行start.sh,即可正常爬取数据。

五. 部署

1. windows环境下生成requirements.txt文件

打开cmder,首先切换至虚拟化境:

cd C:\Users\fxd.virtualenvs\sipder_env
.\Scripts\activate

在这里插入图片描述

然后切换至项目所在目录,输入指令,生成requirements.txt文件
pip freeze > requirements.txt
在这里插入图片描述

在这里插入图片描述

2. xshell连接ubuntu服务器并安装依赖环境

如果未安装openssh,需要首先安装,具体指令如下:

sudo apt-get install openssh-server

连接ubuntu服务器,切换至虚拟环境所在的目录,执行:

source ./bin/activate

进入虚拟环境,执行:

rz

上传requirements.txt,执行:

pip install -r requirements.txt

安装项目依赖环境。

然后安装scrapy-redis:

pip install scrapy-redis

3. 修改部分代码

要将一个Scrapy项目变成一个Scrapy-redis项目,只需要修改以下三点:
(1)将爬虫继承的类,从scrapy.Spider 变成scrapy_redis.spiders.RedisSpider;或者从scrapy.CrowlSpider变成scrapy_redis.spiders.RedisCrowlSpider。
(2)将爬虫中的start_urls删掉,增加一个redis_key="***"。这个key是为了以后在redis中控制爬虫启动的,爬虫的第一个url,就是在redis中通过这个推送出去的。
(3)在配置文件中增加如下配置:

# Scrapy-Redis相关配置
# 确保request存储到redis中
SCHEDULER = "scrapy_redis.scheduler.Scheduler"# 确保所有的爬虫共享相同的去重指纹
DUPEFILTER_CLASS = "scrapy_redis.dupefilter.RFPDupeFilter"# 设置redis为item_pipeline
ITEM_PIPELINES = {'scrapy_redis.pipelines.RedisPipeline':300
}# 在redis中保持scrapy_redis用到的队列,不会清理redis中的队列,从而可以实现暂停和回复的功能
SCHEDULER_PERSIST = True# 设置连接redis信息
REDIS_HOST = '172.20.10.2'
REDIS_PORT = 6379

4. 上传代码至服务器并运行

将项目文件压缩,在xshell中通过命令rz上传,并解压

运行爬虫:
(1)在爬虫服务器上,进入爬虫文件sfw.py所在的路径,然后输入命令:scrapy runspider [爬虫名字]

scrapy runspider sfw.py

(2)在redis(windows)服务器上,开启redis服务:

redis-server redis.windows.conf
若报错,按步骤执行以下命令:
redis-cli.exe
shutdown
exit
redis-server.exe redis.windows.conf

(3)然后打开另外一个windows终端:

redis-cli

推入一个开始的url链接:

lpush fang:start_urls https://www.fang.com/SoufunFamily.htm

爬虫开始
在这里插入图片描述

进入RedisDesktopManager查看保存的数据:

在这里插入图片描述

另外一台爬虫服务器进行同样的操作。
项目结束!

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/news/451968.shtml

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

同一台电脑安装python2python3

【安装之前,先了解一下概念】 python是什么? Python是一种面向对象的解释型计算机程序设计语言,由荷兰人Guido van Rossum于1989年发明,第一个公开发行版发行于1991年。 Python是纯粹的自由软件, 源代码和解释器CPytho…

程序员的常见健康问题

其实这些问题不仅见于程序员,其他长期经常坐在电脑前的职场人士(比如:网络编辑、站长等),都会有其中的某些健康问题。希望从事这些行业的朋友,对自己的健康问题,予以重视。以下是全文。 我最近…

网络爬虫--21.Scrapy知识点总结

文章目录一. Scrapy简介二. Scrapy架构图三. Scrapy框架模块功能四. 安装和文档五. 创建项目六. 创建爬虫一. Scrapy简介 二. Scrapy架构图 三. Scrapy框架模块功能 四. 安装和文档 中文文档:https://scrapy-chs.readthedocs.io/zh_CN/latest/intro/tutorial.html …

Ubuntu将在明年推出平板及手机系统

4月26日下午消息,知名Linux厂商Canonical今天正式发布Ubuntu 12.04版开源操作系统。Ubuntu中国首席代表于立强透露,针对平板电脑的Ubuntu操作系统将在明年推出。 Ubuntu 12.04版开源操作系统发布 Ubuntu操作系统是一款开源操作系统,主要与OE…

Android Studio 超级简单的打包生成apk

为什么要打包: apk文件就是一个包,打包就是要生成apk文件,有了apk别人才能安装使用。打包分debug版和release包,通常所说的打包指生成release版的apk,release版的apk会比debug版的小,release版的还会进行混…

推荐16款最棒的Visual Studio插件

Visual Studio是微软公司推出的开发环境,Visual Studio可以用来创建Windows平台下的Windows应用程序和网络应用程序,也可以用来创建网络服务、智能设备应用程序和Office插件。 本文介绍16款最棒的Visual Studio扩展: 1. DevColor Extension…

网络爬虫--22.【CrawlSpider实战】实现微信小程序社区爬虫

文章目录一. CrawlSpider二. CrawlSpider案例1. 目录结构2. wxapp_spider.py3. items.py4. pipelines.py5. settings.py6. start.py三. 重点总结一. CrawlSpider 现实情况下,我们需要对满足某个特定条件的url进行爬取,这时候就可以通过CrawlSpider完成。…

怎么安装Scrapy框架以及安装时出现的一系列错误(win7 64位 python3 pycharm)

因为要学习爬虫,就打算安装Scrapy框架,以下是我安装该模块的步骤,适合于刚入门的小白: 一、打开pycharm,依次点击File---->setting---->Project----->Project Interpreter,打开后,可以…

xpath-helper: 谷歌浏览器安装xpath helper 插件

1.下载文件xpath-helper.crx xpath链接:https://pan.baidu.com/s/1dFgzBSd 密码:zwvb,感谢这位网友,我从这拿到了 2.在Google浏览器里边找到这个“扩展程序”选项菜单即可。 3.然后就会进入到扩展插件的界面了,把下载好的离线插件…

网络爬虫--23.动态网页数据抓取

文章目录一. Ajax二. 获取Ajax数据的方式三. seleniumchromedriver获取动态数据四. selenium基本操作一. Ajax 二. 获取Ajax数据的方式 三. seleniumchromedriver获取动态数据 selenium文档:https://selenium-python.readthedocs.io/installation.html 四. sele…

gcc g++安装

2019独角兽企业重金招聘Python工程师标准>>> 安装之前要卸载掉老版本的gcc、g sudo apt-get remove gccgcc-xx #可能有多个版本,都要删掉 sudo apt-get remove g sudo apt-get install gcc 安装g编译器,可以通过命令 sudo apt-get installb…

网络爬虫--24.【selenium实战】实现拉勾网爬虫之--分析接口获取数据

文章目录一. 思路概述二. 分析数据接口三. 详细代码一. 思路概述 1.拉勾网采用Ajax技术,加载网页时会向后端发送Ajax异步请求,因此首先找到数据接口; 2.后端会返回json的数据,分析数据,找到单个招聘对应的positionId…

bzoj 1999: [Noip2007]Core树网的核【树的直径+单调队列】

我要懒死了&#xff0c;所以依然是lyd的课件截图 注意是min{max(max(d[uk]),dis(u1,ui),dis(uj,un))}&#xff0c;每次都从这三个的max里取min #include<iostream> #include<cstdio> using namespace std; const int N500005; int n,m,h[N],cnt,d[N],s,t,mx,f[N],a…

Java 设计模式-【单例模式】

单例解决了什么问题&#xff1a;为了节约系统资源&#xff0c;有时需要确保系统中某个类只有唯一一个实例&#xff0c;当这个唯一实例创建成功之后&#xff0c;我们无法再创建一个同类型的其他对象&#xff0c;所有的操作都只能基于这个唯一实例。为了确保对象的唯一性&#xf…

网络爬虫--26.Scrapy中下载器中间件Downloader Middlewares的使用

文章目录一. Downloader Middlewares二. 设置随机请求头三. ip代理池中间件一. Downloader Middlewares 二. 设置随机请求头 三. ip代理池中间件

解决eclipse配置Tomcat时找不到server选项(Mars.2也可用)

前些天发现了一个巨牛的人工智能学习网站&#xff0c;通俗易懂&#xff0c;风趣幽默&#xff0c;忍不住分享一下给大家。点击跳转到教程。 集成Eclipse和Tomcat时找不到server选项&#xff1a; 按照网上的步骤如下&#xff1a; 在Eclipse中&#xff0c;窗口(window)——首选项…

网络爬虫--27.csv文件的读取和写入

文章目录一. csv文件二. 读取csv文件的两种方式三. 写入csv文件的两种方式一. csv文件 二. 读取csv文件的两种方式 import csvdef read_csv_demo1():with open(classroom1.csv,r,encodingutf-8,newline) as fp:# reader是一个迭代器reader csv.reader(fp)next(reader)for x i…

Quiver快速入门

Quiver快速入门 装载自&#xff1a;https://github.com/HappenApps/Quiver/wiki/Quiver%E5%BF%AB%E9%80%9F%E5%85%A5%E9%97%A8Quiver 是一个程序员专用的记事本应用&#xff0c;可轻松混合文本、代码、Markdown、LaTeX 到一个记事本中。提供强大的代码编辑功能&#xff0c;以及…

配置SQL Server的身份验证方式

下面的文章来源于网络&#xff0c;讲的是怎样配置SQL Server 2005登陆验证方式&#xff0c;但是内容同样适用于SQL Server 2008. 配置SQL Server的身份验证方式 在默认情况下&#xff0c;SQL Server 2005 Express是采用集成的Windows安全验证且禁用了sa登录名。为了工作组环境下…