Python爬虫学习笔记(十三)————CrawlSpider

目录

1.CrawlSpider介绍

2.使用方法

(1)提取链接

(2)模拟使用

(3)提取连接

(4)注意事项

3.运行原理

4.Mysql

5.pymysql的使用步骤

6.数据入库

(1)settings配置参数

(2)管道配置

 7.CrawlSpider案例:读书网数据入库

(1)案例分析

(2)项目结构

(3)items.py文件

(4)middlewares.py文件

(5)pipelines.py文件

(6)settings.py文件

(7)read.py文件


1.CrawlSpider介绍

  • 继承自scrapy.Spider
  • CrawlSpider可以定义规则,再解析html内容的时候,可以根据链接规则提取出指定的链接,然后再向这些链接发送请求
  • 所以,如果有需要跟进链接的需求,意思就是爬取了网页之后,需要提取链接再次爬取,使用CrawlSpider是非常合适的

2.使用方法

(1)提取链接

                链接提取器,在这里就可以写规则提取指定链接

        scrapy.linkextractors.LinkExtractor(

                        allow = (),                         # 正则表达式 提取符合正则的链接

                        deny = (),                         # (不用)正则表达式 不提取符合正则的链接

                        allow_domains = (),         #(不用)允许的域名

                        deny_domains = (),         #(不用)不允许的域名

                        restrict_xpaths = (),         # xpath,提取符合xpath规则的链接

                        restrict_css = ()               # 提取符合选择器规则的链接

                        )

(2)模拟使用

                正则用法:  links1 = LinkExtractor(allow=r'list_23_\d+\.html')

                xpath用法:links2 = LinkExtractor(restrict_xpaths=r'//div[@class="x"]')

                css用法:    links3 = LinkExtractor(restrict_css='.x')

(3)提取连接

                link.extract_links(response)

(4)注意事项

        【注1】callback只能写函数名字符串, callback='parse_item'

        【注2】在基本的spider中,如果重新发送请求,那里的callback写的是 callback=self.parse_item

        【注3】follow=true 是否跟进 就是按照提取连接规则进行提取

3.运行原理

4.Mysql

(1)下载(https://dev.mysql.com/downloads/windows/installer/5.7.html)

(2)安装(https://jingyan.baidu.com/album/d7130635f1c77d13fdf475df.html)

5.pymysql的使用步骤

1.pip install pymysql

2.pymysql.connect(host,port,user,password,db,charset)

3.conn.cursor()

4.cursor.execute()

6.数据入库

(1)settings配置参数

                DB_HOST = '192.168.231.128'

                DB_PORT = 3306

                DB_USER = 'root'

                DB_PASSWORD = '1234'

                DB_NAME = 'test'

                DB_CHARSET = 'utf8'

(2)管道配置

from scrapy.utils.project import get_project_settings

import pymysql

class MysqlPipeline(object):

#__init__方法和open_spider的作用是一样的

#init是获取settings中的连接参数

        def __init__(self):

                settings = get_project_settings()

                self.host = settings['DB_HOST']

                self.port = settings['DB_PORT']

                self.user = settings['DB_USER']

                self.pwd = settings['DB_PWD']

                self.name = settings['DB_NAME']

                self.charset = settings['DB_CHARSET']

                self.connect()

# 连接数据库并且获取cursor对象

        def connect(self):

                self.conn = pymysql.connect(host=self.host, port=self.port, user=self.user, password=self.pwd, db=self.name, charset=self.charset)

                self.cursor = self.conn.cursor()

        def process_item(self, item, spider):

                sql = 'insert into book(image_url, book_name, author, info) values("%s", "%s", "%s", "%s")' % (item['image_url'], item['book_name'], item['author'], item['info'])

                sql = 'insert into book(image_url,book_name,author,info) values ("{}","{}","{}","{}")'.format(item['image_url'], item['book_name'], item['author'], item['info'])

                # 执行sql语句

                self.cursor.execute(sql)

                self.conn.commit()

                return item

        def close_spider(self, spider):

                self.conn.close()

                self.cursor.close()

 7.CrawlSpider案例:读书网数据入库

(1)案例分析

1.创建项目:        scrapy startproject 项目的名字

2.跳转到spiders路径         cd 项目名字\项目名字\spiders

3.创建爬虫类:        scrapy genspider ‐t crawl read www.dushu.com

4.items

5.spiders

6.settings

7.pipelines

        数据保存到本地

        数据保存到mysql数据库

(2)项目结构

(3)items.py文件

# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.htmlimport scrapyclass ScrapyReadbook101Item(scrapy.Item):# define the fields for your item here like:# name = scrapy.Field()name = scrapy.Field()src = scrapy.Field()

(4)middlewares.py文件

# Define here the models for your spider middleware
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/spider-middleware.htmlfrom scrapy import signals# useful for handling different item types with a single interface
from itemadapter import is_item, ItemAdapterclass ScrapyReadbook101SpiderMiddleware:# Not all methods need to be defined. If a method is not defined,# scrapy acts as if the spider middleware does not modify the# passed objects.@classmethoddef from_crawler(cls, crawler):# This method is used by Scrapy to create your spiders.s = cls()crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)return sdef process_spider_input(self, response, spider):# Called for each response that goes through the spider# middleware and into the spider.# Should return None or raise an exception.return Nonedef process_spider_output(self, response, result, spider):# Called with the results returned from the Spider, after# it has processed the response.# Must return an iterable of Request, or item objects.for i in result:yield idef process_spider_exception(self, response, exception, spider):# Called when a spider or process_spider_input() method# (from other spider middleware) raises an exception.# Should return either None or an iterable of Request or item objects.passdef process_start_requests(self, start_requests, spider):# Called with the start requests of the spider, and works# similarly to the process_spider_output() method, except# that it doesn’t have a response associated.# Must return only requests (not items).for r in start_requests:yield rdef spider_opened(self, spider):spider.logger.info('Spider opened: %s' % spider.name)class ScrapyReadbook101DownloaderMiddleware:# Not all methods need to be defined. If a method is not defined,# scrapy acts as if the downloader middleware does not modify the# passed objects.@classmethoddef from_crawler(cls, crawler):# This method is used by Scrapy to create your spiders.s = cls()crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)return sdef process_request(self, request, spider):# Called for each request that goes through the downloader# middleware.# Must either:# - return None: continue processing this request# - or return a Response object# - or return a Request object# - or raise IgnoreRequest: process_exception() methods of#   installed downloader middleware will be calledreturn Nonedef process_response(self, request, response, spider):# Called with the response returned from the downloader.# Must either;# - return a Response object# - return a Request object# - or raise IgnoreRequestreturn responsedef process_exception(self, request, exception, spider):# Called when a download handler or a process_request()# (from other downloader middleware) raises an exception.# Must either:# - return None: continue processing this exception# - return a Response object: stops process_exception() chain# - return a Request object: stops process_exception() chainpassdef spider_opened(self, spider):spider.logger.info('Spider opened: %s' % spider.name)

(5)pipelines.py文件

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html# useful for handling different item types with a single interface
from itemadapter import ItemAdapterclass ScrapyReadbook101Pipeline:def open_spider(self,spider):self.fp = open('book.json','w',encoding='utf-8')def process_item(self, item, spider):self.fp.write(str(item))return itemdef close_spider(self,spider):self.fp.close()# 加载settings文件
from scrapy.utils.project import get_project_settings
import pymysqlclass MysqlPipeline:def open_spider(self,spider):settings = get_project_settings()self.host = settings['DB_HOST']self.port =settings['DB_PORT']self.user =settings['DB_USER']self.password =settings['DB_PASSWROD']self.name =settings['DB_NAME']self.charset =settings['DB_CHARSET']self.connect()def connect(self):self.conn = pymysql.connect(host=self.host,port=self.port,user=self.user,password=self.password,db=self.name,charset=self.charset)self.cursor = self.conn.cursor()def process_item(self, item, spider):sql = 'insert into book(name,src) values("{}","{}")'.format(item['name'],item['src'])# 执行sql语句self.cursor.execute(sql)# 提交self.conn.commit()return itemdef close_spider(self,spider):self.cursor.close()self.conn.close()

(6)settings.py文件

# Scrapy settings for scrapy_readbook_101 project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
#     https://docs.scrapy.org/en/latest/topics/settings.html
#     https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#     https://docs.scrapy.org/en/latest/topics/spider-middleware.htmlBOT_NAME = 'scrapy_readbook_101'SPIDER_MODULES = ['scrapy_readbook_101.spiders']
NEWSPIDER_MODULE = 'scrapy_readbook_101.spiders'# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'scrapy_readbook_101 (+http://www.yourdomain.com)'# Obey robots.txt rules
ROBOTSTXT_OBEY = True# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32# Configure a delay for requests for the same website (default: 0)
# See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
#DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16# Disable cookies (enabled by default)
#COOKIES_ENABLED = False# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False# Override the default request headers:
#DEFAULT_REQUEST_HEADERS = {
#   'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
#   'Accept-Language': 'en',
#}# Enable or disable spider middlewares
# See https://docs.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
#    'scrapy_readbook_101.middlewares.ScrapyReadbook101SpiderMiddleware': 543,
#}# Enable or disable downloader middlewares
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES = {
#    'scrapy_readbook_101.middlewares.ScrapyReadbook101DownloaderMiddleware': 543,
#}# Enable or disable extensions
# See https://docs.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
#    'scrapy.extensions.telnet.TelnetConsole': None,
#}# 参数中一个端口号 一个是字符集 都要注意
DB_HOST = '192.168.231.130'
# 端口号是一个整数
DB_PORT = 3306
DB_USER = 'root'
DB_PASSWROD = '1234'
DB_NAME = 'spider01'
# utf-8的杠不允许写
DB_CHARSET = 'utf8'# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {'scrapy_readbook_101.pipelines.ScrapyReadbook101Pipeline': 300,# MysqlPipeline'scrapy_readbook_101.pipelines.MysqlPipeline':301
}# Enable and configure the AutoThrottle extension (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False# Enable and configure HTTP caching (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

(7)read.py文件

import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rulefrom scrapy_readbook_101.items import ScrapyReadbook101Itemclass ReadSpider(CrawlSpider):name = 'read'allowed_domains = ['www.dushu.com']start_urls = ['https://www.dushu.com/book/1188_1.html']rules = (Rule(LinkExtractor(allow=r'/book/1188_\d+.html'),callback='parse_item',follow=True),)def parse_item(self, response):img_list = response.xpath('//div[@class="bookslist"]//img')for img in img_list:name = img.xpath('./@data-original').extract_first()src = img.xpath('./@alt').extract_first()book = ScrapyReadbook101Item(name=name,src=src)yield book

 

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/news/9516.shtml

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

【100个 Unity实用技能】 | Unity中Text文本框 和 InputField文本输入框 内容换行问题【文末送书】

🎬 博客主页:https://xiaoy.blog.csdn.net 🎥 本文由 呆呆敲代码的小Y 原创,首发于 CSDN🙉 🎄 学习专栏推荐:Unity系统学习专栏 🌲 游戏制作专栏推荐:游戏制作 &…

Ceph的应用

文章目录 一、创建 CephFS 文件系统 MDS 接口1)在管理节点创建 mds 服务2)查看各个节点的 mds 服务3)创建存储池,启用 ceph 文件系统4)查看mds状态,一个up,其余两个待命,目前的工作的…

Docker-compose容器编排

Docker-Compose介绍 Compose 是 Docker 公司推出的一个工具软件,可以管理多个 Docker 容器组成一个应用。你需要定义一个 YAML 格式的配置文件docker-compose.yml,写好多个容器之间的调用关系。然后,只要一个命令,就能同时启动/关…

项目里程碑有什么作用?设置里程碑时应注意什么?

正如 "里程碑 "一词的原意是表示所走距离的标记,项目中的里程碑也代表着迄今为止已完成的任务或活动。但实际上,里程碑的作用远不止于此。 项目里程碑为何重要? 项目的成功取决于细节。项目里程碑之所以重要,是因为它…

微信小程序的个人博客--【小程序花园】

微信目录集链接在此: 详细解析黑马微信小程序视频–【思维导图知识范围】难度★✰✰✰✰ 不会导入/打开小程序的看这里:参考 让别人的小程序长成自己的样子-更换window上下颜色–【浅入深出系列001】 文章目录 本系列校训啥是个人博客项目里的理论知识…

如何在3ds max中创建可用于真人场景的巨型机器人:第 1部分

推荐: NSDT场景编辑器助你快速搭建可二次开发的3D应用场景 1. 创建主体 步骤 1 打开 3ds Max。 打开 3ds Max 步骤 2 在左侧视口中,按键盘上的 Alt-B 键。它 打开视口配置窗口。 打开“锁定缩放/平移”和“匹配位图”选项。单击“文件”并转到参考 …

8年测试整理,自动化测试框架从0到1实施,一篇打通自动化...

目录:导读 前言一、Python编程入门到精通二、接口自动化项目实战三、Web自动化项目实战四、App自动化项目实战五、一线大厂简历六、测试开发DevOps体系七、常用自动化测试工具八、JMeter性能测试九、总结(尾部小惊喜) 前言 框架本身一般不完…

Rust vs Go:常用语法对比(十二)

题图来自 Rust vs Go in 2023[1] 221. Remove all non-digits characters Create string t from string s, keeping only digit characters 0, 1, 2, 3, 4, 5, 6, 7, 8, 9. 删除所有非数字字符 package mainimport ( "fmt" "regexp")func main() { s : hei…

STM32读写内部Flash

内存映射 stm32的flash起始地址为0x0800 0000,结束地址为0x0800 0000加上芯片实际的Flash大小,不同芯片Flash大小不同,RAM同理。 对于STM32F103RCT6,Flash256KB,所以结束地址为0x0803 ffff。 Flash中的内容一般用来存…

Macbook M1编译安装Java OpenCV

OpenCV-4.8.0编辑安装 查询编译依赖 brew info opencv确保所有需要模块都打上了✔,未打✔的需要使用brew进行安装 下载OpenCV源码 在此处下载OpenCV源代码,选择Source,点击此处下载opencv_contrib-4.8.0 或者使用如下命令,通…

集装箱装卸作业相关的知识-Part1

1.角件 Corner Fitting of Container or called Corner Casting. there are eigth of it of one container. 国家标准|GB/T 1835-2006https://openstd.samr.gov.cn/bzgk/gb/newGbInfo?hcnoD35857F2200FA115CAA217A114F5EF12 中国的国标:GB/T 1835-2006《系列1集…

2023最新ChatGPT商业运营版网站源码+支持ChatGPT4.0+GPT联网+支持ai绘画(Midjourney)+支持Mind思维导图生成

本系统使用Nestjs和Vue3框架技术,持续集成AI能力到本系统! 支持GPT3模型、GPT4模型Midjourney专业绘画(全自定义调参)、Midjourney以图生图、Dall-E2绘画Mind思维导图生成应用工作台(Prompt)AI绘画广场自定…

微服务的各种边界在架构演进中的作用

演进式架构 在微服务设计和实施的过程中,很多人认为:“将单体拆分成多少个微服务,是微服务的设计重点。”可事实真的是这样吗?其实并非如此! Martin Fowler 在提出微服务时,他提到了微服务的一个重要特征—…

HCIA练习2

目录 第一步 启动eNSP,搭建如图所示的拓扑结构 第二步 进行子网的划分 ​第三步 从第二步划分的16个网段中,选择14个网段进行使用 第四步 对路由器各个端口进行IP配置 第五步 对每个路由器的环回接口进行配置 第六步 对路由器进行静态路由配…

视觉套件专项活动!与飞桨技术专家一起提升技术实力,更多荣誉奖励等你领取

作为中国最早开源的深度学习框架,飞桨深度践行开源理念,开放拥抱社区,重视生态构建,与开发者和生态伙伴共成长,已成为国内综合竞争力第一的产业级深度学习平台。截至目前,飞桨已凝聚750万名开发者。 在飞桨…

如何在工作中利用Prompt高效使用ChatGPT

导读 AI 不是来替代你的,是来帮助你更好工作。用better prompt使用chatgpt,替换搜索引擎,让你了解如何在工作中利用Prompt高效使用ChatGPT。 01背景 现在 GPT 已经开启了人工智能狂潮,不过是IT圈,还是金融圈。 一开…

CNNdebug尝试

这算是啥问题?? 接着根据群里大佬提供的指示,将train和validate中的nums_work改成0即可 此处因为数据已经打乱了,所以在这里就不用打乱数据,把shuffle True修改成为False 后面查看指定目录下,竟然没有这个…

python怎么实现tcp和udp连接

目录 什么是tcp连接 什么是udp连接 python怎么实现tcp和udp连接 什么是tcp连接 TCP(Transmission Control Protocol)连接是一种网络连接,它提供了可靠的、面向连接的数据传输服务。 在TCP连接中,通信的两端(客户端和…

信息与通信工程面试准备——专业知识提问

1.无线通信:依靠电磁波在空间传播以传输信息。 2.通信的目的:传输信息。 3.通信系统:将信息从信源发送到一个或多个目的地。 4.本书中通信一般指电信:利用电信号传输信息(光通信属于电信,因为光也是一种…

华为数通HCIP-OSPF路由计算

路由协议 作用:用于路由设备学习非直连路由; 动态路由协议:使路由设备自动学习到非直连路由; 分类: 按照算法分类: 1、距离矢量路由协议;(RIP、BGP) 只交互路由信息…