Python爬虫学习笔记(十三)————CrawlSpider

目录

1.CrawlSpider介绍

2.使用方法

(1)提取链接

(2)模拟使用

(3)提取连接

(4)注意事项

3.运行原理

4.Mysql

5.pymysql的使用步骤

6.数据入库

(1)settings配置参数

(2)管道配置

 7.CrawlSpider案例:读书网数据入库

(1)案例分析

(2)项目结构

(3)items.py文件

(4)middlewares.py文件

(5)pipelines.py文件

(6)settings.py文件

(7)read.py文件


1.CrawlSpider介绍

  • 继承自scrapy.Spider
  • CrawlSpider可以定义规则,再解析html内容的时候,可以根据链接规则提取出指定的链接,然后再向这些链接发送请求
  • 所以,如果有需要跟进链接的需求,意思就是爬取了网页之后,需要提取链接再次爬取,使用CrawlSpider是非常合适的

2.使用方法

(1)提取链接

                链接提取器,在这里就可以写规则提取指定链接

        scrapy.linkextractors.LinkExtractor(

                        allow = (),                         # 正则表达式 提取符合正则的链接

                        deny = (),                         # (不用)正则表达式 不提取符合正则的链接

                        allow_domains = (),         #(不用)允许的域名

                        deny_domains = (),         #(不用)不允许的域名

                        restrict_xpaths = (),         # xpath,提取符合xpath规则的链接

                        restrict_css = ()               # 提取符合选择器规则的链接

                        )

(2)模拟使用

                正则用法:  links1 = LinkExtractor(allow=r'list_23_\d+\.html')

                xpath用法:links2 = LinkExtractor(restrict_xpaths=r'//div[@class="x"]')

                css用法:    links3 = LinkExtractor(restrict_css='.x')

(3)提取连接

                link.extract_links(response)

(4)注意事项

        【注1】callback只能写函数名字符串, callback='parse_item'

        【注2】在基本的spider中,如果重新发送请求,那里的callback写的是 callback=self.parse_item

        【注3】follow=true 是否跟进 就是按照提取连接规则进行提取

3.运行原理

4.Mysql

(1)下载(https://dev.mysql.com/downloads/windows/installer/5.7.html)

(2)安装(https://jingyan.baidu.com/album/d7130635f1c77d13fdf475df.html)

5.pymysql的使用步骤

1.pip install pymysql

2.pymysql.connect(host,port,user,password,db,charset)

3.conn.cursor()

4.cursor.execute()

6.数据入库

(1)settings配置参数

                DB_HOST = '192.168.231.128'

                DB_PORT = 3306

                DB_USER = 'root'

                DB_PASSWORD = '1234'

                DB_NAME = 'test'

                DB_CHARSET = 'utf8'

(2)管道配置

from scrapy.utils.project import get_project_settings

import pymysql

class MysqlPipeline(object):

#__init__方法和open_spider的作用是一样的

#init是获取settings中的连接参数

        def __init__(self):

                settings = get_project_settings()

                self.host = settings['DB_HOST']

                self.port = settings['DB_PORT']

                self.user = settings['DB_USER']

                self.pwd = settings['DB_PWD']

                self.name = settings['DB_NAME']

                self.charset = settings['DB_CHARSET']

                self.connect()

# 连接数据库并且获取cursor对象

        def connect(self):

                self.conn = pymysql.connect(host=self.host, port=self.port, user=self.user, password=self.pwd, db=self.name, charset=self.charset)

                self.cursor = self.conn.cursor()

        def process_item(self, item, spider):

                sql = 'insert into book(image_url, book_name, author, info) values("%s", "%s", "%s", "%s")' % (item['image_url'], item['book_name'], item['author'], item['info'])

                sql = 'insert into book(image_url,book_name,author,info) values ("{}","{}","{}","{}")'.format(item['image_url'], item['book_name'], item['author'], item['info'])

                # 执行sql语句

                self.cursor.execute(sql)

                self.conn.commit()

                return item

        def close_spider(self, spider):

                self.conn.close()

                self.cursor.close()

 7.CrawlSpider案例:读书网数据入库

(1)案例分析

1.创建项目:        scrapy startproject 项目的名字

2.跳转到spiders路径         cd 项目名字\项目名字\spiders

3.创建爬虫类:        scrapy genspider ‐t crawl read www.dushu.com

4.items

5.spiders

6.settings

7.pipelines

        数据保存到本地

        数据保存到mysql数据库

(2)项目结构

(3)items.py文件

# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.htmlimport scrapyclass ScrapyReadbook101Item(scrapy.Item):# define the fields for your item here like:# name = scrapy.Field()name = scrapy.Field()src = scrapy.Field()

(4)middlewares.py文件

# Define here the models for your spider middleware
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/spider-middleware.htmlfrom scrapy import signals# useful for handling different item types with a single interface
from itemadapter import is_item, ItemAdapterclass ScrapyReadbook101SpiderMiddleware:# Not all methods need to be defined. If a method is not defined,# scrapy acts as if the spider middleware does not modify the# passed objects.@classmethoddef from_crawler(cls, crawler):# This method is used by Scrapy to create your spiders.s = cls()crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)return sdef process_spider_input(self, response, spider):# Called for each response that goes through the spider# middleware and into the spider.# Should return None or raise an exception.return Nonedef process_spider_output(self, response, result, spider):# Called with the results returned from the Spider, after# it has processed the response.# Must return an iterable of Request, or item objects.for i in result:yield idef process_spider_exception(self, response, exception, spider):# Called when a spider or process_spider_input() method# (from other spider middleware) raises an exception.# Should return either None or an iterable of Request or item objects.passdef process_start_requests(self, start_requests, spider):# Called with the start requests of the spider, and works# similarly to the process_spider_output() method, except# that it doesn’t have a response associated.# Must return only requests (not items).for r in start_requests:yield rdef spider_opened(self, spider):spider.logger.info('Spider opened: %s' % spider.name)class ScrapyReadbook101DownloaderMiddleware:# Not all methods need to be defined. If a method is not defined,# scrapy acts as if the downloader middleware does not modify the# passed objects.@classmethoddef from_crawler(cls, crawler):# This method is used by Scrapy to create your spiders.s = cls()crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)return sdef process_request(self, request, spider):# Called for each request that goes through the downloader# middleware.# Must either:# - return None: continue processing this request# - or return a Response object# - or return a Request object# - or raise IgnoreRequest: process_exception() methods of#   installed downloader middleware will be calledreturn Nonedef process_response(self, request, response, spider):# Called with the response returned from the downloader.# Must either;# - return a Response object# - return a Request object# - or raise IgnoreRequestreturn responsedef process_exception(self, request, exception, spider):# Called when a download handler or a process_request()# (from other downloader middleware) raises an exception.# Must either:# - return None: continue processing this exception# - return a Response object: stops process_exception() chain# - return a Request object: stops process_exception() chainpassdef spider_opened(self, spider):spider.logger.info('Spider opened: %s' % spider.name)

(5)pipelines.py文件

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html# useful for handling different item types with a single interface
from itemadapter import ItemAdapterclass ScrapyReadbook101Pipeline:def open_spider(self,spider):self.fp = open('book.json','w',encoding='utf-8')def process_item(self, item, spider):self.fp.write(str(item))return itemdef close_spider(self,spider):self.fp.close()# 加载settings文件
from scrapy.utils.project import get_project_settings
import pymysqlclass MysqlPipeline:def open_spider(self,spider):settings = get_project_settings()self.host = settings['DB_HOST']self.port =settings['DB_PORT']self.user =settings['DB_USER']self.password =settings['DB_PASSWROD']self.name =settings['DB_NAME']self.charset =settings['DB_CHARSET']self.connect()def connect(self):self.conn = pymysql.connect(host=self.host,port=self.port,user=self.user,password=self.password,db=self.name,charset=self.charset)self.cursor = self.conn.cursor()def process_item(self, item, spider):sql = 'insert into book(name,src) values("{}","{}")'.format(item['name'],item['src'])# 执行sql语句self.cursor.execute(sql)# 提交self.conn.commit()return itemdef close_spider(self,spider):self.cursor.close()self.conn.close()

(6)settings.py文件

# Scrapy settings for scrapy_readbook_101 project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
#     https://docs.scrapy.org/en/latest/topics/settings.html
#     https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#     https://docs.scrapy.org/en/latest/topics/spider-middleware.htmlBOT_NAME = 'scrapy_readbook_101'SPIDER_MODULES = ['scrapy_readbook_101.spiders']
NEWSPIDER_MODULE = 'scrapy_readbook_101.spiders'# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'scrapy_readbook_101 (+http://www.yourdomain.com)'# Obey robots.txt rules
ROBOTSTXT_OBEY = True# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32# Configure a delay for requests for the same website (default: 0)
# See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
#DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16# Disable cookies (enabled by default)
#COOKIES_ENABLED = False# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False# Override the default request headers:
#DEFAULT_REQUEST_HEADERS = {
#   'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
#   'Accept-Language': 'en',
#}# Enable or disable spider middlewares
# See https://docs.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
#    'scrapy_readbook_101.middlewares.ScrapyReadbook101SpiderMiddleware': 543,
#}# Enable or disable downloader middlewares
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES = {
#    'scrapy_readbook_101.middlewares.ScrapyReadbook101DownloaderMiddleware': 543,
#}# Enable or disable extensions
# See https://docs.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
#    'scrapy.extensions.telnet.TelnetConsole': None,
#}# 参数中一个端口号 一个是字符集 都要注意
DB_HOST = '192.168.231.130'
# 端口号是一个整数
DB_PORT = 3306
DB_USER = 'root'
DB_PASSWROD = '1234'
DB_NAME = 'spider01'
# utf-8的杠不允许写
DB_CHARSET = 'utf8'# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {'scrapy_readbook_101.pipelines.ScrapyReadbook101Pipeline': 300,# MysqlPipeline'scrapy_readbook_101.pipelines.MysqlPipeline':301
}# Enable and configure the AutoThrottle extension (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False# Enable and configure HTTP caching (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

(7)read.py文件

import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rulefrom scrapy_readbook_101.items import ScrapyReadbook101Itemclass ReadSpider(CrawlSpider):name = 'read'allowed_domains = ['www.dushu.com']start_urls = ['https://www.dushu.com/book/1188_1.html']rules = (Rule(LinkExtractor(allow=r'/book/1188_\d+.html'),callback='parse_item',follow=True),)def parse_item(self, response):img_list = response.xpath('//div[@class="bookslist"]//img')for img in img_list:name = img.xpath('./@data-original').extract_first()src = img.xpath('./@alt').extract_first()book = ScrapyReadbook101Item(name=name,src=src)yield book

 

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/news/9516.shtml

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

【100个 Unity实用技能】 | Unity中Text文本框 和 InputField文本输入框 内容换行问题【文末送书】

🎬 博客主页:https://xiaoy.blog.csdn.net 🎥 本文由 呆呆敲代码的小Y 原创,首发于 CSDN🙉 🎄 学习专栏推荐:Unity系统学习专栏 🌲 游戏制作专栏推荐:游戏制作 &…

游戏开发中的Jenkins

1.安装搭建环境可自行百度或者点击下方链接参考 .【游戏开发进阶】教你Unity通过Jenkins实现自动化打包,打包这种事情就交给策划了(保姆级教程 | 命令行打包 | 自动构建) 2.Jenkins复制和导出导入job 一、同一个Jenkins中复制job 如果是同…

Ceph的应用

文章目录 一、创建 CephFS 文件系统 MDS 接口1)在管理节点创建 mds 服务2)查看各个节点的 mds 服务3)创建存储池,启用 ceph 文件系统4)查看mds状态,一个up,其余两个待命,目前的工作的…

应该选云服务器还是物理服务器

应该选云服务器还是物理服务器 一、为什么需要云服务器或独立服务器取代共享主机 在最早之前,大多数的网站都是共享主机开始的,这里也包含了云虚拟机。这一类的站点还有其他站点都会共同托管在同一台服务器上。但是这种共享机只适用于小的网站&#xff…

Docker-compose容器编排

Docker-Compose介绍 Compose 是 Docker 公司推出的一个工具软件,可以管理多个 Docker 容器组成一个应用。你需要定义一个 YAML 格式的配置文件docker-compose.yml,写好多个容器之间的调用关系。然后,只要一个命令,就能同时启动/关…

项目里程碑有什么作用?设置里程碑时应注意什么?

正如 "里程碑 "一词的原意是表示所走距离的标记,项目中的里程碑也代表着迄今为止已完成的任务或活动。但实际上,里程碑的作用远不止于此。 项目里程碑为何重要? 项目的成功取决于细节。项目里程碑之所以重要,是因为它…

微信小程序的个人博客--【小程序花园】

微信目录集链接在此: 详细解析黑马微信小程序视频–【思维导图知识范围】难度★✰✰✰✰ 不会导入/打开小程序的看这里:参考 让别人的小程序长成自己的样子-更换window上下颜色–【浅入深出系列001】 文章目录 本系列校训啥是个人博客项目里的理论知识…

如何在3ds max中创建可用于真人场景的巨型机器人:第 1部分

推荐: NSDT场景编辑器助你快速搭建可二次开发的3D应用场景 1. 创建主体 步骤 1 打开 3ds Max。 打开 3ds Max 步骤 2 在左侧视口中,按键盘上的 Alt-B 键。它 打开视口配置窗口。 打开“锁定缩放/平移”和“匹配位图”选项。单击“文件”并转到参考 …

JSON格式Python,Java,PHP等封装获取淘宝商品详情SKU数据API方法

淘宝是一个网上购物平台,售卖各类商品,包括服装、鞋类、家居用品、美妆产品、电子产品等。要获取淘宝天猫商品详情SKU详细数据,您可以通过开放平台的接口或者直接访问淘宝天猫商城的网页来获取商品详情Sku信息。以下是两种常用方法的介绍&…

8年测试整理,自动化测试框架从0到1实施,一篇打通自动化...

目录:导读 前言一、Python编程入门到精通二、接口自动化项目实战三、Web自动化项目实战四、App自动化项目实战五、一线大厂简历六、测试开发DevOps体系七、常用自动化测试工具八、JMeter性能测试九、总结(尾部小惊喜) 前言 框架本身一般不完…

Rust vs Go:常用语法对比(十二)

题图来自 Rust vs Go in 2023[1] 221. Remove all non-digits characters Create string t from string s, keeping only digit characters 0, 1, 2, 3, 4, 5, 6, 7, 8, 9. 删除所有非数字字符 package mainimport ( "fmt" "regexp")func main() { s : hei…

STM32读写内部Flash

内存映射 stm32的flash起始地址为0x0800 0000,结束地址为0x0800 0000加上芯片实际的Flash大小,不同芯片Flash大小不同,RAM同理。 对于STM32F103RCT6,Flash256KB,所以结束地址为0x0803 ffff。 Flash中的内容一般用来存…

Macbook M1编译安装Java OpenCV

OpenCV-4.8.0编辑安装 查询编译依赖 brew info opencv确保所有需要模块都打上了✔,未打✔的需要使用brew进行安装 下载OpenCV源码 在此处下载OpenCV源代码,选择Source,点击此处下载opencv_contrib-4.8.0 或者使用如下命令,通…

集装箱装卸作业相关的知识-Part1

1.角件 Corner Fitting of Container or called Corner Casting. there are eigth of it of one container. 国家标准|GB/T 1835-2006https://openstd.samr.gov.cn/bzgk/gb/newGbInfo?hcnoD35857F2200FA115CAA217A114F5EF12 中国的国标:GB/T 1835-2006《系列1集…

代码随想录算法训练营day51 309.最佳买卖股票时机含冷冻期 714.买卖股票的最佳时机含手续费

题目链接309.最佳买卖股票时机含冷冻期 class Solution {public int maxProfit(int[] prices) {if (prices null || prices.length < 2) {return 0;}int[][] dp new int[prices.length][2];dp[0][0] -prices[0];dp[0][1] 0;dp[1][0] Math.max(dp[0][0], dp[0][1] - pr…

2023最新ChatGPT商业运营版网站源码+支持ChatGPT4.0+GPT联网+支持ai绘画(Midjourney)+支持Mind思维导图生成

本系统使用Nestjs和Vue3框架技术&#xff0c;持续集成AI能力到本系统&#xff01; 支持GPT3模型、GPT4模型Midjourney专业绘画&#xff08;全自定义调参&#xff09;、Midjourney以图生图、Dall-E2绘画Mind思维导图生成应用工作台&#xff08;Prompt&#xff09;AI绘画广场自定…

微服务的各种边界在架构演进中的作用

演进式架构 在微服务设计和实施的过程中&#xff0c;很多人认为&#xff1a;“将单体拆分成多少个微服务&#xff0c;是微服务的设计重点。”可事实真的是这样吗&#xff1f;其实并非如此&#xff01; Martin Fowler 在提出微服务时&#xff0c;他提到了微服务的一个重要特征—…

写论文需要注意的点(持续更新)

引言 写论文投稿的时候遇到很多评审给的好的意见&#xff0c;记录下来&#xff0c;在下次写论文的时候谨记这些点&#xff0c;不再犯错 意见 写文章的时候要引用最新的文献做实验的时候将数据集获取的地址和方式记录下来&#xff0c;在文章中标注出来并不是一定要在所有数据…

iptables-netfilter基础

iptables-netfilter基础 课堂笔记 iptables Firewall&#xff1a;防火墙 隔离工具。区别于建筑防火墙。工作于主机或网络边缘&#xff0c;对于进出本主机或本网络的报文根据事先定义的检查规则作匹配检测&#xff0c;对于能够被规则匹配到的报文作出相应处理的组件&#xff…

HCIA练习2

目录 第一步 启动eNSP&#xff0c;搭建如图所示的拓扑结构 第二步 进行子网的划分 ​第三步 从第二步划分的16个网段中&#xff0c;选择14个网段进行使用 第四步 对路由器各个端口进行IP配置 第五步 对每个路由器的环回接口进行配置 第六步 对路由器进行静态路由配…