获取欧洲时报中国板块前新闻数据-scrapy

这里写目录标题

  • 1.创建项目文件
  • 二.爬虫文件编写
  • 三.管道存储
  • 四.settings文件

1.创建项目文件

创建scrapy项目的命令:scrapy startproject <项目名字>
示例:
scrapy startproject myspider

在这里插入图片描述

 scrapy genspider <爬虫名字> <允许爬取的域名>
cd myspider
scrapy genspider itcast itcast.cn

二.爬虫文件编写

import time
import scrapy
from ..items import ChinanewsItem
s=1
class NewsSpider(scrapy.Spider):name = 'news'allowed_domains = []start_urls = ['https://cms.offshoremedia.net/front/list/latest?pageNum=1&pageSize=15&siteId=694841922577108992&channelId=780811183157682176']def parse(self, response):#print(response)global sres=response.json()for i in res["info"]["list"]:try:newurl = i["contentStaticPage"]#print(newurl)yield scrapy.Request(url=newurl,callback=self.datas,cb_kwargs={'newurl': newurl},dont_filter=True)except Exception as e:print(f"Failed to process URL {i['contentStaticPage']}: {str(e)}")s=s+1if s<1000:print(s)url=f'https://cms.offshoremedia.net/front/list/latest?pageNum={str(s)}&pageSize=15&siteId=694841922577108992&channelId=780811183157682176'#print(url)yield scrapy.Request(url=url, callback=self.parse)def datas(self,response,newurl):print(response)item=ChinanewsItem()id = newurl.split('/')[-1].split('.')[0]clas = newurl.split('/')[5]title = response.xpath(f'//*[@id="{id}"]/text()')[0]title=title.get()timee = response.xpath('/html/body/div[1]/div[2]/div/div[1]/div[1]/div[1]/div[3]/span[1]/i/text()')[0]now = int(timee.get())timeArray = time.localtime(now / 1000)otherStyleTime = time.strftime("%Y-%m-%d", timeArray)Released = "发布时间:" + otherStyleTimeimgurl = response.xpath('/html/body/div[1]/div[2]/div/div[1]/div[1]/div[1]//img/@src')imgurl=imgurl.getall()if imgurl == []:imgurl = "无图片"Imageannotations = response.xpath('/html/body/div[1]/div[2]/div/div[1]/div[1]/div[1]/div[4]/div/p/b/text()')  # b标签含有图片来源Imageannotations = Imageannotations.getall()Imageannotationstrue = []for i in Imageannotations:if "图片来源" in i:Imageannotationstrue.append(i)if Imageannotationstrue == []:Imageannotationstrue = "无图片注释"texts = response.xpath('/html/body/div[1]/div[2]/div/div[1]/div[1]/div[1]/div[4]/div/p[@style="text-indent:2em;"]//text()')texts=texts.getall()text = [item for item in texts if item.strip()]# print(imgurl,Imageannotations)if len(text) > 1:summary = text[0]del text[0]body = ""for i in text:body = body + '\n' + ielse:summary = []body = []if body!=[]:item["list"]=[id, clas, title, otherStyleTime, Released, str(imgurl), str(Imageannotationstrue), summary,body, newurl]yield itemelse:id = newurl.split('/')[-1].split('.')[0]clas = newurl.split('/')[5]title = response.xpath(f'//*[@id="{id}"]/text()')[0]title=title.get()timee = response.xpath('/html/body/div[1]/div[2]/div/div[1]/div[1]/div[1]/div[3]/span[1]/i/text()')[0]now = int(timee.get())timeArray = time.localtime(now / 1000)otherStyleTime = time.strftime("%Y-%m-%d", timeArray)Released = "发布时间:" + otherStyleTimeimgurl = response.xpath('/html/body/div[1]/div[2]/div/div[1]/div[1]/div[1]//img/@src')imgurl=imgurl.getall()if imgurl == []:imgurl = "无图片"Imageannotations = response.xpath('/html/body/div[1]/div[2]/div/div[1]/div[1]/div[1]/div[4]/div/p/b/text()')  # b标签含有图片来源Imageannotations= Imageannotations.getall()Imageannotationstrue = []for i in Imageannotations:if "图片来源" in i:Imageannotationstrue.append(i)if Imageannotationstrue == []:Imageannotationstrue = "无图片注释"text = response.xpath('/html/body/div[1]/div[2]/div/div[1]/div[1]/div[1]/div[4]/div/p/span/text()')text= text.getall()try:summary = response.xpath('/html/body/div[1]/div[2]/div/div[1]/div[1]/div[1]/div[4]/p/text()')[0]summary = summary.get()except:summary=[]item["list"]= [id, clas, title, otherStyleTime, Released, str(imgurl), str(Imageannotationstrue), summary,str(text), newurl]yield item

起始url为’https://cms.offshoremedia.net/front/list/latest?pageNum=1&pageSize=15&siteId=694841922577108992&channelId=780811183157682176’
在这里插入图片描述
1.返回的数据为此页每个新闻的详细页面信息获取后将数据传递给parse()函数
2.parse()函数对数据进行提取提取出新闻页详细url然后发送请求并将结果交给datas()

yield scrapy.Request(url=newurl,callback=self.datas,cb_kwargs={'newurl': newurl},dont_filter=True)

3.datas()通过xpath对信息进行提取然后将数据提交给管道*

4.s=s+1更新url实现自动翻页然后通过yield scrapy.Request(url=url, callback=self.parse)将数据重新回调给parse()进入循环

三.管道存储

传统的同步数据库操作可能会导致Scrapy爬虫阻塞,等待数据库操作完成。而adbapi允许Scrapy以异步的方式与数据库进行交互。这意味着Scrapy可以在等待数据库操作完成的同时继续执行其他任务,如抓取更多的网页或解析数据。这种异步处理方式极大地提高了Scrapy的运行效率和性能,使得项目能够更快地处理大量数据。

import logging
from twisted.enterprise import adbapi
import pymysql
class ChinanewsPipeline2:def __init__(self):self.dbpool = adbapi.ConnectionPool('pymysql',host='127.0.0.1',user='root',password='root',port=3306,database='news',)def process_item(self, item, spider):self.dbpool.runInteraction(self._do_insert, item["list"])def _do_insert(self, txn, list):sql = """INSERT INTO untitled (id, clas, title, otherStyleTime, Released, imgurl, Imageannotations, summary, body, url) VALUES (%s, %s, %s, %s, %s, %s, %s, %s, %s, %s)"""# txn是一个事务对象,用于执行SQL语句try:txn.execute(sql, tuple(list))except pymysql.err.IntegrityError:# 处理重复键错误,例如记录日志或忽略print(f"Duplicate entry for id {list[0]}, skipping...")

四.settings文件

# Scrapy settings for chinanews project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
#     https://docs.scrapy.org/en/latest/topics/settings.html
#     https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#     https://docs.scrapy.org/en/latest/topics/spider-middleware.htmlBOT_NAME = 'chinanews'
LOG_LEVEL = 'ERROR'
SPIDER_MODULES = ['chinanews.spiders']
NEWSPIDER_MODULE = 'chinanews.spiders'# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'chinanews (+http://www.yourdomain.com)'# Obey robots.txt rules
ROBOTSTXT_OBEY = False# Configure maximum concurrent requests performed by Scrapy (default: 16)
CONCURRENT_REQUESTS = 100# Configure a delay for requests for the same website (default: 0)
# See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
#DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
CONCURRENT_REQUESTS_PER_DOMAIN = 150
CONCURRENT_REQUESTS_PER_IP = 150# Disable cookies (enabled by default)
#COOKIES_ENABLED = False# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False# Override the default request headers:
DEFAULT_REQUEST_HEADERS = {
'User-Agent':
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/126.0.0.0 Safari/537.36 Edg/126.0.0.0',
'Origin':
'https://www.oushinet.com',
'Referer':
'https://www.oushinet.com/',
'Content-Type':
'application/json;charset=UTF-8'
}# Enable or disable spider middlewares
# See https://docs.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
#    'chinanews.middlewares.ChinanewsSpiderMiddleware': 543,
#}# Enable or disable downloader middlewares
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES = {
#    'chinanews.middlewares.ChinanewsDownloaderMiddleware': 543,
#}# Enable or disable extensions
# See https://docs.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
#    'scrapy.extensions.telnet.TelnetConsole': None,
#}# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {#'chinanews.pipelines.ChinanewsPipeline': 300,
'chinanews.pipelines.ChinanewsPipeline2': 200
}# Enable and configure the AutoThrottle extension (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False# Enable and configure HTTP caching (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/bicheng/47504.shtml

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

联邦学习(Federated learning)—— 去中心化联邦中心化联邦

提出联邦学习的目的&#xff1a;解决数据孤立问题和安全隐私问题。 联邦学习的主要思想&#xff1a;基于分布在多个设备上的数据集构建机器学习模型&#xff0c;同时防止数据泄露。&#xff08;是一种分布式机器学习方法&#xff09; 联邦学习架构 联邦学习的架构分为两种&…

修改了mybatis的xml中的sql不重启服务器如何动态加载更新

目录 一、背景 二、注意 三、代码 四、使用示例 五、其他参考博客 一、背景 开发一个报表功能&#xff0c;好几百行sql&#xff0c;每次修改完想自测下都要重启服务器&#xff0c;启动一次服务器就要3分钟&#xff0c;重启10次就要半小时&#xff0c;耗不起时间呀。于是在…

windows docker nvidia wsl2

下载驱动(GeForce Experience里的也可以)https://www.nvidia.cn/Download/index.aspx 安装wsl2https://blog.csdn.net/qq_39942341/article/details/121512900?ops_request_misc%257B%2522request%255Fid%2522%253A%2522172122816816800227436617%2522%252C%2522scm%2522%253A…

Docker构建LNMP环境并运行Wordpress平台

1.准备Nginx 上传文件 Dockerfile FROM centos:7 as firstADD nginx-1.24.0.tar.gz /opt/ COPY CentOS-Base.repo /etc/yum.repos.d/RUN yum -y install pcre-devel zlib-devel openssl-devel gcc gcc-c make && \useradd -M -s /sbin/nologin nginx && \cd /o…

沙尘传输模拟教程(基于wrf-chem)

沙尘传输模拟教程(基于wrf-chem) 文章目录 沙尘传输模拟教程(基于wrf-chem)简介实验目的wrf-chem简介 软件准备wps、wrf-chem安装conda安装ncl安装ncap安装 数据准备气象数据准备下垫面数据准备 WPS数据预处理namelist.wps的设置geogrid.exe下垫面处理ungrib.exe气象数据预处理…

docker替换主程序排错

docker替换主程序排错 背景&#xff1a;经常会遇到主程序启动错误&#xff0c;导致无法进入到容器内部排错。 替换命令 docker run --rm -it --entrypoint/bin/sh 镜像名

石油与化工行业的工业互联网平台革新之路

石油和化工工业互联网平台的变革是近年来工业发展的重要趋势之一&#xff0c;它基于工业互联网技术&#xff0c;通过数字化、网络化和智能化手段&#xff0c;推动石油和化工行业的转型升级。以下是关于石油和化工工业互联网平台变革的详细分析&#xff1a; 一、革新背景 行业…

@Valid校验前端参数

1、导入依赖。&#xff08;springmvc的stater-web和json依赖也需要添加&#xff0c;此处先不列举&#xff09; <dependency><groupId>org.springframework.boot</groupId><artifactId>spring-boot-starter-validation</artifactId></dependen…

批量下载网易云音乐歌单的Python脚本

在日常的音乐收藏和整理中,有时候我们希望能够快速地备份或下载网易云音乐中的歌曲,以便在没有网络连接的情况下也能随时听到自己喜欢的音乐。这时候,Python可以提供一种便捷的解决方案,让我们能够轻松地实现这一目标。 技术背景 本文介绍的Python脚本利用了Requests库和…

SpringSecurity + JWT 实现登录认证

目录 1. JWT 的组成和优势 2. JWT 的工作原理 2.1 生成 JWT 2.2 传输JWT 3. SpringSecurity JWT 实现登录认证 3.1 配置 Spring Security 安全过滤链 3.2 自定义登录认证过滤器 3.3 实现SpringSecurity用户对象 3.4 获取当前登录用户 1. JWT 的组成和优势 JWT 由三…

SSE(Server Sent Event)实战(3)- Spring Web Flux 实现

上篇博客 SSE&#xff08;Server Sent Event&#xff09;实战&#xff08;2&#xff09;- Spring MVC 实现&#xff0c;我们用 Spring MVC 实现了简单的消息推送&#xff0c;并且留下了两个问题&#xff0c;这篇博客&#xff0c;我们用 Spring Web Flux 实现&#xff0c;并且看…

STM32(六):STM32指南者-定时器实验

目录 一、基本概念1、常规定时器2、内核定时器 二、基本定时器实验1、实验说明2、编程过程&#xff08;1&#xff09;配置LED&#xff08;2&#xff09;配置定时器&#xff08;3&#xff09;设定中断事件&#xff08;4&#xff09;主函数计数 3、工程代码 三、通用定时器实验实…

【Neural signal processing and analysis zero to hero】- 2

Nonstationarities and effects of the FT course from youtube: 传送地址 why we need extinguish stationary and non-stationary signal, because most of neural signal is non-stationary. Welch’s method for smooth spectral decomposition Full FFT method y…

【TDA4板端部署】基于 Pytorch 训练并部署 ONNX 模型在 TDA4

1 将torch模型转onnx模型 Ti转换工具只支持以下格式&#xff1a; Caffe - 0.17 (caffe-jacinto in gitHub) Tensorflow - 1.12 ONNX - 1.3.0 (opset 9 and 11) TFLite - Tensorflow 2.0-Alpha 基于 Tensorflow、Pytorch、Caffe 等训练框架&#xff0c;训练模型&#xff1a;选择…

数据结构与算法(2):顺序表与链表

1.前言 哈喽大家好喔&#xff0c;今天博主继续进行数据结构的分享与学习&#xff0c;今天的主要内容是顺序表与链表&#xff0c;是最简单但又相当重要的数据结构&#xff0c;为以后的学习有重要的铺垫&#xff0c;希望大家一起交流学习&#xff0c;互相进步&#xff0c;让我们…

数据结构之跳表SkipList、ConcurrentSkipListMap

概述 SkipList&#xff0c;跳表&#xff0c;跳跃表&#xff0c;在LevelDB和Lucene中都广为使用。跳表被广泛地运用到各种缓存实现当中&#xff0c;跳跃表使用概率均衡技术而不是使用强制性均衡&#xff0c;因此对于插入和删除结点比传统上的平衡树算法更为简洁高效。 Skip lis…

AQS详解(详细图文)

目录 AQS详解1、AQS简介AbstractQueuedSynchronizer的继承结构和类属性AQS的静态内部类Node总结AQS的实现思想总结AQS的实现原理AQS和锁的关系 2、AQS的核心方法AQS管理共享资源的方式独占方式下&#xff0c;AQS获取资源的流程详解独占方式下&#xff0c;AQS释放资源的流程详解…

如何通过DBC文件看懂CAN通信矩阵

实现汽车CAN通信开发&#xff0c;必不可少要用到DBC文件和CAN通信矩阵。 CAN通信矩阵是指用于描述 CAN 网络中各个节点之间通信关系的表格或矩阵。它通常记录了每个节点能够发送和接收的消息标识符&#xff08;ID&#xff09;以及与其他节点之间的通信权限。 通信矩阵在 CAN 网…

利用Msfvenom获取WindowsShell

一、在kali主机上利用msfvenom生成windows端的安装程序(exe文件),程序名最好取一个大家经常安装的程序,如腾讯视频、爱奇艺等。 (1)由于生成的程序可能会被杀毒软件识别,我们比较一下使用单个编码器生成的程序与用两个编码器生成的程序,哪个更容易被识别。 利用单个编码…

SSE(Server Sent Event)实战(2)- Spring MVC 实现

一、服务端实现 使用 RestController 注解创建一个控制器类&#xff08;Controller&#xff09; 创建一个方法来创建一个客户端连接&#xff0c;它返回一个 SseEmitter&#xff0c;处理 GET 请求并产生&#xff08;produces&#xff09;文本/事件流 (text/event-stream) 创建…