获取欧洲时报中国板块前新闻数据-scrapy

这里写目录标题

  • 1.创建项目文件
  • 二.爬虫文件编写
  • 三.管道存储
  • 四.settings文件

1.创建项目文件

创建scrapy项目的命令:scrapy startproject <项目名字>
示例:
scrapy startproject myspider

在这里插入图片描述

 scrapy genspider <爬虫名字> <允许爬取的域名>
cd myspider
scrapy genspider itcast itcast.cn

二.爬虫文件编写

import time
import scrapy
from ..items import ChinanewsItem
s=1
class NewsSpider(scrapy.Spider):name = 'news'allowed_domains = []start_urls = ['https://cms.offshoremedia.net/front/list/latest?pageNum=1&pageSize=15&siteId=694841922577108992&channelId=780811183157682176']def parse(self, response):#print(response)global sres=response.json()for i in res["info"]["list"]:try:newurl = i["contentStaticPage"]#print(newurl)yield scrapy.Request(url=newurl,callback=self.datas,cb_kwargs={'newurl': newurl},dont_filter=True)except Exception as e:print(f"Failed to process URL {i['contentStaticPage']}: {str(e)}")s=s+1if s<1000:print(s)url=f'https://cms.offshoremedia.net/front/list/latest?pageNum={str(s)}&pageSize=15&siteId=694841922577108992&channelId=780811183157682176'#print(url)yield scrapy.Request(url=url, callback=self.parse)def datas(self,response,newurl):print(response)item=ChinanewsItem()id = newurl.split('/')[-1].split('.')[0]clas = newurl.split('/')[5]title = response.xpath(f'//*[@id="{id}"]/text()')[0]title=title.get()timee = response.xpath('/html/body/div[1]/div[2]/div/div[1]/div[1]/div[1]/div[3]/span[1]/i/text()')[0]now = int(timee.get())timeArray = time.localtime(now / 1000)otherStyleTime = time.strftime("%Y-%m-%d", timeArray)Released = "发布时间:" + otherStyleTimeimgurl = response.xpath('/html/body/div[1]/div[2]/div/div[1]/div[1]/div[1]//img/@src')imgurl=imgurl.getall()if imgurl == []:imgurl = "无图片"Imageannotations = response.xpath('/html/body/div[1]/div[2]/div/div[1]/div[1]/div[1]/div[4]/div/p/b/text()')  # b标签含有图片来源Imageannotations = Imageannotations.getall()Imageannotationstrue = []for i in Imageannotations:if "图片来源" in i:Imageannotationstrue.append(i)if Imageannotationstrue == []:Imageannotationstrue = "无图片注释"texts = response.xpath('/html/body/div[1]/div[2]/div/div[1]/div[1]/div[1]/div[4]/div/p[@style="text-indent:2em;"]//text()')texts=texts.getall()text = [item for item in texts if item.strip()]# print(imgurl,Imageannotations)if len(text) > 1:summary = text[0]del text[0]body = ""for i in text:body = body + '\n' + ielse:summary = []body = []if body!=[]:item["list"]=[id, clas, title, otherStyleTime, Released, str(imgurl), str(Imageannotationstrue), summary,body, newurl]yield itemelse:id = newurl.split('/')[-1].split('.')[0]clas = newurl.split('/')[5]title = response.xpath(f'//*[@id="{id}"]/text()')[0]title=title.get()timee = response.xpath('/html/body/div[1]/div[2]/div/div[1]/div[1]/div[1]/div[3]/span[1]/i/text()')[0]now = int(timee.get())timeArray = time.localtime(now / 1000)otherStyleTime = time.strftime("%Y-%m-%d", timeArray)Released = "发布时间:" + otherStyleTimeimgurl = response.xpath('/html/body/div[1]/div[2]/div/div[1]/div[1]/div[1]//img/@src')imgurl=imgurl.getall()if imgurl == []:imgurl = "无图片"Imageannotations = response.xpath('/html/body/div[1]/div[2]/div/div[1]/div[1]/div[1]/div[4]/div/p/b/text()')  # b标签含有图片来源Imageannotations= Imageannotations.getall()Imageannotationstrue = []for i in Imageannotations:if "图片来源" in i:Imageannotationstrue.append(i)if Imageannotationstrue == []:Imageannotationstrue = "无图片注释"text = response.xpath('/html/body/div[1]/div[2]/div/div[1]/div[1]/div[1]/div[4]/div/p/span/text()')text= text.getall()try:summary = response.xpath('/html/body/div[1]/div[2]/div/div[1]/div[1]/div[1]/div[4]/p/text()')[0]summary = summary.get()except:summary=[]item["list"]= [id, clas, title, otherStyleTime, Released, str(imgurl), str(Imageannotationstrue), summary,str(text), newurl]yield item

起始url为’https://cms.offshoremedia.net/front/list/latest?pageNum=1&pageSize=15&siteId=694841922577108992&channelId=780811183157682176’
在这里插入图片描述
1.返回的数据为此页每个新闻的详细页面信息获取后将数据传递给parse()函数
2.parse()函数对数据进行提取提取出新闻页详细url然后发送请求并将结果交给datas()

yield scrapy.Request(url=newurl,callback=self.datas,cb_kwargs={'newurl': newurl},dont_filter=True)

3.datas()通过xpath对信息进行提取然后将数据提交给管道*

4.s=s+1更新url实现自动翻页然后通过yield scrapy.Request(url=url, callback=self.parse)将数据重新回调给parse()进入循环

三.管道存储

传统的同步数据库操作可能会导致Scrapy爬虫阻塞,等待数据库操作完成。而adbapi允许Scrapy以异步的方式与数据库进行交互。这意味着Scrapy可以在等待数据库操作完成的同时继续执行其他任务,如抓取更多的网页或解析数据。这种异步处理方式极大地提高了Scrapy的运行效率和性能,使得项目能够更快地处理大量数据。

import logging
from twisted.enterprise import adbapi
import pymysql
class ChinanewsPipeline2:def __init__(self):self.dbpool = adbapi.ConnectionPool('pymysql',host='127.0.0.1',user='root',password='root',port=3306,database='news',)def process_item(self, item, spider):self.dbpool.runInteraction(self._do_insert, item["list"])def _do_insert(self, txn, list):sql = """INSERT INTO untitled (id, clas, title, otherStyleTime, Released, imgurl, Imageannotations, summary, body, url) VALUES (%s, %s, %s, %s, %s, %s, %s, %s, %s, %s)"""# txn是一个事务对象,用于执行SQL语句try:txn.execute(sql, tuple(list))except pymysql.err.IntegrityError:# 处理重复键错误,例如记录日志或忽略print(f"Duplicate entry for id {list[0]}, skipping...")

四.settings文件

# Scrapy settings for chinanews project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
#     https://docs.scrapy.org/en/latest/topics/settings.html
#     https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#     https://docs.scrapy.org/en/latest/topics/spider-middleware.htmlBOT_NAME = 'chinanews'
LOG_LEVEL = 'ERROR'
SPIDER_MODULES = ['chinanews.spiders']
NEWSPIDER_MODULE = 'chinanews.spiders'# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'chinanews (+http://www.yourdomain.com)'# Obey robots.txt rules
ROBOTSTXT_OBEY = False# Configure maximum concurrent requests performed by Scrapy (default: 16)
CONCURRENT_REQUESTS = 100# Configure a delay for requests for the same website (default: 0)
# See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
#DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
CONCURRENT_REQUESTS_PER_DOMAIN = 150
CONCURRENT_REQUESTS_PER_IP = 150# Disable cookies (enabled by default)
#COOKIES_ENABLED = False# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False# Override the default request headers:
DEFAULT_REQUEST_HEADERS = {
'User-Agent':
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/126.0.0.0 Safari/537.36 Edg/126.0.0.0',
'Origin':
'https://www.oushinet.com',
'Referer':
'https://www.oushinet.com/',
'Content-Type':
'application/json;charset=UTF-8'
}# Enable or disable spider middlewares
# See https://docs.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
#    'chinanews.middlewares.ChinanewsSpiderMiddleware': 543,
#}# Enable or disable downloader middlewares
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES = {
#    'chinanews.middlewares.ChinanewsDownloaderMiddleware': 543,
#}# Enable or disable extensions
# See https://docs.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
#    'scrapy.extensions.telnet.TelnetConsole': None,
#}# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {#'chinanews.pipelines.ChinanewsPipeline': 300,
'chinanews.pipelines.ChinanewsPipeline2': 200
}# Enable and configure the AutoThrottle extension (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False# Enable and configure HTTP caching (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/bicheng/47504.shtml

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

修改了mybatis的xml中的sql不重启服务器如何动态加载更新

目录 一、背景 二、注意 三、代码 四、使用示例 五、其他参考博客 一、背景 开发一个报表功能&#xff0c;好几百行sql&#xff0c;每次修改完想自测下都要重启服务器&#xff0c;启动一次服务器就要3分钟&#xff0c;重启10次就要半小时&#xff0c;耗不起时间呀。于是在…

windows docker nvidia wsl2

下载驱动(GeForce Experience里的也可以)https://www.nvidia.cn/Download/index.aspx 安装wsl2https://blog.csdn.net/qq_39942341/article/details/121512900?ops_request_misc%257B%2522request%255Fid%2522%253A%2522172122816816800227436617%2522%252C%2522scm%2522%253A…

Docker构建LNMP环境并运行Wordpress平台

1.准备Nginx 上传文件 Dockerfile FROM centos:7 as firstADD nginx-1.24.0.tar.gz /opt/ COPY CentOS-Base.repo /etc/yum.repos.d/RUN yum -y install pcre-devel zlib-devel openssl-devel gcc gcc-c make && \useradd -M -s /sbin/nologin nginx && \cd /o…

沙尘传输模拟教程(基于wrf-chem)

沙尘传输模拟教程(基于wrf-chem) 文章目录 沙尘传输模拟教程(基于wrf-chem)简介实验目的wrf-chem简介 软件准备wps、wrf-chem安装conda安装ncl安装ncap安装 数据准备气象数据准备下垫面数据准备 WPS数据预处理namelist.wps的设置geogrid.exe下垫面处理ungrib.exe气象数据预处理…

SSE(Server Sent Event)实战(3)- Spring Web Flux 实现

上篇博客 SSE&#xff08;Server Sent Event&#xff09;实战&#xff08;2&#xff09;- Spring MVC 实现&#xff0c;我们用 Spring MVC 实现了简单的消息推送&#xff0c;并且留下了两个问题&#xff0c;这篇博客&#xff0c;我们用 Spring Web Flux 实现&#xff0c;并且看…

STM32(六):STM32指南者-定时器实验

目录 一、基本概念1、常规定时器2、内核定时器 二、基本定时器实验1、实验说明2、编程过程&#xff08;1&#xff09;配置LED&#xff08;2&#xff09;配置定时器&#xff08;3&#xff09;设定中断事件&#xff08;4&#xff09;主函数计数 3、工程代码 三、通用定时器实验实…

【Neural signal processing and analysis zero to hero】- 2

Nonstationarities and effects of the FT course from youtube: 传送地址 why we need extinguish stationary and non-stationary signal, because most of neural signal is non-stationary. Welch’s method for smooth spectral decomposition Full FFT method y…

【TDA4板端部署】基于 Pytorch 训练并部署 ONNX 模型在 TDA4

1 将torch模型转onnx模型 Ti转换工具只支持以下格式&#xff1a; Caffe - 0.17 (caffe-jacinto in gitHub) Tensorflow - 1.12 ONNX - 1.3.0 (opset 9 and 11) TFLite - Tensorflow 2.0-Alpha 基于 Tensorflow、Pytorch、Caffe 等训练框架&#xff0c;训练模型&#xff1a;选择…

数据结构与算法(2):顺序表与链表

1.前言 哈喽大家好喔&#xff0c;今天博主继续进行数据结构的分享与学习&#xff0c;今天的主要内容是顺序表与链表&#xff0c;是最简单但又相当重要的数据结构&#xff0c;为以后的学习有重要的铺垫&#xff0c;希望大家一起交流学习&#xff0c;互相进步&#xff0c;让我们…

数据结构之跳表SkipList、ConcurrentSkipListMap

概述 SkipList&#xff0c;跳表&#xff0c;跳跃表&#xff0c;在LevelDB和Lucene中都广为使用。跳表被广泛地运用到各种缓存实现当中&#xff0c;跳跃表使用概率均衡技术而不是使用强制性均衡&#xff0c;因此对于插入和删除结点比传统上的平衡树算法更为简洁高效。 Skip lis…

AQS详解(详细图文)

目录 AQS详解1、AQS简介AbstractQueuedSynchronizer的继承结构和类属性AQS的静态内部类Node总结AQS的实现思想总结AQS的实现原理AQS和锁的关系 2、AQS的核心方法AQS管理共享资源的方式独占方式下&#xff0c;AQS获取资源的流程详解独占方式下&#xff0c;AQS释放资源的流程详解…

如何通过DBC文件看懂CAN通信矩阵

实现汽车CAN通信开发&#xff0c;必不可少要用到DBC文件和CAN通信矩阵。 CAN通信矩阵是指用于描述 CAN 网络中各个节点之间通信关系的表格或矩阵。它通常记录了每个节点能够发送和接收的消息标识符&#xff08;ID&#xff09;以及与其他节点之间的通信权限。 通信矩阵在 CAN 网…

利用Msfvenom获取WindowsShell

一、在kali主机上利用msfvenom生成windows端的安装程序(exe文件),程序名最好取一个大家经常安装的程序,如腾讯视频、爱奇艺等。 (1)由于生成的程序可能会被杀毒软件识别,我们比较一下使用单个编码器生成的程序与用两个编码器生成的程序,哪个更容易被识别。 利用单个编码…

SSE(Server Sent Event)实战(2)- Spring MVC 实现

一、服务端实现 使用 RestController 注解创建一个控制器类&#xff08;Controller&#xff09; 创建一个方法来创建一个客户端连接&#xff0c;它返回一个 SseEmitter&#xff0c;处理 GET 请求并产生&#xff08;produces&#xff09;文本/事件流 (text/event-stream) 创建…

如何使用Milvus Cloud进行稀疏向量搜索

如何使用Milvus Cloud进行向量搜索Milvus Cloud 是一款高度可扩展、性能出色的开源向量数据库。在最新的 2.4 版本中,Milvus Cloud 支持了稀疏和稠密向量(公测中)。本文将利用 Milvus Cloud 2.4 来存储数据集并执行向量搜索。 接下来,我们将演示如何利用 Milvus Cloud 在 M…

[GXYCTF2019]Ping Ping Ping1

打开靶机 结合题目名称&#xff0c;考虑是命令注入&#xff0c;试试ls 结果应该就在flag.php。尝试构造命令注入载荷。 cat flag.php 可以看到过滤了空格,用 $IFS$1替换空格 还过滤了flag&#xff0c;我们用字符拼接的方式看能否绕过,ag;cat$IFS$1fla$a.php。注意这里用分号间隔…

睡前故事—绿色科技的未来:可持续发展的梦幻故事

欢迎来到《Bedtime Stories Time》。这是一个我们倾听、放松、并逐渐入睡的播客。感谢你收听并支持我们&#xff0c;希望你能将这个播客作为你睡前例行活动的一部分。今晚我们将讲述绿色科技的未来&#xff1a;可持续发展的梦幻故事的故事。一个宁静的夜晚&#xff0c;希望你现…

0602STM32定时器输出比较

STM32定时器输出比较 PWM简介 主要用来输出PWM波形&#xff0c;PWM波形又是驱动电机的必要条件&#xff0c;所以如果想用STM32做一些有电机的项目&#xff0c;比如智能车&#xff0c;机器人等。那输出比较功能就要认真掌握 1.PWM驱动LED呼吸灯 2.PWM驱动舵机 3.PWM驱动直流电机…

搜维尔科技:【研究】触觉技术将在5年内以8种方式改变人们的世界

触觉技术在过去几年中发展迅猛&#xff0c;大大提高了反馈的精确度和真实度。其应用产生了真正的影响&#xff0c;数百家公司和企业都集成了触觉技术来增强培训和研究模拟。 虽然触觉技术主要用于 B2B 层面&#xff0c;但触觉技术可能会彻底改变我们的生活&#xff0c;尤其是通…

视频共享融合赋能平台LntonCVS视频监控业务平台技术方案详细介绍

LntonCVS国标视频综合管理平台是一款智慧物联应用平台&#xff0c;核心技术基于视频流媒体&#xff0c;采用分布式和负载均衡技术开发&#xff0c;提供广泛兼容、安全可靠、开放共享的视频综合服务。该平台功能丰富&#xff0c;包括视频直播、录像、回放、检索、云存储、告警上…