爬虫入门到精通_框架篇14(PySpider架构概述及用法详解)

官方文档
Sample Code:

from pyspider.libs.base_handler import *class Handler(BaseHandler):crawl_config = {}# minutes=24 * 60:每隔一天重新爬取@every(minutes=24 * 60)def on_start(self):self.crawl('http://scrapy.org/', callback=self.index_page)# age:过期时间@config(age=10 * 24 * 60 * 60)def index_page(self, response):for each in response.doc('a[href^="http"]').items():self.crawl(each.attr.href, callback=self.detail_page)def detail_page(self, response):return {"url": response.url,"title": response.doc('title').text(),}

1 架构(Architecture)

在这里插入图片描述
scheduler:调度器
feter:请求器,请求网页
processor:用来处理数据,处理的url可以重新调用调度器
monitor&webui:监视器与webUI

scheduler

维护request队列,默认是存储在本地的数据库

feter

在这里插入图片描述
可以用两种模式:一种是直接请求,一种是js的请求。

processor

处理数据,将url重新封装起来,可以重新调用调度器。

2 Command Line

Global Config

global options work for all subcommands.

Usage: pyspider [OPTIONS] COMMAND [ARGS]...A powerful spider system in python.Options:-c, --config FILENAME    a json file with default values for subcommands.{“webui”: {“port”:5001}}--logging-config TEXT    logging config file for built-in python loggingmodule  [default: pyspider/pyspider/logging.conf]--debug                  debug mode--queue-maxsize INTEGER  maxsize of queue--taskdb TEXT            database url for taskdb, default: sqlite--projectdb TEXT         database url for projectdb, default: sqlite--resultdb TEXT          database url for resultdb, default: sqlite--message-queue TEXT     connection url to message queue, default: builtinmultiprocessing.Queue--amqp-url TEXT          [deprecated] amqp url for rabbitmq. please use--message-queue instead.--beanstalk TEXT         [deprecated] beanstalk config for beanstalk queue.please use --message-queue instead.--phantomjs-proxy TEXT   phantomjs proxy ip:port--data-path TEXT         data dir path--version                Show the version and exit.--help                   Show this message and exit.

config

pyspider启动的时候的一些配置选项

{"taskdb": "mysql+taskdb://username:password@host:port/taskdb","projectdb": "mysql+projectdb://username:password@host:port/projectdb","resultdb": "mysql+resultdb://username:password@host:port/resultdb","message_queue": "amqp://username:password@host:port/%2F","webui": {"username": "some_name","password": "some_passwd","need-auth": true}
}

项目启动时,会生成上述3个db,在当前项目的data目录下
在这里插入图片描述
webui:指定好界面访问的用户名密码
在这里插入图片描述
在这里插入图片描述
此时弹出:
在这里插入图片描述

all

运行所有的组件

Usage: pyspider all [OPTIONS]Run all the components in subprocess or threadOptions:--fetcher-num INTEGER         instance num of fetcher--processor-num INTEGER       instance num of processor--result-worker-num INTEGER   instance num of result worker--run-in [subprocess|thread]  run each components in thread or subprocess.always using thread for windows.--help                        Show this message and exit.

在这里插入图片描述

one

在一个进程下运行所有的组件

Usage: pyspider one [OPTIONS] [SCRIPTS]...One mode not only means all-in-one, it runs every thing in one processover tornado.ioloop, for debug purposeOptions:-i, --interactive  enable interactive mode, you can choose crawl url.--phantomjs        enable phantomjs, will spawn a subprocess for phantomjs--help             Show this message and exit.

bench

请求测试

Usage: pyspider bench [OPTIONS]Run Benchmark test. In bench mode, in-memory sqlite database is usedinstead of on-disk sqlite database.Options:--fetcher-num INTEGER         instance num of fetcher--processor-num INTEGER       instance num of processor--result-worker-num INTEGER   instance num of result worker--run-in [subprocess|thread]  run each components in thread or subprocess.always using thread for windows.--total INTEGER               total url in test page--show INTEGER                show how many urls in a page--help                        Show this message and exit.

已下组件都是单独开启

scheduler

Usage: pyspider scheduler [OPTIONS]Run Scheduler, only one scheduler is allowed.Options:--xmlrpc / --no-xmlrpc--xmlrpc-host TEXT--xmlrpc-port INTEGER--inqueue-limit INTEGER  size limit of task queue for each project, taskswill been ignored when overflow--delete-time INTEGER    delete time before marked as delete--active-tasks INTEGER   active log size--loop-limit INTEGER     maximum number of tasks due with in a loop--scheduler-cls TEXT     scheduler class to be used.--help                   Show this message and exit.

phantomjs

Usage: run.py phantomjs [OPTIONS] [ARGS]...Run phantomjs fetcher if phantomjs is installed.Options:--phantomjs-path TEXT  phantomjs path--port INTEGER         phantomjs port--auto-restart TEXT    auto restart phantomjs if crashed--help                 Show this message and exit.

fetcher

Usage: pyspider fetcher [OPTIONS]Run Fetcher.Options:--xmlrpc / --no-xmlrpc--xmlrpc-host TEXT--xmlrpc-port INTEGER--poolsize INTEGER      max simultaneous fetches--proxy TEXT            proxy host:port--user-agent TEXT       user agent--timeout TEXT          default fetch timeout--fetcher-cls TEXT      Fetcher class to be used.--help                  Show this message and exit.

processor

Usage: pyspider processor [OPTIONS]Run Processor.Options:--processor-cls TEXT  Processor class to be used.--help                Show this message and exit.

result_worker

Usage: pyspider result_worker [OPTIONS]Run result worker.Options:--result-cls TEXT  ResultWorker class to be used.--help             Show this message and exit.

webui

Usage: pyspider webui [OPTIONS]Run WebUIOptions:--host TEXT            webui bind to host--port INTEGER         webui bind to host--cdn TEXT             js/css cdn server--scheduler-rpc TEXT   xmlrpc path of scheduler--fetcher-rpc TEXT     xmlrpc path of fetcher--max-rate FLOAT       max rate for each project--max-burst FLOAT      max burst for each project--username TEXT        username of lock -ed projects--password TEXT        password of lock -ed projects--need-auth            need username and password--webui-instance TEXT  webui Flask Application instance to be used.--help                 Show this message and exit.

在这里插入图片描述
在这里插入图片描述

3 API Reference

self.crawl(url, **kwargs)

self.crawl is the main interface to tell pyspider which url(s) should be crawled.
发起请求的函数

Parameters:

url

the url or url list to be crawled.
请求的网页

callback

the method to parse the response. _default: call _
回调函数

def on_start(self):self.crawl('http://scrapy.org/', callback=self.index_page)

the following parameters are optional

age

the period of validity of the task. The page would be regarded as not modified during the period. default: -1(never recrawl)
请求的过期时间,单位秒,过期了才重新请求

@config(age=10 * 24 * 60 * 60)
def index_page(self, response):...

Every pages parsed by the callback index_page would be regarded not changed within 10 days. If you submit the task within 10 days since last crawled it would be discarded.

priority

the priority of task to be scheduled, higher the better. default: 0
指定爬取优先级.

def index_page(self):self.crawl('http://www.example.org/page2.html', callback=self.index_page)self.crawl('http://www.example.org/233.html', callback=self.detail_page,priority=1)

The page 233.html would be crawled before page2.html. Use this parameter can do a BFS and reduce the number of tasks in queue(which may cost more memory resources).

exetime

the executed time of task in unix timestamp. default: 0(immediately)
指定执行时间

import time
def on_start(self):self.crawl('http://www.example.org/', callback=self.callback,exetime=time.time()+30*60)

The page would be crawled 30 minutes later.

retries

retry times while failed. default: 3
失败之后的重试次数,默认时3,最大10.

itag

a marker from frontier page to reveal the potential modification of the task. It will be compared to its last value, recrawl when it’s changed. default: None
标识符,

def index_page(self, response):for item in response.doc('.item').items():self.crawl(item.find('a').attr.url, callback=self.detail_page,itag=item.find('.update-time').text())

In the sample, .update-time is used as itag. If it’s not changed, the request would be discarded.
如果没有改变,不进行重新爬取

Or you can use itag with Handler.crawl_config to specify the script version if you want to restart all of the tasks.

class Handler(BaseHandler):crawl_config = {'itag': 'v223'}

Change the value of itag after you modified the script and click run button again. It doesn’t matter if not set before.

auto_recrawl

when enabled, task would be recrawled every age time. default: False
自动重爬,如果开启,age过期后自动重爬。

def on_start(self):self.crawl('http://www.example.org/', callback=self.callback,age=5*60*60, auto_recrawl=True)

The page would be restarted every age 5 hours.

method

HTTP method to use. default: GET
HTTP的请求方法。

params

dictionary of URL parameters to append to the URL.
get请求的一些参数.

def on_start(self):self.crawl('http://httpbin.org/get', callback=self.callback,params={'a': 123, 'b': 'c'})self.crawl('http://httpbin.org/get?a=123&b=c', callback=self.callback)

The two requests are the same.

data

the body to attach to the request. If a dictionary is provided, form-encoding will take place.
post请求的一些data.

def on_start(self):self.crawl('http://httpbin.org/post', callback=self.callback,method='POST', data={'a': 123, 'b': 'c'})

files

dictionary of {field: {filename: ‘content’}} files to multipart upload.
上传的文件.

user_agent

the User-Agent of the request

headers

dictionary of headers to send.

cookies

dictionary of cookies to attach to this request.

connect_timeout

timeout for initial connection in seconds. default: 20

timeout

maximum time in seconds to fetch the page. default: 120

allow_redirects

follow 30x redirect default: True
一些网页的302的跳转是否重定向设置.

validate_cert

For HTTPS requests, validate the server’s certificate? default: True
HTTPS的证书请求忽略.

proxy

proxy server of username:password@hostname:port to use, only http proxy is supported currently.

class Handler(BaseHandler):crawl_config = {'proxy': 'localhost:8080'}

Handler.crawl_config can be used with proxy to set a proxy for whole project.
设置代理.

etag

use HTTP Etag mechanism to pass the process if the content of the page is not changed. default: True
判断页面更新爬取情况.

last_modified

use HTTP Last-Modified header mechanism to pass the process if the content of the page is not changed. default: True

fetch_type

set to js to enable JavaScript fetcher. default: None
设置为js请求.渲染查找后的页面.

js_script

JavaScript run before or after page loaded, should been wrapped by a function like function() { document.write(“binux”); }.
网页爬取后,执行脚本.

def on_start(self):self.crawl('http://www.example.org/', callback=self.callback,fetch_type='js', js_script='''function() {window.scrollTo(0,document.body.scrollHeight);return 123;}''')

The script would scroll the page to bottom. The value returned in function could be captured via Response.js_script_result.

js_run_at

run JavaScript specified via js_script at document-start or document-end. default: document-end
把脚本加到当我document前面还是后面.

js_viewport_width/js_viewport_height

set the size of the viewport for the JavaScript fetcher of the layout process.
JS视窗的大小.

load_images

load images when JavaScript fetcher enabled. default: False
加载时是否加载图片.

save

a object pass to the callback method, can be visit via response.save.
用来在多个函数直接传递变量的参数

def on_start(self):self.crawl('http://www.example.org/', callback=self.callback,save={'a': 123})def callback(self, response):return response.save['a']

123 would be returned in callback

taskid

unique id to identify the task, default is the MD5 check code of the URL, can be overridden by method def get_taskid(self, task)
指定task的唯一标识码,ruquest队列的消息去重.

import json
from pyspider.libs.utils import md5string
def get_taskid(self, task):return md5string(task['url']+json.dumps(task['fetch'].get('data', '')))

Only url is md5 -ed as taskid by default, the code above add data of POST request as part of taskid.

force_update

force update task params even if the task is in ACTIVE status.
强制更新.

cancel

cancel a task, should be used with force_update to cancel a active task. To cancel an auto_recrawl task, you should set auto_recrawl=False as well.
取消任务.

@config(**kwargs)

default parameters of self.crawl when use the decorated method as callback. For example:
config里可以传递参数.

@config(age=15*60)
def index_page(self, response):self.crawl('http://www.example.org/list-1.html', callback=self.index_page)self.crawl('http://www.example.org/product-233', callback=self.detail_page)@config(age=10*24*60*60)
def detail_page(self, response):return {...}

age of list-1.html is 15min while the age of product-233.html is 10days. Because the callback of product-233.html is detail_page, means it’s a detail_page so it shares the config of detail_page.

Handler.crawl_config = {}

default parameters of self.crawl for the whole project. The parameters in crawl_config for scheduler (priority, retries, exetime, age, itag, force_update, auto_recrawl, cancel) will be joined when the task created, the parameters for fetcher and processor will be joined when executed. You can use this mechanism to change the fetch config (e.g. cookies) afterwards.
将一些配置放入其中,使其全局生效.

class Handler(BaseHandler):crawl_config = {'headers': {'User-Agent': 'GoogleBot',}}...

Response

Response.url

final URL.

Response.text

Content of response, in unicode.
返回网页源代码.
if Response.encoding is None and chardet module is available, encoding of content will be guessed.

Response.content

Content of response, in bytes.
返回网页源代码,二进制.

Response.doc

A PyQuery object of the response’s content. Links have made as absolute by default.
网页解析,调用 PyQuery的解析库.
Refer to the documentation of PyQuery: https://pythonhosted.org/pyquery/

It’s important that I will repeat, refer to the documentation of PyQuery: https://pythonhosted.org/pyquery/

Response.etree

A lxml object of the response’s content.
返回lxml对象

Response.json

The JSON-encoded content of the response, if any.
转换为Json格式.

Response.status_code

状态码:200,300等等.

Response.orig_url

If there is any redirection during the request, here is the url you just submit via self.crawl.
有重定向,就显示最原始的url.

Response.headers

A case insensitive dict holds the headers of response.

Response.cookies

Response.error

Messages when fetch error

Response.time

Time used during fetching.

Response.ok

True if status_code is 200 and no error.
判断请求是否OK.

Response.encoding

Encoding of Response.content.
用来指定编码
If Response.encoding is None, encoding will be guessed by header or content or chardet(if available).
Set encoding of content manually will overwrite the guessed encoding.

Response.save

The object saved by self.crawl API
指定不同函数直接传递参数.

Response.js_script_result

content returned by JS script
接受JS script的结果.

Response.raise_for_status()

Raise HTTPError if status code is not 200 or Response.error exists.
抛出一些错误.

self.send_message

部署过程中,消除队列的配置.

@every(minutes=0, seconds=0)

定时爬取中用的比较多.
method will been called every minutes or seconds

@every(minutes=24 * 60)
def on_start(self):for url in urllist:self.crawl(url, callback=self.index_page)

一天执行一次
The urls would be restarted every 24 hours. Note that, if age is also used and the period is longer then @every, the crawl request would be discarded as it’s regarded as not changed:

@every(minutes=24 * 60)
def on_start(self):self.crawl('http://www.example.org/', callback=self.index_page)@config(age=10 * 24 * 60 * 60)
def index_page(self):...

Even though the crawl request triggered every day, but it’s discard and only restarted every 10 days.

4 Frequently Asked Questions

Unreadable Code (乱码) Returned from Phantomjs

Phantomjs doesn’t support gzip, don’t set Accept-Encoding header with gzip.

How to Delete a Project?

set group to delete and status to STOP then wait 24 hours. You can change the time before a project deleted via
scheduler.DELETE_TIME.
status to STOP:
在这里插入图片描述
group to delete:
在这里插入图片描述

How to Restart a Project?

Why
It happens after you modified a script, and wants to crawl everything again with new strategy. But as the age of urls are not expired. Scheduler will discard all of the new requests.
Solution
1.Create a new project.
2.Using a itag within Handler.crawl_config to specify the version of your script.

What does the progress bar mean on the dashboard?

When mouse move onto the progress bar, you can see the explaintions.
For 5m, 1h, 1d the number are the events triggered in 5m, 1h, 1d. For all progress bar, they are the number of total tasks in correspond status.
Only the tasks in DEBUG/RUNNING status will show the progress.
在这里插入图片描述

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/news/740521.shtml

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

【C语言】Leetcode 66. 加一

文章目录 题目思路代码呈现 题目 链接: link 思路 题目的意思是把一个数以数组的形式输入,然后在这个书的末尾加一,应该是要考察,如果数加一后位数发生变化,比如9991成1000,3位数变成4位数怎么处理的。 并且我们可以…

python调用clickhouse

(作者:陈玓玏) 使用clickhouse-driver包,先通过pip install clickhouse-driver安装包,再通过以下代码执行sql。 from clickhouse_driver import Client client Client(host10.43.234.214, port9000, userclickhou…

主流开发语言和开发环境、程序员如何选择职业赛道?

🌟 前言 欢迎来到我的技术小宇宙!🌌 这里不仅是我记录技术点滴的后花园,也是我分享学习心得和项目经验的乐园。📚 无论你是技术小白还是资深大牛,这里总有一些内容能触动你的好奇心。🔍 &#x…

C++开发基础——IO操作与文件流

一,基础概念 C的IO操作是基于字节流,并且IO操作与设备无关,同一种IO操作可以在不同类型的设备上使用。 C的流是指流入/流出程序的字节序列,在输入操作中数据从外部设备(键盘,文件,网络等)流入程序&#x…

12 list的使用

文档介绍 文档介绍 1.list是可以在常数范围内的任意位置进行插入和删除的序列式容器,并且该容器可以前后双向迭代 2.list的底层是带头双向链表循环结构,双向链表中每个元素存储在互不相关的独立节点中,在节点中通过指针指向其前一个元素和…

大话设计模式 :UML类图 原版部分

目录 原书部分总结各符号继承关系实现接口关联关系聚合关系组合关系依赖关系 原书部分 总结 各符号 ‘’ 表示public ‘-’ 表示private ‘#’表示protected 棒棒糖表示法 类内实现的接口 用棒棒糖的形状在外部具体实现 继承关系 实现接口 关联关系 聚合关系 组合关系 依赖…

Redis核心数据结构之压缩列表(二)

压缩列表 压缩列表节点的构成 encoding 节点的encoding属性记录了节点的content属性所保存数据的类型及长度: 1.一字节、两字节或者五字节长,值得最高位为00、01或者10的是字节数组编码:这种编码表示节点的content属性保存着字节数组,数组的长度由编…

小迪安全39WEB 攻防-通用漏洞CSRFSSRF协议玩法内网探针漏洞利用

#知识点: 逻辑漏洞 1、CSRF-原理&危害&探针&利用等 2、SSRF-原理&危害&探针&利用等 3、CSRF&SSRF-黑盒下漏洞探针点 #详细点: CSRF 全称:Cross-site request forgery,即,跨站请求…

ThingsBoard开源物联网平台介绍

1. Thingsboard 简介 ThingsBoard是一个基于Java的开源物联网平台,旨在实现物联网项目的快速开发、管理和扩展。它使用行业标准的物联网协议(MQTT、CoAP和HTTP)实现设备连接,并支持云和本地部署。ThingsBoard结合了可扩展性、容错…

Springboot+vue的疫情居家办公系统(有报告)。Javaee项目,springboot vue前后端分离项目。

演示视频: Springbootvue的疫情居家办公系统(有报告)。Javaee项目,springboot vue前后端分离项目。 项目介绍: 采用M(model)V(view)C(controller&#xff09…

上海计算机学会 2023年11月月赛 丙组T5 推箱子(数学 思维 排序)

第五题:T5推箱子 标签:排序、数学、思维题意:给定 t t t组数据,每组数据给定长度为 n n n的字符串, 表示箱子, _ \_ _表示空格,求把箱子都推到一起(即两两箱子之间没有空格&#…

Ubuntu18.04 安装搜狗输入法

一. 概述 自己的Ubuntu 18.04系统配置中文搜狗输入法,安装步骤,亲测可用 二. 安装步骤 2.1 确认系统版本和CPU架构 查看Ubuntu系统版本号,通过命令 lsb_release -a wuubuntume:~$ lsb_release -a No LSB modules are available. Distr…

安装Android Studio遇到Unable to access Android SDK add-on list的错误

第一次安装android studio的时候,提示:unable to access Android sdk add-on list 解决办法 这个错误一般是android studoi代理没有设置导致的,需要在setting里面设置: 点击Android Studio - Preferences,在 Appeara…

Linux中文件的权限

我们首先需要明白,权限 用户角色 文件的权限属性 一、拥有者、所属组和other(用户角色) 以文件file1为例 第一个箭头所指处即是文件的拥有者,拥有者为zz 第二个箭头所指处即使文件的所属组,所属组为zz 除去拥有者…

基于log4cpp封装日志类

一、log4cpp的使用 1. 下载log4cpp log4cpp官方下载地址 2. 安装log4cpp 第一步:解压 tar zxvf log4cpp-1.1.4.tar.gz 第二步:进入log4cpp文件夹并执行 ./configure tips:如果是ARM架构的CPU可能会失败,如下面这种情况&a…

Ubuntu查看ros版本-linux查看ros版本

使用ros带的rosversion命令即可查看自己的ros版本: rosversion -d

如何在Windows中检测任何串行设备的COM端口?这里有一个应用程序

使用USB串行设备并不是最简单的工作流程。我们首先需要标识“设备管理器”下的COM端口,然后需要告诉应用程序使用该COM端口。 如果我们可以接收COM设备的自动通知,然后将它们配置为使用特定应用程序打开,该怎么办?Serial Port Notifier程序正是我们所需要的。 在最基本的级…

【计算机网络】UDP/TCP 协议

TCP 协议 一、传输层1. 再谈端口号2. 端口号范围划分3. 进程和端口号4. netstat5. pidof 二、UDP 协议1. UDP 协议端格式(报文)2. UDP 的特点3. 面向数据报4. UDP 的缓冲区 三、TCP 协议1. 认识 TCP2. TCP 协议段格式(1)4 位首部长度(2&#…

Spring Boot+Vue前后端分离项目如何部署到服务器

🌟 前言 欢迎来到我的技术小宇宙!🌌 这里不仅是我记录技术点滴的后花园,也是我分享学习心得和项目经验的乐园。📚 无论你是技术小白还是资深大牛,这里总有一些内容能触动你的好奇心。🔍 &#x…

NVidia NX 中 ROS serial软件包的安装

自己装的ROS是noetic版本,受限于网络,直接用命令安装串口包不行。于是手动安装了一次。 1 下载源码 git clone https://github.com/wjwwood/serial.git 或者直接在浏览器里面输入 https://github.com/wjwwood/serial.git 2 解压 然后在serial&#xf…