爬虫入门到精通_框架篇14(PySpider架构概述及用法详解)

官方文档
Sample Code:

from pyspider.libs.base_handler import *class Handler(BaseHandler):crawl_config = {}# minutes=24 * 60:每隔一天重新爬取@every(minutes=24 * 60)def on_start(self):self.crawl('http://scrapy.org/', callback=self.index_page)# age:过期时间@config(age=10 * 24 * 60 * 60)def index_page(self, response):for each in response.doc('a[href^="http"]').items():self.crawl(each.attr.href, callback=self.detail_page)def detail_page(self, response):return {"url": response.url,"title": response.doc('title').text(),}

1 架构(Architecture)

在这里插入图片描述
scheduler:调度器
feter:请求器,请求网页
processor:用来处理数据,处理的url可以重新调用调度器
monitor&webui:监视器与webUI

scheduler

维护request队列,默认是存储在本地的数据库

feter

在这里插入图片描述
可以用两种模式:一种是直接请求,一种是js的请求。

processor

处理数据,将url重新封装起来,可以重新调用调度器。

2 Command Line

Global Config

global options work for all subcommands.

Usage: pyspider [OPTIONS] COMMAND [ARGS]...A powerful spider system in python.Options:-c, --config FILENAME    a json file with default values for subcommands.{“webui”: {“port”:5001}}--logging-config TEXT    logging config file for built-in python loggingmodule  [default: pyspider/pyspider/logging.conf]--debug                  debug mode--queue-maxsize INTEGER  maxsize of queue--taskdb TEXT            database url for taskdb, default: sqlite--projectdb TEXT         database url for projectdb, default: sqlite--resultdb TEXT          database url for resultdb, default: sqlite--message-queue TEXT     connection url to message queue, default: builtinmultiprocessing.Queue--amqp-url TEXT          [deprecated] amqp url for rabbitmq. please use--message-queue instead.--beanstalk TEXT         [deprecated] beanstalk config for beanstalk queue.please use --message-queue instead.--phantomjs-proxy TEXT   phantomjs proxy ip:port--data-path TEXT         data dir path--version                Show the version and exit.--help                   Show this message and exit.

config

pyspider启动的时候的一些配置选项

{"taskdb": "mysql+taskdb://username:password@host:port/taskdb","projectdb": "mysql+projectdb://username:password@host:port/projectdb","resultdb": "mysql+resultdb://username:password@host:port/resultdb","message_queue": "amqp://username:password@host:port/%2F","webui": {"username": "some_name","password": "some_passwd","need-auth": true}
}

项目启动时,会生成上述3个db,在当前项目的data目录下
在这里插入图片描述
webui:指定好界面访问的用户名密码
在这里插入图片描述
在这里插入图片描述
此时弹出:
在这里插入图片描述

all

运行所有的组件

Usage: pyspider all [OPTIONS]Run all the components in subprocess or threadOptions:--fetcher-num INTEGER         instance num of fetcher--processor-num INTEGER       instance num of processor--result-worker-num INTEGER   instance num of result worker--run-in [subprocess|thread]  run each components in thread or subprocess.always using thread for windows.--help                        Show this message and exit.

在这里插入图片描述

one

在一个进程下运行所有的组件

Usage: pyspider one [OPTIONS] [SCRIPTS]...One mode not only means all-in-one, it runs every thing in one processover tornado.ioloop, for debug purposeOptions:-i, --interactive  enable interactive mode, you can choose crawl url.--phantomjs        enable phantomjs, will spawn a subprocess for phantomjs--help             Show this message and exit.

bench

请求测试

Usage: pyspider bench [OPTIONS]Run Benchmark test. In bench mode, in-memory sqlite database is usedinstead of on-disk sqlite database.Options:--fetcher-num INTEGER         instance num of fetcher--processor-num INTEGER       instance num of processor--result-worker-num INTEGER   instance num of result worker--run-in [subprocess|thread]  run each components in thread or subprocess.always using thread for windows.--total INTEGER               total url in test page--show INTEGER                show how many urls in a page--help                        Show this message and exit.

已下组件都是单独开启

scheduler

Usage: pyspider scheduler [OPTIONS]Run Scheduler, only one scheduler is allowed.Options:--xmlrpc / --no-xmlrpc--xmlrpc-host TEXT--xmlrpc-port INTEGER--inqueue-limit INTEGER  size limit of task queue for each project, taskswill been ignored when overflow--delete-time INTEGER    delete time before marked as delete--active-tasks INTEGER   active log size--loop-limit INTEGER     maximum number of tasks due with in a loop--scheduler-cls TEXT     scheduler class to be used.--help                   Show this message and exit.

phantomjs

Usage: run.py phantomjs [OPTIONS] [ARGS]...Run phantomjs fetcher if phantomjs is installed.Options:--phantomjs-path TEXT  phantomjs path--port INTEGER         phantomjs port--auto-restart TEXT    auto restart phantomjs if crashed--help                 Show this message and exit.

fetcher

Usage: pyspider fetcher [OPTIONS]Run Fetcher.Options:--xmlrpc / --no-xmlrpc--xmlrpc-host TEXT--xmlrpc-port INTEGER--poolsize INTEGER      max simultaneous fetches--proxy TEXT            proxy host:port--user-agent TEXT       user agent--timeout TEXT          default fetch timeout--fetcher-cls TEXT      Fetcher class to be used.--help                  Show this message and exit.

processor

Usage: pyspider processor [OPTIONS]Run Processor.Options:--processor-cls TEXT  Processor class to be used.--help                Show this message and exit.

result_worker

Usage: pyspider result_worker [OPTIONS]Run result worker.Options:--result-cls TEXT  ResultWorker class to be used.--help             Show this message and exit.

webui

Usage: pyspider webui [OPTIONS]Run WebUIOptions:--host TEXT            webui bind to host--port INTEGER         webui bind to host--cdn TEXT             js/css cdn server--scheduler-rpc TEXT   xmlrpc path of scheduler--fetcher-rpc TEXT     xmlrpc path of fetcher--max-rate FLOAT       max rate for each project--max-burst FLOAT      max burst for each project--username TEXT        username of lock -ed projects--password TEXT        password of lock -ed projects--need-auth            need username and password--webui-instance TEXT  webui Flask Application instance to be used.--help                 Show this message and exit.

在这里插入图片描述
在这里插入图片描述

3 API Reference

self.crawl(url, **kwargs)

self.crawl is the main interface to tell pyspider which url(s) should be crawled.
发起请求的函数

Parameters:

url

the url or url list to be crawled.
请求的网页

callback

the method to parse the response. _default: call _
回调函数

def on_start(self):self.crawl('http://scrapy.org/', callback=self.index_page)

the following parameters are optional

age

the period of validity of the task. The page would be regarded as not modified during the period. default: -1(never recrawl)
请求的过期时间,单位秒,过期了才重新请求

@config(age=10 * 24 * 60 * 60)
def index_page(self, response):...

Every pages parsed by the callback index_page would be regarded not changed within 10 days. If you submit the task within 10 days since last crawled it would be discarded.

priority

the priority of task to be scheduled, higher the better. default: 0
指定爬取优先级.

def index_page(self):self.crawl('http://www.example.org/page2.html', callback=self.index_page)self.crawl('http://www.example.org/233.html', callback=self.detail_page,priority=1)

The page 233.html would be crawled before page2.html. Use this parameter can do a BFS and reduce the number of tasks in queue(which may cost more memory resources).

exetime

the executed time of task in unix timestamp. default: 0(immediately)
指定执行时间

import time
def on_start(self):self.crawl('http://www.example.org/', callback=self.callback,exetime=time.time()+30*60)

The page would be crawled 30 minutes later.

retries

retry times while failed. default: 3
失败之后的重试次数,默认时3,最大10.

itag

a marker from frontier page to reveal the potential modification of the task. It will be compared to its last value, recrawl when it’s changed. default: None
标识符,

def index_page(self, response):for item in response.doc('.item').items():self.crawl(item.find('a').attr.url, callback=self.detail_page,itag=item.find('.update-time').text())

In the sample, .update-time is used as itag. If it’s not changed, the request would be discarded.
如果没有改变,不进行重新爬取

Or you can use itag with Handler.crawl_config to specify the script version if you want to restart all of the tasks.

class Handler(BaseHandler):crawl_config = {'itag': 'v223'}

Change the value of itag after you modified the script and click run button again. It doesn’t matter if not set before.

auto_recrawl

when enabled, task would be recrawled every age time. default: False
自动重爬,如果开启,age过期后自动重爬。

def on_start(self):self.crawl('http://www.example.org/', callback=self.callback,age=5*60*60, auto_recrawl=True)

The page would be restarted every age 5 hours.

method

HTTP method to use. default: GET
HTTP的请求方法。

params

dictionary of URL parameters to append to the URL.
get请求的一些参数.

def on_start(self):self.crawl('http://httpbin.org/get', callback=self.callback,params={'a': 123, 'b': 'c'})self.crawl('http://httpbin.org/get?a=123&b=c', callback=self.callback)

The two requests are the same.

data

the body to attach to the request. If a dictionary is provided, form-encoding will take place.
post请求的一些data.

def on_start(self):self.crawl('http://httpbin.org/post', callback=self.callback,method='POST', data={'a': 123, 'b': 'c'})

files

dictionary of {field: {filename: ‘content’}} files to multipart upload.
上传的文件.

user_agent

the User-Agent of the request

headers

dictionary of headers to send.

cookies

dictionary of cookies to attach to this request.

connect_timeout

timeout for initial connection in seconds. default: 20

timeout

maximum time in seconds to fetch the page. default: 120

allow_redirects

follow 30x redirect default: True
一些网页的302的跳转是否重定向设置.

validate_cert

For HTTPS requests, validate the server’s certificate? default: True
HTTPS的证书请求忽略.

proxy

proxy server of username:password@hostname:port to use, only http proxy is supported currently.

class Handler(BaseHandler):crawl_config = {'proxy': 'localhost:8080'}

Handler.crawl_config can be used with proxy to set a proxy for whole project.
设置代理.

etag

use HTTP Etag mechanism to pass the process if the content of the page is not changed. default: True
判断页面更新爬取情况.

last_modified

use HTTP Last-Modified header mechanism to pass the process if the content of the page is not changed. default: True

fetch_type

set to js to enable JavaScript fetcher. default: None
设置为js请求.渲染查找后的页面.

js_script

JavaScript run before or after page loaded, should been wrapped by a function like function() { document.write(“binux”); }.
网页爬取后,执行脚本.

def on_start(self):self.crawl('http://www.example.org/', callback=self.callback,fetch_type='js', js_script='''function() {window.scrollTo(0,document.body.scrollHeight);return 123;}''')

The script would scroll the page to bottom. The value returned in function could be captured via Response.js_script_result.

js_run_at

run JavaScript specified via js_script at document-start or document-end. default: document-end
把脚本加到当我document前面还是后面.

js_viewport_width/js_viewport_height

set the size of the viewport for the JavaScript fetcher of the layout process.
JS视窗的大小.

load_images

load images when JavaScript fetcher enabled. default: False
加载时是否加载图片.

save

a object pass to the callback method, can be visit via response.save.
用来在多个函数直接传递变量的参数

def on_start(self):self.crawl('http://www.example.org/', callback=self.callback,save={'a': 123})def callback(self, response):return response.save['a']

123 would be returned in callback

taskid

unique id to identify the task, default is the MD5 check code of the URL, can be overridden by method def get_taskid(self, task)
指定task的唯一标识码,ruquest队列的消息去重.

import json
from pyspider.libs.utils import md5string
def get_taskid(self, task):return md5string(task['url']+json.dumps(task['fetch'].get('data', '')))

Only url is md5 -ed as taskid by default, the code above add data of POST request as part of taskid.

force_update

force update task params even if the task is in ACTIVE status.
强制更新.

cancel

cancel a task, should be used with force_update to cancel a active task. To cancel an auto_recrawl task, you should set auto_recrawl=False as well.
取消任务.

@config(**kwargs)

default parameters of self.crawl when use the decorated method as callback. For example:
config里可以传递参数.

@config(age=15*60)
def index_page(self, response):self.crawl('http://www.example.org/list-1.html', callback=self.index_page)self.crawl('http://www.example.org/product-233', callback=self.detail_page)@config(age=10*24*60*60)
def detail_page(self, response):return {...}

age of list-1.html is 15min while the age of product-233.html is 10days. Because the callback of product-233.html is detail_page, means it’s a detail_page so it shares the config of detail_page.

Handler.crawl_config = {}

default parameters of self.crawl for the whole project. The parameters in crawl_config for scheduler (priority, retries, exetime, age, itag, force_update, auto_recrawl, cancel) will be joined when the task created, the parameters for fetcher and processor will be joined when executed. You can use this mechanism to change the fetch config (e.g. cookies) afterwards.
将一些配置放入其中,使其全局生效.

class Handler(BaseHandler):crawl_config = {'headers': {'User-Agent': 'GoogleBot',}}...

Response

Response.url

final URL.

Response.text

Content of response, in unicode.
返回网页源代码.
if Response.encoding is None and chardet module is available, encoding of content will be guessed.

Response.content

Content of response, in bytes.
返回网页源代码,二进制.

Response.doc

A PyQuery object of the response’s content. Links have made as absolute by default.
网页解析,调用 PyQuery的解析库.
Refer to the documentation of PyQuery: https://pythonhosted.org/pyquery/

It’s important that I will repeat, refer to the documentation of PyQuery: https://pythonhosted.org/pyquery/

Response.etree

A lxml object of the response’s content.
返回lxml对象

Response.json

The JSON-encoded content of the response, if any.
转换为Json格式.

Response.status_code

状态码:200,300等等.

Response.orig_url

If there is any redirection during the request, here is the url you just submit via self.crawl.
有重定向,就显示最原始的url.

Response.headers

A case insensitive dict holds the headers of response.

Response.cookies

Response.error

Messages when fetch error

Response.time

Time used during fetching.

Response.ok

True if status_code is 200 and no error.
判断请求是否OK.

Response.encoding

Encoding of Response.content.
用来指定编码
If Response.encoding is None, encoding will be guessed by header or content or chardet(if available).
Set encoding of content manually will overwrite the guessed encoding.

Response.save

The object saved by self.crawl API
指定不同函数直接传递参数.

Response.js_script_result

content returned by JS script
接受JS script的结果.

Response.raise_for_status()

Raise HTTPError if status code is not 200 or Response.error exists.
抛出一些错误.

self.send_message

部署过程中,消除队列的配置.

@every(minutes=0, seconds=0)

定时爬取中用的比较多.
method will been called every minutes or seconds

@every(minutes=24 * 60)
def on_start(self):for url in urllist:self.crawl(url, callback=self.index_page)

一天执行一次
The urls would be restarted every 24 hours. Note that, if age is also used and the period is longer then @every, the crawl request would be discarded as it’s regarded as not changed:

@every(minutes=24 * 60)
def on_start(self):self.crawl('http://www.example.org/', callback=self.index_page)@config(age=10 * 24 * 60 * 60)
def index_page(self):...

Even though the crawl request triggered every day, but it’s discard and only restarted every 10 days.

4 Frequently Asked Questions

Unreadable Code (乱码) Returned from Phantomjs

Phantomjs doesn’t support gzip, don’t set Accept-Encoding header with gzip.

How to Delete a Project?

set group to delete and status to STOP then wait 24 hours. You can change the time before a project deleted via
scheduler.DELETE_TIME.
status to STOP:
在这里插入图片描述
group to delete:
在这里插入图片描述

How to Restart a Project?

Why
It happens after you modified a script, and wants to crawl everything again with new strategy. But as the age of urls are not expired. Scheduler will discard all of the new requests.
Solution
1.Create a new project.
2.Using a itag within Handler.crawl_config to specify the version of your script.

What does the progress bar mean on the dashboard?

When mouse move onto the progress bar, you can see the explaintions.
For 5m, 1h, 1d the number are the events triggered in 5m, 1h, 1d. For all progress bar, they are the number of total tasks in correspond status.
Only the tasks in DEBUG/RUNNING status will show the progress.
在这里插入图片描述

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/news/740521.shtml

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

【C语言】Leetcode 66. 加一

文章目录 题目思路代码呈现 题目 链接: link 思路 题目的意思是把一个数以数组的形式输入,然后在这个书的末尾加一,应该是要考察,如果数加一后位数发生变化,比如9991成1000,3位数变成4位数怎么处理的。 并且我们可以…

uniapp 微信小程序和h5处理文件(pdf)下载+保存到本地+预览功能

uniapp实现微信小程序下载资源功能和h5有很大的不同,后台需返回blob文件流 1.微信小程序实现下载资源功能 步骤1:下载文件 uni.downloadFile({url:url,//调接口返回urlsuccess:(res)>{uni.hideLoading();if(res.statusCode200){var tempFilePath …

python调用clickhouse

(作者:陈玓玏) 使用clickhouse-driver包,先通过pip install clickhouse-driver安装包,再通过以下代码执行sql。 from clickhouse_driver import Client client Client(host10.43.234.214, port9000, userclickhou…

curl c++ 实现HTTP GET和POST请求

环境配置 curl //DV2020T环境下此步骤可省略 https://curl.se/download/ 笔者安装为7.85.0版本 ./configure --without-ssl make sudo make install sudo rm /usr/local/lib/curl 系统也有curl库,为防止冲突,删去编译好的curl库。 对以json数据的解析使…

主流开发语言和开发环境、程序员如何选择职业赛道?

🌟 前言 欢迎来到我的技术小宇宙!🌌 这里不仅是我记录技术点滴的后花园,也是我分享学习心得和项目经验的乐园。📚 无论你是技术小白还是资深大牛,这里总有一些内容能触动你的好奇心。🔍 &#x…

C++开发基础——IO操作与文件流

一,基础概念 C的IO操作是基于字节流,并且IO操作与设备无关,同一种IO操作可以在不同类型的设备上使用。 C的流是指流入/流出程序的字节序列,在输入操作中数据从外部设备(键盘,文件,网络等)流入程序&#x…

apisix lua插件使用redis

引入 local redis require("resty.redis") local red redis:new() local redis_config { host "redis_v1", port "6379", pass "123456", db "0" } local function conn_redis() local ok, err red:connect(re…

安卓Java面试题 81- 90

81. 简述Android数字签名?Android系统要求所有的应用必须被证书进行数字签名之后才能进行安装。Android系统通过该证书来确认应用的作者,该证书是不需要权威机构认证的,一般情况下应用都是用开发者的自签名证书,该证书是确保应用程序和应用程序作者之间建立信任关系,而不是…

12 list的使用

文档介绍 文档介绍 1.list是可以在常数范围内的任意位置进行插入和删除的序列式容器,并且该容器可以前后双向迭代 2.list的底层是带头双向链表循环结构,双向链表中每个元素存储在互不相关的独立节点中,在节点中通过指针指向其前一个元素和…

大话设计模式 :UML类图 原版部分

目录 原书部分总结各符号继承关系实现接口关联关系聚合关系组合关系依赖关系 原书部分 总结 各符号 ‘’ 表示public ‘-’ 表示private ‘#’表示protected 棒棒糖表示法 类内实现的接口 用棒棒糖的形状在外部具体实现 继承关系 实现接口 关联关系 聚合关系 组合关系 依赖…

Redis核心数据结构之压缩列表(二)

压缩列表 压缩列表节点的构成 encoding 节点的encoding属性记录了节点的content属性所保存数据的类型及长度: 1.一字节、两字节或者五字节长,值得最高位为00、01或者10的是字节数组编码:这种编码表示节点的content属性保存着字节数组,数组的长度由编…

SwiftUI中的Sheet推出新页面

SwiftUI中的Sheet推出新页面 记录一下SwiftUI如何从下往上推出新的页面 import SwiftUIstruct SheetBootCamp: View {State var showSheet falsevar body: some View {ZStack{Color.green.ignoresSafeArea()Button(action: {showSheet.toggle()}, label: {/*START_MENU_TOKEN…

Linux chattr命令教程:如何改变文件或目录的属性(附案例详解和注意事项)

Linux chattr命令介绍 chattr命令是change file attributes on a Linux file system的缩写,主要用于改变文件或目录的属性。这个命令允许管理员控制谁可以修改文件或目录,或者在什么情况下可以修改。 Linux chattr命令适用的Linux版本 chattr命令在大…

攀拓(PAT)2024年春季 甲级题解

A-1 Braille Recognition 两层循环遍历&#xff0c;用数组计数 #include <bits/stdc.h> using namespace std; int a[10]; int main() {int n,m;cin>>n>>m;string s[110];for(int i0;i<n;i)cin>>s[i];for(int i0;i<n-2;i) {for(int j0;j<m-…

小迪安全39WEB 攻防-通用漏洞CSRFSSRF协议玩法内网探针漏洞利用

#知识点&#xff1a; 逻辑漏洞 1、CSRF-原理&危害&探针&利用等 2、SSRF-原理&危害&探针&利用等 3、CSRF&SSRF-黑盒下漏洞探针点 #详细点&#xff1a; CSRF 全称&#xff1a;Cross-site request forgery&#xff0c;即&#xff0c;跨站请求…

ThingsBoard开源物联网平台介绍

1. Thingsboard 简介 ThingsBoard是一个基于Java的开源物联网平台&#xff0c;旨在实现物联网项目的快速开发、管理和扩展。它使用行业标准的物联网协议&#xff08;MQTT、CoAP和HTTP&#xff09;实现设备连接&#xff0c;并支持云和本地部署。ThingsBoard结合了可扩展性、容错…

Springboot+vue的疫情居家办公系统(有报告)。Javaee项目,springboot vue前后端分离项目。

演示视频&#xff1a; Springbootvue的疫情居家办公系统&#xff08;有报告&#xff09;。Javaee项目&#xff0c;springboot vue前后端分离项目。 项目介绍&#xff1a; 采用M&#xff08;model&#xff09;V&#xff08;view&#xff09;C&#xff08;controller&#xff09…

上海计算机学会 2023年11月月赛 丙组T5 推箱子(数学 思维 排序)

第五题&#xff1a;T5推箱子 标签&#xff1a;排序、数学、思维题意&#xff1a;给定 t t t组数据&#xff0c;每组数据给定长度为 n n n的字符串&#xff0c; 表示箱子&#xff0c; _ \_ _表示空格&#xff0c;求把箱子都推到一起&#xff08;即两两箱子之间没有空格&#…

Ubuntu18.04 安装搜狗输入法

一. 概述 自己的Ubuntu 18.04系统配置中文搜狗输入法&#xff0c;安装步骤&#xff0c;亲测可用 二. 安装步骤 2.1 确认系统版本和CPU架构 查看Ubuntu系统版本号&#xff0c;通过命令 lsb_release -a wuubuntume:~$ lsb_release -a No LSB modules are available. Distr…

安装Android Studio遇到Unable to access Android SDK add-on list的错误

第一次安装android studio的时候&#xff0c;提示&#xff1a;unable to access Android sdk add-on list 解决办法 这个错误一般是android studoi代理没有设置导致的&#xff0c;需要在setting里面设置&#xff1a; 点击Android Studio - Preferences&#xff0c;在 Appeara…