python 网页编程_通过Python编程检索网页

python 网页编程

The internet and the World Wide Web (WWW), is probably the most prominent source of information today. Most of that information is retrievable through HTTP. HTTP was invented originally to share pages of hypertext (hence the name Hypertext Transfer Protocol), which eventually started the WWW.

互联网和万维网(WWW)可能是当今最突出的信息来源。 大多数信息可通过HTTP检索。 最初是发明HTTP来共享超文本页面的(因此被称为超文本传输​​协议),该页面最终启动了WWW。

This process occurs every time we request a web page through our devices. The exciting part is we can perform these operations programmatically to automate the retrieval and processing of information.

每当我们通过设备请求网页时,都会发生此过程。 令人兴奋的部分是我们可以以编程方式执行这些操作,以自动进行信息的检索和处理。

This article is an excerpt from the book Python Automation Cookbook, Second Edition by Jamie Buelta, a comprehensive and updated edition that enables you to develop a sharp understanding of the fundamentals required to automate business processes through real-world tasks, such as developing your first web scraping application, analyzing information to generate spreadsheet reports with graphs, and communicating with automatically generated emails.

本文摘自 Jamie Buelta撰写 的《 Python Automation Cookbook,第二版》 ,这是一个全面而更新的版本,使您能够深入了解通过实际任务(例如,开发第一个任务)来实现业务流程自动化的基本原理。网络抓取应用程序,分析信息以生成带有图表的电子表格报告,以及与自动生成的电子邮件进行通信。

In this article, we will learn how to leverage the Python language to fetch HTTP. Python has an HTTP client in its standard library. Further, the fantastic request modules make obtaining web pages very convenient.

在本文中,我们将学习如何利用Python语言来获取HTTP。 Python在其标准库中有一个HTTP客户端。 此外,出色的请求模块使获取网页非常方便。

[Related article: Web Scraping News Articles in Python]

[相关文章: Python中的Web搜刮新闻文章 ]

与表格互动 (Interacting with forms)

A common element present in web pages is forms. Forms are a way of sending values to a web page, for example, to create a new comment on a blog post, or to submit a purchase.

网页中常见的元素是表单。 表单是一种将值发送到网页的方法,例如,在博客文章上创建新评论或提交购买。

Browsers present forms so you can input values and send them in a single action after pressing the submit or equivalent button. We’ll see how to create this action programmatically in this recipe.

浏览器显示表单,因此您可以输入值并在按下提交或等效按钮后以单个操作发送它们。 我们将在本食谱中了解如何以编程方式创建此动作。

Image for post
https://odsc.com/https://odsc.com/

做好准备 (Getting ready)

We’ll work against the test server https://httpbin.org/forms/post, which allows us to send a test form and sends back the submitted information.

我们将针对测试服务器https://httpbin.org/forms/post进行工作,该服务器允许我们发送测试表单并发回已提交的信息。

The following is an example form to order a pizza:

以下是订购比萨饼的示例表格:

Image for post

Figure 1 Rendered form

图1呈现的表单

You can fill the form in manually and see it return the information in JSON format, including extra information such as the browser being used.

您可以手动填写表单,然后查看它以JSON格式返回信息,包括其他信息,例如正在使用的浏览器。

The following is the frontend of the web form that is generated:

以下是生成的Web表单的前端:

Image for post

Figure 2: Filled-in form

图2:填写表格

The following screenshot shows the backend of the web form that is generated:

以下屏幕快照显示了生成的Web表单的后端:

Image for post

Figure 3: Returned JSON content

图3:返回的JSON内容

We need to analyze the HTML to see the accepted data for the form. The source code is as follows:

我们需要分析HTML以查看表单的可接受数据。 源代码如下:

Image for post

Figure 4: Source code

图4:源代码

Check the names of the inputs, custname, custtel, custemail, size (a radio option), topping (a multiselection checkbox), delivery (time), and comments.

检查输入的名称,客户名称,客户名称,客户邮件,大小(单选),打顶(多选复选框),传递(时间)和注释。

怎么做… (How to do it…)

1. Import the requests, BeautifulSoup, and re modules:

1.导入请求,BeautifulSoup,然后重新模块:

>>> import requests >>> from bs4 import BeautifulSoup >>> import re

2. Retrieve the form page, parse it, and print the input fields. Check that the posting URL is /post (not /forms/post): >>> response = requests.get(‘https://httpbin.org/forms/post’)

2.检索表单页面,对其进行解析,然后打印输入字段。 检查发布URL是否为/ post(不是/ forms / post): >>> response = requests.get('https://httpbin.org/forms/post')

>>> page = BeautifulSoup(response.text) >>> form = page.find('form') >>> {field.get('name') for field in form.find_all(re. compile('input|textarea'))} {'delivery', 'topping', 'size', 'custemail', 'comments', 'custtel', 'custname'}

3. Note that textarea is a valid input and is defined in the HTML format. Prepare the data to be posted as a dictionary. Check that the values are as defined in the form:

3.请注意,textarea是有效输入,并以HTML格式定义。 准备要作为字典发布的数据。 检查值是否符合以下格式中的定义:

>>> data = {'custname': "Sean O'Connell", 'custtel': '123-456- 789', 'custemail': 'sean@oconnell.ie', 'size': 'small', 'topping': ['bacon', 'onion'], 'delivery': '20:30', 'comments': ''}

4. Post the values and check that the response is the same as returned in the browser:

4.发布值,并检查响应是否与浏览器中返回的相同:

>>> response = requests.post('https://httpbin.org/post', data) >>> response <Response [200]> >>> response.json() {'args': {}, 'data': '', 'files': {}, 'form': {'comments': '', 'custemail': 'sean@oconnell.ie', 'custname': "Sean O'Connell", 'custtel': '123-456-789', 'delivery': '20:30', 'size': 'small', 'topping': ['bacon', 'onion']}, 'headers': {'Accept': '*/*', 'Accept-Encoding': 'gzip, deflate', 'Connection': 'close', 'Content-Length': '140', 'Content-Type': 'application/x-wwwform- urlencoded', 'Host': 'httpbin.org', 'User-Agent': 'pythonrequests/ 2.22.0'}, 'json': None, 'origin': '89.100.17.159', 'url': 'https://httpbin.org/post'}

这个怎么运作… (How it works…)

Requests directly encodes and sends data in the configured format. By default, it sends POST data in the application/x-www-form-urlencoded format.

请求以配置的格式直接编码并发送数据。 默认情况下,它以application / x-www-form-urlencoded格式发送POST数据。

The key aspect here is to respect the format of the form and the possible values that can return an error if incorrect, typically a 400 error, indicating a problem with the client.

此处的关键方面是尊重表单的格式和可能的值,如果不正确,则可能返回错误,通常为400错误,这表明客户端存在问题。

[Related article: Building a Scraper Using Browser Automation]

[相关文章: 使用浏览器自动化构建刮板 ]

还有更多… (There’s more…)

Other than following the format of forms and inputting valid values, the main problem when working with forms is the multiple ways of preventing spam and abusive behavior. You will often have to ensure that you have downloaded a form before submitting it, to avoid submitting multiple forms or Cross-Site Request Forgery (CSRF).

除了遵循表格的格式和输入有效值外,使用表格时的主要问题还在于防止垃圾邮件和滥用行为的多种方法。 您通常必须确保在提交表单之前已经下载了表单,以避免提交多个表单或跨站点请求伪造 ( CSRF )。

To obtain the specific token, you need to first download the form, as shown in the recipe, obtain the value of the CSRF token, and resubmit it. Note that the token can have different names; this is just an example:

要获取特定令牌,您需要先下载表单,如配方所示,获取CSRF令牌的值,然后重新提交。 请注意,令牌可以具有不同的名称。 这只是一个例子:

>>> form.find(attrs={'name': 'token'}).get('value') 'ABCEDF12345'

In this article, we learned how to obtain data from the forms of the web, parse it, and print the input fields using Python’s HTTP client. We also explored the role and application of requests, Beautiful Soup, and re–modules.

在本文中,我们学习了如何使用Python的HTTP客户端从Web表单中获取数据,进行解析并打印输入字段。 我们还探讨了请求,“美丽的汤”和“重新模块”的作用和应用。

关于作者 (About the Author)

Jaime Buelta is a full-time Python developer since 2010 and a regular speaker at PyCon Ireland. He has been a professional programmer for over two decades with a rich exposure to a lot of different technologies throughout his career. He has developed software for a variety of fields and industries, including aerospace, networking and communications, industrial SCADA systems, video game online services, and financial services.

Jaime Buelta自2010年以来一直是Python的专职开发人员,并在PyCon Ireland担任定期发言人。 在过去的二十多年中,他一直是一名专业的程序员,在他的整个职业生涯中,他对许多不同的技术有着丰富的了解。 他开发了适用于各个领域和行业的软件,包括航空航天,网络和通信,工业SCADA系统,视频游戏在线服务以及金融服务。

Editor’s note: Interested in learning more about coding beyond just retrieving webpages through Python? Check out some of these upcoming similar ODSC talks:

编者注:除了通过Python检索网页之外,您还想了解更多有关编码的信息吗? 查看以下即将举行的类似ODSC讲座:

ODSC Europe: “Programming with Data: Python and Pandas” — In this training, you will learn how to accelerate your data analyses using the Python language and Pandas, a library specifically designed for tabular data analysis.

ODSC欧洲:“ 使用数据编程:Python和Pandas ” —在本培训中,您将学习如何使用Python语言和Pandas(专门用于表格数据分析的库)来加速数据分析。

ODSC Europe: “Introduction to Linear Algebra for Data Science and Machine Learning With Python” — The goal of this session is to show you that you can start learning the math needed for machine learning and data science using code.

ODSC欧洲:“ 使用Python进行数据科学和机器学习的线性代数简介 ” —本课程的目的是向您展示您可以开始使用代码学习机器学习和数据科学所需的数学。

翻译自: https://medium.com/@ODSC/retrieving-webpages-through-python-programming-8f3bae8518a5

python 网页编程

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/news/389406.shtml

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

火种 ctf_分析我的火种数据

火种 ctfOriginally published at https://www.linkedin.com on March 27, 2020 (data up to date as of March 20, 2020).最初于 2020年3月27日 在 https://www.linkedin.com 上 发布 (数据截至2020年3月20日)。 Day 3 of social distancing.社会疏离的第三天。 As I sit on…

data studio_面向营销人员的Data Studio —报表指南

data studioIn this guide, we describe both the theoretical and practical sides of reporting with Google Data Studio. You can use this guide as a comprehensive cheat sheet in your everyday marketing.在本指南中&#xff0c;我们描述了使用Google Data Studio进行…

人流量统计系统介绍_统计介绍

人流量统计系统介绍Its very important to know about statistics . May you be a from a finance background, may you be data scientist or a data analyst, life is all about mathematics. As per the wiki definition “Statistics is the discipline that concerns the …

乐高ev3 读取外部数据_数据就是新乐高

乐高ev3 读取外部数据When I was a kid, I used to love playing with Lego. My brother and I built almost all kinds of stuff with Lego — animals, cars, houses, and even spaceships. As time went on, our creations became more ambitious and realistic. There were…

图像灰度化与二值化

图像灰度化 什么是图像灰度化&#xff1f; 图像灰度化并不是将单纯的图像变成灰色&#xff0c;而是将图片的BGR各通道以某种规律综合起来&#xff0c;使图片显示位灰色。 规律如下&#xff1a; 手动实现灰度化 首先我们采用手动灰度化的方式&#xff1a; 其思想就是&#…

分析citibike数据eda

数据科学 (Data Science) CitiBike is New York City’s famous bike rental company and the largest in the USA. CitiBike launched in May 2013 and has become an essential part of the transportation network. They make commute fun, efficient, and affordable — no…

上采样(放大图像)和下采样(缩小图像)(最邻近插值和双线性插值的理解和实现)

上采样和下采样 什么是上采样和下采样&#xff1f; • 缩小图像&#xff08;或称为下采样&#xff08;subsampled&#xff09;或降采样&#xff08;downsampled&#xff09;&#xff09;的主要目的有 两个&#xff1a;1、使得图像符合显示区域的大小&#xff1b;2、生成对应图…

r语言绘制雷达图_用r绘制雷达蜘蛛图

r语言绘制雷达图I’ve tried several different types of NBA analytical articles within my readership who are a group of true fans of basketball. I found that the most popular articles are not those with state-of-the-art machine learning technologies, but tho…

java 分裂数字_分裂的补充:超越数字,打印物理可视化

java 分裂数字As noted in my earlier Nightingale writings, color harmony is the process of choosing colors on a Color Wheel that work well together in the composition of an image. Today, I will step further into color theory by discussing the Split Compleme…

结构化数据建模——titanic数据集的模型建立和训练(Pytorch版)

本文参考《20天吃透Pytorch》来实现titanic数据集的模型建立和训练 在书中理论的同时加入自己的理解。 一&#xff0c;准备数据 数据加载 titanic数据集的目标是根据乘客信息预测他们在Titanic号撞击冰山沉没后能否生存。 结构化数据一般会使用Pandas中的DataFrame进行预处理…

比赛,幸福度_幸福与生活满意度

比赛,幸福度What is the purpose of life? Is that to be happy? Why people go through all the pain and hardship? Is it to achieve happiness in some way?人生的目的是什么&#xff1f; 那是幸福吗&#xff1f; 人们为什么要经历所有的痛苦和磨难&#xff1f; 是通过…

带有postgres和jupyter笔记本的Titanic数据集

PostgreSQL is a powerful, open source object-relational database system with over 30 years of active development that has earned it a strong reputation for reliability, feature robustness, and performance.PostgreSQL是一个功能强大的开源对象关系数据库系统&am…

Django学习--数据库同步操作技巧

同步数据库&#xff1a;使用上述两条命令同步数据库1.认识migrations目录&#xff1a;migrations目录作用&#xff1a;用来存放通过makemigrations命令生成的数据库脚本&#xff0c;里面的生成的脚本不要轻易修改。要正常的使用数据库同步的功能&#xff0c;app目录下必须要有m…

React 新 Context API 在前端状态管理的实践

2019独角兽企业重金招聘Python工程师标准>>> 本文转载至&#xff1a;今日头条技术博客 众所周知&#xff0c;React的单向数据流模式导致状态只能一级一级的由父组件传递到子组件&#xff0c;在大中型应用中较为繁琐不好管理&#xff0c;通常我们需要使用Redux来帮助…

机器学习模型 非线性模型_机器学习模型说明

机器学习模型 非线性模型A Case Study of Shap and pdp using Diabetes dataset使用糖尿病数据集对Shap和pdp进行案例研究 Explaining Machine Learning Models has always been a difficult concept to comprehend in which model results and performance stay black box (h…

5分钟内完成胸部CT扫描机器学习

This post provides an overview of chest CT scan machine learning organized by clinical goal, data representation, task, and model.这篇文章按临床目标&#xff0c;数据表示&#xff0c;任务和模型组织了胸部CT扫描机器学习的概述。 A chest CT scan is a grayscale 3…

Pytorch高阶API示范——线性回归模型

本文与《20天吃透Pytorch》有所不同&#xff0c;《20天吃透Pytorch》中是继承之前的模型进行拟合&#xff0c;本文是单独建立网络进行拟合。 代码实现&#xff1a; import torch import numpy as np import matplotlib.pyplot as plt import pandas as pd from torch import …

作业要求 20181023-3 每周例行报告

本周要求参见&#xff1a;https://edu.cnblogs.com/campus/nenu/2018fall/homework/2282 1、本周PSP 总计&#xff1a;927min 2、本周进度条 代码行数 博文字数 用到的软件工程知识点 217 757 PSP、版本控制 3、累积进度图 &#xff08;1&#xff09;累积代码折线图 &…

算命数据_未来的数据科学家或算命精神向导

算命数据Real Estate Sale Prices, Regression, and Classification: Data Science is the Future of Fortune Telling房地产销售价格&#xff0c;回归和分类&#xff1a;数据科学是算命的未来 As we all know, I am unusually blessed with totally-real psychic abilities.众…

openai-gpt_为什么到处都看到GPT-3?

openai-gptDisclaimer: My opinions are informed by my experience maintaining Cortex, an open source platform for machine learning engineering.免责声明&#xff1a;我的看法是基于我维护 机器学习工程的开源平台 Cortex的 经验而 得出 的。 If you frequent any part…