vs azure web_在Azure中迁移和自动化Chrome Web爬网程序的指南。

vs azure web

Webscraping as a required skill for many data-science related jobs is becoming increasingly desirable as more companies slowly migrate their processes to the cloud.

随着越来越多的公司将其流程缓慢迁移到云中,将Web爬网作为许多与数据科学相关的工作所需的技能变得越来越受欢迎。

As someone who started originally getting interested in data science after scraping my University’s course evaluation catalogue, this skill single-handedly allowed me to land two internships during my undergrad program.

作为最初在刮擦我的大学的课程评估目录后开始对数据科学感兴趣的人,这一技能使我能够在我的本科课程期间获得两次实习机会。

Although disputed, many people use a Chrome Webdriver and the Selenium module to scrape data off websites on the internet. While this tool can be very helpful locally, it is difficult to make these recurring tasks that are able to deploy for large scale infrastructure. Within this article, I am going to guide you through porting over your Selenium web-scraper into the Azure Network utilizing a virtual machine as well as show you how to set up the scraping to be a daily reoccurring task.

尽管存在争议,但许多人还是使用Chrome Webdriver和Selenium模块从互联网上的网站上抓取数据。 尽管此工具在本地可能非常有用,但是很难执行这些能够部署到大型基础架构的重复任务。 在本文中,我将指导您通过使用虚拟机将Selenium Web爬网程序移植到Azure网络,并向您展示如何将抓取设置为每天重复发生的任务。

步骤1:设置Azure虚拟机(VM) (Step 1: Setting up the Azure Virtual Machine (VM))

After you have logged into Azure, you’re going to want to make your way over to the Virtual Machines directory. While I won’t walk through every detail behind creating the VM, I will note some specifications that are important to set in order to enable appropriate access between services.

登录到Azure后,您将需要转到虚拟机目录。 尽管我不会遍历创建VM的每个细节,但我会注意到一些重要的规范,这些规范对于使服务之间能够进行适当的访问非常重要。

Since I am familiar most with Windows, I used a Windows 10 Pro Image for my Virtual Machine, however I would imagine that this process could be repeated for other images as well.

由于我对Windows最熟悉,因此我在虚拟机上使用了Windows 10 Pro映像,但是我想也可以对其他映像重复此过程。

For the “Select inbound ports”, make sure to include the HTTPS (443) option to allow the automation task access. We will cover this in more detail in Step 4 of this guide if you miss this step.

对于“选择入站端口”,请确保包括 HTTPS (443) 选项以允许自动化任务访问 。 如果您错过了此步骤,我们将在本指南的第4步中对此进行详细介绍。

第2步:安装Python,Chrome,Chromedriver和必需的依赖项 (Step 2: Install Python, Chrome, Chromedriver & Required Dependencies)

Next, we are going to want to load up the VM. If you are using a Windows image, you can use RDP (Remote Desktop Protocol) to get access, or you can use a software like PuTTY to SSH into the desktop as well.

接下来,我们将要加载虚拟机。 如果使用的是Windows映像,则可以使用RDP(远程桌面协议)进行访问,也可以使用PuTTY之类的软件通过SSH进入桌面。

We are going to setup our working environment here in order for Python and Chrome to get up and running. So, make sure to install your required version of Python as well as the latest version of Chrome & Chromedriver. Make note of where these files are saved as you will need them later on.

我们将在这里设置我们的工作环境,以便Python和Chrome启动并运行。 因此,请确保安装所需的Python版本以及最新版本的Chrome和Chromedriver。 记下这些文件的保存位置,因为以后将需要它们。

If you want to have less maintenance down the road, make sure to rename the Chrome Update folder so Chrome doesn't automatically update requiring you to download a newer version of Chromedriver. Instructions for doing so can be found here.

如果您想减少日常维护工作,请确保重命名Chrome Update文件夹,以便Chrome不会自动更新 ,而您需要下载更新版本的Chromedriver。 有关说明,请参见此处 。

步骤3:Python脚本 (Step 3: Python Script)

For sake of simplicity, we are going to use just a basic python script that loads up stack overflow. Obviously, this could easily be done using the requests library, however, as many scrapers require JavaScript interactivity with the web page, I’ll assume that your script is longer and more complex.

为了简单起见,我们将仅使用一个基本的python脚本来加载堆栈溢出。 显然,这可以使用请求库轻松完成,但是,由于许多抓取工具需要与网页进行JavaScript交互,因此我假设您的脚本更长且更复杂。

Lets call the following script scrape.py

让我们调用以下脚本scrape.py

from selenium import webdriverDRIVER_PATH = "/path/to/chromedriver.exe"def scrape():
driver = webdriver.Chrome(executable_path=DRIVER_PATH)
driver.get('https://stackoverflow.com/')if __name__ == "__main__":
scrape()
print("Script Executed Correctly.")

You’re going to want to first make sure that the correct output is printed so you know your script works locally within the VM.

您首先要确保打印正确的输出,以便您知道脚本可以在VM本地运行。

It is also important to note that storage within these Virtual Machines is expensive. So, try and utilize a database or equivalent to store your data outside of the VM if at all possible. For the majority of my uses, utilizing python’s pyodbc module works incredibly well for getting data that I want stored outside of the VM. However, this will likely change on a case to case basis.

还需要注意的是,这些虚拟机中的存储非常昂贵。 因此,请尽可能利用数据库或等效数据库将数据存储在VM外部。 对于我的大多数用途,利用python的pyodbc模块非常有效地用于获取要存储在VM外部的数据。 但是,这可能会视情况而定。

步骤3:Powershell脚本 (Step 3: Powershell Script)

Next, you’re going to want to setup a Powershell script that runs your python code. This script is how Azure will communicate with any internal scripts you have within your VM. Again, for simplicity, my Powershell script here will utilize some basic functionality just to get the basic structure outlined.

接下来,您将要设置运行Python代码的Powershell脚本。 该脚本是Azure如何与VM中拥有的任何内部脚本进行通信的方式。 同样,为简单起见,我在这里的Powershell脚本将利用一些基本功能,只是为了获得概述的基本结构。

Lets call this script ps-scrape.ps1

让我们将此脚本称为ps-scrape.ps1

Write-Output "Script Started."
\path\to\python.exe \path\to\scrape.py
Write-Output "Script Ending."

Now, give this a test run by running it locally on your VM. It should print out the following results:

现在,通过在您的VM上本地运行来进行测试运行。 它应该打印出以下结果:

Script Started.
Script Executed Correctly.
Script Ending.

步骤4:Azure Powershell Runbook (Step 4: Azure Powershell Runbook)

Now that your Powershell script runs locally on your VM, it is time to do the same thing from outside your VM.

现在,您的Powershell脚本在VM上本地运行,是时候从VM外部执行相同的操作了。

Within Azure, open up the Automation Account Resource. Under Process Automation, click on Runbooks and Create a Runbook. The Runbook type should be PowerShell (not PowerShell workflow or Graphical PowerShell Workflow).

在Azure中,打开自动化帐户资源。 在“流程自动化”下,单击“运行手册”并创建一个“运行手册”。 Runbook类型应为PowerShell(而不是PowerShell工作流或图形PowerShell工作流)。

Keep in mind, that you will likely need to import the required modules from Automation Account to allow the following to run correctly. To do this, go over to your Automation Account you created, under Shared Resources, you should see Modules. Make sure to add the AzureRM.Compute module and any other modules you may need.

请记住,您可能需要从Automation Account导入所需的模块,以使以下内容正确运行。 为此,请转到您在共享资源下创建的自动化帐户,您应该看到模块。 确保添加AzureRM.Compute模块以及您可能需要的任何其他模块。

Lets call the following Runbook RunbookScrape

让我们调用以下Runbook RunbookScrape

$connectionName = "AzureRunAsConnection"
try
{
# Get the connection "AzureRunAsConnection
$servicePrincipalConnection=Get-AutomationConnection -Name $connectionName "Logging in to Azure..."
Add-AzureRmAccount `
-ServicePrincipal `
-TenantId $servicePrincipalConnection.TenantId `
-ApplicationId $servicePrincipalConnection.ApplicationId `
-CertificateThumbprint $servicePrincipalConnection.CertificateThumbprint
}
catch {
if (!$servicePrincipalConnection)
{
$ErrorMessage = "Connection $connectionName not found."
throw $ErrorMessage
} else{
Write-Error -Message $_.Exception
throw $_.Exception
}
}$rgname ="YourResourceGroupName"
$vmname ="YourVirtualMachineName"
$ScriptToRun = "vm\path\to\script\ps-scrape.ps1"
Out-File -InputObject $ScriptToRun -FilePath ScriptToRun.ps1
$run = Invoke-AzureRmVMRunCommand -ResourceGroupName $rgname -Name $vmname -CommandId 'RunPowerShellScript' -ScriptPath ScriptToRun.ps1
Write-Output $run.Value[0]
Remove-Item -Path ScriptToRun.ps1

Bolded items indicate where you will need to change the code to work for your system.

粗体字表示需要更改代码才能在系统上工作。

If the script ran correctly but you don’t see an output. DON’T WORRY. It just means you need to update the VM network settings to allow outbound traffic through port 443. This can be done by going to the Virtual Machine where under Settings, you will see the Networking button. Go here and you should see several tabs under the Network Interface. Click on the Outbound port rules and setup a new rule to look like this.

如果脚本正确运行,但看不到输出。 别担心 这仅意味着您需要更新VM网络设置以允许通过端口443的出站流量。这可以通过转到虚拟机来完成,在虚拟机的“设置”下,您将看到“网络”按钮。 转到此处,您应该在网络接口下看到几个选项卡。 单击出站端口规则,然后设置一个新规则,如下所示。

Image for post

Try running the Runbook again and you should see the same output as you saw from within the VM!

再次尝试运行Runbook,您应该会看到与从VM中看到的输出相同的输出!

步骤5:Runbook自动化 (Step 5: Runbook Automation)

Now comes the task of Automating your Runbook. Within Azure, open up the Logic App resource. Under the Development Tools, you should see the Logic app designer. All that is required is that you link the blocks together to make Azure startup the VM, run the Runbook, and then shut down the VM. You can see what this looks like in the following image.

现在是自动化Runbook的任务。 在Azure中,打开Logic App资源。 在开发工具下,您应该看到Logic应用程序设计器。 所需要做的就是将这些块链接在一起,以使Azure启动VM,运行Runbook,然后关闭VM。 您可以在下图中看到它的外观。

Image for post

Boom! You’re done. Your Python Selenium Webscraper will now run within the Azure Virtual Machine on a scheduled recurring basis.

繁荣! 你完成了。 您的Python Selenium Webscraper现在将按计划的定期在Azure虚拟机中运行。

翻译自: https://medium.com/swlh/guide-to-migrating-automating-chrome-web-scrapers-within-azure-909a4203476a

vs azure web

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/news/388100.shtml

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

hadoop eclipse windows

首先说一下本人的环境: Windows7 64位系统 Spring Tool Suite Version: 3.4.0.RELEASE Hadoop2.6.0 一.简介 Hadoop2.x之后没有Eclipse插件工具,我们就不能在Eclipse上调试代码,我们要把写好的java代码的MapReduce打包成jar然后在Linux上运…

netstat 在windows下和Linux下查看网络连接和端口占用

假设忽然起个服务,告诉我8080端口被占用了,OK,我要去看一下是什么服务正在占用着,能不能杀 先假设我是在Windows下: 第一列: Proto 协议 第二列: 本地地址【ip端口】 第三列:远程地址…

selenium 解析网页_用Selenium进行网页搜刮

selenium 解析网页网页抓取系列 (WEB SCRAPING SERIES) 总览 (Overview) Selenium is a portable framework for testing web applications. It is open-source software released under the Apache License 2.0 that runs on Windows, Linux and macOS. Despite serving its m…

代理ARP协议(Proxy ARP)

代理ARP(Proxy-arp)的原理就是当出现跨网段的ARP请求时,路由器将自己的MAC返回给发送ARP广播请求发送者,实现MAC地址代理(善意的欺骗),最终使得主机能够通信。 图中R1和R3处于不同的局域网&…

hive 导入hdfs数据_将数据加载或导入运行在基于HDFS的数据湖之上的Hive表中的另一种方法。

hive 导入hdfs数据Preceding pen down the article, might want to stretch out appreciation to all the wellbeing teams beginning from cleaning/sterile group to Nurses, Doctors and other who are consistently battling to spare the mankind from continuous Covid-1…

对Faster R-CNN的理解(1)

目标检测是一种基于目标几何和统计特征的图像分割,最新的进展一般是通过R-CNN(基于区域的卷积神经网络)来实现的,其中最重要的方法之一是Faster R-CNN。 1. 总体结构 Faster R-CNN的基本结构如下图所示,其基础是深度全…

大数据业务学习笔记_学习业务成为一名出色的数据科学家

大数据业务学习笔记意见 (Opinion) A lot of aspiring Data Scientists think what they need to become a Data Scientist is :许多有抱负的数据科学家认为,成为一名数据科学家需要具备以下条件: Coding 编码 Statistic 统计 Math 数学 Machine Learni…

postman 请求参数为数组及JsonObject

2019独角兽企业重金招聘Python工程师标准>>> 1. (1)数组的请求方式(post) https://blog.csdn.net/qq_21205435/article/details/81909184 (2)数组的请求方式(get) http://localhost:port/list?ages10,20,30 后端接收方式: PostMa…

python 开发api_使用FastAPI和Python快速开发高性能API

python 开发apiIf you have read some of my previous Python articles, you know I’m a Flask fan. It is my go-to for building APIs in Python. However, recently I started to hear a lot about a new API framework for Python called FastAPI. After building some AP…

基于easyui开发Web版Activiti流程定制器详解(一)——目录结构

题外话(可略过): 前一段时间(要是没记错的话应该是3个月以前)发布了一个更新版本,很多人说没有文档看着比较困难,所以打算拿点时间出来详细给大家讲解一下,…

基于easyui开发Web版Activiti流程定制器详解(二)——文件列表

上一篇我们介绍了目录结构,这篇给大家整理一个文件列表以及详细说明,方便大家查找文件。 由于设计器文件主要保存在wf/designer和js/designer目录下,所以主要针对这两个目录进行详细说明。 wf/designer目录文件详解…

Power BI:M与DAX以及度量与计算列

When I embarked on my Power BI journey I was almost immediately slapped with an onslaught of foreign and perplexing terms that all seemed to do similar, but somehow different, things.当我开始Power BI之旅时,我几乎立刻受到了外国和困惑术语的冲击&am…

git 基本命令和操作

设置全局用户名密码 $ git config --global user.name runoob $ git config --global user.email testrunoob.comgit init:初始化仓库 创建新的 Git 仓库 git clone: 拷贝一个 Git 仓库到本地 : git clone [url]git add:将新增的文件添加到缓存 : git add test.htmlgit status …

基于easyui开发Web版Activiti流程定制器详解(三)——页面结构(上)

上一篇介绍了定制器相关的文件,这篇我们来看看整个定制器的界面部分,了解了页面结构有助于更好的理解定制器的实现,那么现在开始吧! 首先,我们来看看整体的结构: 整体结构比较简单…

基于easyui开发Web版Activiti流程定制器详解(四)——页面结构(下)

题外话: 这两天周末在家陪老婆和儿子没上来更新请大家见谅!上一篇介绍了调色板和画布区的页面结构,这篇讲解一下属性区的结构也是定制器最重要的一个页面。 属性区整体页面结构如图: 在这个区域可以定义工…

梯度下降法优化目标函数_如何通过3个简单的步骤区分梯度下降目标函数

梯度下降法优化目标函数Nowadays we can learn about domains that were usually reserved for academic communities. From Artificial Intelligence to Quantum Physics, we can browse an enormous amount of information available on the Internet and benefit from it.如…

FFmpeg 是如何实现多态的?

2019独角兽企业重金招聘Python工程师标准>>> 前言 众所周知,FFmpeg 在解码的时候,无论输入文件是 MP4 文件还是 FLV 文件,或者其它文件格式,都能正确解封装、解码,而代码不需要针对不同的格式做出任何改变&…

基于easyui开发Web版Activiti流程定制器详解(五)——Draw2d详解(一)

背景: 小弟工作已有十年有余,期间接触了不少工作流产品,个人比较喜欢的还是JBPM,因为出自名门Jboss所以备受推崇,但是现在JBPM版本已经与自己当年使用的版本(3.X)大相径…

seaborn 子图_Seaborn FacetGrid:进一步完善子图

seaborn 子图Data visualizations are essential in data analysis. The famous saying “one picture is worth a thousand words” holds true in the scope of data visualizations as well. In this post, I will explain a well-structured, very informative collection …

基于easyui开发Web版Activiti流程定制器详解(六)——Draw2d的扩展(一)

题外话: 最近在忙公司的云项目空闲时间不是很多,所以很久没来更新,今天补上一篇! 回顾: 前几篇介绍了一下设计器的界面和Draw2d基础知识,这篇讲解一下本设计器如何扩展Draw2d。 进…