vs azure web
Webscraping as a required skill for many data-science related jobs is becoming increasingly desirable as more companies slowly migrate their processes to the cloud.
随着越来越多的公司将其流程缓慢迁移到云中,将Web爬网作为许多与数据科学相关的工作所需的技能变得越来越受欢迎。
As someone who started originally getting interested in data science after scraping my University’s course evaluation catalogue, this skill single-handedly allowed me to land two internships during my undergrad program.
作为最初在刮擦我的大学的课程评估目录后开始对数据科学感兴趣的人,这一技能使我能够在我的本科课程期间获得两次实习机会。
Although disputed, many people use a Chrome Webdriver and the Selenium module to scrape data off websites on the internet. While this tool can be very helpful locally, it is difficult to make these recurring tasks that are able to deploy for large scale infrastructure. Within this article, I am going to guide you through porting over your Selenium web-scraper into the Azure Network utilizing a virtual machine as well as show you how to set up the scraping to be a daily reoccurring task.
尽管存在争议,但许多人还是使用Chrome Webdriver和Selenium模块从互联网上的网站上抓取数据。 尽管此工具在本地可能非常有用,但是很难执行这些能够部署到大型基础架构的重复任务。 在本文中,我将指导您通过使用虚拟机将Selenium Web爬网程序移植到Azure网络,并向您展示如何将抓取设置为每天重复发生的任务。
步骤1:设置Azure虚拟机(VM) (Step 1: Setting up the Azure Virtual Machine (VM))
After you have logged into Azure, you’re going to want to make your way over to the Virtual Machines directory. While I won’t walk through every detail behind creating the VM, I will note some specifications that are important to set in order to enable appropriate access between services.
登录到Azure后,您将需要转到虚拟机目录。 尽管我不会遍历创建VM的每个细节,但我会注意到一些重要的规范,这些规范对于使服务之间能够进行适当的访问非常重要。
Since I am familiar most with Windows, I used a Windows 10 Pro Image for my Virtual Machine, however I would imagine that this process could be repeated for other images as well.
由于我对Windows最熟悉,因此我在虚拟机上使用了Windows 10 Pro映像,但是我想也可以对其他映像重复此过程。
For the “Select inbound ports”, make sure to include the HTTPS (443)
option to allow the automation task access. We will cover this in more detail in Step 4 of this guide if you miss this step.
对于“选择入站端口”,请确保包括 HTTPS (443)
选项以允许自动化任务访问 。 如果您错过了此步骤,我们将在本指南的第4步中对此进行详细介绍。
第2步:安装Python,Chrome,Chromedriver和必需的依赖项 (Step 2: Install Python, Chrome, Chromedriver & Required Dependencies)
Next, we are going to want to load up the VM. If you are using a Windows image, you can use RDP (Remote Desktop Protocol) to get access, or you can use a software like PuTTY to SSH into the desktop as well.
接下来,我们将要加载虚拟机。 如果使用的是Windows映像,则可以使用RDP(远程桌面协议)进行访问,也可以使用PuTTY之类的软件通过SSH进入桌面。
We are going to setup our working environment here in order for Python and Chrome to get up and running. So, make sure to install your required version of Python as well as the latest version of Chrome & Chromedriver. Make note of where these files are saved as you will need them later on.
我们将在这里设置我们的工作环境,以便Python和Chrome启动并运行。 因此,请确保安装所需的Python版本以及最新版本的Chrome和Chromedriver。 记下这些文件的保存位置,因为以后将需要它们。
If you want to have less maintenance down the road, make sure to rename the Chrome Update folder so Chrome doesn't automatically update requiring you to download a newer version of Chromedriver. Instructions for doing so can be found here.
如果您想减少日常维护工作,请确保重命名Chrome Update文件夹,以便Chrome不会自动更新 ,而您需要下载更新版本的Chromedriver。 有关说明,请参见此处 。
步骤3:Python脚本 (Step 3: Python Script)
For sake of simplicity, we are going to use just a basic python script that loads up stack overflow. Obviously, this could easily be done using the requests library, however, as many scrapers require JavaScript interactivity with the web page, I’ll assume that your script is longer and more complex.
为了简单起见,我们将仅使用一个基本的python脚本来加载堆栈溢出。 显然,这可以使用请求库轻松完成,但是,由于许多抓取工具需要与网页进行JavaScript交互,因此我假设您的脚本更长且更复杂。
Lets call the following script scrape.py
让我们调用以下脚本scrape.py
from selenium import webdriverDRIVER_PATH = "/path/to/chromedriver.exe"def scrape():
driver = webdriver.Chrome(executable_path=DRIVER_PATH)
driver.get('https://stackoverflow.com/')if __name__ == "__main__":
scrape()
print("Script Executed Correctly.")
You’re going to want to first make sure that the correct output is printed so you know your script works locally within the VM.
您首先要确保打印正确的输出,以便您知道脚本可以在VM本地运行。
It is also important to note that storage within these Virtual Machines is expensive. So, try and utilize a database or equivalent to store your data outside of the VM if at all possible. For the majority of my uses, utilizing python’s pyodbc module works incredibly well for getting data that I want stored outside of the VM. However, this will likely change on a case to case basis.
还需要注意的是,这些虚拟机中的存储非常昂贵。 因此,请尽可能利用数据库或等效数据库将数据存储在VM外部。 对于我的大多数用途,利用python的pyodbc模块非常有效地用于获取要存储在VM外部的数据。 但是,这可能会视情况而定。
步骤3:Powershell脚本 (Step 3: Powershell Script)
Next, you’re going to want to setup a Powershell script that runs your python code. This script is how Azure will communicate with any internal scripts you have within your VM. Again, for simplicity, my Powershell script here will utilize some basic functionality just to get the basic structure outlined.
接下来,您将要设置运行Python代码的Powershell脚本。 该脚本是Azure如何与VM中拥有的任何内部脚本进行通信的方式。 同样,为简单起见,我在这里的Powershell脚本将利用一些基本功能,只是为了获得概述的基本结构。
Lets call this script ps-scrape.ps1
让我们将此脚本称为ps-scrape.ps1
Write-Output "Script Started."
\path\to\python.exe \path\to\scrape.py
Write-Output "Script Ending."
Now, give this a test run by running it locally on your VM. It should print out the following results:
现在,通过在您的VM上本地运行来进行测试运行。 它应该打印出以下结果:
Script Started.
Script Executed Correctly.
Script Ending.
步骤4:Azure Powershell Runbook (Step 4: Azure Powershell Runbook)
Now that your Powershell script runs locally on your VM, it is time to do the same thing from outside your VM.
现在,您的Powershell脚本在VM上本地运行,是时候从VM外部执行相同的操作了。
Within Azure, open up the Automation Account Resource. Under Process Automation, click on Runbooks and Create a Runbook. The Runbook type should be PowerShell (not PowerShell workflow or Graphical PowerShell Workflow).
在Azure中,打开自动化帐户资源。 在“流程自动化”下,单击“运行手册”并创建一个“运行手册”。 Runbook类型应为PowerShell(而不是PowerShell工作流或图形PowerShell工作流)。
Keep in mind, that you will likely need to import the required modules from Automation Account to allow the following to run correctly. To do this, go over to your Automation Account you created, under Shared Resources, you should see Modules. Make sure to add the AzureRM.Compute
module and any other modules you may need.
请记住,您可能需要从Automation Account导入所需的模块,以使以下内容正确运行。 为此,请转到您在共享资源下创建的自动化帐户,您应该看到模块。 确保添加AzureRM.Compute
模块以及您可能需要的任何其他模块。
Lets call the following Runbook RunbookScrape
让我们调用以下Runbook RunbookScrape
$connectionName = "AzureRunAsConnection"
try
{
# Get the connection "AzureRunAsConnection
$servicePrincipalConnection=Get-AutomationConnection -Name $connectionName "Logging in to Azure..."
Add-AzureRmAccount `
-ServicePrincipal `
-TenantId $servicePrincipalConnection.TenantId `
-ApplicationId $servicePrincipalConnection.ApplicationId `
-CertificateThumbprint $servicePrincipalConnection.CertificateThumbprint
}
catch {
if (!$servicePrincipalConnection)
{
$ErrorMessage = "Connection $connectionName not found."
throw $ErrorMessage
} else{
Write-Error -Message $_.Exception
throw $_.Exception
}
}$rgname ="YourResourceGroupName"
$vmname ="YourVirtualMachineName"
$ScriptToRun = "vm\path\to\script\ps-scrape.ps1"
Out-File -InputObject $ScriptToRun -FilePath ScriptToRun.ps1
$run = Invoke-AzureRmVMRunCommand -ResourceGroupName $rgname -Name $vmname -CommandId 'RunPowerShellScript' -ScriptPath ScriptToRun.ps1
Write-Output $run.Value[0]
Remove-Item -Path ScriptToRun.ps1
Bolded items indicate where you will need to change the code to work for your system.
粗体字表示需要更改代码才能在系统上工作。
If the script ran correctly but you don’t see an output. DON’T WORRY. It just means you need to update the VM network settings to allow outbound traffic through port 443. This can be done by going to the Virtual Machine where under Settings, you will see the Networking button. Go here and you should see several tabs under the Network Interface. Click on the Outbound port rules and setup a new rule to look like this.
如果脚本正确运行,但看不到输出。 别担心 这仅意味着您需要更新VM网络设置以允许通过端口443的出站流量。这可以通过转到虚拟机来完成,在虚拟机的“设置”下,您将看到“网络”按钮。 转到此处,您应该在网络接口下看到几个选项卡。 单击出站端口规则,然后设置一个新规则,如下所示。
Try running the Runbook again and you should see the same output as you saw from within the VM!
再次尝试运行Runbook,您应该会看到与从VM中看到的输出相同的输出!
步骤5:Runbook自动化 (Step 5: Runbook Automation)
Now comes the task of Automating your Runbook. Within Azure, open up the Logic App resource. Under the Development Tools, you should see the Logic app designer. All that is required is that you link the blocks together to make Azure startup the VM, run the Runbook, and then shut down the VM. You can see what this looks like in the following image.
现在是自动化Runbook的任务。 在Azure中,打开Logic App资源。 在开发工具下,您应该看到Logic应用程序设计器。 所需要做的就是将这些块链接在一起,以使Azure启动VM,运行Runbook,然后关闭VM。 您可以在下图中看到它的外观。
Boom! You’re done. Your Python Selenium Webscraper will now run within the Azure Virtual Machine on a scheduled recurring basis.
繁荣! 你完成了。 您的Python Selenium Webscraper现在将按计划的定期在Azure虚拟机中运行。
翻译自: https://medium.com/swlh/guide-to-migrating-automating-chrome-web-scrapers-within-azure-909a4203476a
vs azure web
本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/news/388100.shtml
如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!