python自动化数据报告_如何:使用Python将实时数据自动化到您的网站

python自动化数据报告

This tutorial will be helpful for people who have a website that hosts live data on a cloud service but are unsure how to completely automate the updating of the live data so the website becomes hassle free. For example: I host a website that shows Texas COVID case counts by county in an interactive dashboard, but everyday I had to run a script to download the excel file from the Texas COVID website, clean the data, update the pandas data frame that was used to create the dashboard, upload the updated data to the cloud service I was using, and reload my website. This was annoying, so I used the steps in this tutorial to show how my live data website is now totally automated.

对于拥有在云服务上托管实时数据的网站但不确定如何完全自动化实时数据更新的网站的人来说,本教程将非常有用。 例如:我托管了一个网站 ,该网站在交互式仪表板上显示按县分类的德克萨斯州COVID病例数,但是每天我必须运行脚本以从德克萨斯州COVID网站下载excel文件,清理数据,更新以前的熊猫数据框。用于创建仪表板,将更新的数据上传到我正在使用的云服务,然后重新加载我的网站。 这很烦人,所以我使用了本教程中的步骤来展示我的实时数据网站现在是如何完全自动化的。

I will only be going over how to do this using the cloud service pythonanywhere, but these steps can be transferred to other cloud services. Another thing to note is that I am new to building and maintaining websites so please feel free to correct me or give me constructive feedback on this tutorial. I will be assuming that you have basic knowledge of python, selenium for web scraping, bash commands, and you have your own website. Lets go through the steps of automating live data to your website:

我将只讨论如何使用pythonanywhere的云服务来执行此操作,但是这些步骤可以转移到其他云服务。 要注意的另一件事是,我是网站建设和维护的新手,请随时纠正我或对本教程给我有建设性的反馈。 我假设您具有python的基本知识,用于网络抓取的Selenium,bash命令,并且您拥有自己的网站。 让我们完成将实时数据自动化到您的网站的步骤:

  1. web scraping with selenium using a cloud service

    使用云服务使用Selenium进行Web抓取
  2. converting downloaded data in a .part file to .xlsx file

    将.part文件中的下载数据转换为.xlsx文件
  3. re-loading your website using the os python package

    使用os python软件包重新加载您的网站
  4. scheduling a python script to run every day in pythonanywhere

    安排python脚本每天在pythonanywhere中运行

I will not be going through some of the code I will be showing because I use much of the same code from my last tutorial on how to create and automate an interactive dashboard using python found here. Lets get started!

我将不会看过将要显示的一些代码,因为我使用了上一篇教程中的许多相同代码,它们是关于如何使用此处找到的python创建和自动化交互式仪表板的。 让我们开始吧!

  1. web scraping with selenium using a cloud service

    使用云服务使用Selenium进行Web抓取

So in your cloud service of choice (mine being pythonanywhere), open up a python3.7 console. I will be showing the code in chunks but all the code can be combined into one script which is what I have done. Also, all the file paths in the code you will have to change to your own for the code to work.

因此,在您选择的云服务(我的网站是pythonanywhere)中,打开一个python3.7控制台。 我将分块显示代码,但是所有代码都可以组合成一个脚本,这就是我所做的。 同样,您必须将代码中的所有文件路径更改为自己的路径,代码才能正常工作。

from pyvirtualdisplay import Display
from selenium import webdriver
import time
from selenium.webdriver.chrome.options import Optionswith Display():
# we can now start Firefox and it will run inside the virtual display
browser = webdriver.Firefox()# these options allow selenium to download files
options = Options()
options.add_experimental_option("browser.download.folderList",2)
options.add_experimental_option("browser.download.manager.showWhenStarting", False)
options.add_experimental_option("browser.helperApps.neverAsk.saveToDisk", "application/octet-stream,application/vnd.ms-excel")# put the rest of our selenium code in a try/finally
# to make sure we always clean up at the end
try:
browser.get('https://www.dshs.texas.gov/coronavirus/additionaldata/')# initialize an object to the location on the html page and click on it to download
link = browser.find_element_by_xpath('/html/body/form/div[4]/div/div[3]/div[2]/div/div/ul[1]/li[1]/a')
link.click()# Wait for 30 seconds to allow chrome to download file
time.sleep(30)print(browser.title)
finally:
browser.quit()

In the chunk of code above, I open up a Firefox browser within pythonanywhere using their pyvirtualdisplay library. No new browser will pop on your computer since its running on the cloud. This means you should test out the script on your own computer without the display() function because error handling will be difficult within the cloud server. Then I download an .xlsx file from the Texas COVID website and it saves it in my /tmp file within pythonanywhere. To access the /tmp file, just click on the first “/” of the files tab that proceeds the home file button. This is all done within a try/finally blocks, so after the script runs, we close the browser so we do not use any more cpu time on the server. Another thing to note is that pythonanywhere only supports one version of selenium: 2.53.6. You can downgrade to this version of selenium using the following bash command:

在上面的代码中,我使用pyanytdisplaydisplay库在pythonanywhere中打开了Firefox浏览器。 自从它在云上运行以来,没有新的浏览器会在您的计算机上弹出。 这意味着您应该在没有display()函数的情况下在自己的计算机上测试脚本,因为在云服务器中错误处理将很困难。 然后,我从Texas COVID网站下载.xlsx文件,并将其保存在pythonanywhere中的/ tmp文件中。 要访问/ tmp文件,只需单击文件选项卡的第一个“ /”,然后单击主文件按钮即可。 这都是在try / finally块中完成的,因此在脚本运行之后,我们关闭浏览器,因此我们不再在服务器上使用更多的CPU时间。 要注意的另一件事是pythonanywhere仅支持一个Selenium版本: 2.53.6。 您可以使用以下bash命令降级到该版本的Selenium:

pip3.7 install --user selenium==2.53.6

2. converting downloaded data in a .part file to .xlsx file

2. 将.part文件中的下载数据转换为.xlsx文件

import shutil
import glob
import os# locating most recent .xlsx downloaded file
list_of_files = glob.glob('/tmp/*.xlsx.part')
latest_file = max(list_of_files, key=os.path.getmtime)
print(latest_file)# we need to locate the old .xlsx file(s) in the dir we want to store the new xlsx file in
list_of_files = glob.glob('/home/tsbloxsom/mysite/get_data/*.xlsx')
print(list_of_files)# need to delete old xlsx file(s) so if we download new xlsx file with same name we do not get an error while moving it
for file in list_of_files:
print("deleting old xlsx file:", file)
os.remove(file)# move new data into data dir
shutil.move("{}".format(latest_file), "/home/tsbloxsom/mysite/get_data/covid_dirty_data.xlsx")

When you download .xlsx files in pythonanywhere, they are stored as .xlsx.part files. After some research, these .part files are caused when you stop a download from completing. These .part files cannot be opened with typical tools but there is a easy trick around this problem. In the above code, I automate moving the new data and deleting the old data in my cloud directories. The part to notice is that when I move the .xlsx.part file, I save it as a .xlsx file. This converts it magically, and when you open this new .xlsx file, it has all the live data which means that my script did download the complete .xlsx file but pythonanywhere adds a .part to the file which is weird but hey it works.

当您在pythonanywhere中下载.xlsx文件时,它们将存储为.xlsx.part文件。 经过研究,这些.part文件是在您停止下载完成时引起的。 这些.part文件无法使用典型工具打开,但是可以解决此问题。 在上面的代码中,我自动移动了新数据并删除了云目录中的旧数据。 需要注意的部分是,当我移动.xlsx.part文件时,我将其另存为.xlsx文件。 这会神奇地进行转换,当您打开这个新的.xlsx文件时,它具有所有实时数据,这意味着我的脚本确实下载了完整的.xlsx文件,但是pythonanywhere向该文件中添加了.part很奇怪,但嘿,它起作用了。

3. re-loading your website using the os python package

3.使用os python软件包重新加载您的网站

import pandas as pd
import relist_of_files = glob.glob('/home/tsbloxsom/mysite/get_data/*.xlsx')
latest_file = max(list_of_files, key=os.path.getctime)
print(latest_file)df = pd.read_excel("{}".format(latest_file),header=None)# print out latest COVID data datetime and notes
date = re.findall("- [0-9]+/[0-9]+/[0-9]+ .+", df.iloc[0, 0])
print("COVID cases latest update:", date[0][2:])
print(df.iloc[1, 0])
#print(str(df.iloc[262:266, 0]).lstrip().rstrip())#drop non-data rows
df2 = df.drop([0, 1, 258, 260, 261, 262, 263, 264, 265, 266, 267])# clean column names
df2.iloc[0,:] = df2.iloc[0,:].apply(lambda x: x.replace("\r", ""))
df2.iloc[0,:] = df2.iloc[0,:].apply(lambda x: x.replace("\n", ""))
df2.columns = df2.iloc[0]
clean_df = df2.drop(df2.index[0])
clean_df = clean_df.set_index("County Name")clean_df.to_csv("/home/tsbloxsom/mysite/get_data/Texas county COVID cases data clean.csv")df = pd.read_csv("Texas county COVID cases data clean.csv")# convert df into time series where rows are each date and clean up
df_t = df.T
df_t.columns = df_t.iloc[0]
df_t = df_t.iloc[1:]
df_t = df_t.iloc[:,:-2]# next lets convert the index to a date time, must clean up dates first
def clean_index(s):
s = s.replace("*","")
s = s[-5:]
s = s + "-2020"
#print(s)
return sdf_t.index = df_t.index.map(clean_index)df_t.index = pd.to_datetime(df_t.index)# initalize df with three columns: Date, Case Count, and County
anderson = df_t.T.iloc[0,:]ts = anderson.to_frame().reset_index()ts["County"] = "Anderson"
ts = ts.rename(columns = {"Anderson": "Case Count", "index": "Date"})# This while loop adds all counties to the above ts so we can input it into plotly
x = 1
while x < 254:
new_ts = df_t.T.iloc[x,:]
new_ts = new_ts.to_frame().reset_index()
new_ts["County"] = new_ts.columns[1]
new_ts = new_ts.rename(columns = {new_ts.columns[1]: "Case Count", "index": "Date"})
ts = pd.concat([ts, new_ts])
x += 1ts.to_csv("/home/tsbloxsom/mysite/data/time_series_plotly.csv")time.sleep(5)#reload website with updated data
os.utime('/var/www/tsbloxsom_pythonanywhere_com_wsgi.py')

Most of the above code I explained in my last post which deals with cleaning excel files using pandas for inputting into a plotly dashboard. The most important line for this tutorial is the very last one. The os.utime function shows access and modify times of a file or python script. But when you call the function on your Web Server Gateway Interface (WSGI) file it will reload your website!

我在上一篇文章中解释了上面的大多数代码,其中涉及使用熊猫清理excel文件并输入到绘图仪表板。 本教程最重要的一行是最后一行。 os.utime函数显示文件或python脚本的访问和修改时间。 但是,当您在Web服务器网关接口(WSGI)文件上调用该函数时,它将重新加载您的网站!

4. scheduling a python script to run every day in pythonanywhere

4.计划每天在pythonanywhere中运行的python脚本

Image for post
Image by yours truly
真正的形象

Now for the easy part! After you combine the above code into one .py file, you can make it run every day or hour using pythonanywhere’s Task tab. All you do is copy and paste the bash command, with the full directory path, you would use to run the .py file into the bar in the image above and hit the create button! Now you should test the .py file using a bash console first to see if it runs correctly. But now you have a fully automated data scraping script that your website can use to have daily or hourly updated data displayed without you having to push one button!

现在简单一点! 将以上代码组合成一个.py文件后,您可以使用pythonanywhere的“任务”标签使它每天或每小时运行一次。 您要做的就是复制并粘贴带有完整目录路径的bash命令,您将使用该命令将.py文件运行到上图中的栏中,然后单击“创建”按钮! 现在,您应该首先使用bash控制台测试.py文件,以查看其是否正常运行。 但是现在您有了一个全自动的数据抓取脚本,您的网站可以使用它来显示每日或每小时的更新数据,而无需按一个按钮!

If you have any questions or critiques please feel free to say so in the comments and if you want to follow me on LinkedIn you can!

如果您有任何疑问或批评,请随时在评论中说,如果您想在LinkedIn上关注我,可以!

翻译自: https://towardsdatascience.com/how-to-automate-live-data-to-your-website-with-python-f22b76699674

python自动化数据报告

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/news/392076.shtml

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

android intent参数是上次的结果,【Android】7.0 Intent向下一个活动传递数据、返回数据给上一个活动...

1.0 可以利用Intent吧数据传递给上一个活动&#xff0c;新建一个叫“hellotest01”的项目。新建活动FirstActivity&#xff0c;勾选“Generate Layout File”和“Launcher Activity”。image修改AndroidMainifest.xml中的内容&#xff1a;android:name".FirstActivity&quo…

学习深度学习需要哪些知识_您想了解的有关深度学习的所有知识

学习深度学习需要哪些知识有关深层学习的FAU讲义 (FAU LECTURE NOTES ON DEEP LEARNING) Corona was a huge challenge for many of us and affected our lives in a variety of ways. I have been teaching a class on Deep Learning at Friedrich-Alexander-University Erlan…

html5--3.16 button元素

html5--3.16 button元素 学习要点 掌握button元素的使用button元素 用来建立一个按钮从功能上来说&#xff0c;与input元素建立的按钮相同button元素是双标签&#xff0c;其内部可以配置图片与文字&#xff0c;进行更复杂的样式设计不仅可以在表单中使用&#xff0c;还可以在其…

如何注册鸿蒙id,鸿蒙系统真机调试证书 和 设备ID获取

鸿蒙系统真机调试创建项目创建项目创建应用创建鸿蒙应用(注意&#xff0c;测试阶段需要发邮件申请即可)关联应用项目进入关联 添加引用准备调试使用的 p12 和证书请求 csr使用以下命令// 别名"test"可以修改&#xff0c;但必须前后一致&#xff0c;密码请自行修改key…

读zepto核心源码学习JS笔记(3)--zepto.init()

上篇已经讲解了zepto.init()的几种情况,这篇就继续记录这几种情况下的具体分析. 1. 首先是第一种情况,selector为空 既然是反向分析,那我们先看看这句话的代码; if (!selector) return zepto.Z() 这里的返回值为zepto.Z();那我们继续往上找zepto.Z()函数 zepto.Z function(dom…

置信区间估计 预测区间估计_估计,预测和预测

置信区间估计 预测区间估计Estimation implies finding the optimal parameter using historical data whereas prediction uses the data to compute the random value of the unseen data.估计意味着使用历史数据找到最佳参数&#xff0c;而预测则使用该数据来计算未见数据的…

鸿蒙系统还会推出吗,华为明年所有自研设备都升级鸿蒙系统,还会推出基于鸿蒙系统的新机...

不负期许&#xff0c;华为鸿蒙OS手机版如期而至。今日(12月15日)&#xff0c;鸿蒙OS 2.0手机开发者Beta版本正式上线&#xff0c;支持运行安卓应用&#xff0c;P40、Mate 30系列可申请公测。国内媒体报道称&#xff0c;华为消费者业务软件部副总裁杨海松表示&#xff0c;按照目…

C#中将DLL文件打包到EXE文件

1&#xff1a;在工程目录增加dll目录&#xff0c;然后将dll文件复制到此目录&#xff0c;例如&#xff1a; 2&#xff1a;增加引用&#xff0c;定位到工程的dll目录&#xff0c;选中要增加的dll文件 3&#xff1a;修改dll文件夹下面的dll文件属性 选中嵌入式资源&#xff0c;不…

Oracle VM Virtual Box的安装

安装Oracle VM Virtual Box安装扩展插件 选择"管理""全局设定" 在设置对话框中&#xff0c;选择"扩展" 选择"添加包" 找到"Oracle_VM_VirtualBox_Extension_Pack-4.1.18-78361"&#xff0c;点击"打开" 5&#x…

python 移动平均线_Python中的SMA(短期移动平均线)

python 移动平均线With the evolution of technology rapidly evolving, so do strategies in the stock market. In this post, I’ll go over how I created a SMA(Short Moving Average) strategy.随着技术的飞速发展&#xff0c;股票市场的策略也在不断发展。 在本文中&…

angular中的href=unsafe:我该怎么摆脱你的溺爱!!

解决方法&#xff1a;angular.module加入下面这行&#xff1a;&#xff08;依据Angular changes urls to “unsafe:” in extension page&#xff09; .config(function($compileProvider){//注:有些版本的angularjs为$compileProvider.urlSanitizationWhitelist(/^\s*(https?…

android view gesturedetector,如何在Android中利用 GestureDetector进行手势检测

如何在Android中利用 GestureDetector进行手势检测发布时间&#xff1a;2020-11-26 16:15:21来源&#xff1a;亿速云阅读&#xff1a;92作者&#xff1a;Leah今天就跟大家聊聊有关如何在Android中利用 GestureDetector进行手势检测&#xff0c;可能很多人都不太了解&#xff0c…

Java系列笔记(4) - JVM监控与调优【转】

Java系列笔记(4) - JVM监控与调优【转】 目录 参数设置收集器搭配启动内存分配监控工具和方法调优方法调优实例 光说不练假把式&#xff0c;学习Java GC机制的目的是为了实用&#xff0c;也就是为了在JVM出现问题时分析原因并解决之。通过学习&#xff0c;我觉得JVM监控与调…

地图 c-suite_C-Suite的模型

地图 c-suiteWe’ve all seen a great picture capture an audience of stakeholders.我们所有人都看到了吸引利益相关者听众的美好画面。 Let’s just all notice that the lady in the front right is not captivated by the image on the board (Photo by Christina wocin…

动态链接库.so和静态链接库.a的区别

静态链接库&#xff1a; •扩展名&#xff1a;.a  •编译行为&#xff1a;在编译的时候&#xff0c;将函数库直接整合到执行程序中&#xff08;所以利用静态库编译生成的文档会更大&#xff09; •独立执行的状态&#xff1a;编译成功的可执行文件可以独立运行&#xff0c;不…

华为鸿蒙系统封闭,谷歌正式“除名”华为!“亲儿子”荣耀表示:暂不考虑,鸿蒙OS处境尴尬...

我们都知道&#xff0c;目前智能手机最常用操作系统就是IOS和安卓&#xff0c;占据手机系统超过99%的市场份额。由于IOS系统的封闭性&#xff0c;国内手机厂商基本上都是使用谷歌的开源安卓系统。当然华为也不例外&#xff0c;一直使用的都是安卓系统。可以说&#xff0c;安卓系…

使用vue-cli脚手架搭建简单项目框架

1.首先已经安装了node,最好版本6以上。 2.安装淘宝镜像 大家都知道国内直接使用 npm 的官方镜像是非常慢的&#xff0c;这里推荐使用淘宝 NPM 镜像。这样就可以直接使用cnpm了。 npm install -g cnpm --registryhttps://registry.npm.taobao.org如果过程出差&#xff0c;是否安…

sap中泰国有预扣税设置吗_泰国餐厅密度细分:带有K-means聚类的python

sap中泰国有预扣税设置吗Hi! I am Tung, and this is my first stories for my weekend project. What inspired this project is that I have studied to become data scientist for almost two years now mostly from Youtube, coding sites and of course, Medium ,but my l…

Java—简单的注册页面

根据所提供的界面&#xff0c;编写 register.html 文件 源代码 empty.jsp <% page contentType"text/html;charsetUTF-8" language"java" %> <html> <head><title>error</title> </head> <body> <H1><…

【深度学习系列】用PaddlePaddle和Tensorflow实现经典CNN网络AlexNet

上周我们用PaddlePaddle和Tensorflow实现了图像分类&#xff0c;分别用自己手写的一个简单的CNN网络simple_cnn和LeNet-5的CNN网络识别cifar-10数据集。在上周的实验表现中&#xff0c;经过200次迭代后的LeNet-5的准确率为60%左右&#xff0c;这个结果差强人意&#xff0c;毕竟…