python的power bi转换基础

I’ve been having a great time playing around with Power BI, one of the most incredible things in the tool is the array of possibilities you have to transform your data.

我在玩Power BI方面玩得很开心,该工具中最令人难以置信的事情之一就是您必须转换数据的一系列可能性。

You can perform your transformations directly in your SQL query, use PowerQuery, DAX, R, Python or just by using their buttons and drop-boxes.

您可以直接在SQL查询中执行转换,可以使用PowerQuery,DAX,R,Python或仅通过其按钮和下拉框进行转换。

PBI gives us a lot of choices, but as much as you can load your entire database and figure your way out just with DAX, knowing a little bit o SQL can make things so much easier. Understanding the possibilities, where each of them excels, and where do we feel comfortable, is essential to master the tool.

PBI给我们提供了许多选择,但是尽您可以加载整个数据库并仅使用DAX来解决问题,知道一点点SQL可以使事情变得如此简单。 掌握各种可能性,每种方法的优势以及我们感到舒适的地方,对于掌握该工具至关重要。

In this article, I’ll go through the basics of using Python to transform your data for building visualizations in Power BI.

在本文中,我将介绍使用Python转换数据以在Power BI中构建可视化的基础知识。

勘探 (Exploration)

For the following example, I’ll use Jupyter Lab for exploring the dataset and designing the transformations.

对于以下示例,我将使用Jupyter Lab探索数据集并设计转换。

The dataset I’ll use is the COVID-19 Data Repository by the Center for Systems Science and Engineering (CSSE) at Johns Hopkins University.

我将使用的数据集是约翰霍普金斯大学系统科学与工程中心(CSSE)的COVID-19数据存储库 。

import pandas as pdgit = 'https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_global.csv'dataset = pd.read_csv(git)dataset
Image for post
Data frame
数据框

OK, so we loaded the dataset to a Pandas data frame, the same format we’ll receive it when performing the transformation in PBI.

好的,因此我们将数据集加载到了Pandas数据框中,与在PBI中执行转换时会收到的格式相同。

The first thing that called my attention in this dataset was its arrangement. The dates are spread through the columns, and that’s not a very friendly format for building visualizations in PBI.

在这个数据集中引起我注意的第一件事是它的排列。 日期分布在各列中,对于在PBI中构建可视化而言,这不是一种非常友好的格式。

Another noticeable thing is the amount of NaNs in the Province/State column. Let’s get a better look at the missing values with MissingNo.

另一个值得注意的事情是“省/州”列中的NaN数量。 让我们用MissingNo 更好地查看缺失值 。

import missingno as msnomsno.matrix(dataset)
Image for post
Missing values matrix
缺失值矩阵

Alright, mostly, our dataset is complete, but the Province/State column does have lots of missing values.

好吧,大多数情况下,我们的数据集是完整的,但是“省/州”列确实有很多缺失值。

While exploring, we can also check for typos and mismatching fields. There are lots of methods for doing so. I’ll use Difflib for illustrating.

在探索期间,我们还可以检查拼写错误和不匹配的字段。 有很多这样做的方法。 我将使用Difflib进行说明。

from difflib import SequenceMatcher# empty lists for assembling the data frame
diff_labels = []
diff_vals = []# for every country name check every other country name
for i in dataset['Country/Region'].unique():
for j in dataset['Country/Region'].unique():
if i != j:
diff_labels.append(i + ' - ' + j)
diff_vals.append(SequenceMatcher(None, i, j).ratio() )# assemble the data frame
diff_df = pd.DataFrame(diff_labels)
diff_df.columns = ['labels']
diff_df['vals'] = diff_vals# sort values by similarity ratio
diff_df.sort_values('vals', ascending=False)[:50]
Image for post
Country names similarity
国名相似

From what I can see, most of them are just similar, so this field is already clean.

从我的看到,它们大多数都是相似的,因此该字段已经很干净了。

As much as we could also check Provinces/ States, I guess I can pick typos from the names of countries, but not from provinces or states.

尽我们所能检查省/州,我想我可以从国家/地区名称中选择错别字,但不能从省或州中选择错别字。

目标 (Goal)

Whatever it is your exploration analysis, you’ll probably come up with a new design for the data you want to visualize.

无论您的勘探分析是什么,您都可能会想出想要可视化数据的新设计。

Something that’ll make your life easier when building the charts, and my idea here is to separate this dataset into three tables, like so:

可以简化构建图表时的工作,我的想法是将数据集分成三个表,如下所示:

One table will hold Location, with Province/State, Country/Region, Latitude, and Longitude.

一张桌子将保存位置,省/州,国家/地区,纬度和经度。

One will hold the data for countries, with the date, number of confirmed and number of new cases.

一个将保存国家/地区的数据,以及日期,确诊数量和新病例数量。

And the last one will hold data for the provinces, also with the date, number of confirmed and number of new cases.

最后一个将保存各省的数据,以及日期,确诊数和新病例数。

Here’s what I’m looking for as the final result:

这是最终结果:

Image for post
Tables and Relationships
表和关系

Are there better ways of arranging this dataset? — Most definitely, yes. But I think this is a good way of illustrating a goal for the dataset we want to achieve.

有更好的方法来安排此数据集吗? —绝对是的。 但是我认为这是说明我们要实现的数据集目标的好方法。

Python脚本 (Python Scripts)

Cool, we did a little exploration and came up with an idea of what we want to build. Now we can design the transformations.

太酷了,我们进行了一些探索,并提出了我们想要构建的构想。 现在我们可以设计转换。

Location is the easiest. We only need to select the columns we want.

位置最简单。 我们只需要选择所需的列。

cols = ['Province/State', 'Country/Region', 'Lat', 'Long']
location = dataset[cols]location

To get this to Power BI, we’ll need a new data source, and since we’re bringing it from a GitHub raw CSV, we can choose ‘web.’

要将其发送到Power BI,我们将需要一个新的数据源,并且由于我们是从GitHub原始CSV中获取数据,因此我们可以选择“网络”。

Image for post
PBI Get Data
PBI获取数据

Now we can add the URL for the CSV and click go till we have our new source.

现在,我们可以添加CSV的URL,然后单击“转到”,直到获得新的源。

https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_global.csv
Image for post
PBI Get Data -> From Web
PBI获取数据->从Web

After you finish loading your dataset, you can go to ‘Transform data’, select the table we just imported, and go to the ‘Transform’ tab.

加载完数据集后,可以转到“转换数据”,选择我们刚刚导入的表,然后转到“转换”选项卡。

First, we’ll promote the first row to Headers.

首先,我们将第一行提升为Headers。

Image for post
PBI -> Transform Data -> Use first row as Headers
PBI->转换数据->使用第一行作为标题

Then on the same tab, we can select ‘Run Python script’.

然后在同一标签上,我们可以选择“运行Python脚本”。

Image for post
PBI -> Transform Data -> Run Python Script
PBI->转换数据->运行Python脚本

Here we’ll use the script we just wrote in Jupyter and press OK. Then we can choose the location Table we just made.

在这里,我们将使用刚刚在Jupyter中编写的脚本,然后按OK。 然后,我们可以选择刚才创建的位置表。

Image for post
New tables
新表
Image for post
Location table
位置表

Excellent, it’s arguably way more comfortable to do that with PBI only, but now we know how to use this transformation, and we can add some complexity.

太好了,可以说仅使用PBI可以更轻松地完成此操作,但是现在我们知道了如何使用此转换,并且可以增加一些复杂性。

Let’s make the Province Time-Series transformations in Jupyter.

让我们在Jupyter中进行省时间序列转换。

增加复杂性 (Add Complexity)

We’ll drop the columns we don’t need, set the new index, and stack the dates in a single column.

我们将删除不需要的列,设置新索引,并将日期堆叠在单个列中。

# drop lat and long
Time_Series_P = dataset.drop(['Lat', 'Long'], axis=1)# set country and province as index
Time_Series_P.set_index(['Province/State', 'Country/Region'], inplace=True)# stack date columns
Time_Series_P = Time_Series_P.stack()
Time_Series_P
Image for post
Stacked data frame
堆叠数据框

Next, we can convert that series back to a data frame, reset the index, and rename the columns.

接下来,我们可以将该系列转换回数据框,重置索引,然后重命名列。

Time_Series_P = Time_Series_P.to_frame(name='Confirmed')
Time_Series_P.reset_index(inplace=True)col_names = ['Province/State', 'Country/Region',
'Date', 'Confirmed']
Time_Series_P.columns = col_namesTime_Series_P
Image for post
Transformed data frame
转换后的数据帧

Cool, we already have the rows/ columns figured out. But I still want to add a column with ‘new cases’.

太酷了,我们已经弄清楚了行/列。 但是我仍然想添加一列“新案例”。

For that, we’ll need to sort our values by province and date. Then we’ll go through each row checking if it has the same name as the one before it. If it does, we should calculate the difference between those values. If not, we should use the amount in that row.

为此,我们需要按省和日期对值进行排序。 然后,我们将遍历每一行,检查其名称是否与之前的名称相同。 如果是这样,我们应该计算这些值之间的差。 如果没有,我们应该使用该行中的金额。

Time_Series_P['Date'] = pd.to_datetime(Time_Series_P['Date'])
Time_Series_P['Date'] = Time_Series_P['Date'].dt.strftime('%Y/%m/%d')Time_Series_P.sort_values(['Province/State', 'Date'], inplace=True)c = ''
new_cases = []
for index, value in Time_Series_P.iterrows():
if c != value['Province/State']:
c = value['Province/State']
val = value['Confirmed']
new_cases.append(val)
else:
new_cases.append(value['Confirmed'] - val)
val = value['Confirmed']
Time_Series_P['new_cases'] = new_cases
Image for post
The transformed data frame with new cases column
具有新案例列的转换后的数据框

I guess that’s enough. We transformed the dataset, and we have it exactly how we wanted. Now we can pack all this code in a single script and try it.

我想就足够了。 我们转换了数据集,并得到了我们想要的。 现在,我们可以将所有这些代码打包在一个脚本中并尝试。

Time_Series_P = dataset.drop(['Lat', 'Long'], axis=1).set_index(['Province/State', 'Country/Region']).stack()
Time_Series_P = Time_Series_P.to_frame(name='Confirmed').reset_index()
Time_Series_P.columns = ['Province/State', 'Country/Region', 'Date', 'Confirmed']
Time_Series_P.dropna(inplace=True)Time_Series_P['Date'] = pd.to_datetime(Time_Series_P['Date'])
Time_Series_P['Date'] = Time_Series_P['Date'].dt.strftime('%Y/%m/%d')Time_Series_P.sort_values(['Province/State', 'Date'], inplace=True)c = ''
new_cases = []
for index, value in Time_Series_P.iterrows():
if c != value['Province/State']:
c = value['Province/State']
val = value['Confirmed']
new_cases.append(val)
else:
new_cases.append(value['Confirmed'] - val)
val = value['Confirmed']
Time_Series_P['new_cases'] = new_cases
Time_Series_P[155:170]
Image for post
Final data frame
最终数据帧

We already know how to get this to PBI. Let’s duplicate our last source, and change the python script in it, like so:

我们已经知道如何将其用于PBI。 让我们复制最后一个源,并在其中更改python脚本,如下所示:

Image for post
PBI, run Python script
PBI,运行Python脚本

I don’t know how to create relationships in PBI with composite keys, so for connecting Location to Time_Series_P, I’ve used DAX to build a calculated column concatenating province and country.

我不知道如何使用复合键在PBI中创建关系,因此,为了将Location连接到Time_Series_P,我使用了DAX来构建计算得出的连接省和国家/地区的列。

loc_id = CONCATENATE(Time_Series_Province[Province/State], Time_Series_Province[Country/Region])

That’s it! You can also use similar logic to create the country table.

而已! 您也可以使用类似的逻辑来创建国家/地区表。

Time_Series_C = dataset.drop(['Lat', 'Long', 'Province/State',], axis=1).set_index(['Country/Region']).stack()
Time_Series_C = Time_Series_C.to_frame(name='Confirmed').reset_index()
Time_Series_C.columns = ['Country/Region', 'Date', 'Confirmed']
Time_Series_C = Time_Series_C.groupby(['Country/Region', 'Date']).sum().reset_index()Time_Series_C['Date'] = pd.to_datetime(Time_Series_C['Date'])
Time_Series_C['Date'] = Time_Series_C['Date'].dt.strftime('%Y/%m/%d')Time_Series_C.sort_values(['Country/Region', 'Date'], inplace=True)c = ''
new_cases = []
for index, value in Time_Series_C.iterrows():
if c != value['Country/Region']:
c = value['Country/Region']
val = value['Confirmed']
new_cases.append(val)
else:
new_cases.append(value['Confirmed'] - val)
val = value['Confirmed']
Time_Series_C['new_cases'] = new_cases
Time_Series_C

I guess that gives us an excellent idea of how to use Python transformations in PBI.

我想这给了我们一个很好的想法,如何在PBI中使用Python转换。

结论 (Conclusion)

Having options and knowing how to use them is always a good thing; all of those transformations could have been done with PBI only. For example, it’s way easier to turn all those columns with dates to rows by selecting them and clicking ‘Unpivot columns’ in the transformation tab.

有选择并知道如何使用它们总是一件好事。 所有这些转换只能通过PBI完成。 例如,通过选择所有日期日期列将其转换为行,然后在转换选项卡中单击“取消透视列”,将变得更加容易。

But there may be times where you find yourself lost in the tool, or you need more control over the operation, and many cases where Python may have that library to implement the solution you were seeking.

但是有时您可能会发现自己迷失在该工具中,或者需要对操作进行更多控制,并且在许多情况下,Python可能具有该库来实现您要寻找的解决方案。

All said and done — it’s time to design your visualization.

总而言之,这是设计可视化的时候了。

Image for post
Example of a visualization built with the transformed data
使用转换后的数据构建的可视化示例

Thanks for reading my article. I hope you enjoyed it.

感谢您阅读我的文章。 我希望你喜欢它。

翻译自: https://medium.com/python-in-plain-english/basics-of-power-bi-transformations-with-python-c6df52cb21d7

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/news/389791.shtml

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

您是六个主要数据角色中的哪一个

When you were growing up, did you ever play the name game? The modern data organization has something similar, and it’s called the “Bad Data Blame Game.” Unlike the name game, however, the Bad Data Blame Game is played when data downtime strikes and no…

自定义按钮动态变化_新闻价值的变化定义

自定义按钮动态变化I read Bari Weiss’ resignation letter from the New York Times with some perplexity. In particular, I found her claim that she “was hired with the goal of bringing in voices that would not otherwise appear in your pages” a bit strange: …

Linux记录-TCP状态以及(TIME_WAIT/CLOSE_WAIT)分析(转载)

1.TCP握手定理 2.TCP状态 l CLOSED:初始状态,表示TCP连接是“关闭着的”或“未打开的”。 l LISTEN :表示服务器端的某个SOCKET处于监听状态,可以接受客户端的连接。 l SYN_RCVD :表示服务器接收到了来自客户端请求…

算法 从 数中选出_算法可以选出胜出的nba幻想选秀吗

算法 从 数中选出Note from Towards Data Science’s editors: While we allow independent authors to publish articles in accordance with our rules and guidelines, we do not endorse each author’s contribution. You should not rely on an author’s works without …

django-rest-framework第一次使用使用常见问题

2019独角兽企业重金招聘Python工程师标准>>> 记录在第一次使用django-rest-framework框架使用时遇到的问题,为了便于理解在这里创建了Person和Grade这两个model from django.db import models class Person(models.Model):SHIRT_SIZES ((S, Small),(M, …

插入脚注把脚注标注删掉_地狱司机不应该只是英国电影历史数据中的脚注,这说明了为什么...

插入脚注把脚注标注删掉Cowritten by Andie Yam由安迪(Andie Yam)撰写 Hell Drivers”, 1957地狱司机 》电影海报 Data visualization is a great way to celebrate our favorite pieces of art as well as reveal connections and ideas that were previously invisible. Mor…

贝叶斯统计 传统统计_统计贝叶斯如何补充常客

贝叶斯统计 传统统计For many years, academics have been using so-called frequentist statistics to evaluate whether experimental manipulations have significant effects.多年以来,学者们一直在使用所谓的常客统计学来评估实验操作是否具有significant效果。…

saltstack二

配置管理 haproxy的安装部署 haproxy各版本安装包下载路径https://www.haproxy.org/download/1.6/src/,跳转地址为http,改为https即可 创建相关目录 # 创建配置目录 [rootlinux-node1 ~]# mkdir /srv/salt/prod/pkg/ [rootlinux-node1 ~]# mkdir /srv/sa…

319. 灯泡开关

319. 灯泡开关 初始时有 n 个灯泡处于关闭状态。第一轮,你将会打开所有灯泡。接下来的第二轮,你将会每两个灯泡关闭一个。 第三轮,你每三个灯泡就切换一个灯泡的开关(即,打开变关闭,关闭变打开&#xff0…

因为你的电脑安装了即点即用_即你所爱

因为你的电脑安装了即点即用Data visualization is a great way to celebrate our favorite pieces of art as well as reveal connections and ideas that were previously invisible. More importantly, it’s a fun way to connect things we love — visualizing data and …

2074. 反转偶数长度组的节点

2074. 反转偶数长度组的节点 给你一个链表的头节点 head 。 链表中的节点 按顺序 划分成若干 非空 组,这些非空组的长度构成一个自然数序列(1, 2, 3, 4, …)。一个组的 长度 就是组中分配到的节点数目。换句话说: 节点 1 分配给…

团队管理新思考_需要一个新的空间来思考讨论和行动

团队管理新思考andrew wong安德鲁黄 Follow跟随 Sep 4 九月4 There is a need for a new space to think, discuss, and act. This need are being felt by the majority of AI / ML / Data Product Managers out there. They are exhausted by the ever increasing data volum…

2075. 解码斜向换位密码

2075. 解码斜向换位密码 字符串 originalText 使用 斜向换位密码 ,经由 行数固定 为 rows 的矩阵辅助,加密得到一个字符串 encodedText 。 originalText 先按从左上到右下的方式放置到矩阵中。 先填充蓝色单元格,接着是红色单元格&#xff…

微服务实战(六):落地微服务架构到直销系统(事件存储)

在CQRS架构中,一个比较重要的内容就是当命令处理器从命令队列中接收到相关的命令数据后,通过调用领域对象逻辑,然后将当前事件的对象数据持久化到事件存储中。主要的用途是能够快速持久化对象此次的状态,另外也可以通过未来最终一…

时间序列数据的多元回归_清理和理解多元时间序列数据

时间序列数据的多元回归No matter what kind of data science project one is assigned to, making sense of the dataset and cleaning it always critical for success. The first step is to understand the data using exploratory data analysis (EDA)as it helps us crea…

vue-cli搭建项目的目录结构及说明

vue-cli基于webpack搭建项目的目录结构 build文件夹 ├── build // 项目构建的(webpack)相关代码 │ ├── build.js // 生产环境构建代码(在npm run build的时候会用到这个文件夹)│ ├── check-versions.js // 检查node&am…

391. 完美矩形

391. 完美矩形 给你一个数组 rectangles ,其中 rectangles[i] [xi, yi, ai, bi] 表示一个坐标轴平行的矩形。这个矩形的左下顶点是 (xi, yi) ,右上顶点是 (ai, bi) 。 如果所有矩形一起精确覆盖了某个矩形区域,则返回 true ;否…

bigquery 教程_bigquery挑战实验室教程从数据中获取见解

bigquery 教程This medium article focusses on the detailed walkthrough of the steps I took to solve the challenge lab of the Insights from Data with BigQuery Skill Badge on the Google Cloud Platform (Qwiklabs). I got access to this lab in the Google Cloud R…

学习linux系统到底有没捷径?

2019独角兽企业重金招聘Python工程师标准>>> 说起linux操作系,可能对于很多不了解的人来说,第一个想到的就是类似于黑客帝国中的黑框框以及一串串不知所云的代码,总之这些感觉都可以总结成为一个字,那就是——酷&#…

wxpython实现界面跳转

wxPython实现Frame之间的跳转/更新的一种方法 wxPython是Python中重要的GUI框架,下面通过自己的方法实现模拟类似PC版微信登录,并跳转到主界面(朋友圈)的流程。 (一)项目目录 【说明】 icon : 保存项目使用…