tableau使用_使用Tableau升级Kaplan-Meier曲线

tableau使用

In a previous article, I showed how we can create the Kaplan-Meier curves using Python. As much as I love Python and writing code, there might be some alternative approaches with their unique set of benefits. Enter Tableau!

在上一篇文章中 ,我展示了如何使用Python创建Kaplan-Meier曲线。 尽管我非常喜欢Python和编写代码,但可能会有一些其他方法具有其独特的优势。 进入Tableau!

Image for post
Source资源

Tableau is a business intelligence tool used for creating elegant and interactive visualizations on top of data coming from a vast number of sources (you would be surprised how many distinct ones are there!). To make the definition even shorter, Tableau is used for building dashboards.

Tableau是一种商务智能工具,用于在来自大量来源的数据之上创建优雅的交互式可视化效果(您会惊讶地发现那里有许多不同的数据!)。 为了使定义更短,Tableau用于构建仪表板。

So why would a data scientist be interested in using Tableau instead of Python? When creating a Notebook/report with the results of a survival analysis exercise in Python, the reader will always be limited to:

那么,为什么数据科学家会对使用Tableau而不是Python感兴趣? 当使用Python的生存分析练习的结果创建Notebook /报告时,读者将始终限于:

  • what the creator of the visualization had in mind,

    可视化创建者的想法是什么,
  • what data was available at the moment of creating the report.

    创建报告时可以使用哪些数据。

In other words, there is little freedom for the reader to explore some alternative angles. What is more, if someone in the company will accidentally find the report a few years later, the only way to make the analysis up-to-date would be to find the data scientist and make them rerun the Notebook and generate another report. Definitely not the best situation.

换句话说,读者几乎没有自由来探索某些替代角度。 更重要的是,如果公司中有人会在几年后无意间找到报告,那么使分析保持最新状态的唯一方法是找到数据科学家,然后让他们重新运行笔记本并生成另一份报告。 绝对不是最好的情况。

This is where a solution based on Tableau (or other business intelligence tools such as PowerBI, Looker, etc.) shines. As the visualizations are built directly on top of a data source, the visualization will be updated together with the data. Less work for the data scientist!

这是基于Tableau(或其他商业智能工具,如PowerBI,Looker等)的解决方案的发源地。 由于可视化直接建立在数据源之上,因此可视化将与数据一起更新。 减少数据科学家的工作!

Another extra benefit is the possibility to include some filters, so the readers can play around and try to explore different subsets of the data. From experience, this is a feature often used by product owners, who want to dive deep into the details and at the same time do not want to constantly come to the data person with another request for a new filter or feature. Another win :)

另一个额外的好处是可以包含一些过滤器,以便读者可以玩转并尝试探索数据的不同子集。 根据经验,这是产品所有者经常使用的功能,他们想深入了解细节,同时又不想不断向数据人员提出新过滤器或功能的另一要求。 另一个胜利:)

Lastly, by using such tools, the analysts democratize the access to the data and analyses, as basically anyone in the company can access the dashboard and try to answer their own questions or verify their hypotheses.

最后,通过使用此类工具,分析师可以使对数据和分析的访问民主化,因为基本上公司中的任何人都可以访问仪表板并尝试回答自己的问题或验证其假设。

After this introduction, let’s jump right into re-creating the very same Kaplan-Meier curves we created in the previous article. Once again, we use the Telco Churn dataset, which requires close to no extra preparation before the analysis. Please refer to that article if you need a refresher on the Kaplan-Meier estimator, as we will not cover theory this time. Also, we assume some basic knowledge of Tableau.

在介绍完之后,让我们直接重新创建与上一篇文章中创建的相同的Kaplan-Meier曲线。 再一次,我们使用Telco Churn数据集,该数据集在分析之前几乎不需要任何额外准备。 如果您需要对Kaplan-Meier估计器进行复习,请参阅该文章,因为我们这次将不讨论理论。 此外,我们假设您具有Tableau的一些基本知识。

Note: Tableau is a commercial software and requires a license. You can get access to a 14-day trial by following the instructions here.

注意 :Tableau是商业软件,需要许可证。 您可以按照此处的说明访问14天试用版。

Image for post
mohamed Hassan from mohamed Hassan在PixabayPixabay上发布

方法1:简易模式 (Approach #1: Easy mode)

The first approach is dubbed easy, as it will favor speed and simplicity, while at the same time introducing some shortcomings. First, we load the data from a text file (available here).

第一种方法被称为简单方法,因为它将有利于速度和简便性,同时又带来了一些缺点。 首先,我们从文本文件(可在此处下载 )中加载数据。

To carry out the survival analysis in Tableau, we will need the following variables:

为了在Tableau中进行生存分析,我们将需要以下变量:

  • time-to-event — expressed as time periods (for example, days or months) elapsed since joining the sample until the event of interest or censoring.

    事件发生时间-表示从加入样本到感兴趣或检查事件为止的时间段(例如,天或数月)。
  • event-of-interest — expressed as a binary variable, where 1 indicates that the event happened, 0 otherwise.

    感兴趣的事件—用二进制变量表示,其中1表示事件已发生,否则为0。
  • additional categorical variables — used for filtering and/or grouping.

    其他类别变量-用于过滤和/或分组。

The tenure variable does not require any preparation, as it already expresses the number of months since signing up for the services of the Telco company. But the Churn variable is expressed as a yes/no string, so we need to encode it to binary using a calculated field:

tenure变量不需要任何准备,因为它已经表示自注册电信公司的服务以来的月数。 但是Churn变量表示为是/否字符串,因此我们需要使用计算字段将其编码为二进制:

Image for post

To create this field, right-click on the Churn variable in the variable selector on the left (Data tab), select Create -> Calculated Field.

要创建此字段,请在左侧(数据选项卡)的变量选择器中右键单击Churn变量,然后选择创建->计算字段。

As the next step, we create a new calculated field, d_i, which represents the number of events that occur over time:

下一步,我们创建一个新的计算字段d_i ,该字段代表随时间发生的事件数:

Image for post

The names we used for the variables correspond to the elements you can find in the formula for the Kaplan-Meier estimator.

我们用于变量的名称与您可以在Kaplan-Meier估计器的公式中找到的元素相对应。

The next variable we create will be the denominator used for calculating the hazard function at a given time. It represents the total number of observations since the last time period:

我们创建的下一个变量将是在给定时间用于计算危险函数的分母。 它表示自上一个时间段以来的观察总数:

Image for post

The Number of Records variable is a helper variable used for, as you might have guessed, counting the observations. For that purpose, newer versions of Tableau create a variable based on the name of the data source. However, you can easily create this variable manually by creating a calculated field and placing 1 in the field’s definition. Lastly, we define the Kaplan-Meier curve as:

Number of Records变量是一个辅助变量,您可能已经猜到了该变量用于对观察值进行计数。 为此,Tableau的较新版本根据数据源的名称创建一个变量。 但是,您可以通过创建一个计算字段并在该字段的定义中放置1来轻松手动创建此变量。 最后,我们将Kaplan-Meier曲线定义为:

Image for post

Here, the probability of survival is defined as 1 - hazard function.

在此,将生存概率定义为1 - hazard function

All the building blocks are ready. Now, we place the tenure on the x-axis, the Kaplan-Meier Curve on the y-axis, format the curve as a percentage, add the tile and place the PaymentMethod variable as a color. This way, we create the following visualization:

所有构建块均已准备就绪。 现在,我们将使用tenure放置在x轴上,将Kaplan-Meier Curve放置在y轴上,将曲线设置为百分比格式,添加平铺,并将PaymentMethod变量放置为颜色。 这样,我们创建以下可视化文件:

Image for post

Which is very similar to what we obtained last time using lifelines:

这与我们上次使用lifelines获得的结果非常相似:

Image for post

Some quick observations:

一些快速观察:

  • the survival curves obtained in Tableau are more or less straight, without the characteristic step structure,

    在Tableau中获得的生存曲线或多或少是笔直的,没有典型的阶梯结构,
  • there are no confidence intervals, as their calculation is not that simple in Tableau.

    没有置信区间,因为在Tableau中它们的计算不是那么简单。

Using Tableau, we can easily add some additional filters to the visualization, such as the cohort date, age, or any of the available categorical variables.

使用Tableau,我们可以轻松地向可视化添加一些其他过滤器,例如队列日期,年龄或任何可用的分类变量。

方法2:正常模式 (Approach #2: Normal Mode)

In this approach, we will focus on recreating the characteristic step-like shape of the Kaplan-Meier curves. This approach is dubbed the normal mode, as it requires a bit more preparation.

在这种方法中,我们将专注于重新创建Kaplan-Meier曲线的特征阶梯状形状。 这种方法被称为普通模式,因为它需要更多的准备。

For the additional data preprocessing, we need to complete two steps. First, add a column called link to the CSV file with the Telco Customer Churn data. The column should be populated with a ‘link’ string. As a matter of fact, this string can be arbitrary, just as the column name. What matters is consistency, but all will become clear in a second. The second step is to create a new CSV file (we called it blending.csv), which contains the following:

对于其他数据预处理,我们需要完成两个步骤。 首先,在带有Telco客户流失数据的CSV文件中添加一个名为link的列。 该列应使用'link'字符串填充。 实际上,该字符串可以是任意的,就像列名一样。 重要的是一致性,但是所有这些都将在一秒钟之内变得清晰。 第二步是创建一个新的CSV文件(我们将其称为blending.csv ),其中包含以下内容:

link, set
link, 1
link, 2

Yep, that’s pretty much it. For your convenience, I stored both files on my GitHub.

是的,仅此而已。 为了方便起见,我将这两个文件都存储在GitHub上 。

Armed with the two files, we load them to Tableau and left join the tables using the link variable. You can see that in the following image.

有了这两个文件,我们将它们加载到Tableau,并使用link变量左连接表。 您可以在下图中看到它。

Image for post

As this is the “normal mode”, we will combine a few steps at the same time and create a calculated field called Kaplan-Meier Dots:

由于这是“正常模式”,因此我们将同时结合几个步骤,并创建一个称为Kaplan-Meier Dots的计算字段:

Image for post

You can easily recognize the contents of this field from the “easy mode”, this time, we have put everything into one field. After doing so, comes the new part. We define the Kaplan-Meier Curve as:

您可以从“简单模式”轻松识别该字段的内容,这一次,我们将所有内容都放在一个字段中。 这样做之后,出现了新的部分。 我们将Kaplan-Meier Curve定义为:

Image for post

This convoluted formula will enable us to obtain the step-like shape of the curves. Lastly, we need one more helper variable:

这个复杂的公式将使我们能够获得曲线的阶梯状形状。 最后,我们需要一个辅助变量:

Image for post

When doing so, please click on the Default Table Calculation and specify to compute the results along tenure.

这样做时,请单击“ 默认表计算”,并指定沿tenure计算结果。

Finally, we have all the building blocks to create the curves. We approach the setup similarly to the “easy mode”, with the difference of placing the Index as the Path and set as the Detail. To recreate the curves from Python, we once again use the PaymentMethod as the Color.

最后,我们具有创建曲线的所有构造块。 我们采用类似于“简易模式”的方法进行设置,不同之处在于将“ Index ”放置为“路径”并set为“细节”。 要从Python重新创建曲线,我们再次使用PaymentMethod作为颜色。

Image for post

In the picture above, we accurately recreated the curves we previously obtained using the lifelines library in Python. This definitely required a bit more work but can pay off in the end.

在上图中,我们准确地重新创建了先前使用Python的lifelines库获得的曲线。 这肯定需要更多的工作,但最终可以得到回报。

We can additionally use the Kaplan-Meier Dots to visualize the events as they happen along the curve. In this case, I believe this would simply clutter the visualization. It would be more suitable for a smaller dataset.

我们还可以使用Kaplan-Meier Dots来可视化沿曲线发生的事件。 在这种情况下,我相信这只会使可视化变得混乱。 它更适合于较小的数据集。

We can further improve the dashboard by adding some filters/splits and then share it with our colleagues via the company’s reporting portal (in this case, an instance of Tableau Server).

我们可以通过添加一些过滤器/拆分来进一步改进仪表板,然后通过公司的报告门户(在本例中为Tableau Server实例)与同事共享。

结论 (Conclusion)

In this article, I explained the potential benefits of using business intelligence tools such as Tableau for survival analysis and showed how to create dashboards with the Kaplan-Meier curves.

在本文中,我解释了使用Tableau等商业智能工具进行生存分析的潜在好处,并展示了如何使用Kaplan-Meier曲线创建仪表板。

As is often the case, nothing comes for free and there are also some disadvantages to this approach:

通常,没有什么是免费的,这种方法也有一些缺点:

  • Calculating the confidence intervals is definitely harder and needs quite some effort.

    计算置信区间肯定比较困难,并且需要付出很多努力。
  • In Tableau, there is no simple way to carry out the log-rank test to compare different survival curves (unless we use R from Tableau, but this might be an idea for a future article).

    在Tableau中,没有简单的方法来执行对数秩检验以比较不同的生存曲线(除非我们使用Tableau中的R,但这可能是以后的文章的想法)。
  • If new features are added to the data, for example, new customer segmentation or another category for each observation, this will still require some work from an analyst to add to an already existing dashboard. However, most of the time this does not happen often or requires little extra work.

    如果将新功能(例如,新客户细分或每个观察的另一个类别)添加到数据中,则仍需要分析师进行一些工作才能添加到现有仪表板中。 但是,在大多数情况下,这种情况并不经常发生或需要很少的额外工作。

I hope you enjoyed this alternative approach to visualizing the Kaplan-Meier curves. As always, any constructive feedback is welcome. You can reach out to me on Twitter or in the comments.

我希望您喜欢这种替代方法来可视化Kaplan-Meier曲线。 一如既往,欢迎任何建设性的反馈。 您可以在Twitter或评论中与我联系。

If you liked this article, you might also like the other ones in the series:

如果您喜欢这篇文章,您可能还喜欢该系列中的其他文章:

翻译自: https://towardsdatascience.com/level-up-your-kaplan-meier-curves-with-tableau-bc4a10ec6a15

tableau使用

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/news/391713.shtml

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

Nexus3.x.x上传第三方jar

exus3.x.x上传第三方jar: 1. create repository 选择maven2(hosted),说明: proxy:即你可以设置代理,设置了代理之后,在你的nexus中找不到的依赖就会去配置的代理的地址中找hosted:你可以上传你自…

责备的近义词_考试结果危机:我们应该责备算法吗?

责备的近义词I’ve been considering writing on the topic of algorithms for a little while, but with the Exam Results Fiasco dominating the headline news in the UK during the past week, I felt that now is the time to look more closely into the subject.我一直…

c/c++编译器的安装

MinGW(Minimalist GNU For Windows)是个精简的Windows平台C/C、ADA及Fortran编译器,相比Cygwin而言,体积要小很多,使用较为方便。 MinGW最大的特点就是编译出来的可执行文件能够独立在Windows上运行。 MinGW的组成: 编译器(支持C、…

numpy 线性代数_数据科学家的线性代数—用NumPy解释

numpy 线性代数Machine learning and deep learning models are data-hungry. The performance of them is highly dependent on the amount of data. Thus, we tend to collect as much data as possible in order to build a robust and accurate model. Data is collected i…

spring 注解方式配置Bean

概要: 再classpath中扫描组件 组件扫描(component scanning):Spring可以从classpath下自己主动扫描。侦測和实例化具有特定注解的组件特定组件包含: Component:基本注解。标示了一个受Spring管理的组件&…

零元学Expression Blend 4 - Chapter 25 以Text相关功能就能简单做出具有设计感的登入画面...

原文:零元学Expression Blend 4 - Chapter 25 以Text相关功能就能简单做出具有设计感的登入画面本章将交大家如何运用Blend 4 内的Text相关功能做出有设计感的登入画面 让你五分钟就能快速做出一个登入画面 ? 本章将教大家如何运用Blend 4 内的Text相关功能做出有设计感的登入…

冠状病毒时代的负责任数据可视化

First, a little bit about me: I’m a data science grad student. I have been writing for Medium for a little while now. I’m a scorpio. I like long walks on beaches. And writing for Medium made me realize the importance of taking personal responsibility ove…

集合_java集合框架

转载自http://blog.csdn.net/zsw101259/article/details/7570033 Java集合框架图 简化图: Java平台提供了一个全新的集合框架。“集合框架”主要由一组用来操作对象的接口组成。不同接口描述一组不同数据类型。 1、Java 2集合框架图 ①集合接口:6个…

显示随机键盘

显示随机键盘 1 <!DOCTYPE html>2 <html lang"zh-cn">3 <head>4 <meta charset"utf-8">5 <title>7-77 课堂演示</title>6 <link rel"stylesheet" type"text/css" href"style…

数据特征分析-统计分析

一、统计分析 统计分析是对定量数据进行统计描述&#xff0c;常从集中趋势和离中趋势两个方面分析。 集中趋势&#xff1a;指一组数据向某一中心靠拢的倾向&#xff0c;核心在于寻找数据的代表值或中心值-统计平均数&#xff08;算数平均数和位置平均数&#xff09; 算术平均数…

数据eda_银行数据EDA:逐步

数据edaThis banking data was retrieved from Kaggle and there will be a breakdown on how the dataset will be handled from EDA (Exploratory Data Analysis) to Machine Learning algorithms.该银行数据是从Kaggle检索的&#xff0c;将详细介绍如何将数据集从EDA(探索性…

结构型模式之组合

重新看组合/合成&#xff08;Composite&#xff09;模式&#xff0c;发现它并不像自己想象的那么简单&#xff0c;单纯从整体和部分关系的角度去理解还是不够的&#xff0c;并且还有一些通俗的模式讲解类的书&#xff0c;由于其举的例子太过“通俗”&#xff0c;以致让人理解产…

计算机网络原理笔记-三次握手

三次握手协议指的是在发送数据的准备阶段&#xff0c;服务器端和客户端之间需要进行三次交互&#xff1a; 第一次握手&#xff1a;客户端发送syn包(synj)到服务器&#xff0c;并进入SYN_SEND状态&#xff0c;等待服务器确认&#xff1b; 第二次握手&#xff1a;服务器收到syn包…

Bigmart数据集销售预测

Note: This post is heavy on code, but yes well documented.注意&#xff1a;这篇文章讲的是代码&#xff0c;但确实有据可查。 问题描述 (The Problem Description) The data scientists at BigMart have collected 2013 sales data for 1559 products across 10 stores in…

数据特征分析-帕累托分析

帕累托分析(贡献度分析)&#xff1a;即二八定律 目的&#xff1a;通过二八原则寻找属于20%的关键决定性因素。 随机生成数据 df pd.DataFrame(np.random.randn(10)*10003000,index list(ABCDEFGHIJ),columns [销量]) #避免出现负数 df.sort_values(销量,ascending False,i…

dt决策树_决策树:构建DT的分步方法

dt决策树介绍 (Introduction) Decision Trees (DTs) are a non-parametric supervised learning method used for classification and regression. The goal is to create a model that predicts the value of a target variable by learning simple decision rules inferred f…

读C#开发实战1200例子记录-2017年8月14日10:03:55

C# 语言基础应用&#xff0c;注释 "///"标记不仅仅可以为代码段添加说明&#xff0c;它还有一项更重要的工作&#xff0c;就是用于生成自动文档。自动文档一般用于描述项目&#xff0c;是项目更加清晰直观。在VisualStudio2015中可以通过设置项目属性来生成自动文档。…

数据特征分析-正太分布

期望值&#xff0c;即在一个离散性随机变量试验中每次可能结果的概率乘以其结果的总和。 若随机变量X服从一个数学期望为μ、方差为σ^2的正态分布&#xff0c;记为N(μ&#xff0c;σ^2)&#xff0c;其概率密度函数为正态分布的期望值μ决定了其位置&#xff0c;其标准差σ决定…

r语言调用数据集中的数据集_自然语言数据集中未解决的问题

r语言调用数据集中的数据集Garbage in, garbage out. You don’t have to be an ML expert to have heard this phrase. Models uncover patterns in the data, so when the data is broken, they develop broken behavior. This is why researchers allocate significant reso…

数据特征分析-相关性分析

相关性分析是指对两个或多个具备相关性的变量元素进行分析&#xff0c;从而衡量两个变量的相关密切程度。 相关性的元素之间需要存在一定的联系或者概率才可以进行相关性分析。 相关系数在[-1,1]之间。 一、图示初判 通过pandas做散点矩阵图进行初步判断 df1 pd.DataFrame(np.…