实现klib_使用klib加速数据清理和预处理

实现klib

TL;DRThe klib package provides a number of very easily applicable functions with sensible default values that can be used on virtually any DataFrame to assess data quality, gain insight, perform cleaning operations and visualizations which results in a much lighter and more convenient to work with Pandas DataFrame.

TL; DR klib软件包提供了许多非常容易应用的功能以及合理的默认值,几乎可以在任何DataFrame上使用这些功能来评估数据质量,了解洞察力,执行清洁操作和可视化,从而使工作更轻便,更方便使用Pandas DataFrame。

Over the past couple of months I’ve implemented a range of functions which I frequently use for virtually any data analysis and preprocessing task, irrespective of the dataset or ultimate goal.

在过去的几个月中,我实现了一系列功能,而无论数据集或最终目标如何,我经常将这些功能用于几乎所有数据分析和预处理任务。

These functions require nothing but a Pandas DataFrame of any size and any datatypes and can be accessed through simple one line calls to gain insight into your data, clean up your DataFrames and visualize relationships between features. It is up to you if you stick to sensible, yet sometimes conservative, default parameters or customize the experience by adjusting them according to your needs.

这些函数只需要任何大小和任何数据类型的Pandas DataFrame,就可以通过简单的一行调用来访问它们,以深入了解数据,清理DataFrame并可视化要素之间的关系。 如果您坚持明智但有时还是保守的默认参数,或者根据需要调整体验来定制体验,则取决于您。

This package is not meant to provide an Auto-ML style API. Rather it is a collection of functions which you can — and probably should — call every time you start working on a new project or dataset. Not only for your own understanding of what you are dealing with, but also to produce plots you can show to supervisors, customers or anyone else looking to get a higher level representation and explanation of the data.

该程序包无意提供Auto-ML样式的API。 而是它是函数的集合,您每次开始处理新项目或数据集时都可以(可能应该)调用这些函数。 不仅是为了您自己对正在处理的内容的理解,而且还可以制作出可以显示给主管,客户或希望获得更高级表示和数据解释的任何人的图。

安装说明 (Installation Instructions)

Install klib using pip:

使用pip安装klib:

pip install --upgrade klib

Alternatively, to install with conda run:

或者,使用conda运行:

conda install -c conda-forge klib

What follows is a workflow and set of best practices which I repeatedly apply when facing new datasets.

接下来是工作流程和一组最佳实践,当面对新数据集时,我会反复应用这些最佳实践。

快速大纲 (Quick Outline)

  • Assessing Data Quality

    评估数据质量
  • Data Cleaning

    数据清理
  • Visualizing Relationships

    可视化关系

The data used in this guide is a slightly truncated version of the NFL Dataset found on Kaggle. You can download it here or use any arbitrary data you want to follow along.

本指南中使用的数据是在Kaggle上找到的NFL数据集的略微截短的版本。 您可以在此处下载该文件,也可以使用要跟踪的任意数据。

评估数据质量 (Assessing the Data Quality)

Determining data quality before starting to work on a dataset is crucial. A quick way to achieve that is to use the missing value visualization of klib, which can be called as easy as follows:

在开始使用数据集之前确定数据质量至关重要。 一种快速的实现方法是使用klib的缺失值可视化,可以这样简单地调用它:

Image for post
Default representation of missing values
缺失值的默认表示

This single plot already shows us a number of important things. Firstly, we can identify columns where all or most of the values are missing. These are candidates for dropping, while those with fewer missing values might benefit from imputation.

这个单一的图已经向我们展示了许多重要的事情。 首先,我们可以确定缺少所有或大多数值的列。 这些是丢弃的候选对象,而那些缺失值较少的对象可能会从插补中受益。

Secondly, we can often times see patterns of missing rows stretching across many features. We might want to eliminate them first before thinking about dropping potentially relevant features.

其次,我们经常可以看到缺失行的样式遍布许多要素。 在考虑删除潜在的相关功能之前,我们可能希望先消除它们。

And lastly, the additional statistics at the top and the right side give us valuable information regarding thresholds we can use for dropping rows or columns with many missing values. In our example we can see that if we drop rows with more than 30 missing values, we only lose a few entries. At the same time, if we eliminate columns with missing values larger than 80% the four most affected columns are removed.

最后,顶部和右侧的其他统计信息为我们提供了有关阈值的有价值的信息,可用于删除具有许多缺失值的行或列。 在我们的示例中,我们可以看到,如果我们删除缺失值超过30个的行,则只会丢失一些条目。 同时,如果我们删除缺失值大于80%的列,则会删除四个受影响最大的列。

A quick note on performance: Despite going through about 2 million entries with 66 features each, the plot takes only seconds to create.

关于性能的简要说明:尽管要经历大约200万个条目,每个条目具有66个特征,但创建图仅需几秒钟。

数据清理 (Data Cleaning)

With this insight, we can go ahead and start cleaning the data. With klib this is as simple as calling klib.data_cleaning(), which performs the following operations:

有了这种洞察力,我们就可以开始清理数据。 使用klib,就像调用klib.data_cleaning()一样简单,该函数执行以下操作:

  • cleaning the column names:This unifies the column names by formatting them, splitting, among others, CamelCase into camel_case, removing special characters as well as leading and trailing white-spaces and formatting all column names to lowercase_and_underscore_separated. This also checks for and fixes duplicate column names, which you sometimes get when reading data from a file.

    清理列名称:通过格式化列名称,将CamelCase拆分为camel_case,除掉特殊字符以及前导和尾随空格以及将所有列名称格式化为小写的方式来统一列名称,将它们命名为lowercase_and_underscore_separated 。 这还将检查并修复重复的列名,当您从文件中读取数据时有时会得到这些重复的列名。

  • dropping empty and virtually empty columns:You can use the parameters drop_threshold_cols and drop_threshold_rows to adjust the dropping to your needs. The default is to drop columns and rows with more than 90% of the values missing.

    删除空列和几乎为空的列:可以使用参数drop_threshold_colsdrop_threshold_rows来调整删除以适应您的需要。 默认设置是删除缺少90%以上值的列和行。

  • removes single valued columns:As the name states, this removes columns in which each cell contains the same value. This comes in handy when columns such as “year” are included while you’re just looking at a single year. Other examples are “download_date” or indicator variables which are identical for all entries.

    删除单值列:顾名思义,这将删除其中每个单元格包含相同值的列。 当您仅查看一年时,如果包括“年”之类的列,这将很方便。 其他示例是“ download_date”或指标变量,它们对于所有条目都相同。

  • drops duplicate rows:This is a straightforward drop of entirely duplicate rows. If you are dealing with data where duplicates add value, consider setting drop_duplicates=False.

    删除重复的行:这是直接删除完全重复的行。 如果要处理重复项可增加价值的数据,请考虑设置drop_duplicates = False。

  • Lastly, and often times most importantly, especially for memory reduction and therefore for speeding up the subsequent steps in your workflow, klib.data_cleaning() also optimizes the datatypes as we can see below.

    最后,而且通常是最重要的,尤其是对于减少内存 ,从而加快工作流程中的后续步骤, klib.data_cleaning()优化了数据类型 ,如下所示。

Shape of cleaned data: (183337, 62) - Remaining NAs: 1754608
Changes:
Dropped rows: 123
of which 123 duplicates. (Rows: [22257, 25347, 26631, 30310, 33558, 35164, 35777, ..., 182935, 182942, 183058, 183368, 183369])
Dropped columns: 4
of which 1 single valued. (Column: ['play_attempted'])
Dropped missing values: 523377
Reduced memory by at least: 63.69 MB (-68.94%)

You can change the verbosity of the output using the parameter show=None, show=”changes” or show=”all”. Please note that the memory reduction indicates a very conservative value (i.e. less reduction than is actually achieved), as it only performs a shallow memory check. A deep memory analysis slows down the function for larger datasets but if you are curious about the “true” reduction in size you can use the df.info() method as shown below.

您可以使用参数show = Noneshow =” changes”或show =” all”来更改输出的详细程度。 请注意,内存减少表示一个非常保守的值(即减少的数量少于实际实现的数量),因为它仅执行浅内存检查。 深度内存分析会使大型数据集的功能变慢,但是如果您对大小的“真实”减少感到好奇,则可以使用df.info()方法,如下所示。

df.info(memory_usage='deep')dtypes: float64(25), int64(20), object(21)
memory usage: 256.7 MB

As we can see, pandas assigns 64 bits of storage for each float and int. Additionally, 21 columns are of type “object”, which is a rather inefficient way to store data. After data cleaning, the memory usage drops to only 58.4 MB, a reduction of almost 80%! This is achieved by converting, where possible, float64 to float32, and int64 to int8. Also, the dtypes string and category are utilized. The available parameters such as convert_dtypes, category, cat_threshold and many more allow you to tune the function to your needs.

如我们所见,pandas为每个float和int分配了64位存储空间。 另外,有21列是“对象”类型的,这是存储数据的一种非常低效的方式。 清除数据后, 内存使用量下降到只有58.4 MB,减少了近80%! 这是通过在可能的情况下将float64转换为float32 ,并将int64转换int8来实现的 。 同样,使用dtypes 字符串类别 。 可用的参数(如convert_dtypes,categorycat_threshold等)可让您根据需要调整函数。

df_cleaned.info(memory_usage='deep')dtypes: category(17), float32(25), int8(19), string(1)
memory usage: 58.4 MB

Lastly, we take a look at the column names, which were actually quite well formatted in the original dataset already. However, after the cleaning process, you can rely on lowercase and underscore-connected column names. While not advisable to avoid ambiguity, this now allows you to use df.yards_gained instead of df[“Yards.Gained”], which can be really useful when doing quick lookups or when exploring the data for the first time.

最后,我们来看一下列名,这些列名实际上已经在原始数据集中格式化了。 但是, 在清理过程之后,您可以依赖小写字母和下划线连接的列名 。 尽管不建议您避免歧义,但现在允许您使用df.yards_gained而不是df [“ Yards.Gained”] ,这在进行快速查找或首次浏览数据时非常有用。

Some column name examples:
Yards.Gained --> yards_gained
PlayAttempted --> play_attempted
Challenge.Replay --> challenge_replay

Ultimately, and to sum it all up: we find that not only have the column names been neatly formatted and unified, but also that the features have been converted to more efficient datatypes. With the relatively milde default settings, only 123 rows and 4 columns, of which one column was singled valued, have been eliminated. This leaves us with a lightweight DataFrame of shape: (183337, 62) and 58 MB memory usage.

归根结底,总而言之:我们发现不仅列名被整齐地格式化和统一了,而且功能也已转换为更有效的数据类型。 使用相对温和的默认设置,仅消除了123行和4列(其中一列为单值)。 这为我们提供了一个形状轻巧的DataFrame:(183337,62)和58 MB的内存使用量。

相关图 (Correlation Plots)

Once the initial data cleaning is done, it makes sense to take a look at the relationships between the features. For this we employ the function klib.corr_plot(). Setting the split parameter to “pos”, “neg”, “high” or “low” and optionally combining each setting with a threshold, allows us to dig deeper and highlight the most important aspects.

完成初始数据清理后,有必要查看一下功能之间的关系。 为此,我们使用函数klib.corr_plot() 。 将split参数设置为“ pos”“ neg”“ high”“ low”并可选地将每个设置与阈值组合在一起,使我们能够更深入地挖掘并突出最重要的方面。

plots showing high and low correlations
Correlation plots
相关图

At a glance, we can identify a number of interesting relations. Similarly, we can easily zoom in on correlations above any given threshold, let’s say |0.5|. Not only does this allow us to spot features which might be causing trouble later on in our analysis, it also shows us that there are quite a few highly negatively correlated features in our data. Given sufficient domain expertise, this can be a great starting point for some feature engineering!

乍一看,我们可以确定许多有趣的关系。 类似地,我们可以轻松放大高于任何给定阈值(例如| 0.5 |)的相关性。 这不仅使我们能够发现可能在以后的分析中引起麻烦的要素,还向我们表明,我们的数据中有很多高度负相关的要素。 有了足够的领域专业知识,对于某些功能设计来说,这可能是一个很好的起点!

high correlation plot
Plot of high absolute correlations
高绝对相关图

Further, using the same function, we can take a look at the correlations between features and a chosen target. The target column can be supplied as a column name of the current DataFrame, as a separate pd.Series, a np.ndarry or simply as a list.

此外,使用相同的功能,我们可以看一下特征与选定目标之间的相关性。 目标列可以作为当前DataFrame的列名,单独的pd.Series,np.ndarry或仅作为列表提供。

Image for post
Plot of correlations with the target / label
与目标/标签的相关图

Just as before it is possible to use a wide range of parameters for customizations, such as removing annotations, changing the correlation method or changing the colormap to match your preferred style or corporate identity.

与以前一样,可以使用各种参数进行自定义,例如删除注释,更改关联方法或更改颜色图以匹配您的首选样式或公司标识。

分类数据 (Categorical data)

In a last step in this guide, we take a quick look at the capabilities to visualize categorical columns. The function klib.cat_plot() allows to display the top and/or bottom values regarding their frequency in each column. This gives us an idea of the distribution of the values in the dataset what is very helpful when considering to combine less frequent values into a seperate category before applying one-hot-encoding or similar functions. In this example we can see that for the column “play_type” roughly 75% of all entries are made up of the three most frequent values. Further, we can immediately see that “Pass” and “Run” are by far the most frequent values (75k and 55k). Conversely, the plot also shows us that “desc” is made up of 170384 unique strings.

在本指南的最后一步,我们快速浏览了可视化分类列的功能。 函数klib.cat_plot()允许在每列中显示有关其频率的最高和/或最低值。 这使我们了解了数据集中值的分布,这在考虑在应用单热编码或类似函数之前考虑将频率较低的值组合到单独的类别中时非常有帮助。 在此示例中,我们可以看到,对于“ play_type”列,大约所有条目的75%由三个最频繁的值组成。 此外,我们可以立即看到“通过”和“运行”是迄今为止最频繁的值(75k和55k)。 相反,该图也向我们显示“ desc”由170384个唯一字符串组成。

Image for post
Categorical data plot
分类数据图

The klib package includes many more helpful functions for data analysis and cleaning, not to mention some customized sklearn pipelines, which you can easily stack together using a FeatureUnion and then use with in GridSearchCV or similar. So if you intend to take a shortcut, simply call klib.data_cleaning() and plug the resulting DataFrame into that pipeline. Likely, you will already get a very decent result!

klib软件包包括许多用于数据分析和清理的更有用的功能,更不用说一些自定义的sklearn管道了,您可以使用FeatureUnion轻松地将它们堆叠在一起,然后在GridSearchCV或类似版本中使用。 因此,如果您打算采取捷径,只需调用klib.data_cleaning()并将生成的DataFrame插入该管道即可。 可能,您已经获得了非常不错的结果!

结论 (Conclusion)

All of these functions make for a very convenient data cleaning and visualization and come with many more features and settings than described here. They are by no means a one fits all solution but they should be very helpful in your data preparation process. klib also includes various other functions, most notably pool_duplicate_subsets(), to pool subsets of the data across different features as a means of dimensionality reduction, dist_plot(), to visualize distributions of numerical features, as well as mv_col_handling(), which provides a sophisticated 3-step process, attempting to identify any remaining information in columns with many missing values, instead of simply dropping them right away.

所有这些功能使数据清理和可视化变得非常方便,并且具有比此处所述更多的功能和设置。 它们绝不是一个适合所有解决方案的方法,但是它们在您的数据准备过程中应该会非常有帮助。 klib还包括其他各种功能,最引人注目的是pool_duplicate_subsets() ,它可以将不同特征之间的数据子集合并为维方法; dist_plot() ,可以显示数字特征的分布; mv_col_handling()可以提供复杂的三步过程,尝试识别列中有许多缺失值的任何剩余信息,而不是立即将其丢弃。

Note: Please let me know what you would like to see next and which functions you feel are missing, either in the comments below or by opening an issue on GitHub. Also let me know if you would like to see some examples on the handling of missing values, subset pooling or the customized sklearn pipelines.

注意:请在下面的评论中或通过在GitHub上打开问题,让我知道您接下来想看什么以及感觉缺少哪些功能。 如果您想查看一些有关缺失值处理,子集池或自定义sklearn管道的示例,也请让我知道。

翻译自: https://towardsdatascience.com/speed-up-your-data-cleaning-and-preprocessing-with-klib-97191d320f80

实现klib

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/news/389234.shtml

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

MMDetection修改代码无效

最近在打比赛,使用MMDetection框架,但是无论是Yolo修改类别还是更改head,代码运行后发现运行的是修改之前的代码。。。也就是说修改代码无效。。。 问题解决办法: MMDetection在首次运行后会把一部分运行核心放在anaconda的环境…

docker etcd

etcd是CoreOS团队于2013年6月发起的开源项目,它的目标是构建一个高可用的分布式键值(key-value)数据库,用于配置共享和服务发现 etcd内部采用raft协议作为一致性算法,etcd基于Go语言实现。 etcd作为服务发现系统,有以下的特点&…

SpringBoot简要

2019独角兽企业重金招聘Python工程师标准>>> 简化Spring应用开发的一个框架;      整个Spring技术栈的一个大整合;      J2EE开发的一站式解决方案;      自动配置:针对很多Spring应用程序常见的应用功能&…

简明易懂的c#入门指南_统计假设检验的简明指南

简明易懂的c#入门指南介绍 (Introduction) One of the main applications of frequentist statistics is the comparison of sample means and variances between one or more groups, known as statistical hypothesis testing. A statistic is a summarized/compressed proba…

Torch.distributed.elastic 关于 pytorch 不稳定

错误日志: Epoch: [229] Total time: 0:17:21 Test: [ 0/49] eta: 0:05:00 loss: 1.7994 (1.7994) acc1: 78.0822 (78.0822) acc5: 95.2055 (95.2055) time: 6.1368 data: 5.9411 max mem: 10624 WARNING:torch.distributed.elastic.agent.server.api:Rec…

0x22 迭代加深

poj2248 真是个新套路。还有套路剪枝...大到小和判重 #include<cstdio> #include<iostream> #include<cstring> #include<cstdlib> #include<algorithm> #include<cmath> #include<bitset> using namespace std;int n,D,x[110];bool…

云原生全球最大峰会之一KubeCon首登中国 Kubernetes将如何再演进?

雷锋网消息&#xff0c;11月14日&#xff0c;由CNCF发起的云原生领域全球最大的峰会之一KubeConCloudNativeCon首次登陆中国&#xff0c;中国已经成为云原生领域一股强大力量&#xff0c;并且还在不断成长。 毫无疑问&#xff0c;Kubernetes已经成为容器编排事实标准&#xff…

分布分析和分组分析_如何通过群组分析对用户进行分组并获得可行的见解

分布分析和分组分析数据分析 (DATA ANALYSIS) Being a regular at a restaurant is great.乙 eing定期在餐厅是伟大的。 When I started university, my dad told me I should find a restaurant I really liked and eat there every month with some friends. Becoming a reg…

python 工具箱_Python交易工具箱:通过指标子图增强图表

python 工具箱交易工具箱 (trading-toolbox) After a several months-long hiatus, I can finally resume posting to the Trading Toolbox Series. We started this series by learning how to plot indicators (specifically: moving averages) on the top of a price chart.…

PDA端的数据库一般采用的是sqlce数据库

PDA端的数据库一般采用的是sqlce数据库,这样与PC端的sql2000中的数据同步就变成了一个问题,如在PDA端处理,PDA端的内存,CPU等都是一个制约因素,其次他们的一个连接稳定及其间的数据传输也是一个难点.本例中通过在PC端的转化后再复制到PDA上面,这样,上面所有的问题都得到了一个有…

bzoj 1016 [JSOI2008]最小生成树计数——matrix tree(相同权值的边为阶段缩点)(码力)...

题目&#xff1a;https://www.lydsy.com/JudgeOnline/problem.php?id1016 就是缩点&#xff0c;每次相同权值的边构成的联通块求一下matrix tree。注意gauss里的编号应该是从1到...的连续的。 学习了一个TJ。用了vector。自己曾写过一个只能过样例的。都放上来吧。 路径压缩的…

商米

2019独角兽企业重金招聘Python工程师标准>>> 今天看了一下商米的官网&#xff0c;发现他家的东西还真的是不错。有钱了&#xff0c;想去体验一下。 如果我妹妹还有开便利店的话&#xff0c;我会推荐他用这个。小巧便捷&#xff0c;非常方便。 转载于:https://my.osc…

python交互式和文件式_使用Python创建和自动化交互式仪表盘

python交互式和文件式In this tutorial, I will be creating an automated, interactive dashboard of Texas COVID-19 case count by county using python with the help of selenium, pandas, dash, and plotly. I am assuming the reader has some familiarity with python,…

不可不说的Java“锁”事

2019独角兽企业重金招聘Python工程师标准>>> 前言 Java提供了种类丰富的锁&#xff0c;每种锁因其特性的不同&#xff0c;在适当的场景下能够展现出非常高的效率。本文旨在对锁相关源码&#xff08;本文中的源码来自JDK 8&#xff09;、使用场景进行举例&#xff0c…

数据可视化 信息可视化_可视化数据以帮助清理数据

数据可视化 信息可视化The role of a data scientists involves retrieving hidden relationships between massive amounts of structured or unstructured data in the aim to reach or adjust certain business criteria. In recent times this role’s importance has been…

seaborn添加数据标签_常见Seaborn图的数据标签快速指南

seaborn添加数据标签In the course of my data exploration adventures, I find myself looking at such plots (below), which is great for observing trend but it makes it difficult to make out where and what each data point is.在进行数据探索的过程中&#xff0c;我…

使用python pandas dataframe学习数据分析

⚠️ Note — This post is a part of Learning data analysis with python series. If you haven’t read the first post, some of the content won’t make sense. Check it out here.Note️ 注意 -这篇文章是使用python系列学习数据分析的一部分。 如果您还没有阅读第一篇文…

无向图g的邻接矩阵一定是_矩阵是图

无向图g的邻接矩阵一定是To study structure,tear away all flesh soonly the bone shows.要研究结构&#xff0c;请尽快撕掉骨头上所有的肉。 Linear algebra. Graph theory. If you are a data scientist, you have encountered both of these fields in your study or work …

前端绘制绘制图表_绘制我的文学风景

前端绘制绘制图表Back when I was a kid, I used to read A LOT of books. Then, over the last couple of years, movies and TV series somehow stole the thunder, and with it, my attention. I did read a few odd books here and there, but not with the same ferocity …

如何描绘一个vue的项目_描绘了一个被忽视的幽默来源

如何描绘一个vue的项目Source)来源 ) Data visualization is a great way to celebrate our favorite pieces of art as well as reveal connections and ideas that were previously invisible. More importantly, it’s a fun way to connect things we love — visualizing …