详尽kmp

表中的内容 (Table of Content)

Introduction
介绍
What is Data Preparation
什么是数据准备
Exploratory Data Analysis (EDA)
探索性数据分析(EDA)
Data Preprocessing
数据预处理
Data Splitting
数据分割

介绍 (Introduction)

Before we get into this, I want to make it clear that there is no rigid process when it comes to data preparation. How you prepare one set of data will most likely be different from how you prepare another set of data. Therefore this guide aims to provide an overarching guide that you can refer to when preparing any particular set of data.

在开始讨论之前，我想澄清一下，在数据准备方面没有严格的过程。您准备一组数据的方式与准备另一组数据的方式很可能会有所不同。因此，本指南旨在提供在准备任何特定数据集时可以参考的总体指南。

Before we get into the guide, I should probably go over what Data Preparation is…

在进入指南之前，我可能应该回顾一下数据准备是什么……

什么是数据准备？ (What is Data Preparation?)

Data preparation is the step after data collection in the machine learning life cycle and it’s the process of cleaning and transforming the raw data you collected. By doing so, you’ll have a much easier time when it comes to analyzing and modeling your data.

数据准备是机器学习生命周期中数据收集之后的步骤，并且是清理和转换您收集的原始数据的过程。这样，您就可以轻松地分析和建模数据。

There are three main parts to data preparation that I’ll go over in this article:

我将在本文中介绍数据准备的三个主要部分：

Exploratory Data Analysis (EDA)
探索性数据分析(EDA)
Data preprocessing
数据预处理
Data splitting
数据分割

1.探索性数据分析(EDA) (1. Exploratory Data Analysis (EDA))

Exploratory data analysis, or EDA for short, is exactly what it sounds like, exploring your data. In this step, you’re simply getting an understanding of the data that you’re working with. In the real world, datasets are not as clean or intuitive as Kaggle datasets.

探索性数据分析(简称EDA)听起来像是探索数据。在此步骤中，您仅需要了解正在使用的数据。在现实世界中，数据集不如Kaggle数据集那么干净或直观。

The more you explore and understand the data you’re working with, the easier it’ll be when it comes to data preprocessing.

您越探索和了解正在使用的数据，就越容易进行数据预处理。

Below is a list of things that you should consider in this step:

下面是在此步骤中应考虑的事项列表：

特征和目标变量 (Feature and Target Variables)

Determine what the feature (input) variables are and what the target variable is. Don’t worry about determining what the final input variables are, but make sure you can identify both types of variables.

确定什么是要素(输入)变量，什么是目标变量。不必担心确定最终输入变量是什么，但是请确保可以识别两种类型的变量。

资料类型 (Data Types)

Figure out what type of data you’re working with. Are they categorical, numerical, or neither? This is especially important for the target variable, as the data type will narrow what machine learning model you may want to use. Pandas functions like df.describe() and df.dtypes are useful here.

找出要使用的数据类型。它们是分类的，数字的还是两者都不是？这对于目标变量尤其重要，因为数据类型将缩小您可能要使用的机器学习模型。诸如df.describe()和df.dtypes之类的熊猫函数在这里很有用。

检查异常值 (Check for Outliers)

An outlier is a data point that differs significantly from other observations. In this step, you’ll want to identify outliers and try to understand why they’re in the data. Depending on why they’re in the data, you may decide to remove them from the dataset or keep them. There are a couple of ways to identify outliers:

离群值是与其他观察值有显着差异的数据点。在此步骤中，您将要确定异常值，并尝试了解它们为何在数据中。根据它们在数据中的原因，您可以决定从数据集中删除它们或保留它们。有两种方法可以识别异常值：

Z-score/standard deviations: if we know that 99.7% of data in a data set lie within three standard deviations, then we can calculate the size of one standard deviation, multiply it by 3, and identify the data points that are outside of this range. Likewise, we can calculate the z-score of a given point, and if it’s equal to +/- 3, then it’s an outlier. Note: that there are a few contingencies that need to be considered when using this method; the data must be normally distributed, this is not applicable for small data sets, and the presence of too many outliers can throw off z-score.
Z分数/标准差 ：如果我们知道数据集中99.7％的数据位于三个标准差之内，那么我们可以计算一个标准差的大小，将其乘以3，然后识别出不在该范围内的数据点这个范围。同样，我们可以计算给定点的z得分，如果它等于+/- 3，则这是一个离群值。 注意：使用此方法时需要考虑一些意外情况； 数据必须是正态分布的，不适用于小型数据集，并且存在太多异常值可能会使z得分下降。
Interquartile Range (IQR): IQR, the concept used to build boxplots, can also be used to identify outliers. The IQR is equal to the difference between the 3rd quartile and the 1st quartile. You can then identify if a point is an outlier if it is less than Q1–1.5*IRQ or greater than Q3 + 1.5*IQR. This comes to approximately 2.698 standard deviations.
四分位数间距(IQR) ：IQR是用于构建箱形图的概念，也可以用于识别异常值。 IQR等于第三四分位数和第一四分位数之间的差。然后，如果该点小于Q1-1.5 * IRQ或大于Q3 + 1.5 * IQR，则可以确定该点是否为离群值。这大约是2.698标准偏差。

问问题 (Ask Questions)

There’s no doubt that you’ll most likely have questions regarding the data that you’re working with, especially for a dataset that is outside of your domain knowledge. For example, Kaggle had a competition on NFL analytics and injuries, I had to do some research and understand what the different positions were and what their function served for the team.

毫无疑问，您极有可能对正在使用的数据有疑问， 尤其是对于您领域知识以外的数据集。例如，Kaggle参加了NFL分析和受伤比赛，我必须做一些研究，了解不同职位的位置以及他们对团队的作用。

2.数据预处理 (2. Data Preprocessing)

Once you understand your data, a majority of your time spent as a data scientist is on this step, data preprocessing. This is when you spend your time manipulating the data so that it can be modeled properly. Like I said before, there is no universal way to go about this. HOWEVER, there are a number of essential things that you should consider which we’ll go through below.

一旦了解了数据，作为数据科学家所花费的大部分时间都将放在此步骤上，即数据预处理。这是您花时间处理数据以便可以对其进行正确建模的时候。就像我之前说过的那样，没有通用的方法可以做到这一点。但是，您应该考虑一些基本事项，我们将在下面进行介绍。

特征插补 (Feature Imputation)

Feature Imputation is the process of filling missing values. This is important because most machine learning models don’t work when there are missing data in the dataset.

特征插补是填充缺失值的过程。这很重要，因为当数据集中缺少数据时，大多数机器学习模型将无法正常工作。

One of the main reasons that I wanted to write this guide is specifically for this step. Many articles say that you should default to filling missing values with the mean or simply remove the row, and this is not necessarily true.

我要编写本指南的主要原因之一就是专门针对此步骤的。许多文章说，您应该默认使用均值填充缺失值或简单地删除行，而这不一定是正确的 。

Ideally, you want to choose the method that makes the most sense — for example, if you were modeling people’s age and income, it wouldn’t make sense for a 14-year-old to be making a national average salary.

理想情况下，您想选择最有意义的方法-例如，如果您在模拟人们的年龄和收入，则对于14岁的人来说，获得全国平均工资是没有意义的。

All things considered, there are a number of ways you can deal with missing values:

考虑所有因素，您可以通过多种方式处理缺失值：

Single value imputation: replacing missing values with the mean, median, or mode of a column
单值插补 ：用列的均值，中位数或众数替换缺失值
Multiple value imputation: modeling features that have missing data and imputing missing data with what your model finds.
多值插补：对具有缺失数据的特征进行建模，并使用模型找到的内容插补缺失数据。
K-Nearest neighbor: filling data with a value from another similar sample
K最近邻居：使用另一个相似样本中的值填充数据
Deleting the row: this isn’t an imputation technique, but it tends to be okay when there’s an extremely large sample size where you can afford to.
删除行：这不是一种插补技术，但是当您可以负担得起非常大的样本量时，它往往可以。
Others include: random imputation, moving window, most frequent, etc…
其他包括：随机归因，移动窗口，最频繁出现等等。

功能编码 (Feature Encoding)

Feature encoding is the process of turning values (i.e. strings) into numbers. This is because a machine learning model requires all values to be numbers.

特征编码是将值(即字符串)转换为数字的过程。这是因为机器学习模型要求所有值都是数字。

There are a few ways that you can go about this:

有几种方法可以解决此问题：

Label Encoding: Label encoding simply converts a feature’s non-numerical values into numerical values, whether the feature is ordinal or not. For example, if a feature called car_colour had distinct values of red, green, and blue, then label encoding would convert these values to 1, 2, and 3 respectively. Be wary when using this method because while some ML models will be able to make sense of the encoding, others won’t.
标签编码：标签编码只是将要素的非数字值转换为数值，无论该要素是否为序数。例如，如果一个名为car_colour的特征具有红色，绿色和蓝色的不同值，则标签编码会将这些值分别转换为1、2和3。使用此方法时要当心，因为虽然某些ML模型将能够理解编码，但其他ML模型却无法。
One Hot Encoding (aka. get_dummies): One hot encoding works by creating a binary feature (1, 0) for each non-numerical value of a given feature. Reusing the example above, if we had a feature called car_colour, then one hot encoding would create three features called car_colour_red, car_colour_green, car_colour_blue, and would have a 1 or 0 indicating whether it is or isn’t.
一种热编码(aka。get_dummies)：一种热编码通过为给定特征的每个非数字值创建一个二进制特征(1、0)来工作。重用上面的示例，如果我们有一个名为car_colour的特征，那么一个热编码将创建三个名为car_colour_red，car_colour_green，car_colour_blue的特征，并具有1或0指示是否存在。

特征归一化 (Feature Normalization)

When numerical values are on different scales, eg. height in centimeters and weight in lbs, most machine learning algorithms don’t perform well. The k-nearest neighbors algorithm is a prime example where features with different scales do not work well. Thus normalizing or standardizing the data can help with this problem.

当数值处于不同比例时，例如。身高(厘米)和体重(磅)，大多数机器学习算法的效果都不好。 k近邻算法是一个主要示例，其中具有不同比例的要素无法很好地工作。因此，对数据进行标准化或标准化可以帮助解决此问题。

Feature normalization rescales the values so that they’re within a range of [0,1]/
特征归一化会重新调整值的范围，使其在[0,1] /
Feature standardization rescales the data to have a mean of 0 and a standard deviation of one.
特征标准化会将数据重新缩放为平均值为0，标准差为1。

特征工程 (Feature Engineering)

Feature engineering is the process of transforming raw data into features that better represent the underlying problem that one is trying to solve. There’s no specific way to go about this step but here are some things that you can consider:

特征工程是将原始数据转换为更好地表示人们正在试图解决的潜在问题的特征的过程。没有具体的方法可以执行此步骤，但是您可以考虑以下几点：

Converting a DateTime variable to extract just the day of the week, the month of the year, etc…
转换DateTime变量以仅提取一周中的一天，一年中的月份等。
Creating bins or buckets for a variable. (eg. for a height variable, can have 100–149cm, 150–199cm, 200–249cm, etc.)
为变量创建箱或桶。 (例如，对于高度变量，可以为100–149cm，150–199cm，200–249cm等)
Combining multiple features and/or values to create a new one. For example, one of the most accurate models for the titanic challenge engineered a new variable called “Is_women_or_child” which was True if the person was a woman or a child and false otherwise.
组合多个功能和/或值以创建一个新功能。例如，针对泰坦尼克号挑战的最准确模型之一设计了一个新变量“ Is_women_or_child”，如果该人是女人还是孩子，则为True，否则为false。

功能选择 (Feature Selection)

Next is feature selection, which is choosing the most relevant/valuable features of your dataset. There are a few methods that I like to use that you can leverage to help you with selecting your features:

接下来是要素选择，即选择数据集中最相关/最有价值的要素。我喜欢使用几种方法来帮助您选择功能：

Feature importance: some algorithms like random forests or XGBoost allow you to determine which features were the most “important” in predicting the target variable’s value. By quickly creating one of these models and conducting feature importance, you’ll get an understanding of which variables are more useful than others.
功能重要性：一些算法(例如随机森林或XGBoost)可让您确定哪些功能在预测目标变量的值时最“重要”。通过快速创建这些模型之一并进行功能重要性，您将了解哪些变量比其他变量更有用。
Dimensionality reduction: One of the most common dimensionality reduction techniques, Principal Component Analysis (PCA) takes a large number of features and uses linear algebra to reduce them to fewer features.
降维：主成分分析(PCA)是最常见的降维技术之一，它具有大量特征，并使用线性代数将其简化为更少的特征。

处理数据不平衡 (Dealing with Data Imbalances)

One other thing that you’ll want to consider is data imbalances. For example, if there are 5,000 examples of one class (eg. not fraudulent) but only 50 examples for another class (eg. fraudulent), then you’ll want to consider one of a few things:

您还要考虑的另一件事是数据不平衡。例如，如果一个类别有5,000个示例(例如，欺诈性的)，而另一类别中只有50个示例(例如，欺诈性的)，那么您将要考虑以下几项之一：

Collecting more data — this always works in your favor but is usually not possible or too expensive.
收集更多数据-这总是对您有利，但通常是不可能或太昂贵的。
You can over or undersample the data using the scikit-learn-contrib Python package.
您可以使用scikit-learn-contrib Python软件包对数据进行过度或欠采样。

3.数据分割 (3. Data Splitting)

Last comes splitting your data. I’m just going to give a very generic framework that you can use here, that is generally agreed upon.

最后是拆分数据。我将要给出一个非常通用的框架，您可以在这里使用它，这是普遍同意的。

Typically you’ll want to split your data into three sets:

通常，您需要将数据分为三组：

Training Set (70–80%): this is what the models learns on
训练集 (70–80％)：这是模型学习的内容
Validation Set (10–15%): the model’s hyperparameters are tuned on this set
验证集 (10％到15％)：在该集合上调整模型的超参数
Test set (10–15%): finally, the model’s final performance is evaluated on this. If you’ve prepared the data correctly, the results from the test set should give a good indication of how the model will perform in the real world.
测试集 (10％到15％)：最后，以此评估模型的最终性能。如果您正确地准备了数据，那么测试集的结果应该可以很好地表明模型在现实世界中的表现。

谢谢阅读！ (Thanks for Reading!)

I hope you’ve learned a thing or two from this. By reading this, you should now have a general framework in mind when it comes to data preparation. There are many things to consider, but having resources like this to remind you is always helpful.

希望您从中学到了一两个东西。通过阅读本文，您现在应该在准备数据时牢记一个通用框架。有很多事情要考虑，但是拥有类似的资源来提醒您总是很有帮助的。

If you follow these steps and keep these things in mind, you’ll definitely have your data better prepared, and you’ll ultimately be able to develop a more accurate model!

如果您遵循这些步骤并牢记这些内容，那么您肯定会为数据做好更好的准备，最终您将能够开发出更准确的模型！

特伦斯·辛 (Terence Shin)

Check out my free data science resource with new material every week!
每周查看 我的免费数据科学资源 以及新材料！
If you enjoyed this, follow me on Medium for more
如果您喜欢这个，请 在Medium上关注我以 了解更多
Let’s connect on LinkedIn
让我们在 LinkedIn上建立联系