详尽kmp_详尽的分步指南,用于数据准备

详尽kmp

表中的内容 (Table of Content)

  1. Introduction

    介绍
  2. What is Data Preparation

    什么是数据准备
  3. Exploratory Data Analysis (EDA)

    探索性数据分析(EDA)
  4. Data Preprocessing

    数据预处理
  5. Data Splitting

    数据分割

介绍 (Introduction)

Before we get into this, I want to make it clear that there is no rigid process when it comes to data preparation. How you prepare one set of data will most likely be different from how you prepare another set of data. Therefore this guide aims to provide an overarching guide that you can refer to when preparing any particular set of data.

在开始讨论之前,我想澄清一下,在数据准备方面没有严格的过程。 您准备一组数据的方式与准备另一组数据的方式很可能会有所不同。 因此,本指南旨在提供在准备任何特定数据集时可以参考的总体指南。

Before we get into the guide, I should probably go over what Data Preparation is…

在进入指南之前,我可能应该回顾一下数据准备是什么……

什么是数据准备? (What is Data Preparation?)

Data preparation is the step after data collection in the machine learning life cycle and it’s the process of cleaning and transforming the raw data you collected. By doing so, you’ll have a much easier time when it comes to analyzing and modeling your data.

数据准备是机器学习生命周期中数据收集之后的步骤,并且是清理和转换您收集的原始数据的过程。 这样,您就可以轻松地分析和建模数据。

There are three main parts to data preparation that I’ll go over in this article:

我将在本文中介绍数据准备的三个主要部分:

  1. Exploratory Data Analysis (EDA)

    探索性数据分析(EDA)
  2. Data preprocessing

    数据预处理
  3. Data splitting

    数据分割

1.探索性数据分析(EDA) (1. Exploratory Data Analysis (EDA))

Exploratory data analysis, or EDA for short, is exactly what it sounds like, exploring your data. In this step, you’re simply getting an understanding of the data that you’re working with. In the real world, datasets are not as clean or intuitive as Kaggle datasets.

探索性数据分析(简称EDA)听起来像是探索数据。 在此步骤中,您仅需要了解正在使用的数据。 在现实世界中,数据集不如Kaggle数据集那么干净或直观。

The more you explore and understand the data you’re working with, the easier it’ll be when it comes to data preprocessing.

您越探索和了解正在使用的数据,就越容易进行数据预处理。

Below is a list of things that you should consider in this step:

下面是在此步骤中应考虑的事项列表:

特征和目标变量 (Feature and Target Variables)

Determine what the feature (input) variables are and what the target variable is. Don’t worry about determining what the final input variables are, but make sure you can identify both types of variables.

确定什么是要素(输入)变量,什么是目标变量。 不必担心确定最终输入变量是什么,但是请确保可以识别两种类型的变量。

资料类型 (Data Types)

Figure out what type of data you’re working with. Are they categorical, numerical, or neither? This is especially important for the target variable, as the data type will narrow what machine learning model you may want to use. Pandas functions like df.describe() and df.dtypes are useful here.

找出要使用的数据类型。 它们是分类的,数字的还是两者都不是? 这对于目标变量尤其重要,因为数据类型将缩小您可能要使用的机器学习模型。 诸如df.describe()df.dtypes之类的熊猫函数在这里很有用。

检查异常值 (Check for Outliers)

An outlier is a data point that differs significantly from other observations. In this step, you’ll want to identify outliers and try to understand why they’re in the data. Depending on why they’re in the data, you may decide to remove them from the dataset or keep them. There are a couple of ways to identify outliers:

离群值是与其他观察有显着差异的数据点。 在此步骤中,您将要确定异常值,并尝试了解它们为何在数据中。 根据它们在数据中的原因,您可以决定从数据集中删除它们或保留它们。 有两种方法可以识别异常值:

  1. Z-score/standard deviations: if we know that 99.7% of data in a data set lie within three standard deviations, then we can calculate the size of one standard deviation, multiply it by 3, and identify the data points that are outside of this range. Likewise, we can calculate the z-score of a given point, and if it’s equal to +/- 3, then it’s an outlier. Note: that there are a few contingencies that need to be considered when using this method; the data must be normally distributed, this is not applicable for small data sets, and the presence of too many outliers can throw off z-score.

    Z分数/标准差 :如果我们知道数据集中99.7%的数据位于三个标准差之内,那么我们可以计算一个标准差的大小,将其乘以3,然后识别出不在该范围内的数据点这个范围。 同样,我们可以计算给定点的z得分,如果它等于+/- 3,则这是一个离群值。 注意:使用此方法时需要考虑一些意外情况; 数据必须是正态分布的,不适用于小型数据集,并且存在太多异常值可能会使z得分下降。

  2. Interquartile Range (IQR): IQR, the concept used to build boxplots, can also be used to identify outliers. The IQR is equal to the difference between the 3rd quartile and the 1st quartile. You can then identify if a point is an outlier if it is less than Q1–1.5*IRQ or greater than Q3 + 1.5*IQR. This comes to approximately 2.698 standard deviations.

    四分位数间距(IQR) :IQR是用于构建箱形图的概念,也可以用于识别异常值。 IQR等于第三四分位数和第一四分位数之间的差。 然后,如果该点小于Q1-1.5 * IRQ或大于Q3 + 1.5 * IQR,则可以确定该点是否为离群值。 这大约是2.698标准偏差。

问问题 (Ask Questions)

There’s no doubt that you’ll most likely have questions regarding the data that you’re working with, especially for a dataset that is outside of your domain knowledge. For example, Kaggle had a competition on NFL analytics and injuries, I had to do some research and understand what the different positions were and what their function served for the team.

毫无疑问,您极有可能对正在使用的数据有疑问, 尤其是对于您领域知识以外的数据集。 例如,Kaggle参加了NFL分析和受伤比赛,我必须做一些研究,了解不同职位的位置以及他们对团队的作用。

2.数据预处理 (2. Data Preprocessing)

Once you understand your data, a majority of your time spent as a data scientist is on this step, data preprocessing. This is when you spend your time manipulating the data so that it can be modeled properly. Like I said before, there is no universal way to go about this. HOWEVER, there are a number of essential things that you should consider which we’ll go through below.

一旦了解了数据,作为数据科学家所花费的大部分时间都将放在此步骤上,即数据预处理。 这是您花时间处理数据以便可以对其进行正确建模的时候。 就像我之前说过的那样,没有通用的方法可以做到这一点。 但是,您应该考虑一些基本事项,我们将在下面进行介绍。

特征插补 (Feature Imputation)

Feature Imputation is the process of filling missing values. This is important because most machine learning models don’t work when there are missing data in the dataset.

特征插补是填充缺失值的过程。 这很重要,因为当数据集中缺少数据时,大多数机器学习模型将无法正常工作。

One of the main reasons that I wanted to write this guide is specifically for this step. Many articles say that you should default to filling missing values with the mean or simply remove the row, and this is not necessarily true.

我要编写本指南的主要原因之一就是专门针对此步骤的。 许多文章说,您应该默认使用均值填充缺失值或简单地删除行,而这不一定是正确的

Ideally, you want to choose the method that makes the most sense — for example, if you were modeling people’s age and income, it wouldn’t make sense for a 14-year-old to be making a national average salary.

理想情况下,您想选择最有意义的方法-例如,如果您在模拟人们的年龄和收入,则对于14岁的人来说,获得全国平均工资是没有意义的。

All things considered, there are a number of ways you can deal with missing values:

考虑所有因素,您可以通过多种方式处理缺失值:

  • Single value imputation: replacing missing values with the mean, median, or mode of a column

    单值插补 :用列的均值,中位数或众数替换缺失值

  • Multiple value imputation: modeling features that have missing data and imputing missing data with what your model finds.

    多值插补:对具有缺失数据的特征进行建模,并使用模型找到的内容插补缺失数据。

  • K-Nearest neighbor: filling data with a value from another similar sample

    K最近邻居:使用另一个相似样本中的值填充数据

  • Deleting the row: this isn’t an imputation technique, but it tends to be okay when there’s an extremely large sample size where you can afford to.

    删除行:这不是一种插补技术,但是当您可以负担得起非常大的样本量时,它往往可以。

  • Others include: random imputation, moving window, most frequent, etc…

    其他包括:随机归因,移动窗口,最频繁出现等等。

功能编码 (Feature Encoding)

Feature encoding is the process of turning values (i.e. strings) into numbers. This is because a machine learning model requires all values to be numbers.

特征编码是将值(即字符串)转换为数字的过程。 这是因为机器学习模型要求所有值都是数字。

There are a few ways that you can go about this:

有几种方法可以解决此问题:

  1. Label Encoding: Label encoding simply converts a feature’s non-numerical values into numerical values, whether the feature is ordinal or not. For example, if a feature called car_colour had distinct values of red, green, and blue, then label encoding would convert these values to 1, 2, and 3 respectively. Be wary when using this method because while some ML models will be able to make sense of the encoding, others won’t.

    标签编码:标签编码只是将要素的非数字值转换为数值,无论该要素是否为序数。 例如,如果一个名为car_colour的特征具有红色,绿色和蓝色的不同值,则标签编码会将这些值分别转换为1、2和3。 使用此方法时要当心,因为虽然某些ML模型将能够理解编码,但其他ML模型却无法。

  2. One Hot Encoding (aka. get_dummies): One hot encoding works by creating a binary feature (1, 0) for each non-numerical value of a given feature. Reusing the example above, if we had a feature called car_colour, then one hot encoding would create three features called car_colour_red, car_colour_green, car_colour_blue, and would have a 1 or 0 indicating whether it is or isn’t.

    一种热编码(aka。get_dummies):一种热编码通过为给定特征的每个非数字值创建一个二进制特征(1、0)来工作。 重用上面的示例,如果我们有一个名为car_colour的特征,那么一个热编码将创建三个名为car_colour_red,car_colour_green,car_colour_blue的特征,并具有1或0指示是否存在。

特征归一化 (Feature Normalization)

When numerical values are on different scales, eg. height in centimeters and weight in lbs, most machine learning algorithms don’t perform well. The k-nearest neighbors algorithm is a prime example where features with different scales do not work well. Thus normalizing or standardizing the data can help with this problem.

当数值处于不同比例时,例如。 身高(厘米)和体重(磅),大多数机器学习算法的效果都不好。 k近邻算法是一个主要示例,其中具有不同比例的要素无法很好地工作。 因此,对数据进行标准化或标准化可以帮助解决此问题。

  • Feature normalization rescales the values so that they’re within a range of [0,1]/

    特征归一化会重新调整值的范围,使其在[0,1] /

  • Feature standardization rescales the data to have a mean of 0 and a standard deviation of one.

    特征标准化会将数据重新缩放为平均值为0,标准差为1。

特征工程 (Feature Engineering)

Feature engineering is the process of transforming raw data into features that better represent the underlying problem that one is trying to solve. There’s no specific way to go about this step but here are some things that you can consider:

特征工程是将原始数据转换为更好地表示人们正在试图解决的潜在问题的特征的过程。 没有具体的方法可以执行此步骤,但是您可以考虑以下几点:

  • Converting a DateTime variable to extract just the day of the week, the month of the year, etc…

    转换DateTime变量以仅提取一周中的一天,一年中的月份等。
  • Creating bins or buckets for a variable. (eg. for a height variable, can have 100–149cm, 150–199cm, 200–249cm, etc.)

    为变量创建箱或桶。 (例如,对于高度变量,可以为100–149cm,150–199cm,200–249cm等)
  • Combining multiple features and/or values to create a new one. For example, one of the most accurate models for the titanic challenge engineered a new variable called “Is_women_or_child” which was True if the person was a woman or a child and false otherwise.

    组合多个功能和/或值以创建一个新功能。 例如,针对泰坦尼克号挑战的最准确模型之一设计了一个新变量“ Is_women_or_child”,如果该人是女人还是孩子,则为True,否则为false。

功能选择 (Feature Selection)

Next is feature selection, which is choosing the most relevant/valuable features of your dataset. There are a few methods that I like to use that you can leverage to help you with selecting your features:

接下来是要素选择,即选择数据集中最相关/最有价值的要素。 我喜欢使用几种方法来帮助您选择功能:

  • Feature importance: some algorithms like random forests or XGBoost allow you to determine which features were the most “important” in predicting the target variable’s value. By quickly creating one of these models and conducting feature importance, you’ll get an understanding of which variables are more useful than others.

    功能重要性:一些算法(例如随机森林或XGBoost)可让您确定哪些功能在预测目标变量的值时最“重要”。 通过快速创建这些模型之一并进行功能重要性,您将了解哪些变量比其他变量更有用。

  • Dimensionality reduction: One of the most common dimensionality reduction techniques, Principal Component Analysis (PCA) takes a large number of features and uses linear algebra to reduce them to fewer features.

    :主成分分析(PCA)是最常见的降维技术之一,它具有大量特征,并使用线性代数将其简化为更少的特征。

处理数据不平衡 (Dealing with Data Imbalances)

One other thing that you’ll want to consider is data imbalances. For example, if there are 5,000 examples of one class (eg. not fraudulent) but only 50 examples for another class (eg. fraudulent), then you’ll want to consider one of a few things:

您还要考虑的另一件事是数据不平衡。 例如,如果一个类别有5,000个示例(例如,欺诈性的),而另一类别中只有50个示例(例如,欺诈性的),那么您将要考虑以下几项之一:

  • Collecting more data — this always works in your favor but is usually not possible or too expensive.

    收集更多数据-这总是对您有利,但通常是不可能或太昂贵的。
  • You can over or undersample the data using the scikit-learn-contrib Python package.

    您可以使用scikit-learn-contrib Python软件包对数据进行过度或欠采样。

3.数据分割 (3. Data Splitting)

Last comes splitting your data. I’m just going to give a very generic framework that you can use here, that is generally agreed upon.

最后是拆分数据。 我将要给出一个非常通用的框架,您可以在这里使用它,这是普遍同意的。

Typically you’ll want to split your data into three sets:

通常,您需要将数据分为三组:

  1. Training Set (70–80%): this is what the models learns on

    训练集 (70–80%):这是模型学习的内容

  2. Validation Set (10–15%): the model’s hyperparameters are tuned on this set

    验证集 (10%到15%):在该集合上调整模型的超参数

  3. Test set (10–15%): finally, the model’s final performance is evaluated on this. If you’ve prepared the data correctly, the results from the test set should give a good indication of how the model will perform in the real world.

    测试集 (10%到15%):最后,以此评估模型的最终性能。 如果您正确地准备了数据,那么测试集的结果应该可以很好地表明模型在现实世界中的表现。

谢谢阅读! (Thanks for Reading!)

I hope you’ve learned a thing or two from this. By reading this, you should now have a general framework in mind when it comes to data preparation. There are many things to consider, but having resources like this to remind you is always helpful.

希望您从中学到了一两个东西。 通过阅读本文,您现在应该在准备数据时牢记一个通用框架。 有很多事情要考虑,但是拥有类似的资源来提醒您总是很有帮助的。

If you follow these steps and keep these things in mind, you’ll definitely have your data better prepared, and you’ll ultimately be able to develop a more accurate model!

如果您遵循这些步骤并牢记这些内容,那么您肯定会为数据做好更好的准备,最终您将能够开发出更准确的模型!

特伦斯·辛 (Terence Shin)

  • Check out my free data science resource with new material every week!

    每周 查看 我的免费数据科学资源 以及新材料!

  • If you enjoyed this, follow me on Medium for more

    如果您喜欢这个,请 在Medium上关注我以 了解更多

  • Let’s connect on LinkedIn

    让我们在 LinkedIn上建立联系

翻译自: https://towardsdatascience.com/an-extensive-step-by-step-guide-for-data-preparation-aee4a109051d

详尽kmp

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/news/392314.shtml

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

leetcode 947. 移除最多的同行或同列石头(dfs)

n 块石头放置在二维平面中的一些整数坐标点上。每个坐标点上最多只能有一块石头。 如果一块石头的 同行或者同列 上有其他石头存在,那么就可以移除这块石头。 给你一个长度为 n 的数组 stones ,其中 stones[i] [xi, yi] 表示第 i 块石头的位置&#x…

matlab距离保护程序,基于MATLAB的距离保护仿真.doc

基于MATLAB的距离保护仿真摘要:本文阐述了如何利用Matlab中的Simulink及SPS工具箱建立线路的距离保护仿真模型,并用S函数编制相间距离保护和接地距离保护算法程序,构建相应的保护模块,实现了三段式距离保护。仿真结果表明&#xf…

ZOJ3385 - Hanami Party (贪心)

题目链接: http://acm.zju.edu.cn/onlinejudge/showProblem.do?problemCode3385 题目大意: 妖梦要准备一个party,所以需要许多食物,初始化妖梦的烹饪技能为L,每天妖梦有两种选择,一是选择当天做L个食物&am…

sklearn.fit_两个小时后仍在运行吗? 如何控制您的sklearn.fit。

sklearn.fitby Nathan Toubiana内森图比亚纳(Nathan Toubiana) 两个小时后仍在运行吗? 如何控制您的sklearn.fit (Two hours later and still running? How to keep your sklearn.fit under control) Written by Gabriel Lerner and Nathan Toubiana加布里埃尔勒纳…

RabbitMQ学习系列(二): RabbitMQ安装与配置

1.安装 Rabbit MQ 是建立在强大的Erlang OTP平台上,因此安装RabbitMQ之前要先安装Erlang。 erlang:http://www.erlang.org/download.html rabbitmq:http://www.rabbitmq.com/download.html 注意: 1.现在先别装最新的 3…

帝国CMS浅浅滴谈一下——博客园老牛大讲堂

封笔多月之后,工作中遇到了很多很多的问题,也解决了一些问题,下面我把一些得出的经验,分享给大家! 会帝国cms的请离开,这篇文章对你没什么用 1、什么是帝国CMS?---博客园老牛大讲堂 多月之前&am…

matlab cdf,Matlab 简单计算PDF和CDF | 学步园

通信的魅力就是在于随机性中蕴含的确定性,这也就是为什么你随便拿出一本通信方面的教材,前面几章都会大篇幅的讲解随机过程,随机过程也是研究生必须深入了解的一门课,特别是对于信号处理以及通信专业的学生。在实际工作中&#xf…

leetcode 1232. 缀点成线

在一个 XY 坐标系中有一些点,我们用数组 coordinates 来分别记录它们的坐标,其中 coordinates[i] [x, y] 表示横坐标为 x、纵坐标为 y 的点。 请你来判断,这些点是否在该坐标系中属于同一条直线上,是则返回 true,否则…

mysql常用操作(一)

【数据库设计的三大范式】1、第一范式(1NF):数据表中的每一列,必须是不可拆分的最小单元。也就是确保每一列的原子性。 例如:userInfo:山东省烟台市 18865518189 应拆分成 userAds山东省烟台市 userTel188655181892、第…

pmp 成本估算准确高_如何更准确地估算JavaScript中文章的阅读时间

pmp 成本估算准确高by Pritish Vaidya通过Pritish Vaidya 准确估算JavaScript中篇文章的阅读时间 (Accurate estimation of read time for Medium articles in JavaScript) 介绍 (Introduction) Read Time Estimate is the estimation of the time taken by the reader to rea…

Android数据适配-ExpandableListView

Android中ListView的用法基本上学的时候都会使用,其中可以使用ArrayAdapter,SimpleAdapter,BaseAdapter去实现,这次主要使用的ExpandableListView展示一种两层的效果,ExpandableListView是android中可以实现下拉list的…

JavaWeb 命名规则

命名规范命名规范命名规范命名规范 本规范主要针对java开发制定的规范项目命名项目命名项目命名项目命名 项目创建,名称所有字母均小写,组合方式为:com.company.projectName.component.hiberarchy。1. projectName:项目名称2. com…

多元概率密度_利用多元论把握事件概率

多元概率密度Humans have plenty of cognitive strengths, but one area that most of us struggle with is estimating, explaining and preparing for improbable events. This theme underpins two of Nassim Taleb’s major works: Fooled by Randomness and The Black Swa…

nginx php访问日志配置,nginx php-fpm 输出php错误日志的配置方法

由于nginx仅是一个web服务器,因此nginx的access日志只有对访问页面的记录,不会有php 的 error log信息。nginx把对php的请求发给php-fpm fastcgi进程来处理,默认的php-fpm只会输出php-fpm的错误信息,在php-fpm的errors log里也看不…

阿里的技术愿景_技术技能的另一面:领域知识和长期愿景

阿里的技术愿景by Sihui Huang黄思慧 技术技能的另一面:领域知识和长期愿景 (The other side of technical skill: domain knowledge and long-term vision) When we first start our careers as software engineers, we tend to focus on improving our coding sk…

leetcode 721. 账户合并(并查集)

给定一个列表 accounts,每个元素 accounts[i] 是一个字符串列表,其中第一个元素 accounts[i][0] 是 名称 (name),其余元素是 emails 表示该账户的邮箱地址。 现在,我们想合并这些账户。如果两个账户都有一些共同的邮箱地址&#…

es6重点笔记:数值,函数和数组

本篇全是重点,捡常用的怼,数值的扩展比较少,所以和函数放一起: 一,数值 1,Number.EPSILON:用来检测浮点数的计算,如果误差小于这个,就无误 2,Math.trunc()&am…

SMSSMS垃圾邮件检测器的专业攻击

Note: The methodology behind the approach discussed in this post stems from a collaborative publication between myself and Irene Anthi.注意: 本文讨论的方法背后的方法来自 我本人和 Irene Anthi 之间 的 合作出版物 。 介绍 (INTRODUCTION) Spam SMS te…

php pdo 缓冲,PDO支持数据缓存_PHP教程

/*** 作者:初十* QQ:345610000*/class myPDO extends PDO{public $cache_Dir null; //缓存目录public $cache_expireTime 7200; //缓存时间,默认两小时//带缓存的查询public function cquery($sql){//缓存存放总目录if ($this->cache_Di…

mooc课程下载_如何使用十大商学院的免费课程制作MOOC“ MBA”

mooc课程下载by Laurie Pickard通过劳里皮卡德(Laurie Pickard) 如何使用十大商学院的免费课程制作MOOC“ MBA” (How to make a MOOC “MBA” using free courses from Top 10 business schools) Back when massive open online courses (MOOCs) were new, I started a proje…