pd种知道每个数据的类型_每个数据科学家都应该知道的5个概念

pd种知道每个数据的类型

意见 (Opinion)

目录 (Table of Contents)

  1. Introduction

    介绍
  2. Multicollinearity

    多重共线性
  3. One-Hot Encoding

    一站式编码
  4. Sampling

    采样
  5. Error Metrics

    错误指标
  6. Storytelling

    评书
  7. Summary

    摘要

介绍 (Introduction)

I have written about common skills that Data Scientists can expect to use in their professional careers, so now I want to highlight some key concepts of Data Science that can be beneficial to know and later employ. I may be discussing some that you know already, and some that you do not know; my goal is to provide some professional explanation of why these concepts are beneficial regardless of what you do know now. Multicollinearity, one-hot encoding, undersampling and oversampling, error metrics, and lastly, storytelling, are the key concepts I think of first when thinking of a professional Data Scientist in their day-to-day. The last point, perhaps, is a combination of skill and a concept but wanted to highlight, still, its importance on your everyday work life as a Data Scientist. I will expound upon all of these concepts down below.

我已经写了关于数据科学家可以在其职业生涯中期望使用的常见技能的文章,所以现在我想重点介绍一些数据科学的关键概念,这些知识可能有益于知识并在以后使用。 我可能正在讨论您已经知道的一些,以及您不知道的一些。 我的目标是提供一些专业的解释,说明无论您现在知道什么,这些概念为何都是有益的。 多重共线性,单次编码,欠采样和过采样,错误度量,以及讲故事,是我每天在考虑专业数据科学家时首先想到的关键概念。 最后一点也许是技巧和概念的结合,但仍然想强调它对您作为数据科学家的日常工作的重要性。 我将在下面详细说明所有这些概念。

多重共线性 (Multicollinearity)

Image for post
Photo by The Creative Exchange on Unsplash [2].
图片由The Creative Exchange在Unsplash上​​提供[2]。

Although the word is somewhat long and hard to say, when you break it down, multicollinearity is simple. Multi meaning many, and collinearity meaning linearly related. Multicollinearity can be described as the situation when two or more explanatory variables explain similar information or are highly related in a regression model. There are a few reasons this concept can raise a concern.

尽管这个词有点长且很难说,但将其分解时,多重共线性很简单。 多含义很多,共线性含义线性相关。 多重共线性可以描述为当两个或多个解释变量解释相似信息或在回归模型中高度相关时的情况。 此概念引起关注的原因有几个。

For some modeling techniques, it can cause overfitting and ultimately a decline in model performance.

对于某些建模技术,它可能导致过度拟合并最终导致模型性能下降。

The data becomes redundant and not each feature or attribute is needed in your model. Therefore, there are some ways to find out which features you should remove that constitute multicollinearity.

数据变得多余,并且模型中不需要每个功能或属性。 因此,有一些方法可以找出应删除构成多重共线性的特征。

  • variance inflation factor (VIF)

    方差膨胀因子(VIF)

  • correlation matrices

    相关矩阵

These two techniques are commonly used amongst Data Scientists, especially correlation matrices and plots — usually visualized with a heatmap of some sort, while VIF is lesser-known.

数据科学家通常使用这两种技术,尤其是相关矩阵和图-通常以某种形式的热图可视化,而VIF则鲜为人知。

The higher the VIF value, the less usable the feature is for your regression model.

VIF值越高,该功能对您的回归模型的使用就越少。

A great, simple resource for VIF is [3]:

VIF的一个很好的简单资源是[3]:

一站式编码 (One-Hot Encoding)

This form of feature transformation in your model is called one-hot encoding. You want to represent your categorical features numerically by encoding them. Whereas the categorical features have text values themselves, one-hot encoding transposes that information so that each value becomes the feature and the observation in the row is either denoted as a 0 or 1. For example, if we have the categorical variable gender, the numerical representation after one-hot encoding would look like (gender before, and male/female after):

模型中这种形式的特征转换称为单次编码。 您想通过编码来以数字方式表示分类特征。 尽管分类要素本身具有文本值,但是一键编码会转置该信息,以便每个值都成为要素,并且该行中的观察值将表示为0或1。例如,如果我们拥有分类变量sex ,则一键编码后的数字表示看起来像( 性别之前和之后的男性/女性 ):

Image for post
Before and after one-hot encoding. Screenshot by Author [4].
一键编码之前和之后。 作者[4]的屏幕截图。

This transformation is useful when you are not just working with numerical features, and need to create that numerical representation with text/categorical features.

当您不仅要使用数字功能并且需要使用文本/分类功能创建该数字表示形式时,此转换非常有用。

采样 (Sampling)

When you do not have enough data, oversampling may be suggested as a form of compensation. Say you are working on a classification problem and you have a minority class like the example down below:

当您没有足够的数据时,建议使用过采样作为补偿。 假设您正在处理分类问题,并且有一个少数类,如下例所示:

class_1 = 100 rowsclass_2 = 1000 rowsclass_3 = 1100 rows

As you can see, class_1 has a small amount of data for its class, which means your dataset is imbalanced and will be referred to as the minority class. There are several oversampling techniques. One of them is called SMOTE [5], which stands for Synthetic Minority Over-sampling Technique. One of the ways that SMOTE works is by utilizing a K-neighbor method for finding the nearest neighbor to create synthetic samples. There are similar techniques that use the reverse method for undersampling.

如您所见, class_1的类中包含少量数据,这意味着您的数据集不平衡,将被称为少数类。 有几种过采样技术。 其中之一称为SMOTE [5],代表合成少数族裔过采样技术 。 SMOTE工作的方法之一是利用K邻域方法来找到最接近的邻域以创建合成样本。 有类似的技术使用反向方法进行欠采样

These techniques are beneficial when you have outliers in your class or regression data even, and you want to ensure your sampling is the best representation of the data that your model will run on in the future.

当您的类或回归数据中甚至有异常值时,并且您要确保采样是模型将在将来运行的数据的最佳表示形式时,这些技术将非常有用。

错误指标 (Error Metrics)

There are plenty of error metrics used for both classification and regression models in Data Science. According to sklearn [6], here are some that you can use specifically for regression models:

在数据科学中,分类和回归模型都有大量错误度量标准。 根据sklearn [6],以下是您可以专门用于回归模型的一些信息:

metrics.explained_variance_score

metrics.explained_variance_score

metrics.max_error

metrics.max_error

metrics.mean_absolute_error

metrics.mean_absolute_error

metrics.mean_squared_error

metrics.mean_squared_error

metrics.mean_squared_log_error

metrics.mean_squared_log_error

metrics.median_absolute_error

metrics.median_absolute_error

metrics.r2_score

metrics.r2_score

metrics.mean_poisson_deviance

metrics.mean_poisson_deviance

metrics.mean_gamma_deviance

metrics.mean_gamma_deviance

The two most popular error metrics for regression from above are MSE and RMSE:

从上方进行回归分析的两个最受欢迎的错误度量标准是MSE和RMSE:

MSE: the concept is → mean absolute error regression loss (sklearn)

MSE:概念是→平均绝对误差回归损失(sklearn)

RMSE: the concept is → mean squared error regression loss (sklearn)

RMSE:概念是→均方误差回归损失(sklearn)

For classification, you can expect to evaluate your model’s performance with accuracy and AUC (Area Under the Curve).

对于分类,您可以期望以准确性和AUC(曲线下面积)评估模型的性能。

评书 (Storytelling)

Image for post
Photo by Nong Vang on Unsplash [7].
Nong Vang在《 Unsplash 》上的照片 [7]。

I wanted to add a unique concept of Data Science that is storytelling. I cannot stress enough how important this concept is. It can be seen as a concept or skill, but the label here is not important, what is, is how well you articulate your problem-solving techniques in a business setting. A lot of Data Scientists will focus solely on model accuracy, but will then fail to understand the entire business process. That process includes:

我想添加一个讲故事的数据科学独特概念。 我不能足够强调这个概念的重要性。 可以将其视为概念或技能,但此处的标签并不重要,即您在业务环境中表达解决问题技术的能力如何。 许多数据科学家将只专注于模型的准确性,但随后将无法理解整个业务流程。 该过程包括:

  • what is the business?

    什么事

  • what is the problem?

    问题是什么?

  • why do we need Data Science?

    为什么我们需要数据科学?

  • what is the goal of Data Science here?

    数据科学的目标是什么?

  • when will we get usable results?

    我们什么时候可以获得可用的结果?

  • how can we apply our results?

    我们如何应用我们的结果?

  • what is the impact of our results?

    我们的结果有什么影响?

  • how do we share our results and overall process?

    我们如何分享我们的结果和整体流程?

As you can see, none of these points are the model itself/improvement in accuracy. The focus here is how you will use data to solve your company's problems. It is beneficial to become acquainted with stakeholders and your non-technical coworkers who you will ultimately be working with. You will also work with Product Managers who will work alongside you in assessing the problem, and Data Engineers to collect the data before even running a base model. At the end of your model process, you will share your results with key individuals who will usually like to see its impact in most likely some type of visual representation (Tableau, Google Slide deck, etc.), so being able to present and communicate is beneficial as well.

如您所见,这些要点都不是模型本身/准确性的提高。 这里的重点是如何使用数据来解决公司的问题。 结识最终将要与之合作的利益相关者和您的非技术合作伙伴是有益的。 您还将与产品经理一起工作,他们将与您一起评估问题,并与数据工程师一起甚至在运行基本模型之前收集数据。 在建模过程的最后,您将与主要人员分享您的结果,这些人员通常希望看到其对某种视觉表示形式( Tableau,Google Slide卡座等 )的影响,从而能够进行演示和交流也是有益的。

摘要 (Summary)

There are plenty of key concepts Data Scientists, as well as Machine Learning Engineers, should know. Five of them discussed in this article were:

数据科学家以及机器学习工程师应该知道很多关键概念。 本文讨论的其中五个是:

MulticollinearityOne-hot encodingSamplingErrorStorytelling

Please feel free to comment down below some concepts of Data Science that you focus on daily, or that you think others should know about. Thank you for reading my article, I hope you enjoyed it!

请随意在以下您每天关注的或您认为其他人应该知道的数据科学概念下进行评论。 感谢您阅读我的文章,希望您喜欢!

Below are some references and links that can provide more information on the topics discussed in this article.

下面是一些参考和链接,它们可以提供有关本文讨论的主题的更多信息。

I also want to highlight two other stories I have written which are related to this article, [8] and [9]:

我还想强调我写的另外两个与本文有关的故事,[8]和[9]:

These two articles highlight key skills and projects you will need to either know or become familiar with and expect to eventually employ as a professional Data Scientist.

这两篇文章重点介绍了您需要了解或熟悉的关键技能和项目,并期望他们最终成为专业的数据科学家。

翻译自: https://towardsdatascience.com/5-concepts-every-data-scientist-should-know-16c74d080a83

pd种知道每个数据的类型

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/news/389731.shtml

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

xgboost keras_用catboost lgbm xgboost和keras预测财务交易

xgboost kerasThe goal of this challenge is to predict whether a customer will make a transaction (“target” 1) or not (“target” 0). For that, we get a data set of 200 incognito variables and our submission is judged based on the Area Under Receiver Op…

2017. 网格游戏

2017. 网格游戏 给你一个下标从 0 开始的二维数组 grid ,数组大小为 2 x n ,其中 grid[r][c] 表示矩阵中 (r, c) 位置上的点数。现在有两个机器人正在矩阵上参与一场游戏。 两个机器人初始位置都是 (0, 0) ,目标位置是 (1, n-1) 。每个机器…

HUST软工1506班第2周作业成绩公布

说明 本次公布的成绩对应的作业为: 第2周个人作业:WordCount编码和测试 如果同学对作业成绩存在异议,在成绩公布的72小时内(截止日期4月26日0点)可以进行申诉,方式如下: 毕博平台的第二周在线答…

币氪共识指数排行榜0910

币氪量化数据在今天的报告中给出DASH的近期买卖信号,可以看出从今年4月中旬起到目前为止,DASH_USDT的价格总体呈现出下降的趋势。 转载于:https://www.cnblogs.com/tokpick/p/9621821.html

走出囚徒困境的方法_囚徒困境的一种计算方法

走出囚徒困境的方法You and your friend have committed a murder. A few days later, the cops pick the two of you up and put you in two separate interrogation rooms such that you have no communication with each other. You think your life is over, but the polic…

Zookeeper系列四:Zookeeper实现分布式锁、Zookeeper实现配置中心

一、Zookeeper实现分布式锁 分布式锁主要用于在分布式环境中保证数据的一致性。 包括跨进程、跨机器、跨网络导致共享资源不一致的问题。 1. 分布式锁的实现思路 说明: 这种实现会有一个缺点,即当有很多进程在等待锁的时候,在释放锁的时候会有…

resize 按钮不会被伪元素遮盖

textarea默认有个resize样式,效果就是下面这样 读 《css 揭秘》时发现两个亮点: 其实这个属性不仅适用于 textarea 元素,适用于下面所有元素:elements with overflow other than visible, and optionally replaced elements repre…

平台api对数据收集的影响_收集您的数据不是那么怪异的api

平台api对数据收集的影响A data analytics cycle starts with gathering and extraction. I hope my previous blog gave an idea about how data from common file formats are gathered using python. In this blog, I’ll focus on extracting the data from files that are…

前端技术周刊 2018-09-10:Redux Mobx

前端快爆 在 Chrome 10 周年之际,正式发布 69 版本,整体 UI 重新设计,同时iOS 版本重新将工具栏放置在了底部。API 层面,支持了 CSS Scroll Snap、前端资源锁 Web Lock API、WebWorker 里面可以跑的 OffscreenCanvas API、toggleA…

逻辑回归 概率回归_概率规划的多逻辑回归

逻辑回归 概率回归There is an interesting dichotomy in the world of data science between machine learning practitioners (increasingly synonymous with deep learning practitioners), and classical statisticians (both Frequentists and Bayesians). There is gener…

sys.modules[__name__]的一个实例

关于sys.modules[__name__]的用法,百度上阅读量比较多得一个帖子是:https://www.cnblogs.com/robinunix/p/8523601.html 对于里面提到的基础性的知识点这里就不再重复了,大家看原贴就好。这里为大家提供一个详细的例子,帮助大家更…

ajax不利于seo_利于探索移动选项的界面

ajax不利于seoLately, my parents will often bring up in conversation their desire to move away from their California home and find a new place to settle down for retirement. Typically they will cite factors that they perceive as having altered the essence o…

C#调用WebKit内核

原文:C#调用WebKit内核版权声明:本文为博主原创文章,未经博主允许不得转载。 https://blog.csdn.net/u013564470/article/details/80255954 系统要求 Windows与.NET框架 由于WebKit库和.NET框架的要求,WebKit .NET只能在Windows系统上运行。从…

数据分析入门:如何训练数据分析思维?

本文由 网易云 发布。 作者:吴彬彬(本篇文章仅限知乎内部分享,如需转载,请取得作者同意授权。) 我们在生活中,会经常听说两种推理模式,一种是归纳 一种是演绎,这两种思维模式能够帮…

559. N 叉树的最大深度

559. N 叉树的最大深度 给定一个 N 叉树,找到其最大深度。 最大深度是指从根节点到最远叶子节点的最长路径上的节点总数。 N 叉树输入按层序遍历序列化表示,每组子节点由空值分隔(请参见示例)。 示例 1: 输入&#…

el表达式取值优先级

不同容器中存在同名值时,从作用范围小到大的顺序依次尝试取值:pageContext->request->session->application 转载于:https://www.cnblogs.com/wrencai/p/9006880.html

数据探索性分析_探索性数据分析

数据探索性分析When we hear about Data science or Analytics , the first thing that comes to our mind is Modelling , Tuning etc. . But one of the most important and primary steps before all of these is Exploratory Data Analysis or EDA.当我们听到有关数据科学或…

5930. 两栋颜色不同且距离最远的房子

5930. 两栋颜色不同且距离最远的房子 街上有 n 栋房子整齐地排成一列,每栋房子都粉刷上了漂亮的颜色。给你一个下标从 0 开始且长度为 n 的整数数组 colors ,其中 colors[i] 表示第 i 栋房子的颜色。 返回 两栋 颜色 不同 房子之间的 最大 距离。 第 …

stata中心化处理_带有stata第2部分自定义配色方案的covid 19可视化

stata中心化处理This guide will cover an important, yet, under-explored part of Stata: the use of custom color schemes. In summary, we will learn how to go from this graph:本指南将涵盖Stata的一个重要但尚未充分研究的部分:自定义配色方案的使用。 总而…

Anaconda配置和使用

为什么80%的码农都做不了架构师?>>> 原来一直使用原生python和pip的方式,换了新电脑,准备折腾下Anaconda。 安装过程就不说了,全程可视化安装,很简单。 安装后用“管理员权限”打开“Anaconda Prompt”命令…