如何开始使用任何类型的数据? - 第1部分

从数据开始 (START WITH DATA)

My data science journey began with a student job in the Advanced Analytics department of one of the biggest automotive manufacturers in Germany. I was naïve and still doing my masters.

我的数据科学之旅从在德国最大的汽车制造商之一的Advanced Analytics部门的一名学生工作开始。 我很天真,仍然在做我的主人。

I was excited for this job because my current specialization was Digitalization. I wanted to get a hang of how it really works. I had studied programming too, but not python. My colleagues were all really smart — PhDs, Mathematicians and Physicists. Their understanding level of analytics was way beyond what I could gain by merely reading books!

我对这项工作感到很兴奋,因为我目前的专长是数字化。 我想了解它的真正工作原理。 我也学习过编程,但是没有学习过python。 我的同事们都很聪明-博士,数学家和物理学家。 他们对分析的理解水平远远超出了我仅通过阅读书本就能获得的知识!

For the first few days, the variety of projects and tasks, analysis and projects bewildered me. But, you know what was more bewildering? Questions like what is analytics? Why do it? What are all these files with so much data? What do all those numbers in the results say? How does an analytics project look like? What do they mean when they say they are analyzing data?

在最初的几天里,各种各样的项目和任务,分析和项目让我感到困惑。 但是,您知道还有什么更令人困惑的吗? 诸如什么是分析之类的问题? 为什么呢 这些数据量很大的文件是什么? 结果中所有这些数字表示什么? 分析项目的外观如何? 当他们说他们正在分析数据时,它们是什么意思?

Overwhelming!

压倒!

I spent days understanding analytics and the job itself. I gorged on various books and online courses that taught python, statistics, data science, etc. Gradually, I developed an understanding for the subjects and successfully completed my thesis in the same department too.

我花了几天的时间来了解分析和工作本身。 我浏览了各种书籍和在线课程,这些课程和课程教授python,统计学,数据科学等。逐渐地,我对这些主题有了认识,并且也成功地在同一部门完成了我的论文。

I have explained the data analytics recipe for you below. Hope you can use this as a guide even if your ingredients change with the application.

我在下面为您解释了数据分析方法。 希望即使您的成分随应用程序而变化,也可以将其用作指导。

The most important step to make any project successful is having a clear start. No matter how big or small your project is, if you do not have the ingredients in the required form and the right tools, even a masterchef’s recipe will not guarantee a delicious meal in the end.

要使任何项目成功,最重要的步骤就是要有一个清晰的起点。 无论您的项目大小不一,如果您没有所需形式的配料和正确的工具,那么即使是Masterchef的食谱也无法保证一顿美餐。

Let’s start with the ingredients before starting with the preparation.

让我们先从成分开始,然后再开始准备。

配料: (Ingredients:)

1. The Problem

1.问题

Did you ever get irrelevant results after you searched for your query in google? What do you do then? Rephrase and refine the keywords and search again. Similarly, having the ‘why’ of your analysis clear in the beginning helps you interpret your results better.

在google中搜索查询后,您是否得到不相关的结果? 那你怎么办呢? 重新定义和优化关键字,然后再次搜索。 同样,一开始就明确分析的“原因”可以帮助您更好地解释结果。

After you get all the data that you need, the next step is to understand and define the problem statement. The pain points of the business case need to be addressed here. It is imperative for your aim to align with the business strategy of your company so that the analysis proves fruitful to the stakeholders.

在获得所需的所有数据之后,下一步就是理解并定义问题说明。 这里需要解决业务案例的痛点。 您的目标必须与公司的业务战略保持一致,以使分析对利益相关者证明是卓有成效的。

Consider the above store location example. As the result of your analysis, you will get a score assigned to each prospective location. If the strategy of your management is to finance the project only when the new location results in more than $ 100,000 profit in a new city with a minimum population of 5000. Thus, you have clear criteria to narrow down the analysis results in line with the vision of your company.

考虑上面的商店位置示例。 分析的结果是,您将获得分配给每个预期地点的分数。 如果您的管理策略是仅在新地点在最低人口为5000的新城市中获得超过100,000美元的利润时才为项目提供资金。因此,您有明确的标准来缩小分析结果的范围,以符合您公司的愿景。

2. The Data

2.数据

For any kind of data analysis, getting the data is unquestionable. Data can be acquired from various relevant sources. Thus, it may come in diverse types and formats. Your job is to cut and crush it according to its type so that it is usable for your recipe.

对于任何类型的数据分析,获取数据都是毫无疑问的。 可以从各种相关来源获取数据。 因此,它可能有多种类型和格式。 您的工作是根据类型将其切碎,以便将其用于您的食谱。

In a tabular representation of data, each column is a data field and each row is a record. Each record may be labelled uniquely with an ID.

在数据的表格表示中,每一列都是数据字段,每一行都是记录。 每个记录可以用ID唯一地标记。

For example, for predicting the next location for opening a new store, you may have to use Yearly Sales Data, Sales Data for existing store locations, Population Density of the locations, Total number of Households, Census Data, Land Area. If your company sells pet products then you need number of households with pets. If your company sells children’s products then number of households with children under 15.

例如,为了预测下一个要开设新商店的位置,您可能必须使用“年度销售数据”,“现有商店位置的销售数据”,该位置的人口密度,家庭总数,人口普查数据,土地面积。 如果您的公司销售宠物产品,那么您需要携带宠物的家庭数量。 如果您的公司销售儿童产品,那么有15岁以下儿童的家庭数。

Most common types of input files are .csv (comma-separated-values file), .xlsx (excel sheet file) and .txt (text file). Excel file consumes more memory while importing data. On the contrary, CSV files are faster and consumes less memory.

输入文件的最常见类型是.csv(逗号分隔值文件)、. xlsx(excel工作表文件)和.txt(文本文件)。 Excel文件在导入数据时会占用更多内存。 相反,CSV文件更快并且消耗更少的内存。

Regardless of the file type, you have to clean each of the files and then blend all of it into one file to do the analysis. You can check out more about this here:

无论文件类型如何,您都必须清理每个文件,然后将所有文件混合到一个文件中进行分析。 您可以在此处查看有关此内容的更多信息:

3. The Software

3.软件

The software used for the analysis can be selected depending on the kind of results you want; your knowledge of programming languages like Python or R. For those who do not prefer programming may simply use any modular analytics software. In such a tool, you just drag and drop the required functions and you are good to go with the beautifully structured results and presentations.

可以根据所需结果的类型选择用于分析的软件。 您对Python或R等编程语言的了解。对于不喜欢编程的人,可以简单地使用任何模块化分析软件。 在这样的工具中,您只需拖放所需的功能,就可以很好地处理结构精美的结果和演示文稿。

Popular ‘No Code’ analytics software include:

流行的“无代码”分析软件包括:

  • Tableau — Data Visualization and Reporting

    Tableau-数据可视化和报告

  • DataRobot — Automated Machine Learning Platform

    DataRobot —自动化机器学习平台

  • RapidMiner — Useful for entire life-cycle from prediction to deployment

    RapidMiner-从预测到部署的整个生命周期有用

  • Alteryx — Advanced Analytics Platform

    Alteryx —高级分析平台

  • MLBase — Open Source

    MLBase —开源

  • TriFacta — Free

    TriFacta —免费

For these, you simply need to go to their site, create an account and download (some may only allow trial versions for a limited period)

对于这些,您只需要访问他们的站点,创建一个帐户并下载(有些可能只允许在有限的时间内提供试用版)

After that just upload your data file for analysis and run. You will have your results already when you finish reading this article.

之后,只需上传您的数据文件进行分析并运行。 阅读完本文后,您已经拥有了结果。

Popular IDEs for statistical computing:

流行的用于统计计算的IDE:

  • PyCharm (Python)

    PyCharm (Python)

  • Spyder (Anaconda Python distribution)

    Spyder (Anaconda Python发行版)

  • RStudio (R)

    RStudio (R)

You can also directly start your data analytics projects online, without downloading or installing anything!

您也可以直接在线启动数据分析项目,而无需下载或安装任何内容!

  • Google Colab

    Google Colab

  • Microsoft Azure Notebooks

    Microsoft Azure笔记本

制备: (Preparation:)

Different types of data come in different formats. Data from usually disparate sources requires cleansing, enriching and proper consolidation into one usable form in a downstream process. The technical terms generally used are data cleaning, feature selection, data transforms, feature engineering and dimensionality reduction.

不同类型的数据具有不同的格式。 通常来自不同来源的数据需要在下游过程中进行净化,丰富和适当合并为一种可用形式。 通常使用的技术术语是数据清理,特征选择,数据转换,特征工程和降维。

Data cleaning and preparation is the most time consuming task in the entire analysis process.

数据清理和准备是整个分析过程中最耗时的任务。

The first thing to do with any file is to check whether the given path is correct and it opens without errors. Load the data in the software of your choice. Now, look inside.

处理任何文件的第一件事是检查给定的路径是否正确,并且打开时没有错误。 将数据加载到您选择的软件中。 现在,看看里面。

An example of looking at the data is the field summary tool in Alteryx that provides a summary of data for all fields. The summary is shown below:

查看数据的一个示例是Alteryx中的字段摘要工具,该工具提供所有字段的数据摘要。 摘要如下所示:

Image for post
Priyanka Mane from Alteryx SoftwarePriyanka Mane提供的图像

Analyze and interpret the data using statistical tools (i.e. finding correlations, trends, outliers, etc.). However, the data might have missing values, typing errors or heterogeneous date formats; this must first be identified and fixed for better results.

使用统计工具(即查找相关性,趋势,离群值等)分析和解释数据。 但是,数据可能缺少值,键入错误或日期格式不均; 必须首先确定并修复此问题,以获得更好的结果。

Image for post
Priyanka Mane from Alteryx SoftwarePriyanka Mane提供的图像

· Variables

·变量

Categorical Variables are variables that can take values or labels belonging to a fixed number of categories. Gender is a nominal categorical variable having two categories -male and female. The categories have no intrinsic ordering. An ordinal variable has a clear ordering. Temperature is an ordinal categorical variable with three orderly categories (low, medium and high). Such variables are encoded using different techniques for easier analysis.

分类变量是可以采用属于固定数量类别的值或标签的变量。 性别是具有两个类别-男性和女性的名义分类变量。 类别没有内在的顺序。 序数变量具有清晰的顺序。 温度是具有三个有序类别(低,中和高)的有序分类变量。 使用不同的技术对此类变量进行编码,以便于分析。

Quantitative Variables represent measurement and count. They are of two types continuous (may take any value between an interval) and discrete (countable).

定量变量代表度量和计数。 它们有连续(可在间隔之间取任意值)和离散(可数)两种类型。

Image for post
Types of Data: Numerical and Categorical数据类型 :数值和分类

The link below gives an overview of the methods to encode the variables.

下面的链接概述了编码变量的方法。

You may have to deal with the following challenges while preparing the data:

准备数据时,您可能必须应对以下挑战:

  • Null Values / Missing data

    空值/缺少数据

Null Values are shown in the data as NaN or “Not-a-Number” value. The NaN property is the same as the Number but not a legal number. In python, use the isNaN() global function to check if a value is NaN. In Alteryx Software, the values are shown as [Null] after running the code. They can be filtered using isNull in the formula Tool. Additionally, summarize tool also helps you to count null.

空值在数据中显示为NaN或“非数字”值。 NaN属性与Number相同,但不是合法编号。 在python中,使用isNaN()全局函数检查值是否为NaN。 在Alteryx软件中,运行代码后,这些值显示为[Null]。 可以使用公式工具中的isNull过滤它们。 此外,汇总工具还可以帮助您计算空值。

When no data value is stored for an observation in the dataset, it is termed as missing data or missing values in statistics. Rubin stated three mechanisms for occurance of missing data: missing at random (MAR), missing completely at random (MCAR), and missing not at random (MNAR).

如果在数据集中没有为观察值存储任何数据值,则将其称为缺失数据或统计信息中的缺失值。 鲁宾指出了发生丢失数据的三种机制:随机丢失(MAR),完全随机丢失(MCAR)和非随机丢失(MNAR)。

The process of assigning substituted values to missing values is called imputation. If a small portion (upto 5%) of the data is missing then the values can be imputed using method like mean, median or mode. It uses the other values in the same column for imputation. Principled methods such as the multiple-imputation (MI) method, the full information maximum likelihood (FIML) method, and the expectation-maximization (EM) method.

将替换值分配给缺失值的过程称为插补。 如果缺少一小部分数据(最多5%),则可以使用平均值,中位数或众数等方法估算值。 它使用同一列中的其他值进行插补。 原则方法,例如多输入(MI)方法,完整信息最大似然(FIML)方法和期望最大化(EM)方法。

It is advisable to delete the field if more than 10% of the data is missing, as it may add statistical bias to the results.

如果缺少10%以上的数据, 建议删除该字段,因为这可能会增加结果的统计偏差。

In Alteryx, field Summary Tool shows the percentage of missing records for each data field.

在Alteryx中,字段“汇总工具”显示每个数据字段丢失记录的百分比。

Follow this link for steps and formulae to deal with missing data in excel.

单击此链接以获取处理excel中缺失数据的步骤和公式。

  • Heterogeneous Data

    异构数据

Numerical fields like age, currency, date have a huge potential for errors due to non-uniform format. For example, age can be written as 30 years or 30.2 Years or simply 30. The thousands separator and decimal separator for currency varies according to countries and regions. Make sure that these columns have an even format.

年龄,货币,日期等数字字段由于格式不统一而具有很大的出错可能性。 例如,年龄可以写为30年或30.2年,或简单地为30。货币的千位分隔符和十进制分隔符会因国家和地区而异。 确保这些列具有偶数格式。

Image for post
Micha Sager from Micha Sager在PixabayPixabay上发布
  • Outliers

    离群值

Once the dataset is cleaned, it is time to run another pre-process regime over it. Outliers are unusual values in the dataset that may cause statistical errors in your calculations. They are abnormally away from other values in a dataset and can severely distort your output values.

清除数据集后,就该对它运行另一个预处理方案了。 离群值是数据集中的异常值,可能会导致计算中的统计错误。 它们异常远离数据集中的其他值,并且可能严重扭曲您的输出值。

Image for post
Statistics by JimJim的统计数据

Scatterplots help immensely when you need to instantly identify outliers in your data. Simply visualize the relationship between each predictor variable and the target variable using plots.

当您需要立即识别数据中的异常值时,散点图可以提供极大的帮助。 使用绘图可以简单地可视化每个预测变量和目标变量之间的关系。

Here are some links to help you with the terms and methods to deal with outliers:

以下是一些链接,可帮助您了解处理异常值的条款和方法:

That was all for part 1. Check out part 2 for the analysis and presentation phases of a data science project. Stay tuned!

这就是第1部分的全部内容。请查看第2部分,了解数据科学项目的分析和演示阶段。 敬请关注!

翻译自: https://towardsdatascience.com/how-to-get-started-with-any-kind-of-data-part-1-c1746c66bc2d

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/news/391277.shtml

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

iHealth基于Docker的DevOps CI/CD实践

本文由1月31日晚iHealth运维技术负责人郭拓在Rancher官方技术交流群内所做分享的内容整理而成,分享了iHealth从最初的服务器端直接部署,到现在实现全自动CI/CD的实践经验。作者简介郭拓,北京爱和健康科技有限公司(iHealth)。负责公…

机器学习图像源代码_使用带有代码的机器学习进行快速房地产图像分类

机器学习图像源代码RoomNet is a very lightweight (700 KB) and fast Convolutional Neural Net to classify pictures of different rooms of a house/apartment with 88.9 % validation accuracy over 1839 images. I have written this in python and TensorFlow.RoomNet是…

leetcode 938. 二叉搜索树的范围和

给定二叉搜索树的根结点 root,返回值位于范围 [low, high] 之间的所有结点的值的和。 示例 1: 输入:root [10,5,15,3,7,null,18], low 7, high 15 输出:32 示例 2: 输入:root [10,5,15,3,7,13,18,1,nul…

COVID-19和世界幸福报告数据告诉我们什么?

For many people, the idea of ​​staying home actually sounded good at first. This process was really efficient for Netflix and Amazon. But then sad truths awaited us. What was boring was the number of dead and intubated patients one after the other. We al…

iOS 开发一定要尝试的 Texture(ASDK)

原文链接 - iOS 开发一定要尝试的 Texture(ASDK)(排版正常, 包含视频) 前言 本篇所涉及的性能问题我都将根据滑动的流畅性来评判, 包括掉帧情况和一些实际体验 ASDK 已经改名为 Texture, 我习惯称作 ASDK 编译环境: MacOS 10.13.3, Xcode 9.2 参与测试机型: iPhone 6 10.3.3, i…

lisp语言是最好的语言_Lisp可能不是数据科学的最佳语言,但是我们仍然可以从中学到什么呢?...

lisp语言是最好的语言This article is in response to Emmet Boudreau’s article ‘Should We be Using Lisp for Data-Science’.本文是对 Emmet Boudreau的文章“我们应该将Lisp用于数据科学”的 回应 。 Below, unless otherwise stated, lisp refers to Common Lisp; in …

static、volatile、synchronize

原子性(排他性):不论是多核还是单核,具有原子性的量,同一时刻只能有一个线程来对它进行操作!可见性:多个线程对同一份数据操作,thread1改变了某个变量的值,要保证thread2…

1.10-linux三剑客之sed命令详解及用法

内容:1.sed命令介绍2.语法格式,常用功能查询 增加 替换 批量修改文件名第1章 sed是什么字符流编辑器 Stream Editor第2章 sed功能与版本处理出文本文件,日志,配置文件等增加,删除,修改,查询sed --versionsed -i 修改文件内容第3章 语法格式3.1 语法格式sed [选项] [sed指令…

python pca主成分_超越“经典” PCA:功能主成分分析(FPCA)应用于使用Python的时间序列...

python pca主成分FPCA is traditionally implemented with R but the “FDASRSF” package from J. Derek Tucker will achieve similar (and even greater) results in Python.FPCA传统上是使用R实现的,但是J. Derek Tucker的“ FDASRSF ”软件包将在Python中获得相…

初探Golang(2)-常量和命名规范

1 命名规范 1.1 Go是一门区分大小写的语言。 命名规则涉及变量、常量、全局函数、结构、接口、方法等的命名。 Go语言从语法层面进行了以下限定:任何需要对外暴露的名字必须以大写字母开头,不需要对外暴露的则应该以小写字母开头。 当命名&#xff08…

大数据平台构建_如何像产品一样构建数据平台

大数据平台构建重点 (Top highlight)Over the past few years, many companies have embraced data platforms as an effective way to aggregate, handle, and utilize data at scale. Despite the data platform’s rising popularity, however, little literature exists on…

初探Golang(3)-数据类型

Go语言拥有两大数据类型,基本数据类型和复合数据类型。 1. 数值类型 ##有符号整数 int8(-128 -> 127) int16(-32768 -> 32767) int32(-2,147,483,648 -> 2,147,483,647) int64&#x…

时间序列预测 时间因果建模_时间序列建模以预测投资基金的回报

时间序列预测 时间因果建模Time series analysis, discussed ARIMA, auto ARIMA, auto correlation (ACF), partial auto correlation (PACF), stationarity and differencing.时间序列分析,讨论了ARIMA,自动ARIMA,自动相关(ACF),…

(58)PHP开发

LAMP0、使用include和require命令来包含外部PHP文件。使用include_once命令,但是include和include_once命令相比的不足就是这两个命令并不关心请求的文件是否实际存在,如果不存在,PHP解释器就会直接忽略这个命令并且显示一个错误消息&#xf…

css flexbox模型_如何将Flexbox后备添加到CSS网格

css flexbox模型I shared how to build a calendar with CSS Grid in the previous article. Today, I want to share how to build a Flexbox fallback for the same calendar. 在上一篇文章中,我分享了如何使用CSS Grid构建日历。 今天,我想分享如何为…

贝塞尔修正_贝塞尔修正背后的推理:n-1

贝塞尔修正A standard deviation seems like a simple enough concept. It’s a measure of dispersion of data, and is the root of the summed differences between the mean and its data points, divided by the number of data points…minus one to correct for bias.标…

RESET MASTER和RESET SLAVE使用场景和说明【转】

【前言】在配置主从的时候经常会用到这两个语句,刚开始的时候还不清楚这两个语句的使用特性和使用场景。 经过测试整理了以下文档,希望能对大家有所帮助; 【一】RESET MASTER参数 功能说明:删除所有的binglog日志文件,…

Kubernetes 入门(1)基本概念

1. Kubernetes简介 作为一个目前在生产环境已经广泛使用的开源项目 Kubernetes 被定义成一个用于自动化部署、扩容和管理容器应用的开源系统;它将一个分布式软件的一组容器打包成一个个更容易管理和发现的逻辑单元。 Kubernetes 是希腊语『舵手』的意思&#xff0…

android 西班牙_分析西班牙足球联赛(西甲)

android 西班牙The Spanish football league commonly known as La Liga is the first national football league in Spain, being one of the most popular professional sports leagues in the world. It was founded in 1929 and has been held every year since then with …

Goalng软件包推荐

2019独角兽企业重金招聘Python工程师标准>>> 前言 哈喽大家好呀! 马上要迎来狗年了大家是不是已经怀着过年的心情了呢? 今天笔者给大家带来了一份礼物, Goalng的软件包推荐, 主要总结了一下在go语言中大家开源的优秀的软件, 大家了解之后在后续使用过程有遇到如下软…