r语言处理数据集编码_在强调编码语言或工具之前,请学习这3个基本数据概念

r语言处理数据集编码

重点 (Top highlight)

I got an Instagram DM the other day that really got me thinking. This person explained that they were a data analyst by trade, and had years of experience. But, they also said that they felt that their technical skills were slightly lacking, as they had never heard of many of the terms mentioned on my page. This person mentioned that they were looking forward to expanding their skill set by learning more technical tools (SQL, Python, R, etc.)

前几天,我得到了一个Instagram DM,这确实让我思考。 此人解释说,他们是贸易数据分析师,并且有多年的经验。 但是,他们还说,他们觉得自己的技术技能有些欠缺,因为他们从未听说过我页面上提到的许多术语。 该人提到他们希望通过学习更多技术工具(SQL,Python,R等)来扩展自己的技能。

As I thought about how to advise this person further, I realized that this person was in the perfect position to make the transition that they desired. Why? They had already mastered the data skills and data mindset that is crucial to being successful in the field of data.

当我考虑如何进一步建议此人时,我意识到该人处于完成他们所希望的过渡的完美位置。 为什么? 他们已经掌握了数据技能和数据思维方式,这对于在数据领域取得成功至关重要。

I (and so many others) worry about mastering every technical tool or product that is out there. I worry about only having experience with Microsoft products (SQL Server, Excel, Power BI), and feel that I need to broaden my horizons to be a better data analyst. I constantly see data scientists questioning and debating online about whether Python or R is better in their line of work.

我(以及许多其他人)担心掌握其中的每个技术工具或产品。 我担心只拥有Microsoft产品(SQL Server,Excel,Power BI)的经验,并感到我需要开阔视野才能成为更好的数据分析师。 我经常看到数据科学家在网上质疑和辩论关于Python还是R在他们的工作中是否更好。

But, speaking with my new Instagram friend helped me realize that these worries and debates are quite silly. Tools and programming languages are constantly evolving and changing, coming and going. But you know what is here to stay? The core concepts. Every tool or language that is ever built will always fall back on these core concepts.

但是,与我的新Instagram朋友交谈使我意识到这些担忧和辩论很愚蠢。 工具和编程语言不断发展变化,不断发展。 但是你知道这里还能留下什么吗? 核心概念。 曾经构建的每种工具或语言都将始终依赖于这些核心概念。

If you understand how to take a data set, manipulate it, and present it in a way that provides genuine insight (or at least invites more questions that you didn’t have before… because that happens!!), you are on the right path to succeed as some sort of data professional.

如果您了解如何获取数据集,进行操作并以提供真正洞察力的方式进行呈现(或至少会邀请您之前从未有过的其他问题……因为那样的话!),那么您就对了。成为某种数据专业人员的成功之路。

This base understanding of data is so powerful. You can take this understanding, and combine it with any technical tool of your choice. Then, you can group and filter data for business reporting and KPI monitoring, conduct statistical tests to answer questions about data, predict future data, or even generate AI models to use data to help guide business action. And you can do all these things with huge data sets containing millions and millions of rows!

对数据的基本了解是如此强大。 您可以将这种理解与您选择的任何技术工具结合起来。 然后,您可以对数据进行分组和过滤以进行业务报告和KPI监控,进行统计测试以回答有关数据的问题,预测未来数据,甚至生成AI模型以使用数据来帮助指导业务行动。 您可以使用包含数百万行的庞大数据集来完成所有这些工作!

OK I know I’m selling you and selling you on this idea, so let me cut to the chase. If you understand data concepts and how to apply them, you can easily implement these concepts with any technical tool or product of your choice.

好吧,我知道我在这个想法上要卖给你,也要卖给你,所以让我开始追逐。 如果您了解数据概念以及如何应用它们,则可以使用您选择的任何技术工具或产品轻松实现这些概念。

But don’t worry, I’m not just here to sell you on this and then head out. I’m going to talk about 3 basic data skills that I use daily as a data analyst, from a general perspective. NO TECHNICAL TERMS OR CODE INVOLVED. If you begin to master these (and other) data concepts, it is EASY PEASY LEMON SQUEEZY to take them and apply them with any tool. I even have a serious life hack at the end of the article that will help you further flex your new data knowledge in any tool you’ve been wanting to master. Stick with me, I got you!

但请放心,我不只是在这里卖给您,然后出发。 我将从总体的角度来谈论我日常用作数据分析师的3种基本数据技能。 不涉及技术术语或代码。 如果您开始掌握这些(和其他)数据概念,则很容易将它们应用于任何工具。 我什至在文章结尾处都有一个严肃的生活技巧,可以帮助您在想要掌握的任何工具中进一步扩展新的数据知识。 坚持我,我得到了你!

#1 筛选资料 (#1. Filtering Data)

The first data concept that is crucial in the data world is filtering data. Honestly, filtering data is a super simple concept and one that we as human beings do on a daily basis. Take this example. If you are going to get McDonald’s, you should probably ask your 3 roomies if they want some (because you don’t wanna be that roommate). But, before you go ask your roomies if they want chicken nugs, you remember that 2 out of your 3 roomies don’t even like McDonald’s, so you only end up asking one. Basically, you “filtered out” your two roommates from your “data set” based on some “attribute”, which is whether or not they like McDonald’s.

在数据世界中至关重要的第一个数据概念是过滤数据。 老实说,过滤数据是一个非常简单的概念,这是我们人类每天都在做的事情。 举这个例子。 如果要购买麦当劳,您可能应该问问3个室友是否想要一些(因为您不想成为那个室友)。 但是,在您去问您的空姐是否想要鸡块之前,您要记住,三分之二的空姐甚至都不喜欢麦当劳,所以最终只问了一个。 基本上,您是根据某些“属性”从“数据集”中“过滤”出两个室友的,这就是他们是否喜欢麦当劳的。

Filtering data as a data analyst or data scientist works the exact same way. If you are conducting an analysis on female customers, you will need to use whatever tool you have at your disposal to filter out the non-female customers. If you are trying to build a model that helps recommend skincare for adults, you would want to filter out any data for non-adult patients.

作为数据分析师或数据科学家过滤数据的方式完全相同。 如果要对女性顾客进行分析,则需要使用可用的任何工具来过滤掉非女性顾客。 如果您试图建立一个有助于推荐成人皮肤护理的模型,则可能要过滤掉非成人患者的所有数据。

Long story short, filtering data is just taking away all of the undesired data from whatever data set you have, until you are left with whatever data you need for your analysis.

长话短说,过滤数据只是从您拥有的任何数据集中删除所有不需要的数据,直到您剩下进行分析所需的任何数据为止。

#2。 数据类型转换 (#2. Data Type Conversion)

Another commonly used data skill is data type conversion. Data types are certain categories that data can fall into when it is stored in a spreadsheet, software, or database. Some common examples of data types are:

另一个常用的数据技能是数据类型转换。 数据类型是数据存储在电子表格,软件或数据库中时可以归入的某些类别。 数据类型的一些常见示例是:

  • Strings (ex: “Hello, this is a string.”)

    字符串(例如:“您好,这是一个字符串。”)
  • Integers (ex: 400)

    整数(例如:400)
  • Decimals (ex: 400.17)

    小数(例如:400.17)
  • Booleans (ex: TRUE)

    布尔值(例如:TRUE)

When we are working with a data set, we want to make sure that each data attribute is stored as the correct data type.

在处理数据集时,我们要确保每个数据属性都存储为正确的数据类型。

We would not want to store the integer 123 as a string. If we store 123 as a string, the spreadsheet, software, or database would not be able to perform necessary operations on it. The computer would get confused. If we tell the computer that we have a string (“123”), but later we want to add that “123” to something, the computer is going to say “HOLD UP A SECOND. You taught me that “123” was a STRING, which is basically a word. Ya can’t add words crazy person! You can only add numbers!!!!”

我们不想将整数123存储为字符串。 如果我们将123存储为字符串,则电子表格,软件或数据库将无法对其执行必要的操作。 电脑会感到困惑。 如果我们告诉计算机我们有一个字符串(“ 123”),但是稍后我们想将该“ 123”添加到某个内容中,则计算机将说“ HOLD UP SECOND”。 您告诉我“ 123”是一个STRING,基本上是一个字。 雅不能添加单词疯狂的人! 您只能加数字!!!”

Sorry the hypothetical computer got so aggressive there, but you get the point. In order to ensure that we can perform proper operations on our data down the road, we want to absolutely make sure that it is represented as the right type.

抱歉,假设的计算机在那里攻击性很强,但是您明白了。 为了确保我们可以对数据进行正确的操作,我们要绝对确保将其表示为正确的类型。

#3。 汇总数据 (#3. Aggregating Data)

The final concept that I want to touch on *for now* is aggregating data. Aggregating data is so so so SO powerful. Aggregating data can take you from a big giant text file of rows and columns of data, and turn it into a summary value or a summary table that is much more meaningful and pleasing to the eye.

我现在要谈的最后一个概念是聚合数据。 聚合数据是如此强大。 汇总数据可以使您从数据行和列的大型文本文件中获取,并将其转变为摘要值或摘要表,这些文件或表格更加有意义并令人赏心悦目。

Notice how I kept saying the word summary up there? It’s probably the best way to explain an aggregation, because aggregations take multiple rows of data and summarize them into a smaller number of rows.

请注意,我在那边一直说“总结”一词吗? 这可能是解释聚合的最佳方式,因为聚合会吸收多行数据并将其汇总为较少的行数。

Image for post
SQLiteTutorial.NetSQLiteTutorial.Net提供

If you have a data set that contains numbers that would make sense to be added (such as quantities or sales), one of the simplest ways to aggregate that data is to sum it up. In the example below, I took a data set that contained the amount of coffees I drank each day. I applied an aggregation to it by summing it, which created a summary view of my data on the right. This summary shows that I drank a total of 4 coffees (in this data set at least).

如果您的数据集包含要添加的数字(例如数量或销售额),那么汇总该数据的最简单方法之一就是对其进行汇总。 在下面的示例中,我获取了一个数据集,其中包含我每天喝的咖啡量。 我通过汇总对其应用了汇总,从而在右侧创建了我的数据的汇总视图。 此摘要显示我总共喝了4杯咖啡(至少在此数据集中)。

Image for post

There are many other aggregate operations that are pretty intuitive, even for those that are new to the data world. Each of these operations answers some question that informs us more about our data set. Some examples of other simple aggregate operations are:

还有许多非常直观的聚合操作,即使对于数据世界中的新操作也是如此。 这些操作中的每一个都会回答一个问题,这些问题可以使我们更多地了解我们的数据集。 其他简单聚合操作的一些示例包括:

  • Count (how many records are there?)

    计数(有多少条记录?)
  • Maximum (what’s the biggest observation?)

    最大值(最大的观察值是什么?)
  • Minimum (what’s the smallest observation?)

    最小(什么是最小观察值?)
  • Average (what do I tend to observe?)

    平均(我倾向于观察什么?)

好的,coooOooOol ..那下一步呢? (OK coooOooOol.. so what’s next?)

I know I promised you a life hack earlier, so don’t worry — I didn’t forget. Now that you have got a firmer grasp on some of the most crucial steps in a data professional’s workflow, you can take them and apply them with any technical tool of your choice, even if you are a newbie. How? With our best friend, our ultimate savior, GOOGLE!

我知道我已答应过给您一个生活小知识,所以不用担心-我没有忘记。 既然您已经掌握了数据专业人员工作流程中最关键的一些步骤,那么即使您是新手,也可以采用这些方法并将其与您选择的任何技术工具一起应用。 怎么样? 与我们最好的朋友,我们的终极救星GOOGLE!

Whenever I want to practice any of my skills with some tool, and I need a refresher on how to execute it properly, I will Google in this format:

每当我想使用某种工具来练习我的任何技能,并且需要重新学习如何正确执行它时,我都会以这种格式使用Google:

[insert data skill] in [insert technical tool]

[插入技术工具]中的[插入数据技能]

I swear to you, any time I Google in this format, I always end up finding great documentation, blog posts, or other resources (such as Stack Overflow) that direct my thoughts toward the solution.

我向你发誓,每当我使用这种格式的Google时,总会找到很多很棒的文档,博客文章或其他资源(例如Stack Overflow),这些思想将我的想法引向解决方案。

So, did you find aggregating data interesting? And are you wanting to better your SQL skills? Then I would recommend reviewing and working on:

那么,您发现汇总数据有趣吗? 您是否想提高您SQL技能? 然后,我建议您进行审查并进行以下工作:

aggregating data in SQL

在SQL中聚合数据

Are you basically a pro at filtering data in Python, but now you would like to try it out in R? Try my life hack and Google:

您基本上是精通Python过滤数据的专业人士,但是现在您想在R中尝试一下吗? 试试我的生活技巧和Google:

filtering data in R

在R中过滤数据

Take it from the girl who overwhelmed herself for months before pursuing her data career dreams. Learn the concepts first. Worry about the tech to get it done later. Technology is always evolving, but the foundations aren’t.

从追求了数据职业梦想的几个月来让自己不知所措的女孩那里拿来。 首先学习概念。 担心技术会在以后完成。 技术始终在发展,但基础却没有。

Originally published at https://datadreamer.io on August 7, 2020.

最初于 2020年8月7日 发布在 https://datadreamer.io

翻译自: https://towardsdatascience.com/learn-these-3-basic-data-concepts-before-stressing-about-coding-languages-or-tools-e599896e6d4

r语言处理数据集编码

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/news/388483.shtml

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

HTML和CSS面试问题总结,html和css面试总结

html和cssw3c 规范结构化标准语言样式标准语言行为标准语言1) 盒模型常见的盒模型有w3c盒模型(又名标准盒模型)box-sizing:content-box和IE盒模型(又名怪异盒模型)box-sizing:border-box。标准盒子模型:宽度内容的宽度(content) border padding margin低版本IE盒子…

山师计算机专业研究生怎么样,山东师范大学有计算机专业硕士吗?

山东师范大学位于山东省济南市,学校是一所综合性高等师范院校。该院校深受广大报考专业硕士学员的欢迎,因此很多学员想要知道山东师范大学有没有计算机专业硕士?山东师范大学是有计算机专业硕士的。下面就和大家介绍一下培养目标有哪些&#…

使用TensorFlow概率预测航空乘客人数

TensorFlow Probability uses structural time series models to conduct time series forecasting. In particular, this library allows for a “scenario analysis” form of modelling — whereby various forecasts regarding the future are made.TensorFlow概率使用结构…

python画激活函数图像

导入必要的库 import math import matplotlib.pyplot as plt import numpy as np import matplotlib as mpl mpl.rcParams[axes.unicode_minus] False 绘制softmax函数图像 fig plt.figure(figsize(6,4)) ax fig.add_subplot(111) x np.linspace(-10,10) y sigmoid(x)ax.s…

pdf.js插件使用记录,在线打开pdf

pdf.js插件使用记录,在线打开pdf 原文:pdf.js插件使用记录,在线打开pdf天记录一个js库:pdf.js。主要是实现在线打开pdf功能。因为项目需求需要能在线查看pdf文档,所以就研究了一下这个控件。 有些人很好奇,在线打开pdf…

程序员 sql面试_非程序员SQL使用指南

程序员 sql面试Today, the word of the moment is DATA, this little combination of 4 letters is transforming how all companies and their employees work, but most people don’t really know how data behaves or how to access it and they also think that this is j…

r a/b 测试_R中的A / B测试

r a/b 测试什么是A / B测试? (What is A/B Testing?) A/B testing is a method used to test whether the response rate is different for two variants of the same feature. For instance, you may want to test whether a specific change to your website lik…

Java基础回顾

内容: 1、Java中的数据类型 2、引用类型的使用 3、IO流及读写文件 4、对象的内存图 5、this的作用及本质 6、匿名对象 1、Java中的数据类型 Java中的数据类型有如下两种: 基本数据类型: 4类8种 byte(1) boolean(1) short(2) char(2) int(4) float(4) l…

计算机部分应用显示模糊,win10系统打开部分软件字体总显示模糊的解决方法-电脑自学网...

win10系统打开部分软件字体总显示模糊的解决方法。方法一:win10软件字体模糊1、首先,在Win10的桌面点击鼠标右键,选择“显示设置”。2、在“显示设置”的界面下方,点击“高级显示设置”。3、在“高级显示设置”的界面中&#xff0…

Tomcat调节

Tomcat默认可以使用的内存为128MB,在较大型的应用项目中,这点内存是不够的,需要调大,并且Tomcat本身不能直接在计算机上运行,需要依赖于硬件基础之上的操作系统和一个java虚拟机。 AD: 这里向大家描述一下如何使用Tom…

turtle 20秒画完小猪佩奇“社会人”

转载:https://blog.csdn.net/csdnsevenn/article/details/80650456 图片源自网络 作者 丁彦军 如需转载,请联系原作者授权。 今年社交平台上最火的带货女王是谁?范冰冰?杨幂?Angelababy?不,是猪…

最佳子集aic选择_AutoML的起源:最佳子集选择

最佳子集aic选择As there is a lot of buzz about AutoML, I decided to write about the original AutoML; step-wise regression and best subset selection. Then I decided to ignore step-wise regression because it is bad and should probably stop being taught. That…

Java虚拟机内存溢出

最近在看周志明的《深入理解Java虚拟机》,虽然刚刚开始看,但是觉得还是一本不错的书。对于和我一样对于JVM了解不深,有志进一步了解的人算是一本不错的书。注明:不是书托,同样是华章出的书,质量要比《深入剖…

用户输入汉字时计算机首先将,用户输入汉字时,计算机首先将汉字的输入码转换为__________。...

用户的蓄的形能器常见式有。输入时计算机首先输入包括药物具有基的酚羟。汉字换物包腺皮括质激肾上素药。对既荷又有线有相间负负荷时,将汉倍作为等选取相负效三相负荷乘荷最大,将汉相负荷换荷应先将线间负算为,效三相负荷时在计算等&#xf…

从最终用户角度来看外部结构_从不同角度来看您最喜欢的游戏

从最终用户角度来看外部结构The complete python code and Exploratory Data Analysis Notebook are available at my github profile;完整的python代码和Exploratory Data Analysis Notebook可在我的github个人资料中找到 ; Pokmon is a Japanese media franchise,…

apache+tomcat配置

无意间看到tomcat 6集群的内容,就尝试配置了一下,还是遇到很多问题,特此记录。apache服务器和tomcat的连接方法其实有三种:JK、http_proxy和ajp_proxy。本文主要介绍最为常见的JK。 环境:PC2台:pc1(IP 192.168.88.118…

记自己在spring中使用redis遇到的两个坑

本人在spring中使用redis作为缓存时&#xff0c;遇到两个坑&#xff0c;现在记录如下&#xff0c;算是作为自己的备忘吧&#xff0c;文笔不好&#xff0c;望大家见谅&#xff1b; 一、配置文件 1 <!-- 加载Properties文件 -->2 <bean id"configurer" cl…

Azure实践之如何批量为资源组虚拟机创建alert

通过上一篇的简介&#xff0c;相信各位对于简单的创建alert&#xff0c;以及Azure monitor使用以及大概有个印象了。基础的使用总是非常简单的&#xff0c;这里再分享一个常用的alert使用方法实际工作中&#xff0c;不管是日常运维还是做项目&#xff0c;我们都需要知道VM的实际…

管道过滤模式 大数据_大数据管道配方

管道过滤模式 大数据介绍 (Introduction) If you are starting with Big Data it is common to feel overwhelmed by the large number of tools, frameworks and options to choose from. In this article, I will try to summarize the ingredients and the basic recipe to …

DevOps时代,企业数字化转型需要强大的工具链

伴随时代的飞速进步&#xff0c;中国的人口红利带来了互联网业务的快速发展&#xff0c;巨大的流量也带动了技术的不断革新&#xff0c;研发的模式也在不断变化。传统企业纷纷效仿互联网的做法&#xff0c;结合DevOps进行数字化的转型。通常提到DevOps&#xff0c;大家浮现在脑…