r语言处理数据集编码_在强调编码语言或工具之前,请学习这3个基本数据概念

r语言处理数据集编码

重点 (Top highlight)

I got an Instagram DM the other day that really got me thinking. This person explained that they were a data analyst by trade, and had years of experience. But, they also said that they felt that their technical skills were slightly lacking, as they had never heard of many of the terms mentioned on my page. This person mentioned that they were looking forward to expanding their skill set by learning more technical tools (SQL, Python, R, etc.)

前几天,我得到了一个Instagram DM,这确实让我思考。 此人解释说,他们是贸易数据分析师,并且有多年的经验。 但是,他们还说,他们觉得自己的技术技能有些欠缺,因为他们从未听说过我页面上提到的许多术语。 该人提到他们希望通过学习更多技术工具(SQL,Python,R等)来扩展自己的技能。

As I thought about how to advise this person further, I realized that this person was in the perfect position to make the transition that they desired. Why? They had already mastered the data skills and data mindset that is crucial to being successful in the field of data.

当我考虑如何进一步建议此人时,我意识到该人处于完成他们所希望的过渡的完美位置。 为什么? 他们已经掌握了数据技能和数据思维方式,这对于在数据领域取得成功至关重要。

I (and so many others) worry about mastering every technical tool or product that is out there. I worry about only having experience with Microsoft products (SQL Server, Excel, Power BI), and feel that I need to broaden my horizons to be a better data analyst. I constantly see data scientists questioning and debating online about whether Python or R is better in their line of work.

我(以及许多其他人)担心掌握其中的每个技术工具或产品。 我担心只拥有Microsoft产品(SQL Server,Excel,Power BI)的经验,并感到我需要开阔视野才能成为更好的数据分析师。 我经常看到数据科学家在网上质疑和辩论关于Python还是R在他们的工作中是否更好。

But, speaking with my new Instagram friend helped me realize that these worries and debates are quite silly. Tools and programming languages are constantly evolving and changing, coming and going. But you know what is here to stay? The core concepts. Every tool or language that is ever built will always fall back on these core concepts.

但是,与我的新Instagram朋友交谈使我意识到这些担忧和辩论很愚蠢。 工具和编程语言不断发展变化,不断发展。 但是你知道这里还能留下什么吗? 核心概念。 曾经构建的每种工具或语言都将始终依赖于这些核心概念。

If you understand how to take a data set, manipulate it, and present it in a way that provides genuine insight (or at least invites more questions that you didn’t have before… because that happens!!), you are on the right path to succeed as some sort of data professional.

如果您了解如何获取数据集,进行操作并以提供真正洞察力的方式进行呈现(或至少会邀请您之前从未有过的其他问题……因为那样的话!),那么您就对了。成为某种数据专业人员的成功之路。

This base understanding of data is so powerful. You can take this understanding, and combine it with any technical tool of your choice. Then, you can group and filter data for business reporting and KPI monitoring, conduct statistical tests to answer questions about data, predict future data, or even generate AI models to use data to help guide business action. And you can do all these things with huge data sets containing millions and millions of rows!

对数据的基本了解是如此强大。 您可以将这种理解与您选择的任何技术工具结合起来。 然后,您可以对数据进行分组和过滤以进行业务报告和KPI监控,进行统计测试以回答有关数据的问题,预测未来数据,甚至生成AI模型以使用数据来帮助指导业务行动。 您可以使用包含数百万行的庞大数据集来完成所有这些工作!

OK I know I’m selling you and selling you on this idea, so let me cut to the chase. If you understand data concepts and how to apply them, you can easily implement these concepts with any technical tool or product of your choice.

好吧,我知道我在这个想法上要卖给你,也要卖给你,所以让我开始追逐。 如果您了解数据概念以及如何应用它们,则可以使用您选择的任何技术工具或产品轻松实现这些概念。

But don’t worry, I’m not just here to sell you on this and then head out. I’m going to talk about 3 basic data skills that I use daily as a data analyst, from a general perspective. NO TECHNICAL TERMS OR CODE INVOLVED. If you begin to master these (and other) data concepts, it is EASY PEASY LEMON SQUEEZY to take them and apply them with any tool. I even have a serious life hack at the end of the article that will help you further flex your new data knowledge in any tool you’ve been wanting to master. Stick with me, I got you!

但请放心,我不只是在这里卖给您,然后出发。 我将从总体的角度来谈论我日常用作数据分析师的3种基本数据技能。 不涉及技术术语或代码。 如果您开始掌握这些(和其他)数据概念,则很容易将它们应用于任何工具。 我什至在文章结尾处都有一个严肃的生活技巧,可以帮助您在想要掌握的任何工具中进一步扩展新的数据知识。 坚持我,我得到了你!

#1 筛选资料 (#1. Filtering Data)

The first data concept that is crucial in the data world is filtering data. Honestly, filtering data is a super simple concept and one that we as human beings do on a daily basis. Take this example. If you are going to get McDonald’s, you should probably ask your 3 roomies if they want some (because you don’t wanna be that roommate). But, before you go ask your roomies if they want chicken nugs, you remember that 2 out of your 3 roomies don’t even like McDonald’s, so you only end up asking one. Basically, you “filtered out” your two roommates from your “data set” based on some “attribute”, which is whether or not they like McDonald’s.

在数据世界中至关重要的第一个数据概念是过滤数据。 老实说,过滤数据是一个非常简单的概念,这是我们人类每天都在做的事情。 举这个例子。 如果要购买麦当劳,您可能应该问问3个室友是否想要一些(因为您不想成为那个室友)。 但是,在您去问您的空姐是否想要鸡块之前,您要记住,三分之二的空姐甚至都不喜欢麦当劳,所以最终只问了一个。 基本上,您是根据某些“属性”从“数据集”中“过滤”出两个室友的,这就是他们是否喜欢麦当劳的。

Filtering data as a data analyst or data scientist works the exact same way. If you are conducting an analysis on female customers, you will need to use whatever tool you have at your disposal to filter out the non-female customers. If you are trying to build a model that helps recommend skincare for adults, you would want to filter out any data for non-adult patients.

作为数据分析师或数据科学家过滤数据的方式完全相同。 如果要对女性顾客进行分析,则需要使用可用的任何工具来过滤掉非女性顾客。 如果您试图建立一个有助于推荐成人皮肤护理的模型,则可能要过滤掉非成人患者的所有数据。

Long story short, filtering data is just taking away all of the undesired data from whatever data set you have, until you are left with whatever data you need for your analysis.

长话短说,过滤数据只是从您拥有的任何数据集中删除所有不需要的数据,直到您剩下进行分析所需的任何数据为止。

#2。 数据类型转换 (#2. Data Type Conversion)

Another commonly used data skill is data type conversion. Data types are certain categories that data can fall into when it is stored in a spreadsheet, software, or database. Some common examples of data types are:

另一个常用的数据技能是数据类型转换。 数据类型是数据存储在电子表格,软件或数据库中时可以归入的某些类别。 数据类型的一些常见示例是:

  • Strings (ex: “Hello, this is a string.”)

    字符串(例如:“您好,这是一个字符串。”)
  • Integers (ex: 400)

    整数(例如:400)
  • Decimals (ex: 400.17)

    小数(例如:400.17)
  • Booleans (ex: TRUE)

    布尔值(例如:TRUE)

When we are working with a data set, we want to make sure that each data attribute is stored as the correct data type.

在处理数据集时,我们要确保每个数据属性都存储为正确的数据类型。

We would not want to store the integer 123 as a string. If we store 123 as a string, the spreadsheet, software, or database would not be able to perform necessary operations on it. The computer would get confused. If we tell the computer that we have a string (“123”), but later we want to add that “123” to something, the computer is going to say “HOLD UP A SECOND. You taught me that “123” was a STRING, which is basically a word. Ya can’t add words crazy person! You can only add numbers!!!!”

我们不想将整数123存储为字符串。 如果我们将123存储为字符串,则电子表格,软件或数据库将无法对其执行必要的操作。 电脑会感到困惑。 如果我们告诉计算机我们有一个字符串(“ 123”),但是稍后我们想将该“ 123”添加到某个内容中,则计算机将说“ HOLD UP SECOND”。 您告诉我“ 123”是一个STRING,基本上是一个字。 雅不能添加单词疯狂的人! 您只能加数字!!!”

Sorry the hypothetical computer got so aggressive there, but you get the point. In order to ensure that we can perform proper operations on our data down the road, we want to absolutely make sure that it is represented as the right type.

抱歉,假设的计算机在那里攻击性很强,但是您明白了。 为了确保我们可以对数据进行正确的操作,我们要绝对确保将其表示为正确的类型。

#3。 汇总数据 (#3. Aggregating Data)

The final concept that I want to touch on *for now* is aggregating data. Aggregating data is so so so SO powerful. Aggregating data can take you from a big giant text file of rows and columns of data, and turn it into a summary value or a summary table that is much more meaningful and pleasing to the eye.

我现在要谈的最后一个概念是聚合数据。 聚合数据是如此强大。 汇总数据可以使您从数据行和列的大型文本文件中获取,并将其转变为摘要值或摘要表,这些文件或表格更加有意义并令人赏心悦目。

Notice how I kept saying the word summary up there? It’s probably the best way to explain an aggregation, because aggregations take multiple rows of data and summarize them into a smaller number of rows.

请注意,我在那边一直说“总结”一词吗? 这可能是解释聚合的最佳方式,因为聚合会吸收多行数据并将其汇总为较少的行数。

Image for post
SQLiteTutorial.NetSQLiteTutorial.Net提供

If you have a data set that contains numbers that would make sense to be added (such as quantities or sales), one of the simplest ways to aggregate that data is to sum it up. In the example below, I took a data set that contained the amount of coffees I drank each day. I applied an aggregation to it by summing it, which created a summary view of my data on the right. This summary shows that I drank a total of 4 coffees (in this data set at least).

如果您的数据集包含要添加的数字(例如数量或销售额),那么汇总该数据的最简单方法之一就是对其进行汇总。 在下面的示例中,我获取了一个数据集,其中包含我每天喝的咖啡量。 我通过汇总对其应用了汇总,从而在右侧创建了我的数据的汇总视图。 此摘要显示我总共喝了4杯咖啡(至少在此数据集中)。

Image for post

There are many other aggregate operations that are pretty intuitive, even for those that are new to the data world. Each of these operations answers some question that informs us more about our data set. Some examples of other simple aggregate operations are:

还有许多非常直观的聚合操作,即使对于数据世界中的新操作也是如此。 这些操作中的每一个都会回答一个问题,这些问题可以使我们更多地了解我们的数据集。 其他简单聚合操作的一些示例包括:

  • Count (how many records are there?)

    计数(有多少条记录?)
  • Maximum (what’s the biggest observation?)

    最大值(最大的观察值是什么?)
  • Minimum (what’s the smallest observation?)

    最小(什么是最小观察值?)
  • Average (what do I tend to observe?)

    平均(我倾向于观察什么?)

好的,coooOooOol ..那下一步呢? (OK coooOooOol.. so what’s next?)

I know I promised you a life hack earlier, so don’t worry — I didn’t forget. Now that you have got a firmer grasp on some of the most crucial steps in a data professional’s workflow, you can take them and apply them with any technical tool of your choice, even if you are a newbie. How? With our best friend, our ultimate savior, GOOGLE!

我知道我已答应过给您一个生活小知识,所以不用担心-我没有忘记。 既然您已经掌握了数据专业人员工作流程中最关键的一些步骤,那么即使您是新手,也可以采用这些方法并将其与您选择的任何技术工具一起应用。 怎么样? 与我们最好的朋友,我们的终极救星GOOGLE!

Whenever I want to practice any of my skills with some tool, and I need a refresher on how to execute it properly, I will Google in this format:

每当我想使用某种工具来练习我的任何技能,并且需要重新学习如何正确执行它时,我都会以这种格式使用Google:

[insert data skill] in [insert technical tool]

[插入技术工具]中的[插入数据技能]

I swear to you, any time I Google in this format, I always end up finding great documentation, blog posts, or other resources (such as Stack Overflow) that direct my thoughts toward the solution.

我向你发誓,每当我使用这种格式的Google时,总会找到很多很棒的文档,博客文章或其他资源(例如Stack Overflow),这些思想将我的想法引向解决方案。

So, did you find aggregating data interesting? And are you wanting to better your SQL skills? Then I would recommend reviewing and working on:

那么,您发现汇总数据有趣吗? 您是否想提高您SQL技能? 然后,我建议您进行审查并进行以下工作:

aggregating data in SQL

在SQL中聚合数据

Are you basically a pro at filtering data in Python, but now you would like to try it out in R? Try my life hack and Google:

您基本上是精通Python过滤数据的专业人士,但是现在您想在R中尝试一下吗? 试试我的生活技巧和Google:

filtering data in R

在R中过滤数据

Take it from the girl who overwhelmed herself for months before pursuing her data career dreams. Learn the concepts first. Worry about the tech to get it done later. Technology is always evolving, but the foundations aren’t.

从追求了数据职业梦想的几个月来让自己不知所措的女孩那里拿来。 首先学习概念。 担心技术会在以后完成。 技术始终在发展,但基础却没有。

Originally published at https://datadreamer.io on August 7, 2020.

最初于 2020年8月7日 发布在 https://datadreamer.io

翻译自: https://towardsdatascience.com/learn-these-3-basic-data-concepts-before-stressing-about-coding-languages-or-tools-e599896e6d4

r语言处理数据集编码

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/news/388483.shtml

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

springboot微服务 java b2b2c电子商务系统(一)服务的注册与发现(Eureka)

一、spring cloud简介spring cloud 为开发人员提供了快速构建分布式系统的一些工具,包括配置管理、服务发现、断路器、路由、微代理、事件总线、全局锁、决策竞选、分布式会话等等。它运行环境简单,可以在开发人员的电脑上跑。Spring Cloud大型企业分布式…

linux部署服务器常用命令

fdisk -l 查分区硬盘 df -h 查空间硬盘 cd / 进目录 ls/ll 文件列表 vi tt.txt iinsert 插入 shift: 进命令行 wq 保存%退出 cat tt.txt 内容查看 pwd 当期目录信息 mkdir tt建目录 cp tt.txt tt/11.txt 拷贝文件到tt下 mv 11.txt /usr/ 移动 rm -rf tt.txt 删除不提示 rm t…

HTML和CSS面试问题总结,html和css面试总结

html和cssw3c 规范结构化标准语言样式标准语言行为标准语言1) 盒模型常见的盒模型有w3c盒模型(又名标准盒模型)box-sizing:content-box和IE盒模型(又名怪异盒模型)box-sizing:border-box。标准盒子模型:宽度内容的宽度(content) border padding margin低版本IE盒子…

css清除浮动float的七种常用方法总结和兼容性处理

在清除浮动前我们要了解两个重要的定义: 浮动的定义:使元素脱离文档流,按照指定方向发生移动,遇到父级边界或者相邻的浮动元素停了下来。 高度塌陷:浮动元素父元素高度自适应(父元素不写高度时,…

数据迁移测试_自动化数据迁移测试

数据迁移测试Data migrations are notoriously difficult to test. They take a long time to run on large datasets. They often involve heavy, inflexible database engines. And they’re only meant to run once, so people think it’s throw-away code, and therefore …

使用while和FOR循环分布打印字符串S='asdfer' 中的每一个元素

方法1: s asdfer for i in s :print(i)方法2:index 0 while 1:print(s[index])index1if index len(s):break 转载于:https://www.cnblogs.com/yuhoucaihong/p/10275800.html

山师计算机专业研究生怎么样,山东师范大学有计算机专业硕士吗?

山东师范大学位于山东省济南市,学校是一所综合性高等师范院校。该院校深受广大报考专业硕士学员的欢迎,因此很多学员想要知道山东师范大学有没有计算机专业硕士?山东师范大学是有计算机专业硕士的。下面就和大家介绍一下培养目标有哪些&#…

ZOJ-Crashing Balloon

先从最大的数开始, 深度优先遍历. 如果是 m 和 n 的公因子, 先遍历m的, 回溯返回的数值还是公因子, 再遍历n. 如果有某一或几条路径可以让 m 和 n 变成 1 ,说明 m 和 n 不冲突, m 胜利. 如果没有找到一条路径当 n 分解完成时, m 也分解完成, 则判定 m说谎(无论 n 是否说谎), n…

使用TensorFlow概率预测航空乘客人数

TensorFlow Probability uses structural time series models to conduct time series forecasting. In particular, this library allows for a “scenario analysis” form of modelling — whereby various forecasts regarding the future are made.TensorFlow概率使用结构…

python画激活函数图像

导入必要的库 import math import matplotlib.pyplot as plt import numpy as np import matplotlib as mpl mpl.rcParams[axes.unicode_minus] False 绘制softmax函数图像 fig plt.figure(figsize(6,4)) ax fig.add_subplot(111) x np.linspace(-10,10) y sigmoid(x)ax.s…

计算机网络管理SIMP,计算机网络管理实验报告.docx

计算机网络管理实验报告计算机网络管理实验报告PAGEPAGE #计算机网络管理实验报告作 者: 孙玉虎 学 号:914106840229学院(系):计算机科学与工程学院专 业:网络工程题 目:SNMR报文禾口 MIB指导教师陆一飞2016年12月目录…

tomcat集群

1】 下载安装 httpd-2.2.15-win32-x86-no_ssl.msi 网页服务器 32-bit Windows zip tomcat mod_jk-1.2.30-httpd-2.2.3.so Apache/IIS 用来连接后台Tomcat的模块,支持集群和负载均衡 JK 分为两个版本 1,x 和 2.x &…

pdf.js插件使用记录,在线打开pdf

pdf.js插件使用记录,在线打开pdf 原文:pdf.js插件使用记录,在线打开pdf天记录一个js库:pdf.js。主要是实现在线打开pdf功能。因为项目需求需要能在线查看pdf文档,所以就研究了一下这个控件。 有些人很好奇,在线打开pdf…

程序员 sql面试_非程序员SQL使用指南

程序员 sql面试Today, the word of the moment is DATA, this little combination of 4 letters is transforming how all companies and their employees work, but most people don’t really know how data behaves or how to access it and they also think that this is j…

Apache+Tomcat集群负载均衡的两种session处理方式

session共享有两种方式: 1、session共享,多个服务器session拷贝保存,一台宕机不会影响用户的登录状态; 2、请求精确集中定位,即当前用户的请求都集中定位到一台服务器中,这样单台服务器保存了用户的sessi…

SmartSVN:File has inconsistent newlines

用SmartSVN提交文件的时候,提示svn: File has inconsistent newlines 这是由于要提交的文件编码时混合了windows和unix符号导致的。 解决方案 SmartSVN设置做如下修改可以解决问题: Project–>Setting选择Working copy下的EOL-style将Default EOL-sty…

我要认真学Git了 - Config

有一天,当我像往常一样打开SourceTree提交代码,然后推送的时候,我突然意识到我只是根据肌肉记忆完成这个过程,我压根不知道这其中到底发生了什么。这是个很严重的问题,作为一个技术人员,居然只满足于使用工…

计算机科学与技术科研论文,计算机科学与技术学院2007年度科研论文一览表

1Qiang Sun,Xianwen Zeng, Raihan Ur Rasool, Zongwu Ke, Niansheng Chen. The Capacity of Wireless Ad Hoc Networks with Power Control. IWCLD 2007. (EI收录: 083511480101)2Hong jia ping. The Application of the AES in the Bootloader of AVR Microcontroller. In: DC…

r a/b 测试_R中的A / B测试

r a/b 测试什么是A / B测试? (What is A/B Testing?) A/B testing is a method used to test whether the response rate is different for two variants of the same feature. For instance, you may want to test whether a specific change to your website lik…

一台机器同时运行两个Tomcat

如果不加任何修改,在一台服务器上同时运行两个Tomcat服务显然会发生端口冲突。假设现在已经按照正常的方式安装配置好了第一个Tomcat,第二个如何设置呢?以下是使用Tomcat5.5解压版本所做的实验。 解决办法: 1.解压Tomcat到一个新的目录&#…