聚合 数据处理_R中聚合的简介:强大的数据处理工具

聚合 数据处理

by Satyam Singh Chauhan

萨蒂扬·辛格·乔汉(Satyam Singh Chauhan)

R中聚合的简介:强大的数据处理工具 (An introduction to aggregates in R: a powerful tool for playing with data)

Data Visualization is not just about colors and graphs. It’s about exploring the data and visualizing the right thing.

数据可视化不仅涉及颜色和图形。 这是关于探索数据并使可视化正确的事情。

While playing with the data, the most powerful tool that comes handy is Aggregates. Aggregates is just the type of transformation that we apply to any given data.

在处理数据时,最方便使用的最强大的工具是聚合。 聚合只是我们应用于任何给定数据的转换类型。

我们提供11种汇总函数: (We have 11 aggregate function available to us:)

  • avg

    平均

    Average of all numeric values is calculated and returned.

    计算并返回所有数值的平均值。

  • count

    计数

    Function count returns total number of items in each group.

    函数计数返回每个组中的项目总数。

  • first

    第一

    The first value of each group is returned by the function first.

    函数首先返回每个组的第一个值。

  • last

    持续

    The last value of each group is returned by the function last.

    每个组的最后一个值由函数last返回。

  • max

    最高

    The max value of each group is returned by the function max.

    每个组的最大值由函数max返回。

    It is very helpful to identify outliers as well.

    识别异常值也非常有帮助。

  • median

    中位数

    The median of all numeric values for the mentioned group is returned by the function median.

    函数中位数返回所提及组的所有数值的中位数。

  • min

    The min value of each group is returned by the function min.

    每个组的最小值由函数min返回。

    It is very helpful to identify outliers as well.

    识别异常值也非常有帮助。

  • mode

    模式

    The mode of all numeric values for the mentioned group is returned by the function mode.

    功能组返回所提及组的所有数值的模式。

  • rms

    均方根

    Root Mean Square, rms value for all numeric values in the group is returned by the fucntion rms.

    均方根均方根值由功能均方根值返回。

  • sttdev

    sttdev

    Standard Deviation of all Numeric values given in the group is returned by the function stddev.

    函数stddev返回该组中所有给定数值的标准偏差。

  • sum

    Sum of all the numeric values is returned by the function sum.

    所有数值的总和由函数sum返回。

基本范例 (Basic Examples)

使用聚合函数的基本视觉散点图-总和 (Basic Visual Scatter plot using aggregate function — sum)

#Include the Librarylibrary(plotly)
#Store the graph in one variable to make it easier to manipulate.p <- plot_ly(     type = 'scatter',     y = iris$Petal.Length/iris$Petal.Width,     x = iris$Species,     mode = 'markers',     marker = list(          size = 15,          color = 'green',          opacity = 0.8     ),     transforms = list(          list(               type = 'aggregate',               groups = iris$Species,               aggregations = list(                    list(                         target = 'y', func = 'sum', enabled = T                    )               )          )     ))
#Display the graphp

这是什么意思? (What does this mean?)

Function sum, as mentioned above, calculates the sum of each group.Thus, here the groups are categorized as species. This code uses the Iris Data Set which consist three different species, setosa, veriscolor, and virginica. For each species there are 50 observations in the data set. This data set is available in R (built-in) and can be loaded directly.

如上所述,函数sum计算每个组的总和,因此这里将这些组归类为物种。 此代码使用了鸢尾花数据集,该数据集包含三种不同的树种:setosa,veriscolor和virginica。 对于每个物种,数据集中都有50个观测值。 该数据集位于R(内置)中,可以直接加载。

There are “iris” and “iris3” - two data sets are available. You can choose any one of them to run this code. The Data-Set used in this article is “iris”.

有“ iris”和“ iris3”-两个数据集可用。 您可以选择其中任何一个来运行此代码。 本文中使用的数据集为“ iris”。

此代码的作用是什么 (What does this code do exactly?)

This code uses the function sum and calculates the sum of all the Petal.Length of each group respectively. Then, the calculated sum is plotted on the x-y axis. Where the x-axis is Species, the y-axis shows the Summation.

此代码使用函数sum并分别计算每个组的所有Petal.Length总和 。 然后,将计算出的总和绘制在xy轴上。 x轴为“物种”时,y轴显示“求和”。

From this graph, we can get an idea that the petal size of setosa is smallest as the sum is the smallest, but it’s not conclusive evidence. To get conclusive evidence we can use the function avg.

从这张图中,我们可以得出一个结论,setosa的花瓣大小最小,因为总和最小,但这不是决定性的证据 。 为了获得确凿的证据,我们可以使用函数avg。

The function sum is very suitable for almost the whole data set. For example, one of the best places where this can be used is in Population Data Set. In the world population data set, we can aggregate countries according to continents and find the sum of all the population of the countries in it.

函数和非常适用于几乎整个数据集 。 例如,可以使用的最佳位置之一是“人口数据集”。 在世界人口数据集中,我们可以按大洲汇总国家/地区,并找到其中所有国家的总和。

最常用的功能-平均 (Most used function — avg)

#Include the Librarylibrary(plotly)
#Store the graph in one variable to make it easier to manipulate.q <- plot_ly(     type = 'bar',     y = iris$Petal.Length/iris$Petal.Width,     x = iris$Species,     color = iris$Species,     transforms = list(          list(               type = 'aggregate',               groups = iris$Species,               aggregations = list(                    list(                         target = 'y', func = 'avg', enabled = T                    )               )          )     ))
#Display the graphq

这是什么意思? (What does this mean?)

The iris data-set contains two columns for Petals, Petal.Width and Petal.Length. Further, it can be used to calculate the average of the ratio of Petal.Length & Petal.Width.

虹膜数据集包含用于花瓣的两列,花瓣宽度和花瓣长度。 此外,它可用于计算Petal.Length和Petal.Width之比的平均值。

该代码的作用是什么? (What does this code do exactly?)

For each observation, the ratio of Petal.Length to Petal.Width is calculated before the average of all the gained values is plotted. As we can observe from this Bar Plot, Setosa has the max ratio with a near-ratio of 7, which shows that the petal length in Setosa is 7 times longer than its width. While on the other hand, virginica has the smallest ratio with nearly 3 times the width.

对于每个观察,在绘制所有获得值的平均值之前,先计算Petal.Length与Petal.Width的比率。 从该条形图中可以看出,Setosa的最大比例接近7,表明Setosa的花瓣长度是其宽度的7倍。 另一方面,维吉尼亚具有最小的比例,几乎是宽度的3倍。

This function is very flexible and especially when it’s used very wisely to get the best result. For example, if we consider some other data-set like Population, then we can calculate the average birth to death ratio for each country.

此功能非常灵活,尤其是在非常明智地使用以获得最佳效果时。 例如,如果我们考虑其他数据集,例如人口,那么我们可以计算每个国家的平均出生与死亡比率。

Let’s use all the functions in one graph. Now we’re going to plot a scatter plot for each category and we’re going to use all the functions. To this graph we will add a button from which we can select the desired function to make our work easier and get the results quicker.

让我们在一张图中使用所有函数。 现在,我们将为每个类别绘制一个散点图,并使用所有功能。 在此图中,我们将添加一个按钮,从中可以选择所需的功能以使我们的工作更轻松并更快地获得结果。

所有功能的汇总-一幅图中的所有功能 (Aggregation of all functions — all functions in one-graph)

#Include the Librarylibrary(plotly)
#Store the graph in one variable to make it easier to manipulate.s <- schema()agg <- s$transforms$aggregate$attributes$aggregations$items$aggregation$func$valuesl = list()
for (i in 1:length(agg)) {     ll = list(method = "restyle",     args = list('transforms[0].aggregations[0].func', agg[i]),     label = agg[i])     l[[i]] = ll     }
p <- plot_ly(     type = 'scatter',     x = iris$Species,     y = iris$Sepal.Length / iris$Sepal.Width,     mode = 'markers',     marker = list(          size = 20,          color = 'orange',          opacity = 0.8          ),     transforms = list(          list(               type = 'aggregate',               groups = iris$Species,               aggregations = list(                    list(                         target = 'y', func = 'avg', enabled = T                    )               )            )     )) %>%layout(     title = '<b>Plotly Aggregations by Satyam Chauhan</b><br>use     dropdown to change aggregation<br><b>Sepal ratio of Length to     Width</b>',     xaxis = list(title = 'Species'),     yaxis = list(title = 'Sepal ratio: Length/Width'),     updatemenus = list(          list(               x = 0.2,               y = 1.2,               xref = 'paper',               yref = 'paper',               yanchor = 'top',               buttons = l          )     ))
#Display the graphs

这是什么意思? (What does this mean?)

We make a list where all the function attributes of aggregation are stored. We use this function to experiment with all the functions of Aggregations in R.

我们列出存储聚合的所有功能属性的列表。 我们使用此功能来试验R中聚合的所有功能。

A few of the graphs with different examples are shown below.

下面显示了一些带有不同示例的图形。

该代码的作用是什么? (What does this code do exactly?)

First, a list is created as mentioned earlier, in which all the functions are stored. After the list is made, the y-axis is set to the ratio of Sepal.Length to Sepal.Width and x-axis is set to Species.

首先,如前所述创建一个列表,其中存储了所有功能。 列出后,将y轴设置为Sepal.Length与Sepal.Width的比率,将x轴设置为Species。

After calculating the ratio, the function transform is called in which the func = ‘avg’ is mentioned for just the starting phase. When we run this code and select the function ‘mode’, we get Fig. 3 (above), which shows that the mode of setosa is the least among the three at around 1.4. Mode tells that the ratio 1.4 is repeated the most times or that value is most likely to be sampled. The different pattern we saw here is that the highest value most likely to be sampled is from the category veriscolor having a mode near to 2.2.

在计算出比率之后,将调用函数变换,其中仅在开始阶段就提到了func ='avg'。 当我们运行此代码并选择函数“ mode”时,我们得到图3(上方),该图表明setosa的模式在这三个模式中最小,约为1.4。 模式表明,比率1.4重复最多,或者最有可能被采样。 我们在这里看到的不同模式是,最有可能被采样的最高值来自veriscolor类别,其模式接近2.2。

In Fig. 4 above, the change of ratio of Sepal Length to Sepal Width is plotted and we get very different results compared to the rest of the graphs. We observe the change of Setosa and Virginica to be the same and positive, while in the change of ratio by species, veriscolor is almost negative and is three times the change of the setosa and virginica.

在上面的图4中,绘制了Sepal Length与Sepal Width之比的变化图,与其余图表相比,我们得到了截然不同的结果。 我们观察到Setosa和Virginica的变化相同且为正,而在物种比例变化中,veriscolor几乎为负,是Setosa和virginica的三倍。

On the other hand, the right figure shows the rms values of each species. We can easily see that the species veriscolor and virginica have almost same value which is significantly greater than the rms value of setosa.

另一方面,右图显示了每种物质的均方根值。 我们可以很容易地看到,veriscolor和virginica物种的值几乎相同,大大高于setosa的rms值。

结论 (Conclusion)

Aggregation functions are one of the most powerful tools developers can ask for. They can provide you the patterns and results that you wouldn’t expect. To analyse the data visually, you have to play with the data, and to do that we need to manipulate and transform it. Aggregation functions do that for you, and they’re one of the most widely used functions in transform. This article is just a start. You can certainly explore more and apply more. That’s what explorers do.

聚合功能是开发人员可以要求的最强大的工具之一。 他们可以为您提供意想不到的模式和结果。 要以可视方式分析数据,您必须处理数据,并且为此,我们需要对数据进行操作和转换。 聚合函数可以为您做到这一点,它们是transform中使用最广泛的函数之一。 本文只是一个开始。 您当然可以探索更多并应用更多。 那就是探险家所做的。

翻译自: https://www.freecodecamp.org/news/aggregates-in-r-one-of-the-most-powerful-tool-you-can-ask-for-4dd14eafff1f/

聚合 数据处理

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/news/392194.shtml

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

大数据 notebook_Dockerless Notebook:数据科学期待已久的未来

大数据 notebookData science is hard. Data scientists spend hours figuring out how to install that Python package on their laptops. Data scientists read many pages of Google search results to connect to that database. Data scientists write a detailed docume…

【NGN学习笔记】6 代理(Proxy)和背靠背用户代理(B2BUA)

1. 什么是Proxy模式&#xff1f; 按照RFC3261中的定义&#xff0c;Proxy服务器是一个中间的实体&#xff0c;它本身即作为客户端也作为服务端&#xff0c;为其他客户端提供请求的转发服务。一个Proxy服务器首先提供的是路由服务&#xff0c;也就是说保证请求被发到更加”靠近”…

分布与并行计算—并行计算π(Java)

并行计算π public class pithread extends Thread {private static long mini1000000000;private long start,diff;double sum0;double cur1/(double)mini;public pithread(long start,long diff) {this.startstart;this.diffdiff;}Overridepublic void run() {long istart;f…

linux复制文件跳过相同,Linux cp指令,怎么跳过相同的文件

1、使用cp命令的-n参数即可跳过相同的文件 。2、cp命令使用详解&#xff1a;1)、用法&#xff1a;cp [选项]... [-T] 源文件 目标文件或&#xff1a;cp [选项]... 源文件... 目录或&#xff1a;cp [选项]... -t 目录 源文件...将源文件复制至目标文件&#xff0c;或将多个源文件…

eclipse类自动生成注释

1.创建新类时自动生成注释 window&#xff0d;>preference&#xff0d;>java&#xff0d;>code styple&#xff0d;>code template 当你选择到这部的时候就会看见右侧有一个框显示出code这个选项&#xff0c;你点开这个选项&#xff0c;点一下他下面的New …

rman恢复

--建表create table sales( product_id number(10), sales_date date, sales_cost number(10,2), status varchar2(20));--插数据insert into sales values (1,sysdate-90,18.23,inactive);commit; --启用rman做全库备份 运行D:\autobackup\rman\backup_orcl.bat 生成…

微软大数据_我对Microsoft的数据科学采访

微软大数据Microsoft was one of the software companies that come to hire interns at my university for 2021 summers. This year, it was the first time that Microsoft offered any Data Science Internship for pre-final year undergraduate students.微软是到2021年夏…

再次检查打印机名称 并确保_我们的公司名称糟透了。 这是确保您没有的方法。...

再次检查打印机名称 并确保by Dawid Cedrych通过戴维德塞德里奇 我们的公司名称糟透了。 这是确保您没有的方法。 (Our company name sucked. Here’s how to make sure yours doesn’t.) It is harder than one might think to find a good business name. Paul Graham of Y …

linux中文本查找命令,Linux常用的文本查找命令 find

一、常用的文本查找命令grep、egrep命令grep&#xff1a;文本搜索工具&#xff0c;根据用户指定的文本模式对目标文件进行逐行搜索&#xff0c;先是能够被模式匹配到的行。后面跟正则表达式&#xff0c;让grep工具相当强大。-E之后还支持扩展的正则表达式。# grep [options] …

分布与并行计算—日志挖掘(Java)

日志挖掘——处理数据、计费统计 1、读取附件中日志的内容&#xff0c;找出自己学号停车场中对应的进出车次数&#xff08;in/out配对的记录数&#xff0c;1条in、1条out&#xff0c;视为一个车次&#xff0c;本日志中in/out为一一对应&#xff0c;不存在缺失某条进或出记录&a…

《人人都该买保险》读书笔记

内容目录&#xff1a; 1.你必须知道的保险知识 2.家庭理财的必需品 3.保障型保险产品 4.储蓄型保险产品 5.投资型保险产品 6.明明白白买保险 现在我所在的公司Manulife是一家金融保险公司&#xff0c;主打业务就是保险&#xff0c;因此我需要熟悉一下保险的基础知识&#xff0c…

Linux下查看txt文档

当我们在使用Window操作系统的时候&#xff0c;可能使用最多的文本格式就是txt了&#xff0c;可是当我们将Window平台下的txt文本文档复制到Linux平台下查看时&#xff0c;发现原来的中文所有变成了乱码。没错&#xff0c; 引起这个结果的原因就是两个平台下&#xff0c;编辑器…

如何击败腾讯_击败股市

如何击败腾讯个人项目 (Personal Proyects) Note from Towards Data Science’s editors: While we allow independent authors to publish articles in accordance with our rules and guidelines, we do not endorse each author’s contribution. You should not rely on an…

滑块 组件_组件制作:如何使用链接的输入创建滑块

滑块 组件by Robin Sandborg罗宾桑德伯格(Robin Sandborg) 组件制作&#xff1a;如何使用链接的输入创建滑块 (Component crafting: how to create a slider with a linked input) Here at Stacc, we’re huge fans of React and the render-props pattern. When it came time…

配置静态IPV6 NAT-PT

一.概述&#xff1a; IPV6 NAT-PT( Network Address Translation - Port Translation)应用与ipv4和ipv6网络互访的情况&#xff0c;根据参考链接配置时出现一些问题&#xff0c;所以记录下来。参考链接&#xff1a;http://www.cisco.com/en/US/tech/tk648/tk361/technologies_c…

linux 线程与进程 pid,linux下线程所属进程号问题

这一段看《unix环境高级编程》&#xff0c;一个关于线程的小例子。#include#include#includepthread_t ntid;void printids(const char *s){pid_t pid;pthread_t tid;pidgetpid();tidpthread_self();printf("%s pid %u tid %u (0x%x)n",s,(unsigned int)pid,(unsigne…

python3虚拟环境中解决 ModuleNotFoundError: No module named '_ssl'

前提是已经安装了openssl 问题 当我在python3虚拟环境中导入ssl模块时报错&#xff0c;报错如下&#xff1a; (py3) [rootlocalhost Python-3.6.3]# python3 Python 3.6.3 (default, Nov 19 2018, 14:18:18) [GCC 4.8.5 20150623 (Red Hat 4.8.5-28)] on linux Type "help…

python 使用c模块_您可能没有使用(但应该使用)的很棒的Python模块

python 使用c模块by Adam Goldschmidt亚当戈德施密特(Adam Goldschmidt) 您可能没有使用(但应该使用)的很棒的Python模块 (Awesome Python modules you probably aren’t using (but should be)) Python is a beautiful language, and it contains many built-in modules that…

分布与并行计算—生产者消费者模型实现(Java)

在实际的软件开发过程中&#xff0c;经常会碰到如下场景&#xff1a;某个模块负责产生数据&#xff0c;这些数据由另一个模块来负责处理&#xff08;此处的模块是广义的&#xff0c;可以是类、函数、线程、进程等&#xff09;。产生数据的模块&#xff0c;就形象地称为生产者&a…

通过Xshell登录远程服务器实时查看log日志

主要想总结以下几点&#xff1a; 1.如何使用生成密钥的方式来登录Xshell连接远端服务器 2.在远程服务器上如何上传和下载文件&#xff08;下载log文件到本地&#xff09; 3.如何实时查看log&#xff0c;提取错误信息 一. 使用生成密钥的方式来登录Xshell连接远端服务器 ssh登录…