先进的NumPy数据科学

We will be covering some of the advanced concepts of NumPy specifically functions and methods required to work on a realtime dataset. Concepts covered here are more than enough to start your journey with data.

我们将介绍NumPy的一些高级概念,特别是实时数据集所需的功能和方法。 此处介绍的概念足以开始您的数据之旅。

To go ahead you are requested to know the basic concepts of NumPy if not I suggest you read my article “NumPy-The very basics!” first. You can find a link to it at the end of this article.

首先,要求您了解NumPy的基本概念,否则我建议您阅读我的文章“ NumPy-非常基础 !”。 第一。 您可以在本文末尾找到它的链接。

内容 (Contents)

  1. Universal Functions

    通用功能

  2. Aggregation

    聚合

  3. Broadcasting

    广播

  4. Masking

    掩蔽

  5. Fancy Indexing

    花式索引

  6. Array Sorting

    数组排序

NumPy中的通用函数是什么? (What are Universal Functions in NumPy?)

Most of the time we have to loop over the array to perform simple computations like addition, subtraction, division, etc on each array element. Since these are repeated operations the time taken to compute increases with relatively larger data. Thankfully, NumPy makes this faster by using vectorized operations, generally implemented through NumPy’s universal functions (ufuncs). Let’s understand with an example.

大多数时候,我们必须遍历数组以对每个数组元素执行简单的计算,例如加法,减法,除法等。 由于这些是重复的操作,因此计算所需的时间随着相对较大的数据而增加。 值得庆幸的是,NumPy通过使用矢量化操作(通常通过NumPy的通用函数(ufuncs)实现)使此操作更快 让我们看一个例子。

Suppose we have an array of random integers between 1 to 10 and would like to get square of each element of the array. What we do with the knowledge of Python is:

假设我们有一个介于1到10之间的随机整数数组,并且想要获得数组中每个元素的平方。 我们对Python的了解是:

Numpy universal functions

This takes a lot of time to write and compute, especially for larger arrays in a real dataset. Let’s see how ufuncs make it simpler both ways.

这需要花费大量时间来编写和计算,尤其是对于实际数据集中的较大数组。 让我们看看ufuncs如何使这两种方法都更简单。

Numpy universal functions

Simply by performing an operation on the array it will be applied to each element within the array. As we notice it also retains the dtype. Ufunc operations are extremely flexible. We can also perform operations between two arrays.

只需通过对数组执行操作即可将其应用于数组中的每个元素。 我们注意到它还保留了dtype 。 Ufunc操作非常灵活。 我们还可以在两个数组之间执行操作。

Numpy universal functions

All these arithmetic operations are wrappers around NumPy builtin functions. For example, + operator is a wrapper for add function.

所有这些算术运算都是NumPy内置函数的包装 。 例如,+运算符是add函数的包装器。

Numpy universal functions

Below is the summary table of all the arithmetic operations in NumPy.

下表是NumPy中所有算术运算的汇总表。

Numpy universal functions

Some of the most useful functions provided by NumPy are trigonometric, logarithmic, and exponential functions. As data scientists, we are supposed to be aware of it. These will come handy while working on real datasets.

NumPy提供的一些最有用的函数是三角函数,对数函数和指数函数。 作为数据科学家,我们应该意识到这一点。 这些将在处理实际数据集时派上用场。

Image for post
Image for post
Image for post
Image for post

聚合 (Aggregation)

As a data analyst or data scientist, the very first step is to explore and understand the data. One way to do it is to compute summary statistics. Although, the most common statistical methods to summarize the data are mean and standard deviation other aggregates are also useful such as sum, product, median, maximum, minimum, etc.

作为数据分析师或数据科学家,第一步是探索和理解数据。 一种方法是计算汇总统计信息。 虽然,最常用的统计数据汇总方法是平均值和标准差,其他合计也很有用,例如总和,乘积,中位数,最大值,最小值等。

Let us understand with an example by computing the sum, min, and max.

让我们以计算总和,最小和最大为例来理解。

Numpy aggregation

For most of the NumPy aggregates the shorthand syntax is to use methods of the array objects instead of functions. The above operation can also be performed as shown below which is of no difference computationally.

对于大多数NumPy聚合,速记语法是使用数组对象的方法而不是函数。 也可以如下所示执行上述操作,在计算上没有区别。

Numpy aggregation

IMPORTANT-Difference between Python aggregate functions and NumPy aggregate functions

重要 -Python聚合函数和NumPy聚合函数之间的区别

The one question you can raise is why to use NumPy aggregate functions when these functions are already inbuilt in Python ( sum(), min(), max(), etc). Of course, the difference is NumPy functions are much faster but more importantly NumPy functions are aware of dimensions. Python functions behave differently on multidimensional arrays.

您可能会提出的一个问题是,为什么已经在Python中内置了NumPy聚合函数(sum(),min(),max()等)。 当然,区别在于NumPy函数要快得多,但更重要的是NumPy函数知道尺寸。 Python函数在多维数组上的行为有所不同。

Suppose we like to get some of all the elements in an array of size 2x5. For better understanding, we will take a simple array of numbers from 0 to 9.

假设我们希望以2x5的大小获取所有元素。 为了更好地理解,我们将使用一个简单的数字数组,从0到9。

Numpy aggregation

We were expecting the output to be 45 (0+1+2+3+4+5+6+7+8+9) but the result is very unexpected. These kinds of results will cost a lot while summarizing data. Hence, always make sure you are using the NumPy version of aggregate function while working on arrays.

我们期望输出为45(0 + 1 + 2 + 3 + 4 + 5 + 6 + 7 + 8 + 9),但结果出乎意料。 这些结果在汇总数据时会花费很多。 因此,始终确保在处理数组时使用聚合函数的NumPy版本。

Multidimensional aggregates

多维聚合

One common type of operation is aggregation along rows and columns. Since NumPy functions are aware of dimensions it is easier to do so, for example, minimum value among each row and column. Functions take an additional argument that specifies the axis along which we wish to perform aggregation.

一种常见的操作类型是沿行和列的聚合。 由于NumPy函数知道尺寸,因此更容易做到,例如,每一行和每一列中的最小值。 函数采用一个附加参数,该参数指定了我们希望沿其执行聚合的轴。

Suppose we have a table of marks obtained by students and each column represents a different subject. We wish to find the minimum and maximum marks in each subject and total marks scored by each student. ‘axis = 0’ to specify columns-wise operation and ‘axis=1’ for row-wise. The result will an 1-d array.

假设我们有一张学生获得的分数表,每一列代表一个不同的学科。 我们希望找到每个学科的最低和最高分数,以及每个学生的总分数。 'axis = 0'指定列操作,'axis = 1'指定行操作。 结果将是一维数组。

Numpy aggregation
Numpy aggregation

Other aggregation functions by NumPy

NumPy的其他聚合功能

np.prod, np.mean, np.std, np.var, np.argmin (find index of minimum value), np.argmax (find index of maximum value), np.median, np.percentile (compute rank-based statistics of elements).

np.prod,np.mean,np.std,np.var,np.argmin(最小值的查找索引),np.argmax(最大值的查找索引),np.median,np.percentile(基于计算等级)元素统计)。

广播 (Broadcasting)

We have already seen NumPy universal functions at the very beginning. Broadcasting is another means of applying ufuncs but on arrays of different sizes. Broadcasting is nothing but a set of rules applied by NumPy to perform unfuncs on arrays of different sizes.

我们从一开始就已经看到了NumPy通用函数。 广播是在其他大小的数组上应用ufunc的另一种方法。 广播不过是NumPy应用于在不同大小的数组上执行取消功能的一组规则。

Consider adding two arrays of size 3x3 and 1x3. For our understanding, we can think of this operation as the smaller array is stretched or broadcasted to match the size of a larger array. This stretching of the array does not take place actually, this is just for better understanding.

考虑添加两个大小为3x3和1x3的数组。 就我们的理解而言,我们可以认为此操作是将较小的数组拉伸或广播以匹配较大的数组的大小。 数组的拉伸实际上并没有发生,这只是为了更好地理解。

Numpy broadcasting

Confusion and complication increase when both the arrays need to be broadcasted.

当两个阵列都需要广播时,混乱和复杂性增加。

Numpy broadcasting

Jake VanderPlas, author of the book Python Data Science Handbook has provided excellent visualization to explain this process. The light-colored boxes represent the stretched values.

《 Python数据科学手册》一书的作者Jake VanderPlas提供了出色的可视化效果来解释这一过程。 浅色框代表拉伸值。

Numpy broadcasting
Source-Python for data science handbook
数据科学手册的Source-Python

3 Rules for Broadcasting

3广播规则

Above is the logical imagination to understand. We will explore the theoretical rules with examples.

以上是理解的逻辑想象。 我们将通过实例探索理论规则。

Example 1:

范例1:

m = np.arange(3).reshape((3,1))
n = np.arange(3)
m.shape = (3, 1)
n.shape = (3,)
By rule 1, if two arrays deffer in their shape the array with lesser shape should be padded with ‘1’ on it's left side.m.shape => (3, 1)
n.shape => (1, 3)
By rule 2, if still the shape of two arrays do not match then each array whose dimension is equal to 1 should be broadcasted to match the shape of another array.m.shape => (3, 3)
n.shape => (3, 3)

Stressing on rule 2, it says we can stretch the array only if value of one of its dimensions is 1. We cannot do this for dimension value other than 1. Let’s see an example where the dimension in the shape of an array will be different from 1 during the application of rule 2.

强调规则2,它说只有在其维度之一的值是1时,我们才能拉伸数组。我们不能对维度值除1进行拉伸。让我们来看一个示例,其中数组形状的维度将不同在应用规则2时从1开始。

Example 2:

范例2:

m = np.arange(3).reshape((3,2))
n = np.arange(3)
m.shape = (3, 1)
n.shape = (3,)
By rule 1,m.shape => (3, 2)
n.shape => (1, 3)
By rule 2,m.shape => (3, 2)
n.shape => (3, 3)
By rule 3, if shapes of both array disagree and any dimension of neither array is 1 then an error should be raised.

掩蔽 (Masking)

Masking is a method used extensively in the data processing. It allows us to extract, count, modify or manipulate values in an array based on certain criteria, these criteria are specified using comparison operators and boolean operators.

屏蔽是一种广泛用于数据处理的方法。 它允许我们根据某些条件提取,计数,修改或操作数组中的值,这些条件是使用比较运算符和布尔运算符指定的。

Suppose we have a two-dimensional array of size (3, 4) we would like to get a subset of the array whose values are less than 5.

假设我们有一个大小为(3,4)的二维数组,我们希望得到该数组的一个子集,其值小于5。

Numpy masking

Let’s break it down

让我们分解一下

We used a comparison operator ‘<’ on array x. As we already know this applies element-wise ufunc (np.less()) on the array. As a result, we get an array of boolean operators. True, if the element at the corresponding position is less than 5 else False.

我们在数组x上使用了比较运算符'<'。 众所周知,这在数组上应用了逐元素的ufunc(np.less())。 结果,我们得到一个布尔运算符数组。 如果在相应位置的元素小于5,则为True,否则为False。

Numpy masking

When we say x[x<5], the above returned boolean values are applied on original array x resulting to return the elements of the array whose indices are True, eventually values less than 5. Similar way we can use all the comparison or boolean operators available in Python. We can even combine two operations say x[(x>3) & (x<6)] to get values between 3 and 6, only that the result of operations should be boolean. Notice, here we use bitwise operator ‘&’ rather than keyword ‘and’.

当我们说x [x <5]时,以上返回的布尔值将应用于原始数组x,从而返回索引为True且最终值小于5的数组元素。类似的方式,我们可以使用所有比较或布尔值Python中可用的运算符。 我们甚至可以结合两个操作x [(x> 3)&(x <6)]来获得3到6之间的值,只是操作的结果应该是布尔值。 注意,这里我们使用按位运算符“&”而不是关键字“ and”。

REMEMBER

记得

The keyword ‘and’ and ‘or’ performs single boolean operation on entire array while bitwise ‘&’ and ‘|’ performs multiple boolean operations on elements of an array. Always use bit-wise operators while masking.

关键字“ and”和“或”对整个数组执行单个布尔运算,而按位的“&”和“ |” 对数组的元素执行多个布尔操作。 屏蔽时始终使用按位运算符。

花式索引 (Fancy indexing)

Fancy indexing is similar to normal indexing as we already know. The only difference is we pass an array of indices here. This advanced version of indexing allows quick access and/or modification of complicated subsets of an array.

如我们所知,花式索引与普通索引相似。 唯一的区别是我们在这里传递了一组索引。 索引的此高级版本允许快速访问和/或修改数组的复杂子集。

Suppose we want to access elements at index 2, 5, and 9 of an array, the old school method would be [x[2], x[5], x[9]]. This can we simplified using fancy indexing.

假设我们要访问数组索引2、5和9的元素,则旧的方法是[x [2],x [5],x [9]]。 我们可以使用花式索引来简化此操作。

Numpy indexing

Likewise, we can fancy index two-dimensional array. Let’s see equivalent operation of x[0, 2], x[1, 3] and x[2, 1] in fancy indexing.

同样,我们可以看上二维数组的索引。 让我们看一下花式索引中x [0,2],x [1,3]和x [2,1]的等效操作。

Numpy indexing

This can be further simplified if either row or column value is constant. Let’s say we like to get values at index x[2, 1], x[2, 3] and x[2, 4]. The below yellow color highlight is for row value and blue color for the column value. Similarly, we can also modify values using fancy indexing by using the assignment operator ‘=’.

如果行或列的值恒定,则可以进一步简化。 假设我们喜欢获取索引为x [2,1],x [2,3]和x [2,4]的值。 下面的黄色高亮显示为行值,蓝色为列值。 同样,我们也可以通过赋值运算符'=' 使用花式索引修改值

Numpy indexing

数组排序 (Array sorting)

np.sort is a more efficient sorting function than Python’s built-in sort function. Additionally, np.sort is aware of dimensions. Let’s see a few flavors of the NumPy sorting function.

np.sort是比Python内置的sort函数更有效的排序函数。 另外, np.sort知道Dimensions 。 让我们来看看NumPy排序函数的几种风格。

Numpy sorting

Notice, when we use the method sort(), it alters the value of array x itself. Meaning, the original order of array x in lost. It is called in-place sorting.

注意,当我们使用方法sort()时,它会更改数组x本身的值。 意思是,数组x的原始顺序丢失了。 这称为就地排序

Advanced NumPy for Data Science — Thank you for reading
Photo by Kelly Sikkema on Unsplash
Kelly Sikkema在Unsplash上的照片

Although these are not the only concepts of NumPy still I have managed to cover all critical and must-know concepts. This is clearly more than enough for getting started with data science. Since Python is open-source many functions keep adding and deprecating regularly. Always keep an eye on NumPy’s official documentation. I will also make sure I keep updating content as and when required.

尽管这些不是NumPy的唯一概念,但我还是设法涵盖了所有关键且必须知道的概念。 对于数据科学入门而言,这显然绰绰有余。 由于Python是开源的,因此许多功能会定期添加和弃用。 始终注意NumPy的官方文档 。 我还将确保在需要时不断更新内容。

If you are facing difficulty in understanding the concepts try reading the below article first.

如果您在理解这些概念时遇到困难,请先阅读以下文章。

Let’s connect

让我们连接

翻译自: https://medium.com/analytics-vidhya/advanced-numpy-218584c60c63

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/news/388339.shtml

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

lsof命令详解

基础命令学习目录首页 原文链接&#xff1a;https://www.cnblogs.com/ggjucheng/archive/2012/01/08/2316599.html 简介 lsof(list open files)是一个列出当前系统打开文件的工具。在linux环境下&#xff0c;任何事物都以文件的形式存在&#xff0c;通过文件不仅仅可以访问常规…

统计和冰淇淋

Photo by Irene Kredenets on UnsplashIrene Kredenets在Unsplash上拍摄的照片 摘要 (Summary) In this article, you will learn a little bit about probability calculations in R Studio. As it is a Statistical language, R comes with many tests already built in it, …

信息流服务器哪种好,选购存储服务器需要注意六大关键因素,你知道几个?

原标题&#xff1a;选购存储服务器需要注意六大关键因素&#xff0c;你知道几个&#xff1f;信息技术的飞速发展带动了整个信息产业的发展。越来越多的电子商务平台和虚拟化环境出现在企业的日常应用中。存储服务器作为企业建设环境的核心设备&#xff0c;在整个信息流中承担着…

t3 深入Tornado

3.1 Application settings 前面的学习中&#xff0c;在创建tornado.web.Application的对象时&#xff0c;传入了第一个参数——路由映射列表。实际上Application类的构造函数还接收很多关于tornado web应用的配置参数。 参数&#xff1a; debug&#xff0c;设置tornado是否工作…

对数据仓库进行数据建模_确定是否可以对您的数据进行建模

对数据仓库进行数据建模Some data sets are just not meant to have the geospatial representation that can be clustered. There is great variance in your features, and theoretically great features as well. But, it doesn’t mean is statistically separable.某些数…

15 并发编程-(IO模型)

一、IO模型介绍 1、阻塞与非阻塞指的是程序的两种运行状态 阻塞&#xff1a;遇到IO就发生阻塞&#xff0c;程序一旦遇到阻塞操作就会停在原地&#xff0c;并且立刻释放CPU资源 非阻塞&#xff08;就绪态或运行态&#xff09;&#xff1a;没有遇到IO操作&#xff0c;或者通过某种…

不提拔你,就是因为你只想把工作做好

2019独角兽企业重金招聘Python工程师标准>>> 我有个朋友&#xff0c;他30出头&#xff0c;在500强公司做技术经理。他戴无边眼镜&#xff0c;穿一身土黄色的夹克&#xff0c;下面是一条常年不洗的牛仔裤加休闲皮鞋&#xff0c;典型技术高手范。 三 年前&#xff0c;…

python内置函数多少个_每个数据科学家都应该知道的10个Python内置函数

python内置函数多少个Python is the number one choice of programming language for many data scientists and analysts. One of the reasons of this choice is that python is relatively easier to learn and use. More importantly, there is a wide variety of third pa…

C#使用TCP/IP与ModBus进行通讯

C#使用TCP/IP与ModBus进行通讯1. ModBus的 Client/Server模型 2. 数据包格式及MBAP header (MODBUS Application Protocol header) 3. 大小端转换 4. 事务标识和缓冲清理 5. 示例代码 0. MODBUS MESSAGING ON TCP/IP IMPLEMENTATION GUIDE 下载地址&#xff1a;http://www.modb…

Hadoop HDFS常用命令

1、查看hdfs文件目录 hadoop fs -ls / 2、上传文件 hadoop fs -put 文件路径 目标路径 在浏览器查看:namenodeIP:50070 3、下载文件 hadoop fs -get 文件路径 保存路径 4、设置副本数量 -setrep 转载于:https://www.cnblogs.com/chaofan-/p/9742633.html

SAP UI 搜索分页技术

搜索分页技术往往和另一个术语Lazy Loading&#xff08;懒加载&#xff09;联系起来。今天由Jerry首先介绍S/4HANA&#xff0c;CRM Fiori和S4CRM应用里的UI搜索分页的实现原理。后半部分由SAP成都研究院菜园子小哥王聪向您介绍Twitter的懒加载实现。 关于王聪的背景介绍&#x…

万彩录屏服务器不稳定,万彩录屏 云服务器

万彩录屏 云服务器 内容精选换一换内网域名是指仅在VPC内生效的虚拟域名&#xff0c;无需购买和注册&#xff0c;无需备案。云解析服务提供的内网域名功能&#xff0c;可以让您在VPC中拥有权威DNS&#xff0c;且不会将您的DNS记录暴露给互联网&#xff0c;解析性能更高&#xf…

针对数据科学家和数据工程师的4条SQL技巧

SQL has become a common skill requirement across industries and job profiles over the last decade.在过去的十年中&#xff0c;SQL已成为跨行业和职位描述的通用技能要求。 Companies like Amazon and Google will often demand that their data analysts, data scienti…

全排列算法实现

版权声明&#xff1a;本文为博主原创文章&#xff0c;未经博主允许不得转载。 https://blog.csdn.net/summerxiachen/article/details/605796231.全排列的定义和公式&#xff1a; 从n个数中选取m&#xff08;m<n&#xff09;个数按照一定的顺序进行排成一个列&#xff0c;叫…

14.并发容器之ConcurrentHashMap(JDK 1.8版本)

1.ConcurrentHashmap简介 在使用HashMap时在多线程情况下扩容会出现CPU接近100%的情况&#xff0c;因为hashmap并不是线程安全的&#xff0c;通常我们可以使用在java体系中古老的hashtable类&#xff0c;该类基本上所有的方法都采用synchronized进行线程安全的控制&#xff0c;…

服务器虚拟化网口,服务器安装虚拟网口

服务器安装虚拟网口 内容精选换一换Atlas 800 训练服务器(型号 9010)安装上架、服务器基础参数配置、安装操作系统等操作请参见《Atlas 800 训练服务器 用户指南 (型号9010)》。Atlas 800 训练服务器(型号 9010)适配操作系统如表1所示。请参考表2下载驱动和固件包。Atlas 800 训…

芒果云接吗_芒果糯米饭是生产力的关键吗?

芒果云接吗Would you like to know how your mood impact your sleep and how your parents influence your happiness levels?您想知道您的心情如何影响您的睡眠以及您的父母如何影响您的幸福感吗&#xff1f; Become a data nerd, and track it!成为数据书呆子&#xff0c;…

laravel-admin 开发 bootstrap-treeview 扩展包

laravel-admin 扩展开发文档https://laravel-admin.org/doc... 效果图&#xff1a; 开发过程&#xff1a; 1、先创建Laravel项目&#xff0c;并集成laravel-admin&#xff0c;教程&#xff1a; http://note.youdao.com/notesh... 2、生成开发扩展包 php artisan admin:extend c…

怎么看服务器上jdk安装位置,查看云服务器jdk安装路径

查看云服务器jdk安装路径 内容精选换一换用户可以在公有云MRS集群以外的节点上使用客户端&#xff0c;在使用客户端前需要安装客户端。如果集群外的节点已安装客户端且只需要更新客户端&#xff0c;请使用安装客户端的用户例如root。针对MRS 3.x之前版本的集群&#xff0c;需要…