先进的NumPy数据科学

We will be covering some of the advanced concepts of NumPy specifically functions and methods required to work on a realtime dataset. Concepts covered here are more than enough to start your journey with data.

我们将介绍NumPy的一些高级概念，特别是实时数据集所需的功能和方法。此处介绍的概念足以开始您的数据之旅。

To go ahead you are requested to know the basic concepts of NumPy if not I suggest you read my article “NumPy-The very basics!” first. You can find a link to it at the end of this article.

首先，要求您了解NumPy的基本概念，否则我建议您阅读我的文章“ NumPy-非常基础 ！”。第一。您可以在本文末尾找到它的链接。

内容 (Contents)

Universal Functions
通用功能
Aggregation
聚合
Broadcasting
广播
Masking
掩蔽
Fancy Indexing
花式索引
Array Sorting
数组排序

NumPy中的通用函数是什么？ (What are Universal Functions in NumPy?)

Most of the time we have to loop over the array to perform simple computations like addition, subtraction, division, etc on each array element. Since these are repeated operations the time taken to compute increases with relatively larger data. Thankfully, NumPy makes this faster by using vectorized operations, generally implemented through NumPy’s universal functions (ufuncs). Let’s understand with an example.

大多数时候，我们必须遍历数组以对每个数组元素执行简单的计算，例如加法，减法，除法等。由于这些是重复的操作，因此计算所需的时间随着相对较大的数据而增加。值得庆幸的是，NumPy通过使用矢量化操作(通常通过NumPy的通用函数(ufuncs)实现)使此操作更快。让我们看一个例子。

Suppose we have an array of random integers between 1 to 10 and would like to get square of each element of the array. What we do with the knowledge of Python is:

假设我们有一个介于1到10之间的随机整数数组，并且想要获得数组中每个元素的平方。我们对Python的了解是：

This takes a lot of time to write and compute, especially for larger arrays in a real dataset. Let’s see how ufuncs make it simpler both ways.

这需要花费大量时间来编写和计算，尤其是对于实际数据集中的较大数组。让我们看看ufuncs如何使这两种方法都更简单。

Simply by performing an operation on the array it will be applied to each element within the array. As we notice it also retains the dtype. Ufunc operations are extremely flexible. We can also perform operations between two arrays.

只需通过对数组执行操作即可将其应用于数组中的每个元素。我们注意到它还保留了dtype 。 Ufunc操作非常灵活。我们还可以在两个数组之间执行操作。

All these arithmetic operations are wrappers around NumPy builtin functions. For example, + operator is a wrapper for add function.

所有这些算术运算都是NumPy内置函数的包装。例如，+运算符是add函数的包装器。

Below is the summary table of all the arithmetic operations in NumPy.
下表是NumPy中所有算术运算的汇总表。

Some of the most useful functions provided by NumPy are trigonometric, logarithmic, and exponential functions. As data scientists, we are supposed to be aware of it. These will come handy while working on real datasets.

NumPy提供的一些最有用的函数是三角函数，对数函数和指数函数。作为数据科学家，我们应该意识到这一点。这些将在处理实际数据集时派上用场。

聚合 (Aggregation)

As a data analyst or data scientist, the very first step is to explore and understand the data. One way to do it is to compute summary statistics. Although, the most common statistical methods to summarize the data are mean and standard deviation other aggregates are also useful such as sum, product, median, maximum, minimum, etc.

作为数据分析师或数据科学家，第一步是探索和理解数据。一种方法是计算汇总统计信息。虽然，最常用的统计数据汇总方法是平均值和标准差，其他合计也很有用，例如总和，乘积，中位数，最大值，最小值等。

Let us understand with an example by computing the sum, min, and max.

让我们以计算总和，最小和最大为例来理解。

For most of the NumPy aggregates the shorthand syntax is to use methods of the array objects instead of functions. The above operation can also be performed as shown below which is of no difference computationally.

对于大多数NumPy聚合，速记语法是使用数组对象的方法而不是函数。也可以如下所示执行上述操作，在计算上没有区别。

IMPORTANT-Difference between Python aggregate functions and NumPy aggregate functions
重要 -Python聚合函数和NumPy聚合函数之间的区别

The one question you can raise is why to use NumPy aggregate functions when these functions are already inbuilt in Python ( sum(), min(), max(), etc). Of course, the difference is NumPy functions are much faster but more importantly NumPy functions are aware of dimensions. Python functions behave differently on multidimensional arrays.

您可能会提出的一个问题是，为什么已经在Python中内置了NumPy聚合函数(sum()，min()，max()等)。当然，区别在于NumPy函数要快得多，但更重要的是NumPy函数知道尺寸。 Python函数在多维数组上的行为有所不同。

Suppose we like to get some of all the elements in an array of size 2x5. For better understanding, we will take a simple array of numbers from 0 to 9.

假设我们希望以2x5的大小获取所有元素。为了更好地理解，我们将使用一个简单的数字数组，从0到9。

We were expecting the output to be 45 (0+1+2+3+4+5+6+7+8+9) but the result is very unexpected. These kinds of results will cost a lot while summarizing data. Hence, always make sure you are using the NumPy version of aggregate function while working on arrays.

我们期望输出为45(0 + 1 + 2 + 3 + 4 + 5 + 6 + 7 + 8 + 9)，但结果出乎意料。这些结果在汇总数据时会花费很多。因此，始终确保在处理数组时使用聚合函数的NumPy版本。

Multidimensional aggregates
多维聚合

One common type of operation is aggregation along rows and columns. Since NumPy functions are aware of dimensions it is easier to do so, for example, minimum value among each row and column. Functions take an additional argument that specifies the axis along which we wish to perform aggregation.

一种常见的操作类型是沿行和列的聚合。由于NumPy函数知道尺寸，因此更容易做到，例如，每一行和每一列中的最小值。函数采用一个附加参数，该参数指定了我们希望沿其执行聚合的轴。

Suppose we have a table of marks obtained by students and each column represents a different subject. We wish to find the minimum and maximum marks in each subject and total marks scored by each student. ‘axis = 0’ to specify columns-wise operation and ‘axis=1’ for row-wise. The result will an 1-d array.

假设我们有一张学生获得的分数表，每一列代表一个不同的学科。我们希望找到每个学科的最低和最高分数，以及每个学生的总分数。 'axis = 0'指定列操作，'axis = 1'指定行操作。 结果将是一维数组。

Other aggregation functions by NumPy
NumPy的其他聚合功能

np.prod, np.mean, np.std, np.var, np.argmin (find index of minimum value), np.argmax (find index of maximum value), np.median, np.percentile (compute rank-based statistics of elements).

np.prod，np.mean，np.std，np.var，np.argmin(最小值的查找索引)，np.argmax(最大值的查找索引)，np.median，np.percentile(基于计算等级)元素统计)。

广播 (Broadcasting)

We have already seen NumPy universal functions at the very beginning. Broadcasting is another means of applying ufuncs but on arrays of different sizes. Broadcasting is nothing but a set of rules applied by NumPy to perform unfuncs on arrays of different sizes.

我们从一开始就已经看到了NumPy通用函数。广播是在其他大小的数组上应用ufunc的另一种方法。广播不过是NumPy应用于在不同大小的数组上执行取消功能的一组规则。

Consider adding two arrays of size 3x3 and 1x3. For our understanding, we can think of this operation as the smaller array is stretched or broadcasted to match the size of a larger array. This stretching of the array does not take place actually, this is just for better understanding.

考虑添加两个大小为3x3和1x3的数组。就我们的理解而言，我们可以认为此操作是将较小的数组拉伸或广播以匹配较大的数组的大小。数组的拉伸实际上并没有发生，这只是为了更好地理解。

Confusion and complication increase when both the arrays need to be broadcasted.

当两个阵列都需要广播时，混乱和复杂性增加。

Jake VanderPlas, author of the book Python Data Science Handbook has provided excellent visualization to explain this process. The light-colored boxes represent the stretched values.

《 Python数据科学手册》一书的作者Jake VanderPlas提供了出色的可视化效果来解释这一过程。浅色框代表拉伸值。

3 Rules for Broadcasting
3广播规则

Above is the logical imagination to understand. We will explore the theoretical rules with examples.

以上是理解的逻辑想象。我们将通过实例探索理论规则。

Example 1:

范例1：

m = np.arange(3).reshape((3,1))
n = np.arange(3)
m.shape = (3, 1)
n.shape = (3,)By rule 1, if two arrays deffer in their shape the array with lesser shape should be padded with ‘1’ on it's left side.m.shape => (3, 1)
n.shape => (1, 3)By rule 2, if still the shape of two arrays do not match then each array whose dimension is equal to 1 should be broadcasted to match the shape of another array.m.shape => (3, 3)
n.shape => (3, 3)

Stressing on rule 2, it says we can stretch the array only if value of one of its dimensions is 1. We cannot do this for dimension value other than 1. Let’s see an example where the dimension in the shape of an array will be different from 1 during the application of rule 2.

强调规则2，它说只有在其维度之一的值是1时，我们才能拉伸数组。我们不能对维度值除1进行拉伸。让我们来看一个示例，其中数组形状的维度将不同在应用规则2时从1开始。

Example 2:

范例2：

m = np.arange(3).reshape((3,2))
n = np.arange(3)
m.shape = (3, 1)
n.shape = (3,)By rule 1,m.shape => (3, 2)
n.shape => (1, 3)By rule 2,m.shape => (3, 2)
n.shape => (3, 3)By rule 3, if shapes of both array disagree and any dimension of neither array is 1 then an error should be raised.

掩蔽 (Masking)

Masking is a method used extensively in the data processing. It allows us to extract, count, modify or manipulate values in an array based on certain criteria, these criteria are specified using comparison operators and boolean operators.

屏蔽是一种广泛用于数据处理的方法。它允许我们根据某些条件提取，计数，修改或操作数组中的值，这些条件是使用比较运算符和布尔运算符指定的。

Suppose we have a two-dimensional array of size (3, 4) we would like to get a subset of the array whose values are less than 5.

假设我们有一个大小为(3，4)的二维数组，我们希望得到该数组的一个子集，其值小于5。

Let’s break it down
让我们分解一下

We used a comparison operator ‘<’ on array x. As we already know this applies element-wise ufunc (np.less()) on the array. As a result, we get an array of boolean operators. True, if the element at the corresponding position is less than 5 else False.

我们在数组x上使用了比较运算符'<'。众所周知，这在数组上应用了逐元素的ufunc(np.less())。结果，我们得到一个布尔运算符数组。如果在相应位置的元素小于5，则为True，否则为False。

When we say x[x<5], the above returned boolean values are applied on original array x resulting to return the elements of the array whose indices are True, eventually values less than 5. Similar way we can use all the comparison or boolean operators available in Python. We can even combine two operations say x[(x>3) & (x<6)] to get values between 3 and 6, only that the result of operations should be boolean. Notice, here we use bitwise operator ‘&’ rather than keyword ‘and’.

当我们说x [x <5]时，以上返回的布尔值将应用于原始数组x，从而返回索引为True且最终值小于5的数组元素。类似的方式，我们可以使用所有比较或布尔值Python中可用的运算符。我们甚至可以结合两个操作x [(x> 3)＆(x <6)]来获得3到6之间的值，只是操作的结果应该是布尔值。注意，这里我们使用按位运算符“＆”而不是关键字“ and”。

REMEMBER
记得

The keyword ‘and’ and ‘or’ performs single boolean operation on entire array while bitwise ‘&’ and ‘|’ performs multiple boolean operations on elements of an array. Always use bit-wise operators while masking.
关键字“ and”和“或”对整个数组执行单个布尔运算，而按位的“＆”和“ |” 对数组的元素执行多个布尔操作。 屏蔽时始终使用按位运算符。

花式索引 (Fancy indexing)

Fancy indexing is similar to normal indexing as we already know. The only difference is we pass an array of indices here. This advanced version of indexing allows quick access and/or modification of complicated subsets of an array.

如我们所知，花式索引与普通索引相似。唯一的区别是我们在这里传递了一组索引。索引的此高级版本允许快速访问和/或修改数组的复杂子集。

Suppose we want to access elements at index 2, 5, and 9 of an array, the old school method would be [x[2], x[5], x[9]]. This can we simplified using fancy indexing.

假设我们要访问数组索引2、5和9的元素，则旧的方法是[x [2]，x [5]，x [9]]。我们可以使用花式索引来简化此操作。

Likewise, we can fancy index two-dimensional array. Let’s see equivalent operation of x[0, 2], x[1, 3] and x[2, 1] in fancy indexing.

同样，我们可以看上二维数组的索引。让我们看一下花式索引中x [0，2]，x [1，3]和x [2，1]的等效操作。

This can be further simplified if either row or column value is constant. Let’s say we like to get values at index x[2, 1], x[2, 3] and x[2, 4]. The below yellow color highlight is for row value and blue color for the column value. Similarly, we can also modify values using fancy indexing by using the assignment operator ‘=’.

如果行或列的值恒定，则可以进一步简化。假设我们喜欢获取索引为x [2，1]，x [2，3]和x [2，4]的值。下面的黄色高亮显示为行值，蓝色为列值。同样，我们也可以通过赋值运算符'=' 使用花式索引来修改值 。

数组排序 (Array sorting)

np.sort is a more efficient sorting function than Python’s built-in sort function. Additionally, np.sort is aware of dimensions. Let’s see a few flavors of the NumPy sorting function.

np.sort是比Python内置的sort函数更有效的排序函数。另外， np.sort知道Dimensions 。让我们来看看NumPy排序函数的几种风格。

Notice, when we use the method sort(), it alters the value of array x itself. Meaning, the original order of array x in lost. It is called in-place sorting.

注意，当我们使用方法sort()时，它会更改数组x本身的值。意思是，数组x的原始顺序丢失了。这称为就地排序 。

Advanced NumPy for Data Science — Thank you for reading — Photo by Kelly Sikkema on Unsplash

Although these are not the only concepts of NumPy still I have managed to cover all critical and must-know concepts. This is clearly more than enough for getting started with data science. Since Python is open-source many functions keep adding and deprecating regularly. Always keep an eye on NumPy’s official documentation. I will also make sure I keep updating content as and when required.

尽管这些不是NumPy的唯一概念，但我还是设法涵盖了所有关键且必须知道的概念。对于数据科学入门而言，这显然绰绰有余。由于Python是开源的，因此许多功能会定期添加和弃用。始终注意NumPy的官方文档。我还将确保在需要时不断更新内容。

If you are facing difficulty in understanding the concepts try reading the below article first.

如果您在理解这些概念时遇到困难，请先阅读以下文章。

Let’s connect

让我们连接