熊猫分发_流利的熊猫

熊猫分发

Let’s uncover the practical details of Pandas’ Series, DataFrame, and Panel

让我们揭露Pandas系列,DataFrame和Panel的实用细节

Note to the Readers: Paying attention to comments in examples would be more helpful than going through the theory itself.

读者注意:注意示例中的注释比通过理论本身更有用。

· Series (1D data structure: Column-vector of DataTable)· DataFrame (2D data structure: Table)· Panel (3D data structure)

· 系列(1D数据结构:DataTable的列向量) · DataFrame(2D数据结构:Table) · 面板(3D数据结构)

Pandas is a column-oriented data analysis API. It’s a great tool for handling and analyzing input data, and many ML framework support pandas data structures as inputs.

Pandas是面向列的数据分析API。 这是处理和分析输入数据的好工具,许多ML框架都支持熊猫数据结构作为输入。

熊猫数据结构 (Pandas Data Structures)

Refer Intro to Data Structures on Pandas docs.

请参阅 熊猫文档上的 数据结构简介

The primary data structures in pandas are implemented as two classes: DataFrame and Series.

大熊猫中的主要数据结构实现为两类: DataFrame Series

Import numpy and pandas into your namespace:

numpy pandas 导入您的名称空间:

import numpy as np
import pandas as pd
import matplotlib as mpl
np.__version__
pd.__version__
mpl.__version__

系列(一维数据结构:DataTable的列向量) (Series (1D data structure: Column-vector of DataTable))

CREATING SERIES

创作系列

Series is one-dimensional array having elements with non-unique labels (index), and is capable of holding any data type. The axis labels are collectively referred to as index. The general way to create a Series is to call:

系列是 一维数组,具有带有非唯一标签(索引)的元素 ,并且能够保存任何数据类型。 标签 统称为 指标 创建系列的一般方法是调用:

pd.Series(data, index=index)

Here, data can be an NumPy’s ndarray, Python’s dict, or a scalar value (like 5). The passed index is a list of axis labels.

在这里, data 可以是NumPy的 ndarray ,Python的 dict 或标量值(如 5 )。 传递的 index 是轴标签的列表。

Note: pandas supports non-unique index values. If an operation that does not support duplicate index values is attempted, an exception will be raised at that time.

注意: pandas支持 非唯一索引值 如果尝试执行不支持重复索引值的操作,则此时将引发异常。

If data is list or ndarray (preferred way):

如果 data list ndarray (首选方式):

If data is an ndarray or list, then index must be of the same length as data. If no index is passed, one will be created having values [0, 1, 2, ... len(data) - 1].

如果 data ndarray list ,则 index 必须与 data 具有相同的长度 如果没有传递索引,则将创建一个具有 [0, 1, 2, ... len(data) - 1] 值的索引

If data is a scalar value:

如果 data 是标量值:

If data is a scalar value, an index must be provided. The value will be repeated to match the length of index.

如果 data 是标量值,则 必须提供 index 该值将重复以匹配 index 的长度

If data is dict:

如果 data dict

If data is a dict, and - if index is passed the values in data corresponding to the labels in the index will be pulled out, otherwise - an index will be constructed from the sorted keys of the dict, if possible

如果 data 是一个 dict ,以及 - 如果 index 被传递的值在 data 对应于所述标签的 index 将被拉出,否则 - 一个 index 将从的排序键来构建 dict ,如果可能的话

SERIES IS LIKE NDARRAY AND DICT COMBINED

SERIES NDARRAY DICT 结合

Series acts very similar to an ndarray, and is a valid argument to most NumPy functions. However, things like slicing also slice the index. Series can be passed to most NumPy methods expecting an 1D ndarray.

Series 行为与 ndarray 非常相似 ,并且是大多数NumPy函数的有效参数。 但是,诸如切片之类的事情也会对索引进行切片。 系列可以传递给大多数期望一维 ndarray NumPy方法

A key difference between Series and ndarray is automatically alignment of the data based on labels during Series operations. Thus, you can write computations without giving consideration to whether the Series object involved has some non-unique labels. For example,

Series ndarray 之间的主要区别在于 Series 操作 期间,基于标签的数据自动对齐 因此,您可以编写计算而无需考虑所 涉及 Series 对象 是否 具有某些非唯一标签。 例如,

Image for post
Image for post

The result of an operation between unaligned Seriess will have the union of the indexes involved. If a label is not found in one series or the other, the result will be marked as missing NaN.

未对齐 Series 之间的运算结果 将具有所 涉及索引 并集 如果在一个序列或另一个序列中未找到标签,则结果将标记为缺少 NaN

Also note that in the above example, the index 'b' was duplicated, so s['b'] returns pandas.core.series.Series.

还要注意,在上面的示例中,索引 'b' 是重复的,因此 s['b'] 返回 pandas.core.series.Series

Series is also like a fixed-sized dict on which you can get and set values by index label. If a label is not contained while reading a value, KeyError exception is raised. Using the get method, a missing label will return None or specified default.

Series 还类似于固定大小的 dict ,您可以在其上通过索引标签获取和设置值。 如果在读取值时不包含标签, 则会引发 KeyError 异常。 使用 get 方法,缺少的标签将返回 None 或指定的默认值。

SERIESNAME ATTRIBUTE

SERIES NAME ATTRIBUTE

Series can also have a name attribute (like DataFrame can have columns attribute). This is important as a DataFrame can be seen as dict of Series objects.

Series 还可以具有 name 属性 (例如 DataFrame 可以具有 columns 属性)。 这是很重要 DataFrame 可以看作是 dict Series 对象

The Seriesname will be assigned automatically in many cases, in particular when taking 1D slices of DataFrame.

Series name 会自动在许多情况下采取的1D切片时进行分配,特别是 DataFrame

For example,

例如,

d = {'one' : [1., 2., 3., 4.], 'two' : [4., 3., 2., 1.]}
d = pd.DataFrame(d)
d
# one two
# 0 1.0 4.0
# 1 2.0 3.0
# 2 3.0 2.0
# 3 4.0 1.0
type(d) #=> pandas.core.frame.DataFrame
d.columns #=> Index(['one', 'two'], dtype='object')
d.index #=> RangeIndex(start=0, stop=4, step=1)s = d['one']
s
# 0 1.0
# 1 2.0
# 2 3.0
# 3 4.0
# Name: one, dtype: float64
type(s) #=> pandas.core.series.Series
s.name #=> 'one'
s.index #=> RangeIndex(start=0, stop=4, step=1)

You can rename a Series with pandas.Series.rename() method or by just assigning new name to name attribute.

您可以使用 pandas.Series.rename() 方法 重命名系列,也可以 仅将新名称分配给 name 属性。

s = pd.Series(np.random.randn(5), name='something')
id(s) #=> 4331187280
s.name #=> 'something's.name = 'new_name'
id(s) #=> 4331187280
s.name #=> 'new_name's.rename("yet_another_name")
id(s) #=> 4331187280
s.name #=> 'yet_another_name'

COMPLEX TRANSFORMATIONS ON SERIES USING SERIES.APPLY

使用 SERIES.APPLY SERIES 复杂的转换

NumPy is a popular toolkit for scientific computing. Pandas’ Series can be used as argument to most NumPy functions.

NumPy是用于科学计算的流行工具包。 熊猫Series可以用作大多数NumPy函数的参数。

For complex single-column transformation, you can use Series.apply. Like Python’s map function, Series.apply accepts as an argument a lambda function which is applied to each value.

对于复杂的单列转换,可以使用 Series.apply 像Python的 map 函数一样, Series.apply 接受一个lambda函数作为参数,该函数应用于每个值。

DataFrame (2D数据结构:表格) (DataFrame (2D data structure: Table))

Refer: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html

请参阅: https : //pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html

DataFrame is a 2D labeled data structure with columns of potentially different types. You can think of it like a spreadsheet or a SQL table, or a dict of Series objects. Like Series, DataFrame accepts many different kinds of inputs.

DataFrame 是2D标签的数据结构,具有可能不同类型的列。 你可以认为它像一个电子表格或SQL表或 dict Series 对象 Series 一样 DataFrame 接受许多不同种类的输入。

Along with the data, you can optionally pass index (row labels) and columns (column labels) arguments. Note that index can have non-unique elements (like that of Series). Similarly, columns names can also be non-unique.

与数据一起,您可以选择传递 index (行标签) columns (列标签) 参数。 请注意, index 可以具有 非唯一 元素(例如 Series 元素 )。 同样, columns 名也可以 是非唯一的

If you pass an index and/or columns, you are guaranteeing the index and / or columns of the resulting DataFrame. Thus, a dict of Series plus a specific index will discard all data not matching up to the passed index (similar to passing a dict as data to Series).

如果传递 index 和/或 columns ,则可以 保证 所得 DataFrame index 和/或 columns 因此, dict Series 加上特定的索引将 丢弃 所有数据不匹配到所传递的索引(类似于传递一个 dict 作为 data Series )。

If axis labels (index) are not passed, they will be constructed from the input data based on common sense rules.

如果 未传递 轴标签( index ),则将基于常识规则根据输入数据构造它们。

CREATING DATAFRAMEFrom a dict of ndarrays/lists

CREATING DATAFRAME dict ndarray S / list 小号

The ndarrays/lists must all be the same length. If an index is passed, it must clearly also be the same length as that of data ndarrays. If no index is passed, then implicit index will be range(n), where n is the array length.

ndarray S / list 小号都必须是相同的长度 如果 传递 index ,则它显然也必须 与数据 ndarray s的 长度相同 如果没有 传递 index ,则隐式 index 将是 range(n) ,其中 n 是数组长度。

For example,

例如,

Image for post
Image for post

From a dict of Series (preferred way):

dict Series (首选方式):

The resultant index will be the union of the indexes of the various Series (each Series may be of a different length and may have different index). If there are nested dicts, these will be first converted to Series. If no columns are passed, the columns will be list of dict keys. For example,

将得到的 index 各个的索引的 联合 Series (各 Series 可以是不同的长度,并且可具有不同的 index )。 如果存在嵌套的 dict ,则将它们首先转换为 Series 如果没有 columns 都过去了, columns 将是 list dict 键。 例如,

Image for post
Image for post

The row and column labels can be accessed respectively by accessing the index and columns attributes.

可以通过访问 index columns 属性 分别访问行和列标签

Image for post

From a list of dicts:

dict list

For example,

例如,

data2 = [{'a': 1, 'b': 2}, {'a': 5, 'b': 10, 'c': 20}]pd.DataFrame(data2)
# a b c
# 0 1 2 NaN
# 1 5 10 20.0pd.DataFrame(data2, index=['first', 'second'])
# a b c
# first 1 2 NaN
# second 5 10 20.0pd.DataFrame(data2, columns=['a', 'b'])
# a b
# 0 1 2
# 1 5 10

From a Series:

Series

The result will be a DataFrame with the same index as the input Series, and with one column whose name is the original name of the Series (only if no other column name provided).

结果将是一个 DataFrame ,其 index 与输入 Series 相同 ,并且其中一列的名称是 Series 的原始名称 (仅当未提供其他列名称时)。

For example,

例如,

s = pd.Series([1., 2., 3.], index=['a', 'b', 'c'])
type(s) #=> pandas.core.series.Seriesdf2 = pd.DataFrame(s)
df2
# 0
# a 1.0
# b 2.0
# c 3.0type(df2) #=> pandas.core.frame.DataFrame
df2.columns #=> RangeIndex(start=0, stop=1, step=1)
df2.index #=> Index(['a', 'b', 'c'], dtype='object')

From a Flat File

从平面文件

The pandas.read_csv (preferred way):

所述 pandas.read_csv (优选的方式):

You can read CSV files into a DataFrame using pandas.read_csv() method. Refer to the official docs for its signature.

您可以 使用 pandas.read_csv() 方法将 CSV文件读取到 DataFrame 请参阅官方文档以获取其签名。

For example,

例如,

Image for post

CONSOLE DISPLAY AND SUMMARY

控制台显示和摘要

Some helpful methods and attributes:

一些有用的方法和属性:

Image for post

Wide DataFrames will be printed (print) across multiple rows by default. You can change how much to print on a single row by setting display.width option. You can adjust the max width of the individual columns by setting display.max_colwidth.

默认情况下,宽数据框将跨多行打印( print )。 您可以通过设置display.width选项更改在单行上打印的数量。 您可以通过设置display.max_colwidth来调整各个列的最大宽度。

pd.set_option('display.width', 40)     # default is 80
pd.set_option('display.max_colwidth', 30)

You can also display display.max_colwidth feature via the expand_frame_repr option. This will print the table in one block.

您还可以通过expand_frame_repr选项显示display.max_colwidth功能。 这将把表格打印成一个块。

INDEXING ROWS AND SELECTING COLUMNS

索引行和选择列

The basics of DataFrame indexing and selecting are as follows:

DataFrame 索引和选择 的基础 如下:

Image for post

For example,

例如,

d = {
'one' : pd.Series([1., 2., 3., 4.], index=['a', 'b', 'c', 'a']),
'two' : pd.Series(['A', 'B', 'C', 'D'], index=['a', 'b', 'c', 'a'])
}df = pd.DataFrame(d)
df
# one two
# a 1.0 A
# b 2.0 B
# c 3.0 C
# a 4.0 Dtype(df['one']) #=> pandas.core.series.Series
df['one']
# a 1.0
# b 2.0
# c 3.0
# a 4.0
# Name: one, dtype: float64type(df[['one']]) #=> pandas.core.frame.DataFrame
df[['one']]
# one
# a 1.0
# b 2.0
# c 3.0
# a 4.0type(df[['one', 'two']]) #=> pandas.core.frame.DataFrame
df[['one', 'two']]
# one two
# a 1.0 A
# b 2.0 B
# c 3.0 C
# a 4.0 Dtype(df.loc['a']) #=> pandas.core.frame.DataFrame
df.loc['a']
# one two
# a 1.0 A
# a 4.0 Dtype(df.loc['b']) #=> pandas.core.series.Series
df.loc['b']
# one 2
# two B
# Name: b, dtype: objecttype(df.loc[['a', 'c']]) #=> pandas.core.frame.DataFrame
df.loc[['a', 'c']]
# one two
# a 1.0 A
# a 4.0 D
# c 3.0 Ctype(df.iloc[0]) #=> pandas.core.series.Series
df.iloc[0]
# one 1
# two A
# Name: a, dtype: objectdf.iloc[1:3]
# one two
# b 2.0 B
# c 3.0 Cdf.iloc[[1, 2]]
# one two
# b 2.0 B
# c 3.0 Cdf.iloc[[1, 0, 1, 0]]
# one two
# b 2.0 B
# a 1.0 A
# b 2.0 B
# a 1.0 Adf.iloc[[True, False, True, False]]
# one two
# a 1.0 A
# c 3.0 C

COLUMN ADDITION AND DELETION

列添加和删除

You can treat a DataFrame semantically like a dict of like-indexed Series objects. Getting, setting, and deleting columns works with the same syntax as the analogous dict operations.

您可以将 DataFrame 语义上 索引相似的 Series 对象 dict 一样 对待 获取,设置和删除列的语法与类似 dict 操作的 语法相同

When inserting a Series that doesn’t have the same index as the DataFrame, it will be conformed to the DataFrame‘s index (that is, only values with index matching DataFrame‘s existing index will be added, and missing index will get NaN (of the same dtype as dtype of that particular column) as value.

当插入一个 Series 不具有相同 index DataFrame ,将符合该 DataFrame index (即只与价值 index 匹配 DataFrame 现有 index 将被添加和缺失 index 会得到 NaN (相同 dtype dtype 的特定列的)作为值。

When inserting a columns with scalar value, it will naturally be propagated to fill the column.

当插入具有标量值的列时,它自然会传播以填充该列。

When you insert a same length (as that of DataFrame to which it is inserted) ndarray or list, it just uses existing index of the DataFrame. But, try not to use ndarrays or list directly with DataFrames, intead you can first convert them to Series as follows:

当您插入相同长度( DataFrame 插入 DataFrame 相同 )的 ndarray list ,它仅使用 DataFrame 现有索引 不过,尽量不要用 ndarrays list 直接与 DataFrame S,这一翻译可以先它们转换为 Series 如下:

df['yet_another_col'] = array_of_same_length_as_df# is same asdf['yet_another_col'] = pd.Series(array_of_same_length_as_df, index=df.index)

For example,

例如,

Image for post
Image for post

By default, columns get inserted at the end. The insert() method is available to insert at a particular location in the columns.

默认情况下,列会插入到末尾。 所述 insert() 方法可在列中的一个特定的位置插入。

Columns can be deleted using del, like keys of dict.

可以使用 del 删除列 ,如dict键。

DATA ALIGNMENT AND ARITHMETICArithmetics between DataFrame objects

DataFrame 对象 之间的数据 DataFrame DataFrame

Data between DataFrame objects automatically align on both the columns and the index (row labels). Again, the resulting object will have the union of the column and row labels. For example,

DataFrame 对象 之间的数据会 自动 在列和索引(行标签) 对齐 同样,得到的对象 列和行标签 结合 例如,

Image for post

Important: You might like to try above example with duplicate columns names and index values in each individual data frame.

重要提示:您可能想尝试上面的示例,在每个单独的数据框中使用重复的列名和索引值。

Boolean operators (for example, df1 & df2) work as well.

布尔运算符(例如df1 & df2 )也可以工作。

Arithmetics between DataFrame and Series:

DataFrame Series 之间的算法

When doing an operation between DataFrame and Series, the default behavior is to broadcast Series row-wise to match rows in DataFrame and then arithmetics is performed. For example,

DataFrame Series 之间进行操作时 ,默认行为是按 广播 Series 以匹配 DataFrame 中的行 ,然后执行算术运算。 例如,

Image for post

In the special case of working with time series data, and the DataFrame index also contains dates, the broadcasting will be column-wise. For example,

在使用时间序列数据的特殊情况下,并且 DataFrame 索引还包含日期,广播将按列进行。 例如,

Image for post

Here pd.date_range() is used to create fixed frequency DatetimeIndex, which is then used as index (rather than default index of 0, 1, 2, ...) for a DataFrame.

这里pd.date_range()是用于创建固定频率DatetimeIndex,然后将其用作index (而非默认索引0, 1, 2, ... ),用于一个DataFrame

For explicit control over the matching and broadcasting behavior, see the section on flexible binary operations.

有关对匹配和广播行为的显式控制,请参见关于灵活的二进制操作的部分。

Arithmetics between DataFrame and Scalars

DataFrame 和标量 之间的算法

Operations with scalars are just as you would expect: broadcasted to each cell (that is, to all columns and rows).

标量运算与您期望的一样:广播到每个单元格(即,所有列和行)。

DATAFRAME METHODS AND FUNCTIONS

DATAFRAME 方法和功能

Evaluating string describing operations using eval() method

使用 eval() 方法 评估字符串描述操作

Note: Rather use assign() method.

注意:而是使用 assign() 方法。

The eval() evaluates a string describing operations on DataFrame columns. It operates on columns only, not specific rows or elements. This allows eval() to run arbitrary code, which can make you vulnerable to code injection if you pass user input into this function.

eval()评估一个字符串,该字符串描述对DataFrame列的操作。 它仅对列起作用,而不对特定的行或元素起作用。 这允许eval()运行任意代码,如果将用户输入传递给此函数,可能会使您容易受到代码注入的攻击。

df = pd.DataFrame({'A': range(1, 6), 'B': range(10, 0, -2)})df
# A B
# 0 1 10
# 1 2 8
# 2 3 6
# 3 4 4
# 4 5 2df.eval('2*A + B')
# 0 12
# 1 12
# 2 12
# 3 12
# 4 12
# dtype: int64

Assignment is allowed though by default the original DataFrame is not modified. Use inplace=True to modify the original DataFrame. For example,

允许分配,尽管默认情况下不修改原始DataFrame 。 使用inplace inplace=True修改原始DataFrame。 例如,

df.eval('C = A + 2*B', inplace=True)df
# A B C
# 0 1 10 21
# 1 2 8 18
# 2 3 6 15
# 3 4 4 12
# 4 5 2 9

Assigning new columns to the copies in method chains — assign() method

在方法链中为副本分配新的列— assign() 方法

Inspired by dplyer‘s mutate verb, DataFrame has an assign() method that allows you to easily create new columns that are potentially derived from existing columns.

dplyer mutate 动词 启发 DataFrame 具有一个 assign() 方法,可让您轻松创建可能从现有列派生的新列。

The assign() method always returns a copy of data, leaving the original DataFrame untouched.

assign() 方法 总是返回数据的副本 中,而原始 DataFrame 不变。

Note: Also check pipe() method.

注意:还要检查 pipe() 方法。

df2 = df.assign(one_ratio = df['one']/df['out_of'])df2
# one two one_trunc out_of const one_ratio
# a 1.0 1.0 1.0 100 1 0.01
# b 2.0 2.0 2.0 100 1 0.02
# c 3.0 3.0 NaN 100 1 0.03
# d NaN 4.0 NaN 100 1 NaNid(df) #=> 4436438040
id(df2) #=> 4566906360

Above was an example of inserting a precomputed value. We can also pass in a function of one argument to be evaluated on the DataFrame being assigned to.

上面是插入预计算值的示例。 我们还可以传入一个参数的函数,以在 要分配给 其的 DataFrame 上求值。

df3 = df.assign(one_ratio = lambda x: (x['one']/x['out_of']))
df3
# one two one_trunc out_of const one_ratio
# a 1.0 1.0 1.0 100 1 0.01
# b 2.0 2.0 2.0 100 1 0.02
# c 3.0 3.0 NaN 100 1 0.03
# d NaN 4.0 NaN 100 1 NaNid(df) #=> 4436438040
id(df3) #=> 4514692848

This way you can remove a dependency by not having to use name of the DataFrame.

这样,您可以不必使用 DataFrame 名称来删除依赖 DataFrame

Appending rows with append() method

append() 方法 追加行

The append() method appends rows of other_data_frame DataFrame to the end of current DataFrame, returning a new object. The columns not in the current DataFrame are added as new columns.

append() 方法追加 other_data_frame DataFrame 到当前 DataFrame ,返回一个新对象。 当前 DataFrame 不在的列 将作为新列添加。

Its most useful syntax is:

它最有用的语法是:

<data_frame>.append(other_data_frame, ignore_index=False)

Here,

这里,

  • other_data_frame: Data to be appended in the form of DataFrame or Series/dict-like object, or a list of these.

    other_data_frameDataFrameDataFrame Series / dict的对象或它们的list形式附加的数据。

  • ignore_index: By default it is False. If it is True, then index labels of other_data_frame are ignored

    ignore_index :默认为False 。 如果为True ,则将忽略other_data_frame索引标签

Note: Also check concat() function.

注意:还要检查 concat() 函数。

For example,

例如,

df = pd.DataFrame([[1, 2], [3, 4]], columns=list('AB'))
df
# A B
# 0 1 2
# 1 3 4
df2 = pd.DataFrame([[5, 6], [7, 8]], columns=list('AB'))
df2
# A B
# 0 5 6
# 1 7 8
df.append(df2)
# A B
# 0 1 2
# 1 3 4
# 0 5 6
# 1 7 8
df.append(df2, ignore_index=True)
# A B
# 0 1 2
# 1 3 4
# 2 5 6
# 3 7 8

The drop() method

drop() 方法

Note: Rather use del as stated in Column Addition and Deletion section, and indexing + re-assignment for keeping specific rows.

注意:最好使用“ 列添加和删除”部分中所述的 del ,并使用索引+重新分配来保留特定行。

The drop() function removes rows or columns by specifying label names and corresponding axis, or by specifying directly index or column names. When using a multi-index, labels on different levels can be removed by specifying the level.

drop()函数通过指定标签名称和相应的轴,或通过直接指定索引或列名称来删除行或列。 使用多索引时,可以通过指定级别来删除不同级别上的标签。

The values attribute and copy() method

所述 values 属性和 copy() 方法

The values attribute

values 属性

The values attribute returns NumPy representation of a DataFrame‘s data. Only the values in the DataFrame will be returned, the axes labels will be removed. A DataFrame with mixed type columns (e.g. str/object, int64, float32) results in an ndarray of the broadest type that accommodates these mixed types.

所述 values 属性一个的返回NumPy的表示 DataFrame 的数据。 返回 DataFrame 的值, 将删除轴标签。 DataFrame 与混合型柱(例如STR /对象,Int64类型,FLOAT32)导致的 ndarray 容纳这些混合类型的最广泛的类型。

Check Console Display section for an example.

有关示例,请参见控制台显示部分。

The copy() method

copy() 方法

The copy() method makes a copy of the DataFrame object’s indices and data, as by default deep is True. So, modifications to the data or indices of the copy will not be reflected in the original object.

copy() 方法使副本 DataFrame 对象的指标和数据,因为默认情况下 deep True 因此,对副本的数据或索引的修改将不会反映在原始对象中。

If deep=False, a new object will be created without copying the calling object’s data or index (only references to the data and index are copied). Any changes to the data of the original will be reflected in the shallow copy (and vica versa).

如果 deep=False ,将在不复制调用对象的数据或索引的情况下创建新对象(仅复制对数据和索引的引用)。 原始数据的任何更改都将反映在浅表副本中(反之亦然)。

Its syntax is:

其语法为:

df.copy(deep=True)

Transposing using T attribute or transpose() method

使用 T 属性或 transpose() 方法进行 transpose()

Refer section Arithmetic, matrix multiplication, and comparison operations.

请参阅算术,矩阵乘法和比较运算部分。

To transpose, you can call the method transpose(), or you can the attribute T which is accessor to transpose() method.

要进行转置,可以调用方法 transpose() ,也可以使用属性 T ,它是 transpose() 方法的 访问者

The result is a DataFrame as a reflection of original DataFrame over its main diagonal by writing rows as columns and vice-versa. Transposing a DataFrame with mixed dtypes will result in a homogeneous DataFrame with the object dtype. In such a case, a copy of the data is always made.

结果是 通过将行写为列, DataFrame 反映 在其主对角线上 的原始 DataFrame ,反之亦然。 移调一个 DataFrame 具有混合dtypes将导致同质 DataFrame 与对象D型。 在这种情况下,始终会复制数据。

For example,

例如,

Image for post
Image for post

Sorting (sort_values(), sort_index()), Grouping (groupby()), and Filtering (filter())

排序( sort_values() sort_index() ),分组( groupby() )和过滤( filter() )

The sort_values() method

所述 sort_values() 方法

Dataframe can be sorted by a column (or by multiple columns) using sort_values() method.

可以使用 sort_values() 方法 按一列(或按多列)对数据 sort_values() 进行排序

For example,

例如,

Image for post
Image for post

The sort_index() method

所述 sort_index() 方法

The sort_index() method can be used to sort by index.

所述 sort_index() 方法可用于通过排序 index

For example,

例如,

Image for post

The groupby() method

所述 groupby() 方法

The groupby() method is used to group by a function, label, or a list of labels.

所述 groupby() 方法由功能,标签或标签的列表用于组。

For example,

例如,

Image for post
Image for post

The filter() method

所述 filter() 方法

The filter() method returns subset of rows or columns of DataFrame according to labels in the specified index. Note that this method does not filter a DataFrame on its contents, the filter is applied to the labels of the index, or to the column names.

所述 filter() 方法返回的行或列的子集 DataFrame 根据在指定的索引标签。 请注意,此方法不会 在其内容上 过滤 DataFrame 将过滤器应用于索引的标签或列名。

You can use items, like and regex parameters, but note that they are enforced to be mutually exclusive. The parameter axis default to the info axis that is used when indexing with [].

您可以使用 items like regex 参数,但请注意,它们必须相互排斥。 参数 axis 默认为使用 [] 索引时使用的信息轴

For example,

例如,

Image for post

Melting and Pivoting using melt() and pivot() methods

使用 melt() pivot() 方法 进行 melt() pivot()

The idea of melt() is to keep keep few given columns as id-columns and convert rest of the columns (called variable-columns) into variable and value, where the variable tells you the original columns name and value is corresponding value in original column.

melt() 的想法 是保留给定的列作为 id列 ,并将其余的列(称为 variable-columns )转换为 variable value ,其中 变量 告诉您原始列的名称和 value 原始列 中的对应值柱。

If there are n variable-columns which are melted, then information from each row from the original formation is not spread to n columns.

如果有n 可变柱被熔化,那么来自原始地层的每一行的信息都不会传播到n列。

The idea of pivot() is to do just the reverse.

pivot() 的想法 只是相反。

For example,

例如,

Image for post
Image for post

Piping (chaining) Functions using pipe() method

使用 pipe() 方法 进行管道(链接)函数

Suppose you want to apply a function to a DataFrame, Series or a groupby object, to its output then apply other, other, … functions. One way would be to perform this operation in a “sandwich” like fashion:

假设您要将一个函数应用于 DataFrame Series groupby 对象,然后将其应用于输出,然后再应用其他其他函数。 一种方法是以类似“三明治”的方式执行此操作:

Note: Also check assign() method.

注意:还要检查 assign() 方法。

df = foo3(foo2(foo1(df, arg1=1), arg2=2), arg3=3)

In the long run, this notation becomes fairly messy. What you want to do here is to use pipe(). Pipe can be though of as a function chaining. This is how you would perform the same task as before with pipe():

从长远来看,这种表示会变得很混乱。 您要在此处使用 pipe() 管道可以作为 函数链接 这就是您使用 pipe() 执行与以前相同的任务的方式

df.pipe(foo1, arg1=1).
pipe(foo2, arg2=2).
pipe(foo3, arg3=3)

This way is a cleaner way that helps keep track the order in which the functions and its corresponding arguments are applied.

这种方式是一种更简洁的方式,有助于跟踪应用函数及其相应参数的顺序。

Rolling Windows using rolling() method

使用 rolling() 方法 滚动Windows

Use DataFrame.rolling() for rolling window calculation.

使用 DataFrame.rolling() 进行滚动窗口计算。

Other DataFrame Methods

其他 DataFrame 方法

Refer Methods section in pd.DataFrame.

请参阅 pd.DataFrame 方法” 部分

Refer Computational Tools User Guide.

请参阅《 计算工具 用户指南》。

Refer the categorical listing at Pandas API.

请参阅 Pandas API上 的分类清单

APPLYING FUNCTIONSThe apply() method: apply on columns/rows

应用函数 apply() 方法:应用于列/行

The apply() method applies the given function along an axis (by default on columns) of the DataFrame.

apply() 方法应用于沿轴线(默认上列)的给定的函数 DataFrame

Its most useful form is:

它最有用的形式是:

df.apply(func, axis=0, args, **kwds)

Here:

这里:

  • func: The function to apply to each column or row. Note that it can be a element-wise function (in which case axis=0 or axis=1 doesn’t make any difference) or an aggregate function.

    func :应用于每个列或行的函数。 请注意,它可以是按元素的函数(在这种情况下, axis=0axis=1没有任何区别)或聚合函数。

  • axis: Its value can be 0 (default, column) or 1. 0 means applying function to each column, and 1 means applying function to each row. Note that this axis is similar to how axis are defined in NumpPy, as for 2D ndarray, 0 means column.

    axis :其值可以是0 (默认值,列)或10表示将功能应用于每一列,而1表示将功能应用于每一行。 请注意,此axis类似于在NumpPy中定义轴的方式,对于2D ndarray, 0表示列。

  • args: It is a tuple and represents the positional arguments to pass to func in addition to the array/series.

    args :这是一个tuple ,表示除了数组/系列之外还传递给func的位置参数。

  • **kwds: It represents the additional keyword arguments to pass as keywords arguments to func.

    **kwds :它表示要作为func关键字参数传递的其他关键字参数。

It returns a Series or a DataFrame.

它返回 Series DataFrame

For example,

例如,

Image for post

The applymap() method: apply element-wise

所述 applymap() 方法:应用逐元素

The applymap() applies the given function element-wise. So, the given func must accept and return a scalar to every element of a DataFrame.

所述 applymap() 应用于给定的函数逐元素。 因此,给定的 func 必须接受并将标量返回给 DataFrame 每个元素

Its general syntax is:

其一般语法为:

df.applymap(func)

For example,

例如,

Image for post

When you need to apply a function element-wise, you might like to check first if there is a vectorized version available. Note that a vectorized version of func often exist, which will be much faster. You could square each number element-wise using df.applymap(lambda x: x**2), but the vectorized version df**2 is better.

当需要按 元素 应用函数时 ,您可能想先检查是否有矢量化版本。 注意 矢量版本 func 常同时存在,这会快很多。 您可以使用 df.applymap(lambda x: x**2) 每个数字逐个平方 ,但是矢量化版本 df**2 更好。

WORKING WITH MISSING DATA

处理丢失的数据

Refer SciKit-Learn’s Data Cleaning section.

请参阅SciKit-Learn的“数据清理”部分。

Refer Missing Data Guide and API Reference for Missing Data Handling: dropna, fillna, replace, interpolate.

有关丢失数据的处理 请参阅《 丢失数据指南》 和《 API参考》: dropna fillna replace interpolate

Also check Data Cleaning section of The tf.feature_column API on other options.

还要 在其他选项上 查看 tf.feature_column API 数据清理” 部分

Also go through https://www.analyticsvidhya.com/blog/2016/01/12-pandas-techniques-python-data-manipulation/

也可以通过https://www.analyticsvidhya.com/blog/2016/01/12-pandas-techniques-python-data-manipulation/

NORMALIZING DATA

归一化数据

One way is to perform df / df.iloc[0], which is particular useful while analyzing stock price over a period of time for multiple companies.

一种方法是执行 df / df.iloc[0] ,这在分析多个公司一段时间内的股价时特别有用。

THE CONCAT() FUNCTION

THE CONCAT() 函数

The concat() function performs concatenation operations along an axis while performing optional set logic (union or intersection) of the indexes (if any) on the other axis.

所述 concat() 沿轴线功能执行级联操作,而执行索引的可选的一组逻辑(集或交集)(如果有的话)在另一轴上。

The default axis of concatenation is axis=0, but you can choose to concatenate data frames sideways by choosing axis=1.

默认的串联 axis=0 axis=0 ,但是您可以通过选择 axis=1 来选择横向串联数据帧

Note: Also check append() method.

注意:还要检查 append() 方法。

For example,

例如,

df1 = pd.DataFrame(
{
'A': ['A0', 'A1', 'A2', 'A3'],
'B': ['B0', 'B1', 'B2', 'B3'],
'C': ['C0', 'C1', 'C2', 'C3'],
'D': ['D0', 'D1', 'D2', 'D3']
}, index=[0, 1, 2, 3]
)df2 = pd.DataFrame(
{
'A': ['A4', 'A5', 'A6', 'A7'],
'B': ['B4', 'B5', 'B6', 'B7'],
'C': ['C4', 'C5', 'C6', 'C7'],
'D': ['D4', 'D5', 'D6', 'D7']
}, index=[4, 5, 6, 7]
)df3 = pd.DataFrame(
{
'A': ['A8', 'A9', 'A10', 'A11'],
'B': ['B8', 'B9', 'B10', 'B11'],
'C': ['C8', 'C9', 'C10', 'C11'],
'D': ['D8', 'D9', 'D10', 'D11']
}, index=[1, 2, 3, 4]
)frames = [df1, df2, df3]df4 = pd.concat(frames)df4
# A B C D
# 0 A0 B0 C0 D0
# 1 A1 B1 C1 D1
# 2 A2 B2 C2 D2
# 3 A3 B3 C3 D3
# 4 A4 B4 C4 D4
# 5 A5 B5 C5 D5
# 6 A6 B6 C6 D6
# 7 A7 B7 C7 D7
# 1 A8 B8 C8 D8
# 2 A9 B9 C9 D9
# 3 A10 B10 C10 D10
# 4 A11 B11 C11 D11df5 = pd.concat(frames, ignore_index=True)df5
# A B C D
# 0 A0 B0 C0 D0
# 1 A1 B1 C1 D1
# 2 A2 B2 C2 D2
# 3 A3 B3 C3 D3
# 4 A4 B4 C4 D4
# 5 A5 B5 C5 D5
# 6 A6 B6 C6 D6
# 7 A7 B7 C7 D7
# 8 A8 B8 C8 D8
# 9 A9 B9 C9 D9
# 10 A10 B10 C10 D10
# 11 A11 B11 C11 D11df5 = pd.concat(frames, keys=['s1', 's2', 's3'])df5
# A B C D
# s1 0 A0 B0 C0 D0
# 1 A1 B1 C1 D1
# 2 A2 B2 C2 D2
# 3 A3 B3 C3 D3
# s2 4 A4 B4 C4 D4
# 5 A5 B5 C5 D5
# 6 A6 B6 C6 D6
# 7 A7 B7 C7 D7
# s3 1 A8 B8 C8 D8
# 2 A9 B9 C9 D9
# 3 A10 B10 C10 D10
# 4 A11 B11 C11 D11df5.index
# MultiIndex(levels=[['s1', 's2', 's3'], [0, 1, 2, 3, 4, 5, 6, 7]],
# labels=[[0, 0, 0, 0, 1, 1, 1, 1, 2, 2, 2, 2], [0, 1, 2, 3, 4, 5, 6, 7, 1, 2, 3, 4]])

Like its sibling function on ndarrays, numpy.concatenate(), the pandas.concat() takes a list or dict of homogeneously-typed objects and concatenates them with some configurable handling of “what to do with other axes”.

像其在ndarrays上的同级函数numpy.concatenate()一样, numpy.concatenate() pandas.concat()同类对象的列表或字典,并通过“可与其他轴做什么”的某种可配置处理将它们连接起来。

MERGING AND JOINING USING MERGE() AND JOIN() FUNCTIONS

使用 MERGE() JOIN() 函数 合并和加入

Refer Mergem Join, and Concatenate official guide.

请参阅 Mergem Join和Concatenate 官方指南。

The function merge() merges DataFrame object by performing a database-style join operation by columns or indexes.

函数 merge() 通过按列或索引执行数据库样式的 DataFrame 操作来 合并 DataFrame 对象。

The function join() joins columns with other DataFrame either on index or on a key column.

函数 join() 在索引或键列 DataFrame 列与其他 DataFrame 连接起来

BINARY DUMMY VARIABLES FOR CATEGORICAL VARIABLES USING GET_DUMMIES() FUNCTION

使用 GET_DUMMIES() 函数的化学 变量的 GET_DUMMIES() 变量

To convert a categorical variable into a “dummy” DataFrame can be done using get_dummies():

可以使用 get_dummies() 将类别变量转换为“虚拟” DataFrame

df = pd.DataFrame({'char': list('bbacab'), 'data1': range(6)})df
# char data1
# 0 b 0
# 1 b 1
# 2 a 2
# 3 c 3
# 4 a 4
# 5 b 5dummies = pd.get_dummies(df['char'], prefix='key')
dummies
# key_a key_b key_c
# 0 0 1 0
# 1 0 1 0
# 2 1 0 0
# 3 0 0 1
# 4 1 0 0
# 5 0 1 0

PLOTTING DATAFRAME USING PLOT() FUNCTION

使用PLOT()函数绘制数据DATAFRAME

The plot() function makes plots of DataFrame using matplotlib/pylab.

plot() 函数使得地块 DataFrame 使用matplotlib / pylab。

面板(3D数据结构) (Panel (3D data structure))

Panel is a container for 3D data. The term panel data is derived from econometrics and is partially responsible for the name: pan(el)-da(ta)-s.

面板是3D数据的容器。 面板数据一词源自计量经济学,部分负责名称: pan(el)-da(ta)-s

The 3D structure of a Panel is much less common for many types of data analysis, than the 1D of the Series or the 2D of the DataFrame. Oftentimes, one can simply use a Multi-index DataFrame for easily working with higher dimensional data. Refer Deprecate Panel.

Series的1D或DataFrame的2D相比, Panel的3D结构在许多类型的数据分析中要少DataFrame 。 通常,人们可以简单地使用Multi-index DataFrame来轻松处理高维数据。 请参阅“ 不赞成使用面板” 。

Here are some related interesting stories that you might find helpful:

以下是一些相关的有趣故事,您可能会觉得有帮助:

  • Fluent NumPy

    流利的数字

  • Distributed Data Processing with Apache Spark

    使用Apache Spark进行分布式数据处理

  • Apache Cassandra — Distributed Row-Partitioned Database for Structured and Semi-Structured Data

    Apache Cassandra —用于结构化和半结构化数据的分布式行分区数据库

  • The Why and How of MapReduce

    MapReduce的原因和方式

翻译自: https://medium.com/analytics-vidhya/fluent-pandas-22473fa3c30d

熊猫分发

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/news/387980.shtml

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

Fiddler抓包-只抓APP的请求

from:https://www.cnblogs.com/yoyoketang/p/6582437.html fiddler抓手机app的请求&#xff0c;估计大部分都会&#xff0c;但是如何只抓来自app的请求呢&#xff1f; 把来自pc的请求过滤掉&#xff0c;因为请求太多&#xff0c;这样会找不到重要的信息了。 环境准备&#xff1…

DOCKER windows 安装Tomcat内容

DOCKER windows安装 DOCKER windows安装 1.下载程序包2. 设置环境变量3. 启动DOCKERT4. 分析start.sh5. 利用SSH工具管理6. 下载镜像 6.1 下载地址6.2 用FTP工具上传tar包6.3 安装6.4 查看镜像6.5 运行 windows必须是64位的 1.下载程序包 安装包 https://github.com/boot2doc…

github免费空间玩法

GitHub 是一个用于使用Git版本控制系统的项目的基于互联网的存取服务,GitHub于2008年2月运行。在2010年6月&#xff0c;GitHub宣布它现在已经提供可1百万项目&#xff0c;可以说非常强大。 Github虽然是一个代码仓库&#xff0c;但是Github还免费为大家提供一个免费开源Github …

在Markdown中输入数学公式

写在前面 最近想要把一些数学和编程方面的笔记记录成电子笔记&#xff0c;因为修改、插入新内容等比较方便。这里记一下在Markdown中输入数学公式的方法。 基础知识 公式与文本的区别 公式输入和文本输入属于不同的模式&#xff0c;公式中无法通过空格来控制空白&#xff0c;通…

整合后台服务和驱动代码注入

整合后台服务和驱动代码注入 Home键的驱动代码&#xff1a; /dev/input/event1: 0001 0066 00000001 /dev/input/event1: 0000 0000 00000000 /dev/input/event1: 0001 0066 00000000 /dev/input/event1: 0000 0000 00000000 对应输入的驱动代码&#xff1a; sendevent/dev/…

为数据计算提供强力引擎,阿里云文件存储HDFS v1.0公测发布

2019独角兽企业重金招聘Python工程师标准>>> 在2019年3月的北京云栖峰会上&#xff0c;阿里云正式推出全球首个云原生HDFS存储服务—文件存储HDFS&#xff0c;为数据分析业务在云上提供可线性扩展的吞吐能力和免运维的快速弹性伸缩能力&#xff0c;降低用户TCO。阿里…

对食材的敬畏之心极致产品_这些数据科学产品组合将给您带来敬畏和启发(2020年中的版本)

对食材的敬畏之心极致产品重点 (Top highlight)为什么选择投资组合&#xff1f; (Why portfolios?) Data science is a tough field. It combines in equal parts mathematics and statistics, computer science, and black magic. As of mid-2020, it is also a booming fiel…

android模拟用户输入

目录(?)[-] geteventsendeventinput keyevent 本文讲的是通过使用代码&#xff0c;可以控制手机的屏幕和物理按键&#xff0c;也就是说不只是在某一个APP里去操作&#xff0c;而是整个手机系统。 getevent/sendevent getevent&sendevent 是Android系统下的一个工具&#x…

真格量化常见报错信息和Debug方法

1.打印日志 1.1 在代码中添加运行到特定部分的提示&#xff1a; 如果我们在用户日志未能看到“调用到OnQuote事件”文字&#xff0c;说明其之前的代码就出了问题&#xff0c;导致程序无法运行到OnQuote函数里的提示部分。解决方案为仔细检查该部分之前的代码是否出现问题。 1.2…

自定义PopView

改代码是参考一个Demo直接改的&#xff0c;代码中有一些漏洞&#xff0c;如果发现其他的问题&#xff0c;可以下方直接留言 .h文件 #import <UIKit/UIKit.h> typedef void(^PopoverBlock)(NSInteger index); interface CustomPopView : UIView //property(nonatomic,copy…

当编程语言掌握在企业手中,是生机还是危机?

2019年4月&#xff0c;Java的收费时代来临了&#xff01; Java是由Sun微系统公司在1995年推出的编程语言&#xff0c;2010年Oracle收购了Sun之后&#xff0c;Java的所有者也就自然变成了Oracle。2019年&#xff0c;Oracle宣布将停止Java 8更新的免费支持&#xff0c;未来Java的…

数据可视化 信息可视化_动机可视化

数据可视化 信息可视化John Snow’s map of Cholera cases near London’s Broad Street.约翰斯诺(John Snow)在伦敦宽街附近的霍乱病例地图。 John Snow, “the father of epidemiology,” is famous for his cholera maps. These maps represent so many of our aspirations …

android 接听和挂断实现方式

转载▼标签&#xff1a; android 接听 挂断 it 分类&#xff1a; android应用技巧 参考&#xff1a;android 来电接听和挂断 支持目前所有版本 注意&#xff1a;android2.3版本及以上不支持下面的自动接听方法。 &#xff08;会抛异常&#xff1a;java.lang.Securi…

利用延迟关联或者子查询优化超多分页场景

2019独角兽企业重金招聘Python工程师标准>>> MySQL并不是跳过offset行&#xff0c;而是取offsetN行&#xff0c;然后返回放弃前offset行&#xff0c;返回N行&#xff0c;那当offset 特别大的时候&#xff0c;效率就非常的低下&#xff0c;要么控制返回的总页数&…

客户流失_了解客户流失

客户流失Big Data Analytics within a real-life example of digital music service数字音乐服务真实示例中的大数据分析 Customer churn is a key predictor of the long term success or failure of a business. It is the rate at which customers are leaving your busine…

Nginx:Nginx limit_req limit_conn限速

简介 Nginx是一个异步框架的Web服务器&#xff0c;也可以用作反向代理&#xff0c;负载均衡器和HTTP缓存&#xff0c;最常用的便是Web服务器。nginx对于预防一些攻击也是很有效的&#xff0c;例如CC攻击&#xff0c;爬虫&#xff0c;本文将介绍限制这些攻击的方法&#xff0c;可…

Linux实战教学笔记12:linux三剑客之sed命令精讲

第十二节 linux三剑客之sed命令精讲 标签&#xff08;空格分隔&#xff09;&#xff1a; Linux实战教学笔记-陈思齐 ---更多资料点我查看 1&#xff0c;前言 我们都知道&#xff0c;在Linux中一切皆文件&#xff0c;比如配置文件&#xff0c;日志文件&#xff0c;启动文件等等。…

activiti 为什么需要采用乐观锁?

乐观锁 为什么需要采用乐观锁&#xff1f; 由于activiti一个周期的transaction时间可能比较长&#xff0c;且同一流程实例中存在任务并发执行等场景。设计者将update、insert、delete事务性的操作推迟至command结束时完成&#xff0c;这样尽量降低锁冲突的概率&#xff0c;由…

支付宝架构

支付宝系统架构图如下&#xff1a; 支付宝架构文档有两个搞支付平台设计的人必须仔细揣摩的要点。 一个是账务处理。在记账方面&#xff0c;涉及到内外两个子系统&#xff0c;外部子系统是单边账&#xff0c;满足线上性能需求&#xff1b;内部子系统走复式记账&#xff0c;满足…

Android Studio 导入新工程项目

1 导入之前先修改工程下相关文件 1.1 只需修改如下三个地方1.2 修改build.gradle文件 1.3 修改gradle/wrapper/gradle-wrapper.properties 1.4 修改app/build.gradle 2 导入修改后的工程 2.1 选择File|New|Import Project 2.2 选择修改后的工程 如果工程没有变成AS符号&#xf…