熊猫分发
Let’s uncover the practical details of Pandas’ Series, DataFrame, and Panel
让我们揭露Pandas系列,DataFrame和Panel的实用细节
Note to the Readers: Paying attention to comments in examples would be more helpful than going through the theory itself.
读者注意:注意示例中的注释比通过理论本身更有用。
· Series (1D data structure: Column-vector of DataTable)· DataFrame (2D data structure: Table)· Panel (3D data structure)
· 系列(1D数据结构:DataTable的列向量) · DataFrame(2D数据结构:Table) · 面板(3D数据结构)
Pandas is a column-oriented data analysis API. It’s a great tool for handling and analyzing input data, and many ML framework support pandas data structures as inputs.
Pandas是面向列的数据分析API。 这是处理和分析输入数据的好工具,许多ML框架都支持熊猫数据结构作为输入。
熊猫数据结构 (Pandas Data Structures)
Refer Intro to Data Structures on Pandas docs.
请参阅 熊猫文档上的 数据结构简介 。
The primary data structures in pandas are implemented as two classes: DataFrame
and Series
.
大熊猫中的主要数据结构实现为两类: DataFrame
和 Series
。
Import numpy
and pandas
into your namespace:
将 numpy
和 pandas
导入您的名称空间:
import numpy as np
import pandas as pd
import matplotlib as mpl
np.__version__
pd.__version__
mpl.__version__
系列(一维数据结构:DataTable的列向量) (Series (1D data structure: Column-vector of DataTable))
CREATING SERIES
创作系列
Series is one-dimensional array having elements with non-unique labels (index), and is capable of holding any data type. The axis labels are collectively referred to as index. The general way to create a Series is to call:
系列是 一维数组,具有带有非唯一标签(索引)的元素 ,并且能够保存任何数据类型。 轴标签 统称为 指标 。 创建系列的一般方法是调用:
pd.Series(data, index=index)
Here, data
can be an NumPy’s ndarray
, Python’s dict
, or a scalar value (like 5
). The passed index
is a list of axis labels.
在这里, data
可以是NumPy的 ndarray
,Python的 dict
或标量值(如 5
)。 传递的 index
是轴标签的列表。
Note: pandas supports non-unique index values. If an operation that does not support duplicate index values is attempted, an exception will be raised at that time.
注意: pandas支持 非唯一索引值 。 如果尝试执行不支持重复索引值的操作,则此时将引发异常。
If data
is list
or ndarray
(preferred way):
如果 data
是 list
或 ndarray
(首选方式):
If data
is an ndarray
or list
, then index
must be of the same length as data
. If no index is passed, one will be created having values [0, 1, 2, ... len(data) - 1]
.
如果 data
是 ndarray
或 list
,则 index
必须与 data
具有相同的长度 。 如果没有传递索引,则将创建一个具有 [0, 1, 2, ... len(data) - 1]
值的索引 。
If data
is a scalar value:
如果 data
是标量值:
If data
is a scalar value, an index
must be provided. The value will be repeated to match the length of index
.
如果 data
是标量值,则 必须提供 index
。 该值将重复以匹配 index
的长度 。
If data
is dict
:
如果 data
是 dict
:
If data
is a dict
, and - if index
is passed the values in data
corresponding to the labels in the index
will be pulled out, otherwise - an index
will be constructed from the sorted keys of the dict
, if possible
如果 data
是一个 dict
,以及 - 如果 index
被传递的值在 data
对应于所述标签的 index
将被拉出,否则 - 一个 index
将从的排序键来构建 dict
,如果可能的话
SERIES
IS LIKE NDARRAY
AND DICT
COMBINED
SERIES
像 NDARRAY
和 DICT
结合
Series
acts very similar to an ndarray
, and is a valid argument to most NumPy functions. However, things like slicing also slice the index. Series can be passed to most NumPy methods expecting an 1D ndarray
.
Series
行为与 ndarray
非常相似 ,并且是大多数NumPy函数的有效参数。 但是,诸如切片之类的事情也会对索引进行切片。 系列可以传递给大多数期望一维 ndarray
NumPy方法 。
A key difference between Series
and ndarray
is automatically alignment of the data based on labels during Series
operations. Thus, you can write computations without giving consideration to whether the Series
object involved has some non-unique labels. For example,
Series
和 ndarray
之间的主要区别在于 在 Series
操作 期间,基于标签的数据自动对齐 。 因此,您可以编写计算而无需考虑所 涉及 的 Series
对象 是否 具有某些非唯一标签。 例如,
The result of an operation between unaligned Series
s will have the union of the indexes involved. If a label is not found in one series or the other, the result will be marked as missing NaN
.
未对齐 Series
之间的运算结果 将具有所 涉及索引 的 并集 。 如果在一个序列或另一个序列中未找到标签,则结果将标记为缺少 NaN
。
Also note that in the above example, the index 'b'
was duplicated, so s['b']
returns pandas.core.series.Series
.
还要注意,在上面的示例中,索引 'b'
是重复的,因此 s['b']
返回 pandas.core.series.Series
。
Series
is also like a fixed-sized dict
on which you can get and set values by index label. If a label is not contained while reading a value, KeyError
exception is raised. Using the get
method, a missing label will return None
or specified default.
Series
还类似于固定大小的 dict
,您可以在其上通过索引标签获取和设置值。 如果在读取值时不包含标签, 则会引发 KeyError
异常。 使用 get
方法,缺少的标签将返回 None
或指定的默认值。
SERIES
‘ NAME
ATTRIBUTE
SERIES
“ NAME
ATTRIBUTE
Series
can also have a name
attribute (like DataFrame
can have columns
attribute). This is important as a DataFrame
can be seen as dict
of Series
objects.
Series
还可以具有 name
属性 (例如 DataFrame
可以具有 columns
属性)。 这是很重要的 DataFrame
可以看作是 dict
的 Series
对象 。
The Series
‘ name
will be assigned automatically in many cases, in particular when taking 1D slices of DataFrame
.
该 Series
“ name
会自动在许多情况下采取的1D切片时进行分配,特别是 DataFrame
。
For example,
例如,
d = {'one' : [1., 2., 3., 4.], 'two' : [4., 3., 2., 1.]}
d = pd.DataFrame(d)
d
# one two
# 0 1.0 4.0
# 1 2.0 3.0
# 2 3.0 2.0
# 3 4.0 1.0
type(d) #=> pandas.core.frame.DataFrame
d.columns #=> Index(['one', 'two'], dtype='object')
d.index #=> RangeIndex(start=0, stop=4, step=1)s = d['one']
s
# 0 1.0
# 1 2.0
# 2 3.0
# 3 4.0
# Name: one, dtype: float64
type(s) #=> pandas.core.series.Series
s.name #=> 'one'
s.index #=> RangeIndex(start=0, stop=4, step=1)
You can rename a Series with pandas.Series.rename()
method or by just assigning new name to name
attribute.
您可以使用 pandas.Series.rename()
方法 重命名系列,也可以 仅将新名称分配给 name
属性。
s = pd.Series(np.random.randn(5), name='something')
id(s) #=> 4331187280
s.name #=> 'something's.name = 'new_name'
id(s) #=> 4331187280
s.name #=> 'new_name's.rename("yet_another_name")
id(s) #=> 4331187280
s.name #=> 'yet_another_name'
COMPLEX TRANSFORMATIONS ON SERIES
USING SERIES.APPLY
使用 SERIES.APPLY
SERIES
复杂的转换
NumPy is a popular toolkit for scientific computing. Pandas’ Series
can be used as argument to most NumPy functions.
NumPy是用于科学计算的流行工具包。 熊猫Series
可以用作大多数NumPy函数的参数。
For complex single-column transformation, you can use Series.apply
. Like Python’s map
function, Series.apply
accepts as an argument a lambda function which is applied to each value.
对于复杂的单列转换,可以使用 Series.apply
。 像Python的 map
函数一样, Series.apply
接受一个lambda函数作为参数,该函数应用于每个值。
DataFrame
(2D数据结构:表格) (DataFrame
(2D data structure: Table))
Refer: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html
请参阅: https : //pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html
DataFrame
is a 2D labeled data structure with columns of potentially different types. You can think of it like a spreadsheet or a SQL table, or a dict
of Series
objects. Like Series
, DataFrame
accepts many different kinds of inputs.
DataFrame
是2D标签的数据结构,具有可能不同类型的列。 你可以认为它像一个电子表格或SQL表或 dict
的 Series
对象 。 与 Series
一样 , DataFrame
接受许多不同种类的输入。
Along with the data, you can optionally pass index
(row labels) and columns
(column labels) arguments. Note that index
can have non-unique elements (like that of Series
). Similarly, columns
names can also be non-unique.
与数据一起,您可以选择传递 index
(行标签) 和 columns
(列标签) 参数。 请注意, index
可以具有 非唯一 元素(例如 Series
元素 )。 同样, columns
名也可以 是非唯一的 。
If you pass an index
and/or columns
, you are guaranteeing the index
and / or columns
of the resulting DataFrame
. Thus, a dict
of Series
plus a specific index will discard all data not matching up to the passed index (similar to passing a dict
as data
to Series
).
如果传递 index
和/或 columns
,则可以 保证 所得 DataFrame
的 index
和/或 columns
。 因此, dict
的 Series
加上特定的索引将 丢弃 所有数据不匹配到所传递的索引(类似于传递一个 dict
作为 data
到 Series
)。
If axis labels (index
) are not passed, they will be constructed from the input data based on common sense rules.
如果 未传递 轴标签( index
),则将基于常识规则根据输入数据构造它们。
CREATING DATAFRAME
From a dict
of ndarray
s/list
s
CREATING DATAFRAME
从 dict
的 ndarray
S / list
小号
The ndarray
s/list
s must all be the same length. If an index
is passed, it must clearly also be the same length as that of data ndarray
s. If no index
is passed, then implicit index
will be range(n)
, where n
is the array length.
该 ndarray
S / list
小号都必须是相同的长度 。 如果 传递 了 index
,则它显然也必须 与数据 ndarray
s的 长度相同 。 如果没有 传递 index
,则隐式 index
将是 range(n)
,其中 n
是数组长度。
For example,
例如,
From a dict
of Series
(preferred way):
从 dict
的 Series
(首选方式):
The resultant index
will be the union of the indexes of the various Series
(each Series
may be of a different length and may have different index
). If there are nested dict
s, these will be first converted to Series
. If no columns
are passed, the columns
will be list
of dict
keys. For example,
将得到的 index
将是 各个的索引的 联合 Series
(各 Series
可以是不同的长度,并且可具有不同的 index
)。 如果存在嵌套的 dict
,则将它们首先转换为 Series
。 如果没有 columns
都过去了, columns
将是 list
的 dict
键。 例如,
The row and column labels can be accessed respectively by accessing the index
and columns
attributes.
可以通过访问 index
和 columns
属性 分别访问行和列标签 。
From a list
of dict
s:
从 dict
list
中 :
For example,
例如,
data2 = [{'a': 1, 'b': 2}, {'a': 5, 'b': 10, 'c': 20}]pd.DataFrame(data2)
# a b c
# 0 1 2 NaN
# 1 5 10 20.0pd.DataFrame(data2, index=['first', 'second'])
# a b c
# first 1 2 NaN
# second 5 10 20.0pd.DataFrame(data2, columns=['a', 'b'])
# a b
# 0 1 2
# 1 5 10
From a Series
:
从 Series
:
The result will be a DataFrame
with the same index
as the input Series
, and with one column whose name is the original name of the Series
(only if no other column name provided).
结果将是一个 DataFrame
,其 index
与输入 Series
相同 ,并且其中一列的名称是 Series
的原始名称 (仅当未提供其他列名称时)。
For example,
例如,
s = pd.Series([1., 2., 3.], index=['a', 'b', 'c'])
type(s) #=> pandas.core.series.Seriesdf2 = pd.DataFrame(s)
df2
# 0
# a 1.0
# b 2.0
# c 3.0type(df2) #=> pandas.core.frame.DataFrame
df2.columns #=> RangeIndex(start=0, stop=1, step=1)
df2.index #=> Index(['a', 'b', 'c'], dtype='object')
From a Flat File
从平面文件
The pandas.read_csv
(preferred way):
所述 pandas.read_csv
(优选的方式):
You can read CSV files into a DataFrame
using pandas.read_csv()
method. Refer to the official docs for its signature.
您可以 使用 pandas.read_csv()
方法将 CSV文件读取到 DataFrame
。 请参阅官方文档以获取其签名。
For example,
例如,
CONSOLE DISPLAY AND SUMMARY
控制台显示和摘要
Some helpful methods and attributes:
一些有用的方法和属性:
Wide DataFrames will be printed (print
) across multiple rows by default. You can change how much to print on a single row by setting display.width
option. You can adjust the max width of the individual columns by setting display.max_colwidth
.
默认情况下,宽数据框将跨多行打印( print
)。 您可以通过设置display.width
选项更改在单行上打印的数量。 您可以通过设置display.max_colwidth
来调整各个列的最大宽度。
pd.set_option('display.width', 40) # default is 80
pd.set_option('display.max_colwidth', 30)
You can also display display.max_colwidth
feature via the expand_frame_repr
option. This will print the table in one block.
您还可以通过expand_frame_repr
选项显示display.max_colwidth
功能。 这将把表格打印成一个块。
INDEXING ROWS AND SELECTING COLUMNS
索引行和选择列
The basics of DataFrame
indexing and selecting are as follows:
DataFrame
索引和选择 的基础 如下:
For example,
例如,
d = {
'one' : pd.Series([1., 2., 3., 4.], index=['a', 'b', 'c', 'a']),
'two' : pd.Series(['A', 'B', 'C', 'D'], index=['a', 'b', 'c', 'a'])
}df = pd.DataFrame(d)
df
# one two
# a 1.0 A
# b 2.0 B
# c 3.0 C
# a 4.0 Dtype(df['one']) #=> pandas.core.series.Series
df['one']
# a 1.0
# b 2.0
# c 3.0
# a 4.0
# Name: one, dtype: float64type(df[['one']]) #=> pandas.core.frame.DataFrame
df[['one']]
# one
# a 1.0
# b 2.0
# c 3.0
# a 4.0type(df[['one', 'two']]) #=> pandas.core.frame.DataFrame
df[['one', 'two']]
# one two
# a 1.0 A
# b 2.0 B
# c 3.0 C
# a 4.0 Dtype(df.loc['a']) #=> pandas.core.frame.DataFrame
df.loc['a']
# one two
# a 1.0 A
# a 4.0 Dtype(df.loc['b']) #=> pandas.core.series.Series
df.loc['b']
# one 2
# two B
# Name: b, dtype: objecttype(df.loc[['a', 'c']]) #=> pandas.core.frame.DataFrame
df.loc[['a', 'c']]
# one two
# a 1.0 A
# a 4.0 D
# c 3.0 Ctype(df.iloc[0]) #=> pandas.core.series.Series
df.iloc[0]
# one 1
# two A
# Name: a, dtype: objectdf.iloc[1:3]
# one two
# b 2.0 B
# c 3.0 Cdf.iloc[[1, 2]]
# one two
# b 2.0 B
# c 3.0 Cdf.iloc[[1, 0, 1, 0]]
# one two
# b 2.0 B
# a 1.0 A
# b 2.0 B
# a 1.0 Adf.iloc[[True, False, True, False]]
# one two
# a 1.0 A
# c 3.0 C
COLUMN ADDITION AND DELETION
列添加和删除
You can treat a DataFrame
semantically like a dict
of like-indexed Series
objects. Getting, setting, and deleting columns works with the same syntax as the analogous dict
operations.
您可以将 DataFrame
语义上 像 索引相似的 Series
对象 的 dict
一样 对待 。 获取,设置和删除列的语法与类似 dict
操作的 语法相同 。
When inserting a Series
that doesn’t have the same index
as the DataFrame
, it will be conformed to the DataFrame
‘s index
(that is, only values with index
matching DataFrame
‘s existing index
will be added, and missing index
will get NaN
(of the same dtype
as dtype
of that particular column) as value.
当插入一个 Series
不具有相同 index
的 DataFrame
,将符合该 DataFrame
的 index
(即只与价值 index
匹配 DataFrame
现有 index
将被添加和缺失 index
会得到 NaN
(相同 dtype
如 dtype
的特定列的)作为值。
When inserting a columns with scalar value, it will naturally be propagated to fill the column.
当插入具有标量值的列时,它自然会传播以填充该列。
When you insert a same length (as that of DataFrame
to which it is inserted) ndarray
or list
, it just uses existing index of the DataFrame
. But, try not to use ndarrays
or list
directly with DataFrame
s, intead you can first convert them to Series
as follows:
当您插入相同长度( DataFrame
插入 的 DataFrame
相同 )的 ndarray
或 list
,它仅使用 DataFrame
现有索引 。 不过,尽量不要用 ndarrays
或 list
直接与 DataFrame
S,这一翻译可以先它们转换为 Series
如下:
df['yet_another_col'] = array_of_same_length_as_df# is same asdf['yet_another_col'] = pd.Series(array_of_same_length_as_df, index=df.index)
For example,
例如,
By default, columns get inserted at the end. The insert()
method is available to insert at a particular location in the columns.
默认情况下,列会插入到末尾。 所述 insert()
方法可在列中的一个特定的位置插入。
Columns can be deleted using del
, like keys of dict.
可以使用 del
删除列 ,如dict键。
DATA ALIGNMENT AND ARITHMETICArithmetics between DataFrame
objects
DataFrame
对象 之间的数据 DataFrame
和 DataFrame
Data between DataFrame
objects automatically align on both the columns and the index (row labels). Again, the resulting object will have the union of the column and row labels. For example,
DataFrame
对象 之间的数据会 自动 在列和索引(行标签) 上 对齐 。 同样,得到的对象将有 列和行标签的 结合 。 例如,
Important: You might like to try above example with duplicate columns names and index values in each individual data frame.
重要提示:您可能想尝试上面的示例,在每个单独的数据框中使用重复的列名和索引值。
Boolean operators (for example, df1 & df2
) work as well.
布尔运算符(例如df1 & df2
)也可以工作。
Arithmetics between DataFrame
and Series
:
DataFrame
和 Series
之间的算法 :
When doing an operation between DataFrame
and Series
, the default behavior is to broadcast Series
row-wise to match rows in DataFrame
and then arithmetics is performed. For example,
在 DataFrame
和 Series
之间进行操作时 ,默认行为是按 行 广播 Series
以匹配 DataFrame
中的行 ,然后执行算术运算。 例如,
In the special case of working with time series data, and the DataFrame
index also contains dates, the broadcasting will be column-wise. For example,
在使用时间序列数据的特殊情况下,并且 DataFrame
索引还包含日期,广播将按列进行。 例如,
Here pd.date_range()
is used to create fixed frequency DatetimeIndex, which is then used as index
(rather than default index of 0, 1, 2, ...
) for a DataFrame
.
这里pd.date_range()
是用于创建固定频率DatetimeIndex,然后将其用作index
(而非默认索引0, 1, 2, ...
),用于一个DataFrame
。
For explicit control over the matching and broadcasting behavior, see the section on flexible binary operations.
有关对匹配和广播行为的显式控制,请参见关于灵活的二进制操作的部分。
Arithmetics between DataFrame
and Scalars
DataFrame
和标量 之间的算法
Operations with scalars are just as you would expect: broadcasted to each cell (that is, to all columns and rows).
标量运算与您期望的一样:广播到每个单元格(即,所有列和行)。
DATAFRAME
METHODS AND FUNCTIONS
DATAFRAME
方法和功能
Evaluating string describing operations using eval()
method
使用 eval()
方法 评估字符串描述操作
Note: Rather use assign()
method.
注意:而是使用 assign()
方法。
The eval()
evaluates a string describing operations on DataFrame
columns. It operates on columns only, not specific rows or elements. This allows eval()
to run arbitrary code, which can make you vulnerable to code injection if you pass user input into this function.
eval()
评估一个字符串,该字符串描述对DataFrame
列的操作。 它仅对列起作用,而不对特定的行或元素起作用。 这允许eval()
运行任意代码,如果将用户输入传递给此函数,可能会使您容易受到代码注入的攻击。
df = pd.DataFrame({'A': range(1, 6), 'B': range(10, 0, -2)})df
# A B
# 0 1 10
# 1 2 8
# 2 3 6
# 3 4 4
# 4 5 2df.eval('2*A + B')
# 0 12
# 1 12
# 2 12
# 3 12
# 4 12
# dtype: int64
Assignment is allowed though by default the original DataFrame
is not modified. Use inplace=True
to modify the original DataFrame. For example,
允许分配,尽管默认情况下不修改原始DataFrame
。 使用inplace inplace=True
修改原始DataFrame。 例如,
df.eval('C = A + 2*B', inplace=True)df
# A B C
# 0 1 10 21
# 1 2 8 18
# 2 3 6 15
# 3 4 4 12
# 4 5 2 9
Assigning new columns to the copies in method chains — assign()
method
在方法链中为副本分配新的列— assign()
方法
Inspired by dplyer‘s mutate
verb, DataFrame
has an assign()
method that allows you to easily create new columns that are potentially derived from existing columns.
受 dplyer 的 mutate
动词 启发 , DataFrame
具有一个 assign()
方法,可让您轻松创建可能从现有列派生的新列。
The assign()
method always returns a copy of data, leaving the original DataFrame
untouched.
的 assign()
方法 总是返回数据的副本 中,而原始 DataFrame
不变。
Note: Also check pipe()
method.
注意:还要检查 pipe()
方法。
df2 = df.assign(one_ratio = df['one']/df['out_of'])df2
# one two one_trunc out_of const one_ratio
# a 1.0 1.0 1.0 100 1 0.01
# b 2.0 2.0 2.0 100 1 0.02
# c 3.0 3.0 NaN 100 1 0.03
# d NaN 4.0 NaN 100 1 NaNid(df) #=> 4436438040
id(df2) #=> 4566906360
Above was an example of inserting a precomputed value. We can also pass in a function of one argument to be evaluated on the DataFrame
being assigned to.
上面是插入预计算值的示例。 我们还可以传入一个参数的函数,以在 要分配给 其的 DataFrame
上求值。
df3 = df.assign(one_ratio = lambda x: (x['one']/x['out_of']))
df3
# one two one_trunc out_of const one_ratio
# a 1.0 1.0 1.0 100 1 0.01
# b 2.0 2.0 2.0 100 1 0.02
# c 3.0 3.0 NaN 100 1 0.03
# d NaN 4.0 NaN 100 1 NaNid(df) #=> 4436438040
id(df3) #=> 4514692848
This way you can remove a dependency by not having to use name of the DataFrame
.
这样,您可以不必使用 DataFrame
名称来删除依赖 DataFrame
。
Appending rows with append()
method
用 append()
方法 追加行
The append()
method appends rows of other_data_frame
DataFrame
to the end of current DataFrame
, returning a new object. The columns not in the current DataFrame
are added as new columns.
的 append()
方法追加 的 行 other_data_frame
DataFrame
到当前 DataFrame
,返回一个新对象。 当前 DataFrame
中 不在的列 将作为新列添加。
Its most useful syntax is:
它最有用的语法是:
<data_frame>.append(other_data_frame, ignore_index=False)
Here,
这里,
other_data_frame
: Data to be appended in the form ofDataFrame
orSeries
/dict
-like object, or alist
of these.other_data_frame
:DataFrame
或DataFrame
Series
/dict
的对象或它们的list
形式附加的数据。ignore_index
: By default it isFalse
. If it isTrue
, then index labels ofother_data_frame
are ignoredignore_index
:默认为False
。 如果为True
,则将忽略other_data_frame
索引标签
Note: Also check concat()
function.
注意:还要检查 concat()
函数。
For example,
例如,
df = pd.DataFrame([[1, 2], [3, 4]], columns=list('AB'))
df
# A B
# 0 1 2
# 1 3 4
df2 = pd.DataFrame([[5, 6], [7, 8]], columns=list('AB'))
df2
# A B
# 0 5 6
# 1 7 8
df.append(df2)
# A B
# 0 1 2
# 1 3 4
# 0 5 6
# 1 7 8
df.append(df2, ignore_index=True)
# A B
# 0 1 2
# 1 3 4
# 2 5 6
# 3 7 8
The drop()
method
的 drop()
方法
Note: Rather use del
as stated in Column Addition and Deletion section, and indexing + re-assignment for keeping specific rows.
注意:最好使用“ 列添加和删除”部分中所述的 del
,并使用索引+重新分配来保留特定行。
The drop()
function removes rows or columns by specifying label names and corresponding axis, or by specifying directly index or column names. When using a multi-index, labels on different levels can be removed by specifying the level.
drop()
函数通过指定标签名称和相应的轴,或通过直接指定索引或列名称来删除行或列。 使用多索引时,可以通过指定级别来删除不同级别上的标签。
The values
attribute and copy()
method
所述 values
属性和 copy()
方法
The values
attribute
该 values
属性
The values
attribute returns NumPy representation of a DataFrame
‘s data. Only the values in the DataFrame
will be returned, the axes labels will be removed. A DataFrame
with mixed type columns (e.g. str/object, int64, float32) results in an ndarray
of the broadest type that accommodates these mixed types.
所述 values
属性一个的返回NumPy的表示 DataFrame
的数据。 仅 返回 DataFrame
的值, 将删除轴标签。 甲 DataFrame
与混合型柱(例如STR /对象,Int64类型,FLOAT32)导致的 ndarray
容纳这些混合类型的最广泛的类型。
Check Console Display section for an example.
有关示例,请参见控制台显示部分。
The copy()
method
的 copy()
方法
The copy()
method makes a copy of the DataFrame
object’s indices and data, as by default deep
is True
. So, modifications to the data or indices of the copy will not be reflected in the original object.
该 copy()
方法使副本 DataFrame
对象的指标和数据,因为默认情况下 deep
为 True
。 因此,对副本的数据或索引的修改将不会反映在原始对象中。
If deep=False
, a new object will be created without copying the calling object’s data or index (only references to the data and index are copied). Any changes to the data of the original will be reflected in the shallow copy (and vica versa).
如果 deep=False
,将在不复制调用对象的数据或索引的情况下创建新对象(仅复制对数据和索引的引用)。 原始数据的任何更改都将反映在浅表副本中(反之亦然)。
Its syntax is:
其语法为:
df.copy(deep=True)
Transposing using T
attribute or transpose()
method
使用 T
属性或 transpose()
方法进行 transpose()
Refer section Arithmetic, matrix multiplication, and comparison operations.
请参阅算术,矩阵乘法和比较运算部分。
To transpose, you can call the method transpose()
, or you can the attribute T
which is accessor to transpose()
method.
要进行转置,可以调用方法 transpose()
,也可以使用属性 T
,它是 transpose()
方法的 访问者 。
The result is a DataFrame
as a reflection of original DataFrame
over its main diagonal by writing rows as columns and vice-versa. Transposing a DataFrame
with mixed dtypes will result in a homogeneous DataFrame
with the object dtype. In such a case, a copy of the data is always made.
结果是 通过将行写为列, 将 DataFrame
反映 在其主对角线上 的原始 DataFrame
,反之亦然。 移调一个 DataFrame
具有混合dtypes将导致同质 DataFrame
与对象D型。 在这种情况下,始终会复制数据。
For example,
例如,
Sorting (sort_values()
, sort_index()
), Grouping (groupby()
), and Filtering (filter()
)
排序( sort_values()
, sort_index()
),分组( groupby()
)和过滤( filter()
)
The sort_values()
method
所述 sort_values()
方法
Dataframe can be sorted by a column (or by multiple columns) using sort_values()
method.
可以使用 sort_values()
方法 按一列(或按多列)对数据 sort_values()
进行排序 。
For example,
例如,
The sort_index()
method
所述 sort_index()
方法
The sort_index()
method can be used to sort by index
.
所述 sort_index()
方法可用于通过排序 index
。
For example,
例如,
The groupby()
method
所述 groupby()
方法
The groupby()
method is used to group by a function, label, or a list of labels.
所述 groupby()
方法由功能,标签或标签的列表用于组。
For example,
例如,
The filter()
method
所述 filter()
方法
The filter()
method returns subset of rows or columns of DataFrame
according to labels in the specified index. Note that this method does not filter a DataFrame
on its contents, the filter is applied to the labels of the index, or to the column names.
所述 filter()
方法返回的行或列的子集 DataFrame
根据在指定的索引标签。 请注意,此方法不会 在其内容上 过滤 DataFrame
将过滤器应用于索引的标签或列名。
You can use items
, like
and regex
parameters, but note that they are enforced to be mutually exclusive. The parameter axis
default to the info axis that is used when indexing with []
.
您可以使用 items
, like
和 regex
参数,但请注意,它们必须相互排斥。 参数 axis
默认为使用 []
索引时使用的信息轴 。
For example,
例如,
Melting and Pivoting using melt()
and pivot()
methods
使用 melt()
和 pivot()
方法 进行 melt()
和 pivot()
The idea of melt()
is to keep keep few given columns as id-columns and convert rest of the columns (called variable-columns) into variable and value, where the variable tells you the original columns name and value is corresponding value in original column.
melt()
的想法 是保留给定的列作为 id列 ,并将其余的列(称为 variable-columns )转换为 variable 和 value ,其中 变量 告诉您原始列的名称和 value 是 原始列 中的对应值柱。
If there are n
variable-columns which are melted, then information from each row from the original formation is not spread to n
columns.
如果有n
可变柱被熔化,那么来自原始地层的每一行的信息都不会传播到n
列。
The idea of pivot()
is to do just the reverse.
pivot()
的想法 只是相反。
For example,
例如,
Piping (chaining) Functions using pipe()
method
使用 pipe()
方法 进行管道(链接)函数
Suppose you want to apply a function to a DataFrame
, Series
or a groupby
object, to its output then apply other, other, … functions. One way would be to perform this operation in a “sandwich” like fashion:
假设您要将一个函数应用于 DataFrame
, Series
或 groupby
对象,然后将其应用于输出,然后再应用其他其他函数。 一种方法是以类似“三明治”的方式执行此操作:
Note: Also check assign()
method.
注意:还要检查 assign()
方法。
df = foo3(foo2(foo1(df, arg1=1), arg2=2), arg3=3)
In the long run, this notation becomes fairly messy. What you want to do here is to use pipe()
. Pipe can be though of as a function chaining. This is how you would perform the same task as before with pipe()
:
从长远来看,这种表示会变得很混乱。 您要在此处使用 pipe()
。 管道可以作为 函数链接 。 这就是您使用 pipe()
执行与以前相同的任务的方式 :
df.pipe(foo1, arg1=1).
pipe(foo2, arg2=2).
pipe(foo3, arg3=3)
This way is a cleaner way that helps keep track the order in which the functions and its corresponding arguments are applied.
这种方式是一种更简洁的方式,有助于跟踪应用函数及其相应参数的顺序。
Rolling Windows using rolling()
method
使用 rolling()
方法 滚动Windows
Use DataFrame.rolling()
for rolling window calculation.
使用 DataFrame.rolling()
进行滚动窗口计算。
Other DataFrame
Methods
其他 DataFrame
方法
Refer Methods section in pd.DataFrame
.
请参阅 pd.DataFrame
方法” 部分 。
Refer Computational Tools User Guide.
请参阅《 计算工具 用户指南》。
Refer the categorical listing at Pandas API.
请参阅 Pandas API上 的分类清单 。
APPLYING FUNCTIONSThe apply()
method: apply on columns/rows
应用函数 apply()
方法:应用于列/行
The apply()
method applies the given function along an axis (by default on columns) of the DataFrame
.
的 apply()
方法应用于沿轴线(默认上列)的给定的函数 DataFrame
。
Its most useful form is:
它最有用的形式是:
df.apply(func, axis=0, args, **kwds)
Here:
这里:
func
: The function to apply to each column or row. Note that it can be a element-wise function (in which caseaxis=0
oraxis=1
doesn’t make any difference) or an aggregate function.func
:应用于每个列或行的函数。 请注意,它可以是按元素的函数(在这种情况下,axis=0
或axis=1
没有任何区别)或聚合函数。axis
: Its value can be0
(default, column) or1
.0
means applying function to each column, and1
means applying function to each row. Note that thisaxis
is similar to how axis are defined in NumpPy, as for 2D ndarray,0
means column.axis
:其值可以是0
(默认值,列)或1
。0
表示将功能应用于每一列,而1
表示将功能应用于每一行。 请注意,此axis
类似于在NumpPy中定义轴的方式,对于2D ndarray,0
表示列。args
: It is atuple
and represents the positional arguments to pass tofunc
in addition to the array/series.args
:这是一个tuple
,表示除了数组/系列之外还传递给func
的位置参数。**kwds
: It represents the additional keyword arguments to pass as keywords arguments tofunc
.**kwds
:它表示要作为func
关键字参数传递的其他关键字参数。
It returns a Series
or a DataFrame
.
它返回 Series
或 DataFrame
。
For example,
例如,
The applymap()
method: apply element-wise
所述 applymap()
方法:应用逐元素
The applymap()
applies the given function element-wise. So, the given func
must accept and return a scalar to every element of a DataFrame
.
所述 applymap()
应用于给定的函数逐元素。 因此,给定的 func
必须接受并将标量返回给 DataFrame
每个元素 。
Its general syntax is:
其一般语法为:
df.applymap(func)
For example,
例如,
When you need to apply a function element-wise, you might like to check first if there is a vectorized version available. Note that a vectorized version of func
often exist, which will be much faster. You could square each number element-wise using df.applymap(lambda x: x**2)
, but the vectorized version df**2
is better.
当需要按 元素 应用函数时 ,您可能想先检查是否有矢量化版本。 注意的 矢量版本 func
常同时存在,这会快很多。 您可以使用 df.applymap(lambda x: x**2)
对 每个数字逐个平方 ,但是矢量化版本 df**2
更好。
WORKING WITH MISSING DATA
处理丢失的数据
Refer SciKit-Learn’s Data Cleaning section.
请参阅SciKit-Learn的“数据清理”部分。
Refer Missing Data Guide and API Reference for Missing Data Handling: dropna
, fillna
, replace
, interpolate
.
有关丢失数据的处理 , 请参阅《 丢失数据指南》 和《 API参考》: dropna
, fillna
, replace
, interpolate
。
Also check Data Cleaning section of The tf.feature_column
API on other options.
还要 在其他选项上 查看 tf.feature_column
API 的 “ 数据清理” 部分 。
Also go through https://www.analyticsvidhya.com/blog/2016/01/12-pandas-techniques-python-data-manipulation/
也可以通过https://www.analyticsvidhya.com/blog/2016/01/12-pandas-techniques-python-data-manipulation/
NORMALIZING DATA
归一化数据
One way is to perform df / df.iloc[0]
, which is particular useful while analyzing stock price over a period of time for multiple companies.
一种方法是执行 df / df.iloc[0]
,这在分析多个公司一段时间内的股价时特别有用。
THE CONCAT()
FUNCTION
THE CONCAT()
函数
The concat()
function performs concatenation operations along an axis while performing optional set logic (union or intersection) of the indexes (if any) on the other axis.
所述 concat()
沿轴线功能执行级联操作,而执行索引的可选的一组逻辑(集或交集)(如果有的话)在另一轴上。
The default axis of concatenation is axis=0
, but you can choose to concatenate data frames sideways by choosing axis=1
.
默认的串联 axis=0
为 axis=0
,但是您可以通过选择 axis=1
来选择横向串联数据帧 。
Note: Also check append()
method.
注意:还要检查 append()
方法。
For example,
例如,
df1 = pd.DataFrame(
{
'A': ['A0', 'A1', 'A2', 'A3'],
'B': ['B0', 'B1', 'B2', 'B3'],
'C': ['C0', 'C1', 'C2', 'C3'],
'D': ['D0', 'D1', 'D2', 'D3']
}, index=[0, 1, 2, 3]
)df2 = pd.DataFrame(
{
'A': ['A4', 'A5', 'A6', 'A7'],
'B': ['B4', 'B5', 'B6', 'B7'],
'C': ['C4', 'C5', 'C6', 'C7'],
'D': ['D4', 'D5', 'D6', 'D7']
}, index=[4, 5, 6, 7]
)df3 = pd.DataFrame(
{
'A': ['A8', 'A9', 'A10', 'A11'],
'B': ['B8', 'B9', 'B10', 'B11'],
'C': ['C8', 'C9', 'C10', 'C11'],
'D': ['D8', 'D9', 'D10', 'D11']
}, index=[1, 2, 3, 4]
)frames = [df1, df2, df3]df4 = pd.concat(frames)df4
# A B C D
# 0 A0 B0 C0 D0
# 1 A1 B1 C1 D1
# 2 A2 B2 C2 D2
# 3 A3 B3 C3 D3
# 4 A4 B4 C4 D4
# 5 A5 B5 C5 D5
# 6 A6 B6 C6 D6
# 7 A7 B7 C7 D7
# 1 A8 B8 C8 D8
# 2 A9 B9 C9 D9
# 3 A10 B10 C10 D10
# 4 A11 B11 C11 D11df5 = pd.concat(frames, ignore_index=True)df5
# A B C D
# 0 A0 B0 C0 D0
# 1 A1 B1 C1 D1
# 2 A2 B2 C2 D2
# 3 A3 B3 C3 D3
# 4 A4 B4 C4 D4
# 5 A5 B5 C5 D5
# 6 A6 B6 C6 D6
# 7 A7 B7 C7 D7
# 8 A8 B8 C8 D8
# 9 A9 B9 C9 D9
# 10 A10 B10 C10 D10
# 11 A11 B11 C11 D11df5 = pd.concat(frames, keys=['s1', 's2', 's3'])df5
# A B C D
# s1 0 A0 B0 C0 D0
# 1 A1 B1 C1 D1
# 2 A2 B2 C2 D2
# 3 A3 B3 C3 D3
# s2 4 A4 B4 C4 D4
# 5 A5 B5 C5 D5
# 6 A6 B6 C6 D6
# 7 A7 B7 C7 D7
# s3 1 A8 B8 C8 D8
# 2 A9 B9 C9 D9
# 3 A10 B10 C10 D10
# 4 A11 B11 C11 D11df5.index
# MultiIndex(levels=[['s1', 's2', 's3'], [0, 1, 2, 3, 4, 5, 6, 7]],
# labels=[[0, 0, 0, 0, 1, 1, 1, 1, 2, 2, 2, 2], [0, 1, 2, 3, 4, 5, 6, 7, 1, 2, 3, 4]])
Like its sibling function on ndarrays, numpy.concatenate()
, the pandas.concat()
takes a list or dict of homogeneously-typed objects and concatenates them with some configurable handling of “what to do with other axes”.
像其在ndarrays上的同级函数numpy.concatenate()
一样, numpy.concatenate()
pandas.concat()
同类对象的列表或字典,并通过“可与其他轴做什么”的某种可配置处理将它们连接起来。
MERGING AND JOINING USING MERGE()
AND JOIN()
FUNCTIONS
使用 MERGE()
和 JOIN()
函数 合并和加入
Refer Mergem Join, and Concatenate official guide.
请参阅 Mergem Join和Concatenate 官方指南。
The function merge()
merges DataFrame
object by performing a database-style join operation by columns or indexes.
函数 merge()
通过按列或索引执行数据库样式的 DataFrame
操作来 合并 DataFrame
对象。
The function join()
joins columns with other DataFrame
either on index or on a key column.
函数 join()
在索引或键列 DataFrame
列与其他 DataFrame
连接起来 。
BINARY DUMMY VARIABLES FOR CATEGORICAL VARIABLES USING GET_DUMMIES()
FUNCTION
使用 GET_DUMMIES()
函数的化学 变量的 GET_DUMMIES()
变量
To convert a categorical variable into a “dummy” DataFrame
can be done using get_dummies()
:
可以使用 get_dummies()
将类别变量转换为“虚拟” DataFrame
:
df = pd.DataFrame({'char': list('bbacab'), 'data1': range(6)})df
# char data1
# 0 b 0
# 1 b 1
# 2 a 2
# 3 c 3
# 4 a 4
# 5 b 5dummies = pd.get_dummies(df['char'], prefix='key')
dummies
# key_a key_b key_c
# 0 0 1 0
# 1 0 1 0
# 2 1 0 0
# 3 0 0 1
# 4 1 0 0
# 5 0 1 0
PLOTTING DATAFRAME
USING PLOT()
FUNCTION
使用PLOT()
函数绘制数据DATAFRAME
The plot()
function makes plots of DataFrame
using matplotlib/pylab.
该 plot()
函数使得地块 DataFrame
使用matplotlib / pylab。
面板(3D数据结构) (Panel (3D data structure))
Panel is a container for 3D data. The term panel data is derived from econometrics and is partially responsible for the name: pan(el)-da(ta)-s.
面板是3D数据的容器。 面板数据一词源自计量经济学,部分负责名称: pan(el)-da(ta)-s 。
The 3D structure of a Panel
is much less common for many types of data analysis, than the 1D of the Series
or the 2D of the DataFrame
. Oftentimes, one can simply use a Multi-index DataFrame
for easily working with higher dimensional data. Refer Deprecate Panel.
与Series
的1D或DataFrame
的2D相比, Panel
的3D结构在许多类型的数据分析中要少DataFrame
。 通常,人们可以简单地使用Multi-index DataFrame
来轻松处理高维数据。 请参阅“ 不赞成使用面板” 。
Here are some related interesting stories that you might find helpful:
以下是一些相关的有趣故事,您可能会觉得有帮助:
Fluent NumPy
流利的数字
Distributed Data Processing with Apache Spark
使用Apache Spark进行分布式数据处理
Apache Cassandra — Distributed Row-Partitioned Database for Structured and Semi-Structured Data
Apache Cassandra —用于结构化和半结构化数据的分布式行分区数据库
The Why and How of MapReduce
MapReduce的原因和方式
翻译自: https://medium.com/analytics-vidhya/fluent-pandas-22473fa3c30d
熊猫分发
本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/news/387980.shtml
如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!