纹个鸡儿天才小熊猫
A popular Python library used by those working with data is pandas, an easy and flexible data manipulation and analysis library. There are a myriad of awesome methods and functions in pandas, some of which are probably less well-known than others. With this in mind, I have 5 tips on how these less common (at least in my opinion) methods and functions may be useful in case you don’t already know them ✨:
pandas是供数据处理人员使用的流行Python库, pandas是一种简单而灵活的数据处理和分析库。 熊猫中有许多很棒的方法和功能,其中一些可能不那么知名。 考虑到这一点,我为您提供了5个技巧,以防万一您不了解这些不常见的方法和功能(至少在我看来)be:
Does any of these look interesting or unfamiliar to you? 💭 I hope the answer is ‘yes’ or at least ‘maybe’. In this post, I will go through each one of them and explain what I mean and illustrate how they are useful.
这些对您来说看起来有趣还是陌生? 💭我希望答案是“是”或至少是“也许”。 在这篇文章中,我将逐一介绍它们,解释我的意思并说明它们的用处。
0. Python设置🔧 (0. Python setup 🔧)
I assume the reader (👀 yes, you!) has: ◾️️️ access to and is familiar with Python including installing packages, defining functions and other basic tasks◾️ working knowledge using pandas including basic data manipulation.
我假设读者(👀是的,您!)具有:◾️️️访问并熟悉Python,包括安装软件包,定义函数和其他基本任务◾️使用熊猫的工作知识,包括基本数据操作。
If you are new to Python, this is a good place to get started. If you haven’t used pandas before, this is a good resource to check out.
如果您不熟悉Python,那么这是一个入门的好地方。 如果您以前没有使用过熊猫,那么这是检查的好资源。
I have used and tested the scripts in Python 3.7.1 in Jupyter Notebook. Let’s make sure you have the right tools before we dive in.
我在Jupyter Notebook中使用并测试了Python 3.7.1中的脚本。 在开始之前,请确保您拥有正确的工具。
⬜️️️确保所需的软件包已安装熊猫和seaborn (⬜️ ️Ensure required packages are installed pandas and seaborn)
We will use the following powerful third party packages:
我们将使用以下功能强大的第三方软件包:
pandas: Data analysis library and
熊猫 :数据分析库和
seaborn: Visualisation library (to import a toy dataset).
seaborn:可视化库(用于导入玩具数据集)。
1.数据📦 (1. Data 📦)
We will use seaborn’s dataset on tips to exemplify my tips. (Did you get it? … my pun… 😆):
我们将在提示上使用seaborn的数据集以举例说明我的提示。 (你明白了吗?……我的双关语……):
# Import packages
import pandas as pd
import seaborn as sns# Import data
df = sns.load_dataset('tips')
print(f"{df.shape[0]} rows and {df.shape[1]} columns")
df.head()
Details about this dataset including data dictionary can be found here (this source is actually for R, but it appears to be referring to the same underlying dataset). I have quoted their data description below for quick access:
可以在此处找到有关此数据集的详细信息,包括数据字典(此来源实际上是R的来源,但似乎是指相同的基础数据集)。 为了快速访问,我在下面引用了它们的数据描述:
“One waiter recorded information about each tip he received over a period of a few months working in one restaurant.”
“一位服务员记录了在一家餐馆工作几个月后收到的每条小费的信息。”
2.提示🌟 (2. Tips 🌟)
📍技巧1:使用query()进行过滤 (📍 Tip #1: Filter with query())
Let’s start with my favourite tip! Say we wanted to filter the data to those who tipped more than $6 with $30+ total bill. One common way to accomplish this is to use:
让我们从我最喜欢的提示开始! 假设我们想过滤那些向总费用超过30美元的小费超过6美元的用户提供数据。 实现此目的的一种常见方法是使用:
df.loc[(df['tip']>6) & (df['total_bill']>=30)]
This does the job, but don’t you think it’s little too verbose: each condition requires reference to the dataframe and a parenthesis wrapping if there are multiple conditions. Now, let me show you how we could achieve the same outcome with more elegant code with query():
这可以完成工作,但是您不认为它太冗长了吗:如果存在多个条件,则每个条件都需要引用数据框和括号括起来。 现在,让我向您展示如何通过使用query()编写更优雅的代码来实现相同的结果:
df.query("tip>6 & total_bill>=30")
You see how clean, simple and readable this looks? We are not repeatedly typing df or overloading with brackets and parentheses anymore. With less keystrokes, it’s also quicker to write code and code will be less prone to mistakes. A few more additional tips on query():
您看这看起来多么干净,简单和可读? 我们不再重复输入df或使用方括号和括号重载。 按键次数减少,编写代码的速度也会更快,并且代码更不会出错。 有关query()的其他一些提示:
# reference global variable name with @
median_tip = df['tip'].median()
display(df.query("tip>@median_tip").head())
# wrap column name containing . with backtick: `
df.rename(columns={'total_bill':'total.bill'}, inplace=True)
display(df.query("`total.bill`<20").head())
df.rename(columns={'total.bill':'total_bill'}, inplace=True)
# wrap string condition with single quotes (this is what I like)
display(df.query("day=='Sat'").head())
# could also do it the other way around (i.e. 'day=="Sat"')
📍技巧2:使用display()显示多个数据框 (📍 Tip #2: Show multiple dataframes with display())
I have already given this away in the previous code, so you could probably guess what this one is about. Assume we wanted to inspect both head and tail of the df in one cell of Jupyter Notebook. If we run the following code, it will only show the tail:
我已经在之前的代码中给出了这一点,所以您可能会猜到这是关于什么的。 假设我们要在Jupyter Notebook的一个单元中检查df的头和尾。 如果我们运行以下代码,它将仅显示尾部:
df.head()
df.tail()
We can get around this with display():
我们可以通过display()解决这个问题:
display(df.head())
display(df.tail())
In the last line, display() is redundant but it is there for consistency. It works the same way if we take out display() from the last line:
在最后一行, display()是多余的,但为了保持一致性而存在。 如果我们从最后一行中取出display() ,则其工作方式相同:
display(df.head())
df.tail()
📍技巧3a:按多列排序时使用布尔列表 (📍 Tip #3a: Use a list of booleans when sorting by multiple columns)
I have two tips on sorting. The first one is for sorting multiple columns.
我有两个分类提示。 第一个是对多列进行排序。
Have you ever had to sort your data with multiple columns in different directions? Here is an example of what I mean: Sort the data by total bill in ascending order and break ties with amount of tip in descending order.
您是否曾经不得不以不同方向对多列数据进行排序? 这是我的意思的一个示例:按总帐单按升序对数据进行排序,按小费金额按降序按平局联系。
Before I knew tip #3a, I would create an interim column to flip the scale of either total bill or tip to make all the relevant columns to have the same direction and sort afterwards (I have flipped tip in this example):
在我知道小费#3a之前,我将创建一个临时列以翻转总帐单或小费的比例,以使所有相关列具有相同的方向并随后进行排序(在此示例中,我已翻转小费):
df['rev_tip'] = -df['tip']
df.sort_values(by=['total_bill', 'rev_tip'], ascending=True).head()
This is a workaround but not very elegant way to tackle the task. Let’s delete rev_tip with del df['rev_tip']
. Instead, we could pass a list of booleans to indicate the order for each variable for sorting:
这是一种解决方法,但不是很好的解决方法。 让我们用del df['rev_tip']
删除rev_tip。 相反,我们可以传递一个布尔值列表来指示每个变量的排序顺序:
df.sort_values(by=[‘total_bill’, ‘tip’], ascending=[True, False]).head()
Not only do we not need to create an extra column, the last code also looks cleaner and more readable.
我们不仅不需要创建额外的列,而且最后的代码看起来也更清晰易读。
It’s also possible to use the numerical representation of booleans. That is, if we change to ascending =[1,0]
, it will also give us the same output.
也可以使用布尔值的数字表示形式。 也就是说,如果我们更改为ascending =[1,0]
,它也会给我们相同的输出。
📍技巧3b:使用nsmallest()或nlargest() (📍 Tip #3b: Use nsmallest() or nlargest())
This second tip will come in handy if you ever had to quickly check out data extract for records that have the smallest or largest values in a particular column. Using nsmallest(), we could check out 5 records with the smallest total bill like this:
如果您不得不快速检出数据提取以获取特定列中具有最小或最大值的记录,那么第二条技巧将非常有用。 使用nsmallest(),我们可以检查出5条总账单最小的记录,如下所示:
df.nsmallest(5, 'total_bill')
This is a short form for:
这是以下内容的缩写:
df.sort_values(by='total_bill').head()
Similarly, the outputs of these two lines are identical:
同样,这两行的输出是相同的:
display(df.nlargest(5, 'total_bill'))
display(df.sort_values(by='total_bill', ascending=False).head())
📍技巧4。 自定义describe() (📍 Tip #4. Customise describe())
Any pandas user is probably familiar with df.describe()
. This shows summary stats for numerical columns. But we can get more than that by specifying its arguments.
任何熊猫用户都可能熟悉df.describe()
。 这显示了数字列的摘要统计信息。 但是,通过指定其参数,我们可以获得更多的收益。
Firstly, let’s check out the column types:
首先,让我们检查一下列类型:
df.info()
In our dataframe, we have numerical and categorical columns. Let’s see summary stats for all columns by adding include='all'
:
在我们的数据框中,我们有数值和类别列。 让我们通过添加include='all'
来查看所有列的摘要统计信息:
df.describe(include='all')
This is cool but a little messy. Let’s show the summary stats by column types separately with the following script:
这很酷,但有点混乱。 让我们用以下脚本分别按列类型显示摘要状态:
display(df.describe(include=['category'])) # categorical types
display(df.describe(include=['number'])) # numerical types
Do you like this better? If we had both strings and categorical columns and wished to display the summary stats for both in one table, we can use either: include=['category', 'object']
or exclude=['number'].
If you are curious to learn more, check out the documentation.
你更喜欢这个吗? 如果我们同时具有字符串和类别列,并希望在一个表中同时显示两者的摘要统计信息,则可以使用: include=['category', 'object']
或exclude=['number'].
如果您想了解更多信息,请查阅文档 。
📍技巧5:更新默认显示设置 (📍 Tip #5: Update default display settings)
This last tip is probably more well-known than the rest. Let’s see some examples of useful display settings to change.
这最后一个技巧可能比其余的更广为人知。 让我们看一些有用的显示设置更改示例。
Firstly, we can check out the current default limit for maximum number of columns and rows to be displayed with the code below:
首先,我们可以使用以下代码查看当前默认限制,以显示最大列数和行数:
print(f"{pd.options.display.max_columns} columns")
print(f"{pd.options.display.max_rows} rows")
This means if we try to display a dataframe with more than 20 columns, we only get to see the first 10 and final 10 (total of 20 columns shown) while the rest will be truncated as three dots. The same logic applies to rows. Often, we may want to see more than these maximums. If we want to change this behaviour, we can do so like this:
这意味着,如果我们尝试显示一个包含超过20列的数据框,我们只会看到前10个和最后10个(显示的共20列),而其余部分将被截断为三个点。 相同的逻辑适用于行。 通常,我们可能希望看到的不仅仅是这些最大值。 如果我们想改变这种行为,我们可以这样做:
pd.options.display.max_columns = None
pd.options.display.max_rows = None
Here, we are asking pandas to display every row and column without any limit. This may or may not be a good idea depending on how big your dataframe is. We can also set these options to a number of our choice:
在这里,我们要求熊猫无限制地显示每一行和每一列。 根据数据框的大小,这可能不是一个好主意。 我们还可以将这些选项设置为多种选择:
pd.options.display.max_columns = 50
pd.options.display.max_rows = 100
Secondly, depending on the scale of the numerical variables you are working on, you may sometimes encounter scientific notations for very large or very small numbers when working with pandas. If you find it easier to read numbers as 1200 and 0.012 compared to 1.2e3 and 1.2e-2 respectively, you are likely to find this line of code handy:
其次,根据正在处理的数字变量的大小,在处理大熊猫时,有时可能会遇到非常大或非常小的数字的科学计数法。 如果您发现读取数字1200和0.012比分别使用1.2e3和1.2e-2更容易,则可能会发现以下这行代码很方便:
pd.options.display.float_format = ‘{:.4f}’.format # 4 decimal places
This ensures that you will see real numbers instead of scientific notations.
这样可以确保您看到的是实数而不是科学计数法。
If you are curious to learn more about other options to customise, check out the documentation.
如果您想了解更多有关其他定制选项的信息,请查阅文档 。
Voila❕ These were my current top tips for a pandas user!
这些是我目前对熊猫用户的最高提示!
Thank you for reading my post. Hope you find my tips useful ✂️. If you are interested to learn more about pandas, here is a link to my other post:◼️️ How to transform variables in a pandas DataFrame
感谢您阅读我的帖子。 希望您发现我的技巧有用useful️。 如果您有兴趣了解有关熊猫的更多信息,请访问我的另一篇文章链接:◼️️ 如何在熊猫DataFrame中转换变量
Bye for now 🏃💨
再见for
翻译自: https://towardsdatascience.com/5-tips-for-pandas-users-e73681d16d17
纹个鸡儿天才小熊猫
本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/news/388970.shtml
如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!