纹个鸡儿天才小熊猫_给熊猫用户的5个提示

纹个鸡儿天才小熊猫

A popular Python library used by those working with data is pandas, an easy and flexible data manipulation and analysis library. There are a myriad of awesome methods and functions in pandas, some of which are probably less well-known than others. With this in mind, I have 5 tips on how these less common (at least in my opinion) methods and functions may be useful in case you don’t already know them ✨:

pandas是供数据处理人员使用的流行Python库, pandas是一种简单而灵活的数据处理和分析库。 熊猫中有许多很棒的方法和功能,其中一些可能不那么知名。 考虑到这一点,我为您提供了5个技巧,以防万一您不了解这些不常见的方法和功能(至少在我看来)be:

Does any of these look interesting or unfamiliar to you? 💭 I hope the answer is ‘yes’ or at least ‘maybe’. In this post, I will go through each one of them and explain what I mean and illustrate how they are useful.

这些对您来说看起来有趣还是陌生? 💭我希望答案是“是”或至少是“也许”。 在这篇文章中,我将逐一介绍它们,解释我的意思并说明它们的用处。

0. Python设置🔧 (0. Python setup 🔧)

I assume the reader (👀 yes, you!) has: ◾️️️ access to and is familiar with Python including installing packages, defining functions and other basic tasks◾️ working knowledge using pandas including basic data manipulation.

我假设读者(👀是的,您!)具有:◾️️️访问并熟悉Python,包括安装软件包,定义函数和其他基本任务◾️使用熊猫的工作知识,包括基本数据操作。

If you are new to Python, this is a good place to get started. If you haven’t used pandas before, this is a good resource to check out.

如果您不熟悉Python,那么这是一个入门的好地方。 如果您以前没有使用过熊猫,那么这是检查的好资源。

I have used and tested the scripts in Python 3.7.1 in Jupyter Notebook. Let’s make sure you have the right tools before we dive in.

我在Jupyter Notebook中使用并测试了Python 3.7.1中的脚本。 在开始之前,请确保您拥有正确的工具。

⬜️️️确保所需的软件包已安装熊猫和seaborn (⬜️ ️Ensure required packages are installed pandas and seaborn)

We will use the following powerful third party packages:

我们将使用以下功能强大的第三方软件包:

  • pandas: Data analysis library and

    熊猫 :数据分析库和

  • seaborn: Visualisation library (to import a toy dataset).

    seaborn:可视化库(用于导入玩具数据集)。

1.数据📦 (1. Data 📦)

We will use seaborn’s dataset on tips to exemplify my tips. (Did you get it? … my pun… 😆):

我们将在提示上使用seaborn的数据集以举例说明我的提示。 (你明白了吗?……我的双关语……):

# Import packages
import pandas as pd
import seaborn as sns# Import data
df = sns.load_dataset('tips')
print(f"{df.shape[0]} rows and {df.shape[1]} columns")
df.head()
Image for post

Details about this dataset including data dictionary can be found here (this source is actually for R, but it appears to be referring to the same underlying dataset). I have quoted their data description below for quick access:

可以在此处找到有关此数据集的详细信息,包括数据字典(此来源实际上是R的来源,但似乎是指相同的基础数据集)。 为了快速访问,我在下面引用了它们的数据描述:

“One waiter recorded information about each tip he received over a period of a few months working in one restaurant.”

“一位服务员记录了在一家餐馆工作几个月后收到的每条小费的信息。”

2.提示🌟 (2. Tips 🌟)

📍技巧1:使用query()进行过滤 (📍 Tip #1: Filter with query())

Let’s start with my favourite tip! Say we wanted to filter the data to those who tipped more than $6 with $30+ total bill. One common way to accomplish this is to use:

让我们从我最喜欢的提示开始! 假设我们想过滤那些向总费用超过30美元的小费超过6美元的用户提供数据。 实现此目的的一种常见方法是使用:

df.loc[(df['tip']>6) & (df['total_bill']>=30)]
Image for post

This does the job, but don’t you think it’s little too verbose: each condition requires reference to the dataframe and a parenthesis wrapping if there are multiple conditions. Now, let me show you how we could achieve the same outcome with more elegant code with query():

这可以完成工作,但是您不认为它太冗长了吗:如果存在多个条件,则每个条件都需要引用数据框和括号括起来。 现在,让我向您展示如何通过使用query()编写更优雅的代码来实现相同的结果:

df.query("tip>6 & total_bill>=30")

You see how clean, simple and readable this looks? We are not repeatedly typing df or overloading with brackets and parentheses anymore. With less keystrokes, it’s also quicker to write code and code will be less prone to mistakes. A few more additional tips on query():

您看这看起来多么干净,简单和可读? 我们不再重复输入df或使用方括号和括号重载。 按键次数减少,编写代码的速度也会更快,并且代码更不会出错。 有关query()的其他一些提示:

# reference global variable name with @
median_tip = df['tip'].median()
display(df.query("tip>@median_tip").head())
# wrap column name containing . with backtick: `
df.rename(columns={'total_bill':'total.bill'}, inplace=True)
display(df.query("`total.bill`<20").head())
df.rename(columns={'total.bill':'total_bill'}, inplace=True)
# wrap string condition with single quotes (this is what I like)
display(df.query("day=='Sat'").head())
# could also do it the other way around (i.e. 'day=="Sat"')
Image for post

📍技巧2:使用display()显示多个数据框 (📍 Tip #2: Show multiple dataframes with display())

I have already given this away in the previous code, so you could probably guess what this one is about. Assume we wanted to inspect both head and tail of the df in one cell of Jupyter Notebook. If we run the following code, it will only show the tail:

我已经在之前的代码中给出了这一点,所以您可能会猜到这是关于什么的。 假设我们要在Jupyter Notebook的一个单元中检查df的头和尾。 如果我们运行以下代码,它将仅显示尾部:

df.head()
df.tail()
Image for post

We can get around this with display():

我们可以通过display()解决这个问题:

display(df.head())
display(df.tail())
Image for post

In the last line, display() is redundant but it is there for consistency. It works the same way if we take out display() from the last line:

在最后一行, display()是多余的,但为了保持一致性而存在。 如果我们从最后一行中取出display() ,则其工作方式相同:

display(df.head())
df.tail()

📍技巧3a:按多列排序时使用布尔列表 (📍 Tip #3a: Use a list of booleans when sorting by multiple columns)

I have two tips on sorting. The first one is for sorting multiple columns.

我有两个分类提示。 第一个是对多列进行排序。

Have you ever had to sort your data with multiple columns in different directions? Here is an example of what I mean: Sort the data by total bill in ascending order and break ties with amount of tip in descending order.

您是否曾经不得不以不同方向对多列数据进行排序? 这是我的意思的一个示例:按总帐单按升序对数据进行排序,按小费金额按降序按平局联系。

Before I knew tip #3a, I would create an interim column to flip the scale of either total bill or tip to make all the relevant columns to have the same direction and sort afterwards (I have flipped tip in this example):

在我知道小费#3a之前,我将创建一个临时列以翻转总帐单或小费的比例,以使所有相关列具有相同的方向并随后进行排序(在此示例中,我已翻转小费):

df['rev_tip'] = -df['tip']
df.sort_values(by=['total_bill', 'rev_tip'], ascending=True).head()
Image for post

This is a workaround but not very elegant way to tackle the task. Let’s delete rev_tip with del df['rev_tip']. Instead, we could pass a list of booleans to indicate the order for each variable for sorting:

这是一种解决方法,但不是很好的解决方法。 让我们用del df['rev_tip']删除rev_tip。 相反,我们可以传递一个布尔值列表来指示每个变量的排序顺序:

df.sort_values(by=[‘total_bill’, ‘tip’], ascending=[True, False]).head()
Image for post

Not only do we not need to create an extra column, the last code also looks cleaner and more readable.

我们不仅不需要创建额外的列,而且最后的代码看起来也更清晰易读。

It’s also possible to use the numerical representation of booleans. That is, if we change to ascending =[1,0], it will also give us the same output.

也可以使用布尔值的数字表示形式。 也就是说,如果我们更改为ascending =[1,0] ,它也会给我们相同的输出。

📍技巧3b:使用nsmallest()或nlargest() (📍 Tip #3b: Use nsmallest() or nlargest())

This second tip will come in handy if you ever had to quickly check out data extract for records that have the smallest or largest values in a particular column. Using nsmallest(), we could check out 5 records with the smallest total bill like this:

如果您不得不快速检出数据提取以获取特定列中具有最小或最大值的记录,那么第二条技巧将非常有用。 使用nsmallest(),我们可以检查出5条总账单最小的记录,如下所示:

df.nsmallest(5, 'total_bill')
Image for post

This is a short form for:

这是以下内容的缩写:

df.sort_values(by='total_bill').head()

Similarly, the outputs of these two lines are identical:

同样,这两行的输出是相同的:

display(df.nlargest(5, 'total_bill'))
display(df.sort_values(by='total_bill', ascending=False).head())
Image for post

📍技巧4。 自定义describe() (📍 Tip #4. Customise describe())

Any pandas user is probably familiar with df.describe(). This shows summary stats for numerical columns. But we can get more than that by specifying its arguments.

任何熊猫用户都可能熟悉df.describe() 。 这显示了数字列的摘要统计信息。 但是,通过指定其参数,我们可以获得更多的收益。

Firstly, let’s check out the column types:

首先,让我们检查一下列类型:

df.info()
Image for post

In our dataframe, we have numerical and categorical columns. Let’s see summary stats for all columns by adding include='all':

在我们的数据框中,我们有数值和类别列。 让我们通过添加include='all'来查看所有列的摘要统计信息:

df.describe(include='all')
Image for post

This is cool but a little messy. Let’s show the summary stats by column types separately with the following script:

这很酷,但有点混乱。 让我们用以下脚本分别按列类型显示摘要状态:

display(df.describe(include=['category'])) # categorical types
display(df.describe(include=['number'])) # numerical types
Image for post

Do you like this better? If we had both strings and categorical columns and wished to display the summary stats for both in one table, we can use either: include=['category', 'object'] or exclude=['number']. If you are curious to learn more, check out the documentation.

你更喜欢这个吗? 如果我们同时具有字符串和类别列,并希望在一个表中同时显示两者的摘要统计信息,则可以使用: include=['category', 'object']exclude=['number']. 如果您想了解更多信息,请查阅文档 。

📍技巧5:更新默认显示设置 (📍 Tip #5: Update default display settings)

This last tip is probably more well-known than the rest. Let’s see some examples of useful display settings to change.

这最后一个技巧可能比其余的更广为人知。 让我们看一些有用的显示设置更改示例。

Firstly, we can check out the current default limit for maximum number of columns and rows to be displayed with the code below:

首先,我们可以使用以下代码查看当前默认限制,以显示最大列数和行数:

print(f"{pd.options.display.max_columns} columns")
print(f"{pd.options.display.max_rows} rows")
Image for post
current pandas version: 1.0.3
当前熊猫版本:1.0.3

This means if we try to display a dataframe with more than 20 columns, we only get to see the first 10 and final 10 (total of 20 columns shown) while the rest will be truncated as three dots. The same logic applies to rows. Often, we may want to see more than these maximums. If we want to change this behaviour, we can do so like this:

这意味着,如果我们尝试显示一个包含超过20列的数据框,我们只会看到前10个和最后10个(显示的共20列),而其余部分将被截断为三个点。 相同的逻辑适用于行。 通常,我们可能希望看到的不仅仅是这些最大值。 如果我们想改变这种行为,我们可以这样做:

pd.options.display.max_columns = None
pd.options.display.max_rows = None

Here, we are asking pandas to display every row and column without any limit. This may or may not be a good idea depending on how big your dataframe is. We can also set these options to a number of our choice:

在这里,我们要求熊猫无限制地显示每一行和每一列。 根据数据框的大小,这可能不是一个好主意。 我们还可以将这些选项设置为多种选择:

pd.options.display.max_columns = 50
pd.options.display.max_rows = 100

Secondly, depending on the scale of the numerical variables you are working on, you may sometimes encounter scientific notations for very large or very small numbers when working with pandas. If you find it easier to read numbers as 1200 and 0.012 compared to 1.2e3 and 1.2e-2 respectively, you are likely to find this line of code handy:

其次,根据正在处理的数字变量的大小,在处理大熊猫时,有时可能会遇到非常大或非常小的数字的科学计数法。 如果您发现读取数字1200和0.012比分别使用1.2e3和1.2e-2更容易,则可能会发现以下这行代码很方便:

pd.options.display.float_format = ‘{:.4f}’.format # 4 decimal places

This ensures that you will see real numbers instead of scientific notations.

这样可以确保您看到的是实数而不是科学计数法。

If you are curious to learn more about other options to customise, check out the documentation.

如果您想了解更多有关其他定制选项的信息,请查阅文档 。

Voila❕ These were my current top tips for a pandas user!

这些是我目前对熊猫用户的最高提示!

Thank you for reading my post. Hope you find my tips useful ✂️. If you are interested to learn more about pandas, here is a link to my other post:◼️️ How to transform variables in a pandas DataFrame

感谢您阅读我的帖子。 希望您发现我的技巧有用useful️。 如果您有兴趣了解有关熊猫的更多信息,请访问我的另一篇文章链接:◼️️ 如何在熊猫DataFrame中转换变量

Bye for now 🏃💨

再见for

翻译自: https://towardsdatascience.com/5-tips-for-pandas-users-e73681d16d17

纹个鸡儿天才小熊猫

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/news/388970.shtml

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

用户与用户组管理

linux最优秀的地方之一&#xff0c;就在于他的多用用户、多任务环境。 用户及用户组的概念 1、文件所有者 由于linux是一个多用户、多任务的系统。因此可能常常会有很多人同时使用这台主机来进行工作的情况发生&#xff0c;为了考虑每个人的隐私权以及每个人的喜好的工作环境&a…

代码 抠图_3 行 Python 代码 5 秒抠图的 AI 神器,根本无需 PS,附教程

曾几何时&#xff0c;「抠图」是一个难度系数想当高的活儿&#xff0c;但今天要介绍的这款神工具&#xff0c;只要 3 行代码 5 秒钟就可以完成高精度抠图&#xff0c;甚至都不用会代码&#xff0c;点两下鼠标就完成了。感受下这款抠图工具抠地有多精细&#xff1a;是不是很赞&a…

python函数使用易错举例

关于嵌套&#xff1a; 嵌套使用中&#xff0c; retrun inner ---> 返回的是函数的地址 retrun inner() &#xff1a; ---> 运行inner()函数 ---> 运行inner()函数后的返回值a&#xff08;假设&#xff09;返回上级 --> retrun inner()得到返回值a 如…

图像离群值_什么是离群值?

图像离群值你是&#xff01; (You are!) Actually not. This is not a text about you.其实并不是。 这不是关于您的文字。 But, as Gladwell puts it in Outliers, if you find yourself being that type of outlier, you’re quite lucky. And rare.但是&#xff0c;正如Gla…

混合模型和EM---混合高斯

2019独角兽企业重金招聘Python工程师标准>>> 混合高斯 最大似然 用于高斯混合模型的EM 转载于:https://my.oschina.net/liyangke/blog/2986520

Python学习---django知识补充之CBV

Django知识补充之CBV Django: url --> def函数 FBV[function based view] 用函数和URL进行匹配 url --> 类 CBV[function based view] 用类和URL进行匹配 POSTMAN插件 http://blog.csdn.net/zzy1078689276/article/details/77528249 基于CBV的登…

蓝图解锁怎么用_[UE4蓝图][Materials]虚幻4中可互动的雪地材质完整实现(一)

不说废话&#xff0c;先上个演示图最终成果&#xff08;脚印&#xff0c;雪地可慢慢恢复&#xff0c;地形可控制&#xff09;主要原理&#xff08;白话文&#xff09;&#xff1a;假如你头上是块白色并且可以透视的平地&#xff0c;来了个非洲兄弟踩上面&#xff0c;你拿起单反…

数据预处理工具_数据预处理

数据预处理工具As the title states this is the last project from Udacity Nanodegree. The goal of this project is to analyze demographics data for customers of a mail-order sales company in Germany.如标题所示&#xff0c;这是Udacity Nanodegree的最后一个项目。…

自考数据结构和数据结构导论_我跳过大学自学数据科学

自考数据结构和数据结构导论A few months back, I decided I wanted to learn data science. In order to do this, I skipped an entire semester of my data science major.几个月前&#xff0c;我决定要学习数据科学。 为此&#xff0c; 我跳过了数据科学专业的整个学期。 …

十三、原生爬虫实战

一、简单实例 1、需求&#xff1a;爬取熊猫直播某类主播人气排行 2、了解网站结构 分类——英雄联盟——"观看人数" 3、找到有用的信息 二、整理爬虫常规思路 1、使用工具chrome——F12——element——箭头——定位目标元素 目标元素&#xff1a;主播名字&#xff0c…

归一化 均值归一化_归一化折现累积收益

归一化 均值归一化Do you remember the awkward moment when someone you had a good conversation with forgets your name? In this day and age we have a new standard, an expectation. And when the expectation is not met the feeling is not far off being asked “w…

sqlserver垮库查询_Oracle和SQLServer中实现跨库查询

一、在SQLServer中连接另一个SQLServer库数据在SQL中&#xff0c;要想在本地库中查询另一个数据库中的数据表时&#xff0c;可以创建一个链接服务器&#xff1a;EXEC master.dbo.sp_addlinkedserver server N别名, srvproductN库名,providerNSQLOLEDB, datasrcN服务器地址EXEC…

机器学习实践三---神经网络学习

Neural Networks 在这个练习中&#xff0c;将实现神经网络BP算法,练习的内容是手写数字识别。Visualizing the data 这次数据还是5000个样本&#xff0c;每个样本是一张20*20的灰度图片fig, ax_array plt.subplots(nrows10, ncols10, figsize(6, 4))for row in range(10):fo…

机器学习实践四--正则化线性回归 和 偏差vs方差

这次实践的前半部分是&#xff0c;用水库水位的变化&#xff0c;来预测大坝的出水量。 给数据集拟合一条直线&#xff0c;可能得到一个逻辑回归拟合&#xff0c;但它并不能很好地拟合数据&#xff0c;这是高偏差&#xff08;high bias&#xff09;的情况&#xff0c;也称为“欠…

深度学习 推理 训练_使用关系推理的自我监督学习进行训练而无需标记数据

深度学习 推理 训练背景与挑战&#x1f4cb; (Background and challenges &#x1f4cb;) In a modern deep learning algorithm, the dependence on manual annotation of unlabeled data is one of the major limitations. To train a good model, usually, we have to prepa…

CentOS 7 使用 ACL 设置文件权限

Linux 系统标准的 ugo/rwx 集合并不允许为不同的用户配置不同的权限&#xff0c;所以 ACL 便被引入了进来&#xff0c;为的是为文件和目录定义更加详细的访问权限&#xff0c;而不仅仅是这些特别指定的特定权限。 ACL 可以为每个用户&#xff0c;每个组或不在文件所属组中的用…

机器学习实践五---支持向量机(SVM)

之前已经学到了很多监督学习算法&#xff0c; 今天的监督学习算法是支持向量机&#xff0c;与逻辑回归和神经网络算法相比&#xff0c;它在学习复杂的非线性方程时提供了一种更为清晰&#xff0c;更强大的方式。 Support Vector Machines SVM hypothesis Example Dataset 1…

服务器安装mysql_阿里云服务器上安装MySQL

关闭防火墙和selinuxCentOS7以下&#xff1a;service iptables stopsetenforce 0CentOS7.xsystemctl stop firewalldsystemctl disable firewalldsystemctl status firewalldvi /etc/selinux/config把SELINUXenforcing 改成 SELINUXdisabled一、安装依赖库yum -y install make …

在PyTorch中转换数据

In continuation of my previous post ,we will keep on deep diving into basic fundamentals of PyTorch. In this post we will discuss about ways to transform data in PyTorch.延续我以前的 发布后 &#xff0c;我们将继续深入研究PyTorch的基本原理。 在这篇文章中&a…

机器学习实践六---K-means聚类算法 和 主成分分析(PCA)

在这次练习中将实现K-means 聚类算法并应用它压缩图片&#xff0c;第二部分&#xff0c;将使用主成分分析算法去找到一个脸部图片的低维描述。 K-means Clustering Implementing K-means K-means算法是一种自动将相似的数据样本聚在一起的方法,K-means背后的直观是一个迭代过…