魅族mx5游戏模式小熊猫_您不知道的5大熊猫技巧

魅族mx5游戏模式小熊猫

重点 (Top highlight)

I’ve been using pandas for years and each time I feel I am typing too much, I google it and I usually find a new pandas trick! I learned about these functions recently and I deem them essential because of ease of use.

我已经使用熊猫多年了,每次我输入太多单词时,我都会用google搜索它,而且我通常会发现一个新的熊猫技巧! 我最近了解了这些功能,并且由于易于使用,我认为它们是必不可少的。

1.功能之间 (1. between function)

Image for post
GiphyGiphy的 Gif

I’ve been using “between” function in SQL for years, but I only discovered it recently in pandas.

多年来,我一直在SQL中使用“ between”功能,但最近才在pandas中发现它。

Let’s say we have a DataFrame with prices and we would like to filter prices between 2 and 4.

假设我们有一个带有价格的DataFrame,并且我们希望在2到4之间过滤价格。

df = pd.DataFrame({'price': [1.99, 3, 5, 0.5, 3.5, 5.5, 3.9]})

With between function, you can reduce this filter:

使用between功能,可以减少此过滤器:

df[(df.price >= 2) & (df.price <= 4)]

To this:

对此:

df[df.price.between(2, 4)]
Image for post

It might not seem much, but those parentheses are annoying when writing many filters. The filter with between function is also more readable.

看起来似乎不多,但是编写许多过滤器时这些括号令人讨厌。 具有中间功能的过滤器也更易读。

between function sets interval left <= series <= right.

功能集之间的间隔左<=系列<=右。

2.使用重新索引功能固定行的顺序 (2. Fix the order of the rows with reindex function)

Image for post
giphygiphy

Reindex function conforms a Series or a DataFrame to a new index. I resort to the reindex function when making reports with columns that have a predefined order.

Reindex函数使Series或DataFrame符合新索引。 当使用具有预定义顺序的列制作报表时,我求助于reindex函数。

Let’s add sizes of T-shirts to our Dataframe. The goal of analysis is to calculate the mean price for each size:

让我们在数据框中添加T恤的尺寸。 分析的目的是计算每种尺寸的平ASP格:

df = pd.DataFrame({'price': [1.99, 3, 5], 'size': ['medium', 'large', 'small']})df_avg = df.groupby('size').price.mean()
df_avg
Image for post

Sizes have a random order in the table above. It should be ordered: small, medium, large. As sizes are strings we cannot use the sort_values function. Here comes reindex function to the rescue:

尺寸在上表中具有随机顺序。 应该订购:小,中,大。 由于大小是字符串,因此我们不能使用sort_values函数。 这里有reindex函数来解救:

df_avg.reindex(['small', 'medium', 'large'])
Image for post

By

通过

3.描述类固醇 (3. Describe on steroids)

Image for post
GiphyGiphy的 Gif

Describe function is an essential tool when working on Exploratory Data Analysis. It shows basic summary statistics for all columns in a DataFrame.

当进行探索性数据分析时,描述功能是必不可少的工具。 它显示了DataFrame中所有列的基本摘要统计信息。

df.price.describe()
Image for post

What if we would like to calculate 10 quantiles instead of 3?

如果我们想计算10个分位数而不是3个分位数怎么办?

df.price.describe(percentiles=np.arange(0, 1, 0.1))
Image for post

Describe function takes percentiles argument. We can specify the number of percentiles with NumPy's arange function to avoid typing each percentile by hand.

描述函数采用百分位数参数。 我们可以使用NumPy的arange函数指定百分位数,以避免手动键入每个百分位数。

This feature becomes really useful when combined with the group by function:

与group by函数结合使用时,此功能将非常有用:

df.groupby('size').describe(percentiles=np.arange(0, 1, 0.1))
Image for post

4.使用正则表达式进行文本搜索 (4. Text search with regex)

Image for post
GiphyGiphy的 Gif

Our T-shirt dataset has 3 sizes. Let’s say we would like to filter small and medium sizes. A cumbersome way of filtering is:

我们的T恤数据集有3种尺寸。 假设我们要过滤中小型尺寸。 繁琐的过滤方式是:

df[(df['size'] == 'small') | (df['size'] == 'medium')]

This is bad because we usually combine it with other filters, which makes the expression unreadable. Is there a better way?

这很不好,因为我们通常将其与其他过滤器结合使用,从而使表达式不可读。 有没有更好的办法?

pandas string columns have an “str” accessor, which implements many functions that simplify manipulating string. One of them is “contains” function, which supports search with regular expressions.

pandas字符串列具有“ str”访问器,该访问器实现了许多简化操作字符串的功能。 其中之一是“包含”功能,该功能支持使用正则表达式进行搜索。

df[df['size'].str.contains('small|medium')]

The filter with “contains” function is more readable, easier to extend and combine with other filters.

具有“包含”功能的过滤器更具可读性,更易于扩展并与其他过滤器组合。

5.比带有熊猫的内存数据集更大 (5. Bigger than memory datasets with pandas)

Image for post
giphygiphy

pandas cannot even read bigger than the main memory datasets. It throws a MemoryError or Jupyter Kernel crashes. But to process a big dataset you don’t need Dask or Vaex. You just need some ingenuity. Sounds too good to be true?

熊猫读取的数据甚至不能超过主内存数据集。 它引发MemoryError或Jupyter Kernel崩溃。 但是,要处理大型数据集,您不需要Dask或Vaex。 您只需要一些独创性 。 听起来好得令人难以置信?

In case you’ve missed my article about Dask and Vaex with bigger than main memory datasets:

如果您错过了我的有关Dask和Vaex的文章,而这篇文章的内容比主内存数据集还大:

When doing an analysis you usually don’t need all rows or all columns in the dataset.

执行分析时,通常不需要数据集中的所有行或所有列。

In a case, you don’t need all rows, you can read the dataset in chunks and filter unnecessary rows to reduce the memory usage:

在某种情况下,您不需要所有行,您可以按块读取数据集并过滤不必要的行以减少内存使用量:

iter_csv = pd.read_csv('dataset.csv', iterator=True, chunksize=1000)
df = pd.concat([chunk[chunk['field'] > constant] for chunk in iter_csv])

Reading a dataset in chunks is slower than reading it all once. I would recommend using this approach only with bigger than memory datasets.

分块读取数据集要比一次读取所有数据集慢。 我建议仅对大于内存的数据集使用此方法。

In a case, you don’t need all columns, you can specify required columns with “usecols” argument when reading a dataset:

在某种情况下,不需要所有列,可以在读取数据集时使用“ usecols”参数指定所需的列:

df = pd.read_csvsecols=['col1', 'col2'])

The great thing about these two approaches is that you can combine them.

这两种方法的优点在于您可以将它们组合在一起。

你走之前 (Before you go)

Image for post
giphygiphy

These are a few links that might interest you:

这些链接可能会让您感兴趣:

- Your First Machine Learning Model in the Cloud- AI for Healthcare- Parallels Desktop 50% off- School of Autonomous Systems- Data Science Nanodegree Program- 5 lesser-known pandas tricks- How NOT to write pandas code

翻译自: https://towardsdatascience.com/5-essential-pandas-tricks-you-didnt-know-about-2d1a5b6f2e7

魅族mx5游戏模式小熊猫

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/news/391934.shtml

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

非常详细的Django使用Token(转)

基于Token的身份验证 在实现登录功能的时候,正常的B/S应用都会使用cookiesession的方式来做身份验证,后台直接向cookie中写数据,但是由于移动端的存在,移动端是没有cookie机制的,所以使用token可以实现移动端和客户端的token通信。 验证流程 整个基于Token的验证流程如下: 客户…

数据科学中的数据可视化

数据可视化简介 (Introduction to Data Visualization) Data visualization is the process of creating interactive visuals to understand trends, variations, and derive meaningful insights from the data. Data visualization is used mainly for data checking and cl…

手把手教你webpack3(6)css-loader详细使用说明

CSS-LOADER配置详解 前注&#xff1a; 文档全文请查看 根目录的文档说明。 如果可以&#xff0c;请给本项目加【Star】和【Fork】持续关注。 有疑义请点击这里&#xff0c;发【Issues】。 1、概述 对于一般的css文件&#xff0c;我们需要动用三个loader&#xff08;是不是觉得好…

多重线性回归 多元线性回归_了解多元线性回归

多重线性回归 多元线性回归Video Link影片连结 We have taken a look at Simple Linear Regression in Episode 4.1 where we had one variable x to predict y, but what if now we have multiple variables, not just x, but x1,x2, x3 … to predict y — how would we app…

tp703n怎么做无线打印服务器,TP-Link TL-WR703N无线路由器无线AP模式怎么设置

TP-Link TL-WR703N无线路由器配置简单&#xff0c;不过对于没有网络基础的用户来说&#xff0c;完成路由器的安装和无线AP模式的设置&#xff0c;仍然有一定的困难&#xff0c;本文学习啦小编主要介绍TP-Link TL-WR703N无线路由器无线AP模式的设置方法!TP-Link TL-WR703N无线路…

pandas之groupby分组与pivot_table透视

一、groupby 类似excel的数据透视表&#xff0c;一般是按照行进行分组&#xff0c;使用方法如下。 df.groupby(byNone, axis0, levelNone, as_indexTrue, sortTrue, group_keysTrue,squeezeFalse, observedFalse, **kwargs) 分组得到的直接结果是一个DataFrameGroupBy对象。 df…

js能否打印服务器端文档,js打印远程服务器文件

js打印远程服务器文件 内容精选换一换对于密码鉴权方式创建的Windows 2012弹性云服务器&#xff0c;使用初始密码以MSTSC方式登录时&#xff0c;登录失败&#xff0c;系统显示“第一次登录之前&#xff0c;你必须更改密码。请更新密码&#xff0c;或者与系统管理员或技术支持联…

如何使用Python处理丢失的数据

The complete notebook and required datasets can be found in the git repo here完整的笔记本和所需的数据集可以在git repo中找到 Real-world data often has missing values.实际数据通常缺少值 。 Data can have missing values for a number of reasons such as observ…

为什么印度盛产码农_印度农产品价格的时间序列分析

为什么印度盛产码农Agriculture is at the center of Indian economy and any major change in the sector leads to a multiplier effect on the entire economy. With around 17% contribution to the Gross Domestic Product (GDP), it provides employment to more than 50…

pandas处理excel文件和csv文件

一、csv文件 csv以纯文本形式存储表格数据 pd.read_csv(文件名)&#xff0c;可添加参数enginepython,encodinggbk 一般来说&#xff0c;windows系统的默认编码为gbk&#xff0c;可在cmd窗口通过chcp查看活动页代码&#xff0c;936即代表gb2312。 例如我的电脑默认编码时gb2312&…

tukey检测_回到数据分析的未来:Tukey真空度的整洁实现

tukey检测One of John Tukey’s landmark papers, “The Future of Data Analysis”, contains a set of analytical techniques that have gone largely unnoticed, as if they’re hiding in plain sight.John Tukey的标志性论文之一&#xff0c;“ 数据分析的未来 ”&#x…

spring— Spring与Web环境集成

ApplicationContext应用上下文获取方式 应用上下文对象是通过new ClasspathXmlApplicationContext(spring配置文件) 方式获取的&#xff0c;但是每次从容器中获 得Bean时都要编写new ClasspathXmlApplicationContext(spring配置文件) &#xff0c;这样的弊端是配置文件加载多次…

Elasticsearch集群知识笔记

Elasticsearch集群知识笔记 Elasticsearch内部提供了一个rest接口用于查看集群内部的健康状况&#xff1a; curl -XGET http://localhost:9200/_cluster/healthresponse结果&#xff1a; {"cluster_name": "format-es","status": "green&qu…

matplotlib图表介绍

Matplotlib 是一个python 的绘图库&#xff0c;主要用于生成2D图表。 常用到的是matplotlib中的pyplot&#xff0c;导入方式import matplotlib.pyplot as plt 一、显示图表的模式 1.plt.show() 该方式每次都需要手动show()才能显示图表&#xff0c;由于pycharm不支持魔法函数&a…

到2025年将保持不变的热门流行技术

重点 (Top highlight)I spent a good amount of time interviewing SMEs, data scientists, business analysts, leads & their customers, programmers, data enthusiasts and experts from various domains across the globe to identify & put together a list that…

马尔科夫链蒙特卡洛_蒙特卡洛·马可夫链

马尔科夫链蒙特卡洛A Monte Carlo Markov Chain (MCMC) is a model describing a sequence of possible events where the probability of each event depends only on the state attained in the previous event. MCMC have a wide array of applications, the most common of…

django基于存储在前端的token用户认证

一.前提 首先是这个代码基于前后端分离的API,我们用了django的framework模块,帮助我们快速的编写restful规则的接口 前端token原理: 把(token加密后的字符串,keyname)在登入后发到客户端,以后客户端再发请求,会携带过来服务端截取(token加密后的字符串,keyname),我们再利用解密…

数据分布策略_有效数据项目的三种策略

数据分布策略Many data science projects do not go into production, why is that? There is no doubt in my mind that data science is an efficient tool with impressive performances. However, a successful data project is also about effectiveness: doing the righ…

java基础学习——5、HashMap实现原理

一、HashMap的数据结构 数组的特点是&#xff1a;寻址容易&#xff0c;插入和删除困难&#xff1b;而链表的特点是&#xff1a;寻址困难&#xff0c;插入和删除容易。那么我们能不能综合两者的特性&#xff0c;做出一种寻址容易&#xff0c;插入删除也容易的数据结构&#xff1…

看懂nfl定理需要什么知识_NFL球队为什么不经常通过?

看懂nfl定理需要什么知识Debunking common NFL myths in an analytical study on the true value of passing the ball在关于传球真实价值的分析研究中揭穿NFL常见神话 Background背景 Analytics are not used enough in the NFL. In a league with an abundance of money, i…