熊猫数据集_对熊猫数据框使用逻辑比较

熊猫数据集

P (tPYTHON)

Logical comparisons are used everywhere.

逻辑比较随处可见

The Pandas library gives you a lot of different ways that you can compare a DataFrame or Series to other Pandas objects, lists, scalar values, and more. The traditional comparison operators (<, >, <=, >=, ==, !=) can be used to compare a DataFrame to another set of values.

Pandas库为您提供了许多不同的方式,您可以将DataFrame或Series与其他Pandas对象,列表,标量值等进行比较。 传统的比较运算符( <, >, <=, >=, ==, != )可用于将DataFrame与另一组值进行比较。

However, you can also use wrappers for more flexibility in your logical comparison operations. These wrappers allow you to specify the axis for comparison, so you can choose to perform the comparison at the row or column level. Also, if you are working with a MultiIndex, you may specify which index you want to work with.

但是,还可以使用包装器在逻辑比较操作中提供更大的灵活性。 这些包装器允许您指定要进行比较的 ,因此您可以选择在行或列级别执行比较。 另外,如果您使用的是MultiIndex,则可以指定要使用的索引

In this piece, we’ll first take a quick look at logical comparisons with the standard operators. After that, we’ll go through five different examples of how you can use these logical comparison wrappers to process and better understand your data.

在本文中,我们将首先快速了解与标准运算符的逻辑比较。 之后,我们将介绍五个不同的示例,说明如何使用这些逻辑比较包装器来处理和更好地理解您的数据。

The data used in this piece is sourced from Yahoo Finance. We’ll be using a subset of Tesla stock price data. Run the code below if you want to follow along. (And if you’re curious as to the function I used to get the data scroll to the very bottom and click on the first link.)

本文中使用的数据来自Yahoo Finance。 我们将使用特斯拉股价数据的子集。 如果要继续,请运行下面的代码。 (如果您对我用来使数据滚动到最底部并单击第一个链接的功能感到好奇)。

import pandas as pd# fixed data so sample data will stay the same
df = pd.read_html("https://finance.yahoo.com/quote/TSLA/history?period1=1277942400&period2=1594857600&interval=1d&filter=history&frequency=1d")[0]df = df.head(10) # only work with the first 10 points
Image for post
Tesla stock data from Yahoo Finance雅虎财经的特斯拉股票数据

与熊猫的逻辑比较 (Logical Comparisons With Pandas)

The wrappers available for use are:

可用的包装器有:

  • eq (equivalent to ==) — equals to

    eq (等于== )—等于

  • ne (equivalent to !=) — not equals to

    ne (等于!= )-不等于

  • le (equivalent to <=) — less than or equals to

    le (等于<= )-小于或等于

  • lt (equivalent to <) — less than

    lt (等于< )-小于

  • ge (equivalent to >=) — greater than or equals to

    ge (等于>= )-大于或等于

  • gt (equivalent to >) — greater than

    gt (等于> )-大于

Before we dive into the wrappers, let’s quickly review how to perform a logical comparison in Pandas.

在深入探讨包装之前,让我们快速回顾一下如何在Pandas中进行逻辑比较。

With the regular comparison operators, a basic example of comparing a DataFrame column to an integer would look like this:

使用常规比较运算符,将DataFrame列与整数进行比较的基本示例如下所示:

old = df['Open'] >= 270

Here, we’re looking to see whether each value in the “Open” column is greater than or equal to the fixed integer “270”. However, if you try to run this, at first it won’t work.

在这里,我们正在查看“ Open”列中的每个值是否大于或等于固定整数“ 270”。 但是,如果尝试运行此命令,则一开始它将无法工作。

You’ll most likely see this:

您很可能会看到以下内容:

TypeError: '>=' not supported between instances of 'str' and 'int'

TypeError: '>=' not supported between instances of 'str' and 'int'

This is important to take care of now because when you use both the regular comparison operators and the wrappers, you’ll need to make sure that you are actually able to compare the two elements. Remember to do something like the following in your pre-processing, not just for these exercises, but in general when you’re analyzing data:

这一点现在很重要,因为当您同时使用常规比较运算符和包装器时,需要确保您确实能够比较这两个元素。 请记住,在预处理过程中,不仅要针对这些练习,而且在分析数据时通常要执行以下操作:

df = df.astype({"Open":'float',
"High":'float',
"Low":'float',
"Close*":'float',
"Adj Close**":'float',
"Volume":'float'})

Now, if you run the original comparison again, you’ll get this series back:

现在,如果再次运行原始比较,您将获得以下系列:

Image for post
Easy logical comparison example
简单的逻辑比较示例

You can see that the operation returns a series of Boolean values. If you check the original DataFrame, you’ll see that there should be a corresponding “True” or “False” for each row where the value was greater than or equal to (>=) 270 or not.

您可以看到该操作返回了一系列布尔值。 如果检查原始DataFrame,您会发现值大于或等于( >= )270的每一行都应该有一个对应的“ True”或“ False”。

Now, let’s dive into how you can do the same and more with the wrappers.

现在,让我们深入研究如何使用包装器做同样的事情。

1.比较两列的不平等 (1. Comparing two columns for inequality)

In the data set, you’ll see that there is a “Close*” column and an “Adj Close**” column. The Adjusted Close price is altered to reflect potential dividends and splits, whereas the Close price is only adjusted for splits. To see if these events may have happened, we can do a basic test to see if values in the two columns are not equal.

在数据集中,您将看到有一个“ Close *”列和一个“ Adj Close **”列。 调整后的收盘价被更改以反映潜在的股息和分割,而收盘价仅针对分割进行调整。 要查看是否可能发生了这些事件,我们可以进行基本测试以查看两列中的值是否不相等。

To do so, we run the following:

为此,我们运行以下命令:

# is the adj close different from the close?
df['Close Comparison'] = df['Adj Close**'].ne(df['Close*'])
Image for post
Results of column inequality comparison
列不相等比较的结果

Here, all we did is call the .ne() function on the “Adj Close**” column and pass “Close*”, the column we want to compare, as an argument to the function.

在这里,我们.ne()在“ Adj Close **”列上调用.ne()函数,并传递“ Close *”(我们要比较的列)作为该函数的参数。

If we take a look at the resulting DataFrame, you’ll see that we‘ve created a new column “Close Comparison” that will show “True” if the two original Close columns are different and “False” if they are the same. In this case, you can see that the values for “Close*” and “Adj Close**” on every row are the same, so the “Close Comparison” only has “False” values. Technically, this would mean that we could remove the “Adj Close**” column, at least for this subset of data, since it only contains duplicate values to the “Close*” column.

如果我们看一下生成的DataFrame,您会看到我们创建了一个新列“ Close Compare”,如果两个原始的Close列不同,则显示“ True”,如果相同,则显示“ False”。 在这种情况下,您可以看到每行上“ Close *”和“ Adj Close **”的值相同,因此“ Close Compare”只有“ False”值。 从技术上讲,这意味着我们至少可以删除此数据子集的“ Adj Close **”列,因为它仅包含“ Close *”列的重复值。

2.检查一列是否大于另一列 (2. Checking if one column is greater than another)

We’d often like to see whether a stock’s price increased by the end of the day. One way to do this would be to see a “True” value if the “Close*” price was greater than the “Open” price or “False” otherwise.

我们经常想看看一天结束时股票的价格是否上涨了。 一种方法是,如果“收盘价”大于“开盘价”,则查看“真”值,否则查看“假”价。

To implement this, we run the following:

为了实现这一点,我们运行以下命令:

# is the close greater than the open?
df['Bool Price Increase'] = df['Close*'].gt(df['Open'])
Image for post
Results of column greater than comparison
列结果大于比较结果

Here, we see that the “Close*” price at the end of the day was higher than the “Open” price at the beginning of the day 4/10 times in the first two weeks of July 2020. This might not be that informative because it’s such a small sample, but if you were to extend this to months or even years of data, it could indicate the overall trend of the stock (up or down).

在这里,我们看到,在2020年7月的前两周,一天结束时的“收盘价”比一天开始时的“开盘价”高出4/10倍。因为这是一个很小的样本,但是如果您将其扩展到数月甚至数年的数据,则可能表明存量的总体趋势(上升或下降)。

3.检查列是否大于标量值 (3. Checking if a column is greater than a scalar value)

So far, we’ve just been comparing columns to one another. You can also use the logical operators to compare values in a column to a scalar value like an integer. For example, let’s say that if the volume traded per day is greater than or equal to 100 million, we’ll call it a “High Volume” day.

到目前为止,我们只是在相互比较列。 您还可以使用逻辑运算符将列中的值与标量值(例如整数)进行比较。 例如,假设每天的交易量大于或等于1亿,我们将其称为“高交易量”日。

To do so, we run the following:

为此,我们运行以下命令:

# was the volume greater than 100m?
df['High Volume'] = df['Volume'].ge(100000000)
Image for post
Result of column and scalar greater than comparison
列和标量的结果大于比较结果

Instead of passing a column to the logical comparison function, this time we simply have to pass our scalar value “100000000”.

这次我们不必将列传递给逻辑比较函数,而只需传递标量值“ 100000000”。

Now, we can see that on 5/10 days the volume was greater than or equal to 100 million.

现在,我们可以看到5/10天的交易量大于或等于1亿。

4.检查列是否大于自身 (4. Checking if a column is greater than itself)

Earlier, we compared if the “Open” and “Close*” value in each row were different. It would be cool if instead, we compared the value of a column to the preceding value, to track an increase or decrease over time. Doing this means we can check if the “Close*” value for July 15 was greater than the value for July 14.

之前,我们比较了每行中的“打开”和“关闭*”值是否不同。 相反,如果我们将列的值与先前的值进行比较,以跟踪随时间的增加或减少,那将很酷。 这样做意味着我们可以检查7月15日的“ Close *”值是否大于7月14日的值。

To do so, we run the following:

为此,我们运行以下命令:

# was the close greater than yesterday's close?
df['Close (t-1)'] = df['Close*'].shift(-1)
df['Bool Over Time Increase'] = df['Close*'].gt(df['Close*'].shift(-1))
Image for post
Result of column greater than comparison
列结果大于比较结果

For illustration purposes, I included the “Close (t-1)” column so you can compare each row directly. In practice, you don’t need to add an entirely new column, as all we’re doing is passing the “Close*” column again into the logical operator, but we’re also calling shift(-1) on it to move all the values “up by one”.

为了便于说明,我在“ Close(t-1)”列中添加了一个标题,以便您可以直接比较每一行。 实际上,您不需要添加全新的列,因为我们要做的只是将“ Close *”列再次传递到逻辑运算符中,但是我们还对其调用了shift(-1)来进行移动所有值“加一”。

What’s going on here is basically subtracting one from the index, so the value for July 14 moves “up”, which lets us compare it to the real value on July 15. As a result, you can see that on 7/10 days the “Close*” value was greater than the “Close*” value on the day before.

这里发生的基本上是从索引中减去1,因此7月14日的值“向上”移动,这使我们可以将其与7月15日的实际值进行比较。结果,您可以看到在7/10天“关闭*”值大于前一天的“关闭*”值。

5.比较列与列表 (5. Comparing a column to a list)

As a final exercise, let’s say that we developed a model to predict the stock prices for 10 days. We’ll store those predictions in a list, then compare the both the “Open” and “Close*” values of each day to the list values.

作为最后的练习,假设我们开发了一个模型来预测10天的股价。 我们将这些预测存储在列表中,然后将每天的“打开”和“关闭*”值与列表值进行比较。

To do so, we run the following:

为此,我们运行以下命令:

# did the open and close price match the predictions?
predictions = [309.2, 303.36, 300, 489, 391, 445, 402.84, 274.32, 410, 223.93]
df2 = df[['Open','Close*']].eq(predictions, axis='index')
Image for post
Result of column and list comparison
列和列表比较的结果

Here, we’ve compared our generated list of predictions for the daily stock prices and compared it to the “Close*” column. To do so, we pass “predictions” into the eq() function and set axis='index'. By default, the comparison wrappers have axis='columns', but in this case, we actually want to work with each row in each column.

在这里,我们比较了生成的每日股票价格预测列表,并将其与“收盘价*”列进行了比较。 为此,我们将“预测”传递给eq()函数并设置axis='index' 。 默认情况下,比较包装器具有axis='columns' ,但是在这种情况下,我们实际上要处理每一列中的每一行。

What this means is Pandas will compare “309.2”, which is the first element in the list, to the first values of “Open” and “Close*”. Then it will move on to the second value in the list and the second values of the DataFrame and so on. Remember that the index of a list and a DataFrame both start at 0, so you would look at “308.6” and “309.2” respectively for the first DataFrame column values (scroll back up if you want to double-check the results).

这意味着熊猫将把列表中的第一个元素“ 309.2”与“打开”和“关闭*”的第一个值进行比较。 然后它将移至列表中的第二个值和DataFrame的第二个值,依此类推。 请记住,列表的索引和DataFrame的索引都从0开始,因此对于第一个DataFrame列值,您将分别查看“ 308.6”和“ 309.2”(如果要仔细检查结果,请向上滚动)。

Based on these arbitrary predictions, you can see that there were no matches between the “Open” column values and the list of predictions. There were 4/10 matches between the “Close*” column values and the list of predictions.

根据这些任意的预测,您可以看到“ Open”列值和预测列表之间没有匹配项。 “ Close *”列值和预测列表之间有4/10个匹配项。

I hope you found this very basic introduction to logical comparisons in Pandas using the wrappers useful. Remember to only compare data that can be compared (i.e. don’t try to compare a string to a float) and manually double-check the results to make sure your calculations are producing the intended results.

我希望您发现使用包装程序对熊猫进行逻辑比较非常基础的介绍很有用。 请记住仅比较可以比较的数据(即不要尝试将字符串与浮点数进行比较),并手动仔细检查结果以确保您的计算产生了预期的结果。

Go forth and compare!

继续比较吧!

More by me:- 2 Easy Ways to Get Tables From a Website
- Top 4 Repositories on GitHub to Learn Pandas
- An Introduction to the Cohort Analysis With Tableau
- How to Quickly Create and Unpack Lists with Pandas
- Learning to Forecast With Tableau in 5 Minutes Or Less

翻译自: https://towardsdatascience.com/using-logical-comparisons-with-pandas-dataframes-3520eb73ae63

熊猫数据集

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/news/389631.shtml

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

决策树之前要不要处理缺失值_不要使用这样的决策树

决策树之前要不要处理缺失值As one of the most popular classic machine learning algorithm, the Decision Tree is much more intuitive than the others for its explainability. In one of my previous article, I have introduced the basic idea and mechanism of a Dec…

gl3520 gl3510_带有gl gl本机的跨平台地理空间可视化

gl3520 gl3510Editor’s note: Today’s post is by Ib Green, CTO, and Ilija Puaca, Founding Engineer, both at Unfolded, an “open core” company that builds products and services on the open source deck.gl / vis.gl technology stack, and is also a major contr…

uiautomator +python 安卓UI自动化尝试

使用方法基本说明&#xff1a;https://www.cnblogs.com/mliangchen/p/5114149.html&#xff0c;https://blog.csdn.net/Eugene_3972/article/details/76629066 环境准备&#xff1a;https://www.cnblogs.com/keeptheminutes/p/7083816.html 简单实例 1.自动化安装与卸载 &#…

power bi中的切片器_在Power Bi中显示选定的切片器

power bi中的切片器Just recently, while presenting my session: “Magnificent 7 — Simple tricks to boost your Power BI Development” at the New Stars of Data conference, one of the questions I’ve received was:就在最近&#xff0c;在“新数据之星”会议上介绍我…

5939. 半径为 k 的子数组平均值

5939. 半径为 k 的子数组平均值 给你一个下标从 0 开始的数组 nums &#xff0c;数组中有 n 个整数&#xff0c;另给你一个整数 k 。 半径为 k 的子数组平均值 是指&#xff1a;nums 中一个以下标 i 为 中心 且 半径 为 k 的子数组中所有元素的平均值&#xff0c;即下标在 i …

数据库逻辑删除的sql语句_通过数据库的眼睛查询sql的逻辑流程

数据库逻辑删除的sql语句Structured Query Language (SQL) is famously known as the romance language of data. Even thinking of extracting the single correct answer from terabytes of relational data seems a little overwhelming. So understanding the logical flow…

数据挖掘流程_数据流挖掘

数据挖掘流程1-简介 (1- Introduction) The fact that the pace of technological change is at its peak, Silicon Valley is also introducing new challenges that need to be tackled via new and efficient ways. Continuous research is being carried out to improve th…

北门外的小吃街才是我的大学食堂

学校北门外的那些小吃摊&#xff0c;陪我度过了漫长的大学四年。 细数下来&#xff0c;我最怀念的是…… &#xff08;1&#xff09;烤鸡翅 吸引指数&#xff1a;★★★★★ 必杀技&#xff1a;酥流油 烤鸡翅有蜂蜜味、香辣味、孜然味……最爱店家独创的秘制鸡翅。鸡翅的外皮被…

[LeetCode]最长公共前缀(Longest Common Prefix)

题目描述 编写一个函数来查找字符串数组中的最长公共前缀。如果不存在公共前缀&#xff0c;返回空字符串 ""。 示例 1:输入: ["flower","flow","flight"]输出: "fl"示例 2:输入: ["dog","racecar",&quo…

spark的流失计算模型_使用spark对sparkify的流失预测

spark的流失计算模型Churn prediction, namely predicting clients who might want to turn down the service, is one of the most common business applications of machine learning. It is especially important for those companies providing streaming services. In thi…

区块链开发公司谈区块链与大数据的关系

在过去的两千多年的时间长河中&#xff0c;数字一直指引着我们去探索很多未知的科学世界。到目前为止&#xff0c;随着网络和信息技术的发展&#xff0c;一切与人类活动相关的活动&#xff0c;都直接或者间接的连入了互联网之中&#xff0c;一个全新的数字化的世界展现在我们的…

Jupyter Notebook的15个技巧和窍门,可简化您的编码体验

Jupyter Notebook is a browser bases REPL (read eval print loop) built on IPython and other open-source libraries, it allows us to run interactive python code on the browser.Jupyter Notebook是基于IPL和其他开源库构建的基于REPL(读取评估打印循环)的浏览器&#…

bi数据分析师_BI工程师和数据分析师的5个格式塔原则

bi数据分析师Image by Author图片作者 将美丽融入数据 (Putting the Beauty in Data) Have you ever been ravished by Vizzes on Tableau Public that look like only magic could be in play to display so much data in such a pleasing way?您是否曾经被Tableau Public上的…

BSOJ 2423 -- 【PA2014】Final Zarowki

Description 有n个房间和n盏灯&#xff0c;你需要在每个房间里放入一盏灯。每盏灯都有一定功率&#xff0c;每间房间都需要不少于一定功率的灯泡才可以完全照亮。 你可以去附近的商店换新灯泡&#xff0c;商店里所有正整数功率的灯泡都有售。但由于背包空间有限&#xff0c;你…

WPF绑定资源文件错误(error in binding resource string with a view in wpf)

报错&#xff1a;无法将“***Properties.Resources.***”StaticExtension 值解析为枚举、静态字段或静态属性 解决办法&#xff1a;尝试右键单击在Visual Studio解决方案资源管理器的资源文件&#xff0c;并选择属性选项&#xff0c;然后设置自定义工具属性 PublicResXFile cod…

因果推论第六章

因果推论 (Causal Inference) This is the sixth post on the series we work our way through “Causal Inference In Statistics” a nice Primer co-authored by Judea Pearl himself.这是本系列的第六篇文章&#xff0c;我们将通过Judea Pearl本人与他人合着的《引诱统计学…

如何优化网站加载时间

一、背景 我们要监测网站的加载情况&#xff0c;可以使用 window.performance 来简单的检测。 window.performance 是W3C性能小组引入的新的API&#xff0c;目前IE9以上的浏览器都支持。一个performance对象的完整结构如下图所示&#xff1a; memory字段代表JavaScript对内存的…

熊猫数据集_处理熊猫数据框中的列表值

熊猫数据集Have you ever dealt with a dataset that required you to work with list values? If so, you will understand how painful this can be. If you have not, you better prepare for it.您是否曾经处理过需要使用列表值的数据集&#xff1f; 如果是这样&#xff0…

旋转变换(一)旋转矩阵

1. 简介 计算机图形学中的应用非常广泛的变换是一种称为仿射变换的特殊变换&#xff0c;在仿射变换中的基本变换包括平移、旋转、缩放、剪切这几种。本文以及接下来的几篇文章重点介绍一下关于旋转的变换&#xff0c;包括二维旋转变换、三维旋转变换以及它的一些表达方式&#…

数据预处理 泰坦尼克号_了解泰坦尼克号数据集的数据预处理

数据预处理 泰坦尼克号什么是数据预处理&#xff1f; (What is Data Pre-Processing?) We know from my last blog that data preprocessing is a data mining technique that involves transforming raw data into an understandable format. Real-world data is often incom…