Pandas is one of the most used Python library for both data scientist and data engineers. Today, I want to share some Python tips to help us do qualification checks between 2 Dataframes.
Pandas是数据科学家和数据工程师最常用的Python库之一。 今天,我想分享一些Python技巧,以帮助我们在2个数据框之间进行资格检查。
Notice, I have used the word: qualification, instead of identical. Identical is easy to check, but qualification is a loose check. It is based on business logic. Therefore, it is harder to implement.
注意,我使用了单词: qualification ,而不是完全相同。 相同很容易检查,但资格是一个宽松的检查。 它基于业务逻辑。 因此,很难实现。
不重新发明轮子 (Not reinvent the wheel)
In version 1.1.0 — released on July 28 2020, 8 days before — Pandas introduced the build-in compare function. All our following steps are built based on it.
在2020年7月28日发布的版本1.1.0(比之前晚8天)中,Pandas引入了内置比较功能 。 我们下面的所有步骤均基于此构建。
Tips: if you were using Anacondas distribution, you should use the following command line to upgrade your Pandas version.
提示 :如果使用的是Anacondas发行版,则应使用以下命令行升级Pandas版本。
低挂水果 (Low hanging fruit)
Always check number of columns between 2 frames first. In cases, this simple check could spot issues.
请务必先检查2帧之间的列数。 在某些情况下,此简单检查可能会发现问题。
In certain scenarios, such as enrichment change, we could have a different number of columns. The definition of qualification could be: for all the former columns having the same value between 2 data frames. Therefore, we will identify columns to check and save them in variable Columns for later usage.
在某些情况下,例如富集变化,我们可以有不同数量的列。 资格的定义可以是:对于所有先前的列,在2个数据帧之间具有相同的值。 因此,我们将确定要检查的列并将其保存在变量Columns中,以备后用。
解锁钥匙 (Keys to unlock)
In real application, we would have various ids to identify a record, such as user-id, order_id etc. In order to make a unique query, we may need to use a combination of these keys. Ultimately, we want to verify records with the same keys have the same column values.
在实际应用中,我们将使用各种ID来标识一条记录,例如user-id,order_id等。为了进行唯一查询,我们可能需要使用这些键的组合。 最终,我们要验证具有相同键的记录具有相同的列值。
The first step is to compose the key combination. This is where the DataFrame apply shines. We could use df.apply(lambda: x: func(x), axis = 1) to make any data transformation. With axis = 1, we are telling Pandas to do the same operation row by row. (axis = 0, column by column)
第一步是组合键。 这是DataFrame 应用的亮点。 我们可以使用df.apply(lambda:x:func(x), axis = 1 )进行任何数据转换。 当axis = 1时 ,我们告诉Pandas 逐行执行相同的操作。 (轴= 0,逐列)
处理ValueError (Handle the ValueError)
For the new DataFrame.compare function, the following error is the most confusing. Let me try to explain.
对于新的DataFrame.compare函数,以下错误最令人困惑。 让我尝试解释一下。
ValueError: Can only compare identically-labeled DataFrame objects
ValueError :只能比较标记相同的DataFrame对象
The reason for this error, is the shape and the order of columns between two data frames is not identical. Yes. DataFrame.compare works only for identical checking, not qualification checking.
此错误的原因是两个数据帧之间的列的形状和顺序不相同。 是。 DataFrame.compare仅适用于相同检查,而不适用于资格检查。
The way to solve the issue is: use the keyColumn created before, compare for a subset between the DataFrames with the same keyColumn value. And do that for each keyColumn value.
解决该问题的方法是:使用之前创建的keyColumn ,比较具有相同keyColumn值的DataFrame之间的子集。 并对每个keyColumn值执行此操作。
If the dimensions for keyColumn from 2 DataFrames are different, raise the issue and skip the check.
如果来自2个数据帧的keyColumn的尺寸不同,请提出问题并跳过检查。
带走: (Take Away:)
Use latest Pandas 1.1.0 DataFrame.compare to do robust DataFrame qualification checks. In order to deal with ValueError, we use keyColumn to do multiple sub DataFrame checks and return the final decision.
使用最新的Pandas 1.1.0 DataFrame.compare进行可靠的DataFrame资格检查。 为了处理ValueError,我们使用keyColumn进行多个子DataFrame检查并返回最终决定。
翻译自: https://towardsdatascience.com/robust-2-dataframes-verification-with-pandas-1-1-0-af22f328e622
本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/news/388642.shtml
如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!