Polars是一个Python数据处理库,介绍可以看官网,也可以看看 Pandas有了平替Polars-CSDN博客
Polars基本操作
1. Series 和 Dataframe
import polars as pl# 创建一个Polars DataFrame
data = {"A": [1, 2, 3, 4, 5],"B": ["a", "b", "c", "d", "e"],"C": [True, False, True, False, True],"D": [1.1, 2.2, 3.3, 4.4, 5.5]
}
df = pl.DataFrame(data)# 创建一个Polars Series
series = pl.Series("E", [6, 7, 8, 9, 10])# 查看DataFrame的前几行
print(df.head())# 添加新的列
df = df.with_column(pl.col("A") * 2, "A_doubled")# 选择特定的列
selected_cols = df.select(["A", "B"])
print(selected_cols)# 过滤数据
filtered_df = df.filter(pl.col("C") == True)
print(filtered_df)# 排序数据
sorted_df = df.sort("D", reverse=True)
print(sorted_df)# Series与DataFrame进行运算
new_series = series + df["A"]
print(new_series)# 将Series添加到DataFrame
df_with_series = df.with_column(new_series.rename("F"))
print(df_with_series)
对比一下实现类似功能的Pandas
import pandas as pd# 创建一个Pandas DataFrame
data = {"A": [1, 2, 3, 4, 5],"B": ["a", "b", "c", "d", "e"],"C": [True, False, True, False, True],"D": [1.1, 2.2, 3.3, 4.4, 5.5]
}
df = pd.DataFrame(data)# 创建一个Pandas Series
series = pd.Series([6, 7, 8, 9, 10], name="E")# 查看DataFrame的前几行
print(df.head())# 添加新的列
df["A_doubled"] = df["A"] * 2# 选择特定的列
selected_cols = df[["A", "B"]]
print(selected_cols)# 过滤数据
filtered_df = df[df["C"] == True]
print(filtered_df)# 排序数据
sorted_df = df.sort_values("D", ascending=False)
print(sorted_df)# Series与DataFrame进行运算
new_series = series + df["A"]
print(new_series)# 将Series添加到DataFrame
df_with_series = df.assign(F=new_series)
print(df_with_series)
2. Expressions
Polars库中的Expressions操作可以用于对DataFrame的列进行复杂的计算和转换。看下面例子:
import polars as pl# 创建一个Polars DataFrame
data = {"A": [1, 2, 3, 4, 5],"B": ["a", "b", "c", "d", "e"],"C": [True, False, True, False, True],"D": [1.1, 2.2, 3.3, 4.4, 5.5]
}
df = pl.DataFrame(data)# 使用Expressions操作创建新的列
expr = pl.when(pl.col("C")) \.then(pl.col("A") * 2) \.otherwise(pl.col("A") / 2)
df = df.with_column(expr.alias("New_Column"))# 查看新的DataFrame
print(df)# 对新的列进行条件筛选
filtered_df = df.filter(pl.col("New_Column") > 3)
print(filtered_df)# 对新的列进行聚合操作
aggregated_df = df.groupby("B").agg(pl.col("New_Column").sum().alias("Sum"))
print(aggregated_df)
上述示例 使用when
表达式创建了一个新的列,根据C
列的值进行条件判断,如果为True
,则将A
列的值乘以2,否则将A
列的值除以2。然后,对新的列进行了条件筛选和聚合操作。
如果同样功能用pandas恐怕单独很难完成,不过配上numpy,以笔者的粗浅理解,可以这么做:
import pandas as pd
import numpy as np# 创建一个Pandas DataFrame
data = {"A": [1, 2, 3, 4, 5],"B": ["a", "b", "c", "d", "e"],"C": [True, False, True, False, True],"D": [1.1, 2.2, 3.3, 4.4, 5.5]
}
df = pd.DataFrame(data)# 使用numpy的条件函数和apply方法创建新的列
df["New_Column"] = np.where(df["C"], df["A"] * 2, df["A"] / 2)# 查看新的DataFrame
print(df)# 对新的列进行条件筛选
filtered_df = df[df["New_Column"] > 3]
print(filtered_df)# 对新的列进行聚合操作
aggregated_df = df.groupby("B")["New_Column"].sum().reset_index(name="Sum")
print(aggregated_df)
3. 拼接(join)
在Polars中,可以使用join
方法进行DataFrame的拼接操作。
import polars as pl# 创建两个Polars DataFrame
data1 = {"A": [1, 2, 3],"B": ["a", "b", "c"],
}
df1 = pl.DataFrame(data1)data2 = {"C": [4, 5, 6],"D": ["d", "e", "f"],
}
df2 = pl.DataFrame(data2)# 使用join方法进行内连接
joined_df = df1.join(df2, left_on="A", right_on="C", how="inner")# 查看拼接后的DataFrame
print(joined_df)
Pandas中,貌似没有join方法,可以用merge来做,看起来差不多
import pandas as pd# 创建两个Pandas DataFrame
data1 = {"A": [1, 2, 3],"B": ["a", "b", "c"],
}
df1 = pd.DataFrame(data1)data2 = {"C": [3, 4, 5],"D": ["c", "d", "e"],
}
df2 = pd.DataFrame(data2)# 使用merge方法进行内连接
joined_df = df1.merge(df2, left_on="A", right_on="C", how="inner")# 查看拼接后的DataFrame
print(joined_df)