pandas提供了iterrows()、itertuples()、apply等行遍历的方式,还是比较方便的。
polars的列操作功能非常强大,这个在其官网上有详细的介绍。由于polars底层的arrow是列存储模式,行操作效率低下,官方也不推荐以行方式进行数据操作。但是还是有部分场景可能会用到行遍历的情况。
polars如何进行行遍历,今天尝试一下非apply的方式。
场景:polars读取相应的关于历史股价的csv文件,其中有基本的行情信息,那么,如何对读取到的文件进行快速的行遍历?这种场景在行情驱动的策略回测中比较常见。
一、初步方案:
1、总体方案
1、csv => dataframe
2、dataframe =>into_struct ,得到structchunked
3、struchchunked =>在bars进行行遍历。
2、Bar类型
至于Bar类型的设计,存在两种方案:
(1)值类型的Bar
#[warn(dead_code)]
struct Bar{code:String,date:String,open:f32,high:f32,close:f32,low:f32,volume:f32,amount:f32,is_fq:bool,
}
(2)有引用类型的Bar
#[warn(dead_code)]
struct Bar2<'a>{code:&'a str,date:&'a str,open:f32,high:f32,close:f32,low:f32,volume:f32,amount:f32,is_fq:bool,
}
二、toml
注意,polars对features的设置要求高,有些用到的特性需要准确打开,否则代码编译会通不过。这一点在polars文档中经常没有写清楚,也算是一个坑。
[package]
name = "my_duckdb"
version = "0.1.0"
edition = "2021"
# See more keys and their definitions at https://doc.rust-lang.org/cargo/reference/manifest.html
[dependencies]
polars = { version = "*", features = ["lazy","dtype-struct"] }
注意,features中,一定要加上"dtype-struct"。
三、main.rs
根据上面的设计,全部代码如下:
use polars::prelude::*;
use std::time::Instant;#[warn(dead_code)]
struct Bar{code:String,date:String,open:f32,high:f32,close:f32,low:f32,volume:f32,amount:f32,is_fq:bool,
}
#[warn(dead_code)]
struct Bar2<'a>{code:&'a str,date:&'a str,open:f32,high:f32,close:f32,low:f32,volume:f32,amount:f32,is_fq:bool,
}
fn main() {let time0 = Instant::now();// test2.csv:64w行let csv = "test2.csv"; let df = polars_lazy_read_csv(csv);println!("read raw csv cost time : {:?} seconds",time0.elapsed().as_secs_f32());let time1 = Instant::now();let rows = df.into_struct("bars");println!("dataframe => structs cost time : {:?} seconds",time1.elapsed().as_secs_f32());let time2 = Instant::now();let bars = get_vec_bars(&rows);println!("dataframe => bars cost time : {:?} seconds",time2.elapsed().as_secs_f32());let time3 = Instant::now();let bar2s = get_vec_bar2s(&rows);println!("dataframe => bar2s cost time : {:?} seconds",time3.elapsed().as_secs_f32());println!("bars length :{:?}",bars.len());println!("bar2s length:{:?}",bar2s.len());
}fn get_bar(row:&[AnyValue])->Bar{let code = row.get(0).unwrap();let mut new_code = "";if let &AnyValue::Utf8(value) = code{new_code = value;}let mut new_date = ""; let date = row.get(1).unwrap();if let &AnyValue::Utf8(v) = date {new_date = v;}let open =row[2].extract::<f32>().unwrap();let high:f32 = row[3].extract::<f32>().unwrap();let close =row[4].extract::<f32>().unwrap();let low:f32 = row[5].extract::<f32>().unwrap();let volume =row[6].extract::<f32>().unwrap();let amount:f32 = row[7].extract::<f32>().unwrap();let mut is_fq = false;if let &AnyValue::Boolean(b) = row.get(8).unwrap(){is_fq = b;}let bar = Bar{code: String::from(new_code),date: String::from(new_date),open:open,high:high,close:close,low:low,volume:volume,amount,is_fq:is_fq,};bar
}fn get_bar2<'a>(row:&'a [AnyValue])->Bar2<'a>{let code = row.get(0).unwrap();let mut new_code = "";if let &AnyValue::Utf8(value) = code{new_code = value;}let mut new_date = ""; let date = row.get(1).unwrap();if let &AnyValue::Utf8(v) = date {new_date = v;}let open =row[2].extract::<f32>().unwrap();let high:f32 = row[3].extract::<f32>().unwrap();let close =row[4].extract::<f32>().unwrap();let low:f32 = row[5].extract::<f32>().unwrap();let volume =row[6].extract::<f32>().unwrap();let amount:f32 = row[7].extract::<f32>().unwrap();let mut is_fq = false;if let &AnyValue::Boolean(b) = row.get(8).unwrap(){is_fq = b;}let bar = Bar2{code: new_code,date: new_date,open:open,high:high,close:close,low:low,volume:volume,amount,is_fq:is_fq,};bar
}
fn get_vec_bars(data: &StructChunked)-> Vec<Bar>{let mut bars = Vec::new();for row in data{let bar = get_bar(row);bars.push(bar);}bars
}fn get_vec_bar2s(data: &StructChunked)-> Vec<Bar2>{let mut bars = Vec::new();for row in data{let bar = get_bar2(row);bars.push(bar);}bars
}
fn polars_lazy_read_csv(filepath:&str) ->DataFrame{let polars_lazy_csv_time = Instant::now();let p = LazyCsvReader::new(filepath).has_header(true).finish().unwrap();let mut df = p.collect().expect("error to dataframe!");println!("polars lazy 读出csv的行和列数:{:?}",df.shape());println!("polars lazy 读csv 花时: {:?} 秒!", polars_lazy_csv_time.elapsed().as_secs_f32());df
}
四、输出与比较
对于一个64万行,9列的csv文件,需要遍历转换Vec< Bar >类型,
1、输出如下:
polars lazy 读出csv的行和列数:(640710, 9)
polars lazy 读csv 花时: 0.058484446 秒!
read raw csv cost time : 0.058487203 seconds
dataframe => structs cost time : 2.8842e-5 seconds
dataframe => bars cost time : 0.131985 seconds
dataframe => bar2s cost time : 0.10357016 seconds
bars length :640710
bar2s length:640710
总体上看,从dataframe到struct这层,效率比较高,主要的时间花在了structchunked至bars这部分上面。
2、值类型Bar和引用类型Bar
从输出结果,可以看出,引用类型的Bar的效率要高一些,提效了20%。因为减少了堆分配所需要的时间。
五、其它
polars目前还没有发现有类似pandas的行遍历的方式,后面将持续跟踪。
此外,dataframe转bars的效率并不高,期待找到更高效的方式替代。