特征选择集大成的包-arfs（python）

一、介绍

arfs介绍文档https://arfs.readthedocs.io/en/latest/Introduction.html
英文好的朋友可以阅读作者写的介绍：

All relevant feature selection means trying to find all features carrying information usable for prediction, rather than finding a possibly compact subset of features on which some particular model has a minimal error. This might include redundant predictors. All relevant feature selection is model agnostic in the sense that it doesn’t optimize a scoring function for a specific model but rather tries to select all the predictors which are related to the response. This package implements 3 different methods (Leshy is an evolution of Boruta, BoostAGroota is an evolution of BoostARoota and GrootCV is a new one). They are sklearn compatible. See hereunder for details about those methods. You can use any sklearn compatible estimator with Leshy and BoostAGroota but I recommend lightGBM. It’s fast, accurate and has SHAP values builtin.
It also provides a module for performing preprocessing and perform basic feature selection (autobinning, remove columns with too many missing values, zero variance, high-cardinality, highly correlated, etc.).
Moreover, as an alternative to the all relevant problem, the ARFS package provides a MRmr feature selection which, theoretically, returns a subset of the predictors selected by an arfs method. ARFS also provides a LASSO feature selection which works especially well for (G)LMs and GAMs. You can combine Lasso with the TreeDiscretizer for introducing non-linearities into linear models and perform feature selection. Please note that one limitation of the lasso is that it treats the levels of a categorical predictor individually. However, this issue can be addressed by utilizing the TreeDiscretizer, which automatically bins numerical variables and groups the levels of categorical variables.

二、文库特点

稍微了解这个包之后，首先想到的一个词就是特征筛选“集大成者”。
首先，这个包包含Boruta算法等目前流行的变量筛选的方法，并可以采用多种值来计算重要性，包括naive重要性、shap值重要性和 permutation 重要性，这样习惯使用boruta算法筛选变量的用户具有了更多的选择；
其次，作者还介绍了和Boruta不同的其它算法BoostARoota and GrootCV ，作者似乎最钟情于GrootCV，认为采用了lightgbm而计算快速，使用了shap重要性而更准确。
最后，这个包还包括了Lasso回归、相关性选择等较为传统的算法。

%%time
# GrootCV
feat_selector = arfsgroot.GrootCV(objective="binary", cutoff=1, n_folds=5, n_iter=5, silent=True, fastshap=False
)
feat_selector.fit(X, y, sample_weight=None)
print(f"The selected features: {feat_selector.get_feature_names_out()}")
print(f"The agnostic ranking: {feat_selector.ranking_}")
print(f"The naive ranking: {feat_selector.ranking_absolutes_}")
fig = feat_selector.plot_importance(n_feat_per_inch=5)# highlight synthetic random variable
fig = highlight_tick(figure=fig, str_match="random")
fig = highlight_tick(figure=fig, str_match="genuine", color="green")
plt.show()

另一个特点是代码简洁，和scikit-learn包之间良好的兼容性，可以使用pipline；

model = clone(model)# # Leshi/Boruta
# feat_selector = arfsgroot.Leshy(model, n_estimators=50, verbose=1, max_iter=10, random_state=42, importance='shap')# BoostAGroota
feat_selector = arfsgroot.BoostAGroota(estimator=model, cutoff=1, iters=10, max_rounds=10, delta=0.1, importance="shap"
)# GrootCV
# feat_selector = arfsgroot.GrootCV(objective='binary', cutoff=1, n_folds=5, n_iter=5, silent=True)arfs_fs_pipeline = Pipeline([("missing", MissingValueThreshold(threshold=0.05)),("unique", UniqueValuesThreshold(threshold=1)),("collinearity", CollinearityThreshold(threshold=0.85)),("arfs", feat_selector),]
)
X_trans = arfs_fs_pipeline.fit(X=X, y=y).transform(X=X)