特征工程(I)--探索性数据分析

有这么一句话在业界广泛流传：数据和特征决定了机器学习的上限，而模型和算法只是逼近这个上限而已。由此可见，特征工程在机器学习中占有相当重要的地位。在实际应用当中，可以说特征工程是机器学习成功的关键。

特征工程是数据分析中最耗时间和精力的一部分工作，它不像算法和模型那样是确定的步骤，更多是工程上的经验和权衡。因此没有统一的方法。这里只是对一些常用的方法做一个总结。

特征工程包含了 Data PreProcessing（数据预处理）、Feature Extraction（特征提取）、Feature Selection（特征选择）和 Feature construction（特征构造）等子问题。

数据集描述

本项目使用Kaggle上的家庭信用违约风险数据集 (Home Credit Default Risk) ，是一个标准的机器学习分类问题。其目标是使用历史贷款的信息，以及客户的社会经济和财务信息，预测客户是否会违约。

数据集包括了8个不同的数据文件：

application_{train|test}：包含每个客户社会经济信息和Home Credit贷款申请信息的主要文件。每行代表一个贷款申请，由SK_ID_CURR唯一标识。训练集30.75万数据，测试集4.87万数据。其中训练集中TARGET=1表示未偿还贷款。通过这两个文件，就能对这个任务做基本的数据分析和建模，也是本篇的主要内容。

Jupyter Notebook 脚本：feature_engineering_demo_p1_EDA

探索性数据分析

Exploratory Data Analysis(EDA)

数据概览

导入必要的包

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import warnings# Setting configuration.
warnings.filterwarnings('ignore')
sns.set_style('whitegrid')
pd.set_option('display.max_columns',None)
pd.set_option('display.max_rows',None)
pd.options.display.max_colwidth = 100SEED = 42

导入数据

df = pd.read_csv('../datasets/Home-Credit-Default-Risk/application_train.csv')
print(df.head())

print('Training data shape: ', df.shape)

# `SK_ID_CURR` is the unique id of the row.
id_col = "SK_ID_CURR"
df[id_col].nunique() == df.shape[0]

在遇到非常多的数据的时候，我们一般先会按照数据的类型分布下手，看看不同的数据类型各有多少

# Number of each type of column
print(df.dtypes.value_counts())

print("\nCategorical features:")
print(df.select_dtypes(["object"]).columns.tolist())
print("\nNumeric features:")
print(df.select_dtypes("number").columns.tolist())

接下来看下数据集的统计信息

print(df.info())

print(df.describe())

数据相关性

data Correlation

查看两两特征之间的相关程度，对特征的处理有指导意义。

# The correlation matrix
corrmat = df.corr(numeric_only=True)# Upper triangle of correlations
upper = corrmat.where(np.triu(np.ones(corrmat.shape), k=1).astype(np.bool))# Absolute value correlation
corr_abs = upper.unstack().abs().sort_values(ascending=False).dropna()corr_abs.head(20)

fig, axs = plt.subplots(figsize=(10, 10))
sns.heatmap(corrmat, vmax=0.9, square=True)
axs.set_title('Correlations', size=15)
plt.show()

目标变量相关性

Continuous Features

# Correlation map to see how features are correlated with target.
target = "TARGET"
corrmat[target].abs().sort_values(ascending=False).head(20)

cont_features = ['EXT_SOURCE_1', 'EXT_SOURCE_2', 'EXT_SOURCE_3']fig = plt.figure(figsize=(5, 8))
plt.subplots_adjust(right=1.5)for i, source in enumerate(cont_features):# Distribution of feature in datasetax = fig.add_subplot(3, 1, i + 1)sns.kdeplot(x=source, data=df, hue=target, common_norm=False,fill=True, ax=ax)# Label the plotsplt.title(f'Distribution of {source} by Target Value')plt.xlabel(f'{source}')plt.ylabel('Density')plt.tight_layout(h_pad = 2.5)

cont_features = ['EXT_SOURCE_1', 'EXT_SOURCE_2', 'EXT_SOURCE_3']for source in cont_features:# Calculate the correlation coefficient between the variable and the targetcorr = df[target].corr(df[source])# Calculate medians for repaid vs not repaidavg_repaid = df.loc[df[target] == 0, source].median()avg_not_repaid = df.loc[df[target] == 1, source].median()# print out the correlationprint('\nThe correlation between %s and the TARGET is %0.4f' % (source, corr))# Print out average valuesprint('Median value for loan that was not repaid = %0.4f' % avg_not_repaid)print('Median value for loan that was repaid =     %0.4f' % avg_repaid)

Categorical Features

cat_features = df.select_dtypes(["object"]).columnsfig = plt.figure(figsize=(16, 16))
for i, feature in enumerate(cat_features):    ax = fig.add_subplot(4, 4, i+1)sns.countplot(x=feature, data=df, hue=df[target].map(lambda x:str(x)), fill=True, ax=ax)ax.set_xlabel(feature)ax.set_ylabel('App Count')    ax.legend(loc='upper center')plt.show()

可以看到，一些分类特征和目标变量有很大的相关性，可以one-hot编码，也可以组合成新的有序分类变量。

目标变量分布

对于分类问题的目标变量，在sklearn中需要编码为数值型

sklearn.preprocessing	预处理
LabelEncoder	目标变量序数编码
LabelBinarizer	二分类目标数值化
MultiLabelBinarizer	多标签目标数值化

检查目标变量分布

# `TARGET` is the target variable we are trying to predict (0 or 1):
# 1 = Not Repaid 
# 0 = Repaid
target = 'TARGET'print(f"percentage of default : {df[target].mean():.2%}")
print(df[target].value_counts())

现实中，样本（类别）样本不平衡（class-imbalance）是一种常见的现象，一般地，做分类算法训练时，如果样本类别比例（Imbalance Ratio）（多数类vs少数类）严重不平衡时，分类算法将开始做出有利于多数类的预测。一般有以下几种方法：权重法、采样法、数据增强、损失函数、集成方法、评估指标。

方法	函数	python包
SMOTE	SMOTE	imblearn.over_sampling
ADASYN	ADASYN	imblearn.over_sampling
Bagging算法	BalancedBaggingClassifier	imblearn.ensemble
Boosting算法	EasyEnsembleClassifier	imblearn.ensemble
损失函数	Focal Loss	self-define

我们可以用imbalance-learn这个Python库实现诸如重采样和模型集成等大多数方法。

对于回归任务，假设预测目标为客户的贷款额度AMT_CREDIT

df['AMT_CREDIT'].describe()

我们画出SalePrice的分布图和QQ图。

Quantile-Quantile图是一种常用的统计图形，用来比较两个数据集之间的分布。它是由标准正态分布的分位数为横坐标，样本值为纵坐标的散点图。如果QQ图上的点在一条直线附近，则说明数据近似于正态分布，且该直线的斜率为标准差，截距为均值。

import matplotlib.pyplot as plt
import seaborn as sns
from scipy.stats import probplot, normdef norm_comparison_plot(series):series = pd.Series(series)mu, sigma = norm.fit(series)kurt, skew = series.kurt(), series.skew()print(f"Kurtosis: {kurt:.2f}", f"Skewness: {skew:.2f}", sep='\t')fig = plt.figure(figsize=(10, 4))# Now plot the distributionax1 = fig.add_subplot(121)ax1.set_title('Distribution')ax1.set_ylabel('Frequency')sns.distplot(series, fit=norm, ax=ax1)ax1.legend(['dist','kde','norm'],f'Normal dist. ($\mu=$ {mu:.2f} and $\sigma=$ {sigma:.2f} )', loc='best')# Get also the QQ-plotax2 = fig.add_subplot(122)probplot(series, plot=plt)norm_comparison_plot(df['AMT_CREDIT'])
plt.show()

可以看到 SalePrice 的分布呈偏态，许多回归算法都有正态分布假设，因此我们尝试对数变换，让数据接近正态分布。

norm_comparison_plot(np.log1p(df['AMT_CREDIT']))
plt.show()

可以看到经过对数变换后，基本符合正态分布了。

sklearn.compose 中的 TransformedTargetRegressor 是专门为回归任务设置的目标变换。对于简单的变换，TransformedTargetRegressor在拟合回归模型之前变换目标变量，预测时则通过逆变换映射回原始值。

reg = TransformedTargetRegressor(regressor=LinearRegression(), transformer=FunctionTransformer(np.log1p))

参考文献：
Home Credit Default Risk - 1 之基础篇
Home Credit Default Risk 之FeatureTools篇
Feature Engineering for House Prices
Credit Fraud信用卡欺诈数据集，如何处理非平衡数据
Predict Future Sales 预测未来销量, Kaggle 比赛，LB 0.89896 排名6%
feature-engine 将特征工程中常用的方法进行了封装
分享关于人工智能的内容

特征工程系列：特征筛选的原理与实现（上） 7.23
特征工程系列：特征筛选的原理与实现（下） 7.23
特征工程系列：数据清洗 8.1
特征工程系列：特征预处理（上） 8.8
特征工程系列：特征预处理（下） 8.16
特征工程系列：特征构造之概览篇 10.8
特征工程系列：聚合特征构造以及转换特征构造 10.12
特征工程系列：笛卡尔乘积特征构造以及遗传编程特征构造 10.15
特征工程系列：GBDT特征构造以及聚类特征构造 10.30
特征工程系列：时间特征构造以及时间序列特征构造 11.11
特征工程系列：自动化特征构造 12.10
特征工程系列：空间特征构造以及文本特征构造 12.10