进度24/12/14
昨日复盘:
Pandas
课程完成
Intermediate Mechine Learning
2/7
今日记录:
Intermediate Mechine Learning
之类型变量
读两篇讲解如何提问的文章,在提问区里发起一次提问
实战:自己从头到尾首先Housing Prices Competition for Kaggle Learn Users
并成功提交
Intermediate Mechine Learning
之管道(pipeline之前一直错译为工作流)
Categorical Variables
将学习三种处理类别特征的方式
我的文章里更倾向于将Variables翻译为
特征
特征类型通常分为数值型和类别型
- 策略一:丢弃
- 策略二:顺序编码:为每一个种类分配一个独特的数值。不是所有的类型变量都能有一个排序来对应到顺序编码上,但是对于树形模型,有序编码通常可以很好地工作。
- 策略三:
One-hot
编码,为每一类创建新的列。通常在类型没有内在顺序时工作地很好,但是类型数量不能过多。
获取是字符类型的特征:
# Get list of categorical variables
s = (X_train.dtypes == 'object')
object_cols = list(s[s].index)print("Categorical variables:")
print(object_cols)
策略一:丢弃
drop_X_train = X_train.select_dtypes(exclude=['object'])
drop_X_valid = X_valid.select_dtypes(exclude=['object'])
策略二:顺序编码
from sklearn.preprocessing import OrdinalEncoder# Make copy to avoid changing original data
label_X_train = X_train.copy()
label_X_valid = X_valid.copy()# Apply ordinal encoder to each column with categorical data
ordinal_encoder = OrdinalEncoder()
label_X_train[object_cols] = ordinal_encoder.fit_transform(X_train[object_cols])
label_X_valid[object_cols] = ordinal_encoder.transform(X_valid[object_cols])
问题出现,有些在train中没有出现过的类型应该如何处理,首先对数据进行探索,将类别列分成可以安全编码的列和不可以安全编码的列
# Categorical columns in the training data
object_cols = [col for col in X_train.columns if X_train[col].dtype == "object"]
# Columns that can be safely ordinal encoded
good_label_cols = [col for col in object_cols if set(X_valid[col]).issubset(set(X_train[col]))]
# Problematic columns that will be dropped from the dataset
bad_label_cols = list(set(object_cols)-set(good_label_cols))
print('Categorical columns that will be ordinal encoded:', good_label_cols)
print('\nCategorical columns that will be dropped from the dataset:', bad_label_cols)
当前最简策略是:丢弃不可以进行安全编码的类别列,之后再应用顺序编码。
label_X_train = X_train.drop(bad_label_cols, axis=1)
label_X_valid = X_valid.drop(bad_label_cols, axis=1)
策略三:独热编码
from sklearn.preprocessing import OneHotEncoder# Apply one-hot encoder to each column with categorical data
OH_encoder = OneHotEncoder(handle_unknown='ignore', sparse=False)
OH_cols_train = pd.DataFrame(OH_encoder.fit_transform(X_train[object_cols]))
OH_cols_valid = pd.DataFrame(OH_encoder.transform(X_valid[object_cols]))# One-hot encoding removed index; put it back
OH_cols_train.index = X_train.index
OH_cols_valid.index = X_valid.index# Remove categorical columns (will replace with one-hot encoding)
num_X_train = X_train.drop(object_cols, axis=1)
num_X_valid = X_valid.drop(object_cols, axis=1)# Add one-hot encoded columns to numerical features
OH_X_train = pd.concat([num_X_train, OH_cols_train], axis=1)
OH_X_valid = pd.concat([num_X_valid, OH_cols_valid], axis=1)# Ensure all columns have string type
OH_X_train.columns = OH_X_train.columns.astype(str)
OH_X_valid.columns = OH_X_valid.columns.astype(str)
在开始之前,首先调查类别特征信息
# Get number of unique entries in each column with categorical data
object_nunique = list(map(lambda col: X_train[col].nunique(), object_cols))
d = dict(zip(object_cols, object_nunique))# Print number of unique entries by column, in ascending order
sorted(d.items(), key=lambda x: x[1])
与上一个方法一样,先找出本方法适用的列,对于列别数量过多的列,可以直接丢弃或者使用顺序编码。
# Columns that will be one-hot encoded
low_cardinality_cols = [col for col in object_cols if X_train[col].nunique() < 10]
# Columns that will be dropped from the dataset
high_cardinality_cols = list(set(object_cols)-set(low_cardinality_cols))
print('Categorical columns that will be one-hot encoded:', low_cardinality_cols)
print('\nCategorical columns that will be dropped from the dataset:', high_cardinality_cols)
最终代码
from sklearn.preprocessing import OneHotEncoder# Use as many lines of code as you need!
# low_OH_X_train = X_train.drop(high_cardinality_cols, axis=1)
# low_OH_X_valid = X_valid.drop(high_cardinality_cols, axis=1)OH_encoder = OneHotEncoder(handle_unknown='ignore', sparse_output=False)
OH_cols_train = pd.DataFrame(OH_encoder.fit_transform(X_train[low_cardinality_cols]))
OH_cols_valid = pd.DataFrame(OH_encoder.transform(X_valid[low_cardinality_cols]))OH_cols_train.index = low_OH_X_train.index
OH_cols_valid.index = low_OH_X_valid.indexnum_X_train = X_train.drop(object_cols, axis=1)
num_X_valid = X_valid.drop(object_cols, axis=1)OH_X_train = pd.concat([num_X_train, OH_cols_train], axis=1) # Your code here
OH_X_valid = pd.concat([num_X_valid, OH_cols_valid], axis=1) # Your code hereOH_X_train.columns = OH_X_train.columns.astype(str)
OH_X_valid.columns = OH_X_valid.columns.astype(str)print(OH_X_train.columns)
# Check your answer
step_4.check()
问题
为什么最后需要将所有列转化为字符串类型?
在论坛里提问,首先查看两篇提问说明
Kaggle Community Guidelines
Frequently Asked Questions
Ask Question in discussion area
发出提问。
实战:利用Missing Value和Categorical variable的知识自己写一份Notebook并成功提交
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to loadimport numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directoryimport os
for dirname, _, filenames in os.walk('/kaggle/input'):for filename in filenames:print(os.path.join(dirname, filename))# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All"
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session
# Load original data
from sklearn.model_selection import train_test_splitX_full = pd.read_csv("/kaggle/input/home-data-for-ml-course/train.csv")
X_test = pd.read_csv("/kaggle/input/home-data-for-ml-course/test.csv")X_full.dropna(axis=0, subset=['SalePrice'], inplace=True)
y = X_full.SalePrice
X_full.drop(['SalePrice'], axis=1, inplace=True)X_train, X_valid, y_train, y_valid = train_test_split(X_full, y, train_size=0.8, test_size=0.2,random_state=0)
# define evaluation functions,
# and submit file generation functions
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_errordef score_dataset(X_train, X_valid, y_train, y_valid, model):model.fit(X_train, y_train)preds = model.predict(X_valid)return mean_absolute_error(y_valid, preds)def generate_submit_file(X_test, model):preds_test = model.predict(X_test)output = pd.DataFrame({'Id': X_test.Id,'SalePrice': preds_test})output.to_csv('submission.csv', index=False)print("submission.csv saved.")return outputprint("func defined")
# define data_preprocesser
from sklearn.preprocessing import OrdinalEncoderdef na_processer(X_data, non_cols, is_train=False):# non_cols = [col for col in X_data.columns if X_data[col].isnull().any()]X_data = X_data.drop(non_cols, axis=1)return X_data# def cate_processer(X_data, bad_cols, good_cols, is_train=False):
# # cate_cols = [col for col in X_data.columns if X_data[col].dtype=="object"]# X_data = X_data.drop(cate_cols, axis=1)# return X_datadef data_preprocesser(train, valid, test):"""X_data referce to datasetis_train is used to show whether X_data is training data"""# missing valuestrain_non_cols = [col for col in train.columns if train[col].isnull().any()]valid_non_cols = [col for col in valid.columns if valid[col].isnull().any()]test_non_cols = [col for col in test.columns if test[col].isnull().any()]non_cols = train_non_cols + valid_non_cols + test_non_cols# drop na colsX_train = na_processer(train, non_cols, is_train=True)X_valid = na_processer(valid, non_cols)X_test = na_processer(test, non_cols)# categorical variable: odinary encodingobject_cols = [col for col in X_train.columns if X_train[col].dtype == "object"]good_label_cols = [col for col in object_cols if set(X_valid[col]).issubset(set(X_train[col])) and set(X_test[col]).issubset(set(X_train[col]))]bad_label_cols = list(set(object_cols)-set(good_label_cols))ordinal_encoder = OrdinalEncoder()# encode good colsX_train[good_label_cols] = ordinal_encoder.fit_transform(X_train[good_label_cols])X_valid[good_label_cols] = ordinal_encoder.transform(X_valid[good_label_cols])X_test[good_label_cols] = ordinal_encoder.transform(X_test[good_label_cols])# drop bad colsX_train.drop(bad_label_cols, axis=1, inplace=True)X_valid.drop(bad_label_cols, axis=1, inplace=True)X_test.drop(bad_label_cols, axis=1, inplace=True)return X_train, X_valid, X_testprint("func defined")
# train and valid
model = RandomForestRegressor(n_estimators=100, random_state=0)final_X_train, final_X_valid, final_X_test = data_preprocesser(X_train, X_valid, X_test)# print(final_X_train.dtypes)
score = score_dataset(final_X_train, final_X_valid, y_train, y_valid, model)
print(f"MAE socre is {score}")# generate test output
output = generate_submit_file(final_X_test, model)
问题:提交失败,经过查看发现是Id列的问题
def generate_submit_file(X_test, model):preds_test = model.predict(X_test)output = pd.DataFrame({'Id': X_test.index, #这里应该写成X_test.Id'SalePrice': preds_test})output.to_csv('submission.csv', index=False)print("submission.csv saved.")return output
修改后提交成功
Pipeline
管道是一种组织预处理和建模代码的简单方法。打包预处理和建模过程中的各个步骤。
虽然有些人完全不用pipeline, 它的好处如下:
- 清晰的代码
- 更少的Bug
- 更方便投产
- 更多的验证选择
超级简洁明了的代码,比之前的清晰很多,但需要详细了解一下pipeline的用法。
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder# Preprocessing for numerical data
numerical_transformer = SimpleImputer(strategy='constant')# Preprocessing for categorical data
categorical_transformer = Pipeline(steps=[('imputer', SimpleImputer(strategy='most_frequent')),('onehot', OneHotEncoder(handle_unknown='ignore'))
])# Bundle preprocessing for numerical and categorical data
preprocessor = ColumnTransformer(transformers=[('num', numerical_transformer, numerical_cols),('cat', categorical_transformer, categorical_cols)])from sklearn.ensemble import RandomForestRegressormodel = RandomForestRegressor(n_estimators=100, random_state=0)from sklearn.metrics import mean_absolute_error# Bundle preprocessing and modeling code in a pipeline
my_pipeline = Pipeline(steps=[('preprocessor', preprocessor),('model', model)])# Preprocessing of training data, fit model
my_pipeline.fit(X_train, y_train)# Preprocessing of validation data, get predictions
preds = my_pipeline.predict(X_valid)# Evaluate the model
score = mean_absolute_error(y_valid, preds)
print('MAE:', score)
pipeline和columnTransformer极大简化了代码和编码流程。可以更加专注于策略的选择。