Introduction

分类数据是仅采用有限数量值的数据。

例如，如果人们回答一项关于他们拥有哪种品牌汽车的调查，结果将是明确的（因为答案将是本田，丰田，福特，无等等）。答案属于一组固定的类别。

如果您尝试将这些变量插入Python中的大多数机器学习模型而不首先“编码”它们，则会出现错误。在这里，我们将展示最流行的分类变量编码方法。

One-Hot Encoding : The Standard Approach for Categorical Data

One-Hot Encoding是最普遍的方法，除非你的分类变量具有大量的值，否则它的效果非常好（例如，对于变量超过15个不同值的变量，你通常不会这样做。在数值较少的情况下它是一个糟糕的选择，尽管情况有所不同。）

One-Hot Encoding创建新的（二进制）列，指示原始数据中每个可能值的存在。让我们通过一个例子来解决。

Imgur

原始数据中的值为红色，黄色和绿色。我们为每个可能的值创建一个单独的列。只要原始值为红色，我们在红色列中放置1。

Example

我们在代码中看到这个。我们将跳过基本数据设置代码，因此您可以从拥有train_predictors，test_predictors 的DataFrames位置开始。该数据包含住房特征。您将使用它们来预测房屋价格，房屋价格存储在称为目标的系列中。

【1】

# Read the data
import pandas as pd
train_data = pd.read_csv('../input/train.csv')
test_data = pd.read_csv('../input/test.csv')# Drop houses where the target is missing
train_data.dropna(axis=0, subset=['SalePrice'], inplace=True)target = train_data.SalePrice# Since missing values isn't the focus of this tutorial, we use the simplest
# possible approach, which drops these columns. 
# For more detail (and a better approach) to missing values, see
# https://www.kaggle.com/dansbecker/handling-missing-values
cols_with_missing = [col for col in train_data.columns if train_data[col].isnull().any()]                                  
candidate_train_predictors = train_data.drop(['Id', 'SalePrice'] + cols_with_missing, axis=1)
candidate_test_predictors = test_data.drop(['Id'] + cols_with_missing, axis=1)# "cardinality" means the number of unique values in a column.
# We use it as our only way to select categorical columns here. This is convenient, though
# a little arbitrary.
low_cardinality_cols = [cname for cname in candidate_train_predictors.columns if candidate_train_predictors[cname].nunique() < 10 andcandidate_train_predictors[cname].dtype == "object"]
numeric_cols = [cname for cname in candidate_train_predictors.columns if candidate_train_predictors[cname].dtype in ['int64', 'float64']]
my_cols = low_cardinality_cols + numeric_cols
train_predictors = candidate_train_predictors[my_cols]
test_predictors = candidate_test_predictors[my_cols]

Pandas为每个列或系列分配数据类型（称为dtype）。让我们从预测数据中看到随机的dtypes样本：

【2】

train_predictors.dtypes.sample(10)

Heating          object
CentralAir       object
Foundation       object
Condition1       object
YrSold            int64
PavedDrive       object
RoofMatl         object
PoolArea          int64
EnclosedPorch     int64
KitchenAbvGr      int64
dtype: object

对象表示一列有文本（理论上可能有其他东西，但这对我们的目的来说并不重要）。对这些“对象”列进行one-hot encode是最常见的，因为它们不能直接插入大多数模型中。 Pandas提供了一个名为get_dummies的便捷功能，可以获得one-hot encodings。这样叫：

[3]

one_hot_encoded_training_predictors = pd.get_dummies(train_predictors)

或者，您可以删除分类。为了了解这些方法的比较，我们可以计算两组可选预测构建的模型的平均绝对误差：

One-hot encoded分类以及数字预测变量
数值预测，我们删除分类。

One-hot encoding通常有所帮助，但它会根据具体情况而有所不同。在这种情况下，使用one-hot encoded变量似乎没有任何的好处。

[4]

from sklearn.model_selection import cross_val_score
from sklearn.ensemble import RandomForestRegressordef get_mae(X, y):# multiple by -1 to make positive MAE score instead of neg value returned as sklearn conventionreturn -1 * cross_val_score(RandomForestRegressor(50), X, y, scoring = 'neg_mean_absolute_error').mean()predictors_without_categoricals = train_predictors.select_dtypes(exclude=['object'])mae_without_categoricals = get_mae(predictors_without_categoricals, target)mae_one_hot_encoded = get_mae(one_hot_encoded_training_predictors, target)print('Mean Absolute Error when Dropping Categoricals: ' + str(int(mae_without_categoricals)))
print('Mean Abslute Error with One-Hot Encoding: ' + str(int(mae_one_hot_encoded)))

Mean Absolute Error when Dropping Categoricals: 18350
Mean Abslute Error with One-Hot Encoding: 18023

Applying to Multiple Files

到目前为止，您已经对您的训练数据进行了one-hot encoded。当你有多个文件（例如测试数据集，或者你想要预测的其他数据）时怎么办？ Scikit-learn对列的排序很敏感，因此如果训练数据集和测试数据集未对齐，则结果将是无意义的。如果分类在训练数据中与测试数据具有不同数量的值，则可能发生这种情况。

使用align命令确保测试数据的编码方式与训练数据相同：

【5】

one_hot_encoded_training_predictors = pd.get_dummies(train_predictors)
one_hot_encoded_test_predictors = pd.get_dummies(test_predictors)
final_train, final_test = one_hot_encoded_training_predictors.align(one_hot_encoded_test_predictors,join='left', axis=1)

align命令确保列在两个数据集中以相同的顺序显示（它使用列名来标识每个数据集中的哪些列对齐。）参数join ='left'指定我们将执行等效的SQL左连接。这意味着，如果有一列显示在一个数据集而不是另一个数据集中，我们将保留我们的训练数据中的列。参数join ='inner'将执行SQL数据库调用内连接的操作，仅保留两个数据集中显示的列。这也是一个明智的选择。

Conclusion

世界充满了分类数据。如果您知道如何使用这些数据，那么您将成为一名更有效的数据科学家。当您开始使用cateogircal数据进行更复杂的工作时，这些资源将非常有用。

管道：将模型部署到生产就绪系统本身就是一个主题。虽然one-hot encoding仍然是一种很好的方法，但您的代码需要以特别强大的方式构建。 Scikit-learn管道是一个很好的工具。 Scikit-learn提供了class for one-hot encoding，可以将其添加到管道中。不幸的是，它不处理文本或对象值，这是一个常见的用例。
应用于深度学习的文本：Keras和TensorFlow具有one-hot encoding的功能，这对于处理文本很有用。
具有多个值的分类：Scikit-learn的FeatureHasher使用散列技巧来存储高维数据。这将为您的建模代码增加一些复杂性。