What is Data Leakage

数据泄漏是数据科学家需要理解的最重要问题之一。如果您不知道如何防止它，则会频繁出现泄漏，并且会以最微妙和危险的方式破坏您的模型。具体而言，泄漏会导致模型看起来准确，当您开始使用模型做出决策，模型则变得非常不准确。本教程将向您展示泄漏是什么以及如何避免泄漏。

泄漏有两种主要类型：Leaky Predictors and a Leaky Validation Strategies。

Leaky Predictors

当您的预测因素包含无法使用的数据时，就会发生这种情况。
例如，假设您想预测谁会患上肺炎。原始数据的前几行可能如下所示：

got_pneumonia	age	weight	male	took_antibiotic_medicine	...
False	65	100	False	False	...
False	72	130	True	False	...
True	58	100	False	True	...

人们在患肺炎后服用抗生素药物才能康复。因此原始数据显示了这些列之间的紧密关系。但是，确定了got_pneumonia的值后，take_antibiotic_medicine经常被改变。这是目标泄漏。

该模型将发现，对于take_antibiotic_medicine而言，任何具有False值的人都没有肺炎。验证数据来自同一来源，因此模式将在验证中重复，模型将具有很好的验证（或交叉验证）分数。但随后在现实世界中部署时，该模型将非常不准确。

为防止此类数据泄漏，应排除在目标值实现后更新（或创建）的任何变量。因为当我们使用此模型进行新的预测时，该数据将无法使用。

Leaky Data Graphic

Leaky Validation Strategy

当您不小心区分训练数据和验证数据时，会发生不同类型的泄漏。例如，如果在调用train_test_split之前运行预处理（比如为缺失值拟合Imputer），就会发生这种情况。验证旨在衡量模型对之前未考虑过的数据的影响。如果验证数据影响预处理行为，您可以以微妙的方式破坏此过程。最终结果？您的模型将获得非常好的验证分数，让您对它充满信心，但在部署它以做出决策时表现不佳。

Preventing Leaky Predictors

没有一种解决方案可以普遍地防止泄漏的预测因素。它需要有关您的数据，特定案例检查和常识的知识。

然而，泄漏预测因素通常与目标具有高度统计相关性。所以要记住两个策略：

要筛选可能的泄漏预测因素，请查找与目标统计相关的列。
如果您构建模型并发现它非常准确，则可能存在泄漏问题。

Preventing Leaky Validation Strategies

如果您的验证基于简单的train-test-split，则从任何类型的拟合中排除验证数据，包括预处理步骤的拟合。如果您使用scikit-learn Pipelines，这会更容易。使用交叉验证时，使用管道并在管道内进行预处理更为重要。

Example

我们将使用一个关于信用卡应用程序的小数据集，我们将构建一个模型来预测哪些应用程序被接受（存储在一个名为card的变量中）。以下是数据：

[1]

import pandas as pddata = pd.read_csv('../input/AER_credit_card_data.csv', true_values = ['yes'],false_values = ['no'])
print(data.head())

card  reports       age  income     share  expenditure  owner  selfemp  \
0  True        0  37.66667  4.5200  0.033270   124.983300   True    False   
1  True        0  33.25000  2.4200  0.005217     9.854167  False    False   
2  True        0  33.66667  4.5000  0.004156    15.000000   True    False   
3  True        0  30.50000  2.5400  0.065214   137.869200  False    False   
4  True        0  32.16667  9.7867  0.067051   546.503300   True    False   dependents  months  majorcards  active  
0           3      54           1      12  
1           3      34           1      13  
2           4      58           1       5  
3           0      25           1       7  
4           2      64           1       5

我们使用data.shape看到这是一个小数据集（1312行），所以我们应该使用交叉验证来确保模型质量的准确精度。

[2]

data.shape(1319, 12)

[3]

from sklearn.pipeline import make_pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_scorey = data.card
X = data.drop(['card'], axis=1)# Since there was no preprocessing, we didn't need a pipeline here. Used anyway as best practice
modeling_pipeline = make_pipeline(RandomForestClassifier())
cv_scores = cross_val_score(modeling_pipeline, X, y, scoring='accuracy')
print("Cross-val accuracy: %f" %cv_scores.mean())Cross-val accuracy: 0.979528

根据经验，您会发现找到98％的精确模型是非常罕见的。它发生了，但我们应该更仔细地检查数据以确定它是否是目标泄漏。

以下是数据摘要，您也可以在数据选项卡下找到它：

card: Dummy variable, 1 if application for credit card accepted, 0 if not
reports: Number of major derogatory reports
age: Age n years plus twelfths of a year
income: Yearly income (divided by 10,000)
share: Ratio of monthly credit card expenditure to yearly income
expenditure: Average monthly credit card expenditure
owner: 1 if owns their home, 0 if rent
selfempl: 1 if self employed, 0 if not.
dependents: 1 + number of dependents
months: Months living at current address
majorcards: Number of major credit cards held
active: Number of active credit accounts

一些变量看起来很可疑。例如，expenditure是指支付此卡还是在使用之前卡上的支出？

此时，基本数据比较可能非常有用：

[4]

expenditures_cardholders = data.expenditure[data.card]
expenditures_noncardholders = data.expenditure[~data.card]print('Fraction of those who received a card with no expenditures: %.2f' \%(( expenditures_cardholders == 0).mean()))
print('Fraction of those who received a card with no expenditures: %.2f' \%((expenditures_noncardholders == 0).mean()))

Fraction of those who received a card with no expenditures: 0.02
Fraction of those who received a card with no expenditures: 1.00

每个人(card == False)都没有支出，而card== True的人中只有2％没有支出。我们的模型似乎具有高精度并不奇怪。但这似乎是数据泄露，其中支出可能意味着*他们申请的卡上的支出。**。

由于share部分由支出决定，因此也应予以排除。变量active，majorcards有点不太清楚，但从描述中看，它们听起来很有意义。在大多数情况下，如果您无法追踪创建数据的人员以了解更多信息，那么最好是安全而不是抱歉。

我们将运行一个没有泄漏的模型如下：

[5]

potential_leaks = ['expenditure', 'share', 'active', 'majorcards']
X2 = X.drop(potential_leaks, axis=1)
cv_scores = cross_val_score(modeling_pipeline, X2, y, scoring='accuracy')
print("Cross-val accuracy: %f" %cv_scores.mean())Cross-val accuracy: 0.806677

这种准确性相当低，这一方面令人失望。但是，我们可以预期在新应用程序中使用它的时间大约为80％，而泄漏模型可能会比这更糟糕（即使它在交叉验证中的表观得分更高）。