比赛的目的:
- 通过分析网上的系统日志和用户行为信息,来预测某些网页上项目的点击率。
- 是一个二分类的问题,只需要预测出用户是否点击即可
- 最好能够输出某个概率,比如:用户点击某个广告的概率。
比赛官网
文件信息:
train - Training set. 10 days of click-through data, ordered chronologically. Non-clicks and clicks are subsampled according to different strategies.
test - Test set. 1 day of ads to for testing your model predictions.
sampleSubmission.csv - Sample submission file in the correct format, corresponds to the All-0.5 Benchmark.
属性信息:
- id: ad identifier
- click: 0/1 for non-click/click
- hour: format is YYMMDDHH, so 14091123 means 23:00 on Sept. 11, 2014 UTC.
- C1 – anonymized categorical variable
- banner_pos
- site_id
- site_domain
- site_category
- app_id
- app_domain
- app_category
- device_id
- device_ip
- device_model
- device_type
- device_conn_type
- C14-C21 – anonymized categorical variables
初步分析:
- 这是一个点击率预测的问题,是一个二分类的问题
- 通过初步查看给出的属性,主要分为用户,网站,广告和时间四种类型的属性
- 时间应该是一个重要的属性,可以好好分析,因为每个人在不同时间喜欢看不同的东西
- 网站类型也是一个和用户相关性比较大的属性
- 设备类型可以反映出用户的一个经济范围和消费水平
- 等等!肯定还有很多相关性在这些属性中,我们应该设身处地的思考这些问题。
Load Data
import pandas as pd# Initial setup
train_filename = "train_small.csv" #由于原始数据量比较多,所以这里先导入一个经过下采样的样本
test_filename = "test.csv"
submission_filename = "submit.csv"training_set = pd.read_csv(train_filename)
Explore Data
training_set.shape
(99999, 24)
#我们首先看看数据的样子
training_set.head(10)
id | click | hour | C1 | banner_pos | site_id | site_domain | site_category | app_id | app_domain | ... | device_type | device_conn_type | C14 | C15 | C16 | C17 | C18 | C19 | C20 | C21 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1.000009e+18 | 0 | 14102100 | 1005 | 0 | 1fbe01fe | f3845767 | 28905ebd | ecad2386 | 7801e8d9 | ... | 1 | 2 | 15706 | 320 | 50 | 1722 | 0 | 35 | -1 | 79 |
1 | 1.000017e+19 | 0 | 14102100 | 1005 | 0 | 1fbe01fe | f3845767 | 28905ebd | ecad2386 | 7801e8d9 | ... | 1 | 0 | 15704 | 320 | 50 | 1722 | 0 | 35 | 100084 | 79 |
2 | 1.000037e+19 | 0 | 14102100 | 1005 | 0 | 1fbe01fe | f3845767 | 28905ebd | ecad2386 | 7801e8d9 | ... | 1 | 0 | 15704 | 320 | 50 | 1722 | 0 | 35 | 100084 | 79 |
3 | 1.000064e+19 | 0 | 14102100 | 1005 | 0 | 1fbe01fe | f3845767 | 28905ebd | ecad2386 | 7801e8d9 | ... | 1 | 0 | 15706 | 320 | 50 | 1722 | 0 | 35 | 100084 | 79 |
4 | 1.000068e+19 | 0 | 14102100 | 1005 | 1 | fe8cc448 | 9166c161 | 0569f928 | ecad2386 | 7801e8d9 | ... | 1 | 0 | 18993 | 320 | 50 | 2161 | 0 | 35 | -1 | 157 |
5 | 1.000072e+19 | 0 | 14102100 | 1005 | 0 | d6137915 | bb1ef334 | f028772b | ecad2386 | 7801e8d9 | ... | 1 | 0 | 16920 | 320 | 50 | 1899 | 0 | 431 | 100077 | 117 |
6 | 1.000072e+19 | 0 | 14102100 | 1005 | 0 | 8fda644b | 25d4cfcd | f028772b | ecad2386 | 7801e8d9 | ... | 1 | 0 | 20362 | 320 | 50 | 2333 | 0 | 39 | -1 | 157 |
7 | 1.000092e+19 | 0 | 14102100 | 1005 | 1 | e151e245 | 7e091613 | f028772b | ecad2386 | 7801e8d9 | ... | 1 | 0 | 20632 | 320 | 50 | 2374 | 3 | 39 | -1 | 23 |
8 | 1.000095e+19 | 1 | 14102100 | 1005 | 0 | 1fbe01fe | f3845767 | 28905ebd | ecad2386 | 7801e8d9 | ... | 1 | 2 | 15707 | 320 | 50 | 1722 | 0 | 35 | -1 | 79 |
9 | 1.000126e+19 | 0 | 14102100 | 1002 | 0 | 84c7ba46 | c4e18dd6 | 50e219e0 | ecad2386 | 7801e8d9 | ... | 0 | 0 | 21689 | 320 | 50 | 2496 | 3 | 167 | 100191 | 23 |
10 rows × 24 columns
- 目前主要有22个属性,其中有很多是类别的属性。
- 训练集总共有99999个样本,还行,不多也不少。
training_set.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 99999 entries, 0 to 99998
Data columns (total 24 columns):
id 99999 non-null float64
click 99999 non-null int64
hour 99999 non-null int64
C1 99999 non-null int64
banner_pos 99999 non-null int64
site_id 99999 non-null object
site_domain 99999 non-null object
site_category 99999 non-null object
app_id 99999 non-null object
app_domain 99999 non-null object
app_category 99999 non-null object
device_id 99999 non-null object
device_ip 99999 non-null object
device_model 99999 non-null object
device_type 99999 non-null int64
device_conn_type 99999 non-null int64
C14 99999 non-null int64
C15 99999 non-null int64
C16 99999 non-null int64
C17 99999 non-null int64
C18 99999 non-null int64
C19 99999 non-null int64
C20 99999 non-null int64
C21 99999 non-null int64
dtypes: float64(1), int64(14), object(9)
memory usage: 18.3+ MB
- 因为是处理好的,所以数据比较完整,没有缺失值,这为我们省去很多的工作
- 数据中很多属性是类别的,需要进行编码处理
- 数值型的数据取值都是int64,但是还是需要看看数据范围是否一致,不然还要归一化处理。
- 接下来看一下数值型的数据的一个分布情况
#查看训练集
training_set.describe()
id | click | hour | C1 | banner_pos | device_type | device_conn_type | C14 | C15 | C16 | C17 | C18 | C19 | C20 | C21 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 9.999900e+04 | 99999.000000 | 99999.0 | 99999.000000 | 99999.000000 | 99999.000000 | 99999.000000 | 99999.000000 | 99999.000000 | 99999.000000 | 99999.000000 | 99999.000000 | 99999.000000 | 99999.000000 | 99999.000000 |
mean | 9.500834e+18 | 0.174902 | 14102100.0 | 1005.034440 | 0.198302 | 1.055741 | 0.199272 | 17682.106071 | 318.333943 | 56.818988 | 1964.029090 | 0.789328 | 131.735447 | 37874.606366 | 88.555386 |
std | 5.669435e+18 | 0.379885 | 0.0 | 1.088705 | 0.402641 | 0.583986 | 0.635271 | 3237.726956 | 11.931998 | 36.924283 | 394.961129 | 1.223747 | 244.077816 | 48546.369299 | 45.482979 |
min | 3.237563e+13 | 0.000000 | 14102100.0 | 1001.000000 | 0.000000 | 0.000000 | 0.000000 | 375.000000 | 120.000000 | 20.000000 | 112.000000 | 0.000000 | 33.000000 | -1.000000 | 13.000000 |
25% | 4.183306e+18 | 0.000000 | 14102100.0 | 1005.000000 | 0.000000 | 1.000000 | 0.000000 | 15704.000000 | 320.000000 | 50.000000 | 1722.000000 | 0.000000 | 35.000000 | -1.000000 | 61.000000 |
50% | 1.074496e+19 | 0.000000 | 14102100.0 | 1005.000000 | 0.000000 | 1.000000 | 0.000000 | 17654.000000 | 320.000000 | 50.000000 | 1993.000000 | 0.000000 | 35.000000 | -1.000000 | 79.000000 |
75% | 1.457544e+19 | 0.000000 | 14102100.0 | 1005.000000 | 0.000000 | 1.000000 | 0.000000 | 20362.000000 | 320.000000 | 50.000000 | 2306.000000 | 2.000000 | 39.000000 | 100083.000000 | 156.000000 |
max | 1.844670e+19 | 1.000000 | 14102100.0 | 1010.000000 | 5.000000 | 5.000000 | 5.000000 | 21705.000000 | 728.000000 | 480.000000 | 2497.000000 | 3.000000 | 1835.000000 | 100248.000000 | 157.000000 |
- 数值型数据取值范围相差较大,后面需要对其进行归一化处理。
# id: ad identifier
# click: 0/1 for non-click/click
# hour: format is YYMMDDHH, so 14091123 means 23:00 on Sept. 11, 2014 UTC.
# C1 -- anonymized categorical variable
# banner_pos
# site_id
# site_domain
# site_category
# app_id
# app_domain
# app_category
# device_id
# device_ip
# device_model
# device_type
# device_conn_type
# C14-C21 -- anonymized categorical variables
from sklearn.externals import joblib
from sklearn.cross_validation import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn import metricsfrom utils import load_df
E:\Anaconda2\soft\lib\site-packages\sklearn\cross_validation.py:41: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. Also note that the interface of the new CV iterators are different from that of this module. This module will be removed in 0.20."This module will be removed in 0.20.", DeprecationWarning)
# 结果衡量
def print_metrics(true_values, predicted_values):print "Accuracy: ", metrics.accuracy_score(true_values, predicted_values)print "AUC: ", metrics.roc_auc_score(true_values, predicted_values)print "Confusion Matrix: ", + metrics.confusion_matrix(true_values, predicted_values)print metrics.classification_report(true_values, predicted_values)# 拟合分类器
def classify(classifier_class, train_input, train_targets):classifier_object = classifier_class()classifier_object.fit(train_input, train_targets)return classifier_object# 模型存储
def save_model(clf):joblib.dump(clf, 'classifier.pkl')
train_data = load_df('train_small.csv').values
train_data.shape #数据量还是99999个
(99999L, 14L)
train_data[:,:]
array([[ 0, 14102100, 1005, ..., 35, -1, 79],[ 0, 14102100, 1005, ..., 35, 100084, 79],[ 0, 14102100, 1005, ..., 35, 100084, 79],...,[ 0, 14102100, 1005, ..., 35, -1, 79],[ 1, 14102100, 1005, ..., 35, -1, 79],[ 0, 14102100, 1005, ..., 35, -1, 79]],dtype=int64)
先训练一个baseline看看,说起baseline当然选用工业界认同的baseline模型LR
# 训练和存储模型
X_train, X_test, y_train, y_test = train_test_split(train_data[0::, 1::], train_data[0::, 0],test_size=0.3, random_state=0)classifier = classify(LogisticRegression, X_train, y_train) #使用LR模型
predictions = classifier.predict(X_test)
print_metrics(y_test, predictions) #通过多种评价指标对分类的模型进行评判
save_model(classifier) #保存模型
Accuracy: 0.8233
AUC: 0.5
Confusion Matrix: [[24699 0][ 5301 0]]precision recall f1-score support0 0.82 1.00 0.90 246991 0.00 0.00 0.00 5301avg / total 0.68 0.82 0.74 30000E:\Anaconda2\soft\lib\site-packages\sklearn\metrics\classification.py:1135: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples.'precision', 'predicted', average, warn_for)
从baseline的结果,我们可以得出如下几点结论:
- 将结果全部预测为没有点击后的准确率可以达到82.33%,这显然是不对的
- 从混淆矩阵可以看出原本为点击的结果全部预测为了不点击,猜想的原因可能是样布不均衡问题导致的。因为毕竟广告点击的较少,数据中大部分的数据的标签都是没有点击的,这会导致模型偏向于去预测不点击
- 从实验结果可以发现准确率有时候非常不准,对于模型的状态预判。
#样本中未点击的样本数占总体样本的83%多,这和我们分析的原因是一样的,样本非常不均衡。
training_set[training_set["click"] == 0].count()[0] * 1.0 / training_set.shape[0]
0.8250982509825098
# 按照指定的格式生成结果
def create_submission(ids, predictions, filename='submission.csv'):submissions = np.concatenate((ids.reshape(len(ids), 1), predictions.reshape(len(predictions), 1)), axis=1)df = DataFrame(submissions)df.to_csv(filename, header=['id', 'click'], index=False)
import numpy as np
from pandas import DataFrameclassifier = joblib.load('classifier.pkl')
test_data_df = load_df('test.csv', training=False)
ids = test_data_df.values[0:, 0]
predictions = classifier.predict(test_data_df.values[0:, 1:])
create_submission(ids, predictions)