文章目录
- 1. Baseline KNN
- 2. Try SVC
Digit Recognizer 练习地址
相关博文:[Hands On ML] 3. 分类(MNIST手写数字预测)
1. Baseline KNN
- 读取数据
import pandas as pd
train = pd.read_csv('train.csv')
X_test = pd.read_csv('test.csv')
- 特征、标签分离
train.head()
y_train = train['label']
X_train = train.drop(['label'], axis=1)
X_train
- 网格搜索 KNN 模型最佳参数
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import accuracy_score
# help(KNeighborsClassifier)
para_dict = [{'weights':["uniform", "distance"], 'n_neighbors':[3,4,5], 'leaf_size':[10,20]}
]
knn_clf = KNeighborsClassifier()
grid_search = GridSearchCV(knn_clf, para_dict, cv=3,scoring='accuracy',n_jobs=-1)
grid_search.fit(X_train, y_train)
输出
GridSearchCV(cv=3, estimator=KNeighborsClassifier(), n_jobs=-1,param_grid=[{'leaf_size': [10, 20], 'n_neighbors': [3, 4, 5],'weights': ['uniform', 'distance']}],scoring='accuracy')
- 最佳参数
grid_search.best_params_
# {'leaf_size': 10, 'n_neighbors': 4, 'weights': 'distance'}
- 最好得分
grid_search.best_score_
# 0.9677619047619048
- 生成 test 集预测结果
y_pred = grid_search.predict(X_test)
- 写入结果文件
image_id = pd.Series(range(1,len(y_pred)+1))
output = pd.DataFrame({'ImageId':image_id, 'Label':y_pred})
output.to_csv("submission.csv", index=False) # 不要index列
- 预测结果
排行榜
以上 KNN 模型得分 0.97067,目前排名2467
2. Try SVC
- 读取数据
import pandas as pd
train = pd.read_csv('train.csv')
X_test = pd.read_csv('test.csv')
y_train = train['label']
X_train = train.drop(['label'], axis=1)
- 导入包
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC, LinearSVC
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import classification_report
from sklearn.metrics import accuracy_score
- 搜索最佳参数
pipeline = Pipeline([("scaler",StandardScaler()),('clf', SVC(decision_function_shape="ovr", gamma="auto"))
])from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import reciprocal, uniformparam_distributions = {"clf__gamma": reciprocal(0.001, 0.1), "clf__C": uniform(1, 10)}
rnd_search_cv = RandomizedSearchCV(pipeline, param_distributions, n_iter=10, verbose=2, cv=3)rnd_search_cv.fit(X_train, y_train)
- 训练花费12个小时
[Parallel(n_jobs=1)]: Done 30 out of 30 | elapsed: 744.1min finished
rnd_search_cv.best_estimator_
- 最佳评估器
Pipeline(steps=[('scaler', StandardScaler()),('clf',SVC(C=10.729327185542381, gamma=0.0022750096640207287))])
- 最好得分
rnd_search_cv.best_score_
# 0.9584285714285713
- 预测
y_pred = rnd_search_cv.best_estimator_.predict(X_test)
image_id = pd.Series(range(1,len(y_pred)+1))
output = pd.DataFrame({'ImageId':image_id, 'Label':y_pred})
output.to_csv("submission_svc.csv", index=False)
SVC 支持向量机分类模型 得分 0.96464 没有上面 KNN 模型高(KNN 得分 0.97067)