文章目录
- 1. 逻辑回归二分类
- 2. 垃圾邮件过滤
- 2.1 性能指标
- 2.2 准确率
- 2.3 精准率、召回率
- 2.4 F1值
- 2.5 ROC、AUC
- 3. 网格搜索调参
- 4. 多类别分类
- 5. 多标签分类
- 5.1 多标签分类性能指标
本文为 scikit-learn机器学习(第2版)学习笔记
逻辑回归常用于分类任务
1. 逻辑回归二分类
《统计学习方法》逻辑斯谛回归模型( Logistic Regression,LR)
定义:设 XXX 是连续随机变量, XXX 服从 logistic 分布是指 XXX 具有下列分布函数和密度函数:
F(x)=P(X≤x)=11+e−(x−μ)/γF(x) = P(X \leq x) = \frac{1}{1+e^{{-(x-\mu)} / \gamma}}F(x)=P(X≤x)=1+e−(x−μ)/γ1
f(x)=F′(x)=e−(x−μ)/γγ(1+e−(x−μ)/γ)2f(x)=F'(x)= \frac {e^{{-(x-\mu)} / \gamma}}{\gamma {(1+e^{{-(x-\mu)}/\gamma})}^2}f(x)=F′(x)=γ(1+e−(x−μ)/γ)2e−(x−μ)/γ
在逻辑回归中,当预测概率 >= 阈值,预测为正类,否则预测为负类
2. 垃圾邮件过滤
从信息中提取 TF-IDF 特征,并使用逻辑回归进行分类
import pandas as pd
data = pd.read_csv("SMSSpamCollection", delimiter='\t',header=None)
data
data[data[0]=='ham'][0].count() # 4825 条正常信息
data[data[0]=='spam'][0].count() # 747 条垃圾信息
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split, cross_val_scoreX = data[1].values
y = data[0].values
from sklearn.preprocessing import LabelBinarizer
lb = LabelBinarizer()
y = lb.fit_transform(y)X_train_raw, X_test_raw, y_train, y_test = train_test_split(X, y, random_state=520)vectorizer = TfidfVectorizer()
X_train = vectorizer.fit_transform(X_train_raw)
X_test = vectorizer.transform(X_test_raw)classifier = LogisticRegression()
classifier.fit(X_train, y_train)pred = classifier.predict(X_test)
for i, pred_i in enumerate(pred[:5]):print("预测为:%s, 信息为:%s,真实为:%s" %(pred_i,X_test_raw[i],y_test[i]))
预测为:0, 信息为:Aww that's the first time u said u missed me without asking if I missed u first. You DO love me! :),真实为:[0]
预测为:0, 信息为:Poor girl can't go one day lmao,真实为:[0]
预测为:0, 信息为:Also remember the beads don't come off. Ever.,真实为:[0]
预测为:0, 信息为:I see the letter B on my car,真实为:[0]
预测为:0, 信息为:My love ! How come it took you so long to leave for Zaher's? I got your words on ym and was happy to see them but was sad you had left. I miss you,真实为:[0]
2.1 性能指标
混淆矩阵
from sklearn.metrics import confusion_matrix
import matplotlib.pyplot as plt
confusion_matrix = confusion_matrix(y_test, pred)
plt.matshow(confusion_matrix)
plt.rcParams["font.sans-serif"]= 'SimHei' # 消除中文乱码
plt.title("混淆矩阵")
plt.ylabel('真实')
plt.xlabel('预测')
plt.colorbar()
2.2 准确率
scores = cross_val_score(classifier, X_train, y_train, cv=5)
print('Accuracies: %s' % scores)
print('Mean accuracy: %s' % np.mean(scores))
Accuracies: [0.94976077 0.95933014 0.96650718 0.95215311 0.95688623]
Mean accuracy: 0.9569274847434318
准确率不是一个很合适的性能指标,它不能区分预测错误,是正预测为负,还是负预测为正
2.3 精准率、召回率
可以参考 [Hands On ML] 3. 分类(MNIST手写数字预测)
单独只看精准率或者召回率是没有意义的
from sklearn.metrics import precision_score, recall_score, f1_score
precisions = precision_score(y_test, pred)
print('Precision: %s' % precisions)
recalls = recall_score(y_test, pred)
print('Recall: %s' % recalls)
Precision: 0.9852941176470589
预测为垃圾信息的基本上真的是垃圾信息Recall: 0.6979166666666666
有30%的垃圾信息预测为了非垃圾信息
2.4 F1值
F1 值是以上精准率和召回率的均衡
f1s = f1_score(y_test, pred)
print('F1 score: %s' % f1s)
# F1 score: 0.8170731707317074
2.5 ROC、AUC
- 好的分类器AUC面积越接近1越好,随机分类器AUC面积为0.5
from sklearn.metrics import roc_curve
from sklearn.metrics import roc_auc_scorefalse_positive_rate, recall, thresholds = roc_curve(y_test, pred)
roc_auc_score = roc_auc_score(y_test, pred)plt.title('受试者工作特性')
plt.plot(false_positive_rate, recall, 'b', label='AUC = %0.2f' % roc_auc_score)
plt.legend(loc='lower right')
plt.plot([0, 1], [0, 1], 'r--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.0])
plt.ylabel('Recall')
plt.xlabel('Fall-out')
plt.show()
3. 网格搜索调参
import pandas as pd
from sklearn.preprocessing import LabelEncoder
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from sklearn.metrics import precision_score, recall_score, accuracy_scorepipeline = Pipeline([('vect', TfidfVectorizer(stop_words='english')),('clf', LogisticRegression())
])
parameters = {'vect__max_df': (0.25, 0.5, 0.75), # 模块name__参数name'vect__stop_words': ('english', None),'vect__max_features': (2500, 5000, None),'vect__ngram_range': ((1, 1), (1, 2)),'vect__use_idf': (True, False),'clf__penalty': ('l1', 'l2'),'clf__C': (0.01, 0.1, 1, 10),
}if __name__ == "__main__":df = pd.read_csv('./SMSSpamCollection', delimiter='\t', header=None)X = df[1].valuesy = df[0].valueslabel_encoder = LabelEncoder()y = label_encoder.fit_transform(y)X_train, X_test, y_train, y_test = train_test_split(X, y)grid_search = GridSearchCV(pipeline, parameters, n_jobs=-1, verbose=1, scoring='accuracy', cv=3)grid_search.fit(X_train, y_train)print('Best score: %0.3f' % grid_search.best_score_)print('Best parameters set:')best_parameters = grid_search.best_estimator_.get_params()for param_name in sorted(parameters.keys()):print('\t%s: %r' % (param_name, best_parameters[param_name]))predictions = grid_search.predict(X_test)print('Accuracy: %s' % accuracy_score(y_test, predictions))print('Precision: %s' % precision_score(y_test, predictions))print('Recall: %s' % recall_score(y_test, predictions))
Best score: 0.985
Best parameters set:clf__C: 10clf__penalty: 'l2'vect__max_df: 0.5vect__max_features: 5000vect__ngram_range: (1, 2)vect__stop_words: Nonevect__use_idf: True
Accuracy: 0.9791816223977028
Precision: 1.0
Recall: 0.8605769230769231
调整参数后,提高了召回率
4. 多类别分类
电影情绪评价预测
data = pd.read_csv("./chapter5_movie_train.csv",header=0,delimiter='\t')
data
data['Sentiment'].describe()
count 156060.000000
mean 2.063578
std 0.893832
min 0.000000
25% 2.000000
50% 2.000000
75% 3.000000
max 4.000000
Name: Sentiment, dtype: float64
平均都是比较中立的情绪
data["Sentiment"].value_counts()/data["Sentiment"].count()
2 0.509945
3 0.210989
1 0.174760
4 0.058990
0 0.045316
Name: Sentiment, dtype: float64
50% 的例子都是中立的情绪
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, accuracy_score, confusion_matrix
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCVdf = pd.read_csv('./chapter5_movie_train.csv', header=0, delimiter='\t')
X, y = df['Phrase'], df['Sentiment'].values
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.5)pipeline = Pipeline([('vect', TfidfVectorizer(stop_words='english')),('clf', LogisticRegression())
])
parameters = {'vect__max_df': (0.25, 0.5),'vect__ngram_range': ((1, 1), (1, 2)),'vect__use_idf': (True, False),'clf__C': (0.1, 1, 10),
}grid_search = GridSearchCV(pipeline, parameters, n_jobs=-1, verbose=1, scoring='accuracy')
grid_search.fit(X_train, y_train)print('Best score: %0.3f' % grid_search.best_score_)
print('Best parameters set:')
best_parameters = grid_search.best_estimator_.get_params()
for param_name in sorted(parameters.keys()):print('\t%s: %r' % (param_name, best_parameters[param_name]))
Best score: 0.619
Best parameters set:clf__C: 10vect__max_df: 0.25vect__ngram_range: (1, 2)vect__use_idf: False
- 性能指标
predictions = grid_search.predict(X_test)print('Accuracy: %s' % accuracy_score(y_test, predictions))
print('Confusion Matrix:')
print(confusion_matrix(y_test, predictions))
print('Classification Report:')
print(classification_report(y_test, predictions))
Accuracy: 0.6292323465333846
Confusion Matrix:
[[ 1013 1742 682 106 11][ 794 5914 6275 637 49][ 196 3207 32397 3686 222][ 28 488 6513 8131 1299][ 1 59 548 2388 1644]]
Classification Report:precision recall f1-score support0 0.50 0.29 0.36 35541 0.52 0.43 0.47 136692 0.70 0.82 0.75 397083 0.54 0.49 0.52 164594 0.51 0.35 0.42 4640accuracy 0.63 78030macro avg 0.55 0.48 0.50 78030
weighted avg 0.61 0.63 0.62 78030
5. 多标签分类
- 一个实例可以被贴上多个 labels
问题转换:
- 实例的标签(假设为L1,L2),转换成(L1 and L2),以此类推,缺点,产生很多种类的标签,且模型只能训练数据中包含的类,很多可能无法覆盖到
- 对每个标签,训练一个二分类器(这个实例是L1吗,是L2吗?),缺点,忽略了标签之间的关系
5.1 多标签分类性能指标
- 汉明损失:不正确标签的平均比例,0最好
- 杰卡德相似系数:预测与真实标签的交集数量 / 并集数量,1最好
from sklearn.metrics import hamming_loss, jaccard_score
# help(jaccard_score)print(hamming_loss(np.array([[0.0, 1.0], [1.0, 1.0]]), np.array([[0.0, 1.0], [1.0, 1.0]])))print(hamming_loss(np.array([[0.0, 1.0], [1.0, 1.0]]), np.array([[1.0, 1.0], [1.0, 1.0]])))print(hamming_loss(np.array([[0.0, 1.0], [1.0, 1.0]]), np.array([[1.0, 1.0], [0.0, 1.0]])))print(jaccard_score(np.array([[0.0, 1.0], [1.0, 1.0]]), np.array([[0.0, 1.0], [1.0, 1.0]]),average=None))print(jaccard_score(np.array([[0.0, 1.0], [1.0, 1.0]]), np.array([[1.0, 1.0], [1.0, 1.0]]),average=None))print(jaccard_score(np.array([[0.0, 1.0], [1.0, 1.0]]), np.array([[1.0, 1.0], [0.0, 1.0]]),average=None))
0.0
0.25
0.5
[1. 1.]
[0.5 1. ]
[0. 1.]