sklearn支持多类别(Multiclass)分类和多标签(Multilabel)分类:多类别分类:超过两个类别的分类任务。多类别分类假设每个样本属于且仅属于一个标签,类如一个水果可以是苹果或者是桔子但是不能同时属于两者。
多标签分类:给每个样本分配一个或多个标签。例如一个新闻可以既属于体育类,也属于文娱类。
sklearn的官方文档给出了支持多标签分类的类,包括如下:
以决策树举例,给出如下实现过程
数据准备
from sklearn.datasets import make_multilabel_classification
# Generate a random multilabel classification problem.
# For each sample, the generative process is:
# pick the number of labels: n ~ Poisson(n_labels)
# n times, choose a class c: c ~ Multinomial(theta)
# pick the document length: k ~ Poisson(length)
k times, choose a word: w ~ Multinomial(theta_c)
X, Y = datasets.make_multilabel_classification(n_samples=10, n_features=5, n_classes=3, n_labels=2)
生成的X和Y为如下形式的数据:
分类
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
# Split dataset to 8:2
X_train, X_test, Y_train ,Y_test = train_test_split(X, Y, test_size=0.2)
cls = DecisionTreeClassifier()
cls.fit(X_train, Y_train)
多标签分类评估
from sklearn import metrics
Y_pred = cls.predict(X_test)
Y_test和Y_pred值如下:
metrics.f1_score(Y_test, Y_pred, average="macro")
# 0.666
metrics.f1_score(Y_test, Y_pred, average="micro")
# 0.8
metrics.f1_score(Y_test, Y_pred, average="weighted")
# 1.0
metrics.f1_score(Y_test, Y_pred, average="samples")
# 0.4
概率预测
Y_prob = cls.predict_proba(X_test)
X_test和Y_prob值如下:predict_proba(X)
X:array-like or sparse matrix of shape = [n_samples, n_features]
RETURN:array of shape = [n_samples, n_classes], or a list of n_outputs