注意:ARI取值范围为[-1,1],值越大越好,反映两种划分的重叠程度,使用该度量指标需要数据本身有类别标记。
用C表示实际的类别划分,K表示聚类结果。定义a 为在C中被划分为同一类,在K中被划分为同一簇的实例对数量。定义b为在C中被划分为不同类别,在K中被划分为不同簇的实例对数量。定义Rand Index(兰德系数):
Rand Index无法保证随机划分的聚类结果的RI值接近0。于是,提出了Adjusted Rand index(调节的兰德系数):
为了计算ARI的值,引入contingency table(列联表),反映实例类别划分与聚类划分的重叠程度,表的行表示实际划分的类别,表的列表示聚类划分的簇标记,nij表示重叠实例数量,如下所示:
有了列联表,即可用它计算ARI:
这里,显然把max(RI)替换成了mean(RI)。
还是看个例子吧,
例:设实际类别划分为labels_true = [0, 0, 0, 1, 1, 1],聚类划分为labels_pred = [0, 0, 1, 1, 2, 2],求ARI值。
画划分图:
画列联表:
看看sklearn中如何计算吧,https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/metrics/cluster/_supervised.py 文件中的adjusted_rand_score方法。
labels_true, labels_pred = check_clusterings(labels_true, labels_pred)
n_samples = labels_true.shape[0]
n_classes = np.unique(labels_true).shape[0]
n_clusters = np.unique(labels_pred).shape[0]# Special limit cases: no clustering since the data is not split;
# or trivial clustering where each document is assigned a unique cluster.
# These are perfect matches hence return 1.0.
if (n_classes == n_clusters == 1 orn_classes == n_clusters == 0 orn_classes == n_clusters == n_samples):
return 1.0# Compute the ARI using the contingency data
contingency = contingency_matrix(labels_true, labels_pred, sparse=True)
sum_comb_c = sum(_comb2(n_c) for n_c in np.ravel(contingency.sum(axis=1)))
sum_comb_k = sum(_comb2(n_k) for n_k in np.ravel(contingency.sum(axis=0)))
sum_comb = sum(_comb2(n_ij) for n_ij in contingency.data)prod_comb = (sum_comb_c * sum_comb_k) / _comb2(n_samples)
mean_comb = (sum_comb_k + sum_comb_c) / 2.
return (sum_comb - prod_comb) / (mean_comb - prod_comb)
运行一下看看结果吧:
# coding:utf-8
"""
测试ARI聚类评测指标
"""from sklearn import metricslabels_true = [0, 0, 0, 1, 1, 1]
labels_pred = [0, 0, 1, 1, 2, 2]print(metrics.adjusted_rand_score(labels_true, labels_pred))0.24242424242424246