ucinet计算聚类系数大于1怎么办_聚类性能评估-ARI（调兰德指数）

注意：ARI取值范围为[-1,1]，值越大越好，反映两种划分的重叠程度，使用该度量指标需要数据本身有类别标记。

用C表示实际的类别划分，K表示聚类结果。定义a 为在C中被划分为同一类，在K中被划分为同一簇的实例对数量。定义b为在C中被划分为不同类别，在K中被划分为不同簇的实例对数量。定义Rand Index（兰德系数）：

Rand Index无法保证随机划分的聚类结果的RI值接近0。于是，提出了Adjusted Rand index（调节的兰德系数）：

为了计算ARI的值，引入contingency table（列联表），反映实例类别划分与聚类划分的重叠程度，表的行表示实际划分的类别，表的列表示聚类划分的簇标记，nij表示重叠实例数量，如下所示：

有了列联表，即可用它计算ARI：

这里，显然把max(RI)替换成了mean(RI)。

还是看个例子吧，

例：设实际类别划分为labels_true = [0, 0, 0, 1, 1, 1]，聚类划分为labels_pred = [0, 0, 1, 1, 2, 2]，求ARI值。

画划分图：

画列联表：

看看sklearn中如何计算吧，https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/metrics/cluster/_supervised.py 文件中的adjusted_rand_score方法。

labels_true, labels_pred = check_clusterings(labels_true, labels_pred)
n_samples = labels_true.shape[0]
n_classes = np.unique(labels_true).shape[0]
n_clusters = np.unique(labels_pred).shape[0]# Special limit cases: no clustering since the data is not split;
# or trivial clustering where each document is assigned a unique cluster.
# These are perfect matches hence return 1.0.
if (n_classes == n_clusters == 1 orn_classes == n_clusters == 0 orn_classes == n_clusters == n_samples):
return 1.0# Compute the ARI using the contingency data
contingency = contingency_matrix(labels_true, labels_pred, sparse=True)
sum_comb_c = sum(_comb2(n_c) for n_c in np.ravel(contingency.sum(axis=1)))
sum_comb_k = sum(_comb2(n_k) for n_k in np.ravel(contingency.sum(axis=0)))
sum_comb = sum(_comb2(n_ij) for n_ij in contingency.data)prod_comb = (sum_comb_c * sum_comb_k) / _comb2(n_samples)
mean_comb = (sum_comb_k + sum_comb_c) / 2.
return (sum_comb - prod_comb) / (mean_comb - prod_comb)

运行一下看看结果吧：

# coding:utf-8
"""
测试ARI聚类评测指标
"""from sklearn import metricslabels_true = [0, 0, 0, 1, 1, 1]
labels_pred = [0, 0, 1, 1, 2, 2]print(metrics.adjusted_rand_score(labels_true, labels_pred))0.24242424242424246

本文来自互联网用户投稿，该文观点仅代表作者本人，不代表本站立场。本站仅提供信息存储空间服务，不拥有所有权，不承担相关法律责任。如若转载，请注明出处：http://www.mzph.cn/news/489754.shtml

如若内容造成侵权/违法违规/事实不符，请联系多彩编程网进行投诉反馈email:809451989@qq.com，一经查实，立即删除！