【K-means聚类】

K-means聚类python代码实现

- - 聚类
  - k-means聚类代码

聚类

定义：聚类是一种无监督的机器学习方法，它的主要目的是将数据集中的对象（或点）按照它们之间的相似性分组或聚类。这些聚类（或称为簇）中的对象在某种度量下是相似的，而不同聚类中的对象是不同的。简言之，聚类是将相似的对象归为一类，不同的对象归为不同的类。
常用的聚类方法及其确定和适用场景比较：

K-means聚类：

确定：基于距离的聚类，目标是使每个数据点到其所属簇的中心的距离之和最小。
适用场景：适用于球形簇，且簇的大小和形状相对接近的情况。
层次聚类：

确定：按照层次结构对数据进行聚类，可以是自底向上的（凝聚）或自顶向下的（分裂）。
适用场景：适用于各种形状和大小的簇，但需要更多的计算资源。
DBSCAN：

确定：基于密度的聚类，可以找到任意形状的簇，并可以处理噪声数据。
适用场景：适用于密度不均的簇和具有噪声的数据集。
谱聚类：

确定：基于图理论的聚类，将数据视为图中的节点，相似度视为边的权重，然后在图上执行聚类。
适用场景：适用于非凸形状的数据集和高维数据。
K-means聚类的原理和步骤：

原理：

K-means聚类是基于距离的聚类方法，其主要目标是使每个数据点到其所属簇的中心的距离之和最小。
在给定的数据集中，选择K个初始点作为簇的中心。
对于每个数据点，将其分配给最近的簇中心。
重新计算每个簇的中心（即簇内所有点的均值）。
重复步骤2和3，直到簇的中心不再发生变化或变化很小。
步骤：

初始化：选择K个数据点作为初始的簇中心。
分配步骤：对于每个数据点，计算它到各个簇中心的距离，并将其分配给最近的簇。
更新步骤：对于每个簇，重新计算其中心（即簇内所有点的均值）。
迭代：重复步骤2和3，直到满足停止条件（如簇的中心不再发生变化或变化很小，或达到预定的迭代次数）。
需要注意的是，K-means聚类方法对于初始簇中心的选择很敏感，不同的初始选择可能导致不同的聚类结果。此外，K-means只能发现球形的簇，并且需要提前指定簇的数量K。

k-means聚类代码

k-means 实现

from sklearn.cluster import KMeans
import numpy as np# 示例数据
data = np.array([[1, 2], [1, 4], [1, 0], [4, 2], [4, 4], [4, 0]])# 创建KMeans对象，设置聚类数为2
kmeans = KMeans(n_clusters=2)# 拟合数据
kmeans.fit(data)# 预测数据的类别
labels = kmeans.predict(data)print("聚类结果：", labels)

应用：

from sklearn.datasets import make_moons
from sklearn.cluster import DBSCAN
import matplotlib.pyplot as plt
import numpy as np
from sklearn.cluster import KMeans
import osdef z_score(arr, threshold=3):mean = np.mean(arr)std_dev = np.std(arr)print('std_dev:', std_dev)z_scores = [(x - mean) / std_dev for x in arr]print('z_scores: ', z_scores)outliers = [i for i, z in enumerate(z_scores) if abs(z) > threshold]return outliersdef _kmeans(x, file):# global fileprefix, suffix = os.path.splitext(file)data = xn_clusters = 3kmeans = KMeans(n_clusters=n_clusters, random_state=0).fit(data)for i in range(n_clusters):plt.scatter(data[kmeans.labels_ == i, 0], data[kmeans.labels_ == i, 1], label=f'cluster{i}')# 聚类中心plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1], marker='x', color='red', label='cluster center')print(kmeans.cluster_centers_)print(np.bincount(kmeans.labels_))id_count_list = list(np.bincount(kmeans.labels_))y_center_list = list(kmeans.cluster_centers_[:, 0])y_center_list_c = y_center_list.copy()sorted_y_list = sorted(y_center_list_c, reverse=True)max_count_index = id_count_list.index(max(id_count_list))y_calib_value = sorted_y_list[len(sorted_y_list)//2]value_index = y_center_list.index(y_calib_value)# 校正值，不一定在类别列表的中心front_value, end_value = sorted_y_list[len(sorted_y_list)//2 -1], sorted_y_list[len(sorted_y_list)//2 +1]front_index, end_index = y_center_list.index(front_value), y_center_list.index(end_value)pitch_list = [kmeans.cluster_centers_[:, 1][end_index], kmeans.cluster_centers_[:, 1][value_index], kmeans.cluster_centers_[:, 1][front_index]]print(pitch_list)# flag_list = z_score(pitch_list)flag = True# 比较法if len(pitch_list)==3:diff0 = np.abs(pitch_list[0]- pitch_list[1])diff1 = np.abs(pitch_list[0]- pitch_list[2])diff3 = np.abs(pitch_list[1]- pitch_list[2])thresh = 10if np.abs(diff3- diff0)>thresh or np.abs(diff3- diff1)> thresh:flag = Falsepitch_calib, yaw_calib = None, None if value_index== max_count_index and flag:yaw_calib, pitch_calib = kmeans.cluster_centers_[max_count_index]print(yaw_calib.item())print(pitch_calib.item())# print(roll_calib.item())else:print('需再次采集数据标定')   print('=========', file) plt.legend()plt.title(file)# plt.show()plt.savefig('runs/onnx_infer/'+ prefix+ '.png')plt.savefig('runs/pred/kmeans.png')plt.close()return pitch_calib, yaw_calibif __name__ == '__main__':# 分析數據，得到校正值file_path = r'/home/'files = os.listdir(file_path)files = [item  for item in files if item.endswith('.txt')]for file in files:pt_list = []with open(os.path.join(file_path, file), 'r') as f:lines = f.read().splitlines()for line in lines:line = line.split()line = [float(item) for item in line]pt = [line[3], line[2]]pt_list.append(pt)x = np.array(pt_list)# 输入为[yaw, pitch] 二维数组pitch_calib, yaw_calib = _kmeans(x, file)