高斯混合模型做聚类

概述

聚类算法大多数采用相似度来判断，而相似度又大多数采用欧式距离长短来衡量，而GMM采用了新的判断依据—–概率，即通过属于某一类的概率大小来判断最终的归属类别。
GMM的基本思想就是：任意形状的概率分布都可以用多个高斯分布函数去近似，也就是GMM就是有多个单高斯密度分布组成的，每一个Gaussian叫”Component”，线性的加成在一起就组成了GMM概率密度。

算法函数

n_components ：高斯模型的个数，即聚类的目标个数
covariance_type : 通过EM算法估算参数时使用的协方差类型，默认是”full”
full：每个模型使用自己的一般协方差矩阵
tied：所用模型共享一个一般协方差矩阵
diag：每个模型使用自己的对角线协方差矩阵
spherical：每个模型使用自己的单一方差

可与K-means聚类比较：

https://blog.csdn.net/fanzonghao/article/details/85045232

#coding:utf-8
import pandas as pd
from sklearn.cluster import KMeans
from sklearn.mixture import GaussianMixture
import time
import os
import numpy as np
import matplotlib.pyplot as plt
data_path = "./Aggregation_cluster=7.txt"
# 导入数据
def load_data():points = pd.read_table(data_path, header=None)return pointsdef plotRes(data, clusterRes, clusterNum):"""结果可视化:param data:样本集:param clusterRes:聚类结果:param clusterNum: 类个数:return:"""nPoints = len(data)scatterColors = ['black', 'blue', 'green', 'yellow', 'red', 'purple', 'orange', 'brown']for i in range(clusterNum):color = scatterColors[i % len(scatterColors)]x1 = [];  y1 = []for j in range(nPoints):if clusterRes[j] == i:x1.append(data[j, 0])y1.append(data[j, 1])plt.scatter(x1, y1, c=color, alpha=1, marker='+')plt.show()
if __name__ == '__main__':n_cluster=7points=load_data()df = pd.DataFrame(points, index=None)X = df.iloc[:, :-1].valuesprint(X)n = X.shape[0]print('========== Do clustering ==========')start_time = time.time()gmm = GaussianMixture(n_components=n_cluster, covariance_type='full')# gmm=KMeans(n_clusters=n_cluster)gmm.fit(X)y_pred = gmm.predict(X)end_time = time.time()print('--- {} s ---'.format(end_time - start_time))plotRes(X,y_pred,n_cluster)

n_cluster==3: