目录
- 1. 无监督学习的类型
- 2. 无监督学习的挑战
- 3. 预处理与缩放
- 3.1 不同类型的预处理
- 3.2 应用数据变换
- 3.3 对训练数据和测试数据进行相同的缩放
- 快捷方式与高效的替代方法
- 3.4 预处理对监督学习的作用
- 4. 降维、特征提取与流形学习
- 4.1 主成分分析(PCA)
- 4.1.1 将PCA应用于cancer数据集并可视化
- 4.1.2 特征提取的特征脸
- 4.2 非负矩阵分解(NMF)
- 4.2.1 将NMF应用于模拟数据
- 4.2.2 将NMF应用于人脸图像
- 4.3 用t-SNE进行流形学习
- 5. 聚类
- 5.1 k均值聚类
- 5.1.1 k均值的失败案例
- 5.1.2 矢量量化,或者将k均值看作分解
- 5.1.3 优点、缺点
- 5.2 凝聚聚类
- 层次聚类与树状图
- 5.3 DBSCAN
- 5.4 聚类算法的对比与评估
- 5.4.1 用真实值评估聚类
- 5.4.2 在没有真实值的情况下评估聚类
- 5.4.3 在人脸数据集上比较算法
- 用DBSCAN分析人脸数据集
- 用k均值分析人脸数据集
- 用凝聚聚类分析人脸数据集
- 5.5 聚类方法小结
1. 无监督学习的类型
- 两种无监督学习
- 数据集变换(数据集的无监督变换)
- 创建数据新的表示的算法
- 新的表示可能更容易被人或其他机器学习算法所理解
- 常见应用
- 降维
- 接受包含许多特征的数据的高维表示
- 找到表示该数据的一种新方法
- 用较少的特征就可以概括其重要特征
- 常见应用
- 将数据降为二维(为了可视化)
- 找到“构成”数据的各个组成部分
- 常见应用
- 对文本文档集合进行主题提取
- 任务
- 找到每个文档中讨论的未知主题
- 学习每个文档中出现了哪些主题
- 用于追踪社交媒体上的话题讨论
- 任务
- 对文本文档集合进行主题提取
- 常见应用
- 降维
- 创建数据新的表示的算法
- 聚类
- 将数据划分成不同的组
- 每个组包含相似的物项
- 常见应用
- 相册的智能分类
- 提取所有的人脸
- 将看起来相似的人脸分在一组
- 相册的智能分类
- 数据集变换(数据集的无监督变换)
2. 无监督学习的挑战
- 主要挑战:评估算法是否学到了有用的东西
- 无监督学习算法一般用于不包含任何标签信息的数据,所以我们不知道正确的输出应该是什么
- 我们没有办法“告诉”算法我们要的是什么
- 通常来说,评估无监督算法结果的唯一方法就是人工检查
- 如果数据科学家想要更好地理解数据,那么无监督算法通常可用于探索性的目的,而不是作为大型自动化系统的一部分
- 无监督算法的另一个常见应用是作为监督算法的预处理步骤
- 可以提高监督算法的精度
- 可以减少内存占用和时间开销
3. 预处理与缩放
- 对于数据缩放敏感的算法,可以对特征进行调节,使数据表示更适合与这些算法
- 对数据的简单的按特征的缩放和移动
3.1 不同类型的预处理
from matplotlib import pyplot as plt
import mglearnmglearn.plots.plot_scaling()plt.tight_layout()
plt.show()
- 左侧:有两个特征的二分类数据
- 第一个特征值:10~15
- 第二个特征值:1~9
- 右侧:4种数据变换方法
- StandardScaler
- 确保每个特征的平均值为0,方差为1,使所有特征都位于同一量级
- 不能确保特征任何特定的最大值和最小值
- RobustScaler
- 确保每个特征的统计属性都位于同一范围
- 中位数和四分位数
- 忽略与其他点有很大不同的数据点(异常值)
- 确保每个特征的统计属性都位于同一范围
- MinMaxScaler
- 使所有特征都刚好位于0~1
- Normalizer
- 对每个数据点进行缩放,使特征向量的欧式长度等于1
- 将每个数据点投射到半径为1的圆(球面)上
- 每个数据点的缩放比例都不相同
- 对每个数据点进行缩放,使特征向量的欧式长度等于1
- StandardScaler
3.2 应用数据变换
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScalercancer = load_breast_cancer()
X_train, X_test, y_train, y_test = train_test_split(cancer.data, cancer.target, random_state=1)scaler = MinMaxScaler()scaler.fit(X_train)# 对训练数据进行变换
X_train_scaled = scaler.transform(X_train)# 打印缩放之后数据集属性
print("per-feature minimum after scaling:\n {}".format(X_train_scaled.min(axis=0)))
# per-feature minimum after scaling:
# [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
# 0. 0. 0. 0. 0. 0.]
print("per-feature maximum after scaling:\n {}".format(X_train_scaled.max(axis=0)))
# per-feature maximum after scaling:
# [1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.
# 1. 1. 1. 1. 1. 1.]# 对测试数据进行变换
X_test_scaled = scaler.transform(X_test)# 打印缩放之后数据集属性
print("per-feature minimum after scaling:\n {}".format(X_test_scaled.min(axis=0)))
# per-feature minimum after scaling:
# [ 0.0336031 0.0226581 0.03144219 0.01141039 0.14128374 0.04406704
# 0. 0. 0.1540404 -0.00615249 -0.00137796 0.00594501
# 0.00430665 0.00079567 0.03919502 0.0112206 0. 0.
# -0.03191387 0.00664013 0.02660975 0.05810235 0.02031974 0.00943767
# 0.1094235 0.02637792 0. 0. -0.00023764 -0.00182032]print("per-feature maximum after scaling:\n {}".format(X_test_scaled.max(axis=0)))
# per-feature maximum after scaling:
# [0.9578778 0.81501522 0.95577362 0.89353128 0.81132075 1.21958701
# 0.87956888 0.9333996 0.93232323 1.0371347 0.42669616 0.49765736
# 0.44117231 0.28371044 0.48703131 0.73863671 0.76717172 0.62928585
# 1.33685792 0.39057253 0.89612238 0.79317697 0.84859804 0.74488793
# 0.9154725 1.13188961 1.07008547 0.92371134 1.20532319 1.63068851]
- 由于scaler进行拟合使用的数据为X_train,所以对于X_train所有的特征都在0~1,而对于X_test则出现数据混乱
3.3 对训练数据和测试数据进行相同的缩放
import mglearn
from matplotlib import pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
from sklearn.datasets import make_blobsX, _ = make_blobs(n_samples=50, centers=5, random_state=4, cluster_std=2)
X_train, X_test = train_test_split(X, random_state=5, test_size=.1)# 绘制训练集和测试集
fig, axes = plt.subplots(1, 3, figsize=(13, 4))
axes[0].scatter(X_train[:, 0], X_train[:, 1], c=mglearn.cm2(0), label="Training set", s=60)
axes[0].scatter(X_test[:, 0], X_test[:, 1], marker='^', c=mglearn.cm2(1), label="Test set", s=60)
axes[0].legend(loc='upper left')
axes[0].set_title("Original Data")# 利用MinMaxScaler缩放数据
scaler = MinMaxScaler()
scaler.fit(X_train)X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)# 将正确缩放的数据可视化
axes[1].scatter(X_train_scaled[:, 0], X_train_scaled[:, 1], c=mglearn.cm2(0), label="Training set", s=60)
axes[1].scatter(X_test_scaled[:, 0], X_test_scaled[:, 1], marker='^', c=mglearn.cm2(1), label="Test set", s=60)
axes[1].set_title("Scaled Data")# 单独对测试集进行缩放
test_scaler = MinMaxScaler()
test_scaler.fit(X_test)X_test_scaled_badly = test_scaler.transform(X_test)# 将错误缩放的数据可视化
axes[2].scatter(X_train_scaled[:, 0], X_train_scaled[:, 1], c=mglearn.cm2(0), label="training set", s=60)
axes[2].scatter(X_test_scaled_badly[:, 0], X_test_scaled_badly[:, 1], marker='^', c=mglearn.cm2(1), label="test set", s=60)
axes[2].set_title("Improperly Scaled Data")for ax in axes:ax.set_xlabel("Feature 0")ax.set_ylabel("Feature 1")plt.tight_layout()
plt.show()
- 左图:未缩放的二维数据集
- 中图:使用MinMaxScaler进行缩放
- 右图:训练集和测试集分解进行不同的缩放
快捷方式与高效的替代方法
scaler.fit(X).transform(X)
# 等效于
scaler.fit_transform(X)
3.4 预处理对监督学习的作用
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import StandardScalercancer = load_breast_cancer()
X_train, X_test, y_train, y_test = train_test_split(cancer.data, cancer.target, random_state=0)svm = SVC(C=100)
svm.fit(X_train, y_train)print("test score: {:.3f}".format(svm.score(X_test, y_test)))
# test score: 0.944# 使用0-1缩放进行预处理
scaler = MinMaxScaler()
scaler.fit(X_train)
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)# 在缩放后的训练数据上学习SVM
svm.fit(X_train_scaled, y_train)# 在缩放后的测试集上计算分数
print("test score: {:.3f}".format(svm.score(X_test_scaled, y_test)))
# test score: 0.965# 利用零均值和单位方差的缩放方法进行预处理
scaler = StandardScaler()
scaler.fit(X_train)
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)# 在缩放后的训练数据上学习SVM
svm.fit(X_train_scaled, y_train)# 在缩放后的测试集上计算分数
print("test score: {:.3f}".format(svm.score(X_test_scaled, y_test)))
# test score: 0.958
4. 降维、特征提取与流形学习
4.1 主成分分析(PCA)
- 一种旋转数据集的方法
- 旋转后的特征在统计上不相关
- 转转后通常根据新特征对解释数据的重要性来选择它的一个子集
import matplotlib.pyplot as plt
import mglearnmglearn.plots.plot_pca_illustration()plt.tight_layout()
plt.show()
-
左上图:原始数据点
-
算法查找方差最大的方向(Component 1)
- 数据中包含最多信息的方向
-
算法找到与第一个方向正交且包含最多信息的方向
- 利用此方法找到的方向称为主成分
- 数据方差的主要方向
- 主成分的个数与原始特征相同
-
-
右上图:旋转原始数据,使第一主成分与x轴平行且第二主成分与y轴平行
- 旋转之前,数据减去平均值
- 使变换后的数据以0为中心
- 旋转之前,数据减去平均值
-
左下图:仅保留第一个主成分
- 将二维数据降为一维数据
-
右下图:反向旋转并将平均值重新加到数据中
- 去除数据中的噪声影响
- 将主成分中保留的那部分信息可视化
4.1.1 将PCA应用于cancer数据集并可视化
-
对每个特征分别计算两个类别的直方图
import mglearn import numpy as np from matplotlib import pyplot as plt from sklearn.datasets import load_breast_cancer from sklearn.model_selection import train_test_splitcancer = load_breast_cancer() X_train, X_test, y_train, y_test = train_test_split(cancer.data, cancer.target, random_state=0)fig, axes = plt.subplots(15, 2, figsize=(10, 20)) malignant = cancer.data[cancer.target == 0] benign = cancer.data[cancer.target == 1]ax = axes.ravel()for i in range(30):_, bins = np.histogram(cancer.data[:, i], bins=50)ax[i].hist(malignant[:, i], bins=bins, color=mglearn.cm3(0), alpha=.5)ax[i].hist(benign[:, i], bins=bins, color=mglearn.cm3(2), alpha=.5)ax[i].set_title(cancer.feature_names[i])ax[i].set_yticks(())ax[0].set_xlabel("Feature magnitude") ax[0].set_ylabel("Frequency") ax[0].legend(["malignant", "benign"], loc="best")fig.tight_layout() fig.show()
-
利用PCA,获取到主要的相互作用
-
利用StandardScaler缩放数据
from sklearn.preprocessing import StandardScaler from sklearn.datasets import load_breast_cancercancer = load_breast_cancer()scaler = StandardScaler() scaler.fit(cancer.data) X_scaled = scaler.transform(cancer.data)
-
学习并应用PCA
- 默认情况下,PCA仅旋转(移动)数据,并保留所有主成分
from sklearn.preprocessing import StandardScaler from sklearn.datasets import load_breast_cancer from sklearn.decomposition import PCAcancer = load_breast_cancer()scaler = StandardScaler() scaler.fit(cancer.data) X_scaled = scaler.transform(cancer.data)# 保留数据的前两个主成分 pca = PCA(n_components=2) # n_components: 保留的主成分个数# 对乳腺癌数据拟合PCA模型 pca.fit(X_scaled)# 将数据变换到前两个主成分的方向上 X_pca = pca.transform(X_scaled)print("Original shape: {}".format(str(X_scaled.shape))) # Original shape: (569, 30)print("Reduced shape: {}".format(str(X_pca.shape))) # Reduced shape: (569, 2)
-
对前两个主成分作图
import mglearn from matplotlib import pyplot as plt from sklearn.preprocessing import StandardScaler from sklearn.datasets import load_breast_cancer from sklearn.decomposition import PCAcancer = load_breast_cancer()scaler = StandardScaler() scaler.fit(cancer.data) X_scaled = scaler.transform(cancer.data)pca = PCA(n_components=2) pca.fit(X_scaled)X_pca = pca.transform(X_scaled)plt.figure(figsize=(8, 8)) mglearn.discrete_scatter(X_pca[:, 0], X_pca[:, 1], cancer.target)plt.legend(cancer.target_names, loc="best") plt.gca().set_aspect("equal") plt.xlabel("First principal component") plt.ylabel("Second principal component")plt.tight_layout() plt.show()
-
-
PCA的缺点:不容易对图中的两个轴做出解释
-
主成分在PCA对象的components_属性中
-
用热图将系数可视化
from matplotlib import pyplot as plt from sklearn.preprocessing import StandardScaler from sklearn.datasets import load_breast_cancer from sklearn.decomposition import PCAcancer = load_breast_cancer()scaler = StandardScaler() scaler.fit(cancer.data) X_scaled = scaler.transform(cancer.data)pca = PCA(n_components=2) pca.fit(X_scaled)X_pca = pca.transform(X_scaled)plt.matshow(pca.components_, cmap='viridis') plt.yticks([0, 1], ["First component", "Second component"]) plt.colorbar() plt.xticks(range(len(cancer.feature_names)), cancer.feature_names, rotation=60, ha='left')plt.xlabel("Feature") plt.ylabel("Principal components") plt.tight_layout() plt.show()
4.1.2 特征提取的特征脸
- 思想:找到一种表示,比给定的原始更适合于分析
- 应用实例:图像
- 图像由像素构成
- 通常存储为RGB强度
from matplotlib import pyplot as plt
from sklearn.datasets import fetch_lfw_people
import sslssl._create_default_https_context = ssl._create_unverified_contextpeople = fetch_lfw_people(min_faces_per_person=20, resize=0.7)image_shape = people.images[0].shapefix, axes = plt.subplots(2, 5, figsize=(15, 8), subplot_kw={'xticks': (), 'yticks': ()})for target, image, ax in zip(people.target, people.images, axes.ravel()):ax.imshow(image)ax.set_title(people.target_names[target])print("people.images.shape: {}".format(people.images.shape))
# people.images.shape: (3023, 87, 65)
# 3023张图像
# 87像素*65像素print("Number of classes: {}".format(len(people.target_names)))
# Number of classes: 62
# 62个人plt.tight_layout()
plt.show()
-
数据集有些偏斜
- 参与分类的两个类别(或多个类别)样本数量差异很大
import numpy as np from sklearn.datasets import fetch_lfw_people import sslssl._create_default_https_context = ssl._create_unverified_contextpeople = fetch_lfw_people(min_faces_per_person=20, resize=0.7)# 计算每个目标出现的次数 counts = np.bincount(people.target)# 将次数与目标名称一起打印出来 for i, (count, name) in enumerate(zip(counts, people.target_names)):print("{0:25} {1:3}".format(name, count), end=' ')if (i + 1) % 3 == 0:print() # Alejandro Toledo 39 Alvaro Uribe 35 Amelie Mauresmo 21 # Andre Agassi 36 Angelina Jolie 20 Ariel Sharon 77 # Arnold Schwarzenegger 42 Atal Bihari Vajpayee 24 Bill Clinton 29 # Carlos Menem 21 Colin Powell 236 David Beckham 31 # Donald Rumsfeld 121 George Robertson 22 George W Bush 530 # Gerhard Schroeder 109 Gloria Macapagal Arroyo 44 Gray Davis 26 # Guillermo Coria 30 Hamid Karzai 22 Hans Blix 39 # Hugo Chavez 71 Igor Ivanov 20 Jack Straw 28 # Jacques Chirac 52 Jean Chretien 55 Jennifer Aniston 21 # Jennifer Capriati 42 Jennifer Lopez 21 Jeremy Greenstock 24 # Jiang Zemin 20 John Ashcroft 53 John Negroponte 31 # Jose Maria Aznar 23 Juan Carlos Ferrero 28 Junichiro Koizumi 60 # Kofi Annan 32 Laura Bush 41 Lindsay Davenport 22 # Lleyton Hewitt 41 Luiz Inacio Lula da Silva 48 Mahmoud Abbas 29 # Megawati Sukarnoputri 33 Michael Bloomberg 20 Naomi Watts 22 # Nestor Kirchner 37 Paul Bremer 20 Pete Sampras 22 # Recep Tayyip Erdogan 30 Ricardo Lagos 27 Roh Moo-hyun 32 # Rudolph Giuliani 26 Saddam Hussein 23 Serena Williams 52 # Silvio Berlusconi 33 Tiger Woods 23 Tom Daschle 25 # Tom Ridge 33 Tony Blair 144 Vicente Fox 32 # Vladimir Putin 49 Winona Ryder 24
-
降低数据偏斜
- 每个人最多取50张图像
import numpy as np from sklearn.datasets import fetch_lfw_people import sslssl._create_default_https_context = ssl._create_unverified_contextpeople = fetch_lfw_people(min_faces_per_person=20, resize=0.7) mask = np.zeros(people.target.shape, dtype=np.bool_)for target in np.unique(people.target):mask[np.where(people.target == target)[0][:50]] = 1X_people = people.data[mask] y_people = people.target[mask]# 将灰度值缩放到0到1之间,而不是在0到255之间 # 以得到更好的数据稳定性 X_people = X_people / 255.
-
使用单一最近邻分类器(1-nn)
- 寻找与要分类的人脸最为相似的人脸
import numpy as np from sklearn.datasets import fetch_lfw_people from sklearn.model_selection import train_test_split from sklearn.neighbors import KNeighborsClassifier import sslssl._create_default_https_context = ssl._create_unverified_contextpeople = fetch_lfw_people(min_faces_per_person=20, resize=0.7) mask = np.zeros(people.target.shape, dtype=np.bool_)for target in np.unique(people.target):mask[np.where(people.target == target)[0][:50]] = 1X_people = people.data[mask] y_people = people.target[mask]X_people = X_people / 255.# 将数据分为训练集和测试集 X_train, X_test, y_train, y_test = train_test_split(X_people, y_people, stratify=y_people, random_state=0)# 使用一个邻居构建KNeighborsClassifier knn = KNeighborsClassifier(n_neighbors=1) knn.fit(X_train, y_train)print("test score: {:.3f}".format(knn.score(X_test, y_test))) # test score: 0.215
-
使用PCA
-
启动白化选项
- 将主成分缩放到相同的尺度
- 结果与StandardScaler相同
from matplotlib import pyplot as plt import mglearnmglearn.plots.plot_pca_whitening()plt.tight_layout() plt.show()
import numpy as np from sklearn.datasets import fetch_lfw_people from sklearn.decomposition import PCA from sklearn.model_selection import train_test_split from sklearn.neighbors import KNeighborsClassifier import sslssl._create_default_https_context = ssl._create_unverified_contextpeople = fetch_lfw_people(min_faces_per_person=20, resize=0.7) mask = np.zeros(people.target.shape, dtype=np.bool_)for target in np.unique(people.target):mask[np.where(people.target == target)[0][:50]] = 1X_people = people.data[mask] y_people = people.target[mask]X_people = X_people / 255.X_train, X_test, y_train, y_test = train_test_split(X_people, y_people, stratify=y_people, random_state=0)pca = PCA(n_components=100, whiten=True, random_state=0).fit(X_train) # 提取前100个主成分,并进行拟合X_train_pca = pca.transform(X_train) X_test_pca = pca.transform(X_test)print("X_train_pca.shape: {}".format(X_train_pca.shape)) # X_train_pca.shape: (1547, 100)knn = KNeighborsClassifier(n_neighbors=1) knn.fit(X_train_pca, y_train)print("test score: {:.3f}".format(knn.score(X_test_pca, y_test))) # test score: 0.297
-
-
主成分可视化
import numpy as np from matplotlib import pyplot as plt from sklearn.datasets import fetch_lfw_people from sklearn.decomposition import PCA from sklearn.model_selection import train_test_split import sslssl._create_default_https_context = ssl._create_unverified_contextpeople = fetch_lfw_people(min_faces_per_person=20, resize=0.7) mask = np.zeros(people.target.shape, dtype=np.bool_)for target in np.unique(people.target):mask[np.where(people.target == target)[0][:50]] = 1X_people = people.data[mask] y_people = people.target[mask]X_people = X_people / 255.X_train, X_test, y_train, y_test = train_test_split(X_people, y_people, stratify=y_people, random_state=0)pca = PCA(n_components=100, random_state=0).fit(X_train)image_shape = people.images[0].shapefix, axes = plt.subplots(3, 5, figsize=(15, 12), subplot_kw={'xticks': (), 'yticks': ()})for i, (component, ax) in enumerate(zip(pca.components_, axes.ravel())):ax.imshow(component.reshape(image_shape), cmap='viridis')ax.set_title("{}. component".format((i + 1)))plt.tight_layout() plt.show()
-
尝试找到一些数字(PCA旋转后的新特征值),使我们可以将测试点表示为主成分的加权求和
- x 0 x_0 x0、 x 1 x_1 x1等:数据点的主成分系数
-
对人脸数据进行变换
- 将数据降维到只包含一些主成分,然后反向旋转回到原始空间
- 回到原始特征空间的方法:inverse_transform
import mglearn import numpy as np from matplotlib import pyplot as plt from sklearn.datasets import fetch_lfw_people from sklearn.model_selection import train_test_split import sslssl._create_default_https_context = ssl._create_unverified_contextpeople = fetch_lfw_people(min_faces_per_person=20, resize=0.7) mask = np.zeros(people.target.shape, dtype=np.bool_)for target in np.unique(people.target):mask[np.where(people.target == target)[0][:50]] = 1X_people = people.data[mask] y_people = people.target[mask]X_people = X_people / 255.X_train, X_test, y_train, y_test = train_test_split(X_people, y_people, stratify=y_people, random_state=0)image_shape = people.images[0].shapemglearn.plots.plot_pca_faces(X_train, X_test, image_shape)plt.tight_layout() plt.show()
- 将数据降维到只包含一些主成分,然后反向旋转回到原始空间
-
利用PCA的前两个主成分,将数据集中的所有人脸在散点图中可视化
import numpy as np from matplotlib import pyplot as plt from sklearn.datasets import fetch_lfw_people from sklearn.decomposition import PCA from sklearn.model_selection import train_test_split import mglearn import sslssl._create_default_https_context = ssl._create_unverified_contextpeople = fetch_lfw_people(min_faces_per_person=20, resize=0.7) mask = np.zeros(people.target.shape, dtype=np.bool_)for target in np.unique(people.target):mask[np.where(people.target == target)[0][:50]] = 1X_people = people.data[mask] y_people = people.target[mask]X_people = X_people / 255.X_train, X_test, y_train, y_test = train_test_split(X_people, y_people, stratify=y_people, random_state=0)pca = PCA(n_components=100, whiten=True, random_state=0).fit(X_train)X_train_pca = pca.transform(X_train) X_test_pca = pca.transform(X_test)mglearn.discrete_scatter(X_train_pca[:, 0], X_train_pca[:, 1], y_train)plt.xlabel("First principal component") plt.ylabel("Second principal component")plt.tight_layout() plt.show()
4.2 非负矩阵分解(NMF)
- 提取有用的特征
- 将每个数据点写成一些分量的加权求和
- 希望分量和系数都大于或等于0
- 只能应用于每个特征都是非负的数据
- 对由多个独立源相加创建而成的数据特别有用
- 多人说话的音轨
- 多种乐器的音乐
4.2.1 将NMF应用于模拟数据
from matplotlib import pyplot as plt
import mglearnmglearn.plots.plot_nmf_illustration()plt.tight_layout()
plt.show()
- 左图:所有数据点都可以写成这两个分量的正数组合
- 右图:指向平均值的分量
- NMF使用随机初始化,根据随机种子的不同可能产生不同的结果
4.2.2 将NMF应用于人脸图像
-
NMF的主要参数
- 想要提取的分量个数
- 要小于输入特征的个数
- 想要提取的分量个数
-
分量个数对NMF重建数据的影响
from matplotlib import pyplot as plt import mglearnmglearn.plots.plot_nmf_illustration()plt.tight_layout() plt.show()
- 比PCA稍差
-
提取一部分分量,并观察数据
import numpy as np from matplotlib import pyplot as plt from sklearn.datasets import fetch_lfw_people from sklearn.model_selection import train_test_split from sklearn.decomposition import NMF import sslssl._create_default_https_context = ssl._create_unverified_contextpeople = fetch_lfw_people(min_faces_per_person=20, resize=0.7) mask = np.zeros(people.target.shape, dtype=np.bool_)for target in np.unique(people.target):mask[np.where(people.target == target)[0][:50]] = 1X_people = people.data[mask] y_people = people.target[mask]X_people = X_people / 255.X_train, X_test, y_train, y_test = train_test_split(X_people, y_people, stratify=y_people, random_state=0)image_shape = people.images[0].shapenmf = NMF(n_components=15, random_state=0) nmf.fit(X_train)X_train_nmf = nmf.transform(X_train) X_test_nmf = nmf.transform(X_test)fix, axes = plt.subplots(3, 5, figsize=(15, 12), subplot_kw={'xticks': (), 'yticks': ()}) for i, (component, ax) in enumerate(zip(nmf.components_, axes.ravel())):ax.imshow(component.reshape(image_shape))ax.set_title("{}. component".format(i))plt.tight_layout() plt.show()
-
绘制分量4和7的图像
import numpy as np from matplotlib import pyplot as plt from sklearn.datasets import fetch_lfw_people from sklearn.model_selection import train_test_split from sklearn.decomposition import NMF import sslssl._create_default_https_context = ssl._create_unverified_contextpeople = fetch_lfw_people(min_faces_per_person=20, resize=0.7) mask = np.zeros(people.target.shape, dtype=np.bool_)for target in np.unique(people.target):mask[np.where(people.target == target)[0][:50]] = 1X_people = people.data[mask] y_people = people.target[mask]X_people = X_people / 255.X_train, X_test, y_train, y_test = train_test_split(X_people, y_people, stratify=y_people, random_state=0)image_shape = people.images[0].shapenmf = NMF(n_components=15, random_state=0) nmf.fit(X_train)X_train_nmf = nmf.transform(X_train)compn = 4 # 按第4个分量排序,绘制前10张图像 inds = np.argsort(X_train_nmf[:, compn])[::-1] fig, axes = plt.subplots(2, 5, figsize=(15, 8), subplot_kw={'xticks': (), 'yticks': ()}) for i, (ind, ax) in enumerate(zip(inds, axes.ravel())):ax.imshow(X_train[ind].reshape(image_shape))plt.tight_layout() plt.show()compn = 7 # 按第7个分量排序,绘制前10张图像 inds = np.argsort(X_train_nmf[:, compn])[::-1] fig, axes = plt.subplots(2, 5, figsize=(15, 8), subplot_kw={'xticks': (), 'yticks': ()}) for i, (ind, ax) in enumerate(zip(inds, axes.ravel())):ax.imshow(X_train[ind].reshape(image_shape))plt.tight_layout() plt.show()
-
-
对信号进行处理
import mglearn from matplotlib import pyplot as pltS = mglearn.datasets.make_signals()plt.figure(figsize=(6, 1)) plt.plot(S, '-') plt.xlabel("Time") plt.ylabel("Signal")plt.tight_layout() plt.show()
-
将混合信号分解为原始信号
import mglearn import numpy as np from matplotlib import pyplot as plt from sklearn.decomposition import NMF, PCAS = mglearn.datasets.make_signals()# 将数据混合成100维的状态 A = np.random.RandomState(0).uniform(size=(100, 3)) X = np.dot(S, A.T)# 使用NMF还原信号 nmf = NMF(n_components=3, random_state=42) S_ = nmf.fit_transform(X)# 使用PCA还原信号 pca = PCA(n_components=3) H = pca.fit_transform(X)models = [X, S, S_, H] names = ['Observations (first three measurements)','True sources','NMF recovered signals','PCA recovered signals'] fig, axes = plt.subplots(4, figsize=(8, 4), gridspec_kw={'hspace': .5}, subplot_kw={'xticks': (), 'yticks': ()})for model, name, ax in zip(models, names, axes):ax.set_title(name)ax.plot(model[:, :3], '-')plt.tight_layout() plt.show()
4.3 用t-SNE进行流形学习
-
流形学习算法
- 用于可视化的算法
- 允许进行复杂的映射
- 可以给出较好的可视化
- 算法计算训练数据的一种新表示,但不允许变换新数据
- 只能变换用于测试的数据
-
t-SNE
- 思想:找到数据的一个二维表示,尽可能地保持数据点之间的距离
- 步骤
- 给出每个数据点的随机二维表示
- 尝试让在原始特征空间中距离较近的点更加靠近,原始特征空间中距离较远的更加远离
- 重点关注距离较近的点
- 试图保存那些表示哪些点比较靠近的信息
- 仅根据原始空间中数据点之间的靠近程度就能将各个类别明确分开
-
加载手写数字数据集
from matplotlib import pyplot as plt from sklearn.datasets import load_digitsdigits = load_digits()fig, axes = plt.subplots(2, 5, figsize=(10, 5), subplot_kw={'xticks': (), 'yticks': ()}) for ax, img in zip(axes.ravel(), digits.images):ax.imshow(img)plt.tight_layout() plt.show()
-
用PCA将降到二维的数据可视化
- 对前两个主成分作图,并按类别对数据点着色
from matplotlib import pyplot as plt from sklearn.datasets import load_digits from sklearn.decomposition import PCAdigits = load_digits()# 构建一个PCA模型 pca = PCA(n_components=2) pca.fit(digits.data)# 将digits数据变换到前两个主成分的方向上 digits_pca = pca.transform(digits.data) colors = ["#476A2A", "#7851B8", "#BD3430", "#4A2D4E", "#875525","#A83683", "#4E655E", "#853541", "#3A3120", "#535D8E"]plt.figure(figsize=(10, 10)) plt.xlim(digits_pca[:, 0].min(), digits_pca[:, 0].max()) plt.ylim(digits_pca[:, 1].min(), digits_pca[:, 1].max())for i in range(len(digits.data)):# 将数据实际绘制成文本,而不是散点plt.text(digits_pca[i, 0], digits_pca[i, 1], str(digits.target[i]),color=colors[digits.target[i]],fontdict={'weight': 'bold', 'size': 9})plt.xlabel("First principal component") plt.ylabel("Second principal component")plt.tight_layout() plt.show()
- 0、4、6相对较好地分开
-
将t-SNE应用于数据集
- TSNE类没有transform方法
- 调用fit_transform代替
- 构建模型,并立刻返回变换后的数据
from matplotlib import pyplot as plt from sklearn.datasets import load_digits from sklearn.manifold import TSNEdigits = load_digits()tsne = TSNE(random_state=42)# 使用fit_transform而不是fit,因为TSNE没有transform方法 digits_tsne = tsne.fit_transform(digits.data) colors = ["#476A2A", "#7851B8", "#BD3430", "#4A2D4E", "#875525","#A83683", "#4E655E", "#853541", "#3A3120", "#535D8E"]plt.figure(figsize=(10, 10)) plt.xlim(digits_tsne[:, 0].min(), digits_tsne[:, 0].max() + 1) plt.ylim(digits_tsne[:, 1].min(), digits_tsne[:, 1].max() + 1)for i in range(len(digits.data)):# 将数据实际绘制成文本,而不是散点plt.text(digits_tsne[i, 0], digits_tsne[i, 1], str(digits.target[i]),color=colors[digits.target[i]],fontdict={'weight': 'bold', 'size': 9})plt.xlabel("t-SNE feature 0") plt.xlabel("t-SNE feature 1")plt.tight_layout() plt.show()
- 大多数类别都形成一个密集的组
5. 聚类
- 将数据集划分成组(簇)的任务
- 目标:划分数据,使得一个簇内的数据点非常相似且不同簇内的数据点非常不同
- 算法为每个数据点分配(或预测)一个数字,表示这个点属于哪个簇
5.1 k均值聚类
-
试图找到代表数据特定区域的簇中心
-
步骤
-
将每个数据点分配给最近的簇中心
-
将每个簇中心设置为所分配的所有数据点的平均值
-
重复执行以上两个步骤,直到簇的分配不再发生变化
-
-
算法说明
from matplotlib import pyplot as plt import mglearnmglearn.plots.plot_kmeans_algorithm()plt.tight_layout() plt.show()
- 三角形:簇中心
- 圆形:数据点
- 颜色:簇成员
- 寻找3个簇
- 声明3个随机数据点为簇中心来将算法初始化
- 运行迭代算法
- 每个数据点被分配给距离最近的簇中心
- 将簇中心修改为所分配点的平均值
- 重复2次,第三次迭代后,为簇中心分配的数据点保持不变,算法结束
-
簇中心的边界
from matplotlib import pyplot as plt import mglearnmglearn.plots.plot_kmeans_boundaries()plt.tight_layout() plt.show()
-
使用k均值
from matplotlib import pyplot as plt import mglearn from sklearn.datasets import make_blobs from sklearn.cluster import KMeans# 生成模拟的二维数据 X, y = make_blobs(random_state=1)# 构建聚类模型 kmeans = KMeans(n_clusters=3) # n_clusters: 簇的个数(默认为8)kmeans.fit(X)# 打印每个点的簇标签 print("Cluster memberships:\n{}".format(kmeans.labels_)) # Cluster memberships: # [0 2 2 2 1 1 1 2 0 0 2 2 1 0 1 1 1 0 2 2 1 2 1 0 2 1 1 0 0 1 0 0 1 0 2 1 2 # 2 2 1 1 2 0 2 2 1 0 0 0 0 2 1 1 1 0 1 2 2 0 0 2 1 1 2 2 1 0 1 0 2 2 2 1 0 # 0 2 1 1 0 2 0 2 2 1 0 0 0 0 2 0 1 0 0 2 2 1 1 0 1 0]# predict方法也可以为新数据点分配簇标签 print(kmeans.predict(X)) # [0 2 2 2 1 1 1 2 0 0 2 2 1 0 1 1 1 0 2 2 1 2 1 0 2 1 1 0 0 1 0 0 1 0 2 1 2 # 2 2 1 1 2 0 2 2 1 0 0 0 0 2 1 1 1 0 1 2 2 0 0 2 1 1 2 2 1 0 1 0 2 2 2 1 0 # 0 2 1 1 0 2 0 2 2 1 0 0 0 0 2 0 1 0 0 2 2 1 1 0 1 0]mglearn.discrete_scatter(X[:, 0], X[:, 1], kmeans.labels_, markers='o') mglearn.discrete_scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1], [0, 1, 2],markers='^', markeredgewidth=2)plt.tight_layout() plt.show()
-
每个元素都有一个标签
- 不存在真实的标签
- 标签本身没有先验意义
-
绘制图像
from matplotlib import pyplot as plt import mglearn from sklearn.datasets import make_blobs from sklearn.cluster import KMeansX, y = make_blobs(random_state=1)kmeans = KMeans(n_clusters=3) kmeans.fit(X)mglearn.discrete_scatter(X[:, 0], X[:, 1], kmeans.labels_, markers='o') mglearn.discrete_scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1], [0, 1, 2],markers='^', markeredgewidth=2)plt.tight_layout() plt.show()
-
使用更多或更少的簇中心
from matplotlib import pyplot as plt import mglearn from sklearn.datasets import make_blobs from sklearn.cluster import KMeansX, y = make_blobs(random_state=1)fig, axes = plt.subplots(1, 2, figsize=(10, 5))# 使用2个簇中心 kmeans = KMeans(n_clusters=2) kmeans.fit(X) assignments = kmeans.labels_mglearn.discrete_scatter(X[:, 0], X[:, 1], assignments, ax=axes[0])# 使用5个簇中心 kmeans = KMeans(n_clusters=5) kmeans.fit(X) assignments = kmeans.labels_mglearn.discrete_scatter(X[:, 0], X[:, 1], assignments, ax=axes[1])plt.tight_layout() plt.show()
-
5.1.1 k均值的失败案例
-
每个簇仅由其中心定义
- 每个簇都是凸形
-
k均值只能找到相对简单的形状
-
k均值假设所有簇在某种程度上都具有相同的直径,总是将簇之间的边界刚好画在簇中心的中间位置
from matplotlib import pyplot as plt import mglearn from sklearn.datasets import make_blobs from sklearn.cluster import KMeansX, y = make_blobs(random_state=1)X_varied, y_varied = make_blobs(n_samples=200, cluster_std=[1.0, 2.5, 0.5], random_state=170)y_pred = KMeans(n_clusters=3, random_state=0).fit_predict(X_varied)mglearn.discrete_scatter(X_varied[:, 0], X_varied[:, 1], y_pred)plt.legend(["cluster 0", "cluster 1", "cluster 2"], loc='best') plt.xlabel("Feature 0") plt.ylabel("Feature 1")plt.tight_layout() plt.show()
- 簇0和簇1都包含一些原理簇中其他点的点
-
k均值假设所有方向对每个簇都同等重要
import numpy as np from matplotlib import pyplot as plt import mglearn from sklearn.datasets import make_blobs from sklearn.cluster import KMeans# 生成一些随机分组数据 X, y = make_blobs(random_state=170, n_samples=600) rng = np.random.RandomState(74)# 变换数据使其拉长 transformation = rng.normal(size=(2, 2)) X = np.dot(X, transformation)# 将数据聚类成3个簇 kmeans = KMeans(n_clusters=3) kmeans.fit(X) y_pred = kmeans.predict(X)# 画出簇分配和簇中心 plt.scatter(X[:, 0], X[:, 1], c=y_pred, cmap=mglearn.cm3) plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1],marker='^', c=[0, 1, 2], s=100, linewidth=2) plt.xlabel("Feature 0") plt.ylabel("Feature 1")plt.tight_layout() plt.show()
-
簇的形状很复杂
from matplotlib import pyplot as plt import mglearn from sklearn.datasets import make_moons from sklearn.cluster import KMeans# 生成模拟的two moons数据(这次的噪声较小) X, y = make_moons(n_samples=200, noise=0.05, random_state=0)# 将数据聚类成2个簇 kmeans = KMeans(n_clusters=2) kmeans.fit(X) y_pred = kmeans.predict(X)# 画出簇分配和簇中心 plt.scatter(X[:, 0], X[:, 1], c=y_pred, cmap=mglearn.cm2, s=60) plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1],marker='^', c=[mglearn.cm2(0), mglearn.cm2(1)], s=100, linewidth=2) plt.xlabel("Feature 0") plt.ylabel("Feature 1")plt.tight_layout() plt.show()
5.1.2 矢量量化,或者将k均值看作分解
-
矢量量化:k均值是一种分解方法,其中每个点用单一分量来表示
-
并排比较PCA、NMF和k均值,分别显示提取的分量,以及利用100个分量对测试集中人脸的重建
import numpy as np from matplotlib import pyplot as plt from sklearn.cluster import KMeans from sklearn.datasets import fetch_lfw_people from sklearn.model_selection import train_test_split from sklearn.decomposition import NMF, PCA import sslssl._create_default_https_context = ssl._create_unverified_contextpeople = fetch_lfw_people(min_faces_per_person=20, resize=0.7) mask = np.zeros(people.target.shape, dtype=np.bool_)for target in np.unique(people.target):mask[np.where(people.target == target)[0][:50]] = 1X_people = people.data[mask] y_people = people.target[mask] X_people = X_people / 255.image_shape = people.images[0].shapeX_train, X_test, y_train, y_test = train_test_split(X_people, y_people, stratify=y_people, random_state=0)nmf = NMF(n_components=100, random_state=0) nmf.fit(X_train) pca = PCA(n_components=100, random_state=0) pca.fit(X_train) kmeans = KMeans(n_clusters=100, random_state=0) kmeans.fit(X_train)X_reconstructed_pca = pca.inverse_transform(pca.transform(X_test)) X_reconstructed_kmeans = kmeans.cluster_centers_[kmeans.predict(X_test)] X_reconstructed_nmf = np.dot(nmf.transform(X_test), nmf.components_)fig, axes = plt.subplots(3, 5, figsize=(8, 8), subplot_kw={'xticks': (), 'yticks': ()})fig.suptitle("Extracted Components") for ax, comp_kmeans, comp_pca, comp_nmf in zip(axes.T, kmeans.cluster_centers_, pca.components_, nmf.components_):ax[0].imshow(comp_kmeans.reshape(image_shape))ax[1].imshow(comp_pca.reshape(image_shape), cmap='viridis')ax[2].imshow(comp_nmf.reshape(image_shape))axes[0, 0].set_ylabel("kmeans") axes[1, 0].set_ylabel("pca") axes[2, 0].set_ylabel("nmf")plt.tight_layout()fig, axes = plt.subplots(4, 5, subplot_kw={'xticks': (), 'yticks': ()},figsize=(8, 8))fig.suptitle("Reconstructions") for ax, orig, rec_kmeans, rec_pca, rec_nmf in zip(axes.T, X_test, X_reconstructed_kmeans, X_reconstructed_pca, X_reconstructed_nmf):ax[0].imshow(orig.reshape(image_shape))ax[1].imshow(rec_kmeans.reshape(image_shape))ax[2].imshow(rec_pca.reshape(image_shape))ax[3].imshow(rec_nmf.reshape(image_shape))axes[0, 0].set_ylabel("original") axes[1, 0].set_ylabel("kmeans") axes[2, 0].set_ylabel("pca") axes[3, 0].set_ylabel("nmf")plt.tight_layout() plt.show()
-
用比输入维度更多的簇来对数据进行编码
from matplotlib import pyplot as plt from sklearn.datasets import make_moons from sklearn.cluster import KMeansX, y = make_moons(n_samples=200, noise=0.05, random_state=0)kmeans = KMeans(n_clusters=10, random_state=0) kmeans.fit(X) y_pred = kmeans.predict(X)plt.scatter(X[:, 0], X[:, 1], c=y_pred, s=60, cmap='Paired') plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1],marker='^', s=60, c=range(kmeans.n_clusters),linewidth=2, cmap='Paired') plt.xlabel("Feature 0") plt.ylabel("Feature 1")print("Cluster memberships:\n{}".format(kmeans.labels_)) # Cluster memberships: # [9 2 5 4 2 7 9 6 9 6 1 0 2 6 1 9 3 0 3 1 7 6 8 6 8 5 2 7 5 8 9 8 6 5 3 7 0 # 9 4 5 0 1 3 5 2 8 9 1 5 6 1 0 7 4 6 3 3 6 3 8 0 4 2 9 6 4 8 2 8 4 0 4 0 5 # 6 4 5 9 3 0 7 8 0 7 5 8 9 8 0 7 3 9 7 1 7 2 2 0 4 5 6 7 8 9 4 5 4 1 2 3 1 # 8 8 4 9 2 3 7 0 9 9 1 5 8 5 1 9 5 6 7 9 1 4 0 6 2 6 4 7 9 5 5 3 8 1 9 5 6 # 3 5 0 2 9 3 0 8 6 0 3 3 5 6 3 2 0 2 3 0 2 6 3 4 4 1 5 6 7 1 1 3 2 4 7 2 7 # 3 8 6 4 1 4 3 9 9 5 1 7 5 8 2]plt.tight_layout() plt.show()
-
将到每个簇中心的距离作为特征,可以得到一种表现力很强的数据表示
- 使用transform方法
from sklearn.datasets import make_moons from sklearn.cluster import KMeansX, y = make_moons(n_samples=200, noise=0.05, random_state=0)kmeans = KMeans(n_clusters=10, random_state=0) kmeans.fit(X)distance_features=kmeans.transform(X) print("Distance feature shape: {}".format(distance_features.shape)) # Distance feature shape: (200, 10)print("Distance features:\n{}".format(distance_features)) # Distance features: # [[0.9220768 1.46553151 1.13956805 ... 1.16559918 1.03852189 0.23340263] # [1.14159679 2.51721597 0.1199124 ... 0.70700803 2.20414144 0.98271691] # [0.78786246 0.77354687 1.74914157 ... 1.97061341 0.71561277 0.94399739] # ... # [0.44639122 1.10631579 1.48991975 ... 1.79125448 1.03195812 0.81205971] # [1.38951924 0.79790385 1.98056306 ... 1.97788956 0.23892095 1.05774337] # [1.14920754 2.4536383 0.04506731 ... 0.57163262 2.11331394 0.88166689]]
5.1.3 优点、缺点
- 优点
- 非常流行的聚类算法
- 相对容易理解和实现
- 运行速度相对较快
- 可以轻松扩展到大型数据集
- 缺点
- 依赖于随机初始化
- 算法的输出依赖于随机种子
- 默认情况下,scikit-learn用10种不同的随机初始化将算法运行10次,并返回最佳结果(簇的方差之和最小)
- 对簇形状的假设的约束性较强
- 要求指定所要寻找的簇的个数(在现实世界的应用中可能并不知道这个数字)
- 依赖于随机初始化
5.2 凝聚聚类
-
许多基于相同原则构建的聚类算法
- 原则:算法首先声明每个点是自己的簇,然后合并两个最相似的簇,直到满足某种停止准则为止
- 准则
- scikit-learn:簇的个数
- 链接准则:规定如何度量最相似的簇
- 定义在两个现有的簇之间
- scikit-learn中实现的三种选项
- ward
- 默认选项
- 挑选两个簇进行合并,使得所有簇中的方差增加最小
- 会得到大小差不多相等的簇
- 用于大多数数据集
- average
- 将簇中所有点之间平均距离最小的两个簇合并
- complete
- 将簇中点之间最大距离最小的两个簇合并
- ward
- 准则
- 原则:算法首先声明每个点是自己的簇,然后合并两个最相似的簇,直到满足某种停止准则为止
-
二维数据集上的凝聚聚类过程
- 寻找3个簇
import matplotlib.pyplot as plt import mglearnmglearn.plots.plot_agglomerative_algorithm()plt.tight_layout() plt.show()
-
凝聚聚类对简单三簇数据的效果
from matplotlib import pyplot as plt import mglearn from sklearn.datasets import make_blobs from sklearn.cluster import AgglomerativeClusteringX, y = make_blobs(random_state=1) agg = AgglomerativeClustering(n_clusters=3)assignment = agg.fit_predict(X) mglearn.discrete_scatter(X[:, 0], X[:, 1], assignment)plt.xlabel("Feature 0") plt.ylabel("Feature 1")plt.tight_layout() plt.show()
层次聚类与树状图
-
同时查看所有可能的聚类
from matplotlib import pyplot as plt import mglearn mglearn.plots.plot_agglomerative()plt.tight_layout() plt.show()
-
树状图
- 可以处理多维数据
from matplotlib import pyplot as plt from sklearn.datasets import make_blobs from scipy.cluster.hierarchy import dendrogram, wardX, y = make_blobs(random_state=0, n_samples=12)# 将ward聚类应用于数据数组X # SciPy的ward函数返回一个数组,指定执行凝聚聚类时跨越的距离 linkage_array = ward(X)# 现在为包含簇之间距离的linkage array绘制树状图 dendrogram(linkage_array)# 在树中标记划分成两个簇或三个簇的位置 ax = plt.gca() bounds = ax.get_xbound() ax.plot(bounds, [7.25, 7.25], '--', c='k') ax.plot(bounds, [4, 4], '--', c='k')ax.text(bounds[1], 7.25, ' two clusters', va='center', fontdict={'size': 15}) ax.text(bounds[1], 4, ' three clusters', va='center', fontdict={'size': 15})plt.xlabel("Sample index") plt.ylabel("Cluster distance")plt.tight_layout() plt.show()
- x轴:数据点
- y轴:聚类算法中簇的合并时间
- 分支长度:合并的簇之间的距离
5.3 DBSCAN
-
优点
- 不需要用户先验地设置簇的个数
- 可以划分具有复杂形状的簇
- 可以找出不属于任何簇的点
- 可以扩展到相对较大的数据集
-
缺点
- 运行速度较慢
-
原理:识别特征空间的“拥挤”区域中的点
- “拥挤”区域(密集区域):区域中许多数据点靠近在一起
- 密集区域中的点:核心样本
- 如果在距一个给定数据点eps的距离内至少有min_samples个数据点,那么这个点就是核心样本
- DBSCAN将彼此距离小于eps的核心样本放到同一个簇中
- 密集区域中的点:核心样本
- “拥挤”区域(密集区域):区域中许多数据点靠近在一起
-
思想:簇形成数据的密集区域,并由相对较空的区域隔开
-
步骤
- 选取任意一个点
- 找到到这个点的距离小于等于eps的所有点
- 如果距起始点的距离在eps之内的数据点个数小于min_samples,则这个点被标记为噪声
- 这个点不属于任何簇
- 如果距起始点的距离在eps之内的数据点个数大于min_samples,则这个点被标记为核心样本,并被分配一个新的簇标签
- 如果距起始点的距离在eps之内的数据点个数小于min_samples,则这个点被标记为噪声
- 访问该点的所有邻居(在距离eps以内)
- 如果它们还没有被分配一个簇,则将刚刚创建的新的簇标签分配给它们
- 如果它们是核心样本,那么依次访问其邻居
- 簇逐渐增大,直到在簇的eps距离内没有更多的核心样本为止
- 选取另一个未被访问过的点,重复以上步骤
-
eps和min_samples取不同值时的簇分类
import matplotlib.pyplot as plt import mglearnmglearn.plots.plot_dbscan()plt.tight_layout() plt.show() # min_samples: 2 eps: 1.000000 cluster: [-1 0 0 -1 0 -1 1 1 0 1 -1 -1] # min_samples: 2 eps: 1.500000 cluster: [0 1 1 1 1 0 2 2 1 2 2 0] # min_samples: 2 eps: 2.000000 cluster: [0 1 1 1 1 0 0 0 1 0 0 0] # min_samples: 2 eps: 3.000000 cluster: [0 0 0 0 0 0 0 0 0 0 0 0] # min_samples: 3 eps: 1.000000 cluster: [-1 0 0 -1 0 -1 1 1 0 1 -1 -1] # min_samples: 3 eps: 1.500000 cluster: [0 1 1 1 1 0 2 2 1 2 2 0] # min_samples: 3 eps: 2.000000 cluster: [0 1 1 1 1 0 0 0 1 0 0 0] # min_samples: 3 eps: 3.000000 cluster: [0 0 0 0 0 0 0 0 0 0 0 0] # min_samples: 5 eps: 1.000000 cluster: [-1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1] # min_samples: 5 eps: 1.500000 cluster: [-1 0 0 0 0 -1 -1 -1 0 -1 -1 -1] # min_samples: 5 eps: 2.000000 cluster: [-1 0 0 0 0 -1 -1 -1 0 -1 -1 -1] # min_samples: 5 eps: 3.000000 cluster: [0 0 0 0 0 0 0 0 0 0 0 0]
- -1:噪声
- 实心:属于簇的点
- 空心:噪声点
- 较大的标记:核心样本
- 较小的标记:边界点
-
使用StandardScaler或MinMaxScaler对数据进行缩放后,有时更容易找到eps的较好取值
-
在two_moons数据集上运行DBSCAN的结果
import mglearn from matplotlib import pyplot as plt from sklearn.cluster import DBSCAN from sklearn.datasets import make_moons from sklearn.preprocessing import StandardScalerX, y = make_moons(n_samples=200, noise=0.05, random_state=0)# 将数据缩放成平均值为0、方差为1 scaler = StandardScaler() scaler.fit(X) X_scaled = scaler.transform(X)dbscan = DBSCAN() clusters = dbscan.fit_predict(X_scaled)# 绘制簇分配 plt.scatter(X_scaled[:, 0], X_scaled[:, 1], c=clusters, cmap=mglearn.cm2, s=60)plt.xlabel("Feature 0") plt.ylabel("Feature 1")plt.tight_layout() plt.show()
5.4 聚类算法的对比与评估
5.4.1 用真实值评估聚类
-
用于评估聚类算法相对于真实聚类结果的指标
- 调整rand指数(ARI)
- 最佳值:1
- 不相关:0
- 归一化互信息(NMI)
- 最佳值:1
- 不相关:0
- 调整rand指数(ARI)
-
使用ARI比较k均值、凝聚聚类和DBSCAN算法
import numpy as np import mglearn from matplotlib import pyplot as plt from sklearn.cluster import DBSCAN, KMeans, AgglomerativeClustering from sklearn.datasets import make_moons from sklearn.preprocessing import StandardScaler from sklearn.metrics.cluster import adjusted_rand_scoreX, y = make_moons(n_samples=200, noise=0.05, random_state=0)# 将数据缩放成平均值为0、方差为1 scaler = StandardScaler() scaler.fit(X) X_scaled = scaler.transform(X)fig, axes = plt.subplots(1, 4, figsize=(15, 3), subplot_kw={'xticks': (), 'yticks': ()})# 列出要使用的算法 algorithms = [KMeans(n_clusters=2), AgglomerativeClustering(n_clusters=2), DBSCAN()]# 创建一个随机的簇分配,作为参考 random_state = np.random.RandomState(seed=0) random_clusters = random_state.randint(low=0, high=2, size=len(X))# 绘制随机分配 axes[0].scatter(X_scaled[:, 0], X_scaled[:, 1], c=random_clusters, cmap=mglearn.cm3, s=60) axes[0].set_title("Random assignment - ARI: {:.2f}".format(adjusted_rand_score(y, random_clusters)))for ax, algorithm in zip(axes[1:], algorithms):# 绘制簇分配和簇中心clusters = algorithm.fit_predict(X_scaled)ax.scatter(X_scaled[:, 0], X_scaled[:, 1], c=clusters, cmap=mglearn.cm3, s=60)ax.set_title("{} - ARI: {:.2f}".format(algorithm.__class__.__name__, adjusted_rand_score(y, clusters)))plt.tight_layout() plt.show()
-
评估聚类时,不应该使用accuracy_score
- 精度评估:分配的簇标签与真实值完全匹配
- 但簇标签没有意义
from sklearn.metrics.cluster import adjusted_rand_score from sklearn.metrics import accuracy_score# 这两种点标签对应于相同的聚类 clusters1 = [0, 0, 1, 1, 0] clusters2 = [1, 1, 0, 0, 1]# 精度为0,因为二者标签完全不同 print("Accuracy: {:.2f}".format(accuracy_score(clusters1, clusters2))) # Accuracy: 0.00# 调整rand分数为1,因为二者聚类完全相同 print("ARI: {:.2f}".format(adjusted_rand_score(clusters1, clusters2))) # ARI: 1.00
5.4.2 在没有真实值的情况下评估聚类
-
不需要真实值的聚类评分指标
- 轮廓系数
- 计算一个簇的紧致度
- 越大越好
- 最大值:1
- 不允许复杂的形状
- 轮廓系数
-
使用轮廓系数比较k均值、凝聚聚类和DBSCAN算法
import numpy as np import mglearn from matplotlib import pyplot as plt from sklearn.cluster import DBSCAN, KMeans, AgglomerativeClustering from sklearn.datasets import make_moons from sklearn.preprocessing import StandardScaler from sklearn.metrics.cluster import silhouette_scoreX, y = make_moons(n_samples=200, noise=0.05, random_state=0)# 将数据缩放成平均值为0、方差为1 scaler = StandardScaler() scaler.fit(X)X_scaled = scaler.transform(X)fig, axes = plt.subplots(1, 4, figsize=(15, 3), subplot_kw={'xticks': (), 'yticks': ()})# 列出要使用的算法 algorithms = [KMeans(n_clusters=2), AgglomerativeClustering(n_clusters=2), DBSCAN()]# 创建一个随机的簇分配,作为参考 random_state = np.random.RandomState(seed=0) random_clusters = random_state.randint(low=0, high=2, size=len(X))# 绘制随机分配 axes[0].scatter(X_scaled[:, 0], X_scaled[:, 1], c=random_clusters, cmap=mglearn.cm3, s=60) axes[0].set_title("Random assignment - ARI: {:.2f}".format(silhouette_score(X_scaled, random_clusters)))for ax, algorithm in zip(axes[1:], algorithms):# 绘制簇分配和簇中心clusters = algorithm.fit_predict(X_scaled)ax.scatter(X_scaled[:, 0], X_scaled[:, 1], c=clusters, cmap=mglearn.cm3, s=60)ax.set_title("{} - ARI: {:.2f}".format(algorithm.__class__.__name__, silhouette_score(X_scaled, clusters)))plt.tight_layout() plt.show()
-
较好的评估聚类策略:使用基于鲁捧性聚类指标
- 先向数据中添加一些噪声,或使用不同的参数设定
- 然后运行算法,并对结果进行比较
- 思想:如果许多算法参数和许多数据扰动返回相同的结果,那么它很可能是可信的
5.4.3 在人脸数据集上比较算法
-
加载人脸数据
- 使用数据的特征脸表示
- 由100个成分的PCA(whiten=True)生成
import numpy as np from sklearn.datasets import fetch_lfw_people from sklearn.decomposition import PCA import sslssl._create_default_https_context = ssl._create_unverified_contextpeople = fetch_lfw_people(min_faces_per_person=20, resize=0.7) mask = np.zeros(people.target.shape, dtype=np.bool_)for target in np.unique(people.target):mask[np.where(people.target == target)[0][:50]] = 1X_people = people.data[mask] y_people = people.target[mask] X_people = X_people / 255.# 从lfw数据中提取特征脸,并对数据进行变换 pca = PCA(n_components=100, whiten=True, random_state=0) # 100个成分pca.fit_transform(X_people)X_pca = pca.transform(X_people)
- 使用数据的特征脸表示
用DBSCAN分析人脸数据集
-
应用DBSCAN
import numpy as np from sklearn.cluster import DBSCAN from sklearn.datasets import fetch_lfw_people from sklearn.decomposition import PCA import sslssl._create_default_https_context = ssl._create_unverified_contextpeople = fetch_lfw_people(min_faces_per_person=20, resize=0.7) mask = np.zeros(people.target.shape, dtype=np.bool_)for target in np.unique(people.target):mask[np.where(people.target == target)[0][:50]] = 1X_people = people.data[mask] y_people = people.target[mask] X_people = X_people / 255.pca = PCA(n_components=100, whiten=True, random_state=0) pca.fit_transform(X_people)X_pca = pca.transform(X_people)# 应用默认参数的DBSCAN dbscan = DBSCAN() labels = dbscan.fit_predict(X_pca) print("Unique labels: {}".format(np.unique(labels))) # Unique labels: [-1]
- 所有数据点都被标记为噪声
- 改进的两种方式
- 增大eps参数
- 减小min_samples参数
-
减小min_samples参数
import numpy as np from sklearn.cluster import DBSCAN from sklearn.datasets import fetch_lfw_people from sklearn.decomposition import PCA import sslssl._create_default_https_context = ssl._create_unverified_contextpeople = fetch_lfw_people(min_faces_per_person=20, resize=0.7) mask = np.zeros(people.target.shape, dtype=np.bool_)for target in np.unique(people.target):mask[np.where(people.target == target)[0][:50]] = 1X_people = people.data[mask] y_people = people.target[mask] X_people = X_people / 255.pca = PCA(n_components=100, whiten=True, random_state=0) pca.fit_transform(X_people)X_pca = pca.transform(X_people)dbscan = DBSCAN(min_samples=3) labels = dbscan.fit_predict(X_pca) print("Unique labels: {}".format(np.unique(labels))) # Unique labels: [-1]
- 没有发生变化
-
增大eps参数
import numpy as np from sklearn.cluster import DBSCAN from sklearn.datasets import fetch_lfw_people from sklearn.decomposition import PCA import sslssl._create_default_https_context = ssl._create_unverified_contextpeople = fetch_lfw_people(min_faces_per_person=20, resize=0.7) mask = np.zeros(people.target.shape, dtype=np.bool_)for target in np.unique(people.target):mask[np.where(people.target == target)[0][:50]] = 1X_people = people.data[mask] y_people = people.target[mask] X_people = X_people / 255.pca = PCA(n_components=100, whiten=True, random_state=0) pca.fit_transform(X_people)X_pca = pca.transform(X_people)dbscan = DBSCAN(min_samples=3, eps=15) labels = dbscan.fit_predict(X_pca) print("Unique labels: {}".format(np.unique(labels))) # Unique labels: [-1 0]
- 得到了单一簇和噪声点
-
查看数据点的情况
import numpy as np from sklearn.cluster import DBSCAN from sklearn.datasets import fetch_lfw_people from sklearn.decomposition import PCA import sslssl._create_default_https_context = ssl._create_unverified_contextpeople = fetch_lfw_people(min_faces_per_person=20, resize=0.7) mask = np.zeros(people.target.shape, dtype=np.bool_)for target in np.unique(people.target):mask[np.where(people.target == target)[0][:50]] = 1X_people = people.data[mask] y_people = people.target[mask] X_people = X_people / 255.pca = PCA(n_components=100, whiten=True, random_state=0) pca.fit_transform(X_people)X_pca = pca.transform(X_people)dbscan = DBSCAN(min_samples=3, eps=15) labels = dbscan.fit_predict(X_pca)# 计算所有簇中的点数和噪声中的点数 # bincount不允许负值,所以我们需要加1 # 结果中的第一个数字对应于噪声点 print("Number of points per cluster: {}".format(np.bincount(labels + 1))) # Number of points per cluster: [ 37 2026]
-
查看所有的噪声点
import numpy as np from matplotlib import pyplot as plt from sklearn.cluster import DBSCAN from sklearn.datasets import fetch_lfw_people from sklearn.decomposition import PCA import sslssl._create_default_https_context = ssl._create_unverified_contextpeople = fetch_lfw_people(min_faces_per_person=20, resize=0.7) mask = np.zeros(people.target.shape, dtype=np.bool_)for target in np.unique(people.target):mask[np.where(people.target == target)[0][:50]] = 1X_people = people.data[mask] y_people = people.target[mask] X_people = X_people / 255.image_shape = people.images[0].shapepca = PCA(n_components=100, whiten=True, random_state=0) pca.fit_transform(X_people)X_pca = pca.transform(X_people)dbscan = DBSCAN(min_samples=3, eps=15) labels = dbscan.fit_predict(X_pca)noise = X_people[labels == -1]fig, axes = plt.subplots(3, 9, subplot_kw={'xticks': (), 'yticks': ()}, figsize=(12, 4)) for image, ax in zip(noise, axes.ravel()):ax.imshow(image.reshape(image_shape))plt.tight_layout() plt.show()
-
异常值检测:尝试找出数据集中对不匹配的数据
-
eps不同取值对应的结果
import numpy as np from sklearn.cluster import DBSCAN from sklearn.datasets import fetch_lfw_people from sklearn.decomposition import PCA import sslssl._create_default_https_context = ssl._create_unverified_contextpeople = fetch_lfw_people(min_faces_per_person=20, resize=0.7) mask = np.zeros(people.target.shape, dtype=np.bool_)for target in np.unique(people.target):mask[np.where(people.target == target)[0][:50]] = 1X_people = people.data[mask] y_people = people.target[mask] X_people = X_people / 255.pca = PCA(n_components=100, whiten=True, random_state=0) pca.fit_transform(X_people)X_pca = pca.transform(X_people)for eps in [1, 3, 5, 7, 9, 11, 13]:print("\neps={}".format(eps))dbscan = DBSCAN(eps=eps, min_samples=3)labels = dbscan.fit_predict(X_pca)print("Clusters present: {}".format(np.unique(labels)))print("Cluster sizes: {}".format(np.bincount(labels + 1))) # eps=1 # Clusters present: [-1] # Cluster sizes: [2063] # # eps=3 # Clusters present: [-1] # Cluster sizes: [2063] # # eps=5 # Clusters present: [-1 0] # Cluster sizes: [2059 4] # # eps=7 # Clusters present: [-1 0 1 2 3 4 5 6] # Cluster sizes: [1954 75 4 14 6 4 3 3] # # eps=9 # Clusters present: [-1 0 1] # Cluster sizes: [1199 861 3] # # eps=11 # Clusters present: [-1 0] # Cluster sizes: [ 403 1660] # # eps=13 # Clusters present: [-1 0] # Cluster sizes: [ 119 1944]
-
打印eps=7时7个簇中的图像
import numpy as np from matplotlib import pyplot as plt from sklearn.cluster import DBSCAN from sklearn.datasets import fetch_lfw_people from sklearn.decomposition import PCA import sslssl._create_default_https_context = ssl._create_unverified_contextpeople = fetch_lfw_people(min_faces_per_person=20, resize=0.7) mask = np.zeros(people.target.shape, dtype=np.bool_)for target in np.unique(people.target):mask[np.where(people.target == target)[0][:50]] = 1X_people = people.data[mask] y_people = people.target[mask] X_people = X_people / 255.image_shape = people.images[0].shapepca = PCA(n_components=100, whiten=True, random_state=0) pca.fit_transform(X_people)X_pca = pca.transform(X_people)dbscan = DBSCAN(min_samples=3, eps=7) labels = dbscan.fit_predict(X_pca)for cluster in range(max(labels) + 1):mask = labels == clustern_images = np.sum(mask)fig, axes = plt.subplots(1, n_images, figsize=(n_images * 1.5, 4), subplot_kw={'xticks': (), 'yticks': ()})for image, label, ax in zip(X_people[mask], y_people[mask], axes):ax.imshow(image.reshape(image_shape))ax.set_title(people.target_names[label].split()[-1])plt.tight_layout()plt.show()
用k均值分析人脸数据集
-
提取簇
import numpy as np from sklearn.cluster import KMeans from sklearn.datasets import fetch_lfw_people from sklearn.decomposition import PCA import sslssl._create_default_https_context = ssl._create_unverified_contextpeople = fetch_lfw_people(min_faces_per_person=20, resize=0.7) mask = np.zeros(people.target.shape, dtype=np.bool_)for target in np.unique(people.target):mask[np.where(people.target == target)[0][:50]] = 1X_people = people.data[mask] y_people = people.target[mask] X_people = X_people / 255.image_shape = people.images[0].shapepca = PCA(n_components=100, whiten=True, random_state=0) pca.fit_transform(X_people)X_pca = pca.transform(X_people)# 用k均值提取簇 km = KMeans(n_clusters=10, random_state=0) labels_km = km.fit_predict(X_pca)print("Cluster sizes k-means: {}".format(np.bincount(labels_km))) # Cluster sizes k-means: [ 70 198 139 109 196 351 207 424 180 189]
- 簇的大小相似
-
可视化
import numpy as np from matplotlib import pyplot as plt from sklearn.cluster import KMeans from sklearn.datasets import fetch_lfw_people from sklearn.decomposition import PCA import sslssl._create_default_https_context = ssl._create_unverified_contextpeople = fetch_lfw_people(min_faces_per_person=20, resize=0.7) mask = np.zeros(people.target.shape, dtype=np.bool_)for target in np.unique(people.target):mask[np.where(people.target == target)[0][:50]] = 1X_people = people.data[mask] y_people = people.target[mask] X_people = X_people / 255.image_shape = people.images[0].shapepca = PCA(n_components=100, whiten=True, random_state=0) pca.fit_transform(X_people)X_pca = pca.transform(X_people)km = KMeans(n_clusters=10, random_state=0) labels_km = km.fit_predict(X_pca)fig, axes = plt.subplots(2, 5, subplot_kw={'xticks': (), 'yticks': ()}, figsize=(12, 4)) for center, ax in zip(km.cluster_centers_, axes.ravel()):ax.imshow(pca.inverse_transform(center).reshape(image_shape))plt.tight_layout() plt.show()
-
绘制每个簇中心最典型和最不典型各5个图像
import mglearn import numpy as np from matplotlib import pyplot as plt from sklearn.cluster import KMeans from sklearn.datasets import fetch_lfw_people from sklearn.decomposition import PCA import sslssl._create_default_https_context = ssl._create_unverified_contextpeople = fetch_lfw_people(min_faces_per_person=20, resize=0.7) mask = np.zeros(people.target.shape, dtype=np.bool_)for target in np.unique(people.target):mask[np.where(people.target == target)[0][:50]] = 1X_people = people.data[mask] y_people = people.target[mask] X_people = X_people / 255.image_shape = people.images[0].shapepca = PCA(n_components=100, whiten=True, random_state=0) pca.fit_transform(X_people)X_pca = pca.transform(X_people)km = KMeans(n_clusters=10, random_state=0) km.fit_predict(X_pca)mglearn.plots.plot_kmeans_faces(km, pca, X_pca, X_people, y_people, people.target_names)plt.tight_layout() plt.show()
用凝聚聚类分析人脸数据集
-
提取簇
import numpy as np from sklearn.cluster import AgglomerativeClustering from sklearn.datasets import fetch_lfw_people from sklearn.decomposition import PCA import sslssl._create_default_https_context = ssl._create_unverified_contextpeople = fetch_lfw_people(min_faces_per_person=20, resize=0.7) mask = np.zeros(people.target.shape, dtype=np.bool_)for target in np.unique(people.target):mask[np.where(people.target == target)[0][:50]] = 1X_people = people.data[mask] y_people = people.target[mask] X_people = X_people / 255.image_shape = people.images[0].shapepca = PCA(n_components=100, whiten=True, random_state=0) pca.fit_transform(X_people)X_pca = pca.transform(X_people)# 用ward凝聚聚类提取簇 agglomerative = AgglomerativeClustering(n_clusters=10) labels_agg = agglomerative.fit_predict(X_pca)print("Cluster sizes agglomerative clustering: {}".format(np.bincount(labels_agg))) # Cluster sizes agglomerative clustering: [264 100 275 553 49 64 546 52 51 109]
-
计算ARI来度量凝聚聚类与k均值给出的两种数据划分是否相似
import numpy as np from sklearn.cluster import AgglomerativeClustering, KMeans from sklearn.datasets import fetch_lfw_people from sklearn.decomposition import PCA from sklearn.metrics import adjusted_rand_score import sslssl._create_default_https_context = ssl._create_unverified_contextpeople = fetch_lfw_people(min_faces_per_person=20, resize=0.7) mask = np.zeros(people.target.shape, dtype=np.bool_)for target in np.unique(people.target):mask[np.where(people.target == target)[0][:50]] = 1X_people = people.data[mask] y_people = people.target[mask] X_people = X_people / 255.image_shape = people.images[0].shapepca = PCA(n_components=100, whiten=True, random_state=0) pca.fit_transform(X_people)X_pca = pca.transform(X_people)km = KMeans(n_clusters=10, random_state=0) labels_km = km.fit_predict(X_pca)agglomerative = AgglomerativeClustering(n_clusters=10) labels_agg = agglomerative.fit_predict(X_pca)print("ARI: {:.3f}".format(adjusted_rand_score(labels_agg, labels_km))) # ARI: 0.088
-
绘制树状图
import numpy as np from matplotlib import pyplot as plt from scipy.cluster.hierarchy import ward, dendrogram from sklearn.cluster import AgglomerativeClustering from sklearn.datasets import fetch_lfw_people from sklearn.decomposition import PCA import sslssl._create_default_https_context = ssl._create_unverified_contextpeople = fetch_lfw_people(min_faces_per_person=20, resize=0.7) mask = np.zeros(people.target.shape, dtype=np.bool_)for target in np.unique(people.target):mask[np.where(people.target == target)[0][:50]] = 1X_people = people.data[mask] y_people = people.target[mask] X_people = X_people / 255.image_shape = people.images[0].shapepca = PCA(n_components=100, whiten=True, random_state=0) pca.fit_transform(X_people)X_pca = pca.transform(X_people)agglomerative = AgglomerativeClustering(n_clusters=10) labels_agg = agglomerative.fit_predict(X_pca)linkage_array = ward(X_pca)# 现在我们为包含簇之间距离的linkage array绘制树状图 plt.figure(figsize=(20, 5)) dendrogram(linkage_array, p=7, truncate_mode='level', no_labels=True)plt.xlabel("Sample index") plt.ylabel("Cluster distance")plt.tight_layout() plt.show()
-
将10个簇可视化
import numpy as np from matplotlib import pyplot as plt from sklearn.cluster import AgglomerativeClustering from sklearn.datasets import fetch_lfw_people from sklearn.decomposition import PCA import sslssl._create_default_https_context = ssl._create_unverified_contextpeople = fetch_lfw_people(min_faces_per_person=20, resize=0.7) mask = np.zeros(people.target.shape, dtype=np.bool_)for target in np.unique(people.target):mask[np.where(people.target == target)[0][:50]] = 1X_people = people.data[mask] y_people = people.target[mask] X_people = X_people / 255.image_shape = people.images[0].shapepca = PCA(n_components=100, whiten=True, random_state=0) pca.fit_transform(X_people)X_pca = pca.transform(X_people)agglomerative = AgglomerativeClustering(n_clusters=10) labels_agg = agglomerative.fit_predict(X_pca)n_clusters = 10 for cluster in range(n_clusters):mask = labels_agg == clusterfig, axes = plt.subplots(1, 10, subplot_kw={'xticks': (), 'yticks': ()}, figsize=(15, 8))axes[0].set_ylabel(np.sum(mask))for image, label, asdf, ax in zip(X_people[mask], y_people[mask], labels_agg[mask], axes):ax.imshow(image.reshape(image_shape))ax.set_title(people.target_names[label].split()[-1], fontdict={'fontsize': 9})plt.tight_layout()plt.show()
5.5 聚类方法小结
- 聚类的应用与评估是一个非常定性的过程,通常在数据分析的探索阶段很有帮助
- 三种聚类算法
- k均值
- 允许指定想要的簇的数量
- 可以用簇的平均值来表示簇
- 可以被看作一种分解方法,每个数据点都由其簇中心表示
- DBSCAN
- 允许用eps参数定义接近程度,从而间接影响簇的大小
- 可以检测到没有分配任何簇的“噪声点”
- 可以帮助自动判断簇的数量
- 允许簇具有复杂的形状
- 凝聚聚类
- 允许指定想要的簇的数量
- 可以提供数据的可能划分的整个层次结构
- 可以通过树状图轻松查看
- k均值
- 三种算法都可以控制聚类的粒度
- 三种方法都可以用于大型的现实世界数据集,都相对容易理解,也都可以聚类成多个簇