机器学习实践六---K-means聚类算法 和 主成分分析(PCA)

在这次练习中将实现K-means 聚类算法并应用它压缩图片,第二部分,将使用主成分分析算法去找到一个脸部图片的低维描述。

K-means Clustering

Implementing K-means

K-means算法是一种自动将相似的数据样本聚在一起的方法,K-means背后的直观是一个迭代过程,它从猜测初始的质心开始,然后通过重复地将示例分配到最接近的质心,然后根据分配重新计算质心,来细化这个猜测。
具体步骤:

  1. 随机初始K个质心
  2. 开始迭代
  3. 为每个样本找到最近的质心
  4. 用分配给每个质心的点来计算每个质心的平均值,作为新的质心
  5. 回到三
  6. 收敛于最终的均值集
    在实践中,K-means算法通常使用不同的随机初始化运行几次。从不同的随机初始化中选择不同的解的一种方法是选择成本函数值(失真)最低的解。
Finding closest centroids

在这里插入图片描述

Computing centroid means

for every centroid k we set
在这里插入图片描述
新的形心重置为所有聚集该形心和的平均值

K-means on example dataset

def find_closest_centroids(X, centroids):idx = np.zeros((len(X), 1))print(idx.shape)K = len(centroids)print(K)t = [0 for i in range(K)]print(t)for i in range(len(X)):for j in range(K):temp = centroids[j, :] - X[i, :]t[j] = temp.dot(temp.T)index = t.index(min(t)) + 1print(index)idx[i] = indexreturn np.array(idx)
mat = loadmat("ex7data2.mat")
X = mat['X']
init_centroids = np.array([[3, 3], [6, 2], [8, 5]])
idx = find_closest_centroids(X, init_centroids)
print(idx[0:3])

计算出聚集样本的平均数,得到新的centroids

def compute_centroids(X, idx):centroids = []# print(idx)K = len(np.unique(idx))m, n = X.shapecentroids = np.zeros((K, n))temp = np.zeros((K, n))count = np.zeros((K, 1))print(temp.shape)print(count.shape)for i in range(m):for j in range(K):if idx[i] == j:temp[j, :] = temp[j, :] + X[i, :]count[j] = count[j] + 1centroids = temp/countreturn centroids

绘制聚类过程

def plotData(X, centroids, idx=None):"""可视化数据,并自动分开着色。idx: 最后一次迭代生成的idx向量,存储每个样本分配的簇中心点的值centroids: 包含每次中心点历史记录"""colors = ['b', 'g', 'gold', 'darkorange', 'salmon', 'olivedrab','maroon', 'navy', 'sienna', 'tomato', 'lightgray', 'gainsboro''coral', 'aliceblue', 'dimgray', 'mintcream','mintcream']assert len(centroids[0]) <= len(colors), 'colors not enough 'subX = []  # 分号类的样本点if idx is not None:for i in range(centroids[0].shape[0]):x_i = X[idx[:, 0] == i]subX.append(x_i)else:subX = [X]  # 将X转化为一个元素的列表,每个元素为每个簇的样本集,方便下方绘图# 分别画出每个簇的点,并着不同的颜色plt.figure(figsize=(8, 5))for i in range(len(subX)):xx = subX[i]plt.scatter(xx[:, 0], xx[:, 1], c=colors[i], label='Cluster %d' % i)plt.legend()plt.grid(True)plt.xlabel('x1', fontsize=14)plt.ylabel('x2', fontsize=14)plt.title('Plot of X Points', fontsize=16)# 画出簇中心点的移动轨迹xx, yy = [], []for centroid in centroids:xx.append(centroid[:, 0])yy.append(centroid[:, 1])plt.plot(xx, yy, 'rx--', markersize=8)plotData(X, [init_centroids])
plt.show()
def run_k_means(X, centroids, max_iters):K = len(centroids)centroids_all = []centroids_all.append(centroids)centroid_i = centroidsfor i in range(max_iters):idx = find_closest_centroids(X, centroid_i)centroid_i = compute_centroids(X, idx)centroids_all.append(centroid_i)return idx, centroids_allidx, centroids_all = run_k_means(X, init_centroids, 20)
plotData(X, centroids_all, idx)
plt.show()

Random initialization

  1. 洗牌所有数据
  2. 取前k个作为centroids
    随机选取K个样本作为centroids
def initCentroids(X, K):m, n = X.shapeidx = np.random.choice(m, K)centroids = X[idx]return centroids
for i in range(K):centroids = initCentroids(X, K)idx, centroids_all = run_k_means(X, centroids, 10)plotData(X, centroids_all, idx)

Image compression with K-means

读入图片,128*128 RGB编码图片

from skimage import ioA = io.imread('bird_small.png')
print(A.shape)
plt.imshow(A)
A = A/255.
K-means on pixels
X = A.reshape(-1, 3)
K = 16
centroids = initCentroids(X, K)
idx, centroids_all = run_k_means(X, centroids, 10)img = np.zeros(X.shape)
centroids = centroids_all[-1]
for i in range(len(centroids)):img[idx == i] = centroids[i]img = img.reshape((128, 128, 3))fig, axes = plt.subplots(1, 2, figsize=(12, 6))
axes[0].imshow(A)
axes[1].imshow(img)

Principal Component Analysis

Example Dataset

mat = loadmat('ex7data1.mat')
X = mat['X']
print(X.shape)
plt.scatter(X[:,0], X[:,1], facecolors='none', edgecolors='b')
plt.show()

Implementing PCA

两个步骤:

  1. 计算矩阵数据的协方差
  2. 使用SVD 方法得出特征向量
    使用归一化是每个特征在同一范围内。
def feature_normalize(X):means = X.mean(axis=0)stds = X.std(axis=0, ddof=1)X_norm = (X - means) / stdsreturn X_norm, means, stds
def pca(X):sigma = (X.T @ X) / len(X)U, S, V = np.linalg.svd(sigma)return U, S, V
X_norm, means, stds = feature_normalize(X)
U, S, V = pca(X_norm)
print(U[:,0])
plt.figure(figsize=(7, 5))
plt.scatter(X[:,0], X[:,1], facecolors='none', edgecolors='b')plt.plot([means[0], means[0] + 1.5*S[0]*U[0,0]],[means[1], means[1] + 1.5*S[0]*U[0,1]],c='r', linewidth=3, label='First Principal Component')
plt.plot([means[0], means[0] + 1.5*S[1]*U[1,0]],[means[1], means[1] + 1.5*S[1]*U[1,1]],c='g', linewidth=3, label='Second Principal Component')
plt.grid()
plt.axis("equal")
plt.legend()

Dimensionality Reduction with PCA

Projecting the data onto the principal components
def project_data(X, U, K):Z = X @ U[:, :K]return Z
def recover_data(Z, U, K):X_rec = Z @ U[:, :K].Treturn X_rec
Reconstructing an approximation of the data
X_rec = recover_data(Z, U, 1)
X_rec[0]
Visualizing the projections
plt.figure(figsize=(7,5))
plt.axis("equal")
plot = plt.scatter(X_norm[:,0], X_norm[:,1], s=30, facecolors='none',edgecolors='b',label='Original Data Points')
plot = plt.scatter(X_rec[:,0], X_rec[:,1], s=30, facecolors='none',edgecolors='r',label='PCA Reduced Data Points')plt.title("Example Dataset: Reduced Dimension Points Shown",fontsize=14)
plt.xlabel('x1 [Feature Normalized]',fontsize=14)
plt.ylabel('x2 [Feature Normalized]',fontsize=14)
plt.grid(True)for x in range(X_norm.shape[0]):plt.plot([X_norm[x,0],X_rec[x,0]],[X_norm[x,1],X_rec[x,1]],'k--')# 输入第一项全是X坐标,第二项都是Y坐标
plt.legend()

Face Image Dataset

mat = loadmat('ex7faces.mat')
X = mat['X']
print(X.shape)
def display_data(X, row, col):fig, axs = plt.subplots(row, col, figsize=(8, 8))for r in range(row):for c in range(col):axs[r][c].imshow(X[r * col + c].reshape(32, 32).T, cmap='Greys_r')axs[r][c].set_xticks([])axs[r][c].set_yticks([])display_data(X, 10, 10)
PCA on Faces
X_norm, means, stds = feature_normalize(X)
U, S, V = pca(X_norm)
display_data(U[:,:36].T, 6, 6)
Dimensionality Reduction
z = project_data(X_norm, U, K=36)
X_rec = recover_data(z, U, K=36)
display_data(X_rec, 10, 10)

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/news/388930.shtml

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

打包 压缩 命令tar zip

2019独角兽企业重金招聘Python工程师标准>>> 打包 压缩 命令tar zip tar语法 #压缩 tar -czvf ***.tar.gz tar -cjvf ***.tar.bz2 #解压缩 tar -xzvf ***.tar.gz tar -xjvf ***.tar.bz2 tar [主选项辅选项] 文件或目录 主选项是必须要有的&#xff0c;它告诉tar要做…

mysql免安装5.7.17_mysql免安装5.7.17数据库配置

首先要有 mysql-5.7.10-winx64环境: mysql-5.7.10-winx64 win10(64位)配置环境变量&#xff1a;1、把mysql-5.7.10-winx64放到D盘&#xff0c;进入D\mysql-5.7.10-winx64\bin目录&#xff0c;复制路径&#xff0c;配置环境变量&#xff0c;在path后面添加D\mysql-5.7.10-winx6…

tidb数据库_异构数据库复制到TiDB

tidb数据库This article is based on a talk given by Tianshuang Qin at TiDB DevCon 2020.本文基于Tianshuang Qin在 TiDB DevCon 2020 上的演讲 。 When we convert from a standalone system to a distributed one, one of the challenges is migrating the database. We’…

机器学习实践七----异常检测和推荐系统

Anomaly detection 异常检测是机器学习中比较常见的应用&#xff0c;它主要用于非监督学习问题&#xff0c;从某些角度看&#xff0c; 它又类似于一些监督学习问题。 什么是异常检测&#xff1f;来看几个例子&#xff1a; 例1. 假设是飞机引擎制造商&#xff0c; 要对引擎进行…

CODE[VS] 1621 混合牛奶 USACO

题目描述 Description牛奶包装是一个如此低利润的生意,所以尽可能低的控制初级产品(牛奶)的价格变的十分重要.请帮助快乐的牛奶制造者(Merry Milk Makers)以可能的最廉价的方式取得他们所需的牛奶.快乐的牛奶制造公司从一些农民那购买牛奶,每个农民卖给牛奶制造公司的价格不一定…

刚认识女孩说不要浪费时间_不要浪费时间寻找学习数据科学的最佳方法

刚认识女孩说不要浪费时间重点 (Top highlight)Data science train is moving, at a constantly accelerating speed, and increasing its length by adding up new coaches. Businesses want to be on the data science train to keep up with the ever-evolving technology a…

测试工具之badboy

badboy这个工具本身用处不是很大&#xff0c;但有个录制脚本的功能&#xff0c;还是jmeter脚本&#xff0c;所以针对这一点很多懒人就可以通过这个录制脚本&#xff0c;而不需要自己去编写 badboy工具最近还是2016年更新的&#xff0c;后面也没在更新了&#xff0c;官方下载地址…

hive 集成sentry

2019独角兽企业重金招聘Python工程师标准>>> 环境 apache-hive-2.3.3-bin apache-sentry-2.1.0-bin 1 2 sentry是目前最新的版本&#xff0c;支持hive的最高版本为2.3.3&#xff0c;hive版本如果高于2.3.3&#xff0c;会出一些版本兼容问题[亲测] hive快速安装 wget…

isql 测试mysql连接_[libco] 协程库学习,测试连接 mysql

历史原因&#xff0c;一直使用 libev 作为服务底层&#xff1b;异步框架虽然性能比较高&#xff0c;但新人学习和使用门槛非常高&#xff0c;而且串行的逻辑被打散为状态机&#xff0c;这也会严重影响生产效率。用同步方式实现异步功能&#xff0c;既保证了异步性能优势&#x…

什么是数据仓库,何时以及为什么要考虑一个

The term “Data Warehouse” is widely used in the data analytics world, however, it’s quite common for people who are new with data analytics to ask the above question.术语“数据仓库”在数据分析领域中被广泛使用&#xff0c;但是&#xff0c;对于数据分析新手来…

探索性数据分析入门_入门指南:R中的探索性数据分析

探索性数据分析入门When I started on my journey to learn data science, I read through multiple articles that stressed the importance of understanding your data. It didn’t make sense to me. I was naive enough to think that we are handed over data which we p…

python web应用_为您的应用选择最佳的Python Web爬网库

python web应用Living in today’s world, we are surrounded by different data all around us. The ability to collect and use this data in our projects is a must-have skill for every data scientist.生活在当今世界中&#xff0c;我们周围遍布着不同的数据。 在我们的…

NDK-r14b + FFmpeg-release-3.4 linux下编译FFmpeg

下载资源 官网下载完NDK14b 和 FFmpeg 下载之后&#xff0c;更改FFmpeg 目录下configure问价如下&#xff1a; SLIBNAME_WITH_MAJOR$(SLIBPREF)$(FULLNAME)-$(LIBMAJOR)$(SLIBSUF) LIB_INSTALL_EXTRA_CMD$$(RANLIB)"$(LIBDIR)/$(LIBNAME)" SLIB_INSTALL_NAME$(SLI…

html中列表导航怎么和图片对齐_HTML实战篇:html仿百度首页

本篇文章主要给大家介绍一下如何使用htmlcss来制作百度首页页面。1)制作页面所用的知识点我们首先来分析一下百度首页的页面效果图百度首页由头部的一个文字导航&#xff0c;中间的一个按钮和一个输入框以及下边的文字简介和导航组成。我们这里主要用到的知识点就是列表标签(ul…

C# 依赖注入那些事儿

原文地址&#xff1a;http://www.cnblogs.com/leoo2sk/archive/2009/06/17/1504693.html 里面有一个例子差了些代码&#xff0c;补全后贴上。 3.1.3 依赖获取 using System; using System.Collections.Generic; using System.Linq; using System.Text; using System.Xml;//定义…

在FAANG面试中破解堆算法

In FAANG company interview, Candidates always come across heap problems. There is one question they do like to ask — Top K. Because these companies have a huge dataset and they can’t always go through all the data. Finding tope data is always a good opti…

mysql springboot 缓存_Spring Boot 整合 Redis 实现缓存操作

摘要: 原创出处 www.bysocket.com 「泥瓦匠BYSocket 」欢迎转载&#xff0c;保留摘要&#xff0c;谢谢&#xff01;『 产品没有价值&#xff0c;开发团队再优秀也无济于事 – 《启示录》 』本文提纲一、缓存的应用场景二、更新缓存的策略三、运行 springboot-mybatis-redis 工程…

itchat 道歉_人类的“道歉”

itchat 道歉When cookies were the progeny of “magic cookies”, they were seemingly innocuous packets of e-commerce data that stored a user’s partial transaction state on their computer. It wasn’t disclosed that you were playing a beneficial part in a muc…

matlab软件imag函数_「复变函数与积分变换」基本计算代码

使用了Matlab代码&#xff0c;化简平时遇到的计算问题&#xff0c;也可以用于验算结果来自211工科专业2学分复变函数与积分变换课程求复角主值sym(angle(待求复数))%公式 sym(angle(1sqrt(3)*i))%举例代入化简将 代入关于z的函数f(z)中并化解&#xff0c;用于公式法计算无穷远点…

数据科学 python_为什么需要以数据科学家的身份学习Python的7大理由

数据科学 pythonAs a new Data Scientist, you know that your path begins with programming languages you need to learn. Among all languages that you can select from Python is the most popular language for all Data Scientists. In this article, I will cover 7 r…