knn算法python理解与预测_理解KNN算法

KNN主要包括训练过程和分类过程。在训练过程上,需要将训练集存储起来。在分类过程中,将测试集和训练集中的每一张图片去比较,选取差别最小的那张图片。

如果数据集多,就把训练集分成两部分,一小部分作为验证集(假的测试集),剩下的都为训练集(一般来说是70%-90%,具体多少取决于需要调整的超参数的多少,如果超参数多,验证集占比就更大一点)。验证集的好处是用来调节超参数,如果数据集不多,使用交叉验证的方法来调节参数。但是交叉验证的代价比较高,K折交叉验证,K越大越好,但是代价也更高。

决策分类

明确K个邻居中所有数据类别的个数,将测试数据划分给个数最多的那一类。即由输入实例的 K 个最临近的训练实例中的多数类决定输入实例的类别。

常用决策规则:

多数表决法:多数表决法和我们日常生活中的投票表决是一样的,少数服从多数,是最常用的一种方法。

加权表决法:有些情况下会使用到加权表决法,比如投票的时候裁判投票的权重更大,而一般人的权重较小。所以在数据之间有权重的情况下,一般采用加权表决法。

优点:

所选择的邻居都是已经正确分类的对象

KNN算法本身比较简单,分类器不需要使用训练集进行训练,训练时间复杂度为0。本算法分类的复杂度与训练集中数据的个数成正比。

对于类域的交叉或重叠较多的待分类样本,KNN算法比其他方法跟合适。

缺点:

当样本分布不平衡时,很难做到正确分类

计算量较大,因为每次都要计算测试数据到全部数据的距离。

python代码实现:

import numpy as np

class kNearestNeighbor:

def init(self):

pass

def train(self, X, y):

self.Xtr = X

self.ytr = y

def predict(self, X, k=1):

num_test = X.shape[0]

Ypred = np.zeros(num_test, dtype = self.ytr.dtype)

for i in range(num_test):

distances = np.sum(np.abs(self.Xtr - X[i,:]), axis = 1)

closest_y = y_train[np.argsort(distances)[:k]]

u, indices = np.unique(closest_y, return_inverse=True)

Ypred[i] = u[np.argmax(np.bincount(indices))]

return Ypred

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

load_CIFAR_batch()和load_CIFAR10()是用来加载CIFAR-10数据集的

import pickle

def load_CIFAR_batch(filename):

“”" load single batch of cifar “”"

with open(filename, ‘rb’) as f:

datadict = pickle.load(f, encoding=‘latin1’)

X = datadict[‘data’]

Y = datadict[‘labels’]

X = X.reshape(10000, 3, 32, 32).transpose(0,2,3,1).astype(“float”)

Y = np.array(Y)

return X, Y

1

2

3

4

5

6

7

8

9

10

import os

def load_CIFAR10(ROOT):

“”" load all of cifar “”"

xs = []

ys = []

for b in range(1,6):

f = os.path.join(ROOT, ‘data_batch_%d’ %(b))

X, Y = load_CIFAR_batch(f)

xs.append(X)

ys.append(Y)

Xtr = np.concatenate(xs) #使变成行向量

Ytr = np.concatenate(ys)

del X,Y

Xte, Yte = load_CIFAR_batch(os.path.join(ROOT, ‘test_batch’))

return Xtr, Ytr, Xte, Yte

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

Xtr, Ytr, Xte, Yte = load_CIFAR10(‘cifar10’)

Xtr_rows = Xtr.reshape(Xtr.shape[0], 32 * 32 * 3)

Xte_rows = Xte.reshape(Xte.shape[0], 32 * 32 * 3)

1

2

3

#由于数据集稍微有点大,在电脑上跑的很慢,所以取训练集5000个,测试集500个

num_training = 5000

num_test = 500

x_train = Xtr_rows[:num_training, :]

y_train = Ytr[:num_training]

x_test = Xte_rows[:num_test, :]

y_test = Yte[:num_test]

1

2

3

4

5

6

7

8

9

knn = kNearestNeighbor()

knn.train(x_train, y_train)

y_predict = knn.predict(x_test, k=7)

acc = np.mean(y_predict == y_test)

print(‘accuracy : %f’ %(acc))

1

2

3

4

5

accuracy : 0.302000

1

#k值取什么最后的效果会更好呢?可以使用交叉验证的方法,这里使用的是5折交叉验证

num_folds = 5

k_choices = [1, 3, 5, 8, 10, 12, 15, 20, 50, 100]

x_train_folds = np.array_split(x_train, num_folds)

y_train_folds = np.array_split(y_train, num_folds)

k_to_accuracies = {}

for k_val in k_choices:

print('k = ’ + str(k_val))

k_to_accuracies[k_val] = []

for i in range(num_folds):

x_train_cycle = np.concatenate([f for j,f in enumerate (x_train_folds) if j!=i])

y_train_cycle = np.concatenate([f for j,f in enumerate (y_train_folds) if j!=i])

x_val_cycle = x_train_folds[i]

y_val_cycle = y_train_folds[i]

knn = kNearestNeighbor()

knn.train(x_train_cycle, y_train_cycle)

y_val_pred = knn.predict(x_val_cycle, k_val)

num_correct = np.sum(y_val_cycle == y_val_pred)

k_to_accuracies[k_val].append(float(num_correct) / float(len(y_val_cycle)))

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

k = 1

k = 3

k = 5

k = 8

k = 10

k = 12

k = 15

k = 20

k = 50

k = 100

1

2

3

4

5

6

7

8

9

10

for k in sorted(k_to_accuracies):

for accuracy in k_to_accuracies[k]:

print(‘k = %d, accuracy = %f’ % (int(k), accuracy))

1

2

3

k = 1, accuracy = 0.098000

k = 1, accuracy = 0.148000

k = 1, accuracy = 0.205000

k = 1, accuracy = 0.233000

k = 1, accuracy = 0.308000

k = 3, accuracy = 0.089000

k = 3, accuracy = 0.142000

k = 3, accuracy = 0.215000

k = 3, accuracy = 0.251000

k = 3, accuracy = 0.296000

k = 5, accuracy = 0.096000

k = 5, accuracy = 0.176000

k = 5, accuracy = 0.240000

k = 5, accuracy = 0.284000

k = 5, accuracy = 0.309000

k = 8, accuracy = 0.100000

k = 8, accuracy = 0.175000

k = 8, accuracy = 0.263000

k = 8, accuracy = 0.289000

k = 8, accuracy = 0.310000

k = 10, accuracy = 0.099000

k = 10, accuracy = 0.174000

k = 10, accuracy = 0.264000

k = 10, accuracy = 0.318000

k = 10, accuracy = 0.313000

k = 12, accuracy = 0.100000

k = 12, accuracy = 0.192000

k = 12, accuracy = 0.261000

k = 12, accuracy = 0.316000

k = 12, accuracy = 0.318000

k = 15, accuracy = 0.087000

k = 15, accuracy = 0.197000

k = 15, accuracy = 0.255000

k = 15, accuracy = 0.322000

k = 15, accuracy = 0.321000

k = 20, accuracy = 0.089000

k = 20, accuracy = 0.225000

k = 20, accuracy = 0.270000

k = 20, accuracy = 0.319000

k = 20, accuracy = 0.306000

k = 50, accuracy = 0.079000

k = 50, accuracy = 0.248000

k = 50, accuracy = 0.278000

k = 50, accuracy = 0.287000

k = 50, accuracy = 0.293000

k = 100, accuracy = 0.075000

k = 100, accuracy = 0.246000

k = 100, accuracy = 0.275000

k = 100, accuracy = 0.284000

k = 100, accuracy = 0.277000

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

可视化交叉验证的结果

import matplotlib.pyplot as plt

plt.rcParams[‘figure.figsize’] = (10.0, 8.0)

plt.rcParams[‘image.interpolation’] = ‘nearest’

plt.rcParams[‘image.cmap’] = ‘gray’

1

2

3

4

5

for k in k_choices:

accuracies = k_to_accuracies[k]

plt.scatter([k] * len(accuracies), accuracies)

accuracies_mean = np.array([np.mean(v) for k,v in sorted(k_to_accuracies.items())])

accuracies_std = np.array([np.std(v) for k,v in sorted(k_to_accuracies.items())])

plt.errorbar(k_choices, accuracies_mean, yerr=accuracies_std)

plt.title(‘Cross-validation on k’)

plt.xlabel(‘k’)

plt.ylabel(‘Cross-validation accuracy’)

plt.show()

1

2

3

4

5

6

7

8

9

10

11

bde1000a4eb25fceef241b1778be7678.png

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/news/441965.shtml

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

joptionpane java_Java JOptionPane

Java JOptionPane1 Java JOptionPane的介绍JOptionPane类用于提供标准对话框,例如消息对话框,确认对话框和输入对话框。这些对话框用于显示信息或从用户那里获取输入。JOptionPane类继承了JComponent类。2 Java JOptionPane的声明public class JOptionPa…

java 股票 代码_Java中利用散列表实现股票行情的查询_java

---- 在java中,提供了一个散列表类Hashtable,利用该类,我们可以按照特定的方式来存储数据,从而达到快速检索的目的。本文以查询股票的收盘数据为例,详细地说明java中散列表的使用方法。一、散列表的原理---- 散列表&am…

【HDU - 3714 】Error Curves (三分)

题干: Josephina is a clever girl and addicted to Machine Learning recently. She pays much attention to a method called Linear Discriminant Analysis, which has many interesting properties. In order to test the algorithms efficiency, she colle…

指数循环节证明

还有关键的一步忘写了phi(m)>r的注意因为ma^r*m‘’所以phi(m)>phi(a^r)>r,所以就相当于phi(m)为循环节,不过如果指数小于phi(m)只能直接算了。。 注意这里的m与a^r是互质的上面忘写了。。 转自https://blog.csdn.net/guoshiyuan484/article/details/787…

java语言中的类可以_java 语言中的类

类一、类类是具有相同性质的一类事物的总称, 它是一个抽象的概念。它封装了一类对象的状态和方法, 是创建对象的模板。类的实现包括两部分: 类声明和类体类的声明类声明的基本格式为:[ 访问权限修饰符]c l a s s类名[extends超类][ implments实现的接口列表]{}说 明:① 访问权限…

【POJ - 3310】Caterpillar(并查集判树+树的直径求树脊椎(bfs记录路径)+dfs判支链)

题干: An undirected graph is called a caterpillar if it is connected, has no cycles, and there is a path in the graph where every node is either on this path or a neighbor of a node on the path. This path is called the spine of the caterpillar …

软件设计师下午题java_2018上半年软件设计师下午真题(三)

● 阅读下列说明和Java代码,将应填入(n)处的字句写在答题纸的对应栏内。【说明】生成器( Builder)模式的意图是将一个复杂对象的构建与它的表示分离,使得同样的构建过程可以创建不同的表示。图6-1所示为其类图。【Java代码】import java.util.*;class Product {priv…

java细粒度锁_Java细粒度锁实现的3种方式

最近在工作上碰见了一些高并发的场景需要加锁来保证业务逻辑的正确性,并且要求加锁后性能不能受到太大的影响。初步的想法是通过数据的时间戳,id等关键字来加锁,从而保证不同类型数据处理的并发性。而java自身api提供的锁粒度太大&#xff0c…

【POJ - 1062】【nyoj - 510】昂贵的聘礼 (Dijkstra最短路+思维)

题干: 年轻的探险家来到了一个印第安部落里。在那里他和酋长的女儿相爱了,于是便向酋长去求亲。酋长要他用10000个金币作为聘礼才答应把女儿嫁给他。探险家拿不出这么多金币,便请求酋长降低要求。酋长说:"嗯,如果…

【HDU - 5605】 geometry(水,数学题,推公式)

题干: There is a point PP at coordinate (x,y)(x,y). A line goes through the point, and intersects with the postive part of X,YX,Yaxes at point A,BA,B. Please calculate the minimum possible value of |PA|∗|PB||PA|∗|PB|. Input the first line…

matlab如何画函数的外包络曲线,怎样在MATLAB中划出一个函数的包络线?

沧海一幻觉下面是一系列关于MATLAB的包络线的程序:%这是定义了一个函数:function [up,down] envelope(x,y,interpMethod)%ENVELOPE gets the data of upper and down envelope of the known input (x,y).%% Input parameters:% x the abscissa of the g…

【51Nod - 1279】 扔盘子(思维)(on-p会超时)

题干: 有一口井,井的高度为N,每隔1个单位它的宽度有变化。现在从井口往下面扔圆盘,如果圆盘的宽度大于井在某个高度的宽度,则圆盘被卡住(恰好等于的话会下去)。 盘子有几种命运:1、…

java 内部类私有成员 能访问,为什么外部Java类可以访问内部类私有成员?

HUX布斯如果您想隐藏内部类的私有成员,您可以与公共成员定义一个接口,并创建一个实现此接口的匿名内部类。下面的例子:class ABC{private interface MyInterface{void printInt();}private static MyInterface mMember new MyInterface(){pr…

【POJ - 3321】 Apple Tree(dfs序 + 线段树维护 或 dfs序 + 树状数组维护)

题干: There is an apple tree outside of kakas house. Every autumn, a lot of apples will grow in the tree. Kaka likes apple very much, so he has been carefully nurturing the big apple tree. The tree has N forks which are connected by branches. …

【HDU - 1698】 Just a Hook(线段树模板 区间覆盖更新(laz标记) + 区间和查询 )

题干: In the game of DotA, Pudge’s meat hook is actually the most horrible thing for most of the heroes. The hook is made up of several consecutive metallic sticks which are of the same length. Now Pudge wants to do some operations on the hoo…

反序列化 php R类型,pikachu-PHP反序列化、XXE、SSFR

一、PHP反序列化1.1概述在理解这个漏洞前,你需要先搞清楚php中serialize(),unserialize()这两个函数。序列化serialize()序列化说通俗点就是把一个对象变成可以传输的字符串,比如下面是一个对象:class S{public $test"pikachu";}$snew S(); //创建一个对象…

【51nod - 前缀异或】 对前缀和的理解

题干&#xff1a; 前缀异或 基准时间限制&#xff1a;2 秒 空间限制&#xff1a;131072 KB 分值: 5 输入一个长度为n(1 < n < 100000)数组a[1], a[2], ..., a[n]。 输入一个询问数m(1 < m < 100000)和m组询问&#xff0c;每组询问形如(l, r) 对于每组询问(l, …

oracle中创建实体,生成实体-SqlSugar 4.x-文档园

注意&#xff1a;使用DbFirst数据库账户要有系统表的权限,否则无法读取表的结构1.将库里面所有表都生成实体类文件db.DbFirst.CreateClassFile("c:\\Demo\\1",命名空间);2.指定名表生成 &#xff0c;可以传数组db.DbFirst.Where("Student").CreateClassFil…

【HDU - 1087】Super Jumping! Jumping! Jumping! (最大上升子序列类问题,dp)

题干&#xff1a; Nowadays, a kind of chess game called “Super Jumping! Jumping! Jumping!” is very popular in HDU. Maybe you are a good boy, and know little about this game, so I introduce it to you now. The game can be played by two or more than two pl…

oracle启动监听读取哪个文件,监听服务启动及数据文件恢复oracle数据库

最近遭遇了 oralce 监听服务启动了 又自行关闭的 悲惨经历我把我的过程和大家分享一下&#xff01;1)排查原因程序员是懒惰的&#xff0c;我始终都希望能够成功启动监听服务&#xff0c;但是就是事与愿违有一下方式可能不能成功启动监听1.端口占用&#xff0c;oralce 要用到152…