机器学习基石作业二中的DECISION

机器学习基石作业二中的DECISION_STUMP实现

概要：在林老的题目描述中，DECISION_STUMP（其实就是“决策桩”，也就是只有一层的决策树）。题目中提到了 $\theta$ 的选去是把属性（一维的）按照从小到大的顺序排列以后取两个挨着的值的平均值，网上有人的实现会在开头和结尾的值手动去加一个小于最小的值，一个大于最大的值；添加的两个值的大小是多大合适，这是个问题。带来的另外一个问题就是，解释性变差了；就像《西瓜书》上说的，我们按西瓜的甜度区分西瓜的好坏，你收集到了甜度值是 0.1,0.2,.0.5,0.6,0.9(忽略了好瓜、坏瓜的标志），但最后你用了0.35(假设算法取在了0.2和0.5之间）作为了区别好瓜和坏瓜的标准，这个值没有在训练数据中出现过，给人的感觉就是：唉，为什么是这个值，怎么得来的？所以《西瓜书》提到了可以直接选择出现的这些值作为 $\theta$ ，有更好的解释性。当然去均值的方式也是正确的。但是本人更倾向于直接用出现的值来作为 $\theta$ ，所以算法中没有对属性进行排序，加一个大于最大以及一个小于最小值（在《西瓜书》中，取均值的时候也没有做这个操作，而是直接排序，然后就取两个相邻值的平均值作为 $\theta$ 了）、取平均值的操作。

举例：有1，3, 2三个值：

（1）按照从小到大的顺序排列以后取两个挨着的值的平均值:1 $\theta$ 2 $\theta$ 3；会得到2个 $\theta$ 。

（2）按照从小到大的顺序排列以后取两个挨着的值的平均值，在开头和结尾的值手动去加一个小于最小的值，一个大于最大的值：0 $\theta$ 1 $\theta$ 2 $\theta$ 3 $\theta$ 4；会得到4个 $\theta$ .

（3）直接取值，得到3个 $\theta$ ；

这三种方式有所区别，但对结果其实没什么影响。下面的代码使用的是（3）

util.py（公共方法，加载数据用的）

# -*- coding:utf-8 -*-
# Author: Evan Mi
import numpy as npdef load_data(file_name):x = []y = []with open(file_name, 'r+') as f:for line in f:line = line.rstrip("\n")temp = line.split(" ")temp.insert(0, '1')x_temp = [float(val) for val in temp[:-1]]y_tem = [int(val) for val in temp[-1:]][0]x.append(x_temp)y.append(y_tem)nx = np.array(x)ny = np.array(y)return nx, ny

decision_stump_one_dimension.py（对应一维的问题）

# -*- coding:utf-8 -*-
# Author: Evan Mi
import numpy as npdef sign_zero_as_neg(x):"""这里修改了np自带的sign函数，当传入的值为0的时候，不再返回0，而是-1；也就是说在边界上的点按反例处理:param x::return:"""result = np.sign(x)result[result == 0] = -1return resultdef data_generator(size):"""生成[-1, 1)之间的随机数， 然后加入20%的噪声，即20%的概率观测值取了相反数:param size::return:"""x_arr = np.random.uniform(-1, 1, size)y_arr = sign_zero_as_neg(x_arr)y_arr = np.where(np.random.uniform(0, 1, size) < 0.2, -y_arr, y_arr)print(x_arr)print(y_arr)return x_arr, y_arrdef err_in_counter(x_arr, y_arr, s, theta):"""计算E_in:param x_arr:[[x1, x2, x3, ... ,xn][x1, x2, x3, ... ,xn][x1, x2, x3, ... ,xn]...[x1, x2, x3, ... ,xn]]:param y_arr:[[y1, y2, y3, ... ,yn][y1, y2, y3, ... ,yn][y1, y2, y3, ... ,yn]...[y1, y2, y3, ... ,yn]]:param s:{-1,1}:param theta:[[theta1, theta1, theta1, ... ,theta1][theta2, theta2, theta2, ..., theta2][theta3, theta3, theta3, ..., theta3]...[thetak, thetak, thetak, ..., thetak]]:return:[err_theta1, err_theta2, ..., err_thetak] 中最小的以及下标"""result = s * sign_zero_as_neg(x_arr - theta)err_tile = np.where(result == y_arr, 0, 1).sum(1)return err_tile.min(), err_tile.argmin()def err_out_calculator(s, theta):return 0.5 + 0.3 * s * (abs(theta) - 1)def decision_stump_1d(x_arr, y_arr):theta = x_arrtheta_tile = np.tile(theta, (len(x_arr), 1)).Tx_tile = np.tile(x_arr, (len(theta), 1))y_tile = np.tile(y_arr, (len(theta), 1))err_pos, index_pos = err_in_counter(x_tile, y_tile, 1, theta_tile)err_neg, index_neg = err_in_counter(x_tile, y_tile, -1, theta_tile)if err_pos < err_neg:return err_pos / len(y_arr), err_out_calculator(1, theta[index_pos])else:return err_neg / len(y_arr), err_out_calculator(-1, theta[index_neg])if __name__ == '__main__':avg_err_in = 0avg_err_out = 0for i in range(5000):x, y = data_generator(20)e_in, e_out = decision_stump_1d(x, y)avg_err_in = avg_err_in + (1.0 / (i + 1)) * (e_in - avg_err_in)avg_err_out = avg_err_out + (1.0 / (i + 1)) * (e_out - avg_err_out)print("e_in:", avg_err_in)print("e_out:", avg_err_out)

decision_stump_multi_dimension.py（对应多维的问题）

# -*- coding:utf-8 -*-
# Author: Evan Mi
import numpy as np
from decison_stump import utildef sign_zero_as_neg(x):"""这里修改了np自带的sign函数，当传入的值为0的时候，不再返回0，而是-1；也就是说在边界上的点按反例处理:param x::return:"""result = np.sign(x)result[result == 0] = -1return resultdef err_in_counter(x_arr, y_arr, s, theta):"""计算E_in:param x_arr:[[x1, x2, x3, ... ,xn][x1, x2, x3, ... ,xn][x1, x2, x3, ... ,xn]...[x1, x2, x3, ... ,xn]]:param y_arr:[[y1, y2, y3, ... ,yn][y1, y2, y3, ... ,yn][y1, y2, y3, ... ,yn]...[y1, y2, y3, ... ,yn]]:param s:{-1,1}:param theta:[[theta1, theta1, theta1, ... ,theta1][theta2, theta2, theta2, ..., theta2][theta3, theta3, theta3, ..., theta3]...[thetak, thetak, thetak, ..., thetak]]:return:[err_theta1, err_theta2, ..., err_thetak] 中最小的以及下标"""result = s * sign_zero_as_neg(x_arr - theta)err_tile = np.where(result == y_arr, 0, 1).sum(1)return err_tile.min(), err_tile.argmin()def err_out_counter(x_arr, y_arr, s, theta, dimension):temp = s * sign_zero_as_neg(x_arr.T[dimension] - theta)e_out = np.where(temp == y_arr, 0, 1).sum() / np.size(x_arr, 0)return e_outdef decision_stump_1d(x_arr, y_arr):theta = x_arrtheta_tile = np.tile(theta, (len(x_arr), 1)).Tx_tile = np.tile(x_arr, (len(theta), 1))y_tile = np.tile(y_arr, (len(theta), 1))err_pos, index_pos = err_in_counter(x_tile, y_tile, 1, theta_tile)err_neg, index_neg = err_in_counter(x_tile, y_tile, -1, theta_tile)if err_pos < err_neg:return err_pos / len(y_arr), index_pos, 1else:return err_neg / len(y_arr), index_neg, -1def decision_stump_multi_d(x, y):x = x.Tdimension, e_in, theta, s = 0, float('inf'), 0, 0for i in range(np.size(x, 0)):e_in_temp, index, s_temp = decision_stump_1d(x[i], y)if e_in_temp < e_in:dimension, e_in, theta, s = i, e_in_temp, x[i][index], s_temp# 错误率相等的时候随机选择if e_in_temp == e_in:pick_rate = np.random.uniform(0, 1)if pick_rate > 0.5:dimension, e_in, theta, s = i, e_in_temp, x[i][index], s_tempreturn dimension, e_in, theta, sif __name__ == '__main__':x_train, y_train = util.load_data('data/train.txt')x_test, y_test = util.load_data('data/test.txt')determined_dimension, e_in_result, theta_result, s_result = decision_stump_multi_d(x_train, y_train)print("E_IN:", e_in_result)print("E_OUT:", err_out_counter(x_test, y_test, s_result, theta_result, determined_dimension))

详细项目代码及代码使用的数据见：DECISION_STUMP

本文来自互联网用户投稿，该文观点仅代表作者本人，不代表本站立场。本站仅提供信息存储空间服务，不拥有所有权，不承担相关法律责任。如若转载，请注明出处：http://www.mzph.cn/news/569159.shtml

如若内容造成侵权/违法违规/事实不符，请联系多彩编程网进行投诉反馈email:809451989@qq.com，一经查实，立即删除！