概要:在林老的题目描述中,DECISION_STUMP(其实就是“决策桩”,也就是只有一层的决策树)。题目中提到了的选去是把属性(一维的)按照从小到大的顺序排列以后取两个挨着的值的平均值,网上有人的实现会在开头和结尾的值手动去加一个小于最小的值,一个大于最大的值;添加的两个值的大小是多大合适,这是个问题。带来的另外一个问题就是,解释性变差了;就像《西瓜书》上说的,我们按西瓜的甜度区分西瓜的好坏,你收集到了甜度值是 0.1,0.2,.0.5,0.6,0.9(忽略了好瓜、坏瓜的标志),但最后你用了0.35(假设算法取在了0.2和0.5之间)作为了区别好瓜和坏瓜的标准,这个值没有在训练数据中出现过,给人的感觉就是:唉,为什么是这个值,怎么得来的?所以《西瓜书》提到了可以直接选择出现的这些值作为,有更好的解释性。当然去均值的方式也是正确的。 但是本人更倾向于直接用出现的值来作为,所以算法中没有对属性进行排序,加一个大于最大以及一个小于最小值(在《西瓜书》中,取均值的时候也没有做这个操作,而是直接排序,然后就取两个相邻值的平均值作为了)、取平均值的操作。
举例:有1,3, 2三个值:
(1)按照从小到大的顺序排列以后取两个挨着的值的平均值:1 2 3;会得到2个。
(2)按照从小到大的顺序排列以后取两个挨着的值的平均值,在开头和结尾的值手动去加一个小于最小的值,一个大于最大的值:0 1 2 3 4;会得到4个.
(3)直接取值,得到3个;
这三种方式有所区别,但对结果其实没什么影响。下面的代码使用的是(3)
util.py(公共方法,加载数据用的)
# -*- coding:utf-8 -*-
# Author: Evan Mi
import numpy as npdef load_data(file_name):x = []y = []with open(file_name, 'r+') as f:for line in f:line = line.rstrip("\n")temp = line.split(" ")temp.insert(0, '1')x_temp = [float(val) for val in temp[:-1]]y_tem = [int(val) for val in temp[-1:]][0]x.append(x_temp)y.append(y_tem)nx = np.array(x)ny = np.array(y)return nx, ny
decision_stump_one_dimension.py(对应一维的问题)
# -*- coding:utf-8 -*-
# Author: Evan Mi
import numpy as npdef sign_zero_as_neg(x):"""这里修改了np自带的sign函数,当传入的值为0的时候,不再返回0,而是-1;也就是说在边界上的点按反例处理:param x::return:"""result = np.sign(x)result[result == 0] = -1return resultdef data_generator(size):"""生成[-1, 1)之间的随机数, 然后加入20%的噪声,即20%的概率观测值取了相反数:param size::return:"""x_arr = np.random.uniform(-1, 1, size)y_arr = sign_zero_as_neg(x_arr)y_arr = np.where(np.random.uniform(0, 1, size) < 0.2, -y_arr, y_arr)print(x_arr)print(y_arr)return x_arr, y_arrdef err_in_counter(x_arr, y_arr, s, theta):"""计算E_in:param x_arr:[[x1, x2, x3, ... ,xn][x1, x2, x3, ... ,xn][x1, x2, x3, ... ,xn]...[x1, x2, x3, ... ,xn]]:param y_arr:[[y1, y2, y3, ... ,yn][y1, y2, y3, ... ,yn][y1, y2, y3, ... ,yn]...[y1, y2, y3, ... ,yn]]:param s:{-1,1}:param theta:[[theta1, theta1, theta1, ... ,theta1][theta2, theta2, theta2, ..., theta2][theta3, theta3, theta3, ..., theta3]...[thetak, thetak, thetak, ..., thetak]]:return:[err_theta1, err_theta2, ..., err_thetak] 中最小的以及下标"""result = s * sign_zero_as_neg(x_arr - theta)err_tile = np.where(result == y_arr, 0, 1).sum(1)return err_tile.min(), err_tile.argmin()def err_out_calculator(s, theta):return 0.5 + 0.3 * s * (abs(theta) - 1)def decision_stump_1d(x_arr, y_arr):theta = x_arrtheta_tile = np.tile(theta, (len(x_arr), 1)).Tx_tile = np.tile(x_arr, (len(theta), 1))y_tile = np.tile(y_arr, (len(theta), 1))err_pos, index_pos = err_in_counter(x_tile, y_tile, 1, theta_tile)err_neg, index_neg = err_in_counter(x_tile, y_tile, -1, theta_tile)if err_pos < err_neg:return err_pos / len(y_arr), err_out_calculator(1, theta[index_pos])else:return err_neg / len(y_arr), err_out_calculator(-1, theta[index_neg])if __name__ == '__main__':avg_err_in = 0avg_err_out = 0for i in range(5000):x, y = data_generator(20)e_in, e_out = decision_stump_1d(x, y)avg_err_in = avg_err_in + (1.0 / (i + 1)) * (e_in - avg_err_in)avg_err_out = avg_err_out + (1.0 / (i + 1)) * (e_out - avg_err_out)print("e_in:", avg_err_in)print("e_out:", avg_err_out)
decision_stump_multi_dimension.py(对应多维的问题)
# -*- coding:utf-8 -*-
# Author: Evan Mi
import numpy as np
from decison_stump import utildef sign_zero_as_neg(x):"""这里修改了np自带的sign函数,当传入的值为0的时候,不再返回0,而是-1;也就是说在边界上的点按反例处理:param x::return:"""result = np.sign(x)result[result == 0] = -1return resultdef err_in_counter(x_arr, y_arr, s, theta):"""计算E_in:param x_arr:[[x1, x2, x3, ... ,xn][x1, x2, x3, ... ,xn][x1, x2, x3, ... ,xn]...[x1, x2, x3, ... ,xn]]:param y_arr:[[y1, y2, y3, ... ,yn][y1, y2, y3, ... ,yn][y1, y2, y3, ... ,yn]...[y1, y2, y3, ... ,yn]]:param s:{-1,1}:param theta:[[theta1, theta1, theta1, ... ,theta1][theta2, theta2, theta2, ..., theta2][theta3, theta3, theta3, ..., theta3]...[thetak, thetak, thetak, ..., thetak]]:return:[err_theta1, err_theta2, ..., err_thetak] 中最小的以及下标"""result = s * sign_zero_as_neg(x_arr - theta)err_tile = np.where(result == y_arr, 0, 1).sum(1)return err_tile.min(), err_tile.argmin()def err_out_counter(x_arr, y_arr, s, theta, dimension):temp = s * sign_zero_as_neg(x_arr.T[dimension] - theta)e_out = np.where(temp == y_arr, 0, 1).sum() / np.size(x_arr, 0)return e_outdef decision_stump_1d(x_arr, y_arr):theta = x_arrtheta_tile = np.tile(theta, (len(x_arr), 1)).Tx_tile = np.tile(x_arr, (len(theta), 1))y_tile = np.tile(y_arr, (len(theta), 1))err_pos, index_pos = err_in_counter(x_tile, y_tile, 1, theta_tile)err_neg, index_neg = err_in_counter(x_tile, y_tile, -1, theta_tile)if err_pos < err_neg:return err_pos / len(y_arr), index_pos, 1else:return err_neg / len(y_arr), index_neg, -1def decision_stump_multi_d(x, y):x = x.Tdimension, e_in, theta, s = 0, float('inf'), 0, 0for i in range(np.size(x, 0)):e_in_temp, index, s_temp = decision_stump_1d(x[i], y)if e_in_temp < e_in:dimension, e_in, theta, s = i, e_in_temp, x[i][index], s_temp# 错误率相等的时候随机选择if e_in_temp == e_in:pick_rate = np.random.uniform(0, 1)if pick_rate > 0.5:dimension, e_in, theta, s = i, e_in_temp, x[i][index], s_tempreturn dimension, e_in, theta, sif __name__ == '__main__':x_train, y_train = util.load_data('data/train.txt')x_test, y_test = util.load_data('data/test.txt')determined_dimension, e_in_result, theta_result, s_result = decision_stump_multi_d(x_train, y_train)print("E_IN:", e_in_result)print("E_OUT:", err_out_counter(x_test, y_test, s_result, theta_result, determined_dimension))
详细项目代码及代码使用的数据见:DECISION_STUMP