【机器学习】熵、决策树、随机森林 总结

一、熵

公式:
−∑i=1np(xi)∗log2p(xi)-\sum_{i = 1}^{n}{p(xi)*log_2p(xi)}i=1np(xi)log2p(xi)

∑i=1np(xi)∗log21p(xi)\sum_{i=1}^{n}p(xi)*log_2\frac{1}{p(xi)}i=1np(xi)log2p(xi)1

在这里插入图片描述

import numpy as np# 账号是否真实:3no(0.3) 7yes(0.7)# 不进行划分,信息熵
info_D = 0.3*np.log2(1/0.3) + 0.7*np.log2(1/0.7)
info_D

0.8812908992306926

# 决策树,对目标值进行划分
# 三个属性:日志密度,好友密度,是否真实头像
# 使用日志密度进行树构建
# 3 s 0.3 -------> 2no 1yes
# 4 m 0.4 -------> 1no 3yes
# 3 l 0.3 -------> 3yesinfo_L_D = 0.3*(2/3*np.log2(3/2) + 1/3*np.log2(3)) + 0.4 * (0.25*np.log2(4) + 0.75*np.log2(4/3)) + 0.3*(1*np.log2(1))
info_L_D

0.5999999999999999

# 信息增益
info_D - info_L_D

0.2812908992306927

# 好友密度
# 4 s 0.4 ---> 3no 1yes
# 4 m 0.4 ---> 4yes
# 2 l 0.2 ---> 2yes
info_F_D = 0.4*(0.75*np.log2(4/3) + 0.25*np.log2(4)) + 0 + 0
info_F_D

0.32451124978365314

# 信息增益
info_D - info_F_D

0.5567796494470394

二、 决策树

1导包

from sklearn import datasets import numpy as npfrom sklearn.tree import DecisionTreeClassifierfrom sklearn import datasetsimport matplotlib.pyplot as plt
%matplotlib inlinefrom sklearn import treefrom sklearn.model_selection import train_test_split 

2取数据

X,y = datasets.load_iris(True)
X 
iris = datasets.load_iris()X = iris['data']y = iris['target']feature_names = iris.feature_names
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size = 0.2,random_state  = 1024)

3决策树的使用

# 数据清洗,花时间# 特征工程# 使用模型进行训练# 模型参数调优# sklearn所有算法,封装好了
# 直接用,使用规则如下clf = DecisionTreeClassifier(criterion='entropy')clf.fit(X_train,y_train)y_ = clf.predict(X_test)from sklearn.metrics import accuracy_scoreaccuracy_score(y_test,y_)

1.0

39/120*np.log2(120/39) + 42/120*np.log2(120/42) + 39/120*np.log2(120/39)

1.5840680553754911

42/81*np.log2(81/42) + 39/81*np.log2(81/39)

0.9990102708804813

plt.figure(figsize=(18,12))
_ = tree.plot_tree(clf,filled = True,feature_names=feature_names,max_depth=1)
plt.savefig('./tree.jpg') 
# 连续的,continuous 属性 阈值 threshold
X_train 
# 波动程度,越大,离散,越容易分开
X_train.std(axis = 0)

array([0.82300095, 0.42470578, 1.74587112, 0.75016619])

1.9 + 3.3 = 5.25.2/2 = 2.6
np.sort(X_train[:,2])
%%time
# 树的深度变浅了,树的裁剪
clf = DecisionTreeClassifier(criterion='entropy',max_depth=5)clf.fit(X_train,y_train)y_ = clf.predict(X_test)from sklearn.metrics import accuracy_scoreprint(accuracy_score(y_test,y_))plt.figure(figsize=(18,12))_ = tree.plot_tree(clf,filled=True,feature_names = feature_names)

1.0
Wall time: 114 ms

%%time
# 树的深度变浅了,树的裁剪
clf = DecisionTreeClassifier(criterion='gini',max_depth=5)clf.fit(X_train,y_train)y_ = clf.predict(X_test)from sklearn.metrics import accuracy_scoreprint(accuracy_score(y_test,y_))plt.figure(figsize=(18,12))_ = tree.plot_tree(clf,filled=True,feature_names = feature_names)

1.0
Wall time: 113 ms
在这里插入图片描述

gini 系数公式:

∑i=0np(xi)∗(1−p(xi))\sum_{i = 0}^{n}p(xi)*(1-p(xi))i=0np(xi)(1p(xi))

# 1.0 其余都是0
# 百分之百纯
gini = 1*(1-1)
gini

0

# 39 42 39
39/120*(1 - 39/120)*2 + 42/120*(1 - 42/120)

0.66625

feature_names
['sepal length (cm)','sepal width (cm)','petal length (cm)','petal width (cm)']
X_train2 = X_train[y_train != 0]
X_train2y_train2 = y_train[y_train!=0]
y_train2index = np.argsort(X_train2[:,0])display(X_train2[:,0][index])y_train2[index]```python
index = np.argsort(X_train2[:,1])display(X_train2[:,1][index])y_train2[index] 
index = np.argsort(X_train2[:,2])display(X_train2[:,2][index])y_train2[index] 
index = np.argsort(X_train2[:,3])display(X_train2[:,3][index])y_train2[index]
决策树模型,不需要对数据进行去量纲化,规划化,标准化
公司应用中,不用决策树,太简单
决策树升级版:集成算法(随机森林,(extrem)极限森林,梯度提升树,adaboost提升树)

三、随机森林

import numpy as npimport matplotlib.pyplot as plt
%matplotlib inlinefrom sklearn.ensemble import RandomForestClassifier,ExtraTreesClassifierfrom sklearn import datasetsimport pandas as pdfrom sklearn.model_selection import train_test_splitfrom sklearn.tree import DecisionTreeClassifierfrom sklearn.metrics import accuracy_score

随机森林 :多颗决策树构建而成,每一颗决策树都是刚才讲到的决策树原理
多颗决策树一起运算------------>集成算法
随机森林,随机什么意思

wine = datasets.load_wine()
wine
{'data': array([[1.423e+01, 1.710e+00, 2.430e+00, ..., 1.040e+00, 3.920e+00,1.065e+03],[1.320e+01, 1.780e+00, 2.140e+00, ..., 1.050e+00, 3.400e+00,1.050e+03],[1.316e+01, 2.360e+00, 2.670e+00, ..., 1.030e+00, 3.170e+00,1.185e+03],...,[1.327e+01, 4.280e+00, 2.260e+00, ..., 5.900e-01, 1.560e+00,8.350e+02],[1.317e+01, 2.590e+00, 2.370e+00, ..., 6.000e-01, 1.620e+00,8.400e+02],[1.413e+01, 4.100e+00, 2.740e+00, ..., 6.100e-01, 1.600e+00,5.600e+02]]),'target': array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1,1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2,2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,2, 2]),'target_names': array(['class_0', 'class_1', 'class_2'], dtype='<U7'),'DESCR': '.. _wine_dataset:\n\nWine recognition dataset\n------------------------\n\n**Data Set Characteristics:**\n\n    :Number of Instances: 178 (50 in each of three classes)\n    :Number of Attributes: 13 numeric, predictive attributes and the class\n    :Attribute Information:\n \t\t- Alcohol\n \t\t- Malic acid\n \t\t- Ash\n\t\t- Alcalinity of ash  \n \t\t- Magnesium\n\t\t- Total phenols\n \t\t- Flavanoids\n \t\t- Nonflavanoid phenols\n \t\t- Proanthocyanins\n\t\t- Color intensity\n \t\t- Hue\n \t\t- OD280/OD315 of diluted wines\n \t\t- Proline\n\n    - class:\n            - class_0\n            - class_1\n            - class_2\n\t\t\n    :Summary Statistics:\n    \n    ============================= ==== ===== ======= =====\n                                   Min   Max   Mean     SD\n    ============================= ==== ===== ======= =====\n    Alcohol:                      11.0  14.8    13.0   0.8\n    Malic Acid:                   0.74  5.80    2.34  1.12\n    Ash:                          1.36  3.23    2.36  0.27\n    Alcalinity of Ash:            10.6  30.0    19.5   3.3\n    Magnesium:                    70.0 162.0    99.7  14.3\n    Total Phenols:                0.98  3.88    2.29  0.63\n    Flavanoids:                   0.34  5.08    2.03  1.00\n    Nonflavanoid Phenols:         0.13  0.66    0.36  0.12\n    Proanthocyanins:              0.41  3.58    1.59  0.57\n    Colour Intensity:              1.3  13.0     5.1   2.3\n    Hue:                          0.48  1.71    0.96  0.23\n    OD280/OD315 of diluted wines: 1.27  4.00    2.61  0.71\n    Proline:                       278  1680     746   315\n    ============================= ==== ===== ======= =====\n\n    :Missing Attribute Values: None\n    :Class Distribution: class_0 (59), class_1 (71), class_2 (48)\n    :Creator: R.A. Fisher\n    :Donor: Michael Marshall (MARSHALL%PLU@io.arc.nasa.gov)\n    :Date: July, 1988\n\nThis is a copy of UCI ML Wine recognition datasets.\nhttps://archive.ics.uci.edu/ml/machine-learning-databases/wine/wine.data\n\nThe data is the results of a chemical analysis of wines grown in the same\nregion in Italy by three different cultivators. There are thirteen different\nmeasurements taken for different constituents found in the three types of\nwine.\n\nOriginal Owners: \n\nForina, M. et al, PARVUS - \nAn Extendible Package for Data Exploration, Classification and Correlation. \nInstitute of Pharmaceutical and Food Analysis and Technologies,\nVia Brigata Salerno, 16147 Genoa, Italy.\n\nCitation:\n\nLichman, M. (2013). UCI Machine Learning Repository\n[https://archive.ics.uci.edu/ml]. Irvine, CA: University of California,\nSchool of Information and Computer Science. \n\n.. topic:: References\n\n  (1) S. Aeberhard, D. Coomans and O. de Vel, \n  Comparison of Classifiers in High Dimensional Settings, \n  Tech. Rep. no. 92-02, (1992), Dept. of Computer Science and Dept. of  \n  Mathematics and Statistics, James Cook University of North Queensland. \n  (Also submitted to Technometrics). \n\n  The data was used with many others for comparing various \n  classifiers. The classes are separable, though only RDA \n  has achieved 100% correct classification. \n  (RDA : 100%, QDA 99.4%, LDA 98.9%, 1NN 96.1% (z-transformed data)) \n  (All results using the leave-one-out technique) \n\n  (2) S. Aeberhard, D. Coomans and O. de Vel, \n  "THE CLASSIFICATION PERFORMANCE OF RDA" \n  Tech. Rep. no. 92-01, (1992), Dept. of Computer Science and Dept. of \n  Mathematics and Statistics, James Cook University of North Queensland. \n  (Also submitted to Journal of Chemometrics).\n','feature_names': ['alcohol','malic_acid','ash','alcalinity_of_ash','magnesium','total_phenols','flavanoids','nonflavanoid_phenols','proanthocyanins','color_intensity','hue','od280/od315_of_diluted_wines','proline']}
X = wine['data']y = wine['target']X.shape

(178, 13)

将数据分割
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size = 0.2)

使用随机森林算法训练获取预测值和准确率

clf = RandomForestClassifier()clf.fit(X_train,y_train)y_ = clf.predict(X_test)accuracy_score(y_test,y_)

1.0

dt_clf = DecisionTreeClassifier()dt_clf.fit(X_train,y_train)dt_clf.score(X_test,y_test)

0.9444444444444444

对比决策树和随机森林算法的差距

score = 0 
for i in range(100):X_train,X_test,y_train,y_test = train_test_split(X,y,test_size = 0.2)dt_clf = DecisionTreeClassifier()dt_clf.fit(X_train,y_train)score+=dt_clf.score(X_test,y_test)/100print('决策树多次运行准确率:',score)

决策树多次运行准确率: 0.909166666666666

score = 0 
for i in range(100):X_train,X_test,y_train,y_test = train_test_split(X,y,test_size = 0.2)clf = RandomForestClassifier(n_estimators=100)clf.fit(X_train,y_train)score+=clf.score(X_test,y_test)/100print('随机森林多次运行准确率:',score)

随机森林多次运行准确率: 0.9808333333333332

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/news/456140.shtml

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

HDU 4857 逃生(拓扑排序)

拓扑排序 一.定义 对一个有向无环图(Directed Acyclic Graph简称DAG)G进行拓扑排序&#xff0c;是将G中所有顶点排成一个线性序列&#xff0c;使得图中任意一对顶点u和v&#xff0c;若<u&#xff0c;v> ∈E(G)&#xff0c;则u在线性序列中出现在v之前。 通常&#xff0c;…

关于deepin系统安装design compiler的问题解答

关于deepin系统安装design compiler的问题解答 Design Compiler是Synopsys综合软件的核心产品。它提供约束驱动时序最优化&#xff0c;并支持众多的设计类型&#xff0c;把设计者的HDL描述综合成与工艺相关的门级设计&#xff1b;它能够从速度、面积和功耗等方面来优化组合电…

【机器学习】交叉验证筛选参数K值和weight

交叉验证 导包 import numpy as npfrom sklearn.neighbors import KNeighborsClassifierfrom sklearn import datasets#model_selection &#xff1a;模型选择 # cross_val_score: 交叉 &#xff0c;validation&#xff1a;验证&#xff08;测试&#xff09; #交叉验证 from s…

手机只能签荣耀!最忠诚代言人胡歌喊你去天猫超品日

在你心中&#xff0c;男神胡歌是什么样子&#xff1f;“御剑乘风来&#xff0c;除魔天地间。”也许是《仙剑奇侠传》里飞扬跋扈、青春不羁的侠客李逍遥。“遍识天下英雄路&#xff0c;俯首江左有梅郎。”也许是《琅铘榜》中才智冠天下&#xff0c;远在江湖却名动帝辇的麒麟才子…

欧式距离与曼哈顿距离

欧式距离&#xff0c;其实就是应用勾股定理计算两个点的直线距离 二维空间的公式 其中&#xff0c; 为点与点之间的欧氏距离&#xff1b;为点到原点的欧氏距离。 三维空间的公式 n维空间的公式 曼哈顿距离&#xff0c;就是表示两个点在标准坐标系上的绝对轴距之和&#xff1a…

在maven pom.xml中加载不同的properties ,如localhost 和 dev master等jdbc.properties 中的链接不一样...

【参考】&#xff1a;maven pom.xml加载不同properties配置[转] 首先 看看效果&#xff1a; 点开我们项目中的Maven projects 后&#xff0c;会发现右侧 我们profile有个可勾选选项。默认勾选localhost。localhost对应项目启动后&#xff0c;会加载配置左侧localhost文件夹下面…

python安装以及版本检测

Windows 安装 Python 3 目前Python有两个大版本&#xff0c;分别是 2.X 和 3.X &#xff0c;我们的教程基于最新版本 3.6.1 首先我们需要获取Python的安装包&#xff0c;可以从官网获取&#xff0c;如果你因为没有VPN工具而无法访问官网的话&#xff0c;我已经将它放在网盘了&…

【机器学习】梯度下降原理

import numpy as np import matplotlib.pyplot as plt %matplotlib inlinef lambda x :(x-3)**22.5*x-7.5 f2 lambda x :-(x-3)**22.5*x-7.5求解导数 导数为0 取最小值 x np.linspace(-2,5,100) y f(x) plt.plot(x,y)梯度下降求最小值 #导数函数 d lambda x:2*(x-3)*12.…

C语言的面向对象设计-对X264/FFMPEG架构探讨

本文贡献给ZSVC开源社区&#xff08;https://sourceforge.net/projects/zsvc/&#xff09;&#xff0c;他们是来自于中国各高校的年轻学子&#xff0c;是满怀激情与梦想的人&#xff0c;他们将用自己的勤劳与智慧在世界开源软件领域为中国留下脚步&#xff0c;该社区提供大量视…

【机器学习】自己手写实现线性回归,梯度下降 原理

导包 import numpy as npimport matplotlib.pyplot as plt %matplotlib inlinefrom sklearn.linear_model import LinearRegression创建数据 X np.linspace(2,10,20).reshape(-1,1)# f(x) wx b y np.random.randint(1,6,size 1)*X np.random.randint(-5,5,size 1)# 噪…

跨站的艺术-XSS Fuzzing 的技巧

作者 | 张祖优(Fooying) 腾讯云 云鼎实验室 对于XSS的漏洞挖掘过程&#xff0c;其实就是一个使用Payload不断测试和调整再测试的过程&#xff0c;这个过程我们把它叫做Fuzzing&#xff1b;同样是Fuzzing&#xff0c;有些人挖洞比较高效&#xff0c;有些人却不那么容易挖出漏洞…

H.264/AVC视频压缩编码标准的新进展

H .264/AVC是由ISO/IEC与ITU-T组成的联合视频组(JVT)制定的新一代视频压缩编码标准&#xff0c;于2003年5月完成制订。相对于先前的标准&#xff0c;H.264/AVC无论在压缩效率、还是在网络适应性方面都有明显的提高&#xff0c;因此&#xff0c;业界普遍预测其将在未来的视频应用…

python注释及语句分类

注释 注释就是&#xff1a;注解&#xff0c;解释。 主要用于在代码中给代码标识出相关的文字提示(提高代码的可读性) 或 调试程序。Python中注释分为两类&#xff1a; 1.单行注释 &#xff1a; 单行注释以 # 号开头&#xff0c;在当前行内&#xff0c;# 号后面的内容就是注释…

【机器学习】回归误差:MSE、RMSE、MAE、R2、Adjusted R2 +方差、协方差、标准差(标准偏差/均方差)、均方误差、均方根误差(标准误差)、均方根解释

我们通常采用MSE、RMSE、MAE、R2来评价回归预测算法。 1、均方误差&#xff1a;MSE&#xff08;Mean Squared Error&#xff09; 其中&#xff0c;为测试集上真实值-预测值。 def rms(y_test, y): return sp.mean((y_test - y) ** 2) 2、均方根误差&#xff1a;RMSE&#xff…

大院大所合作对接会7天倒计时!亮点抢先看

为什么80%的码农都做不了架构师&#xff1f;>>> 推动产业特色发展&#xff0c;提升企业自主创新能力&#xff0c;加快成果转化落地&#xff0c;继江苏发展大会之后&#xff0c;围绕“聚力创新”&#xff0c;7月5日-6日&#xff0c;中国江苏大院大所合作对接会暨第六…

通过取父级for循环的i来理解闭包,iife,匿名函数

在使用for循环的时候&#xff0c;假如需要在循环体中添加一个匿名函数处理其他的事情&#xff0c;那么&#xff0c;在这个匿名函数内&#xff0c;如果需要用到对应的i&#xff0c;因为闭包的缘故&#xff0c;循环体循环结束后才返回i&#xff0c;所以i最终为最后一次的数值。闭…

H.264将普及 视频编码讲坛之H.264前世今生

随着HDTV等高清资源的兴起&#xff0c;H.264这个规范频频出现在我们眼前&#xff0c;HD-DVD和蓝光DVD均计划采用这一标准进行节目制作。而且自2005年下半年以来&#xff0c;无论是NVIDIA还是ATI都把支持H.264硬件解码加速作为自己最值得夸耀的视频技术。而数码播放器领域也吹来…

【机器学习】岭回归

import numpy as npimport matplotlib.pyplot as plt %matplotlib inlinefrom sklearn.model_selection import train_test_splitfrom sklearn.metrics import mean_squared_error,r2_score from sklearn import datasets# CV crosss validation &#xff1a;交叉验证 from skl…

Keepalived 添加脚本配置监控haproxy方案

作者&#xff1a;风过无痕-唐出处&#xff1a;http://www.cnblogs.com/tangyanbo/ 上一篇文章已经讲到了keepalived实现双机热备&#xff0c;且遗留了一个问题 master的网络不通的时候&#xff0c;可以立即切换到slave&#xff0c;但是如果只是master上的应用出现问题的时候&am…

H.264编解码标准的核心技术(提供相关流程图)

最近在学习H.264编解码知识&#xff0c;上网搜了不少资料看&#xff0c;发现大多数中文资料中都缺少相应的图片&#xff0c;例如编解码流程图、编码模板等&#xff0c;这对加深理解是很有帮助 的。木有办法&#xff0c;只好回去潜心阅读《H.264_MPEG-4_Part_10_White_Paper》&a…