计算信息熵的公式:n是类别数,p(xi)是第i类的概率
假设数据集有m行,即m个样本,每一行最后一列为该样本的标签,计算数据集信息熵的代码如下:
from math import logdef calcShannonEnt(dataSet):numEntries = len(dataSet) # 样本数labelCounts = {} # 该数据集每个类别的频数for featVec in dataSet: # 对每一行样本currentLabel = featVec[-1] # 该样本的标签if currentLabel not in labelCounts.keys(): labelCounts[currentLabel] = 0labelCounts[currentLabel] += 1 shannonEnt = 0.0for key in labelCounts:prob = float(labelCounts[key])/numEntries # 计算p(xi)shannonEnt -= prob * log(prob, 2) # log base 2return shannonEnt