Jaccard 相似系数又称为Jaccard相似性度量(Jaccard系数,Jaccard 指数,Jaccard index)。用于比较有限样本集之间的相似性与差异性。Jaccard系数值越大,样本相似度越高。定义为相交的大小除以样本集合的大小:
(若A B均为空,那么定义J(A,B)= 1)
与 Jaccard 相似系数相对的指标是Jaccard 距离(Jaccard distance),定义为 1- Jaccard系数,即:
Python 代码:
data_school_list = data['school'].unique().tolist() #school列表
edu_similar=[]
l = len(data_school_list)#定义循环次数
for i in data_school_list:try:print(l)#显示当前计算的进度Jaccard_list = []#建立一个空白列表,用于存储Jaccar系数true_id = data.loc[data['school'] == i,'id'].tolist() #id列表for m in range(len(true_id)):true_ids = copy.copy(true_id)#复制id列表true_ids.pop(m)for n in range(len(true_ids)):data_id_x = data.loc[data['id']== true_id[m],'school'].tolist()data_id_y = data.loc[data['id']== true_id[n],'school'].tolist()union_set = len(list(set(data_id_x)|set(data_id_y)))#并集长度intersection_set = len(list(set(data_id_x)&set(data_id_y)))#交集长度Jaccard = intersection_set/union_set #Jaccard IndexJaccard_list.append(Jaccard)#插入listJaccard_array = np.array(Jaccard_list)Jaccard_mean = np.mean(Jaccard_array)Jaccard_std = np.std(Jaccard_array)Jaccard_list = [i,Jaccard_mean,Jaccard_std]edu_similar.append(Jaccard_list)l-=1#l-1进行下一个循环,直到数据中每一条都计算完毕except:Jaccard_list = [i,0,0]edu_similar.append(Jaccard_list)l-=1