AAAI大会(Association for the Advancement of Artificial Intelligence),是一年一度在人工智能方向的顶级会议之一,旨在汇集世界各地的人工智能理论和领域应用的最新成果。该会议固定在每年的2月份举行,由AAAI协会主办。
第32届AAAI大会-AAAI 2018将于2月2号-7号在美国新奥尔良召开,其中蚂蚁金服人工智能部和新加坡科技大学合作的一篇基于汉字笔画信息的中文词向量算法研究的论文“cw2vec: Learning Chinese Word Embeddings with Stroke n-grams”被高分录用(其中一位审稿人给出了满分,剩下两位也给出了接近满分的评价)。我们将在2月7日在大会上做口头报告(Oral),欢迎大家一起讨论交流。
1 ##!/usr/bin/env python2 ## coding=utf-83 import jieba4 5 filePath='corpus.txt'6 fileSegWordDonePath ='corpusSegDone.txt'7 # read the file by line8 fileTrainRead = []9 #fileTestRead = []
10 with open(filePath) as fileTrainRaw:
11 for line in fileTrainRaw:
12 fileTrainRead.append(line)
13
14
15 # define this function to print a list with Chinese
16 def PrintListChinese(list):
17 for i in range(len(list)):
18 print list[i],
19 # segment word with jieba
20 fileTrainSeg=[]
21 for i in range(len(fileTrainRead)):
22 fileTrainSeg.append([' '.join(list(jieba.cut(fileTrainRead[i][9:-11],cut_all=False)))])
23 if i % 100 == 0 :
24 print i
25
26 # to test the segment result
27 #PrintListChinese(fileTrainSeg[10])
28
29 # save the result
30 with open(fileSegWordDonePath,'wb') as fW:
31 for i in range(len(fileTrainSeg)):
32 fW.write(fileTrainSeg[i][0].encode('utf-8'))
33 fW.write('\n')
import word2vec
model = word2vec.load('corpusWord2Vec.bin')
indexes = model.cosine(u'加拿大')
for index in indexes[0]:print (model.vocab[index])
得到的结果如下:
可以修改希望查找的中文词,例子如下:
四、二维空间中显示词向量
将词向量采用PCA进行降维,得到二维的词向量,并打印出来,代码如下:
1 #!/usr/bin/env python2 # coding=utf-83 import numpy as np4 import matplotlib5 import matplotlib.pyplot as plt6 7 from sklearn.decomposition import PCA8 import word2vec9 # load the word2vec model
10 model = word2vec.load('corpusWord2Vec.bin')
11 rawWordVec=model.vectors
12
13 # reduce the dimension of word vector
14 X_reduced = PCA(n_components=2).fit_transform(rawWordVec)
15
16 # show some word(center word) and it's similar words
17 index1,metrics1 = model.cosine(u'中国')
18 index2,metrics2 = model.cosine(u'清华')
19 index3,metrics3 = model.cosine(u'牛顿')
20 index4,metrics4 = model.cosine(u'自动化')
21 index5,metrics5 = model.cosine(u'刘亦菲')
22
23 # add the index of center word
24 index01=np.where(model.vocab==u'中国')
25 index02=np.where(model.vocab==u'清华')
26 index03=np.where(model.vocab==u'牛顿')
27 index04=np.where(model.vocab==u'自动化')
28 index05=np.where(model.vocab==u'刘亦菲')
29
30 index1=np.append(index1,index01)
31 index2=np.append(index2,index03)
32 index3=np.append(index3,index03)
33 index4=np.append(index4,index04)
34 index5=np.append(index5,index05)
35
36 # plot the result
37 zhfont = matplotlib.font_manager.FontProperties(fname='/usr/share/fonts/truetype/wqy/wqy-microhei.ttc')
38 fig = plt.figure()
39 ax = fig.add_subplot(111)
40
41 for i in index1:
42 ax.text(X_reduced[i][0],X_reduced[i][1], model.vocab[i], fontproperties=zhfont,color='r')
43
44 for i in index2:
45 ax.text(X_reduced[i][0],X_reduced[i][1], model.vocab[i], fontproperties=zhfont,color='b')
46
47 for i in index3:
48 ax.text(X_reduced[i][0],X_reduced[i][1], model.vocab[i], fontproperties=zhfont,color='g')
49
50 for i in index4:
51 ax.text(X_reduced[i][0],X_reduced[i][1], model.vocab[i], fontproperties=zhfont,color='k')
52
53 for i in index5:
54 ax.text(X_reduced[i][0],X_reduced[i][1], model.vocab[i], fontproperties=zhfont,color='c')
55
56 ax.axis([0,0.8,-0.5,0.5])
57 plt.show()
Yang, B., Mitchell, T., 2017. Leveraging Knowledge Bases in LSTMs for Improving Machine Reading. Association for Computational Linguistics, pp. 1436–1446.链接:http://www.aclweb.org/anthology/P/P17/P17-1132.pdf这篇论文是今年发表在 ACL 的一篇文章…