05.序列模型 W2.自然语言处理与词嵌入(作业:词向量+Emoji表情生成)

文章目录

  • 作业1:
    • 1. 余弦相似度
    • 2. 单词类比
    • 3. 词向量纠偏
      • 3.1 消除对非性别词语的偏见
      • 3.2 性别词的均衡算法
  • 作业2:Emojify表情生成
    • 1. Baseline model: Emojifier-V1
      • 1.1 数据集
      • 1.2 模型预览
      • 1.3 实现 Emojifier-V1
      • 1.4 在训练集上测试
    • 2. Emojifier-V2: Using LSTMs in Keras
      • 2.1 模型预览
      • 2.2 Keras and mini-batching
      • 2.3 Embedding 层
      • 2.3 建立 Emojifier-V2

测试题:参考博文

笔记:W2.自然语言处理与词嵌入

作业1:

  • 加载预训练的 单词向量,用 cos(θ)cos(\theta)cos(θ) 余弦夹角 测量相似度
  • 使用词嵌入解决类比问题
  • 修改词嵌入降低性比歧视
import numpy as np
from w2v_utils import *

这个作业使用 50-维的 GloVe vectors 表示单词

words, word_to_vec_map = read_glove_vecs('data/glove.6B.50d.txt')

1. 余弦相似度

CosineSimilarity(u, v)=u.v∣∣u∣∣2∣∣v∣∣2=cos(θ)\text{CosineSimilarity(u, v)} = \frac {u . v} {||u||_2 ||v||_2} = cos(\theta)CosineSimilarity(u, v)=u2v2u.v=cos(θ)

其中 ∣∣u∣∣2=∑i=1nui2||u||_2 = \sqrt{\sum_{i=1}^{n} u_i^2}u2=i=1nui2

# GRADED FUNCTION: cosine_similaritydef cosine_similarity(u, v):"""Cosine similarity reflects the degree of similariy between u and vArguments:u -- a word vector of shape (n,)          v -- a word vector of shape (n,)Returns:cosine_similarity -- the cosine similarity between u and v defined by the formula above."""distance = 0.0### START CODE HERE #### Compute the dot product between u and v (≈1 line)dot = np.dot(u, v)# Compute the L2 norm of u (≈1 line)norm_u = np.linalg.norm(u)# Compute the L2 norm of v (≈1 line)norm_v = np.linalg.norm(v)# Compute the cosine similarity defined by formula (1) (≈1 line)cosine_similarity = dot/(norm_u*norm_v)### END CODE HERE ###return cosine_similarity

2. 单词类比

例如:男人:女人 --> 国王:王后

# GRADED FUNCTION: complete_analogydef complete_analogy(word_a, word_b, word_c, word_to_vec_map):"""Performs the word analogy task as explained above: a is to b as c is to ____. Arguments:word_a -- a word, stringword_b -- a word, stringword_c -- a word, stringword_to_vec_map -- dictionary that maps words to their corresponding vectors. Returns:best_word --  the word such that v_b - v_a is close to v_best_word - v_c, as measured by cosine similarity"""# convert words to lower caseword_a, word_b, word_c = word_a.lower(), word_b.lower(), word_c.lower()### START CODE HERE #### Get the word embeddings v_a, v_b and v_c (≈1-3 lines)e_a, e_b, e_c = word_to_vec_map[word_a],word_to_vec_map[word_b],word_to_vec_map[word_c]### END CODE HERE ###words = word_to_vec_map.keys()max_cosine_sim = -100              # Initialize max_cosine_sim to a large negative numberbest_word = None                   # Initialize best_word with None, it will help keep track of the word to output# loop over the whole word vector setfor w in words:        # to avoid best_word being one of the input words, pass on them.if w in [word_a, word_b, word_c] :continue### START CODE HERE #### Compute cosine similarity between the vector (e_b - e_a) and the vector ((w's vector representation) - e_c)  (≈1 line)cosine_sim = cosine_similarity(e_b-e_a, word_to_vec_map[w]-e_c)# If the cosine_sim is more than the max_cosine_sim seen so far,# then: set the new max_cosine_sim to the current cosine_sim and the best_word to the current word (≈3 lines)if cosine_sim > max_cosine_sim:max_cosine_sim = cosine_simbest_word = w### END CODE HERE ###return best_word

测试:

triads_to_try = [('italy', 'italian', 'spain'), ('india', 'delhi', 'japan'), ('man', 'woman', 'boy'), ('small', 'smaller', 'large')]
for triad in triads_to_try:print ('{} -> {} :: {} -> {}'.format( *triad, complete_analogy(*triad,word_to_vec_map)))

输出:

italy -> italian :: spain -> spanish
india -> delhi :: japan -> tokyo
man -> woman :: boy -> girl
small -> smaller :: large -> larger

额外测试:

good -> ok :: bad -> oops(糟糕)
father -> dad :: mother -> mom

3. 词向量纠偏

研究反映在单词嵌入中的性别偏见,并探索减少这种偏见的算法

g = word_to_vec_map['woman'] - word_to_vec_map['man']
print(g)

输出:向量(50维)

[-0.087144    0.2182     -0.40986    -0.03922    -0.1032      0.94165-0.06042     0.32988     0.46144    -0.35962     0.31102    -0.868240.96006     0.01073     0.24337     0.08193    -1.02722    -0.211220.695044   -0.00222     0.29106     0.5053     -0.099454    0.404450.30181     0.1355     -0.0606     -0.07131    -0.19245    -0.06115-0.3204      0.07165    -0.13337    -0.25068714 -0.14293    -0.224957-0.149       0.048882    0.12191    -0.27362    -0.165476   -0.204260.54376    -0.271425   -0.10245    -0.32108     0.2516     -0.33455-0.04371     0.01258   ]
print ('List of names and their similarities with constructed vector:')# girls and boys name
name_list = ['john', 'marie', 'sophie', 'ronaldo', 'priya', 'rahul', 'danielle', 'reza', 'katy', 'yasmin']for w in name_list:print (w, cosine_similarity(word_to_vec_map[w], g))

输出:

List of names and their similarities with constructed vector:
john -0.23163356145973724
marie 0.315597935396073
sophie 0.31868789859418784
ronaldo -0.31244796850329437
priya 0.17632041839009402
rahul -0.16915471039231716
danielle 0.24393299216283895
reza -0.07930429672199553
katy 0.2831068659572615
yasmin 0.2331385776792876

可以看出,

  • 女性的名字往往与向量 𝑔 有正的余弦相似性,
  • 而男性的名字往往有负的余弦相似性。结果似乎可以接受。

试试其他的词语

print('Other words and their similarities:')
word_list = ['lipstick', 'guns', 'science', 'arts', 'literature', 'warrior','doctor', 'tree', 'receptionist', 'technology',  'fashion', 'teacher', 'engineer', 'pilot', 'computer', 'singer']
for w in word_list:print (w, cosine_similarity(word_to_vec_map[w], g))

输出:

Other words and their similarities:
lipstick 0.2769191625638267
guns -0.1888485567898898
science -0.06082906540929701
arts 0.008189312385880337
literature 0.06472504433459932
warrior -0.20920164641125288
doctor 0.11895289410935041
tree -0.07089399175478091
receptionist 0.3307794175059374
technology -0.13193732447554302
fashion 0.03563894625772699
teacher 0.17920923431825664
engineer -0.0803928049452407
pilot 0.0010764498991916937
computer -0.10330358873850498
singer 0.1850051813649629

这些结果反映了某些性别歧视。例如,“computer 计算机”更接近“man 男人”“literature 文学”更接近“woman 女人”

下面看到如何使用Boliukbasi等人2016年提出的算法来减少这些向量的偏差。

请注意,有些词对,如“演员”/“女演员”“祖母”/“祖父”应保持性别特异性,而其他词如“接待员”“技术”应保持中立,即与性别无关。纠偏时,你必须区别对待这两种类型的单词

3.1 消除对非性别词语的偏见

ebias_component=e⋅g∣∣g∣∣22∗ge^{bias\_component} = \frac{e \cdot g}{||g||_2^2} * gebias_component=g22egg

edebiased=e−ebias_componente^{debiased} = e - e^{bias\_component}edebiased=eebias_component

def neutralize(word, g, word_to_vec_map):"""Removes the bias of "word" by projecting it on the space orthogonal to the bias axis. This function ensures that gender neutral words are zero in the gender subspace.Arguments:word -- string indicating the word to debiasg -- numpy-array of shape (50,), corresponding to the bias axis (such as gender)word_to_vec_map -- dictionary mapping words to their corresponding vectors.Returns:e_debiased -- neutralized word vector representation of the input "word""""### START CODE HERE #### Select word vector representation of "word". Use word_to_vec_map. (≈ 1 line)e = word_to_vec_map[word]# Compute e_biascomponent using the formula give above. (≈ 1 line)e_biascomponent = np.dot(e, g)/np.linalg.norm(g)**2*g# Neutralize e by substracting e_biascomponent from it # e_debiased should be equal to its orthogonal projection. (≈ 1 line)e_debiased = e - e_biascomponent### END CODE HERE ###return e_debiased

测试:

e = "receptionist"
print("cosine similarity between " + e + " and g, before neutralizing: ", cosine_similarity(word_to_vec_map["receptionist"], g))e_debiased = neutralize("receptionist", g, word_to_vec_map)
print("cosine similarity between " + e + " and g, after neutralizing: ", cosine_similarity(e_debiased, g))

输出:

cosine similarity between receptionist and g, before neutralizing:  0.3307794175059374
cosine similarity between receptionist and g, after neutralizing:  -2.099120994400013e-17

纠偏以后,receptionist(接待员)与性别的相似度接近于 0,既不偏向男人,也不偏向女人

3.2 性别词的均衡算法

如何将纠偏应用于单词对,例如“女演员”和“演员”
均衡化应用:只希望通过性别属性而有所不同的单词对。
作为一个具体的例子,假设“女演员”比“演员”更接近“保姆”,通过对“保姆”进行中性化,我们可以减少与保姆相关的性别刻板印象。但这仍然不能保证“演员”和“女演员”“保姆”的距离相等,均衡算法可以处理这一点。


μ=ew1+ew22\mu = \frac{e_{w1} + e_{w2}}{2}μ=2ew1+ew2

μB=μ⋅bias_axis∣∣bias_axis∣∣22∗bias_axis\mu_{B} = \frac {\mu \cdot \text{bias\_axis}}{||\text{bias\_axis}||_2^2} *\text{bias\_axis}μB=bias_axis22μbias_axisbias_axis

μ⊥=μ−μB\mu_{\perp} = \mu - \mu_{B}μ=μμB

ew1B=ew1⋅bias_axis∣∣bias_axis∣∣22∗bias_axise_{w1B} = \frac {e_{w1} \cdot \text{bias\_axis}}{||\text{bias\_axis}||_2^2} *\text{bias\_axis}ew1B=bias_axis22ew1bias_axisbias_axis

ew2B=ew2⋅bias_axis∣∣bias_axis∣∣22∗bias_axise_{w2B} = \frac {e_{w2} \cdot \text{bias\_axis}}{||\text{bias\_axis}||_2^2} *\text{bias\_axis}ew2B=bias_axis22ew2bias_axisbias_axis

ew1Bcorrected=∣1−∣∣μ⊥∣∣22∣∗ew1B−μB∣(ew1−μ⊥)−μB)∣e_{w1B}^{corrected} = \sqrt{ |{1 - ||\mu_{\perp} ||^2_2} |} * \frac{e_{\text{w1B}} - \mu_B} {|(e_{w1} - \mu_{\perp}) - \mu_B)|}ew1Bcorrected=1μ22(ew1μ)μB)ew1BμB

ew2Bcorrected=∣1−∣∣μ⊥∣∣22∣∗ew2B−μB∣(ew2−μ⊥)−μB)∣e_{w2B}^{corrected} = \sqrt{ |{1 - ||\mu_{\perp} ||^2_2} |} * \frac{e_{\text{w2B}} - \mu_B} {|(e_{w2} - \mu_{\perp}) - \mu_B)|}ew2Bcorrected=1μ22(ew2μ)μB)ew2BμB

e1=ew1Bcorrected+μ⊥e_1 = e_{w1B}^{corrected} + \mu_{\perp}e1=ew1Bcorrected+μ

e2=ew2Bcorrected+μ⊥e_2 = e_{w2B}^{corrected} + \mu_{\perp}e2=ew2Bcorrected+μ

def equalize(pair, bias_axis, word_to_vec_map):"""Debias gender specific words by following the equalize method described in the figure above.Arguments:pair -- pair of strings of gender specific words to debias, e.g. ("actress", "actor") bias_axis -- numpy-array of shape (50,), vector corresponding to the bias axis, e.g. genderword_to_vec_map -- dictionary mapping words to their corresponding vectorsReturnse_1 -- word vector corresponding to the first worde_2 -- word vector corresponding to the second word"""### START CODE HERE #### Step 1: Select word vector representation of "word". Use word_to_vec_map. (≈ 2 lines)w1, w2 = pair[0], pair[1]e_w1, e_w2 = word_to_vec_map[w1], word_to_vec_map[w2]# Step 2: Compute the mean of e_w1 and e_w2 (≈ 1 line)mu = (e_w1+e_w2)/2# Step 3: Compute the projections of mu over the bias axis and the orthogonal axis (≈ 2 lines)mu_B = np.dot(mu, bias_axis)/np.linalg.norm(bias_axis)**2*bias_axismu_orth = mu-mu_B# Step 4: Use equations (7) and (8) to compute e_w1B and e_w2B (≈2 lines)e_w1B = np.dot(e_w1,bias_axis)/np.linalg.norm(bias_axis)**2*bias_axise_w2B = np.dot(e_w2,bias_axis)/np.linalg.norm(bias_axis)**2*bias_axis# Step 5: Adjust the Bias part of e_w1B and e_w2B using the formulas (9) and (10) given above (≈2 lines)corrected_e_w1B = np.sqrt(np.abs(1-np.linalg.norm(mu_orth)**2))*np.divide((e_w1B-mu_B),np.abs(e_w1-mu_orth-mu_B))corrected_e_w2B = np.sqrt(np.abs(1-np.linalg.norm(mu_orth)**2))*np.divide((e_w2B-mu_B),np.abs(e_w2-mu_orth-mu_B))# Step 6: Debias by equalizing e1 and e2 to the sum of their corrected projections (≈2 lines)e1 = corrected_e_w1B+mu_orthe2 = corrected_e_w2B+mu_orth### END CODE HERE ###return e1, e2

测试:

print("cosine similarities before equalizing:")
print("cosine_similarity(word_to_vec_map[\"man\"], gender) = ", cosine_similarity(word_to_vec_map["man"], g))
print("cosine_similarity(word_to_vec_map[\"woman\"], gender) = ", cosine_similarity(word_to_vec_map["woman"], g))
print()
e1, e2 = equalize(("man", "woman"), g, word_to_vec_map)
print("cosine similarities after equalizing:")
print("cosine_similarity(e1, gender) = ", cosine_similarity(e1, g))
print("cosine_similarity(e2, gender) = ", cosine_similarity(e2, g))

输出:

cosine similarities before equalizing:
cosine_similarity(word_to_vec_map["man"], gender) =  -0.11711095765336832
cosine_similarity(word_to_vec_map["woman"], gender) =  0.35666618846270376cosine similarities after equalizing:
cosine_similarity(e1, gender) =  -0.7165727525843935
cosine_similarity(e2, gender) =  0.7396596474928909

平衡以后,相似度符号相反,数值接近

作业2:Emojify表情生成

使用 word vector representations 建立 Emojifier

让你的消息更有表现力😁,使用单词向量的话,可以是你的单词没有在该表情的关联里面,也能学习到可以使用该表情。

  • 导入一些包
import numpy as np
from emo_utils import *
import emoji
import matplotlib.pyplot as plt%matplotlib inline

1. Baseline model: Emojifier-V1

1.1 数据集

X:127个句子(字符串)
Y:整型 标签 0 - 4 ,是相关的句子的表情

  • 加载数据集,训练集(127个样本),测试集(56个样本)
X_train, Y_train = read_csv('data/train_emoji.csv')
X_test, Y_test = read_csv('data/tesss.csv')
maxLen = len(max(X_train, key=len).split())
print(max(X_train, key=len).split())

输出:

['I', 'am', 'so', 'impressed', 'by', 'your', 'dedication', 'to', 'this', 'project']

最长的句子是10个单词

  • 查看数据集
index = 3
print(X_train[index], label_to_emoji(Y_train[index]))

输出:
Miss you so much ❤️

1.2 模型预览


为了方便,把 Y 的形状从 (m,1)(m,1)(m,1) 改成 one-hot 表示 (m,5)(m,5)(m,5)

Y_oh_train = convert_to_one_hot(Y_train, C = 5)
Y_oh_test = convert_to_one_hot(Y_test, C = 5)
index = 52
print(Y_train[index], "is converted into one hot", Y_oh_train[index])

输出:

3 is converted into one hot [0. 0. 0. 1. 0.]

1.3 实现 Emojifier-V1

使用预训练的 50-dimensional GloVe embeddings

word_to_index, index_to_word, word_to_vec_map = read_glove_vecs('data/glove.6B.50d.txt')
  • 检查下是否正确
word = "cucumber"
index = 289846
print("the index of", word, "in the vocabulary is", word_to_index[word])
print("the", str(index) + "th word in the vocabulary is", index_to_word[index])

输出:

the index of cucumber in the vocabulary is 113317
the 289846th word in the vocabulary is potatos

实现 sentence_to_avg()

  • 转换每个句子为小写,并切分成单词
  • 每个句子的单词,使用 GloVe 向量表示,然后求句子的平均
# GRADED FUNCTION: sentence_to_avgdef sentence_to_avg(sentence, word_to_vec_map):"""Converts a sentence (string) into a list of words (strings). Extracts the GloVe representation of each wordand averages its value into a single vector encoding the meaning of the sentence.Arguments:sentence -- string, one training example from Xword_to_vec_map -- dictionary mapping every word in a vocabulary into its 50-dimensional vector representationReturns:avg -- average vector encoding information about the sentence, numpy-array of shape (50,)"""### START CODE HERE #### Step 1: Split sentence into list of lower case words (≈ 1 line)words = sentence.lower().split()# Initialize the average word vector, should have the same shape as your word vectors.avg = np.zeros(word_to_vec_map[words[0]].shape)# Step 2: average the word vectors. You can loop over the words in the list "words".for w in words:avg += word_to_vec_map[w]avg /= len(words)### END CODE HERE ###return avg

测试:

avg = sentence_to_avg("Morrocan couscous is my favorite dish", word_to_vec_map)
print("avg = ", avg)

输出:

avg =  [-0.008005    0.56370833 -0.50427333  0.258865    0.55131103  0.03104983-0.21013718  0.16893933 -0.09590267  0.141784   -0.15708967  0.185258670.6495785   0.38371117  0.21102167  0.11301667  0.02613967  0.260377670.05820667 -0.01578167 -0.12078833 -0.02471267  0.4128455   0.51520610.38756167 -0.898661   -0.535145    0.33501167  0.68806933 -0.21562651.797155    0.10476933 -0.36775333  0.750785    0.10282583  0.348925-0.27262833  0.66768    -0.10706167 -0.283635    0.59580117  0.28747333-0.3366635   0.23393817  0.34349183  0.178405    0.1166155  -0.0764330.1445417   0.09808667]

模型

sentence_to_avg() 处理完以后,进行前向传播、计算损失、后向传播更新参数

z(i)=W.avg(i)+bz^{(i)} = W . avg^{(i)} + bz(i)=W.avg(i)+b

a(i)=softmax(z(i))a^{(i)} = softmax(z^{(i)})a(i)=softmax(z(i))

L(i)=−∑k=0ny−1Yohk(i)∗log(ak(i))\mathcal{L}^{(i)} = - \sum_{k = 0}^{n_y - 1} Yoh^{(i)}_k * log(a^{(i)}_k)L(i)=k=0ny1Yohk(i)log(ak(i))

# GRADED FUNCTION: modeldef model(X, Y, word_to_vec_map, learning_rate = 0.01, num_iterations = 400):"""Model to train word vector representations in numpy.Arguments:X -- input data, numpy array of sentences as strings, of shape (m, 1)Y -- labels, numpy array of integers between 0 and 7, numpy-array of shape (m, 1)word_to_vec_map -- dictionary mapping every word in a vocabulary into its 50-dimensional vector representationlearning_rate -- learning_rate for the stochastic gradient descent algorithmnum_iterations -- number of iterationsReturns:pred -- vector of predictions, numpy-array of shape (m, 1)W -- weight matrix of the softmax layer, of shape (n_y, n_h)b -- bias of the softmax layer, of shape (n_y,)"""np.random.seed(1)# Define number of training examplesm = Y.shape[0]                          # number of training examplesn_y = 5                                 # number of classes  n_h = 50                                # dimensions of the GloVe vectors # Initialize parameters using Xavier initializationW = np.random.randn(n_y, n_h) / np.sqrt(n_h)b = np.zeros((n_y,))# Convert Y to Y_onehot with n_y classesY_oh = convert_to_one_hot(Y, C = n_y) # Optimization loopfor t in range(num_iterations):                       # Loop over the number of iterationsfor i in range(m):                                # Loop over the training examples### START CODE HERE ### (≈ 4 lines of code)# Average the word vectors of the words from the i'th training exampleavg = sentence_to_avg(X[i], word_to_vec_map)# Forward propagate the avg through the softmax layerz = np.dot(W, avg)+ba = softmax(z)# Compute cost using the i'th training label's one hot representation and "A" (the output of the softmax)cost = - sum(Y_oh[i]*np.log(a))### END CODE HERE #### Compute gradients dz = a - Y_oh[i]dW = np.dot(dz.reshape(n_y,1), avg.reshape(1, n_h))db = dz# Update parameters with Stochastic Gradient DescentW = W - learning_rate * dWb = b - learning_rate * dbif t % 100 == 0:print("Epoch: " + str(t) + " --- cost = " + str(cost))pred = predict(X, Y, W, b, word_to_vec_map)return pred, W, b

1.4 在训练集上测试

print("Training set:")
pred_train = predict(X_train, Y_train, W, b, word_to_vec_map)
print('Test set:')
pred_test = predict(X_test, Y_test, W, b, word_to_vec_map)

输出:

Training set:
Accuracy: 0.9772727272727273
Test set:
Accuracy: 0.8571428571428571

随机猜测的话,平均概率是 20%(1/5),模型的效果很不错,在只有127个训练样本的情况下

让我们来测试:

  • 我们在训练集里看到了 I love you 有标签 ❤️
  • 我们来检查下使用 adore(爱慕) (该词没有在训练集出现过)
X_my_sentences = np.array(["i adore you", "i love you", "funny lol", "lets play with a ball", "food is ready", "not feeling happy"])
Y_my_labels = np.array([[0], [0], [2], [1], [4],[3]])pred = predict(X_my_sentences, Y_my_labels , W, b, word_to_vec_map)
print_predictions(X_my_sentences, pred)

输出:

Accuracy: 0.8333333333333334(5/6,最后一个错了)

i adore you ❤️(adore 跟 love 有相似的 embedding )
i love you ❤️
funny lol 😄
lets play with a ball ⚾
food is ready 🍴
not feeling happy 😄(识别错误,不能发现 not 这类组合词)

检查错误:
打印混淆矩阵可以帮助了解哪些样本模型预测不准。
一个混淆矩阵显示了一个标签是一个类(真实标签)的例子被算法用不同的类(预测错误)错误标记的频率

print(Y_test.shape)
print('           '+ label_to_emoji(0)+ '    ' + label_to_emoji(1) + '    ' +  label_to_emoji(2)+ '    ' + label_to_emoji(3)+'   ' + label_to_emoji(4))
print(pd.crosstab(Y_test, pred_test.reshape(56,), rownames=['Actual'], colnames=['Predicted'], margins=True))
plot_confusion_matrix(Y_test, pred_test)


2. Emojifier-V2: Using LSTMs in Keras

让我们构建一个LSTM模型,它将单词序列作为输入。这个模型将能够考虑单词顺序。
Emojifier-V2 将继续使用预先训练过的 word embeddings 来表示单词,将把它们输入LSTM,LSTM的任务是预测最合适的表情符号。

  • 导入一些包
import numpy as np
np.random.seed(0)
from keras.models import Model
from keras.layers import Dense, Input, Dropout, LSTM, Activation
from keras.layers.embeddings import Embedding
from keras.preprocessing import sequence
from keras.initializers import glorot_uniform
np.random.seed(1)

2.1 模型预览

2.2 Keras and mini-batching

为了使样本能够批量训练,我们必须处理句子,使他们的长度都一样长,长度不够最大长度的,后面补上一些 0 向量 (ei,elove,eyou,0⃗,0⃗,…,0⃗)(e_{i}, e_{love}, e_{you}, \vec{0}, \vec{0}, \ldots, \vec{0})(ei,elove,eyou,0,0,,0)

2.3 Embedding 层

https://keras.io/zh/layers/embeddings/

  • 先把所有句子的单词对应的 idx 填好
# GRADED FUNCTION: sentences_to_indicesdef sentences_to_indices(X, word_to_index, max_len):"""Converts an array of sentences (strings) into an array of indices corresponding to words in the sentences.The output shape should be such that it can be given to `Embedding()` (described in Figure 4). Arguments:X -- array of sentences (strings), of shape (m, 1)word_to_index -- a dictionary containing the each word mapped to its indexmax_len -- maximum number of words in a sentence. You can assume every sentence in X is no longer than this. Returns:X_indices -- array of indices corresponding to words in the sentences from X, of shape (m, max_len)"""m = X.shape[0]                                   # number of training examples### START CODE HERE #### Initialize X_indices as a numpy matrix of zeros and the correct shape (≈ 1 line)X_indices = np.zeros((m, max_len))for i in range(m):                               # loop over training examples# Convert the ith training sentence in lower case and split is into words. You should get a list of words.sentence_words = X[i].lower().split()# Initialize j to 0j = 0# Loop over the words of sentence_wordsfor w in sentence_words:# Set the (i,j)th entry of X_indices to the index of the correct word.X_indices[i, j] = word_to_index[w]# Increment j to j + 1j = j+1### END CODE HERE ###return X_indices

实现 pretrained_embedding_layer()

  • 初始化 词嵌入矩阵,注意 shape
  • 填充 词嵌入矩阵,从word_to_vec_map里抽取
  • 定义 Keras embedding 层,注意设置trainable = False,使之不可被训练,如果为True,则允许算法修改词嵌入的值
  • 将 嵌入权重 设置为与 嵌入矩阵 相等
# GRADED FUNCTION: pretrained_embedding_layerdef pretrained_embedding_layer(word_to_vec_map, word_to_index):"""Creates a Keras Embedding() layer and loads in pre-trained GloVe 50-dimensional vectors.Arguments:word_to_vec_map -- dictionary mapping words to their GloVe vector representation.word_to_index -- dictionary mapping from words to their indices in the vocabulary (400,001 words)Returns:embedding_layer -- pretrained layer Keras instance"""vocab_len = len(word_to_index) + 1                  # adding 1 to fit Keras embedding (requirement)emb_dim = word_to_vec_map["cucumber"].shape[0]      # define dimensionality of your GloVe word vectors (= 50)### START CODE HERE #### Initialize the embedding matrix as a numpy array of zeros of shape (vocab_len, dimensions of word vectors = emb_dim)emb_matrix = np.zeros((vocab_len, emb_dim))# Set each row "index" of the embedding matrix to be the word vector representation of the "index"th word of the vocabularyfor word, index in word_to_index.items():emb_matrix[index, :] = word_to_vec_map[word]# Define Keras embedding layer with the correct output/input sizes, make it trainable. Use Embedding(...). Make sure to set trainable=False. embedding_layer = Embedding(vocab_len, emb_dim, trainable=False)### END CODE HERE #### Build the embedding layer, it is required before setting the weights of the embedding layer. Do not modify the "None".embedding_layer.build((None,))# Set the weights of the embedding layer to the embedding matrix. Your layer is now pretrained.embedding_layer.set_weights([emb_matrix])return embedding_layer

2.3 建立 Emojifier-V2


https://keras.io/zh/layers/core/#input
https://keras.io/zh/layers/embeddings/#embedding
https://keras.io/zh/layers/recurrent/#lstm
https://keras.io/zh/layers/core/#dropout
https://keras.io/zh/layers/core/#dense
https://keras.io/zh/activations/
https://keras.io/zh/models/about-keras-models/#model

# GRADED FUNCTION: Emojify_V2def Emojify_V2(input_shape, word_to_vec_map, word_to_index):"""Function creating the Emojify-v2 model's graph.Arguments:input_shape -- shape of the input, usually (max_len,)word_to_vec_map -- dictionary mapping every word in a vocabulary into its 50-dimensional vector representationword_to_index -- dictionary mapping from words to their indices in the vocabulary (400,001 words)Returns:model -- a model instance in Keras"""### START CODE HERE #### Define sentence_indices as the input of the graph, it should be of shape input_shape and dtype 'int32' (as it contains indices).sentence_indices = Input(input_shape, dtype='int32')# Create the embedding layer pretrained with GloVe Vectors (≈1 line)embedding_layer = pretrained_embedding_layer(word_to_vec_map, word_to_index)# Propagate sentence_indices through your embedding layer, you get back the embeddingsembeddings = embedding_layer(sentence_indices)# Propagate the embeddings through an LSTM layer with 128-dimensional hidden state# Be careful, the returned output should be a batch of sequences.X = LSTM(128,return_sequences=True)(embeddings)# Add dropout with a probability of 0.5X = Dropout(rate=0.5)(X)# Propagate X trough another LSTM layer with 128-dimensional hidden state# Be careful, the returned output should be a single hidden state, not a batch of sequences.X = LSTM(128, return_sequences=False)(X)# Add dropout with a probability of 0.5X = Dropout(rate=0.5)(X)# Propagate X through a Dense layer with softmax activation to get back a batch of 5-dimensional vectors.X = Dense(5)(X)# Add a softmax activationX = Activation('softmax')(X)# Create Model instance which converts sentence_indices into X.model = Model(inputs=sentence_indices, outputs=X)### END CODE HERE ###return model
  • 创建模型
model = Emojify_V2((maxLen,), word_to_vec_map, word_to_index)
model.summary()

输出:

Model: "model_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
input_3 (InputLayer)         (None, 10)                0         
_________________________________________________________________
embedding_4 (Embedding)      (None, 10, 50)            20000050  
_________________________________________________________________
lstm_3 (LSTM)                (None, 10, 128)           91648     
_________________________________________________________________
dropout_1 (Dropout)          (None, 10, 128)           0         
_________________________________________________________________
lstm_4 (LSTM)                (None, 128)               131584    
_________________________________________________________________
dropout_2 (Dropout)          (None, 128)               0         
_________________________________________________________________
dense_1 (Dense)              (None, 5)                 645       
_________________________________________________________________
activation_1 (Activation)    (None, 5)                 0         
=================================================================
Total params: 20,223,927
Trainable params: 223,877
Non-trainable params: 20,000,050  注:(400,001个单词*50词向量维度)
_________________________________________________________________
  • 配置模型
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
  • 训练模型

转换 X,Y 的格式

X_train_indices = sentences_to_indices(X_train, word_to_index, maxLen)
Y_train_oh = convert_to_one_hot(Y_train, C = 5)

训练

model.fit(X_train_indices, Y_train_oh, epochs = 50, batch_size = 32, shuffle=True)

输出:

WARNING:tensorflow:From c:\program files\python37\lib\site-packages\keras\backend\tensorflow_backend.py:422: The name tf.global_variables is deprecated. Please use tf.compat.v1.global_variables instead.Epoch 1/50
132/132 [==============================] - 1s 5ms/step - loss: 1.6088 - accuracy: 0.1970
Epoch 2/50
132/132 [==============================] - 0s 582us/step - loss: 1.5221 - accuracy: 0.3636
Epoch 3/50
132/132 [==============================] - 0s 574us/step - loss: 1.4762 - accuracy: 0.3939
(省略)
Epoch 49/50
132/132 [==============================] - 0s 597us/step - loss: 0.0115 - accuracy: 1.0000
Epoch 50/50
132/132 [==============================] - 0s 582us/step - loss: 0.0182 - accuracy: 0.9924

在训练集上的准确率几乎 100%

  • 在测试集上测试
X_test_indices = sentences_to_indices(X_test, word_to_index, max_len = maxLen)
Y_test_oh = convert_to_one_hot(Y_test, C = 5)
loss, acc = model.evaluate(X_test_indices, Y_test_oh)
print()
print("Test accuracy = ", acc)

输出:

56/56 [==============================] - 0s 2ms/stepTest accuracy =  0.875

测试集上准确率为 87.5%

  • 查看预测错误的样本
# This code allows you to see the mislabelled examples
C = 5
y_test_oh = np.eye(C)[Y_test.reshape(-1)]
X_test_indices = sentences_to_indices(X_test, word_to_index, maxLen)
pred = model.predict(X_test_indices)
for i in range(len(X_test)):x = X_test_indicesnum = np.argmax(pred[i])if(num != Y_test[i]):print('Expected emoji:'+ label_to_emoji(Y_test[i]) + ' prediction: '+ X_test[i] + label_to_emoji(num).strip())

输出:

Expected emoji:😞 prediction: work is hard 😄
Expected emoji:😞 prediction: This girl is messing with me ❤️
Expected emoji:😞 prediction: work is horrible 😄
Expected emoji:🍴 prediction: any suggestions for dinner 😄
Expected emoji:😄 prediction: you brighten my day ❤️
Expected emoji:😞 prediction: go away ⚾
Expected emoji:🍴 prediction: I did not have breakfast ❤️

  • 用自己的例子测试
# Change the sentence below to see your prediction. Make sure all the words are in the Glove embeddings.  
x_test = np.array(['not feeling happy'])
X_test_indices = sentences_to_indices(x_test, word_to_index, maxLen)
print(x_test[0] +' '+  label_to_emoji(np.argmax(model.predict(X_test_indices))))

not feeling happy 😞 (这次 LSTM 可以预测 not 这类的组合词了)
not very happy 😞
very happy 😄
i really love my wife ❤️


总结:

  • 如果你有一个训练集很小的NLP任务,使用单词嵌入可以显著地帮助你的算法。单词嵌入允许模型处理测试集中没有出现在训练集中的单词
  • 在Keras(和大多数其他深度学习框架中)中训练序列模型需要一些重要的细节:
  1. 要使用 mini-batches,需要填充序列,以便 mini-batches 中的所有样本具有相同的长度
  2. “Embedding()” 层可以用预先训练的值初始化。这些值可以是固定的,也可以在数据集中进一步训练。如果数据集很小就不要接着训练了(效果不大)
  3. LSTM() 有一个名为“return_sequences”的标志,用于决定是返回每个隐藏状态还是返回最后一个隐藏状态
  4. 可以在LSTM() 之后使用Dropout()来正则化网络

本文地址:https://michael.blog.csdn.net/article/details/108902060

我的CSDN博客地址 https://michael.blog.csdn.net/

长按或扫码关注我的公众号(Michael阿明),一起加油、一起学习进步!
Michael阿明

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/news/473837.shtml

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

LeetCode 688. “马”在棋盘上的概率(DP)

文章目录1. 题目2. 解题1. 题目 已知一个 NxN 的国际象棋棋盘,棋盘的行号和列号都是从 0 开始。即最左上角的格子记为 (0, 0),最右下角的记为 (N-1, N-1)。 现有一个 “马”(也译作 “骑士”)位于 (r, c) ,并打算进行…

LeetCode 第 36 场双周赛(304/2204,前13.8%)

文章目录1. 比赛结果2. 题目1. LeetCode 5515. 设计停车系统 easy2. LeetCode 5516. 警告一小时内使用相同员工卡大于等于三次的人 medium3. LeetCode 5518. 给定行和列的和求可行矩阵 medium4. LeetCode 5517. 找到处理最多请求的服务器 hard1. 比赛结果 做出来3题&#xff0…

数据库实例:用户注册

1.根据结构创建表的脚本如下 createtable py_users( id int unsigned auto_increment not null primary key, uname varchar(20) not null, upwd char(40) not null, is_delete bit not null default 0 ); 如下流程图,接下来的代码就按照这个逻辑来写 2.创建user…

HBase原理

目录 HBase原理 1 HBase架构 2 HBase中的核心概念 3 HBase的存储机制 4 HBase的寻址机制 5 HBase的读写流程 6 HBase的设计 7 HBase和Hive的整合 HBase原理 1 HBase架构 HBase的架构为主从架构,HMaster为主节点,HRegionServer为从节点 &#x…

数据库实例:用户登录

如下遍流程图,接下来的代码就按照这个逻辑来写 创建user_login.py文件,代码如下 #codingutf-8 from MySQLdb import* from hashlib importsha1 if __name____main__: try: #接收输入用户名、密码 unameraw_input(请输入用户名&…

数据库实例:mysql与mongo结合用户登录

加入mongodb后登录逻辑如下图,将图中nosql的位置换为mongodb即可 用户数据存储的集合名称为py_users,文档格式为{uname:用户名,upwd:密码} 将原来MySQL操作的代码封装到一个方法中,代码如下 def mysql_login(): #mongodb中没有则到mysql中…

LeetCode 1609. 奇偶树(层序遍历)

文章目录1. 题目2. 解题1. 题目 如果一棵二叉树满足下述几个条件,则可以称为 奇偶树 : 二叉树根节点所在层下标为 0 ,根的子节点所在层下标为 1 ,根的孙节点所在层下标为 2 ,依此类推。偶数下标 层上的所有节点的值都…

数据的特征工程

数据的特征工程 1 什么是数据的特征工程 特征工程是将原始数据转换为更好地代表预测模型的潜在问题的特征的过程,从而提高了对未知数据的模型准确性。 特征工程的意义:将直接影响模型的预测结果。 2 数据的来源与类型 2.1 数据的来源 企业日益积累…

数据库实例:mysql与redis结合用户登录

加入redis后登录逻辑如下图,将图中nosql的位置换为redis即可 用户数据存的键为用户名,值为密码 将原来MySQL操作的代码封装到一个方法中,代码如下 defmysql_login(): #redis中没有则到mysql中查询 sqlselect upwd from py_users wher…

LeetCode 1610. 可见点的最大数目(atan2函数求夹角)

文章目录1. 题目2. 解题1. 题目 给你一个点数组 points 和一个表示角度的整数 angle ,你的位置是 location ,其中 location [posx, posy] 且 points[i] [xi, yi] 都表示 X-Y 平面上的整数坐标。 最开始,你面向东方进行观测。你 不能 进行…

[翻译]API Guides - Bound Services

官方文档原文地址:http://developer.android.com/guide/components/bound-services.html 一个Bound Service是一个客户端-服务器接口的服务。一个Bound Service允许组件(像activity)绑定一个service,发送请求,接受结果…

sklearn数据集与估计器

sklearn数据集与估计器 1 sklearn数据集 (1)数据来源:大多数以文件的形式 (csv文件..), 因为mysql有性能瓶颈、读取速度遭到限制,数据大的时候很费时间 (2)读取数据的工具&#…

LeetCode 935. 骑士拨号器(动态规划)

文章目录1. 题目2. 解题1. 题目 国际象棋中的骑士可以按下图所示进行移动: 这一次,我们将 “骑士” 放在电话拨号盘的任意数字键(如上图所示)上, 接下来,骑士将会跳 N-1 步。每一步必须是从一个数字键跳到…

k近邻算法(KNN)-分类算法

k近邻算法(KNN)-分类算法 1 概念 定义:如果一个样本在特征空间中的k个最相似(即特征空间中最邻近)的样本中的大多数属于某一个类别,则该样本也属于这个类别。 k-近邻算法采用测量不同特征值之间的距离来进行分类。 2 优缺点 优点:简单&a…

LeetCode 822. 翻转卡片游戏(哈希)

文章目录1. 题目2. 解题1. 题目 在桌子上有 N 张卡片,每张卡片的正面和背面都写着一个正数(正面与背面上的数有可能不一样)。 我们可以先翻转任意张卡片,然后选择其中一张卡片。 如果选中的那张卡片背面的数字 X 与任意一张卡片…

朴素贝叶斯算法-分类算法

朴素贝叶斯算法-分类算法 1 概率基础 概率定义为一件事情发生的可能性 联合概率:包含多个条件,且所有条件同时成立的概率,记作P(A,B) 条件概率:事件A在另一个事件B已经发生条件下的发送概率,记作P(A|B) 在A1,A2相…

django简介及环境搭建

MVC简介 MVC框架的核心思想是:解耦,让不同的代码块之间降低耦合,增强代码的可扩展性和可移植性,实现向后兼容 M全拼为Model,主要封装对数据库层的访问,内嵌ORM框架,实现面向对象的编程来操作数据…

LeetCode 1312. 让字符串成为回文串的最少插入次数(区间DP)

文章目录1. 题目2. 解题1. 题目 给你一个字符串 s ,每一次操作你都可以在字符串的任意位置插入任意字符。 请你返回让 s 成为回文串的 最少操作次数 。 「回文串」是正读和反读都相同的字符串。 示例 1: 输入:s "zzazz" 输出&…

Django创建项目

创建项目的名称为test1,完成“图书-英雄”信息的维护,创建应用名称为booktest 创建项目 cd /home/Desktop/ mkdir pytest cd pytest 创建项目的命令如下: django-admin startproject test1 项目默认目录说明 进入test1目录,查看…

分类模型的评估

分类模型的评估 在许多实际问题中,衡量分类器任务的成功程度是通过固定的性能指标来获取。一般最常见使用的是准确率,即预测结果正确的百分比,方法为estimator.score() 1 混淆矩阵 有时候,我们关注的是样本是否被正确诊断出来。…