目录:
自注意力机制Self-attention(1)
自注意力机制Self-attention(2)
1 内容回顾
以b2b^2b2的计算过程为例来说明:
query: q1=Wqa1q^1 = W^q a^1q1=Wqa1, q2=Wqa2q^2 = W^q a^2q2=Wqa2, q3=Wqa3q^3 = W^q a^3q3=Wqa3, q4=Wqa4q^4 = W^q a^4q4=Wqa4;
key:k1=Wka1k^1 = W^k a^1k1=Wka1, k2=Wka2k^2 = W^k a^2k2=Wka2, k3=Wka3k^3 = W^k a^3k3=Wka3,k4=Wka4k^4 = W^k a^4k4=Wka4;
value:v1=Wva1v^1 = W^v a^1v1=Wva1, v2=Wva2v^2 = W^v a^2v2=Wva2, v3=Wva3v^3 = W^v a^3v3=Wva3, v4=Wva4v^4 = W^v a^4v4=Wva4;
attention score:α2,1=q2⋅k1\alpha_{2,1} = q^2 \cdot k^1α2,1=q2⋅k1, α2,2=q2⋅k2\alpha_{2,2} = q^2 \cdot k^2α2,2=q2⋅k2, α2,3=q2⋅k3\alpha_{2,3} = q^2 \cdot k^3α2,3=q2⋅k3, α2,4=q2⋅k4\alpha_{2,4} = q^2 \cdot k^4α2,4=q2⋅k4;
Soft-max:α2,1′=exp(α2,1)∑jexp(α2,j)\alpha_{2,1}^{'} = \frac{\exp(\alpha_{2,1})}{\sum_j \exp(\alpha_{2,j})}α2,1′=∑jexp(α2,j)exp(α2,1), α2,2′=exp(α2,2)∑jexp(α2,j)\alpha_{2,2}^{'} = \frac{\exp(\alpha_{2,2})}{\sum_j \exp(\alpha_{2,j})}α2,2′=∑jexp(α2,j)exp(α2,2), α2,3′=exp(α2,3)∑jexp(α2,j)\alpha_{2,3}^{'} = \frac{\exp(\alpha_{2,3})}{\sum_j \exp(\alpha_{2,j})}α2,3′=∑jexp(α2,j)exp(α2,3), α2,2′=exp(α2,4)∑jexp(α2,j)\alpha_{2,2}^{'} = \frac{\exp(\alpha_{2,4})}{\sum_j \exp(\alpha_{2,j})}α2,2′=∑jexp(α2,j)exp(α2,4);
b2=α2,1′v1+α2,2′v2+α2,3′v3+α2,4′v4=∑iα2,i′vib^2 = \alpha_{2,1}^{'}v^1 + \alpha_{2,2}^{'}v^2 + \alpha_{2,3}^{'}v^3 + \alpha_{2,4}^{'}v^4 = \sum_i \alpha^{'}_{2,i}v^ib2=α2,1′v1+α2,2′v2+α2,3′v3+α2,4′v4=∑iα2,i′vi.
问:a1,…,a4a^1, \dots, a^4a1,…,a4是什么?
答:就是输入的一组向量,如经过编码后的“I saw a saw”。
问:WqW^qWq, WkW^kWk, WvW^vWv是什么?
答:矩阵,需要通过学习得到。
下面通过矩阵操作进一步来回顾自注意力机制的计算过程。
查询矩阵:Q=WqIQ = W^q IQ=WqI;
关键字矩阵:K=WkIK = W^k IK=WkI;
值矩阵:V=WvIV = W^v IV=WvI.
注意力分数矩阵:A=KTQA = K^T QA=KTQ;
进行Soft-max:A′=softmax(A)A^{'} = softmax(A)A′=softmax(A);
O=VA′O = V A^{'}O=VA′
唯一需要学的参数是WqW^qWq, WkW^kWk, WvW^vWv。