【深度学习】多头注意力机制的实现|pytorch

在这里插入图片描述

博主简介：努力学习的22级计算机科学与技术本科生一枚🌸
博主主页： @Yaoyao2024
往期回顾：【深度学习】注意力机制| 基于“上下文”进行编码,用更聪明的矩阵乘法替代笨重的全连接
每日一言🌼: 路漫漫其修远兮，吾将上下而求索。—屈原🌺

在这里插入图片描述

0、前言

在上篇文章中，我们介绍了系统且详细的介绍了注意力机制及其数学原理进行系统且详细的讲解。在本篇博客中，我们围绕多头注意力的代码实现进行展开。

这篇文章的代码实现还是youtube管博主所提供的worksheet：https://github.com/kilianmandon/alphafold-decoded.git

在本篇博客中，我们会根据worksheet中的内容，依次实现以下：

MultiHeadAttention：多头注意力机制
Gated MultiHeadAttention：带门控的注意力机制
Global Gated MultiHeadAttention：全局+门控注意力机制

最终将其整合到一个注意力模块中，利用传递参数的方法选择使用哪种注意力。不过本篇博客主要是从代码方面进行讲解，让对python和pytorch不是很熟悉的同学也能看懂代码。

1. 模型初始化和qkv准备

在这里插入图片描述

1.1 def init

class MultiHeadAttention(nn.Module):"""A MultiHeadAttention module with optional bias and optional gating."""def __init__(self, c_in, c, N_head, attn_dim, gated=False, is_global=False, use_bias_for_embeddings=False):"""Initializes the module. MultiHeadAttention theoretically consists of N_head separate linear layers for the query, key and value embeddings.However, the embeddings can be computed jointly and split afterwards,so we only need one query, key and value layer with larger c_out.Args:c_in (int): Input dimension for the embeddings.c (int): Embedding dimension for each individual head.N_head (int): Number of heads.attn_dim (int): The dimension in the input tensor along whichthe attention mechanism is performed.gated (bool, optional): If True, an additional sigmoid-activated linear layer will be multiplicated against the weighted value vectors before feeding them through the output layer. Defaults to False.is_global (bool, optional): If True, global calculation will be performed.For global calculation, key and value embeddings will only use one head,and the q query vectors will be averaged to one query vector.Defaults to False.use_bias_for_embeddings (bool, optional): If True, query, key, and value embeddings will use bias, otherwise not. Defaults to False."""super().__init__()self.c_in = c_inself.c = cself.N_head = N_headself.gated = gatedself.attn_dim = attn_dimself.is_global = is_global

首先在模型初始化中包含这样几个参数：

c_in：输入特征维度
c：每个注意力头的特征维度
N_head：注意力头的数量
attn_dim⭐：计算注意力的维度索引，注意力会沿着这个维度去计算不同元素之间的关联。比如对于上图的输入单词序列Input(N,ci)，这里N代表token个数也是序列长度，注意力模型会沿着这个维度，去计算各个token之间的关联。
🌸在计算点积亲和度（dot - product affinities ）时，是在这个维度上不同位置的查询（queries）、键（keys）向量间进行点积运算，衡量不同位置之间的相关性，从而确定注意力权重。比如句子中某个词和其他词之间关联程度计算，就是沿着这个维度展开的。
gated：是否使用门控机制
is_global：是否使用全局注意力：如果是全局注意力key和value在线性层进行变换后只有一个头，query还是多头，但是会在后面q/k/v准备的时候沿着注意力头的方向被平均掉。
use_bias_for_embeddings：是否在Q/K/V线性变换中使用偏置（也就是Linear层要不要加偏置和上图中的在注意力得分后加偏置的意义不同）！

关键组件

线性变换层：
- linear_q：生成查询(Query)向量，输出维度为c*N_head
- linear_k：生成键(Key)向量，全局模式下输出c，否则c*N_head
- linear_v：生成值(Value)向量，维度同linear_k
- linear_o：输出变换层，将多头结果合并回c_in维度
门控层（可选）：
- linear_g：生成门控信号，使用sigmoid激活

       ########################################################################### TODO: Initialize the query, key, value and output layers.              ##   Whether or not query, key, and value layers use bias is determined   ##   by `use_bias` (False for AlphaFold). The output layer should always  ##   use a bias. If gated is true, initialize another linear with bias.   ##   For compatibility use the names linear_q, linear_k, linear_v,        ##   linear_o and linear_g.                                               ###########################################################################

在初始化部分，我们主要是实现模型输入和输出的几个线性层：

        self.linear_q = nn.Linear(c_in, c*N_head, bias=use_bias_for_embeddings)c_kv = c if is_global else c*N_headself.linear_k = nn.Linear(c_in, c_kv, bias=use_bias_for_embeddings)self.linear_v = nn.Linear(c_in, c_kv, bias=use_bias_for_embeddings)self.linear_o = nn.Linear(c*N_head, c_in)if gated:self.linear_g = nn.Linear(c_in, c*N_head)

整个代码实现如上，用pytorch中的nn.Linear即可。对于当时学到这里的我来说，我并不是很理解在is_global下的处理逻辑：

If True, global calculation will be performed.
For global calculation, key and value embeddings will only use one head,
and the q query vectors will be averaged to one query vector.
Defaults to False.

大致意思是说,k,v使用单头，而q使用多头（然后在和k进行点积计算注意力得分计算之前沿着attn-dim维度进行平均

1.2 prepare_qkv

非全局注意力的q,k,v准备：

    def prepare_qkv(self, q: torch.Tensor, k: torch.Tensor, v: torch.Tensor):"""Splits the embeddings into individual heads and transforms the inputshapes of form (*, q/k/v, *, N_head*c) into the shape (*, N_head, q/k/v, c). The position of the q/k/v dimension in the original tensors is given by attn_dim.Args:q (torch.Tensor): Query embedding of shape (*, q, *, N_head*c).k (torch.Tensor): Key embedding of shape (*, k, *, N_head*c).v (torch.Tensor): Value embedding of shape (*, v, *, N_head*c).Returns:tuple: The rearranged embeddings q, k, and v of shape (*, N_head, q/k/v, c) respectively."""########################################################################### TODO: Rearrange the tensors with the following changes:                ##   - (*, q/k/v, *, N_head*c) -> (*, q/k/v, N_head*c) with movedim       # #   - (*, q/k/v, N_head*c) -> (*, q/k/v, N_head, c)                      ##   - (*, q/k/v, N_head, c) -> (*, N_head, q/k/v, c)                     ############################################################################ Transposing to [*, q/k/v, N_head*c]q = q.movedim(self.attn_dim, -2)k = k.movedim(self.attn_dim, -2)v = v.movedim(self.attn_dim, -2)# Unwrapping to [*, q/k/v, N_head, c]q_shape = q.shape[:-1] + (self.N_head, -1)k_shape = k.shape[:-1] + (self.N_head, -1)v_shape = v.shape[:-1] + (self.N_head, -1)q = q.view(q_shape)k = k.view(k_shape)v = v.view(v_shape)# Transposing to [*, N_head, q/k/v, c]q = q.transpose(-2, -3)k = k.transpose(-2, -3)v = v.transpose(-2, -3)###########################################################################               END OF YOUR CODE                                         ###########################################################################return q, k, v

1. 移动 attn_dim 维度到倒数第二个位置

self.attn_dim 表示查询、键和值维度在原始张量中的位置。
movedim 方法用于将 attn_dim 维度移动到倒数第二个位置，这样做是为了方便后续的形状调整操作。经过这一步，张量的形状变为 (*, q/k/v, N_head*c)。

在标准的多头注意力计算中，通常会将头的维度放在倒数第三个位置，这样可以更清晰地表示不同的头和每个头的嵌入维度。把 attn_dim 移动到倒数第二个位置，然后再进行后续的维度调整，最终可以得到符合这种习惯的形状，便于后续的注意力计算和代码实现。

2. 将 N_head*c 维度拆分为 N_head 和 c

首先，通过 q.shape[:-1] + (self.N_head, -1) 构建新的形状元组，将最后一个维度 N_head*c 拆分为 N_head 和 c。这里的 -1 表示让 PyTorch 自动计算该维度的大小。
然后，使用 view 方法将张量的形状调整为 (*, q/k/v, N_head, c)。

3. 交换倒数第二个和倒数第三个维度

transpose 方法用于交换张量的两个维度。这里交换倒数第二个和倒数第三个维度，将 N_head 维度移动到倒数第三个位置，最终得到形状为 (*, N_head, q/k/v, c) 的张量。

1.3 prepare_qkv_global

  def prepare_qkv_global(self, q, k, v):"""Prepares the query, key and value embeddings with the following differences to the non-global version:- key and value embeddings use only one head.- the query vectors are contracted into one, average query vector.Args:q (torch.tensor): Query embeddings of shape (*, q, *, N_head*c).k (torch.tensor): Key embeddings of shape (*, k, *, c).v (torch.tensor): Value embeddings of shape (*, v, *, c).Returns:tuple: The rearranged embeddings q, k, and v ofshape (*, N_head, 1, c) for q and shape (*, 1, k, c) for k and v. """########################################################################### TODO: Rearrange the tensors to match the output dimensions. Use        ##   torch.mean for the contraction of q at the end of this function.     ###########################################################################q = q.movedim(self.attn_dim, -2)k = k.movedim(self.attn_dim, -2)v = v.movedim(self.attn_dim, -2)q_shape = q.shape[:-1] + (self.N_head, self.c)q = q.view(q_shape)q = q.transpose(-2, -3)k = k.unsqueeze(-3)v = v.unsqueeze(-3)q = torch.mean(q, dim=-2, keepdim=True)###########################################################################               END OF YOUR CODE                                         ###########################################################################return q, k, v

因为在上面初始化的时候已经讲到k,v都是单头的，所以在这里无需考虑n-head。但对于q来说，它需要考虑。

其次它的不同是，需要在最后沿着attn-dim的方向进行平均，这样让一个head下只有一个query和key进行矩阵乘法计算注意力得分。

1.4 解释：关于global选项下的qkv

到这里为止，我们把多头注意力的初始化、q/k/v的准备算是讲完了。其实到这里我还有一个疑问：

为什么在考虑Global-attention的时候，只对k/v使用单头？对q保留多头。后来我发现是自己对q/k/v的本身地位没有理解透彻。

如果和cnn类比的话，q相当于卷积核，k/v都是用来表示原始数据的信息。只有卷积核不同，模型才能提取出来各种各样的特征。这里也是类似，只有query不同，模型才能以各个角度去捕捉多样化的信息。k/v可以不用多头，因为它们本质主要为注意力计算提供可匹配信息和实际要聚合的特征。单头足以提供关键信息，多头可能引入过多重复或相似信息，造成资源浪费，单头能更高效地提供必要信息（主要是采用单头计算，能显著减少线性变换等操作次数）。

1. 核心目的：减少计算量
全局注意力的核心思想是将序列级别的全局信息压缩为一个"概要向量"，从而避免计算庞大的 $\times N$ 注意力矩阵（ $N$ ) 是序列长度）。

Key/Value单头：所有注意力头共享同一组Key/Value，相当于用单头生成一个"全局记忆池"。
- 计算量从 $O(N^2 \cdot H)$ 降至 $O(N^2 + N \cdot H)$ （ $H$ 是头数）。
Query多头：保留多头设计，让不同头从不同角度"查询"这个全局记忆池，维持特征多样性。

2. 为什么Query需要多头？
即使Key/Value是全局共享的，不同注意力头仍可关注不同的全局模式：

举例（蛋白质序列）：
- 头1可能关注"保守残基"的全局分布。
- 头2可能关注"疏水残基"的全局密度。
- 头3可能关注"二级结构"（如α螺旋）的周期性。
数学上：
多组Query与同一组Key/Value计算注意力，仍会得到不同的加权结果（因Query向量不同）。

3. 为什么Key/Value可以单头？

信息冗余假设：
对于超长序列，Key/Value的全局特征（如蛋白质的总体折叠模式）通常不需要多视角编码，一个统一的表示足够。
计算效率：
Key/Value矩阵的维度从 $\times (H \cdot d_k)$ 降至 $\times d_k$ ，显存占用大幅减少。

2. Forward

    def forward(self, x, bias=None, attention_mask=None):"""Forward pass through the MultiHeadAttention module.Args:x (torch.tensor): Input tensor of shape (*, q/k/v, *, c_in).bias (torch.tensor, optional): Optional bias tensor of shape(*, N_head, q, k) that will be added to the attention weights. Defaults to None.attention_mask (torch.tensor, optional): Optional attention maskof shape (*, k). If set, the keys with value 0 in the mask willnot be attended to.Returns:torch.tensor: Output tensor of shape (*, q/k/v, *, c_in)"""out = Noneq = self.linear_q(x)k = self.linear_k(x)v = self.linear_v(x)if self.is_global:q, k, v = self.prepare_qkv_global(q, k, v)else:q, k, v = self.prepare_qkv(q, k, v)q = q / math.sqrt(self.c)a = torch.einsum('...qc,...kc->...qk', q, k)if bias is not None:bias_batch_shape = bias.shape[:-3]bias_bc_shape = bias_batch_shape + (1,) * (a.ndim-len(bias_batch_shape)-3) + bias.shape[-3:]bias = bias.view(bias_bc_shape)a = a + biasif attention_mask is not None:attention_mask = attention_mask[..., None, None, :]offset = (attention_mask==0) * -1e8a = a + offseta = torch.softmax(a, dim=-1)# o has shape [*, N_head, q, c]o = torch.einsum('...qk,...kc->...qc', a, v)o = o.transpose(-3, -2)o = torch.flatten(o, start_dim=-2)o = o.moveaxis(-2, self.attn_dim)if self.gated:g = torch.sigmoid(self.linear_g(x))o = g * oout = self.linear_o(o)###########################################################################               END OF YOUR CODE                                         ###########################################################################return out

输入预处理: Create query, key and value embeddings，Rearrange the embeddings with prepare_qkv

q = self.linear_q(x)
k = self.linear_k(x)
v = self.linear_v(x)
if self.is_global:q, k, v = self.prepare_qkv_global(q, k, v)
else:q, k, v = self.prepare_qkv(q, k, v)

通过线性变换生成Query(Q)、Key(K)、Value(V)张量： (*, N_head, q/k/v, c)
如果是全局注意力模式(is_global=True)，会调用prepare_qkv_global对KV做特殊处理

Query缩放:Scale the queries by 1/sqrt( c )

q = q / math.sqrt(self.c)

将Query向量除以√d（d是每个头的维度），防止点积结果过大导致softmax梯度消失

注意力得分计算

a = torch.einsum('...qc,...kc->...qk', q, k)

使用爱因斯坦求和约定计算Q和K的点积
结果张量a的形状为[*, N_head, q, k]，表示每个查询位置与每个键位置的相似度

偏置处理

if bias is not None:bias_batch_shape = bias.shape[:-3]bias_bc_shape = bias_batch_shape + (1,) * (a.ndim-len(bias_batch_shape)-3) + bias.shape[-3:]bias = bias.view(bias_bc_shape)a = a + bias

调整偏置张量的形状使其可以广播到注意力得分矩阵
将偏置加到原始得分上（如AlphaFold中用于注入残基对信息）

注意力掩码处理

if attention_mask is not None:attention_mask = attention_mask[..., None, None, :]offset = (attention_mask==0) * -1e8a = a + offset

对需要屏蔽的位置(attention_mask==0)加上一个很大的负值(-1e8)
softmax后这些位置的权重会趋近于0

Softmax归一化

a = torch.softmax(a, dim=-1)

对最后一个维度(k)做softmax，得到归一化的注意力权重

加权求和

o = torch.einsum('...qk,...kc->...qc', a, v)

使用注意力权重对Value向量加权求和
输出形状为[*, N_head, q, c]

输出重组

o = o.transpose(-3, -2)
o = torch.flatten(o, start_dim=-2)
o = o.moveaxis(-2, self.attn_dim)

转置头维和查询维
展平多头输出
将特征维度移动到指定位置(attn_dim)

门控机制

if self.gated:g = torch.sigmoid(self.linear_g(x))o = g * o

如果启用门控，生成0-1之间的门控值
按元素相乘控制信息流

最终输出变换

out = self.linear_o(o)

通过最后一个线性层将维度映射回输入维度

这个forward方法是多头注意力机制的核心计算过程，我将逐步解释它的实现逻辑和关键步骤：

1. 输入预处理

q = self.linear_q(x)
k = self.linear_k(x)
v = self.linear_v(x)

通过线性变换生成Query(Q)、Key(K)、Value(V)张量
如果是全局注意力模式(is_global=True)，会调用prepare_qkv_global对KV做特殊处理

2. Query缩放

q = q / math.sqrt(self.c)

将Query向量除以√d（d是每个头的维度），防止点积结果过大导致softmax梯度消失

3. 注意力得分计算

a = torch.einsum('...qc,...kc->...qk', q, k)

使用爱因斯坦求和约定计算Q和K的点积
结果张量a的形状为[*, N_head, q, k]，表示每个查询位置与每个键位置的相似度

4. 偏置处理

if bias is not None:bias_batch_shape = bias.shape[:-3]bias_bc_shape = bias_batch_shape + (1,) * (a.ndim-len(bias_batch_shape)-3) + bias.shape[-3:]bias = bias.view(bias_bc_shape)a = a + bias

调整偏置张量的形状使其可以广播到注意力得分矩阵
将偏置加到原始得分上（如AlphaFold中用于注入残基对信息）

5. 注意力掩码处理

if attention_mask is not None:attention_mask = attention_mask[..., None, None, :]offset = (attention_mask==0) * -1e8a = a + offset

对需要屏蔽的位置(attention_mask==0)加上一个很大的负值(-1e8)
softmax后这些位置的权重会趋近于0（代表不关注这些位置）

6. Softmax归一化

Use softmax to convert the attention scores into a probability distribution.

a = torch.softmax(a, dim=-1)

对最后一个维度(k)做softmax，得到归一化的注意力权重。

7. 加权求和

o = torch.einsum('...qk,...kc->...qc', a, v)

使用注意力权重对Value向量加权求和
输出形状为[*, N_head, q, c]

8. 输出重组

    #   - Rearrange the intermediate output in the following way:            ##       * (*, N_head, q, c) -> (*, q, N_head, c)                         ##       * (*, q, N_head, c) -> (*, q, N_head * c)                        ##       * (*, q, N_head * c) -> (*, q, *, N_head * c)                    ##       The order of these transformations is crucial, as moving q

o = o.transpose(-3, -2)
o = torch.flatten(o, start_dim=-2)
o = o.moveaxis(-2, self.attn_dim)

转置头维和查询维
展平多头输出
将特征维度移动到指定位置(attn_dim)

9. 门控机制

if gated, calculate the gating with linear_g and sigmoid and multiply it against the output.

if self.gated:g = torch.sigmoid(self.linear_g(x))o = g * o

如果启用门控，生成0-1之间的门控值
按元素相乘控制信息流

10. 最终输出变换

apply linear_o to calculate the final output.

out = self.linear_o(o)

通过最后一个线性层将维度映射回输入维度

关键设计特点：

高效张量操作：使用einsum进行批量矩阵运算
灵活的维度处理：支持任意批处理维度和自定义注意力维度
模块化设计：可插拔的偏置、掩码和门控机制
全局注意力支持：通过is_global标志切换模式