Transformer详解 - mathor
atten之后经过一个全连接层+残差+层归一化
class BertSelfOutput(nn.Module):def __init__(self, config):super().__init__()self.dense = nn.Linear(config.hidden_size, config.hidden_size)self.LayerNorm = nn.LayerNorm(config.hidden_size, eps=config.layer_norm_eps)self.dropout = nn.Dropout(config.hidden_dropout_prob)def forward(self, hidden_states: torch.Tensor, input_tensor: torch.Tensor) -> torch.Tensor:hidden_states = self.dense(hidden_states) # 全连接 768->768hidden_states = self.dropout(hidden_states)hidden_states = self.LayerNorm(hidden_states + input_tensor) # 残差和层归一化return hidden_states
残差的作用:避免梯度消失
归一化的作用:避免梯度消失和爆炸,加速收敛
然后再送入一个两层的前馈神经网络
class BertIntermediate(nn.Module):def __init__(self, config):super().__init__()self.dense = nn.Linear(config.hidden_size, config.intermediate_size)if isinstance(config.hidden_act, str):self.intermediate_act_fn = ACT2FN[config.hidden_act]else:self.intermediate_act_fn = config.hidden_actdef forward(self, hidden_states: torch.Tensor) -> torch.Tensor:hidden_states = self.dense(hidden_states) # [1, 16, 3072] 映射到高维空间:768 -> 3072hidden_states = self.intermediate_act_fn(hidden_states)return hidden_states
class BertOutput(nn.Module):def __init__(self, config):super().__init__()self.dense = nn.Linear(config.intermediate_size, config.hidden_size)self.LayerNorm = nn.LayerNorm(config.hidden_size, eps=config.layer_norm_eps)self.dropout = nn.Dropout(config.hidden_dropout_prob)def forward(self, hidden_states: torch.Tensor, input_tensor: torch.Tensor) -> torch.Tensor:hidden_states = self.dense(hidden_states) # 3072 -> 768hidden_states = self.dropout(hidden_states)hidden_states = self.LayerNorm(hidden_states + input_tensor) # 残差和层归一化return hidden_states
面试题:为什么注意力机制中要除以根号dk
答:因为q和k做点积后值会很大,会导致反向传播时softmax函数的梯度很小。除以根号dk是为了保持点积后的值均值为0,方差为1.(q和k都是向量)
证明:已知q和k相互独立,且是均值为0,方差为1。
则D(qi*ki)=D(qi)*D(ki)=1
除以dk则D((qi*ki)/根号dk)=1/dk,每一项是这个值,但是根据上面红框的公式,一共有dk项求和,值为1
所以(q*k)/dk的方差就等1
(背景知识)方差性质:
D(CX)=C^2D(X) ,其中C是常量