一、Transformer
Transformer,是由编码块和解码块两部分组成,其中编码块由多个编码器组成,解码块同样也是由多个解码块组成。
编码器:自注意力 + 全连接
- 多头自注意力:Q、K、V
- 公式:
解码块:自注意力 + 编码 - 解码自注意力 +全连接
- 多头自注意力:
- 编码—解码自注意力:Q上个解码器的输出
K、V最后一个编码器输出
二、BERT
- bert,是由Transformer的多个编码器组成。
- Base :12层编码器,每个编码器有12个多头,隐藏维度为768。
- Large: 24层编码器,每个编码器16个头,隐层维度为1024
- bert结构 :
import torch
class MultiHeadAttention(nn.Module):def__init__(self,hidden_size,head_num):super().__init__()self.head_size = hidden_size / head_numself.query = nn.Linear(hidden_size, hidden_size)self.key = nn.Linear(hidden_size, hidden_size)self.value = nn.Linear(hidden_size, hidden_size)def transpose_dim(self,x):x_new_shape = x.size()[:-1]+(self.head_num, head_size)x = x.view(*x_new_shape)return x.permute(0,2,1,3)def forward(self,x,attention_mask):Quary_layer = self.query(x)Key_layer = self.key(x)Value_layer = self.value(x)'''B = Quary_layer.shape[0]N = Quary_layer.shape[1]multi_quary = Quary_layer.view(B,N,self.head_num,self.head_size).transpose(1,2)'''multi_quary =self.transpose_dim(Quary_layer)multi_key =self.transpose_dim(Key_layer)multi_value =self.transpose_dim(Value_layer)attention_scores = torch.matmul(multi_quary, multi_key.transpose(-1,-2))attention_scores = attention_scores / math.sqrt(self.head_size)attention_probs = nn.Softmax(dim=-1)(attention_scores) context_layer = torch.matmul(attention_probs,values_layer)context_layer = context_layer.permute(0,2,1,3).contiguous()context_layer_shape = context_layer.size()[:-2]+(self.hidden_size)context_layer = cotext_layer.view(*context_layer_shape return context_layer