论文:PolyGen: An Autoregressive Generative Model of 3D Meshes
首先阅读transformer铺垫知识《Torch中Transformer
的中文注释》。
以下为Encoder
部分,很简单,小学生都会:
from typing import Optional
import pdbimport torch
import torch.nn as nn
from torch.nn import (MultiheadAttention,Linear,Dropout,LayerNorm,ReLU,Parameter,TransformerEncoderLayer,
)
import pytorch_lightning as plfrom .utils import embedding_to_paddingclass PolygenEncoderLayer(TransformerEncoderLayer):"""Polygen论文中描述的编码器模块"""def __init__(self,d_model: int = 256,nhead: int = 4,dim_feedforward: int = 1024,dropout: float = 0.2,re_zero: bool = True,) -> None:"""初始化PolygenEncoderLayer参数:d_model: 嵌入向量的大小,即模型的隐藏状态维度。nhead: 多头注意力机制的头数。dim_feedforward: 前馈网络(Feed Forward Network)的中间层维度。dropout: 每个连接层后ReLU激活函数后应用的dropout率,用于防止过拟合。re_zero: 如果为 True,使用零初始化对残差进行Alpha缩放,这是一种正则化技术,旨在改善模型的收敛速度和泛化能力。初始化:self_attn: 多头注意力层,使用 MultiheadAttention 实现。linear1 和 linear2: 两层线性变换,用于前馈网络的构建。dropout: 用于添加dropout的 Dropout 层。norm1 和 norm2: 用于层规范化(Layer Normalization)的 LayerNorm 层。activation: 激活函数,这里使用的是 ReLU。re_zero: 一个布尔变量,指示是否使用ReZero技术。alpha 和 beta: 如果使用ReZero技术,这两个参数用于缩放残差连接的输出,初始值为0。"""super(PolygenEncoderLayer, self).__init__(d_model, nhead, dim_feedforward=dim_feedforward, dropout=dropout)self.self_attn = MultiheadAttention(d_model, nhead, dropout=dropout)self.linear1 = Linear(d_model, dim_feedforward)self.linear2 = Linear(dim_feedforward, d_model)self.dropout = Dropout(dropout)self.norm1 = LayerNorm(d_model)self.norm2 = LayerNorm(d_model)self.activation = ReLU()self.re_zero = re_zeroself.alpha = Parameter(data=torch.Tensor([0.0]))self.beta = Parameter(data=torch.Tensor([0.0]))def forward(self,src: torch.Tensor,src_mask: Optional[torch.Tensor] = None,src_key_padding_mask: Optional[torch.Tensor] = None,) -> torch.Tensor:"""PolygenEncoderLayer的前向传播方法参数:src: 形状为 [sequence_length, batch_size, embed_size] 的张量。传入TransformerEncoder的输入张量src_mask: 形状为 [sequence_length, sequence_length] 的张量。输入序列的掩码src_key_padding_mask: 形状为 [sequence_length, batch_size] 的张量。告诉注意力机制哪些输入序列的部分应该被忽略,因为它们是填充的返回:src: 形状为 [sequence_length, batch_size, embed_size] 的张量计算流程:自我注意力计算:首先,对输入张量 src 进行层规范化,然后将其作为查询、键和值传入多头注意力层 self_attn。注意力计算中,可以使用 src_mask 和 src_key_padding_mask 控制哪些位置的注意力应该被屏蔽。残差连接与Dropout:如果使用ReZero技术,将注意力层的输出乘以 alpha 参数。应用dropout,并将结果与输入张量 src 进行残差连接。前馈网络计算:再次对 src 进行层规范化,然后通过两层线性变换和激活函数。如果使用ReZero技术,将前馈网络的输出乘以 beta 参数。应用dropout,并将结果与 src 进行残差连接。"""src2 = self.norm1(src)src2 = self.self_attn(src, src, src, attn_mask=src_mask, key_padding_mask=src_key_padding_mask)[0]# ReZero is All You Need: Fast Convergence at Large Depth https://arxiv.org/abs/2003.04887# The parameter α is initialized to 0 at the beginning of training,# and the output of the residual block is almost entirely dependent on its input,# thus avoiding the problem of the gradient disappearing or exploding.# As the training progresses, alpha gradually learns the optimal value,# allowing the internal representation of the residual block to gradually influence the final output.if self.re_zero:src2 = src2 * self.alphasrc2 = self.dropout(src2)src = src + src2src2 = self.norm2(src)src2 = self.linear1(src2)src2 = self.linear2(src2)if self.re_zero:src2 = src2 * self.betasrc2 = self.dropout(src2)src = src + src2return srcclass PolygenEncoder(pl.LightningModule):"""A modified version of the traditional Transformer Encoder suited for Polygen input sequences"""def __init__(self,hidden_size: int = 256,fc_size: int = 1024,num_heads: int = 4,layer_norm: bool = True,num_layers: int = 8,dropout_rate: float = 0.2,) -> None:"""Initializes the PolygenEncoderArgs:hidden_size: Size of the embedding vectors.fc_size: Size of the fully connected layer.num_heads: Number of multihead attention heads.layer_norm: Boolean variable that signifies if layer normalization should be used.num_layers: Number of decoder layers in the decoder.dropout_rate: Dropout rate applied immediately after the ReLU in each fully connected layer."""super(PolygenEncoder, self).__init__()self.hidden_size = hidden_sizeself.num_layers = num_layersself.encoder = nn.TransformerEncoder(PolygenEncoderLayer(d_model=hidden_size,nhead=num_heads,dim_feedforward=fc_size,dropout=dropout_rate,),num_layers=num_layers,)self.norm = LayerNorm(hidden_size)def forward(self, inputs: torch.Tensor) -> torch.Tensor:"""Forward method for the Transformer EncoderArgs:inputs: A Tensor of shape [sequence_length, batch_size, embed_size]. Represents the input sequence.Returns:outputs: A Tensor of shape [sequence_length, batch_size, embed_size]. Represents the result of the TransformerEncoder"""padding_mask = embedding_to_padding(inputs)out = self.encoder(inputs, src_key_padding_mask=padding_mask)return self.norm(out)