karpathy Let‘s build GPT

1 introduction

按照karpathy的教程,一步步的完成transformer的构建,并在这个过程中,加深对transformer设计的理解。
karpathy推荐在进行网络设计的过程中,同时利用jupyter notebook进行快速测试和python进行主要的网络的构建。

在这里插入图片描述

2 网络实现

2.1 数据的构建

  • 读取text
text = open("input.txt", "r", encoding='utf-8').read()
words = sorted(set(''.join(text)))
vocab_size = len(words)
print(f'vocab_size is: {vocab_size}')
print(''.join(words))
print(text[:1000])

vocab_size is: 65
!$&',-.3:;?ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz
First Citizen:
Before we proceed any further, hear me speak.
All:
Speak, speak.
First Citizen:
You are all resolved rather to die than to famish?

  • 将字符转换成数字
stoi = {ch : i for i, ch in enumerate(words)}
itos = {i : ch for i, ch in enumerate(words)}encode = lambda s: [stoi[ch] for ch in s]
decode = lambda l: ''.join([itos[i] for i in l])
print(encode("hii")) 
print(decode(encode("hii")))

[46, 47, 47]
hii

  • 制作数据集
import torch
# 生成数据集
data = torch.tensor(encode(text), dtype=torch.long)
print(len(data))
n = int(len(data) * 0.9)
train_data = data[:n]
val_data = data[n:]
print(train_data[:1000])

1115394
tensor([18, 47, 56, 57, 58, 1, 15, 47, 58, 47, 64, 43, 52, 10, 0, 14, 43, 44,
53, 56, 43, 1, 61, 43, 1, 54, 56, 53, 41, 43, 43, 42, 1, 39, 52, 63,
1, 44, 59, 56, 58, 46, 43, 56, 6, 1, 46, 43, 39, 56, 1, 51, 43, 1,
57, 54, 43, 39, 49, 8, 0, 0, 13, 50, 50, 10, 0, 31, 54, 43, 39, 49,

  • 构建dataloader
import torch
batch_size = 4
torch.manual_seed(1337)
def get_batch(split):datasets = {'train': train_data,'val': val_data,}[split]ix = torch.randint(0, len(datasets) - block_size, (batch_size,))x = torch.stack([datasets[i:i+block_size] for i in ix])y = torch.stack([datasets[1+i:i+block_size+1] for i in ix])return x, yxb, yb = get_batch('train')
print(f'x shape is: {xb.shape}, y shape is: {yb.shape}')
print(f'x is {xb}')
print(f'y is {yb}')

x shape is: torch.Size([4, 8]), y shape is: torch.Size([4, 8])
x is tensor([[24, 43, 58, 5, 57, 1, 46, 43],
[44, 53, 56, 1, 58, 46, 39, 58],
[52, 58, 1, 58, 46, 39, 58, 1],
[25, 17, 27, 10, 0, 21, 1, 54]])
y is tensor([[43, 58, 5, 57, 1, 46, 43, 39],
[53, 56, 1, 58, 46, 39, 58, 1],
[58, 1, 58, 46, 39, 58, 1, 46],
[17, 27, 10, 0, 21, 1, 54, 39]])

2.2 构建pipeline

  • 定义一个最简单的网络
import torch.nn as nn
import torch.nn.functional as F
torch.manual_seed(1337)
class BigramLanguageModel(nn.Module):def __init__(self, vocab_size):super().__init__()self.token_embedding_table = nn.Embedding(vocab_size, vocab_size)def forward(self, idx, targets=None):self.out = self.token_embedding_table(idx)return self.outxb, yb = get_batch('train')
model = BigramLanguageModel(vocab_size)
out = model(xb)
print(f'x shape is: {xb.shape}')
print(f'out shape is: {out.shape}')

x shape is: torch.Size([4, 8])
out shape is: torch.Size([4, 8, 65])

  • 包含输出以后的完整的pipeline是
from typing import Iterator
import torch.nn as nn
import torch.nn.functional as F
torch.manual_seed(1337)
class BigramLanguageModel(nn.Module):def __init__(self, vocab_size):super().__init__()self.token_embedding_table = nn.Embedding(vocab_size, vocab_size)def forward(self, idx, targets=None):logits = self.token_embedding_table(idx)  # B, T, Cif targets is None:loss = Noneelse:B, T, C = logits.shapelogits = logits.view(B*T, C) # 这是很好理解的targets = targets.view(B*T) # 但是targets是B,Tloss = F.cross_entropy(logits, targets)return logits, lossdef generate(self, idx, max_new_tokens):for _ in range(max_new_tokens):logits, loss = self(idx)    logits = logits[:, -1, :]  # B, C    prob = F.softmax(logits, dim=-1)  # 对最后一维进行softmaxix = torch.multinomial(prob, num_samples=1) # B, Cprint(idx)idx = torch.cat((idx, ix), dim=1)   # B,T+1print(idx)return idx# ix = ix.view(B)xb, yb = get_batch('train')
model = BigramLanguageModel(vocab_size)
out, loss = model(xb)
print(f'x shape is: {xb.shape}')
print(f'out shape is: {out.shape}')idx = idx = torch.zeros((1, 1), dtype=torch.long)
print(decode(model.generate(idx, max_new_tokens=10)[0].tolist()))
# print(f'idx is {idx}')

x shape is: torch.Size([4, 8])
out shape is: torch.Size([4, 8, 65])
tensor([[0]])
tensor([[ 0, 50]])
tensor([[ 0, 50]])
tensor([[ 0, 50, 7]])
tensor([[ 0, 50, 7]])
tensor([[ 0, 50, 7, 29]])
tensor([[ 0, 50, 7, 29]])
tensor([[ 0, 50, 7, 29, 37]])
tensor([[ 0, 50, 7, 29, 37]])
tensor([[ 0, 50, 7, 29, 37, 48]])
tensor([[ 0, 50, 7, 29, 37, 48]])
tensor([[ 0, 50, 7, 29, 37, 48, 58]])
tensor([[ 0, 50, 7, 29, 37, 48, 58]])
tensor([[ 0, 50, 7, 29, 37, 48, 58, 5]])
tensor([[ 0, 50, 7, 29, 37, 48, 58, 5]])
tensor([[ 0, 50, 7, 29, 37, 48, 58, 5, 15]])
tensor([[ 0, 50, 7, 29, 37, 48, 58, 5, 15]])
tensor([[ 0, 50, 7, 29, 37, 48, 58, 5, 15, 24]])
tensor([[ 0, 50, 7, 29, 37, 48, 58, 5, 15, 24]])
tensor([[ 0, 50, 7, 29, 37, 48, 58, 5, 15, 24, 12]])
l-QYjt’CL?

这里有几个地方需要注意,首先输入输出是:

x is tensor([[24, 43, 58, 5, 57, 1, 46, 43],
[44, 53, 56, 1, 58, 46, 39, 58],
[52, 58, 1, 58, 46, 39, 58, 1],
[25, 17, 27, 10, 0, 21, 1, 54]])
y is tensor([[43, 58, 5, 57, 1, 46, 43, 39],
[53, 56, 1, 58, 46, 39, 58, 1],
[58, 1, 58, 46, 39, 58, 1, 46],
[17, 27, 10, 0, 21, 1, 54, 39]])

并且这个pipeline,网络对输入的长度也没有限制

  • 开始训练
    这个时候我们需要构建一个完整的训练代码,如果还是用jupyter notebook,每次改变了网络的一个组成部分,需要重新执行很多地方,比较麻烦,所以构建一个.py文件。
import torch
import torch.nn as nn
import torch.nn.functional as F# hyperparameters
batch_size = 32
block_size = 8
max_iter = 3000
eval_interval = 300
learning_rate = 1e-2
device = 'cuda' if torch.cuda.is_available() else 'cpu'
eval_iters = 200
# ---------------------torch.manual_seed(1337)text = open("input.txt", "r", encoding='utf-8').read()
chars = sorted(list(set(text)))
vocab_size = len(chars)stoi = {ch : i for i, ch in enumerate(chars)}
itos = {i : ch for i, ch in enumerate(chars)}
encode = lambda s: [stoi[ch] for ch in s]
decode = lambda l: ''.join([itos[i] for i in l])# 生成数据集
data = torch.tensor(encode(text), dtype=torch.long)
n = int(len(data) * 0.9)
train_data = data[:n]
val_data = data[n:]def get_batch(split):datasets = {'train': train_data,'val': val_data,}[split]ix = torch.randint(0, len(datasets) - block_size, (batch_size,))x = torch.stack([datasets[i:i+block_size] for i in ix])y = torch.stack([datasets[1+i:i+block_size+1] for i in ix])x, y = x.to(device), y.to(device)return x, y@torch.no_grad()
def estimate_loss():out = {}model.eval()for split in ['train', 'val']:losses = torch.zeros(eval_iters)for k in range(eval_iters):X, Y = get_batch(split)logits, loss = model(X, Y)losses[k] = loss.item()out[split] = losses.mean()model.train()return outclass BigramLanguageModel(nn.Module):def __init__(self, vocab_size):super().__init__()self.token_embedding_table = nn.Embedding(vocab_size, vocab_size)def forward(self, idx, targets=None):# import pdb; pdb.set_trace()logits = self.token_embedding_table(idx)  # B, T, Cif targets is None:loss = Noneelse:B, T, C = logits.shapelogits = logits.view(B*T, C) # 这是很好理解的targets = targets.view(B*T) # 但是targets是B,T, C其实并不好理解loss = F.cross_entropy(logits, targets)return logits, lossdef generate(self, idx, max_new_tokens):for _ in range(max_new_tokens):logits, loss = self(idx)    logits = logits[:, -1, :]  # B, C    prob = F.softmax(logits, dim=-1)  # 对最后一维进行softmaxix = torch.multinomial(prob, num_samples=1) # B, 1# print(idx)idx = torch.cat((idx, ix), dim=1)   # B,T+1# print(idx)return idxmodel = BigramLanguageModel(vocab_size)
m = model.to(device)
optimizer = torch.optim.AdamW(model.parameters(), lr=learning_rate)lossi = []
for iter in range(max_iter):if iter % eval_interval == 0:losses = estimate_loss()print(f'step {iter}: train loss {losses["train"]:.4f}, val loss {losses["val"]:.4f}')xb, yb = get_batch('train')out, loss = m(xb, yb)optimizer.zero_grad(set_to_none=True)loss.backward()optimizer.step()# generate from the model
context = torch.zeros((1,1), dtype=torch.long, device=device)
print(decode(m.generate(context, max_new_tokens=500)[0].tolist()))

输出的结果是

step 0: train loss 4.7305, val loss 4.7241
step 300: train loss 2.8110, val loss 2.8249
step 600: train loss 2.5434, val loss 2.5682
step 900: train loss 2.4932, val loss 2.5088
step 1200: train loss 2.4863, val loss 2.5035
step 1500: train loss 2.4665, val loss 2.4921
step 1800: train loss 2.4683, val loss 2.4936
step 2100: train loss 2.4696, val loss 2.4846
step 2400: train loss 2.4638, val loss 2.4879
step 2700: train loss 2.4738, val loss 2.4911
CEThik brid owindakis b, bth
HAPet bobe d e.
S:
O:3 my d?
LUCous:
Wanthar u qur, t.
War dXENDoate awice my.
Hastarom oroup
Yowhthetof isth ble mil ndill, ath iree sengmin lat Heriliovets, and Win nghir.
Swanousel lind me l.
HAshe ce hiry:
Supr aisspllw y.
Hentofu n Boopetelaves
MPOLI s, d mothakleo Windo whth eisbyo the m dourive we higend t so mower; te
AN ad nterupt f s ar igr t m:
Thin maleronth,
Mad
RD:
WISo myrangoube!
KENob&y, wardsal thes ghesthinin couk ay aney IOUSts I&fr y ce.
J

2.3 self-attention

我们处理当前的字符的时候,需要和历史字符进行通信,历史字符可以看成是某一种特征,使用最简单的均值提取的方式提取历史字符的feature

# 最简单的通信方式,将当前的字符和之前的字符平均进行沟通
# 可以看成是history information的features
a = torch.tril(torch.ones(3, 3))
print(a)
a = torch.tril(a) / torch.sum(a, 1, keepdim=True)
print(a)

tensor([[1., 0., 0.],
[1., 1., 0.],
[1., 1., 1.]])
tensor([[1.0000, 0.0000, 0.0000],
[0.5000, 0.5000, 0.0000],
[0.3333, 0.3333, 0.3333]])

可以采用softmax的方式进行mask

import torch.nn.functional as F
tril = torch.tril(torch.ones(T, T))  # 某种意义上的Q
wei = torch.zeros(T, T) # K
wei = wei.masked_fill(tril == 0, float('-inf'))  
print(wei)
wei = F.softmax(wei)
print(wei)

tensor([[0., -inf, -inf, -inf, -inf, -inf, -inf, -inf],
[0., 0., -inf, -inf, -inf, -inf, -inf, -inf],
[0., 0., 0., -inf, -inf, -inf, -inf, -inf],
[0., 0., 0., 0., -inf, -inf, -inf, -inf],
[0., 0., 0., 0., 0., -inf, -inf, -inf],
[0., 0., 0., 0., 0., 0., -inf, -inf],
[0., 0., 0., 0., 0., 0., 0., -inf],
[0., 0., 0., 0., 0., 0., 0., 0.]])
tensor([[1.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
[0.5000, 0.5000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
[0.3333, 0.3333, 0.3333, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
[0.2500, 0.2500, 0.2500, 0.2500, 0.0000, 0.0000, 0.0000, 0.0000],
[0.2000, 0.2000, 0.2000, 0.2000, 0.2000, 0.0000, 0.0000, 0.0000],
[0.1667, 0.1667, 0.1667, 0.1667, 0.1667, 0.1667, 0.0000, 0.0000],
[0.1429, 0.1429, 0.1429, 0.1429, 0.1429, 0.1429, 0.1429, 0.0000],
[0.1250, 0.1250, 0.1250, 0.1250, 0.1250, 0.1250, 0.1250, 0.1250]])

特征提取的结果

xbow2 = wei @ x # (T, T) @ (B, T, C) --> (B, T, C)  # x对应v
print(xbow2.shape)

torch.Size([4, 8, 2])

加上pos_emb现在的forward版本

def forward(self, idx, targets=None):# import pdb; pdb.set_trace()tok_emb = self.token_embedding_table(idx)  # B, T, C(n_emb)pos_emb = self.position_embedding_table(torch.range(T, device=device)) # T,C # positional encodingx = tok_emb + pos_emb   # (B, T, C) broadcastinglogits = self.lm_head(x)  # B, T, C(vocab_size)if targets is None:loss = Noneelse:B, T, C = logits.shapelogits = logits.view(B*T, C) # 这是很好理解的targets = targets.view(B*T) # 但是targets是B,T, C其实并不好理解loss = F.cross_entropy(logits, targets)return logits, loss

karpathy 给出的一些启示

  • Attention is a communication mechanism. Can be seen as nodes in a directed graph looking at each other and aggregating information with a weighted sum from all nodes that point to them, with data-dependent weights.
  • There is no notion of space. Attention simply acts over a set of vectors. This is why we need to positionally encode tokens.
  • Each example across batch dimension is of course processed completely independently and never “talk” to each other
  • In an “encoder” attention block just delete the single line that does masking with tril, allowing all tokens to communicate. This block here is called a “decoder” attention block because it has triangular masking, and is usually used in autoregressive settings, like language modeling.
  • “self-attention” just means that the keys and values are produced from the same source as queries. In “cross-attention”, the queries still get produced from x, but the keys and values come from some other, external source (e.g. an encoder module)
  • “Scaled” attention additional divides wei by 1/sqrt(head_size). This makes it so when input Q,K are unit variance, wei will be unit variance too and Softmax will stay diffuse and not saturate too much. Illustration below

attention的公式其中scale是为了保证两个分布相乘的时候,方差不变的。

在这里插入图片描述

k = torch.randn(B, T, head_size)
q = torch.randn(B, T, head_size)
wei = q @ k.transpose(-2, -1)
wei_scale = wei / head_size**0.5
print(k.var())
print(q.var())
print(wei.var())
print(wei_scale.var())

输出结果

tensor(1.0278)
tensor(0.9802)
tensor(15.9041)
tensor(0.9940)

初始化对结果的影响很大,实际上来说我们还是很希望softmax初始化的结果是一个方差较小的分布,如果不进行scale

torch.softmax(torch.tensor([0.1, -0.2, 0.3, -0.2, 0.5]) * 8, dim=-1)

tensor([0.0326, 0.0030, 0.1615, 0.0030, 0.8000])

对原来的py文件做一些修改:
在这里插入图片描述

class Head(nn.Module):def __init__(self, head_size):super().__init__()self.query = nn.Linear(n_embd, head_size, bias=False)self.key = nn.Linear(n_embd, head_size, bias=False)self.value = nn.Linear(n_embd, head_size, bias=False)self.register_buffer('tril', torch.tril(torch.ones(block_size, block_size)))def forward(self, x):# import pdb; pdb.set_trace()B, T, C = x.shape    q = self.query(x)      #(B, T, C)k = self.key(x)        #(B, T, C)v = self.value(x)      #(B, T, C)wei = q @ k.transpose(-2, -1) * C**-0.5 # (B,T,C)@(B,C,T) --> (B, T, T)wei = wei.masked_fill(self.tril[:T, :T] == 0, float('-inf')) wei = F.softmax(wei, dim=-1)   # (B, T, T)out = wei @ v   #(B, T, T) @ (B, T, C) --> (B, T, C)return out

修改模型

class BigramLanguageModel(nn.Module):def __init__(self, vocab_size):super().__init__()self.token_embedding_table = nn.Embedding(vocab_size, n_embd)self.position_embedding_table = nn.Embedding(block_size, n_embd)self.sa_head = Head(n_embd)  # head的尺寸保持不变self.lm_head = nn.Linear(n_embd, vocab_size)def forward(self, idx, targets=None):# import pdb; pdb.set_trace()B, T = idx.shapetok_emb = self.token_embedding_table(idx)  # B, T, C(n_emb)pos_emb = self.position_embedding_table(torch.arange(T, device=device)) # T,C # positional encodingx = tok_emb + pos_emb   # (B, T, C) broadcastingx = self.sa_head(x)logits = self.lm_head(x)  # B, T, C(vocab_size)if targets is None:loss = Noneelse:B, T, C = logits.shapelogits = logits.view(B*T, C) # 这是很好理解的targets = targets.view(B*T) # 但是targets是B,T, C其实并不好理解loss = F.cross_entropy(logits, targets)return logits, lossdef generate(self, idx, max_new_tokens):for _ in range(max_new_tokens):idx_cmd = idx[:, -block_size:]   # (B, T)logits, loss = self(idx_cmd)    logits = logits[:, -1, :]  # B, C    prob = F.softmax(logits, dim=-1)  # 对最后一维进行softmaxix = torch.multinomial(prob, num_samples=1) # B, 1# print(idx)idx = torch.cat((idx, ix), dim=1)   # B,T+1# print(idx)return idx

加上self-attention的结果
step 4500: train loss 2.3976, val loss 2.4041

2.4 multi-head attention

在这里插入图片描述

这里借鉴了group convolutional 的思想,

class MultiHeadAttention(nn.Module):""" multiple head of self attention in parallel """def __init__(self, num_heads, head_size):super().__init__()self.heads = nn.ModuleList([Head(head_size) for _ in range(num_heads)])def forward(self, x):return torch.cat([h(x) for h in self.heads], dim=-1)

应用的时候

self.sa_head = MultiHeadAttention(4, n_embd//4)  # head的尺寸保持不变

训练的结果

step 4500: train loss 2.2679, val loss 2.2789

2.5 feedforward network

在这里插入图片描述
在这里插入图片描述

加上feedforward的结果
step 4500: train loss 2.2337, val loss 2.2476

同时用一个block表示这个这个单元,
一个transform的block可以理解成一个connection 组成部分+computation组成部分

class Block(nn.Module):def __init__(self, n_embd, n_head):super().__init__()head_size = n_embd // n_headself.sa = MultiHeadAttention(n_head, head_size)self.ffwd = FeedForward(n_embd)def forward(self, x):x = self.sa(x)x = self.ffwd(x)return x

修改模型的定义

class BigramLanguageModel(nn.Module):def __init__(self, vocab_size):super().__init__()self.token_embedding_table = nn.Embedding(vocab_size, n_embd)self.position_embedding_table = nn.Embedding(block_size, n_embd)self.blocks = nn.Sequential(Block(n_embd, n_head=4),Block(n_embd, n_head=4),Block(n_embd, n_head=4),)self.lm_head = nn.Linear(n_embd, vocab_size)def forward(self, idx, targets=None):# import pdb; pdb.set_trace()B, T = idx.shapetok_emb = self.token_embedding_table(idx)  # B, T, C(n_emb)pos_emb = self.position_embedding_table(torch.arange(T, device=device)) # T,C # positional encodingx = tok_emb + pos_emb   # (B, T, C) broadcastingx = self.blocks(x)logits = self.lm_head(x)  # B, T, C(vocab_size)if targets is None:loss = Noneelse:B, T, C = logits.shapelogits = logits.view(B*T, C) # 这是很好理解的targets = targets.view(B*T) # 但是targets是B,T, C其实并不好理解loss = F.cross_entropy(logits, targets)return logits, loss

2.6 Residual network

现在模型的深度已经很深了,直接训练很可能无法很好的收敛,需要另外一个很重要的工具,残差网络。

class Block(nn.Module):def __init__(self, n_embd, n_head):super().__init__()head_size = n_embd // n_headself.sa = MultiHeadAttention(n_head, head_size)self.ffwd = FeedForward(n_embd)def forward(self, x):x = x + self.sa(x)x = x + self.ffwd(x)return x

深度扩充了以后,很容易过拟合了

step 4500: train loss 2.0031, val loss 2.1067

2.7 Layer normalization

我们先来看一下很基础的batchnorm。加入x,y 是两个独立,并且均值为0,方差为1的分布。
根据Var(xy)=E(X)^2 * Var(Y) + E(Y)^2 * Var(X) + Var(X) * Var(Y)=1
再来看矩阵相乘后,每一行变成了T2个独立同分布的乘积,根据中心极限定理:它们的和将近似服从正态分布,均值为各随机变量均值之和,方差为各随机变量方差之和。
也就是说矩阵相乘后的第一列的var=T2*1, mean=0
所以在矩阵相乘的时候,进行scale T 2 \sqrt{T2} T2 可以normalize(var 是平方,所以用了开根号)
在这里插入图片描述

x = torch.ones(5,5)
x = torch.tril(x)
print(x)
print(x.mean(dim=0))
print(x.mean(dim=1))

观察一下矩阵normalize的特点

tensor([[1., 0., 0., 0., 0.],
[1., 1., 0., 0., 0.],
[1., 1., 1., 0., 0.],
[1., 1., 1., 1., 0.],
[1., 1., 1., 1., 1.]])
tensor([1.0000, 0.8000, 0.6000, 0.4000, 0.2000])
tensor([0.2000, 0.4000, 0.6000, 0.8000, 1.0000])

class BatchNorm1d:def __init__(self, dim, eps=1e-5, momentum=0.1):self.eps = epsself.momentum = momentumself.training = True# parameters (trained with backprop)self.gamma = torch.ones(dim)self.beta = torch.zeros(dim)# buffers (trained with a running 'momentum update')self.running_mean = torch.zeros(dim)self.running_var = torch.ones(dim)def __call__(self, x):# calculate the forward passif self.training:if x.ndim == 2:dim = 0elif x.ndim == 3:dim = (0,1)xmean = x.mean(dim, keepdim=True) # batch meanxvar = x.var(dim, keepdim=True) # batch varianceelse:xmean = self.running_meanxvar = self.running_varxhat = (x - xmean) / torch.sqrt(xvar + self.eps) # normalize to unit varianceself.out = self.gamma * xhat + self.beta# update the buffersif self.training:with torch.no_grad():self.running_mean = (1 - self.momentum) * self.running_mean + self.momentum * xmeanself.running_var = (1 - self.momentum) * self.running_var + self.momentum * xvarreturn self.outdef parameters(self):return [self.gamma, self.beta]

BatchNormalized对一列进行normalized, layernormalize 对一行进行normalized。

class BatchNorm1d:def __init__(self, dim, eps=1e-5, momentum=0.1):self.eps = epsself.momentum = momentumself.training = True# parameters (trained with backprop)self.gamma = torch.ones(dim)self.beta = torch.zeros(dim)# buffers (trained with a running 'momentum update')self.running_mean = torch.zeros(dim)self.running_var = torch.ones(dim)def __call__(self, x):# calculate the forward passif self.training:dim = 1xmean = x.mean(dim, keepdim=True) # batch meanxvar = x.var(dim, keepdim=True) # batch varianceelse:xmean = self.running_meanxvar = self.running_varxhat = (x - xmean) / torch.sqrt(xvar + self.eps) # normalize to unit varianceself.out = self.gamma * xhat + self.beta# update the buffersif self.training:with torch.no_grad():self.running_mean = (1 - self.momentum) * self.running_mean + self.momentum * xmeanself.running_var = (1 - self.momentum) * self.running_var + self.momentum * xvarreturn self.outdef parameters(self):return [self.gamma, self.beta]

如今用的比较 的模式
在这里插入图片描述
对应的代码
``python
class Block(nn.Module):
def init(self, n_embd, n_head):
super().init()
head_size = n_embd // n_head
self.sa = MultiHeadAttention(n_head, head_size)
self.ffwd = FeedForward(n_embd)
self.ln1 = nn.LayerNorm(n_embd)
self.ln2 = nn.LayerNorm(n_embd)

def forward(self, x):x = x + self.sa(self.ln1(x))x = x + self.ffwd(self.ln2(x))return x
并且一般会在连续的decoder block 模块后添加一个layerNorm
```python
class BigramLanguageModel(nn.Module):def __init__(self, vocab_size):super().__init__()self.token_embedding_table = nn.Embedding(vocab_size, n_embd)self.position_embedding_table = nn.Embedding(block_size, n_embd)self.blocks = nn.Sequential(Block(n_embd, n_head=4),Block(n_embd, n_head=4),Block(n_embd, n_head=4),nn.LayerNorm(n_embd),)self.lm_head = nn.Linear(n_embd, vocab_size)

加上layerNormlization以后,精度又上升了一些

step 4500: train loss 1.9931, val loss 2.0892
现在训练误差和验证误差的loss比较大 ,需要想办法解决一下。

2.8 使用dropout

  • 在head 使用dropout,防止模型被特定的feature给过分影响了提高模型的鲁棒性。
    def forward(self, x):# import pdb; pdb.set_trace()B, T, C = x.shape    q = self.query(x)      #(B, T, C)k = self.key(x)        #(B, T, C)v = self.value(x)      #(B, T, C)wei = q @ k.transpose(-2, -1) * C**-0.5 # (B,T,C)@(B,C,T) --> (B, T, T)wei = wei.masked_fill(self.tril[:T, :T] == 0, float('-inf')) wei = F.softmax(wei, dim=-1)   # (B, T, T)wei = self.dropout(wei)out = wei @ v   #(B, T, T) @ (B, T, C) --> (B, T, C)return out
  • 在multihead上使用dropout,也是同样的原理,防止特定feature过分影响了模型
    def forward(self, x):out =  torch.cat([h(x) for h in self.heads], dim=-1)out = self.dropout(self.proj(out))return out
  • 在计算单元的输出结果前使用dropout
class FeedForward(nn.Module):def __init__(self, n_embd):super().__init__()self.net = nn.Sequential(nn.Linear(n_embd, 4 * n_embd),nn.ReLU(),nn.Linear(4 * n_embd, n_embd),nn.Dropout(dropout),)

修改设定参数

# hyperparameters
batch_size = 64
block_size = 256
max_iter = 5000
eval_interval = 500
learning_rate = 3e-4    # self-attention can't tolerate very high learnning rate
device = 'cuda' if torch.cuda.is_available() else 'cpu'
eval_iters = 200
n_embd = 384
n_layer = 6
n_head = 6
dropout = 0.2

step 4500: train loss 1.1112, val loss 1.4791

References

[1] https://www.youtube.com/watch?v=kCc8FmEb1nY

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/pingmian/6504.shtml

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

STM32标准库SPI通信协议与W25Q64

目录 一、SPI通信 1.SPI通信简介 2.硬件电路 3.移位示意图 4.SPI基本时序图 (1)起始和终止 (2)交换一个字节 模式0: 模式1:​编辑 模式2:​编辑 模式3:​编辑 5.SPI时序 …

初识C语言——第九天

ASCII定义 在 C 语言中,每个字符都对应一个 ASCII 码。ASCII 码是一个字符集,它定义了许多常用的字符对应的数字编码。这些编码可以表示为整数,也可以表示为字符类型。在 C 语言中,字符类型被定义为一个整数类型,它占…

数据仓库实验三:分类规则挖掘实验

目录 一、实验目的二、实验内容和要求三、实验步骤1、创建数据库和表2、决策树分类规则挖掘(1)新建一个 Analysis Services 项目 jueceshu(2)建立数据源视图(3)建立挖掘结构 DST.dmm(4&#xff…

43 单例模式

目录 1.什么是单例模式 2.什么是设计模式 3.特点 4.饿汉和懒汉 5.峨汉实现单例 6.懒汉实现单例 7.懒汉实现单例(线程安全) 8.STL容器是否线程安全 9.智能指针是否线程安全 10.其他常见的锁 11.读者写者问题 1. 什么是单例模式 单例模式是一种经典的&a…

线性数据结构-手写队列-哈希(散列)Hash

什么是hash散列? 哈希表的存在是为了解决能通过O(1)时间复杂度直接索引到指定元素。这是什么意思呢?通过我们使用数组存放元素,都是按照顺序存放的,当需要获取某个元素的时候,则需要对数组进行遍历,获取到指…

【skill】onedrive的烦人问题

Onedrive的迷惑行为 安装Onedrive,如果勾选了同步,会默认把当前用户的数个文件夹(桌面、文档、图片、下载 等等)移动到安装时提示的那个文件夹 查看其中的一个文件的路径: 这样一整,原来的文件收到严重影…

吴恩达2022机器学习专项课程C2(高级学习算法)W1(神经网络):2.1神经元与大脑

目录 神经网络1.初始动机*2.发展历史3.深度学习*4.应用历程 生物神经元1.基本功能2.神经元的互动方式3.信号传递与思维形成4.神经网络的形成 生物神经元简化1.生物神经元的结构2.信号传递过程3.生物学术语与人工神经网络 人工神经元*1.模型简化2.人工神经网络的构建3.计算和输入…

Java与Go: 生产者消费者模型

什么是生产者消费者模型 生产者-消费者模型(也称为生产者-消费者问题)是一种常见的并发编程模型,用于处理多线程或多进程之间的协同工作。该模型涉及两个主要角色:生产者和消费者,一个次要角色:缓冲区。 生…

18 内核开发-内核重点数据结构学习

课程简介: Linux内核开发入门是一门旨在帮助学习者从最基本的知识开始学习Linux内核开发的入门课程。该课程旨在为对Linux内核开发感兴趣的初学者提供一个扎实的基础,让他们能够理解和参与到Linux内核的开发过程中。 课程特点: 1. 入门级别&…

办公数据分析利器:Excel与Power Query透视功能

数据分析利器:Excel与Power Query透视功能 Excel透视表和Power Query透视功能是强大的数据分析工具,它们使用户能够从大量数据中提取有意义的信息和趋势,可用于汇总、分析和可视化大量数据。 本文通过示例演示Power Query透视功能的一个小技…

Linux专栏08:Linux基本指令之压缩解压缩指令

博客主页:Duck Bro 博客主页系列专栏:Linux专栏关注博主,后期持续更新系列文章如果有错误感谢请大家批评指出,及时修改感谢大家点赞👍收藏⭐评论✍ Linux基本指令之压缩解压缩指令 编号:08 文章目录 Linu…

Spring Boot与OpenCV:融合机器学习的智能图像与视频处理平台

🧑 作者简介:阿里巴巴嵌入式技术专家,深耕嵌入式人工智能领域,具备多年的嵌入式硬件产品研发管理经验。 📒 博客介绍:分享嵌入式开发领域的相关知识、经验、思考和感悟,欢迎关注。提供嵌入式方向…

【模板】二维前缀和

原题链接:登录—专业IT笔试面试备考平台_牛客网 目录 1. 题目描述 2. 思路分析 3. 代码实现 1. 题目描述 2. 思路分析 二维前缀和板题。 二维前缀和:pre[i][j]a[i][j]pre[i-1][j]pre[i][j-1]-pre[i-1][j-1]; 子矩阵 左上角为(x1,y1) 右下角(x2,y2…

PG控制文件的管理与重建

一.控制文件位置与大小 逻辑位置:pgpobal 表空间中 物理位置:$PGDATA/global/pg_control --pg_global表空间的物理位置就在$PGDATA/global文件夹下 物理大小:8K 二.存放的内容 1.数据库初始化的时候生成的永久化参数,无法更改…

brpc中http2 grpc协议解析

搭建gRPC服务 | bRPC https://blog.csdn.net/INGNIGHT/article/details/132657099 global.cpp http2_rpc_protocol.cpp ParseH2Message解析frame header信息 ParseResult H2Context::ConsumeFrameHead( 这个是固定长度的9字节帧头部,length是,3*8bit…

Mysql技能树学习

查询进阶 别名 MySQL支持在查询数据时为字段名或表名指定别名&#xff0c;指定别名时可以使用AS关键字。 BETWEEN AND条件语句 mysql> SELECT * FROM t_goods WHERE id BETWEEN 6 AND 8; 查询特定数据 &#xff08;CASE&#xff09; select name,case when price <…

Linux 麒麟系统安装

国产麒麟系统官网地址&#xff1a; https://www.openkylin.top/downloads/ 下载该镜像后&#xff0c;使用VMware部署一个虚拟机&#xff1a; 完成虚拟机创建。点击&#xff1a;“开启此虚拟机” 选择“试用试用开放麒麟而不安装&#xff08;T&#xff09;”&#xff0c;进入op…

Cisco Firepower FTD生成troubleshooting File

在出现故障时&#xff0c;需要采集信息 FMC上需要采集对应FTD设备的troubleshooting file system -->health -->monitor 选择相应的FTD&#xff0c;右侧点 generate Generate 4 右上角小红点点开 选择里面的task,就可以看到进度&#xff0c;差不多要10分钟以上 5 完成后…

基于51单片机的交通灯设计—可调时间、夜间模式

基于51单片机的交通灯设计 &#xff08;仿真&#xff0b;程序&#xff0b;原理图&#xff0b;设计报告&#xff09; 功能介绍 具体功能&#xff1a; 1.四方向数码管同时显示时间&#xff1b; 2.LED作红、绿、黄灯 3.三个按键可以调整红绿灯时间&#xff1b; 4.夜间模式&am…

IDEA上文件换行符、分隔符(Line Separator)LF,CR,CRLF错乱影响Git上传Github或Gitee代码

IDEA上文件换行符、分隔符(Line Separator)LF&#xff0c;CR&#xff0c;CRLF错乱影响Git上传Github或Gitee代码 指定目录 然后就可以上传了 OK 一定注意更改Line Separator的文件目录 如果是target目录下的文件,是不能修改为LF的,把target文件删除,再重载一次main文件,就…