昇思25天学习打卡营第18天|文本解码原理--以MindNLP为例

文章目录

- - 昇思MindSpore应用实践
  - - - 1、自回归语言模型
      - RNN网络
      - 2、文本解码原理--以MindNLP为例
      - Greedy search
        Beam search
        Repeat problem
        TopK sample
  - Refernence

昇思MindSpore应用实践

本系列文章主要用于记录昇思25天学习打卡营的学习心得。

1、自回归语言模型

自回归语言模型（Autoregressive Language Model）是一种用于生成文本的统计模型。它基于序列数据的概率分布，通过建模当前词语与前面已生成词语的条件概率来预测下一个词语，即 根据前文预测下一个单词。

一个文本序列的概率分布可以分解为每个词基于其上文的条件概率的乘积

𝑊_0:初始上下文单词序列
𝑇: 时间步
当生成EOS标签时，停止生成。

自回归语言模型可以使用不同的方法来建模条件概率分布。其中，在基于Transformer架构的大语言模型出现之前，一种常见的方法是使用循环神经网络（Recurrent Neural Network，RNN），
RNN 可以通过在每个时间步骤上接收输入并保留隐状态信息，来捕捉序列中的上下文关系。通过训练RNN模型，可以学习到词语之间的概率分布，并用于生成新的文本。

RNN网络

RNN 网络的基本结构包括一个输入层 $x_t$ 、隐藏层 $h_t$ （含激活函数Activation Function）、延迟器（循环单元）、输出层 $h_t$ ，
在这里插入图片描述

网络中的神经元通过时间步骤连接形成循环：允许信息从一个时间步骤的输出 $h_{t-1}$ 通过与输入 $X_t$ 经过tanh函数激活后，传递至下一个时间步骤输入的一部分；
RNN具体计算公式：

$h_t=tanh(W_{ih}x_t+b_{ih}+W_{hh}x_{t-1}+b_{hh})$
在这里插入图片描述

单个展开的RNN结构

在这里插入图片描述

整体展开的RNN结构

对于某 t 时刻的步骤，RNN隐藏状态大致的计算方法为：

2、文本解码原理–以MindNLP为例

MindNLP/huggingface Transformers提供的文本生成方法

Greedy search

在每个时间步𝑡都简单地选择概率最高的词作为当前输出词:

𝑤_𝑡=𝑎𝑟𝑔𝑚𝑎𝑥_𝑤 𝑃(𝑤|𝑤_(1:𝑡−1))

按照贪心搜索输出序列(“The”,“nice”,“woman”) 的条件概率为：0.5 x 0.4 = 0.2

缺点: 错过了隐藏在低概率词后面的高概率词，如：dog=0.5, has=0.9

#greedy_searchfrom mindnlp.transformers import GPT2Tokenizer, GPT2LMHeadModeltokenizer = GPT2Tokenizer.from_pretrained("iiBcai/gpt2", mirror='modelscope')  # 导入预训练GPT2# add the EOS token as PAD token to avoid warnings
model = GPT2LMHeadModel.from_pretrained("iiBcai/gpt2", pad_token_id=tokenizer.eos_token_id, mirror='modelscope')# encode context the generation is conditioned on
input_ids = tokenizer.encode('I enjoy walking with my cute dog', return_tensors='ms')# generate text until the output length (which includes the context length) reaches 50
greedy_output = model.generate(input_ids, max_length=50)print("Output:\n" + 100 * '-')
print(tokenizer.decode(greedy_output[0], skip_special_tokens=True))

在这里插入图片描述

Beam search

Beam search通过在每个时间步保留最可能的 num_beams 个词，并从中最终选择出概率最高的序列来降低丢失潜在的高概率序列的风险。如图以 num_beams=2 为例:

(“The”,“dog”,“has”) : 0.4 * 0.9 = 0.36

(“The”,“nice”,“woman”) : 0.5 * 0.4 = 0.20

优点：一定程度保留最优路径

缺点：1. 无法解决重复问题；2. 开放域生成效果差

from mindnlp.transformers import GPT2Tokenizer, GPT2LMHeadModeltokenizer = GPT2Tokenizer.from_pretrained("iiBcai/gpt2", mirror='modelscope')# add the EOS token as PAD token to avoid warnings
model = GPT2LMHeadModel.from_pretrained("iiBcai/gpt2", pad_token_id=tokenizer.eos_token_id, mirror='modelscope')# encode context the generation is conditioned on
input_ids = tokenizer.encode('I enjoy walking with my cute dog', return_tensors='ms')# activate beam search and early_stopping
beam_output = model.generate(input_ids, max_length=50, num_beams=5, early_stopping=True
)print("Output:\n" + 100 * '-')
print(tokenizer.decode(beam_output[0], skip_special_tokens=True))
print(100 * '-')# set no_repeat_ngram_size to 2
beam_output = model.generate(input_ids, max_length=50, num_beams=5, no_repeat_ngram_size=2, early_stopping=True
)print("Beam search with ngram, Output:\n" + 100 * '-')
print(tokenizer.decode(beam_output[0], skip_special_tokens=True))
print(100 * '-')# set return_num_sequences > 1
beam_outputs = model.generate(input_ids, max_length=50, num_beams=5, no_repeat_ngram_size=2, num_return_sequences=5, early_stopping=True
)# now we have 3 output sequences
print("return_num_sequences, Output:\n" + 100 * '-')
for i, beam_output in enumerate(beam_outputs):print("{}: {}".format(i, tokenizer.decode(beam_output, skip_special_tokens=True)))
print(100 * '-')

Output:
----------------------------------------------------------------------------------------------------
I enjoy walking with my cute dog, but I don't think I'll ever be able to walk with her again.""I don't think I'll ever be able to walk with her again.""I don't think I
----------------------------------------------------------------------------------------------------
Beam search with ngram, Output:
----------------------------------------------------------------------------------------------------
I enjoy walking with my cute dog, but I don't think I'll ever be able to walk with her again.""I'm not sure what to say to that," she said. "I mean, it's not like I'm
----------------------------------------------------------------------------------------------------
return_num_sequences, Output:
----------------------------------------------------------------------------------------------------
0: I enjoy walking with my cute dog, but I don't think I'll ever be able to walk with her again.""I'm not sure what to say to that," she said. "I mean, it's not like I'm
1: I enjoy walking with my cute dog, but I don't think I'll ever be able to walk with her again.""I'm not sure what to say to that," she said. "I mean, it's not like she's
2: I enjoy walking with my cute dog, but I don't think I'll ever be able to walk with her again.""I'm not sure what to say to that," she said. "I mean, it's not like we're
3: I enjoy walking with my cute dog, but I don't think I'll ever be able to walk with her again.""I'm not sure what to say to that," she said. "I mean, it's not like I've
4: I enjoy walking with my cute dog, but I don't think I'll ever be able to walk with her again.""I'm not sure what to say to that," she said. "I mean, it's not like I can
----------------------------------------------------------------------------------------------------

Repeat problem

n-gram 惩罚:

将出现过的候选词的概率设置为 0
设置no_repeat_ngram_size=2 ，任意 2-gram 不会出现两次
实际文本生成需要重复出现

import mindspore
from mindnlp.transformers import GPT2Tokenizer, GPT2LMHeadModeltokenizer = GPT2Tokenizer.from_pretrained("iiBcai/gpt2", mirror='modelscope')# add the EOS token as PAD token to avoid warnings
model = GPT2LMHeadModel.from_pretrained("iiBcai/gpt2", pad_token_id=tokenizer.eos_token_id, mirror='modelscope')# encode context the generation is conditioned on
input_ids = tokenizer.encode('I enjoy walking with my cute dog', return_tensors='ms')mindspore.set_seed(0)
# activate sampling and deactivate top_k by setting top_k sampling to 0
sample_output = model.generate(input_ids, do_sample=True, max_length=50, top_k=0
)print("Output:\n" + 100 * '-')
print(tokenizer.decode(sample_output[0], skip_special_tokens=True))

Output:
----------------------------------------------------------------------------------------------------
I enjoy walking with my cute dog Neddy as much as I'd like. Keep up the good work Neddy!"I realized what Neddy meant when he first launched the website. "Thank you so much for joining."

TopK sample

选出概率最大的 K 个词，重新归一化，最后在归一化后的 K 个词中采样；
将采样池限制为固定大小 K ：

在分布比较尖锐的时候产生胡言乱语
在分布比较平坦的时候限制模型的创造力

import mindspore
from mindnlp.transformers import GPT2Tokenizer, GPT2LMHeadModeltokenizer = GPT2Tokenizer.from_pretrained("iiBcai/gpt2", mirror='modelscope')# add the EOS token as PAD token to avoid warnings
model = GPT2LMHeadModel.from_pretrained("iiBcai/gpt2", pad_token_id=tokenizer.eos_token_id, mirror='modelscope')# encode context the generation is conditioned on
input_ids = tokenizer.encode('I enjoy walking with my cute dog', return_tensors='ms')mindspore.set_seed(0)
# activate sampling and deactivate top_k by setting top_k sampling to 0
sample_output = model.generate(input_ids, do_sample=True, max_length=50, top_k=50
)print("Output:\n" + 100 * '-')
print(tokenizer.decode(sample_output[0], skip_special_tokens=True))

Output:
----------------------------------------------------------------------------------------------------
I enjoy walking with my cute dog.She's always up for some action, so I have seen her do some stuff with it.Then there's the two of us.The two of us I'm talking about were