mindspore打开第十四天文本解码原理1

## __文本解码原理\-\-以MindNLP为例__

### 回顾：自回归语言模型

__根据前文预测下一个单词__
<div align=center><img src="https://openi.pcl.ac.cn/mindspore-courses/Step_into_LLMs/raw/commit/8f6e55c907ef7d2b616e8e3c4da76b065633c2ae/Season2.step_into_llm/03.Decoding/example.png"></div>

__一个文本序列的概率分布可以分解为每个词基于其上文的条件概率的乘积__
<div align=center><img src="https://openi.pcl.ac.cn/mindspore-courses/Step_into_LLMs/raw/commit/8d9110b0d7b8840d536880b7fceaf0df1b3e8354/Season2.step_into_llm/03.Decoding/P%e5%85%ac%e5%bc%8f.png"></div>

* 𝑊_0:初始上下文单词序列
* 𝑇: 时间步
* 当生成EOS标签时，停止生成。

__MindNLP/huggingface Transformers提供的文本生成方法__
<div align=center><img src="https://openi.pcl.ac.cn/mindspore-courses/Step_into_LLMs/raw/commit/8f6e55c907ef7d2b616e8e3c4da76b065633c2ae/Season2.step_into_llm/03.Decoding/3possible.png"></div>

__Greedy search__

在每个时间步𝑡都简单地选择概率最高的词作为当前输出词:

𝑤_𝑡=𝑎𝑟𝑔𝑚𝑎𝑥_𝑤 𝑃(𝑤|𝑤_(1:𝑡−1))

按照贪心搜索输出序列("The","nice","woman") 的条件概率为：0.5 x 0.4 = 0.2

缺点: 错过了隐藏在低概率词后面的高概率词，如：dog=0.5, has=0.9
![image.png](attachment:image.png =600x600)

#greedy_search

from mindnlp.transformers import GPT2Tokenizer, GPT2LMHeadModel

tokenizer = GPT2Tokenizer.from_pretrained("iiBcai/gpt2", mirror='modelscope')

# add the EOS token as PAD token to avoid warnings
model = GPT2LMHeadModel.from_pretrained("iiBcai/gpt2", pad_token_id=tokenizer.eos_token_id, mirror='modelscope')

# encode context the generation is conditioned on
input_ids = tokenizer.encode('I enjoy walking with my cute dog', return_tensors='ms')

# generate text until the output length (which includes the context length) reaches 50
greedy_output = model.generate(input_ids, max_length=50)

print("Output:\n" + 100 * '-')
print(tokenizer.decode(greedy_output[0], skip_special_tokens=True))

from mindnlp.transformers import GPT2Tokenizer, GPT2LMHeadModel

tokenizer = GPT2Tokenizer.from_pretrained("iiBcai/gpt2", mirror='modelscope')

# add the EOS token as PAD token to avoid warnings
model = GPT2LMHeadModel.from_pretrained("iiBcai/gpt2", pad_token_id=tokenizer.eos_token_id, mirror='modelscope')

# encode context the generation is conditioned on
input_ids = tokenizer.encode('I enjoy walking with my cute dog', return_tensors='ms')

# activate beam search and early_stopping
beam_output = model.generate(
input_ids,
max_length=50,
num_beams=5,
early_stopping=True
)

print("Output:\n" + 100 * '-')
print(tokenizer.decode(beam_output[0], skip_special_tokens=True))
print(100 * '-')

# set no_repeat_ngram_size to 2
beam_output = model.generate(
input_ids,
max_length=50,
num_beams=5,
no_repeat_ngram_size=2,
early_stopping=True
)

print("Beam search with ngram, Output:\n" + 100 * '-')
print(tokenizer.decode(beam_output[0], skip_special_tokens=True))
print(100 * '-')

# set return_num_sequences > 1
beam_outputs = model.generate(
input_ids,
max_length=50,
num_beams=5,
no_repeat_ngram_size=2,
num_return_sequences=5,
early_stopping=True
)

# now we have 3 output sequences
print("return_num_sequences, Output:\n" + 100 * '-')
for i, beam_output in enumerate(beam_outputs):
print("{}: {}".format(i, tokenizer.decode(beam_output, skip_special_tokens=True)))
print(100 * '-')

本文来自互联网用户投稿，该文观点仅代表作者本人，不代表本站立场。本站仅提供信息存储空间服务，不拥有所有权，不承担相关法律责任。如若转载，请注明出处：http://www.mzph.cn/news/866379.shtml

如若内容造成侵权/违法违规/事实不符，请联系多彩编程网进行投诉反馈email:809451989@qq.com，一经查实，立即删除！