Tokenization 指南：字节对编码，WordPiece等方法Python代码详解

在2022年11月OpenAI的ChatGPT发布之后，大型语言模型(llm)变得非常受欢迎。从那时起，这些语言模型的使用得到了爆炸式的发展，这在一定程度上得益于HuggingFace的Transformer库和PyTorch等库。

计算机要处理语言，首先需要将文本转换成数字形式。这个过程由一个称为标记化 Tokenization。

标记化分为2个过程

1、将输入文本划分为token

标记器首先获取文本并将其分成更小的部分，可以是单词、单词的部分或单个字符。这些较小的文本片段被称为标记。Stanford NLP Group[2]将标记更严格地定义为:

在某些特定的文档中，作为一个有用的语义处理单元组合在一起的字符序列实例。

2、为每个标记分配一个ID

标记器将文本划分为标记后，可以为每个标记分配一个称为标记ID的整数。例如，单词cat被赋值为15，因此输入文本中的每个cat标记都用数字15表示。用数字表示替换文本标记的过程称为编码。类似地将已编码的记号转换回文本的过程称为解码。

使用单个数字表示记号有其缺点，因此要进一步处理这些编码以创建词嵌入，这个不在本文的范围内，我们后面介绍。

标记方法

将文本划分为标记的主要方法有三种:

1、基于单词:

基于单词的标记化是三种标记化方法中最简单的一种。标记器将通过拆分每个空格字符(有时称为“基于空白的标记化”)或通过类似的规则集(如基于标点的标记化)将句子分成单词[12]。

例如，这个句子:

 Cats are great, but dogs are better!

通过空格可以拆分为:

 ['Cats', 'are', 'great,', 'but', 'dogs', 'are', 'better!']

通过分隔标点和可以拆分为:

 ['Cats', 'are', 'great', ',', 'but', 'dogs', 'are', 'better', '!']

这里可以看到，用于确定分割的规则非常重要。空格方法可以更好地提供潜在的稀有标记!，而通过标点割则使两个不太罕见的标记更加突出!这里要说明下不要完全去掉标点符号，因为它们可以承载非常特殊的含义。’就是一个例子，它可以区分单词的复数形式和所有格形式。例如，“book’s”指的是一本书的某些属性，而“books”指的是许多书。

生成标记后，每个标记都会可以分配一个编号。下一次生成标记器已经看到的标记时，可以简单地为该标记分配为该单词指定的数字。例如，如果在上面的句子中，标记great被赋值为1，那么great的所有后续实例也将被赋值为1[3]。

优缺点:

基于单词的方法生成的标记包含高度的信息，因为每个标记都包含语义和上下文信息。但是这种方法最大的缺点之一是非常相似的单词被视为完全独立的标记。例如，cat和cats之间的联系将是不存在的，因此它们将被视为单独的单词。这在包含许多单词的大规模应用程序中成为一个问题，因为模型词汇表中可能出现的标记数量(模型所看到的标记总数)可能会变得非常大。英语大约有17万个单词，就会导致所谓的词汇爆炸问题。这方面的一个例子是TransformerXL标记器，它使用基于空白的分割。这导致词汇量超过25万[4]。

解决这个问题的一种方法是对模型可以学习的标记数量施加硬限制(例如10,000)。这将把10,000个最常见的标记之外的任何单词分类为词汇表外(OOV)，并将标记值分配为UNKNOWN而不是数值(通常缩写为UNK)。在存在许多未知单词的情况下，这会导致性能下降，但如果数据中包含的大多是常见单词，这可能是一种合适的折衷方法。[5]

2、基于字符的分词器

基于字符的标记法根据每个字符拆分文本，包括:字母、数字和标点符号等特殊字符。这大大减少了词汇量的大小，英语可以用大约256个标记来表示，而不是基于单词的方法所需的170,000多个[5]。即使是东亚语言，如汉语和日语，其词汇量也会显著减少，尽管它们的书写系统中包含数千个独特的字符。

在基于字符的标记器中，以下句子:

 Cats are great, but dogs are better!

会被拆分成：

 ['C', 'a', 't', 's', ' ', 'a', 'r', 'e', ' ', 'g', 'r', 'e', 'a', 't', ',', ' ', 'b', 'u', 't', ' ', 'd', 'o',  'g', 's', ' ', 'a', 'r', 'e', ' ', 'b', 'e', 't', 't', 'e', 'r', '!'`]

优缺点:

与基于单词的方法相比，基于字符的方法的词汇表大小要小得多，而且词汇表外的标记也要少得多。它可以对拼写错误的单词进行标记(尽管与单词的正确形式不同)。

但是这种方法也有一些缺点。使用基于字符的方法生成的单个标记中存储的信息非常少。这是因为与基于单词的方法中的标记不同，没有捕获语义或上下文含义(特别是在使用基于字母的书写系统的语言中，如英语)。这种方法限制了可以输入语言模型的标记化输入的大小，因为需要许多数字来编码输入文本。

3、基于子词的分词器

基于子词的标记化可以实现基于词和基于字符的方法的优点，同时最大限度地减少它们的缺点。基于子词的方法采取了折中的方案，将单词中的文本分开，创建具有语义意义的标记，即使它们不是完整的单词。例如，符号ing和ed虽然本身不是单词，但它们具有语法意义。

这种方法产生的词汇表大小小于基于单词的方法，但大于基于字符的方法。对于每个标记中存储的信息量也是如此，它也位于前两个方法生成的标记之间。

只拆分不常用的单词，可以使词形、复数形式等分解成它们的组成部分，同时保留符号之间的关系。例如，cat可能是数据集中非常常见的单词，但cats可能不太常见。所以cats将被分成cat和s，其中cats现在被赋予与其他所有cats标记相同的值，而s被赋予不同的值，这可以编码复数的含义。另一个例子是单词tokenization，它可以分为词根token和后缀ization。这种方法可以保持句法和语义的相似性[6]。由于这些原因，基于子词的标记器在今天的NLP模型中非常常用。

标准化和预标记化

标记化过程需要一些预处理和后处理步骤，这些步骤组成了标记化管道。其中标记化方法(基于子词，基于字符等)发生在模型步骤[7]中。

当使用Hugging Face的transformer库中的标记器时，标记化管道的所有步骤都会自动处理。整个管道由一个名为Tokenizer的对象执行。本节将深入研究大多数用户在处理NLP任务时不需要手动处理的代码的内部工作原理。还将介绍在标记器库中自定义基标记器类的步骤，这样可以在需要时为特定任务专门构建标记器。

1、规范化方法

规范化是在将文本拆分为标记之前清理文本的过程。这包括将每个字符转换为小写，从字符中删除重复，删除不必要的空白等步骤。例如，字符串ThÍs is áN examplise sÉnteNCE。不同的规范化程序将执行不同的步骤，

Hugging Face的Normalizers包包含几个基本的Normalizers，一般常用的有：

NFC:不转换大小写或移除口音

Lower:转换大小写，但不移除口音

BERT:转换大小写并移除口音

我们可以看看上面三种方法的对比：

 from tokenizers.normalizers import NFC, Lowercase, BertNormalizer# Text to normalizetext = 'ThÍs is  áN ExaMPlé     sÉnteNCE'# Instantiate normalizer objectsNFCNorm = NFC()LowercaseNorm = Lowercase()BertNorm = BertNormalizer()# Normalize the textprint(f'NFC:   {NFCNorm.normalize_str(text)}')print(f'Lower: {LowercaseNorm.normalize_str(text)}')print(f'BERT:  {BertNorm.normalize_str(text)}')#NFC:   ThÍs is  áN ExaMPlé     sÉnteNCE#Lower: thís is  án examplé     séntence#BERT:  this is  an example     sentence

下面的示例可以看到，只有NFC删除了不必要的空白。

 from transformers import FNetTokenizerFast, CamembertTokenizerFast, \BertTokenizerFast# Text to normalizetext = 'ThÍs is  áN ExaMPlé     sÉnteNCE'# Instantiate tokenizersFNetTokenizer = FNetTokenizerFast.from_pretrained('google/fnet-base')CamembertTokenizer = CamembertTokenizerFast.from_pretrained('camembert-base')BertTokenizer = BertTokenizerFast.from_pretrained('bert-base-uncased')# Normalize the textprint(f'FNet Output:      \{FNetTokenizer.backend_tokenizer.normalizer .normalize_str(text)}')print(f'CamemBERT Output: \{CamembertTokenizer.backend_tokenizer.normalizer.normalize_str(text)}')print(f'BERT Output:      \{BertTokenizer.backend_tokenizer.normalizer.normalize_str(text)}')#FNet Output:      ThÍs is áN ExaMPlé sÉnteNCE#CamemBERT Output: ThÍs is  áN ExaMPlé     sÉnteNCE#BERT Output:      this is  an example     sentence

2、预标记化

预标记化步骤是标记化原始文本的第一次分割。执行分割是为了给出的最终标记的上限。一个句子可以在预标记步骤中被分割成几个词，然后在模型步骤中，根据标记方法(例如基于子词的方法)，将其中的一些词进一步分割。因此，预先标记的文本表示标记化后仍然可能保留的最大标记。

例如，一个句子可以根据每个空格拆分，每个空格加一些标点，或者每个空格加每个标点。

下面显示了基本的Whitespacesplit预标记器和稍微复杂一点的BertPreTokenizer之间的比较。pre_tokenizers包。空白预标记器的输出保留标点完整，并且仍然连接到邻近的单词。例如，includes:被视为单个单词。而BERT预标记器将标点符号视为单个单词[8]。

 from tokenizers.pre_tokenizers import WhitespaceSplit, BertPreTokenizer# Text to normalize text = ("this sentence's content includes: characters, spaces, and " \"punctuation.")# Define helper function to display pre-tokenized outputdef print_pretokenized_str(pre_tokens):for pre_token in pre_tokens:print(f'"{pre_token[0]}", ', end='')# Instantiate pre-tokenizerswss = WhitespaceSplit()bpt = BertPreTokenizer()# Pre-tokenize the textprint('Whitespace Pre-Tokenizer:')print_pretokenized_str(wss.pre_tokenize_str(text))#Whitespace Pre-Tokenizer:#"this", "sentence's", "content", "includes:", "characters,", "spaces,", #"and", "punctuation.", print('\n\nBERT Pre-Tokenizer:')print_pretokenized_str(bpt.pre_tokenize_str(text))#BERT Pre-Tokenizer:#"this", "sentence", "'", "s", "content", "includes", ":", "characters", #",", "spaces", ",", "and", "punctuation", ".",

我们可以直接从常见的标记器(如GPT-2和ALBERT (A Lite BERT)标记器)调用预标记化方法。这些方法与上面所示的标准BERT预标记器略有不同，因为在分割标记时不会删除空格字符。它们被替换为表示空格所在位置的特殊字符。这样做的好处是，在进一步处理时可以忽略空格字符，但如果需要，可以检索原始句子。GPT-2模型使用Ġ字符，其特征是大写G上面有一个点。ALBERT模型使用下划线字符。

 from transformers import AutoTokenizer# Text to pre-tokenizetext = ("this sentence's content includes: characters, spaces, and " \"punctuation.")# Instatiate the pre-tokenizersGPT2_PreTokenizer = AutoTokenizer.from_pretrained('gpt2').backend_tokenizer \.pre_tokenizerAlbert_PreTokenizer = AutoTokenizer.from_pretrained('albert-base-v1') \.backend_tokenizer.pre_tokenizer# Pre-tokenize the textprint('GPT-2 Pre-Tokenizer:')print_pretokenized_str(GPT2_PreTokenizer.pre_tokenize_str(text))#GPT-2 Pre-Tokenizer:#"this", "Ġsentence", "'s", "Ġcontent", "Ġincludes", ":", "Ġcharacters", ",",#"Ġspaces", ",", "Ġand", "Ġpunctuation", ".", print('\n\nALBERT Pre-Tokenizer:')print_pretokenized_str(Albert_PreTokenizer.pre_tokenize_str(text))#ALBERT Pre-Tokenizer:#"▁this", "▁sentence's", "▁content", "▁includes:", "▁characters,", "▁spaces,",#"▁and", "▁punctuation.",

下面显示了同一个示例句子上的BERT预标记步骤的结果，返回的对象是一个包含元组的Python列表。每个元组对应一个预标记，其中第一个元素是预标记字符串，第二个元素是一个元组，包含原始输入文本中字符串的开始和结束的索引。

 from tokenizers.pre_tokenizers import WhitespaceSplit, BertPreTokenizer# Text to pre-tokenizetext = ("this sentence's content includes: characters, spaces, and " \"punctuation.")# Instantiate pre-tokenizerbpt = BertPreTokenizer()# Pre-tokenize the textbpt.pre_tokenize_str(example_sentence)

结果如下：

 [('this', (0, 4)),('sentence', (5, 13)),("'", (13, 14)),('s', (14, 15)),('content', (16, 23)),('includes', (24, 32)),(':', (32, 33)),('characters', (34, 44)),(',', (44, 45)),('spaces', (46, 52)),(',', (52, 53)),('and', (54, 57)),('punctuation', (58, 69)),('.', (69, 70))]

子词标记化方法

在完成了分词和预标记后，就可以开始合并标记了，对于transformer模型，有三种通常用于实现基于子词的方法。它们都使用略微不同的技术将不常用的单词分成更小的标记。

1、字节对编码 Byte Pair Encoding

字节对编码算法是一种常用的标记器，例如GPT和GPT-2模型(OpenAI)， BART (Lewis等人)等[9-10]。它最初被设计为一种文本压缩算法，但人们发现它在语言模型的标记化任务中工作得非常好。BPE算法将一串文本分解为在参考语料库(用于训练标记化模型的文本)中频繁出现的子词单元[11]。BPE模型的训练方法如下:

a)构建语料库

输入文本被提供给规范化和预标记化模型，创建干净的单词列表。然后将这些单词交给BPE模型，模型确定每个单词的频率，并将该数字与单词一起存储在称为语料库的列表中。

b)构建词汇

然后语料库中的单词被分解成单个字符，并添加到一个称为词汇表的空列表中。该算法将在每次确定哪些字符对可以合并在一起时迭代地添加该词汇表。

c)找出字符对的频率

然后记录语料库中每个单词的字符对频率。例如，单词cat将具有ca, at和ts的字符对。所有单词都以这种方式进行检查，并贡献给全局频率计数器。在任何标记中找到的ca实例都会增加ca对的频率计数器。

d)创建合并规则

当每个字符对的频率已知时，最频繁的字符对被添加到词汇表中。词汇表现在由符号中的每个字母以及最常见的字符对组成。这也提供了一个模型可以使用的合并规则。例如，如果模型学习到ca是最常见的字符对，它已经学习到语料库中所有相邻的c和a实例可以合并以得到ca。现在可以将其作为单个字符ca处理其余步骤。

重复步骤c和d，找到更多合并规则，并向词汇表中添加更多字符对。这个过程一直持续到词汇表大小达到训练开始时指定的目标大小。

下面是BPE算法的Python实现

 class TargetVocabularySizeError(Exception):def __init__(self, message):super().__init__(message)class BPE:'''An implementation of the Byte Pair Encoding tokenizer.'''def calculate_frequency(self, words):''' Calculate the frequency for each word in a list of words.Take in a list of words stored as strings and return a list oftuples where each tuple contains a string from the words list,and an integer representing its frequency count in the list.Args:words (list):  A list of words (strings) in any order.Returns:corpus (list[tuple(str, int)]): A list of tuples where thefirst element is a string of a word in the words list, andthe second element is an integer representing the frequencyof the word in the list.'''freq_dict = dict()for word in words:if word not in freq_dict:freq_dict[word] = 1else:freq_dict[word] += 1corpus = [(word, freq_dict[word]) for word in freq_dict.keys()]return corpusdef create_merge_rule(self, corpus):''' Create a merge rule and add it to the self.merge_rules list.Args:corpus (list[tuple(list, int)]): A list of tuples where thefirst element is a list of a word in the words list (wherethe elements are the individual characters (or subwords inlater iterations) of the word), and the second element isan integer representing the frequency of the word in thelist.Returns:None'''pair_frequencies = self.find_pair_frequencies(corpus)most_frequent_pair = max(pair_frequencies, key=pair_frequencies.get)self.merge_rules.append(most_frequent_pair.split(','))self.vocabulary.append(most_frequent_pair)def create_vocabulary(self, words):''' Create a list of every unique character in a list of words.Args:words (list): A list of strings containing the words of theinput text.Returns:vocabulary (list): A list of every unique character in the listof input words.'''vocabulary = list(set(''.join(words)))return vocabularydef find_pair_frequencies(self, corpus):''' Find the frequency of each character pair in the corpus.Loop through the corpus and calculate the frequency of each pairof adjacent characters across every word. Return a dictionary ofeach character pair as the keys and the corresponding frequency asthe values.Args:corpus (list[tuple(list, int)]): A list of tuples where thefirst element is a list of a word in the words list (wherethe elements are the individual characters (or subwords inlater iterations) of the word), and the second element isan integer representing the frequency of the word in thelist.Returns:pair_freq_dict (dict): A dictionary where the keys are thecharacter pairs from the input corpus and the values are aninteger representing the frequency of the pair in thecorpus.'''pair_freq_dict = dict()for word, word_freq in corpus:for idx in range(len(word)-1):char_pair = f'{word[idx]},{word[idx+1]}'if char_pair not in pair_freq_dict:pair_freq_dict[char_pair] = word_freqelse:pair_freq_dict[char_pair] += word_freqreturn pair_freq_dictdef get_merged_chars(self, char_1, char_2):''' Merge the highest score pair and return to the self.merge method.This method is abstracted so that the BPE class can be used as thebase class for other Tokenizers, and so the merging method can beeasily overwritten. For example, in the BPE algorithm thecharacters can simply be concatenated and returned. However in theWordPiece algorithm, the # symbols must first be stripped.Args:char_1 (str): The first character in the highest-scoring pair.char_2 (str): The second character in the highest-scoring pair.Returns:merged_chars (str): Merged characters.'''merged_chars = char_1 + char_2return merged_charsdef initialize_corpus(self, words):''' Split each word into characters and count the word frequency.Split each word in the input word list on every character. For eachword, store the split word in a list as the first element inside atuple. Store the frequency count of the word as an integer as thesecond element of the tuple. Create a tuple for every word in thisfashion and store the tuples in a list called 'corpus', then returnthen corpus list.Args:NoneReturns:corpus (list[tuple(list, int)]):  A list of tuples where thefirst element is a list of a word in the words list (wherethe elements are the individual characters of the word),and the second element is an integer representing thefrequency of the word in the list.'''corpus = self.calculate_frequency(words)corpus = [([*word], freq) for (word, freq) in corpus]return corpusdef merge(self, corpus):''' Loop through the corpus and perform the latest merge rule.Args:corpus (list[tuple(list, int)]): A list of tuples where thefirst element is a list of a word in the words list (wherethe elements are the individual characters (or subwords inlater iterations) of the word), and the second element isan integer representing the frequency of the word in thelist.Returns:new_corpus (list[tuple(list, int)]): A modified version of theinput argument where the most recent merge rule has beenapplied to merge the most frequent adjacent characters.'''merge_rule = self.merge_rules[-1]new_corpus = []for word, word_freq in corpus:new_word = []idx = 0while idx < len(word):# If a merge pattern has been foundif (len(word) != 1) and (word[idx] == merge_rule[0]) and\(word[idx+1] == merge_rule[1]):new_word.append(self.get_merged_chars(word[idx],word[idx+1]))idx += 2# If a merge patten has not been foundelse:new_word.append(word[idx])idx += 1new_corpus.append((new_word, word_freq))return new_corpusdef train(self, words, target_vocab_size):''' Train the model.Args:words (list[str]): A list of words to train the model on.target_vocab_size (int): The number of words in the vocabularyto be used as the stopping condition when training.Returns:None.'''self.words = wordsself.target_vocab_size = target_vocab_sizeself.corpus = self.initialize_corpus(self.words)self.corpus_history = [self.corpus]self.vocabulary = self.create_vocabulary(self.words)self.vocabulary_size = len(self.vocabulary)self.merge_rules = []# Iteratively add vocabulary until reaching the target vocabulary sizeif len(self.vocabulary) > self.target_vocab_size:raise TargetVocabularySizeError(f'Error: Target vocabulary size \must be greater than the initial vocabulary size \({len(self.vocabulary)})')else:while len(self.vocabulary) < self.target_vocab_size:try:self.create_merge_rule(self.corpus)self.corpus = self.merge(self.corpus)self.corpus_history.append(self.corpus)# If no further merging is possibleexcept ValueError:print('Exiting: No further merging is possible')breakdef tokenize(self, text):''' Take in some text and return a list of tokens for that text.Args:text (str): The text to be tokenized.Returns:tokens (list): The list of tokens created from the input text.'''tokens = [*text]for merge_rule in self.merge_rules:new_tokens = []idx = 0while idx < len(tokens):# If a merge pattern has been foundif (len(tokens) != 1) and (tokens[idx] == merge_rule[0]) and \(tokens[idx+1] == merge_rule[1]):new_tokens.append(self.get_merged_chars(tokens[idx],tokens[idx+1]))idx += 2# If a merge patten has not been foundelse:new_tokens.append(tokens[idx])idx += 1tokens = new_tokensreturn tokens

使用的详细步骤：

 # Training setwords = ['cat', 'cat', 'cat', 'cat', 'cat','cats', 'cats','eat', 'eat', 'eat', 'eat', 'eat', 'eat', 'eat', 'eat', 'eat', 'eat','eating', 'eating', 'eating','running', 'running','jumping','food', 'food', 'food', 'food', 'food', 'food']# Instantiate the tokenizerbpe = BPE()bpe.train(words, 21)# Print the corpus at each stage of the process, and the merge rule usedprint(f'INITIAL CORPUS:\n{bpe.corpus_history[0]}\n')for rule, corpus in list(zip(bpe.merge_rules, bpe.corpus_history[1:])):print(f'NEW MERGE RULE: Combine "{rule[0]}" and "{rule[1]}"')print(corpus, end='\n\n')

结果输出

 INITIAL CORPUS:[(['c', 'a', 't'], 5), (['c', 'a', 't', 's'], 2), (['e', 'a', 't'], 10),(['e', 'a', 't', 'i', 'n', 'g'], 3), (['r', 'u', 'n', 'n', 'i', 'n', 'g'], 2), (['j', 'u', 'm', 'p', 'i', 'n', 'g'], 1), (['f', 'o', 'o', 'd'], 6)]NEW MERGE RULE: Combine "a" and "t"[(['c', 'at'], 5), (['c', 'at', 's'], 2), (['e', 'at'], 10), (['e', 'at', 'i', 'n', 'g'], 3), (['r', 'u', 'n', 'n', 'i', 'n', 'g'], 2), (['j', 'u', 'm', 'p', 'i', 'n', 'g'], 1), (['f', 'o', 'o', 'd'], 6)]NEW MERGE RULE: Combine "e" and "at"[(['c', 'at'], 5), (['c', 'at', 's'], 2), (['eat'], 10), (['eat', 'i', 'n', 'g'], 3), (['r', 'u', 'n', 'n', 'i', 'n', 'g'], 2), (['j', 'u', 'm', 'p', 'i', 'n', 'g'], 1), (['f', 'o', 'o', 'd'], 6)]NEW MERGE RULE: Combine "c" and "at"[(['cat'], 5), (['cat', 's'], 2), (['eat'], 10), (['eat', 'i', 'n', 'g'], 3), (['r', 'u', 'n', 'n', 'i', 'n', 'g'], 2), (['j', 'u', 'm', 'p', 'i', 'n', 'g'], 1), (['f', 'o', 'o', 'd'], 6)]NEW MERGE RULE: Combine "i" and "n"[(['cat'], 5), (['cat', 's'], 2), (['eat'], 10), (['eat', 'in', 'g'], 3), (['r', 'u', 'n', 'n', 'in', 'g'], 2), (['j', 'u', 'm', 'p', 'in', 'g'], 1), (['f', 'o', 'o', 'd'], 6)]NEW MERGE RULE: Combine "in" and "g"[(['cat'], 5), (['cat', 's'], 2), (['eat'], 10), (['eat', 'ing'], 3), (['r', 'u', 'n', 'n', 'ing'], 2), (['j', 'u', 'm', 'p', 'ing'], 1), (['f', 'o', 'o', 'd'], 6)]

我们的代码只是为了学习流程，在实际应用中可以直接使用transformer库

BPE标记器只能识别出现在训练数据中的字符（characters）。如果出现不包含的词汇,会将这个字符转换为一个未知的字符。如果模型被用来标记真实数据。但是BPE错误处理没有添加未知的字符的标记,所以有的productionized模型是会产生崩溃。

但是GPT-2和RoBERTa中使用的BPE标记器没有这个问题。它们不是基于Unicode字符分析训练数据，而是分析字符的字节。这被称为字节级BPE Byte-Level BPE，它允许一个小的基本词汇表能够标记模型可能看到的所有字符。

2、WordPiece

WordPiece是Google为的BERT模型开发的一种标记化方法，并用于其衍生模型，如DistilBERT和MobileBERT。

WordPiece算法的全部细节尚未完全向公众公布，因此本文介绍的方法是基于Hugging Face[12]给出的解释。WordPiece算法类似于BPE，但使用不同的度量来确定合并规则。系统不会选择出现频率最高的字符对，而是为每对字符计算一个分数，分数最高的字符对决定合并哪些字符。WordPiece的训练如下:

a)构建语料库

输入文本被提供给规范化和预标记化模型，以创建干净的单词。

b)构建词汇

与BPE一样，语料库中的单词随后被分解为单个字符，并添加到称为词汇表的空列表中。但是这一次不是简单地存储每个单独的字符，而是使用两个#符号作为标记来确定该字符是在单词的开头还是在单词的中间/结尾找到的。例如，单词cat在BPE中会被分成[‘c’， ‘a’， ‘t’]，但在WordPiece中它看起来像[‘c’， ‘##a’， ‘##t’]。单词开头的c和单词中间或结尾的##c将被区别对待。每次算法确定哪些字符对可以合并在一起时，都会迭代地向这个词汇表中添加内容。

c)计算每个相邻字符对的配对得分

与BPE模型不同，这次为每个字符对计算一个分数。识别语料库中每个相邻的字符对。‘c##a’， ##a##t等，并计算频率。每个字符单独出现的频率也是确定的。已知这些值后，可以根据以下公式计算配对得分:

这个指标会给经常一起出现的字符分配更高的分数，但单独出现或与其他字符一起出现的频率较低。这是WordPiece和BPE的主要区别，因为BPE不考虑单个字符本身的总体频率。

d)创建合并规则

高分代表通常一起出现的字符对。也就是说，如果c##a的配对得分很高，那么c和a在语料库中经常一起出现，而不是单独出现。与BPE一样，合并规则是由得分最高的字符对决定的，但这次不是由频率决定得分，而是由字符对得分决定。

然后重复步骤c和d，找到更多合并规则，并向词汇表添加更多字符对。这个过程一直持续到词汇表大小达到训练开始时指定的目标大小。

简单代码示例如下：

 class WordPiece(BPE):def add_hashes(self, word):''' Add # symbols to every character in a word except the first.Take in a word as a string and add # symbols to every characterexcept the first. Return the result as a list where each element isa character with # symbols in front, except the first characterwhich is just the plain character.Args:word (str): The word to add # symbols to.Returns:hashed_word (list): A list of the characters with # symbols(except the first character which is just the plaincharacter).'''hashed_word = [word[0]]for char in word[1:]:hashed_word.append(f'##{char}')return hashed_worddef create_merge_rule(self, corpus):''' Create a merge rule and add it to the self.merge_rules list.Args:corpus (list[tuple(list, int)]): A list of tuples where thefirst element is a list of a word in the words list (wherethe elements are the individual characters (or subwords inlater iterations) of the word), and the second element isan integer representing the frequency of the word in thelist.Returns:None'''pair_frequencies = self.find_pair_frequencies(corpus)char_frequencies = self.find_char_frequencies(corpus)pair_scores = self.find_pair_scores(pair_frequencies, char_frequencies)highest_scoring_pair = max(pair_scores, key=pair_scores.get)self.merge_rules.append(highest_scoring_pair.split(','))self.vocabulary.append(highest_scoring_pair)def create_vocabulary(self, words):''' Create a list of every unique character in a list of words.Unlike the BPE algorithm where each character is stored normally,here a distinction is made by characters that begin a word(unmarked), and characters that are in the middle or end of a word(marked with a '##'). For example, the word 'cat' will be splitinto ['c', '##a', '##t'].Args:words (list): A list of strings containing the words of theinput text.Returns:vocabulary (list): A list of every unique character in the listof input words, marked accordingly with ## to denote if thecharacter was featured in the middle/end of a word, insteadof as the first character of the word.'''vocabulary = set()for word in words:vocabulary.add(word[0])for char in word[1:]:vocabulary.add(f'##{char}')# Convert to list so the vocabulary can be appended to latervocabulary = list(vocabulary)return vocabularydef find_char_frequencies(self, corpus):''' Find the frequency of each character in the corpus.Loop through the corpus and calculate the frequency of characters.Note that 'c' and '##c' are different characters, since the firstrepresents a 'c' at the start of a word, and '##c' represents a 'c'in the middle/end of a word. Return a dictionary of each characterpair as the keys and the corresponding frequency as the values.Args:corpus (list[tuple(list, int)]): A list of tuples where thefirst element is a list of a word in the words list (wherethe elements are the individual characters (or subwords inlater iterations) of the word), and the second element isan integer representing the frequency of the word in thelist.Returns:pair_freq_dict (dict): A dictionary where the keys are thecharacters from the input corpus and the values are aninteger representing the frequency.'''char_frequencies = dict()for word, word_freq in corpus:for char in word:if char in char_frequencies:char_frequencies[char] += word_freqelse:char_frequencies[char] = word_freqreturn char_frequenciesdef find_pair_scores(self, pair_frequencies, char_frequencies):''' Find the pair score for each character pair in the corpus.Loops through the pair_frequencies dictionary and calculate thepair score for each pair of adjacent characters in the corpus.Store the scores in a dictionary and return it.Args:pair_frequencies (dict): A dictionary where the keys are theadjacent character pairs in the corpus and the values arethe frequencies of each pair.char_frequencies (dict): A dictionary where the keys are thecharacters in the corpus and the values are correspondingfrequencies.Returns:pair_scores (dict): A dictionary where the keys are theadjacent character pairs in the input corpus and the valuesare the corresponding pair score.'''pair_scores = dict()for pair in pair_frequencies.keys():char_1 = pair.split(',')[0]char_2 = pair.split(',')[1]denominator = (char_frequencies[char_1]*char_frequencies[char_2])score = (pair_frequencies[pair]) / denominatorpair_scores[pair] = scorereturn pair_scoresdef get_merged_chars(self, char_1, char_2):''' Merge the highest score pair and return to the self.merge method.Remove the # symbols as necessary and merge the highest scoringpair then return the merged characters to the self.merge method.Args:char_1 (str): The first character in the highest-scoring pair.char_2 (str): The second character in the highest-scoring pair.Returns:merged_chars (str): Merged characters.'''if char_2.startswith('##'):merged_chars = char_1 + char_2[2:]else:merged_chars = char_1 + char_2return merged_charsdef initialize_corpus(self, words):''' Split each word into characters and count the word frequency.Split each word in the input word list on every character. For eachword, store the split word in a list as the first element inside atuple. Store the frequency count of the word as an integer as thesecond element of the tuple. Create a tuple for every word in thisfashion and store the tuples in a list called 'corpus', then returnthen corpus list.Args:None.Returns:corpus (list[tuple(list, int)]): A list of tuples where thefirst element is a list of a word in the words list (wherethe elements are the individual characters of the word),and the second element is an integer representing thefrequency of the word in the list.'''corpus = self.calculate_frequency(words)corpus = [(self.add_hashes(word), freq) for (word, freq) in corpus]return corpusdef tokenize(self, text):''' Take in some text and return a list of tokens for that text.Args:text (str): The text to be tokenized.Returns:tokens (list): The list of tokens created from the input text.'''# Create cleaned vocabulary list without # and commas to check againstclean_vocabulary = [word.replace('#', '').replace(',', '') for word in self.vocabulary]clean_vocabulary.sort(key=lambda word: len(word))clean_vocabulary = clean_vocabulary[::-1]# Break down the text into the largest tokens first, then smallestremaining_string = texttokens = []keep_checking = Truewhile keep_checking:keep_checking = Falsefor vocab in clean_vocabulary:if remaining_string.startswith(vocab):tokens.append(vocab)remaining_string = remaining_string[len(vocab):]keep_checking = Trueif len(remaining_string) > 0:tokens.append(remaining_string)return tokens

WordPiece与BPE算法学习的标记非常不同。可以清楚地看到，WordPiece更倾向于这样的组合:字符相互出现的频率比单独出现的频率更高，因此m和p会立即合并，因为它们只一起存在于数据集中，而不是单独存在。

 wp = WordPiece()wp.train(words, 30)print(f'INITIAL CORPUS:\n{wp.corpus_history[0]}\n')for rule, corpus in list(zip(wp.merge_rules, wp.corpus_history[1:])):print(f'NEW MERGE RULE: Combine "{rule[0]}" and "{rule[1]}"')print(corpus, end='\n\n')

结果

 INITIAL CORPUS:[(['c', '##a', '##t'], 5), (['c', '##a', '##t', '##s'], 2), (['e', '##a', '##t'], 10), (['e', '##a', '##t', '##i', '##n', '##g'], 3), (['r', '##u', '##n', '##n', '##i', '##n', '##g'], 2), (['j', '##u', '##m', '##p', '##i', '##n', '##g'], 1), (['f', '##o', '##o', '##d'], 6)]NEW MERGE RULE: Combine "##m" and "##p"[(['c', '##a', '##t'], 5), (['c', '##a', '##t', '##s'], 2), (['e', '##a', '##t'], 10), (['e', '##a', '##t', '##i', '##n', '##g'], 3), (['r', '##u', '##n', '##n', '##i', '##n', '##g'], 2), (['j', '##u', '##mp', '##i', '##n', '##g'], 1), (['f', '##o', '##o', '##d'], 6)]NEW MERGE RULE: Combine "r" and "##u"[(['c', '##a', '##t'], 5), (['c', '##a', '##t', '##s'], 2), (['e', '##a', '##t'], 10), (['e', '##a', '##t', '##i', '##n', '##g'], 3), (['ru', '##n', '##n', '##i', '##n', '##g'], 2), (['j', '##u', '##mp', '##i', '##n', '##g'], 1), (['f', '##o', '##o', '##d'], 6)]NEW MERGE RULE: Combine "j" and "##u"[(['c', '##a', '##t'], 5), (['c', '##a', '##t', '##s'], 2), (['e', '##a', '##t'], 10), (['e', '##a', '##t', '##i', '##n', '##g'], 3), (['ru', '##n', '##n', '##i', '##n', '##g'], 2), (['ju', '##mp', '##i', '##n', '##g'], 1), (['f', '##o', '##o', '##d'], 6)]NEW MERGE RULE: Combine "ju" and "##mp"[(['c', '##a', '##t'], 5), (['c', '##a', '##t', '##s'], 2), (['e', '##a', '##t'], 10), (['e', '##a', '##t', '##i', '##n', '##g'], 3), (['ru', '##n', '##n', '##i', '##n', '##g'], 2), (['jump', '##i', '##n', '##g'], 1), (['f', '##o', '##o', '##d'], 6)]NEW MERGE RULE: Combine "jump" and "##i"[(['c', '##a', '##t'], 5), (['c', '##a', '##t', '##s'], 2), (['e', '##a', '##t'], 10), (['e', '##a', '##t', '##i', '##n', '##g'], 3), (['ru', '##n', '##n', '##i', '##n', '##g'], 2), (['jumpi', '##n', '##g'], 1), (['f', '##o', '##o', '##d'], 6)]NEW MERGE RULE: Combine "##i" and "##n"[(['c', '##a', '##t'], 5), (['c', '##a', '##t', '##s'], 2), (['e', '##a', '##t'], 10), (['e', '##a', '##t', '##in', '##g'], 3), (['ru', '##n', '##n', '##in', '##g'], 2), (['jumpi', '##n', '##g'], 1), (['f', '##o', '##o', '##d'], 6)]NEW MERGE RULE: Combine "ru" and "##n"[(['c', '##a', '##t'], 5), (['c', '##a', '##t', '##s'], 2), (['e', '##a', '##t'], 10), (['e', '##a', '##t', '##in', '##g'], 3), (['run', '##n', '##in', '##g'], 2), (['jumpi', '##n', '##g'], 1), (['f', '##o', '##o', '##d'], 6)]NEW MERGE RULE: Combine "run" and "##n"[(['c', '##a', '##t'], 5), (['c', '##a', '##t', '##s'], 2), (['e', '##a', '##t'], 10), (['e', '##a', '##t', '##in', '##g'], 3), (['runn', '##in', '##g'], 2), (['jumpi', '##n', '##g'], 1), (['f', '##o', '##o', '##d'], 6)]NEW MERGE RULE: Combine "jumpi" and "##n"[(['c', '##a', '##t'], 5), (['c', '##a', '##t', '##s'], 2), (['e', '##a', '##t'], 10), (['e', '##a', '##t', '##in', '##g'], 3), (['runn', '##in', '##g'], 2), (['jumpin', '##g'], 1), (['f', '##o', '##o', '##d'], 6)]NEW MERGE RULE: Combine "runn" and "##in"[(['c', '##a', '##t'], 5), (['c', '##a', '##t', '##s'], 2), (['e', '##a', '##t'], 10), (['e', '##a', '##t', '##in', '##g'], 3), (['runnin', '##g'], 2), (['jumpin', '##g'], 1), (['f', '##o', '##o', '##d'], 6)]NEW MERGE RULE: Combine "##in" and "##g"[(['c', '##a', '##t'], 5), (['c', '##a', '##t', '##s'], 2), (['e', '##a', '##t'], 10), (['e', '##a', '##t', '##ing'], 3), (['runnin', '##g'], 2), (['jumpin', '##g'], 1), (['f', '##o', '##o', '##d'], 6)]NEW MERGE RULE: Combine "runnin" and "##g"[(['c', '##a', '##t'], 5), (['c', '##a', '##t', '##s'], 2), (['e', '##a', '##t'], 10), (['e', '##a', '##t', '##ing'], 3), (['running'], 2), (['jumpin', '##g'], 1), (['f', '##o', '##o', '##d'], 6)]NEW MERGE RULE: Combine "jumpin" and "##g"[(['c', '##a', '##t'], 5), (['c', '##a', '##t', '##s'], 2), (['e', '##a', '##t'], 10), (['e', '##a', '##t', '##ing'], 3), (['running'], 2), (['jumping'], 1), (['f', '##o', '##o', '##d'], 6)]NEW MERGE RULE: Combine "f" and "##o"[(['c', '##a', '##t'], 5), (['c', '##a', '##t', '##s'], 2), (['e', '##a', '##t'], 10), (['e', '##a', '##t', '##ing'], 3), (['running'], 2), (['jumping'], 1), (['fo', '##o', '##d'], 6)]

尽管训练数据有限，但模型仍然设法学习了一些有用的标记，比如单词jumper开始。首先，字符串被分解成[‘jump’，‘er’]，因为jump是训练集中可以在单词开头找到的最大token。接下来，字符串er被分解成单个字符，因为模型还没有学会将字符e和r组合在一起。

 print(wp.tokenize('jumper'))#['jump', 'e', 'r']

3、Unigram

Unigram标记器采用与BPE和WordPiece不同的方法，从一个大词汇表开始，然后迭代地减少它，直到达到所需的大小。

Unigram模型使用统计方法，其中考虑句子中每个单词或字符的概率。这些列表中的每个元素都可以被认为是一个标记t，而一系列标记t1, t2，…，tn出现的概率由下式给出:

a)构建语料库

与往常一样，输入文本被提供给规范化和预标记化模型，以创建干净的单词

b)构建词汇

Unigram模型的词汇表大小一开始非常大，然后迭代地减少，直到达到所需的大小。要构造初始词汇表，请在语料库中找到所有可能的子字符串。例如，如果语料库中的第一个单词是cats，则子字符串[‘c’， ‘a’， ‘t’， ‘s’， ‘ca’， ‘at’， ‘ts’， ‘cat’， ‘ats’]将被添加到词汇表中。

c)计算每个标记的概率

通过查找语料库中标记的出现次数，然后除以标记出现的总次数，可以近似地计算出标记出现的概率。

d)找出单词的所有可能的切分

假设训练语料库中的一个单词是cat。这可以通过以下方式进行细分:

[‘c’， ‘a’， ‘t’]

(“ca”、“t”)

[’ c ', ‘at’]

(“cat”)

e)计算语料库中每个分割出现的近似概率

结合上面的方程式将给出每个系列标记的概率。

由于段[‘ca’， ‘t’]具有最高的概率得分，因此这是用于标记单词的段。单词cat将被标记为[‘ca’， ‘t’]。可以想象，对于像tokenization这样的较长的单词，拆分可能出现在整个单词的多个位置，例如[‘token’， ‘iza’， tion]或[‘token’， 'ization]。

f)计算损失

这里的损失是指模型的分数，如果从词汇表中删除一个重要的标记，则损失会大大增加，但如果删除一个不太重要的标记，则损失不会增加太多。通过计算每个标记被删除后在模型中的损失，可以找到词汇表中最没用的标记。这可以迭代地重复，直到词汇表大小减少到只剩下训练集语料库中最有用的标记。

这里的损失计算公式如下：

一旦删除了足够的字符，使词汇表减少到所需的大小，训练就完成了，模型就可以用于对单词进行标记。

比较BPE、WordPiece和Unigram

根据训练集和要标记的数据，一些标记器可能比其他标记器表现得更好。在为语言模型选择标记器时，最好使用用于特定用例的训练集进行实验，看看哪个能提供最好的结果。

在这三种方法中，BPE似乎是当前语言模型标记器中最流行的选择。尽管在这样一个瞬息万变的领域，这种变化在未来是很有可能发生的。但是其他子词标记器，如sentencepece，近年来越来越受欢迎[13]。

与BPE和Unigram相比，WordPiece似乎产生了更多的单词标记，但无论模型选择如何，随着词汇量的增加，所有标记器似乎都产生了更少的标记[14]。

标记器的选择取决于打算与模型一起使用的数据集。这里的建议是尝试BPE或sentencepece进行实验。

后处理

标记化的最后一步是后处理，如果有必要，可以对输出进行任何最终修改。BERT使用这一步骤添加了两种额外类型的标记:

[CLS] -这个标记代表“分类”，用于标记输入文本的开始。这在BERT中是必需的，因为它被训练的任务之一是分类(因此标记的名称)。即使不用于分类任务，该标记仍然是模型所期望的。

[SEP] -这个标记代表“分隔”，用于分隔输入中的句子。这对于BERT执行的许多任务都很有用，包括在同一提示符中同时处理多条指令[15]。

tokenizers库

tokenizers库使得使用预训练的tokenizer非常容易。只需导入Tokenizer类，调用from_pretrained方法，并传入要使用Tokenizer from的模型名称。模型列表见[16]。

 from tokenizers import Tokenizertokenizer = Tokenizer.from_pretrained('bert-base-cased')

我们可以直接使用下面的实现

 BertWordPieceTokenizer - The famous Bert tokenizer, using WordPieceCharBPETokenizer - The original BPEByteLevelBPETokenizer - The byte level version of the BPESentencePieceBPETokenizer - A BPE implementation compatible with the one used by SentencePiece

h爱可以使用train方法进行自定义的训练。训练完成后使用save方法保存训练好的标记器，这样就不必再次执行训练。

 # Import a tokenizerfrom tokenizers import BertWordPieceTokenizer, CharBPETokenizer, \ByteLevelBPETokenizer, SentencePieceBPETokenizer# Instantiate the modeltokenizer = CharBPETokenizer()# Train the modeltokenizer.train(['./path/to/files/1.txt', './path/to/files/2.txt'])# Tokenize some textencoded = tokenizer.encode('I can feel the magic, can you?')# Save the modeltokenizer.save('./path/to/directory/my-bpe.tokenizer.json')

下面是一个完整的自定义训练的流程代码：

 from tokenizers import Tokenizer, models, pre_tokenizers, decoders, trainers, \processors# Initialize a tokenizertokenizer = Tokenizer(models.BPE())# Customize pre-tokenization and decodingtokenizer.pre_tokenizer = pre_tokenizers.ByteLevel(add_prefix_space=True)tokenizer.decoder = decoders.ByteLevel()tokenizer.post_processor = processors.ByteLevel(trim_offsets=True)# And then traintrainer = trainers.BpeTrainer(vocab_size=20000,min_frequency=2,initial_alphabet=pre_tokenizers.ByteLevel.alphabet())tokenizer.train(["./path/to/dataset/1.txt","./path/to/dataset/2.txt","./path/to/dataset/3.txt"], trainer=trainer)# And Save ittokenizer.save("byte-level-bpe.tokenizer.json", pretty=True)