1 基本概念

中文分词

指的是将一段文本拆分为一系列单词的过程，这些单词顺序拼接后等于原文本。
作为中文信息处理的第一站，是后续nlp任务的基础，中文分词算法大致可分为词典规则与统计学习，针对具体问题往往会以统计学习为主、词典规则为辅。

2 正向最长匹配

最长匹配算法

就是在以某个下标为起点递增查词的过程中，优先输出更长的单词，这种规则被称为最长匹配算法。从前往后匹配则称为正向最长匹配，反之则称为逆向最长匹配。

# -*- coding:utf-8 -*-from tests.book.ch02.utility import load_dictionarydef forward_segment(text, dic):word_list = []i = 0while i < len(text):longest_word = text[i]                      # 当前扫描位置的单字for j in range(i + 1, len(text) + 1):       # 所有可能的结尾word = text[i:j]                        # 从当前位置到结尾的连续字符串if word in dic:                         # 在词典中if len(word) > len(longest_word):   # 并且更长longest_word = word             # 则更优先输出word_list.append(longest_word)              # 输出最长词i += len(longest_word)                      # 正向扫描return word_listif __name__ == '__main__':dic = load_dictionary()print(forward_segment('就读北京大学', dic))print(forward_segment('研究生命起源', dic))

运行结果：

['就读', '北京大学']
['研究生', '命', '起源']

从代码逻辑可以看出，在匹配到字典中的最长字符串优先输出，若以该起点的字符中都不在字典中，则该起点的单字作为分词输出；
[‘研究生’, ‘命’, ‘起源’]产生误差的原因在于，正向最长匹配“研究生”的优先级大于“研究”，下面采用逆向最长匹配解决这个问题；

3 逆向最长匹配

# -*- coding:utf-8 -*-
# Author：hankcs
# Date: 2018-05-22 21:05
# 《自然语言处理入门》2.3.3 逆向最长匹配
# 配套书籍：http://nlp.hankcs.com/book.php
# 讨论答疑：https://bbs.hankcs.com/
from tests.book.ch02.utility import load_dictionarydef backward_segment(text, dic):word_list = []i = len(text) - 1while i >= 0:                                   # 扫描位置作为终点longest_word = text[i]                      # 扫描位置的单字for j in range(0, i):                       # 遍历[0, i]区间作为待查询词语的起点word = text[j: i + 1]                   # 取出[j, i]区间作为待查询单词if word in dic:if len(word) > len(longest_word):   # 越长优先级越高longest_word = wordbreakword_list.insert(0, longest_word)           # 逆向扫描，所以越先查出的单词在位置上越靠后i -= len(longest_word)return word_listif __name__ == '__main__':dic = load_dictionary()print(backward_segment('研究生命起源', dic))print(backward_segment('项目的研究计划', dic))

运行结果：

['研究', '生命', '起源']
['项', '目的', '研究计划']

[‘研究’, ‘生命’, ‘起源’]分词正确了，但是[‘项’, ‘目的’, ‘研究计划’]又错了，后者用正向最长匹配可以正确分词；

4 双向最长匹配

清华大学的孙茂松教授曾经做过统计，在随机挑选的3680个句子中，正向匹配错误而逆向匹配正确的句子占比 $9.24%9.24\%$ ，正向匹配正确而逆向匹配错误的情况则没有。
基于类似上面观察到的一些经验，人们继续提出了双向最长匹配，具体规则如下：

（1）同时执行正向和逆向最长匹配，若两者的词数不同，则返回词数更少的那一个；
（2）否则，返回两者中单字更少的那一个。当单字数也相同时，优先返回逆向最长匹配的结果；

# -*- coding:utf-8 -*-from tests.book.ch02.backward_segment import backward_segment
from tests.book.ch02.forward_segment import forward_segment
from tests.book.ch02.utility import load_dictionarydef count_single_char(word_list: list):  # 统计单字成词的个数return sum(1 for word in word_list if len(word) == 1)def bidirectional_segment(text, dic):f = forward_segment(text, dic)b = backward_segment(text, dic)if len(f) < len(b):                                  # 词数更少优先级更高return felif len(f) > len(b):return belse:if count_single_char(f) < count_single_char(b):  # 单字更少优先级更高return felse:return b                                     # 都相等时逆向匹配优先级更高if __name__ == '__main__':dic = load_dictionary()print(bidirectional_segment('研究生命起源', dic))