Elasticsearch之文本分析

文本分析基本概念

官网：Text analysis | Elasticsearch Guide [7.17] | Elastic

官网称为文本分析，这是对文本进行一直分析处理的方式，基本处理逻辑是为按照预先制定的分词规则，把原本的文档进行分割成多个小颗粒度的词项，颗粒度的大小取决于分词器的配置规则。

不同文本分析结果如下：

POST _analyze
{"analyzer":"standard","text":"中华人民共和国"
}

#ik_smart:会做最粗粒度的拆
POST _analyze
{"analyzer": "ik_smart","text": "中华人民共和国"}

#ik_max_word:会将文本做最细粒度的拆分
POST _analyze
{"analyzer":"ik_max_word","text":"中华人民共和国"
}

文本分析发生时间

Index and search analysis | Elasticsearch Guide [7.17] | Elastic

文本分析处理发生在 Index Time 和 Search Time 两个时间。

Index Time：文本写入并创建倒排索引时期，其分词逻辑取决于映射参数analyzer。
Search Time：搜索执行时期，其分词仅对搜索词产生作用。

文本分析构成

Anatomy of an analyzer | Elasticsearch Guide [7.17] | Elastic

切词器（Tokenizer）：定义切词（分词）逻辑
词项过滤器（Token Filter）：分词之后的单个词项的处理逻辑
字符过滤器（Character Filter）：处理单个字符

分词器：Tokenizer

tokenizer 是文本分析的核心组成部分之一，其主要作用是分词，或称之为切词。主要用来对原始文本进行细粒度拆分。拆分之后的每一个部分称之为一个 Term，或称之为一个词项。可以把切词器理解为预定义的切词规则。官方内置了很多种切词器，默认标准切词器standard。

词项过滤器：Token Filter

词项过滤器用来处理切词完成之后的词项，例如把大小写转换，删除停用词或同义词处理等。官方同样预置了很多词项过滤器，基本可以满足日常开发的需要。当然也是支持第三方也自行开发的。

停用词

在切词完成之后，会被干掉词项，即停用词。停用词可以自定义
英文停用词（english）：a, an, and, are, as, at, be, but, by, for, if, in, into, is, it, no, not, of, on,
or, such, that, the, their, then, there, these, they, this, to, was, will, with。
中日韩停用词（cjk）：a, and, are, as, at, be, but, by, for, if, in, into, is, it, no, not, of, on, or, s,
such, t, that, the, their, then, there, these, they, this, to, was, will, with, www。

GET _analyze
{
"tokenizer": "standard",
"filter": ["stop"],
"text": ["how are you"]
}

自定义过滤器


# 自定义 filter
DELETE test_token_filter_stop
PUT test_token_filter_stop
{"settings": {"analysis": {"filter": {"my_filter": {"type": "stop","stopwords": ["you"],"ignore_case": true}}}}
}
GET test_token_filter_stop/_analyze
{"tokenizer": "standard", "filter": ["my_filter"], "text": ["how are you"]
}

同义词

同义词定义规则：

a, b, c => d：这种方式，a、b、c 会被 d 代替。
a, b, c, d：这种方式下，a、b、c、d 是等价的。

PUT test_token_filter_synonym
{"settings": {"analysis": {"filter": {"my_synonym": {"type": "synonym","synonyms": [ "good, nice => excellent" ] //good, nice, excellent}}}}
}
GET test_token_filter_synonym/_analyze
{"tokenizer": "standard", "filter": ["my_synonym"], "text": ["good"]
}

字符过滤器：Character Filter

分词之前的预处理，过滤无用字符。

PUT <index_name>
{"settings": {"analysis": {"char_filter": {"my_char_filter": {"type": "<char_filter_type>"}}}}
}

type：使用的字符过滤器类型名称，可配置以下值：

html_strip
mapping
pattern_replace

HTML 标签过滤器：HTML Strip Character Filter

字符过滤器会去除 HTML 标签和转义 HTML 元素，如、&

PUT test_html_strip_filter
{"settings": {"analysis": {"char_filter": {"my_char_filter": {"type": "html_strip",  // html_strip 代表使用 HTML 标签过滤器"escaped_tags": [     // 当前仅保留 a 标签        "a"]}}}}
}
GET test_html_strip_filter/_analyze
{"tokenizer": "standard", "char_filter": ["my_char_filter"],"text": ["<p>I&apos;m so <a>happy</a>!</p>"]
}

参数：escaped_tags：需要保留的 html 标签

字符映射过滤器：Mapping Character Filter

通过定义映替换为规则，把特定字符替换为指定字符

PUT test_html_strip_filter
{"settings": {"analysis": {"char_filter": {"my_char_filter": {"type": "mapping",    // mapping 代表使用字符映射过滤器"mappings": [                // 数组中规定的字符会被等价替换为 => 指定的字符"滚 => *","垃 => *","圾 => *"]}}}}
}
GET test_html_strip_filter/_analyze
{//"tokenizer": "standard", "char_filter": ["my_char_filter"],"text": "你个垃圾废物！滚"
}

正则替换过滤器：Pattern Replace Character Filter

PUT text_pattern_replace_filter
{"settings": {"analysis": {"char_filter": {"my_char_filter": {"type": "pattern_replace",    // pattern_replace 代表使用正则替换过滤器            "pattern": """(\d{3})\d{4}(\d{4})""",    // 正则表达式"replacement": "$1****$2"}}}}
}
GET text_pattern_replace_filter/_analyze
{"char_filter": ["my_char_filter"],"text": "手机号是18868686688"
}

倒排索引的数据结构

当数据写入 ES 时，数据将会通过分词被切分为不同的 term，ES 将 term 与其对应的文档列表建立一种映射关系，这种结构就是倒排索引。

为了进一步提升索引的效率，ES 在 term 的基础上利用 term 的前缀或者后缀构建了 term index, 用于对 term 本身进行索引，ES 实际的索引结构如下图所示：

这样当我们去搜索某个关键词时，ES 首先根据它的前缀或者后缀迅速缩小关键词的在 term dictionary 中的范围，大大减少了磁盘IO的次数。

单词词典（Term Dictionary) ：记录所有文档的单词，记录单词到倒排列表的关联关系

常用字典数据结构：

数据结构	优缺点
排序列表Array/List	使用二分法查找，不平衡
HashMap/TreeMap	性能高，内存消耗大，几乎是原始数据的三倍
Skip List	跳跃表，可快速查找词语，在lucene、redis、Hbase等均有实现。相对于TreeMap等结构，特别适合高并发
Trie	适合英文词典，如果系统中存在大量字符串且这些字符串基本没有公共前缀，则相应的trie树将非常消耗内存
Double Array Trie	适合做中文词典，内存占用小，很多分词工具均采用此种算法
Ternary Search Tree	三叉树，每一个node有3个节点，兼具省空间和查询快的优点
Finite State Transducers (FST)	一种有限状态转移机，Lucene 4有开源实现，并大量使用

倒排列表(Posting List)-记录了单词对应的文档结合，由倒排索引项组成
倒排索引项(Posting)：
- 文档ID
- 词频TF–该单词在文档中出现的次数，用于相关性评分
- 位置(Position)-单词在文档中分词的位置。用于短语搜索（match phrase query)
- 偏移(Offset)-记录单词的开始结束位置，实现高亮显示