Elasticsearch7.15.2 ik中文分词器定制化分词器之扩展词库（本地）

在这里插入图片描述

背景： IK分词提供的两个分词器，并不支持一些新的词汇，有时候也不能满足实际业务需要，这时候，我们可以定义自定义词库来完成目标。
目标：定制化中文分词器，使得我们的中文分词器支持扩展的词汇

文章目录

- - - - 一、搜索现状
      - 1. 搜索关键词
        2. 数据结果
        3. 数据分析
        4. ES IK分词
        5. IK分词结果+分析
      - 二、定制化分词器
      - 2.1. 新增分词词典库
        2.2. 词典配置
        2.3. 重启es7
        2.4. 重新查看分词结果
        2.5. 重新搜索
        2.6. 重建分词索引
        2.7. 再次查询
        2.8. 数据分析

一、搜索现状

1. 搜索关键词

# 搜索凯悦相关的酒店
GET /shop/_search
{"query":{"match": {"name":"凯悦"}}
}

2. 数据结果

{"took" : 7,"timed_out" : false,"_shards" : {"total" : 1,"successful" : 1,"skipped" : 0,"failed" : 0},"hits" : {"total" : {"value" : 5,"relation" : "eq"},"max_score" : 3.3362136,"hits" : [{"_index" : "shop","_type" : "_doc","_id" : "9","_score" : 3.3362136,"_source" : {"price_per_man" : 176,"remark_score" : 2.2,"category_name" : "酒店","@version" : "1","seller_disabled_flag" : 0,"@timestamp" : "2021-11-21T04:10:03.916Z","tags" : "落地大窗","location" : "31.306172,121.525843","seller_remark_score" : 3.0,"id" : 9,"name" : "凯悦酒店","seller_id" : 17,"category_id" : 2}},{"_index" : "shop","_type" : "_doc","_id" : "10","_score" : 2.836244,"_source" : {"price_per_man" : 182,"remark_score" : 0.5,"category_name" : "酒店","@version" : "1","seller_disabled_flag" : 0,"@timestamp" : "2021-11-21T04:10:03.918Z","tags" : "自助餐","location" : "31.196742,121.322846","seller_remark_score" : 3.0,"id" : 10,"name" : "凯悦嘉轩酒店","seller_id" : 17,"category_id" : 2}},{"_index" : "shop","_type" : "_doc","_id" : "11","_score" : 2.836244,"_source" : {"price_per_man" : 74,"remark_score" : 1.0,"category_name" : "酒店","@version" : "1","seller_disabled_flag" : 0,"@timestamp" : "2021-11-21T04:10:03.920Z","tags" : "自助餐","location" : "31.156899,121.238362","seller_remark_score" : 3.0,"id" : 11,"name" : "新虹桥凯悦酒店","seller_id" : 17,"category_id" : 2}},{"_index" : "shop","_type" : "_doc","_id" : "12","_score" : 2.638537,"_source" : {"price_per_man" : 71,"remark_score" : 2.0,"category_name" : "美食2","@version" : "1","seller_disabled_flag" : 0,"@timestamp" : "2021-11-21T04:10:03.923Z","tags" : "有包厢","location" : "30.679819,121.651921","seller_remark_score" : 3.0,"id" : 12,"name" : "凯悦咖啡(新建西路店)","seller_id" : 17,"category_id" : 1}},{"_index" : "shop","_type" : "_doc","_id" : "4","_score" : 1.3119392,"_source" : {"price_per_man" : 152,"remark_score" : 2.0,"category_name" : "美食2","@version" : "1","seller_disabled_flag" : 0,"@timestamp" : "2021-11-21T04:10:03.907Z","tags" : "落地大窗 有WIFI","location" : "31.306419,121.524878","seller_remark_score" : 2.0,"id" : 4,"name" : "花悦庭果木烤鸭","seller_id" : 2,"category_id" : 1}}]}
}

3. 数据分析

上面数据中有一条不符的结果数据，此数据中无**“凯悦”**关键词，但是，搜索后还是显示在页面上，不符合预期搜索结果。

 {"_index" : "shop","_type" : "_doc","_id" : "4","_score" : 1.3119392,"_source" : {"price_per_man" : 152,"remark_score" : 2.0,"category_name" : "美食2","@version" : "1","seller_disabled_flag" : 0,"@timestamp" : "2021-11-21T04:10:03.907Z","tags" : "落地大窗 有WIFI","location" : "31.306419,121.524878","seller_remark_score" : 2.0,"id" : 4,"name" : "花悦庭果木烤鸭","seller_id" : 2,"category_id" : 1}}

4. ES IK分词

# 查阅凯悦分词
GET /shop/_analyze
{"analyzer": "ik_smart","text": "凯悦"
}

5. IK分词结果+分析

{"tokens" : [{"token" : "凯","start_offset" : 0,"end_offset" : 1,"type" : "CN_CHAR","position" : 0},{"token" : "悦","start_offset" : 1,"end_offset" : 2,"type" : "CN_CHAR","position" : 1}]
}

从上面数据可以看出，使用ik_smart分词api，分词“凯”，“悦”，并没有将“凯悦”关键词当做一个分词元素，主要原因就是，es安装的ik中文分词库中没有将“凯悦”放入分词库。

二、定制化分词器

2.1. 新增分词词典库

cd /app/elasticsearch-7.15.2/config/analysis-ik/
vim new_word.dic

添加自定义分词

凯悦

2.2. 词典配置

使用ik加载我们自定义的分词词典库

vim IKAnalyzer.cfg.xml

内容：

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE properties SYSTEM "http://java.sun.com/dtd/properties.dtd">
<properties><comment>IK Analyzer 扩展配置</comment><!--用户可以在这里配置自己的扩展字典 --><entry key="ext_dict">new_word.dic</entry><!--用户可以在这里配置自己的扩展停止词字典--><entry key="ext_stopwords"></entry><!--用户可以在这里配置远程扩展字典 --><!-- <entry key="remote_ext_dict">words_location</entry> --><!--用户可以在这里配置远程扩展停止词字典--><!-- <entry key="remote_ext_stopwords">words_location</entry> -->
</properties>

在这里插入图片描述

2.3. 重启es7

ps -ef|grep elasticsearch

kill -9 es进程号

cd /app/elasticsearch-7.15.2/
bin/elasticsearch -d

在这里插入图片描述

2.4. 重新查看分词结果

# 查阅凯悦分词
GET /shop/_analyze
{"analyzer": "ik_smart","text": "凯悦"
}GET /shop/_analyze
{"analyzer": "ik_max_word","text": "凯悦"
}

在这里插入图片描述

2.5. 重新搜索

GET /shop/_search
{"query":{"match": {"name":"凯悦"}}
}

在这里插入图片描述

GET /shop/_search

在这里插入图片描述

发现一条数据都没查询出来，但是，数据都还在。

2.6. 重建分词索引

索引创建的时候，是在ik分词器上当时没有“凯悦”这个词的时候。目前，我们凯悦酒店这条记录对应的记录”凯和悦”已经在索引成型，单字的”凯”和单字“悦”。因为在擦黄建索引的时候，并没有做分词的扩展分词库加载。

目前的问题，现在索引中存储的是”凯和悦”分开的，但是我搜索的时候，执行的凯悦，却是按照搜索当前search的分词器，也就是分出来的是”凯和悦”连字存在。

我搜索是2个字，但是倒排索引的时候是按照单字做搜引得，因此导致搜素数据为空。

解决方案：

第一种（第一次）：把索引全部删除，然后全量同步分词索引
第二种（推荐）：针对搜索的索引中包含“凯“或者“悦“的索引执行重建索引，其他的索引不重建索引。

# 重建凯悦分析索引
POST /shop/_update_by_query
{"query": {"bool": {"must": [{"term":{"name":"凯"}},{"term":{"name":"悦"}}]}}
}

2.7. 再次查询

GET /shop/_search
{"query":{"match": {"name":"凯悦"}}
}

2.8. 数据分析

{"took" : 3,"timed_out" : false,"_shards" : {"total" : 1,"successful" : 1,"skipped" : 0,"failed" : 0},"hits" : {"total" : {"value" : 4,"relation" : "eq"},"max_score" : 2.0709352,"hits" : [{"_index" : "shop","_type" : "_doc","_id" : "9","_score" : 2.0709352,"_source" : {"price_per_man" : 176,"remark_score" : 2.2,"category_name" : "酒店","@version" : "1","seller_disabled_flag" : 0,"@timestamp" : "2021-11-21T04:10:03.916Z","tags" : "落地大窗","location" : "31.306172,121.525843","seller_remark_score" : 3.0,"id" : 9,"name" : "凯悦酒店","seller_id" : 17,"category_id" : 2}},{"_index" : "shop","_type" : "_doc","_id" : "10","_score" : 1.7177677,"_source" : {"price_per_man" : 182,"remark_score" : 0.5,"category_name" : "酒店","@version" : "1","seller_disabled_flag" : 0,"@timestamp" : "2021-11-21T04:10:03.918Z","tags" : "自助餐","location" : "31.196742,121.322846","seller_remark_score" : 3.0,"id" : 10,"name" : "凯悦嘉轩酒店","seller_id" : 17,"category_id" : 2}},{"_index" : "shop","_type" : "_doc","_id" : "11","_score" : 1.7177677,"_source" : {"price_per_man" : 74,"remark_score" : 1.0,"category_name" : "酒店","@version" : "1","seller_disabled_flag" : 0,"@timestamp" : "2021-11-21T04:10:03.920Z","tags" : "自助餐","location" : "31.156899,121.238362","seller_remark_score" : 3.0,"id" : 11,"name" : "新虹桥凯悦酒店","seller_id" : 17,"category_id" : 2}},{"_index" : "shop","_type" : "_doc","_id" : "12","_score" : 1.5828056,"_source" : {"price_per_man" : 71,"remark_score" : 2.0,"category_name" : "美食2","@version" : "1","seller_disabled_flag" : 0,"@timestamp" : "2021-11-21T04:10:03.923Z","tags" : "有包厢","location" : "30.679819,121.651921","seller_remark_score" : 3.0,"id" : 12,"name" : "凯悦咖啡(新建西路店)","seller_id" : 17,"category_id" : 1}}]}
}