文章目录
- 5、语言处理与自动补全技术探测
- 5.1 自定义语料库
- 5.1.1 语料库映射OpenAPI
- 5.1.2 语料库文档OpenAPI
- 5.2 产品搜索与自动补全
- 5.2.1 汉字补全OpenAPI
- 5.2.2 拼音补全OpenAPI
- 5.3 产品搜索与语言处理
- 5.3.1 什么是语言处理(拼写纠错)
- 5.3.2 语言处理OpenAPI
- 5.4 总结
- 6、电商平台产品推荐
- 6.1 什么是搜索推荐
- 6.2 产品推荐OpenAPI
- 7、指标聚合与下钻分析
- 7.1 指标聚合与分类
- 7.2 指标聚合与下钻设计
- 7.2.1 基础框架搭建
- 7.2.2 单值分析API设计
- 7.2.3 多值分析API设计
- 8、电商平台日志埋点与搜索热词
- 8.1 什么是热度搜索
- 8.2 提取热度搜索
- 8.3 日志埋点
- 8.4 数据落盘
- 8.5 热度搜索OpenAPI
5、语言处理与自动补全技术探测
实现的效果
实现的最终效果如下图京东搜索相似,输入词的时候返回提示。同时输入拼音和首字母也会有相同的提示效果
输入汉字
输入拼音
输入首字母
5.1 自定义语料库
5.1.1 语料库映射OpenAPI
索引映射OpenAPI
-
定义索引(映射)接口
/*** 索引操作接口*/ public interface ElasticsearchIndexService {//新增索引+映射public boolean addIndexAndMapping(CommonEntity commonEntity) throws Exception; }
-
定义索引(映射)实现
@Overridepublic boolean addIndexAndMapping(CommonEntity commonEntity) throws Exception {boolean flag=false;//创建索引请求CreateIndexRequest request=new CreateIndexRequest(commonEntity.getIndexName());//获取下游业务参数Map<String,Object> map =commonEntity.getMap();//循环参数for(Map.Entry<String,Object> entry:map.entrySet()){//设置settings参数if("settings".equals(entry.getKey()) && entry.getValue() instanceof Map && ((Map)entry.getValue()).size()>0){request.settings(((Map)entry.getValue()));}//设置mapping参数if("mapping".equals(entry.getKey()) && entry.getValue() instanceof Map && ((Map)entry.getValue()).size()>0){request.mapping(((Map)entry.getValue()));}}//创建索引操作客户端IndicesClient indicesClient=client.indices();//创建响应对象CreateIndexResponse response=indicesClient.create(request,RequestOptions.DEFAULT);flag=response.isAcknowledged();return flag;}
-
新增控制器
/*** 索引操作控制器*/ @RestController @RequestMapping("v1/indices") public class ElasticsearchIndexController {private static final Logger logger = LoggerFactory.getLogger(ElasticsearchIndexController.class);@AutowiredElasticsearchIndexService elasticsearchIndexService;@PostMapping(value = "/add")public ResponseData addIndexAndMapping(@RequestBody CommonEntity commonEntity) {//构造返回下游业务数据ResponseData rData = new ResponseData();if (StringUtils.isEmpty(commonEntity.getIndexName())) {rData.setResultEnum(ResultEnum.param_isnull);return rData;}//增加索引(映射)是否成功boolean isSuccess = false;try {//通过接口调用远程结构化查询方法isSuccess = elasticsearchIndexService.addIndexAndMapping(commonEntity);//通过类型推断自动装箱(多个参数取交集)rData.setResultEnum(isSuccess, ResultEnum.success, null);//日志记录logger.info(TipsEnum.create_index_success.getMessage());} catch (Exception e) {//打印到控制台e.printStackTrace();//日志记录logger.error(TipsEnum.create_index_fail.getMessage());//构建错误返回信息rData.setResultEnum(ResultEnum.error);}//返回return rData;} }
-
开始新增映射
http://127.0.0.1:8888/v1/indices/add
参数
自定义分词器ik_pinyin_analyzer(ik和pinyin组合分词器)
tips
在创建映射前,需要安装拼音插件
{"indexName": "product_completion_index","map": {"settings": {"number_of_shards": 1,"number_of_replicas": 2,"analysis": {"analyzer": {"ik_pinyin_analyzer": {"type": "custom","tokenizer": "ik_smart","filter": "pinyin_filter"}},"filter": {"pinyin_filter": {"type": "pinyin","keep_first_letter": true,"keep_separate_first_letter": false,"keep_full_pinyin": true,"keep_original": true,"limit_first_letter_length": 16,"lowercase": true,"remove_duplicated_term": true}}}},"mapping": {"properties": {"name": {"type": "keyword"},"searchkey": {"type": "completion","analyzer": "ik_pinyin_analyzer"}}}}
}
settings下面的为索引的设置信息,动态设置参数,遵循DSL写法
mapping下为映射的字段信息,动态设置参数,遵循DSL写法
属性 | 说明 |
---|---|
keep_first_letter | 启用此选项时,例如:刘德华> ldh,默认值: true |
keep_separate_first_letter | 启用该选项时,将保留第一个字母分开,例如: 刘德华> l,d,h,默认:假的,注意:查询结果 也许是太模糊,由于长期过频 |
limit_first_letter_length | 设置first_letter结果的最大长度,默认值:16 |
keep_full_pinyin | 当启用该选项,例如:刘德华> [ liu,de, hua],默认值:true |
keep_joined_full_pinyin | 当启用此选项时,例如:刘德华> [ liudehua], 默认值:false |
keep_none_chinese | 在结果中保留非中文字母或数字,默认值:true |
keep_none_chinese_together | 默认值:true,如:DJ音乐家- > DJ,yin,yue, jia,当设置为false,例如:DJ音乐家- > D,J, yin,yue,jia,注意:keep_none_chinese必 须先启动 |
keep_none_chinese_in_first_letter | 第一个字母保持非中文字母,例如:刘德华 AT2016- > ldhat2016,默认值:true |
keep_none_chinese_in_joined_full_pinyin | 保留非中文字母加入完整拼音,例如:刘德华 2016- > liudehua2016,默认:false |
none_chinese_pinyin_tokenize | 打破非中国信成单独的拼音项,如果他们拼音, 默认值:true,如: liudehuaaibaba13zhuanghan- > liu,de, hua,a,li,ba,ba,13,zhuang,han,注 意:keep_none_chinese和 keep_none_chinese_together应首先启用 |
keep_original | 当启用此选项时,也会保留原始输入,默认值: false |
lowercase | 小写非中文字母,默认值:true |
trim_whitespace | 默认值:true |
remove_duplicated_term | 当启用此选项时,将删除重复项以保存索引,例 如:de的> de,默认值:false,注意:位置相关 查询可能受影响 |
返回
5.1.2 语料库文档OpenAPI
-
定义批量新增文档接口
//批量新增文档public RestStatus bulkAndDoc(CommonEntity commonEntity) throws Exception;
-
定义批量新增文档实现
@Overridepublic RestStatus bulkAndDoc(CommonEntity commonEntity) throws Exception {//构建批量新增请求BulkRequest bulkRequest = new BulkRequest(commonEntity.getIndexName());//循环下游业务文档数据for (int i = 0; i < commonEntity.getList().size(); i++) {bulkRequest.add(new IndexRequest().source(XContentType.JSON, SearchTools.mapToObjectGroup(commonEntity.getList().get(i))));}//开始执行批量新增操作BulkResponse bulkResponse = client.bulk(bulkRequest, RequestOptions.DEFAULT);return bulkResponse.status();}
官方文档
如上图,需要定义成箭头中的形式
所以上面SearchTools.mapToObjectGroup将map转成了数组 -
定义批量新增文档控制器
@PostMapping(value = "/batch")public ResponseData bulkAndDoc(@RequestBody CommonEntity commonEntity) {//构造返回下游业务数据ResponseData rData = new ResponseData();if (StringUtils.isEmpty(commonEntity.getIndexName()) || CollectionUtils.isEmpty(commonEntity.getList())) {rData.setResultEnum(ResultEnum.param_isnull);return rData;}//定义批量返回结果RestStatus result = null;try {//通过接口调用批量新增方法result = elasticsearchDocumentService.bulkAndDoc(commonEntity);//通过类型推断自动装箱(多个参数取交集)rData.setResultEnum(result, ResultEnum.success, null);//日志记录logger.info(TipsEnum.batch_create_doc_success.getMessage());} catch (Exception e) {//打印到控制台e.printStackTrace();//日志记录logger.error(TipsEnum.batch_create_doc_fail.getMessage());//构建错误返回信息rData.setResultEnum(ResultEnum.error);}//返回return rData;}
-
开始批量新增调用
http://127.0.0.1:8888/v1/docs/batch
参数
定义23个suggest词库(定义了两个小米手机,验证是否去重)
tips
学完聚合通过日志埋点、数据落盘进行维护
{"indexName": "product_completion_index","list": [{"searchkey": "小米手机","name": "小米(MI)"},{"searchkey": "小米10","name": "小米(MI)"},{"searchkey": "小米电视","name": "小米(MI)"},{"searchkey": "小米路由器","name": "小米(MI)"},{"searchkey": "小米9","name": "小米(MI)"},{"searchkey": "小米手机","name": "小米(MI)"},{"searchkey": "小米耳环","name": "小米(MI)"},{"searchkey": "小米8","name": "小米(MI)"},{"searchkey": "小米10Pro","name": "小米(MI)"},{"searchkey": "小米笔记本","name": "小米(MI)"},{"searchkey": "小米摄像头","name": "小米(MI)"},{"searchkey": "小米电饭煲","name": "小米(MI)"},{"searchkey": "小米充电宝","name": "小米(MI)"},{"searchkey": "adidas男鞋","name": "adidas男鞋"},{"searchkey": "adidas女鞋","name": "adidas女鞋"},{"searchkey": "adidas外套","name": "adidas外套"},{"searchkey": "adidas裤子","name": "adidas裤子"},{"searchkey": "adidas官方旗舰店","name": "adidas官方旗舰店"},{"searchkey": "阿迪达斯袜子","name": "阿迪达斯袜子"},{"searchkey": "阿迪达斯外套","name": "阿迪达斯外套"},{"searchkey": "阿迪达斯运动鞋","name": "阿迪达斯运动鞋"},{"searchkey": "耐克外套","name": "耐克外套"},{"searchkey": "耐克运动鞋","name": "耐克运动鞋"}]
}
返回
查看
GET product_completion_index/_search
5.2 产品搜索与自动补全
- Term suggester :词条建议器。对给输入的文本进进行分词,为每个分词提供词项建议
- Phrase suggester :短语建议器,在term的基础上,会考量多个term之间的关系
- Completion Suggester,它主要针对的应用场景就是"Auto Completion"。
- Context Suggester:上下文建议器
GET product_completion_index/_search
{"from": 0,"size": 100,"suggest": {"czbk-suggest": {"prefix": "小米","completion": {"field": "searchkey","size": 20,"skip_duplicates": true}}}
}
5.2.1 汉字补全OpenAPI
-
定义自动补全接口
//自动补全(完成建议)public List<String> cSuggest(CommonEntity commonEntity) throws Exception;
-
定义自动补全实现
@Overridepublic List<String> cSuggest(CommonEntity commonEntity) throws Exception {//定义返回List<String> suggestList = new ArrayList<>();//定义自动完成构建器CompletionSuggestionBuilder completionSuggestionBuilder = SuggestBuilders.completionSuggestion(commonEntity.getSuggestFileld());//定义搜索关键字completionSuggestionBuilder.prefix(commonEntity.getSuggestValue());//去重completionSuggestionBuilder.skipDuplicates(true);//获取建议条数completionSuggestionBuilder.size(commonEntity.getSuggestCount());//定义返回字段SearchRequest searchRequest = new SearchRequest().indices(commonEntity.getIndexName()).source(new SearchSourceBuilder().sort(new ScoreSortBuilder().order(SortOrder.DESC)).suggest(new SuggestBuilder().addSuggestion("czbk-suggest", completionSuggestionBuilder)));//定义查找响应SearchResponse response = client.search(searchRequest, RequestOptions.DEFAULT);//定义完成建议对象CompletionSuggestion completionSuggestion = response.getSuggest().getSuggestion("czbk-suggest"); //获取返回数据List<CompletionSuggestion.Entry.Option> optionList = completionSuggestion.getEntries().get(0).getOptions();//从optionList取出结果if (!CollectionUtils.isEmpty(optionList)) {optionList.forEach(item -> {suggestList.add(item.getText().toString());});}return suggestList;}
-
定义自动补全控制器
@GetMapping(value = "/csuggest")public ResponseData cSuggest(@RequestBody CommonEntity commonEntity) {//构造返回下游业务数据ResponseData rData = new ResponseData();if (StringUtils.isEmpty(commonEntity.getIndexName()) || StringUtils.isEmpty(commonEntity.getSuggestFileld()) || StringUtils.isEmpty(commonEntity.getSuggestValue())) {rData.setResultEnum(ResultEnum.param_isnull);return rData;}//定义建议返回结果List<String> result = null;try {//通过接口调用批量新增方法result = elasticsearchDocumentService.cSuggest(commonEntity);//通过类型推断自动装箱(多个参数取交集)rData.setResultEnum(result, ResultEnum.success, result.size());//日志记录logger.info(TipsEnum.csuggest_get_doc_success.getMessage());} catch (Exception e) {//打印到控制台e.printStackTrace();//日志记录logger.error(TipsEnum.csuggest_get_doc_fail.getMessage());//构建错误返回信息rData.setResultEnum(ResultEnum.error);}//返回return rData;}
-
自动补全调用验证
http://192.168.150.7:6666/v1/docs/csuggest
或者
http://localhost:6666/v1/docs/csuggest
参数
{"indexName": "product_completion_index","suggestFileld": "searchkey","suggestValue": "小米","suggestCount": 13
}
- indexName索引名称
- suggestFileld:自动补全查找列
- suggestValue:自动补全输入的关键字
- suggestCount:自动补全返回个数(京东是13个)
返回
{"code": "200","desc": "操作成功!","data": ["小米10","小米10Pro","小米8","小米9","小米充电宝","小米手机","小米摄像头","小米电视","小米电饭煲","小米笔记本","小米耳环","小米路由器"],"count": 12
}
自动补全自动去重
5.2.2 拼音补全OpenAPI
使用拼音访问【小米】
http://localhost:8888/v1/docs/csuggest
参数
// 全拼访问
{"indexName": "product_completion_index","suggestFileld": "searchkey","suggestValue": "xiaomi","suggestCount": 13
}
// 全拼访问(分隔)
{"indexName": "product_completion_index","suggestFileld": "searchkey","suggestValue": "xiao mi","suggestCount": 13
}
// 首字母访问
{"indexName": "product_completion_index","suggestFileld": "searchkey","suggestValue": "xm","suggestCount": 13
}
安装pinyin插件
5.3 产品搜索与语言处理
5.3.1 什么是语言处理(拼写纠错)
场景描述
例如:错误输入"【adidaas官方旗舰店】 ”能够纠错为【adidas官方旗舰店】
5.3.2 语言处理OpenAPI
GET product_completion_index/_search
{"suggest": {"czbk-suggestion": {"text": "adidaas官方旗舰店","phrase": {"field": "name","size": 13}}}
}
-
定义拼写纠错接口
// 拼写纠错public String pSuggest(CommonEntity commonEntity) throws Exception;
-
定义拼写纠错实现
@Overridepublic String pSuggest(CommonEntity commonEntity) throws Exception {//定义返回String pSuggestString = new String();//定义短语建议器的构建器PhraseSuggestionBuilder phraseSuggestionBuilder = new PhraseSuggestionBuilder(commonEntity.getSuggestFileld());//设置搜索关键字phraseSuggestionBuilder.text(commonEntity.getSuggestValue());//数量匹配phraseSuggestionBuilder.size(1);//定义返回字段SearchRequest searchRequest = new SearchRequest().indices(commonEntity.getIndexName()).source(new SearchSourceBuilder().sort(new ScoreSortBuilder().order(SortOrder.DESC)).suggest(new SuggestBuilder().addSuggestion("czbk-suggest", phraseSuggestionBuilder)));//定义查找响应SearchResponse response = client.search(searchRequest, RequestOptions.DEFAULT);//定义短语建议对象PhraseSuggestion phraseSuggestion = response.getSuggest().getSuggestion("czbk-suggest");//获取返回数据List<PhraseSuggestion.Entry.Option> optionList = phraseSuggestion.getEntries().get(0).getOptions();//从optionList取出结果if (!CollectionUtils.isEmpty(optionList)) {pSuggestString = optionList.get(0).getText().toString();}return pSuggestString;}
-
定义拼写纠错控制器
@GetMapping(value = "/psuggest")public ResponseData pSuggest(@RequestBody CommonEntity commonEntity) {//构造返回下游业务数据ResponseData rData = new ResponseData();if (StringUtils.isEmpty(commonEntity.getIndexName()) || StringUtils.isEmpty(commonEntity.getSuggestFileld()) || StringUtils.isEmpty(commonEntity.getSuggestValue())) {rData.setResultEnum(ResultEnum.param_isnull);return rData;}//定义纠错返回结果String result = null;try {//通过接口调用批量新增方法result = elasticsearchDocumentService.pSuggest(commonEntity);//通过类型推断自动装箱(多个参数取交集)rData.setResultEnum(result, ResultEnum.success, null);//日志记录logger.info(TipsEnum.psuggest_get_doc_success.getMessage());} catch (Exception e) {//打印到控制台e.printStackTrace();//日志记录logger.error(TipsEnum.psuggest_get_doc_fail.getMessage());//构建错误返回信息rData.setResultEnum(ResultEnum.error);}//返回return rData;}
-
语言处理调用验证
http://192.168.150.7:6666/v1/docs/psuggest
或者
http://localhost:6666/v1/docs/psuggest参数
{"indexName": "product_completion_index","suggestFileld": "name","suggestValue": "adidaas官方旗舰店" }
- indexName索引名称
- suggestFileld:自动补全查找列
- suggestValue:自动补全输入的关键字
返回
{"code": "200","desc": "操作成功!","data": "adidas官方旗舰店" }
5.4 总结
- 需要一个搜索词库/语料库,不要和业务索引库在一起,方便维护和升级语料库
- 根据分词及其他搜索条件去语料库中查询若干条(京东13条、淘宝(天猫)10条、百度4条)记录
返回 - 为了提升准确率,通常都是前缀搜索
6、电商平台产品推荐
6.1 什么是搜索推荐
例如:关键词输入【阿迪达斯 耐克 外套 运动鞋 袜子】
汪~没有找到与“阿迪达斯 耐克 外套 运动鞋 袜子”相关的商品,为您推荐“ 阿迪达斯耐克运动鞋”的相关商品,或者试试:
6.2 产品推荐OpenAPI
GET product_completion_index/_search
{"suggest": {"czbk-suggestion": {"text": "阿迪达斯 耐克 外套 运动鞋 袜子","term": {"field": "name","min_word_length": 2,"string_distance": "ngram","analyzer": "ik_smart"}}}
}
注意的地方,查看官网
https://www.elastic.co/guide/en/elasticsearch/reference/7.4/search-suggesters.html#te
rm-suggester
-
定义搜索推荐接口
//搜索推荐public String tSuggest(CommonEntity commonEntity) throws Exception;
-
定义搜索推荐实现
@Overridepublic String tSuggest(CommonEntity commonEntity) throws Exception {//定义返回String tSuggestString = new String();//定义词条建议器的构建器TermSuggestionBuilder termSuggestionBuilder = SuggestBuilders.termSuggestion(commonEntity.getSuggestFileld());//定义搜索关键字termSuggestionBuilder.text(commonEntity.getSuggestValue());//设置分词termSuggestionBuilder.analyzer("ik_smart");//定义查询长度termSuggestionBuilder.minWordLength(2);//设置查找算法termSuggestionBuilder.stringDistance(TermSuggestionBuilder.StringDistanceImpl.NGRAM);//定义返回字段SearchRequest searchRequest = new SearchRequest().indices(commonEntity.getIndexName()).source(new SearchSourceBuilder().sort(new ScoreSortBuilder().order(SortOrder.DESC)).suggest(new SuggestBuilder().addSuggestion("czbk-suggest", termSuggestionBuilder)));//定义查找响应SearchResponse response = client.search(searchRequest, RequestOptions.DEFAULT);//定义term建议对象TermSuggestion termSuggestion = response.getSuggest().getSuggestion("czbk-suggest");//获取返回数据List<TermSuggestion.Entry.Option> optionList = termSuggestion.getEntries().get(0).getOptions();//从optionList取出结果if (!CollectionUtils.isEmpty(optionList)) {tSuggestString = optionList.get(0).getText().toString();}return tSuggestString;}
-
定义搜索推荐控制器
@GetMapping(value = "/tsuggest")public ResponseData tSuggest(@RequestBody CommonEntity commonEntity) {//构造返回下游业务数据ResponseData rData = new ResponseData();if (StringUtils.isEmpty(commonEntity.getIndexName()) || StringUtils.isEmpty(commonEntity.getSuggestFileld()) || StringUtils.isEmpty(commonEntity.getSuggestValue())) {rData.setResultEnum(ResultEnum.param_isnull);return rData;}//定义搜索推荐返回结果String result = null;try {//通过接口调用批量新增方法result = elasticsearchDocumentService.tSuggest(commonEntity);//通过类型推断自动装箱(多个参数取交集)rData.setResultEnum(result, ResultEnum.success, null);//日志记录logger.info(TipsEnum.tsuggest_get_doc_success.getMessage());} catch (Exception e) {//打印到控制台e.printStackTrace();//日志记录logger.error(TipsEnum.tsuggest_get_doc_fail.getMessage());//构建错误返回信息rData.setResultEnum(ResultEnum.error);}//返回return rData;}
-
语言处理调用验证
http://127.0.0.1:8888/v1/docs/tsuggest
参数
{"indexName": "product_completion_index","suggestFileld": "name","suggestValue": "阿迪达斯 耐克 外套 运动鞋 袜子" }
- indexName索引名称
- suggestFileld:自动补全查找列
- suggestValue:自动补全输入的关键字
返回
{"code": "200","desc": "操作成功!","data": "阿迪达斯外套" }
7、指标聚合与下钻分析
7.1 指标聚合与分类
什么是指标聚合(Metric)
聚合分析是数据库中重要的功能特性,完成对某个查询的数据集中数据的聚合计算,
如:找出某字段(或计算表达式的结果)的最大值、最小值,计算和、平均值等。
ES作为搜索引擎兼数据库,同样提供了强大的聚合分析能力。
对一个数据集求最大值、最小值,计算和、平均值等指标的聚合,在ES中称为指标聚合。
Metric聚合分析分为单值分析和多值分析两类
- 单值分析,只输出一个分析结果
min,max,avg,sum,cardinality(cardinality 求唯一值,即不重复的字段有多少(相当于mysql中的
distinct) - 多值分析,输出多个分析结果
stats,extended_stats,percentile,percentile_rank
7.2 指标聚合与下钻设计
官网
语法
"aggregations" : {"<aggregation_name>" : { <!--聚合的名字 -->"<aggregation_type>" : { <!--聚合的类型 --><aggregation_body> <!--聚合体:对哪些字段进行聚合 -->}[,"meta" : { [<meta_data_body>] } ]? <!--元 -->[,"aggregations" : { [<sub_aggregation>]+ } ]? <!--在聚合里面在定义子聚合-->}[,"<aggregation_name_2>" : { ... } ]* <!--聚合的名字 -->
}
openAPI设计目标与原则:
- DSL调用与语法进行高度抽象,参数动态设计
- Open API通过结果转换器支持上百种组合调用
qurey,constant_score,match/matchall/filter/sort/size/frm/higthlight/_source/includes - 逻辑处理公共调用,提升API业务处理能力
- 保留原生API与参数的用法
7.2.1 基础框架搭建
7.2.2 单值分析API设计
-
Avg(平均值)
从聚合文档中提取的价格的平均值。
对所有文档进行avg聚合(DSL)POST product_list/_search {"size": 0,"aggs": {"czbk": {"avg": {"field": "price"}}} }
以上汇总计算了所有文档的平均值。
“size”: 0, 表示只查询文档聚合数量,不查文档,如查询50,size=50
aggs:表示是一个聚合
czbk:可自定义,聚合后的数据将显示在自定义字段中结果:
{"took" : 1662,"timed_out" : false,"_shards" : {"total" : 1,"successful" : 1,"skipped" : 0,"failed" : 0},"hits" : {"total" : {"value" : 10000,"relation" : "gte"},"max_score" : null,"hits" : [ ]},"aggregations" : {"czbk" : {"value" : 920.1535462724372}} }
OpenAPI查询参数设计
{"indexName": "product_list","map": {"size": 0,"aggs": {"czbk": {"avg": {"field": "price"}}}} }
对筛选后的文档聚合
POST product_list/_search {"size": 0,"query": {"match": {"onelevel": "手机通讯"}},"aggs": {"czbk": {"avg": {"field": "price"}}} }
结果:
{"took" : 159,"timed_out" : false,"_shards" : {"total" : 1,"successful" : 1,"skipped" : 0,"failed" : 0},"hits" : {"total" : {"value" : 10000,"relation" : "gte"},"max_score" : null,"hits" : [ ]},"aggregations" : {"czbk" : {"value" : 314.77633210684854}} }
OpenAPI查询参数设计
{"indexName": "product_list","map": {"size": 0,"query": {"match": {"onelevel": "手机通讯"}},"aggs": {"czbk": {"avg": {"field": "price"}}}} }
根据Script计算平均值:
es所使用的脚本语言是painless这是一门安全-高效的脚本语言,基于jvm的
#统计所有 POST product_list/_search?size=0 {"aggs": {"czbk": {"avg": {"script": {"source": "doc.evalcount.value"}}}} } 结果:"value" : 599929.110015995 #有条件 POST product_list/_search?size=0 {"query": {"match": {"onelevel": "手机通讯"}},"aggs": {"czbk": {"avg": {"script": {"source": "doc.evalcount"}}}} } 结果:"value" : 600055.6935087288
OpenAPI查询参数设计
{"indexName": "product_list","map": {"size": 0,"aggs": {"czbk": {"avg": {"script": {"source": "doc.evalcount"}}}}} }
总结:
avg平均
1、统一avg(所有文档)
2、有条件avg(部分文档)
3、脚本统计(所有)
4、脚本统计(部分)代码编写
//平均值if (m.getValue() instanceof ParsedAvg) {map.put("value", ((ParsedAvg) m.getValue()).getValue());}
访问验证
http://localhost:6666/v1/analysis/metric/agg
或者
http://localhost:5555/v1/analysis/metric/agg -
Max(最大值)
计算从聚合文档中提取的数值的最大值。
统计所有文档
POST product_list/_search {"size": 0,"aggs": {"czbk": {"max": {"field": "price"}}} }
结果: “value” :1.0E8
OpenAPI查询参数设计
{"indexName": "product_list","map": {"size": 0,"aggs": {"czbk": {"max": {"field": "price"}}}} }
统计过滤后的文档
POST product_list/_search {"size": 0,"query": {"match": {"onelevel": "手机通讯"}},"aggs": {"czbk": {"max": {"field": "price"}}} }
结果: “value” :2474000.0
OpenAPI查询参数设计
{"indexName": "product_list","map": {"size": 0,"query": {"match": {"onelevel": "手机通讯"}},"aggs": {"czbk": {"max": {"field": "price"}}}} }
结果: “value” : 2474000.0
代码编写
//最大值if (m.getValue() instanceof ParsedMax) {map.put("value", ((ParsedMax) m.getValue()).getValue());}
访问验证
http://localhost:6666/v1/analysis/metric/agg
OR
http://localhost:5555/v1/analysis/metric/agg -
Min(最小值)
计算从聚合文档中提取的数值的最小值。
统计所有文档
POST product_list/_search {"size": 0,"aggs": {"czbk": {"min": {"field": "price"}}} }
结果:“value”: 0.0
OpenAPI查询参数设计
{"indexName": "product_list","map": {"size": 0,"aggs": {"czbk": {"min": {"field": "price"}}}} }
统计筛选后的文档
POST product_list/_search {"size": 1,"query": {"match": {"onelevel": "手机通讯"}},"aggs": {"czbk": {"min": {"field": "price"}}} }
结果:“value”: 0.0
参数size=1;可查询出金额为0的数据
OpenAPI查询参数设计
{"indexName": "product_list","map": {"size": 1,"query": {"match": {"onelevel": "手机通讯"}},"aggs": {"czbk": {"min": {"field": "price"}}}} }
代码编写
//最小值if (m.getValue() instanceof ParsedMin) {map.put("value", ((ParsedMin) m.getValue()).getValue());}
访问验证
http://localhost:6666/v1/analysis/metric/agg
或者
http://localhost:5555/v1/analysis/metric/agg -
Sum(总和)
统计所有文档汇总
POST product_list/_search {"size": 0,"query": {"constant_score": {"filter": {"match": {"threelevel": "手机"}}}},"aggs": {"czbk": {"sum": {"field": "price"}}} }
结果:“value” : 9.652872986812243E8
OpenAPI查询参数设计
{"indexName": "product_list","map": {"size": 0,"query": {"constant_score": {"filter": {"match": {"threelevel": "手机"}}}},"aggs": {"czbk": {"sum": {"field": "price"}}}} }
代码编写
//求和if (m.getValue() instanceof ParsedSum) {map.put("value", ((ParsedSum) m.getValue()).getValue());}
访问验证
http://localhost:6666/v1/analysis/metric/agg
OR
http://localhost:5555/v1/analysis/metric/agg -
Cardinality(唯一值)
Cardinality Aggregation,基数聚合。它属于multi-value,基于文档的某个值(可以是特定的字段,
也可以通过脚本计算而来),计算文档非重复的个数(去重计数),相当于sql中的distinct。cardinality 求唯一值,即不重复的字段有多少(相当于mysql中的distinct)
统计所有文档
POST product_list/_search {"size": 0,"aggs": {"czbk": {"cardinality": {"field": "storename.keyword"}}} }
结果:“value” : 103169
OpenAPI查询参数设计
{"indexName": "product_list","map": {"size": 0,"aggs": {"czbk": {"cardinality": {"field": "storename.keyword"}}}} }
统计筛选后的文档
POST product_list/_search {"size": 0,"query": {"constant_score": {"filter": {"match": {"threelevel": "手机"}}}},"aggs": {"czbk": {"cardinality": {"field": "storename.keyword"}}} }
OpenAPI查询参数设计
{"indexName": "product_list","map": {"size": 0,"query": {"constant_score": {"filter": {"match": {"threelevel": "手机"}}}},"aggs": {"czbk": {"cardinality": {"field": "storename.keyword"}}}} }
代码编写
//不重复的值if (m.getValue() instanceof ParsedCardinality) {map.put("value", ((ParsedCardinality) m.getValue()).getValue());}
访问验证
http://localhost:6666/v1/analysis/metric/agg
OR
http://localhost:5555/v1/analysis/metric/agg
7.2.3 多值分析API设计
-
Stats Aggregation
Stats Aggregation,统计聚合。它属于multi-value,基于文档的某个值(可以是特定的数值型字段,也可以通过脚本计算而来),计算出一些统计信息(min、max、sum、count、avg5个值)
统计所有文档
POST product_list/_search {"size": 0,"aggs": {"czbk": {"stats": {"field": "price"}}} }
返回
"aggregations" : {"czbk" : {"count" : 5072448,"min" : 0.0,"max" : 1.0E8,"avg" : 920.1535462724372,"sum" : 4.667431015482532E9}}
OpenAPI查询参数设计
{"indexName": "product_list","map": {"size": 0,"aggs": {"czbk": {"stats": {"field": "price"}}}} }
统计筛选文档
POST product_list/_search {"size": 0,"query": {"constant_score": {"filter": {"match": {"threelevel": "手机"}}}},"aggs": {"czbk": {"stats": {"field": "price"}}} }
OpenAPI查询参数设计
{"indexName": "product_list","map": {"size": 0,"query": {"constant_score": {"filter": {"match": {"threelevel": "手机"}}}},"aggs": {"czbk": {"stats": {"field": "price"}}}} }
代码编写
//状态统计if (m.getValue() instanceof ParsedStats) {map.put("count", ((ParsedStats) m.getValue()).getCount());map.put("min", ((ParsedStats) m.getValue()).getMin());map.put("max", ((ParsedStats) m.getValue()).getMax());map.put("avg", ((ParsedStats) m.getValue()).getAvg());map.put("sum", ((ParsedStats) m.getValue()).getSum());}
访问验证
http://localhost:6666/v1/analysis/metric/agg
OR
http://localhost:5555/v1/analysis/metric/agg -
扩展状态统计
Extended Stats Aggregation,扩展统计聚合。它属于multi-value,比stats多4个统计结果: 平方
和、方差、标准差、平均值加/减两个标准差的区间统计所有文档
POST product_list/_search {"size": 0,"aggs": {"czbk": {"extended_stats": {"field": "price"}}} }
返回
"aggregations" : {"czbk" : {"count" : 5072448,"min" : 0.0,"max" : 1.0E8,"avg" : 920.1535462724372,"sum" : 4.667431015482532E9,"sum_of_squares" : 2.0182209454063148E16,"variance" : 3.9779441210362864E9,"variance_population" : 3.9779441210362864E9,"variance_sampling" : 3.9779449052621484E9,"std_deviation" : 63070.94514145389,"std_deviation_population" : 63070.94514145389,"std_deviation_sampling" : 63070.951358467304,"std_deviation_bounds" : {"upper" : 127062.04382918023,"lower" : -125221.73673663534,"upper_population" : 127062.04382918023,"lower_population" : -125221.73673663534,"upper_sampling" : 127062.05626320705,"lower_sampling" : -125221.74917066217}}}
- sum_of_squares:平方和
- variance:方差
- std_deviation:标准差
- std_deviation_bounds:标准差的区间
OpenAPI查询参数设计
{"indexName": "product_list","map": {"size": 0,"aggs": {"czbk": {"extended_stats": {"field": "price"}}}} }
统计筛选后的文档
POST product_list/_search {"size": 1,"query": {"constant_score": {"filter": {"match": {"threelevel": "手机"}}}},"aggs": {"czbk": {"extended_stats": {"field": "price"}}} }
返回
"aggregations" : {"czbk" : {"count" : 340378,"min" : 0.0,"max" : 2474000.0,"avg" : 2835.927406240193,"sum" : 9.652872986812243E8,"sum_of_squares" : 6.06065362437439E13,"variance" : 1.7001407710991383E8,"variance_population" : 1.7001407710991383E8,"variance_sampling" : 1.7001457659747353E8,"std_deviation" : 13038.944631752749,"std_deviation_population" : 13038.944631752749,"std_deviation_sampling" : 13038.963785419206,"std_deviation_bounds" : {"upper" : 28913.81666974569,"lower" : -23241.961857265305,"upper_population" : 28913.81666974569,"lower_population" : -23241.961857265305,"upper_sampling" : 28913.854977078605,"lower_sampling" : -23242.00016459822}}}
OpenAPI查询参数设计
{"indexName": "product_list","map": {"size": 1,"query": {"constant_score": {"filter": {"match": {"threelevel": "手机"}}}},"aggs": {"czbk": {"extended_stats": {"field": "price"}}}} }
代码编写
状态统计ParsedStats是扩展状态统计ParsedExtendedStats父类
判断无需更改顺序
//扩展统计if (m.getValue() instanceof ParsedExtendedStats) {map.put("count", ((ParsedExtendedStats) m.getValue()).getCount());map.put("min", ((ParsedExtendedStats) m.getValue()).getMin());map.put("max", ((ParsedExtendedStats) m.getValue()).getMax());map.put("avg", ((ParsedExtendedStats) m.getValue()).getAvg());map.put("sum", ((ParsedExtendedStats) m.getValue()).getSum());map.put("sum_of_squares", ((ParsedExtendedStats) m.getValue()).getSumOfSquares());map.put("variance", ((ParsedExtendedStats) m.getValue()).getVariance());map.put("std_deviation", ((ParsedExtendedStats) m.getValue()).getStdDeviation());map.put("upper", ((ParsedExtendedStats) m.getValue()).getStdDeviationBound(ExtendedStats.Bounds.UPPER));map.put("lower", ((ParsedExtendedStats) m.getValue()).getStdDeviationBound(ExtendedStats.Bounds.LOWER));}
访问验证
http://localhost:6666/v1/analysis/metric/agg
OR
http://localhost:5555/v1/analysis/metric/agg -
百分位度量/百分比统计
Percentiles Aggregation,百分比聚合。它属于multi-value,对指定字段(脚本)的值按从小到大累计每个值对应的文档数的占比(占所有命中文档数的百分比),返回指定占比比例对应的值。默认返回[1, 5, 25, 50, 75, 95, 99 ]分位上的值。
它们表示了人们感兴趣的常用百分位数值。
统计所有文档
POST product_list/_search {"size": 0,"aggs": {"czbk": {"percentiles": {"field": "price"}}} }
返回
},"aggregations" : {"czbk" : {"values" : {"1.0" : 0.0,"5.0" : 14.99999272133453,"25.0" : 58.76038168571048,"50.0" : 139.47447505232998,"75.0" : 388.59368606915626,"95.0" : 3634.3835145207904,"99.0" : 12547.450833578012}}}
OpenAPI查询参数设计
{"indexName": "product_list","map": {"size": 0,"aggs": {"czbk": {"percentiles": {"field": "price"}}}} }
统计筛选后的文档
POST product_list/_search {"size": 0,"query": {"constant_score": {"filter": {"match": {"threelevel": "手机"}}}},"aggs": {"czbk": {"percentiles": {"field": "price"}}} }
OpenAPI查询参数设计
{"indexName": "product_list","map": {"size": 0,"query": {"constant_score": {"filter": {"match": {"threelevel": "手机"}}}},"aggs": {"czbk": {"percentiles": {"field": "price"}}}} }
代码编写
//百分位度量if (m.getValue() instanceof ParsedTDigestPercentiles) {for (Iterator<Percentile> iterator = ((ParsedTDigestPercentiles) m.getValue()).iterator(); iterator.hasNext(); ) {Percentile p = iterator.next();map.put(p.getPercent(), p.getValue());}}
访问验证
http://localhost:6666/v1/analysis/metric/agg
OR
http://localhost:5555/v1/analysis/metric/agg -
百分位等级/百分比排名聚合
百分比排名聚合:这里有另外一个紧密相关的度量叫 percentile_ranks 。 percentiles 度量告诉
我们落在某个百分比以下的所有文档的最小值。统计所有文档
统计价格在15元之内统计价格在30元之内文档数据占有的百分比
tips:
统计数据会变化
这里的15和30;完全可以理解万SLA的200;比较字段不一样而已POST product_list/_search {"size": 0,"aggs": {"czbk": {"percentile_ranks": {"field": "price","values": [15,30]}}} }
返回
价格在15元之内的文档数据占比是4.92%
价格在30元之内的文档数据占比是12.72%"aggregations" : {"czbk" : {"values" : {"15.0" : 4.89331591488828,"30.0" : 12.732247823263487}}}
OpenAPI查询参数设计
{"indexName": "product_list","map": {"size": 0,"aggs": {"czbk": {"percentile_ranks": {"field": "price","values": [15,30]}}}} }
统计过滤后的文档
POST product_list/_search {"size": 0,"query": {"constant_score": {"filter": {"match": {"threelevel": "手机"}}}},"aggs": {"czbk": {"percentile_ranks": {"field": "price","values": [15,30]}}} }
OpenAPI查询参数设计
{"indexName": "product_list","map": {"size": 0,"query": {"constant_score": {"filter": {"match": {"threelevel": "手机"}}}},"aggs": {"czbk": {"percentile_ranks": {"field": "price","values": [15,30]}}}} }
代码编写
//百分位等级if (m.getValue() instanceof ParsedTDigestPercentileRanks) {for (Iterator<Percentile> iterator = ((ParsedTDigestPercentileRanks) m.getValue()).iterator(); iterator.hasNext(); ) {Percentile p = iterator.next();map.put(p.getValue(), p.getPercent());}}
访问验证
http://localhost:6666/v1/analysis/metric/agg
OR
http://localhost:5555/v1/analysis/metric/agg
8、电商平台日志埋点与搜索热词
8.1 什么是热度搜索
以下为【京东】热搜词
8.2 提取热度搜索
热搜词分析流程图
8.3 日志埋点
下面的配置针对需要埋点的服务
这里以service-elasticsearch为例
-
Spring Cloud 整合Log4j2
相比与其他的日志系统,log4j2丢数据这种情况少;disruptor技术,在多线程环境下,性能高于logback等10倍以上;利用jdk1.5并发的特性,减少了死锁的发生;
排除logback的默认集成。
因为Spring Cloud 默认集成了logback, 所以首先要排除logback的集成,在pom.xml文件<!--排除logback的默认集成 Spring Cloud 默认集成了logback--><dependency><groupId>org.springframework.boot</groupId><artifactId>spring-boot-starter-web</artifactId><exclusions><exclusion><groupId>org.springframework.boot</groupId><artifactId>spring-boot-starter-logging</artifactId></exclusion></exclusions></dependency>
-
引入log4j2起步依赖
<!-- 引入log4j2起步依赖--><dependency><groupId>org.springframework.boot</groupId><artifactId>spring-boot-starter-log4j2</artifactId></dependency><!-- log4j2依赖环形队列--><dependency><groupId>com.lmax</groupId><artifactId>disruptor</artifactId><version>3.4.2</version></dependency>
-
设置配置文件
如果自定义了文件名,需要在application.yml中配置
进入Nacos修改配置
logging:config: classpath:log4j2-dev.xml
-
配置文件模板
<Configuration><Appenders><Socket name="Socket" host="192.168.150.7" port="4567"><JsonLayout compact="true" eventEol="true" /></Socket></Appenders><Loggers><Root level="info"><AppenderRef ref="Socket"/></Root></Loggers> </Configuration>
从配置文件中可以看到,这里使用的是Socket Appender来将日志打印的信息发送到Logstash。
注意了,Socket的Appender必须要配置到下面的Logger才能将日志输出到Logstash里!
另外这里的host是部署了Logstash服务端的地址,并且端口号要和你在Logstash里配置的一致才行。
-
日志埋点
private void getClientConditions(CommonEntity commonEntity, SearchSourceBuilder searchSourceBuilder) {//循环下游业务查询条件for (Map.Entry<String, Object> m : commonEntity.getMap().entrySet()) {if (StringUtils.isNotEmpty(m.getKey()) && m.getValue() != null) {String key = m.getKey();String value = String.valueOf(m.getValue());//构造DSL请求体中的querysearchSourceBuilder.query(QueryBuilders.matchQuery(key, value));logger.info("search for the keyword:" + value);}}}
-
创建索引
下面的索引存储用户输入的关键字,最终通过聚合的方式处理索引数据,最终将数据放到语料库
PUT es-log/ {"mappings": {"properties": {"@timestamp": {"type": "date"},"host": {"type": "text"},"searchkey": {"type": "keyword"},"port": {"type": "long"},"loggerName": {"type": "text"}}} }
8.4 数据落盘
- 配置Logstash.conf
连接logstash方式有两种
(1) 一种是Socket连接
(2)另外一种是gelf连接
对外暴露logstash容器的4567端口:参考文档
- 执行全文检索
http://localhost:8888/v1/docs/mquery
参数
{"pageNumber": 1,"pageSize": 3,"indexName": "product_list","highlight": "productname","map": {"productname": "小米"}
}
- 查询是否有数据
GET es-log/_search
{"from": 0,"size": 200,"query": {"match_all": {}}
}
返回
"hits" : [{"_index" : "es-log","_type" : "_doc","_id" : "H94AKpQB5vqCNWpIYHYT","_score" : 1.0,"_source" : {"host" : "192.168.150.1","loggerName" : "com.xin.service.impl.ElasticsearchDocumentServiceImpl","@timestamp" : "2025-01-03T02:30:55.118Z","searchkey" : "小米","port" : 54544}},{"_index" : "es-log","_type" : "_doc","_id" : "ZdgAKpQBrYxtVgSQgvHB","_score" : 1.0,"_source" : {"host" : "192.168.150.1","loggerName" : "com.xin.service.impl.ElasticsearchDocumentServiceImpl","@timestamp" : "2025-01-03T02:31:04.021Z","searchkey" : "小米","port" : 54544}}]
8.5 热度搜索OpenAPI
聚合
获取es-log索引中的文档数据并对其进行分组,统计热搜词出现的频率,根据频率获取有效数据。
DSL实现
POST es-log/_search?size=0
{"aggs": {"czbk": {"terms": {"field": "searchkey","min_doc_count": 5,"size": 2,"order": {"_count": "desc"}}}}
}
返回
{"took" : 155,"timed_out" : false,"_shards" : {"total" : 1,"successful" : 1,"skipped" : 0,"failed" : 0},"hits" : {"total" : {"value" : 14,"relation" : "eq"},"max_score" : null,"hits" : [ ]},"aggregations" : {"czbk" : {"doc_count_error_upper_bound" : 0,"sum_other_doc_count" : 0,"buckets" : [{"key" : "华为","doc_count" : 7},{"key" : "小米","doc_count" : 7}]}}
}
OpenAPI查询参数设计
-
定义搜索推荐接口
//获取搜索热词public Map<String, Long> hotWords(CommonEntity commonEntity) throws Exception;
-
定义搜索推荐实现
@Overridepublic Map<String, Long> hotWords(CommonEntity commonEntity) throws Exception {//定义返回数据Map<String, Long> map = new LinkedHashMap<>();//执行查询SearchResponse response = getSearchResponse(commonEntity);//接收数据Terms termsAggData = response.getAggregations().get(response.getAggregations().getAsMap().entrySet().iterator().next().getKey());for (Terms.Bucket entry : termsAggData.getBuckets()) {if (entry.getKey() != null) {//key为分组字段String key = entry.getKey().toString();//count数据条数Long count = entry.getDocCount();//设置到mapmap.put(key, count);}}return map;}
-
定义搜索推荐控制器
@GetMapping(value = "/hotwords")public ResponseData hotWords(@RequestBody CommonEntity commonEntity) {//构造返回数据ResponseData responseData = new ResponseData();if (StringUtils.isEmpty(commonEntity.getIndexName())) {responseData.setResultEnum(ResultEnum.param_isnull);return responseData;}//定义查询返回结果Map<String, Long> result = null;try {result = analysisService.hotWords(commonEntity);//通过类型推断自动装箱responseData.setResultEnum(result, ResultEnum.success, null);//日志记录logger.info(TipsEnum.hotwords_get_doc_success.getMessage());} catch (Exception e) {//打印到控制台e.printStackTrace();//日志记录logger.error(TipsEnum.hotwords_get_doc_fail.getMessage());//构建错误信息responseData.setResultEnum(ResultEnum.error);}return responseData;}
-
调用验证
获取分析服务热搜词数据
http://localhost:5555/v1/analysis/hotwords
参数
{"indexName": "es-log","map": {"aggs": {"per_count": {"terms": {"field": "searchkey","min_doc_count": 5,"size": 2,"order": {"_count": "desc"}}}}} }
- field表示需要查找的列
- min_doc_count:热搜词在文档中出现的次数
- size表示本次取出多少数据
- order表示排序(升序or降序)
返回
{"code": "200","desc": "操作成功!","data": {"华为": 7,"小米": 7} }