七天进阶elasticsearch[two]

批量保存

批量保存是通过_bulk API来实现的

请求方式 post

请求地址 _bulk

通过_bulk操作文档,一般至少有两行参数

第一行用于确定要干什么(插入,修改还是删除)

第二行才是操作的数据;

当然以上是标准操作,也可以不遵循标准操作,使用不同的请求方式来完成

批量保存demo:

localhost:9200/_bulk  post请求{"create":{"_index":"book","_type":"_doc","_id":3}}{"id":3,"title":"一战历史","price":99.99}{"create":{"_index":"book","_type":"_doc","_id":4}}{"id":4,"title":"二战历史","price":99.99}

批量保存/替换

批量替换~如果原文档不存在,则创建,否则就是替换:

批量替换demo

{"index":{"_index":"book","_type":"_doc","_id":3}}{"id":3,"title":"西点军校进化史","price":88}{"index":{"_index":"book","_type":"_doc","_id":5}}{"id":5,"title":"黄埔军校建校史","price":188}结果:
可以看到一个是create,一个是update(全量替换)
{"took": 9,"errors": false,"items": [{"index": {"_index": "book","_type": "_doc","_id": "3","_version": 2,"result": "updated","_shards": {"total": 2,"successful": 1,"failed": 0},"_seq_no": 2,"_primary_term": 1,"status": 200}},{"index": {"_index": "book","_type": "_doc","_id": "5","_version": 1,"result": "created","_shards": {"total": 2,"successful": 1,"failed": 0},"_seq_no": 3,"_primary_term": 1,"status": 201}}]
}

批量保存时如果使用create,若有重复_id,则会报错(除非让es自动生成新的id即批量保存时不指定id);
如果使用index,则如果有重复_id,则重复的会被替换,没有的则会新增;

批量删除:

localhost:9200/_bulk  post 请求
{"delete":{"_index":"book","_type":"_doc","_id":4}}
{"delete":{"_index":"book","_type":"_doc","_id":5}}

结果:

{"took": 18,"errors": false,"items": [{"delete": {"_index": "book","_type": "_doc","_id": "4","_version": 2,"result": "deleted","_shards": {"total": 2,"successful": 1,"failed": 0},"_seq_no": 4,"_primary_term": 1,"status": 200}},{"delete": {"_index": "book","_type": "_doc","_id": "5","_version": 2,"result": "deleted","_shards": {"total": 2,"successful": 1,"failed": 0},"_seq_no": 5,"_primary_term": 1,"status": 200}}]
}

批量更新:


localhost:9200/_bulk  post 请求
body:
{"update":{"_index":"book","_type":"_doc","_id":3}}
{"doc":{"title":"中华上下五千年","price":100}}

组合应用

组合应用~在一次请求中完成批量操作,包括创建,更新,删除,替换等操作;


{"create":{"_index":"book","_type":"_doc","_id":"id"}}
{"id":1,"title":"资治通鉴","price":66}
{"index":{"_index":"book","_type":"_doc","_id":"id"}}
{"id":2,"title":"三国志","price":76}
{"delete":{"_index":"book","_type":"_doc","_id":3}}
{"update":{"_index":"book","_type":"_doc","_id":5}}
{"doc":{"id":8,"title":"三国志2","price":76}}

结果:

{"took": 14,"errors": false,"items": [{"create": {"_index": "book","_type": "_doc","_id": "id","_version": 1,"result": "created","_shards": {"total": 2,"successful": 1,"failed": 0},"_seq_no": 11,"_primary_term": 1,"status": 201}},{"index": {"_index": "book","_type": "_doc","_id": "id","_version": 2,"result": "updated","_shards": {"total": 2,"successful": 1,"failed": 0},"_seq_no": 12,"_primary_term": 1,"status": 200}},{"delete": {"_index": "book","_type": "_doc","_id": "3","_version": 6,"result": "deleted","_shards": {"total": 2,"successful": 1,"failed": 0},"_seq_no": 13,"_primary_term": 1,"status": 200}},{"update": {"_index": "book","_type": "_doc","_id": "5","_version": 2,"result": "updated","_shards": {"total": 2,"successful": 1,"failed": 0},"_seq_no": 14,"_primary_term": 1,"status": 200}}]
}

批量读取:

mget

localhost:9200/_mget    post 请求
{"docs":[{"_index":"book","_id":4},{"_index":"book","_id":5}]
}

mget 简化后~即请求时带上索引类型,然后请求参数只写id即可;

localhost:9200/book/_mget
{"ids":[4,5]
}``注意:请求时如果像下面这样也会请求成功,说明es背后支持将字符串转为数字```json{"docs":[{"_index":"book","_id":"4"},{"_index":"book","_id":"5"}]
}

返回结果:

{"docs": [{"_index": "book","_type": "_doc","_id": "4","_version": 2,"_seq_no": 15,"_primary_term": 1,"found": true,"_source": {"id": 4,"title": "中华上下五千年","price": 100}},{"_index": "book","_type": "_doc","_id": "5","_version": 3,"_seq_no": 16,"_primary_term": 1,"found": true,"_source": {"id": 5,"title": "三国志2","price": 100}}]
}

批量读取,如果请求的id不存在,则不会返回该id对应的数据,只会返回found:false;

批量查询

批量是读取用postman与 kibana发送请求时的一些区别:

kibana中

GET /book/_msearch{}{"query": {"match_all":{}}}{"index": "book1"}{"query": {"match_all":{}}}

ruguo 去掉第一个{},则会报错,报错如下

{"error": {"root_cause": [{"type": "illegal_argument_exception","reason": "key [query] is not supported in the metadata section"}],"type": "illegal_argument_exception","reason": "key [query] is not supported in the metadata section"},"status": 400
}

如果使用postman请求,则需要去掉第一个{},否则也会报错

在kibana中使用请求与postman中使用请求的区别:

kibana中:
GET /book/_msearch
{}
{"query": {"match_all":{}}}
{"index": "book1"}
{"query": {"match_all":{}}}对应在postman中:
localhost:9200/book/_msearch{"query":{"match_all":{}}}
{"index":"book1"}
{"query":{"match_all":{}}}首先我们请求时在url上添加了一个index,
在postman中,请求时,需要去掉第一个{},但是kibana中不需要去掉,否则会报错将url去掉index,
kibana中:
GET /_msearch
{"index":"book"}
{"query":{"match_all":{}}}
{"index":"book1"}
{"query":{"match_all":{}}}postman中:这样写会报错,
localhost:9200/_msearch
{"index":"book"}
{"query":{"match_all":{}}}
{"index":"book1"}
{"query":{"match_all":{}}}

那么想要查询索引 book 与book1怎么办?

可以用之前的方式:
localhost:9200/_search

{"query": {"bool": {"should":[{"match":{"_index":"book"}},{"match":{"_index":"book1"}}]}}
}

ES检索原理

ES检索原理:不断缩小数据范围,同时把随机的时间变为顺序事件

当我们去搜索某个关键词时,ES首先会根据他的前缀后者后缀快速去匹配数据所在的范围以减少磁盘io的次数

所以es需要维护

单词词典:记录所有文档的单词,记录单词与倒排表的关系
倒排列表:记录单词出现的文档,记录文档与单词的关系
倒排索引项:
文档id:记录单词出现的文档id
词频:记录单词出现的次数,用于相关性评分
位置:记录单词出现的位置,用于短语搜索
偏移量:记录单词出现的位置,用于短语搜索,实现高亮显示;
外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传

谈到检索,就不得不谈到es中的分词器
,针对es目前有多种分词器,每种分词器也有多种不同的分词方案
1.ik分词器:ik_smart,ik_max_word
2.jieba分词器:jieba
3.hanlp分词器:hanlp
等等,目前我们常用的适合中文的分词器是ik
es默认的分词器是standard,会单子拆分
ik_smart:会做最粗力度的拆分
ik_max_word:会做最细粒度的拆分

举个例子:


#默认分词器
POST _analyze
{"analyzer": "standard","text": "蓦然回首,那人却在灯火阑珊处"
}POST _analyze
{"analyzer": "ik_smart","text": "蓦然回首,那人却在灯火阑珊处"
}POST _analyze
{"analyzer": "ik_max_word","text": "蓦然回首,那人却在灯火阑珊处"
}分词器对英文,英文分词器是standard,会做最细粒度的拆分,所以送我们在设置分词器时要考虑那种分词器对于我们更合适
POST _analyze
{"analyzer": "standard","text": "I have a pen"
}POST _analyze
{"analyzer": "ik_smart","text": "I have a pen"
}POST _analyze
{"analyzer": "ik_max_word","text": "I have a pen"
}

回顾:我们在创建索引时可以指定索引类型

PUT /test
{"settings": {"index": {"analysis.analyzers.default.type": "ik_max_word"}}
}
GET /test/_settings

大数据量查询

es 对大数据量查询做了一些限制

比如要查询两万条数据

GET /book/_search
{"query": {"match_all": {}}, "size": 20000
}

...部分..."reason" : "Result window is too large, from + size must be less than or equal to: [10000] but was [20000]. See the scroll api for a more efficient way to request large data sets. This limit can be set by changing the [index.max_result_window] index level setting."

可以看到es限制了返回的数据量,如果要查询两万条数据,则需要分页查询,分页查询时,需要使用scroll api,scroll api是es提供的一个api,可以解决大数据量查询的问题

当然,我们也可以通过修改限制来解决大数据量查询的问题,但是修改限制,会影响es的性能,所以不建议修改限制
而且修改限制只会对当前已经有的索引生效,之后创建的索引并不会生效;

PUT /_all/_settings
{"index.max_result_window":"20000"
}

此时再去查询 20000条数据,就不会报错了,

但是我们在创建一个新的索引

PUT /newindexGET /newindex/_search
{"query": {"match_all": {}}, "size": 20000
}

这是仍然会返回错误,因为新的索引没有修改限制,所以需要修改限制,足以见得官方并不想让我们通过修改数据量限制的方式来消除此种错误,
修改后的代价就是增加了内存消耗,所以官方不建议修改限制

g the [index.max_result_window] index level setting."

可以看到es限制了返回的数据量,如果要查询两万条数据,则需要分页查询,分页查询时,需要使用scroll api,scroll api是es提供的一个api,可以解决大数据量查询的问题当然,我们也可以通过修改限制来解决大数据量查询的问题,但是修改限制,会影响es的性能,所以不建议修改限制
而且修改限制只会对当前已经有的索引生效,之后创建的索引并不会生效;```json
PUT /_all/_settings
{"index.max_result_window":"20000"
}

此时再去查询 20000条数据,就不会报错了,

但是我们在创建一个新的索引

PUT /newindexGET /newindex/_search
{"query": {"match_all": {}}, "size": 20000
}