ElasticSearch - 深入解析 Elasticsearch Composite Aggregation 的分页与去重机制

文章目录

Pre
概述
什么是 `composite aggregation`？
基本结构
`after` 参数的作用
- 问题背景：传统分页的重复问题
- `after` 的设计理念
- 响应示例
`after` 如何确保数据不重复
- 核心机制
- Example
- - 步骤 1: 创建测试数据
  - - 创建索引
    - 插入测试数据
  - 步骤 2: 查询第一页结果
  - - 查询第一页
    - 返回结果
  - 步骤 3: 查询第二页结果
  - - 查询第二页
    - 返回结果
  - 步骤 4: 查询第三页结果
  - - 查询第三页
    - 返回结果
  - 步骤 5: 查询第四页结果
  - - 查询第四页
    - 返回结果
  - 验证
  - 小结
总结

在这里插入图片描述

Pre

ElasticSearch - 使用 Composite Aggregation 实现桶的分页查询

概述

在 Elasticsearch 中，composite aggregation 提供了一种高效的分页聚合方式，尤其适用于数据量较大的场景。为了避免传统分页中常见的重复数据问题，composite aggregation 引入了 after 参数。本文将详细探讨 after 参数的机制，以及它如何确保数据不重复。

什么是 `composite aggregation`？

composite aggregation 是一种支持多字段分组的聚合类型，其独特之处在于可以实现分页功能。这种分页能力通过 size 参数控制每次返回的桶数量，并通过 after 参数获取下一页的结果。

基本结构

一个典型的 composite aggregation 查询：

GET /your_index_name/_search
{"size": 0,"aggs": {"my_composite_agg": {"composite": {"size": 10,"sources": [{"field1": {"terms": {"field": "your_field_name1"}}},{"field2": {"terms": {"field": "your_field_name2"}}}]}}}
}

在以上查询中：

sources 定义了按哪些字段分组，字段顺序决定了分组键（bucket key）的生成顺序。
size 定义每页的桶数量。
响应结果中的 after_key 用于获取下一页数据。

`after` 参数的作用

问题背景：传统分页的重复问题

在使用基于偏移量的分页（如 from 和 size 参数）时，数据更新可能导致页码错乱或重复。例如：

如果在分页过程中有新文档插入或更新，数据偏移可能导致某些文档重复出现在多页结果中。

`after` 的设计理念

after 参数是 composite aggregation 特有的，它记录了上一页最后一个桶的键值（after_key），并以此为起点获取下一页数据。这种方式基于排序键，确保分页过程始终连续、无重复。

响应示例

以下是一个分页查询的响应：

{"aggregations": {"my_composite_agg": {"buckets": [{ "key": { "field1": "value1", "field2": "value2" }, "doc_count": 10 },{ "key": { "field1": "value3", "field2": "value4" }, "doc_count": 8 }],"after_key": { "field1": "value3", "field2": "value4" }}}
}

在下一页查询中，可以使用 after_key 作为起点：

GET /your_index_name/_search
{"size": 0,"aggs": {"my_composite_agg": {"composite": {"size": 10,"after": { "field1": "value3", "field2": "value4" },"sources": [{"field1": {"terms": {"field": "your_field_name1"}}},{"field2": {"terms": {"field": "your_field_name2"}}}]}}}
}

`after` 如何确保数据不重复

核心机制

排序保证一致性
- composite aggregation 内部按照 sources 中定义的字段顺序生成桶键，并进行字典序排序。
- 每次查询的结果顺序是固定的，即使数据发生变动，也不会影响之前已返回的桶键。
分页起点记录
- 每次查询都会返回 after_key，表示当前页最后一个桶的键值。
- 在下一页查询中，Elasticsearch 从该键值开始，获取后续的桶。
跳过已处理的桶
- Elasticsearch 在执行查询时，会严格按照 after_key 跳过已处理的桶，确保每个桶仅返回一次。
游标精准定位
- after_key 明确表示从上次分页结果的最后一个桶之后开始读取，而不会重新读取已经返回的桶。
- 查询总是基于 key 的排序位置，按顺序依次获取后续的桶。
无偏移计算
- 不使用 from 和 size 等偏移量参数，避免了由于数据插入或删除导致的分页偏移问题。
全局一致性排序
- 所有桶的排序是全局确定的，即使数据分布在多个分片中，也能按照统一的顺序返回。
- Elasticsearch 会在多个分片中进行合并排序，从而确保每次分页的桶是唯一且无重复的。

Example

步骤 1: 创建测试数据

我们创建一个名为 test_index 的索引，并插入一些测试数据。数据包含一个字段 category，我们将根据这个字段进行聚合分页。

创建索引

PUT /test_index
{"mappings": {"properties": {"category": {"type": "keyword"},"value": {"type": "integer"}}}
}

插入测试数据

POST /test_index/_bulk
{ "index": {} }
{ "category": "A", "value": 10 }
{ "index": {} }
{ "category": "A", "value": 20 }
{ "index": {} }
{ "category": "A", "value": 30 }
{ "index": {} }
{ "category": "B", "value": 40 }
{ "index": {} }
{ "category": "B", "value": 50 }
{ "index": {} }
{ "category": "B", "value": 60 }
{ "index": {} }
{ "category": "C", "value": 70 }
{ "index": {} }
{ "category": "C", "value": 80 }
{ "index": {} }
{ "category": "C", "value": 90 }
{ "index": {} }
{ "category": "D", "value": 100 }
{ "index": {} }
{ "category": "D", "value": 110 }
{ "index": {} }
{ "category": "D", "value": 120 }
{ "index": {} }
{ "category": "E", "value": 130 }
{ "index": {} }
{ "category": "E", "value": 140 }
{ "index": {} }
{ "category": "E", "value": 150 }
{ "index": {} }
{ "category": "F", "value": 160 }
{ "index": {} }
{ "category": "F", "value": 170 }
{ "index": {} }
{ "category": "F", "value": 180 }
{ "index": {} }
{ "category": "G", "value": 190 }
{ "index": {} }
{ "category": "G", "value": 200 }
{ "index": {} }
{ "category": "G", "value": 210 }
{ "index": {} }
{ "category": "H", "value": 220 }
{ "index": {} }
{ "category": "H", "value": 230 }
{ "index": {} }
{ "category": "H", "value": 240 }
{ "index": {} }
{ "category": "I", "value": 250 }
{ "index": {} }
{ "category": "I", "value": 260 }
{ "index": {} }
{ "category": "I", "value": 270 }
{ "index": {} }
{ "category": "J", "value": 280 }
{ "index": {} }
{ "category": "J", "value": 290 }
{ "index": {} }
{ "category": "J", "value": 300 }
{ "index": {} }
{ "category": "K", "value": 310 }
{ "index": {} }
{ "category": "K", "value": 320 }
{ "index": {} }
{ "category": "K", "value": 330 }
{ "index": {} }
{ "category": "L", "value": 340 }
{ "index": {} }
{ "category": "L", "value": 350 }
{ "index": {} }
{ "category": "L", "value": 360 }
{ "index": {} }
{ "category": "M", "value": 370 }
{ "index": {} }
{ "category": "M", "value": 380 }
{ "index": {} }
{ "category": "M", "value": 390 }
{ "index": {} }
{ "category": "N", "value": 400 }
{ "index": {} }
{ "category": "N", "value": 410 }
{ "index": {} }
{ "category": "N", "value": 420 }
{ "index": {} }
{ "category": "O", "value": 430 }
{ "index": {} }
{ "category": "O", "value": 440 }
{ "index": {} }
{ "category": "O", "value": 450 }
{ "index": {} }
{ "category": "P", "value": 460 }
{ "index": {} }
{ "category": "P", "value": 470 }
{ "index": {} }
{ "category": "P", "value": 480 }
{ "index": {} }
{ "category": "Q", "value": 490 }
{ "index": {} }
{ "category": "Q", "value": 500 }
{ "index": {} }
{ "category": "Q", "value": 510 }
{ "index": {} }
{ "category": "R", "value": 520 }
{ "index": {} }
{ "category": "R", "value": 530 }
{ "index": {} }
{ "category": "R", "value": 540 }
{ "index": {} }
{ "category": "S", "value": 550 }
{ "index": {} }
{ "category": "S", "value": 560 }
{ "index": {} }
{ "category": "S", "value": 570 }
{ "index": {} }
{ "category": "T", "value": 580 }
{ "index": {} }
{ "category": "T", "value": 590 }
{ "index": {} }
{ "category": "T", "value": 600 }
{ "index": {} }
{ "category": "U", "value": 610 }
{ "index": {} }
{ "category": "U", "value": 620 }
{ "index": {} }
{ "category": "U", "value": 630 }
{ "index": {} }
{ "category": "V", "value": 640 }
{ "index": {} }
{ "category": "V", "value": 650 }
{ "index": {} }
{ "category": "V", "value": 660 }
{ "index": {} }
{ "category": "W", "value": 670 }
{ "index": {} }
{ "category": "W", "value": 680 }
{ "index": {} }
{ "category": "W", "value": 690 }
{ "index": {} }
{ "category": "X", "value": 700 }
{ "index": {} }
{ "category": "X", "value": 710 }
{ "index": {} }
{ "category": "X", "value": 720 }
{ "index": {} }
{ "category": "Y", "value": 730 }
{ "index": {} }
{ "category": "Y", "value": 740 }
{ "index": {} }
{ "category": "Y", "value": 750 }
{ "index": {} }
{ "category": "Z", "value": 760 }
{ "index": {} }
{ "category": "Z", "value": 770 }
{ "index": {} }
{ "category": "Z", "value": 780 }

步骤 2: 查询第一页结果

我们使用 composite aggregation 查询第一页，设置每页返回 3 个桶。

查询第一页

GET /test_index/_search
{"size": 0,"aggs": {"composite_agg": {"composite": {"size": 10,"sources": [{ "category": { "terms": { "field": "category" } } }]}}}
}

返回结果

 {"took" : 11,"timed_out" : false,"_shards" : {"total" : 1,"successful" : 1,"skipped" : 0,"failed" : 0},"hits" : {"total" : {"value" : 78,"relation" : "eq"},"max_score" : null,"hits" : [ ]},"aggregations" : {"composite_agg" : {"after_key" : {"category" : "J"},"buckets" : [{"key" : {"category" : "A"},"doc_count" : 3},{"key" : {"category" : "B"},"doc_count" : 3},{"key" : {"category" : "C"},"doc_count" : 3},{"key" : {"category" : "D"},"doc_count" : 3},{"key" : {"category" : "E"},"doc_count" : 3},{"key" : {"category" : "F"},"doc_count" : 3},{"key" : {"category" : "G"},"doc_count" : 3},{"key" : {"category" : "H"},"doc_count" : 3},{"key" : {"category" : "I"},"doc_count" : 3},{"key" : {"category" : "J"},"doc_count" : 3}]}}
}

步骤 3: 查询第二页结果

我们使用第一页返回的 after_key 值 { "category": "J" } 来查询第二页。

查询第二页

GET /test_index/_search
{"size": 0,"aggs": {"composite_agg": {"composite": {"size": 10,"after": { "category": "J" },"sources": [{ "category": { "terms": { "field": "category" } } }]}}}
}

返回结果

在这里插入图片描述

步骤 4: 查询第三页结果

使用第二页返回的 after_key 值 { "category": "T" } 查询第三页。

查询第三页

GET /test_index/_search
{"size": 0,"aggs": {"composite_agg": {"composite": {"size": 10,"after": { "category": "T" },"sources": [{ "category": { "terms": { "field": "category" } } }]}}}
}

返回结果

在这里插入图片描述

步骤 5: 查询第四页结果

使用第三页返回的 after_key 值 { "category": "Z" } 查询第三页。

查询第四页

GET /test_index/_search
{"size": 0,"aggs": {"composite_agg": {"composite": {"size": 10,"after": { "category": "Z" },"sources": [{ "category": { "terms": { "field": "category" } } }]}}}
}

返回结果

在这里插入图片描述

验证

通过四次分页查询，我们验证以下几点：

结果无重复：
- 每页的结果是唯一的，没有重复桶。例如：
  - 第 1 页返回桶：A, B, C…J
  - 第 2 页返回桶：K, L, M…T
  - 第 3 页返回桶：U, V…Z
  - 第 4 页返回桶：已到最后
顺序一致：
- 所有结果按照 category 字段值排序，顺序为 A, B, C, …, Z。
after_key 确保正确游标定位：
- 使用 after_key 明确标识分页起点，每次从上页的最后一个桶的 key 开始查询，没有遗漏或重复。

小结

composite aggregation 使用基于 after_key 的游标机制，可以确保分页查询中数据无重复、无遗漏。
composite aggregation 的设计特别适合大规模数据的聚合和分页，是传统 from + size 分页方法的高效替代方案。

通过 after_key 的分页，可以看到每页数据互不重叠，且严格按照 category 字段排序。

总结

传统分页 (`from` + `size`)	Composite Aggregation (游标)
基于偏移计算，容易因数据变动重复	基于游标，桶的顺序和定位稳定无重复
数据量大时性能下降明显	高效处理大数据，避免偏移的性能开销
不支持跨分片排序	跨分片排序一致性，返回结果无重复或遗漏