A Charming Algorithm for Count-Distinct

如何估计不重复元素的个数

本文提出一种很有趣的算法,估计一个数列里面不重复元素的个数,关键是它只使用指定大小的内存。
 

I recently came across a paper called Distinct Elements in Streams: An Algorithm for the (Text) Book by Chakraborty, Vinodchandran, and Meel.

The usage of the phrase “from the book” is of course a reference to Erdős, who often referred to a “book” within which God kept the best proofs of any given theorem. Thus, for something to be “from the book” is for it to be particularly elegant. I have to say, I agree with their assessment. This is an extremely charming little algorithm that I really enjoyed thinking about, so today, I’m going to explain it to you.

The count-distinct problem is to estimate the number of distinct elements appearing in a stream. That is, given some enumeration of “objects,” which you can think of as any data type you like, we want to know approximately how many unique objects there are. For instance, this array:

[1,1,2,1,2,3,1,2,1,2,2,2,1,2,3,1,1,1,1]

has only three distinct objects: 12, and 3. It’s pretty natural to want to know how many distinct objects appear in such a list. Unfortunately, if you require the actual number, there’s basically only two options:

  • sort the list, or
  • use a hash table.

Both of these options require memory proportional at least to the number of distinct elements, which in some cases could be as big as the entire stream. This might be fine for some smaller data sizes, but if we want to handle millions or billions of elements, that’s a nonstarter for many use cases.

It turns out that if we can tolerate some imprecision, and we often can, there are ways we can vastly reduce the amount of memory we need by using approximate algorithms.

The most well-known approximate count-distinct algorithm is HyperLogLog, which is widely used in production for all sorts of things. While the idea behind HyperLogLog is simple, the analysis of it is somewhat complex. What this paper provides is an alternative algorithm which:

  1. has simpler analysis, and
  2. doesn’t rely on hashing at all.

The paper provides proofs around the algorithm’s correctness, so I’m just going to explain how it works by way of derivation; one lovely thing about this algorithm is that we can build up to it very naturally.

The most obvious solution to the count-distinct problem is to just maintain a hash table of the objects you’ve seen, and to emit its size at the end:

function countDistinct(list) {let seen = new Set();for (let value of list) {seen.add(value);}return seen.size;
}console.log(countDistinct(["the", "quick", "brown", "fox", "jumps", "jumps", "over","over", "dog", "over", "the", "lazy", "quick", "dog",
]));
// => 8

This will have to store every element we’ve seen. If we’re trying to save memory, one obvious trick to try is to just not store everything.

If we attempt to only store half the values, then the expected size of seen should be half the actual number of distinct elements, so at the end we can just multiply that size by two to get an approximation of the number of distinct elements.

When we see an element, we flip a coin, and only store it if the flip is heads:

function countDistinct(list) {let seen = new Set();for (let value of list) {if (Math.random() < 0.5) {seen.add(value);}}return seen.size * 2;
}console.log(countDistinct(["the", "quick", "brown", "fox", "jumps", "jumps", "over","over", "dog", "over", "the", "lazy", "quick", "dog",
]));
// => 10

well, this is actually wrong, because if we see the same element multiple times, we’re more likely to have it in our final representative set:

console.log(countDistinct(["a", "a", "a", "a", "a", "a", "a","a", "a", "a", "a", "a", "a", "a","a", "a", "a", "a", "a", "a", "a","a", "a", "a", "a", "a", "a", "a",
]));
// => 2 (with very high probability)

the number of times an element appears shouldn’t impact the output of our algorithm (this is sort of the defining property of count-distinct, I’d say).

There’s an easy fix for this though: when we see an element, we can just remove it from the set before flipping, so the only coin flip that actually matters is the last one (which works out, because every element that appears at least once has exactly one final appearance):

function countDistinct(list) {let seen = new Set();for (let value of list) {seen.delete(value);if (Math.random() < 0.5) {seen.add(value);}}return seen.size * 2;
}

In this iteration, every distinct element appears in seen with probability 0.5.

We can improve the memory usage even further (at the cost of precision) by requiring each element to win more coin flips to be included in the final set:

function countDistinct(list, p) {let seen = new Set();for (let value of list) {seen.delete(value);if (Math.random() < p) {seen.add(value);}}return seen.size / p;
}console.log(countDistinct(["the", "quick", "brown", "fox", "jumps", "jumps", "over","over", "dog", "over", "the", "lazy", "quick", "dog",
]), 0.125);

Now each element is included in the final set with probability p, so we divide by p to get the actual estimate.

We’ve reduced our memory usage by some constant factor, and that’s maybe good! But it’s not any better asymptotically, and more importantly, it doesn’t let us bound the amount of memory usage: I can’t tell you ahead of time how much memory I’m going to use for this.

The final trick to get us to the actual algorithm is to pick p dynamically.

That is, we start with a p of 1, and have a threshold for how big is “too big.” If our set grows beyond this size, we “upgrade” p so that elements now have to win an additional coin flip to be included in the final set. When we upgrade p this way, we have to do two things:

  1. ensure future elements are subject to the new filter, by updating the variable p, and
  2. ensure old elements are subject to the new filter, by forcing them to win an additional coin-flip on top of what they won before.

Since elements that have already “won” and are included in seen need to be held to this new standard, we have to do a Thanos-snap and have them each win an additional coin-flip in order to stay in the set.

At the end of the day, we still have a set that contains elements with probability p, so we can divide its size by p to get the true estimate.

The final algorithm looks like this:

function countDistinct(list, thresh) {let p = 1;let seen = new Set();for (let value of list) {seen.delete(value);if (Math.random() < p) {seen.add(value);}if (seen.size === thresh) {// Objects now need to win an extra coin flip to be included// in the set. Every element in `seen` already won n-1 coin// flips, so they now have to win one more.seen = new Set([...seen].filter(() => Math.random() < 0.5));p *= 1 / 2;}}return seen.size / p;
}

That’s the whole algorithm—the paper contains an actual analysis, as well as guidance for picking a value of thresh for a desired level of precision.

It’s not really clear to me if this algorithm is appropriate for real-world use. A comparison with HyperLogLog is notably absent from the paper. My immediate suspicion is that it’s not, really, since HyperLogLog has some additional nice properties (for instance, it distributes very well, due to sketches being mergeable) and it’s not clear to me whether they’re preserved here.

Asking that question is sort of missing the point, though, of course, since like the authors emphasize, the appeal of this algorithm is its simplicity, and to me, the surprise of its existence—I actually had no idea it was possible to do efficient count-distinct without hashes, but it turns out it is!

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/news/625624.shtml

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

FFmpeg 入门

1. 编译 参考文档&#xff1a;FFmpeg编译和集成(FFmpeg开发基础知识)&#xff0c;重点注意这句话&#xff1a; 在MSYS2 Packages可以查到云仓库有哪些包&#xff0c;直接安装可节约大量时间。 注意&#xff1a;这个路径可自定义 吐槽 在看到这篇文章之前&#xff0c;花了大…

腾讯云服务器购买指南,2024更新购买步骤

腾讯云服务器购买流程很简单&#xff0c;有两种购买方式&#xff0c;直接在官方活动上购买比较划算&#xff0c;在云服务器CVM或轻量应用服务器页面自定义购买价格比较贵&#xff0c;但是自定义购买云服务器CPU内存带宽配置选择范围广&#xff0c;活动上购买只能选择固定的活动…

【银行测试】银行项目,信用卡业务测试+常问面试(三)

目录&#xff1a;导读 前言一、Python编程入门到精通二、接口自动化项目实战三、Web自动化项目实战四、App自动化项目实战五、一线大厂简历六、测试开发DevOps体系七、常用自动化测试工具八、JMeter性能测试九、总结&#xff08;尾部小惊喜&#xff09; 前言 银行测试-信用卡业…

产品经理之基础必备知识点

目录 前言 一.用户画像 1.1含义 1.2 举例说明 二.MVP&#xff08;最小可行产品&#xff09; 1.1含义 1.2 优缺点 三.体验地图 3.1 含义 3.2 举例说明 四.产品路线图 4.1 含义 4.2 举例说明 五.用户故事 5.1 含义 5.2 举例 六.用户故事地图 七.敏捷开发 7.1 含义 7.2 …

C++I/O流——(3)文件输入/输出(第二节)

归纳编程学习的感悟&#xff0c; 记录奋斗路上的点滴&#xff0c; 希望能帮到一样刻苦的你&#xff01; 如有不足欢迎指正&#xff01; 共同学习交流&#xff01; &#x1f30e;欢迎各位→点赞 &#x1f44d; 收藏⭐ 留言​&#x1f4dd; 含泪播种的人一定能含笑收获&#xff…

[M链表] lc82. 删除排序链表中的重复元素 II(单链表+好题+模拟)

文章目录 1. 题目来源2. 题目解析 1. 题目来源 链接&#xff1a;82. 删除排序链表中的重复元素 II 相似题目&#xff1a;[E链表] lc83. 删除排序链表中的重复元素(单链表模拟) 2. 题目解析 这个题目与 83 题都很类似&#xff0c;一个是将重复元素全部删除&#xff0c;另一个…

面试宝典之ElasticSearch面试题

E01、什么是倒排索引&#xff1f; ES分词器通过扫描文章中的每一个词&#xff0c;对每一个词建立一个索引&#xff0c;指明该词在文章中出现的次数和位置&#xff0c;当用户查询时&#xff0c;检索程序就根据事先建立的索引进行查找&#xff0c;并将查找的结果反馈给用户的检索…

检索增强生成的多模态信息:综述

英文原文地址&#xff1a;Retrieving Multimodal Information for Augmented Generation: A Survey 随着大型语言模型&#xff08;LLMs&#xff09;的流行&#xff0c;出现了一个重要趋势&#xff0c;即使用多模态来增强 LLMs 的生成能力&#xff0c;从而使 LLMs 能够更好地与…

Github项目推荐-clone-voice

项目地址 GitHub - jianchang512/clone-voice 项目简述 一个声音ai工具。基于python编写。作用是音色复用。下面是官方说明&#xff1a;“这是一个声音克隆工具&#xff0c;可使用任何人类音色&#xff0c;将一段文字合成为使用该音色说话的声音&#xff0c;或者将一个声音使…

行为型设计模式——模板方法模式

学习难度&#xff1a;⭐ &#xff0c;比较常用 模板方法模式 在面向对象程序设计过程中&#xff0c;程序员常常会遇到这种情况&#xff1a;设计一个系统时知道了算法所需的关键步骤&#xff0c;而且确定了这些步骤的执行顺序&#xff0c;但某些步骤的具体实现还未知&#xff0…

SpringBoot项目的两种发布方式(jar包和war包)

SpringBoot项目的两种发布方式&#xff08;jar包和war包&#xff09; 在springboot入门和项目示例这个项目和application.yml配置端口号和访问url路径基础上进行修改 1、使用jar包方式发布 1.1、在pom.xml中添加一个SpringBoot的构建的插件 <build><plugins>&l…

腾讯云服务器多少钱?2024年腾讯云服务器报价明细表

腾讯云服务器租用价格表&#xff1a;轻量应用服务器2核2G3M价格62元一年、2核2G4M价格118元一年&#xff0c;540元三年、2核4G5M带宽218元一年&#xff0c;2核4G5M带宽756元三年、轻量4核8G12M服务器446元一年、646元15个月&#xff0c;云服务器CVM S5实例2核2G配置280.8元一年…

测试人,你还在写用例吗?是什么在支撑着你写?

测试交付的过程&#xff0c;通常是伴随的是一个测试用例生命周期过程&#xff0c;通常有测试需求分析、测试用例设计、测试用例实现、测试用例执行&#xff0c;以及测试用例管理等几个阶段组成。 为什么要有测试用例&#xff1f; 首先测试用例这是测试岗位的基本交付物之一。开…

第10章-特殊函数-贝塞尔函数

贝兹函数又称贝塞尔曲线&#xff0c;是计算机图形学中相当重要的参数曲线&#xff0c;在绘图工具上看到的钢笔工具就是来做这种矢量曲线的。 贝塞尔函数由线段和节点组成&#xff0c;节点是可拖动的支点&#xff0c;线段像可伸缩的皮筋&#xff0c;通过 控制曲线上的4个点&…

LeetCode 144. 94. 145. 二叉树的前序,中序,后续遍历(详解) ੭ ᐕ)੭*⁾⁾

经过前面的二叉树的学习&#xff0c;现在让我们实操来练练手~如果对二叉树还不熟悉的小伙伴可以看看我的这篇博客~数据结构——二叉树&#xff08;先序、中序、后序及层次四种遍历&#xff08;C语言版&#xff09;&#xff09;超详细~ (✧∇✧) Q_Q-CSDN博客 144.二叉树的前序遍…

GitHub要求所有贡献代码的用户在2023年底前启用双因素认证

到2023年底&#xff0c;所有向github托管的存储库贡献代码的用户都必须启用一种或多种形式的2FA。 双重身份认证 所谓双重身份认证&#xff08;Two-Factor Authentication&#xff09;&#xff0c;就是在账号密码以外还额外需要一种方式来确认用户身份。 GitHub正在大力推动双…

你知道程序员如何利用citywork实现财富自由吗?

周末到了&#xff0c;我要去citywalk寻找心灵的呼吸&#xff01;”有谁没有设想过疲惫的工作日之后好好地去走一走&#xff0c;亲近大自然呢&#xff1f;谁又不想在闲暇之余唤起对生活的趣味呢&#xff1f;可是对于我们悲催的打工人而言&#xff0c;没有citywalk&#xff0c;只…

代码随想录训练营day2

一、有序数组的平方 1.1暴力解法 可以直接使用C当时自带的排序算法库函数进行计算&#xff0c;属于暴力解法&#xff0c;复杂度较高&#xff0c;那么有没有运行效率更高的算法思想呢&#xff1f; class Solution { public:vector<int> sortedSquares(vector<int&g…

1.5 面试经典150题 - 轮转数组

轮转数组 给定一个整数数组 nums&#xff0c;将数组中的元素向右轮转 k 个位置&#xff0c;其中 k 是非负数。 注意&#xff1a;本题需要原地操作 class Solution(object):def rotate(self, nums, k):""":type nums: List[int]:type k: int:rtype: None Do not…

class_5:在c++中一个类包含另一个类的对象叫做组合

#include <iostream> using namespace std;class Wheel{ public://成员数据string brand; //品牌int year; //年限//真正的成员函数void printWheelInfo(); //声明成员函数 };void Wheel::printWheelInfo() {cout<<"我的轮胎品牌是&#xff1a;"<…