密度聚类dbscan_DBSCAN —基于密度的聚类方法的演练

密度聚类dbscan

The idea of having newer algorithms come into the picture doesn’t make the older ones ‘completely redundant’. British statistician, George E. P. Box had once quoted that, “All models are wrong, but some are useful”, meaning that no model is exact enough to certify as cent percent accurate. Reverse claims can only lead to the loss of generalization. The most accurate thing to do is to find the most approximate model.

出现新算法的想法并不能使旧算法“完全冗余”。 英国统计学家George EP Box曾经引述过: “所有模型都是错误的,但有些模型是有用的” ,这意味着没有任何一种模型能够精确到百分之一的精度。 反向主张只能导致泛化。 最准确的事情是找到最近似的模型。

Clustering is an unsupervised learning technique where the aim is to group similar objects together. We are virtually living in a world where our past and present choices have become a dataset that can be clustered to identify patterns in our searches, shopping carts, the books we read, etc such that the machine algorithm is sophisticated enough to recommend the things to us. It is fascinating that the algorithms know much more about us then we ourselves can recognize!

聚类是一种无监督的学习技术,其目的是将相似的对象分组在一起。 实际上,我们生活在一个世界中,过去和现在的选择已成为一个数据集,可以将其聚类以识别我们的搜索,购物车,阅读的书籍等中的模式,从而机器算法足够复杂,可以向您推荐事物我们。 令人着迷的是,这些算法对我们的了解更多,然后我们自己就能意识到!

As already discussed in the previous blog, K-means makes use of Euclidean distance as a metric to form the clusters. This leads to a variety of drawbacks as mentioned. Please refer to the blog to read about the K-means algorithm, implementation, and drawbacks: Clustering — Diving deep into K-means algorithm

如先前博客中已讨论的,K-means利用欧几里得距离作为度量来形成聚类。 如上所述,这导致了各种缺点。 请参阅博客,以了解有关K-means算法,实现和缺点的信息: 聚类-深入探讨K-means算法

The real-life data has outliers and is irregular in shape. K-means fails to address these important points and becomes unsuitable for arbitrary shaped, noisy data. In this blog, we are going to learn about an interesting density-based clustering approach — DBSCAN.

现实生活中的数据存在异常值,并且形状不规则。 K均值无法解决这些重要问题,因此不适用于任意形状的嘈杂数据。 在此博客中,我们将学习一种有趣的基于密度的聚类方法-DBSCAN。

应用程序基于密度的空间聚类— DBSCAN (Density-based spatial clustering of applications with noise — DBSCAN)

DBSCAN is a density-based clustering approach that separates regions with a high density of data points from the regions with a lower density of data points. Its fundamental definition is that the cluster is a contiguous region of dense data points separated from another such region by a region of the low density of data points. Unlike K-means clustering, the number of clusters is determined by the algorithm. Two important concepts are density reachability and density connectivity, which can be understood as follows:

DBSCAN是基于密度的聚类方法,可将数据点密度较高的区域与数据点密度较低的区域分开 它的基本定义是,群集是密集数据点的连续区域,该区域与另一个此类区域之间被数据点的低密度区域分隔开 。 与K均值聚类不同,聚类的数量由算法确定。 密度可达性密度连通 是两个重要的概念,可以理解如下:

“A point is considered to be density reachable to another point if it is situated within a particular distance range from it. It is the criteria for calling two points as neighbors. Similarly, if two points A and B are density reachable (neighbors), also B and C are density reachable (neighbors), then by chaining approach A and C belong to the same cluster. This concept is called density connectivity. By this approach, the algorithm performs cluster propagation.”

“如果一个点位于另一个点的特定距离范围内,则认为该点可以密度达到另一个点。 这是将两个点称为邻居的标准。 类似地,如果两个点A和B是密度可达的(邻居),则B和C也是密度可达的(邻居),则通过链接方法A和C属于同一群集。 这个概念称为密度连接。 通过这种方法,该算法执行集群传播。”

The key constructs of the DBSCAN algorithm that help it determine the ‘concept of density’ are as follows:

DBSCAN算法可帮助确定“密度概念”的关键结构如下:

Epsilon ε (measure): ε is the threshold radius distance which determines the neighborhood of a point. If a point is located at a distance less than or equal to ε from another point, it becomes its neighbor, that is, it becomes density reachable to it.

小量 ε(测量):ε 是确定点附近的阈值半径距离。 如果一个点与另一个点的距离小于或等于ε,则该点成为其相邻点,即它可以达到的密度。

Choice of ε: The choice of ε is made in a way that the clusters and the outlier data can be segregated perfectly. Too large ε value can cluster the entire data as one cluster and too small value can classify each point as noise. In layman terms, the average distance of each point from its k-nearest neighbors is determined, sorted, and plotted. The point of maximum change (the elbows bend) determines the optimal value of ε.

的选择 ε :选择ε时,可以将聚类和离群数据完美地分开。 太大的ε值会将整个数据聚类为一个聚类,而太小的ε值会将每个点归类为噪声。 用外行术语来说,每个点到其k个最近邻居的平均距离被确定,排序和绘制。 最大变化点(肘部弯曲)确定ε的最佳值。

Min points m (measure): It is a threshold number of points present in the ε distance of a data point that dictates the category of that data point. It is driven by the number of dimensions present.

最小点数m (小节):它是数据点的ε距离中存在的阈值点数,它决定了该数据点的类别。 它由当前尺寸的数量驱动。

Choice of Min points: Minimum value of Min points has to be 3. Larger density and dimensionality means larger value should be chosen. The formula to be used while assigning value to Min points is: Min points>= Dimensionality + 1

最小点数的选择:最小点数的最小值必须为3。较大的密度和维数表示应选择较大的值。 将值分配给“最小点”时要使用的公式为: 最小点> =维+ 1

Core points (data points): A point is a core point if it has at least m number of points within radii of ε distance from it.

核心点 (数据点):如果一个点在距其ε距离的半径内至少有m个点,则它是一个核心点。

Border points (data points): A point that doesn’t qualify as a core point but is a neighbor of a core point.

边界点 (数据点):不符合核心点要求但与核心点相邻的点。

Noise points (data points): An outlier point that doesn’t fulfill any of the above-given criteria.

噪声点 (数据点):不满足上述任何标准的异常点。

Image for post

Algorithm:

算法:

  1. Select a value for ε and m.

    εm选择一个值。

  2. Mark all points as outliers.

    将所有点标记为离群值。
  3. For each point, if at least m points are present within its ε distance range:

    对于每个点,如果在其ε距离范围内至少存在m个点:

  • Identify it as a core point and mark the point as visited.

    将其标识为核心点并将该点标记为已访问。
  • Assign the core point and its density reachable points in one cluster and remove them from the outlier list.

    在一个群集中分配核心点及其密度可达到的点,并将其从异常值列表中删除。

4. Check for the density connectivity of the clusters. If so, merge the clusters into one.

4.检查集群的密度连接。 如果是这样,请将群集合并为一个。

5. For points remaining in the outlier list, identify them as noise.

5.对于剩余在异常值列表中的点,将其标识为噪声。

Image for post
Example of cluster formation by DBSCAN
DBSCAN形成集群的示例

The time complexity of the DBSCAN lies between O(n log n) (best case scenario) to O(n²) (worst case), depending upon the indexing structure, ε, and m values chosen.

-O之间的DBSCAN位于(N log n)的 (最好的情况下)至O(N²)(最坏情况),取决于所选择索引结构 ,ε,和m值的时间复杂度。

Python code:

Python代码:

As a part of the scikit-learn module, below is the code of DBSCAN with some of the hyperparameters set to the default value:

作为scikit-learn模块的一部分,以下是DBSCAN的代码,其中一些超参数设置为默认值:

class sklearn.cluster.DBSCAN(eps=0.5, *, min_samples=5, metric='euclidean')

eps is the epsilon value as already explained.

如前所述,eps是epsilon值。

min_samples is the Min points value.

min_samples最低分值。

metric is the process by which distance is calculated in the algorithm. By default, it is Euclidean distance, other than that it can be any user-defined distance function or a ‘precomputed’ distance matrix.

metric是在算法中计算距离的过程。 默认情况下,它是欧几里得距离,除了可以是任何用户定义的距离函数或“预计算”距离矩阵。

There are some advanced hyperparameters which will be best discussed in future projects.

有一些高级超参数将在以后的项目中进行最佳讨论。

Drawbacks:

缺点:

Image for post
  1. For the large differences in densities and unequal density spacing between clusters, DBSCAN shows unimpressive results at times. At times, the dataset may require different ε and ‘Min points’ value, which is not possible with DBSCAN.

    对于群集之间的密度差异和不相等的密度间距,DBSCAN有时会显示令人印象深刻的结果。 有时,数据集可能需要不同的ε和“最小点”值,而DBSCAN则不可能。
  2. DBSCAN sometimes shows different results on each run for the same dataset. Although rarely so, but it has been termed as non-deterministic.

    对于同一数据集,DBSCAN有时每次运行都会显示不同的结果。 虽然很少这样,但是它被称为不确定性的。
  3. DBSCAN faces the curse of dimensionality. It doesn’t work as expected in high dimensional datasets.

    DBSCAN面临着维度的诅咒。 在高维数据集中无法正常工作。

To overcome these, other advanced algorithms have been designed which will be discussed in future blogs.

为了克服这些问题,已经设计了其他高级算法,这些算法将在以后的博客中讨论。

Stay tuned. Happy learning :)

敬请关注。 快乐学习:)

翻译自: https://medium.com/devtorq/dbscan-a-walkthrough-of-a-density-based-clustering-method-b5e74ca9fcfa

密度聚类dbscan

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/news/391871.shtml

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

node aws 内存溢出_在AWS Elastic Beanstalk上运行生产Node应用程序的现实

node aws 内存溢出by Jared Nutt贾里德努特(Jared Nutt) 在AWS Elastic Beanstalk上运行生产Node应用程序的现实 (The reality of running a production Node app on AWS Elastic Beanstalk) 从在AWS的ELB平台上运行生产Node应用程序两年的经验教训 (Lessons learned from 2 y…

Day2-数据类型

数据类型与内置方法 数据类型 数字字符串列表字典元组集合字符串 1.用途 用来描述某个物体的特征:姓名,性别,爱好等 2.定义方式 变量名 字符串 如:name huazai 3.常用操作和内置方法 1.按索引取值:(只能取…

嵌套路由

父组件不能用精准匹配,否则只组件路由无法展示 转载于:https://www.cnblogs.com/dianzan/p/11308146.html

leetcode 992. K 个不同整数的子数组(滑动窗口)

给定一个正整数数组 A,如果 A 的某个子数组中不同整数的个数恰好为 K,则称 A 的这个连续、不一定独立的子数组为好子数组。 (例如,[1,2,3,1,2] 中有 3 个不同的整数:1,2,以及 3。) …

从完整的新手到通过TensorFlow开发人员证书考试

I recently graduated with a bachelor’s degree in Civil Engineering and was all set to start with a Master’s degree in Transportation Engineering this fall. Unfortunately, my plans got pushed to the winter term because of COVID-19. So as of January this y…

微信开发者平台如何编写代码_编写超级清晰易读的代码的初级开发者指南

微信开发者平台如何编写代码Writing code is one thing, but writing clean, readable code is another thing. But what is “clean code?” I’ve created this short clean code for beginners guide to help you on your way to mastering and understanding the art of c…

【转】PHP面试题总结

PHP面试总结 PHP基础1:变量的传值与引用。 2:变量的类型转换和判断类型方法。 3:php运算符优先级,一般是写出运算符的运算结果。 4:PHP中函数传参,闭包,判断输出的echo,print是不是函…

Winform控件WebBrowser与JS脚本交互

1)在c#中调用js函数 如果要传值,则可以定义object[]数组。 具体方法如下例子: 首先在js中定义被c#调用的方法: function Messageaa(message) { alert(message); } 在c#调用js方法Messageaa private void button1_Click(object …

从零开始撸一个Kotlin Demo

####前言 自从google将kotlin作为亲儿子后就想用它撸一管app玩玩,由于工作原因一直没时间下手,直到项目上线后才有了空余时间,期间又由于各种各样烦人的事断了一个月,现在终于开发完成项目分为服务器和客户端;服务器用…

移动平均线ma分析_使用动态移动平均线构建交互式库存量和价格分析图

移动平均线ma分析I decided to code out my own stock tracking chart despite a wide array of freely available tools that serve the same purpose. Why? Knowledge gain, it’s fun, and because I recognize that a simple project can generate many new ideas. Even t…

敏捷开发创始人_开发人员和技术创始人如何将他们的想法转化为UI设计

敏捷开发创始人by Simon McCade西蒙麦卡德(Simon McCade) 开发人员和技术创始人如何将他们的想法转化为UI设计 (How developers and tech founders can turn their ideas into UI design) Discover how to turn a great idea for a product or service into a beautiful UI de…

在ubuntu怎样修改默认的编码格式

ubuntu修改系统默认编码的方法是:1. 参考 /usr/share/i18n/SUPPORTED 编辑/var/lib/locales/supported.d/* gedit /var/lib/locales/supported.d/localgedit /var/lib/locales/supported.d/zh-hans如:more /var/lib/locales/supported.d/localzh_CN GB18…

JAVA中PO,BO,VO,DTO,POJO,Entity

https://my.oschina.net/liaodo/blog/2988512转载于:https://www.cnblogs.com/dianzan/p/11311217.html

【Lolttery】项目开发日志 (三)维护好一个项目好难

项目的各种配置开始出现混乱的现象了 在只有一个人开发的情况下也开始感受到维护一个项目的难度。 之前明明还好用的东西,转眼就各种莫名其妙的报错,完全不知道为什么。 今天一天的工作基本上就是整理各种配置。 再加上之前数据库设计出现了问题&#xf…

leetcode 567. 字符串的排列(滑动窗口)

给定两个字符串 s1 和 s2,写一个函数来判断 s2 是否包含 s1 的排列。 换句话说,第一个字符串的排列之一是第二个字符串的子串。 示例1: 输入: s1 “ab” s2 “eidbaooo” 输出: True 解释: s2 包含 s1 的排列之一 (“ba”). 解题思路 和s1每个字符…

静态变数和非静态变数_统计资料:了解变数

静态变数和非静态变数Statistics 101: Understanding the different type of variables.统计101:了解变量的不同类型。 As we enter the latter part of the year 2020, it is safe to say that companies utilize data to assist in making business decisions. F…

代码走查和代码审查_如何避免代码审查陷阱降低生产率

代码走查和代码审查Code reviewing is an engineering practice used by many high performing teams. And even though this software practice has many advantages, teams doing code reviews also encounter quite a few code review pitfalls.代码审查是许多高性能团队使用…

Zabbix3.2安装

一、环境 OS: CentOS7.0.1406 Zabbix版本: Zabbix-3.2 下载地址: http://repo.zabbix.com/zabbix/3.2/rhel/7/x86_64/zabbix-release-3.2-1.el7.noarch.rpm MySQL版本: 5.6.37 MySQL: http://repo.mysql.com/mysql-community-release-el7-5.noarch.r…

Warensoft Unity3D通信库使用向导4-SQL SERVER访问组件使用说明

Warensoft Unity3D通信库使用向导4-SQL SERVER访问组件使用说明 (作者:warensoft,有问题请联系warensoft163.com) 在前一节《warensoft unity3d通信库使用向导3-建立WarensoftDataService》中已经说明如何配置Warensoft Data Service,从本节开始,将说明…

01-gt;选中UITableViewCell后,Cell中的UILabel的背景颜色变成透明色

解决方案有两种方法一 -> 新建一个UILabel类, 继承UILabel, 然后重写 setBackgroundColor: 方法, 在这个方法里不做任何操作, 让UILabel的backgroundColor不发生改变.写在最后, 感谢参考的出处:不是谢志伟StackOverflow: UITableViewCell makes labels background clear whe…