密度聚类dbscan_DBSCAN —基于密度的聚类方法的演练

密度聚类dbscan

The idea of having newer algorithms come into the picture doesn’t make the older ones ‘completely redundant’. British statistician, George E. P. Box had once quoted that, “All models are wrong, but some are useful”, meaning that no model is exact enough to certify as cent percent accurate. Reverse claims can only lead to the loss of generalization. The most accurate thing to do is to find the most approximate model.

出现新算法的想法并不能使旧算法“完全冗余”。 英国统计学家George EP Box曾经引述过: “所有模型都是错误的,但有些模型是有用的” ,这意味着没有任何一种模型能够精确到百分之一的精度。 反向主张只能导致泛化。 最准确的事情是找到最近似的模型。

Clustering is an unsupervised learning technique where the aim is to group similar objects together. We are virtually living in a world where our past and present choices have become a dataset that can be clustered to identify patterns in our searches, shopping carts, the books we read, etc such that the machine algorithm is sophisticated enough to recommend the things to us. It is fascinating that the algorithms know much more about us then we ourselves can recognize!

聚类是一种无监督的学习技术,其目的是将相似的对象分组在一起。 实际上,我们生活在一个世界中,过去和现在的选择已成为一个数据集,可以将其聚类以识别我们的搜索,购物车,阅读的书籍等中的模式,从而机器算法足够复杂,可以向您推荐事物我们。 令人着迷的是,这些算法对我们的了解更多,然后我们自己就能意识到!

As already discussed in the previous blog, K-means makes use of Euclidean distance as a metric to form the clusters. This leads to a variety of drawbacks as mentioned. Please refer to the blog to read about the K-means algorithm, implementation, and drawbacks: Clustering — Diving deep into K-means algorithm

如先前博客中已讨论的,K-means利用欧几里得距离作为度量来形成聚类。 如上所述,这导致了各种缺点。 请参阅博客,以了解有关K-means算法,实现和缺点的信息: 聚类-深入探讨K-means算法

The real-life data has outliers and is irregular in shape. K-means fails to address these important points and becomes unsuitable for arbitrary shaped, noisy data. In this blog, we are going to learn about an interesting density-based clustering approach — DBSCAN.

现实生活中的数据存在异常值,并且形状不规则。 K均值无法解决这些重要问题,因此不适用于任意形状的嘈杂数据。 在此博客中,我们将学习一种有趣的基于密度的聚类方法-DBSCAN。

应用程序基于密度的空间聚类— DBSCAN (Density-based spatial clustering of applications with noise — DBSCAN)

DBSCAN is a density-based clustering approach that separates regions with a high density of data points from the regions with a lower density of data points. Its fundamental definition is that the cluster is a contiguous region of dense data points separated from another such region by a region of the low density of data points. Unlike K-means clustering, the number of clusters is determined by the algorithm. Two important concepts are density reachability and density connectivity, which can be understood as follows:

DBSCAN是基于密度的聚类方法,可将数据点密度较高的区域与数据点密度较低的区域分开 它的基本定义是,群集是密集数据点的连续区域,该区域与另一个此类区域之间被数据点的低密度区域分隔开 。 与K均值聚类不同,聚类的数量由算法确定。 密度可达性密度连通 是两个重要的概念,可以理解如下:

“A point is considered to be density reachable to another point if it is situated within a particular distance range from it. It is the criteria for calling two points as neighbors. Similarly, if two points A and B are density reachable (neighbors), also B and C are density reachable (neighbors), then by chaining approach A and C belong to the same cluster. This concept is called density connectivity. By this approach, the algorithm performs cluster propagation.”

“如果一个点位于另一个点的特定距离范围内,则认为该点可以密度达到另一个点。 这是将两个点称为邻居的标准。 类似地,如果两个点A和B是密度可达的(邻居),则B和C也是密度可达的(邻居),则通过链接方法A和C属于同一群集。 这个概念称为密度连接。 通过这种方法,该算法执行集群传播。”

The key constructs of the DBSCAN algorithm that help it determine the ‘concept of density’ are as follows:

DBSCAN算法可帮助确定“密度概念”的关键结构如下:

Epsilon ε (measure): ε is the threshold radius distance which determines the neighborhood of a point. If a point is located at a distance less than or equal to ε from another point, it becomes its neighbor, that is, it becomes density reachable to it.

小量 ε(测量):ε 是确定点附近的阈值半径距离。 如果一个点与另一个点的距离小于或等于ε,则该点成为其相邻点,即它可以达到的密度。

Choice of ε: The choice of ε is made in a way that the clusters and the outlier data can be segregated perfectly. Too large ε value can cluster the entire data as one cluster and too small value can classify each point as noise. In layman terms, the average distance of each point from its k-nearest neighbors is determined, sorted, and plotted. The point of maximum change (the elbows bend) determines the optimal value of ε.

的选择 ε :选择ε时,可以将聚类和离群数据完美地分开。 太大的ε值会将整个数据聚类为一个聚类,而太小的ε值会将每个点归类为噪声。 用外行术语来说,每个点到其k个最近邻居的平均距离被确定,排序和绘制。 最大变化点(肘部弯曲)确定ε的最佳值。

Min points m (measure): It is a threshold number of points present in the ε distance of a data point that dictates the category of that data point. It is driven by the number of dimensions present.

最小点数m (小节):它是数据点的ε距离中存在的阈值点数,它决定了该数据点的类别。 它由当前尺寸的数量驱动。

Choice of Min points: Minimum value of Min points has to be 3. Larger density and dimensionality means larger value should be chosen. The formula to be used while assigning value to Min points is: Min points>= Dimensionality + 1

最小点数的选择:最小点数的最小值必须为3。较大的密度和维数表示应选择较大的值。 将值分配给“最小点”时要使用的公式为: 最小点> =维+ 1

Core points (data points): A point is a core point if it has at least m number of points within radii of ε distance from it.

核心点 (数据点):如果一个点在距其ε距离的半径内至少有m个点,则它是一个核心点。

Border points (data points): A point that doesn’t qualify as a core point but is a neighbor of a core point.

边界点 (数据点):不符合核心点要求但与核心点相邻的点。

Noise points (data points): An outlier point that doesn’t fulfill any of the above-given criteria.

噪声点 (数据点):不满足上述任何标准的异常点。

Image for post

Algorithm:

算法:

  1. Select a value for ε and m.

    εm选择一个值。

  2. Mark all points as outliers.

    将所有点标记为离群值。
  3. For each point, if at least m points are present within its ε distance range:

    对于每个点,如果在其ε距离范围内至少存在m个点:

  • Identify it as a core point and mark the point as visited.

    将其标识为核心点并将该点标记为已访问。
  • Assign the core point and its density reachable points in one cluster and remove them from the outlier list.

    在一个群集中分配核心点及其密度可达到的点,并将其从异常值列表中删除。

4. Check for the density connectivity of the clusters. If so, merge the clusters into one.

4.检查集群的密度连接。 如果是这样,请将群集合并为一个。

5. For points remaining in the outlier list, identify them as noise.

5.对于剩余在异常值列表中的点,将其标识为噪声。

Image for post
Example of cluster formation by DBSCAN
DBSCAN形成集群的示例

The time complexity of the DBSCAN lies between O(n log n) (best case scenario) to O(n²) (worst case), depending upon the indexing structure, ε, and m values chosen.

-O之间的DBSCAN位于(N log n)的 (最好的情况下)至O(N²)(最坏情况),取决于所选择索引结构 ,ε,和m值的时间复杂度。

Python code:

Python代码:

As a part of the scikit-learn module, below is the code of DBSCAN with some of the hyperparameters set to the default value:

作为scikit-learn模块的一部分,以下是DBSCAN的代码,其中一些超参数设置为默认值:

class sklearn.cluster.DBSCAN(eps=0.5, *, min_samples=5, metric='euclidean')

eps is the epsilon value as already explained.

如前所述,eps是epsilon值。

min_samples is the Min points value.

min_samples最低分值。

metric is the process by which distance is calculated in the algorithm. By default, it is Euclidean distance, other than that it can be any user-defined distance function or a ‘precomputed’ distance matrix.

metric是在算法中计算距离的过程。 默认情况下,它是欧几里得距离,除了可以是任何用户定义的距离函数或“预计算”距离矩阵。

There are some advanced hyperparameters which will be best discussed in future projects.

有一些高级超参数将在以后的项目中进行最佳讨论。

Drawbacks:

缺点:

Image for post
  1. For the large differences in densities and unequal density spacing between clusters, DBSCAN shows unimpressive results at times. At times, the dataset may require different ε and ‘Min points’ value, which is not possible with DBSCAN.

    对于群集之间的密度差异和不相等的密度间距,DBSCAN有时会显示令人印象深刻的结果。 有时,数据集可能需要不同的ε和“最小点”值,而DBSCAN则不可能。
  2. DBSCAN sometimes shows different results on each run for the same dataset. Although rarely so, but it has been termed as non-deterministic.

    对于同一数据集,DBSCAN有时每次运行都会显示不同的结果。 虽然很少这样,但是它被称为不确定性的。
  3. DBSCAN faces the curse of dimensionality. It doesn’t work as expected in high dimensional datasets.

    DBSCAN面临着维度的诅咒。 在高维数据集中无法正常工作。

To overcome these, other advanced algorithms have been designed which will be discussed in future blogs.

为了克服这些问题,已经设计了其他高级算法,这些算法将在以后的博客中讨论。

Stay tuned. Happy learning :)

敬请关注。 快乐学习:)

翻译自: https://medium.com/devtorq/dbscan-a-walkthrough-of-a-density-based-clustering-method-b5e74ca9fcfa

密度聚类dbscan

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/news/391871.shtml

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

嵌套路由

父组件不能用精准匹配,否则只组件路由无法展示 转载于:https://www.cnblogs.com/dianzan/p/11308146.html

从完整的新手到通过TensorFlow开发人员证书考试

I recently graduated with a bachelor’s degree in Civil Engineering and was all set to start with a Master’s degree in Transportation Engineering this fall. Unfortunately, my plans got pushed to the winter term because of COVID-19. So as of January this y…

【转】PHP面试题总结

PHP面试总结 PHP基础1:变量的传值与引用。 2:变量的类型转换和判断类型方法。 3:php运算符优先级,一般是写出运算符的运算结果。 4:PHP中函数传参,闭包,判断输出的echo,print是不是函…

移动平均线ma分析_使用动态移动平均线构建交互式库存量和价格分析图

移动平均线ma分析I decided to code out my own stock tracking chart despite a wide array of freely available tools that serve the same purpose. Why? Knowledge gain, it’s fun, and because I recognize that a simple project can generate many new ideas. Even t…

静态变数和非静态变数_统计资料:了解变数

静态变数和非静态变数Statistics 101: Understanding the different type of variables.统计101:了解变量的不同类型。 As we enter the latter part of the year 2020, it is safe to say that companies utilize data to assist in making business decisions. F…

Zabbix3.2安装

一、环境 OS: CentOS7.0.1406 Zabbix版本: Zabbix-3.2 下载地址: http://repo.zabbix.com/zabbix/3.2/rhel/7/x86_64/zabbix-release-3.2-1.el7.noarch.rpm MySQL版本: 5.6.37 MySQL: http://repo.mysql.com/mysql-community-release-el7-5.noarch.r…

Warensoft Unity3D通信库使用向导4-SQL SERVER访问组件使用说明

Warensoft Unity3D通信库使用向导4-SQL SERVER访问组件使用说明 (作者:warensoft,有问题请联系warensoft163.com) 在前一节《warensoft unity3d通信库使用向导3-建立WarensoftDataService》中已经说明如何配置Warensoft Data Service,从本节开始,将说明…

不知道输入何时停止_知道何时停止

不知道输入何时停止In predictive analytics, it can be a tricky thing to know when to stop.在预测分析中,知道何时停止可能是一件棘手的事情。 Unlike many of life’s activities, there’s no definitive finishing line, after which you can say “tick, I…

掌握大数据数据分析师吗?_要掌握您的数据吗? 这就是为什么您应该关心元数据的原因...

掌握大数据数据分析师吗?Either you are a data scientist, a data engineer, or someone enthusiastic about data, understanding your data is one thing you don’t want to overlook. We usually regard data as numbers, texts, or images, but data is more than that.…

docker在Centos上的安装

Centos6安装docker 系统:centos6.5 内核:3.10.107-1(已升级),docker对RHEL/Centos的最低内核支持是2.6.32-431,epel源的docker版本推荐内核为3.10版本。 内核升级可参考:https://www.jslink.org/linux/centos-kernel-u…

Lambda表达式的前世今生

Lambda 表达式 早在 C# 1.0 时,C#中就引入了委托(delegate)类型的概念。通过使用这个类型,我们可以将函数作为参数进行传递。在某种意义上,委托可理解为一种托管的强类型的函数指针。 通常情况下,使用委托来…

matplotlib柱状图、面积图、直方图、散点图、极坐标图、箱型图

一、柱状图 1.通过obj.plot() 柱状图用bar表示,可通过obj.plot(kindbar)或者obj.plot.bar()生成;在柱状图中添加参数stackedTrue,会形成堆叠图。 fig,axes plt.subplots(2,2,figsize(10,6)) s pd.Series(np.random.randint(0,10,15),index …

微信支付商业版 结算周期_了解商业周期

微信支付商业版 结算周期Economics is an inexact science, finance and investing even more so (some would call them art). But if there’s one thing in economics that you can consistently count on over the long run, it’s the tendency of things to mean revert …

Bootstrap——可拖动模态框(Model)

还是上一个小项目,o(╥﹏╥)o,要实现点击一个div或者button或者一个东西然后可以弹出一个浮在最上面的弹框。网上找了找,发现Bootstrap的Model弹出框可以实现该功能,因此学习了一下,实现了基本弹框功能(可拖…

mfcc中的fft操作_简化音频数据:FFT,STFT和MFCC

mfcc中的fft操作What we should know about sound. Sound is produced when there’s an object that vibrates and those vibrations determine the oscillation of air molecules which creates an alternation of air pressure and this high pressure alternated with low …

PHP绘制3D图形

PEAR提供了Image_3D Package来创建3D图像。图像或光线在3D空间中按照X、Y 、Z 坐标定位。生成的图像将呈现在2D空间中,可以存储为 PNG、SVG 格式,或输出到Shell。通过Image_3D可以很方便生成一些简单的3D对象,例如立方体、锥体、球体、文本和…

r语言怎么以第二列绘制线图_用卫星图像绘制世界海岸线图-第二部分

r语言怎么以第二列绘制线图Part I of this blog series is here.本博客系列的第一部分 在这里 。 At the UKHO we are interested in the oceans, the seabed and the coastline — not to mention everything in and on them! In our previous blog, we (the UKHO Data Scien…

JSP基础--动作标签

JSP基础--动作标签 JSP动作标签 1 JSP动作标签概述 动作标签的作用是用来简化Java脚本的! JSP动作标签是JavaWeb内置的动作标签,它们是已经定义好的动作标签,我们可以拿来直接使用。 如果JSP动作标签不够用时,还可以使用自定义标…

rcp rapido_Rapido使用数据改善乘车调度

rcp rapidoGiven our last blog post of the series, which can be found here :鉴于我们在该系列中的最后一篇博客文章,可以在这里找到: We thought it would be helpful to explain how we implemented all of the above into an on-ground experimen…

SSRS:之为用户“NT AUTHORITY\NETWORK SERVICE”授予的权限不足,无法执行此操作。 (rsAccessDenied)...

错误信息:为用户“NT AUTHORITY\NETWORK SERVICE”授予的权限不足,无法执行此操作。 (rsAccessDenied)如图:解决方案之检查顺序:1.检查报表的执行服务帐户。使用“ Reporting Services 配置管理器”。2.检查数据库安全 - 登录名 中…