vc6.0 绘制散点图_vc有关散点图的一切

vc6.0 绘制散点图

Scatterplots are one of the most popular visualization techniques in the world. Its purposes are recognizing clusters and correlations in ‘pairs’ of variables. There are many variations of scatter plots. We will look at some of them.

散点图是世界上最流行的可视化技术之一。 其目的是识别变量“对”中的聚类和相关性。 散点图有很多变体。 我们将研究其中的一些。

Strip Plots

带状图

Scatter plots in which one attribute is categorical are called ‘strip plots’. Since it is hard to see the data points when we plot the data points as a single line, we need to slightly spread the data points, you can check the above and we can also divide the data points based on the given label.

其中一种属性是分类的散点图称为“条形图”。 由于在将数据点绘制为单线时很难看到数据点,因此我们需要稍微散布数据点,您可以检查上述内容,也可以根据给定的标签划分数据点。

Scatterplot Matrices (SPLOM)

散点图矩阵(SPLOM)

Image for post
Scatterplot Matrices
散点图矩阵

SPLOM produce scatterplots for all pairs of variables and place them into a matrix. Total unique scatterplots are (p²-p)/2. The diagonal is filled with KDE or histogram most of the time. As you can see, there is an order of scatterplots. Does the order matter? It cannot affect the value of course but it can affect the perception of people.

SPLOM为所有变量对生成散点图,并将它们放入矩阵中。 总唯一散点图为(p²-p)/ 2。 对角线大部分时间都充满KDE或直方图。 如您所见,有一个散点图顺序。 顺序重要吗? 它不会影响课程的价值,但会影响人们的感知。

Image for post
Ordering is matter, Image taken from [Peng et al. 2004]
排序很重要,图片取自[Peng等。 2004]

Therefore we need to consider the order of it. Peng suggests the ordering that similar scatterplots are located close to each other in his work in 2004 [Peng et al. 2004]. They distinguish between high-cardinality and low cardinality(number of possible values > number of points means high cardinality.) and sort low-cardinality by a number of values. They rate the ordering of high-cardinality dimensions based on their correlation. Pearson Correlation Coefficient is used for sorting.

因此,我们需要考虑它的顺序。 Peng建议在2004年的工作中将相似的散点图放置在彼此附近的顺序[Peng等。 2004]。 它们区分高基数和低基数(可能值的数量>点数表示高基数),并通过多个值对低基数进行排序。 他们根据它们的相关性对高基数维度的排序进行评分。 皮尔逊相关系数用于排序。

Image for post
clutter measure
杂乱的措施

We find all other pairs of x,y scatter plots with clutter measure. It calculates all correlation and compares it with each pair (x,y ) of high-cardinality dimensions. If its results are smaller than the threshold we choose that scatter plot as an important one. However, it takes a lot of computing power because its big-o-notation is O(p² * p!). They suggest random swapping, it chooses the smallest one and keeps it and again and again.

我们发现所有其他对具有散乱度量的x,y散点图。 它计算所有相关并将其与高基数维的每对(x,y)进行比较。 如果其结果小于阈值,则选择该散点图作为重要散点图。 但是,由于它的big-o表示法是O(p²* p!),因此需要大量的计算能力。 他们建议随机交换,它选择最小的交换并一次又一次地保留。

Selecting Good Views

选择好的观点

Correlation is not enough to choose the nice scatterplots when we are trying to find out the cluster based on the given label or we can get the label from clustering.

当我们尝试根据给定标签找出聚类时,或者仅从聚类中获取标签时,相关性不足以选择合适的散点图。

Image for post
Image for post
Histogram and DSC, [Sips et al. 2009]
直方图和DSC,[Sips等。 2009]

If you don’t have given labels in the left graph, you can pick x-axis projection or y-axis projection because there are no many differences but there are labels. Therefore, we can know the x-axis projection is correct., DSC is introduced with respect to this view that the method checks how good our scatterplot is. More good separation, more good scatterplots.

如果在左图中没有给出标签,则可以选择x轴投影或y轴投影,因为它们之间没有太多差异,但是有标签。 因此,我们可以知道x轴投影是正确的。为此,介绍了DSC,该方法检查了散点图的质量。 更好的分离,更好的散点图。

Image for post
Cluster center
集群中心
Image for post
The equation to calculate DSC
计算DSC的方程式

First of all, we calculate the center of each cluster and measure the distance between each data point and each cluster center. If the distance from its own cluster is shorter than other clusters distance, we increase the cardinality and we normalized it by the number of clusters and multiply 100. This method is similar to the k-means clustering method. Since it only considers distance, it has a limitation to applying.

首先,我们计算每个聚类的中心并测量每个数据点与每个聚类中心之间的距离。 如果距其自身群集的距离短于其他群集的距离,我们将增加基数,并通过群集数对其进行归一化并乘以100。此方法类似于k均值群集方法。 由于仅考虑距离,因此在应用方面存在局限性。

Distribution Consistency (DC)

分配一致性(DC)

DC is the upgrade(?) version of DSC. DC measures the score based on penalizing local entropy in high-density regions. DSC assumes the particular cluster shapes but DC does not assume the shapes.

DC是DSC的升级版本。 DC基于惩罚高密度区域中的局部熵来测量分数。 DSC假定特定的群集形状,但DC不假定这些形状。

Image for post
Example of DC [Sips et al. 2009]
DC的例子[Sips等。 2009]
Image for post
Entropy

This equation is from information theory and it considers how much information in a specific distribution. The data should be estimated using KDE before we apply the entropy function, p(x,y) means the KDE. This equation means it gives smaller(Look at the minus) when the region we measure is mixed with other clusters and its minimum is 0 and the maximum is log2|C|.

该方程式来自信息理论,它考虑特定分布中有多少信息。 在应用熵函数之前,应使用KDE估算数据,p(x,y)表示KDE。 该方程式意味着当我们测量的区域与其他簇混合且其最小值为0且最大值为log2 | C |时,它的值较小(看一下负值)。

Image for post
Normalization function
归一化功能
Image for post
DC score function
直流计分功能

We calculated the entropy with KDE and we don’t want to calculate the whole region at the same weight because there are many vacant regions. Finally, we normalize the results. This gives the DC score. We can choose scatterplots based on thresholds that we can choose.

我们使用KDE来计算熵,我们不想以相同的权重来计算整个区域,因为有许多空置区域。 最后,我们将结果标准化。 这给出了DC得分。 我们可以根据选择的阈值来选择散点图。

Image for post
WHO example of HIV risk groups
世卫组织艾滋病毒高危人群的例子

This dataset is from the WHO, 194 countries, 159 attributes, and 6 HIV risk groups. They focus on DC > 80 and they can eliminate 97% of the plots. It is a highly efficient method.

该数据集来自WHO,194个国家,159个属性和6个HIV风险组。 他们专注于DC> 80,并且可以消除97%的地块。 这是一种高效的方法。

Other than these methods that it only considers the clusters, there are many ways to consider other specific patterns, e.g. fraction of outliers, sparsity, convexity, and e.t.c. You can take a look at [Wilkinson et al. 2006]. PCA also can be used as an alternative way to group similar plots together.

除了仅考虑聚类的这些方法以外,还有许多方法可以考虑其他特定模式,例如,异常值的分数,稀疏性,凸度等。您可以看一下[Wilkinson等。 2006]。 PCA也可以用作将相似地块组合在一起的替代方法。

SPLOM Navigation

SPLOM导航

Image for post
The 3D transition between neighboring views. [Elmqvist et al. 2008]
相邻视图之间的3D过渡。 [Elmqvist等。 2008]

Since the SPLOM shares one axis with the neighboring plots, it is possible to project on to 3D space.

由于SPLOM与相邻的图共享一个轴,因此可以投影到3D空间。

The limitation of scatterplots: Overdraw

散点图的局限性:透支

Image for post
Overdraw
透支
Image for post
KDE solution
KDE解决方案

Too many data points lead to overdraw. We can solve this with KDE but it becomes no longer see individual points. The second problem is high dimensional data because it gives too many scatterplots. We discussed the solution of the second problem. Now we are going to look at the first problem.

太多的数据点导致透支。 我们可以使用KDE解决此问题,但不再看到单个点。 第二个问题是高维数据,因为它提供了太多的散点图。 我们讨论了第二个问题的解决方案。 现在我们要看第一个问题。

Splatterplots

飞溅图

Image for post
Splatterplots [Mayorga/Gleicher 2013]
飞溅图[Mayorga / Gleicher 2013]

Splatterplots properly combine the KDE and Scatterplots. The high-density region is represented by colors and the low-density region is represented by a single data point. We need to choose a proper kernel width for KDE. Splatterplots define the kernel width in screen space, how many data points in the unit screen space. However, we need to choose the threshold by ourselves.

Splatterplots正确地将KDE和Scatterplots结合在一起。 高密度区域由颜色表示,低密度区域由单个数据点表示。 我们需要为KDE选择合适的内核宽度。 Splatterplots定义屏幕空间中的内核宽度,即单位屏幕空间中有多少个数据点。 但是,我们需要自己选择阈值。

Image for post
Zoom of splatter plots [Mayorga / Gleicher 2013]
散点图的放大[Mayorga / Gleicher 2013]

If clusters are mixed, then colors are matter. High luminance and saturation can cause the miss perception that people can recognize the mixed cluster as a different cluster. Therefore, we need to reduce the saturation and luminance to indicate it is mixed clusters.

如果群集混合在一起,那么颜色就很重要。 高亮度和饱和度可能会导致人们误以为人们会将混合群集识别为另一个群集。 因此,我们需要降低饱和度和亮度以表明它是混合簇。

This post is published on 9/2/2020.

此帖发布于2020年9月2日。

翻译自: https://medium.com/@jeheonpark93/vc-everything-about-scatter-plots-467f80aec77c

vc6.0 绘制散点图

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/news/389587.shtml

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

Pytorch中RNN入门思想及实现

RNN循环神经网络 整体思想: 将整个序列划分成多个时间步,将每一个时间步的信息依次输入模型,同时将模型输出的结果传给下一个时间步,也就是说后面的结果受前面输入的影响。 RNN的实现公式: 个人思路: 首…

小扎不哭!FB又陷数据泄露风波,9000万用户受影响

对小扎来说,又是多灾多难的一个月。 继不久前Twitter曝出修补了一个可能造成数以百万计用户私密消息被共享给第三方开发人员的漏洞,连累Facebook股价跟着短线跳水之后,9月28日,Facebook又双叒叕曝出因安全漏洞遭到黑客攻击&#…

在衡量欧洲的政治意识形态时,调查规模的微小变化可能会很重要

(Related post: On a scale from 1 to 10, how much do the numbers used in survey scales really matter?)(相关文章: 从1到10的量表,调查量表中使用的数字到底有多重要? ) At Pew Research Center, survey questions about respondents’…

Pytorch中CNN入门思想及实现

CNN卷积神经网络 基础概念: 以卷积操作为基础的网络结构,每个卷积核可以看成一个特征提取器。 思想: 每次观察数据的一部分,如图,在整个矩阵中只观察黄色部分33的矩阵,将这【33】矩阵(点乘)权重得到特…

事件映射 消息映射_映射幻影收费站

事件映射 消息映射When I was a child, I had a voracious appetite for books. I was constantly visiting the library and picking new volumes to read, but one I always came back to was The Phantom Tollbooth, written by Norton Juster and illustrated by Jules Fei…

前端代码调试常用

转载于:https://www.cnblogs.com/tabCtrlShift/p/9076752.html

Pytorch中BN层入门思想及实现

批归一化层-BN层(Batch Normalization) 作用及影响: 直接作用:对输入BN层的张量进行数值归一化,使其成为均值为零,方差为一的张量。 带来影响: 1.使得网络更加稳定,结果不容易受到…

匿名内部类和匿名类_匿名schanonymous

匿名内部类和匿名类Everybody loves a fad. You can pinpoint someone’s generation better than carbon dating by asking them what their favorite toys and gadgets were as a kid. Tamagotchi and pogs? You were born around 1988, weren’t you? Coleco Electronic Q…

Pytorch框架中SGD&Adam优化器以及BP反向传播入门思想及实现

因为这章内容比较多,分开来叙述,前面先讲理论后面是讲代码。最重要的是代码部分,结合代码去理解思想。 SGD优化器 思想: 根据梯度,控制调整权重的幅度 公式: 权重(新) 权重(旧) - 学习率 梯度 Adam…

朱晔和你聊Spring系列S1E3:Spring咖啡罐里的豆子

标题中的咖啡罐指的是Spring容器,容器里装的当然就是被称作Bean的豆子。本文我们会以一个最基本的例子来熟悉Spring的容器管理和扩展点。阅读PDF版本 为什么要让容器来管理对象? 首先我们来聊聊这个问题,为什么我们要用Spring来管理对象&…

ab实验置信度_为什么您的Ab测试需要置信区间

ab实验置信度by Alos Bissuel, Vincent Grosbois and Benjamin HeymannAlosBissuel,Vincent Grosbois和Benjamin Heymann撰写 The recent media debate on COVID-19 drugs is a unique occasion to discuss why decision making in an uncertain environment is a …

基于Pytorch的NLP入门任务思想及代码实现:判断文本中是否出现指定字

今天学了第一个基于Pytorch框架的NLP任务: 判断文本中是否出现指定字 思路:(注意:这是基于字的算法) 任务:判断文本中是否出现“xyz”,出现其中之一即可 训练部分: 一&#xff…

支撑阻力指标_使用k表示聚类以创建支撑和阻力

支撑阻力指标Note from Towards Data Science’s editors: While we allow independent authors to publish articles in accordance with our rules and guidelines, we do not endorse each author’s contribution. You should not rely on an author’s works without seek…

高版本(3.9版本)python在anaconda安装opencv库及skimage库(scikit_image库)诸多问题解决办法

今天开始CV方向的学习,然而刚拿到基础代码的时候发现 from skimage.color import rgb2gray 和 import cv2标红(这里是因为我已经配置成功了,所以没有红标),我以为是单纯两个库没有下载,去pycharm中下载ski…

单机安装ZooKeeper

2019独角兽企业重金招聘Python工程师标准>>> zookeeper下载、安装以及配置环境变量 本节介绍单机的zookeeper安装,官方下载地址如下: https://archive.apache.org/dist/zookeeper/ 我这里使用的是3.4.11版本,所以找到相应的版本点…

均线交易策略的回测 r_使用r创建交易策略并进行回测

均线交易策略的回测 rR Programming language is an open-source software developed by statisticians and it is widely used among Data Miners for developing Data Analysis. R can be best programmed and developed in RStudio which is an IDE (Integrated Development…

opencv入门课程:彩色图像灰度化和二值化(采用skimage库和opencv库两种方法)

用最简单的办法实现彩色图像灰度化和二值化: 首先采用skimage库(skimage库现在在scikit_image库中)实现: from skimage.color import rgb2gray import numpy as np import matplotlib.pyplot as plt""" skimage库…

instagram分析以预测与安的限量版运动鞋转售价格

Being a sneakerhead is a culture on its own and has its own industry. Every month Biggest brands introduce few select Limited Edition Sneakers which are sold in the markets according to Lottery System called ‘Raffle’. Which have created a new market of i…

opencv:用最邻近插值和双线性插值法实现上采样(放大图像)与下采样(缩小图像)

上采样与下采样 概念: 上采样: 放大图像(或称为上采样(upsampling)或图像插值(interpolating))的主要目的 是放大原图像,从而可以显示在更高分辨率的显示设备上。 下采样&#xff…

CSS魔法堂:那个被我们忽略的outline

前言 在CSS魔法堂:改变单选框颜色就这么吹毛求疵!中我们要模拟原生单选框通过Tab键获得焦点的效果,这里涉及到一个常常被忽略的属性——outline,由于之前对其印象确实有些模糊,于是本文打算对其进行稍微深入的研究^_^ …