k均值算法 二分k均值算法
Have you ever seen a Caribbean reef? Well if you haven’t, prepare yourself.
您见过加勒比礁吗? 好吧,如果没有,请做好准备。
Today, we will be answering a question that, at face value, appears quite simple: “What does a Caribbean reef look like?” However, this question can be decomposed into many complex layers. So to avoid ambiguity, let’s refine the question to: “What are the non-mobile components of a Caribbean reefs and how are they related?”
今天,我们将回答一个从表面上看很简单的问题:“加勒比海礁石看起来像什么?” 但是,这个问题可以分解为许多复杂的层。 因此,为避免歧义,让我们将问题细化为:“加勒比海珊瑚礁的非活动组成部分是什么,它们之间有何关系?”
That seems reasonable; we’ll have to look at fish another day.
这似乎是合理的; 我们要改天看看鱼。
Now we’re not going to roll out beautiful images of underwater cities teeming with diversity. Instead, we have bar charts. Without further ado, let’s dive in.
现在,我们不打算发布充满多样性的水下城市的美丽影像。 相反,我们有条形图。 事不宜迟,让我们开始吧。
什么是典型的珊瑚礁? (What Makes up a Typical Reef?)
To start, we have developed a baseline graph (Figure 1) of the components of all Caribbean reefs. Here we have the median percent cover for nine substrate types. Now, if you haven’t conducted a scuba transect before, it may be helpful to break down the above sentence. First, percent cover is how coral reef composition is measured — in other words, from a birds-eye view what percent of sea floor is hard coral, sponge, rock, etc. Second, substrates types are broad categories of sea floor, such as silt or sand. If you’re curious about the sampling methods or specific substrate definitions, check this out.
首先,我们绘制了所有加勒比海珊瑚礁成分的基线图(图1)。 此处,我们提供了9种基材类型的中位覆盖率百分比。 现在,如果您以前没有进行过水肺横断面检查,则最好将上述句子分解。 首先,覆盖率是如何测量珊瑚礁成分的,换句话说,从鸟瞰角度看,硬质珊瑚,海绵,岩石等占海床的百分比。其次,底物类型是海床的大类,例如淤泥或沙子。 如果您想了解抽样方法或特异底物的定义,请检查该出来。
Ok, so in Figure 1 we’re looking at the median value for each of the nine substrate values. For example, in the Hard Coral column, we can see that hard coral’s median percent cover is roughly 17%. Good to know.
好的,因此在图1中,我们查看的是9个底物值中的每个的中值。 例如,在“ 硬珊瑚”列中,我们可以看到硬珊瑚的覆盖率中位数约为17%。 很高兴知道。
Diving deeper into the chart, it appears that most Caribbean reefs are primarily composed four substrate types: rock, hard coral, nutrient indicator algae (NI Algae), and sand. Together, these four categories account for 91% of the total median values. On the other hand, recently killed coral (RK Coral) and silt both have median values of 0. So, they’re relatively rare.
深入研究图表,似乎大多数加勒比海礁石主要由四种基质类型组成:岩石,硬珊瑚,营养指示藻类( NI Algae )和沙子。 这四个类别合起来占总中值的91%。 另一方面,最近被杀死的珊瑚( RK Coral )和淤泥的中位数均为0。因此,它们相对较少。
We have learned that Caribbean reefs are rocky and sandy. Lovely.
我们了解到加勒比礁是岩石和沙滩。 可爱。
But here’s an alarming analogy: the average number of children per US family is 1.93. If we take that number to be representative of the data, we might conclude that most families have 1.93 children, which I find hard to believe. Even worse, we have no understanding of the underlying distribution that led to an average of 1.93. There could be one family with 184 children and 9 families with one child. Instead, it would be useful to see if there are common counts for the number of kids per family.
但这是一个令人震惊的类比:每个美国家庭的平均孩子人数为1.93。 如果我们以该数字作为数据的代表,我们可以得出结论,大多数家庭有1.93个孩子,我很难相信。 更糟糕的是,我们不了解导致平均1.93的基本分布。 可能有一个家庭有184个孩子,有9个家庭有一个孩子。 取而代之的是,查看每个家庭的孩子数是否有共同计数是有用的。
K-均值演示 (K-Means Demo)
Applying this logic to reef composition, we will explore if there are groups coral reefs using the above substrate categories. This is where unsupervised classification comes into play. Unsupervised algorithms fit data where we don’t know the “correct” answer. And, one of the simplest methods of all is the k-means algorithm.
将这种逻辑应用于珊瑚礁组成,我们将使用上述基质类别探讨是否存在珊瑚礁群。 这是无监督分类起作用的地方。 无监督算法适合我们不知道“正确”答案的数据。 而且,最简单的方法之一是k-means算法。
Without getting too technical, k-means attempts to split data into k clusters. The algorithm does this by minimizing the distance from the center of the cluster (the cluster mean) to all points in that cluster. And because of this simple fitting criteria, it’s really easy to interpret. So let’s see an example…
不用太技术,k-means尝试将数据拆分为k个群集。 该算法通过最小化从群集中心(群集均值)到该群集中所有点的距离来实现此目的。 而且由于这种简单的拟合标准,它真的很容易解释。 因此,让我们看一个例子……
In Figure 2 we have created two clusters (k=2 in this case) using two substrate categories: hard coral and nutrient indicator algae. As you can see, there appears to be a clear divide between these two categories. But, let’s not get into interpretation quite yet.
在图2中,我们使用两个基质类别(硬珊瑚和营养指示藻)创建了两个群集(在这种情况下, k = 2 )。 如您所见,这两个类别之间似乎存在明显的鸿沟。 但是,让我们暂时不做解释。
Instead, let’s consider the case where we add another variable. Here, the k-means algorithm would categorize each point using three dimensions instead of two. But as you increase the number of dimensions, you lose the ability to visualize; it’s pretty hard to think in five or eight dimensions. However, we can still see where the cluster centers are numerically located in hyperspace.
相反,让我们考虑添加另一个变量的情况。 在这里,k-means算法将使用三个维度而不是两个维度对每个点进行分类。 但是随着尺寸的增加,您将失去可视化的能力。 很难从五个或八个维度来思考。 但是,我们仍然可以看到聚类中心在数字上位于超空间中的位置。
Now that we have a basic understanding of what k-means does, let’s move on to the interesting graphs.
现在,我们对k均值的功能有了基本的了解,让我们继续研究有趣的图。
前4种基板类型(k = 3) (Top 4 Substrate Types (k=3))
In Figure 3 (below) we have fit three clusters (k=3) using the four most most prevalent substrate types. Each bar represents a substrate category. The height of each bar represents the the difference between the cluster mean and the total mean for that given substrate. Blue bars correspond to a cluster mean greater than the entire category’s mean and conversely, red bars correspond to a cluster mean less than the entire category’s mean.
在下面的图3中,我们使用四种最普遍的底物类型拟合了三个簇( k = 3 )。 每个条形代表基材类别。 每个条形的高度代表该给定底物的簇均值与总均值之差。 蓝色条形对应的聚类平均值大于整个类别的平均值,红色条形对应的聚类平均值小于整个类别的平均值。
When classifying Caribbean reefs into three clusters there appear to be sensible groupings: sand-dominated, rock-dominated, and algae-dominated. Interestingly, hard coral showed relatively little change even though it was the second most abundant substrate category. Conversely, nutrient indicator algae, which is often found on degraded reefs, had extremely high signal relative to its abundance.
将加勒比海珊瑚礁分为三类时,似乎有一些合理的分类:以沙子为主,以岩石为主和以藻类为主。 有趣的是,即使硬质珊瑚是第二丰富的底物类别,其变化也相对较小。 相反,经常在退化的珊瑚礁上发现的营养指示剂藻类相对于其丰富度具有极高的信号。
We can also observe that sand-dominated reefs allowed for the highest quantity of hard coral at roughly 10 percentage points more than the total data average. Rock-dominated reefs were net positive but had little impact on hard corals. And finally, as most people would expect, the evil nutrient indicator algae appears to have a fairly strong negative impact on all other substrate types.
我们还可以观察到,以砂岩为主的礁石允许的硬珊瑚数量最多,比整个数据平均值高出大约10个百分点。 岩石为主的礁石为净阳性,但对硬珊瑚影响不大。 最后,正如大多数人所期望的那样,邪恶的营养指示剂藻类似乎对所有其他底物类型具有相当强烈的负面影响。
Ok, we’re starting to get somewhere. Now let’s increase the number of substrate types by including all categories that had a median value greater than zero: only silt and recently killed coral were not included.
好的,我们开始有所建树。 现在,通过包含中值大于零的所有类别来增加底物类型的数量:不包括淤泥和最近被杀死的珊瑚。
非零中值基板类型(k = 3) (Non-Zero-Median Substrate Types (k=3))
In Figure 4 it appears the categories we found above hold steady. Sand/rubble dominated reefs seem to support the most life with above-average values in hard coral, soft coral, and sponge. Rocky reefs also exhibit life-supporting ability, although less than its sandy counterpart. And finally, nutrient indicator algae reefs show below average percent cover in all other substrate values observed.
在图4中,我们上面找到的类别似乎保持稳定。 在硬珊瑚,软珊瑚和海绵中,以沙/卵石为主的礁石似乎能维持大多数生命,其价值均高于平均值。 礁石还具有生命维持能力,尽管比沙质礁石要弱一些。 最后,营养指示剂藻类礁石在所有其他底物值中均显示低于平均覆盖率。
Now you might be wondering what the deal is with NI Algae. Well, nutrient indicator algae are often found on degraded reefs because they thrive in waters with elevated nutrient levels, such as nitrogen and phosphorus; Reef Check added this category to monitor the infamous algal blooms. Conversely, these high levels of nutrients can be harmful to corals. Thus, we would expect to see an inverse relationship between nutrient indicator algae and the other living substrate types, namely sponges, soft corals, and hard corals.
现在您可能想知道与NI Algae达成的交易是什么。 好吧,营养指示剂藻类经常在退化的珊瑚礁上发现,因为它们在营养水平较高的水中繁殖,例如氮和磷。 Reef Check添加了此类别,以监视臭名昭著的藻华。 相反,这些高含量的养分可能对珊瑚有害。 因此,我们希望看到营养指示剂藻类与其他活的基质类型(即海绵,软珊瑚和硬珊瑚)之间存在反比关系。
This stuff is pretty cool.
这个东西很酷。
使用非零基材值进行拟合(k = 4) (Fitting Using the Non-Zero Substrate Values (k=4))
In our final chart, we will try increasing the number of clusters to four because who’s to say there are only three types of Caribbean reefs? Well, technically there are statistical methods to show reasonable values that k can take. In this case the elbow method was implemented and three to five clusters were deemed sensible.
在我们的最终图表中,我们将尝试将集群数增加到四个,因为谁能说只有三种类型的加勒比海珊瑚礁? 嗯,从技术上讲,有统计方法可以显示k可以取的合理值。 在这种情况下,采用肘部方法,认为三到五个簇是明智的。
As shown shown in Figure 5 to the left, as expected, a fourth category has emerged. Boasting extremely high values of hard and soft corals, this coral-dominated reef appears to be the “healthiest” reefs of the four.
如预期的那样,如左图5所示,出现了第四类。 这种以珊瑚为主的珊瑚礁拥有极高的硬珊瑚和软珊瑚价值,似乎是这四种珊瑚中“最健康的”。
Now why did increasing the number of clusters suddenly create this magical healthy reef category? Well, with only three clusters, the high levels of hard and soft corals were lumped into the sand-dominated and rock-dominated classifications. By allowing for a fourth category, the data could be subset more cleanly.
现在,为什么增加簇的数量突然创建了这个神奇的健康珊瑚礁类别? 好吧,只有三个集群,高水平的硬珊瑚和软珊瑚被归类为以沙子为主和以岩石为主的分类。 通过考虑第四类,可以更清晰地对数据进行子集化。
In a similar vein, why can’t we conclude that there are five types of reefs? To answer your outstanding question, k-means with k=5 was plotted, however the categories created were not intuitive. Moreover, because four central substrate categories compose 91% of the median total, limiting to four clusters is intuitive.
同样,为什么我们不能得出结论说有五种类型的珊瑚礁呢? 为了回答您的悬而未决的问题,绘制了k = 5的 k均值,但是创建的类别不直观。 此外,由于四个中央底物类别构成中位数总数的91%,因此直观地限制为四个簇即可。
Ok final question, how can we tell if three or four clusters is better? Another outstanding question, but unfortunately there isn’t a clear answer.
好吧,最后一个问题,我们如何确定三个或四个集群更好? 另一个悬而未决的问题,但不幸的是没有一个明确的答案。
From an ecological perspective, there is no reason why rock and sand-dominated reefs can’t support corals and sponges, which argues for k=3. It’s also simpler. However, by creating four clusters we can develop clear-cut classifications that appear to correspond to health, which argues for k=4. Those categories are:
从生态的角度来看,没有任何理由说明以岩石和沙子为主的珊瑚礁不能支撑珊瑚和海绵,这证明了k = 3 。 它也更简单。 但是,通过创建四个群集,我们可以开发出与健康相对应的清晰分类,这证明k = 4 。 这些类别是:
- High health: coral-dominated 高健康:珊瑚为主
- Medium health: sand/rubble-dominated, rock-dominated 中度健康:以沙子/碎石为主,以岩石为主
- Low health: algae-dominated 低健康:藻类为主
As with many applied statistics problems, humans have to make judgement calls based on subject-matter knowledge. Here, there are good arguments for both k=3 and k=4.
与许多应用统计问题一样,人类必须根据主题知识做出判断。 在这里,对于k = 3和k = 4都有很好的论据。
结论 (Conclusion)
I’m glad you now understand why bar charts are superior to pretty pictures. Even though you have no idea what a Caribbean reef looks like, you have a better understanding of what makes up a Caribbean reef (which is pretty cool).
我很高兴您现在了解为什么条形图优于漂亮的图片。 即使您不知道加勒比礁是什么样子,您也可以更好地了解加勒比礁的构成(这很酷)。
What else can we conclude?
我们还能得出什么结论?
- Caribbean reefs tend to be dominated by sand, rock, hard coral, and nutrient indicator algae. However, ratios differ greatly at the tails of the distributions. 加勒比礁往往以沙子,岩石,坚硬的珊瑚和营养指示剂藻类为主。 但是,比率在分布的尾部差别很大。
- One of the most consistent reef classifications was algae-dominated reefs. Algal blooms tend to occur in areas with high levels of sunlight, nutrients, and CO2 (a term called eutrophication), so from an ecological standpoint, it makes sense that coral cover would have an inverse relationship with algae. That being said, further research is required, specifically species breakdown of the NI algae. 最一致的礁石分类之一是藻类为主的礁石。 藻华往往发生在阳光,营养和二氧化碳含量高的区域(富营养化),因此从生态角度来看,珊瑚覆盖与藻类成反比是有意义的。 话虽如此,还需要进一步的研究,特别是NI藻类的种类分解。
- All classifications that do not include nutrient indicator algae have the ability to support coral. That being said, sand-dominated reefs show a higher “life capacity” than rock-dominated reefs. 所有不包括营养指标藻类的分类都具有支持珊瑚的能力。 话虽如此,以砂为主的珊瑚礁比以岩石为主的珊瑚礁显示出更高的“生命能力”。
Got any other ideas?
还有其他想法吗?
资料来源 (Sources)
Algae can function as indicators of water pollution. (n.d.). Retrieved August 21, 2020, from http://www.walpa.org/waterline/june-2012/algae-can-function-as-indicators-of-water-pollution/
藻类可以作为水污染的指标。 (nd)。 检索于2020年8月21日, 网址为http://www.walpa.org/waterline/june-2012/algae-can-function-as-indicators-of-water-pollution/
Barott, K. L., Rodriguez-Mueller, B., Youle, M., Marhaver, K. L., Vermeij, M. J., Smith, J. E., & Rohwer, F. L. (2011). Microbial to reef scale interactions between the reef-building coral Montastraea annularis and benthic algae. Proceedings of the Royal Society B: Biological Sciences, 279(1733), 1655–1664. doi:10.1098/rspb.2011.2155
KL的Barott,B。的Rodriguez-Mueller,M。的Youle,Marhaver的KL,Vermeij,MJ,Smith,JE和Rohwer的佛罗里达(2011)。 造礁珊瑚Montastraea ringis和底栖藻类之间的微生物到礁垢的相互作用。 皇家学会学报B:生物科学, 279 (1733),1655–1664。 doi:10.1098 / rspb.2011.2155
Duffin, P., & 13, J. (2020, January 13). Average number of own children per family U.S. Retrieved August 20, 2020, from https://www.statista.com/statistics/718084/average-number-of-own-children-per-family/
Duffin,P.,&13,J.(2020年1月13日)。 美国每个家庭的平均独生子女数于2020年8月20日从https://www.statista.com/statistics/718084/average-number-of-own-children-per-family/检索
The data were collected by Reef Check, a coral conservation non-profit that trains volunteer divers to collect marine data. There were 1576 unique entries for the Caribbean ranging from 1997–05–24 to 2019–08–24. Date of the dive was not taken into account, however in future iterations it would be interesting to see how these cluster centers change over time. The only transformation to the traditional k-means algorithm was including weights that correspond to the median percent cover of each substrate category.
数据是由珊瑚礁非营利组织Reef Check收集的,该组织培训志愿潜水员收集海洋数据。 1997–05–24至2019–08–24期间,加勒比海地区共有1576个独特条目。 没有考虑潜水日期,但是在将来的迭代中,观察这些聚类中心如何随时间变化会很有趣。 对传统k均值算法的唯一转换是包括权重,该权重对应于每种基材类别的中位覆盖率百分比。
Here is the code.
这是代码 。
Note: These are my findings. If you would like to contact me, leave a message here. All criticisms are welcome.
注意:这些是我的发现。 如果您想与我联系,请在此处留言。 欢迎所有批评。
翻译自: https://medium.com/data-diving/classification-of-caribbean-coral-reefs-using-k-means-51a66997a989
k均值算法 二分k均值算法
本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/news/390697.shtml
如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!