综述 2023-IEEE-TCBB：生物序列聚类方法比较

Wei, Ze-Gang, et al. "Comparison of methods for biological sequence clustering." IEEE/ACM Transactions on Computational Biology and Bioinformatics (2023). https://ieeexplore.ieee.org/document/10066180

被引次数：1；
研究背景：测序技术进步极大促进了基因组学研究。这一巨大进步带来了大量的测序数据。聚类分析对于研究和探索大规模序列数据具有强大的作用。过去十年中已经开发了许多可用的聚类方法。尽管发表了大量的比较研究，但我们注意到它们有两个主要局限性：仅比较传统的基于比对的聚类方法，并且评估指标严重依赖于标记的序列数据。
研究意义：序列聚类有利于去除数据库中冗余序列
作者信息：

一、传统序列聚类方法

传统方法：基于分层策略、需要对序列进行逐对对齐来进行聚类

1. mothur

[42] P. D. Schloss et al., “Introducing mothur: Open-source, platform- independent, community-supported software for describing and compar- ing microbial communities,” Appl. Environ. Microbiol., vol. 75, no. 23,pp. 7537–7541, 2009.

2. ESPRIT

[43] Y. Sun et al., “ESPRIT: Estimating species richness using large collections of 16S rRNA pyrosequences,” Nucleic Acids Res., vol. 37, no. 10, pp. e76–e76, 2009.

3. HPC-CLUST

[44] M. Rodrigues, J. F., and C. von Mering, “HPC-CLUST: Distributed hierarchical clustering for large sets of nucleotide sequences,” Bioinformatics,vol. 30, no. 2, pp. 287–288, 2013.

4. mcClust

[45] Q. Wang et al., “Naive Bayesian classifier for rapid assignment of rRNA sequences into the new bacterial taxonomy,” Appl. Environ. Microbiol.,vol. 73, no. 16, pp. 5261–5267, 2007.

二、现代大规模序列聚类方法

1. CD-HIT：应用贪婪增量策略

巧妙地应用了统计k-mer（固定长度的子序列 k) 过滤以避免不必要的成对序列比对

[46] L. Fu et al., “CD-HIT: Accelerated for clustering the next-generation sequencing data,” Bioinformatics, vol. 28, no. 23, pp. 3150–3152, 2012.

[47] Y. Huang et al., “CD-HIT Suite: A web server for clustering and comparing biological sequences,” Bioinformatics, vol. 26, no. 5, pp. 680–682, 2010.

[48] W. Li and A. Godzik, “Cd-hit: A fast program for clustering and comparing large sets of protein or nucleotide sequences,” Bioinformatics, vol. 22,no. 13, pp. 1658–1659, 2006.

2. UCLUST ：采用 USEARCH 的贪婪搜索算法

应用 k-mer 过滤器来避免不必要的低相似性序列对

[49] R. C. Edgar, “Search and clustering orders of magnitude faster than BLAST,” Bioinformatics, vol. 26, no. 19, pp. 2460–2461, 2010.

3. VSEARCH：作为 UCLUST 的替代品

VSEARCH 是一款免费的 64 位开源软件，用于序列聚类。它使用基于 k-mers 的快速启发式（UCLUST 中应用的类似策略）来有效检测相似序列。 VSEARCH 实现了 UCLUST 中用于分析生物序列的大部分功能，例如序列排序和去重复。因此，评估VSEARCH和UCLUST在序列聚类方面的性能非常有意义。

[50] T. Rognes et al., “VSEARCH: A versatile open source tool for metagenomics,” PeerJ, vol. 4, 2016, Art. no. e2584.

4. DBH：基于de Bruijn (DB) graph

克服传统启发式聚类算法中关键问题——种子选择的敏感性，并减少大规模 16S rRNA 序列的计算负担，我们开发了一种基于启发式聚类方法

[51] Z. - G. Wei and S. - W. Zhang, “DBH: A de Bruijn graph-based heuristic
method for clustering large-scale 16S rRNA sequences into OTUs,” J. Theor. Biol., vol. 425, pp. 80–87, 2017.

5. edClust：基于Edlib library

对相似序列进行分组，由 C/C++ 编程，可实现高速精确的半全局成对序列比对。 edClust 也是一种启发式方法，遵循 CD-HIT 的贪婪增量方法。应用了Edlib中实现的半全局序列比对来计算相似度对于带有种子的每个查询序列。

[52] M. Cao et al., “EdClust: A heuristic sequence clustering method with higher sensitivity,” J. Bioinf. Comput. Biol., vol. 20, 2021, Art. no. 2150036.
[53] M. Šošic ́ and M. Šikic ́, “Edlib: A C/C++ library for fast, exact sequence alignment using edit distance,” Bioinformatics, vol. 33, no. 9, pp. 1394–1395, 2017.

在预过滤过程中，CD-HIT、UCLUST、VSEARCH、DBH 和 edClust 仅计算序列之间相同k-mers 的数量。因为这个数字随着比较序列的相似性降低而迅速下降，所以大多数上述方法将在低聚类阈值（特别是低于 50%）下形成包含非同源序列的损坏簇的很大一部分。

6. kClust

为了提高低聚类阈值下的聚类敏感性，开发了 kClust，可以通过查找相似的 k-mers 以实现高灵敏度。

[54] M. Hauser, C. E. Mayer, and J. Söding, “kClust: Fast and sensitive clustering of large protein sequence databases,” BMC Bioinf., vol. 14, no. 1, 2013, Art. no. 248.

根据上面的描述，我们可以总结出CD-HIT、UCLUST、VSEARCH、DBH、edClust、kClust和MMseqs2将贪婪增量策略应用于聚类序列，计算复杂度约为O(KN)，其中 N 和 K 分别是序列数和簇数。对于数亿个序列，K 通常与 N 具有相似的顺序，导致计算复杂度几乎以 N 的二次方增加。

7. Linclust：线性时间 O(N)

对大蛋白进行聚类

[57] M. Steinegger and J. Söding, “Clustering huge protein sequence sets in linear time,” Nature Commun., vol. 9, no. 1, 2018, Art. no. 2542.

8. MMseq2

[55] M. Hauser, M. Steinegger, and J. Söding, “MMseqs software suite for fast and deep clustering and searching of large protein sequence sets,” Bioinformatics, vol. 32, no. 9, pp. 1323–1330, 2016.
[56] M. Steinegger and J. Söding, “MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets,” Nature Biotechnol.,vol. 35, no. 11, pp. 1026–1028, 2017.

9. MeShClust：均值平移算法

对DNA序列进行聚类

[58] B. T. James, B. B. Luczak, and H. Z. Girgis, “MeShClust: An intelligent tool for clustering DNA sequences,” Nucleic Acids Res., vol. 46, no. 14, pp. e83–e83, 2018.

三、4个 Benchmark datasets

表I-四个序列数据集的简单统计

1. 模拟数据集

模拟数据集由 James 等人[58]生成，包含 236 个序列，10 个簇，每个簇由约 23 个序列组成。所有序列的平均长度约为1000 bp。

[58] B. T. James, B. B. Luczak, and H. Z. Girgis, “MeShClust: An intelligent tool for clustering DNA sequences,” Nucleic Acids Res., vol. 46, no. 14, pp. e83–e83, 2018.

2. Schmidt数据集

Schmidt数据集是Schmidt等人[44]构建的一个综合性的全球16S rRNA基因序列数据集（http://meringlab.org/suppdata/2014-otu_robustness/）。该数据集几乎覆盖了细菌16S rRNA基因的整个区域，包含从NCBI GenBank收集的887870个序列，平均长度约为1401 bp。

[44] M. Rodrigues, J. F., and C. von Mering, “HPC-CLUST: Distributed hierarchical clustering for large sets of nucleotide sequences,” Bioinformatics,vol. 30, no. 2, pp. 287–288, 2013.

3. Alfree 数据集

Alfree 基准数据集 [39] 是基于 ASTRAL v2.06 数据集 [65] 构建的，该数据集包含 6569 个蛋白质序列，分为 513 个家族组。该组中的序列范围在 20 到 1047 之间，平均长度为 184 个氨基酸。 Alfree数据集和类标签可以从网站链接免费下载：http://150.254.123.165/alfree//download/data/。

[39] A. Zielezinski et al., “Alignment-free sequence comparison: Benefits, applications, and tools,” Genome Biol., vol. 18, no. 1, 2017, Art. no. 186.

[65] N. K. Fox, S. E. Brenner, and J. -M. Chandonia, “SCOPe: Structural classification of proteins—Extended, integrating SCOP and ASTRAL data and classification of new structures,” Nucleic Acids Res., vol. 42, no. D1,pp. D304–D309, 2014.

4. UniProt 序列数据集

UniProt 序列数据集 [64] 是一个精心策划的蛋白质序列数据库，致力于提供高水平的注释、最小程度的冗余以及与其他数据库的高水平集成。 UniProt 数据库包含~562 K 蛋白质序列，平均序列长度为~359 aa。

[64] B. E. Suzek et al., “UniRef: Comprehensive and non-redundant UniProt reference clusters,” Bioinformatics, vol. 23, no. 10, pp. 1282–1288, 2007.

四、聚类评估指标

NMI（归一化互信息）指标 [43]

[43] Y. Sun et al., “ESPRIT: Estimating species richness using large collections of 16S rRNA pyrosequences,” Nucleic Acids Res., vol. 37, no. 10, pp. e76–e76, 2009.

其它评估指标：cluster number, seed sensitivity (SS), clustered fraction (CF) and the wrong clustered fraction (WCF) of one seed sequence ---> have been applied in previous study 【52】

[52] M. Cao et al., “EdClust: A heuristic sequence clustering method with higher sensitivity,” J. Bioinf. Comput. Biol., vol. 20, 2021, Art. no. 2150036.