注:这些工具的应用都是受限的,有些本来就是只能用于预测动物,在使用之前务必用ground truth数据来测试一些。我想预测某一个植物的转录本,所以可以拿已经注释得比较好的拟南芥来测试一下。(测试的结果还是比较惊人的)
CPC
(熟悉的名字,原来是北京大学的高歌、魏丽萍开发的)
搜文章时才发现2017年已经出了CPC2了
CPC可在线使用
a Support Vector Machine-based classifier, named Coding Potential Calculator (CPC), to assess the protein-coding potential of a transcript based on six biologically meaningful sequence features.
Coding Potential Calculator distinguish protein-coding from non-coding RNAs based on the sequence features of the input transcripts. Our preliminary performance assessment suggests the CPC can reliably discriminate the coding and non-coding transcripts in ~98% accuracy. We provide an online version of CPC here.
自称有98%的准确率
bin/run_predict.sh (input_seq) (result_in_table) (working_dir) (result_evidence)
CPC RESULTS (The first column is input sequence ID; the second column is input sequence length; the third column is coding status and the four column is the coding potential score (the "distance" to the SVM classification hyper-plane in the features space).)
AF282387 528 coding 3.32462 Tsix_mus 4300 noncoding -1.30047
HOMO EVIDENCE
ORF EVIDENCE
AF282387 ORF_FRAMEFINDER 4 529 99.43 109.41 Full Tsix_mus ORF_FRAMEFINDER 4077 4206 3.00 27.50 Full
FRAME FINDER
>AF282387 Filobasidiella neoformans calcineurin B regulatory subunit (CNB1) mRNA, complete cds [framefinder (3,528) score=109.41 used=99.43% {forward,strict} ] MGAAESSMFNSLEKNSNFSGPELMRLKKRFMKLDKDGSGSIDKDEFLQIPQIANNPLAHR MIAIFDEDGSGTVDFQEFVGGLSAFSSKGGRDEKLRFAFKVYDMDRDGYISNGELYLVLK QMVGNNLKDQQLQQIVDKTIMEADKDGDGKLSFEEFTQMVASTDIVKQMTLEDLF >Tsix_mus NR_002844.1 Mus musculus X (inactive)-specific transcript, antisense (Tsix) on chromosome X [framefinder (4076,4205) score=27.50 used=3.00% {forward,strict} ] MKGYVLKLSSWAGEIAQWLGVLTALPEGLSSILNNFVVAHSHL
BLAST RESULT
CPC2
CPC2 runs ∼1000 times faster than CPC1 and exhibits superior accuracy compared with CPC1, especially for long non-coding transcripts. Moreover, the model of CPC2 is species-neutral, making it feasible for ever-growing non-model organism transcriptomes.
个人测试,CPC1不用blast还是比较快的,但是blast起来真的是奇慢无比,它后台居然还在调用blastall这种古老的软件,现在我们连blast都嫌慢,都只用diamond了。
CPC2用python改写了,还是在调用libvm来进行分类。
CPC的大致原理:
1. 特征选择,Feature Selection。four intrinsic features as Fickett TESTCODE score, open reading frame (ORF) length, ORF integrity and isoelectric point (pI).
2. 使用svm构建分类模型,trained a support vector machine (SVM) model
3. 使用多个物种的数据来验证模型的性能。评价指标:sensitivity, specificity and accuracy
这么简单的方法,是不是瞬间有种我也能发NAR的错觉~~
PLEK
(predictor of long non-coding RNAs and messenger RNAs based on an improved k-mer scheme)
an efficient alignment-free computational tool to distinguish lncRNAs from mRNAs in RNA-seq transcriptomes of species lacking reference genomes.
貌似没有website,也没有GitHub,程序放在了sourceforge.
基本原理:
核心:kmer和svm
It is suitable for vertebrates lacking high-quality genome sequences and annotation information and is especially effective for the de novo assembled transcriptome data generated by PacBio or 454 sequencing platforms.
A k-mer pattern is a specific string with k nucleotides, each can be A, C, G or T. For k = 1 to 5, we had 4 + 16 + 64 + 256 + 1024 = 1,364 patterns: 4 one-mer patterns, 16 two-mer patterns, 64 three-mer patterns, 256 four-mer patterns, and 1,024 five-mer patterns.
选了5种kmer
非常常规的特征选择,最后还是调用libsvm,发了BMCBioinformatics。看了之后是不是自己也想发一篇。
CNCI
Utilizing sequence intrinsic composition to classify protein-coding and long non-coding transcripts
特征选择
To distinguish protein-coding sequences from the non-coding sequences, we extracted five features, i.e. the length and S-score of MLCDS, length-percentage, score-distance and codon-bias. The length and S-score of MLCDS were used as the first two features, which assess the extent and quality of the MLCDS, respectively. Moreover, as demonstrated earlier in the text, protein-coding transcripts possess a special reading frame obviously distinct from the other five in the distribution of ANT. We analyzed six MLCDS candidates outputted by dynamic programming of the six reading frames for each transcript, with the assumption that there must exist one best MLCDS (as described earlier in the text); however, this phenomenon does not generally exist for non-coding transcripts. Thus, we defined other two features, length-percentage and score-distance, as follows:
测试结果:cnci不能直接处理fasta序列,输入fasta出来的结果为空。于是我就输入gtf和基因组2bit文件,才能出来有效的结果。
CPAT
CPAT: Coding-Potential Assessment Tool using an alignment-free logistic regression model
使用说明文档:http://rna-cpat.sourceforge.net/
特征选择:
The first feature was the maximum length of the open reading frame (ORF).
The second feature was ORF coverage defined as the ratio of ORF to transcript lengths.
The third feature we used was the Fickett TESTCODE score (termed ‘Fickett score’ hereafter), which is a simple linguistic feature that distinguishes protein-coding RNA and ncRNA according to the combinational effect of nucleotide composition and codon usage bias (22).
The fourth feature we used was hexamer usage bias (termed ‘hexamer score’ hereafter). This may be the most discriminating feature because of the dependence between adjacent amino acids in proteins (23).
We build a logistic regression model using these four linguistic features as predictor variables. A χ2 test was used to evaluate whether our logit model with predictors fits the training data significantly better than the null model, which had only an intercept.
FEELnc
FEELnc: a tool for long non-coding RNA annotation and its application to the dog transcriptome
OrfPredictor
OrfPredictor: predicting protein-coding regions in EST-derived sequences
PhyloCSF
PhyloCSF: a comparative genomics method to distinguish protein coding and non-coding regions
lncRNA的编码性预测——PhyloCSF的使用
后面会一一测试。
待续~~~