dbCAN(碳水化合物酶基因数据库)是一个专门用于在基因组中预测碳水化合物酶基因的在线工具。它基于隐马尔可夫模型(HMM)和BLAST搜索,能够在蛋白质序列中识别和注释不同类型的碳水化合物酶基因,包括纤维素酶、木质素酶、半纤维素酶、淀粉酶、果糖酶等等。 dbCAN是一个非常有用的生物信息学工具,对于研究纤维素生物转化、生物能源、生产生物基化学品等领域的研究具有重要意义。
Run_dbCAN
是一个用于预测生物信息学中的碳水化合物活性酶的工具。它用于分析基因组或转录组数据,以识别编码碳水化合物活性酶的基因。
相关文章:
dbCAN2: a meta server for automated carbohydrate-active enzyme annotation | Nucleic Acids Research | Oxford Academic
dbCAN3: automated carbohydrate-active enzyme and substrate annotation | Nucleic Acids Research | Oxford Academic
dbCAN-seq: a database of carbohydrate-active enzyme (CAZyme) sequence and annotation | Nucleic Acids Research | Oxford Academic
github最新版代码源
GitHub - linnabrown/run_dbcan: Run_dbcan V4, using genomes/metagenomes/proteomes of any assembled organisms (prokaryotes, fungi, plants, animals, viruses) to search for CAZymes.
其他相关链接(有些链接暂时打不开,大家可以等一段时间后再试,或者站内找本人发布的相关资源下载):
CAZy - Home
Index of /dbCAN2/download (unl.edu)
https://github.com/linnabrown/run_dbcan/issues
dbCAN-sub
1、安装dbcan
conda环境安装
conda create -n run_dbcan python=3.8 dbcan -c conda-forge -c bioconda
conda activate run_dbcan
docker 拉取
docker pull haidyi/run_dbcan:latestdocker run --name <preferred_name> -v <host-path>:<container-path> -it haidyi/run_dbcan:latest run_dbcan <input_file> [params] --out_dir <output_dir>
2、数据库配置
可以在指定位置建立db或dbcan的目录,然后下载相关文件包并用对应的软件处理,这里面有些文件不是最新的,大家可以修改后下载最新版然后再执行,下面的脚本是官方的,首先看是否有db文件夹,如果没有就创建db,然后进入db文件夹开始下载和处理数据库文件,这个可以分开来做,大家应该都理解。
test -d db || mkdir db
cd db \&& wget http://bcb.unl.edu/dbCAN2/download/Databases/fam-substrate-mapping-08252022.tsv \&& wget http://bcb.unl.edu/dbCAN2/download/Databases/PUL.faa && makeblastdb -in PUL.faa -dbtype prot \&& wget http://bcb.unl.edu/dbCAN2/download/Databases/dbCAN-PUL_07-01-2022.xlsx \&& wget http://bcb.unl.edu/dbCAN2/download/Databases/dbCAN-PUL_07-01-2022.txt \&& wget http://bcb.unl.edu/dbCAN2/download/Databases/dbCAN-PUL.tar.gz && tar xvf dbCAN-PUL.tar.gz \&& wget http://bcb.unl.edu/dbCAN2/download/Databases/dbCAN_sub.hmm && hmmpress dbCAN_sub.hmm \&& wget http://bcb.unl.edu/dbCAN2/download/Databases/V11/CAZyDB.08062022.fa && diamond makedb --in CAZyDB.08062022.fa -d CAZy \&& wget https://bcb.unl.edu/dbCAN2/download/Databases/V11/dbCAN-HMMdb-V11.txt && mv dbCAN-HMMdb-V11.txt dbCAN.txt && hmmpress dbCAN.txt \&& wget https://bcb.unl.edu/dbCAN2/download/Databases/V11/tcdb.fa && diamond makedb --in tcdb.fa -d tcdb \&& wget http://bcb.unl.edu/dbCAN2/download/Databases/V11/tf-1.hmm && hmmpress tf-1.hmm \&& wget http://bcb.unl.edu/dbCAN2/download/Databases/V11/tf-2.hmm && hmmpress tf-2.hmm \&& wget https://bcb.unl.edu/dbCAN2/download/Databases/V11/stp.hmm && hmmpress stp.hmm \&& cd ../ && wget http://bcb.unl.edu/dbCAN2/download/Samples/EscheriaColiK12MG1655.fna \&& wget http://bcb.unl.edu/dbCAN2/download/Samples/EscheriaColiK12MG1655.faa \&& wget http://bcb.unl.edu/dbCAN2/download/Samples/EscheriaColiK12MG1655.gff
手动下载位置:Index of /dbCAN2/download (unl.edu)
SignalP数据库下载和配置
文章:Predicting Secretory Proteins with SignalP | SpringerLink
SignalP 4.1 - DTU Health Tech - Bioinformatic Services
需要填写邮箱信息同意后才会发送限时链接(4小时内有效)到对应邮箱
当然大家可以直接在网上丢个fasta文件,选择参数后提交在线的注释任务。
3、使用run_dbcan
帮助信息:
Required arguments:inputFile User input file. Must be in FASTA format.{protein,prok,meta} Type of sequence input. protein=proteome; prok=prokaryote; meta=metagenomeoptional arguments:-h, --help show this help message and exit--dbCANFile DBCANFILEIndicate the file name of HMM database such as dbCAN.txt, please use the newest one from dbCAN2 website.--dia_eval DIA_EVAL DIAMOND E Value--dia_cpu DIA_CPU Number of CPU cores that DIAMOND is allowed to use--hmm_eval HMM_EVAL HMMER E Value--hmm_cov HMM_COV HMMER Coverage val--hmm_cpu HMM_CPU Number of CPU cores that HMMER is allowed to use--out_pre OUT_PRE Output files prefix--out_dir OUT_DIR Output directory--db_dir DB_DIR Database directory--tools {hmmer,diamond,dbcansub,all} [{hmmer,diamond,dbcansub,all} ...], -t {hmmer,diamond,dbcansub,all} [{hmmer,diamond,dbcansub,all} ...]Choose a combination of tools to run--use_signalP USE_SIGNALPUse signalP or not, remember, you need to setup signalP tool first. Because of signalP license, Docker version does not have signalP.--signalP_path SIGNALP_PATH, -sp SIGNALP_PATHThe path for signalp. Default location is signalp--gram {p,n,all}, -g {p,n,all}Choose gram+(p) or gram-(n) for proteome/prokaryote nucleotide, which are params of SingalP, only if user use singalP-v VERSION, --version VERSIONdbCAN-sub parameters:--dbcan_thread DBCAN_THREAD, -dt DBCAN_THREAD--tf_eval TF_EVAL tf.hmm HMMER E Value--tf_cov TF_COV tf.hmm HMMER Coverage val--tf_cpu TF_CPU tf.hmm Number of CPU cores that HMMER is allowed to use--stp_eval STP_EVAL stp.hmm HMMER E Value--stp_cov STP_COV stp.hmm HMMER Coverage val--stp_cpu STP_CPU stp.hmm Number of CPU cores that HMMER is allowed to useCGC_Finder parameters:--cluster CLUSTER, -c CLUSTERPredict CGCs via CGCFinder. This argument requires an auxillary locations file if a protein input is being used--cgc_dis CGC_DIS CGCFinder Distance value--cgc_sig_genes {tf,tp,stp,tp+tf,tp+stp,tf+stp,all}CGCFinder Signature Genes valueCGC_Substrate parameters:--cgc_substrate run cgc substrate prediction?--pul PUL dbCAN-PUL PUL.faa-o OUT, --out OUT-w WORKDIR, --workdir WORKDIR-env ENV, --env ENV-oecami, --oecami out eCAMI prediction intermediate result?-odbcanpul, --odbcanpuloutput dbCAN-PUL prediction intermediate result?dbCAN-PUL homologous searching parameters:how to define homologous gene hits and PUL hits-upghn UNIQ_PUL_GENE_HIT_NUM, --uniq_pul_gene_hit_num UNIQ_PUL_GENE_HIT_NUM-uqcgn UNIQ_QUERY_CGC_GENE_NUM, --uniq_query_cgc_gene_num UNIQ_QUERY_CGC_GENE_NUM-cpn CAZYME_PAIR_NUM, --CAZyme_pair_num CAZYME_PAIR_NUM-tpn TOTAL_PAIR_NUM, --total_pair_num TOTAL_PAIR_NUM-ept EXTRA_PAIR_TYPE, --extra_pair_type EXTRA_PAIR_TYPENone[TC-TC,STP-STP]. Some like sigunature hits-eptn EXTRA_PAIR_TYPE_NUM, --extra_pair_type_num EXTRA_PAIR_TYPE_NUMspecify signature pair cutoff.1,2-iden IDENTITY_CUTOFF, --identity_cutoff IDENTITY_CUTOFFidentity to identify a homologous hit-cov COVERAGE_CUTOFF, --coverage_cutoff COVERAGE_CUTOFFquery coverage cutoff to identify a homologous hit-bsc BITSCORE_CUTOFF, --bitscore_cutoff BITSCORE_CUTOFFbitscore cutoff to identify a homologous hit-evalue EVALUE_CUTOFF, --evalue_cutoff EVALUE_CUTOFFevalue cutoff to identify a homologous hitdbCAN-sub major voting parameters:how to define dbsub hits and dbCAN-sub subfamily substrate-hmmcov HMMCOV, --hmmcov HMMCOV-hmmevalue HMMEVALUE, --hmmevalue HMMEVALUE-ndsc NUM_OF_DOMAINS_SUBSTRATE_CUTOFF, --num_of_domains_substrate_cutoff NUM_OF_DOMAINS_SUBSTRATE_CUTOFFdefine how many domains share substrates in a CGC, one protein may include several subfamily domains.-npsc NUM_OF_PROTEIN_SUBSTRATE_CUTOFF, --num_of_protein_substrate_cutoff NUM_OF_PROTEIN_SUBSTRATE_CUTOFFdefine how many sequences share substrates in a CGC, one protein may include several subfamily domains.-subs SUBSTRATE_SCORS, --substrate_scors SUBSTRATE_SCORSeach cgc contains with substrate must more than this value
命令及结果参考
#参考格式
run_dbcan [inputFile] [inputType] [-c AuxillaryFile] [-t Tools]#结果说明
uniInput - The unified input file for the rest of the tools(created by prodigal if a nucleotide sequence was used)
dbsub.out - the output from the dbCAN_sub run
diamond.out - the output from the diamond blast
hmmer.out - the output from the hmmer run
tf.out - the output from the diamond blast predicting TF's for CGCFinder
tc.out - the output from the diamond blast predicting TC's for CGCFinder
cgc.gff - GFF input file for CGCFinder
cgc.out - ouput from the CGCFinder run
overview.txt - Details the CAZyme predictions across the three tools with signalp results###说的都很清楚了,就不重复了,英文可以chatgpt或者百度吧
示例:
run_dbcan EscheriaColiK12MG1655.fna prok --out_dir output_EscheriaColiK12MG1655run_dbcan EscheriaColiK12MG1655.faa protein --out_dir output_EscheriaColiK12MG1655run_dbcan EscheriaColiK12MG1655.fna prok -c cluster --out_dir output_EscheriaColiK12MG1655run_dbcan EscheriaColiK12MG1655.faa protein -c EscheriaColiK12MG1655.gff --out_dir output_EscheriaColiK12MG1655
手动注释CAZyDB
1、下载指定文件的数据库文件,注意下载最新版本:
###中间07312020表示2020年7月31日的版本,大家可以浏览download目录查看确认最新版
wget -c http://bcb.unl.edu/dbCAN2/download/CAZyDB.07312020.fa
wget -c http://bcb.unl.edu/dbCAN2/download/Databases/CAZyDB.07302020.fam-activities.txt
2、使用diamond工具进行快速比对
#基于fasta文件生成diamond比对参考数据库
diamond makedb --in CAZyDB.07312020.fa --db CAZyDB.07312020# 提取fam对应注释
grep -v '#' CAZyDB.07302020.fam-activities.txt |sed 's/ //'| sed '1 i CAZy\tDescription' > CAZy_description.txt###位置 /database/CAZyDB
diamond blastp --db /database/CAZyDB/CAZyDB.07312020 --query out_pro.fa --threads 10 -e 1e-5 --outfmt 6 --max-target-seqs 1 --quiet --out ./gene_diamond.f6# 提取基因与dbcan分类对应表
perl ./format_dbcan2list.pl -i gene_diamond.f6 -o gene.list#按对应表累计丰度
python ./summarizeAbundance.py -i gene.count -m gene.list -c 2 -s ',' -n raw -o ./TPM
这里面format_dbcan2list.pl和summarizeAbundance.py的来源是来自刘永鑫文章和github代码仓库,后面有时间再给大家做详细介绍,或者大家看相关文章自己研究:
https://doi.org/10.1002/imt2.83
YongxinLiu/EasyMicrobiome (github.com)