dbCAN碳水化合物酶基因数据库及run_dbCAN4工具安装配置及使用

dbCAN（碳水化合物酶基因数据库）是一个专门用于在基因组中预测碳水化合物酶基因的在线工具。它基于隐马尔可夫模型（HMM）和BLAST搜索，能够在蛋白质序列中识别和注释不同类型的碳水化合物酶基因，包括纤维素酶、木质素酶、半纤维素酶、淀粉酶、果糖酶等等。 dbCAN是一个非常有用的生物信息学工具，对于研究纤维素生物转化、生物能源、生产生物基化学品等领域的研究具有重要意义。

Run_dbCAN是一个用于预测生物信息学中的碳水化合物活性酶的工具。它用于分析基因组或转录组数据，以识别编码碳水化合物活性酶的基因。

dbCAN2: a meta server for automated carbohydrate-active enzyme annotation | Nucleic Acids Research | Oxford Academic

dbCAN3: automated carbohydrate-active enzyme and substrate annotation | Nucleic Acids Research | Oxford Academic

dbCAN-seq: a database of carbohydrate-active enzyme (CAZyme) sequence and annotation | Nucleic Acids Research | Oxford Academic

github最新版代码源

GitHub - linnabrown/run_dbcan: Run_dbcan V4, using genomes/metagenomes/proteomes of any assembled organisms (prokaryotes, fungi, plants, animals, viruses) to search for CAZymes.

其他相关链接（有些链接暂时打不开，大家可以等一段时间后再试，或者站内找本人发布的相关资源下载）：

CAZy - Home

Index of /dbCAN2/download (unl.edu)

https://github.com/linnabrown/run_dbcan/issues

dbCAN-sub

1、安装dbcan

conda环境安装

conda create -n run_dbcan python=3.8 dbcan -c conda-forge -c bioconda
conda activate run_dbcan

docker 拉取

docker pull haidyi/run_dbcan:latestdocker run --name <preferred_name> -v <host-path>:<container-path> -it haidyi/run_dbcan:latest run_dbcan <input_file> [params] --out_dir <output_dir>

2、数据库配置

可以在指定位置建立db或dbcan的目录，然后下载相关文件包并用对应的软件处理，这里面有些文件不是最新的，大家可以修改后下载最新版然后再执行，下面的脚本是官方的，首先看是否有db文件夹，如果没有就创建db，然后进入db文件夹开始下载和处理数据库文件，这个可以分开来做，大家应该都理解。

test -d db || mkdir db
cd db \&& wget http://bcb.unl.edu/dbCAN2/download/Databases/fam-substrate-mapping-08252022.tsv \&& wget http://bcb.unl.edu/dbCAN2/download/Databases/PUL.faa && makeblastdb -in PUL.faa -dbtype prot \&& wget http://bcb.unl.edu/dbCAN2/download/Databases/dbCAN-PUL_07-01-2022.xlsx \&& wget http://bcb.unl.edu/dbCAN2/download/Databases/dbCAN-PUL_07-01-2022.txt \&& wget http://bcb.unl.edu/dbCAN2/download/Databases/dbCAN-PUL.tar.gz && tar xvf dbCAN-PUL.tar.gz \&& wget http://bcb.unl.edu/dbCAN2/download/Databases/dbCAN_sub.hmm && hmmpress dbCAN_sub.hmm \&& wget http://bcb.unl.edu/dbCAN2/download/Databases/V11/CAZyDB.08062022.fa && diamond makedb --in CAZyDB.08062022.fa -d CAZy \&& wget https://bcb.unl.edu/dbCAN2/download/Databases/V11/dbCAN-HMMdb-V11.txt && mv dbCAN-HMMdb-V11.txt dbCAN.txt && hmmpress dbCAN.txt \&& wget https://bcb.unl.edu/dbCAN2/download/Databases/V11/tcdb.fa && diamond makedb --in tcdb.fa -d tcdb \&& wget http://bcb.unl.edu/dbCAN2/download/Databases/V11/tf-1.hmm && hmmpress tf-1.hmm \&& wget http://bcb.unl.edu/dbCAN2/download/Databases/V11/tf-2.hmm && hmmpress tf-2.hmm \&& wget https://bcb.unl.edu/dbCAN2/download/Databases/V11/stp.hmm && hmmpress stp.hmm \&& cd ../ && wget http://bcb.unl.edu/dbCAN2/download/Samples/EscheriaColiK12MG1655.fna \&& wget http://bcb.unl.edu/dbCAN2/download/Samples/EscheriaColiK12MG1655.faa \&& wget http://bcb.unl.edu/dbCAN2/download/Samples/EscheriaColiK12MG1655.gff

手动下载位置：Index of /dbCAN2/download (unl.edu)

SignalP数据库下载和配置

文章：Predicting Secretory Proteins with SignalP | SpringerLink

SignalP 4.1 - DTU Health Tech - Bioinformatic Services

需要填写邮箱信息同意后才会发送限时链接（4小时内有效）到对应邮箱

当然大家可以直接在网上丢个fasta文件，选择参数后提交在线的注释任务。

3、使用run_dbcan

帮助信息：

Required arguments:inputFile             User input file. Must be in FASTA format.{protein,prok,meta}   Type of sequence input. protein=proteome; prok=prokaryote; meta=metagenomeoptional arguments:-h, --help            show this help message and exit--dbCANFile DBCANFILEIndicate the file name of HMM database such as dbCAN.txt, please use the newest one from dbCAN2 website.--dia_eval DIA_EVAL   DIAMOND E Value--dia_cpu DIA_CPU     Number of CPU cores that DIAMOND is allowed to use--hmm_eval HMM_EVAL   HMMER E Value--hmm_cov HMM_COV     HMMER Coverage val--hmm_cpu HMM_CPU     Number of CPU cores that HMMER is allowed to use--out_pre OUT_PRE     Output files prefix--out_dir OUT_DIR     Output directory--db_dir DB_DIR       Database directory--tools {hmmer,diamond,dbcansub,all} [{hmmer,diamond,dbcansub,all} ...], -t {hmmer,diamond,dbcansub,all} [{hmmer,diamond,dbcansub,all} ...]Choose a combination of tools to run--use_signalP USE_SIGNALPUse signalP or not, remember, you need to setup signalP tool first. Because of signalP license, Docker version does not have signalP.--signalP_path SIGNALP_PATH, -sp SIGNALP_PATHThe path for signalp. Default location is signalp--gram {p,n,all}, -g {p,n,all}Choose gram+(p) or gram-(n) for proteome/prokaryote nucleotide, which are params of SingalP, only if user use singalP-v VERSION, --version VERSIONdbCAN-sub parameters:--dbcan_thread DBCAN_THREAD, -dt DBCAN_THREAD--tf_eval TF_EVAL     tf.hmm HMMER E Value--tf_cov TF_COV       tf.hmm HMMER Coverage val--tf_cpu TF_CPU       tf.hmm Number of CPU cores that HMMER is allowed to use--stp_eval STP_EVAL   stp.hmm HMMER E Value--stp_cov STP_COV     stp.hmm HMMER Coverage val--stp_cpu STP_CPU     stp.hmm Number of CPU cores that HMMER is allowed to useCGC_Finder parameters:--cluster CLUSTER, -c CLUSTERPredict CGCs via CGCFinder. This argument requires an auxillary locations file if a protein input is being used--cgc_dis CGC_DIS     CGCFinder Distance value--cgc_sig_genes {tf,tp,stp,tp+tf,tp+stp,tf+stp,all}CGCFinder Signature Genes valueCGC_Substrate parameters:--cgc_substrate       run cgc substrate prediction?--pul PUL             dbCAN-PUL PUL.faa-o OUT, --out OUT-w WORKDIR, --workdir WORKDIR-env ENV, --env ENV-oecami, --oecami     out eCAMI prediction intermediate result?-odbcanpul, --odbcanpuloutput dbCAN-PUL prediction intermediate result?dbCAN-PUL homologous searching parameters:how to define homologous gene hits and PUL hits-upghn UNIQ_PUL_GENE_HIT_NUM, --uniq_pul_gene_hit_num UNIQ_PUL_GENE_HIT_NUM-uqcgn UNIQ_QUERY_CGC_GENE_NUM, --uniq_query_cgc_gene_num UNIQ_QUERY_CGC_GENE_NUM-cpn CAZYME_PAIR_NUM, --CAZyme_pair_num CAZYME_PAIR_NUM-tpn TOTAL_PAIR_NUM, --total_pair_num TOTAL_PAIR_NUM-ept EXTRA_PAIR_TYPE, --extra_pair_type EXTRA_PAIR_TYPENone[TC-TC,STP-STP]. Some like sigunature hits-eptn EXTRA_PAIR_TYPE_NUM, --extra_pair_type_num EXTRA_PAIR_TYPE_NUMspecify signature pair cutoff.1,2-iden IDENTITY_CUTOFF, --identity_cutoff IDENTITY_CUTOFFidentity to identify a homologous hit-cov COVERAGE_CUTOFF, --coverage_cutoff COVERAGE_CUTOFFquery coverage cutoff to identify a homologous hit-bsc BITSCORE_CUTOFF, --bitscore_cutoff BITSCORE_CUTOFFbitscore cutoff to identify a homologous hit-evalue EVALUE_CUTOFF, --evalue_cutoff EVALUE_CUTOFFevalue cutoff to identify a homologous hitdbCAN-sub major voting parameters:how to define dbsub hits and dbCAN-sub subfamily substrate-hmmcov HMMCOV, --hmmcov HMMCOV-hmmevalue HMMEVALUE, --hmmevalue HMMEVALUE-ndsc NUM_OF_DOMAINS_SUBSTRATE_CUTOFF, --num_of_domains_substrate_cutoff NUM_OF_DOMAINS_SUBSTRATE_CUTOFFdefine how many domains share substrates in a CGC, one protein may include several subfamily domains.-npsc NUM_OF_PROTEIN_SUBSTRATE_CUTOFF, --num_of_protein_substrate_cutoff NUM_OF_PROTEIN_SUBSTRATE_CUTOFFdefine how many sequences share substrates in a CGC, one protein may include several subfamily domains.-subs SUBSTRATE_SCORS, --substrate_scors SUBSTRATE_SCORSeach cgc contains with substrate must more than this value

命令及结果参考

#参考格式
run_dbcan [inputFile] [inputType] [-c AuxillaryFile] [-t Tools]#结果说明
uniInput - The unified input file for the rest of the tools(created by prodigal if a nucleotide sequence was used)
dbsub.out - the output from the dbCAN_sub run
diamond.out - the output from the diamond blast
hmmer.out - the output from the hmmer run
tf.out - the output from the diamond blast predicting TF's for CGCFinder
tc.out - the output from the diamond blast predicting TC's for CGCFinder
cgc.gff - GFF input file for CGCFinder
cgc.out - ouput from the CGCFinder run
overview.txt - Details the CAZyme predictions across the three tools with signalp results###说的都很清楚了，就不重复了，英文可以chatgpt或者百度吧

示例：

run_dbcan EscheriaColiK12MG1655.fna prok --out_dir output_EscheriaColiK12MG1655run_dbcan EscheriaColiK12MG1655.faa protein --out_dir output_EscheriaColiK12MG1655run_dbcan EscheriaColiK12MG1655.fna prok -c cluster --out_dir output_EscheriaColiK12MG1655run_dbcan EscheriaColiK12MG1655.faa protein -c EscheriaColiK12MG1655.gff --out_dir output_EscheriaColiK12MG1655

手动注释CAZyDB

1、下载指定文件的数据库文件，注意下载最新版本：

###中间07312020表示2020年7月31日的版本，大家可以浏览download目录查看确认最新版
wget -c http://bcb.unl.edu/dbCAN2/download/CAZyDB.07312020.fa
wget -c http://bcb.unl.edu/dbCAN2/download/Databases/CAZyDB.07302020.fam-activities.txt

2、使用diamond工具进行快速比对

#基于fasta文件生成diamond比对参考数据库
diamond makedb --in CAZyDB.07312020.fa --db CAZyDB.07312020# 提取fam对应注释
grep -v '#' CAZyDB.07302020.fam-activities.txt |sed 's/ //'| sed '1 i CAZy\tDescription' > CAZy_description.txt###位置 /database/CAZyDB
diamond blastp --db /database/CAZyDB/CAZyDB.07312020 --query out_pro.fa --threads 10 -e 1e-5 --outfmt 6 --max-target-seqs 1 --quiet --out ./gene_diamond.f6# 提取基因与dbcan分类对应表
perl ./format_dbcan2list.pl -i gene_diamond.f6 -o gene.list#按对应表累计丰度
python ./summarizeAbundance.py -i gene.count -m gene.list -c 2 -s ',' -n raw -o ./TPM

这里面format_dbcan2list.pl和summarizeAbundance.py的来源是来自刘永鑫文章和github代码仓库，后面有时间再给大家做详细介绍，或者大家看相关文章自己研究：

https://doi.org/10.1002/imt2.83

YongxinLiu/EasyMicrobiome (github.com)