在进行变异检测时,以群体基因组重测序数据为例,涉及到的个体基本都是上百个,而其中大多数流程均是重复的步骤。
本文将基于GATK进行SNP calling的流程写入循环,便于批量分析。
1 涉及变量
1.工作目录work_dir/
2.参考基因组ref_genome.fa
3.Reads列表read_list.txt
4.测序平台Illumina
5.调用线程数
2 调用数据
1.参考基因组ref_genome.fa
2.重测序数据sample1/sample1_1.fq.gz
、sample1/sample1_2.fq.gz
……
3.Reads列表:read_list.txt
生成方法:预先将存放各个个体Reads的文件夹放入一个文件夹work_dir/
然后使用下列命令生成:
ls work_dir/ > read_list.txt
3 主要脚本
usage:
bash GATK_pipeline.sh work_dir/ ref_genome.fa read_list.txt Illumina 10
GATK_pipeline.sh
#---------------------------------------------------------------#
# objection defined by user #
#---------------------------------------------------------------#set -au# 1.
# Master dir.:
WORK_dir=$1# 2.
# Reference genome:
REF=$2# 3.
# Read list:
READ_list=$3# 4.
# Seqencing platform:
PL=$4# 5.
# number of threads:
NT=$5#---------------------------------------------------------------#
# main loop for SNPs calling by gatk pipeline #
#---------------------------------------------------------------##READ_list.txt is a list of read groups.
while read -r READdoSAMPLE=SM_${READ}
ID=${READ}
READ1="${WORK_dir}${READ}_1.fq"
READ2="${WORK_dir}${READ}_2.fq"
OUT="${READ}"#1.
#Alignning reads to reference genome by BWA-MEM2-mem, producing a .sam data
bwa-mem2 \mem \-M \-t ${NT} \-R "@RG\tID:${ID}\tSM:${SAMPLE}\tPL:${PL}" \${REF} \${READ1} \${READ2} \> ${OUT}.sam#2.
#Sorting .sam by gatk-SortSam, producing a .bam data
gatk \SortSam \-I ${OUT}.sam \-O ${OUT}.bam \-SO coordinate \-VALIDATION_STRINGENCY LENIENT \-CREATE_INDEX true \-TMP_DIR ./${OUT}tmp.sort
#3.
#Marking dupulications in .bam by gatk-MarkDuplicates
#producing a .dup.bam and .dup.txt data
gatk \MarkDuplicates \-I ${OUT}.bam \-O ${OUT}.dup.bam \-M ${OUT}.dup.txt \-REMOVE_DUPLICATES true \-VALIDATION_STRINGENCY LENIENT \-CREATE_INDEX true \-TMP_DIR ${OUT}tmp.dup#4.
#QC by samtools-flagstat, producing a .dup.bam.stat data
samtools \flagstat \${OUT}.dup.bam \> ${OUT}.dup.bam.stat#5.
#Calling SNPs by gatk-HaplotypeCaller, producing a .dup.vcf data
gatk \HaplotypeCaller \-R ${REF} \-I ${OUT}.dup.bam \-O ${OUT}.dup.vcfdone < $READ_list
##