文章目录
- 前言
- cfDNAPro
- demo
- 1.片段长度可视化
- 2.片段长度分布比较
- 3.可视化DNA片段模态长度
- 4.片段化振荡模式比较
- 5. ggplot2美化
前言
cfDNA(无细胞DNA,游离DNA,Circulating free DNA
or Cell free DNA
)是指在血液循环中存在的DNA片段。这些DNA片段不属于任何细胞,因此被称为“无细胞”或“游离”的。cfDNA来源广泛,可以来自正常细胞和病变细胞(如肿瘤细胞)的死亡和分解过程。cfDNA的长度通常在160-180碱基对左右,这与核小体保护的DNA片段长度相符。
cfDNA的研究对于非侵入性诊断、疾病监测、早期检测以及了解生理和病理状态具有重要意义。特别是在肿瘤学领域,通过分析循环肿瘤DNA(ctDNA
),即来源于肿瘤细胞的cfDNA,可以获取肿瘤的遗传信息,从而指导癌症的诊断、治疗选择和治疗效果监测。
cfDNAPro
- 数据表征: 计算片段大小分布的整体、中位数和众数,以及片段大小轮廓中的峰和谷,还有振荡周期性。
- 数据可视化: 提供了多种函数来可视化这些数据,包括整体到单个片段的可视化、度量可视化、模式和摘要可视化等。
demo
1.片段长度可视化
-
上图:横轴表示片段长度,范围为30bp至500bp。纵轴表示具有特定读取长度的读取比例。这里的线并不是平滑曲线,而是连接不同数据点的直线。
-
下图:首先统计长度小于或等于30bp的读取数量(例如N),然后将其归一化为比例。重复这一过程,直至处理完所有片段长度(即30bp, 31bp, …, 500bp),然后以线图的形式呈现。与非累积图一样,这里的线也是连接各个数据点,而不是平滑曲线。
library(scales)
library(ggpubr)
library(ggplot2)
library(dplyr)# Define a list for the groups/cohorts.
grp_list<-list("cohort_1"="cohort_1","cohort_2"="cohort_2","cohort_3"="cohort_3","cohort_4"="cohort_4")# Generating the plots and store them in a list.
result<-sapply(grp_list, function(x){result <-callSize(path = data_path) %>% dplyr::filter(group==as.character(x)) %>% plotSingleGroup()
}, simplify = FALSE)
#> setting default outfmt to df.
#> setting default input_type to picard.
#> setting default outfmt to df.
#> setting default input_type to picard.
#> setting default outfmt to df.
#> setting default input_type to picard.
#> setting default outfmt to df.
#> setting default input_type to picard.# Multiplexing the plots in one figure
suppressWarnings(multiplex <-ggarrange(result$cohort_1$prop_plot + theme(axis.title.x = element_blank()),result$cohort_4$prop_plot + theme(axis.title = element_blank()),result$cohort_1$cdf_plot,result$cohort_4$cdf_plot + theme(axis.title.y = element_blank()),labels = c("Cohort 1 (n=5)", "Cohort 4 (n=4)"),label.x = 0.2,ncol = 2,nrow = 2))multiplex
2.片段长度分布比较
- callMetrics:计算了每个组的中位片段大小分布
- 上图:每个队列中位数片段大小分布的比例。y轴显示读取比例,x轴显示片段大小。图中显示的线不是平滑的曲线,而是连接不同数据点的线
- 下图:中位数累积分布函数(CDF)的图形。y轴显示累积比例,x轴仍然显示片段大小。这是一个逐步上升的图形,反映了不同片段大小下读取的累积分布情况。
# Set an order for those groups (i.e. the levels of factors).
order <- c("cohort_1", "cohort_2", "cohort_3", "cohort_4")
# Generate plots.
compare_grps<-callMetrics(data_path) %>% plotMetrics(order=order)
#> setting default input_type to picard.# Modify plots.
p1<-compare_grps$median_prop_plot +ylim(c(0, 0.028)) +theme(axis.title.x = element_blank(),axis.title.y = element_text(size=12,face="bold")) +theme(legend.position = c(0.7, 0.5),legend.text = element_text( size = 11),legend.title = element_blank())p2<-compare_grps$median_cdf_plot +scale_y_continuous(labels = scales::number_format(accuracy = 0.001)) +theme(axis.title=element_text(size=12,face="bold")) +theme(legend.position = c(0.7, 0.5),legend.text = element_text( size = 11),legend.title = element_blank())# Finalize plots.
suppressWarnings(median_grps<-ggpubr::ggarrange(p1,p2,label.x = 0.3,ncol = 1,nrow = 2))median_grps
3.可视化DNA片段模态长度
- 柱状图:这里的模态片段大小是指在样本中出现次数最多的DNA片段长度
# Set an order for your groups, it will affect the group order along x axis!
order <- c("cohort_1", "cohort_2", "cohort_3", "cohort_4")# Generate mode bin chart.
mode_bin <- callMode(data_path) %>% plotMode(order=order,hline = c(167,111,81))
#> setting default mincount as 0.
#> setting default input_type to picard.# Show the plot.
suppressWarnings(print(mode_bin))
- 堆叠柱状图:可以看到每个组中不同长度片段的分布
# Set an order for your groups, it will affect the group order along x axis.
order <- c("cohort_1", "cohort_2", "cohort_3", "cohort_4")# Generate mode stacked bar chart. You could specify how to stratify the modes
# using 'mode_partition' arguments. If other modes exist other than you
# specified, an 'other' group will be added to the plot.mode_stacked <- callMode(data_path) %>% plotModeSummary(order=order,mode_partition = list(c(166,167)))
#> setting default input_type to picard.# Modify the plot using ggplot syntax.
mode_stacked <- mode_stacked + theme(legend.position = "top")# Show the plot.
suppressWarnings(print(mode_stacked))
4.片段化振荡模式比较
- 间峰距离:通过测量和比较间距距离(峰值之间的距离),比较不同队列中的10bp周期性振荡模式
# Set an order for your groups, it will affect the group order.
order <- c("cohort_1", "cohort_2", "cohort_4", "cohort_3")# Plot and modify inter-peak distances.inter_peak_dist<-callPeakDistance(path = data_path, limit = c(50, 135)) %>%plotPeakDistance(order = order) +labs(y="Fraction") +theme(axis.title = element_text(size=12,face="bold"),legend.title = element_blank(),legend.position = c(0.91, 0.5),legend.text = element_text(size = 11))
#> setting the mincount to 0.
#> setting the xlim to c(7,13).
#> setting default outfmt to df.
#> Setting default mincount to 0.
#> setting default input_type to picard.# Show the plot.
suppressWarnings(print(inter_peak_dist))
- 间谷距离:与之前介绍的间峰距离可视化相比,间谷距离的可视化重点在于表示读取次数下降的区域,而不是上升的区域。这两个图表的区别在于它们关注的是碎片大小谱的不同特点,一个是峰点(即频率的局部最高点),另一个是谷点(即频率的局部最低点)。
# Set an order for your groups, it will affect the group order.
order <- c("cohort_1", "cohort_2", "cohort_4", "cohort_3")
# Plot and modify inter-peak distances.
inter_valley_dist<-callValleyDistance(path = data_path, limit = c(50, 135)) %>%plotValleyDistance(order = order) +labs(y="Fraction") +theme(axis.title = element_text(size=12,face="bold"),legend.title = element_blank(),legend.position = c(0.91, 0.5),legend.text = element_text(size = 11))
#> setting the mincount to 0.
#> setting the xlim to c(7,13).
#> setting default outfmt to df.
#> setting the mincount to 0.
#> setting default input_type to picard.# Show the plot.
suppressWarnings(print(inter_valley_dist))
5. ggplot2美化
library(ggplot2)
library(cfDNAPro)
# Set the path to the example sample.
exam_path <- examplePath("step6")
# Calculate peaks and valleys.
peaks <- callPeakDistance(path = exam_path)
#> setting default limit to c(35,135).
#> setting default outfmt to df.
#> Setting default mincount to 0.
#> setting default input_type to picard.
valleys <- callValleyDistance(path = exam_path)
#> setting default limit to c(35,135).
#> setting default outfmt to df.
#> setting the mincount to 0.
#> setting default input_type to picard.
# A line plot showing the fragmentation pattern of the example sample.
exam_plot_all <- callSize(path=exam_path) %>% plotSingleGroup(vline = NULL)
#> setting default outfmt to df.
#> setting default input_type to picard.
# Label peaks and valleys with dashed and solid lines.
exam_plot_prop <- exam_plot_all$prop + coord_cartesian(xlim = c(90,135),ylim = c(0,0.0065)) +geom_vline(xintercept=peaks$insert_size, colour="red",linetype="dashed") +geom_vline(xintercept = valleys$insert_size,colour="blue")# Show the plot.
suppressWarnings(print(exam_plot_prop))
# Label peaks and valleys with dots.
exam_plot_prop_dot<- exam_plot_all$prop + coord_cartesian(xlim = c(90,135),ylim = c(0,0.0065)) +geom_point(data= peaks, mapping = aes(x= insert_size, y= prop),color="blue",alpha=0.5,size=3) +geom_point(data= valleys, mapping = aes(x= insert_size, y= prop),color="red",alpha=0.5,size=3)
# Show the plot.
suppressWarnings(print(exam_plot_prop_dot))
想做cfDNA,迈出分析的第一步,数据表征。