cfDNAPro｜cfDNA片段数据生物学表征及可视化的R包

文章目录

- 前言
- cfDNAPro
- demo
- - 1.片段长度可视化
  - 2.片段长度分布比较
  - 3.可视化DNA片段模态长度
  - 4.片段化振荡模式比较
  - 5. ggplot2美化

前言

cfDNA（无细胞DNA，游离DNA，Circulating free DNA or Cell free DNA）是指在血液循环中存在的DNA片段。这些DNA片段不属于任何细胞，因此被称为“无细胞”或“游离”的。cfDNA来源广泛，可以来自正常细胞和病变细胞（如肿瘤细胞）的死亡和分解过程。cfDNA的长度通常在160-180碱基对左右，这与核小体保护的DNA片段长度相符。

cfDNA的研究对于非侵入性诊断、疾病监测、早期检测以及了解生理和病理状态具有重要意义。特别是在肿瘤学领域，通过分析循环肿瘤DNA（ctDNA），即来源于肿瘤细胞的cfDNA，可以获取肿瘤的遗传信息，从而指导癌症的诊断、治疗选择和治疗效果监测。

cfDNAPro

主要功能：

数据表征：计算片段大小分布的整体、中位数和众数，以及片段大小轮廓中的峰和谷，还有振荡周期性。
数据可视化：提供了多种函数来可视化这些数据，包括整体到单个片段的可视化、度量可视化、模式和摘要可视化等。

demo

1.片段长度可视化

上图：横轴表示片段长度，范围为30bp至500bp。纵轴表示具有特定读取长度的读取比例。这里的线并不是平滑曲线，而是连接不同数据点的直线。
下图：首先统计长度小于或等于30bp的读取数量（例如N），然后将其归一化为比例。重复这一过程，直至处理完所有片段长度（即30bp, 31bp, …, 500bp），然后以线图的形式呈现。与非累积图一样，这里的线也是连接各个数据点，而不是平滑曲线。

library(scales)
library(ggpubr)
library(ggplot2)
library(dplyr)# Define a list for the groups/cohorts.
grp_list<-list("cohort_1"="cohort_1","cohort_2"="cohort_2","cohort_3"="cohort_3","cohort_4"="cohort_4")# Generating the plots and store them in a list.
result<-sapply(grp_list, function(x){result <-callSize(path = data_path) %>% dplyr::filter(group==as.character(x)) %>% plotSingleGroup()
}, simplify = FALSE)
#> setting default outfmt to df.
#> setting default input_type to picard.
#> setting default outfmt to df.
#> setting default input_type to picard.
#> setting default outfmt to df.
#> setting default input_type to picard.
#> setting default outfmt to df.
#> setting default input_type to picard.# Multiplexing the plots in one figure
suppressWarnings(multiplex <-ggarrange(result$cohort_1$prop_plot + theme(axis.title.x = element_blank()),result$cohort_4$prop_plot + theme(axis.title = element_blank()),result$cohort_1$cdf_plot,result$cohort_4$cdf_plot + theme(axis.title.y = element_blank()),labels = c("Cohort 1 (n=5)", "Cohort 4 (n=4)"),label.x = 0.2,ncol = 2,nrow = 2))multiplex

2.片段长度分布比较

callMetrics：计算了每个组的中位片段大小分布
上图：每个队列中位数片段大小分布的比例。y轴显示读取比例，x轴显示片段大小。图中显示的线不是平滑的曲线，而是连接不同数据点的线
下图：中位数累积分布函数(CDF)的图形。y轴显示累积比例，x轴仍然显示片段大小。这是一个逐步上升的图形，反映了不同片段大小下读取的累积分布情况。

# Set an order for those groups (i.e. the levels of factors).
order <- c("cohort_1", "cohort_2", "cohort_3", "cohort_4")
# Generate plots.
compare_grps<-callMetrics(data_path) %>% plotMetrics(order=order)
#> setting default input_type to picard.# Modify plots.
p1<-compare_grps$median_prop_plot +ylim(c(0, 0.028)) +theme(axis.title.x = element_blank(),axis.title.y = element_text(size=12,face="bold")) +theme(legend.position = c(0.7, 0.5),legend.text = element_text( size = 11),legend.title = element_blank())p2<-compare_grps$median_cdf_plot +scale_y_continuous(labels = scales::number_format(accuracy = 0.001)) +theme(axis.title=element_text(size=12,face="bold")) +theme(legend.position = c(0.7, 0.5),legend.text = element_text( size = 11),legend.title = element_blank())# Finalize plots.
suppressWarnings(median_grps<-ggpubr::ggarrange(p1,p2,label.x = 0.3,ncol = 1,nrow = 2))median_grps

3.可视化DNA片段模态长度

柱状图：这里的模态片段大小是指在样本中出现次数最多的DNA片段长度

# Set an order for your groups, it will affect the group order along x axis!
order <- c("cohort_1", "cohort_2", "cohort_3", "cohort_4")# Generate mode bin chart.
mode_bin <- callMode(data_path) %>% plotMode(order=order,hline = c(167,111,81))
#> setting default mincount as 0.
#> setting default input_type to picard.# Show the plot.
suppressWarnings(print(mode_bin))

堆叠柱状图：可以看到每个组中不同长度片段的分布

# Set an order for your groups, it will affect the group order along x axis.
order <- c("cohort_1", "cohort_2", "cohort_3", "cohort_4")# Generate mode stacked bar chart. You could specify how to stratify the modes
# using 'mode_partition' arguments. If other modes exist other than you 
# specified, an 'other' group will be added to the plot.mode_stacked <- callMode(data_path) %>% plotModeSummary(order=order,mode_partition = list(c(166,167)))
#> setting default input_type to picard.# Modify the plot using ggplot syntax.
mode_stacked <- mode_stacked + theme(legend.position = "top")# Show the plot.
suppressWarnings(print(mode_stacked))

4.片段化振荡模式比较

间峰距离：通过测量和比较间距距离（峰值之间的距离），比较不同队列中的10bp周期性振荡模式

# Set an order for your groups, it will affect the group order.
order <- c("cohort_1", "cohort_2", "cohort_4", "cohort_3")# Plot and modify inter-peak distances.inter_peak_dist<-callPeakDistance(path = data_path,  limit = c(50, 135)) %>%plotPeakDistance(order = order) +labs(y="Fraction") +theme(axis.title =  element_text(size=12,face="bold"),legend.title = element_blank(),legend.position = c(0.91, 0.5),legend.text = element_text(size = 11))
#> setting the mincount to 0.
#>  setting the xlim to c(7,13). 
#>  setting default outfmt to df.
#> Setting default mincount to 0.
#> setting default input_type to picard.# Show the plot.
suppressWarnings(print(inter_peak_dist))

间谷距离：与之前介绍的间峰距离可视化相比，间谷距离的可视化重点在于表示读取次数下降的区域，而不是上升的区域。这两个图表的区别在于它们关注的是碎片大小谱的不同特点，一个是峰点（即频率的局部最高点），另一个是谷点（即频率的局部最低点）。

# Set an order for your groups, it will affect the group order.
order <- c("cohort_1", "cohort_2", "cohort_4", "cohort_3")
# Plot and modify inter-peak distances.
inter_valley_dist<-callValleyDistance(path = data_path,  limit = c(50, 135)) %>%plotValleyDistance(order = order) +labs(y="Fraction") +theme(axis.title =  element_text(size=12,face="bold"),legend.title = element_blank(),legend.position = c(0.91, 0.5),legend.text = element_text(size = 11))
#> setting the mincount to 0. 
#>  setting the xlim to c(7,13). 
#>  setting default outfmt to df.
#> setting the mincount to 0.
#> setting default input_type to picard.# Show the plot.
suppressWarnings(print(inter_valley_dist))

5. ggplot2美化

library(ggplot2)
library(cfDNAPro)
# Set the path to the example sample.
exam_path <- examplePath("step6")
# Calculate peaks and valleys.
peaks <- callPeakDistance(path = exam_path) 
#> setting default limit to c(35,135).
#> setting default outfmt to df.
#> Setting default mincount to 0.
#> setting default input_type to picard.
valleys <- callValleyDistance(path = exam_path) 
#> setting default limit to c(35,135).
#> setting default outfmt to df.
#> setting the mincount to 0.
#> setting default input_type to picard.
# A line plot showing the fragmentation pattern of the example sample.
exam_plot_all <- callSize(path=exam_path) %>% plotSingleGroup(vline = NULL)
#> setting default outfmt to df.
#> setting default input_type to picard.
# Label peaks and valleys with dashed and solid lines.
exam_plot_prop <- exam_plot_all$prop + coord_cartesian(xlim = c(90,135),ylim = c(0,0.0065)) +geom_vline(xintercept=peaks$insert_size, colour="red",linetype="dashed") +geom_vline(xintercept = valleys$insert_size,colour="blue")# Show the plot.
suppressWarnings(print(exam_plot_prop))

# Label peaks and valleys with dots.
exam_plot_prop_dot<- exam_plot_all$prop + coord_cartesian(xlim = c(90,135),ylim = c(0,0.0065)) +geom_point(data= peaks, mapping = aes(x= insert_size, y= prop),color="blue",alpha=0.5,size=3) +geom_point(data= valleys, mapping = aes(x= insert_size, y= prop),color="red",alpha=0.5,size=3) 
# Show the plot.
suppressWarnings(print(exam_plot_prop_dot))