clusterprolifer go kegg msigdbr 富集分析应该使用哪个数据集,GO?KEGG?Hallmark?

关注微信:生信小博士 

5 Overview of enrichment analysis

Chapter 5 Overview of enrichment analysis | Biomedical Knowledge Mining using GOSemSim and clusterProfiler

5.1.2 Gene Ontology (GO)

Gene Ontology defines concepts/classes used to describe gene function, and relationships between these concepts. It classifies functions along three aspects:

  • MF: Molecular Function
    • molecular activities of gene products
  • CC: Cellular Component
    • where gene products are active
  • BP: Biological Process
    • pathways and larger processes made up of the activities of multiple gene products

GO terms are organized in a directed acyclic graph, where edges between terms represent parent-child relationship.

5.1.3 Kyoto Encyclopedia of Genes and Genomes (KEGG)

KEGG is a collection of manually drawn pathway maps representing molecular interaction and reaction networks. These pathways cover a wide range of biochemical processes that can be divided into 7 broad categories:

  1. Metabolism
  2. Genetic information processing
  3. Environmental information processing
  4. Cellular processes
  5. Organismal systems
  6. Human diseases
  7. Drug development.

5.1.4 Other gene sets

GO and KEGG are the most frequently used for functional analysis. They are typically the first choice because of their long-standing curation and availability for a wide range of species.

Other gene sets include but are not limited to Disease Ontology (DO), Disease Gene Network (DisGeNET), wikiPathways, Molecular Signatures Database (MSigDb).

富集分析的原理

Over Representation Analysis (ORA)和Gene Set Enrichment Analysis (GSEA)是用于分析基因表达数据的两种常见方法。ORA基于预定义的基因集,用超几何分布计算差异表达基因与该基因集中的基因数之间的显著性,适用于已知基因集和差异表达较大情况。GSEA考虑所有基因,根据基因在表型排序上的变化来寻找整个基因集是否显著富集,比ORA更适用于寻找微小但协调变化的基因集。GSEA还包括较新的leading edge analysis和core enriched genes等功能,能够帮助确定影响基因富集的关键基因。两种方法各有优缺点,在不同的研究问题中可以灵活选用。

5.2 原理1 Over Representation Analysis

Example: Suppose we have 17,980 genes detected in a Microarray study and 57 genes were differentially expressed. Among the differentially expressed genes, 28 are annotated to a gene set1.

d <- data.frame(gene.not.interest=c(2613, 15310), gene.in.interest=c(28, 29))
row.names(d) <- c("In_category", "not_in_category")
d

##                 gene.not.interest gene.in.interest
## In_category                  2613               28
## not_in_category             15310               29

Whether the overlap(s) of 25 genes are significantly over represented in the gene set can be assessed using a hypergeometric distribution. This corresponds to a one-sided version of Fisher’s exact test.

fisher.test(d, alternative = "greater")

 5.3 原理2 Gene Set Enrichment Analysis

ORA的方法,对于差异明显的对比结果较好,但是对于差异小的结果不好

A common approach to analyzing gene expression profiles is identifying differentially expressed genes that are deemed interesting. The ORA enrichment analysis is based on these differentially expressed genes. This approach will find genes where the difference is large and will fail where the difference is small, but evidenced in coordinated way in a set of related genes. Gene Set Enrichment Analysis (GSEA)(Subramanian et al. 2005) directly addresses this limitation. All genes can be used in GSEA; GSEA aggregates the per gene statistics across genes within a gene set, therefore making it possible to detect situations where all genes in a predefined set change in a small but coordinated way. This is important since it is likely that many relevant phenotypic differences are manifested by small but consistent changes in a set of genes.

Genes are ranked based on their phenotypes. Given apriori defined set of gene S (e.g., genes sharing the same DO category), the goal of GSEA is to determine whether the members of S are randomly distributed throughout the ranked gene list (L) or primarily found at the top or bottom.

There are three key elements of the GSEA method:

  • Calculation of an Enrichment Score.
    • The enrichment score (ES) represents the degree to which a set S is over-represented at the top or bottom of the ranked list L. The score is calculated by walking down the list L, increasing a running-sum statistic when we encounter a gene in S and decreasing when it is not encountered. The magnitude of the increment depends on the gene statistics (e.g., correlation of the gene with phenotype). The ES is the maximum deviation from zero encountered in the random walk; it corresponds to a weighted Kolmogorov-Smirnov(KS)-like statistic (Subramanian et al. 2005).
  • Esimation of Significance Level of ES.
    • The p-value of the ES is calculated using a permutation test. Specifically, we permute the gene labels of the gene list L and recompute the ES of the gene set for the permutated data, which generate a null distribution for the ES. The p-value of the observed ES is then calculated relative to this null distribution.
  • Adjustment for Multiple Hypothesis Testing.
    • When the entire gene sets are evaluated, the estimated significance level is adjusted to account for multiple hypothesis testing and also q-values are calculated for FDR control.

We implemented the GSEA algorithm proposed by Subramanian (Subramanian et al. 2005). Alexey Sergushichev implemented an algorithm for fast GSEA calculation in the fgsea (Korotkevich, Sukhov, and Sergushichev 2019) package. In our packages (clusterProfiler, DOSE, meshes and ReactomePA), users can use the GSEA algorithm implemented in DOSE or fgsea by specifying the parameter by="DOSE" or by="fgsea". By default, the fgsea method will be used since it is much more faster.

5.4 原理3 Leading edge analysis and core enriched genes

Leading edge analysis reports Tags to indicate the percentage of genes contributing to the enrichment score, List to indicate where in the list the enrichment score is attained and Signal for enrichment signal strength.

It would also be very interesting to get the core enriched genes that contribute to the enrichment. Our packages (clusterProfiler, DOSE, meshes and ReactomePA) support leading edge analysis and report core enriched genes in GSEA analysis.

详细的富集分析使用范围及其原理

1 GO 富集

GO comprises three orthogonal ontologies, i.e. molecular function (MF), biological process (BP), and cellular component (CC).

6.1 Supported organisms

GO analyses (groupGO()enrichGO() and gseGO()) support organisms that have an OrgDb object available (see also session 2.2).

If a user has GO annotation data (in a data.frame format with the first column as gene ID and the second column as GO ID), they can use the enricher() and GSEA() functions to perform an over-representation test and gene set enrichment analysis.

If the genes are annotated by direction annotation, they should also be annotated by their ancestor GO nodes (indirect annotation). If a user only has direct annotation, they can pass their annotation to the buildGOmap function, which will infer indirect annotation and generate a data.frame that is suitable for both enricher() and GSEA().

6.2 groupGo是 对基因的分类

In clusterProfiler, the groupGO() function is designed for gene classification based on GO distribution at a specific level. Here we use the dataset geneList provided by DOSE.

library(clusterProfiler)
data(geneList, package="DOSE")
gene <- names(geneList)[abs(geneList) > 2]# Entrez gene ID
head(gene)

## [1] "4312"  "8318"  "10874" "55143" "55388" "991"
ggo <- groupGO(gene     = gene,OrgDb    = org.Hs.eg.db,ont      = "CC",level    = 3,readable = TRUE)head(ggo)

6.3 enrichGO()是通过ORA对基因进行富集分析

The clusterProfiler package implements enrichGO() for gene ontology over-representation test.

ego <- enrichGO(gene          = gene,universe      = names(geneList),OrgDb         = org.Hs.eg.db,ont           = "CC",pAdjustMethod = "BH",pvalueCutoff  = 0.01,qvalueCutoff  = 0.05,readable      = TRUE)
head(ego)

gene.df <- bitr(gene, fromType = "ENTREZID",toType = c("ENSEMBL", "SYMBOL"),OrgDb = org.Hs.eg.db)ego2 <- enrichGO(gene         = gene.df$ENSEMBL,OrgDb         = org.Hs.eg.db,keyType       = 'ENSEMBL',ont           = "CC",pAdjustMethod = "BH",pvalueCutoff  = 0.01,qvalueCutoff  = 0.05)
head(ego2, 3)    

6.4   gseGO() 是对基因集进行gsea分析,而GSEA是对基因列表基因gsea分析??

The clusterProfiler package provides the gseGO() function for gene set enrichment analysis using gene ontology.

ego3 <- gseGO(geneList     = geneList,OrgDb        = org.Hs.eg.db,ont          = "CC",minGSSize    = 100,maxGSSize    = 500,pvalueCutoff = 0.05,verbose      = FALSE)

The format of input data, geneList, was documented in the FAQ. Beware that only gene Set size in [minGSSize, maxGSSize] will be tested.

A.1 How to prepare your own geneList

GSEA analysis requires a ranked gene list, which contains three features:

  • numeric vector: fold change or other type of numerical variable
  • named vector: every number has a name, the corresponding gene ID
  • sorted vector: number should be sorted in decreasing order

If you import your data from a csv file, the file should contains two columns, one for gene ID (no duplicated ID allowed) and another one for fold change. You can prepare your own geneList via the following command:

d = read.csv(your_csv_file)
## assume 1st column is ID
## 2nd column is FC## feature 1: numeric vector
geneList = d[,2]## feature 2: named vector
names(geneList) = as.character(d[,1])## feature 3: decreasing orde
geneList = sort(geneList, decreasing = TRUE)

7.2 KEGG pathway ORA

 Input ID type can be keggncbi-geneidncbi-proteinid or uniprot (see also session 16.1.2). Unlike enrichGO(), there is no readable parameter for enrichKEGG(). However, users can use the setReadable() function if there is an OrgDb available for the species.

data(geneList, package="DOSE")
gene <- names(geneList)[abs(geneList) > 2]kk <- enrichKEGG(gene         = gene,organism     = 'hsa',pvalueCutoff = 0.05)
head(kk)
##                ID                                                   Description
## hsa04110 hsa04110                                                    Cell cycle
## hsa04114 hsa04114                                                Oocyte meiosis
## hsa04218 hsa04218                                           Cellular senescence
## hsa04061 hsa04061 Viral protein interaction with cytokine and cytokine receptor
## hsa03320 hsa03320                                        PPAR signaling pathway
## hsa04914 hsa04914                       Progesterone-mediated oocyte maturation
##          GeneRatio  BgRatio       pvalue     p.adjust       qvalue
## hsa04110     11/94 126/8142 1.829412e-07 3.841764e-05 3.774365e-05
## hsa04114     10/94 131/8142 2.368439e-06 2.486861e-04 2.443231e-04
## hsa04218     10/94 156/8142 1.135672e-05 7.949704e-04 7.810235e-04
## hsa04061      8/94 100/8142 1.821466e-05 9.562698e-04 9.394931e-04
## hsa03320      7/94  75/8142 2.285993e-05 9.601169e-04 9.432728e-04
## hsa04914      7/94 102/8142 1.651911e-04 5.781690e-03 5.680256e-03
##                                                      geneID Count
## hsa04110 8318/991/9133/890/983/4085/7272/1111/891/4174/9232    11
## hsa04114    991/9133/983/4085/51806/6790/891/9232/3708/5241    10
## hsa04218     2305/4605/9133/890/983/51806/1111/891/776/3708    10
## hsa04061           3627/10563/6373/4283/6362/6355/9547/1524     8
## hsa03320                 4312/9415/9370/5105/2167/3158/5346     7
## hsa04914                    9133/890/983/4085/6790/891/5241     7

7.3 KEGG pathway GSEA

kk2 <- gseKEGG(geneList     = geneList,organism     = 'hsa',minGSSize    = 120,pvalueCutoff = 0.05,verbose      = FALSE)
head(kk2)

##                ID                              Description setSize
## hsa05169 hsa05169             Epstein-Barr virus infection     193
## hsa04613 hsa04613  Neutrophil extracellular trap formation     130
## hsa05166 hsa05166  Human T-cell leukemia virus 1 infection     202
## hsa04510 hsa04510                           Focal adhesion     190
## hsa04218 hsa04218                      Cellular senescence     141
## hsa05170 hsa05170 Human immunodeficiency virus 1 infection     189
##          enrichmentScore       NES       pvalue     p.adjust      qvalues rank
## hsa05169       0.4335010  1.962039 5.505487e-08 4.239225e-06 2.839672e-06 2820
## hsa04613       0.4496569  1.934454 6.345081e-06 2.442856e-04 1.636363e-04 2575
## hsa05166       0.3893613  1.755469 1.357530e-05 3.484327e-04 2.333999e-04 1955
## hsa04510      -0.4169068 -1.691234 2.771385e-05 5.334917e-04 3.573629e-04 2183
## hsa04218       0.4115945  1.783044 5.168200e-05 7.959028e-04 5.331406e-04 1155
## hsa05170       0.3711150  1.677643 1.118012e-04 1.229814e-03 8.237986e-04 3178
##                            leading_edge
## hsa05169 tags=39%, list=23%, signal=31%
## hsa04613 tags=37%, list=21%, signal=30%
## hsa05166 tags=26%, list=16%, signal=22%
## hsa04510 tags=27%, list=17%, signal=22%
## hsa04218  tags=17%, list=9%, signal=16%
## hsa05170 tags=38%, list=25%, signal=29%
##                                                                                                                                                                                                                                                                                                                                                                                core_enrichment
## hsa05169 3627/890/6890/9636/898/9134/6502/6772/3126/3112/4609/917/5709/1869/3654/919/915/4067/4938/864/4940/5713/5336/11047/3066/54205/1871/578/1019/637/916/3383/4939/10213/23586/4793/5603/7979/7128/6891/930/5714/3452/6850/5702/4794/7124/3569/7097/5708/2208/8772/3119/5704/7186/5971/3135/1380/958/5610/4792/10018/8819/3134/10379/9641/1147/5718/6300/3109/811/5606/2923/3108/5707/1432
## hsa04613                                                                                                                                      820/366/51311/64581/3015/85236/55506/8970/8357/1535/2359/5336/4688/92815/3066/8336/292/1991/3689/8345/5603/4689/5880/10105/1184/6404/3018/6850/5604/3014/7097/1378/8290/1536/834/5605/1183/728/2215/8335/5594/9734/3674/5578/5582/7417/8331/6300
## hsa05166                                                                                                                      991/9133/890/4085/7850/1111/9232/8061/701/9700/898/4316/9134/3932/3559/3126/3112/4609/3561/917/1869/1029/915/114/2005/5902/55697/1871/1031/2224/292/1019/3689/916/3383/11200/706/3600/6513/3601/468/5604/7124/1030/3569/4049/4055/10393/3119/5901/5971/1959/3135
## hsa04510                                                                                                                   5595/5228/7424/1499/4636/83660/7059/5295/1288/23396/3910/3371/3082/1291/394/3791/7450/596/3685/1280/3675/595/2318/3912/1793/1278/1277/1293/10398/55742/2317/7058/25759/56034/3693/3480/5159/857/1292/3908/3909/63923/3913/1287/3679/7060/3479/10451/80310/1311/1101
## hsa04218                                                                                                                                                                                                                                                                 2305/4605/9133/890/983/51806/1111/891/993/3576/1978/898/9134/4609/1869/1029/22808/1871/5499/91860/292/1019/11200/1875
## hsa05170                   9133/9582/983/51806/1111/891/6890/200315/917/3654/919/915/5336/54205/91860/578/995/25939/637/1234/916/5603/2792/5880/6891/6921/3452/5604/7124/9978/7097/7852/8772/1174/7186/3135/164/60489/2787/356/7133/5605/27350/6199/1642/5594/4792/3134/5578/4893/8454/5582/2786/1147/3984/6300/200316/811/5606/2923/4775/162/1432/2784/836/5747/5058/3106/2770/5534/5579/4615

7.4 KEGG module ORA. 这里的module指的是msigdb种收集的其他kegg内容,共有

KEGG Module is a collection of manually defined function units. In some situation, KEGG Modules have a more straightforward interpretation.

mkk <- enrichMKEGG(gene = gene,organism = 'hsa',pvalueCutoff = 1,qvalueCutoff = 1)
head(mkk)       

3.KEGG 分类备注

KEGG Pathway可以进行分类汇总。

在KEGG官网(KEGG PATHWAY Database)中,将通路分成了7类:

1. Metabolism (代谢)

2. Genetic Information Processing (遗传信息处理)

3. Environmental Information Processing (环境信息处理)

4. Cellular Processes (细胞过程)

5. Organismal Systems (生物系统)

6. Human Diseases (人类疾病)

7. Drug Development (药物开发)

做出一个分类汇总图,这样可以在更上一层级研究pathway。主要应用于特殊物种的denovo测序注释,也可以应用于我们常规的pathway富集分析。

7.5 KEGG module gene set enrichment analysis

mkk2 <- gseMKEGG(geneList = geneList,organism = 'hsa',pvalueCutoff = 1)
head(mkk2)

通路可视化

library("pathview")
hsa04110 <- pathview(gene.data  = geneList,
                     pathway.id = "hsa04110",
                     species    = "hsa",
                     limit      = list(gene=max(abs(geneList)), cpd=1))

browseKEGG(kk, 'hsa04110')

 

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/news/128545.shtml

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

网络安全进阶学习第二十一课——XXE

文章目录 一、XXE简介二、XXE原理三、XXE危害四、XXE如何寻找五、XXE限制条件六、XXE分类七、XXE利用1、读取任意文件1.1、有回显1.2、没有回显 2、命令执行&#xff08;情况相对较少见&#xff09;3、内网探测/SSRF4、拒绝服务攻击(DDoS)4.1、内部实体4.2、参数实体 八、绕过基…

rpm 软件包管理工具

RPM&#xff08;RedHat Package Manager&#xff09;&#xff0c;RedHat软件包管理工具。 rpm 查询 rpm -qa #查询所有包(query all)rpm -qa |grep firefox #firefox-102.15.0-1.el7.centos.x86_64rpm -qi | grep firefox #(query information) #Name : firefox #…

Scala语言用Selenium库写一个爬虫模版

首先&#xff0c;我将使用Scala编写一个使用Selenium库下载yuanfudao内容的下载器程序。 然后我们需要在项目的build.sbt文件中添加selenium的依赖项。以下是添加Selenium依赖项的代码&#xff1a; libraryDependencies "org.openqa.selenium" % "selenium-ja…

2.4G合封芯片 XL2422,集成M0核MCU,高性能 低功耗

XL2422芯片是一款高性能低功耗的SOC集成无线收发芯片&#xff0c;集成M0核MCU&#xff0c;工作在2.400~2.483GHz世界通用ISM频段。该芯片集成了射频接收器、射频发射器、频率综合器、GFSK调制器、GFSK解调器等功能模块&#xff0c;并且支持一对多线网和带ACK的通信模式。发射输…

Proteus仿真--基于51单片机的走马灯实现(仿真文件+程序)

本文主要介绍基于51单片机的走马灯仿真&#xff08;完整仿真源文件及代码见文末链接&#xff09; 本设计中有16个LED灯用于流水走马演示&#xff0c;一位数码管用于显示当前模式状态&#xff0c;3个按键分别用于选择模式及加减速度控制 仿真图如下 其中 K1&#xff1a;用于模…

Linux设置ssh免密登录

ssh连接其他服务器 基本语法 ssh 另一台机器的ip地址 连接后输入连接主机用户的密码&#xff0c;即可成功连接。 输入exit 可以登出&#xff1b; 由于我配置了主机映射所以可以不写ip直接写映射的主机名即可&#xff0c;Linux配置主机映射的操作为 vim /etc/hosts # 我自己…

STM32循迹小车原理介绍和代码示例

目录 1. 循迹模块介绍 2. 循迹小车原理 3. 循迹小车核心代码 4. 循迹小车解决转弯平滑问题 1. 循迹模块介绍 TCRT5000传感器的红外发射二极管不断发射红外线当发射出的红外线没有被反射回来或被反射回来但强度不够大时红外接收管一直处于关断状态&#xff0c;此时模块的输出…

如何实现异步通知的重试机制

工作中经常要和第三方做对接&#xff0c;比如支付、电子合同等系统。操作成功之后&#xff0c;第三方会发送异步的通知&#xff0c;返回最终的处理结果&#xff0c;使用异步而不是使用同步通知&#xff0c;是为了加快系统响应速度&#xff0c;防止线程阻塞。任务处理完成后通过…

【使用Python编写游戏辅助工具】第二篇:键盘监听的应用

前言 这里是【使用Python编写游戏辅助工具】的第二篇&#xff1a;键盘监听的应用。本文主要介绍使用Python实现事件监听功能。 键盘监听是指通过编程的方式监控用户在键盘上的按键操作。 在这里键盘监听的主要用途是&#xff1a; 监听我们按下的按键&#xff0c;如果按下了指…

JSPv2之El

​ (一)EL的基本语法 1优点 1 jsp的java太长了,el自己的语言${ 开始 }结束 2el直接返回空字符转,而java直接报错 3使用“lt”代替“<”运算符&#xff0c;如果运算符后面是数字&#xff0c;在运算符 *EL取值时&#xff0c;没有数组的下标越界&#xff0c;没有…

面试常考:从lc24《两两交换链表中的节点》 到 lc25《K 个一组翻转链表》带你认识链表递归

1 lc24《两两交换链表中的节点》 1.1 描述 1.2 题解 1.2.1 递归解法 下面的三行注释要理解透彻&#xff0c; public ListNode swapPairs(ListNode head) {if(headnull||head.nextnull)return head;// 具体的两两交换过程ListNode nexthead.next; ListNode nextNexthead.next…

Apache ECharts简介和相关操作

文章目录 一、Apache ECharts介绍二、快速入门1.下载echarts.js文件2.新建index.html文件3.准备一个DOM容器用于显示图表4.完整代码展示5.相关配置 三、演示效果四、总结 一、Apache ECharts介绍 Apache ECharts 是一款基于 Javascript 的数据可视化图表库&#xff0c;提供直观…

Android java Handler sendMessage使用Parcelable传递实例化对象,我这里传递Bitmap 图片数据

一、Bundle给我们提供了一个putParcelable(key,value)的方法。专门用于传递实例化对象。 二、我这里传递Bitmap 图片数据&#xff0c;实际使用可以成功传统图像数据。 发送&#xff1a;Bundle bundle new Bundle();bundle.putParcelable("bitmap",bitmap);msg.setD…

Python库Requests的爬虫程序爬取视频通用模版

目录 一、引言 二、Requests库介绍 三、通用视频爬虫模板设计 1、确定目标网站和视频页面结构 2、发送HTTP请求获取页面内容 3、解析HTML内容提取视频链接 4、下载视频文件 四、模板应用与实践 五、注意事项 总结与展望 一、引言 随着互联网的发展&#xff0c;视频内…

系列四、全局配置文件mybatis-config.xml

一、全局配置文件中的属性 mybatis全局配置中的文件非常多&#xff0c;主要有如下几个&#xff1a; properties&#xff08;属性&#xff09;settings&#xff08;全局配置参数&#xff09;typeAliases&#xff08;类型别名&#xff09;typeHandlers&#xff08;类型处理器&am…

全球高分辨率地表太阳辐射数据集包含36年(1983.7-2018.12)

简介&#xff1a; 全球高分辨率地表太阳辐射数据集包含36年&#xff08;1983.7-2018.12&#xff09;的全球地表太阳辐射数据&#xff0c;其分辨率为3小时&#xff0c;10公里&#xff0c;数据单位为W/㎡&#xff0c;瞬时值。该数据集可用于水文建模、地表建模和工程应用&#x…

车载电子电器架构 —— 基于AP定义车载HPC

我是穿拖鞋的汉子,魔都中坚持长期主义的汽车电子工程师。 老规矩,分享一段喜欢的文字,避免自己成为高知识低文化的工程师: 屏蔽力是信息过载时代一个人的特殊竞争力,任何消耗你的人和事,多看一眼都是你的不对。非必要不费力证明自己,无利益不试图说服别人,是精神上的节…

kafka动态认证 自定义认证 安全认证-亲测成功

kafka动态认证 自定义认证 安全认证-亲测成功 背景 Kafka默认是没有安全机制的&#xff0c;一直在裸奔。用户认证功能&#xff0c;是一个成熟组件不可或缺的功能。在0.9版本以前kafka是没有用户认证模块的&#xff08;或者说只有SSL&#xff09;&#xff0c;好在kafka0.9版本…

OpenGL_Learn04

我这边并不是教程&#xff0c;只是学习记录&#xff0c;方便后面回顾&#xff0c;代码均是100%可以运行成功的。 1. 渐变三角形 #include <glad/glad.h> #include <GLFW/glfw3.h>#include <iostream> #include <cmath>void framebuffer_size_callba…