生信-ChAMP学习笔记

未分类 2018-11-06 1,193 次浏览 Comments Off on 生信-ChAMP学习笔记

ChAMP芯片甲基化数据分析管道

Yuan Tian, Tiffany J Morris, Amy P Webster, Zhen Yang, Stephan Beck, Andrew Feber, and Andrew E Teschendorff

Zifeng译

2018-10-30

1 新更新

ChAMP paper 已经在 Bioconductor上出版了! 作为 ChAMP的作者, 我们非常感谢您的帮助和建议,我们会努力将这个工具包做得越来越好.

如果你对 ChAMP有任何的bug提交或者建议,请发邮 champ450k@gmail.com tianyuan1991hit@gmail.com.

2 介绍

ChAMP 工具包被设计用于 Illumina Methylation bead array data (EPIC and 450k) 芯片数据的分析, 整合提供了目前可用的450k和EPIC分析的管道。这包括各种不同类型的数据导入方法(例如来自.idat文件或β值矩阵)和质量控制图。 Type-2探头校正方法包括SWAN1,Peak Based Correction(PBC)和BMIQ(默认选择)。 minfi软件包提供了流行的Functional Normalization功能。singular value decomposition(SVD)方法允许深入研究batch effects,并且为了校正多个batch effects,加入了ComBat方法。可以通过RefbaseEWAS9调整细胞类型的异质性。 ChAMP还包括用于从450k或EPIC数据中推断拷贝数改变的功能。为了鉴定差异甲基化区域(DMR),除了先前的DMR检测功能Bumphunter12和DMRcate13之外,ChAMP还提供了新的Probe Lasso方法。对于需要找到差异化甲基化区块的用户,新版本的ChAMP包括检测这些区域的功能。Gene Set Enrichment Analysis(GSEA)也是可能的,新版的ChAMP采用了一些方法来纠正由基因之间不均匀的探针表现引起的偏倚。此外,新版的ChAMP还整合了FEM包,它可以推断出用户指定的基因网络中的基因模型,这些基因模型在不同表型之间表现出不同的甲基化。

尽管有许多其他的管道和封装可用于450k或EPIC阵列分析(包括IMA,minfi,methylumi,RnBeads和wateRmelon),但ChAMP提供了更全面,更完整的分析流程,从读取原始数据文件到最终的三级分析结果,如GSEA,它简化了研究人员的甲基化阵列分析。新版ChAMP还提供一系列基于Shiny和Plotly的WebBrower交互式分析功能(GUI功能),以帮助科学家查看ChAMP分析结果。这需要一个基于WebBrower的交互式框架,用于本地或远程调用Graph System。例如,X11适用于大多数Linux系统。

3 安装

必须在计算机或服务器上安装R 3.3或更高版本。 ChAMP是一种管道,它利用了许多其他Bioconductor封装,目前可从CRAN和Bioconductor获得。 要使管道的所有步骤正常工作,请确保已将Bioconductor升级到最新版本(3.5)。

正确安装R和Bioconductor后,即可开始安装ChAMP。 安装它的最简单方法是在R会话中键入以下代码:

if (!requireNamespace("BiocManager", quietly=TRUE))
    install.packages("BiocManager")
BiocManager::install("ChAMP")

如果您使用Bioconductor时有任何问题,另一个方法是先安装所有依赖包,然后再安装ChAMP。 ChAMP依赖包有很多。 您可以在R中使用以下命令来安装其中的大多数::

if (!requireNamespace("BiocManager", quietly=TRUE))
    install.packages("BiocManager")
BiocManager::install(c("minfi","ChAMPdata","Illumina450ProbeVariants.db","sva","IlluminaHumanMethylation450kmanifest","limma","RPMM","DNAcopy","preprocessCore","impute","marray","wateRmelon","goseq","plyr","GenomicRanges","RefFreeEWAS","qvalue","isva","doParallel","bumphunter","quadprog","shiny","shinythemes","plotly","RColorBrewer","DMRcate","dendextend","IlluminaHumanMethylationEPICmanifest","FEM","matrixStats","missMethyl","combinat"))

当您安装ChAMP时,您可能会遇到一些错误,错误说明中指出某些软件包未安装。 这些错误是由递归依赖的R包引起的,例如,ChAMP引用minfi,minfi引用lumi,而lumi引用其他包,因此如果您的计算机上未安装lumi,则ChAMP将安装失败。 要解决这些错误,您只需要检查这些错误消息,找出所需的包缺失,然后直接使用命令BiocManager :: install(“YourErrorPackage”)进行安装。 然后重试安装ChAMP,可能需要3-4次,但最终应该可以正常工作。

如果你现在使用的 ChAMP 版本低于 2.8.3, 请确保您已经安装了最新的  ChAMPdata package(version 2.8.1) 以开始一个新的ChAMP工作.

安装完成后,你应该可以在R交互式界面中使用ChAMP包了:

library("ChAMP")

4 测试数据集

该软件包包含两个测试数据集,一个是HumanMethylation450数据(.idat),另一个是模拟EPIC数据,可用于测试ChAMP中可用的功能。 这可以通过将添加目录指向testDataSet来加载,如下所示:

testDir=system.file("extdata",package="ChAMPdata")
myLoad <- champ.load(testDir,arraytype="450K")

450k肺癌数据集仅包含8个样品,4个肺肿瘤样品(T)和4个对照样品(C)。 我们使用此数据集来演示ChAMP的功能。

对于EPIC模拟数据集,用户可以使用以下代码加载它

data(EPICSimData)

该模拟数据集包含16个样本,所有这些样本实际上最初来自一个样本,但已修改为DMP和DMR,也单独存在一些错误的变异值。 在这16个样本数据集中,我们模拟了8个样本作为对照,8个样本作为案例,样本在EPICSimData对象的pd中标记。 在这个数据中,我们从bumphunter包的clusterMaker()函数中随机选择5000个区域。 在每个区域,我们随机选择一些连续的CpG作为DMR,然后增加或减少该DMR的β值。 因此,该数据集中应该有少于5000个DMR(4700+),因为一些模拟DMR仅包含1-2个CpG,它们在champ.DMR()函数中不会被视为DMR。 用户可以使用该数据来测试ChAMP处理EPIC数据的功能.

5 ChAMP 分析管道

5.1 管道介绍

上图概述了ChAMP管道中的各个步骤。 每个步骤都可以作为独立的功能单独运行。 这让ChAMP的结果与其他分析管道之间的集成成为可能。 在上面的管道图中,包括了ChAMP的所有功能,它们使用三种颜色分类:

  • 蓝色功能代表甲基化数据预处理的功能,如导入,标准化,质量控制检查等。
  • 红色功能代表产生分析结果的功能,如差异甲基化位置(DMP),差异甲基化区域(DMR),差异甲基化区块,EpiMod(该方法衍生自FEM包的差异甲基化基因模块),途径富集结果等。
  • 黄色函数表示用于展示数据集和分析结果的GUI界面。

上图中的实线灰色线表示管道流,而划线灰线表示不一定使用的功能,具体使用取决于您的数据和项目。

在上图中,有一些功能和连接(线)标有绿色闪烁,表示核心分析管道。 ChAMP整合了许多功能,但并非所有功能在一次成功的分析中都会被使用到。 绿色光线代表最有可能用于各种数据集的核心分析管道。 我们还为这些主要管道步骤进行了标记(在使用时),这对新用户应该有所帮助。 图中间的黑点代表已经完全准备好的甲基化数据集,这是数据准备和数据分析之间的枢纽。 因此,我们建议您完整保存准备好的数据,这样您可以在之后的工作中直接启动分析。

5.2 完整管道

可以使用一个命令立即运行完整管道:

champ.process(directory = testDir)

在通过 champ.process()函数运行完整管道时,可以调整一系列的参数. 这些参数被ChAMP中涵盖的函数所依赖,但是并没有覆盖所有的参数,否则champ.process()函数会变得过于臃肿; 具体参数设置请查阅ChAMP手册.

5.3 分开的步骤

我们会发现champ.process() 并不总是能够成功地完整运行, 常常是因为一些非标准的特殊数据集的特性引起的错误时. 举个例子, pd文件中相关的协变量并没有表示为“Sample_Group”而是其他种类, 比如 “Disease”, 但是在 ChAMP 方法中, “Sample_Group” 被配置为默认的需处理的变量列. 如果您在使用 champ.process()时遇到任何问题, 我们建议您用以下方法一步一步来:

myLoad <- cham.load(testDir)
# Or you may separate about code as champ.import(testDir) + champ.filter()
CpG.GUI()
champ.QC() # Alternatively: QC.GUI()
myNorm <- champ.norm()
champ.SVD()
# If Batch detected, run champ.runCombat() here.
myDMP <- champ.DMP()
DMP.GUI()
myDMR <- champ.DMR()
DMR.GUI()
myBlock <- champ.Block()
Block.GUI()
myGSEA <- champ.GSEA()
myEpiMod <- champ.EpiMod()
myCNA <- champ.CNA()
# If DataSet is Blood samples, run champ.refbase() here.
myRefbase <- champ.refbase()

我们可以注意到ChAMP函数的上述实现使用了默认的参数,参数可以根据用户其他独特的分析工具和方法进行调整和协调。有关参数选项的详细信息,请参阅ChAMP手册或本文的相关章节。

5.4 EPIC pipeline

ChAMP为EPIC阵列分析提供了友好的用户界面。对于450K数组和EPIC数组,大多数函数都是相同的,唯一的区别在于参数“arraytype”,这一参数用于一些需要注释文件的函数:只需将这个参数设置为“EPIC”即可。在ChAMP包中,我们创建了一个模拟EPIC数据集(beta值)。分析EPIC阵列的管道如下:

# myLoad <- champ.load(directory = testDir,arraytype="EPIC")
# We simulated EPIC data from beta value instead of .idat file,
# but user may use above code to read .idat files directly.
# Here we we started with myLoad. 
data(EPICSimData)
CpG.GUI(arraytype="EPIC")
champ.QC() # Alternatively QC.GUI(arraytype="EPIC")
myNorm <- champ.norm(arraytype="EPIC")
champ.SVD()
# If Batch detected, run champ.runCombat() here.This data is not suitable.
myDMP <- champ.DMP(arraytype="EPIC")
DMP.GUI()
myDMR <- champ.DMR()
DMR.GUI()
myDMR <- champ.DMR(arraytype="EPIC")
DMR.GUI(arraytype="EPIC")
myBlock <- champ.Block(arraytype="EPIC")
Block.GUI(arraytype="EPIC") # For this simulation data, not Differential Methylation Block is detected.
myGSEA <- champ.GSEA(arraytype="EPIC")
myEpiMod <- champ.EpiMod(arraytype="EPIC")
# champ.CNA(arraytype="EPIC")
# champ.CNA() function call for intensity data, which is not included in our Simulation data.

5.5 计算量需求

在使用管道处理大量示例的能力多少取决于可用的内存大小。ChAMP管道在8GB内存的计算机上可以成功运行200个示例。如果有需要,可能还需要找到一个服务器/集群来运行分析.

champ.load() 函数使用了绝大多数的内存.如果您打算多次运行一次分析我们建议您使用 myLoad <- champ.load() 后保存此列表以供将来的分析工作.

在champ.DMR(), champ.norm()和 champ.Block()函数中,有一些步骤可以通过并行计算来加速处理过程.如果你的电脑或服务器有多个核心, 您可以同时指定更多的内核,以使函数运行得更快,尽管这可能会消耗更多的内存。您可以在R中使用以下代码来检测内核的数量.

library("doParallel")detectCores()

GUI 的实现需要更多的东西. 一个GUI环境的基本框架是被需要的. 如果你在自己的电脑上运行 ChAMP ,那并没有问题. 但是如果你在一个远程服务器上运行, 要使用这些基于web浏览器的GUI功能,本地显示系统是必不可少的, X11就是一个很常见的工具。对于Windows用户, XmanagerMobaxTerm 都是很好的选择.

ChAMP的一个显著改进是,即使用户不是从原始.idat文件开始进行分析,他们也可以完成分析工作。只要你有一个甲基化矩阵和相应的表现型(pd)文件,你就可以进行几乎所有的ChAMP分析。这使得只有beta值而不具有原始idat文件的用户更容易进行分析,比如用户从TCGA或GEO获得的数据。

6 ChAMP 处理管道的说明

如前所述,用户可以选择单独运行某一ChAMP函数,以便与其他管道集成或在每个步骤完成后保存结果。下面我们将详细介绍每个函数.

6.1 加载数据

第一步总是加载数据. ChAMP 提供一个加载函数来从.idat文件(其中需包括pd文件(Sample_Sheet.csv))获取数据. 旧方法 champ.load() 函数使用minfi从.idat文件加载数据, 但是现在我们提供了一种新的方法“ChAMP”来读取数据。这两种加载方法均返回了相同的对象,除此之外旧版本还返回了rgSet和mset mset对象,这些对象可以用作champ.norm()中的SWAN和FunctionalNormliazation方法的输入。默认情况下,将从当前工作目录加载数据,在这个目录中应该有所有的.idat文件和一个示例表。当前的示例表需要是一个.csv文件,因为混合之后的结果是这样的。用于从idat文件加载数据的champ.load()函数会自动过滤掉可能影响分析结果的snp. SNP列表是从 Zhou’s Nucleic Acids Research paper in 201621, 专门为450K和EPIC所设计,同时也考虑了种群差异. 因此,对于450k的珠子阵列数据,在过滤前对低质量的探针进行过滤, 数据集了包括 485,512 个探针. 而且对于EPIC 芯片阵列数据, 过滤之前的数据集, 将会包含 867,531 个探针位点.

当前的默认加载方法是“ChAMP”. 你可能需要设置 “method” 在 champ.load() 以使用经典的minifi加载方式. 注意 “ChAMP” 方法在 champ.load() 仅仅是 champ.import() champ.filter()两个方法的组合. 如果不满意返回的结果 champ.load(),比如说你想要得到 beadcount, 检测P值矩阵, 或者Meth UnMeth 矩阵,可能你需要 champ.import() 来实现它们.

用户可以调用以下代码来加载数据集:

myLoad <- champ.load(testDir)

或者, 用户可以从ChAMPdata的 testDataSet数据集对象来加载数据集, 使用:

data(testDataSet)

检查pd文件很重要(Sample_Sheet.csv). 大多数pd文件的格式相似, 但它们最有价值的信息并不总是存储在同一列中, 比如,对于大多数pd文件来说, Sample_Group 表示需要被研究的表现型, 但是在一些pd文件中, 类似的信息可以存储在名为 “Diagnose” 或 “CancerType”的列中, 所以用户必须 理解他们的pd文件,以成功进行分析工作

myLoad$pd##    Sample_Name Sample_Plate Sample_Group Pool_ID Project Sample_Well## C1          C1           NA            C      NA      NA         E09## C2          C2           NA            C      NA      NA         G09## C3          C3           NA            C      NA      NA         E02## C4          C4           NA            C      NA      NA         F02## T1          T1           NA            T      NA      NA         B09## T2          T2           NA            T      NA      NA         C09## T3          T3           NA            T      NA      NA         E08## T4          T4           NA            T      NA      NA         C09##     Array      Slide                   Basename                  filenames## C1 R03C02 7990895118 Test450K/7990895118_R03C02 Test450K/7990895118_R03C02## C2 R05C02 7990895118 Test450K/7990895118_R05C02 Test450K/7990895118_R05C02## C3 R01C01 9247377086 Test450K/9247377086_R01C01 Test450K/9247377086_R01C01## C4 R02C01 9247377086 Test450K/9247377086_R02C01 Test450K/9247377086_R02C01## T1 R06C01 7766130112 Test450K/7766130112_R06C01 Test450K/7766130112_R06C01## T2 R01C02 7766130112 Test450K/7766130112_R01C02 Test450K/7766130112_R01C02## T3 R01C01 7990895118 Test450K/7990895118_R01C01 Test450K/7990895118_R01C01## T4 R01C02 7990895118 Test450K/7990895118_R01C02 Test450K/7990895118_R01C02

就像我们所看到的, Sample_Group包含了两种 表现型: C 和 T, 这是我们想作比较的。 在这个数据集中, C 表示 “对照”, 而 T 表示 “肿瘤”.

6.2 数据过滤

ChAMP 提供了一个全面的数据过滤功能 champ.filter() 可以接受任何数据矩阵作为输入 (beta, M, Meth, UnMeth, intensity) 并进行过滤. champ.filter() 不调用任何特定的对象或类结果, 它也不需要提供 IDAT 文件,只要你有一个 single beta 矩阵, 过滤功能就可以被完成. 在 champ.filter()中,一些探头和样品可能会因为质量问题而被移除, NA 值可能会仍然存在于过滤后的数据中.所以, 我们在函数中提供了一个参数 autoimpute 来决定是否自动对剩余的探针数据进行归一化, 如果 NAs 依然存在. 实际上, imputation工作是由 champ.impute()函数完成的, 读者可以检查该函数的说明以获得更多信息。

对于一些过滤方法, 某些额外数据是必需的, 例如,如果要使用p值检测执行过滤, 您需要提供一个p值检测矩阵. 同样的, 如果你想要基于beadcounts来执行过滤, 你需要提供 beadcount 信息. 如果你用champ.load()的默认方法“ChAMP”来读入 IDAT 文件,返回的结果中应该会包含beadcount信息. 同样的,champ.filter()被设计用于从同一数据集中对多种类型的数据帧进行过滤, 如果你有 来自同一.idat 文件的beta matrix, M matrix, intensity matrix, 你可能会需要使用 champ.filter()来 同时对它们做过滤, 在未来的分析中,它们会保持它们的依从性.

  • 首先,必须加载至少一个数据矩阵
  • 第二, 如果提供多个矩阵作为输入,它们必须具有完全相同的行名和列名。例如,如果您想同时对强度数据和M值进行过滤,您必须确保它们具有相同的名称,否则ChAMP将假定它们是来自不同来源的数据帧。然后,champ.filter()将停止过滤过程.
  • 第三,一些低质量的样本 (比如有许多未探测到数据的探针) 可能在过滤中被删除, 确保pd文件中的Sample_Name标识符与数据矩阵的名称完全相同非常重要
  • 第四, 如果要进行归一化,则必须提供检测P矩阵、beta矩阵或M矩阵,且参数ProbeCutoff不能为0,该参数控制NA比率的阈值,探头是否应该被移除。如果这三个条件中的任何一个失败,champ.filter()将重置autoimpute参数为FALSE,那么就不会进行归一化处理.
  • 第五,如果要通过beadcount进行过滤,必须提供beadcount, champ.import()将返回beads信息。或者,您可以使用beadcount函数从wateRmelon包中提取是从rgSet对象中提取的“minfi”方法

ChAMP .filter()函数是ChAMP .import()的后处理,它是ChAMP团队编写的一个新的加载函数。您可以像下面这样使用它们:

myImport <- champ.import(testDir)myLoad <- champ.filter()

然后,champ.filter()(上面代码中的myLoad)的结果会与旧的minfi方法相同。

实际上,在加载过程中,champ.load()函数也会进行一些过滤。下面是在champ.load()和champ.filter()期间执行的过滤步骤.

  • 第一个过滤器用于检查检测p (< 0.01). 这使用了存储在.idat文件中的检测p值. champ.import() 会读取这些信息并将其整合到数据帧中.对于每一个探针, 如果p值低于01, 将会被认为是失败的探测. 一个失败的样本信息会被打印在屏幕上, 显示每个样本中失败探头的比例. 如果有一个样本其中每一个值都很高(>0.1),您可能需要考虑从分析中删除该样本并重新运行它。之前我们发现,在大多数情况下,大数据集中只有一两个样本不适合进行分析, 他们可能有70%到80%的失败探针. 所以如果我们只对探测器进行过滤,大约80%的探针会被屏蔽, 因此,我们开发了一种新的样品和探针质量的过滤系统。 如果某个样品的失败探针比率高于特定阈值(默认值= 0.1),则将丢弃该样品,并对剩余样品进行探针过滤。 样品和探针的阈值可以分别由参数SampleCutoff和ProbeCutoff控制。
  • 其次,ChAMP将过滤掉每个探针至少5%样品中<3beads的探针。可以使用filterBeads参数更改此默认值,也可以使用beadCutoff参数调整频率。
  • 第三,ChAMP将默认过滤掉数据集中包含的所有非CpG探针。
  • 第四,默认情况下,ChAMP将过滤所有与SNP相关的探针。 SNP清单来自Zhou’s Nucleic Acids Research paper in 2016. 请注意,如果您知道数据来自哪个群体,您可以选择某些群体进行过滤。否则ChAMP将使用Zhou提供的一般推荐探针进行过滤。您只需指定“population”参数即可实现此目的。
  • 第五,默认情况下,ChAMP将过滤所有multi-hit探针。multi-hit探针来自Nordlund’s Genome Biology Paper in 201322.
  • 第六,ChAMP将过滤掉位于染色体XY的所有探针。这也是默认设置,但用户可以使用filterXY参数进行更改。

以上所有过滤都由champ.load()和champ.filter()函数中的参数控制,用户可以根据需要进行调整。 请注意,虽然大多数ChAMP函数支持单独的beta matrix分析,这意味着从一开始就不依赖于.idat文件,但champ.load()不能单独对beta matrix进行过滤。 对于没有.IDAT数据但是有beta matrix和Sample_Sheet.csv的用户,您可能希望使用champ.filter()函数执行过滤,然后使用之后的函数进行分析。

除了过滤功能外,我们还提供了一个函数CpG.GUI(),供用户检查它们在染色体,CpG岛,TSS区域上的CpG分布。 例如 此函数可用于任何CpG列表,例如,在分析期间,只要从DMR获得重要的CpG列表,就可以使用以下函数检查CpG向量的分布:

CpG.GUI(CpG=rownames(myLoad$beta),arraytype="450K")

这是一个有用的函数,用于演示CpG列表的分布。如果你得到一个DMP列表,你可以用这个函数来检查你的DMP在染色体上的分布。

6.3 进一步的质量控制和探索性分析

质量控制是确保数据集适合于下游分析的一个重要过程。champ.QC()函数和QC.GUI()函数将绘制一些图供用户方便地检查其数据质量。通常会生成三个图:

champ.QC()

  • mdsPlot(多维标度图):此图允许基于所有样本中1000个最可变的探针可视化样本的相似性。在这个图中,样本是由样本组着色的.
  • densityPlot(密度图): 每个样本的beta分布;使用者可使用该图找出与其他样本有明显差异、品质不佳的样本(例如亚硫酸氢盐转换不完全)。NB:如果研究中包含了甲基化或非甲基化的对照样本,它还可以用于识别和确认这些样本)
  • 系统树状图: 对于所有样本的聚类图,您可以选择不同的方法来生成这个图。有一个参数特性。champ.QC()函数中的 Feature.sel="None" 。而“None”是指样本之间的距离将由所有探针结果直接计算而得(如果直接做一 个大数据集服务器可能会崩溃),“SVD”指champ.QC()函数将使用SVD计算方法对β矩阵做反卷积,然后基于“isva”包的EstDimRMT()方法处理选择的重要的组件。然后根据这些分量计算样本之间的距离。、

或者您可以使用如下所示的QC.GUI()函数,但是这可能会占用大量内存,所以在ChAMP中运行这个GUI函数时,请确保有一个好的服务器或计算机。

QC.GUI(beta=myLoad$beta,arraytype="450K")

与champ.QC()不同,QC.GUI()将提供5个交互图。这些图如下:mdsPlot, type-I&II densityPlot, sample beta distribution plot, dendrogram plot and top 1000 most variable CpG’s heatmap. 用户可以很容易地与这些图进行交互,以查看您的示例是否符合进一步分析的条件,并检查集群结果和top CpGs。

  • Probe-Type-I&II plot: 数据集中Type-I和Type-II探针的beta分布,这可以帮助您检查数据集的规范化状态。
  • Top variable CpGs’ heatmap: 这个热图将选择大多数可变的CpGs,并根据这些位点的甲基化值创建一个热图。

6.4 标准化

在Illumina beadarrays,上,探针有两种不同的设计(称为I型和II型),两者具有不同的杂交化学原理,这意味着来自这两种不同设计的探针将表现出不同的分布。这是使用不同技术产生的效果,并且不是由于I型和II型探针的生物学特征(例如CpG密度)的差异引起的变化。 I型和II型甲基化分布之间最显着的差异是II型分布显示一个减少的动态范围。在监督分析(supervised analyses)中,这可能导致选择I型或II型探针后产生偏差。对于DMR检测,其中I型和II型探针可能落在相同区域,显然调整是最重要的。通常,建议调整II型探针偏差,您可以使用champ.norm()函数执行此规范化。

在champ.norm()中,我们提供了四种方法来进行这种II型探针标准化:BMIQ,SWAN,PBC和FunctionalNormliazation。每种方法之间存在一些关键性差异,用户可以阅读相关论文以选择最适合所需分析工作的方法。

目前,FunctionalNormalization需要rgSet,SWAN需要从IDAT文件生成的mset和rgSet对象(如minfi包描述中所述),而FunctionalNormliazation需要rgSet对象。因此,FunctionalNormliazation和SWAN规范化方法仅适用于“minfi”加载方法的分析。

请注意,BMIQ函数已经更新到1.6版本,这将为EPIC数组数据提供更好的规范化。我们注意到有时BMIQ 并不能收敛然后提供某些样本的输出, 如果样品的甲基化分布明显偏离了3-state beta-mixture的分布(如甲基化/非甲基化的对照组),就可能发生这种情况,当然也可能是因为样本的质量太差。一个新特性是BMIQ函数现在可以并行运行,因此如果您的计算机有多个核心,您可以在函数中设置参数“核心”来加速该函数的运行。如果你设置参数 PDFplot=TRUE, BMIQ函数的绘图将保存在resultsDir中

下面是进行标准化工作的代码:

myNorm <- champ.norm(beta=myLoad$beta,arraytype="450K",cores=5)

在标准化之后, 你可以使用 QC.GUI() 来再次检查结果。 记住要将“myNorm”传递给 QC.GUI()。

6.5 SVD 图

Teschendorff为甲基化数据实现的突变分解方法(SVD)是评估数据集中变异的标志性部分的数量和性质的有力工具。这些变异部件理想地与感兴趣的生物因子相关,但通常也与技术变异来源(例如批次效应)相关。我们强烈建议用户尽可能多地收集所分析样本的信息(例如杂交日期,收集样本的季节,流行病学信息等),并在与SVD组分相关时包括所有这些因素。如果已从.idat文件加载样品,那么如果在champ.SVD()函数中设置参数RGEffect = TRUE,则将包括珠芯片上的18个内部探针控件(包括亚硫酸氢盐转换效率)。此外,与旧版本的ChAMP软件包相比,当前版本的champ.SVD()将检测所有有效因子以执行分析,这意味着该图包含以下两个条件:

  • 协变量不与name、Sample_Name或File_Name相关联
  • 协变量至少包含两个值(例如,要将“ID”作为协变量进行测试,必须包含来自至少两个不同BeadChips的样本)

champ.SVD()函数将检测pd文件中的所有有效协变量。因此,如果您想要分析一些独特的协变量,您可以将它们组整合到您的pd文件中,从而允许champ.SVD()测试与数据集中变化的最重要组成部分的关联。请注意,champ.SVD()仅使用pd文件形式的表型数据。因此,如果你有自己的协变量,你想用你当前的pd文件进行分析,请cbind()你的协变量和myLoad $ pd,然后使用这个对象作为champ.SVD()函数的pd参数的输入。

请注意,对于不同类型的协变量(分类和数字),champ.SVD()使用不同的方法来计算协变量和分量之间的显着性(Krustal.Test和线性回归)。因此,请确保您的pd对象是数据框,数字协变量转换为“数字”类型,而分类协变量则转换为“因子”或“字符”类型。如果您的Age协变量被指定为“字符”,则champ.SVD函数将不会检测到它应该被分析为数字。因此,我们建议用户非常清楚地了解他们的数据集和pd文件。

当champ.SVD()运行时,所有检测到的协变量将打印在屏幕上,以便用户可以检查您想要分析的协变量是否正确。结果是与所提供的协变量信息相关的顶部主成分的热图(保存为SVDsummary.pdf)。在champ.SVD()中,我们使用isva包中的随机矩阵理论来检测潜在变量的数量。如果我们的方法检测到超过20个组件,我们将只选择前20个。在热图中,较暗的颜色表示更显着的p值,表明SVD组件与感兴趣的因子更强相关。如果SVD分析显示技术因素占变化的很大一部分,那么有必要实施其他标准化方法(例如ComBat),这可能有助于消除这种技术变化。 ComBat包含在ChAMP管道中以消除与芯片,位置和/或板相关的变化,但它可用于去除SVD分析中显示的其他批次效应。

champ.SVD() function detects all valid covariates in your pd file. Thus if you have some unique covariates you want to be analysed, you may combine them into your pd file, allowing champ.SVD() to test for association with the most significant components of variation in the dataset. Note that champ.SVD() will only uses phenotype data in the form of a pd file. So, if you have your own covariates which you want to be analysed with your current pd file, please cbind() your covariates along with myLoad$pd, then use this object as input for the pd parameter of the champ.SVD() function.

Note that, for different type of covariates (categorical and numeric), champ.SVD()uses different methods to calculate the significance between covariates and components (Krustal.Test and Linear Regression) . Thus, please make sure your pd object is a data frame and numeric covariates are transferred into “numeric” type, while categorical covariates are transferred into “factor” or “character” type. If your Age covariate was assigned as “character”, the champ.SVDfunction will not detect that it should be analysed as numeric. Thus, we suggest users have very clear understanding of their dataset and pd files.

While champ.SVD() is running, all detected covariates will be printed on the screen, so that the user may check if the covariates you want to analyse are correct. The result is a heatmap (saved as SVDsummary.pdf) of the top principal components correlated to the covariates information provided. In champ.SVD() we used Random Matrix Theory from isva package to detect numbers of latent variables. If our method detected more than 20 components, we would select only top 20. In the heatmap, the darker colours represent a more significant p-value, indicating a stronger correlation of the SVD component with a factor of interest. If the SVD analysis shows that technical factors account for a substantial fraction of variation, then it will be necessary to implement other normalization methods (e.g ComBat) that may help remove this technical variation. ComBat is included in the ChAMP pipeline to remove variation related to the beadchip, position and/or plate, but it may be used to remove other batch effects revealed in the SVD analysis.

使用以下代码来生成SVD图:

champ.SVD(beta=myNorm,pd=myLoad$pd)

Above plot is generated on our Demo 450K array data set. Since there are only 8 samples in that data set, so the final headmap plot looks not complicated. Below we used another plot from GSE40279, this data contains 656 sample, and a complicated pd file, with covariates like Age and Race included. Below we are showing the result of SVD plot we generated by using champ.SVD() function.

In above plot, Age is a numeric variable, while other factors are categorical factors. The darker block means stronger correlation between deconvoluted components and covariates.

6.6 Batch Effect Correction

This function implements the ComBat normalization method8 that was developed for microarrays. The sva R package is used to implement this function. ComBat is specifically defined in the ChAMP package to correct for batch effects. More advanced users can implement ComBat using sva documentation to adjust for other batch effects. You need to set parameter batchname=c("Slide") in the function to correct them.

Note that champ.runCombat() would extract batch from your batchname, so please make sure you understand you dataset perfectly. Another important thing is, previously champ.runCombat() would automatically assume Sample_Group is the covariates need to be analysed, but now, we added a parameter called variablename so that users may assign their own covariates to analysis, thus for some data set whose disease phenotype is NOT Sample_Group, please remember to assign variablename parameter.

ComBat algorithm uses an empirical Bayes method to correct for technical variation. If ComBat were applied directly to beta values, the output may not be bounded between 0 and 1. For this reason ChAMP logit transforms beta values before ComBat adjustment and then computes the reverse logit transformation following the ComBat adjustment. If the user has chosen to use M-values, please assign logitTrans=FALSE.

During the function champ.runCombat(), the program would automatically detect all valid factors might be corrected and list on screen, so that user may check if there are the batches you want to correct. Then if the batchname assigned by user are correct, the function would started to do batch correction.

In the latest version, champ.runCombat() would also check if user’s Batch and Variable counfounded with each other. Confounding means all samples from one phenotype of your variable are comes from certain one phenotype of your batch, which means, if you correct batches, your variable-contained information are also gone, which is not allowed in Combat function of SVA package, that’s why we made check beforehead.

You can use following code to do batch correction:

myCombat <- champ.runCombat(beta=myNorm,pd=myLoad$pd,batchname=c("Slide"))

The ComBat function could be a very time consuming step in the pipeline if you have a large number of samples. After correction, you may use champ.SVD() function to check the corrected result.

6.7 Differential Methylation Probes

Differentially Methylated Probes (DMPs) is the most common desired output. Here users may use champ.DMP() function to calculate differential methylation, and can use DMP.GUI() to check the result.

champ.DMP() function implements the limma package to calculate the p-value for differential methylation using a linear model. Our latest champ.DMP() function would support numeric variable like age, and categorical variables which contain more than two phenotypes, like “tumor”, “metastasis”, “control”. If our function detected numeric variables like age, linear regression would be conducted on each CpG sites, to find your covariate-related CpGs, say age-related CpGs. While categorical variables detected, champ.DMP() would do contrast comparison between each two phenotypes in your covariate. Say you have covariate as “tumor”, “metastasis”, “control”, then champ.DMP() would compare “tumor–metastatic”, “tumor-control”, and “metastatis-control”. Each comparison would return one data frame containing significantly differential methylated probes.

The output of champ.DMP() includes some data frames of P-value, t-statistic and difference in mean methylation (this is labeled as logFC since it is analogous to the log fold-change in gene expression analysis) between two groups (for categorical covariate only). It also includes the annotation for each probe, the average beta value for the sample group, and the delta beta value for the two groups used in the comparison (delta value is same as logFC, we kept is here because very old version ChAMP used it).

In our latest version champ.DMP(), numeric variables are fully support not, if you have covariate like ages or time series data. However, more advanced users can run limma with other settings and use the output (including all probes and their p-values) for the DMR hunter.

Users may use following code to get DMPs (still we are using Demo 450K test data here):

myDMP <- champ.DMP(beta = myNorm,pheno=myLoad$pd$Sample_Group)##       Contrasts## Levels pT-pC##     pC    -1##     pT     1

We can check the result of champ.DMP() as below. Remember that in champ.DMP(), each returned result would be stored in an element in a list. Each element in the list is one result of DMPs detected between two phenotypes. If your covariate is numeric variable, the returned data frame name is “NumericVariable”. If your covariate is categorical variable, the returned name would be “PhenoA_to_PhenoB”, where “PhenoA” and “PhenoB” are two name of your phenotypes.

head(myDMP[[1]])

In new version of ChAMP, we include a GUI interface for user to check the result of myDMP generated from above code. You just need to provide the “unmodified” result of champ.DMP (myDMP), orginal beta matrix you used to generat DMP (beta) and covariates you want to analyse. DMP.GUI() function would automatically detect if your covariate is numeric or categorical. If your covariate is numeric, DMP.GUI() would plot scatter plot for each CpGs, showing trend of beta value along with your covariate. If your covariate is categorical variable like cancer/normal, DMP.GUI() would plot boxplot for significantly differential methylated CpGs. You can use following code to open DMP.GUI() function.

DMP.GUI(DMP=myDMP[[1]],beta=myNorm,pheno=myLoad$pd$Sample_Group)# myDMP is a list now, each data frame is stored as myDMP[[1]], myDMP[[2]], myDMP[[3]]...

The left panel indicates the cut off parameters for significant CpGs. Currently only two parameters are supported here: P-value and abs(logFC). Users may set these two values then press “Submit” to generate significant CpGs table on the right. On the right is the table of significant CpGs, all information are included, the table can be ranked by each column.

The second tab provides the heatmap of ALL significant CpGs you generate by left panel cutoffs. Note that rows and columns of this heatmap has been Hierarchical Clustered.

The third panel provides the barplot of proportion of each CpG feature(e.g. TSS200, TSS1500, body, opensea, shore) and cgi summary information for hyper and hypo CpGs. You may press the marker in the legend to close or open certain bars.

The fourth tab is controled by the Gene parameter on the left panel. You may select a gene name (the example default is NFIX here) and press “submit”. Then the plot for this gene will be plot above. Note that the x axis here is not the actual MAPINFO of each CpG. The distance between each CpG are the same. There are two kinds of lines plotted for each phenotype, the solid lines represent Mean Value plot, and the dash line represent the Loess line, user may close or open they by clicking the marker on the legend. Below is the feature and cgi information of each CpG. At the bottom of this tab, is the gene information extracted from wikigene, which is also controlled by the gene parameter on the left.

The fifth tab includes two plots. At the top is the enrichment plot for top 70 genes involved in significant CpGs you select by left cutoff parameters. The number in each bar indicates how many hyper or hypo differential methylated CpGs are included in this gene. Users may use this plot to find intereted genes then use tab 4 to plot these genes. At the bottom of this tab, is the boxplot of certain CpGs, which is controled by parameter on the left panel.

Note that all plots here can be Zoomed in or out. And can be downloaded directly. You may even change the shape of the brower to resize of plot then download it with different resolution.

6.7.1 DMP.GUI performance on numeric variable

Above DMP.GUI plots are generated on our Demo 450K array. Since that’s a categorical covariate data set, below we are showing part result from a numeric covariate data set: GSE40279. Here we are doing whole analysis, but showing the DMP.GUI() webpage for Age-related CpGs detected from champ.DMP() on GSE40279 data set. Note that we get below result after steps like loading, filtering, normalization.e.g. then run champ.DMP() and DMP.GUI().

For numeric variable, to showing the trend of beta value change along with variable, DMP.GUI() would automatically cut the numeric variable into couple groups increasingly (default is 4), then plot lines and dot based on each group. Say if your variable is age, DMP.GUI() would automatically cut your age variable into couple groups, like 30-, 30-50, 50-70, 70+, then four lines and groups of dots would be plotted. Thus you may see the line for old sample group is significantly different from line from the youngest group.

Firstly we show the third tab (gene related tab) of DMP.GUI() webpage:

In above plot, we can see that DMP.GUI() automatically cut age covariate into 4 groups. Then for each group, dots ant lines are plotted. We can see that in gene ELOVL2, there are 8 CpGs in total shows age-related trend. And lines show that older group (80.5 ~ 101), tend to be hypermethylated than younger groups.

Then we show the fourth tab (CpG related tab) of DMP.GUI() webpage:

In above plot, CpGs are plotted in to scatter plot, so that we can see trend of CpG along with age.

6.7.2 Hydroxymethylation Analysis

Some users might want to analyse hydroxymethylation data, which can be done with champ.DMP() function, here we provided some demo code below for user to check.

myDMP <- champ.DMP(beta=myNorm, pheno=myLoad$pd$Sample_Group, compare.group=c("oxBS", "BS"))# In above code, you can set compare.group() as "oxBS" and "BS" to do DMP detection between hydroxymethylatio and normal methylation. hmc <- myDMP[[1]][myDMP[[1]]$deltaBeta>0,]# Then you can use above code to extract hydroxymethylation CpGs.

6.8 Differential Methylation Regions

Differentially Methylated Regions (DMRs) are extended segments of the genome that show a quantitative alteration in DNA methylation levels between two groups. champ.DMR() is the function ChAMP provides to compute and return a data frame of probes binned into discreet differentially methylated regions (DMRs), with accompanying P-value.

There are three DMR algorithms implemented within ChAMP: Bumphunter, ProbeLasso and DMRcate. The Bumphunter algorithm first groups all probes into small clusters (or regions), then applies random permutation method to estimate candidate DMRs. This method is very user friendly and is not relying on any output of previous functions. The permutation step in Bumphunter may be computer-intensive, but user may assign more cores to accelerate it. The result of bumphunter algorithm is a dataframe containing all detected DMRs, with their length, clusters, number of CpGs annotated.

For ProbeLasso method, the final data frame is distilled from a limma output of probes and their association statistics. Previously, the result from champ.DMP() must be inputted into champ.DMR() function to do run ProbeLasso, but now we have upgraded the function and ProbeLasso needs no result from champ.DMP() anymore. For the principle and mechanism of ProbeLasso function, user may find it in original paper11.

A new method recently incoporated into ChAMP is DMRcate. This is a data-driven approach that is agnostic to all annotations except for spatial ones, specifically chromosomal coordinates. Critical to DMRcate are robust estimates of differential methylation (DM) at individual CpG sites derived from limma. DMRcate pass the square of the moderated t statistic calculated on each 450K probe to DMR-finding function. Then DMRcate would apply a Gaussian kernel to smooth this metric within a given window, and also derive an expected value of the smoothed estimate (in other words, one with no experimental effect) from the varying density of CpGs incu by reduced representation and irregular spacing. For more detail about DMRcate function, user may read the original paper13 and DMRcate R pacakge.

Note that in this version ChAMP, we made some strict checking process in champ.DMR(), now champ.DMR() only takes categorical covariates with EXACTLY only two phenotypes. If your covariates contains multiple phenotype, say “tumor”/“metastasis”/“control”, please manually select any two of them, and input corresponding beta matrix and pheno information.

Users may use following code to generate DMR:

myDMR <- champ.DMR(beta=myNorm,pheno=myLoad$pd$Sample_Group,method="Bumphunter")

The output of Bumphunter, DMRcate and ProbeLasso functions are all dataframe, but may contain different columns. Here we can have a check on DMRcate result.

head(myDMR$DMRcateDMR)##       seqnames     start       end width strand no.cpgs        minfdr## DMR_1     chr2 177021702 177030228  8527            48  3.243739e-69## DMR_2     chr2 177012117 177017797  5681      *      33  1.137990e-59## DMR_3    chr12 115134148 115136308  2161      *      51 3.571093e-108## DMR_4    chr11  31820441  31828715  8275      *      40  3.087712e-43## DMR_5    chr17  46655289  46658170  2882            27  1.565729e-85## DMR_6     chr6  32115964  32118811  2848      *      50 5.055382e-123##           Stouffer maxbetafc meanbetafc## DMR_1 4.011352e-24 0.6962753  0.4383726## DMR_2 6.555658e-20 0.6138975  0.4676655## DMR_3 5.711951e-19 0.5372083  0.3601741## DMR_4 1.206741e-18 0.5673115  0.4112588## DMR_5 2.087453e-17 0.6859260  0.4812521## DMR_6 7.216567e-16 0.5772168  0.2971849##                                                                                              overlapping.promoters## DMR_1                                                                       HOXD3-001, HOXD3-004, RP11-387A1.5-001## DMR_2                                                                             HOXD4-001, MIR10B-201, HOXD3-005## DMR_3                                                                                                         <NA>## DMR_4 PAX6-016, PAX6-006, PAX6-014, PAX6-015, PAX6-007, PAX6-028, PAX6-025, PAX6-027, PAX6-029, PAX6-024, PAX6-026## DMR_5                                                      HOXB4-001, MIR10A-201, HOXB3-009, HOXB3-005, MIR10A-001## DMR_6                                                        PRRT1-002, PRRT1-201, PRRT1-007, PRRT1-006, PRRT1-003

Just like DMP result, we provide DMR.GUI() as interface for DMR result. All result from three different method of champ.DMR() are accepted in the GUI function. The DMR.GUI() function would automatically detected what kind of input the DMR result is. Thus it’s not recommended to modify the structure of myDMR result.

DMR.GUI(DMR=myDMR)# It might be a little bit slow to open DMR.GUI() because function need to extract annotation for CpGs from DMR. Might take 30 seconds.

Just like the DMP.GUI above, there is a panel for parameters on the left, user may set the parameters to select significant DMRs. We currently only provide two parameters: P-value and min number probes. The above title will indicate the method you have used to get DMR result (here is DMRcate). Note that for different method the P-value may indicate different columns. Bumphunter is pvaluearea, while ProbeLasso is dmrP, and there is no cutoff for DMRcate result, which has been done during calculation. After your selection and press “Submit”, you will get table of significant DMRs on the right. For different method, the table content might be different. Thus user should research what method you want to use to find DMRs.

The second tab provides a table for all CpGs involvded in these DMRs, which can be regarded as a mapping list between CpGs and DMRs. Also if you can not get gene information in first tab, you can get them here. In this table, for each CpGs, their corresponding genes and DMRs are marked.

The third plot is the plot of each DMR, just like the gene plot in DMP.GUI() function. Here the x axis are not real MAPINFO of each CpG, the distance between each CpGs are equal. Also I have feature and CpGs marked under the plot. At the top of this tab, are the table of all CpGs in this DMR. You can use the DMR index parameter on the left panel to select DMR you want to check.

The fourth tab provide the enrichment information of genes and heatmap. Note that there might be many genes and CpGs in this plot. So if you have too many DMRs selected, the plot might be too big to be open. But you don’t need to worry can not see it clearly, because you may zoom in as much as you can to check details.

Currently, champ.DMR() does not support multiple phenotypes in one covariates, in theory we can do pairewise comparision, but it would takes very long time. Also, we are not supporing numeric variables, say looking for “Age-related Methylated Regions”, it calls for many heavy coding working and we may support that in future version.

6.9 Differential Methylation Blocks

There may also be potential interest to identify Differentially Methylated Blocks (DMBs) 19. Here we provide a function to infer DMBs. In our Block-finder function, champ.Block() would firstly calculate small clusters (regions) on whole genome based on their locations on genome. Then for each cluster (region), its mean value and mean position would be calculated, thus you can assume that each region would be collapsed into a single unit. When we finding DMB, only single unit from open sea would be used to do clustering. Here Bumphunter algorithm will be used to find “DMRs” over these regions (single units after collapse). In our previous paper23, and other scientists’ work24 we demonstrated that Differential Methylated Blocks may show universal feature across various cancers

In recent years, there are more researchs about Differential Methylated Blocks14. Here we provide a function to calculate methylation blocks. In our Block-finder function, champ.Block() would firstly calculate small clusters (regions) on whole genome based on their locations on genome. Then for each cluster (region), its mean value and mean position would be calcuated, thus you can assume that each region would be collapsed into a dot. Then Bumphunter algorithsm will be used to find “DMRs” over these regions (dots after collapse). Note that only open sea regions will be used to calculate Blocks. In our previous paper23 we demonstrated that Differential Methylated Blocks may show universal feature across various cancers.

User may use following code to generate methylation blocks:

myBlock <- champ.Block(beta=myNorm,pheno=myLoad$pd$Sample_Group,arraytype="450K")

champ.Block() function returns numerous objects, but the most important one is the first list: Block. You may check it like below:

head(myBlock$Block)##         chr     start       end      value     area cluster indexStart## Block_1  12 125405786 133245207 -0.1398238 61.66232     649      27996## Block_2   6  28839951  30308328 -0.1725335 40.20032     356      81359## Block_3  15  24778439  27360551 -0.1752729 38.73531     732      35251## Block_4  14 100851839 102144080 -0.2166707 35.75066     724      34607## Block_5  11 130933979 133954227 -0.1833729 34.29073     616      22569## Block_6   7  40026390  42107777 -0.2616146 24.85339     400      88773##         indexEnd   L clusterL      p.value  fwer  p.valueArea fwerArea## Block_1    28436 441     2176 1.698689e-05 0.028 1.698689e-05    0.028## Block_2    81591 233     2396 1.698689e-05 0.028 1.577354e-04    0.136## Block_3    35471 221      261 1.698689e-05 0.028 2.196162e-04    0.136## Block_4    34771 165      708 1.698689e-05 0.028 3.858451e-04    0.188## Block_5    22755 187     1601 1.698689e-05 0.028 4.622861e-04    0.188## Block_6    88867  95     1497 1.698689e-05 0.028 1.571287e-03    0.420

And we provide a GUI function for users to check the result of myBlock above.

Block.GUI(Block=myBlock,beta=myNorm,pheno=myLoad$pd$Sample_Group,runDMP=TRUE,compare.group=NULL,arraytype="450K")

Note that there are two special parameters here: runDMP means if Block.GUI() function should calculate DMP for each CpGs, if not, the GUI() function would only return the annotation of all involvded CpG. compare.group could to be assigned if you want to calculate DMP while running Block.GUI(), it works if you have multiple phenotypes in our covariates.

Just like DMP.GUI() and DMR.GUI(), the P-value cutoff and min cluster (similar to min probes in DMR.GUI()) controls the number of significant Blocks you selected. Then the information of significant blocks will be listed on the right.

The second tab provides information of all CpGs involved in all Blocks. For methylation blocks, there are actually three unit integrated together: Block, Cluster (regions) and CpG. Blocks are formed by some clusters, and clusters are formed by some CpGs. Thus the second tab actually provides a mapping solution from CpGs to Blocks. User may use this tab to find interesting genes and CpGs. Then use DMP.GUI() to combind research between Blocks and DMPs.

The third tab provides the plot for each Blocks. Be caution that each Block may contain many many clusters. Thus it would be slow to calculate and plot. And please make sure you have a good computer or server to run this function. Note that plot is slightly different from DMR.GUI() and DMP.GUI() that the x axis is actually the REAL MAPINFO of each cluster, because I also need to map CpG islands position above the lines (Those dark red dots).

6.10 Gene Set Enrichment Analysis

Gene Set Enrichment Analysis is a very important step for most bioinformatics work. After previous steps, you may already get some significant DMPs or DMRs, thus you may want to know if genes involved in these significant DMPs or DMRs are enriched for specific biological terms or pathways. To achieve this analysis, you can use champ.GSEA() to do GSEA analysis.

champ.GSEA() would automatically extract genes in myDMP and myDMR (No matter what method you have used to generate DMRs). Also if you have your own significant CpG list or gene list calculated from THE SAME methylation arrays (450K or EPIC) by other method not included in ChAMP. You may also format them into list and input the function to do GSEA (Please assign each element in your own CpG list or gene list a name, otherwise error might be triggered). champ.GSEA() would automatically extract gene information, transfer CpG information into gene information then conduct GSEA on each list. During the mapping process between CpGs to genes, if there are multiple CpGs mapped to one gene, that gene would only be counted once in case of over counting.

There are three ways to do GSEA. In previous version, ChAMP used pathway information downloaded from MSigDB. Then Fisher Exact Test will be used to calculate the enrichment status of each pathway. After gene enrichment analysis, champ.GSEA() function would automatically return pathways with P-value smaller then adjPval cutoff.

However, as pointed out by Geeleher [citation], since different genes has different numbers of CpGs contained inside, the two situation that one genes with 50 CpGs inside but only one of them show significant methylation, and one gene with 2 CpGs inside but two are significant methylated should not be eaqualy treated. The solution is use number of CpGs contained by genes to correct significant genes. as implemented in the gometh function from missMethyl package25. In gometh function, it used number of CpGs contained by each gene replace length as biased data, to correct this issue. The idea of gometh is fitting a curve for numbers of CpGs across genes related with GSEA, then using the probability weighting function to correct GO’s p value.

In latest version, we incoporated a new empirical bayes GSEA method into champ.GSEA() function, which is short for “ebayes” in “method” parameter. This method would not relying on DMP or DMR result to do GSEA, but instead would use global test to directly find significant genes, with the bias of inequality of CpG number corrected for each genes. Also this method considered the significant degree of of each CpG, thus user may use this method to detect some marginal significant genes for GSEA process.

Note: If you want to fix bias from inequality of CpG number of genes, and take significant level of CpGs into consideration, you could set “method” parameter as “ebayes” to use empirical bayes method. Otherwise, you may use “gometh” method or “fisher” method to do GSEA as well.

User may use following command to do GSEA:

myGSEA <- champ.GSEA(beta=myNorm,DMP=myDMP[[1]], DMR=myDMR, arraytype="450K",adjPval=0.05, method="fisher")# myDMP and myDMR could (not must) be used directly.

If you don’t want to do GSEA on DMP or DMR, you may set them as NULL.

In above code, we also used the demo 450K array incorporated in our package to demonstrate the effect of champ.GSEA(). After the calculation, for each list (DMP,DMR or other list), one GSEA result would be returned. User may check the result like below:

head(myGSEA$DMP)## NULL# Above is the GSEA result for differential methylation probes.head(myGSEA$DMR)##                                                         Gene_List nList## BENPORATH_EED_TARGETS                       BENPORATH_EED_TARGETS  1062## BENPORATH_ES_WITH_H3K27ME3             BENPORATH_ES_WITH_H3K27ME3  1118## BENPORATH_SUZ12_TARGETS                   BENPORATH_SUZ12_TARGETS  1038## HOX_GENES                                               HOX_GENES    65## BENPORATH_PRC2_TARGETS                     BENPORATH_PRC2_TARGETS   652## MEISSNER_BRAIN_HCP_WITH_H3K27ME3 MEISSNER_BRAIN_HCP_WITH_H3K27ME3   269##                                  nRep      fRep nOVLAP        OR## BENPORATH_EED_TARGETS             966 0.9096045    118  4.816181## BENPORATH_ES_WITH_H3K27ME3       1031 0.9221825    118  4.457708## BENPORATH_SUZ12_TARGETS           909 0.8757225    102  4.257157## HOX_GENES                          60 0.9230769     29 28.960565## BENPORATH_PRC2_TARGETS            594 0.9110429     79  5.037991## MEISSNER_BRAIN_HCP_WITH_H3K27ME3  259 0.9628253     48  7.200159##                                       P.value      adjPval## BENPORATH_EED_TARGETS            1.002738e-36 8.364836e-33## BENPORATH_ES_WITH_H3K27ME3       5.684943e-34 2.371190e-30## BENPORATH_SUZ12_TARGETS          1.482686e-28 4.122856e-25## HOX_GENES                        1.863102e-27 3.885499e-24## BENPORATH_PRC2_TARGETS           6.092783e-27 1.016520e-23## MEISSNER_BRAIN_HCP_WITH_H3K27ME3 7.157888e-23 9.951850e-20##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        Genes## BENPORATH_EED_TARGETS            HOXC5 HOXD4 TP73 FLJ45983 DKK1 HOXD9 SIX6 EVX1 HIST1H3I NEUROD2 KCNQ1 TRPA1 SOX17 CHST8 HOXD3 CCDC140 DLX5 DCC ADCYAP1 GHSR PAX3 HOXA2 MKX BTG4 EMX2 SIM1 CRTAC1 ISL2 HIST1H4F FEZF2 DLK1 HOXC6 HSPA1L HOXB7 GBX2 HOXB8 GSC GATA3 MSX1 HOXC9 CYP26C1 DUOXA1 ALX4 HOXA3 TAP1 EYA4 DSC3 HOXA9 SFRP1 HOXD8 FOXF1 QRFPR VSX1 HCG9 HOXC11 TAC1 HTR1E OTP ALX1 HOXA10 CALCA HOXD12 RAB32 CIDEA HIST1H4L PDGFRA HOXA7 FOXG1 PTGDR HIST1H3C BARHL2 ZIC4 APBB1IP OLIG3 HSPA1A LBX1 PITX1 IGFBP3 EN1 CBLN1 ZIC1 PAX6 DUOX1 PRDM14 HOXB1 HOXD1 ADARB2 CMTM2 MAGI2 HOXA6 HIC1 HTR1A RASGRF1 GATA4 HOXD13 TBX5 LHX8 UCN HLX DLX1 NKX2-3 FLJ32063 PAX9 HOXC13 HOXA4 TLX1 SIM2 HPSE2 PDX1 LOC153684 TMEM26 HIST1H3E DRD5 HS3ST3B1 VAX1 HOXC4 MAB21L1 PRLHR## BENPORATH_ES_WITH_H3K27ME3                  GBX2 OTP KLHL1 VAX1 PRLHR CALCA LRAT HSPA1A DUOXA1 HOXD8 OLIG3 LHX8 PCDHGC4 EVX1 DSC3 SOX17 HTR1A HLA-F SFRP1 PITX1 ZIC4 PAX3 HLX SIM2 ESR1 EXOC3L2 RFX4 ISL2 BARHL2 DUOX1 HOXA3 SIM1 CRTAC1 KLHL14 GATA4 HOXC11 HCG9 CBLN1 NKX2-3 FOXG1 UCN LOC153684 TAP1 FEZF2 DLK1 PAX6 RASGRF1 EYA4 HPSE2 ADARB2 HOXD1 HOXB8 ADCYAP1 GABRG3 TAC1 EGFLAM OSR2 CCNA1 EN1 HOXD13 HOXB7 ZIC1 KCNQ1 GHSR GATA3 PAX9 PTGDR ALX4 HOXA2 HOXD4 PDGFRA PTPRN2 TBX5 CMTM2 PDX1 HOXD12 HOXD9 HSPA1L MSX1 IGFBP3 NEUROD2 GPR150 DCC DLX1 CCDC140 MAB21L1 HOXB1 CHST8 CYP26C1 HOXD3 TLX1 FLJ45983 PRRT1 TP73 CIDEA FOXF1 DRD5 SIX6 HOXA6 HOXA9 HOXC4 THBS2 HOXA10 HOXC5 HIC1 DKK1 HOXA4 MKX DLX5 VSX1 HOXA7 HOXC6 FLJ32063 HS3ST3B1 ZFPM1 SLC6A2 LBX1 GSC## BENPORATH_SUZ12_TARGETS                                                                                                                 HOXC9 CIDEA RASGRF1 CRTAC1 SFRP1 FLJ45983 PRLHR FOXF1 CBLN1 TMEM26 TLX1 DUOXA1 ADCYAP1 EMX2 GATA5 HOXC6 MAB21L1 GHSR GSC FOXG1 ALX4 MSX1 ISL2 HOXD12 ADARB2 DLX1 ZIC1 CCDC140 LOC153684 SPOCK2 QRFPR HOXD13 PAX6 HOXD1 HPSE2 HOXC11 DKK1 VSX1 LHX8 GPR158 PAX9 VAX1 HOXC4 NEUROD2 DCC PAX3 MKX CYP26C1 PITX1 UCN DRD5 HOXC5 GATA4 SLC6A2 PDX1 HOXC13 GPR83 CHST8 HOXB1 PTGDR TOX2 TP73 GBX2 TBX5 CRABP1 EN1 APBB1IP GATA3 FLJ32063 SIX6 ZAR1 TRPA1 HOXD4 ZIC4 CCNA1 HTR1A OSR2 HOXB8 PDGFRA HOXD9 CALCA ACCN1 SOX17 BARHL2 EGFLAM HOXB7 FEZF2 DSC3 ALDH1L1 HS3ST3B1 LBX1 GIPC2 HLX ALX1 NKX2-3 DUOX1 SIM2 UPB1 HOXD8 HOXD3 OTP CMTM2## HOX_GENES                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 HOXA6 HOXC5 HOXD10 HOXB5 LBX1 HOXC6 HOXA5 HOXD9 HOXA2 HOXC11 HOXD13 HOXA4 HOXA11 HOXC13 PITX1 HOXD1 HOXD12 HOXA3 HOXA10 HOXD4 HOXB7 HOXC4 HOXB8 HOXA7 HOXD3 HOXB1 HOXB9 TLX1 HOXA9## BENPORATH_PRC2_TARGETS                                                                                                                                                                                                                                                                 TP73 CCDC140 HOXD13 SOX17 PDX1 TBX5 OTP HOXC5 HOXD3 DKK1 PAX6 PDGFRA DLX1 HOXD12 ZIC1 CHST8 CRTAC1 CIDEA DSC3 HOXC11 ISL2 NEUROD2 VAX1 EN1 GBX2 ADARB2 MKX PAX9 PRLHR DCC TLX1 RASGRF1 GHSR UCN ADCYAP1 HOXD1 GATA4 SIM2 HTR1A HLX LOC153684 GSC LHX8 SIX6 FEZF2 HOXB7 HOXD8 PTGDR PAX3 DUOX1 DRD5 LBX1 NKX2-3 CALCA HPSE2 CMTM2 GATA3 ZIC4 FOXF1 HOXC4 HS3ST3B1 HOXB8 FLJ32063 MSX1 PITX1 FLJ45983 CBLN1 HOXC6 DUOXA1 BARHL2 MAB21L1 CYP26C1 HOXB1 SFRP1 HOXD4 VSX1 HOXD9 ALX4 FOXG1## MEISSNER_BRAIN_HCP_WITH_H3K27ME3                                                                                                                                                                                                                                                                                                                                                                                                                                                       HOXB5 LOXL1 PRDM6 LHX8 HOXD1 GATA5 ALX1 EVX1 HOXC4 TNXB SIM1 OTP TLX1 PRDM14 OLIG3 HOXC13 DBX1 HOXD9 HOXB9 SLC6A2 SIM2 GSC HOXC11 TP73 HOXC5 KRT7 HOXA11 HOXD8 ALX4 PRSS50 HOXA4 HOXB7 GATA4 HOXD12 HOXD10 LBX1 GATA3 HOXA10 GBX2 BARHL2 PITX1 HOXA5 SIX6 LHX1 DLK1 HOXD13 HOXB4 PDX1# Above is the GSEA result for differential methylation regions.# Too many information may be printed, so we are not going to show the result here.

6.10.1 Empirical Bayes GSEA method

As we introduced before, ChAMP now incorporated a new method called “ebayes” in champ.GSEA(), this method is different from most other GSEA method because it does not need DMP or DMR information. Thus actually users could run this method separately from champ.GSEA(). We provided champ.ebayGSEA() function in ChAMP, for users who want to directly do GSEA from normalized beta matrix and phenotype. Like below:

myebayGSEA <- champ.ebayGSEA(beta=myNorm,pheno=myLoad$pd$Sample_Group,arraytype="450K")

This is a promising GSEA method, without bias from number of CpGs enriched in genes, or significant level of CpGs.

6.11 Differential Methylated Interaction Hotspots

Another function included in ChAMP is champ.EpiMod(). This function uses FEM package to infer differentially methylated gene modules within a user-specific functional gene-network. This network could be e.g. a protein-protein interaction network. Thus, the champ.EpiMod() function can be viewed as a functional supervised algorithm, which uses a network of relations between genes (usually a PPI network), to identify subnetworks where a significant number of genes are associated with a phenotype of interest (POI). The EpiMod algorithm can be run in two different modes: at the probe level, in which case the most differentially methylated probe is assigned to each gene, or at the gene-level in which case a DNAm value is assigned to each gene using an optimized procedure described in detail in Jiao Y, Widschwendter M, Teschendorff AE Bioinformatics 2014. Originally, the FEM package was developed to infer differentially methylated gene modules which are also deregulated at the gene expression level, however here we only provide the EpiMod version, which only infers differentially methylated modules. More advanced user may refer to FEM package for more information.

User may use following code to do champ.EpiMod()

myEpiMod <- champ.EpiMod(beta=myNorm,pheno=myLoad$pd$Sample_Group)

The all significant modules would be returned. And PDF version plots would returned in resultsDir in function parameter. Like below:

Above we demonstrated four modules detect by champ.EpiMod() based on our small 450K test data. But in practice, each modules would be saved in to a pdf file individually in resultsDir directory.

6.12 Copy Number Variation

For Copy Number Variation analysis, we provide champ.CNA() to achieve this work. This function uses the HumanMethylation450 or HumanMethylationEPIC data to identify copy number alterations10. The function utilises the intensity values for each probe to count copy number and determine if copy number alterations are present. Copy number is determined using the CopyNumber package.

Basically there are two methods you may choose to do CNA analysis: you may compare the copy number various between case samples and control samples; or you can compare copy number of each sample to the average copy number status of all samples. For the first method, you may specify which groups of samples in your pheno parameter, or ChAMP already included some blood control samples itself, you may use them as control then calculate the copy number variance, all you need to do is assign parameter control=TRUE and controlGroup="champCtls". For the other method, you may assign control=FALSE to apply.

There are two kinds of plots would be generated, analysis for each sample (sampleCNA=TRUE parameter) and analysis for each group (groupFreqPlots=TRUE parameter). Like champ.QC() function before, there are two parameter could be assigned for plots. The Rplot parameter would control if champ.CNA() would plot with R session, and PDFplot parameter would control if champ.CNA() would save PDF file to resultsDir. By default, only PDFplot is TRUE.

Note that different from old version of ChAMP. No batch correction would be run automatically on intensity data, if you want to correct batch effect on intensity data, please use champ.runCombat(), the usage for intensity data is exactly the same for beta or M matrix, the only difference is you need to replace the beta value with myLoad$intensity and set logitTrans=FALSE.

Feber10 compared the results obtained using this method to copy number data from Illumina CytoSNP array and Affymetrix SNP 6.0 arrays and found that using this method for 450k data was effective in identifying regions of gain and loss.

Users may use following code to calculate Copy Number Variance:

myCNA <- champ.CNA(intensity=myLoad$intensity,pheno=myLoad$pd$Sample_Group)

The main usage of champ.CNA() function is to generate plot for sample and groups in pheno. Like below:

6.13 Cell Type Heterogeneity

DNA methylation profiles are usually generated in samples that constitute mixtures over many cell-types. Given that DNAm is highly cell-type specific, thus there is often the need to infer DMPs/DMRs which are not driven by underlying changes in cell-type composition. Many methods have been proposed to deal with the cell-type heterogeneity problem, like RefbaseEWAS.

RefbaseEWAS method uses a reference-database of DNAm profiles of the major cell-types thought to be present in the tissue of interest, and uses this reference to infer sample-specific cell-type proportions using a form of constrained multivariate regression, known as constrained projection or quadratic programming. These cell-type proportions can subsequently be used as covariates to infer phenotype-associated changes which are not driven by changes in cell-type composition. We included this refbase method into our ChAMP for users who want to detect cell proportion and remove cell type influence on their WHOLE BLOOD data.

In ChAMP, we include a reference databases for whole blood, one for 27K and the other for 450K. After champ.refbase() function, cell type heterogeneity corrected beta matrix, and cell-type specific proportions in each sample will be returned. Do remember champ.refbase() can only works on Blood Sample Data Set.

myRefBase <- champ.refbase(beta=myNorm,arraytype="450K")# Our test data set is not whole blood. So it should not be run here.

Note that our demo 450K array is not whole blood data, so champ.refbase() can not works on this data set. Above code is only demo here. In the future, we will add more reference for champ.refbase function.

7 Summary

We have briefly described the various pipelines and functions included in the new ChAMP package. In the future we hope to improve it further, along with more functionality. If you have any question, bug report, or suggestion for improving ChAMP, please email it to champ450k@gmail.com. Feeback is key for improving a package which tries to seamlessly incorporate functionality and flexibility from many other tools.

8 Citing ChAMP

ChAMP is a wrapper incorporating algorithms and functions from many other sources. The majority of functions used in ChAMP are incorporated directly from other packages or are drawn from specific published algorithms. Hence, below we provide guidance on which papers should be cited alongside ChAMP when using these functions:

If you use the ChAMP package, please cite:

Morris, T. J., Butcher, L. M., Feber, A., Teschendorff, A. E., Chakravarthy, A. R., Wojdacz, T. K., and Beck, S. (2014). Champ: 450k chip analysis methylation pipeline pg - 428-30. Bioinformatics, 30(3), 428-30.

The loading part of ChAMP (champ.load) uses minfi package, so please cite:

Aryee MJ, Jaffe AE, Corrada-Bravo H, Ladd-Acosta C, Feinberg AP, Hansen KD and Irizarry RA (2014). Minfi: A flexible and comprehensive Bioconductor package for the analysis of Infinium DNA Methylation microarrays. Bioinformatics, 30(10), pp. 1363-1369. doi: 10.1093/bioinformatics/btu049 .

Jean-Philippe Fortin, Timothy Triche, Kasper Hansen. Preprocessing, normalization and integration of the Illumina HumanMethylationEPIC array. bioRxiv 065490; doi: https://doi.org/10.1101/065490

Filtering out probes with SNPs was based following recommendations presented in this paper:

Zhou W, Laird PW and Shen H: Comprehensive characterization, annotation and innovative use of Infinium DNA Methylation BeadChip probes. Nucleic Acids Research 2016

FunctionalNormliazation and SWAN normalization methods are based on:

Jean-Philippe Fortin, Timothy Triche, Kasper Hansen. Preprocessing, normalization and integration of the Illumina HumanMethylationEPIC array. bioRxiv 065490; doi: https://doi.org/10.1101/065490

Maksimovic J et a. SWAN: Subset-quantile within array normalization for illumina infinium humanmethylation450 beadchips. Genome Biol. 2012;13(6):R44.

The default type-2 probe correction method in ChAMP is BMIQ, based on:

Teschendorff AE et al . A beta-mixture quantile normalization method for correcting probe design bias in illumina infinium 450 k dna methylation data. Bioinformatics. 2013;29(2):189-196.

Another type-2 probe correction method is PBC, which was presented in:

Dedeurwaerder S et a. Evaluation of the infinium methylation 450K technology. Epigenomics. 2011;3(6):771-784.

champ.SVD() function is based on:

Teschendorff AE et a. An epigenetic signature in peripheral blood predicts active ovarian cancer. PLoS One. 2009;4(12):e8274.

ChAMP uses the batch correction method Combat as implanted in the SVA package. Please cite:

Johnson WE et a. Adjusting batch effects in microarray expression data using empirical bayes methods. Biostatistics. 2007;8(1):118-127.

Leek JT, Johnson WE, Parker HS, Fertig EJ, Jaffe AE, Storey JD, Zhang Y and Torres LC (2017). sva: Surrogate Variable Analysis. R package version 3.24.4.

Above are all functions related to Data Normalization. Below are functions and methods for downstream analyses.

If using limma to find DMPs, please cite:

Smyth GK. Linear models and empirical bayes methods for assessing differential expression in microarray experiments. Stat Appl Genet Mol Biol. 2004;3:Article3. Epub 2004 Feb 12. PubMed PMID: 16646809 .

Wettenhall JM, Smyth GK. limmaGUI: a graphical user interface for linear modeling of microarray data. Bioinformatics. 2004 Dec 12;20(18):3705-6. Epub 2004 Aug 5. PubMed PMID: 15297296 .

For DMR detection, if you used Bumphunter method, please cite:

Jaffe AE et a. Bump hunting to identify differentially methylated regions in epigenetic epidemiology studies. Int J Epidemiol. 2012;41(1):200-209.

If you used ProbeLasso function, please cite:

Butcher LM, Beck S. Probe lasso: A novel method to rope in differentially methylated regions with 450K dna methylation data. Methods. 2015;72:21-28.

Or if you used DMRcate function, please cite:

Peters TJ, Buckley MJ, Statham AL, et al. De novo identification of differentially methylated regions in the human genome. Epigenetics & Chromatin. 2015;8(1):1-16.

In this version of ChAMP, we provided a function to find Differential Methylation Blocks (DMB). The reference should be:

Hansen KD, Timp W, Bravo HC, et al. Increased methylation variation in epigenetic domains across cancer types. Nat Genet. 2011;43(8):768-775.

Timp W, Bravo HC, McDonald OG, et al. Large hypomethylated blocks as a universal defining epigenetic alteration in human solid tumors. Genome Med. 2014;6(8):61.

For GSEA method, if you used ‘gometh’ method, please cite below paper:

Paul Geeleher, Lori Hartnett, Laurance J. Egan, Aaron Golden, Raja Affendi Raja Ali, and Cathal Seoighe Gene-set analysis is severely biased when applied to genome-wide methylation data Bioinformatics 2013 : btt311v2-btt311.

Young MD, Wakefield MJ, Smyth GK and Oshlack A (2010). “Gene ontology analysis for RNA-seq: accounting for selection bias.” Genome Biology, 11, pp. R14.

Phipson, B, Maksimovic, J, Oshlack, A (2016). missMethyl: an R package for analyzing data from Illumina’s HumanMethylation450 platform. Bioinformatics, 32, 2:286-8.

If you used the FEM package to find Differential Methylated Modules in gene networks, please cite:

Jiao Y, Widschwendter M, Teschendorff AE. A systems-level integrative framework for genome-wide dna methylation and gene expression data identifies differential gene expression modules under epigenetic control. Bioinformatics. 2014;30(16):2360-2366.

The CNA detection function was provided by Andy Feber of UCL, you may citate related paper below:

Feber A, Guilhamon P, Lechner M, et al. Using high-density dna methylation arrays to profile copy number alterations. Genome Biol. 2014;15(2):R30.

We included two methods for cell proportion correction. If you used the Refbase method in ChAMP, please cite:

Houseman EA, Accomando WP, Koestler DC, et al. DNA methylation arrays as surrogate measures of cell mixture distribution. BMC Bioinformatics. 2012;13(1):1-16.

References

  1. Maksimovic J et a. SWAN: Subset-quantile within array normalization for illumina infinium humanmethylation450 beadchips. Genome Biol. 2012;13(6):R44.
  2. Dedeurwaerder S et a. Evaluation of the infinium methylation 450K technology. Epigenomics. 2011;3(6):771-784.
  3. Teschendorff AE et al . A beta-mixture quantile normalization method for correcting probe design bias in illumina infinium 450 k dna methylation data. Bioinformatics. 2013;29(2):189-196.
  4. Fortin J-P, Labbe A, Lemire M, et al. Functional normalization of 450k methylation array data improves replication in large cancer studies. Genome Biology. 2014;15(11):1-17.
  5. Aryee MJ, Jaffe AE, Corrada-Bravo H, et al. Minfi: a flexible and comprehensive Bioconductor package for the analysis of Infinium DNA methylation microarrays. Bioinformatics. 2014;30(10):1363-1369.
  6. Fortin J-P, Triche T, Hansen K. Preprocessing, normalization and integration of the illumina humanmethylationepic array. bioRxiv. 2016. doi:10.1101/065490
  7. Teschendorff AE et a. An epigenetic signature in peripheral blood predicts active ovarian cancer. PLoS One. 2009;4(12):e8274.
  8. Johnson WE et a. Adjusting batch effects in microarray expression data using empirical bayes methods. Biostatistics. 2007;8(1):118-127.
  9. Houseman EA, Accomando WP, Koestler DC, et al. DNA methylation arrays as surrogate measures of cell mixture distribution. BMC Bioinformatics. 2012;13(1):1-16.
  10. Feber A, Guilhamon P, Lechner M, et al. Using high-density dna methylation arrays to profile copy number alterations. Genome Biol. 2014;15(2):R30.
  11. Butcher LM, Beck S. Probe lasso: A novel method to rope in differentially methylated regions with 450K dna methylation data. Methods. 2015;72:21-28.
  12. Jaffe AE et a. Bump hunting to identify differentially methylated regions in epigenetic epidemiology studies. Int J Epidemiol. 2012;41(1):200-209.
  13. Peters TJ, Buckley MJ, Statham AL, et al. De novo identification of differentially methylated regions in the human genome. Epigenetics & Chromatin. 2015;8(1):1-16.
  14. Hansen KD, Timp W, Bravo HC, et al. Increased methylation variation in epigenetic domains across cancer types. Nat Genet. 2011;43(8):768-775.
  15. Young MD, Wakefield MJ, Smyth GK, Oshlack A. Gene ontology analysis for RNA-seq: accounting for selection bias. Genome Biol. 2010;11(2):R14.
  16. Jiao Y, Widschwendter M, Teschendorff AE. A systems-level integrative framework for genome-wide dna methylation and gene expression data identifies differential gene expression modules under epigenetic control. Bioinformatics. 2014;30(16):2360-2366.
  17. Wang D et a. IMA: An r package for high-throughput analysis of illumina’s 450K infinium methylation data. Bioinformatics. 2012;28(5):729-730.
  18. Davis S, Du P, Bilke S, Triche T, Jr., Bootwalla M. Methylumi: Handle illumina methylation data. 2013.
  19. Assenov Y, Muller F, Lutsik P. RnBeads - Comprehensive DNA Methylation Analysis. 2012.
  20. Wong C, Pidsley R, Schalkwyk L. The wateRmelon Package. 2012.
  21. Zhou W, Laird PW, Shen H. Comprehensive characterization, annotation and innovative use of infinium dna methylation beadchip probes. Nucleic Acids Research. 2016. doi:10.1093/nar/gkw967
  22. Nordlund J, Bäcklin CL, Wahlberg P, et al. Genome-wide signatures of differential dna methylation in pediatric acute lymphoblastic leukemia. Genome Biology. 2013;14(9):r105. doi:10.1186/gb-2013-14-9-r105
  23. Yuan YA de J Tian AND Jiao. An integrative multi-scale analysis of the dynamic dna methylation landscape in aging. PLoS Genet. 2015;11(2):1-21.
  24. Timp W, Bravo HC, McDonald OG, et al. Large hypomethylated blocks as a universal defining epigenetic alteration in human solid tumors. Genome Med. 2014;6(8):61.
  25. Phipson B, Maksimovic J, Oshlack A. missMethyl: an R package for analyzing data from Illumina’s HumanMethylation450 platform. Bioinformatics. 2016;32(2):286-288.

微信扫一扫,分享到朋友圈

生信-ChAMP学习笔记
梓沨

站长 INTP,生物搬砖工