«STATISTICAL METHODS FOR LOW-FREQUENCY AND RARE GENETIC VARIANTS by Clement Ma A dissertation submitted in partial fulfillment of the requirements for ...»
STATISTICAL METHODS FOR LOW-FREQUENCY AND RARE GENETIC VARIANTS
A dissertation submitted in partial fulfillment
of the requirements for the degree of
Doctor of Philosophy
in the University of Michigan
Professor Michael L. Boehnke, Co-Chair
Research Associate Professor Laura J. Scott, Co-Chair
Professor Gonçalo Abecasis
Assistant Professor Hyun M. Kang
Assistant Professor Seunggeun Lee Professor Peter X. Song Assistant Professor Cristen J. Willer © Clement Ma 2014 Dedication To my wife, Joyce.
ii Acknowledgements I would like to express my deepest thanks and gratitude to my advisors Mike Boehnke and Laura Scott. Both of you were outstanding mentors, and encouraged me to achieve what I once thought was impossible. Without your careful guidance and mentorship, I would not
be the statistical geneticist that I am today. I also want to thank my committee members:
Gonçalo Abecasis, Hyun Min Kang, Seunggeun Lee, Peter Song, and Cristen Willer for their constructive feedback and support for my dissertation research.
I would like to thank all my colleagues from the Center for Statistical Genetics. I learned a great deal from Tom Blackwell, who was an active collaborator on my first two dissertation topics. Thanks to Sean Caron and Paul Anderson for helping me run my simulations smoothly and efficiently on the computing cluster. I want to thank Dawn Keene and Laura Baker for helping me on my faculty applications and other administrative issues.
I want to thank my many colleagues and collaborators outside the University of Michigan.
Thank you to all the GoT2D study collaborators for allowing me to use an early data freeze of the sequencing data for my dissertation research. I gained many useful insights from the regular participants of the Single Variant Group conference call. Thanks to Georg Heinze for helpful discussions regarding the Firth bias-corrected logistic regression test.
I want to thank all my Ann Arbor area friends who have made my five years here fun and memorable. Thanks to Mark Reppell and Adrian Tan who were always there, and served as groomsmen at my wedding ceremony. I would like to thank Ryan Welch, Rebecca Rothwell, Yancy Lo, Giorgio Pistis, Eleonora Porcu, Zhenzhen Zhang, Caroline Cheng, Katie Huang, Lisa Henn, Min A Jhun, Yeji Lee, Tanya Teslovich, Xueling Sim, Adam Locke, Christopher Moraes, and Stefanie Moraes for the many board game nights, dinners, and happy hours.
iii I want to thank my family, Danny, Joyce, and Winnie Ma for supporting me throughout my graduate studies. I was very fortunate that Ann Arbor is within driving distance to Toronto, so I was able to visit home frequently. Thanks to my new family, King, Susan, and Coral Wong, who have been very supportive and encouraging during my time in Michigan.
Most of all, I want to thank my wife, Joyce Wong, who supported me every step of the way.
Five years ago, she encouraged me to pursue my dream of doctoral studies, even though it would mean we would spend over four years living apart from each other. You were always there to cheer me up, listen to my fears, and share a laugh. I am so happy that you were able to join me in Ann Arbor, and watch me complete this long yet rewarding journey.
I truly could not have done this without you.
List of Tables
List of Figures
List of Supplemental Figures
Chapter 1: Introduction
Chapter 2: Recommended joint and meta-analysis strategies for case-control association testing of single low count variants
Chapter 3: Near equivalent calibration and power of joint and meta-analysis for association analysis of quantitative traits
Chapter 4: Evaluating the calibration and power of three gene-based association tests for the X chromosome
Chapter 5: Summary, discussion, and future directions
v List of Tables Table 3.1: Sample-sizes and untransformed HDL values for GoT2D studies and substudies
Table 4.1: Sample sizes for simulated case-control datasets
Table 4.2: Sample sizes for simulated quantitative trait datasets
Table 4.3: Type I error rates for burden, SKAT, and SKAT-O tests in binary and quantitative trait studies.
vi List of Figures Figure 2.1: Type I error rates by minor allele count (MAC) for logistic regression tests in joint and meta-analysis.
Figure 2.2: Type I error rates by case-control ratio for logistic regression tests in joint and meta-analysis.
Figure 2.3: Simulation-based power curves for joint and meta-analysis.
Figure 2.4: Joint analysis type I error rates by sample size for fixed expected minor allele count (MAC)
Figure 2.5: Logistic regression p-value distributions for fixed total minor allele count (MAC).
Figure 2.6: Comparison of score test-based meta-analysis and Firth test-based joint analysis p-values in the GoT2D study
Figure 3.1: Type I error rates of inverse-normalized and normally distributed quantitative traits (QTs) for linear regression in joint and meta-analysis.
Figure 3.2: Type I error rates of inverse-normalized quantitative traits (QTs) for linear regression in joint and meta-analysis.
Figure 3.3: Type I error rates of non-normally distributed quantitative traits (QTs) for linear regression in joint and meta-analysis.
Figure 3.4: Power of linear regression in joint and meta-analysis.
Figure 3.5: Joint and meta-analysis of high density lipoprotein (HDL) in the GoT2D study
Figure 4.1: Power for gene-based tests in case-control studies assuming all causal variants are deleterious.
Figure 4.2: Power for gene-based tests in case-control studies assuming causal variants are 50% deleterious and 50% protective.
vii Figure 4.3: Power for gene-based tests in QT studies assuming all causal variants are deleterious.
Figure 4.4: Power for gene-based tests in QT studies assuming causal variants are 50% deleterious and 50% protective.
viii List of Supplemental Figures Figure S2.1: Type I error rates by fixed expected minor allele count (MAC) for different sample sizes.
Figure S2.2: Meta-analysis type I error rates by sample size for fixed expected minor allele count (MAC)
Figure S2.3: Comparison of score and Firth test association p-values in the GoT2D study
Figure S2.4: Comparison of joint and meta-analysis p-values in the GoT2D study.
.... 32 Figure S2.5: Score test type I error rate and power with study-level minor allele count (MAC) filters.
Figure S2.6: Score test type I error rate and power curves for meta-analysis of K = 10 and 50 sub-studies
Figure S2.7: Type I error rates by minor allele count (MAC) for logistic regression tests and Fisher's exact test in joint and meta-analysis.
Figure S2.8: Type I error rates by case-control ratio for logistic regression and Fisher's exact tests in joint and meta-analysis.
Figure S2.9: Simulated power curves for joint and meta-analysis.
Figure S3.1: Type I error rates of normally distributed quantitative traits (QTs) for linear regression in joint and meta-analysis with covariates.
Figure S3.2: Type I error rates of additional non-normally distributed quantitative traits (QTs) for linear regression in joint and meta-analysis.
Figure S4.1: Complete type I error rates for the burden (BURD), SKAT, and SKAT-O tests in case-control studies.
Figure S4.2: Type I error rates based on simulated datasets with re-sampling and without re-sampling.
ix Figure S4.3: Power simulated with X-inactivation for gene-based tests in casecontrol studies assuming all causal variants are deleterious.
Figure S4.4: Power simulated with X-inactivation for gene-based tests in casecontrol studies assuming causal variants are 50% deleterious and 50% protective.
AbstractGenetic association studies using sequencing, dense-array genotyping, or sequencing-based imputation provide the means to identify low-frequency and rare variants associated with diseases and traits, but analysis of these variants presents new statistical challenges. Single marker tests (e.g. logistic and linear regression), and methods to combine information across studies (e.g. joint and meta-analysis) may be poorly calibrated and/or of low power.
The calibration and power of aggregation tests, where multiple rare variants are analyzed jointly, have not been evaluated for variants on the X chromosome. In my dissertation, I
address three topics:
First, for case-control studies, I evaluate the calibration and power of four logistic regression tests in joint and meta-analysis for low-frequency and rare variants and demonstrate that: (a) for joint analysis, the Firth bias-corrected test is best (e.g. most powerful among well-calibrated tests); (b) for meta-analysis of balanced studies (equal numbers of cases and controls), the score test is best, but is less powerful than Firth testbased joint analysis; and (c) for meta-analysis of sufficiently unbalanced studies, all four tests can be anti-conservative, particularly the score test.
Second, for quantitative trait (QT) studies, I evaluate the calibration and power of linear
regression in joint and meta-analysis and demonstrate for normally distributed QTs that:
joint and sample-size weighted meta-analysis are equally well-calibrated and powerful for variants with expected minor allele count E[MAC]≥10; inverse-variance weighted metaanalysis is slightly anti-conservative for small-sized studies. For non-normally distributed QTs, joint and meta-analysis is equally anti-conservative for low-frequency and rare variants. Inverse-normal transformation of the QT remedies this problem, but transforming QTs of any distribution reduces power.
xi Third, for case-control and QT studies, I evaluate the calibration and power of three aggregation tests for the X chromosome: burden, SKAT, and SKAT-O. For case-control studies, tests are relatively well-calibrated across all simulation scenarios. Power is usually slightly increased when the coding scheme for male genotypes matches the underlying model, but power loss is small when the model is misspecified. Differences in male:female ratio in cases and controls have little effect on power. For QTs, calibration and power results are very similar to those for binary traits.
xii Chapter 1: Introduction Many human diseases and biological traits can be hereditary in nature [Gottlieb and Root, 1968; Kaprio et al., 1992; Silventoinen et al., 2003], but their genetic mechanisms are not fully understood. In genome-wide association studies (GWAS), we aim to identify genetic variants that cause differences in biological traits or disease risk. While many associated variants identified by GWAS are not causal, associated variants help localize genes or genomic regions that may harbor the true causal variants. Through fine-mapping and functional studies, we hope to identify the true causal variants, and better understand the biological mechanisms underlying human diseases and traits [Shea et al., 2011; Kulzer et al., 2014].
Genotype array-based common-variant GWAS have identified thousands of genetic variants associated with hundreds of different traits [Hindorff et al., 2012]. Investigators typically use case-control studies to detect disease-associated and cohort studies to detect quantitative trait (QT)-associated variants. We also often analyze QTs collected from casecontrol studies to identify variants associated with these QTs. To increase power to detect novel variants with small effect sizes in GWAS, investigators often combine samples across multiple association studies, typically using meta-analysis of summary-level association results [Scott et al., 2007], and less frequently, joint analysis of the combined individuallevel data [Schizophrenia Psychiatric Genome-Wide Association Study Consortium, 2011].
Although early genotyping arrays can only assay hundreds of thousands of common variants per individual, these variants are sufficient to tag a large proportion of the common variation in the population [International HapMap Consortium, 2005]. Since studies use different genotype arrays, only the small subset of overlapping variants can be meta-analyzed together directly. Genotype imputation using early reference panels (such as HapMap haplotypes [International HapMap Consortium, 2005]) fills in missing common genotypes with high accuracy, and allows the meta-analysis of the same dense set of genetic markers across all available samples [Marchini et al., 2007; Li et al., 2010].
Nearly all associated variants identified by GWAS are common [Hindorff et al., 2012].