 Research
 Open Access
 Published:
An adaptive genelevel association test for pedigree data
BMC Genetics volume 19, Article number: 68 (2018)
Abstract
Background
We propose a genelevel association test that accounts for individual relatedness and population structures in pedigree data in the framework of linear mixed models (LMMs). Our method dataadaptively combines the results across a class of scorebased tests, only requiring fitting a single null model (under the null hypothesis) for the whole genome, thereby being computationally efficient.
Results
We applied our approach to test for association with the highdensity lipoprotein (HDL) ratio of post and pretreatments in GAW20 data. Using the LMM similar to that used by Aslibekyan et al. (PLos One, 7:48663, 2012), our method identified 2 nearly significant genes (APOA5 and ZNF259) near rs964184, whereas neither the other genelevel tests nor the standard test on each individual singlenucleotide polymorphism (SNP) detected any significant gene in a genomewide scan.
Conclusions
Genelevel association testing can be a complementary approach to the SNPlevel association testing and our method is adaptive and efficient compared to several other existing genelevel association tests.
Background
Genomewide association studies (GWASs) are considered to be the standard approach to use to detect common genetic variants associated with complex traits. It has become popular to extend the most popular singlenucleotide polymorphism (SNP)level analysis to genelevel analysis by aggregating multiple SNPs in a gene or other functional unit. As a complement to the standard single SNPbased approach, the genelevel approach can achieve higher reproducibility and power. An additional benefit of the genelevel approach is that a decreased number of hypotheses need to be tested, thereby reducing the burden of multiple testing.
The goal of this work is to perform a genelevel association test to detect genes significantly associated with a single trait using the GAW20 data while effectively controlling for the falsepositive rate. Note that the candidate gene approach conducted by Aslibekyan et al. was based on the 95 loci drawn from previous studies based on SNPlevel association testing [1], and found SNP rs964184 to be strongly associated with the highdensity lipoprotein (HDL) ratio of post and pretreatments. We are interested in determining whether a genelevel analysis can lead to uncovering significantly associated genes, and, in particular, whether the genes near rs964184 are significantly associated in a genomewide scan. Specifically, we apply the adaptive sum of powered score (aSPU) test [2], which is motivated to account for unknown and varying association patterns (eg, varying numbers or proportions of associated SNPs) across the genes, thus maintaining higher power than other nonadaptive genelevel tests. The aSPU test is computationally feasible as it does not require to fit separate models for each SNP or gene, and it satisfactorily controls falsepositive rates. Note that the aSPU test was originally proposed for generalized linear models, and extended to generalized estimating equations and generalized linear mixed models (GLMM) [3,4,5]. Its application to and empirical performance in linear mixed models (LMMs), especially with large pedigree data, have not been discussed in previous studies.
The Genetics of Lipid Lowering Drugs and Diet Network (GOLDN) study collected pedigree data, motivating the use of LMMs to account for population structures and relatedness as adopted by Aslibekyan et al. [1]. In our LMM, we account for genetic relatedness among subjects as a random effect with a covariance matrix calculated based on individuallevel SNP data. We also adjusted for covariates such as age gender, and study center. In this paper, we present the results of the aSPU test based on LMM and compare with other existing genelevel tests and individual SNP analysis.
Methods
Suppose that y_{i} denotes a quantitative trait for individual i = 1, ⋯, n, X_{i} = (X_{i1}, ⋯, X_{iq})^{′} is a vector of q covariates, and G_{i} = (G_{i1}, ⋯, G_{ip})^{′} is a vector of p SNPs in a gene for individual i. A LMM is constructed as
where α and β are the unknown regression coefficient vectors for the corresponding covariates and SNPs, b_{i} and ε_{i} are a random intercept and an error term that are independent with each other. We further assume that the error terms ε_{i}s are independently distributed, but b_{i}s are not. Specifically,
where Ψ is a known n × n genetic relationship matrix, which reflects the genetic relatedness among the subjects in the data. The null hypothesis to be tested for association between the group of the SNPs and the trait is H_{0} : β = 0.
Fitting (generalized) LMMs can be computationally demanding. However, using penalized quasilikelihood (PQL) to fit the model enables us to extract the test statistic for scorebased tests including the aSPU test [6]. It is known that maximizing PQL is equivalent to maximizing the likelihood for quantitative traits. Specifically, we first need to fit the LMM under the null hypothesis.
from which, the score vector U = (U_{1}, ⋯, U_{p})^{′}, to be used to construct various genelevel scorebased tests, can be expressed as
The aSPU test statistic can be obtained using the score vector U and its covariance matrix V under the null hypothesis, which can also be written in a closed form. Because the score vector follows asymptotic normal distribution with mean zero under the null hypothesis, one can use the Monte Carlo method to compute pvalues. Note that both U and V depend only on the null model (3), which provides computational efficiency when the number of tests is large as in a genomewide scan. We can use an R package GMMAT to derive U and V [7].
We briefly introduce the idea of the aSPU test here. All scorebased association tests require U and V, and each nonadaptive test has its own advantages and disadvantages. For example, consider these 2 cases: (a) every SNP encoded in a gene is associated with an equal effect size and direction, and (b) only one or a small proportion of the SNPs are associated. The burden test, which takes \( {\sum}_{j=1}^p{U}_j \) as a test statistic, is desired in the first case, but it will lose power in the second case. On the other hand, the UminP test, which takes max{U_{1}, ⋯, U_{p}} as a test statistic when the variances of the score elements are the same, is advantageous in the second case but not in the first case. Thus, applying a single and nonadaptive scorebased test might not be powerful in genelevel analysis. The aSPU test offers a way to combine various scorebased tests; it is based on a class of the sum of powered score (SPU) tests indexed by a positive integer γ. Specifically, the SPU(γ) test statistic is.
It is easy to see that the burden test and the sum of squared score (SSU) test are equivalent to the SPU(1) and SPU(2) tests respectively. It was also shown that SPU(2) is equivalent to sequence kernel association test (SKAT) with the linear kernel and to Multivariate Distance Matrix Regression (MDMR) with the Euclidean distance (under the framework of LMM) [8]. Furthermore, assuming the equal variance of the score elements, the UminP test is equal to SPU test with γ = ∞. One can treat γ as a factor that decides the weight on each score element. The aSPU test uses the minimum p value of the SPU tests as the test statistic, which provides a general dataadaptive method to test for associations. The set of γ ∈ {1, 2, ⋯, 8, ∞} was proposed by Pan et al. based on experiences [2].
Results
The LMM we used for the GAW20 data was similar to that used by Aslibekyan et al.; we used the ratio of post and pretreatment HDL as the trait, and we used age, gender, and study center as covariates. The only difference was the covariance matrix of the random effects. Our covariance matrix Ψ of the random effects reflected the genetic relatedness, where each Ψ_{ij} was the Pearson correlation coefficient between 2 subjects i and j of 20,000 randomly selected SNPs. Our analysis was based on 821 subjects who did not have missing values in either the trait or the covariates. We only included common variants with minor allele frequencies (MAFs) greater than 0.05. Among those, we randomly imputed missing variants using MAF if the proportions of missing values were less than 1%. It resulted in a total of 595,304 SNPs included in our analysis. For the genelevel analysis, we used hg18 as a reference genome and each gene included the SNPs that were within 10,000 regions upstream or downstream of the gene’s coding region. In total, we included 22,434 genes in our analysis.
We conducted the SPU(γ) and aSPU tests under the LMM. In addition to the SPU(1), SPU(2), and SPU(∞) tests where their theoretical equivalences with other existing genelevel tests are shown in the Methods section, we also performed the genelevel score test and the famSKAT (familybased sequence kernel association test) [9] using the same covariates and relationship matrix. Figure 1 shows the results of the tests. Using the Bonferroni adjustment for the genomewide significance level (α = 0.05), the aSPU test and the score test did not detect any significant genes, but 2 genes (APOA4 and ZNF259 on chromosome 11) were close to being significant. However, these 2 genes were detected by the SPU(1) test, suggesting that their association effects were not dominated by a small number of variants. We emphasize the adaptiveness property of the aSPU test by noting gene BUD13 on chromosome 11 and GUCD1 and SNRPD3 on chromosome 22, whose −log_{10}(p values) were not less than 3 by SPU(1), but much larger by the SPU(∞) test (as well as by a few other SPU tests and the aSPU test). We also note that APOA5 and ZNF259 were located nearby as shown in Fig. 2. In particular, they shared 7 variants out of 9 SNPs in both genes. The genelevel score test yielded a gene (DDX42 on chromosome 17) almost significant at the genomewide significance level, but the score test did not detect any loci near rs964184. Similarly, the famSKAT did not detect any significant gene.
<insert Figure(s) 1 and 2 here>.
We also compared the genebased tests to the score test for single variants. We used the usual 5 × 10^{− 8} as the genomewide significance level for the SNPlevel analysis. Even though rs964184 turned out to be the one most significantly associated with the trait among all the SNPs, its p value was far away from the genomewide significance level, as shown in Fig. 3. This example partially confirms the usefulness of genelevel testing.
Discussion
In GWAS, individuals in pedigree data are not independent, thus motivating the use of (generalized) LMMs. We considered a general LMM with a random intercept that reflects the genetic relatedness among the subjects. We then conducted the aSPU test on the genes across the whole genome based on fitting a single null model, and identified 2 genes near SNP rs964184 to be nearly significant. In contrast, none of the SNPs, including SNP rs964184, were nearly significant in a standard single SNPbased analysis.
Conclusions
We have demonstrated the applicability and usefulness of our proposed aSPU test in LMMs for association analysis of large pedigree data. Furthermore, our study has confirmed possible advantages and complementary roles of genelevel analyses with the adaptive aSPU test when compared to standard single SNPbased analyses.
Abbreviations
 aSPU:

Adaptive sum of powered score
 GLMM:

Generalized linear mixed model
 GWAS:

Genomewide association study
 LMM:

Linear mixed model
 MAF:

Minor allele frequency
 SNP:

Single nucleotide polymorphisms
 SPU:

Sum of powered score.
References
 1.
Aslibekyan S, Goodarzi MO, FrazierWood AC, Yan X, Irvin MR, Kim E, Tiwari HK, Guo X, Straka RJ, Taylor KD, et al. Variants identified in a GWAS metaanalysis for blood lipids are associated with the lipid response to fenofibrate. PLoS One. 2012;7(10):48663.
 2.
Pan W, Kim J, Zhang Y, Shen X, Wei P. A powerful and adaptive association test for rare variants. Genetics. 2014;197(4):1081–95.
 3.
Zhang Y, Xu Z, Shen X, Pan W. Alzheimer’s Disease Neuroimaging Initiative: testing for association with multiple traits in generalized estimation equations with application to neuroimaging data. Neuroimage. 2014;96:309–25.
 4.
Kim J, Zhang Y, Pan W. Powerful and adaptive testing for multitrait and multiSNP associations with GWAS and sequencing data. Genetics. 2016;203(2):715–31.
 5.
Park JY, Wu C, Basu S, McGue M, Pan W. Adaptive SNPset association testing in generalized linear mixed models with application to family studies. Behav Genet. 2018;48(1):55–66.
 6.
Breslow NE, Clayton DG. Approximate inference in generalized linear mixed models. J Am Stat Assoc. 1993;88(421):9–25.
 7.
Chen H, Wang C, Conomos MP, Stilp AM, Li Z, Sofer T, Szpiro AA, Chen W, Brehm JM, Celed’on JC, et al. Control for population structure and relatedness for binary traits in genetic association studies via logistic mixed models. Am J Hum Genet. 2016;98(4):653–66.
 8.
Pan W. Relationship between genomic distancebased regression and kernel machine regression for multimarker association testing. Genet Epidemiol. 2011;35(4):211–6.
 9.
Chen H, Meigs JB, Dupuis J. Sequence kernel association test for quantitative traits in family samples. Genet Epidemiol. 2013;37(2):196–204.
Acknowledgements
We thank the reviewers for many helpful and constructive comments and the organizers of Genetic Analysis Workshop 20. This research was supported by the Minnesota Supercomputing Institute..
Funding
Publication of this article was supported by NIH R01 GM031575. This research was funded by NIH grants R21AG057038, R01HL116720, R01GM113250, and R01HL105397. CW was funded by the University of Minnesota Doctoral Dissertation Fellowship.
Availability of data and materials
The data that support the findings of this study are available from the Genetic Analysis Workshop (GAW), but restrictions apply to the availability of these data, which were used under license for the current study. Qualified researchers may request these data directly from GAW.
About this supplement
This article has been published as part of BMC Genetics Volume 19 Supplement 1, 2018: Genetic Analysis Workshop 20: envisioning the future of statistical genetics by exploring methods for epigenetic and pharmacogenomic data. The full contents of the supplement are available online at https://bmcgenet.biomedcentral.com/articles/supplements/volume19supplement1
Author information
Affiliations
Contributions
JYP, CW, and WP designed the study. JYP and CW performed the data analysis. JYP drafted the manuscript. WP helped revise the manuscript. All authors read and approved the final manuscript.
Corresponding author
Correspondence to Wei Pan.
Ethics declarations
Ethics approval and consent to participate
Not applicable.
Consent for publication
Not applicable.
Competing interests
The authors declare that they have no competing interests.
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
About this article
Cite this article
Park, J.Y., Wu, C. & Pan, W. An adaptive genelevel association test for pedigree data. BMC Genet 19, 68 (2018). https://doi.org/10.1186/s1286301806392
Published:
Keywords
 aSPU
 GWAS
 HDL
 Linear mixed models
 Score test