A general semiparametric approach to the analysis of genetic association studies in populationbased designs
 Sharon Lutz^{1, 3, 4}Email author,
 WaiKi Yip^{3, 4},
 John Hokanson^{2},
 Nan Laird^{3, 4} and
 Christoph Lange^{3, 4, 5, 6}
DOI: 10.1186/147121561413
© Lutz et al.; licensee BioMed Central Ltd. 2013
Received: 15 October 2012
Accepted: 1 February 2013
Published: 28 February 2013
Abstract
Background
For genetic association studies in designs of unrelated individuals, current statistical methodology typically models the phenotype of interest as a function of the genotype and assumes a known statistical model for the phenotype. In the analysis of complex phenotypes, especially in the presence of ascertainment conditions, the specification of such model assumptions is not straightforward and is errorprone, potentially causing misleading results.
Results
In this paper, we propose an alternative approach that treats the genotype as the random variable and conditions upon the phenotype. Thereby, the validity of the approach does not depend on the correctness of assumptions about the phenotypic model. Misspecification of the phenotypic model may lead to reduced statistical power. Theoretical derivations and simulation studies demonstrate both the validity and the advantages of the approach over existing methodology. In the COPDGene study (a GWAS for Chronic Obstructive Pulmonary Disease (COPD)), we apply the approach to a secondary, quantitative phenotype, the Fagerstrom nicotine dependence score, that is correlated with COPD affection status. The software package that implements this method is available.
Conclusions
The flexibility of this approach enables the straightforward application to quantitative phenotypes and binary traits in ascertained and unascertained samples. In addition to its robustness features, our method provides the platform for the construction of complex statistical models for longitudinal data, multivariate data, multimarker tests, rarevariant analysis, and others.
Keywords
Genetic associations studies Secondary phenotypes Casecontrol Ascertainment SemiparametricBackground
In genetic association studies, individuals are often recruited based on casecontrol ascertainment conditions of the primary phenotype [1]. For the analysis of secondary phenotypes, this recruitmentscheme can become problematic. If the secondary phenotype is correlated with the primary phenotype in a casecontrol study, the distribution of the secondary phenotype can be fundamentally different from the general population. For example, in a genetic association study of COPD in which all cases have COPD and control subjects have normal pulmonary function, the distribution of quantitative lung phenotypes can deviate substantially from their distribution in the general population. For samples that are ascertained in this fashion, standard statistical methods may lead to misleading results or may lack statistical power to identify true genotype phenotype associations. There are several methods to accurately estimate the odds ratio of genetic variants for binary secondary phenotypes associated with casecontrol status [2–10], but these methods cannot easily accommodate continuous secondary phenotypes. For the special case that the secondary phenotype is normally distributed or binary, Lin & Zeng (2009) proposed an adjusted score test that incorporates genetic associations with affection status into the test statistic [11].
We present a more general approach that does not require any distribution assumptions for the secondary phenotype. We refer to the approach as the nonparametric populationbased association test (NPBAT). The approach has a form similar to the Family Based Association Test (FBAT), a nonparametric test statistic that is frequently used in the family based setting [12–15]. The flexibility of our approach allows us to construct a genetic association test for standard and complex phenotypes that is nonparametric with respect to the phenotype. The class of tests is very general. It includes most standard association tests and can be applied to multivariate traits and phenotypes, multiple genetic markers, and case control studies where phenotypic information is available for the cases but correlated with the casecontrol status [16–18].
The general concept of the proposed associationtesting framework is to condition on the phenotype of interest and treat only the genetic data as random [12, 13, 15]. By assuming that the phenotype data is deterministic, the validity of the approach does not depend on the correctness of the phenotypic assumptions. Nevertheless, the power of the approach can be increased by incorporating a plausible model for the phenotype into the test statistic. Based on theoretical considerations and on simulation studies, we show that the new approach is robust against misspecification of phenotype assumptions. At the same time, this approach achieves the same power level as standard genetic association tests for populationbased designs when the phenotype of interest has a normal distribution or is dichotomous. For studies where a quantitative trait is correlated with casecontrol status, our simulation studies examine the power and significance levels for the proposed approach, which does not require any adjustment for the ascertainment conditions.
We illustrate the practical advantages of NPBAT by an application to the COPDGene study. The COPDGene study is a casecontrol study of the genetics of COPD in current or former smokers with at least 10 packyears of smoking history [19]. We test the genetic association of single nucleotide polymorphisms (SNPs) in the CHRNA 3/5 region and the Fagerstrom Nicotine Dependence score (FNDS). FNDS is a validated instrument of nicotine dependence in current smokers and was measured in the current smokers, but not former smokers in the COPDGene study. NPBAT, which uses the genotype data in both current and former smokers, is compared to the published genetic association of SNPs in the CHRNA 3/5 region and FNDS that was performed in current smokers only [20].
Methods
where E_{ x } denotes the expectation of the marker score/ genotype X under the nullhypothesis of no genetic association between the phenotype. The marker locus. E_{ x } can be estimated based on the sample mean of the genotypes. The asymptotic distribution of the NPBAT statistic under the nullhypothesis depends on the estimation of E_{ x } and on the specification of the trait information T_{ i }, and is derived in the Appendix.
There are various ways to code the phenotype of interest and define the coding function T_{ i }. For the analysis of affection status, one could specify the coding function to be T_{ i } = 1 or T_{ i } = 0, depending on the disease status of the proband. However, as we show in the Appendix Appendix A: Offset choice when Y is binary, a more efficient way is to set ${T}_{i}=1\frac{\mathit{\text{\#cases}}}{n}$ for the cases, and ${T}_{i}=0\frac{\mathit{\text{\#cases}}}{n}$ for the controls. Then the NPBAT statistic is approximately the same as the CochranArmitage Trend test.
If the phenotype Y_{ i } is in fact normally distributed and ${T}_{i}={Y}_{i}{\u0176}_{i}$ where ${\u0176}_{i}$ denotes the fitted values of regressing the phenotype Y on any covariates, then the NPBAT statistic is approximately the same as a tstatistic from a linear regression. In general, if the phenotype Y_{ i } is a continuous phenotype, we recommend T_{ i } = Y_{ i }  μ_{ y } where μ_{ y } is the phenotypic mean in the general population.
While it is appealing that the NPBAT statistic is comparable to standard methods in these simple scenarios, the real appeal of the NPBAT statistic is when there is only phenotype information available for some subjects but there is genetic information available for all subjects. For example, in case control studies, an additional quantitative phenotype may be available for the cases but not the controls. When testing for a genetic association with this additional quantitative phenotype, the NPBAT statistic uses the genotype of both the cases and the controls with the optimal coded phenotype T_{ i } = Y_{ i }  Y_{offset} where Y_{offset} is a constant. The choice of this constant is described in detail in the simulations subsection and the asymptotic distribution of the NPBAT statistic is derived in the Appendix. Using this optimal offset choice, the NPBAT statistic has a substantial increase in power over other methods such as the NPBAT statistic when an offset choice of ${T}_{i}={Y}_{i}\stackrel{\u0304}{Y}$ or the improved score test, which is uniformly more powerful than score tests based on the generalized linear model such as the CochranArmitage trend test, the allelic χ^{2} test and the genotypic χ^{2} test [21].
Adjustments for population admixture
The NPBAT statistic can be adjusted for population admixture by using standard methods such as principal components analysis or genomic control [22, 23]. For example, to account for population admixture, one can treat the principal components as additional covariate representing population information, and incorporate them into the test statistic in equation (2) by taking ${T}_{i}={Y}_{i}{\u0176}_{i}$ where ${\u0176}_{i}$ denotes the fitted values of regressing the phenotype Y on the top principal components that explain the greatest amount of variability in the data. Note the above approach requires that the phenotype Y is dichotomous or roughly normally distributed.
Extension to multiple phenotypes
Due to the estimation of E_{ x } based on the sample, this statistic does not have a chi square distribution and a permutation test needs to be used to assess significance levels, which can be done by using the NPBAT software package (https://sites.google.com/site/genenpbat/).
Simulations
In genetic association casecontrol studies, only the cases may have additional phenotypic information available. For instance, in a casecontrol study where the cases have asthma (the primary phenotype), only the cases may have FEV measurements (the secondary phenotype). In this scenario, the secondary phenotype FEV will be more severe than it would be in the general population and the analysis of this secondary phenotype can be misleading due to the ascertainment of subjects based on the primary phenotype, asthma. To simulate this scenario, we generated the genotype X for 500 cases and 500 controls and a secondary phenotype Y for only the 500 cases from a truncated normal distribution with standard deviation σ = 1, mean a X under the alternative and mean 0 under the null and cutoff such that the secondary phenotype in the top 50 percent of the normal distribution. We consider an allele frequency of p = 20% and a is chosen such that the heritability h[24] equals 1%, 2%, 3%, 5%. The solving for a, $a=\sigma \sqrt{h/2p(1p)(1h)}$.
We compute the NPBAT statistic with the coded phenotype T_{ i } = Y_{ i }  Y_{offset} where Y_{offset} is a constant that ranges from 5 to 15 and E_{ x } is the sample mean of the genotypes in the cases. We also compute the NPBAT statistic with E_{ x } equal to the sample mean of the genotypes in the controls and E_{ x } equal to the sample mean of the genotypes in the cases and the controls. We compare the power of these three NPBAT statistics to the Improved Score Test, which is uniformly more powerful than score tests based on the generalized linear model such as the CochranArmitage trend test, the allelic χ^{2} test and the genotypic χ^{2} test [21]. We also compare the power of the NPBAT approach to a standard linear regression.
These simulations show that for casecontrol studies when analyzing secondary phenotypes correlated with casecontrol status, we recommend to set Y_{offset} to a constant significantly different from the phenotypic mean of the sample and E_{ x } equal to the genotypic mean of the controls. In this situation, a robust and efficient choice for the offset Y_{offset} is the phenotypic mean in the general population. Note that the results of these simulations are analogous to the FBAT statistic in family studies where it was found that when ascertaining cases only from a quantitative distribution, one needed to choose an offset that was outside the range of the case’s phenotypic values [15].
Data analysis
This table displays the pvalues for the association between the Fagerstrom Test for Nicotine Dependence (FTND) and the markers listed above for the different statistical tests: the NPBAT where ${E}_{x}={\stackrel{\u0304}{x}}_{c}$ is the genotypic mean of the current smokers, NPBAT where ${E}_{x}={\stackrel{\u0304}{x}}_{f}$ is the genotypic mean of the former smokers, the Improved Score Test and a linear regression
Method  NPBAT:${\mathbf{\text{E}}}_{\mathbf{\text{x}}}\mathbf{=}{\stackrel{\mathbf{\u0304}}{\mathbf{\text{x}}}}_{\mathbf{\text{c}}}$  NPBAT:${\mathbf{\text{E}}}_{\mathbf{\text{x}}}={\stackrel{\mathbf{\u0304}}{\mathbf{\text{x}}}}_{\mathbf{\text{f}}}$  Improved Score Test  Regression 

rs1051730  0.00134  0.00138  0.00227  0.00259 
rs8034191  0.00386  0.00391  0.00694  0.00744 
Results and discussion
 1.
when a sample is ascertained based on case/control status and the phenotype of interest is correlated with case status
 2.
in a cohort study in which prevalent cases are excluded (i.e. the classic epidemiologic cohort study) and the phenotype of interest is correlated with the disease of interest
 3.
a pharmacogenetics study using a randomized clinical trial when participants are ascertained based on the levels of the target of therapy
The broad application of NPBAT is to scenarios where samples are ascertained based on selection criteria that are correlated with the phenotype of interest.
Conclusions
In conclusion, the key advantage that defines the attraction of the proposed approach is its robustness against model specification of the phenotypes. This enables extensions to different types of traits and the integration of complex statistical models for the phenotype. While, at the same time, the validity of the approach is not compromised by such generalization. Though the power is sensitive to the offset choice, NPBAT is valid regardless of the offset. As with all populationbased association tests, population stratification can be a problem. Adjusting for known population substructure using principal components of ancestral informative markers (AIMs) or using genomic controls can reduce the impact of population stratification. The NPBAT software package which implements this method is detailed in the Appendix.
Appendix
Appendix A: Offset choice when Y is binary
where $\gamma =\frac{\mathit{\text{\#cases}}}{\mathit{\text{\#controls}}}$. Given this ratio, the power of the NPBAT statistic relative to the CochranArmitage trend test is maximized for the offset choice ${\mu}_{y}^{\mathit{\text{optimal}}}=\frac{\gamma}{1+\gamma}=\frac{\mathit{\text{\#cases}}}{N}$. For example, if the ratio of the cases versus the controls is 1, the offset choice μ_{ y } is $\frac{1}{2}$. This corresponds to equally weighting the cases and controls in the conditional test statistic. For large sample size N, such that $\sqrt{\frac{N}{N1}}\approx 1$, the ratio of the test statistics is approximately one when the offset is set to ${\mu}_{y}^{\mathit{\text{optimal}}}=\frac{\mathit{\text{\#cases}}}{n}$. Consequently, for the optimal offset choice, the test statistics are approximately the same.
Appendix B: asymptotic distribution when the secondary phenotype is available for both the cases and controls
Note that the statistic is maximized and has a standard normal distribution when Y_{offset} = E[Y].
Appendix C: asymptotic distribution when the secondary phenotype is only available for the cases
Then the NPBAT statistic is normally distributed with mean zero and variance given above. Note that the variance is always greater than or equal to one and equals one when Y_{offset} = E[Y]. Note that if ${Y}_{\text{offset}}=\stackrel{\u0304}{Y}$ and ${E}_{x}={\stackrel{\u0304}{X}}_{\text{controls}}$ then NPBAT has a standard normal distribution. As seen in the Simulations section and Figure 1, when E_{ x } is based on the the controls and the phenotype information is only available for the cases, then the power is maximized when ${Y}_{\text{offset}}\ne \stackrel{\u0304}{Y}$ because the variance equals the minimum when Y_{offset} ≈ E[Y].
Appendix D: NPBAT software
A software package implemented in C++ to compute both single phenotype and multiple phenotypes NPBAT statistics is available for download at the following website: https://sites.google.com/site/genenpbat/. In addition to NPBAT statistics, other population based statistics such as the Armitage Trend Test, Fisher Exact Test are also available. Currently, only two platforms are supported: linux64 and windows64. The NPBAT software package reads in genetic data through the PLINK style pedigree (ped), map (map) and phenotype (phe) files. The website provides detail information on how to use the software package.
Abbreviations
 COPD:

Chronic obstructive pulmonary disease
 FBAT:

Family Based Association Test
 FEV:

Forced expiratory volume
 FNDS:

Fagerstrom nicotine dependence score
 GWAS:

Genome Wide Association Study
 LRT:

Likelihood ratio test
 NPBAT:

Nonparametric Population Based Association Test
 PC:

Principal component.
Declarations
Acknowledgements
We would like to acknowledge Carla Wilson at National Jewish Health for her help with the COPDgene dataset. This work was funded by NIH/ NHLBI U01 HL089856 Edwin K. Silverman, PI. COPDGene is supported by NHLBI Grant Nos U01HL089897 and U01Hl089856.
Authors’ Affiliations
References
 Thomas DC: Statistical Methods in Genetic Epidemiology. 2004, Oxford: Oxford University PressGoogle Scholar
 Wang J, Shete S: Power and type I error results for a biascorrection approach recently shown to provide accurate odds ratios of genetic variants for the secondary phenotypes associated with primary diseases. Genet Epidemiol. 2011, 35: 739743. 10.1002/gepi.20611.PubMed CentralView ArticlePubMedGoogle Scholar
 Wang J, Shete S: Estimation of odds ratios of genetic variants for the secondary phenotypes associated with primary diseases. Genet Epidemiol. 2011, 35: 190200. 10.1002/gepi.20568.PubMed CentralView ArticlePubMedGoogle Scholar
 Richardson DB, Rzehak P, Klenk J, Weiland SK: Analyses of case control data for additional outcomes. Epidemiol. 2007, 18 (4): 441445. 10.1097/EDE.0b013e318060d25c.View ArticleGoogle Scholar
 Li H, Gail MH, Berndt S, Chatterjee N: Using cases to strengthen inference on the association between single nucleotide polymorphisms and a secondary phenotype in genomewide association studies. Genet Epidemiol. 2010, 34: 427433. 10.1002/gepi.20495.PubMed CentralView ArticlePubMedGoogle Scholar
 Li H, Gail MH: Efficient adaptively weighted analysis of secondary phenotypes in casecontrol genomewide association studies. Hum Hered. 2012, 73 (3): 159173. 10.1159/000338943.PubMed CentralView ArticlePubMedGoogle Scholar
 Monsees GM, Tamimi RM, Kraft P: Genomewide association scans for secondary traits using casecontrol samples. Genet Epidemiol. 2009, 33: 717728. 10.1002/gepi.20424.PubMed CentralView ArticlePubMedGoogle Scholar
 Greenland S: Quantifying biases in causal models: classical confounding vs collider–stratification bias. Epidemiol. 2003, 14 (3): 300306.Google Scholar
 He J, Li H, Edmondson AC, Rader DJ, Li M: A Gaussian copula approach for the analysis of secondary phenotypes in case control genetic association studies. Biostatistics. 2011, 13 (3): 497508.PubMed CentralView ArticlePubMedGoogle Scholar
 Kraft P: Letter to the editor: analyses of genomewide association scans for additional outcomes. Epidemiol. 2007, 18 (6): 83810.1097/EDE.0b013e318154c7e2.View ArticleGoogle Scholar
 Lin DY, Zeng D: Proper Analysis of secondary phenotype data in casecontrol association studies. Genet Epidemiol. 2009, 33 (3): 256265. 10.1002/gepi.20377.PubMed CentralView ArticlePubMedGoogle Scholar
 Laird NM, Horvath S, Xu X: Implementing a unified approach to family based tests of association. Genet Epidemiol. 2000, 19: S3610.1002/10982272(2000)19:1+<::AIDGEPI6>3.0.CO;2M.View ArticlePubMedGoogle Scholar
 Laird NM, Lange C: Familybased methods for linkage and association analysis. Adv Genet. 2008, 60: 219252.View ArticlePubMedGoogle Scholar
 Lange C, DeMeo D, Silverman EK, Weiss ST, Laird NM, PBAT: Tools for familybased association studies. Am J Hum Genet. 2004, 74 (2): 367369. 10.1086/381563.PubMed CentralView ArticlePubMedGoogle Scholar
 Lange C, DeMeo DL, Laird NM: Power and design considerations for a general class of familybased association tests: the asymptotic distribution, the conditional power, and optimality considerations. Genet Epidemiol. 2002, 23 (2): 165180. 10.1002/gepi.209.View ArticlePubMedGoogle Scholar
 Lange C, Blacker D, Laird NM: Familybased association tests for survival and timestoonset analysis. Stat Med. 2004, 23 (2): 179189. 10.1002/sim.1707.View ArticlePubMedGoogle Scholar
 Lange C, Silverman EK, Xu X, Weiss ST, Laird NM: A multivariate familybased association test using generalized estimating equations: FBATGEE. Biostatistics. 2003, 4 (2): 195206. 10.1093/biostatistics/4.2.195.View ArticlePubMedGoogle Scholar
 Lange C, Laird NM: Power calculations for a general class of familybased association tests: dichotomous traits. Am J Hum Genet. 2002, 71 (3): 575584. 10.1086/342406.PubMed CentralView ArticlePubMedGoogle Scholar
 Regan EA, Hokanson JE, Murphy JR, Make B, Lynch D, Silverman EK, Crapo JD: Genetic epidemiology of COPD (COPDGene): study design and methods. COPD. 2010, 7 (1): 3243. 10.3109/15412550903499522.PubMed CentralView ArticlePubMedGoogle Scholar
 Kim DK, Hersh CP, Washko GR, Hokanson JE, Lynch DA, Newell JD, Murphy JR, Crapo JD, Silverman EK: Epidemiology, radiology, and genetics of nicotine dependence in COPD. Respis Res. 2011, 12: 915. 10.1186/14659921129.View ArticleGoogle Scholar
 Sha Q, Zhang Z, Zhang S: An improved score test for genetic association studies. Genet Epidemiol. 2011, 35 (5): 350359. 10.1002/gepi.20583.View ArticlePubMedGoogle Scholar
 Devlin B, Roeder K: Genomic control for association studies. Biometrics. 2009, 55 (4): 9971004.View ArticleGoogle Scholar
 Price AK, Patterson NJ, Plenge RM, Weinblatt ME, Shadick NA, Reich D: Principal components analysis corrects for stratification in genomewide association studies. Nature Genet. 2006, 38 (8): 904909. 10.1038/ng1847.View ArticlePubMedGoogle Scholar
 Falconer DS, Makcay TFC: Introduction to Quantitative Genetics. 1997, London: LongmanGoogle Scholar
 Serfling R: Approximation Theorems of Mathematical Statistics. 1980, New York: WileyView ArticleGoogle Scholar
 Billingsley P: Probability and Measure. 1995, New York: WileyGoogle Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.