Correcting for cryptic relatedness by a regressionbased genomic control method
 Ting Yan^{1},
 Bo Hou^{1}Email author and
 Yaning Yang^{1, 2}
DOI: 10.1186/147121561078
© Yan et al; licensee BioMed Central Ltd. 2009
Received: 25 June 2009
Accepted: 2 December 2009
Published: 2 December 2009
Abstract
Background
Genomic control (GC) method is a useful tool to correct for the cryptic relatedness in populationbased association studies. It was originally proposed for correcting for the variance inflation of CochranArmitage's additive trend test by using information from unlinked null markers, and was later generalized to be applicable to other tests with the additional requirement that the null markers are matched with the candidate marker in allele frequencies. However, matching allele frequencies limits the number of available null markers and thus limits the applicability of the GC method. On the other hand, errors in genotype/allele frequencies may cause further bias and variance inflation and thereby aggravate the effect of GC correction.
Results
In this paper, we propose a regressionbased GC method using null markers that are not necessarily matched in allele frequencies with the candidate marker. Variation of allele frequencies of the null markers is adjusted by a regression method.
Conclusion
The proposed method can be readily applied to the CochranArmitage's trend tests other than the additive trend test, the Pearson's chisquare test and other robust efficiency tests. Simulation results show that the proposed method is effective in controlling type I error in the presence of population substructure.
Background
Populationbased genetic association analysis is a powerful method for detecting susceptibility loci for complex diseases. A common issue in such design is that it may be subject to population heterogeneity and, as a result, spurious association may be reported if the population substructure is not properly addressed. Many methods have been proposed to deal with population heterogeneity in genetic association analysis.
When there is population stratification (PS) on allele frequencies, a direct method is to use familybased design [1–5] in which unaffected family members are chosen to match each case so that the association detected is truly due to the linkage between the candidate marker and the disease. But this method is limited by the cost and the difficulty in recruiting family members. Pritchard et al. [6, 7] used a Bayesian clustering method to infer the number of subpopulations and to assign the individuals to putative subpopulations. The inferred memberships in each subpopulation are then used to perform tests of association for that subpopulation. A modification of this method was implemented by Satten et al. [8], in which subpopulation memberships were decided by a latent class model. Patterson et al. [9] proposed a principle components analysis method to correct for the population structure and obtained a test statistic based on the eigenvalues of the correlation matrix to detect the population structure. When the population has substructure, the usual chisquare statistics have noncentral chisquare distributions under the null. Gorroochurn et al. [10] proposed a δcentralization method to correct for PS by centralizing the test statistics using information from the null markers.
Another form of population heterogeneity is the cryptic relatedness or correlation across individuals. For this type of data, Devlin and Roeder [11] developed the genomiccontrol (GC) method to correct for the variance inflation. They proposed to use the additive CochranArmitage trend test to detect the genephenotype association. Assuming that the correlations or kinship coefficients are the same across all markers, they showed that the scaled test statistic has asymptotically a 1df chisquare distribution. The scaling factor, known as the variance inflation factor (VIF), can be estimated from information of the unlinked null markers.
The GC method is a simple and effective method in association studies to correct for population heterogeneity caused by cryptic relatedness. However, when the GC method is applied to recessive or dominant trend tests [12] or, to Pearson's chisquare test [13] or other robust tests [14], the null loci are required to match with the candidate loci in allele frequencies, which reduces the number of available null makers.
In this study, we propose a regressionbased genomic control (RGC) method that can be applied to association tests other than the additive trend test. This method allows for using arbitrary null markers in the GC correction procedure by adjusting the variability of the allele frequencies of the null markers through linear regression. We use simulation studies to check whether the method appropriately corrects for the problem of spurious association. In addition the robustness of the proposed method to the errors in selecting null markers is assessed. We also simulate the power of our method.
Methods
Trend tests
Genotype counts
Genotype  

Group  aa  Aa  AA  Total 
Case  n _{00}  n _{01}  n _{02}  n _{0} 
Control  n _{10}  n _{11}  n _{12}  n _{1} 
Total  m _{0}  m _{1}  m _{2}  n 
The fact that the VIF λ_{0.5} of the additive trend test doesn't depend on allele frequency of the candidate marker makes it possible that the λ_{0.5} can be consistently estimated from a sequence of unlinked markers with arbitrary allele frequencies [11]. Unfortunately this is not true for trend tests with x other than 0.5 since the quantity λ_{ x }does depend on allele frequency of the candidate marker. Therefore, dominant or recessive trend tests and other robust tests cannot be uniformly adjusted by the GC method using null markers with different allele frequencies. To overcome this problem, Zheng et al. [12] proposed to use null markers that have the same allele frequency as that of the candidate marker to evaluate the variance inflation factor. This constraint of matching allele frequency limits substantially the number of null markers that can be used.
RGC method
In what follows, we propose a regressionbased GC method to adjust for the frequency variability of null markers when the GC method is applied to the general trend tests and the Pearson chisquare test.
When the population is of pure CR, the theoretical value of μ_{ x }is zero. But in reality the PS and CR are usually mixed together, so it won't do any harm if we include this term in our analysis.
Simulation study
To assess the validity of the proposed RGC method, we have implemented extensive simulations. Following [11], we use Wright's coefficient F to measure the correlation due to CR. Since it is difficult to simulate pure CR data, following [11, 12] and [14], we employ the following procedure to generate a CR population. Let p be the allele frequency of a marker. Assume that there are L subpopulations including a_{1},⋯, a_{ L }cases and b_{1}, ⋯, b_{ L }controls. We first generate p_{1}, ..., p_{ L }independently from the Beta distribution Beta((1  F)p/F, (1  F)(1  p)/F). We then generate L subpopulations having allele frequency p_{1}, ..., p_{ L }respectively, assuming that within each subpopulation HardyWeinberg equilibrium holds. Finally we mix the L subpopulations together. From long run, this mixed population would resemble a pure CR population.
Next we drew independent genotype counts (a_{0j}, a_{1j}, a_{2j}) of cases and (b_{0j}, b_{1j}, b_{2j}) of controls from multinomial distributions Mul( ) and Mul( ) respectively. We then mixed (a_{0j}, a_{1j}, a_{2j}) and (b_{0j}, b_{1j}, b_{2j}) up to obtain a casecontrol data set given in Table 1, with and for i = 0, 1, 2.
With this method of generating data, we simulated the cases of p = 0.2 and 0.45 where p is the minor allele frequency of candidate marker. The frequencies of unlinked null markers were selected randomly with equal probability from [0.1, 0.5]. The data for the K null markers with the same penetrances f_{0} = f_{1} = f_{2} and a candidate marker with different penetrances f_{0}, f_{1}, f_{2} are independently generated. The number of replicates in each simulation was 10, 000. To avoid the instability of the linear regression, the predictors were centered before to be fitted into the regression [15].
Results
A regressionbased genomic control (RGC) method is proposed and applied to association tests other than the additive trend test. This method allows for using arbitrary null markers in the GC correction procedure, in which the variability of the allele frequencies of the null markers is adjusted by linear regression. The method is assessed by extensive simulation results. In addition, the robustness of the proposed method to the errors in selecting null markers is evaluated. We also simulate the power of our method.
Type I error of the uncorrected and GC or RGCcorrected tests under H_{0}: f_{0} = f_{1} = f_{2} (nominal level is 0.05, a = (500, 1500), b = (1500, 500).
F  MAF  K  Method  T _{0}  T _{1/2}  T _{1} 


0.01  p = 0.2  200  Uncorrected  0.350  0.557  0.539  0.509 
GC  0.032  0.056  0.067  0.045  
RGC  0.063  0.054  0.052  0.055  
300  Uncorrected  0.335  0.551  0.532  0.497  
GC  0.027  0.047  0.057  0.038  
RGC  0.055  0.053  0.051  0.052  
p = 0.45  200  Uncorrected  0.446  0.543  0.486  0.496  
GC  0.085  0.051  0.035  0.053  
RGC  0.052  0.052  0.054  0.049  
300  Uncorrected  0.464  0.550  0.487  0.512  
GC  0.096  0.049  0.037  0.058  
RGC  0.051  0.050  0.052  0.051  
0.02  p = 0.2  200  Uncorrected  0.473  0.667  0.650  0.633 
GC  0.026  0.046  0.061  0.040  
RGC  0.065  0.053  0.052  0.056  
300  Uncorrected  0.452  0.679  0.662  0.637  
GC  0.022  0.048  0.060  0.035  
RGC  0.054  0.050  0.051  0.053  
p = 0.45  200  Uncorrected  0.591  0.665  0.612  0.627  
GC  0.101  0.047  0.038  0.060  
RGC  0.052  0.053  0.054  0.050  
300  Uncorrected  0.581  0.663  0.610  0.622  
GC  0.106  0.047  0.032  0.062  
RGC  0.051  0.052  0.053  0.052 
Power of RGCcorrected tests nominal level 0.05, K = 200, a = (300, 200), b = (200, 300).
F  MAF  Model  T _{0}  T _{1/2}  T _{1} 


0.01  p = 0.2  DOM(f_{0} = 0.1, f_{1} = f_{2} = 0.15)  0.134  0.791  0.857  0.777 
ADD(f_{0} = 0.1, f_{1} = 0.17, f_{2} = 0.24)  0.401  0.805  0.781  0.734  
REC(f_{0} = f_{1} = 0.1, f_{2} = 0.2)  0.803  0.378  0.130  0.728  
p = 0.4  DOM(f_{0} = 0.1, f_{1} = f_{2} = 0.15)  0.179  0.704  0.852  0.780  
ADD(f_{0} = 0.1, f_{1} = 0.14, f_{2} = 0.18)  0.701  0.905  0.856  0.866  
REC(f_{0} = f_{1} = 0.1, f_{2} = 0.2)  0.995  0.936  0.418  0.990  
0.02  p = 0.2  DOM(f_{0} = 0.1, f_{1} = f_{2} = 0.15)  0.129  0.682  0.767  0.674 
ADD(f_{0} = 0.1, f_{1} = 0.17, f_{2} = 0.24)  0.362  0.698  0.689  0.636  
REC(f_{0} = f_{1} = 0.1, f_{2} = 0.2)  0.753  0.333  0.130  0.687  
p = 0.4  DOM(f_{0} = 0.1, f_{1} = f_{2} = 0.15)  0.179  0.704  0.852  0.780  
ADD(f_{0} = 0.1, f_{1} = 0.14, f_{2} = 0.18)  0.655  0.853  0.811  0.809  
REC(f_{0} = f_{1} = 0.1, f_{2} = 0.2)  0.987  0.876  0.403  0.974 
Type I error of the uncorrected, GC and RGCcorrected tests when the markers are linked to the disease with probability 2% (nominal level is 0.05, K = 200, a = (500, 1500), b = (1500, 500), F = 0.02, f_{2}, f_{1}, f_{0} are the penetrances for AA, Aa, aa.)
(f_{0}, f_{1}, f_{2})  MAF  Method  T _{0}  T _{1/2}  T _{1} 


(0.01, 0.02, 0.02)  p = 0.2  Uncorrected  0.470  0.673  0.657  0.631 
GC  0.021  0.041  0.055  0.035  
RGC  0.064  0.051  0.047  0.058  
(0.01, 0.015, 0.02)  Uncorrected  0.474  0.679  0.656  0.637  
GC  0.018  0.042  0.056  0.034  
RGC  0.056  0.052  0.051  0.054  
(0.01, 0.01, 0.02)  Uncorrected  0.473  0.669  0.653  0.630  
GC  0.022  0.040  0.054  0.039  
RGC  0.063  0.052  0.053  0.055  
(0.01, 0.02, 0.02)  p = 0.45  Uncorrected  0.592  0.668  0.615  0.630 
GC  0.098  0.046  0.034  0.062  
RGC  0.054  0.051  0.046  0.049  
(0.01, 0.015, 0.02)  Uncorrected  0.608  0.675  0.619  0.638  
GC  0.105  0.045  0.033  0.060  
RGC  0.053  0.052  0.053  0.050  
(0.01, 0.01, 0.02)  Uncorrected  0.598  0.670  0.620  0.631  
GC  0.103  0.045  0.034  0.061  
RGC  0.049  0.051  0.054  0.048 
Discussion
Casecontrol design is useful in detecting genes related to complex disease. For a casecontrol sample, if there is population structure and cryptic relatedness, spurious association between disease and genotype can occur due to variance inflation in the statistical tests. The genomic control method proposed by Devlin and Roeder [11] is a simple and effective method for eliminating spurious results caused by cryptic relatedness.
However when applying the GC method to correct for inflation of type I error of general trend test or the Pearson's chisquare test, it is required that the null markers are matched with the candidate marker in allele frequencies. This matching limits the applicability of the GC method. In this paper we propose a RGC method to correct for the population stratification effects which allows for use of any null markers. To adjust for the variability of allele frequencies of the null markers we estimate the inflated variance τ_{ x }and the noncentral parameter μ_{ x }by linear regression. This RGC method can be applied to the CochranArmitage's trend tests other than the additive trend test, with arbitrary score, the Pearson genotypebased association test and other robust efficiency tests.
Simulation results show that the RGC method can properly correct for the inflation of type I error of trend tests or Pearson's chisquare test caused by cryptic relatedness in the population. It is observed that the RGC method is slightly conservative for recessive trend test and anticonservative for dominant trend test when the minor allele frequency is close to 0. We think that this is due to the instability of linear regression near the boundary of MAF values.
Conclusion
Simulation studies show that the RGC method can effectively correct for the variance inflation caused by cryptic relatedness and is robust to inclusion of linked loci in the selection of null markers.
Appendix
where p is the frequency of the allelic A.
Declarations
Acknowledgements
BH and YY are supported by Chinese Natural Science Foundation and Chinese Academy of Science Grant. TY is supported by USTC Graduate Student Innovation Foundation. The authors thank three anonymous reviewers for their helpful comments and Yifan Yang for careful reading of the manuscript.
Authors’ Affiliations
References
 Spielman R, McGinnis R, Ewens W: Transmission test for linkage disequilibrium: the insulin gene region and insulindependent diabetes mellitus (IDDM). Am J Hum Genet. 1993, 52: 506516.PubMed CentralPubMedGoogle Scholar
 Curtis D: Use of siblings as controls in casecontrol association studies. Ann Hum Genet. 1997, 61: 319333. 10.1017/S000348009700626X.View ArticlePubMedGoogle Scholar
 Gauderman W, Witte J, Thomas D: Familybased association studies. J Natl Cancer Inst Monogr. 1999, 26: 3137.View ArticlePubMedGoogle Scholar
 Li Z, Gail M, Pee D, Gastwirth J: Statistical properties of Teng and Risch's sibship type tests for detecting an association between disease and a candidate allele. Hum Hered. 2002, 53: 114129. 10.1159/000064974.View ArticlePubMedGoogle Scholar
 Li Z, Gastwirth J, Gail M: Power and related statistical properties of conditional likelihood score tests for association studies in nuclear families with parental genotypes. Ann Hum Genet. 2005, 69: 296314. 10.1046/J.14691809.2005.00169.x.View ArticlePubMedGoogle Scholar
 Pritchard J, Stephens M, Donnelly P: Inference of population structure using multilocus genotype data. Genetics. 2000, 155: 945959.PubMed CentralPubMedGoogle Scholar
 Pritchard J, Stephens M, Rosenberg N, Donnelly P: Association mapping in structured populations. Am J Hum Genet. 2000, 67: 170181. 10.1086/302959.PubMed CentralView ArticlePubMedGoogle Scholar
 Satten G, Flanders W, Yang Q: Accounting for unmeasured population substructure in casecontrol studies of genetic association using a novel latentclass model. Am J Hum Genet. 2001, 68: 466477. 10.1086/318195.PubMed CentralView ArticlePubMedGoogle Scholar
 Patterson N, Price A, Reich D: Population structure and eigenanalysis. PLoS Genet. 2006, 2 (12): e19010.1371/journal.pgen.0020190.PubMed CentralView ArticlePubMedGoogle Scholar
 Gorroochurn P, Heiman G, Hodge S, Greenberg D: Centralizing the noncentral chisquare: a new method to correct for population stratification in genetic casecontrol association studies. Genet Epidemiol. 2006, 30: 277289. 10.1002/gepi.20143.View ArticlePubMedGoogle Scholar
 Devlin B, Roeder K: Genomic control for association studies. Biometrics. 1999, 55: 9971004. 10.1111/j.0006341X.1999.00997.x.View ArticlePubMedGoogle Scholar
 Zheng G, Freidlin B, Li Z, Gastwirth J: Genomic control for association studies under various genetic models. Biometrics. 2005, 61: 186192. 10.1111/j.0006341X.2005.t011.x.View ArticlePubMedGoogle Scholar
 Zheng G, Freidlin B, Gastwirth J: Robust genomic control for association studies. Am J Hum Genet. 2006, 78: 350356. 10.1086/500054.PubMed CentralView ArticlePubMedGoogle Scholar
 Zang Y, Zhang H, Yang Y, Zheng G: Robust genomic control and robust delta centralization tests for casecontrol association studies. Hum Hered. 2007, 63: 187195. 10.1159/000099831.View ArticlePubMedGoogle Scholar
 Ryan TP: Modern Regression Methods. 1996, WileyInterscience: New YorkGoogle Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.