Volume 6 Supplement 1
Genetic Analysis Workshop 14: Microsatellite and singlenucleotide polymorphism
Identifying susceptibility genes by using joint tests of association and linkage and accounting for epistasis
 Joshua Millstein^{1}Email author,
 Kimberly D Siegmund^{2},
 David V Conti^{2} and
 W James Gauderman^{2}
DOI: 10.1186/147121566S1S147
© Millstein et al; licensee BioMed Central Ltd 2005
Published: 30 December 2005
Abstract
Simulated Genetic Analysis Workshop14 data were analyzed by jointly testing linkage and association and by accounting for epistasis using a candidate gene approach. Our group was unblinded to the "answers." The 48 singlenucleotide polymorphisms (SNPs) within the six disease loci were analyzed in addition to five SNPs from each of two nondiseaserelated loci. Affected sibparent data was extracted from the first 10 replicates for populations Aipotu, Kaarangar, and Danacaa, and analyzed separately for each replicate. We developed a likelihood for testing association and/or linkage using data from affected sib pairs and their parents. Identicalbydescent (IBD) allele sharing between sibs was explicitly modeled using a conditional logistic regression approach and incorporating a covariate that represents expected IBD allele sharing given the genotypes of the sibs and their parents. Interactions were accounted for by performing likelihood ratio tests in stages determined by the highest order interaction term in the model. In the first stage, main effects were tested independently, and in subsequent stages, multilocus effects were tested conditional on significant marginal effects. A reduction in the number of tests performed was achieved by prescreening gene combinations with a goodnessoffit chi square statistic that depended on matingtype frequencies. SNPspecific joint effects of linkage and association were identified for loci D1, D2, D3, and D4 in multiple replicates. The strongest effect was for SNP B03T3056, which had a median pvalue of 1.98 × 10^{34}. No two or threelocus effects were found in more than one replicate.
Background
The need to account for gene × gene interactions in the search for susceptibility genes for complex diseases such as cancer, diabetes, hypertension, and obesity has been widely suggested. However, accounting for interactions is not a trivial task due to the serious problem of multiple testing created by the large number of possible interactions even for a relatively small set of candidate genes. This problem is compounded by the notoriously low power of formal tests for interaction. In the context of modeling association in unrelated individuals, Devlin et al. [1] proposed a testing strategy that conserves power by jointly testing main effects together with interactions and by adjusting for multiple tests by controlling false discovery rates (FDR). However, this method suffers from interpretability difficulties because a positive test of a set of main effects and interactions for several loci does not necessarily imply that all loci in the set are contributing to the significance of the test. Also, the method requires exhaustively testing sets of two and possibly three genes. We propose an analytic strategy that uses likelihood ratio tests (LRT) within a framework to test main effects and interactions of association and/or linkage jointly, conditional on significant singlelocus effects. This strategy also incorporates a screening statistic that reduces the number of marker combinations that need to be tested for multilocus effects.
To apply this testing framework to nuclear family data, the investigator must choose a particular type of test. An important advantage of conditionalonparentalgenotype (CPG) transmission disequilibrium test (TDT) methods, for testing association due to linkage disequilibrium (LD), over unconditional methods is that they are not subject to confounding due to population stratification. Generally, these methods are able to be implemented using standard statistical software packages. However, if there is more than one affected offspring per family, directly applying the TDT by treating the offspring as if they are from independent families will no longer provide a valid test of association when there is linkage. This is due to a downward bias in the standard error estimator for the association parameter [2, 3]. We propose a model for the CPG likelihood that can be fit using standard statistical software and can be used for joint tests of linkage and association. We apply our proposed testing framework to singlenucleotide polymorphism (SNP) data in disease regions and nondisease regions from the simulated Genetic Analysis Workshop 14 (GAW14) data to demonstrate the performance of this approach in the context of a candidate gene study.
Methods
Data
Nuclear family data were extracted from the first ten GAW14 simulated replicate datasets. Kofendrerd Personality Disorder (KPD) disease status and genotype data were used for the first two affected sibs in each nuclear family, but only genotype data were used for the parents. For each replicate, there were 100 affected sibpairparent nuclear families that were obtained from each of the populations Aipotu, Kaarangar, and Danacaa with no missing data. All 48 SNPs from the six disease regions were included in the analysis as well as 10 SNPs from two regions on chromosomes 2 and 8 that were simulated with LD but had no relation to disease.
Analytic approach
Let g_{1}, g_{2}, g_{ m }, g_{ f }denote the genotypes of two affected offspring, mother, and father at a marker locus under study, and let D_{ i }denote the disease status of offspring i. The CPG likelihood for an individual family takes the form P(g_{1}, g_{2}g_{ m }, g_{ f }, D_{1}, D_{2}) [4]. From the conditional laws of probability the likelihood can alternatively be written as P(g_{1}g_{ m }, g_{ f }, D_{1}, D_{2}) × P(g_{2} g_{1},g_{ m }, g_{ f }, D_{1}, D_{2}). The basic analytic approach, developed by Millstein et al. [5], is based on the approach described by Self et al. [6] for estimating association due to LD in caseparenttrios. By making the reasonable assumption that D_{2} without g_{2} provides no information on g_{1} conditional on D_{1}, the first part of the product can by modeled using the standard CPG conditional logistic regression approach, i.e.,
where G_{ i }denotes an indicator variable for alleles at g_{ i }and exp(β) is an association estimate of relative risk for the genotype. The second factor of the product can be modeled by including a covariate, e_{ ij }, to indicate the expected number of alleles shared identical by descent (IBD) by affected sibs i and j given the genotypes of both sibs and their parents. The resulting likelihood for both sibs would take the form
where the sums with respect to g* are over all possible offspring genotypes given the parental genotypes, and e_{1}* is the expected IBD allele sharing between sib 1 and pseudosib * given the genotypes of both sibs and their parents, and γ is a measure of linkage. This model can be easily fit using the SAS (SAS Institute Inc., Cary, NC) procedure PROC PHREG with a conditional logistic regression approach, by creating a risk set for each affected sib and creating a covariate indicating IBD allele sharing for all members of the sib 2 risk sets (this covariate is set to zero for members of the sib 1 risk sets). The choice of which sib to assign the sib 1 position does not affect the likelihood. This approach can be used to test the null of no association in the presence of linkage (H_{0}: β = 0 and γ ≠ 0), linkage (H_{0}: γ = 0), or joint association and linkage (H_{0}: β = 0 and γ = 0). Among the advantages of using this approach are 1) it is easy to implement using standard statistical software; 2) direct adjustment for individual level covariates (for effect modification) is possible; 3) LRTs can be used to test jointly for multilocus effects of association; 4) LRTs of gene × environment interactions are possible (see below); 5) joint LRTs of linkage and association are possible; 6) IBD sharing is explicitly modeled.
Testing framework
We restricted our investigation to main effects and interactions between two or three SNPs in different chromosome regions, thus our focus was on gene × gene interactions rather than haplotype effects or withingene SNP interactions. We modified the testing framework by prescreening SNP combinations using a mating type screening statistic (MS), and we tested only those combinations with an observed MS above a cutoff value.
Screening statistic
If loci interact to produce disease in an offspring, then mating types that are likely to produce the susceptibility multilocus genotype will be present in the parents of cases more likely than we would expect based on marginal mating type frequencies. For a diallelic locus, there are six possible parental mating types if we ignore parentoforigin effects, AA × AA, AA × Aa, AA × aa, Aa × Aa, Aa × aa, aa × aa, and thus 36 possible mating types for a pair of loci or in general, 6^{ k }possible mating types for klocus genotypes. Let m_{ i }denote the number of parental pairs that are described by the i^{th} multilocus mating type and let r denote the number of possible mating types. Then the statistic,
which is equivalent to a goodnessof fit chisquared statistic, can be used to screen gene combinations for caseparent designs thus reducing the number of gene sets that require testing for multilocus effects. The quantity E[m_{ i }] is calculated from the observed singlelocus mating type frequencies by assuming independence between loci in the population, i.e., the power of the method will be decreased if the loci are in LD in the population. In these analyses, the combinations of markers analyzed involved loci on separate chromosomes. The MS statistic uses only betweenmating type information, which is independent of the within family information that is used in any CPG TDT analysis [6–9].
Results and Discussion
Number of 10 replicate datasets with significant effects.
SNP  Locus  significant Replicates (Frequency)  

Single Locus Effects  
B01T0554  D1  2  
B01T0559  D1  2  
C01R0052  D1  2  
B03T3056  D2  10  
B03T3057  D2  10  
B03T3058  D2  10  
C03R0281  D2  7  
B03T3062  D2  1  
B03T3063  D2  1  
B03T3066  D2  4  
B03T3067  D2  6  
B05T4135  D3  2  
B05T4136  D3  8  
C05R0380  D3  7  
B05T4138  D3  1  
B05T4139  D3  3  
B05T4140  D3  2  
B05T4141  D3  1  
B05T4142  D3  5  
B09T8331  D4  4  
B09T8332  D4  8  
B09T8333  D4  9  
B09T8334  D4  4  
C09R0765  D4  8  
B09T8337  D4  9  
B09T8338  D4  5  
B09T8339  D4  5  
B09T8340  D4  8  
B09T8341  D4  9  
B09T8342  D4  3  
3Locus Effects  
C01R0052  B03T3056  C05R0380  D1, D2, D3  1 
B03T3062  B09T8341  B02T1017  D1, D4, 7  1 
The solid line in Figure 2 representing the joint test of linkage and association generally lies above (more significant) the lines for the independent tests of association and linkage, which implies that the power of the joint test under these conditions is greater. A highly significant pvalue due to association occurs at SNP B03T3056, whereas the linkage effect is nonsignificant at this SNP. This pattern is consistent with the observation that a linkage effect would not be expected conditional on association if the actual disease SNP is included in the analysis [10]. Although SNP B03T3056 is not the disease causing SNP, it may be so strongly in LD with the true disease SNP that there is essentially no conditional linkage effect.
The MS statistic was used to restrict the investigation to the top 20% of twolocus combinations and the top 10% of threelocus combinations. No twolocus effects were detected, and no threelocus effect was sufficiently strong to be detected in more than one replicate dataset after controlling the experimentwise FDR at 0.05. Relaxing the significance criteria did not yield consistent effects across replicates. The multilocus effect involving {C01R0052, B03T3056, C05R0380} was identified in replicate 1 and the effect involving {B03T3062, B09T8341, B02T1017} was identified in replicate 2 (Table 1). The test of the multilocus effect of {B03T3062, B09T8341, B02T1017} was significant after conditioning on the marginal main effect of B09T8341, and {C01R0052, B03T3056, C05R0380} was significant conditional on B03T3056. Thus three of these five diseaserelated SNPs were identified by their involvement in multilocus effects in these replicates, and one SNP, B02T1017, was falsely identified as a diseaserelated SNP. Latent phenotype P1 was the result of a D1, D2 interaction and latent phenotype P2 involves a D2, D3 interaction, therefore the observed {D1, D2, D3} multilocus effect is consistent with the simulation design. The D1, D4 interaction has a penetrance of 1.0 for phenotype P3, thus the {D1, D4, nonKPD region 7} effect could be partially explained by that interaction.
The lack of significant twolocus effects together with the lack of consistency for the threelocus effects indicates a general lack of power under these conditions for detecting multilocus effects conditional on significant marginal effects after accounting for multiple tests even after prescreening locuscombinations with the MS statistic. Each of the four disease loci were involved in multiple interactions that caused risk of multiple latent phenotypes. Additionally, there was heterogeneity across populations in how these latent phenotypes caused the KPD trait. Therefore, the strength of the interaction effects relative to the marginal effects was diluted by the presence of multiple interactions per locus. In this situation the identification of a disease locus is more likely to happen through its marginal effect. The testing framework employed here involved multidf tests of main effects and interactions, which could lead to positive tests in the presence of main effects but no interaction. However, the number of tests per stage increased with the order of interaction and the alpha level was equally allocated to each of the three stages. This resulted in the testspecific significance threshold increasing with stage, thus a significant multilocus effect test was unlikely to be explained by main effects alone. Also, following Schaid et al. [8] our models assumed logadditive risk and multiplicativity between loci, while the true susceptibility patterns were either dominant or recessive, and multiplicativity did not necessarily hold. Departures from the true risk model may have contributed to our lack of power to detect multilocus effects. For example, various combinations of the three latent phenotypes, P1–P3, determined the KPD trait, and first order interactions, involving dominant and recessive susceptibility patterns, between the four disease regions, D1–D4, influenced risk of P1–P3. Also, the relationship between KPD and P1–P3 varied across the three populations. Therefore, the relationship between D1–D4 and KPD was complicated, and there may not have been adequate information in the data to detect those interactions at the provided sample size. However, it needs to be emphasized that the principle objective of this approach is not to identify interaction effects per se but rather to identify loci or combinations of loci that influence disease risk. The method was thus successful in identifying the disease regions through the marginal effects of the SNPs.
Conclusion
While consistent multilocus effects were not identified by this particular analysis, an approach was documented that simultaneously accounts for possible interactions, maintains adequate power for detecting main effects, and rigorously controls the FDR for multiple tests. A novel method was implemented for jointly testing linkage and association using affected sibpairparent data that is computationally fast and easy to implement using standard statistical software. With further research this method could be generalized to nuclear families with more than two affected offspring. Four of the six disease loci were identified in at least a subset of the 10 replicate data sets when all diseaseregion SNPs were included in the analysis. These results bolster the idea that it is feasible to explicitly account for interactions in a candidate gene study while maintaining adequate power for finding marginal effects.
Abbreviations
 CPG:

Conditionalonparentalgenotype
 FDR:

False discovery rate
 GAW14:

Genetic Analysis Workshop 14
 IBD:

Identical by descent
 KPD:

Kofendrerd personality disorder
 LD:

Linkage disequilibrium
 LRT:

Likelihood ratio test
 MS:

Mating type screening statistic
 SNP:

Singlenucleotide polymorphism
 TDT:

Transmission disequilibrium test
Declarations
Acknowledgements
This study was funded in part by the National Institute of Environmental Health Sciences (Grants ES10421 and 5P30ES07048), the National Cancer Institute (CA52862), and the Institute of General Medicine (GM58897).
Authors’ Affiliations
References
 Devlin B, Roeder K, Wasserman L: Analysis of multilocus models of association. Genet Epidemiol. 2003, 25: 3647. 10.1002/gepi.10237.View ArticlePubMedGoogle Scholar
 Siegmund KD, Gauderman WJ: Association tests in nuclear families. Hum Hered. 2001, 52: 6676. 10.1159/000053357.View ArticlePubMedGoogle Scholar
 Martin ER, Kaplan NL, Weir BS: Tests for linkage and association in nuclear families. Am J Hum Genet. 1997, 61: 439448. 10.1017/S0003480097006362.PubMed CentralView ArticlePubMedGoogle Scholar
 Cordell HJ: Properties of case/pseudocontrol analysis for genetic association studies: effects of recombination, ascertainment, and multiple affected offspring. Genet Epidemiol. 2004, 26: 186205. 10.1002/gepi.10306.View ArticlePubMedGoogle Scholar
 Millstein J, Seigmund DS, Conti DV, Gauderman WJ: Testing association in the presence of linkage using affectedsibparent study designs. 2005, Genetic Epidemiology, 29: 225233. [http://hydra.usc.edu/biostat/TR%20pages/TR171/171.htm]Google Scholar
 Self SG, Longton G, Kopecky KJ, Liang KY: On estimating HLA/disease association with application to a study of aplastic anemia. Biometrics. 1991, 47: 5361. 10.2307/2532495.View ArticlePubMedGoogle Scholar
 Schaid DJ, Sommer SS: Genotype relative risks: methods for design and analysis of candidategene association studies. Am J Hum Genet. 1993, 53: 11141126.PubMed CentralPubMedGoogle Scholar
 Schaid DJ: General score tests for associations of genetic markers with disease using cases and their parents. Genet Epidemiol. 1996, 13: 423449. 10.1002/(SICI)10982272(1996)13:5<423::AIDGEPI1>3.0.CO;23.View ArticlePubMedGoogle Scholar
 Weinberg CR: Methods for detection of parentoforigin effects in genetic studies of caseparents triads. Am J Hum Genet. 1999, 65: 229235. 10.1086/302466.PubMed CentralView ArticlePubMedGoogle Scholar
 Siegmund KD, Langholz B, Kraft P, Thomas DC: Testing linkage disequilibrium in sibships. Am J Hum Genet. 2000, 67: 244248. 10.1086/302973.PubMed CentralView ArticlePubMedGoogle Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.