Testing association and maternally mediated genetic effects with the principal component analysis in case-parents studies

Abstracts Background Major advances in genotyping technology have generated high-density maps of single nucleotide polymorphism (SNP) markers that provide an unprecedented opportunity to identify genes underlying complex traits. Several family-based statistical methods showing robust population stratification have been developed to test the association between multiple markers and disease-susceptibility genes. Only a few methods focus on testing for maternally mediated genetic effects, which is a critical risk for birth defects. The present study focuses on testing for association and maternally mediated genetic effects with family-based methods. Results In the present study, we proposed a new method, max_PC integrating principal component analysis, to test association or maternally mediated genetic effects with case-parent data. The proposed method only uses the genotypes of case-parents triads and accommodates missing SNP data. Our results demonstrated that this method is powerful to test association or maternally mediated genetic effects and attractive because it provides a tool for testing the null hypothesis of no association and no maternally mediated genetic effects. Simulations with the permutation procedure as well as an application in the Crohn’s disease study showed that the type I error rates of the proposed statistic were nominal with slightly higher power as compared to those of the max_Z2 test. Conclusions We conclude that the max_PC is a good approach to test association or maternally mediated genetic effects.


Background
The rapid advancement of genotyping technologies and the availability of enormous quantities of genotype or haplotype data provide an unprecedented opportunity for identifying genes involved in susceptibility to complex diseases. The utilization of information from individual markers as well as linkage disequilibrium (LD) structure between the markers has resulted in the use of haplotypebased association studies to determine the relationship between complex traits and a series of possibly linked markers. However, haplotype-based methods are challenging as the large number of distinct haplotypes will in turn produce a large number of degrees of freedom. In addition, some haplotype-based methods need the estimation of haplotype phases when haplotypes are not directly observed. A potential strategy to avoid the estimation of haplotype phases and frequencies is to develop genotype-based statistical approaches that directly use genotype data.
Population-based case-control studies and family-based studies are two common strategies for association analysis. Compared to case-control studies, family-based studies are more attractive due to their robustness to population stratification. Several genotype-based methods with case-parent data have been proposed for association test.
McIntyre et al. [1] proposed a max_TDT test, in which the usual TDT statistic for each locus was calculated and the maximum was utilized as the test statistic. Using the principal components (PC) of a variance-covariance matrix of difference vectors that were calculated by comparing the genotypes of affected offspring with their corresponding "complements", which are defined to be the parental genotypes that were not present in the affected offspring, Lee WC [2] proposed the adaptive principal component test, PCT L 2 . Based on the same difference vector, Fan et al. [3] proposed a paired Hotelling's T 2 test, while Shi et al. [4] developed the max_Z 2 test. The max_Z 2 test is superior to other methods as it is able to study the maternally mediated genetic effects while only using the genotypes of affected individuals and their parents. It could also accommodate missing SNP data. Maternally mediated genetic effects arise when the phenotype of an offspring is influenced by maternal alleles via the intrauterine environment [5]. This effect can be detected by using the father's genotype as a matched control for the mother's under an assumption of mating symmetry [6]. The maternal genome influences the risk for specific diseases in the offspring, suggesting maternal genes play an independent role in the etiology of birth defects [5,6]. So incorporating maternally mediated genetic effects into an association analysis might be particularly useful in studying diseases that happened during fetal life.
In this report, we proposed a new statistic, max_PC, to test association and maternally mediated genetic effects with case-parent data. Here, the null hypothesis is that neither association nor maternally mediated genetic effects exist. The max_PC test has three characteristics: (1) only using the genotypes of case-parent triads while accommodating missing SNP data, (2) utilizing the principal component analysis, and (3) evaluating the association and the maternally mediated genetic effects at the same time. Simulations analysis using a permutation procedure and an application in Crohn's disease study indicated that this max_PC statistic has good performance.

Simulation setting
To assess the performance of the statistic, max_PC, we performed a simulation study using a wide range of parameters. The simulations were implemented as described by Lee WC [2]. A candidate region of 100 kb in size was considered, where 20 dense marker loci were uniformly distributed in a homogeneous population. The marker frequencies were randomly determined, with values ranging from 0.1-0.9, and the disequilibrium coefficients between two adjacent markers were uniformly generated between −0.9 and 0.9 [2,7]. Assumption was made that a biallelic disease susceptibility gene with alleles D and d was uniformly located within the candidate region. The frequency of disease allele D was set to be 0.1. Let R 1 denote the relative risk with one copy of the disease allele D and R 2 denote the relative risk with two copies compared with no copies. We considered a simple mutational process for the disease allele D where the allele D appears in an individual with the "ancestral haplotype" and it broke up because of meiotic recombination. We used the first-order Markov model [8] to generate the markers on the ancestral haplotype and then generate the haplotypes in the present-day population.
For the null hypothesis that neither association nor maternally mediated genetic effects exist, we let R 1 = R 2 = 1. For the alternative hypothesis that there exists association or maternally mediated genetic effects, we considered the following three scenarios. First, there exists association where the state of disease phenotype of the offspring was determined by the offspring genotype; the second, there exists maternally mediated genetic effects where the state of disease phenotype of the offspring was determined by the mother genotype; and third, there exist both association and maternally mediated genetic effects where the state of disease phenotype of the offspring was determined by the genotypes of both mother and offspring. For the third scenario, we considered a simple model in which the relative risk was the product of the relative risks of the offspring and the mother. For the above three scenarios, we let the baseline disease risk (the penetrance of genotype dd) be 0.01, and R 1 = R 2 = 2 and R 1 = 1, R 2 = 2, which represented a dominant and a recessive model, respectively. The number of case-parent triads, N, was chosen as 100, 200, or 400. We assumed that each individual had an available marker that could be utilized as a genotype in the analysis. Upon simulation of the familial data, we first calculated the value of max_PC, and then recalculated it by using the permutation procedure 1,000 times. The p value was the proportion of permutationbased statistics that were larger than the data-based statistic. We performed this data simulation with 10,000 replicates. For a given significance level α, the power/type I error rate is then estimated as the proportion of rejecting the null hypothesis when p ≤α.
We also investigated the performance of max_PC in the presence of population admixture. We assumed that there were two subpopulations that differed in disease allele frequency and baseline disease risk. Disease allele D was introduced 1,000 and 500 generations ago through two different ancestral haplotypes in the two subpopulations, respectively. The frequencies of the D allele were set as 0.1 and 0.2, and the baseline disease risks were set as 0.01 and 0.05 in the first and second subpopulation, respectively. We used the previously described approach to generate the markers and haplotypes in each subpopulation and let the two subpopulations admix at the last generation, with 30 percent for the first subpopulation and 70 percent for the second subpopulation.

Assessment of the performance of the max_PC statistic
As shown in Table 1, the type I error rates are around the nominal value, indicating that the type I error rates are validated in both homogeneous and admixture populations. The powers under three scenarios are exhibited in Tables 2 and 3. The power of max_Z 2 was also investigated so that we could compare the performance of max_PC with that of max_Z 2 . It should be noted that max_Z 2 uses the difference vector 2C − F − M and M − F under the first scenario and the second scenario, respectively. However, we perform two tests under the third scenario: one uses the difference vector 2C − F − M to calculate max_Z 2 for testing association and the other uses the difference vector M − F to calculate max_Z 2 for testing maternally genetic effects. It is observed from Table 2 that, in homogeneous population, the power of max_PC for testing association and maternally mediated genetic effects is highest among three scenarios. The powers for testing association and maternally mediated genetic effects reach 80 % and 90 %, respectively, when the sample size is 200 or 400. The power of max_PC under the recessive model is slightly lower than those under the dominant mode. The second to the fifth columns of Table 2 show that the powers of max_PC are similar to those of max_Z 2 . Under the third scenario of testing association and maternally mediated genetic effects at the same time, the powers of max_PC are slightly higher than those of max_Z 2 with the difference vector 2C − F − M and those of max_Z 2 with M − F. Table 3 presents the powers in admixture population. It can be seen that the results are similar to those observed in homogeneous population.

A real example
To evaluate the performance of the max_PC on more realistic scenarios, we applied our method to a real data set of Crohn's disease. Crohn's disease is one of the two major forms of inflammatory bowel diseases [9,10]. Rioux et al. [10,11] detected a candidate region containing a genetic risk factor for Crohn disease on human chromosome 5q31with 139 case-parents trios. We applied the max_PC and the max_Z 2 to the publically available subset of 129 trios genotyped at 103 common SNPs. We assessed P value using the permutation procedure 10,000 times. The P value using the max_PC is 3.62 × 10 −4 , indicating that this region is associated with Crohn's disease or there is a possible maternal effects. When the max_Z 2 with 2C − F − M and M − F was used, the P values are 3.78 × 10 −4 and 4.05 × 10 −3 , respectively, indicating that there may exist association and maternal effects. In order to further compare the two methods, we also considered risk-haplotype-tagging alleles which are highly predictive of disease susceptibility as descripted by Shi et al. [4]. According to Shi et al. [4], we consider the locus with P value being smaller than a preset threshold to be related to risk and designate the corresponding  The max_Z 2 using M − F overtransmitted allele as a risk-haplotype-tagging allele. It can be seen from Fig. 1 that the number of risk-haplotypetagging alleles identified using the max_PC (Fig. 1c) is larger than that using the max_Z 2 ( Fig. 1a and b).

Discussion
Based on the principal component analysis, we proposed a statistic max_PC to test association and maternally mediated genetic effects with multiple markers. Similar to the max_Z 2 test, the max_PC test only uses the genotypes of case-parent triads and accommodates missing SNP data. Notably, at least two characteristics of the max_PC test are totally different from those of the max_Z 2 test: (1) the max_PC test uses the principal component analysis, and (2) the max_PC can be used to test association and maternally mediated genetic effects at the same time, whereas two tests with max_Z 2 should be performed for this case. The performance of max_PC was assessed by simulations analysis as well as the application in Crohn's disease study.
In practice, when multiple markers are studied, individuals may have incomplete information of individual marker data. Our method calculates the PC value for each locus, indicating that it is capable of handling missing SNP data. Moreover, our approach is not biased toward admixture population and Hardy-Weinberg equilibrium was not essential when using our method. The null hypothesis for the max_PC test is that there is no association and no maternally mediated genetic effects, whereas the alternative hypothesis is that there is an association or maternally mediated genetic effects. One attractive feature of max_PC is that it is able to test association and maternally mediated genetic effects exist at the same time. Also, it provides a method for testing the null hypothesis of no association and no maternally mediated genetic effects. When the null hypothesis is rejected we will use max_Z 2 with the difference vector, 2C − F − M, for testing association, and use max_Z 2 with the difference vector, M − F, for testing maternally mediated genetic effects. In this way, we will be able to determine whether only maternally mediated genetic effects or only association exists. It is worth noting that testing for maternally mediated genetic effects rely on the assumption of mating symmetry when max_PC was used according to M − F. However, when an offspring effect is present, parent-of-origin effects can also cause the effect of symmetry [4]. In this case, it is not easy to distinguish maternally mediated genetic effects from parent-of-origin effects. Further studies to carefully evaluate the possible parent-of-origin effects are thus warranted.

Conclusions
We have proposed a statistical approach for detecting association or maternally mediated genetic effects with case-parent data. The approach is powerful to test association and the maternally mediated genetic effects at the same time. It also provides a valid tool for testing no association and no maternally mediated genetic effects. The proposed method for association analysis incorporating maternally mediated genetic effects is particularly useful in studying diseases that happened during fetal life.

Methods
In this study, all datasets were publically available and no research requiring ethics approval was conducted.
We consider a biallelic marker with alleles "A" and "a". Assume that case-parent triads are sampled with the genotypes known for each member of the triads. Let M, F, and C be the number of copies of allele "A" carried by the mother, father, and affected offspring, respectively. 2C − F − M is the paired differences in genotypes between the affected offspring and the complement, which carries the non-transmitted genotypes in the case-parent data. M − F is the SNP-count difference between the mother and the father, which can be used to measure  Under the null hypothesis of no association and no maternally mediated genetic effects, Z = 0. We assume that there are n case-parent triads. Let S be the sample covariance matrix of Z for the observed case-parent trios data, Z i = (X i , Y i ) T (i = 1, ⋯, n). Denote the two sample eigenvalue-eigenvector pairs for S by ðλ 1 ; e 1 Þ, ðλ 2 ; e 2 Þ, where λ 1 ≥λ 2 ≥0 . Then, the kth (k = 1, 2) sample principal component is ζ k = ê k T Z, with variance of V ar ζ k ð Þ ¼λ k . Note that the two sample principal components ζ 1 , ζ 2 are uncorrelated, i.e., Cov(ζ 1 , ζ 2 ) = 0. The kth sample principal component for the ith case-parent trio is ζ ik = ê k T Z i (k = 1, 2; i = 1, ⋯, n).  Fig. 1 Crohn's disease study. Result of testing association using the max_Z 2 with M − F (a), testing maternal effects using the max_Z 2 with 2C − F − M (b), and testing association or maternal effects using the max_PC (c). The Y-axis shows -log10(p) at individual SNPs; the X-axis shows SNPs and no maternally mediated genetic effects, E ζ k À Á ¼ 0 for k = 1, 2. Define a statistic, denoted by PC, as follow: The statistic PC is asymptotically a central χ (1) 2 distribution under the null hypothesis of no association and no maternally mediated genetic effects. Now, we consider q biallelic markers each with alleles "A" and "a". The markers are indexed by l (l = 1, ⋯, q). Let PC (l) be the statistic for marker l. Our test statistic, here, denoted as max_PC, is the maximum of the PC (l) across all q marker loci, i.e., maxÀPC ¼ max l fPC ðlÞ g. It can be seen that this max_PC approach is able to accommodate missing individual marker genotypes and does not require any haplotype information. Here, the statistical significance is assessed by a permutation procedure. We first calculate the data-based statistic. Then we permute the "case" and "complement" labels with equal probability and recalculate the statistic. The estimated P value (denoted byp ) is then the proportion of permutation-based statistics that are larger than the data-based statistic.