A multi-marker test based on family data in genome-wide association study

Background Complex diseases are believed to be the results of many genes and environmental factors. Hence, multi-marker methods that can use the information of markers from different genes are appropriate for mapping complex disease genes. There already have been several multi-marker methods proposed for case-control studies. In this article, we propose a multi-marker test called a Multi-marker Pedigree Disequilibrium Test (MPDT) to analyze family data from genome-wide association studies. If the parental phenotypes are available, we also propose a two-stage test in which a genomic screening test is used to select SNPs, and then the MPDT is used to test the association of the selected SNPs. Results We use simulation studies to evaluate the performance of the MPDT and the two-stage approach. The results show that the MPDT constantly outperforms the single marker transmission/disequilibrium test (TDT) [1]. Comparing the power of the two-stage approach with that of the one-stage approach, which approach is more powerful depends on the value of the prevalence; when the prevalence is no less than 10%, the two-stage approach may be more powerful than the one-stage approach. Otherwise, the one-stage approach is more powerful. Conclusion The proposed MPDT, is more powerful than the single marker TDT. When the parental phenotypes are available and the prevalence is no less than 10%, the proposed two-stage approach is more powerful than the one-stage approach.


Background
Complex diseases are presumed to be the results of many genes and environmental factors, with each gene only having a small effect on the disease. To test for the association, multi-marker methods that can combine the information of markers from different genes or across the genome are appropriate. To search for a set of susceptibility genes across the whole genome that is responsible for a complex trait, we need a multi-marker test (applicable to linked and unlinked markers) and a searching algorithm.
In case-control studies, several multi-marker association tests have been proposed which include the Hotelling's T 2 test proposed by Xiong et al. [2] and the score test proposed by Chapman et al. [3] and Wallace et al. [4], among others. All these methods can combine information from multiple markers in one candidate gene, different genes, or across the genome. Although several haplotype-based methods have been proposed for a family-based design [5][6][7][8][9][10][11][12], those methods can only deal with the markers in a candidate gene or a tightly linked chromosome region. Xu et al. [13] first proposed a multi-marker method dealing with unlinked markers. The test statistic of Xu et al.'s method is the weighted sum of the single marker test statistics and the weights were calculated by using the information of parental phenotypes. This method is mainly developed for quantitative traits and requires parental phenotypes.
In this article, we proposed a multi-marker association test for family-based designs. Our method, proposed for qualitative traits, does not require parental phenotypes and can deal with markers from different genes or across the genome. The proposed MPDT, as an extension of the Pedigree Disequilibrium Test (PDT) [14], can be applied to any size of pedigrees. To apply the MPDT to genomewide association studies, we also propose a searching algorithm. The proposed multi-marker association test together with the searching algorithm allows one to search for a set of susceptibility genes across the genome responsible for a complex trait. If the parental phenotypes are available, we propose a two-stage test by using the same family-based data set. In the two-stage approach, we first use a single marker test by contrasting parental cases with parental controls to screen the SNPs, and then use the MPDT and the searching algorithm to search for a set of susceptibility genes. The two-stage approach is motivated by the method recently proposed by Steen et al. [15]. In mapping quantitative trait loci using family data, Steen et al. proposed an interesting approach that performs a SNP screening and association test using the same sample. The basic idea of Steen et al.'s method is that the screening test based on the traits and between-family genotype scores is statistically independent of the association test which depends on trait values and within-family genotype scores. The screening test is used first to select SNPs, and then the association test is performed on a much smaller set of the selected SNPs. Our two-stage approach uses a similar idea but different tests in both stages.
We use simulation studies to evaluate the performance of our proposed method. The results show that the MPDT (either one-stage or two-stage) has the correct type I error rates. In all the cases that we considered, the MPDT is more powerful than the single marker TDT. Comparing the power of the two-stage approach that uses parental phenotypes with that of the one-stage approach, which approach is more powerful depends on the value of the prevalence; when the prevalence is no less than 10%, the two-stage approach may be more powerful than the onestage approach. Otherwise, the one-stage approach is the more powerful one.

Methods
In this section, we will first give the test statistic of the MPDT. Then, we will discuss a searching algorithm and how to find a set of susceptibility genes by the searching method and the MPDT. Finally, we will describe a twostage approach used to incorporate the information of parental phenotypes if it is available.

The MPDT
As the PDT proposed by Martin et al. [14], the MPDT is designed for pedigrees of any size. In the following discussion, for simplicity of presentation, we will only give the statistic for nuclear families with affected children. It is straightforward to extend the statistic for general pedigrees. Suppose we have genotyped m markers across the genome or in a candidate region for each sampled individual. Consider a sample of n nuclear families with n i affected children in the i th family. For a biallelic marker with two alleles A and a, we code the three genotypes aa, Aa, and AA as 0, 1, and 2, respectively.
Let F ij , M ij and u ijk denote the genotype codes of the father, mother and k th child in the i th family at the j th marker, respectively, i = 1, 2,..., n; j = 1, 2,..., m; k = 1, 2,..., n i . Considering each affected child as a case, we define a pseudocontrol matching each case. The pseudo-control matching the k th child in the i th family has a genotype code at the j th marker where is the genotype code of the two alleles not transmitting to the k th child by the parents. For example, if the genotypes of the father, mother, and a child are Aa, Aa, and AA, respectively, then the pseudocontrol matching this child has a genotype of aa and a genotype code of 0.
It is easy to see that the genotype codes of parents, children, and the pseudo-controls have the relationship where V ⊕ is the generalized inverse of V. Under the null hypothesis of no association between the markers and the trait, the MPDT has approximately a χ 2 distribution with degrees of freedom k, where k is the rank of V. If only one marker is considered, is the test statistic of the PDT.

Searching algorithm and overall p-value
In this section, we consider a genome-wide association study. Suppose we have genotyped M markers across the genome. Our aim is to find a set of markers that jointly have significant association with the trait. We propose two searching algorithms: Conditional Search (CS) and Sequential Forward Search (SFS). In both algorithms, each of the M markers is tested by using the PDT first. Then, the markers are ordered according to their p-values of the PDT. Suppose the p-values of markers 1, 2,..., M are in ascending order. Based on the ordered markers, the two algorithms are given below:

CS
The CS algorithm searches marker-sets A 1 ,..., A L , where marker-set A i consists of markers 1,..., i (i = 1,..., L) and L is a pre-specified value. We calculate the p-value of the MPDT for each set of markers and call the p-value from this step a raw p-value.

SFS
The SFS algorithm begins with marker-set A 1 which consists of marker 1. Then, by adding one marker to the marker-set A 1 , we get all of the two-locus combinations with the first marker included. We test all of the two-locus combinations by the MPDT and choose the two-locus combination with the smallest p-value (also called a raw p-value) as marker-set A 2 . In this way, we can get a series of marker-sets A 1 ,..., A L .
Both of the two searching algorithms give a series of candidate marker-sets and the corresponding raw p-values of the MPDT. The problems that remain are choosing the "best" or the final marker-set and evaluating the overall pvalue of the final marker-set. An intuitive idea is to choose the marker-set with the smallest raw p-values as the final marker-set and use a permutation procedure to evaluate the overall p-value. However, our simulation studies (results not shown) show that in most cases, the more markers a marker-set contains, the smaller the p-value of the marker-set will be. Thus, instead of using the raw pvalues, we propose to use a permutation procedure recently proposed by Ge et al. [16] and further discussed by Becker and Knapp [17] to adjust the raw p-values and use the adjusted p-values to choose the final marker-set. This procedure also gives the overall p-value of the final marker-set. Let A 1 ,..., A L denote the candidate marker-sets and P 01 ,..., P 0L denote the associated raw p-values of the MPDT. The permutation procedure includes the following steps: 1. Generate S (say, 1,000) permuted datasets. In each permutation, there is a 50% probability of changing the multi-marker genotype (the genotype across all of the M markers) of each child with that of the corresponding pseudo-control. The reason that we changed the genotypes across M markers simultaneously is to keep the LD structure in each permuted data set.
2. For each permuted data set, search for the L candidate marker-sets by either of the two algorithms. Based on the permutated data set, test for the association between each marker-set and the trait using the MPDT. For the s th permuted data set, denote the L candidate marker-sets by A s1 ,..., A sL and the associated raw p-values by P s1 ,..., P sL .
Then, the adjusted p-value corresponding to the candidate is a indicator function. We will choose the marker-set with the smallest adjusted p-value, p 0 = min(p 01 , p 02 ,..., p 0L ), as the final marker-set.
3. To evaluate the overall p-value of the final marker-set, we first adjust the raw p-values P s1 ,..., P sL for the s th permuted data, s = 1,..., S. The adjusted value of P sl is given by . Let p s = min{p s1 ,..., p sL }. Then, the overall p-value of the final marker-set is given by Usually, p overall is obtained through another layer of permutation by a standard double permutation procedure, according to Ge et al. [16], p overall can be estimated by (1), which needs only one layer of permutation.

A two-stage approach to incorporate parental phenotypes
If parental phenotypes are available, we propose a twostage approach to incorporate the parental phenotypes. The basic idea of the two-stage approach is that the test used in the first stage is independent of the association test used in the second stage; the test in the first stage is used to select promising SNPs, and the association test in the second stage can be performed on a smaller set of the selected SNPs.

Stage one
The test that we propose to use in this stage is based on a test statistic for a case-control study. Consider a case-control study with N 1 cases and N 2 controls, and each sampled individual has a genotype at a bi-allelic marker with two alleles A and a. To test the association between the marker and the disease, one can use the test statistic where p and q are the sample frequencies of allele A in cases and controls, respectively; is the estimated variance of p -q; p 0 is the sample frequency of allele A in the whole sample. Under the null hypothesis of no association, this test statistic asymptotically follows a Chi-squared distribution with one degree of freedom. To use this test statistic in the first stage, we consider the affected parents of the sampled nuclear families as cases and the unaffected parents of the sampled nuclear families as controls. We propose to use the statistic T p on each of the M markers and get a corresponding P-value for each marker. Select M 1 markers with the smallest P-values, where M 1 is a pre-specified number, which usually is smaller than M. We will discuss how to choose M 1 later. In this stage, we use only the parental information of the nuclear families.

Stage two
Apply the searching algorithm (including the permutation procedure) to the M 1 selected markers to find a final marker-set and the overall p-value of the MPDT to test the association between the final marker-set and the trait. Since all the calculations including searching and permutation procedure are applied to the data set of M 1 markers, the calculation will be much faster than that of applying the method directly to the original M markers. If the parental phenotypes and genotypes have sufficient information to keep most of the disease susceptibility loci in the selected markers and delete many noise markers in the first stage, then the two-stage approach should be more powerful. Otherwise, the two-stage approach may lose power.

Other method compared
We compared the proposed MPDT (plus the searching algorithms) with the single-marker TDT. We also compared the power of the tests using (two-stage approach) and without using (one-stage approach) parental phenotypes. For the single marker TDT, we search for a set of significance markers by controlling the False Discovery Rates (FDRs) [18], the ratio of the number of falsely rejected null hypotheses to the total number of rejected null hypotheses. We use the one-stage approach as an example to explain the procedure. Calculate the TDT for each of the M markers and denote the ordered p-values by p (1) ,..., p (M) .
Declare a marker significant if the P-value of the TDT at this marker is less than a threshold δ M such that the FDR can be controlled at a level of α. The threshold δ M is determined by The marker set that consists of all the markers associated with the trait is called the final marker-set.

Results
We use two sets of simulation studies to evaluate the type I error rates and the power of the proposed methods. The type I error rates are evaluated for both one-stage and twostage approaches. For the power comparisons, we compare the proposed multiple-marker approach with the single-marker approach, and the one-stage approach with the two-stage approach. In the simulation studies, we consider nuclear families with one affected child and use L = 15, where L is the maximum number of markers contained in the searched marker-sets. To identify a set of markers, we propose two searching algorithms: CS and SFS. The results for type I error rates and power comparisons by using CS and SFS are very similar. We only show the results using the CS algorithm.
In the first set of simulation studies, we use the data sets where the nearby markers are in Linkage Disequilibrium (LD). For the purpose of generating genotypes with LD between markers, we use Hudson's program (ms software) [19] which assumes the coalescent process with recombination to generate multi-marker haplotypes. We follow Nordborg M, Tavare S [20] and Kimmel G, Shamir R's [21] methods, using a mutation rate of 2.5 × 10 -8 per nucleotide per generation, a recombination rate of 10 -8 per pair of nucleotides per generation, and an effective population size of 10,000 individuals. Of all the segregating sites, only the ones with minor-allele frequency >5% are defined as SNPs and are used in the rest of the analysis. The simulation program produces populations from which samples of cases and controls can be drawn.
In the second set of simulation studies, we generate the genotypes by assuming the Hardy-Weinberg equilibrium and linkage equilibrium. This means that we generate each allele at each marker independently, and the minor allele frequency at each marker is drawn randomly between 0.05~0.5.

Assessing the type I error rates
To assess the type I error rates, we generate data under the null hypothesis of no association between trait values and the multi-marker genotypes. In both of the two sets of simulations, we first generate the genotypes of the parents of each nuclear family and then each parent randomly transmits one of the two haplotypes to the child to form the child's genotype. We consider the sample size of 500 nuclear families and three different numbers of markers: 100, 1,000 and 100,000. For each simulation scenario, we generate 1,000 samples to estimate the type I error rates of the MPDT and TDT for one-stage approaches as well as two-stage approaches by using different values of M 1 (the number of markers retained in the first stage). Note that when M 1 = M, the two-stage method is, in fact, the onestage approach. For 1,000 replications, the 95% confidence interval of the type I error rate is (0.0362, 0.0638) with a nominal level of 5%. The results are summarized in Table 1. Table 1 shows that the type I error rates of the MPDT and TDT are very consistent with the nominal level of 5% for both one-stage approaches (the case of M 1 = M) and two-stage approaches (M 1 <M). We can see that the consistency does not depend on the value of M 1 . The correct type I error rates for the two-stage approach also show that the tests used in the first stage and in the second stage should be independent.

Simulation studies for evaluating power
For the power comparisons, we compare the proposed multiple-marker approach with the single-marker approach, and the one-stage approach with the two-stage approach. First, we clarify the meaning of powers for different methods.

Power and Power Calculation
To estimate the power of the MPDT and TDT for one-stage approaches as well as two-stage approaches, we use 100

Power comparisons
To assess the power of the proposed method, we generate genotype data under four different disease models. In model I and model II, we consider two three-locus disease models, and in model III and model IV, we consider two ten-locus disease models. These models are similar to those used by Millstein et al. [22] in their simulation studies. Let p = P(Affected|genotype) denote the penetrance and x k denote the numerical code of the genotype at the k th disease locus. The relationship between p and genotype codes at the disease loci is given by the logistic models: for model I, II, III, and IV, respectively. The values of coefficients (except for β 0 ) are given in Table 2. x k is coded by an additive coding scheme i.e. x k = 0, 1, or 2 corresponding to genotype a k a k , A k a k , or A k A k at the kth disease locus, where A k is the high risk allele. The value of β 0 can be determined by the values of the other parameters and population prevalence. We use three different values of population prevalence: 5%, 10%, and 20%. In our simulation studies, we assume that the frequency of the minor allele (high risk allele) at each disease locus is between 0.1 and 0.33. The four models represent different interaction structures. Model II and IV represent additive effects, while Model I and III represent additive as well as interaction effects.
We generate genotypes at 100,000 markers for each individual. To generate multi-marker genotypes of parents and the affected child in each nuclear family, we use a reject/accept procedure. Using the case of genotypes with LD as an example, for each parent, we randomly choose two haplotypes with replacement from the haplotype population generated by Hudson's ms program to form the multi-marker genotype of each parent. Then, each parent randomly transmits one of the two multi-marker haplotypes to the child to form the child's multi-marker genotype. Let G denote the multi-marker genotype of the child at the disease loci and let f G = Pr(affected|G) denote the penetrance that can be calculated from a given disease model. Then, the child is affected with a probability of f G . If the child is affected, we will retain the family. Otherwise, we will discard the family.
In our simulations, we vary the value of M 1 , the number of markers retained in the first step, from 1,000 to 100,000. The results of M 1 = 100,000 are the results of the one stage approaches. The power comparisons of the MPDT and TDT for one-stage and two-stage approaches with different prevalence, different LD patterns, and different models are given in Figure 1, Figure 2 and   (2), β 123 = log (5) Model II (2) Model III the more realistic situation where there is LD between markers, which approach is more powerful depends on the value of the prevalence. When the prevalence is no less than 10%, the two-stage approach may be more powerful than the one-stage approach. Otherwise, the one-stage approach is more powerful. Our conclusion of the power comparisons between the one-stage and two-stage approaches is different from that of Steen et al.'s which says that when M 1 is smaller than 10, for all the cases they considered, the two-stage approach is much more powerful than the one-stage approach for a quantitative trait. The difference between our results for a qualitative trait and Steen et al.'s for a quantitative trait is not surprising. If the prevalence is small, 1% for example, the affected parents are probably only little more than 1% among all the sampled parents. Suppose we have sampled 1000 par-ents. The test T p used in the first stage is based on a casecontrol study with ~10 cases and ~990 controls. Since there are too few cases, the test T p should have limited power. When the value of the prevalence increases, the power of the test T p will increase, and then the two-stage approach may have some advantages over the one-stage approach.
We have also compared the Proportion of Non-Disease Loci (PNDL) contained in the final marker-set of the MPDT and TDT. Let fi (ti) denote the number of non-disease loci (total loci) contained in the final marker-set, and pi as the overall p-value of the test, to test the association between the final marker-set and the trait, respectively, for the ith replicated sample. Then, the estimated PNDL of the MPDT is given by When there is no LD between markers, the PNDLs of both the TDT and MPDT are very close to the nominal level 0.05 for different values of M 1 , prevalence and difference disease models. However, when there is LD between markers, the PNDL of the MPDT is much larger than that of the TDT. In this case, the MPDT has an average PNDL of 0.6 and the TDT has an average PNDL of 0.3. We need to note that when there is no LD between markers, the PNDL is the same as the FDR, but when there is LD between markers, the PNDL is different from the FDR. The FDR is defined as the proportion of falsely rejected loci (no LD with disease loci) in the final marker-set, but the PNDL is the proportion of non-disease loci in the final marker-set. Noting that the null hypothesis is no association with the disease (no LD with disease loci), non-disease loci in the final marker-set may have LD with the disease loci, and thus may not be the falsely rejected loci. Based on the above discussion, when there is LD between markers, although the PNDL of the MPDT is much larger Power comparisons using the population prevalence of 0.1 Figure 2 Power comparisons using the population prevalence of 0.1. TDT and MPDT denote the two tests based on the first set of simulation studies, generating genotypes with LD between markers; TDT_indep and MPDT_indep denote the two tests based on the second set of simulation studies, generating genotypes by assuming the Hardy-Weinberg equilibrium and linkage equilibrium.

M1 (in thousands)
Power than that of the TDT, we can not say that the FDR of the MPDT is larger than that of the TDT.
When there is LD between markers, we further define the FDR by changing f i and F i in equation (2) and (3) to be the number of markers that have no LD with the disease loci (the absolute value of LD measure D' less than 0.05) in the final marker-set. Using this definition, the FDRs of the MPDT and TDT are both near 0.05.

Discussion
In this article, we proposed a multi-marker test, the MPDT, to analyze family-based data from genome-wide or region-wide association studies. The MPDT can combine information from multiple unrelated markers, while haplotype-based methods [5][6][7][8][9][10][11][12] can only combine information from nearby markers. Thus, the MPDT is more appropriate for searching for a set of susceptibility loci across the whole genome. By using simulation studies, we are able to demonstrate that the proposed multi-marker method, MPDT, consistently outperforms the singlemarker method, TDT.
One remaining question in the MPDT is choosing the values of L, the maximum number of markers contained in the searched marker-sets. In our simulation studies, we use L = 15. If one has a prior knowledge of the number of disease loci, the value of L can be set to be that number. Otherwise, we suggest choosing L between 10 and 30. A large value of L increases computational burden, while a small value of L may limit the power.
If parental phenotypes are available, we proposed a twostage approach. In the first stage a screening test based on parental phenotypes is used to select SNPs, and in the second stage the MPDT is performed to search for a set of sus-Power comparisons using the population prevalence of 0.2 Figure 3 Power comparisons using the population prevalence of 0.2. TDT and MPDT denote the two tests based on the first set of simulation studies, generating genotypes with LD between markers; TDT_indep and MPDT_indep denote the two tests based on the second set of simulation studies, generating genotypes by assuming the Hardy-Weinberg equilibrium and linkage equilibrium.
ceptibility loci from the selected SNPs. Comparing a onestage approach with the corresponding two-stage approach, the more powerful one depends on population prevalence and M 1 , the number of SNPs selected in the first stage. Our simulations show that only if the population prevalence is very high (≥10%), then the two-stage approach may outperform the one-stage approach.
Another question needed to be pointed out is choosing the value of M 1 . In our simulation studies, we varied M 1 from 1% to 100% of the total number of markers, M. The results show that 10% to 20% of the total numbers of markers are good choices for M 1 . In general, we need further investigations on choosing the optimal value of M 1 .
In this article, we proposed two searching algorithms: CS and SFS. Theoretically, if all the disease loci have detectable marginal effects, CS should be more powerful than SFS. If the disease loci can be divided into two groups: the first group consisting of loci with detectable marginal effects and the second consisting of loci with weak marginal effects, but having strong interaction effects with the loci in the first group, SFS should be the more powerful one. However, our simulation studies showed that the two methods have similar power in all the cases that we considered. Thus, we suggest using CS in practice because CS is much easier computationally than SFS.