Linkage analysis of longitudinal data and design consideration

Background Statistical methods have been proposed recently to analyze longitudinal data in genetic studies. So far, little attention has been paid to examine the relationship among key factors in genetic longitudinal studies including power, the number of families or sibships, and the number of repeated measures per individual subjects. Results We proposed a variance component model that extends classic variance component models for a single quantitative trait to mapping longitudinal traits. Our model includes covariate effects and allows genetic effects to vary over time. Using our proposed model, we examined the power, pedigree structures, and sample size through simulation experiments. Conclusion Our simulation results provide useful insights into the study design for genetic, longitudinal studies. For example, collecting a small number of large sibships is much more powerful than collecting a large number of small sibships or increasing the number of repeated measures, when the total number of measurements is comparable.


Background
Longitudinal study design has been routinely used to investigate the etiology and epidemiology of complex diseases, and statistical methods for analyzing longitudinal data are well established [1]. However, there are limited applications of longitudinal data in genetic studies.
Province and Rao [2] used path analysis for assessing familial aggregation in the presence of temporal trends, although their analysis did not include genetic marker information. Longitudinal studies have also been used in a few occasions for twin and adoption studies (e.g., [3][4][5][6]). However, the main purpose of those studies was to assess the heritability of a trait, instead of mapping candidate loci.
Using an ad hoc approach, Levy and colleagues [7] conducted a linkage scan of the Framingham Heart Study. They regress the phenotype against covariates as in a standard mixed effects model, and then treat the residuals corresponding to individual measurements as a quantitative trait in standard linkage analysis software such as SOLAR [8]. More recently, in the Genetic Analysis Workshop 13, some participants examined two-step models and some proposed joint models [9]. The first step in a two-step model is similar to that of Levy et al. [7] by fitting an "ordinary" longitudinal model without consideration for genetic markers or family structures. Then, in the second step, linkage analysis is performed on one or more statistics derived from the first step. While such two-step methods are practical and simple, they are not ideal. For example, even if the covariates have additive effects to the genetic effects, potential useful information can be lost in deriving the residuals or some summary statistics. Besides, the selection among different statistics (e.g., residuals and averages) to be used in the second stage increases the number of tests to be performed, which raises the multiple comparison issue. Also importantly, the lack of a welldefined statistical model directly associating the original phenotype to the inheritance of the markers makes it infeasible to conduct formal statistical inference. In fact, the authors in the Genetic Analysis Workshop 13 [9] clearly pointed out that a joint approach to simultaneously estimating genetic and longitudinal model parameters is appealing, because estimates of genetic and longitudinal parameters will be mutually adjusted for one another. Thus, in this report, we consider a joint model that is related to some of the models described in [9]. Our main objective is to use our model to examine the relationship among key factors in genetic longitudinal studies including power, the number of families or sibships, and the number of repeated measures per individual subjects.
There is a growing effort to develop mixed effects models that separate the genetic effect from environmental effects [10] and that incorporate temporal information [11]. However, those models do not have simple structures to accommodate genetic and temporal interactions, or to enable us to assess the longitudinal study design in linkage analysis. This raises the computational concern and may limit the analyses that can be performed as pointed in [9]. Hence, our idea is to use a realistic yet simple variance component model that can be used to analyze general pedigree data such as the Framingham Heart Study and that allows us to consider age specific genetic effects and related study design issues. We choose a variance component model because this type of models is well established for linkage analysis of quantitative traits (e.g., [8,12,13]).

Results
In this section, we report our simulation results to assess the Type I error rate based on the asymptotic theory, and the power of our method in detecting linkage. We are particularly interested in the effectiveness of repeated measures in improving the power. For example, how do we determine the most cost-effective number of repeated measures? The computation was performed by a statistical software R using our own program, which are available upon request. We should note that our model and program have been used to analyze general pedigree data such as the Framingham Heart Study (to be reported in a future report), although our simulation below is focused on sibships to reduce computational burden. Nuclear families were simulated, and fully informative markers with four equally frequent alleles were generated. All parental alleles were distinguished. For the nuclear families, phenotypes were simulated only for the siblings. In all the simulations, each sib in every nuclear family has 5 measurements taken at different times. The measurement times were simulated simply as (1,2,3,4,5). A covariate was simulated from a uniform distribution between 0 and 1. For clarity, we used f(X, t) = Xβ to generate the data, where β = (β 0 , β 1 , β 2 )' = (1, 1, 1)' and β 0 , β 1 and β 2 are parameters for the intercept, the time and the simulated covariate in mean structure. As in related studies [13], we did not consider dominant effects in the simulation studies and set ( , ) = (0, 0).

Type I error rates
To evaluate the type 1 error rates of the proposed tests, we considered two different null models. The first type of null model assumes that the genetic linkage effect due to the testing QTL and the polygenic effect are both zero, that is, Likelihood ratio test is used to test the null hypothesis that the genetic variance due to the testing QTL equals zero (no linkage).
We use two times the natural logarithm of the likelihood ratio as the test statistic. Its asymptotic distribution appears to be a mixture of χ 2 distributions [16], but the degrees of freedom depend on s(t). In practice, we do not know the form of s(t). However, we can use the backward selection as in regression analysis by beginning with the quadratic polynomial and testing whether the coefficients are zero or not. This strategy can serve as the guide in determining the final form of s(t). Table 1 presents the empirical type I error rates based on 5,000 simulated replications under two null models. The rejection rates in the table were obtained by computing the frequencies at which the null hypotheses were rejected at the critical values from the stated asymptotic distributions. Given that we used only 100 sib pairs, the empirical type I error rates are numerically close to the nominal significance levels.

Power comparisons
To compare the power increment from larger sibships, we considered the scenarios of collecting 200 sib pairs, 400 sib pairs and 200 nuclear families with 4 siblings each so that we can assess the corresponding effects of the number of nuclear families, the size of the nuclear families, and the number of repeated measures on power. We simulated data from the following three forms of . We also generated measurement errors from a multivariate normal distribution with the variance σ 2 and the within-subject autocorrelation exp(-α|t -u|) between measurements at two time points t and u. To evaluate the power, we conducted a number of experiments using various genetic models: (a) ( , Note that these four parameters determine the extent of the overall genetic heritability as well as the heritability due to a specific locus under consideration.
When presenting our power assessment, we make use of a generalized heritability measure for longitudinal trait proposed by de Andrade et al. [11]. To incorporate the serial variance components, we express the polygenic and major gene heritabilities in our model as Table 2 displays the polygenic and major gene heritabilities used in our simulation models when different numbers of repeated measurements are used.
Regardless of the true form of s(t), in our estimation we assumed s(t) to be one of the following three forms: s(t) = s 0 , s(t)= s 0 + s 1 t, and s(t) = s 0 + s 1 t + s 2 t 2 where s 0 is nonnegative, and it may need to be estimated together with s 1 and/or s 2 , depending on the choice. As stated above, one of the true s(t)'s is the logit function. This is because we want to know what happens in linkage detection when s(t) is misspecified.
To understand the gain of power as a result of more repeated measures, we examined the power using all or some of the 5 measurements for each sib. We also compared the power from our models with the power of using traditional variance component (VC) method for a single measurement. The single measure can be a measurement at a particular time point or the average of the five measurements for each sib. underlying parameters are (β 0 , β 1 , β 2 ) = (1, 1, 1), and (σ 2 , α) = (7, 0.5). The assumed s(t) is labeled as "i" for s 0 + s 1 t, and "q" for s 0 +  Tables 3, 4, and 5 display the power in the experiments as specified above. To appreciate the incremental gain of power as the number of repeated measures increases, we compared the power estimates when we used all or some of the 5 repeated measurements. As expected, the power increases as the number of repeated measures and/or the number of families increase. However, the increment of power is not uniform, and depends on the significance level. For example, ascertaining 200 sib pairs with four repeated measures tends to yield better power than collecting 400 sib pairs with two repeated measures when there is a gene-time interaction, and vice versa when there is no gene-time interaction. The information from these tables underscores the importance to conduct the power calculation under the specific designs and significance level in order to choose the most cost effective designs.
Tables 3 and 4 reveal serious loss of power of ignoring a gene-time interaction. For example, in Table 3 when the underlying s(t) = 1 + 0.1t, with 5 repeated measures, the power estimates by ignoring s(t) were 0.77, 0.56, 0.26, and 0.09, respectively, at significance levels 0.05, 0.01, 0.001, and 0.0001. In contrast, the respective power esti-mates were increased to 0.90, 0.78, 0.45, and 0.24 when we estimated s(t) from s0 + s1t. We should also note here that the fold of increase is more dramatic for a more stringent significance level. On the other hand, is there a loss of the power if we consider s(t) when there is no timedependent genetic effort? Or, broadly, what happens to the power if the time-dependent effect is misspecified? Tables 3, 4, and 5 address these questions. As expected, the power is at its peak when the underlying time trend is correctly specified. However, even with a misspecified trend, the test based on our model is more powerful than the one using a single measure, regardless of whether it was from a particular age or the average of the same number of repeated measures. We should note that, from our experiment, the use of the average of repeated measures yields more power than the use of a single measure at a given time point. In other words, without any consideration for the cost and effectiveness, we gain power from repeated measures even with a simple approach.
Finally, Table 3, 4, and 5 reveal the substantial benefit of power as a result of ascertaining large pedigrees. Table 5 displays the power of using 200 4-siblings. The power esti- mates using 400 sib pairs is available in Tables 3 and 4. Clearly, whenever feasible, collecting large sibships are more effective than collecting more sibships or more repeats.

Discussion
In this work, we proposed a variance component model to map candidate genes when the quantitative trait is meas-ured repeatedly. A notable feature of our model is to accommodate a potential gene-time interaction. In the existing literature, longitudinal information on the trait is sometimes re-processed into a single trait and then the standard variance component model is applied [7]. Agreeing with other authors, we believe it is useful to have a unified model so that formal statistical inference can be   performed. This benefit is evident from the simulation reported here.
We should note that the power is low with the sample sizes that we considered when the significance level is set at 0.0001. Since our purpose is to compare the power in various design settings, the absolute level of power is not critical. This is purely to reduce the computational time for our simulation. In practice, if an 80% power is desirable, for example, both the sample size and simulation replication should be increased. Despite the fact that the longitudinal study design are very popular in epidemiological and medical research, its use is still limited in linkage analysis [11]. Here, we only discuss a basic model to explore the potential of using longitudinal data and to investigate cost effective designs. Our model is related to,  1, 8, 0.5). The assumed s(t) is labeled as "c" for constant, "l" for s 0 + s 1 t, and "q" for s 0 + s 1 t + s 2 t 2 . but has a simpler structure than that of de Andrade et al. [11]. We focus on the time at which the data are collected, but different study subjects may have data available at different time points from others. We also allow a potentially general temporal trend to interact with the genetic effect. In contrast, de Andrade et al. [11] proposed a model that assumed an individual genetic effect at every time point, which requires a uniform time schedule for all study subjects. This is a reasonable assumption for some studies including the Framingham Heart Study, but it may become restrictive to other studies.

sib pairs 400 sib pairs
Clearly, many important research issues warrant further investigation. For example, we need to consider gene-gene interactions, gene-environment interactions, and more general forms of gene-time interaction and fixed effects.
Other classic issues including sample selection, ascertainment bias, multiple genes, and imprinting also require further investigations.

Conclusion
We conducted a number of simulation studies to explore the increment of power when the number of sibships is increased, when the number of repeated measures is increased, and when the size of families is increased. While we expect that these factors enhance the power, how they do so is rather intriguing. Our results can provide useful guidance for designing a genetic, longitudinal study to balance the cost, feasibility, and power. For example, collecting a small number of families with a large sibship is more effective than collecting a comparatively large number of families with a small sibship. Collecting fewer families with more repeated measures may or may not lead to more power than collecting more families with fewer repeated measures, depending on the underlying genetic models. In general, however, the relationship between the power and design is subtle, and depends on the significance level and obviously the size of genetic effects. It is wise to conduct appropriate power simula-   1, 1, 1) with various settings of ( , , σ 2 , α). The assumed s(t) is labeled as "c" for constant, "l" for s 0 + s 1 t, and "q" for s 0 + s 1 t + s 2 t 2 .
( , , σ 2 , α) tions before a genetic, longitudinal study is carried out so that the cost, the feasibility, and power can be balanced. Software can be requested from the authors for such simulations.
Although our simulations were based on nuclear families, our model can handle general pedigrees as we have used it to analyze data from the Framingham Heart Study for which the pedigree size was, on average, 5 and ranged from 2 to 29.

The model and methods
Let y denote a quantitative trait. For convenience, we first consider one pedigree. By assuming-independence between pedigrees, it is straightforward to multiply the likelihood from multiple pedigrees.
Let i refer to the ith member in a pedigree and t ij be the time when the quantitative trait is measured at the jth occasion, j = 1,...,T i and i = 1,...,n. Consider the model: where f(X i , t ij ) is a function of the fixed effect X i and time t ij , s(t ij ) a simple parametric function to accommodate time variant genetic effects, γ i1 the random effect for a major gene, γ i2 the random effect for unspecified polygenic effects over the genome, and e i (t ij ) the measurement error, j = 1,...,T i and i = 1,...,n. We assume that γ i1 , γ i2 , and e i are independent, although e i (t ij ), j = 1,...,T i , has a withinsubject correlation structure that needs to specified on a case-by-case basis. It follows: where σ(t, u) is the covariance function for e(t) and e(u) and δ (i = l) is the identity indicator. In addition, the covariances of γ i1 and γ i2 can be partitioned into additive and dominant variances as follows: and where k 1,il and k 2,il represent the k coefficients of [14] for the probability of members i and l sharing 1 and 2 alleles, respectively, identity by decent (IBD) at the locus of interest, φ and τ are respectively the expected kinship coeffi-cient and the expected probability of sharing 2 alleles IBD over the residual components of the genome, and are respectively the additive and dominant genetic variances at the locus of interest, and and are respectively the total additive and dominant genetic variances over the residual components of the genome.

Parameter estimation and hypothesis testing
If we arrange the phenotype in model (1) as y = (y 1 (t 11 ),..., ,...,y i (t i1 ),..., ,...,y n (t n1 ),..., )', (2) then its covariance matrix is   In this work, we assume that γ i1 , γ i2 , and e i have normal distributions with mean 0. If the normality is not assumed, a generalized estimating equation approach can be adopted. However, we will not explore this approach here. For clarity, we consider a specific version of model (1). Namely, let f(X i , t ij ) = β 0 + t ij β 1 + X i (t ij )β 2 , where β 2 is a p-vector of parameters. In addition, assume that s(t) is a first-order polynomial function, s(t) = s 0 + s 1 t.
Let β = (β 0 , β 1 , β 2 )' (4) be the vector of fixed effect parameters, and be the vector of the covariance parameters. We estimate these parameters through the restricted maximum likelihood (REML) approach introduced by Patterson and Thompson [15] which takes into account the loss in degrees of freedom resulting from estimating fixed effects and avoids the bias in the estimation of covariance parameters.
Note that y has a multivariate normal distribution with mean Aβ and covariance Σ, where Now, let us consider M independent pedigrees. Let Based on the theory on matrix derivatives, we have , and . Therefore, the first-order partial derivative of the log likelihood l(θ) with respect to θ gives and the second-order partial derivative of the log likelihood l(θ) with respect to θ gives Linkage is tested by a likelihood ratio test by comparing the likelihood under the alternative hypothesis in which the genetic variance component due to the testing QTL is estimated with that under the null hypothesis of the genetic variance due to the testing QTL being equal to zero (no linkage). Twice the natural logarithm of the likelihood ratio of these two models may have a complex asymptotic distribution of a mixture of χ 2 distributions [16] and what kind of asymptotic distribution depends on how s(t) is defined.