- Open Access
Stability of exploratory multivariate data modeling in longitudinal data
BMC Geneticsvolume 4, Article number: S38 (2003)
Exploratory data-driven multivariate analysis provides a means of investigating underlying structure in complex data. To explore the stability of multivariate data modeling, we have applied a common method of multivariate modeling (factor analysis) to the Genetic Analysis Workshop 13 (GAW13) Framingham Heart Study data. Given the longitudinal nature of the data, multivariate models were generated independently for a number of different time points (corresponding to cross-sectional clinic visits for the two cohorts), and compared. In addition, each multivariate model was used to generate factor scores, which were then used as a quantitative trait in variance component-based linkage analysis to investigate the stability of linkage signals over time. We found surprisingly good correlation between factor models (i.e., predicted factor structures), maximum LOD scores, and locations of maximum LOD scores (0.81< ρ <0.94 for factor scores; ρ >0.99 for peak locations; and 0.67< ρ <0.93 for peak LOD scores). Furthermore, the regions implicated by linkage analysis with these factor scores have also been observed in other studies, further validating our exploratory modeling.
When examining large amounts of data with many correlated variables, a common approach is to employ dimensionality-reducing techniques, such as clustering methods, principle component analysis, or common factor analysis. Each of these methods attempts to reduce the complexity in large systems by creating combinations of variables that reflect underlying, unobservable structures inherent in the data. Applying these methods to understand the structure of complex data falls into the realm of exploratory data analysis , and is commonly followed by a confirmatory phase, in which the reproducibility of the results is investigated.
Multivariable modeling of this type has been used to identify genetic latent factors in studies of twins and family data . These genetic latent factors are thought to represent the effects of genes, which exhibit pleiotropic effects on the measured variables with which the factors are correlated. Several studies [2–4] have examined the effect that pleiotropy has on analysis methods and study design. These studies conclude that recognition of pleiotropic effects on multiple measured traits and application of appropriate multivariate analysis methods results in increased power to detect a genetic effect compared with univariate (simple) analysis of the individual measured traits.
In line with this approach, we have applied the method of factor analysis to investigate the underlying structure of data in the Framingham Heart Study (FHS) data provided as part of the Genetic Analysis Workshop 13 (GAW13) conference. The FHS represents one of the largest (and longest) ongoing projects in the history of NIH-funded research, with a goal of identifying risk factors that contribute to the development of cardiovascular disease. This objective was pursued by collecting a population-based sample and following them forward in time (an example of repeated-measures prospective study design). Since these data were collected longitudinally over many years, we chose to investigate the stability of factors by comparing factor models and resulting multipoint linkage signals generated from different time points.
Linear and logistic regressions were used to remove the effects of age and gender from all selected traits. Regressions were performed for each time-point independently. Residuals from each regression were used as inputs to factor analysis, thus generating four sets of factor structures (one for each time-point). The composition of the factors for each time-point are given in Table 1. Predicted factor scores were then generated using these models for all phenotyped individuals. Correlations between the factor scores are presented in Table 2. These factor scores, along with the pedigree information and marker genotypes from the FHS genome-wide scan were used in a variance component linkage analysis. A summary of the significant multipoint LOD scores, as well as their chromosomal locations, is given in Table 3.
Exploratory factor analysis allows the researcher to investigate the structure of complex data by looking for commonalities between variables. These commonalities become expressed as the function of an unobserved latent variable that manifests on the system in question through the measured variables with which it is correlated. This concept works well with classical genetics, which explains the latent variables as underlying genetic factors exhibiting pleiotropic effects on the system of measured traits.
In testing genetic factor models, a key issue is identifiability. Appropriately applied rotations can modify the loadings and the resulting factor scores, so that factor solutions are not unique. Thus, the factor loadings to be used in genetic modeling must correspond to what is known about the biology of the system under investigation. In our study, the first and third factors can be identified as components of lipid homestasis. The factor loadings for the first factor (see Table 1) have the highest degree of identifiability, and seem to describe a latent variable which is highly correlated with high-density lipoprotein (HDL) levels, and correspondingly inversely correlated with triglycerides, smoking, and body mass index (BMI). These loadings are in line with current understanding of the biology of high density lipoproteins. Thus, in some way, this factor can be thought of as having a significant effect on HDL levels (perhaps coding for a component of HDLs). Looking across timepoints, it appears that this definition remains fairly stable, differing only in later timepoints with the addition of blood pressure and fasting glucose as predictive factors, and the removal of smoking (perhaps as older individuals cease, or at the very least reduce, the smoking habit of their earlier years). It is interesting to note that, although apparently similar in factor loadings, the factors from the first and subsequent time points differ significantly, as exemplified by the correlation between factor scores (the correlation between time1 and other times' scores for this factor is low – on the order of 0.1–0.2, whereas the correlation between other times for this factor are fairly high – on the order of 0.8–0.9). This also explains the difference observed later in peak LOD scores and positions in time1. The exact source of this difference is unknown, but may be due to numerical instability in the factor solution for the first time point (the loading of 1.0 on HDL levels, and the concomitant communality of 1.0 for HDL as a variable, indicates the presence of a Heywood instability, or a problem with the factor solution).
Variance component-linkage analysis using the predicted factor scores highlights several regions of the genome. Most prominent in these analyses are the regions on chromosome 6 (at roughly 127 cM) and chromosome 7 (at 160 cM), which correspond with previous studies of HDL variability [5, 6]. The stability of the location estimates for these peaks (and their LOD scores) are impressive, and lend considerable confirmatory evidence to support our multivariate modeling.
Factor 3 may also be related to some component of the cholesterol homeostasis mechanism, as it seems to be correlated with triglyceride levels, fasting glucose levels, HDL levels, BMI, and systolic blood pressure. As with Factor 1, correlations between the different time point factor models are high. This component appears to be distinct from that described by Factor 1 based on the location of linkage signals produced by the Factor 3 structures.
For the other two factors, identifiability is an issue. Factor 2 resembles to some degree Factor 1, in that it is composed of contributions from triglyceride levels, total cholesterol levels, and blood pressure measures. However, this factor does not provide any unique (or significant) linkage results on its own, in spite of good correlations between factor scores derived from each time point. These findings suggest a limit to the correlation between factor scores that will generate significant, reproducible findings. Based on the correlations in Table 2, correlations of 70% or more appear to be required to get reproducible linkage findings from factor models. In this way, the table of factor score correlations could possibly be useful in predicting the strength of further analysis.
Factor 4 also defies our attempts at identification, as no clear pattern is observable in the factor loadings. These factor models may actually demonstrate the effect of extraneous variables in a factor model, as extraneous variables should appear in factor structures independently of the other variables. Because this happens in at least two of the time points for this factor, it is possible that these structures simply represent the residual variation left in the system after the effects of other latent variables have been taken into account. Additionally, no clear, reproducible linkage signals appear.
We have demonstrated, by the application of multivariate modeling to a longitudinal data set, good stability of the predicted factor models, and surprisingly good stability of the locations and LOD scores from linkage analyses using the predicted factor structure when that structure corresponds with previous knowledge about the system being studied. Thus, the use of factor models for exploratory data analysis appears to be a suitable tool for the dissection of complex traits, and can be used to identify genetic factors in longitudinal studies.
Framingham Heart Study
The FHS has been described in detail elsewhere . For the purposes of the GAW13 workshop, phenotype data were provided for the first 40 years (from 1948 to 1988, or 21 exams) of follow up in Cohort 1 and the first 20 years (from 1971 to 1991, or 5 exams) of follow up in Cohort 2. The individuals available for study come from 330 of the largest pedigrees in the FHS sample. These pedigrees comprise 4692 subjects, of which 2885 were available for study. Additionally, we have genotype data for Marshfield Mapping Set 8A on 1702 of these individuals. For the purposes of this study, use of this data has been approved by the institutional review board of the University of Pittsburgh (IRB approval number 020481).
Matching of timepoints for Cohort 1 and 2
Since the FHS data were collected as two separate cohorts, and since the clinic visits were not scheduled at the same time, we had to find a way to match timepoints (clinic visits) between the two cohorts. To get cross-sectional samples from these two cohorts we averaged the data from the years 1970 and 1972 in Cohort 1 (or used the value present if data was available for only one of these years) to match the data from the year 1971 in Cohort 2 for all the variables used in this study (this is referred to hereafter as time1 data). We used the same method for data from years 1978 and 1980 (time2), 1982 and 1984 (time3), and 1986 and 1988 (time4) in Cohort 1 to match the data from the years 1979, 1983, and 1987 in Cohort 2, respectively. Thus we created four artificial time points for our studies.
Selection of traits and removal of environmental factors
We selected the following traits to use as main effects in our modeling: fasting triglycerides, fasting glucose, fasting HDL cholesterol, total cholesterol, BMI (calculated from height and weight data), systolic blood pressure, high blood pressure, number of grams of alcohol per day, and number of cigarettes smoked per day. For regression modeling, we first selected one phenotyped individual at random from each family in Cohort 2 (the amount of missing data in all available timepoints in Cohort 1 prohibited the use of these individuals for regression modeling). Age and gender adjustments were made using standard multiple linear regression techniques. Logistic regression was used to model high-blood pressure as a trait (thus, age- and gender-adjusted risk of high-blood pressure was the trait used in all further multivariate modeling). Logarithmic transformations were used when necessary to correct for non-normal trait distributions. Transformation was necessary for triglyceride levels, fasting glucose, and HDL cholesterol levels. Residuals from the regression analyses were used as inputs for factor analysis.
We used a minimum-volume ellipsoid covariance estimate to perform the factor analysis. Factor analyses were performed using the R statistical package . To be able to perform a test of the hypothesis that the specified number of factors is adequate to explain the model, we used the maximum likelihood factor estimate. This provides a likelihood-based test which compares the probability that the number of factors specified is sufficient with the probability that more factors are required. In this way, we were able to compare the validity of one-factor, two-factor, three-factor, and four-factor models for the system of traits. Using the residuals from the regression equations formed for each observed variable to adjust for age and gender, we arrived at a common factor solution for each time point corresponding to Exams 1 to 4 for Cohort 2. Four factors were found by the analysis in each case. The factor models predicted by each of these factor analyses were then applied to all the data from Cohort 1 and Cohort 2 to obtain estimated factor scores for all phenotyped individuals. Factor scores were generated using a regression-based approximation method.
Variance component analysis
Using factor scores as trait values, we ran multipoint variance-component analysis using the SOLAR package . The standard polygenic model was used for analysis, which hypothesizes a single major gene, a background (uncorrelated) polygenic effect, and an uncorrelated environmental effect. Fine mapping was performed on chromosomes which displayed a LOD score signal in excess of 1.5.
Correlations between factor scores, peak multipoint LOD scores, and peak locations were all calculated using the R statistical package .
Tukey J: Exploratory Data Analysis. Reading, MA, Addison-Wesley. 1977
Martin N, Boomsma D, Machin G: A twin-pronged attack on complex traits. Nat Genet. 1997, 17: 387-392. 10.1038/ng1297-387.
Allison DB, Thiel B, St Jean P, Elston RC, Infante MC, Schork NJ: Multiple phenotype modeling in gene-mapping studies of quantitative traits: power advantages. Am J Hum Genet. 1998, 63: 1190-1201. 10.1086/302038.
Dolan CV, Boomsma DI, Neale MC: A note on the power provided by sibships of sizes 2, 3, and 4 in genetic covariance modeling of a codominant QTL. Behav Genet. 1999, 29: 163-170. 10.1023/A:1021687817609.
Arya R, Blangero J, Williams K, Almasy L, Dyer TD, Leach RJ, O'Connell P, Stern MP, Duggirala R: Factors of insulin resistance syndrome-related phenotypes are linked to genetic locations on chromosomes 6 and 7 in nondiabetic Mexican-Americans. Diabetes. 2002, 51: 841-847. 10.2337/diabetes.51.3.841.
Shearman AM, Ordovas JM, Cupples LA, Schaefer EJ, Harmon MD, Shao Y, Keen JD, DeStefano AL, Joost O, Wilson PW, Housman DE, Myers RH: Evidence for a gene influencing the TG/HDL-C ratio on chromosome 7q32.3-qter: a genome-wide scan in the Framingham Study. Hum Mol Genet. 2000, 9: 1315-1320. 10.1093/hmg/9.9.1315.
Dawber TR, Kannel WB: The Framingham Study. An epidemiological approach to coronary heart disease. Circulation. 1966, 34: 553-555.
Ihaka R, Gentleman R: A language for data analysis and graphics. J Comput Graph Stat. 1996, 299-314. 10.2307/1390807.
Almasy L, Blangero J: Multipoint quantitative-trait linkage analysis in general pedigrees. Am J Hum Genet. 1998, 62: 1198-1211. 10.1086/301844.