Empirically derived phenotypic subgroups – qualitative and quantitative trait analyses
© Wilcox et al; licensee BioMed Central Ltd 2003
Published: 31 December 2003
The Framingham Heart Study has contributed a great deal to advances in medicine. Most of the phenotypes investigated have been univariate traits (quantitative or qualitative). The aims of this study are to derive multivariate traits by identifying homogeneous groups of people and assigning both qualitative and quantitative trait scores; to assess the heritability of the derived traits; and to conduct both qualitative and quantitative linkage analysis on one of the heritable traits.
Multiple correspondence analysis, a nonparametric analogue of principal components analysis, was used for data reduction. Two-stage clustering, using both k-means and agglomerative hierarchical clustering, was used to cluster individuals based upon axes (factor) scores obtained from the data reduction. Probability of cluster membership was calculated using binary logistic regression. Heritability was calculated using SOLAR, which was also used for the quantitative trait analysis. GENEHUNTER-PLUS was used for the qualitative trait analysis.
We found four phenotypically distinct groups. Membership in the smallest group was heritable (38%, p < 1 × 10-6) and had characteristics consistent with atherogenic dyslipidemia. We found both qualitative and quantitative LOD scores above 3 on chromosomes 11 and 14 (11q13, 14q23, 14q31). There were two Kong & Cox LOD scores above 1.0 on chromosome 6 (6p21) and chromosome 11 (11q23).
This approach may be useful for the identification of genetic heterogeneity in complex phenotypes by clarifying the phenotype definition prior to linkage analysis. Some of our findings are in regions linked to elements of atherogenic dyslipidemia and related diagnoses, some may be novel, or may be false positives.
Contemporary advances in medicine are due, at least in part, to the long history of research conducted in Framingham. To date, the majority of the outcome measures have been univariate qualitative or quantitative traits. The objectives of the present analyses were to derive multivariate qualitative and quantitative traits empirically, to examine the heritability of the traits, and to conduct genome-wide linkage analyses with a trait that demonstrated some heritability. The analyses were conducted in the families collected by the Framingham Heart Study made available to participants in the Genetic Analysis Workshop 13.
This study was conducted in the sample from the Framingham Heart Study distributed to participants in Genetic Analysis Workshop 13. The most extreme measurement category across all of the measures for an individual was used to create the multivariate phenotypes. For example, if, over the course of the available measurements, the maximum triglyceride level reached the fourth quartile, the summary measure was the fourth quartile. Continuous measures were categorized according to classes commonly used in clinical practice as follows: body-mass index (BMI) (underweight, normal weight, overweight, obese); tobacco use (none, less than one pack per day, one to two packs per day, two to three packs per day, and more than three packs per day); alcohol use (abstinence, moderate use, heavy use); systolic blood pressure (sbp) [low (< 80), normal (80–129), elevated (130–139), high (>140)]; cholesterol (normal, borderline, high); glucose (low, normal, impaired, hyperglycemic); atherogenic dyslipidemia [no criteria, either lowest HDL quartile, or highest triglyceride decile, both low HDL and high triglycerides (atherogenic dyslipidemia)]. High density lipoproteins (HDL) and triglycerides were characterized in age- and gender-specific quartiles as observed in the Framingham Heart Study data. An individual was classified as having high blood pressure if they were being treated for hypertension, regardless of the clinical measurement.
The strategy for the development of qualitative and quantitative traits included nonparametric data reduction, iterative two-staged clustering on the observed dimensions, and the assignment of probability of cluster membership in each cluster for each individual.
Principal components analysis (PCA) is a method commonly used for data reduction. PCA is based upon a Pearson product-moment correlation which assumes a pair-wise Gaussian structure. The original continuous data were not pair-wise normal and did not meet the assumptions for this method. Multiple correspondence analysis (MCA) is a nonparametric data reduction method free of the assumptions underlying PCA. The only requirement for MCA is a non-negative rectangular data matrix. MCA uses a singular value decomposition (SVD) of the matrix. Eigenvalue (vector) decomposition is a special case of SVD. The objective of MCA is to identify a low-dimensional subspace that comes closest to all of the data points. It is analogous to graphing the results of a factor analysis in a multidimensional Euclidean space. However, the space identified in MCA is not Euclidian. The coordinates of each individual in the identified multi-dimensional space served as the basis for the identification of subgroups or clusters .
It is possible to represent cluster membership as both qualitative and quantitative traits. The qualitative trait is membership in the cluster, which is binary. The quantitative trait represents the degree of affiliation with the cluster, distance from the cluster centroid, or probability of membership. To compare the utility of the measures and the consistency of the linkage results, both traits were constructed and linkage analyses were conducted on each.
Binary logistic regression was used to estimate the probability of cluster membership for each study participant in each of the clusters. The natural logarithm of the probability of membership in Group 4 (described below) was the dependent measure in the quantitative trait analyses. Categorical cluster membership was used in the qualitative trait analyses. Two-point variance components linkage analysis was conducted using SOLAR . Multipoint NPL (nonparametric linkage) analysis was performed using the S (pairs) option of GENEHUNTER-PLUS, and maximizing nonparametric LOD scores ("K&C LOD scores") were calculated under an exponential model with δ constrained between 0 and 2 .
Correspondence analysis and clustering
Coordinates on eight axes (analogous to factor scores) were retained and used for clustering. Four clusters were identified. The cluster sizes were n = 1030 (35.7%), n = 670 (23.22%), n = 881 (30.54%), and n = 304 (10.54%).
An index measure of the prevalence of each of the independent variables within each of the clusters was calculated by dividing the observed category proportion in a cluster by its expectation, the marginal proportion. If the prevalence in a cluster did not differ from the sample, the index would be unity. Group 1 had indices higher than 1.25 for the first quartile triglyceride measure (2.02), high blood pressure (1.53), high cholesterol (1.44), hyperglycemia (1.43), fourth quartile HDL (1.39), and heavy alcohol use (1.27). Group 2 was characterized by high HDL (1.62) and lower rates of all other measures. This was a particularly healthy group. Group 3 was characterized by low HDL, obesity, and high triglycerides (1.65, 1.31, and 1.24, respectively). The last group (Group 4) contained all of the individuals in the sample who met the criteria for atherogenic dyslipidemia as defined by lowest quartile for HDL and highest decile for triglycerides. They had high indices for atherogenic dyslipidemia (8.39), top decile for triglycerides (5.24), lowest quartile HDL (3.82), obesity (1.51), and smoking (1.49).
Figure 1 shows a simple correspondence analysis graph of the relationships between the groups and each of the categories used to identify them. The four-group structure is fully represented in the three-dimensional display. Group 4 is nearest the criteria for atherogenic dyslipidemia labeled "MS" on the graph. The measures of good health cluster around Group 2. Groups 1 and 3 have moderate to high levels of most of the independent variables.
The heritability of the probability of group membership was computed using SOLAR. The heritability of each of the quantitative traits is 20% (p < 1 × 10-6) for Group 1, 19% (p < 1 × 10-6) for Group 2, 39% (p < 1 × 10-6) for Group 3, and 38% (p < 1 × 10-6) for Group 4. Linkage analysis was conducted for the probability of membership in Group 4 and for a binary qualitative trait representing membership in the Group 4.
Quantitative trait analysis
Quantitative Trait Two-Point LOD Score
1 – 1p32
2 – 2p23
3 – 3pter
4 – 4q22
4 – 4q21
4 – 4q13
4 – 4q21
4 – 4q21
4 – 4q28
5 – 5q34
6 – 6p12
6 – 6p21
8 – 8q16
8 – 8q24
8 – 8q24
9 – 9q34
10 – 10q26
11 – 11q23
11 – 11q13
14 – 14q31
14 – 14q23
15 – 15q21
17 – 17p11
17 – 17q24
18 – 18q22
18 – 18q22
19 – 19p13
22 – 22p11
Qualitative trait analysis
Qualitative Trait – Multipoint NPL Scores, p-values, Kong & Cox LOD Scores
Kong & Cox LOD
Discussion and Conclusions
We found four empirically derived, phenotypically distinct subgroups. One group was very healthy, two groups had mild to moderately elevated lipid levels, and one group had lipid levels characteristic of atherogenic dyslipidemia. The profile of the latter group resembled atherogenic dyslipidemia and atherogenic dyslipidemia. Grundy  identified atherogenic dyslipidemia as a disorder characterized by elevated triglycerides, small LDL particles, and reduced HDL. The multivariate measure of this related trait had significant heritability (38%) and was chosen for examination in linkage analyses. It should be noted that this cluster was identified empirically. It represents factors associated with atherogenic dyslipidemia. This constellation of factors was chosen empirically, not clinically, for further linkage analyses.
Three loci were common across our qualitative and quantitative analyses. One of the three LOD scores above 3 in the quantitative trait was observed on 11q23. The highest NPL score in the qualitative trait analysis was observed in the same region. Similarly, there were consistent findings on 6p21 and 18q22 in both the qualitative and quantitative analyses.
Several of our results are close to those reported by Aouizerat et al.  in a genome scan for familial combined hyperlipidemia. Our results on chromosomes 2q and 11q are in the same regions as the two highest LOD scores reported in that study. Additionally, our highest scores on 10q, 15p, 18p, and 22p are in regions close to those reported in those regions in the same study. Comuzzie et al. , Comuzzie , Hager et al. , and Rotimi et al.  all previously reported human obesity quantitative trait loci at the same regions in which we found some evidence for linkage on chromosomes 2q and 17q.
Lindsay et al.  reported linkage of diabetes in Pima Indians at the same region on chromosome 14 as two of our three LOD scores over 3. Their reported region of linkage on chromosome 6 maps to the same regions in which we report evidence for linkage to our quantitative trait. Arya et al.  used a principal components approach to construct three quantitative traits representing insulin-resistance syndrome. They reported linkage on both chromosomes 6q and 7. Our findings on chromosome 6 were on 6p and likely not related to those shown by Arya et al. We did not have a measurable signal on chromosome 7. This lack of replication is likely due to difference in traits. Our empirically derived trait has more in common with atherogenic dyslipidemia than it does with insulin resistance syndrome.
Our results are also close to those reported by Soro et al. . In a study investigating the genetic etiology of low HDL, they showed linkage on chromosomes 8q23 and 16q24. We found some evidence in the quantitative trait analysis at 8q24, and at 16q21 in the qualitative trait analysis.
For the qualitative trait analyses, one location on chromosome 14p was common across this analysis and another done by this group examining a trait for atherogenic dyslipidemia . Both analyses resulted in K&C LOD scores of approximately 0.6. It appears that the empirically derived qualitative trait is similar to atherogenic dyslipidemia, but not identical.
- Greenacre M: Theory and Applications of Correspondence Analysis. New York Wiley. 1984Google Scholar
- LeBart L, Morineau A, Warwick K: Multivariate Descriptive Statistical Analysis: Correspondence Analysis and Related Techniques for Large Matrices. New York Wiley. 1984Google Scholar
- CISIA: SPAD. CISIA, Paris. 2001Google Scholar
- SAS Institute Inc: Program Guide. Version 8.2. Cary NC SAS Institute Inc. 1989Google Scholar
- Insightful Corporation: S-Plus Version 6. Seattle, WA, Insightful Corporation. 2001Google Scholar
- Almasy L, Blangero J: Multipoint quantitative trait linkage analysis in general pedigrees. Am J Hum Genet. 1998, 62: 1198-1211. 10.1086/301844.PubMed CentralView ArticlePubMedGoogle Scholar
- Kong A, Cox NJ: Allele-sharing models: LOD scores and accurate linkage tests. Am J Hum Genet. 1997, 61: 1179-1188. 10.1086/301592.PubMed CentralView ArticlePubMedGoogle Scholar
- Grundy SM: Hypertriglyceridemia, atherogenic dyslipidemia, and the metabolic syndrome. Am J Cardiol. 1998, 81 (4A): 18B-25B. 10.1016/S0002-9149(98)00033-2.View ArticlePubMedGoogle Scholar
- Aouizerat BE, Allayee H, Cantor RM, Davis RC, Lanning CD, Wen PZ, Dallinga-Thie GM, de Bruin TW, Rotter JI, Lusis AJ: A genome scan for familial combined hyperlipidemia reveals evidence of linkage with a locus on chromosome 11. Am J Hum Genet. 1999, 65: 397-412. 10.1086/302490.PubMed CentralView ArticlePubMedGoogle Scholar
- Comuzzie AG, Funahashi T, Sonnenberg G, Martin LJ, Jacob HJ, Black AE, Maas D, Takahashi M, Kihara S, Tanaka S, Matsuzawa Y, Blangero J, Cohen D, Kissebah A: The genetic basis of plasma variation in adiponectin, a global endophenotype for obesity and the metabolic syndrome. J Clin Endocrinol Metab. 2001, 86: 4321-4325. 10.1210/jc.86.9.4321.View ArticlePubMedGoogle Scholar
- Comuzzie AG, Hixson JE, Almasy L, Mitchell BD, Mahaney MC, Dyer TD, Stern MP, MacCluer JW, Blangero J: A major quantitative trait locus determining serum leptin levels and fat mass is located on human chromosome 2. Nat Genet. 1997, 15: 273-276. 10.1038/ng0397-273.View ArticlePubMedGoogle Scholar
- Hager J, Dina C, Francke S, Dubois S, Houari M, Vatin V, Vaillant E, Lorentz N, Basdevant A, Clement K, Guy-Grand B, Froguel P: A genome-wide scan for human obesity genes reveals a major susceptibility locus on chromosome 10. Nat Genet. 1998, 20: 304-308. 10.1038/3123.View ArticlePubMedGoogle Scholar
- Rotimi CN, Comuzzie AG, Lowe WL, Luke A, Blangero J, Cooper RS: The quantitative trait locus on chromosome 2 for serum leptin levels is confirmed in African-Americans. Diabetes. 1999, 48: 643-644. 10.2337/diabetes.48.3.643.View ArticlePubMedGoogle Scholar
- Lindsay RS, Kobes S, Knowler WC, Bennett PH, Hanson RL: Genome-wide linkage analysis assessing parent-of-origin effects in the inheritance of type 2 diabetes and BMI in Pima Indians. Diabetes. 2001, 50: 2850-2857. 10.2337/diabetes.50.12.2850.View ArticlePubMedGoogle Scholar
- Arya R, Blangero J, Williams K, Almasy L, Dyer TD, Leach RJ, O'Connell P, Stern MP, Duggirala R: Factors of insulin resistance syndrome-related phenotypes are linked to genetic locations on chromosomes 6 and 7 in nondiabetic Mexican-Americans. Diabetes. 2002, 51: 841-847. 10.2337/diabetes.51.3.841.View ArticlePubMedGoogle Scholar
- Soro A, Pajukanta P, Lilja HE, Ylitalo K, Hiekkalinna T, Perola M, Cantor RM, Viikari JS, Taskinen MR, Peltonen L: Genome scans provide evidence for low-HDL-C loci on chromosomes 8q23, 16q24.1-24.2, and 20q13.11 in Finnish families. Am J Hum Genet. 2002, 70: 1333-1340. 10.1086/339988.PubMed CentralView ArticlePubMedGoogle Scholar
- Yip A, Ma Q, Wilcox MA, Panhuysen CI, Farrell J, Farrer LA, Wyszynski D: Search for genetic factors predisposing to atherogenic dyslipidemia. BMC Genetics. 2003, 4 (suppl 1): S100-10.1186/1471-2156-4-S1-S100.PubMed CentralView ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.