- Open Access
A genome-wide scan using tree-based association analysis for candidate loci related to fasting plasma glucose levels
© Chen et al; licensee BioMed Central Ltd 2003
- Published: 31 December 2003
In the analysis of complex traits such as fasting plasma glucose levels, researchers often adjust the trait for some important covariates before assessing gene susceptibility, and may at times encounter confounding among the covariates and the susceptible genes. Previously, the tree-based method has been employed to accommodate the heterogeneity in complex traits. In this study, we performed a genome-wide screen on fasting glucose levels in the offspring generation of the Framingham Heart Study provided by the Genetic Analysis Workshop 13. We defined one quantitative trait and converted it to a dichotomous trait based on a predetermined cut-off value, and performed association analyses using regression and classification trees for the two traits, respectively. A marker was interpreted as positive if at least one of its alleles exhibited association in both analyses. Our purpose was to identify candidate genes susceptible to fasting glucose levels in the presence of other covariates. The covariates entered in the analysis including sex, body mass index, and lipids (total plasma cholesterol, high density lipoprotein cholesterol, and triglycerides) of the subjects, and those of their parents.
Four out of seven positive regions in chromosomes 1, 2, 6, 11, 16, 18, and 19 from our analyses harbored or were very close to previously reported diabetes related genes or potential candidate genes.
This screen method that employed tree-based association showed promise for identifying candidate loci in the presence of covariates in genome scans for complex traits.
- Regression Tree
- Classification Tree
- Candidate Locus
- Fasting Glucose Level
- Total Plasma Cholesterol
Problem 1 of Genetic Analysis Workshop 13 (GAW13) provided the data from the Framingham Heart Study. We focused on the offspring cohort due to the missing rate of the data in the parental cohort.
Because the history of medical intervention, including the adjustment of lifestyle and the use of anti-diabetic medications were not available, we chose the highest fasting plasma glucose levels across the course of follow-up as the targeted quantitative trait to indicate the potential risk for abnormal glucose disposal. As suggested by the American Diabetes Association, the impaired fasting glucose (IFG, fasting plasma glucose between 110 and 125 mg/dl) appears as a risk factor for type 2 diabetes mellitus (T2DM) . We further used the lower limit of IFG (≥110 mg/dl) as the cut-off to transform this quantitative trait into a dichotomy. In this way, we included the subjects in the group with one or more incidences of higher fasting plasma glucose. We then performed association analyses using regression and classification trees for the two traits, respectively. A marker was considered positive if at least one of its alleles showed association in both analyses.
Our purpose was to identify candidate genes related to the fasting glucose levels in the presence of covariates. We found a few interesting markers that are closely linked with some potential candidate genes biologically relevant to glucose metabolism.
For the phenotype measurements, the corresponding covariates were created using their cross-sectional means. The covariates entered in the analysis included sex, body mass index, and lipids (total plasma cholesterol, high density lipoprotein cholesterol, and triglycerides) for each subjects. To control for potential familial correlations, the cross-sectional means of the maternal and paternal phenotype measurements were also included as covariates.
For the genotypic data, an allele was chosen to enter the analyses if its allele frequency is at least 10%. Alleles with frequencies less than 10% but from the same marker are categorized as an incognito allele. The allelic covariates were created using the technique proposed by Zhang and Bonney .
Association analysis using classification trees
The classification tree (CT) and regression tree (RT) methods are both built on the recursive partition technique; they can be used to partition a study population into homogeneous disjointed subgroups. The optimal tree is created by both growing and pruning procedures. The maximal tree is built by splitting each node into two child nodes until the purity of the terminal node is achieved. In splitting, the best choice of the child node is derived while the minimum of the entropy impurity function is reached. In pruning, it is processed for each binary class j in the subtree τ until the unconditional misclassification rate is attained, where c(j|i) is the cost that a class j is classified as a class i and IP is the entropy impurity function. In general, choice of the cost depends on the severity of the misclassification. In this study, equal cost was chosen for both misclassifications because it frequently gives most satisfactory analyses , i.e., c(1|0) = c(0|1). The optimal tree in RT is similar to that in CT with a different impurity function , i.e., the within-node variance in the tree τ. More details of CT, RT, and corresponding splitting criteria are described elsewhere [3–5].
Tree-based association analysis was implemented by using genotype measurements such as allelic covariates and related phenotype measurements to construct binary trees. An allele shows association with the trait if its corresponding covariate is included in the optimal tree.
In this study, we conducted a genome-wide screen to identify the candidate gene in the presence of a set of specified covariates. We performed RT- and CT-based association analyses on the quantitative and dichotomy traits, respectively. A marker was interpreted as positive if at least one of its alleles showed association in both association analyses. The allelic covariates from the same chromosome were entered in the analyses simultaneously. The genome-wide screen consisted of 22 such processes for the autosomes. The computer programs QUEST  and RT  were used to construct the binary trees for the CT and RT analyses.
Web-searching for candidate genes
The map position was defined using Ensemble Genome Server at Sanger Institute http://www.ensembl.org/Homo_sapiens/. For candidate gene search, we used Online Medelian Inheritance in Man at National Center for Biotechnology Information http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=OMIM or euGene http://iubio.bio.indiana.edu:8089/man/.
Positive markers found in the analyses using classification and regression trees
Amylase, salivary, Pancreatic (1p21), facilitated glucose transporter (1p31-35), agiopoietin-like 3 (1p31)
Glucokinase regulatory protein (2p23), Alstrom syndrome (2p13), serum levels of leptin (2p21), hexokinase 2 (2p12), eukaryote translation initiation factor 2-α kinase 3 (Wolcott-Rallison syndrome, 2p21)
IDDM 1 (6q21) IDDM5 (6q24-27), IDDM8 (6q25-27), IDDM15, transient neonatal diabetes mellitus (6q24), pleomorphic adenoma gene-like 1 (6q24-26), phosphodiesterase 1 (6q22-23)
HRAS (11p14.1), INS (11p14.1), MODY1 (11p15.5), phosphodiesterase 3B (11p15), SUR1 (11p15.1), Kir6.2 (11p15.1)
Glycogen synthase 1 (muscle)(19q13.3), ApoCIII (19q13.2), Dystrophia Myotonica 1 (19q13.2-13.3), AKT2 (19q13.1-13.2), TGFβ1 (19q13.1), Glycogen synthase kinase 3A (19q13.1-13.2)
In this study, the intent of our screen method was to identify candidate markers rather than to pinpoint susceptibility alleles, although it can be applied to detect the allelic or non-allelic heterogeneity. The cut-off value used in CT in this analysis was chosen for a biological reason. However, the analysis was sensitive to the choice of cut-offs when the subjects were largely clustered around the cut-off point (>110 mg/dl). Only three regions on 1p, 16q, and 18p were consistently positive at neighboring cut-offs from 100 to 120.
Although covariates such as BMI and HBP, which are associated with fasting glucose level, were included in our analyses, the cut-off of these covariates in our final 22 optimal trees were not the same. Further studies are needed to inspect the impact of different cut-off and associated alleles.
From a different point of view, our method used the RT analysis on the quantitative trait to validate the results from CT such that the positive markers showed association in both analyses. Notably, four out of the seven candidate regions harbored previously reported genes that are related to glucose metabolism or diabetes mellitus. In conclusion, our screen method shows promise for searching candidate loci in genome scans for complex traits.
This study was partially supported by the National Science Council in Taiwan (NSC 91-3112-B-001-006-M51) and National Taiwan University Hospital (NTUH-91A15).
- The Expert Committee on the Diagnosis and Classification of Diabetes Mellitus: Report of the Expert Committee on the Diagnosis and Classification of Diabetes Mellitus. Diabetes Care. 1997, 20: 1183-1197.View ArticleGoogle Scholar
- Zhang Z, Bonney GE: Use of classification trees for association studies. Genet Epidemiol. 2000, 19: 323-332. 10.1002/1098-2272(200012)19:4<323::AID-GEPI4>3.0.CO;2-5.View ArticlePubMedGoogle Scholar
- Breiman L, Friedman JH, Olshen RA, Stone CJ: Classification and Regression Trees. New York, Chapman and Hall. 1989Google Scholar
- Chang CJ, Fann CSJ: Using data mining to address heterogeneity in the Southampton data. Genet Epidemiol. 2001, 21: S180-S185.PubMedGoogle Scholar
- Fann CSJ, Shugart YY, Lachman H, Collins A, Chang CJ: The impact of redefining affection status for alcoholism on affected-sib-pair analysis. Genet Epidemiol. 1999, 17: S151-S156. 10.1002/(SICI)1098-2272(1999)17:2<151::AID-GEPI5>3.3.CO;2-B.View ArticlePubMedGoogle Scholar
- Department of Mathematics, National Chung Cheng University,Taiwan: Quest User Manual. Version 1.8.8. Taiwan. 2000Google Scholar
- Torgo L: RT 4.1 User's Manual. University of Porto, Porto, Portugal. 2001Google Scholar
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.