- Open Access
A genome-wide tree- and forest-based association analysis of comorbidity of alcoholism and smoking
BMC Genetics volume 6, Article number: S135 (2005)
Genetic mechanisms underlying alcoholism are complex. Understanding the etiology of alcohol dependence and its comorbid conditions such as smoking is important because of the significant health concerns. In this report, we describe a method based on classification trees and deterministic forests for association studies to perform a genome-wide joint association analysis of alcoholism and smoking. This approach is used to analyze the single-nucleotide polymorphism data from the Collaborative Study on the Genetics of Alcoholism in the Genetic Analysis Workshop 14. Our analysis reaffirmed the importance of sex difference in alcoholism. Our analysis also identified genes that were reported in other studies of alcoholism and identified new genes or single-nucleotide polymorphisms that can be useful candidates for future studies.
Alcoholism is a complex disease that is highly concordant within family clusters. It is a widespread problem; nearly 14 million Americans abuse alcohol or are alcoholic . It is a major cause of certain cancers, especially liver cancer, a risk factor for brain damage, and is hazardous for developing fetuses. The Genetic etiology of alcoholism is well documented but not well understood , though the results of controlled family and twin studies of alcoholism suggest that alcoholism is in part caused by genetic components .
Smoking is highly associated with alcohol dependence . Genetic factors contribute to a person's risk of both smoking and alcoholism . There is a high prevalence of smoking among active alcoholics. The analysis of a 1981 Australian twin panel cohort data finds a positive genetic correlation between habitual smoking and alcoholism . The effect remains significant even after controlling for personality variables. Thus, the joint analysis of alcohol dependence and smoking using genetic information should reveal interesting results.
Classification trees and forests are known for their ability to identify complex relationships, especially in large, complex datasets . The availability of the single-nucleotide polymorphism (SNP) data in the Collaborative Study on the Genetics of Alcoholism (COGA) makes these methods well suited for identifying SNPs associated with smoking and alcoholism. In fact, we identified multiple trees of similar quality in terms of prediction error, and those trees suggest multiple potential genetic pathways underlying smoking and alcoholism.
The COGA data include 1,614 family members. After removing those individuals with missing genotype data on some markers, there were 1,306 individuals in the Illumina genotype dataset. There are 4,752 SNP markers released by Illumina, 32 of them without a map position. The number of SNPs released in the reformatted data was 4,720. Phenotypes used for this analysis are alcohol dependence based on DSM-III-R and Feighner, coded as ALDX1, and smoking. We combined ALDX1 with smoking to construct a comorbid response. Because ALDX1 has 4 levels (261 pure unaffected, 28 never drank, 408 unaffected with some symptoms, 609 affected), the comorbid response has 8 levels. The covariates include sex, parental phenotypes, and the SNP markers. The inclusion of parental phenotypes in such an association analysis is well documented to control for the residual familial correlations . The coding scheme for a SNP genotype is 0 for 1/1, 1 for 1/2, and 2 for 2/2. A variable, sex, was used to account for any sex differences.
The tree construction consists of two steps: tree growing and pruning. Tree growing is based on recursive partitioning. The classification tree for ALDX1 as the single outcome is shown in Figure 1, while Figure 2 depicts the classification tree for comorbid ALDX1 and smoking.
In Figure 1, the root node at the top contains all study samples. We use circles and boxes to represent internal nodes and terminal nodes, respectively. A splitting rule consists of a covariate and its corresponding threshold. As shown in Figure 1, sex is selected to split the root node with males to the right daughter node and females to the left daughter node, underscoring prominent sex difference. The selection of such a split is based on a specific goodness of split measure such as entropy . The objective of the split is to produce two daughter nodes (numbers 2 and 3 in Figure 1) such that the within-node distribution of the phenotype such as ALDX1 in Figure 1, is as homogeneous as possible. Specifically, suppose that we consider splitting node t, which can be the root node, and that the outcome variable has q levels, which is 4 for ALDX1 and 8 for the combination of ALDX1 and smoking. The entropy-based goodness of split is defined as
where t L and t R are left and right daughter nodes of node t resulting from split s, respectively, is the probability for an individual to be in node t L , is the probability for an individual in node t L to have response level i(i = 1, ..., q). The definitions for and are analogous to those of and . The split based on the sex variable for the root node in Figure 1 was selected because it yielded the highest i(s) after evaluating all possible splits of the root node using all covariates and all SNPs.
After splitting the root node into two daughter nodes, we repeated the procedure to further partition the daughter nodes into the next layer, and as a result, the study sample is divided into smaller, and hopefully more homogeneous, daughter nodes hierarchically or recursively. This recursive partitioning procedure produces an initial tree that usually contains many nodes. Because there are a finite number of ways of splitting any given study sample, the recursive partitioning can run for a while, but always terminates when it exhausts all possible splits. To improve the reliability and interpretability of the information contained in a tree, the initial tree from the recursive partitioning procedure is usually pruned to a smaller size.
We adopted the bottom-up method described in Zhang and Singer  to delete those superficial or unreliable splits. A χ2 testing statistic for a 2 × q contingency table was calculated for each internal node. For example, in Figure 1, we have the 2 × 4 table as shown in Table 1 for the root node and the χ2 value equals 189.8 for testing the independence of cell counts in the table. After the χ2 values are obtained for all internal nodes, we can follow the suggestion of  by prespecifying a significance level (e.g., 0.01) and void all splits whose χ2 values as well as the χ2 values in the subsequent splits do not exceed the predetermined threshold. This pruning step resulted in the tree in Figure 1 for ALDX1.
Thanks to a large number of covariates, we may have multiple splits with similar quality in terms of the goodness of split measure and the predictive precision of the phenotype. Biologically, it is possible that there are multiple pathways to a disease. Thus, it is useful to unravel and make use of all competitive split, and form a forest of competitive trees. Although random forests  provide a popular option, for the reasons explained in , we adopted the approach in  to form a deterministic forest. The key points made in  are that the deterministic forests perform similarly to random forests for data similar the COGA data and that the deterministic forests are reproducible and can be studied easily, whereas random forests are produced with uncertainty by design that may be not desirable. We refer to  for further discussions.
Following the recommendation in , we consider the top 20 splits of the root node and the top 3 splits of the two daughter nodes of the root node, giving rise to a maximum of 180 (20 × 3 × 3) trees in the forest.
Using the method described above, we obtained an initial tree with 139 nodes for ALDX1. At the significance level of 0.0001 based on a 2 × 4 contingency table, a tree with 39 nodes is determined. At significance level of 0.00001, a tree with 19 nodes is selected as shown in Figure 1. Figure 1 identified six important SNP markers that appear to be significantly associated with alcoholism. We list the SNP markers that are selected when ALDX1 or ALDX1 and smoking are used as the responses in Table 2.
In this report, we identified 37 SNPs that are associated with alcoholism and smoking. Fifteen of these SNPs are within known genes. Table 3 lists the eight genes with known or inferred functions. For example, SNP marker rs476646 is from gene SLC6A13, i.e., member 13 in the solute carrier family 6 (neurotransmitter transporter, GABA) in the chromosome region 12p13. GABA is neurotransmitter in the human central nervous system as well as human liver. Evidence indicates that GABA genes are likely candidates for alcohol dependence, and increased clearance of GABA by the liver is susceptible to alcoholism. It is not surprising that the transporter of these genes is associated to the alcohol addiction . According to our MedLine search, the remaining SNPs and the corresponding genes that we identified have not been previously suggested to be specifically associated with either alcoholism or smoking. However, in a recent genome-wide scan for smoking genes , strong or suggestive evidence for linkage on chromosomes 9, 11, 14, and X was reported. While that scan  identified the genes on chromosomes 9, 11, and 14 in different regions from what we identified, the SNPs (rs1934176, rs1536163, rs2015312, rs204141, and rs204165) that we identified on the X chromosome are in the same regions as those identified by Gelernter et al. . It is noteworthy that our analysis supports the strong sex difference in alcoholism, which is well documented. For example, Zhang and Merikangas  suggested the need to use a lower threshold of alcoholism for females. This is another important motivation for us to analyze the ordinal spectrum of the alcoholism, and may explain partially why most of the SNPs that we have identified were not previously identified to be associated or linked to alcoholism or smoking.
Collaborative Study on the Genetics of Alcoholism
Grant B, Harford T, Dawson D, Chou P, Dufour M, Pickering R: Prevalence of DSM-IV alcohol abuse and dependence: United States, 1992. Alcohol Health Res World. 1992, 18: 243-248.
Edenberg HJ: The collaborative study on the genetics of alcoholism: an update. Alcohol Res Health. 2002, 26: 214-218.
Dick D, Foroud T: Candidate genes for alcohol dependence: a review of genetic evidence from human studies. Alcohol Clin Exp Res. 2003, 27: 868-879. 10.1097/01.ALC.0000065436.24221.63.
Drobes D: Concurrent alcohol and tobacco dependence: mechanism and treatment. Alcohol Res Health. 2002, 26: 136-142.
Madden P, Bucholz K, Martin N, Heath A: Smoking and the genetic contribution to alcohol-dependence risk. Alcohol Res Health. 2000, 24: 209-214.
Zhang HP, Singer B: Recursive Partitioning in the Health Science. 1999, New York: Springer
Zhang HP, Bonney G: Use of classification trees for association studies. Genet Epidemiol. 2000, 19: 323-332. 10.1002/1098-2272(200012)19:4<323::AID-GEPI4>3.0.CO;2-5.
Breiman L: Random forest. Machine Learning. 2001, 45: 5-32. 10.1023/A:1010933404324.
Zhang HP, Yu CY, Singer B: Cell and tumor classification using gene expression data: construction of forests. Proc Natl Acad Sci USA. 2003, 100: 4168-4172. 10.1073/pnas.0230559100.
Gong Y, Zhang M, Cui L, Minuk Y: Sequence and chromosomal assignment of a human novel cDNA: similarity to gamma-aminobutyric acid transporter. Can J Physiol Pharmacol. 2001, 79: 977-984. 10.1139/cjpp-79-12-977.
Gelernter J, Liu X, Hesselbrock V, Page GP, Goddard A, Zhang H: Results of a genomewide linkage scan: support for chromosomes 9 and 11 loci increasing risk for cigarette smoking. Am J Med Genet Part B (Neuropsychiatric Genet). 2004, 128B: 94-101. 10.1002/ajmg.b.30019.
Zhang HP, Merikangas K: A frailty model of segregation analysis: understanding the familial transmission of alcoholism. Biometrics. 2000, 56: 815-823. 10.1111/j.0006-341X.2000.00815.x.
This research is supported in part by grants DA12468, DA016750, and DA017713 from the National Institute on Drug Abuse.
About this article
Cite this article
Ye, Y., Zhong, X. & Zhang, H. A genome-wide tree- and forest-based association analysis of comorbidity of alcoholism and smoking. BMC Genet 6, S135 (2005) doi:10.1186/1471-2156-6-S1-S135
- Root Node
- Random Forest
- Alcohol Dependence
- Classification Tree
- Recursive Partitioning