Data mining of the GAW14 simulated data using rough set theory and tree-based methods

Rough set theory and decision trees are data mining methods used for dealing with vagueness and uncertainty. They have been utilized to unearth hidden patterns in complicated datasets collected for industrial processes. The Genetic Analysis Workshop 14 simulated data were generated using a system that implemented multiple correlations among four consequential layers of genetic data (disease-related loci, endophenotypes, phenotypes, and one disease trait). When information of one layer was blocked and uncertainty was created in the correlations among these layers, the correlation between the first and last layers (susceptibility genes and the disease trait in this case), was not easily directly detected. In this study, we proposed a two-stage process that applied rough set theory and decision trees to identify genes susceptible to the disease trait. During the first stage, based on phenotypes of subjects and their parents, decision trees were built to predict trait values. Phenotypes retained in the decision trees were then advanced to the second stage, where rough set theory was applied to discover the minimal subsets of genes associated with the disease trait. For comparison, decision trees were also constructed to map susceptible genes during the second stage. Our results showed that the decision trees of the first stage had accuracy rates of about 99% in predicting the disease trait. The decision trees and rough set theory failed to identify the true disease-related loci.

a system that implemented multiple correlations among four consequential layers of genetic data (disease-related loci, endophenotypes, phenotypes, and one disease trait). When information of one layer was blocked and uncertainty was created in the correlations among these layers, the correlation between the first and last layers (susceptibility genes and the disease trait in this case), was not easily directly detected. In this study, we proposed a two-stage process that applied rough set theory and decision trees to identify genes susceptible to the disease trait. During the first stage, based on phenotypes of subjects and their parents, decision trees were built to predict trait values. Phenotypes retained in the decision trees were then advanced to the second stage, where rough set theory was applied to discover the minimal subsets of genes associated with the disease trait. For comparison, decision trees were also constructed to map susceptible genes during the second stage. Our results showed that the decision trees of the first stage had accuracy rates of about 99% in predicting the disease trait. The decision trees and rough set theory failed to identify the true disease-related loci.

Background
Data mining approaches have been applied to different areas to derive useful and comprehensive knowledge. Methods focusing on functionalities of data mining, such as classification, prediction, association, and clustering, have been developed [1]. Variants of decision trees, such as ID3 [2] and C4.5 [3], have become standard tools for classification [4,5]. Recently, tree-based methods have been applied to genome-wide association studies for disease gene mapping [6]. Rough set theory [7] has also been utilized to solve decision problem in business and industrial areas [8][9][10]. In this study, we proposed two-stage methods that utilize decision trees C4.5 and rough set theory to analyze the Genetic Analysis Workshop 14 (GAW14) simulated data. Our goal was to search genes susceptible to Kofendrerd Personality Disorder (KPD), a behavioral disorder with multiple possible phenotype definitions.

Materials
The GAW14 simulated data was generated to represent diseased families sampled from four geographically diverse sites, Aipotu, Karangar, Danacaa, and New York City, with varied criteria for diagnosis of KPD. Subjects from these four sites had different living environments and ethnic backgrounds. One hundred replicates were generated. In each replicate, 100 nuclear families were collected from each of the first three sites and 50 extended families from the fourth site. In addition to the KPD affected status, 12 KPD-related phenotypes, labeling as a, b, c, ...,l, were given for each subject. A total of 917 SNP markers, spaced 3 cM apart, were provided on 10 chromosomes. In addition, a genome screen of 416 microsatellite markers, spaced 7 cM apart, was also given. In this study, only the SNP datasets of the first 10 replicates were analyzed. Simulated data answers were revealed after the analysis was done.

Decision trees
A decision tree is often constructed based on some attributes to divide a group of subjects into more homogenous subgroups with respect to the target outcome variable. Briefly, a decision tree is built using a recursive partitioning process and a pruning process. Initially, a root node is built to represent the entire group. Then two leaf nodes are constructed, each representing a subgroup with a specific character of a selected attribute. At each level of tree construction, entropy was employed to calculate the information gain of each attribute. The attribute with the maximal information gain was chosen as a node at that level. The process continued until we got to the end of the branch. Then, each branch was defined as the leaf of the selected attribute. A route stemmed from the root to the leaf is defined as a rule. The attribute closer to the tree root is the most important decision factor for the rule. The pruning process for a decision tree was to replace a whole sub-tree with a leaf node. The replacement took place if a decision rule was established such that the expected error rate in the sub-tree was greater than in the single leaf. With this approach, the final decision tree was built. In this study, the C4.5 Release 8 software http:// www.rulequest.com/Personal/ was used to build the deci-sion tree. The choice of pruning confidence affects how the error rates were estimated and hence the severity of pruning; values smaller than the default (25%) cause more of the initial tree to be pruned, while larger values resulted in less pruning. In this study, the pruning confidence level was set at 25%. The GAW14 simulated data was transformed into an appropriate format for the software. The KPD affected status and the 12 phenotypes were coded as 0 and 1 for unaffected and affected; SNP genotypes 11, 12, and 22 were coded as 1, 2, and 3, respectively.
For the second stage of this study, the KPD affected status and the 12 phenotypes were treated as decision variables. SNP genotypes 11, 12, and 22 were coded as 1, 2, and 3 respectively. The significant conditional attributes retained in the decision rules could be seen as the genes susceptible to the corresponding trait.

Two-stage method
At the first stage, based on the phenotypes of subjects and their parents, classification trees were built to predict trait values. Phenotypes retained in the decision trees were then advanced to the second stage where RST was applied to discover the minimal subsets of genes associated with the disease trait. For comparison, decision trees were also constructed to map susceptible genes at the second stage. In addition, phenotypes not significantly associated with the KPD affected status were also analyzed at the second stage. Analysis was done for each of the four groups as well as for the pooled data of the four groups. SNPs on a same chromosome were analyzed at the same time. Genome scans were performed by analyzing the pooled set of the significant SNPs across the 10 chromosomes.

Relationships between KPD and 12 phenotypes
At the first stage, the decision trees had accuracy rates of about 99% in predicting the disease trait. The set of significant phenotypes for predicting the disease trait varied among groups. This might suggest some population-specific effects with respect to KPD. The rest of the first ten replicates showed similar results. Phenotypes b and h were the most common effects shown in the decision trees across groups (6 out of 10, and 7 out of 10, respectively). More nodes were needed to construct decision trees in the NYC group than in the other three groups (Figure 1). In addition, the decision trees remained the same when the parental phenotypes, one at a time or 12 at the same time, were taken into consideration in the construction of the trees. Figure 2 showed the SNPs identified to be associated with KPD and the 12 phenotypes in each of four groups using RST. Genotypes that were significantly associated with KPD did not show better results in terms of hits of true SNPs related to traits in Aipotu, Karangar, Dancaa, and NYC by rough set theory Figure 2 SNPs related to traits in Aipotu, Karangar, Dancaa, and NYC by rough set theory.

SNPs identified to be associated with KPD and phenotypes
susceptible SNPs. Similar result were found in the rest of the first 10 replicates.

Discussion
In this study, the decision trees based on a few phenotypes successfully predicted the KPD affected status at the first stage. Some phenotypes were frequently included in the decision trees. These phenotypes might be used to screen KPD or to become biomarkers themselves. The decision trees for the NYC group had different structures, in terms of number of nodes. This might be due to the underlying genetic background or the extended pedigree structures. Further study should clarify the difference. It was difficult to rationalize the failures of the application of decision trees and RST methods in identifying SNPs susceptible to KPD or the 12 phenotypes. One possible reason was that the two methods might not be suitable to decompose the complex algorithms of the simulation. Another conjecture was the low penetrance rate of the disease alleles. It is also possible that association does not exist in this kind of population with SNPs so far apart. It would be interested to estimate statistical power of the two methods (RST and decision trees) for identifying SNPs with various levels of penetrance rates.

Conclusion
Our results showed that the decision trees at the first stage had accuracy rates about 99% in predicting the disease trait. The application of the decision trees and RST failed to identify some disease-related loci.

Authors' contributions
L-YW participated in the study design, carried out the analysis, and drafted the manuscript. C-LH participated in the study design, developed the algorithms, and helped to draft the manuscript. C-HC participated in study design and coordination and helped to draft the manuscript. All authors read and approved the final manuscript.