Skip to main content


Figure 1 | BMC Genetics

Figure 1

From: G2D: a tool for mining genes associated with disease

Figure 1

The G2D algorithm. The cylinders represent public databases. MEDLINE contains references to scientific literature annotated at the National Library of Medicine with terms from the MeSH ontology. For each disease being studied we take the MeSH C terms ('Diseases Category') from the publications associated in OMIM [3] as its keywords. For each gene we take the Gene Ontology (GO) terms [8] associated to its product in the RefSeq protein database [34] as its keywords. MEDLINE does not contain enough clinical literature to allow us to directly relate every symptom, represented by a MeSH C term, to every gene feature, represented by a GO term. Taking into account that genes relate to phenotypes by means of molecules, we can increase the robustness of the gene/phenotype relations using an intermediate association step through the MeSH D category of 'Chemicals & Drugs' (top). Accordingly, we first compute associations between MeSH C terms ('Diseases') and MeSH D terms ('Chemicals & Drugs') by their co-annotation on the same record, more specifically looking for dependences of MeSH D terms on MeSH C terms. For example, we would deduce a relation between "Alzheimer's disease" (MeSH C) and "Amyloid protein" (MeSH D) if the presence of the C term in a MEDLINE entry always implies the presence of the D term. Records in the RefSeq database contain annotations from GO that describe the protein function, and will often include a link to MEDLINE, mostly dealing with the experimental characterization of the protein. We use these links to relate MeSH D terms from the MEDLINE reference to GO terms from the sequence, again looking for GO term dependence on a MeSH D term. In this case we could deduce an association between the MeSH D term "Amyloid Protein" and the GO term "Amyloid Protein". Finally, we combine both sets of relations to obtain associations between MeSH C terms and GO terms (for example, the relation of Alzheimer's disease to the amyloid protein). To evaluate the genes associated with a particular disease we follow two directions. First, we deduce the gene functions (GO terms) related to the disease using the associations from phenotypes (MeSH C terms) describing the disease. For this, we collect the MeSH C terms found in the MEDLINE references from its corresponding OMIM entry (left), score all GO terms according to their relation to the terms in the MeSH C list (top), and finally, score all the proteins in RefSeq with the average of scores of their GO terms (right). For example, the analysis of late-onset familial Alzheimer disease (LOFAD) [9] would start by characterizing the disease with the MeSH C term "Alzheimer's Disease" among others. This would point to a series of GO terms including "Amyloid Protein" as a likely related function. One of the most related sequences in RefSeq (according to its GO annotations) would be the human amyloid beta A4 precursor protein-binding, which is annotated with the GO-term "amyloid protein". The other component of the analysis is a BLAST homology search [35] of the human genome region where the disease is mapped against the sequences stored in the RefSeq database (bottom). All hits in the region (red block) below a cut-off of E-value of 10e-10 are registered and sorted according to the score of the RefSeq protein they hit. Following our example, the analysis of the region where the LOFAD was mapped would show a gene similar to the human amyloid beta A4 precursor protein-binding annotated with the GO-term "amyloid protein": the APBA3 gene, which interacts with the Alzheimer's beta-amyloid precursor protein [12]. The analysis of LOFAD is extensively described in the Results section. Further details of the method are given in [2] and in the G2D web site.

Back to article page