Sequencing genes in silico using single nucleotide polymorphisms
© Zhang et al; licensee BioMed Central Ltd. 2012
Received: 31 October 2011
Accepted: 30 January 2012
Published: 30 January 2012
Skip to main content
© Zhang et al; licensee BioMed Central Ltd. 2012
Received: 31 October 2011
Accepted: 30 January 2012
Published: 30 January 2012
The advent of high throughput sequencing technology has enabled the 1000 Genomes Project Pilot 3 to generate complete sequence data for more than 906 genes and 8,140 exons representing 697 subjects. The 1000 Genomes database provides a critical opportunity for further interpreting disease associations with single nucleotide polymorphisms (SNPs) discovered from genetic association studies. Currently, direct sequencing of candidate genes or regions on a large number of subjects remains both cost- and time-prohibitive.
To accelerate the translation from discovery to functional studies, we propose an in silico gene sequencing method (ISS), which predicts phased sequences of intragenic regions, using SNPs. The key underlying idea of our method is to infer diploid sequences (a pair of phased sequences/alleles) at every functional locus utilizing the deep sequencing data from the 1000 Genomes Project and SNP data from the HapMap Project, and to build prediction models using flanking SNPs. Using this method, we have developed a database of prediction models for 611 known genes. Sequence prediction accuracy for these genes is 96.26% on average (ranges 79%-100%). This database of prediction models can be enhanced and scaled up to include new genes as the 1000 Genomes Project sequences additional genes on additional individuals. Applying our predictive model for the KCNJ11 gene to the Wellcome Trust Case Control Consortium (WTCCC) Type 2 diabetes cohort, we demonstrate how the prediction of phased sequences inferred from GWAS SNP genotype data can be used to facilitate interpretation and identify a probable functional mechanism such as protein changes.
Prior to the general availability of routine sequencing of all subjects, the ISS method proposed here provides a time- and cost-effective approach to broadening the characterization of disease associated SNPs and regions, and facilitating the prioritization of candidate genes for more detailed functional and mechanistic studies.
The recent advent of SNP array based high throughput genotyping technologies has stimulated a wave of genetic association studies, including genome-wide association studies (GWAS) , and has led to a number of discovered and validated associations, which are fully catalogued on the website http://www.genome.gov/gwastudies/. While many of the disease associations were found with single nucleotide polymorphisms (SNPs) in non-coding regions, some are found with SNPs within or nearby known functional genes. To move beyond SNP associations, a common approach is to sequence the candidate gene region in affected subjects and fully define the entire variation in the region. With this, one could study the disease association with the gene via natural "genetic alleles" before launching more detailed mechanistic studies . Despite the recent advances in high-throughput sequencing technologies (e.g., Solexa of Illumina http://www.illumina.com/, 454 Life Sciences http://www.454.com/, Solid of Applied Biosystem http://www.appliedbiosystems.com), sequencing targeted genes or regions for a large number of samples remains cost-prohibitive.
In this study, we describe an in silico sequencing (ISS) method to more fully define the sequence variations in candidate gene regions. Recently, the 1000 Genomes Project http://www.1000genomes.org has completed its initial phases, including a pilot project which has sequenced 8,140 target exons from 906 selected gene regions in 697 subjects with an average depth of 50X. With this high quality sequence data, we were able to deduce a database of multi-allelic gene polymorphisms for a majority of the selected gene regions. Based upon the linkage-disequilibrium between the inferred gene alleles and the SNP genotype data of these subjects, we constructed models for predicting a pair phased sequences of these genes using selected flanking SNPs. Such prediction models can be used to sequence candidate genes in silico with minimal genotyping requirement.
This development is an extension to our recently developed method , which predicts Human Leukocyte antigen (HLA) gene alleles. The key idea is to model gene alleles as multi-allelic polymorphisms, and to build a likelihood framework of all gene alleles and genotypes of flanking SNPs for the sample set. Previously, we have demonstrated success in predicting HLA-A, -B, -C, -DR and -DQ genes in high resolution with accuracy 95, 93, 97, 79, and 83% respectively. Of course, HLA genes are quite special, since 1) these genes are among the most polymorphic genes known to date, 2) there is extensive linkage disequilibrium (LD) in the region, 3) large sets of samples have been directly sequenced for their HLA gene alleles with resolved phasing information in hematopoietic stem cell transplantation (HSCT), or studies of autoimmune diseases, such as Type 1 diabetes. In this study, we show that it is possible to build comparable prediction models with satisfactory accuracies for other genes by treating diploid sequences in the gene region as large multi-allelic polymorphisms.
This method of predicting polymorphic genetic variants is closely related to the commonly used imputation methods. Typically, imputation methodologies such as Impute [4, 5], Mach , and BEAGLE , are designed to impute untyped SNPs for GWAS on a particular commercial SNP arrays (e.g. Affymetrix 6.0 or Illumina 1.2 M array), based upon a reference panel of a more dense set of SNPs (e.g. the HapMap SNPs). While utilizing underlying LD structure, these imputation methods typically aim to produce "imputed genotypes" for single nucleotide variants. In contrast, we have tuned our approach to predict (or impute) a pair of phased gene alleles. The primary reason for our focus is that gene alleles, typically associated with functional sequences in the coding gene region, can be readily converted to amino acid sequences for further interpretation. In addition, one can naturally incorporate structural variations into allele polymorphisms, due to the modeling of the entire sequence region.
In the remainder of the paper, we first describe the proposed methods then summarize the properties of genes deep sequenced in the 1000 Genomes Project, and evaluate the performance of our gene prediction models in the Results Section. Then we present an application of our approach to an existing GWAS data gathered from the WTCCC. We conclude this paper with discussions on the strengths and limitations of the proposed method.
In order to build prediction models for gene diploid sequences, one of the first steps is to deduce the phase information of the gene sequences for the training dataset, which are the samples sequenced in Pilot 3 of the 1000 Genomes Project here. Utilizing all SNP calls within each gene, we deduced diploid sequences for every subject. This is equivalent to phasing multiple SNPs, in the absence of any structural variants as in the current case. Several methods have been developed for phasing multiple SNPs [8–12]. The empirical method for estimating haplotype frequencies and for inferring haplotypes as described in  and implemented in HPlus http://qge.fhcrc.org/hplus/ was used here to efficiently phase the genotypes even when the number of SNPs was large.
where the second term equals the difference of the number of gene-SNP haplotypes (m) and the number of gene alleles (k).
The search region was empirically chosen to maximize the prediction accuracies while retaining parsimony. We have chosen 150 kb flanking region to predict HLA genes in our previous work . However, when building prediction models to accommodate multiple genes, we had to choose the searching boundary generally applicable to all genes. In this study, we evaluated the performance of flanking regions of size 0 to 500 kb with a step size of 50 kb for each gene. In order to compare the relationship between objective function and size of flanking region among all selected genes, we rescale the objective function values (O g ) for gene g, denoted as Ôg, by (O g -min(O g ))/(max(O g )-min(O g )), where the minimum and maximum of O g is taken across all sizes of flanking region. By definition, these rescaled objective function values are within range 0 . The goal was to choose a size of flanking region such that Ôg reaches its minimum for a majority of the genes. In addition, to accommodate genes with different complexity level, we have incorporated several modifications (see Additional file 1) to ensure computational efficiency and feasibility.
represents the number of correctly predicted diploid sequences for the ith subject.
where the summation is over all possible alleles. Entropy with value of 0 means that the gene is not polymorphic. The maximum value of entropy equals to log(n). In general, the larger the entropy is, the more polymorphic is the gene.
Samples genotyped in the 1000 Genomes Project pilot3 and the HapMap3 Project
Project pilot3 #
Given the large amount of sequencing data, the typical first step is to align short-read sequences to the reference genome, and then to make variant callings. In this project, investigators from the 1000 Genomes Project have already performed such analysis, producing all possible SNP genotype calls in each study population.
The second step is to clean the data. For each gene, we filtered out samples that have at least one SNP missing in the gene. Genes that have more than 50% of samples filtered or have only one SNP were excluded from analysis. In addition, samples that were filtered in more than 60% of genes were excluded.
The third step is to deduce diploid sequences by phasing genotypes of all SNPs within every gene for every subject. To maintain phasing efficiency and at the same time retain the independence of the training and validation sets, we estimated the frequency of alleles of each gene using the entire set of samples, and then predicted the phases of gene alleles for samples in the training and validation set separately. Each distinct gene allele was then assigned a label as GeneName*Allele, where Allele was labeled by the numerical number 1, 2,..., in the order of the frequency of corresponding gene allele. The diploid sequences were called for each subject, if the corresponding posterior probability was maximum and also exceeded a pre-assigned threshold. Otherwise, the subject's sequence pair was deemed as missing. To ensure the quality of prediction models, a stringent threshold of probability > 0.95 was applied and only genes that had less than 10% samples with missing diploid sequences in both training and validation sets were selected for further model building.
The fourth step is to build prediction models for each gene using our algorithm of predicting multi-allelic genes. The prediction models built based on diploid sequences and SNP genotype data of samples in the training set were then applied to the training set itself. The prediction accuracy on the training set was thus evaluated by comparing the predicted diploid sequences with the observed ones. To control the quality of prediction accuracy, we used call threshold (CT) of CT = 0, 0.5 and 0.9, i.e., one would make a prediction only if the posterior probability exceeds CT.
The last step is to validate these prediction models. By applying the prediction models built in the above step to the validation set, we computed the prediction accuracy using CT = 0, 0.5 and 0.9, which would reflect "true" prediction accuracies in practice.
Of the individuals sequenced in Pilot 3, 614 samples from multiple populations have been genotyped in HapMap 3 using the Affymetrix Human SNP array 6.0 and the Illumina Human 1M-single BeadChip. We have recently shown that prediction models built using a training set that includes multi-ethnic populations are reliable and perform well for the subpopulations . Thereby we combined samples from different ethnicities in this model building process.
The prediction models built from the training set were validated by predicting sequence variants for each of the 611 genes in the independent validation sample set. The accuracy of the prediction was evaluated by comparing between the predicted and observed diploid sequences. The distributions of prediction accuracy with CT = 0, 0.5 and 0.9 are illustrated in Figure 4b, and the distributions of the call rates corresponding to CT = 0, 0.5 and 0.9 are illustrated in Figure 4d. As expected, the call rates decline as CT increases from 0 to 0.9. Interestingly, the accuracies with any CT largely peaked at the interval [95%, 100%), among about 500 out of 611 genes. About 50 genes have accuracies approaching to 100% and slightly more than 50 genes have accuracies around 90%. The relationship between the prediction accuracy and the variability of gene is similar to that shown in the training set (Figure 5). For more details, see Additional file 3.
The training and validation exercises described above have shown that the proposed method of in silico sequencing candidate genes is valid, scalable, and has satisfactory prediction accuracies for a broad sampling of genes. To make the gene prediction models more robust, we combined the training and validation sets to build the final prediction models for all 611 genes using the procedure described above. The prediction models included the number of SNPs ranged from 1 to 38. Those models and the SNPs selected are available on our website http://qge.fhcrc.org/ISS/.
To illustrate the utility of the prediction model in silico gene sequencing, we analyzed the Type 2 diabetes (T2D) GWAS data generated by WTCCC . The WTCCC T2D study includes 2,000 cases and 3,000 controls. Genotyping was done using the Human GeneChip 500 K Mapping Array (Affy500K). WTCCC investigators have successfully replicated 13 disease associated variants which include the SNPs located in KCNJ11 and PPARG gene. KCNJ11 gene has been deeply sequenced for its exon region in the 1000 Genomes Project Pilot 3. Thus, using KCNJ11 as an example, we illustrate how one would use our prediction models to sequence KCNJ11 gene in silico, and to advance the validation/functional study of the discovered SNP association before actually sequencing the gene in the wet lab.
Results of the multi-allelic association of the KCNJ11 gene alleles with T2D.
GC C G G G G C T
GC T G G G G C C
GC T G G G A C C
GC C G C G G C T
GT T G G G G C C
GC T G G C G C C
GC C T G G G C T
CC T G G G G C C
GC TGG G G A C
Results of the multi-allelic association of the KCNJ11 amino acid gene alleles with T2D.
E V LI S
K V LV S
K V VV S
E V LI T
E L LIS
The 1000 Genomes Project, built upon the success of the HapMap project, promises to generate high quality sequence data for thousands of subjects from multiple racial/ethnic groups. Besides shedding light on human genome variations, this project will also generate an abundance of sequence data that will be useful for fine mapping and better defining disease associated functional variants. Towards this goal, we have described a method for in silico sequencing of candidate genes and regions of potential interest. Through training and validation exercises on currently available sequence data from the 1000 Genomes project, we have shown that the method produces reliable prediction models with high accuracies, with few exceptions. To facilitate the future use for the research community, we developed a set of predictive models for 611 genes utilizing all available sequence data. Such models can now be used to sequence known candidate genes in silico and further to investigate their functional properties, prior to carrying out actual sequencing in the wet lab. Therefore, applications of this method would help accelerate validation researches from discoveries to functional characterization in the most cost-effective and time-efficient way.
Our method of predicting polymorphic genetic variants may be thought of as an extension to imputation methods. Specifically, like imputation methods, our method also uses local LD to infer untyped gene alleles (or equivalently, phased diplotypes with multiple SNPs), whereas imputation methods infer untyped SNPs. By predicting fully phased sequence diploids, our approach allows one to translate DNA sequence variants in the coding region into amino acid sequence variants, thereby facilitating the functional interpretation of disease associations.
Our methodology also helps to resolve a long-standing debate on the value of haplotype based association analysis. Recall that haplotype based association analysis, since its inception, has been suggested as a preferred method for assessing disease association [19–22], because of its desired genetic interpretation and reduced number of multiple comparisons. However, the broader application of haplotype based association analyses would require investigators to rationally choose "haplotype units" of multiple SNPs, such as "haplotype blocks" [19, 23, 24]. Further investigation of the human genome has suggested that such blocks are not as robust as they were originally postulated and are also variable across populations. Thus, "haplotype unit" remains to be conceptually appealing. In practice, the majority of GWAS takes simple SNP-by-SNP association analysis, based upon publications thus far, because SNP-based association analysis is straightforward and is more amenable to comparisons across studies. Our proposed analytic strategy potentially has the advantages from both sides: 1) it focuses on natural units such as coding genes, or functional regions, which are comparable across studies; 2) it facilitates interpretations by framing associations in terms of diploid sequences and reduces the severity of multiple comparison issue.
The current methodology can naturally incorporate structural variants in building prediction models. One of the major advantages of direct sequencing is the ability to identify structural variants, including insertions and deletions. Structural variants could be naturally coded as additional gene alleles, and incorporated into prediction models. In the current study, few genes are known to have structural variants in those sequenced exons; no illustrations are given here.
During the model training process, one subjective parameter to be chosen, is the flanking region size for each gene (or region). The larger the flanking region, the more likely it is that all predictive SNPs are included, but also the greater is the chance that the prediction models will be over-fitting. Based upon our empirical experiences determining the optimal size of flanking regions for predicting alleles for five HLA genes [3, 15], we have concluded that a default 150 KB flanking region is usually satisfactory. However, since the majority of the genes modeled for this study are larger than HLA genes, the empirical data summarized in Figure 3 lead us to use a flanking region of ± 250 kb for each gene.
While prediction accuracies of most genes are approaching 99%, it is noted that two genes (TLL1 and ZNF474) have particularly low prediction accuracies in the validation set. At CT = 0, the lowest prediction accuracy among all the 611 genes is 79% for TLL1 gene. This gene is one of the complex genes in this dataset (see Figure 2), with length of 230,584, as many as 31 SNPs, more than 50 alleles and entropy of 1.38. More than half of the prediction errors made on the validation set occur for uncommon alleles with frequency < 0.01. The predictive model includes 28 SNPs, with the prediction accuracy of 97% and 79% on the training and validation sets respectively, which indicates a possible model over-fitting. The remedies for such a problem include 1) increasing sample size in the training set and thus improving the power of predicting gene alleles, or 2) reducing the gene complexity by dividing the single gene into two or more segments based on their LD structure, or 3) choosing a smaller size of searching boundary when training the model.
The second gene ZNF474 has the prediction accuracy of 80% on the training set. Seventy out of the 92 prediction errors involved three alleles ZNF474*1, ZNF474*2 and ZNF474*3, which differ by two SNPs (rs2560306 and rs35262183). Unfortunately, neither SNP was genotyped nor in significant LD with any of the SNPs in the HapMap 3. In the absence of these "highly informative and yet isolated SNPs", one would not be able to differentiate these alleles. This phenomenon has also been observed for HLA predictions . To overcome this limitation, the remedy is to restrict the prediction only to the combined alleles, a common practice in HLA genotyping. If separating such alleles is essential for practical reasons in prospective studies, one has no choice but to genotype additional missing SNPs.
In order to recover diploid sequences from the 1000 Genomes Project, it is necessary to statistically infer for haplotypes. Among the 645 selected genes, 32 genes have been excluded, because > 10% of samples had relatively poor haplotype inference with probability less than 0.95. The poorer haplotype inference may associate with a complex gene structure. For example, of these 32 genes, four genes (ANKRD15, TRPV3, NLRP11, CYP24A1) are longer than 850Kb, or the entropies are greater than 2, or have more than 50 gene alleles. To address this challenge, our strategy is to divide a gene into several regions, e.g., by combining adjacent exons or by short segments, and then deduce the alleles within each region separately. On the other hand, for those genes with complex structural variations, probably the most effective approach is to use sequence technologies that produce relatively long reads, such as several hundred or even thousand nucleotide bases, which allow us to deduce diploid sequences directly. Technological improvement in the future would lessen the impact of this particular limitation.
A critical factor of building useful prediction models, especially for those genes with rare gene alleles, is to have a sufficiently large sample size so that rare alleles are observed in the training set. In our study, the SNP and sequence data used in this paper came from the 614 samples in both HapMap and the 1000 Genomes Project. Prediction accuracies should be improved when much more samples are available in future.
Prior to the general availability of routine sequencing all subjects, the ISS method proposed here provides a timely and cost-effective approach to broadening the characterization of disease associated SNPs and regions, and facilitating the prioritization of candidate genes for more detailed functional and mechanistic studies.
We would like to thank NIH for providing partial funding support to this project through the following grants: NIH/NIMH R01MH084621 (PI: Lue Ping Zhao), NIH/NCI R01CA106320 (PI: Lue Ping Zhao), NIH/NCI R01CA119225 (PI: Lue Ping Zhao), and NIH/NHLBI R01HL087690 (PI: John Hansen).
This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.