 Methodology article
 Open Access
 Published:
An improved statistical model for taxonomic assignment of metagenomics
BMC Genetics volume 19, Article number: 98 (2018)
Abstract
Background
With the advances in the nextgeneration sequencing technologies, researchers can now rapidly examine the composition of samples from humans and their surroundings. To enhance the accuracy of taxonomy assignments in metagenomic samples, we developed a method that allows multiple mismatch probabilities from different genomes.
Results
We extended the algorithm of taxonomic assignment of metagenomic sequence reads (TAMER) by developing an improved method that can set a different mismatch probability for each genome rather than imposing a single parameter for all genomes, thereby obtaining a greater degree of accuracy. This method, which we call TADIP (Taxonomic Assignment of metagenomics based on DIfferent Probabilities), was comprehensively tested in simulated and real datasets. The results support that TADIP improved the performance of TAMER especially in large sample size datasets with high complexity.
Conclusions
TADIP was developed as a statistical model to improve the estimate accuracy of taxonomy assignments. Based on its varying mismatch probability setting and correlated variance matrix setting, its performance was enhanced for high complexity samples when compared with TAMER.
Background
As the next generation sequencing technologies continue to advance at a rapid pace, it is now possible to identify the presence of microorganisms with greater efficiency and accuracy. Such studies can, in turn, help to explain whether the presence or absence of certain species or specific genus contributes to disease processes of interest. Biological samples taken from different parts of the human body as well as from different environment, such as seawater, soil, etc. can be used to extract DNA, and then those DNA samples can be analyzed as short reads of ∼100 base pairs. To analyze these short reads, Basic Local Alignment Search Tool (BLAST) is often used to identify regions of similarity between nucleotide or protein sequences by comparing sequence reads from one sample to sequences in reference databases. It then assesses the significance of matches, and, using a scoring matrix, assigns reads to the taxonomy tree that most likely to represent what had happened in the evolutionary process [1, 2]. Here, a single sequence read can be matched to multiple genomes because of sequence homology across species as well as overlapping of sequences.
BLAST can potentially lead to inaccurate estimates when errors occur in taxonomy assignment in the context of metagenomic analysis [3,4,5]. To improve the accuracy of taxonomy assignment, several algorithms have been developed to optimize the use of BLAST searches. Metagenome Analyzer called MEGAN is one of the most commonly used analytical tool [4]. MEGAN assigns matched reads to the least common ancestor in the taxonomy tree when there are multiple matches to different genomes [2], and because it assigns short reads to one genome with the best match and ignores relevant biological information with weak statistical significance, MEGAN can lead to false findings. To address this issue, Jiang and colleagues [2] introduced TAMER, which assigns metagenomic sequence reads with a mixture model by estimating the probability for each read generated from the genomes. Based on analyses of both simulated and real datasets, Jiang and colleagues [2] showed that TAMER had a higher degree of accuracy and efficiency compared to MEGAN. However, because TAMER assigns equal mismatch probability for all genomes, TAMER will experience difficulties when there exists a high degree of complexity among data and a high degree of correlations among microorganisms as in human microbiome samples.
In the present paper, we propose a statistical framework: a Taxonomic Assignment of metagenomics based on Different Probabilities (TADIP) method with the goal of improving the accuracy of the estimates by setting different mismatch probabilities for different candidate genomes. Unlike TAMER which sets the same mismatch probability for different genomes, TADIP extends TAMER to address the biological reality that: (1) different organisms may have different genetic variants at homologous loci; and (2) different organisms coexist within one microbial community. Specifically, TADIP allows the true mismatch probabilities for the genomes to be generated by each genome’s own mismatch part plus systemic errors. We also illustrate the use of a burden/variance component test based on a logistic regression model to test the need for a range of setting with varying mismatch probabilities to reflect the complexity of samples. We evaluated TADIP using both simulated and real datasets.
Methods
Data
This study uses the NCBINT data from the NCBI website as a reference dataset, and uses BLASTn as the primary analytical tool for data analysis. Following the BLASTn analysis, for each read and its corresponding candidate genome, we mapped and recorded the read serial number, corresponding genome name, taxonomy identification number, matched length, and alignment length. These variables constitute the input file for TADIP, which is consistent with the input file for TAMER [2].
TADIP model
Model parameters
We first summarize known information from the BLAST output, which includes the following: n reads(x_{j} denotes the j^{th} read), k genomes (genome i denotes the i^{th} genome), L_{ji} denotes alignment length for read j against genome i, M_{ji} denotes matched length. The parameters of interest are: R_{i}: the true proportion of reads generated from genome i or the probability of a read x_{j} is generated by genome i, \( {R}_i\ge 0,\sum \limits_{i=1}^k{R}_i=1 \). p_{i}: the probability of observing a mismatched base pair for genome i. Even if x_{j} could be generated from genome i, it is unlikely to match 100% because of potential errors involving sequencing, alignment and among others.
Likelihood
Because alignment lengths for sequence reads are nearly the same, let L_{j} = max {L_{ji}, i = 1, 2, …, k}, then the probability that a read x_{j} is generated by genome i with M_{ji} matched base pairs, and L_{j} − M_{ji} mismatched base pairs would be \( {R}_i{p}_i^{L_j{M}_{ji}}{\left(1{p}_i\right)}^{M_{ji}} \). Therefore, the probability of observing a read x_{j} from the sample is:
Let θ = (p, R), p = (p_{1}, …, p_{k})^{T}, R = (R_{1}, R_{2}, …, R_{k})^{T}, and let D = (x, L, M) where x = (x_{1}, …, x_{n})^{T}, L = (L_{1}, …, L_{n})^{T}, M denotes matched lengths for n reads against k genomes. Assuming that the sequence reads are independent and identically distributed random variables, the likelihood and log likelihood of θ are:
EM algorithm
The maximization of the loglikelihood is not straightforward. As in [2], an ExpectationMaximization (EM) algorithm is used to obtain the maximum likelihood estimators of θ by introducing latent variables Z = (Z_{1}, Z_{2}, …, Z_{n})^{T}. The latent variable determines the genome from which a sequence read originates, for example, for each read j and each genome i:
Then the probability of observing a read x_{j} from the sample becomes:
The log likelihood when the latent variables are observed is:
Base on (3), the following EM algorithm can be used. In the Estep of algorithm, let T_{ji} denote the conditional distribution of the latent variable Z_{j} given the current estimates of parameters θ^{(t)} = (p^{(t)}, R^{(t)}), then
The expectation of the log likelihood (3) with respect to Pr(Z_{j} = i θ^{(t)}, D) is:\( Q={E}_{\mathbf{Z}\mid {\boldsymbol{\uptheta}}^{\left(\mathbf{t}\right)},\mathbf{D}}\left[l\left(\boldsymbol{\uptheta} \mathbf{D},\mathbf{Z}\right)\right]={E}_{\mathbf{Z}\mid {\boldsymbol{\uptheta}}^{\left(\mathbf{t}\right)},\mathbf{D}}\left[\sum \limits_{j=1}^n\sum \limits_{i=1}^k\mathit{\log}\left({R}_i{p}_i^{L_j{M}_{ji}}{\left(1{p}_i\right)}^{M_{ji}}\right){I}_{z_j=i}\right] \)
In the Mstep, the expected log likelihood (5) can be maximized, which yields
Repeat the Estep and Mstep until the convergence of the estimated values of parameters to obtain the estimate of θ.
Taxonomic assignment of reads
Given the above information, we can then estimate the probability that a read x_{j} is generated from genome i by:
for i = 1, 2, …, k and j = 1, 2, …, n. Then the read x_{j} is assigned to the genome for which the maximum estimated value of probability is reached.
Hypotheses testing
The null hypothesis H_{0} : p_{1} = p_{2} = p_{3} = … = p_{k} for all p_{i} (or p = p_{0} where p = (p_{1}, …, p_{k})^{T}) can then be tested against the alternative hypothesis H_{1}: at least one pair of p_{i} is not equal (or p ≠ p_{0}). If H_{0} is true, then the TAMER method is valid to use. On the other hand, if H_{0} is rejected, then the TAMER method is not valid, and the TADIP method is appropriate.
Wald test
Using the estimate \( \widehat{\mathbf{p}} \) from the TADIP method, the Wald test statistics W = (p − p_{0})^{T} × Σ^{−1} × (p − p_{0}) can be used, when the variancecovariance matrix Σ of p is known. W follows a χ^{2} distribution with k − 1 degree of freedom under H_{0}.
In practice, however, the variancecovariance matrix (Σ) is often unknown. The dimension of the variancecovariance matrix (Σ) is equal to the number of genomes, and the number of genomes can be large for most metagenomic samples. As a result, it is computationally difficult to obtain the inverse of the estimate even when it is estimated using moment estimation methods for example, and its inverse is very much likely to be singular because of sparsity. Moreover, the Wald test often has poor power under sparse alternatives [6]. To circumvent this, we present two alternative tests that can be implemented in practice.
Logistic regression model
One simple test is to use logistic regression models for the mismatch probabilities. Recall that p_{i} represents the true mismatch probability of genome i which consists of system errors that include sequencing errors, alignment errors and SNPs. The system errors ought to be the same for all genomes in the datasets processing steps, but SNPs are not [7]. Assuming (1) true mismatch probability for each genome to be the sum of a fixed part which comprise average system errors and (2) the genome dependent part denotes different adjustment (SNPs) for each genome i, the following logistic regression model can be used: logit(p) = α + V × β where the matrix V takes value 1 in the diagonal and 1/(k − 1) in the nondiagonal. The model yields an estimate of fixed part α which is exactly the logit (p_{0}), and an estimate of β which denotes the coefficient of random part of mismatch probabilities for different genomes. Then test hypotheses are equivalent to the null hypothesis where H_{0}: β = 0 against the alternative hypothesis where H_{1}:β ≠ 0. We can then apply two collapsing tests that are easier to implement than the Wald test to metagenomic samples.
Burden test
The burden test collapses information for multiple random variants into a single score [8] with the assumption that a large proportion causal variants effects are in the same direction and magnitude (See the variance component test below for violation of the assumptions). The magnitude of effects is adjusted by weight wherew = (w_{1}, ⋯, w_{n}) [9], i.e., β_{j} = β_{0} × w_{j}, j = 1, …, n. Let v_{ij} denote the (i, j)th element of the matrix V, then the logistic regression model can be expressed as logit(p) = α + C × β_{0}, where C = (C_{1}, …, C_{k}), with \( {C}_j=\sum \limits_{i=1}^n{w}_j\times {v}_{ij} \). The burden test is to test where H_{0}: β_{0} = 0 with following test statistic
which follows a χ^{2} distribution with one degree of freedom under H_{0}. The weight w_{j} can be generated from the Beta function where, \( {w}_j= Beta\left({\widehat{p}}_j,{a}_1,{a}_2\right) \) [10]. Throughout this paper, the empirical value for burden test parameters of the Beta function is set to a_{1}=96, a_{2}=100.
Variance component test
When the assumptions of the burden tests are violated, which are not infrequent, we can use the variance component test method to deal with differences in directions and magnitudes to take into account both positive or negative effects [9]. Here, we present a variance component test based on the kernel method.
In this variance component test, it is assumed that β_{j} follows a distribution with mean 0 and variance \( {w}_j^2\tau \). Therefore, we test τ = 0 for the equality of all β_{j}, j = 1, …, n. The test statistic is:
where K = V^{T}WV. Under the H_{0} : τ = 0, the test statistic Q_{VCT} follows a mixture degree of freedom and it can be asymptotically calculated by \( \sum \limits_{i=1}^n{\lambda}_i{\chi}_{1,i}^2 \). \( {\chi}_{1,i}^2 \) means independent χ_{1} variables, and λ_{i} are eigenvalues of \( {\mathbf{P}}_{\mathbf{0}}^{\mathbf{1}/\mathbf{2}}{\mathbf{KP}}_{\mathbf{0}}^{\mathbf{1}/\mathbf{2}} \), where P_{0} is the inverse of variance under the null and pvalues can be calculated using the Davies Method [10, 11]. Again, the weight w_{j} is often generated from the Beta function, where \( {w}_j= Beta\left({\widehat{p}}_j,{a}_1,{a}_2\right) \). In this paper, the empirical values for the variance component test parameters were set to a_{1} at 96 and a_{2} at 100.
Simulation study
We performed simulation studies using MetaSim, the first sequencing simulator for metagenomics. MetaSim can generate a collection of reads with genome profile settings that mimic existing sequencing platforms such as Illumina, Roche 454, and Sanger [12]. To generate realistic simulated data, MetaSim introduces sequencing errors, SNPs, indels, inversions, translocations, copy number variants (CNVs), short tandem repeats (STRs); consequently, these features generate different mismatch probabilities [13].
To test our models, we generated datasets with the same as well as different mismatch probabilities. Most existing genomic nextgeneration sequencing simulation tools can be set to generate data with the same mismatch probability at the nucleotide base level. For example, MetaSim uses a fixed value of error rate for the same nucleotide base in one platform for a single run using a dataset. Empirically, the error rates for Roche 454 ranged from 1.07 to 1.7%, while those for Illumina ranged from 0.0034 to 1% [14]. Since we need to set the probabilities at the genome level to be the same, we assumed that the dataset in one sample generated from one approach (“fixed error model”, henceforth) yielded the same mismatch probability at the genome level with a small range of error rates. When the datasets were generated from multiple approaches with different error rates (“varyingerror model”, henceforth), however, different mismatch probabilities at the genome level were expected.
Evaluation of burden test and variance component test
We compare the performance of the burden test and variance component test by estimating empirical type I error and power using simulated datasets. To evaluate type I errors, we generated 100 datasets using an one fixed error model, where all genomes in the datasets were assumed to have the same mismatch probability. The burden test and variance component tests were used with weights generated from the Beta function where \( Beta\left({\widehat{p}}_j,96,100\right) \). The empirical type I error rate was estimated using the proportion of p values less than α = 0.05. To evaluate power, we generated 100 datasets with 1000 reads, where half of the dataset of one genome being generated from the one errorrate approach and the other half generated from the other errorrate approach (specifically “varyingtwoerror model”, henceforth). Therefore, two genomes in the datasets were assumed to have two different mismatch probabilities. Again, the burden test and variance component tests were used with weights generated from the Beta function where \( Beta\left({\widehat{p}}_j,96,100\right) \). The empirical power was estimated using the proportion of p values less than α < 0.05.
Comparison of TAMER and TADIP methods
To determine whether or not TADIP improves the accuracy of taxonomy assignment, we simulated three benchmark datasets with low (2 genomes, simLC), medium (9 genomes, simMC), high (15 genomes, simHC) complexity. We generated these three benchmark datasets under the varyingerror model with two different read lengths: 500 bp and 150 bp. First, we generated each dataset from a number of genomes that included 10,000 reads with the average length of 500 bp (‘long reads,’ henceforth). Second, we generated the same set of simulated data with an average read length of 150 bp (‘short reads,’ henceforth). With these simulated datasets, we compared the performance of TAMER and TADIP by estimating the proportion of correct assignment (true positive (TP)) and the proportion of incorrect assignment (false positive (FP) at different taxonomy ranks. Incorrect assignment includes reads that were aligned incorrectly or overmatched. Here, TP represents the number of correctly assigned reads / the total number of reads (10,000). FP represents the number of incorrectly assigned reads / the total number of reads (10,000).
Real data study
Using TADIP, we examined eight oral datasets from human oral cavity (http://www.mgrast.org) [15], and 11 gut datasets (https://www.ncbi.nlm.nih.gov) [16] generated from two real metagenomic studies. The oral cavity study compared the metagenomics in four groups: healthy controls who never had caries against patients who had been treated caries; those who had active caries; and those who had cavities. Each group contributed two samples. The datasets included ~ 2 million reads in total, and the average read length was 425 ± 117 bp, representing the long reads. The smallest samples had ~ 70,000 reads, while the largest sample had ~ 465,000 reads [15]. In addition, we examined 11 human gut data, obtained from a study of Crohn’s disease, an inflammatory bowel disease (https://www.ncbi.nlm.nih.gov) [16]. Crohn’s disease results in changes of microbial community in the human gut [17]. This dataset comprised seven healthy donors and four donors with Crohn’s disease. The whole genome reads were generated using the Illumina platform, and the average length of the whole genome is 119 bp, representing the short reads [16].
Results
Type I error and power
Table 1 shows that the burden test was less conservative than the variance component test where the empirical type I error for the burden test was 0.05 and that for variance component test was 0.02. Table 1 further shows that two tests were valid and powerful, despite the relatively small sample size. Specifically, the power estimate for the burden test was 0.99, while that for the variance component test was 1.00.
Comparison of TAMER and TADIP methods
We compared the performance of TAMER and TADIP using three datasets with low, medium and high levels of complexity as defined by the number of genomes (i.e., simLC, simMC, and simHC). Datasets with different mismatch probabilities were tested for both long and short reads. Table 2 shows the results for the long reads, and Table 3 represents the results for short reads.
Low level complexity with two genomes
For datasets with long reads, the simulation study revealed that, at the level of Species, both TAMER and TADIP assigned 100.0% of the reads correctly, and had 0.00% false assignments. It is evident that the numbers of reads that TAMER and TADIP assigned were close to the true values at the level of Species (Fig. 1). However, the pvalues of the burden test and variance component test were less than 0.05. This is because of different taxonomic composition, different genomes were assigned to the sample based on TAMER and TADIP, though these genomes belong to the same species. When the datasets with short reads were examined in Table 3, the rates of TP were lower than those in long reads, and this is most likely due to low matching rate from BLAST. Both TAMER and TADIP assigned 71.0% of the reads correctly at the at all rank levels and had 0.00% false assignments. The pvalues of the burden test and variance component test were 0.131 and 0.009, respectively (Table 4). The test results indicate that TAMER and TADIP are both appropriate to use in this simplest setting.
Medium level complexity with nine genomes
For the datasets with long reads, TAMER assigned 92.3% of the reads correctly at the level of Species and had 17.3% false assignments. On the other hand, TADIP assigned 92.3% of the reads correctly and has 10.0% false assignments (Table 2). Note that the sum of TP and FP is greater than 1 under this setting due to the occurrence of multiple assignments. We then counted the numbers of reads that TAMER and TADIP assigned were close to the true value at the level of Species (Fig. 2). Particularly, the assignment number of Escherichia coli, Francisella tularensis and Shigella dysenteriae using TADIP was notably close to the truth. We then tested the datasets with short reads. TAMER assigned 67.0% of the reads correctly at the level of Species, and had 3.98% false assignments, while TADIP assigned 67.0% of the reads correctly and has 0.65% false assignments (Table 3). The assignment number of Escherichia coli, Shigella dysenteriae using TADIP in this simulation was notably close to the truth, however, the detected number of Francisella tularensis, Pasteurella multocida, Pseudomonas entomophila, Pseudomonas fluorescens are less than the truth both for TAMER and TADIP as shown in Fig. 2, which contribute to the decrease of TP. The pvalues for the burden test and variance component test of two simulation studies were less than 0.05, suggesting that it was appropriate to use TADIP in this setting.
High level complexity with 15 genomes
For the datasets with long reads, TAMER assigned 96.4% of the reads correctly at the level of Genus, and had 1.94% false assignments. On the other hand, TADIP assigned 96.4% of the reads correctly, and had 0.87% false assignments (Table 2). It is evident that the numbers of reads that TAMER and TADIP assigns were close to the true values at the level of Species, even though there were differences in subspecies or strain. The assignment number of Escherichia and Shigella in this simulation using TADIP is closer to the truth (Fig. 3). For the dataset with short reads, TAMER assigns 71.1% of the reads correctly at the level of Species, and has 2.42% false assignments. TADIP assigns 72.9% of the reads correctly and has 0.55% false assignments (Table 3). The assignment number of Escherichia and Shigella using TADIP is also closer to the truth (Fig. 3). In the meanwhile, the assigned number is less than the truth over half of the species both for TAMER and TADIP. The pvalues of burden test and variance component test were close to 0 and less than 0.05, respectively.
As the complexity increases, TADIP performs better than TAMER, especially for generating lower levels of FPs. The three levels of complexity for simulation showed that TADIP performed better than TAMER, when mismatch probabilities were different across datasets. However, the enhancement is weakened when the length of the reads was shorter such that the likelihood of low matching rate of each assignment increased.
Oral metagenomics
We identified approximately 2500 species in these eight oral samples, comprising two controls and six patients (grouped by treated caries, active caries, cavities), and the number of identified species varied by sample ranging from 700 to 1400. We estimated the proportions of reads assigned to the dominant Classes based on TADIP as shown in Additional file 1: Figure S1 and Table S1. Under the TADIP analysis, we can observe that in general diseased samples had more Bacteroidia, Fusobacteriia at the rank level of Classes than the healthy samples. On the other hand, controls had more Gammaproteobacteria that were seem to be absent in five diseased samples. In addition, the proportions of Betaproteobacteria and Actinobacteria behaved oddly in the first and last patient samples, and some other Class such as Epsilonproteobacteria, Coriobacteriia appeared in samples with caries or cavities. As the eight samples were chosen with a range of clinical features, there was a large variation among the samples, indicating the individuation in oral samples. Under the TAMER analysis, most of estimated proportions of dominant Classes were similar with the result of TADIP, except for evident differences in the bar plot of first patient with treated caries (Fig. 4). TAMER identified less Bacilli and Fusobacteria and Gammaproteobacteria than TADIP. It is of interest to note that the results from TADIP were more consistent to the results in the original published [15], i.e., the amount of Gammaproteobacteria of that patient ought to be the largest of all eight samples. The result of burden test also showed that there were significance differences between the mismatch probabilities among different genomes.
Gut metagenomics
Around 300 dominant species were assigned in 11 gut samples consisting seven controls and four patients with Crohn’s disease. Fig. 5 shows that the estimated proportions of reads assigned to the dominant Phylums based on TADIP. Under the TADIP analysis, the four major Phylum in the human gut were Firmicutes, Bacteroidetes, Actinobacteia and Proteobacteria, replicating the earlier published results [16]. In our analysis, Verrucomicrobia was in high abundance compared with Proteobacteria, and Actinobacteia in three healthy samples and one patient sample (Fig. 5). This agrees with some findings in human gut that Verrucomicrobia can be occasionally observed [3, 18]. In general, the proportions of the phylum Proteobacteria and Actinobacteia were higher in Crohn’s disease patients than in healthy controls. In addition, the proportions of Firmicutes and Bacteroidetes were unusual in the samples from the first two Crohn’s disease patients, indicating that they might have been overrepresented or depleted. These observations on samples from Crohn’s disease patients agree with previous findings [19,20,21,22,23,24], supporting the notion that the causes of Crohn’s disease among patients vary widely. Under the TAMER analysis, most of estimated proportions of dominant Phylums were similar to the results from the TADIP analysis where fewer Phylums were detected in the second diseased sample and the sixth control sample (Additional file 1: Figures S2S3 and Tables S2S3). The burden test showed that there were significance differences at the significance level of 5%, indicating that the mismatch probabilities among genomes were significant.
Discussion
We have proposed a statistical framework called TADIP to improve the estimate accuracy of taxonomy assignments in metagenomic data. This approach extends the TAMER method and allows efficient analysis of metagenomics data by allowing different mismatch probabilities to different candidate genomes, rather than using one common mismatch probability for an array of different genomes.
It has been shown that TAMER performs better than MEGAN at high taxonomic ranks and estimates with a greater degree of accuracy at the genus level or even at the species level [2]. However, TAMER does not perform well in samples with a high degree of complexity. Our study has demonstrated that TADIP performed better for high complexity samples, because TADIP takes into account the correlations among different genomes with different mismatch probabilities. Using simulated and real datasets, we showed that TADIP can overcome some of the problems faced by complex samples. When we tested both simulation and real data incorporating significantly different mismatch probabilities among samples, for highly complicated samples, we showed that the burden test and variance component tests may yield different results because of the opposite assumptions, especially when sample size is large and these is a high degree of complexity. A future work will apply a bootstrap method to further explore this problem by resampling the original sequence reads with replacement for the statistical inference.
BLAST is considered as a generalpurpose tool of the preprocessing step for metagenomics study, in which the set of DNA sequences is compared against publicly available databases [4]. However, we can also use some other efficient alignment tools such as BOWTIE [25] for short reads. Output from these alignment tools can serve as input for TADIP. The results from BLAST, Bowtie, and TADIP are presented in Additional file 1.
We note that selection of the right match algorithm is important. TADIP, TAMER and MEGAN rely on homologous searches of the sequence reads in the reference databases, and these algorithms do not perform well when the reads are generated from new genomes and the reference databases contain limited genome data. When a limited set of genomic data is available on novel genus, order or higher level in the evolutionary tree, algorithms that employ the sequence composition approach that characterizes sequence reads phylogenetically, such as PhyloPhythiaS and Phymm, performed substantially better than other algorithms [26, 27].
Conclusions
TADIP is developed as a statistical model to improve the estimate accuracy of taxonomy assignments. TADIP allows a varying mismatch probability setting and a correlated variance matrix setting to mimic the biological reality (i.e., truth). It performs better than TAMER for high complexity samples, especially in samples that contain different species, where different mismatch probabilities are likely to be abundant.
Abbreviations
 BLAST:

Basic local alignment search tool
 CNVs:

Copy number variations
 FP:

False positives
 MEGAN:

Metagenome analyzer
 simHC:

Simulated data with high level complexity
 simLC:

Simulated data with low level complexity
 simMC:

Simulated data with medium level complexity
 SNPs:

Single nucleotide polymorphisms
 STR:

Short tandem repeats
 TADIP:

Taxonomic assignment of metagenomics base on different probabilities
 TAEC:

Taxonomic analysis by elimination and correction
 TAMER:

Taxonomic assignment of metagenomic sequence reads
 TP:

True positives
References
 1.
Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ. Gapped blast and psiblast: a new generation of protein database search programs. Nucleic Acids Res. 1997;25(17):3389–402.
 2.
Jiang H, An L, Lin SM, Feng G, Qiu Y. A statistical framework for accurate taxonomic assignment of metagenomic sequencing reads. PLoS One. 2012;7(10):46450.
 3.
Sohn MB, An L, Pookhao N, Li Q. Accurate genome relative abundance estimation for closely related species in a metagenomic sample. BMC Bioinf. 2014;15(1):242.
 4.
Huson DH, Auch AF, Qi J, Schuster SC. Megan analysis of metagenomic data. Genome Res. 2007;17(3):377–86.
 5.
Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ. Basic local alignment search tool. J Mol Biol. 1990;215(3):403–10.
 6.
Fan J, Liao Y, Yao J. Power enhancement in highdimensional crosssectional tests. Econometrica. 2015;83(4):1497–541.
 7.
Brookes AJ. The essence of snps. Gene. 1999;234(2):177–86.
 8.
Wu MC, Lee S, Cai T, Li Y, Boehnke M, Lin X. Rarevariant association testing for sequencing data with the sequence kernel association test. Am J Hum Genet. 2011;89(1):82–93.
 9.
Lee S, Abecasis GR, Boehnke M, Lin X. Rarevariant association analysis: study designs and statistical tests. Am J Hum Genet. 2014;95(1):5–23.
 10.
Zhao N, Chen J, Carroll IM, RingelKulka T, Epstein MP, Zhou H, Zhou JJ, Ringel Y, Li H, Wu MC. Testing inmicrobiomeprofiling studies with mirkat, the microbiome regressionbased kernel association test. Am J Hum Genet. 2015;96(5):797–807.
 11.
Davies RB. The distribution of a linear combination of χ2 random variables. Appl Stat. 1980;29(3):323–33.
 12.
Richter DC, Ott F, Auch AF, Schmid R, Huson DH. Metasim—a sequencing simulator for genomics and metagenomics. PLoS One. 2008;3(10):3373.
 13.
Pattnaik S, Gupta S, Rao AA, Panda B. Sinc: an accurate and fast errormodel based simulator for snps, indels and cnvs coupled with a read generator for shortread sequence data. BMC Bioinf. 2014;15(1):40.
 14.
Jia B, Xuan L, Cai K, Hu Z, Ma L, Wei C. Nessm: a nextgeneration sequencing simulator for metagenomics. PLoS One. 2013;8(10):75448.
 15.
BeldaFerre P, Alcaraz LD, CabreraRubio R, Romero H, SimonSoro A, Pignatelli M, Mira A. The oral metagenome in health and disease. ISME J. 2012;6(1):46.
 16.
Morgan XC, Tickle TL, Sokol H, Gevers D, Devaney KL, Ward DV, Reyes JA, Shah SA, LeLeiko N, Snapper SB, et al. Dysfunction of the intestinal microbiome in inflammatory bowel disease and treatment. Genome Biol. 2012;13(9):79.
 17.
Khor B, Gardet A, Xavier RJ. Genetics and pathogenesis of inflammatory bowel disease. Nature. 2011;474(7351):307.
 18.
Dubourg G, Lagier JC, Armougom F, Robert C, Audoly G, Papazian L, Raoult D. Highlevel colonisation of the human gut by verrucomicrobia following broadspectrum antibiotic treatment. Int J Antimicrob Agents. 2013;41(2):149–55.
 19.
Thota VR, Dacha S, Natarajan A, Nerad J. Eggerthella lentabacteremia in a crohn’s disease patient after ileocecal resection. Future Microbiol. 2011;6(5):595–7.
 20.
P ́erezBrocal V, Gar ́cıaLopez R, V’azquezCastellanos JF, Nos P, Beltr ́an B, Latorre A, Moya A. Study of the viral and microbial communities associated with crohn’s disease: a metagenomics approach. Clin Transl Gastroenterol. 2013;4(6):36.
 21.
Sartor RB. Microbial influences in inflammatory bowel diseases. Gastroenterology. 2008;134(2):577–94.
 22.
Liu Y, Van Kruiningen HJ, West AB, Cartun RW, Cortot A, Colombel JF. Immunocytochemical evidence of listeria, escherichia coli, and streptococcus antigens in crohn’s disease. Gastroenterology. 1995;108(5):1396–404.
 23.
Man SM, Kaakoush NO, Mitchell HM. The role of bacteria andpatternrecognition receptors in crohn’s disease. Nat Rev Gastroenterol Hepatol. 2011;8(3):152.
 24.
Frank DN, Amand ALS, Feldman RA, Boedeker EC, Harpaz N, Pace NR. Molecularphylogenetic characterization of microbial community imbalances in human inflammatory bowel diseases. Proc Natl Acad Sci. 2007;104(34):13780–5.
 25.
Langmead B, Salzberg SL. Fast gappedread alignment with bowtie2. Nat Methods. 2012;9(4):357.
 26.
Patil KR, Haider P, Pope PB, Turnbaugh PJ, Morrison M, Scheffer T, McHardy AC. Taxonomic metagenome sequence assignment with structured output models. Nat Methods. 2011;8(3):191–2.
 27.
McHardy AC, Martin HG, Tsirigos A, Hugenholtz P, Rigoutsos I. Accurate phylogenetic classification of variablelength dna fragments. Nat Methods. 2007;4(1):63–72.
Funding
This study was supported in part by the funding from the NIH/NIA (AG054186 for ZJ, JL; AG051876 for JL). The funding agreement ensured the authors’ independence in designing the study, collecting, analyzing and interpreting the data, and writing the manuscript.
Availability of data and materials
Simulated data were generated as specified in the Methods section. The deidentified publicly available data can be obtained from the MGRAST Project ID mgp128 and NCBI BioProject ID numbers 82,111 and 175,224.
Author information
Affiliations
Contributions
ZJ, YY and JL conceived and designed experiments; YY and ZJ performed the experiments; YY, JL and ZJ analyzed and interpreted the data; YY, JL and ZJ drafted the manuscript; All authors read and approved the final version of the manuscript.
Corresponding author
Correspondence to Joseph H Lee.
Ethics declarations
Ethics approval and consent to participate
Not applicable.
Consent for publication
Not applicable.
Competing interests
The authors declare that they have no competing interests.
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Additional file
Additional file 1:
Figure S1.  Heat maps for the representative Classes of oral metagenomics data based on the TADIP (A) model and the TAMER (B) model. Figure S2.  Heat maps for the representative Species of gut metagenomics data based on the TADIP (A) model and the TAMER (B) model. Figure S3.  Numbers of reads assigned using TAMER and TADIP for the representative Phylums of gut metagenomics data. Figure S4.  Comparison of Blast and Bowtie for the study of oral metagenomics data. Table S1.  Tables for the estimated proportion of reads assigned to representative Classes of oral metagenomics data based on the TADIP model and the TAMER model. Table S2.  Tables for the estimated proportion of reads assigned to representative Phylum of gut metagenomics data(disease)based on the TADIP model and the TAMER model. Table S3.  Tables for the estimated proportion of reads assigned to representative Phylum of gut metagenomics data(control)based on the TADIP model and the TAMER model. Supplementary Code  BLAST. Supplementary Code  Bowtie. Supplementary Code  R package for TADIP and Hypothesis Testing. (DOCX 4876 kb)
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
About this article
Cite this article
Yao, Y., Jin, Z. & Lee, J.H. An improved statistical model for taxonomic assignment of metagenomics. BMC Genet 19, 98 (2018). https://doi.org/10.1186/s1286301806801
Received:
Accepted:
Published:
Keywords
 EM algorithm
 Metagenomics
 Taxonomic assignment