This study demonstrates that untargeted MPS can be used to detect variation between metagenome profiles. The method for deriving rumen microbiome profiles described here, which has similarities to analysing counts of RNA sequence data for example see , allows comparison of samples based on the whole population, not just individual species. The method uses data and programs in the public domain. The method is attractive as minimal computational resources are required: less than half a gigabase of sequence data was required to produce repeatable results (clustering of samples within cow) when the bovine rumen was used as an example.
Other methods used to examine the rumen metagenome tend to be either targeted 16S gene studies [24–26], which use some form of amplification and then sequence the amplified products; or they perform whole genomic (shot-gun) untargeted sequencing and then assemble contigs to gain some functional insight into the sample of origin [6, 17]. Our method uses the same type of data that is typically generated for functional studies. We have expanded the use of this data to give a profile of species in the metagenome. This type of data can now be used not only for functional analysis, but also to perform an analysis analogous to 16S profiling studies.
No published studies have previously directly addressed the hypothesis that was tested here using MPS: that for the fluid component, within rumen variation is less than between rumen variation. However, it is an assumption implicitly made by studies which take only one technical replicate per animal e.g. [8, 17, 27]. In this way, our small study appears to confirm previous assumptions made about the rumen fluid microbiome. Samples clustered by host animal when hierarchical clustering was performed, and the linear models confirmed a much larger proportion of variation was due to the host animal than the sample position. The uniformity in the rumen fluid microbiome is likely due to the churning action of the rumen and reticulum . Future sequencing efforts can therefore focus on increasing animal numbers rather than multiple samples per individual.
As rumen fluid samples are difficult to obtain, it would be convenient if there was considerable overlap of the microbe profile between rumen and faecal samples from the same animal. Faecal samples are easy to obtain and hence would lend themselves well to large studies. We were unable to find evidence that rumen fluid and faecal profiles are significantly linked. This was illustrated by the large number of contigs unique to faeces or rumen fluid, as well as identifying many contigs which abundance change significantly from rumen fluid to faeces. Further mixed model analysis showed that the host animal explained little of the variation present; most of the variation was attributed to sample type. We also hypothesised that the microbial profile of a cow’s faeces would correlate more strongly with that cow’s rumen fluid microbial profile than the rumen fluid microbial profile of any other cow. We did not find evidence to support this hypothesis. A more sophisticated analysis may be required to investigate this question in detail.
The differences between rumen and faeces may be a reflection of the function of the two environments - the rumen microbial community may be under strong selection to remain highly functional as the host animal depends on it for digestion, while the faeces may have less restrictions. It is therefore interesting that there was more variation in relative abundance of the functional pathways in rumen than in faeces. Brulc et al.  compared rumen sequence to termite. This could be a more relevant comparison as the hindgut of the termite performs a similar function to the rumen - digesting plant material for the host animal.
In our study, a sequence depth of 1 million paired reads was sufficient to detect variation between the rumen fluid micro-biome of different animals. This finding will allow researchers to obtain a relatively inexpensive and detailed summary of the whole rumen fluid metagenome. While not detracting from the importance of deep sequencing, or targeted sequencing of particular genes or species, this study has illustrated a method to obtain a whole rumen fluid metagenome profile, without targeting any particular gene or species group. The production of whole metagenome profiles is likely to be of particular importance for traits where interactions between hundreds or thousands of species may be occurring.
Using multiple databases allowed us to examine the effect of the database source on the clustering pattern. The NCBI prokaryotes reference metagenome and GreenGenes reference metagenome represent well annotated databases. This is the type of database that sequence data is often compared to for taxonomic classification of reads. The soil database reference metagenome is a database of contigs from a different sample type than the test samples. The human faeces reference metagenome represents a database of contigs from a similar sample type as the test samples (both gut samples of mammals). The JGI_rumen database was a database of contigs from a very similar sample type as the test samples (fibre adherent microbes as compared to liquid fraction microbes). The DPI rumen derived database was a databases of contigs from the same sample type as the test samples (although DNA was extracted using a different method, and the sequencing technology used was different). Our samples clustered by animal only when the rumen or human faeces databases were used. The soil database did not produce clustering by cow. This suggests that using a contig database derived from independent sequencing of the same (or similar) community type as the experimental sequences is most appropriate. Interestingly, clustering by host animal was observed even when the database was derived from another species or prepared in a different way to the test samples. Our results suggest that a large proportion of reads aligning to the reference, while desirable, is not absolutely necessary if the reference is biologically relevant to the samples.
Here we have used two databases derived from bovine metagenome sequencing. Other possible sources of bovine metagenome sequence may include Brulc et al.  as a rumen derived source or Durso et al.  as a faecal source. However, the amount of sequence in each of the above mentioned studies may cause difficulties in assembling a database. We therefore suggest that they be combined with the rumen sequence from this or other  studies. A combined bovine metagenome reference assembly could contain reads from multiple sources to take best advantage of sequencing efforts around the world.
In our study we show that the biological relevance of the reference metagenome, in this case rumen derived sequences, is more important than size (as measured by the total number of basepairs in the database). Database size did not have a positive effect on the percentage of sequence reads that aligned. For example the two rumen derived databases showed a five fold difference in the proportion of reads that aligned; despite being approximately equal size (as determined by total summed length of contigs). Hierarchical clustering also appeared unaffected by database size, with the largest (NCBI prokaryotes) and smallest (GreenGenes) databases not clustering samples by cow, but the two intermediate sized rumen derived databases successfully clustering samples by cow. Therefore size is not the most critical feature of the reference metagenome.
The majority of rumen sequences obtained by MPS in this study were novel. It is therefore possible that the reference databases used have limited the power of this study as only a small proportion (0.72%-6.00%) of the sequence information generated was actually used for hierarchical clustering. Despite this, as the contigs present in the rumen derived references represent the most common rumen taxa, it is evident that relative abundance of even the most common rumen species vary enough between animals to allow the microbial profiles of different individuals to be discriminated.
The hierarchical clustering method used in this study should be a useful approach in the analysis of other metagenome datasets. It is similar to methods using RNA sequence to quantify gene expression levels, and it does not require assembly of reads into contigs, or the use of BLAST, both of which require large computational capacities for the volume of data produced by MPS. Also, we have shown that the database need not be from the same environment as the samples, illustrated by the clustering of samples even when databases were prepared in different ways (e.g. DPI_rumen versus JGI_rumen) or from different species (e.g. human faeces). Therefore this method may be particularly useful for novel gut metagenomes, where a reference from closely related species is not available.
The method allows examination of relationships between metagenomes, however the hierarchical clustering can not in itself provide information on which species are driving the hierarchical relationships. One way of dealing with this limitation is to use the metagenome profile matrix to find contigs that are significantly up or down represented between two sample groups, as we did in the comparison of rumen fluid to faeces. Another limitation is that the dendrograms only represent species in the reference database, and hence if key species in the community were missing from the database, the clustering pattern may not represent the community accurately.
This study did not investigate the effect of using the fibre adherent microbes; as such further work may be required to assess if this method is also applicable to the fibre-adherent rumen fraction. Because a database even from a different species (human) could successfully cluster samples by animal, we predict that this method will be applicable to the fibre-adherent rumen fraction, as well as gut metagenomes from other animals.
Another limitation is that the DNA extraction methodology used here may have caused shifts in the proportion of species, or degradation of the DNA. Studies have shown that the method used to extract DNA has a substantial effect on the microbial population observed . However, as all samples were treated in the same manner, these possible effects would have been uniform across all samples. The specific effect of the extraction method we have chosen may be that eukaryotes such as protozoa have been under represented, as the physical disruption used for DNA extraction would likely shear DNA released early in the lysis process. Likewise, the centrifugation of samples before DNA extraction probably caused an under representation of viral sequences, as many viruses would remain in the supernatant. However, neither of these is a concern as it is prokaryote population that we were interested in investigating.