SNP discovery in swine by reduced representation and high throughput pyrosequencing
© Wiedmann et al; licensee BioMed Central Ltd. 2008
Received: 04 September 2008
Accepted: 04 December 2008
Published: 04 December 2008
Relatively little information is available for sequence variation in the pig. We previously used a combination of short read (25 base pair) high-throughput sequencing and reduced genomic representation to discover > 60,000 single nucleotide polymorphisms (SNP) in cattle, but the current lack of complete genome sequence limits this approach in swine. Longer-read pyrosequencing-based technologies have the potential to overcome this limitation by providing sufficient flanking sequence information for assay design. Swine SNP were discovered in the present study using a reduced representation of 450 base pair (bp) porcine genomic fragments (approximately 4% of the swine genome) prepared from a pool of 26 animals relevant to current pork production, and a GS-FLX instrument producing 240 bp reads.
Approximately 5 million sequence reads were collected and assembled into contigs having an overall observed depth of 7.65-fold coverage. The approximate minor allele frequency was estimated from the number of observations of the alternate alleles. The average coverage at the SNPs was 12.6-fold. This approach identified 115,572 SNPs in 47,830 contigs. Comparison to partial swine genome draft sequence indicated 49,879 SNP (43%) and 22,045 contigs (46%) mapped to a position on a sequenced pig chromosome and the distribution was essentially random. A sample of 176 putative SNPs was examined and 168 (95.5%) were confirmed to have segregating alleles; the correlation of the observed minor allele frequency (MAF) to that predicted from the sequence data was 0.58.
The process was an efficient means to identify a large number of porcine SNP having high validation rate to be used in an ongoing international collaboration to produce a highly parallel genotyping assay for swine. By using a conservative approach, a robust group of SNPs were detected with greater confidence and relatively high MAF that should be suitable for genotyping in a wide variety of commercial populations.
The identification of genes and mutations that lead to genetic variation in complex, economically important traits in livestock has been hindered by the lack of genomic sequence, adequate map density and effective platforms for high density genotyping. It is estimated that linkage disequilibrium extends for hundreds of kilobases in the pig  and that 30,000–50,000 SNP would be necessary for whole genome associations in livestock [2, 3]. The availability of livestock genome sequence, a high density of markers, and cost effective SNP genotyping will allow genome-wide association studies in swine. A major limitation to the development of highly parallel genotyping assays for swine is a lack of suitable SNPs for genotyping. To date there are a little over 8,400 SNPs for swine in dbSNP, but many of these are clustered into a small number of sequences that do not effectively cover the genome . While many SNP will surely be discovered by genome sequencing, because of low sequence coverage the conversion rate of putative SNP and their minor allele frequency won't be known until tested across populations. In order to identify large numbers of randomly distributed SNPs for swine, we chose to construct a reduced representation library (RRL) to reduce the complexity of the genome and to use massively parallel second-generation sequencing to identify large numbers of high-confidence SNP for high density genotyping on a cost effective platform.
Reduced representation was first used to construct an SNP map in human to scan the genome for haplotypes associated with disease . Recently, reduced representation was coupled with second-generation sequencing technology for SNP detection, estimation of allele frequency and validation in cattle . Reduced representation sequencing has also been used successfully for gene discovery , methylation analysis  and genomic characterization of repetitive genomes . Because reduced representation reduces the complexity of the genome being sampled by orders of magnitude, doesn't require prior knowledge of genome sequence, and samples identical regions from different individuals dispersed across the genome, it is an ideal strategy for SNP discovery in species without a complete genome sequence.
Reduced Representation Library Selection
Percent repetitive element content of reduced representation libraries.
Bst U I
GS FLX Sequencing and Assembly
A total of 5,024,039 reads were obtained in 11 runs of the GS FLX instrument producing 1,167,904,923 total bases of sequence (average read length of 232 bp), with 87% of bases having a quality score of 20 or more. Most of the reads (96.4%) started with "CC", as expected for these restriction enzyme digested fragments. Less than 1% of reads had an internal Hae III recognition site (GGCC), indicating that digestion was complete. About 32.6% of the sequence was found to be repetitive DNA by RepeatMasker, consistent with the results obtained from the Sanger sequenced Hae III fragments (see Table 1). The percentage of bases called as "N" was 0.02%.
Prior to the SNP discovery step, the repetitive content of the reads was masked by RepeatMasker and only those reads containing a 50 bp length of non-repetitive sequence were retained. Repetitive sequence near the ends of the reads was trimmed. Using the Newbler unmasked assembly (421,060 contigs) as the reference, the ssaha2 software mapped 2,189,534 (72.0%) of the 3,041,168 repeat-masked and trimmed reads to the reference contigs, placing a total of 468,787,385 bp onto the reference sequence. The repetitive regions of the reference assembly are not expected to have coverage from the repeat-masked reads, and, in fact, 27% of the reference had no reads mapped onto it. For the positions that did have coverage, the average depth was 5.84×. We used ssaha_pileup to detect variation between the mapped reads and the reference, and among the mapped reads.
SNP Genotyping and Validation
Average minor allele frequencies by breed for 168 sampled SNPs.
Number of Animals1
Number in Pool2
Eur. Wild Boar
These results indicate that reduced representation sequencing coupled with second-generation sequencing technology provides the detection of a large number of valid SNPs at a much lower investment of time and expense than conventional methods. This approach was recently used to discover a large number of SNP in cattle  by deep-sequencing RRLs to simultaneously identify and determine MAF of SNP using short reads that were mapped to the bovine genome. This study differs by the use of longer sequence reads, which are more likely to map to the unfinished pig genome sequence or that of related species, provide the ability to design assay primers for many different genotyping platforms and allow the detection of neighboring SNP in the same fragment. Although the rate of detection per base was lower than our initial prediction based on Sanger sequencing of PCR products from pig genomic DNA , the conservative approach of requiring that the variation be seen at least twice to be called polymorphic resulted in a large number of accurately identified SNPs as reflected by the high success rate in validation with a false discovery rate of less than 5%. The success rate could probably be further improved by eliminating SNPs residing in homopolymeric regions, because the pyrosequencing method has difficulty with homopolymers and thus the error rate of sequencing versus actual SNP is increased. Additional sequencing of the Hae III RRL library to get deeper coverage would likely uncover more SNPs, but would bias toward those with lower minor allele frequencies. From this resource, assays could be designed from the greater than 50,000 non-repetitive loci (contigs) for high-density genotyping providing a reasonable distribution of markers across the genome. Although map positions will only be known for those contigs that fall within the sequenced regions of the pig genome, the remaining contig positions should be available in the near future. Alternatively, the contigs could be mapped by linkage or with a high resolution radiation hybrid panel , or by similarity to the completely sequenced human or bovine genomes.
Although SNP are abundant in the genome and are amenable to high-throughput genotyping technology, the identification of a large number of informative SNP spaced over the genome suitable for whole genome association is a difficult and expensive task. The combination of next-generation sequencing technology and reduced representation of pooled genomes provides a powerful and efficient strategy to discover large numbers of genetic markers in a target population. The approach to sample several unrelated animals of different breeds and sequence to sufficient depth for reliable SNP identification allowed the ability to detect many common SNP present at a high MAF. The SNPs identified in this report will provide a much needed resource for genetic studies in swine and will contribute to the development of a high density, cost effective genotyping platform for swine.
Design and Construction of Reduced Representation Libraries
The libraries were designed to generate fragments with a typical length of 450 basepair to allow for a 50 bp overlap of reads from opposite fragment ends assuming a 250 bp median read length from the GS FLX instrument (454 Life Sciences, Branford, CT, USA). Genomic DNA was extracted from semen of 21 unrelated boars (International Boar Semen, Eldora, IA) representing the seven most predominant industry breeds (four each of Duroc, Landrace, Yorkshire, Large White and Hampshire breeds, three Berkshire and two Pietrain) and from five Duroc-Landrace-Yorkshire cross-bred boars from the United States Meat Animal Research Center (USMARC) resource population. Equal amounts (40 μg) of DNA were pooled and digested overnight to insure complete digestion with 5 U/μg of Alu I, Bst UI, Dra I, Hae III, and Pvu II (New England Biolabs, Beverly, MA, USA) to produce 5 fragment libraries. Each of these enzymes produces fragments with blunt ends, which improves ligation efficiency of adaptors in the preparation of single strand libraries for sequencing on the 454 platform (T. Smith, unpublished data). Digested genomic DNA was fractionated in a 5% polyacrylamide gel immobilized to GelBond film (FMC, Rockland, ME, USA), stained with CyberGold (Molecular Probes, Eugene, OR, USA), visualized on a DarkReader (Clare Chemical Research, Dolores, CO, USA; see Additional file 1), and a gel section containing digested fragments of approximately 450 bp (427–456 bp) was removed. The gel sections were crushed by centrifugation through an 18-gauge needle hole in a microfuge tube and DNA was eluted from the gel pieces by incubation at 37°C overnight in 0.5 M ammonium acetate and 0.1 mM EDTA. To evaluate each enzyme for suitability in large-scale sequencing, samples of fragments generated by each enzyme were cloned into pBluescript, and two 384-well plates of transformed clones from each library were sequenced on an ABI 3730 (Applied Biosystems, Foster City, CA, USA) The repetitive DNA content of the libraries was determined using RepeatMasker http://www.repeatmasker.org/cgi-bin/WEBRepeatMasker after running Crossmatch http://www.genome.washington.edu/UWGC/analysistools/Swat.cfm to remove vector sequence.
A single strand library was prepared from the Hae III fragments for sequencing on the GS FLX platform as recommended by the manufacturer (454 Life Sciences Corporation, Branford, CT, USA). Six micro-bead sequencing runs of the Hae III library were performed as a service by 454 Life Sciences Corporation (Branford, CT, USA) using a Roche-454 GS FLX sequencer. Five additional machine runs were performed on a Roche-454 GS FLX sequencer at USMARC.
Assembly of the reference sequence and SNP Detection
To generate the reference contigs used for SNP discovery, unmasked sequence reads were assembled using the Newbler algorithm (version 1.1.03) provided with the GS FLX sequencer. Repeat-masked reads were removed from further analysis if they contained less than 50 consecutive non-masked bases in one region leaving 60% of the initial reads. These reads were trimmed of repetitive sequence near their ends prior to either assembly or mapping. The RepeatMasked reads were then mapped onto the reference contigs using ssaha2 software and SNP were detected by ssaha_pileup . Putative SNPs were tagged if each of two alleles appeared at least twice and no other alleles were detected; therefore a minimum depth of four reads was necessary in a contig to detect SNPs.
Validation of SNPs
A group of 1,000 SNPs was randomly selected from among those that mapped to the finished pig chromosome sequence that represented differing allele frequencies, positions in the contig and depth of reads. An excess of SNPs was selected to allow for multiple SNPs in a contig and to provide enough targets for multiplex assay design. For contigs containing more than one SNP, only one SNP was selected per contig for validation. In addition, SNP were discarded if the contig did not have at least 30 bp on either side of the SNP to allow for amplification primer design. Multiplex assays were designed for the Sequenom MASSARRAY® system using the MASSARRAY® Assay Design software (Sequenom, San Diego, CA, USA). Assays were designed for 192 unique SNP with thirty-two SNPs in each of six multiplexes. Each amplification primer had a 10-base tag added to ensure that the amplification primer masses were outside the range of the allele masses and amplicon lengths with tags were approximately 90 bp. Reaction conditions were performed by iPLEX™ chemistry as recommended by Sequenom. The SNP were genotyped across a panel of 192 animals which included the 21 discovery animals and consisted of about 40 animals each for the Duroc, Landrace and Yorkshire breeds, 29 Hampshire, and 2–7 animals each for the Berkshire, Chester White, European Wild Boar, Fengjing, Meishan, Minzhu, Pietrain, Poland China, and Spotted breeds
Additional data files and information
The sequence and nucleotide variation has been submitted to GenBank dbSTS and dbSNP databases. The Accession Numbers are: [GenBank: BV729586 to BV999999, GF000001 to GF089508 and GF089703 to GF091743]. The SNPs are submitted under the handle MARC in batch number 2008-11-06 [GenBank: ss107796326 to ss107911925].
The authors thank Renee Godtel for assistance in library construction, Sue Hauver for DNA preparation and genotyping, Bob Lee for 454 sequencing, Steve Simcox for sequencing preliminary libraries, Sherry Kluver for manuscript preparation, Jim Wray for sequence submissions and the National Swine Registry for providing DNA samples for genotyping. Mention of trade name proprietary product or specified equipment does not constitute a guarantee or warranty by the USDA and does not imply approval to the exclusion of other products that may be suitable.
- Du FX, Clutter AC, Lohuis MM: Characterizing linkage disequilibrium in pig populations. Int J Biol Sci. 2007, 3: 166-178.PubMed CentralView ArticlePubMedGoogle Scholar
- Solberg TR, Sonesson AK, Woolliams JA, Meuwissen THE: Genomic selection using different marker types and densities. J Anim Sci. 2008, 86 (10): 2447-54. 10.2527/jas.2007-0010.View ArticlePubMedGoogle Scholar
- McKay SD, Schnabel RD, Murdoch BM, Matukumalli LK, Aerts J, Coppieters W, Crews D, Dias Neto E, Gill CA, Gao C, Mannen H, Stothard P, Wang Z, Van Tassell CP, Williams JL, Taylor JF, Moore SS: Whole genome linkage disequilibrium maps in cattle. BMC Genet. 2007, 8: 74-10.1186/1471-2156-8-74.PubMed CentralView ArticlePubMedGoogle Scholar
- Fahrenkrug SC, Freking BA, Smith TP, Rohrer GA, Keele JW: Single nucleotide polymorphism (SNP) discovery in porcine expressed genes. Anim Genet. 2002, 33: 186-195. 10.1046/j.1365-2052.2002.00846.x.View ArticlePubMedGoogle Scholar
- Altshuler D, Pollara VJ, Cowles CR, van Etten WJ, Baldwin J, Linton L, Lander ES: An SNP map of the human genome generated by reduced representation shotgun sequencing. Nature. 2000, 407: 513-516. 10.1038/35035083.View ArticlePubMedGoogle Scholar
- van Tassell CP, Smith TP, Matukumalli LK, Taylor JF, Schnabel RD, Lawley CT, Haudenschild CD, Moore SS, Warren WC, Sonstegard TS: SNP discovery and allele frequency estimation by deep sequencing of reduced representation libraries. Nat Methods. 2008, 5: 247-252. 10.1038/nmeth.1185.View ArticlePubMedGoogle Scholar
- Timko MP, Rushton PJ, Laudeman TW, Bokowiec MT, Chipumuro E, Cheung F, Town CD, Chen X: Sequencing and analysis of the gene-rich space of cowpea. BMC Genomics. 2008, 9: 103-10.1186/1471-2164-9-103.PubMed CentralView ArticlePubMedGoogle Scholar
- Meissner A, Gnirke A, Bell GW, Ramsahoye B, Lander ES, Jaenisch R: Reduced representation bisulfite sequencing for comparative high-resolution DNA methylation analysis. Nucleic Acids Res. 2005, 33: 5868-5877. 10.1093/nar/gki901.PubMed CentralView ArticlePubMedGoogle Scholar
- Paterson AH: Leafing through the genomes of our major crop plants: strategies for capturing unique information. Nat Rev Genet. 2006, 7: 174-184. 10.1038/nrg1806.View ArticlePubMedGoogle Scholar
- Barbazuk WB, Emrich SJ, Chen HD, Li L, Schnable PS: SNP discovery via 454 transcriptome sequencing. Plant J. 2007, 51: 910-918. 10.1111/j.1365-313X.2007.03193.x.PubMed CentralView ArticlePubMedGoogle Scholar
- Brockman W, Alvarez P, Young S, Garber M, Giannoukos G, Lee WL, Russ C, Lander ES, Nusbaum C, Jaffe DB: Quality scores and SNP detection in sequencing-by-synthesis systems. Genome Res. 2008, 18: 763-70. 10.1101/gr.070227.107.PubMed CentralView ArticlePubMedGoogle Scholar
- McKay SD, Schnabel RD, Murdoch BM, Aerts J, Gill CA, Gao C, Li C, Matukumalli LK, Stothard P, Wang Z, van Tassell CP, Williams JL, Taylor JF, Moore SS: Construction of bovine whole-genome radiation hybrid and linkage maps using high-throughput genotyping. Anim Genet. 2007, 38: 120-125. 10.1111/j.1365-2052.2006.01564.x.PubMed CentralView ArticlePubMedGoogle Scholar
- Ning Z, Cox AJ, Mullikin JC: SSAHA: a fast search method for large DNA databases. Genome Res. 2001, 11: 1725-1929. 10.1101/gr.194201.PubMed CentralView ArticlePubMedGoogle Scholar
- Wernersson R, Schierup MH, Jørgensen FG, Gorodkin J, Panitz F, Staerfeldt HH, Christensen OF, Mailund T, Hornshøj H, Klein A, Wang J, Liu B, Hu S, Dong W, Li W, Wong GK, Yu J, Wang J, Bendixen C, Fredholm M, Brunak S, Yang H, Bolund L: Pigs in sequence space: a 0.66× coverage pig genome survey based on shotgun sequencing. BMC Genomics. 2005, 6: 70-10.1186/1471-2164-6-70.PubMed CentralView ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.