Genetic diversity of a New Zealand multi-breed sheep population and composite breeds’ history revealed by a high-density SNP chip

Brito, Luiz F.; McEwan, John C.; Miller, Stephen P.; Pickering, Natalie K.; Bain, Wendy E.; Dodds, Ken G.; Schenkel, Flávio S.; Clarke, Shannon M.

doi:10.1186/s12863-017-0492-8

Research article
Open access
Published: 14 March 2017

Genetic diversity of a New Zealand multi-breed sheep population and composite breeds’ history revealed by a high-density SNP chip

Luiz F. Brito ORCID: orcid.org/0000-0002-5819-0922^1,2,
John C. McEwan²,
Stephen P. Miller^1,2,
Natalie K. Pickering³,
Wendy E. Bain²,
Ken G. Dodds²,
Flávio S. Schenkel¹ &
…
Shannon M. Clarke²

BMC Genetics volume 18, Article number: 25 (2017) Cite this article

4280 Accesses
32 Citations
Metrics details

Abstract

Background

Knowledge about the genetic diversity of a population is a crucial parameter for the implementation of successful genomic selection and conservation of genetic resources. The aim of this research was to establish the scientific basis for the implementation of genomic selection in a composite Terminal sheep breeding scheme by providing consolidated linkage disequilibrium (LD) measures across SNP markers, estimating consistency of gametic phase between breed-groups, and assessing genetic diversity measures, such as effective population size (N_e), and population structure parameters, using a large number of animals (n = 14,845) genotyped with a high density SNP chip (606,006 markers). Information generated in this research will be useful for optimizing molecular breeding values predictions and managing the available genetic resources.

Results

Overall, as expected, levels of pairwise LD decreased with increasing distance between SNP pairs. The mean LD r² between adjacent SNP was 0.26 ± 0.10. The most recent effective population size for all animals (687) and separately per breed-groups: Primera (974), Lamb Supreme (380), Texel (227) and Dual-Purpose (125) was quite variable. The genotyped animals were outbred or had an average low level of inbreeding. Consistency of gametic phase was higher than 0.94 for all breed pairs at the average distance between SNP on the chip (~4.74 kb). Moreover, there was not a clear separation between the breed-groups based on principal component analysis, suggesting that a mixed-breed training population for calculation of molecular breeding values would be beneficial.

Conclusions

This study reports, for the first time, estimates of linkage disequilibrium, genetic diversity and population structure parameters from a genome-wide perspective in New Zealand Terminal Sire composite sheep breeds. The levels of linkage disequilibrium indicate that genomic selection could be implemented with the high density SNP panel. The moderate to high consistency of gametic phase between breed-groups and overlapping population structure support the pooling of the animals in a mixed training population for genomic predictions. In addition, the moderate to high N_e highlights the need to genotype and phenotype a large training population in order to capture most of the haplotype diversity and increase accuracies of genomic predictions. The results reported herein are a first step toward understanding the genomic architecture of a Terminal Sire composite sheep population and for the optimal implementation of genomic selection and genome-wide association studies in this sheep population.

Background

Sheep farming is of significant economic importance to New Zealand and is represented throughout the country. The variable climates and landscapes have favoured the adoption of a wide diversity of sheep breeds that have adapted and performed well for different breeding objectives (Maternal vs Terminal) under a range of production systems (e.g. intensive vs extensive). Although there are a significant number of purebred sheep farms, over time the New Zealand sheep industry has been characterized by a high and increasing proportion of composite breeds and crossbreed animals [1, 2]. As described by Blair [1], New Zealand sheep farmers are largely focused on profitability of their stock compared to that of raising solely purebred animals.

Genomic selection (GS) [3] has played an important role on increasing profitability in livestock species by improving selection efficiency. The success of GS depends on many factors such as the extent of the Linkage Disequilibrium (LD, the non-random association of alleles at different loci) across the genome, which may vary between breeds/populations. The history of the population under selection and its genetic diversity has implications on the long-term success of a breeding program (genetic gains per generation that can be achieved) and determines cost effective tools/ways to apply GS (e.g. SNP chip density) [4]. Over the last 30 years several composite breeds have been developed in New Zealand for a commercial need, however their genetic diversity is still unknown and their breeding history has not been fully documented in the scientific literature. Some of these composite breeds are Primera and Lamb Supreme. Therefore, to enable GS and characterise the genetic diversity in the New Zealand Terminal Sire composite breeds, a high density SNP array (606,006 SNPs) was commissioned by FarmIQ™ (joint New Zealand government and industry Primary Growth Partnership) and developed in conjunction with the International Sheep Genomics Consortium (ISGC) and Illumina [5, 6].

The main objectives of this study were: 1) to collate and present the breeding history of new composite breeds widely raised in New Zealand and overseas; and 2) to establish the scientific basis for the implementation of genomic selection in a composite Terminal breeding scheme by: providing consolidated LD measures across SNP markers; estimating consistency of gametic phase between breed-groups; and, estimating other genetic diversity measures relevant for the successful predictions of molecular breeding values (mBVs), such as N_e, pedigree and genomic inbreeding, and population structure. This investigation will also provide fundamental information related to the genomic architecture of this sheep population.

Methods

Genotype data and quality control

There were 14,845 animals from both sexes (7,961 males and 6,884 females) with HD (Ovine Infinium® HD SNP Beadchip) genotype call rate greater than 95%. The animals were born in: 2007–2009 (n = 208); 2010 (n = 3,623); 2011 (n = 3,782), 2012 (n = 2,383), 2013 (n = 2,175) and 2014 (n = 2,674). DNA was extracted mostly from ear punch tissue [7]; however, DNA was also extracted from blood [8] and semen samples as well. Genotyping was conducted at the AgResearch Animal Genomics Research Laboratory, Mosgiel, New Zealand.

Genotypes were called on the AB system and using Illumina GenomeStudio® software. Genotypes were coded as the number of A alleles (0, 1 or 2). SNP were excluded from the analysis if their minor allele frequency (MAF) was less than 0.01, had call rate less than 95%, were non-autosomal, had unknown genomic position on the sheep reference genome assembly version OARV3.1, had duplicated map positions (two SNP with the same position, but with different names), had misplaced SNP positions compared to OARv3.1, and/or showed an extreme departure from Hardy Weinberg equilibrium (p < 10⁻¹⁵). A total of 517,902 SNP were retained for further analyses after filtering. Following quality control, missing genotypes were minimal (2.16%) and were subsequently imputed using the FImpute software [9]. The analysis were performed for each breed group separately (Primera, Lamb Supreme, Texel, or Dual-Purpose) and using the whole dataset of genotyped animals.

Extent of linkage disequilibrium

The degree of LD between markers was estimated using the squared correlation coefficient (r²) statistic as proposed by Hill and Robertson [10], which is the squared correlation between alleles at two loci. It can be expressed as: \( {r}^2=\frac{D^2}{f\left({A}_i\right) f\left({B}_i\right) f\left({A}_j\right) f\left({B}_j\right)} \), where f(A _i), f(B _i), f(A _j), and f(B _j), are observed frequencies of alleles A_i, B_i, A_j, and B_j, respectively and i and j are markers. D was estimated as suggested by Lynch and Walsh [11]: \( D=\frac{N}{N-1}\left[\frac{4{N}_{AABB}+2\left({N}_{AABb}+{N}_{AaBB}\right)+{N}_{AaBb}}{2 N}-2\times f(A)\times f(B)\right], \) where N is the total number of animals, and N _AABB, N _AABb, N _AaBB, and N _AaBb are the corresponding number of individuals in each genotypic category (AABB, AABb, AaBB, and AaBb). Considering the r² between a bi-allelic marker and an (unobserved) bi-allelic quantitative trait loci (QTL), r² is the proportion of variation caused by the alleles at a QTL that is explained by the markers [12] and it ranges from 0 (no LD) to 1 (complete LD) between two markers. The r² for each pair of loci on each chromosome was calculated to determine the LD between adjacent and syntenic SNP pairs. LD (r²) decay over different distances was also investigated.

Consistency of gametic phase

The consistency of gametic phase was defined by the Pearson correlation of signed r-values between two breed-group pairs. For each markers pair with a measure of r², the signed r-value was determined by taking the square root of the r² value and assigning the appropriate sign based on the calculated disequilibrium (D) value. Data was sorted into bins based on pairwise marker distance to determine the breakdown in the consistency of gametic phase across distances. For each distance bin, the signed r-values were then correlated between all six breed-group pairs. The analysis were performed on snp1101 software [13].

Current and ancestral effective population size

To estimate N_e through time, the formula used was Ne = ((1/E[r ² ]) – 1)*(1/4c) [14], where c is the average genetic distance in Morgans estimated for each chromosome in the LD analysis (estimated using snp1101 package) and E[r ² ] is the expected r² at distance c calculated as \( E\left({r}^2\right)=\frac{1}{1+4{N}_e c} \). Time is in generations, assuming T = 1/2c [15]. N_e was determined from current to 1,000 generations ago.

Principal component analysis

To investigate the genomic composition of the population, the principal components were derived from the genomic relationship matrix (G) calculated using all the genotyped animals and all SNPs that passed the quality control process. The G matrix was calculated using the method described by VanRaden [16]: \( G=\frac{\left( M-2 P\right)\left( M-2 P\right)\boldsymbol{\hbox{'}}}{2\sum {p}_i\left(1-{p}_i\right)} \), where M is a matrix of counts of the alleles “A” (with dimensions equal to the number of animals by number of SNP), p _i is the frequency of allele “A” of the i^th SNP, and P is a matrix (with dimensions equal to the number of animals by number of SNP) with each row containing the p _i values. Principal components were calculated using the prcomp function of R [17].

Pedigree and genomic inbreeding coefficients

Both pedigree (F_PED) and genomic inbreeding coefficients in this population were estimated and compared. Pedigree information was available from 243,486 individuals born from 1990 to 2014 and F_PED was calculated using the Meuwissen and Luo [18] algorithm. Genomic inbreeding was calculated as:

1)
Inbreeding coefficient based on excess of homozygosity (PLINK software [19], F _EH ): \( \frac{1}{m}{\displaystyle {\sum}_{i=1}^m}\kern0.5em 1-\frac{c_i\left(2-{c}_i\right)}{p_i\left(1-0.5{p}_i\right)} \), where m is the number of SNP, p _i is the minor allele frequency at loci i and c _i is the genotype call (0, 1 or 2).
2)
Diagonal of VanRaden’ G-matrix minus 1 (F _VR ): Genomic relationship matrix was calculated as in VanRaden [16] and the F_VR was calculated as the diagonal element minus 1 for each individual.

Results

Genotypes

The 517,902 SNP markers that passed quality control spanned about 2.45 Gb of the genome, with an average distance of 4.74 kb between adjacent SNPs, which varied between chromosomes (ranging from 4.50 kb in OAR11 to 4.84 kb in OAR10). Figure 1 presents the number of SNP per chromosome and chromosome length, indicating that SNPs were uniformly distributed across the genome. The number of SNP per chromosome ranged from 58,074 (OAR1, longest chromosome; 42.01 Mb) to 9,191 (OAR24, shortest chromosome; 27.56 Mb). The maximum gaps between adjacent SNPs were observed on OAR5 (305.58 kb), OAR10 (357.01 kb) and OAR13 (343.36 kb). The distribution of MAF of the SNPs after quality control is given in Fig. 2 and the MAF distribution per breed group is shown in Fig. 3. The mean MAF (± SD) over all genotyped animals was 0.255 ± 0.136 and for the breed-groups Primera, Lamb Supreme, Texel and Dual-Purpose was 0.254 ± 0.137, 0.248 ± 0.141, 0.249 ± 0.140 and 0.245 ± 0.143, respectively. SNPs were found to have a broad range of MAF (Fig. 2). The distribution of the MAF shows that the proportion of SNPs with high polymorphism (MAF > 0.3) after quality control was 39.27%. The mean expected heterozygosity (H_e) for all the genotyped animals was 0.346 (±0.009) and ranged from 0.249 to 0.383. H_e (± SD) was 0.350 (±0.006), 0.346 (±0.011), 0.340 (±0.007) and 0.332 (±0.010) for Primera, Texel, Lamb Supreme and Dual-Purpose, respectively.

Genetic resources

The sheep population under investigation is predominantly focused on breeding for faster growth, higher carcass yield, survival and improved meat quality. The majority of the genotyped animals were progeny of Terminal Sire composites and Texel mated to a variety of maternal/dual-purpose breeds. The main breeds involved were Lamb Supreme, Primera, Texel, Romney, Coopworth, Landmark and Highlander. Due to the lack of literature for some of the composite breeds, we collate a brief history of them, presented in Additional file 1.

Genomic and pedigree inbreeding

Pedigree (F_PED) and two genomic (F_EH, F_VR) inbreeding coefficients by year of birth were calculated (Table 1). Pedigree inbreeding had the highest average values of the three inbreeding coefficient measures. The average F_PED was 0.002 ± 0.009 and ranged from 0.000 to 0.277. The average F_PED for the sires was 0.014 and 0.012 for the dams. The average F_PED for the inbred animals (F_PED > 0) was 0.029. The genomic inbreeding coefficients based on excess of homozygosity (F_EH) or G matrix (F_VR) were −0.008 ± 0.031 (range: −0.079 – 0.301) and −0.009 ± 0.027 (range: −0.093 – 0.328), respectively. Correlation between F_PED and genomic inbreeding was 0.27 (F_EH) and 0.36 (F_VR). The correlation between F_EH and F_VR was 0.51. There were individuals with high genomic inbreeding, but zero pedigree inbreeding (incomplete pedigree information). This highlights another advantage of genomic information for breeding programs.

Table 1 Mean inbreeding coefficients (± SD) and inbreeding range per year

Full size table