Linkage disequilibrium compared between five populations of domestic sheep

Background The success of genome-wide scans depends on the strength and magnitude of linkage disequilibrium (LD) present within the populations under investigation. High density SNP arrays are currently in development for the sheep genome, however little is known about the behaviour of LD in this livestock species. This study examined the behaviour of LD within five sheep populations using two LD metrics, D' and x2'. Four economically important Australian sheep flocks, three pure breeds (White Faced Suffolk, Poll Dorset, Merino) and a crossbred population (Merino × Border Leicester), along with an inbred Australian Merino museum flock were analysed. Results Short range LD (0 – 5 cM) was observed in all five populations, however the persistence with increasing distance and magnitude of LD varied considerably between populations. Average LD (x2') for markers spaced up to 20 cM exceeded the non-syntenic average within the White Faced Suffolk, Poll Dorset and Macarthur Merino. LD decayed faster within the Merino and Merino × Border Leicester, with LD below or consistent with observed background levels. Using marker-marker LD as a guide to the behaviour of marker-QTL LD, estimates of minimum marker spacing were made. For a 95% probability of detecting QTL, a microsatellite marker would be required every 0.1 – 2.5 centimorgans, depending on the population used. Conclusion Sheep populations were selected which were inbred (Macarthur Merino), highly heterogeneous (Merino) or intermediate between these two extremes. This facilitated analysis and comparison of LD (x2') between populations. The strength and magnitude of LD was found to differ markedly between breeds and aligned closely with both observed levels of genetic diversity and expectations based on breed history. This confirmed that breed specific information is likely to be important for genome wide selection and during the design of successful genome scans where tens of thousands of markers will be required.


Background
Mapping genes of interest within animal genomes has been a lengthy and expensive task. In the past, the technique of choice has been within family linkage analysis, requiring the construction of large multigenerational ped-igrees. A faster and more economical way to narrow the genetic interval surrounding a gene of interest is through whole genome scans and linkage disequilibrium (LD) mapping. The power of LD mapping lies in its ability to exploit historical recombination within populations of unrelated animals to track the sequence variations which contribute to phenotypic variation. Linkage disequilibrium refers to the ability of an allele from one marker to predict the allelic status at a second marker. The extent of LD serves to inform the number of markers required for a whole genome scan. A population with extensive LD will require a lower marker density as large tracts of the genome will be redundant to those surrounding it. Conversely if LD persists over short distances many more markers will be required to obtain the same power to detect association. Recombination events, population dynamics including drift and admixture as well as breed selection bottlenecks all serve to influence the extent of LD. With this in mind, it is important to quantify the extent of LD within different breeds as this is likely to have an impact on the success of gene mapping experiments.
The potential application of LD has prompted investigation into its magnitude and persistence within a number of livestock species including cattle [1][2][3][4], pig [5,6] and sheep [7]. A common finding is significant LD extending across tens of centimorgans. The majority of these studies have examined only one or two breeds, however recent studies in cattle have compared LD between multiple breeds [8,9]. In addition, an investigation comparing five divergent canine breeds which revealed marked differences between populations and a wide range in breed specific LD decay [10]. Sheep breeds represent a broad spectrum of both population history and phenotypic attributes. The process of sheep domestication began approximately 9000 years ago [11] and subsequent selection has occurred for such diverse traits as environmental tolerance, wool characteristics, milk yield and meat production. The result is formation of more than 1400 breeds [12]. The focus of this study was to sample multiple populations of sheep reflecting different population histories and to use microsatellites to measure the magnitude and significance of linkage disequilibrium across one ovine chromosome (OAR 18). By extrapolating the LD measured across a single chromosome to that present in the whole genome, the study aimed to provide a guide to minimum marker spacing for whole genome scans and to examine the impact of breed selection on such undertakings.

Genetic Diversity and Population Structure
A total of 555 animals from five ovine populations were genotyped at 28 microsatellite loci. The mean amount of missing data per locus across all populations was 3.8% (WFS 2.7%; PD 3.1%; MER 3.2%; MxB 6.8%; EMAI 2.1%). Information describing the chromosomal location and the polymorphism observed at each marker is contained within Additional file 1. Analysis of genetic diversity within the five populations (Table 1) showed the Merino (MER) contained the highest genetic diversity as measured by average number of alleles observed per locus (A N = 8.13), gene diversity (H E = 0.70) and allelic richness (A R = 8.13). The MER also appeared the most distinct as measured by private allelic richness (pA R = 0.58). The closed population of Macarthur Merinos (EMAI) contained the lowest amount of diversity, with estimates of A N (3.03) A R (3.13) and pA R (0.09) less than half that of the next lowest population (Table 1). Comparison with previous estimates of sheep gene diversity [13] reveal that the commercial Merino used in this study was amongst the most diverse and the Macarthur Merino were approximately equivalent to the least diverse of ovine populations.
The level of relatedness between ovine populations was investigated by calculation of pair-wise F ST (Table 1). The smallest value was observed between the White Faced Suffolk (WFS) and Poll Dorset (PD) (F ST = 0.035), indicating of the five groups analysed, these two are the most closely related. The next lowest F ST was observed between the MER and MxB (F ST = 0.043). This is likely a reflection of the common Merino contribution to both populations. The highest F ST values were observed for every pair-wise combination of populations which included the EMAI animals. A cluster based method was used to estimate the minimum number of sub-populations (K) required to explain the total sum of genetic variation observed [14]. Figure 1 illustrates four sub-populations (K = 4) differentiated the MER, MxB and EMAI as distinct populations, however the fourth cluster contains both the WFS and PD. The undifferentiated genetic unit containing both the WFS and PD is in keeping with the low F ST reported for these breeds and is also consistent with breed history, as the White Faced Suffolk was founded in part by the Poll Dorset. Cluster analysis also illustrated subpopulation diversity. Figure 1 shows the EMAI group as a solid green block which is almost completely free from contribution of other sub-populations whilst MER appears to be a more heterogeneous subpopulation.

Linkage Disequilibrium Analysis Using x 2'
Linkage disequilibrium was estimated for all marker pairs using the metric x 2' , a standardised chi-square statistic suitable for use with multi-allelic markers [15]. The values of x 2' derived from chromosome 18 marker pairs were plotted as a function of increasing genetic distance ( Figure  2). Figure 2 shows x 2' derived from syntenic marker pairs (green circles) exceeded the average derived from nonsyntenic markers (orange line) for closely spaced markers in each of the five populations tested. For example, average LD for markers separated by less than 5 cM in WFS (x 2' = 0.167 ± 0.076) was well above the average observed using non-syntenic markers in the same population (x 2' = 0.099 ± 0.047; Figure 2, Table 2). Short range LD was observed in all five populations, however LD was observed to persist over larger chromosomal distances in some populations. Average LD for markers spaced up to 20 cM exceeded the non-syntenic average within the WFS, PD and EMAI populations (Table 2, Figure 2). When x 2' was compared against the 5% threshold for significant LD (red line, Figure 2), many fewer marker pairs display both the magnitude and significance which exceeds the critical level. This was particularly evident in MER and MxB where less than 9% of marker pair combinations had x 2' which exceeded the 5% threshold. The threshold limits applied here (0.05 -0.15, Table 2) did not appear unrealistically high when compared to those applied in commercial chicken (x 2' range 0.07 -0.25) [16].
The proportion of microsatellite pairs in significant LD was determined as a function of genetic distance ( Table  3). As expected, the proportion of significant LD decreased with increasing genetic distance. At short distances (< 5 cM), a high proportion of marker pairs displayed significant LD (Table 3). The highest proportion was observed within Macarthur Merinos (1.00) and lowest with the commercial Merino (0.56). In each population, with the exception of MER, the proportion of marker pairs in significant LD exceeded the non-syntenic fraction for pairs up to 20 cM apart.

Rate of LD Decay Compared Between Breeds
To examine the decline in LD, decay with distance was modelled and plotted (black line, Figure 2) and the coefficient of decay (b j ) used to quantify this curve for each population. The value for b j is inversely proportional to the extent of LD, meaning high values of b j indicate a low persistence of disequilibrium with distance [16]. Table 2 shows the maximum decay coefficient was observed within the MER (b = 9.015) followed by the MxB (4.875), PD (1.066), WFS (0.802) and EMAI (0.239).

Linkage Disequilibrium Analysis Using D'
Linkage disequilibrium was estimated for all marker pairs using D' (see Additional files 2, 3, 4). This facilitated comparison with x 2' (this study) and the only other investigation of LD in sheep which employed D' [7]. The magnitude of D', plotted as a function of genetic distance, revealed the expected decline with increasing distance was only clearly evident in the Macarthur Merino and Poll Dorset populations (Additional file 2). Examples of strong LD (D' > 0.5) can be seen at long range (> 30 cM) in several populations, consistent with previous studies in sheep and other livestock species [1,4,7]. Comparison of D' against the 5% critical threshold for significance revealed low levels of average LD in the MER and MxB, even over short genetic distances (red line, Additional file 2). Estimation of D' between non-syntenic marker pairs Cluster analysis of five sheep populations  Table 1), b j was approximately the same in all five groups of animals (0.027 -0.031; Additional file 3).

Predictions for Genome Wide Association Studies
The chance of detecting LD between a marker and QTL was estimated given the observed levels of marker-marker LD (x 2' ) using a probabilistic relationship (see equation 4). This used the proportion of marker pairs which display LD in a given range (LD R ) to estimate the probability of detecting marker-QTL LD (P R ). A genome scan performed using unrelated animals and markers spaced at 2 cM intervals is predicted to identify 99% of QTL within the Macarthur Merino population (LD R = 0.58; mR = 5; T set to x 2' > 0.2; calculation 1, Table 4). The probability of detecting the same QTL within commercial Merinos was dramatically lower at 25% (LD R = 0.06, Table 4). For WFS and PD, the probability remained high at 91% and 80% respectively (LD R = 0.39 and LD R = 0.58, Table 4). The same equation was used to estimate the number of markers required to achieve a 95% probability of detecting LD between a marker and QTL (P R = 0.95; calculation 2, Table   4). For the population which displayed the highest rate of LD decay (b = 9.015), a total of 35,000 markers would be required at 0.1 cM intervals across the genome. This minimum marker number is reduced 5 -8 fold when the other commercial sheep populations are considered ( Table 4). The predictions for genome wide association studies were revisited with population specific LD thresholds taken from the 5% critical value. This served to lower the LD threshold in all populations and as a result the probability of finding QTL and the minimum marker spacing distance increased in most populations ( Table 4). The trends observed between populations remained the same.

Discussion
The magnitude of linkage disequilibrium (LD) and its decay with distance was measured within five sheep populations across a single chromosome (OAR18). Studies which use multi-allelic markers to measure LD in livestock species have mainly calculated D' [1,2,5], however more recent investigations have promoted use of the metric x 2' [15,16]. Comparison between metrics in this study revealed the average magnitude of D' was higher than x 2' for a given genetic distance (Table 2 and Additional file 3) and many more marker pairs had elevated values (LD > 0.60) using D'. This variance between measures has been reported previously and likely reflects the theoretical expectation that rare alleles and unobserved haplotypes Mean values for x 2' (standard deviation) were calculated following classification of marker pairs into distance bins. The number of both syntenic and non-syntenic marker pairs used for the calculation of mean x 2' are given for each population. The x 2' value which corresponds to the 5% level of significance is given for each population. This appears as a horizontal red line in Figure 2. The decay of LD with distance is quantified using b j (formula 3).
tend to inflate D' but not x 2' [16][17][18]. The inflation of D' values also appeared between non-syntenic (NS) marker pairs. For the five sheep populations tested, 0 -14% of NS pairs had D' > 0.5. When NS LD was calculated using x 2' however, 0 -1.6% of marker pairs reported x 2' > 0.5, a nine fold reduction in apparent NS LD. This difference is smaller than the dramatic reduction reported by [16], where a 100 fold decrease in NS LD was observed within commercial chicken populations. The nine fold reduction observed in this study is still important, as artificially high levels of background LD are expected to result in a proportionate increase in the rate of false positive associations reported for whole genome scans. The conclusion is therefore that D' is to be avoided as it tends to reduce the power Linkage disequilibrium (x 2' ) as a function of genetic distance  Table 2. The decay of LD modelled as a function of distance according to formula 3 is shown using black diamonds. Two significance thresholds are indicated using horizontal lines. The first represents the average x 2' value obtained between non-syntenic marker pairs (orange line) while the second represents the 5% significance threshold (red line).
to identify true association where marker spacing is either dense (fine mapping) or sparse (current microsatellite based genome scans).
Only one previous investigation reported on the level of LD found within sheep populations [7]. These authors described high LD extending over tens of centimorgans and highlighted the sensitivity of D' to both rare alleles and marker heterozygosity. Comparison with this study necessitated the use of D', and direct comparison between the studies should be treated with caution due to differ-ences in sample size, breed, population structure and the molecular markers used. A common finding to both investigations was of significant LD extending across large genetic distances. The proportion of marker pairs in significant LD persisted well above the NS-LD rate for distances up to 20 cM or more within some, but not all, of the populations tested here (Additional file 4). This lends support to the original finding of [7] by showing some sheep populations contain extensive LD.
The behaviour of LD, measured with the x 2' metric, was found to differ markedly between breeds. Table 2 quantifies this difference by reporting a wide range of solutions to b j , the coefficient of LD decay, for the five populations. LD decayed fastest within the commercial Merino (b = 9.02). Conversely, LD persisted over the largest distance and decayed slowest within the Macarthur Merino population (b = 0.24). This neatly fits both the known breed history for each population and the objective measures of genetic diversity (Table 1). For example, the Merino is an old breed, the foundation of which in Australia is known to contain contributions from numerous European, Asian and African breeds [19,20]. The levels of allelic richness and gene diversity observed place the breed amongst the most diverse sheep populations tested to date (Table 1) [13]. The finding that this high level of diversity coincided with the sharpest decline in LD suggests historic recombination and a large effective population size are likely to be responsible. At the other extreme, the Macarthur Merinos have been maintained as a closed museum flock. The animals are descendants of a small number of rams imported into the Australian colonies by John Macarthur in the early 19 th century [21]. The very low estimates of genetic diversity observed support anecdotal information indicating that little or no introgression into the flock has occurred. The persistence of LD over large distances was The number of marker pairs with significant x 2' (p < 0.05) is given before the total number of markers tested for each bin and population. The proportion is given in brackets. In calculation 1, the threshold (T) was set to 0.2 or the empirically derived 5% significance threshold for each population. This allowed the value for LD R to be taken from the dataset and used to calculate P R . The range (R) was set to 0 -5 cM in each case and mR = 5. In calculation 2, P R was set to 0.95 in each case and the thresholds used were the same as for calculation 1 which resulted in use of the same values for LD R . This allowed the number of markers (mR) for size range (R = 5) to be calculated. mR was converted into the required marker spacing in cM (M) and the total number of markers required for a genome scan (Total M) for each population. The calculation was not applicable (na) where LD was equal to 1.
therefore not surprising and suggests a small effective population may have acted to preserve LD. The White Faced Suffolk (WFS) and Poll Dorset (PD) had intermediate coefficients of decay (WFS b = 0.802; PD b = 1.066; Table 2). In the past 100 years, both populations have undergone bottlenecks during breed formation. The WFS was developed during the 1970s in an attempt to remove the black pigmentation from the head and legs of the Suffolk [22]. Similarly, the PD was developed from the Dorset beginning in the 1930s with the aim to select against horns. In each case, breed foundation necessarily reduced the effective population size [22]. The result is a reduction in the number of haplotypes observed compared with the commercial Merino and an intermediate decline in LD as a function of distance. It is also possible selection may also played a role in generating the observed differences in LD. Each of the closely spaced microsatellites reside in a genomic region known to harbour loci which influence muscularity [23,24]. This is an important consideration given some of the breeds have been selected for muscularity (WFS, PD) more intensively than others (eg MER). Taken together, the comparison between populations indicate that LD behaves in a breed specific manner and that simple indices of genetic diversity appear to serve as predictors.
The extent of LD observed within each population was used to make predictions about marker spacing and the likelihood of detecting QTL in genome wide association studies. Table 4 shows that, dependant on the population used, microsatellite markers are required at 0.1 -2.5 centimorgans intervals to detect QTL with high confidence. This suggests LD mapping within closed populations containing low diversity, such as long term selection lines, can be successfully performed using the existing set of approximately 1500 microsatellites [25,26]. Populations in which LD decays much more sharply will require many more microsatellites than currently available, with approximately 35,000 required for LD mapping within the commercial Merino (Table 4). Given the prohibitively high cost associated with genotyping such a large number of microsatellites, future genome wide association experiments will utilize SNP markers. It was not possible to draw any conclusion regarding the number of SNP which will be required, due to differences in information content, mutation rate and genomic distribution when SNP are compared with microsatellites. The microsatellite based projections should be considered with caution as they rely on certain assumptions. Foremost amongst these is that the magnitude and significance of LD observed across chromosome 18 is representative of the entire ovine genome. Several studies have demonstrated considerable variation in LD between chromosomes in human [27], cattle [18], deer [17] and pig [5]. The projections were also reliant on a low level of statistical significance and the requirement for only modest levels of LD between markers. Association studies which used these thresholds would likely have a high rate of false positive findings and fail to detect QTL with small effects. In addition, the extent of LD may vary significantly along the length of individual chromosomes, creating LD 'holes' which display very low levels of LD in the presence of tightly spaced markers [27]. Finally, marker -marker LD has been considered the equivalent of marker -QTL LD. Comparison between metrics revealed x 2' best reflects marker -QTL LD [15] however the current analysis does not consider sample size or the size of QTL effects. The frequency and severity of these phenomena are yet to be described within the ovine genome, meaning this study is likely to be calibrated by subsequent experimentation using high density genome wide SNP panels.

Conclusion
Knowledge concerning the behaviour of LD is important for performing genome wide association analysis and the emerging objective of genomic selection. Genomic selection involves the prediction of molecular estimated breeding values (mEBV) based on markers spread across the genome [28]. The major finding of this study is that the magnitude and significance of LD varies markedly between sheep populations. This makes information concerning LD between breeds important. For example, a molecular EBV generated within one breed (eg Poll Dorset) may have limited use in a second breed where the structure of LD is different (eg Merino). Conversely, Poll Dorset derived mEBVs are likely to have higher accuracy within closely related breeds which share a similar LD structure (eg White Faced Suffolk). The characterisation of LD across OAR 18 within these historically and genetically different sheep breeds also has implications for association mapping, confirming that tens of thousands of markers will be required for genome scans.

Animal Resources
The study consisted of 460 Australian commercial sheep from four populations; White Faced Suffolk (WFS; n = 84), Poll Dorset (PD; n = 122), Merino (MER; n = 126) and Merino × Border Leicester (MxB; n = 128). Animals were selected from between 3 and 11 properties across Australia to ensure the recruitment of as many unrelated individuals as possible. The MER is a wool breed, the PD and WFS meat breeds and the MxB a terminal composite which has been selected for both wool and meat production. A fifth population was also included in the study. The Elizabeth Macarthur Agricultural Institute Merinos (EMAI; n = 95) are maintained as descendants of the original nineteenth century Macarthur Merinos and are a single, closed flock. DNA from the WFS, PD, MER and MxB was prepared from whole blood using QIAamp DNA mini kits (QIAGEN, Australia) following the manufacture's instructions, whilst DNA from each EMAI animal was extracted using standard phenol/chloroform methods.

Marker Selection and Genotyping
Two panels of microsatellites were used. Microsatellite panel 1 (MSP1) consisted of nineteen markers selected to span 113 cM of ovine chromosme (OAR) 18. Marker locations (in cM) were taken from the CompLDB integrated map [26]. The average distance separating marker pairs was 6.2 cM, with the smallest interval 0 cM and the largest 30.5 cM. Panel 2 (MSP2) was composed of nine microsatellites, each located on different autosomes, plus hh47 from MSP1. MSP2 was used to estimate levels of non-syntenic LD. The forward primer of each marker pair was fluorescently labelled and after multiplex PCR was performed, the products were separated using an ABI 3130 × l Genetic Analyser (Applied Biosystems, USA). GeneMapper v3.7 software (Applied Biosystems, USA) was used for allele sizing and binning. The name, genomic location, observed allelic size range and polymorphism associated with each marker is presented in Addition file 1.

Genetic Analysis of Genetic Diversity
Four indices of genetic diversity were used to compare the amount of diversity within each ovine population. Calculations of gene diversity (H E ), average number of alleles per locus (A N ), allelic richness (A R ) and private allelic richness (pA R ) were performed using the complete data set (MSP1 and MSP2) in HP-RARE v1.0 [29]. FSTAT 2.9.3.2 http://www2.unil.ch/popgen/softwares/fstat.htm was used to evaluate population relatedness using pair-wise estimates of F ST . The presence of population substructure was investigated using MSP2 data and an admixture ancestry model-based clustering method as implemented in STRUCTURE v2.2 [14]. Three replicates of one to five subpopulations (K = 1 -5) were performed using 50,000 Markov chain steps after a burn-in period of 20,000 steps.

Analysis of Linkage Disequilibrium
Two measures were considered. The first metric, x 2' (formula 1), has recently been proposed as the measure of choice for use with multi-allelic markers such as microsatellites [15]. The second metric, D' (formula 2), was first described by Hedrick [30] as a multi-allelic extension of Lewontin's D' ij [31]. D' was implemented by the only other published study to empirically measure ovine LD [7].
where, and D ij = P(A i B j ) -P(A i )P(B j ) where P(A i ) is the frequency of allele i at marker A, P(B j ) is the frequency of allele j at marker B. N is the population size and n is the number of alleles at the marker with the smaller number of alleles. Both x 2' and D' require two-marker haplotype frequency estimation. This was performed using the Expectation-Maximisation (EM) algorithm and 20 initial conditions for each of 5000 permutation tests. The maximum likelihood estimate of haplotype frequencies was then used to estimate D' and x 2' . The EM algorithm, D' and associated p-value calculations were implemented in PyPOP release 0.6.0 [32] whilst the calculation of x 2' was performed with R statistical software [33]. LD derived from non-syntenic marker pairs was used to determine the critical levels of significance for each metric and population. This was achieved by ranking the p-values and selecting the LD value corresponding to the 5% significance threshold in each population. Theory states LD is negatively correlated with genetic distance [34]. This principle was examined graphically by plotting each metric as a function of distance (in centimorgans). The decay in LD was quantified by fitting the following formula to the observed data. [16] LD ij = 1/(1 + 4b j d ij ) + e ij where LD ij is the LD between microsatellite pair i of population j, separated by genetic distance (in cM) d ij , and where b j expresses LD decay with distance for population j, and e ij equates to the model residual. Parameter b j was calculated using the nls function set in R.

Predictions for Genome Wide Association Analysis
Calculations regarding genome wide association studies were made using formula 4. The proportion of marker pairs within a given cM distance range (R) which had x 2' values exceeding a defined threshold (T) was termed LD R . The number of markers in this range was denoted MR and the probability of finding QTL with LD > T with at least one marker in the given range is (P R ). The relationship between each is given in formula 4 as: [16] P R = 1 -(1-LD R ) MR Two separate questions were addressed (reported as calculation 1 and 2, Table 4). Firstly, the probability of detecting a QTL was estimated given observed levels of LD within each population. For this calculation, marker spacing was assumed to be 2 cM, as this is the approximate situation in sheep (1,500 microsatellites and genome size of 3,500 cM [26]). At 2 cM intervals, a randomly positioned QTL would be within 5 cM of approximately 5 markers (ie for distance range (R) 0 -5 cM; number of markers (MR) = 5). The value of LD R was determined empirically where T was set to either x 2' > 0.2 or the 5% critical threshold for significance. T > 0.2 represents the threshold estimate of detecting QTL between SNP taken from [28]. Zhao and colleagues [15] illustrated that the metric of measuring SNP LD, r 2 and x 2' are comparable. The second question examined the number of markers (MR) required to obtain a 95% probability of detecting QTL given the observed magnitude of LD in each population (ie R = 0 -5 cM; P R = 0.95). The number of markers was converted into the total required for a genome scan assuming a genome size of 3,500 cM.