Linkage disequilibrium, persistence of phase and effective population size estimates in Hereford and Braford cattle

Background The existence of moderate to high levels of linkage disequilibrium (LD) between genetic markers and quantitative trait loci (QTL) affecting traits of interest is fundamental for the success of genome-wide association (GWAS) and genomic selection (GS) studies. Knowledge about the extent and the pattern of LD in livestock populations is essential to determine the density of single nucleotide polymorphisms (SNP) required for accurate GWAS and GS. Moreover, observed LD is related to historical effective population sizes (Ne), and can provide insights into the genetic diversity history of populations. Estimates of the consistency of linkage phase across breeds (RH,B) can be used to determine if there is sufficient relationship to use pooled reference populations in multi-breed GS programs. The objective of this study was to estimate LD levels, persistence of phase and effective population size in Hereford and Braford cattle populations sampled in Brazil. Results Mean LD estimates, measured using the squared correlation of alleles at two loci (r2), obtained between adjacent SNP across all chromosomes were 0.21 ± 0.27 for Herefords (391 samples with 41,241 SNP) and 0.16 ± 0.22 for Brafords (2044 samples and 41,207 SNP). Estimated r2 was > 0.2 and 0.3, respectively, for 34 and 25 % of adjacent markers in Herefords, and 26 and 17 % in Brafords. Estimated Ne for Brafords and Herefords at the current generation was 220 and 153 individuals, respectively. The two breeds demonstrated moderate to strong persistence of phase at all distances (RH,B = 0.53 to 0.97). The largest phase correlations were found in the 0 to 50 Kb bins (RH,B = 0.92 to 0.97). Estimated LD decreased rapidly with increasing distance between SNP, however, useful linkage for GWAS and GS (r2 > 0.2) was found spanning to ~50 Kb. Conclusions Panels containing about 50,000 and 150,000 SNP markers are necessary to detect minimal levels of LD between adjacent markers that would be useful for GWAS and GS studies to Hereford and Braford breeds, respectively. Markers are expected to be linked to the same QTL alleles in distances < 50 Kb in both populations due to observed high persistence of phase levels.


Background
The evolution of molecular biology tools and techniques occurred over the last decades helped unveiling underlying genetic factors and improving rates of genetic gains for economically important traits in livestock. The availability of high-density single nucleotide polymorphisms (SNP) genotyping assays for cattle coupled with advances in computational and statistical methods allowed the generation and use of large amounts of genomic data in genome-wide association (GWAS) and genomic selection (GS) studies for production-relevant traits [1]. These advancements created new opportunities to identify and select animals with superior genetic merit, while reducing generation intervals and overall associated costs [2,3].
Existing genome-wide linkage disequilibrium (LD) levels are usually affected by genetic (selection, mutation, drift, migration, and non-random mating) and nongenetic (marker ascertainment bias) factors [4][5][6] and they can reflect historical effective population sizes and rates of recombination in a population [7]. Studies to understand LD levels and structure in livestock species are necessary for revealing diversity levels among breeds and detecting regions of genome that have been historically subjected to different selection pressures [8,9]. Knowledge about historical effective population sizes is also important to determine optimal selection pressures [10] for achieving breeding goals while maintaining acceptable levels of genetic diversity in breeding populations [11]. Estimates of linkage phase consistency across breeds and populations are also essential for determining the potential success of using data from pooled reference populations for multi-breed genomic evaluation and selection programs [12].
Hereford and Braford cattle are important breeds for beef production in southern Brazil, where subtropical climates are observed with average low temperatures in winter months. Local climate conditions naturally control the incidence of tropical ectoparasites such as the bovine tick (Rhipicephalus (Boophilus) microplus) however, infestation peaks can be observed sporadically, raising risks of losses associated with tick-borne diseases [13][14][15]. Development of strategies to implement GS methods for genetic improvement of tick-resistance in these breeds are underway [16], in addition to studies focusing on unrevealing biological mechanisms underlying this trait [17]. Minimal levels of linkage disequilibrium between causative variants and genetic markers are fundamental for performing effective GWAS and GS studies, since these approaches rely on the non-random associations between markers and functional mutations affecting traits of interest [18,19]. Successful experimental designs for performing GWAS and GS with Brazilian Hereford and Braford cattle will be highly dependent of the right choice of marker density, which in term is dependent on the estimated levels of LD across the genome [20].
The objective of this study was to estimate linkage disequilibrium levels at varying SNP densities, persistence of phase and effective population sizes in Brazilian Hereford and Braford cattle populations.

Animal welfare
All experimental procedures involving live animals were approved by the Committee for Ethics in Animal Experimentation from the Federal University of Pelotas (Comissão de Ética em Experimentação Animal, Pelotas, Rio Grande do Sul, Brazil; process number 6849).

Sample collection and genotyping
Samples from a total of 391 Hereford and 2079 Braford (with a breed composition ranging between ½ Hereford + ½ Zebu and ¾ Hereford + ¼ Zebu) cattle born between 2008 and 2010 in commercial farms associated with the Delta G Connection Genetic Improvement Consortium [21] were used in this study. DNA was extracted from blood samples from FTA cards or from cryopreserved semen. Genotyping of all samples was performed with Illumina BovineSNP50v2 (50 K) BeadChip (Illumina Inc., San Diego, USA). Additionally, data from 40 bulls (17 Hereford and 23 Braford) genotyped with Illumina High-Density (HD) Bovine BeadChip Array (Illumina Inc., San Diego, USA) were also included in the final dataset. Genotype calling and initial data quality control (QC) were performed using GenomeStudio software (Illumina Inc. San Diego, CA), according to manufacturer's protocols. Genotypes with a GenCall Score < 0.15 were set as missing genotypes.

Data quality control
Additional QC was performed with R snpStats package [22,23]. Samples with call rates < 0.90, heterozygosity deviations > 3.0 standard deviations, conflicts between declared and genotype-based sex, and duplicated genotypes with different sample identification were removed from the final dataset. Only SNP located on autosomes (BTA) were considered in further analyses. SNP with call rates < 0.98, minor allele frequencies (MAF) < 0.03 and highly significant deviations (P < 10 -6 ) from Hardy-Weinberg Equilibrium (HWE) were also excluded. Moreover, whenever multiple SNP were observed in the same physical map position, only one SNP with highest MAF was retained. The high-density (HD) panel was filtered to select only the SNP also present in the 50K panel.

Haplotype reconstruction and phasing
Haplotype reconstruction and imputation of sporadically missing genotypes (0.39 %) were carried out using FImpute 2.0 [24]. Expectation-Maximization algorithm employed initially estimates the most probable haplotypes considering observed genotypes, using pedigree relationship information. Subsequently, the program performs genotype imputation using a haplotype search based on a sliding window approach, walking along each chromosome and using overlapping windows to reconstruct haplotypes [25].

Linkage disequilibrium analysis
Linkage disequilibrium was calculated as pairwise r 2 [26], which relies on the allele phase information at the gametic level. Considering two marker loci (A and B), each one with two alleles (A1, A2, B1 and B2), the frequency of alleles in the population can be denoted p A1 , p A2 , p B1 and p B2 , and the frequency of haplotypes with allele 1 at marker locus A and allele 1 at locus B, for example, denoted p A1B1 . Following Hill and Roberson [26], For each population, LD values between all pairs of SNP of all chromosomes were binned according to pairwise physical distances into intervals of 100 Kb starting from 0 up to 10 Mb. Average values of r 2 were calculated for each bin.
To evaluate the feasibility of successfully using sparser marker panels in GWAS and GS studies, average r 2 between adjacent markers was calculated for different marker densities, sequentially removing SNP from the total dataset using every second, fourth, fifth, sixth, seventh and 14 th marker (using, respectively, 50, 25, 20, 17, 14 and 7 % of available SNP). Linkage disequilibrium estimates were calculated according to Badke et al. [27], using R scripts [23] available at https://www.msu.edu/~steibelj/ JP_files/LD_estimate.html.

Intra and inter chromosomal and breed heterogeneities
To investigate intra and inter chromosomal and breed variation in LD, two analysis of covariance with general linear models were fitted. The following linear model (Equation 2) was used to estimate the effects of physical distance (covariate), chromosome, breed and the chromosome x breed interaction on LD through a total of 82,356 adjacent breed-specific marker pairs: were r 2 ijk was the observed LD over marker distance d k on chromosome i of breed j, μ is the overall mean of r 2 ijk across markers pairs, c i is the effect of chromosome i, b j is the effect of breed j, 1 is the regression coefficient on marker distance, d Ã k is the adjusted marker distance log 10 d k −log 10 d À Á , d k is the observed physical distance for marker pair k, d is the average physical distance between markers, and e ijk is the residual effect. Distances were log 10 transformed in an attempt to linearize the relationship between LD and the logtransformed distance [28].
Additionally, a more complex linear model was fit to all 12,911,174 syntenic SNP pairs to investigate the adjusted mean r 2 in a broader range of inter marker distances than that observed in Equation 2, where just adjacent markers were considered. This more comprehensive model can be represented by: In addition to the variables already described above for Equation 2, here 2 and 3 are the regression coefficients (quadratic and cubic, respectively) on marker distance. Although the log 10 transformation of physical distances to linearize the relationship between LD and the logtransformed distances, we decided to fit these higher order coefficients to consider possible deviations of the expected linearity and interaction with breed and chromosome effects (cβ 1i , cβ 2i , cβ 3i , bβ 1j , bβ 2j , bβ 3j , cbβ 1ij , cbβ 2ij and cbβ 3ij terms).
The physical distances of interest to predict r 2 using Equation 3 were related to the average inter marker distance values observed in some commercial panels available for cattle genomic selection. The density of the considered panels was 150K (GeneSeek Genomic Profiler HD-150K), 80K (GeneSeek Genomic Profiler HD-80K), 50K (Illumina BovineSNP50v2 BeadChip), 20K (GeneSeek Genomic Profiler LD v2), 8K (Illumina BovineLD v.2 BeadChip) and 3K (Illumina Bovine3k BeadChip). Assuming a bovine genome size of 3.000 Mb [29], the target distance values were calculated dividing the genome length in Mb by the number of markers in each panel. As example, for the 150K panel, we divided 3.000 Mb by 150.000 markers and obtained an average inter marker distance (d k ) of 20 Kb. Accordingly, chromosome by breed predicted r 2 values were obtained from estimated parameters of Equation 3 at physical inter marker distances of 20, 38, 60, 150, 375 and 1000Kb. All analysis was performed using R package [23].

Analysis of persistence of phase
The degree of phase concordance between the two breeds for pairs of SNP was calculated according to Badke et al. [27] with the following formula: where R B,H is the correlation of phase between r ij(B) in the Braford (B) population and r ij(H) in the Hereford (H) population, S (B) and S (H) are the standard deviations of r ij(B) and r ij(H) , respectively, r B ð Þ and r H ð Þ are the average r ij across all SNP i and j within interval p for populations B and H, respectively.
Positive r values are expected when two markers are in LD and show equal linkage phase in both studied populations [30]. Marker pairs were binned according to marker distances (intervals of 100 Kb starting from 0 up to 10 Mb), and average values of R B,H were calculated for each bin, using markers common to both breeds.

Estimation of effective population size
Estimates of LD decay in relation to different SNP distances were used to infer past effective population sizes (N e ) of the two studied cattle breeds. The relationship between r 2 and N e can be expressed by the formula: where c is the genetic distance between two markers expressed in Morgans [31]. Effective population size was estimated considering each SNP pair located within 100 Mb of the same chromosome, with physical distances between SNP converted to genetic distances, assuming 1 Mb = 1 cM [32,33]. Because generations are discrete and distances between SNP are continuous, historical effective population size (N et ) for a given generation t = 1/2c [34] in the past was assessed by selecting SNP pairs with map distance within corresponding ranges of c values. When applied in t = 1/2c, the resulting t value was rounded to the target generation. For example, r 2 of all SNP pairs with distance between 33.3 cM (t = 1.5) and 1 M (t = 0.5) of the same autosomal chromosome were selected to calculate N e at t = 1. Then, within each bin, the average values of r 2 and c were obtained and applied in the formula: for 0.0 < r 2 < 1.0. Longer c ranges were used to define generation bins as we moved further in the past, because they correspond to shorter distances with fewer markers and we wanted to ensure sufficient numbers of SNP pairs for reliably estimating N et within each bin [35]. The actual bins were of one generation for t between 1 and 10, e.g. a range from 0.5 to 1.5 for the current generation; five generations for t between 15 and 100, and of 50 generations for t between 150 and 1000 (see Table 1 for additional details).

SNP quality control
Data QC excluded 43 samples with call rates < 0.90, 24 samples with heterozygosity deviations > 3.0 standard deviations, eight samples with incorrect sex assignment and eight duplicated genotypes with different sample identification. The final resulting dataset contained 2435 samples (391 Hereford and 2044 Braford). Some samples were excluded for not meeting more than one criteria of QC. Marker QC also removed 4232 SNP with call rates < 0.98, 5712 SNP with MAF < 0.03, and 1342 SNP with highly significant HWE deviations (P < 10 -6 ). Estimated means for call rate, MAF and HWE were 0.998, 0.271, and 0.541, respectively. Additionally, 34 monomorphic markers in the subset of Hereford samples were excluded from subsequent analyses of this breed. The final resulting dataset contained a total of 41,241 autosomal SNP (75.52 %) in Brafords and 41,207 autosomal SNP (75.46 %) in Herefords.

Linkage disequilibrium
Average r 2 ± SD between adjacent SNP across all chromosomes was 0.21 ± 0.27 for Hereford and 0.16 ± 0.22 for Braford. The analysis revealed a rapid decrease in LD in both populations with increasing physical distances (Fig. 1). Herefords showed higher LD than Brafords up to a distance of 1780 Kb. For larger distances, Brafords showed a mean of r 2 slightly higher than Herefords. Average LD for markers at some distance intervals is presented in Table 2. At distances of 50 Kb, average LD was 0.25 ± 0.29 for Herefords and 0.18 ± 0.24 for Brafords. In Brafords, average r 2 decayed faster than in Herefords with the increase of distance, declining to 50 % of its initial value at~5 Kb whereas in Herefords the same decline was observed at~50 Kb. Observed LD was > 0. 2 (Table 4). Using only every fifth marker (20 % of available SNP), the percentage of markers with r 2 > 0.2 decreased from 34 to 15 % in Herefords and from 26 to 10.1 % in Brafords, while the percentage of markers with r 2 > 0.3 decreased from 25 to 8.3 % in Herefords, and from 17 to 4.9 % in Brafords.

Intra and inter chromosomal heterogeneity and differences between breed in r 2
Average distances between SNP in different chromosomes were similar, and average physical distances between adjacent markers on Hereford and Braford autosomes was 61 Kb. For all chromosomes, average r 2 between adjacent SNP was larger in Herefords than ) revealed that physical distance, breed and chromosome have significant effects on r 2 (P < 0.001), whereas the interaction between breed and chromosome was not significant. Therefore, predicted marginal (least square) means ± SE of r 2 were obtained for each chromosome averaged across breeds, chromosome by breed interactions and adjacent SNP distances (Fig. 2). Squared correlation estimates considering all SNP pairs (Equation 3) revealed significant effects of linear, quadratic and cubic physical distance coefficients, breed, chromosome, and all interactions (P < 0.001). Predicted r 2 values at specific distances for Herefords and Brafords are shown in Fig. 3 for chromosomes with the greatest, lowest and intermediate LD averages (BTA6, BTA23 and BTA10, respectively).

Effective population size
Although current estimated effective population size (Fig. 4) for Brafords (N e = 220) is larger than for Herefords (N e = 153), different recent N e trends were observed for the two studied populations. Braford´s decreasing historical N e trend was reversed and sharply increased in the last two generations, while Herefords had an accelerated N e decline starting four generations   ago. Consequently, the relative positions of the breed were changed, restoring the larger past effective size of Braford compared to Hereford, which was observed up to about thirty generations ago, when respective values were 315 and 309 for the two populations.

Persistence of phase
Moderate to strong persistence of phase at all distances (R B,H = 0.53 to 0.97) were observed for both breeds (Fig. 5). Phase correlations decreased rapidly with increasing distances between SNP, as was similarly observed for average r 2 . For markers < 50 Kb apart mean estimated R B,H was 0.92, decreasing to 0.74 and 0.53 at marker distances of 1 and 5 Mb, respectively. Marker phase correlation values of R B,H > 0.9 were found in the 0 to 50 Kb bins (R B,H = 0.92 to 0.97), and the proportion of SNP with reversed r signs was lower, ranging from 5 % at 1 Kb to 34 % at 10 Mb distance.

Discussion
Numerous studies have been conducted to estimate the pattern and extent of LD in different domestic animal species [8,27,28,36]. Obtained results are essential for fine-tuning experimental designs to increase GWAS and GS efficiency and accuracies in studied populations, and can therefore have great impact in realized rates of genetic progress in economically important traits. Linkage disequilibrium patterns and scale within and between populations/breeds can be influenced by several factors such as marker allele frequencies, selection history, population structure and effective size, marker type and density, as well as which LD measure is used [9,32,37,38]. Therefore, these factors should be considered when conclusions are drawn from comparisons between different studies. The choice of r 2 [23] to estimate LD in Hereford and Braford data was based on the parameter´s lower sensitivity to variations in allele frequency [38] and sample size [39,40], when compared to D and D´.
Robust ascertainment bias towards informativeness in taurine breeds has been observed in the 50K panel [41] and has to be considered when interpreting population genetics inferences based on LD estimates. Using a high density marker panel (446,985 SNP) and 795 genotyped Nelore steers, Espigolan et al. [42] observed an overall average r 2 of 0.17.
Lower mean LD estimates were also reported for B. indicus breeds by Villa-Angulo et al. [43] when analyzing a dataset with 31,857 SNP derived from the 50K panel from 487 animals of seventeen taurine and indicine breeds. Reports of estimated larger historical effective population sizes for B. indicus breeds may be representative of differences which occurred during the domestication and selection processes of B. indicus and B. taurus cattle, and offers a plausible explanation for the lower LD levels observed in indicine populations [8,44]. Considering an expected average of 3/8 Zebu contribution to Braford breed composition, the lower average r 2 observed in this population can be a result of those differences reported for B. indicus and B. taurus breeds in terms of past effective population sizes and selection processes during breed formation.
Both studied populations presented an inverse relationship between LD and inter-marker distances, and this decline of LD as a function of distance agrees with other studies based on r 2 estimates in cattle [8,38,42,45]. The Bovine HapMap Consortium [46] reported that, in general, B. indicus breeds had lower r 2 values at short distances and higher r 2 values at longer distances between markers when compared to B. taurus breeds. As the extent of LD at short inter-marker distances reflect the historic effective population size [44]. Rapid decline in average r 2 of Braford compared to the decrease of r 2 in Hereford can be associated to differences in effective population size of the breeds.
Decreases in effective population sizes have been widely observed within the last~50 years in several cattle breeds, mostly due to the advent of artificial insemination which allowed the intense use of fewer males and high selection pressures for specific traits [47]. As demonstrated in Fig. 4, effective population sizes for both breeds have declined over time as reflection of the historical process of domestication and breed formation [46]. Brazilian Brafords were essentially formed with a limited pool of Nelore sires and the increased N e trend observed in the last two generations (Fig. 1) could have resulted from recent efforts to increase diversity of this population through the introduction of foreign lineages formed with Brahman zebu. Conversely, the decreasing N e trend observed in Brazilian Herefords is supported by the known restricted use of few purebred sires observed in most recent generations (Lopa TBP, personal communication).
Moreover, selection can lead to increased interchromosomal LD heterogeneity [48]. Thus, higher LD values observed in BTA6 in Brafords and Herefords in comparison to other chromosomes can be indicative of the Fig. 3 Predicted r 2 as a function of inter-marker distance considering different panels densities. Square symbols pinpoint predicted r 2 at distances of 20, 38, 60, 150, 375 and 1000 Kb corresponding, respectively, to average physical distances expected for panels of 150K, 80K, 50K, 20K, 8K, and 3K equally-spaced markers in chromosomes 6 (BTA6), 10 (BTA10) and 23 (BTA23) for Hereford and Braford populations presence of QTL affecting traits that have been under intense selection in both breeds. Evidence of QTL in BTA6 affecting growth traits such as birth weight [49], carcass weight [50] and ribeye area [49], and also feed intake and body weight gain [51,52], have been reported in different cattle breeds. Additionally, the "spotted" locus (S) [53] is also located at BTA6 in the region containing the KIT gene. The S H allele is responsible for Hereford pattern of coat colour (white face, belly, feet and tail, often with a white stripe over the shoulder when homozygous) [53,54]. Since the breed is fixed for this phenotype, it is expected that a historical selection occurred during breed formation fixed the spotted allele at this locus. The absence of breed × chromosome interaction effects obtained by Equation 2 indicates that even though high levels of interchromosomal LD heterogeneity were observed, similar trends were observed in Brafords and Herefords between adjacent markers at the 50K panel  distances. This is in agreement with the common selection objectives applied to both populations analysed in this study [21] and to fact that Herefords are expected to have contributed 5/8 of the Braford genome. Nevertheless, as we considered a wider range of distances and had additional statistical power by including all pairs of syntenic markers in Equation 3, we were able to capture significant differences between breeds that were specific for chromosome and distance.
Knowledge about breed and chromosome-specific variation in LD levels could be used to determine optimal marker density for performing GWAS and GS studies [39,48]. This information could be used to design customized marker panels for Hereford and Braford cattle with different chromosome-specific densities or to choose commercially available panels that would yield the desired LD levels across the whole genome. For example, if a minimal LD value of r 2 = 0.2 is established as a threshold and the 80K panel will be used for genotyping, only BTA6 and BTA10 in Hereford and BTA6 in Braford would be expected to meet the target LD threshold (Fig. 3). To reach average r 2 > 0.2 in all chromosomes, genotyping with 150K is required for Herefords, while for Brafords even higher densities would be required, such as available with HD commercial panels (as Illumina High-Density Bovine BeadChip Array Illumina 777K or Affymetrix Axiom Genome-Wide BOS 1 Array 650K). Alternatively, animals genotyped with panels below the desirable density could be imputed to higher densities, provided that reference haplotype panels are available for the respective populations [55].
The across-population, or across-breed, accuracy of GS estimates based on prediction equations derived from a specific reference population depends basically on the persistence of LD phase across populations, which reflects their genetic relationships [56]. Even if r 2 values close to 1 are observed across populations, if a large proportion of SNP are in reversed phase the result of selection for these markers will lead to antagonistic responses [12]. Our estimates of phase correlation indicated that markers in LD at distances lower than 50 Kb in Herefords show similar levels of LD in Brafords, and a high proportion of these SNP share the same linkage phase. These results could be expected, since Brafords are composites with a contribution of 62.5 % of the Hereford breed. The correlation of phase values between populations proportionally decreases with the extent of divergence between the populations. To find markers that are in LD with QTL across diverging breeds, such as Australian Angus and New Zealand Jersey populations, de Roos et al. [12] concluded that a panel of approximately 300,000 marker would be required. In our case, due to the genetic similarity of Hereford and Braford breeds, 50K panel data can be pooled into a single reference population for performing GS and GWAS studies and results can subsequently be applied to either breed.

Conclusions
Our results indicate that at least 50K and 150K of evenly spaced SNP are necessary to effectively perform GWAS and GS studies, in Hereford and Braford respectively. For distances < 50 Kb, SNP are expected to be linked to the same QTL alleles in both breeds, due to high persistence of phase; therefore, Hereford and Braford data can be pooled into a single reference population for multibreed GS and GWAS studies performed with the above mentioned marker densities.

Ethics approval and consent to participate
Biological samples used for genotyping were collected according the Brazilian Guidelines for Use of Animals for Scientific and Educational Purposes, available at http://www.mct.gov.br/upd_blob/0234/234054.pdf.
There was a written informed consent from cattle owners (Embrapa SAIC 10200.10/01623) to cover the use of their animals and sample collection.