High-resolution haplotype block structure in the cattle genome
© Villa-Angulo et al. 2009
Received: 05 June 2008
Accepted: 24 April 2009
Published: 24 April 2009
Skip to main content
© Villa-Angulo et al. 2009
Received: 05 June 2008
Accepted: 24 April 2009
Published: 24 April 2009
The Bovine HapMap Consortium has generated assay panels to genotype ~30,000 single nucleotide polymorphisms (SNPs) from 501 animals sampled from 19 worldwide taurine and indicine breeds, plus two outgroup species (Anoa and Water Buffalo). Within the larger set of SNPs we targeted 101 high density regions spanning up to 7.6 Mb with an average density of approximately one SNP per 4 kb, and characterized the linkage disequilibrium (LD) and haplotype block structure within individual breeds and groups of breeds in relation to their geographic origin and use.
From the 101 targeted high-density regions on bovine chromosomes 6, 14, and 25, between 57 and 95% of the SNPs were informative in the individual breeds. The regions of high LD extend up to ~100 kb and the size of haplotype blocks ranges between 30 bases and 75 kb (10.3 kb average). On the scale from 1–100 kb the extent of LD and haplotype block structure in cattle has high similarity to humans. The estimation of effective population sizes over the previous 10,000 generations conforms to two main events in cattle history: the initiation of cattle domestication (~12,000 years ago), and the intensification of population isolation and current population bottleneck that breeds have experienced worldwide within the last ~700 years. Haplotype block density correlation, block boundary discordances, and haplotype sharing analyses were consistent in revealing unexpected similarities between some beef and dairy breeds, making them non-differentiable. Clustering techniques permitted grouping of breeds into different clades given their similarities and dissimilarities in genetic structure.
This work presents the first high-resolution analysis of haplotype block structure in worldwide cattle samples. Several novel results were obtained. First, cattle and human share a high similarity in LD and haplotype block structure on the scale of 1–100 kb. Second, unexpected similarities in haplotype block structure between dairy and beef breeds make them non-differentiable. Finally, our findings suggest that ~30,000 uniformly distributed SNPs would be necessary to construct a complete genome LD map in Bos taurus breeds, and ~580,000 SNPs would be necessary to characterize the haplotype block structure across the complete cattle genome.
The rapid improvement in high-throughput single nucleotide polymorphism (SNP) discovery and genotyping technologies is making possible the availability of many thousands of SNP markers for genome-wide association studies [1–5]. High-resolution linkage disequilibrium (LD) maps and characterizations of haplotype block structure are being generated for different organisms, confirming that elucidating in the fine-scale the structure of LD at the population level is crucial for understanding the nature of the highly non-linear association between genes and phenotypic traits, such as complex diseases and quantitative trait loci (QTL) [6–8].
Initial studies in humans [9, 10] demonstrated that, by investigating regions for evidence of recombination and LD patterns, it was possible to parse the human genome into haplotype blocks, and that those blocks shared just a few common haplotypes. This result provided impetus for the construction of LD and haplotype maps of the human genome. Furthermore, haplotype block structure appears to be conserved across mammals .
Recently, high resolution LD and haplotype block maps were generated for humans using a set of 3.1 million SNPs genotyped in 270 individuals from four geographically diverse populations . Overall, 98.6% of the assembled genome is within 5 kb of the nearest polymorphic SNP. The analysis of these high-resolution data is helping to infer with great precision, information about population history, recombination and mutation rates, evidence of positive selection, and is providing invaluable information for gene-disease association studies .
An initial bovine study  reported characterization of haplotype blocks in Holstein-Friesian cattle using a 15 K SNP chip with an average intermarker spacing of 251.8 kb. Another study  reported haplotype block structure for 14 European and African cattle breeds using 1536 SNPs. This study had an average resolution of 311 kb intermarker distance and was focused mainly on chromosome 3. Recently, the Bovine HapMap Consortium  generated an assay of 30 K SNPs and genotyped 501 animals sampled from 19 worldwide taurine (Bos taurus) and indicine (Bos indicus) breeds, plus two outgroup species (Anoa and Water Buffalo). In this article we present the characterization of LD and haplotype block structure across 101 high-density targeted regions from the bovine HapMap data, spanning 7.6 Mb of the genome with an average intermarker distance of ~4 kb. The extent of LD is presented along with the estimation of ancestral population size for different generations. In a first level of analysis, haplotype block characterization allowed us to elucidate the breed-specific block structure and its variability compared with all other breeds. In a second level of analysis, haplotype block density correlation, haplotype block boundary comparison, and haplotype sharing between breeds and subgroups helped us to elucidate high-resolution similarities between breeds, and also permitted us to differentiate breeds by geographic separation versus those related by shared ancestry. Finally, breeds were clustered given computed genetic distances based on haplotype block analysis.
Using the filtered data set (see Methods section for Quality Control filters) from the Bovine HapMap Consortium  consisting of 31,857 markers from 487 animals sampled from 19 cattle breeds (see Additional file 1), we selected the three chromosomes having the highest number of SNP markers, BTA 6, 14, and 25, and performed an analysis of high-density regions on these chromosomes. High density regions were originally genotyped in chromosomes 6 and 14 based on evidence of QTL and chromosome 25 based on a lack of known QTL (see Methods section). The high-density regions were defined as non-overlapping genomic windows of 100 kb containing 10 or more markers and a maximum gap between markers of 20 kb. We identified 101 such high-density regions covering a total genomic distance of 10.1 Mb (see Additional file 2). The effective region (regions within markers) covered is 7.6 Mb and contains in total 1,981 markers with an average of one marker each ~4 kb. The following sections discuss the haplotype block structure of these 101 high-density regions.
In general, African and indicine breeds exhibited lower MAF values. It could be thought that this is due to an ascertainment bias in the SNP discovery because all targeted SNPs in this study were originally derived by comparison between a Hereford assembly and sequence reads from a series of bacterial artificial chromosomes (BACs) constructed from Holstein DNA. However, analysis of variation from among the major cattle breeds free from SNP ascertainment bias demonstrated a higher genetic diversity in indicine compared to taurine breeds . In the targeted regions, MAF values ranged from a maximum of 0.253 (Holstein) to 0.116 (Nelore), which is a difference of about 28% in the full scale of 0.0 to 0.5. The average decay in MAF between breeds was 1.51%. (see Additional file 4). Furthermore, we compared the proportion of polymorphic SNPs in the selected regions with the proportion of polymorphic SNPs in the entire HapMap data set and found a 20% higher proportion in the complete HapMap data than the selected regions.
The 1,981 SNPs in the high-density regions were used to evaluate the extent of pairwise LD as a function of physical distance. The complete set of SNPs (31,857) was used to estimate the effective population size in the previous 10,000 generations for each breed. A pair of haplotypes was inferred for each sample using the software fastPHASE version 1.2.3 , which provided imputed haplotypes for missing genotypes where necessary.
After adjusting r 2 for sample size error (see Methods section), we estimated the effective population size over the 10,000 previous generations (assuming a generation time of six to seven years ). This estimation was based on the observation that in a population with constant effective population size N, the approximate expectation of r 2 is: , where N is the effective population size 1/(2c) generations in the past, E(r 2) is the average of r 2 values for all SNPs within a specified range, and c is the median of the range in Morgans (we assumed 1 cM ~1 Mb) [15, 19–22].
Haplotype blocks based on r 2 were estimated using the definition from , discussed in the Methods section. Additional file 7 details the block characteristics for all breeds. In summary, the average maximum number of markers per block was 27.16. Across all breeds, 34.7% of the high-density regions were covered by haplotype blocks. We found that mean block size varied from 5.7 to 15.67 kb across breeds (with a mean block size of 10.3 kb over all breeds) and an average of 3.8 markers per block. These results are similar to those in a recent study of human haplotype blocks , which reported haplotype block sizes averaging 7.3, 13.2, and 16.3 kb in three human populations when analysing ten 500-kilobase regions with a density of one SNP per ~5 kb. The human data showed a marked decline in LD over the range of 1–100 kb, again similar to our observed decline in cattle LD from 0.6 to 0.1 over the range 1–100 kb.
From this and the results in the previous section, if we assume that the elucidated average of r 2 of ~0.1 in 100 kb, and that the haplotype block average size of ~10 kb with one informative SNP each ~5 kb are homogeneously distributed across the bovine genome, then, for constructing an LD map for association studies we should tag at least a SNP in each 100 kb. Therefore, we can estimate that it would be necessary to successfully assay at least 28,700 SNPs for a LD map for association studies. In the same way, it would be necessary to assay at least 574,000 SNPs to characterize the haplotype block structure across the entire bovine genome (assuming a bovine genome size of 2.87 Gb).
Average haplotype block density correlations from all breeds within the group and outside the group.
Proportions of block boundary discordances and concordances among cattle subgroups
NR – REC (%)
REC – NR (%)
Beef vs Dairy
Beef vs Indicus
Beef vs Composite
Beef vs African
Dairy vs Indicus
Dairy vs Composite
Dairy vs African
Indicus vs Composite
Indicus vd African
Composite vs African
Normalized proportion of shared haplotypes
In this work we present a high-resolution characterization of haplotype block structure in cattle. The analysis was performed on 101 targeted genomic regions spanning 7.6 Mb with an average density of one SNP each ~4 kb, sampled from 19 worldwide breeds. We studied LD and elucidated the block structure for each specific breed. Consistent with previous analyses in cattle, and in high agreement with observation in humans, we observed that LD declines rapidly, such that r 2 averages ~0.1 at 100 kb, and haplotype blocks exhibit an overall mean size of 10.3 kb (varying from 5.7 kb to 15.57 kb across all breeds) with an average of 3.8 markers per block. Estimation of effective population size in previous generations reflects the period of domestication ~12,000 years ago, as well as the current population bottleneck that breeds have experienced worldwide (last ~700 years) as a result of population isolation and selective breeding. In addition, an analysis of block density correlations, block boundary discordances, and haplotype sharing across all breeds and between subgroups were consistent in exhibiting a clear differentiation between indicus, African, and composite subgroups, but not between dairy and beef subgroups.
In summary, this work presents the first high-resolution analysis of haplotype block structure in worldwide cattle samples. First, novel results show that cattle and human share a high similarity in LD and haplotype block structure in the scale of 1–100 kb. Second, unexpected similarities in haplotype block structure between dairy and beef breeds make them non-differentiable. Finally, our results suggest that it would be necessary to successfully assay ~30,000 SNPs to construct an LD map for association studies, and ~580,000 SNPs to characterize the haplotype block structure across the entire bovine genome.
The data used for this analysis correspond to the BTA4.0 assembly of the Bovine HapMap consortium database . It includes genotypes from 501 animals on a set of 32,826 markers. Animals were sampled from 19 cattle breeds and two outgroups Anoa and Water Buffalo (see Additional file 1). All breeds belong to the taurus and indicus subspecies of Bos taurus, and represented several different geographical regions: N'Dama and Sheko are African breeds; Angus, Hereford, and Red Angus are British beef breeds; Charolais, Limousin, Piedmontese, and Romagnola are European beef breeds; Guernsey and Jersey are British dairy breeds; Brown Swiss, Holstein, and Norwegian Red are European dairy breeds; Brahman, Nelore, and Gir are indicus breeds; Beefmaster, and Santa Gertrudis are composites of taurine-indicine origin. Individuals were selected to be unrelated at least for 4–5 ancestral generations, with the exception of 44 trios of sire, dam and offspring included to allow quality control of the data and to assist in the determination of allelic phase relationships. The DNA samples were taken from whole blood or cryopreserved semen.
To ensure the overall quality of samples and a consistent set of genotypes, QC filters were applied to the initial data (see ). The filters included removal of all genotypes that had >20% missing genotypes, that violated Hardy-Weinberg frequency distribution, or that violated Mendelian inheritance. Data were also removed for all animals with genotype completeness <98%, for markers with estimated genotyping error >5% and at least one breed out of Hardy-Weinberg equilibrium, as well as markers that were monomorphic for all breeds, markers with minor allele frequency <0.05 among all breeds, markers containing >2 discordant trios, and markers assigned to unknown chromosome. After this QC procedure, the data set contained 31,857 markers from 487 animals, and excluded Anoa and Water Buffalo.
In addition to previous QC filters, we removed monomorphic SNPs breed by breed in order to avoid the analysis of uninformative data.
In order to facilitate the study of haplotypes extended over multiple markers, we focused on the regions of the bovine genome that had the highest density of markers in the HapMap data set. We focused exclusively on chromosomes 6, 14, and 25, which were selected for additional genotyping due to the presence of known QTL of interest in chromosomes 6 and 14, and the absence of known QTL on chromosome 25.
Chromosome 25 therefore served as a control for studies focusing on high-density regions. For this study, we defined high-density regions as non-overlapping genomic windows of 100 kb containing 10 or more markers and a maximum gap between markers of 20 kb. This definition identified 101 high-density regions contained a total of 1,981 markers, yielding an average density of 19.61 markers per region. The average distance between adjacent high-density regions on the same chromosome was 1.46 Mb, but they were not evenly spaced. There were 31 instances in which two adjacent high-density regions were contiguous on the chromosome.
where, p 1 and p 2 are the minor and major allele frequencies in SNP 1 respectively, q 1 and q 2 are the minor and major allele frequencies in SNP 2 respectively, and p 11 is the frequency of observing both minor alleles in the same individual across all population.
where N is the effective population size 1/(2c) generations in the past, E(r 2) is the average of r 2 values for all SNPs within a specified range, and c is the median of the range in Morgans [19–22]. To compute N for each breed, the number of previous generations was first selected. Then, c was computed in Morgans and taken as the median of the range (using a range of 10 kb and an approximation of 1 cM ≈ 1 Mb). The adjusted r 2 values were averaged for all SNP pairs within the range across all 29 autosomal chromosomes. We estimated N for 10 to 10,000 previous generations by using the complete set of SNPs (31,857 SNPs) since the set comprising just targeted high-density regions only permitted the estimation from N for 5,000 to 10,000 previous generations.
Haplotype blocks were defined by the following algorithm : (i) Begin a block by selecting the pair of adjacent SNPs with the highest r 2 value (no less than α = 0.4); (ii) Repeatedly extend the block if the average r 2 value between an adjacent marker and current block members is at least β (= 0.3) and all the pairwise r 2 values within the block are at least γ (= 0.1).
where denotes the sample average mean size, s denotes the sample standard deviation, n denotes the sample size, and denotes the percentile of a t distribution with n-1 degrees of freedom .
To determine if the haplotype block structure in high-density regions is conserved among breeds, we counted the number of haplotype blocks occurring in each of the 101 high-density regions for each breed, producing a 101-element vector for each breed.
where i and j represent two breeds, k represents a high density region, x i, kand y j, krepresents the number of haplotype blocks found in region k for breeds i and j respectively, and and represents the mean number of haplotype blocks found across all regions for breeds i and j respectively.
In order to assess the consistency of block boundaries across breeds, we examined adjacent pairs of SNPs with intermarker distances up to 10 kb. For each breed, it was determined whether the pair was assigned to a single block or not. Then, for a given pair of breeds, a SNP pair was termed concordant if the assignment was the same in both breeds and discordant if the assignments disagreed . We performed this analysis for all pairs of breeds. In addition, we computed concordances and discordances between beef and dairy groups, and between dairy and indicus groups as well.
S'(P1, P2, k) has value 1.0 if the proportional of shared haplotypes between populations P1 and P2 at locus k is equal to the average of the proportional of shared haplotypes within the two populations P1 and P2. If S'(P1, P2, k) << 1.0, then the proportion of shared haplotypes between the two populations is much less than the average within the two populations.
where u is the number of loci. This is related to common measurements for genetic distance between two individuals [28–30]. D'(P 1, P 2) has value 0 if breeds P 1 and P 2 share the same proportion of haplotypes as are shared by the individuals within each individual breed.
Vectors resulting from the computation of haplotype block boundary discordances for each breed compared to the remaining breeds were used to perform a Principal Component Analysis (PCA) and look for differentiation between cattle subgroups. We used R software to perform this analysis. The central idea of PCA is to reduce the dimensionality of a data set which consists of a large number of interrelated variables, while retaining as much as possible of the variation present in the data set. This is achieved by transforming a new set of variables, the principal components (PCs), which are uncorrelated, and which are ordered so that the first few retain most of the variation present in all the original variables .
Formally, PCA is defined as an orthogonal linear transformation that transforms the data to a new coordinate system such that the greatest variance by any projection of the data comes to lie on the first coordinate (called the first principal component), the second greatest variance on the second coordinate, and so on. PCA is theoretically the optimum transform for a given data in least square terms. The procedure for obtaining PCAs can be summarized as follows:
Select d eigenvectors to represent the n variables, d <n. Then the P 1, P 2,..., P d are called the principal components.
This project was supported by National Research Initiative Grant no. 2007-35604-17870 from the USDA Cooperative State Research, Education, and Extension Service Animal Genome program. RV was supported in part by a Fulbright Scholarship.
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.