Genome-wide association study for birth weight in Nellore cattle points to previously described orthologous genes affecting human and bovine height

Background Birth weight (BW) is an economically important trait in beef cattle, and is associated with growth- and stature-related traits and calving difficulty. One region of the cattle genome, located on Bos primigenius taurus chromosome 14 (BTA14), has been previously shown to be associated with stature by multiple independent studies, and contains orthologous genes affecting human height. A genome-wide association study (GWAS) for BW in Brazilian Nellore cattle (Bos primigenius indicus) was performed using estimated breeding values (EBVs) of 654 progeny-tested bulls genotyped for over 777,000 single nucleotide polymorphisms (SNPs). Results The most significant SNP (rs133012258, PGC = 1.34 × 10-9), located at BTA14:25376827, explained 4.62% of the variance in BW EBVs. The surrounding 1 Mb region presented high identity with human, pig and mouse autosomes 8, 4 and 4, respectively, and contains the orthologous height genes PLAG1, CHCHD7, MOS, RPS20, LYN, RDHE2 (SDR16C5) and PENK. The region also overlapped 28 quantitative trait loci (QTLs) previously reported in literature by linkage mapping studies in cattle, including QTLs for birth weight, mature height, carcass weight, stature, pre-weaning average daily gain, calving ease, and gestation length. Conclusions This study presents the first GWAS applying a high-density SNP panel to identify putative chromosome regions affecting birth weight in Nellore cattle. These results suggest that the QTLs on BTA14 associated with body size in taurine cattle (Bos primigenius taurus) also affect birth weight and size in zebu cattle (Bos primigenius indicus).


Background
Birth weight (BW) is an economically important trait in beef cattle, and is usually the first characteristic measured in a calf. Birth weight is associated with growthrelated traits [1], mature size [2] and carcass weight, thus being a valuable production indicator, as well as a selection criterion to improve calving ease [3,4].
Despite the beef industry's pursuit of animals with rapid growth, yielding heavier carcasses, selection for these objectives needs to be properly balanced against selection for reproductive traits, which have great economic importance in beef cattle production systems [5,6]. While low estimated breeding values (EBVs) for BW are associated with reduced calf viability [4] and lower growth rates [3,7], the use of sires with high EBVs for BW on dams with small pelvic size may result in higher rates of dystocia [7] and increased perinatal mortality [8]. These antagonisms result from the strong association of birth weight with the body size of the calf, i.e., with the stature of the animal [2].
Calving difficulties can result from a mismatch between pelvic opening and calf size [6]. This relationship between BW with reproductive and growth/size traits highlights the importance of understanding the underlying genetic architecture of BW.
Birth weight exhibits sufficient variability and heritability in the Nellore breed (Bos primigenius indicus), with an average of 29.8 ± 2.7 kg [9] and estimated heritability between 0.25 and 0.33 [1,9,10]. Despite the low frequency of dystocia in Nellore cows, BW has been recorded and used to monitor genetic trend. One selection strategy of the breeding programs in Brazil has been to preferentially use sires with higher EBVs for weaning and yearling weights, but with low or close to average EBVs for BW [11]. The identification of major genes and variants affecting multiple weight and carcass traits or influencing BW alone would be of help to balance these conflicting goals, because BW is positively correlated with weaning and yearling weights [1].
In the past two decades, linkage studies attempting to map quantitative trait loci (QTLs) affecting weight, growth, or stature in cattle have been published (e.g. [12][13][14][15][16]). The release of the reference bovine genome [17], the discovery of common single nucleotide polymorphisms (SNPs) across breeds [18,19], and the availability of high-throughput microarrays have enhanced the process of mapping loci that affect complex traits. This has led to several population-based investigations of associations between weight/growth/height phenotypes with genome-wide variants in different cattle breeds [15,[20][21][22][23].
In particular, two regions of the bovine genome associated with stature and growth have been highlighted recently. The first, located on Bos primigenius taurus (BTA) autosome 6 [20,23], shelters the orthologous genes NCAPG and LCORL, which have been also found to be associated with adult height in humans [24,25]. The second, located on BTA14 [15,[21][22][23], contains the genes PLAG1, CHCHD7, RDHE2, MOS, RPS20, LYN, PENK and TGS1, that were previously found to affect stature in both cattle and humans [22,24,[26][27][28]. Importantly, the majority of the genome-wide association studies (GWAS) reported in literature were conducted in the humpless subspecies of cattle (Bos primigenius taurus, known as taurine cattle), and GWAS in the humped bovine subspecies (Bos primigenius indicus, often referred as indicine or zebu cattle) are only now emerging, especially because the first SNP microarrays were optimized for taurine cattle [19].
In this paper, results from a genome-wide scan for SNPs associated with BW variation in Nellore cattle using EBVs of progeny-tested Brazilian bulls are reported. As EBVs take into account information from performance of the individual, progeny and parents, pedigree relationships, and systematic management and environmental factors, they can be used as composite phenotypes for proceeding with association analyses (see, e.g., [29]). The objective of this study was to identify putative SNP associated with differences in BW and to explore the genomic regions around them to unravel prospective functional relationships among weight, fertility, and growth/size traits.

Genotype informativeness and quality control
From the initial set of 777,961 SNPs, 42,669 (5.5%) were non-autosomal markers. Fifty four autosomal SNPs with redundant genomic coordinates were identified and excluded from further analyses. The identity by state (IBS) check revealed no unexpected sample duplicates. A total of 223,309 (30.4%) markers were excluded due to minor allele frequency (MAF) < 0.02. The number of SNPs excluded due to SNP Call Rate (CR SNP ) < 0.98 and Fisher's exact test P-value for Hardy-Weinberg Equilibrium (HWE) < 1 × 10 -5 were 122,611 (16.7%) and 13,194 (1.8%), respectively. Five individuals were removed due to low Call Rate (CR IND ). The final dataset included data for 649 individuals and 434,020 SNPs.

Descriptive statistics of dependent variables
Based on the results for the Shapiro-Wilk test, there was no evidence that EBVs deviated from normality (P = 0.415), and no outliers were observed. Average accuracy was 0.87 ± 0.11, with minimum, median and maximum accuracies of 0.51, 0.91 and 0.99, respectively. After fitting the regression in formula (2), there was no evidence against the hypotheses of normally distributed (P = 0.57) and homoskedastic residuals (studentized Breusch-Pagan test, P = 0.11). These findings suggest that the dependent variables used were reliable and did not violate possible assumptions of the statistical analyses used hereafter.

Population substructure
The Principal Coordinates Analysis (PCoA) revealed genetic stratification among the Nellore samples ( Figure 1). After using k-means clustering to assign individuals to two different groups according to their coordinates in the PCoA, a highly significant association (P = 5.41 × 10 -34 ) between the k-means assignments (n cluster1 = 531, n cluster2 = 118) and breeding program subgroups (n subgroup1 = 352, n subgroup2 = 297) was found. For the k-means groups, BW average was 0.434 ± 1.306 in cluster 1 and −0.216 ± 1.295 in cluster 2. For the breeding program subgroups, the trait averages in subgroup 1 and subgroup 2 were 0.683 ± 1.276 and −0.119 ± 1.255, respectively. When considering either k-means clusters or breeding program subgroups as subpopulation labels, the EBVs showed homogeneity of variance (P > 0.05), but the trait mean was significantly different between groups (P = 1.21 × 10 -6 and P = 4.37 × 10 -15 for k-means and breeding program, respectively). Thus, population substructure was a potential confounder in the genotype-EBV association analysis, which justified the inclusion of eigenvectors from the PCoA as fixed effects in the linear model.

Association analysis
A total of 32 eigenvectors from the PCoA were significantly correlated with the phenotypes, which together explained 15.23% of the genotypic variability. Residuals from the weighted regression on these significant eigenvectors were used as the dependent variable for the SNP association analysis. The quantile-quantile (Q-Q) plot ( Figure 2) showed that the deviation of the observed test statistics from the theoretical quantiles was mild and acceptable (λ = 1.002831), and the values were adjusted for the inflation factor via Genomic Control (GC). The T 2 values deviating from the expected values were interpreted as SNPs departing from the null hypothesis. The genomewide deflated P-values are shown in Figure 3.
A peak crossing the boundary for Bonferroni significance (α = 1.15 × 10 -7 ) was detected on BTA14, comprising 5 SNPs (Table 1) which were highly linked (mean r 2 = 0.728 ± 0.12). The most significant SNP (rs133012258, P GC = 1.34 × 10 -9 ), located at BTA14:25376827, had an estimated allele substitution effect of 0.452 kg (i.e., for each extra A allele, the BW breeding value is expected to increase 0.452 kg), with lower and upper limits for the 95% confidence interval (CI) of 0.306 kg and 0.598 kg, respectively, and the percentage of the variance in sires EBVs explained by the SNP was 4.62% (with a 95% CI of 2.12-8.09%). The overall rs133012258 A allele frequency was 0.274, whereas the breeding program subgroups 1 and 2 had frequencies of 0.351 and 0.184, respectively. Figure 4 shows the distribution of the EBVs (in standard deviations) for the three genotype classes of rs133012258. In both Illumina TOP and Forward allele notation, the AB correspondence for rs133012258 was A = A and B = G.
The BTA14 region pointed out by the present study has also been shown to be associated with reproductive traits. Cole et al. [16] reported a QTL on BTA14 associated with stillbirth, which also has been associated with body size in dairy cattle [8,30], but found no effect on stature or other conformation traits on that chromosome. The region also associates with many fertility and growth-related traits in the indicine breed Brahman, for example scrotal circumference [31,32], age at the first corpus luteum [31,33], blood levels of insulin-like growth factor 1 (IGF1) [32,33] and hip height [33].
A significant SNP was found within intron 2 of the XKR4 gene in the present study. Lindholm-Perry et al. [34] identified five SNPs near XKR4 associated with feed intake and gain in crossbred steers. Bolormaa et al. [35] found five SNPs in a narrow region of BTA14 encompassing XKR4 associated with rump fat thickness measured at the P8 position (CHILLP8) in seven breeds of cattle, including taurine, indicine and composite breeds. The authors found that four of these SNPs were also associated with CHILLP8 in a confirmatory sample of 1,338 animals, including Angus, Hereford and Brahman cattle. Furthermore, Porto Neto et al. [36] performed a replication study using samples of Belmont Red, Santa Gertrudis and Brahman animals genotyped for SNPs within XKR4 and found that although the SNP effect may vary depending on the breed, the variant rs42646 Chi-squared statistics for β. Genomic control correction was performed by dividing the χ 2 statistics by the distribution inflation/deflation factor estimate (λ = 1.002831). 708 (BTA14:24573257) explain around 1.3% of CHILLP8 variance in cattle. This SNP is also located within intron 2 of XKR4, only 17.6 kb apart from the intronic SNP detected in the present study, which strongly suggests XKR4 as a candidate gene for being further explored in future studies of weight and carcass traits in Nellore cattle.
The most significant SNP (rs133012258, P GC = 1.34 × 10 -9 ) was found to explain 4.62% of the variance in sires EBVs, with a 95% CI of 2.12-8.09%. One hundred and eighty loci associated with human adult height explain only 10% of the phenotypic variance together, while individual loci account for 0.4% or less [37]. SNPs analyzed by [22] within a nearby BTA14 region explain from 0.29 to 2.53% of the bovine stature variability, and the quantitative trait nucleotides (QTN) spanning MOS, CHCHD7 and PLAG1 described by [27] explain from 1.10 to 3.50% of height in Jersey and Holstein breeds. Furthermore, the genome-wide survey performed by [21] provided strong evidence for two QTL on BTA14 and BTA21 that together explain at least 10% of the variation of EBVs for calving ease in the German Fleckvieh.
Considering that multiple stature-related traits are governed by variants with small effects, and that the genomic region identified in this study has been previously found to be associated with several of these traits, the putative SNP detected in the present analysis can be considered as a marker in linkage disequilibrium (LD) with major untyped (i.e., not probed by the SNP assay used) causative variants affecting BW and other heightassociated traits in Nellore cattle, and further studies would be needed to determine if the QTNs reported by [27] are also segregating in the Nellore population. Also, future investigations are needed to better characterize the effect of nearby SNPs on other weight and carcass   traits in Nellore cattle, as it is not clear yet how the putative pleiotropic effect of these variants would be used towards balancing conflicting selection goals for birth, weaning and yearling weights. Although we cannot confirm that the allele substitution effects of these SNPs work in the same direction for all three traits, because only birth weight was analyzed here, these findings suggest that the SNPs identified would be key polymorphisms to be monitored over time. In a scenario where the SNP effects have the same direction in all three traits, one strategy could be avoiding strong positive selection or drifting of the allele that contributes to higher BW EBVs, and identify and promote positive selection of other variants that have effects on weaning and yearling weights only.
The high identity found in the alignment of this BTA14 region against other mammalian species genomes suggests that these orthologous genes are located in a conserved syntenic block which may have arisen and been maintained after speciation from a common ancestor of the mammal clade. Moreover, the evidence for variants associated with growth and stature within this BTA14 region in both taurine and zebu cattle raises two hypotheses: 1) these variants have been introgressed into Nellore via historical admixture with taurine Creole cattle in the maternal line, and was maintained in the breed in spite of several generations of backcrossing; 2) these are ancient polymorphisms, probably already segregating in the founder population of wild Aurochs (Bos primigenius) before subspecies formation. Figure 6 Ensembl alignments of UMD v3.1 sequence for the 1 Mb region surrounding rs133012258. The bovine reference genome sequence was aligned against (from top to bottom) the human (GRCh37 assembly), pig (Sscrofa10.2 assembly) and mouse (GRCm38 assembly) genome builds. Gene colors: yellow -merged Ensembl/Havana, red -protein coding, blue -processed transcript, grey -pseudogene, purple -RNA gene. Triangles: black -breakpoint between different chromosomes, blue -inversion in chromosome, brown -breakpoint on chromosome, red -gap between two underlying slices.
Regarding functional meaning, the set of genes reported participate in diverse growth and tumor development mechanisms. Among these genes, PLAG1 is the most appealing functional candidate. It is an oncogene that encodes a transcription factor broadly expressed during fetal development, but is down-regulated at birth [27]. It interacts with several growth factors controlling body size, including IGF2 [38]. In addition, PLAG1 knock-out mice have been shown to have marked growth retardation and reduced fertility [39]. In a replication study, [28] confirmed the findings reported by [27], demonstrating association of growth rate and early life and peripubertal body weight with PLAG1 polymorphisms, supporting its status as a key regulator of mammalian growth.
The lack of significant association between BW and SNPs within other previously described weight-and height-related chromosome regions in the present study should not be interpreted as a lack of existence of true association, but rather it might be due to limitations specific to this study. Firstly, because complex trait mapping requires large sample sizes and only 649 bulls were analyzed here. Secondly, the significance level adopted was highly stringent, which may have caused inflation of type II errors. In spite of these limitations, it was possible to demonstrate that a well-characterized chromosome region affecting human and taurine cattle stature also associates with BW in a zebu breed. The release of a Bos primigenius indicus reference genome assembly, as well as the application of re-sequencing and replication studies would help improve resolution to narrow down the genomic region as close as possible to the true causative variants.

Conclusions
This study is believed to be the first genome-wide association study applying a high-density SNP panel to identify putative chromosome regions affecting birth weight in zebu cattle. The findings presented, which are strongly supported by the literature, point to orthologous genes already known to affect growth-and stature-related traits in both humans and cattle, which may shelter ancient polymorphisms responsible for variation in those traits since before cattle subspecies divergence.

Estimated breeding values
Estimated breeding values for BW were obtained from routine genetic evaluations using performance and pedigree data from the Aliança database [11], containing data from different commercial Nellore breeding programs, including more than 250 farms distributed across Brazil and Paraguay. The genetic evaluation for BW was calculated using a subset of that data that included 542,918 animals, born from 1985 to 2011, and distributed in approximately 5,000 distinct contemporary groups. These data were collected in 243 grazing-based herds in Brazil. Estimated breeding values were obtained using an animal model that included fixed effects for the age of dam at calving and contemporary group (defined as animals from the same herd, born in the same year and season, and belonging to the same management group at birth, and sex), as well as random effects that include direct additive genetic, maternal additive genetic, maternal permanent environmental and residual error effects. The variance ratios required to solve the mixed model equations were computed based on restricted maximum likelihood (REML) estimates of the variance components from previous studies in this population. Only EBVs of progeny-tested bulls whose accuracy (i.e., square root of reliability, calculated based on prediction error variance estimates) was ≥ 0.50 were used for sample collection and genotyping (described later). The majority of the bulls were used under artificial insemination service.

Genotyping, informativeness, and quality assurance
A total of 654 progeny-tested Nellore bulls were genotyped with the Illumina® BovineHD Genotyping BeadChip assay, according to the manufacturer's protocol. Genotype calls (i.e. successfully determined genotypes) were defined as genotypes with GenCall Scores greater than 0.70, using the validated standard cluster file provided by the manufacturer. As chromosomes X, Y and mtDNA present different mode of inheritance from the rest of the genome, only autosomal markers with unique genomic coordinates were included into the analyses. After this initial screening, potential duplicated samples were determined by calculating the proportion of alleles identical by state (IBS) shared between all pairs of individuals. Any pair of samples with IBS ≥ 0.95 for 2,000 randomly sampled markers was considered unexpected duplicates, and resulted in the exclusion of both members of the pair. Individual SNPs were removed from the dataset if they did not exhibit: 1) minor allele frequency (MAF) greater than or equal to 0.02, 2) Fisher's exact test P-value for Hardy-Weinberg Equilibrium (HWE) greater than or equal to 1 × 10 -5 (i.e. extremely deviating from HWE, suggesting potential genotyping error) or 3) Call rate (CR SNP ) of at least 98%. After the SNP pruning, individuals exhibiting call rate (CR IND ) below 90% were also removed. These procedures and many others described later were performed in the R v2.15.0 environment [40], using combinations of functions from the R base, locally developed scripts, and the GenABEL v1.7-2 package [41].

Assessment of population substructure
Sires genotyped in this study were known to belong to one of two major breeding program subgroups in the Aliança database [11] that have different selection objectives. One group emphasizes selection for weaning and yearling weight (subgroup 1) and the other emphasizes selection for fertility and carcass traits (subgroup 2). Thus, genetic stratification was expected and therefore population substructure was evaluated by performing a Principal Coordinates Analysis (PCoA).
Pair-wise genomic kinship coefficients for all subjects under study were calculated first, following [42,43]: Wheref i;j is the estimated genomic kinship between individuals i and j, L is the total number of loci used for the calculation, p l is the reference allele frequency for locus l, and g l,i and g l,j are the locus l genotypes for individuals i and j, respectively (coded as 0, 1 or 2 reference alleles). Calculations were based on 10,000 randomly sampled markers using GenABEL [41]. The calculated genomic kinship coefficients within the yielded n x n symmetric matrix (where n is the total number of samples) were then transformed to squared Euclidean distances, and the dissimilarities between the subjects within the matrix were captured in n-1 dimensional spaces of n observations (eigenvectors), via classical multidimensional scaling [44].
A clustering analysis was applied to the two eigenvectors that explained the largest proportion of the data variance using the k-means algorithm [45] implemented in R [40]. Individuals were clustered into 2 groups, and the association between the prior information on breeding program and the k-means clustering results was tested using Pearson's χ 2 with Yates' continuity correction in order to see if the algorithm could reproduce the known breeding programs subgroups. Additionally, an F-test for homogeneity of variance between subpopulations and a t-test for difference between subpopulation means, defining subpopulations either as k-means assignments or breeding program of origin, was performed to determine if there was confounding due to stratification.

Association analysis
In order to reduce computation time, the ideas of [46] were abstracted and a three-step association analysis was performed. In the first step, a linear regression using the weighted least squares method with weights equal to the squared accuracy (i.e., reliability) of the EBVs was applied. By weighting the EBVs by their respective accuracies, the uncertainty around the estimates was taken into account when estimating the regression parameters. The following model was fitted: Where y i is the EBV of sire i, μ is the overall mean, X ij is value i (corresponding to sire i) in the eigenvector j calculated in the PCoA, β j is the estimated effect of eigenvector j, and ε i is the residual effect for animal j. Only eigenvectors significantly (P < 0.05) correlated with the dependent variable, as assessed by Pearson correlations, were included in the model. Next, the residuals were obtained from the fitted model in (2): and had their homoskedasticity and normality tested by using the studentized Breusch-Pagan test and the Shapiro-Wilk test, respectively. Then, these residuals were used as the new dependent variable for a singlemarker linear regression: Where β g is the marker regression coefficient (i.e., the allele substitution effect of the SNP) and g i is the genotype (0, 1 or 2) of the sire i. For each SNP, β g and its respective standard error (SE g ) were estimated using ordinary least squares. The association between the SNP and the trait was assessed via a test statistic, calculated as: The test statistics are assumed to asymptotically follow a χ 2 distribution with one degree of freedom under the null hypothesis. To assess the validity of this assumption, the deviation of the distribution of the test statistics from the expected theoretical quantiles was examined via 1) a quantile-quantile (Q-Q) plot, and 2) calculation of the inflation/deflation factor: If λ < 1.1, the inflation was considered acceptable, and the Genomic Control (GC) correction was applied to adjust for that inflation [47]. Then, P-values were derived from the χ 2 cumulative distribution function for the corrected test statistics. Finally, markers within the smallest 0.1% P-value percentile (i.e., most significant) were considered for re-analysis with the full model: The EBVs were again weighted by their respective accuracies. The conservative Bonferroni adjustment for multiple testing (α=0.05/N, where N is the number of tests, i.e., number of SNPs) was used to reject the null hypothesis (β g = 0, i.e. there is no association between the SNP and the EBVs), which resulted in an adjusted significance of α = 1.15 × 10 -7 .

Exploratory view of significant SNPs
For any peak crossing the Bonferroni significance threshold, the estimated regression parameters were reported. For the most significant SNP, a 95% confidence interval (CI) for the estimated allele substitution effect size (β g ) was calculated, and the percentage of the EBV variance explained was calculated as: Where p and q are the allele frequencies and S 2 is the sample EBVs variance. The upper and lower limits of the estimated 95% CI forβ g were used to derive a 95% CI for %π σ 2 .
The genomic region containing the most significant SNP of a peak was explored by inspecting a 1 Mb window around the location of this SNP using the BioMart tool and the Ensembl genes 69 database [48] to interrogate 500 kb to each side of the marker using the UMD v3.1 assembly. The cattle QTLdb database [49] was also examined to find out if the significant SNP mapped against any previously described bovine QTL. Additionally, the alignments of the UMD v3.1 assembly sequence of the 1 Mb window against the human (Homo sapiens, GRCh37 assembly), pig (Sus scrofa, Sscrofa10.2 assembly) and mouse (Mus musculus, GRCm38 assembly) genome builds were inspected in the Ensembl Comparative genomics alignments and Comparative genomics synteny tools in order to determine if any homologous genes were present in the putative region.