Genome-wide association study for backfat thickness in Canchim beef cattle using Random Forest approach

Background Meat quality involves many traits, such as marbling, tenderness, juiciness, and backfat thickness, all of which require attention from livestock producers. Backfat thickness improvement by means of traditional selection techniques in Canchim beef cattle has been challenging due to its low heritability, and it is measured late in an animal’s life. Therefore, the implementation of new methodologies for identification of single nucleotide polymorphisms (SNPs) linked to backfat thickness are an important strategy for genetic improvement of carcass and meat quality. Results The set of SNPs identified by the random forest approach explained as much as 50% of the deregressed estimated breeding value (dEBV) variance associated with backfat thickness, and a small set of 5 SNPs were able to explain 34% of the dEBV for backfat thickness. Several quantitative trait loci (QTL) for fat-related traits were found in the surrounding areas of the SNPs, as well as many genes with roles in lipid metabolism. Conclusions These results provided a better understanding of the backfat deposition and regulation pathways, and can be considered a starting point for future implementation of a genomic selection program for backfat thickness in Canchim beef cattle.


Background
Beef cattle production in Brazil is based on several breeds, depending on the geography and climate of a given area. Breeds based on Bos taurus are commonly raised as livestock for beef in the South of Brazil, but in most parts of the country, beef cattle production is based on Bos indicus (zebu) breeds raised on natural pastures. A good description of Brazilian beef cattle production was recently published [1]. Zebu breeds are considered highly adapted to the tropical environment in Brazil [2][3][4][5], but they are known for their lower meat quality in certain aspects, such as tenderness, palatability, and marbling [6][7][8][9][10], and for their lower reproduction efficiency [11,12] when compared to Bos taurus. The Canchim (3/8 zebu + 5/8 Charolais) breed was developed in the early 1960's in Brazil [13] with the intention of combining fitness traits from zebu to the higher reproduction efficiency and meat quality from the Charolais breed.
Although the Canchim breed has fared well when raised on natural pastures in Brazil, some carcass traits have still remained inferior when compared to Bos taurus. One such trait is backfat thickness, which has been a concern for Canchim producers, and for the beef cattle industry in general, due to its low fat deposition in animals raised on pasture (1.90mm ± 0.77) [14]. Improvement of this trait in Canchim beef cattle using traditional selection techniques has had limited success because of its relatively low heritability (0.23) [14], and because it is measured late in an animal's life. Most studies available in the literature regarding backfat thickness have been conducted on animals raised in feedlot systems, which permits earlier ultrasound measurements, and has also shown moderate to high heritabilities [15][16][17][18][19], thereby allowing traditional selection techniques under these conditions to be more successful than compared to the Canchim breed.
In attempts to improve meat quality, previous studies have focused on the identification of candidate markers associated with meat quality traits, as well as backfat thickness, in Canchim and other Bos indicus × Bos taurus crosses in Brazil [20][21][22][23][24]. However, these have had limited success, particularly in response to markers on the DDEF1 and LEP genes [20,23]. Therefore, the identification of genetic markers linked to backfat thickness by novel methodologies is an important strategy for genetic improvement of carcass and meat qualities. One recently developed approach relies on examining how SNPs (single nucleotide polymorphisms) are associated with these qualitative traits [25]. More specifically, this method has been used successfully in studies that examined fat-related traits, such as intramuscular fat percentage, marbling, rib fat, backfat thickness and rump fat depth [26][27][28][29][30][31][32][33][34][35]. By the use of high-density SNP panel assays for different breeds and crosses, these studies have collectively found such traits associated with regions on nine bovine chromosomes (6, 15, 17, 20, 21, 24, 25, 26, and 28) [27,28,32,35]. However, another study suggested that some of the effects attributed to each SNP can show variation based on the breed's origin, resulting from variation in indicine and taurine-indicine composite cattle [35], thereby justifying the investigation of SNPs based on the breed of interest.
A previous study using high-density SNP panel has associated 100 SNPs to backfat thickness in a Canchim population using an approach that selected animals with extreme phenotypes for genotyping [33]. Those SNPs were located on several bovine autosomes, and from them, the authors further investigated and validated two regions on chromosome (chr) 14 associated with backfat thickness, where the haplotypes were responsible for 0.24% to 1.1% of the phenotypic variance for this trait.
Although these results are useful, it is well known that quantitative traits are polygenic as each SNP may account for only a small part of the phenotypic variance, therefore joint analysis of many SNPs has become a more interesting strategy [36,37]. This, however, exacerbates the 'large p, small n' problem faced by genome-wide studies, which means that there is a small number of phenotypes (n) to predict a large number of SNP (p) effects [38].
One solution to this problem is through the use of Random Forest, a machine learning algorithm capable of dealing with certain datasets for building model independent classification and/or regression problem predictors [39]. Specifically, it embeds a procedure of accounting for predictor variable importance, which results in a score that can be used for prioritizing variables (SNPs), similar to p-values from statistical tests [40][41][42]. Because of these features, the variable importance of the random forest method has been recognized as an useful methodology for genome-wide association studies [43].
Considering all of the above, the objectives of this study were to identify SNPs associated with backfat thickness in Canchim beef cattle using the random forest approach for genome-wide association studies, to shed insight on potential genes associated with this trait, and to discover potential SNPs for future implementation of genomic selection (GS). The set of SNPs identified by this methodology explains as much as 50% of the deregressed estimated breeding value variance associated with the observed phenotype. These results intend to provide a better understanding of the backfat deposition and regulatory pathways, and to enable the use of the identified SNPs in validation studies for genomic selection.

Animals and phenotypes
Animals used in this study were part of the Canchim Breeding Association from seven herds located in two Brazilian states (São Paulo and Goiás). This research is in agreement with the ethical principles of animal experimentation of Embrapa Southeast Livestock Ethical Committee of Animal Use (CEUA-CPPSE), and has been performed with the approval of CEUA-CPPSE under protocol number 02/2009. An initial sample of 987 animals (males and females) was evaluated for backfat thickness by ultrasound in vivo over the 12th rib around the age of 18 months. All animals evaluated were born between 2003 and 2005 and raised on natural pastures.
These 987 animals had the estimated breeding value (EBV) predicted by restricted maximum likelihood using the MTDFREML software [44]. The animal model included fixed effects of contemporary group (sex, year, herd, and genetic group) and age at measurement as a linear covariate, the additive genetic effect and error were included as random effects. From these animals, a sample of 400 was selected considering: EBV, accuracy, family size, and proportion between males (196) and females (204). These 400 animals were offspring of 50 different sires (with 1 to 30 offspring per sire).

Genotyping and SNPs quality control
The selected 400 animals were genotyped using the BovineHD BeadChip (Illumina Inc., San Diego, CA). The quality control filters included call rate (< 0.90) for samples and SNPs, minor allele frequency (MAF < 0.01), and heterozygosity (< 3 standard deviations). After quality control processing, 396 animals and 708,641 SNPs with an average call rate higher than 0.99 remained in the study.

Genome-wide association analysis
Genome-wide association (GWA) analysis was performed on deregressed EBVs (dEBV) [45], which takes into account the pedigree matrix, estimated heritability (0.16, data not shown), EBVs, and EBV's accuracies obtained by the same animal model described above. For the estimation of dEBVs the data set was enhanced with data collected from animals born between 2005 and 2008 totaling 1,648 individuals with phenotypes for backfat thickness, with 6,801 animals in the pedigree matrix.
Association of SNPs to dEBVs was undertaken by a random forest package [46] available in the R-project software [47]. The association analysis was composed of a two-step procedure. In the first step, the SNPs with the highest 1% importance score by chromosome were selected, and in the second step, the outcome set of SNPs from the first step was re-analyzed disregarding the chromosome classification, and the SNPs with the highest 1% importance score were selected. For the association analysis, the missing genotypes were imputed by the näive method provided in the random forest package (which imputes column median values for missing genotypes), the number of trees to grow and the number of randomly selected candidate SNPs at each split were set to 5,000 and 10% from the SNPs being evaluated, respectively. This procedure was done using the 396 samples available.
Taking into account the unbalanced offspring range among sires, 10 subsamples consisting of 198 animals each were also analyzed in the same two-step process as previously described. The 10 subsamples were selected as follows: i) The first animal was chosen at random from the 396 genotyped animals; ii) The next animal was selected based on the lowest relationship with the previous selected animal, but most representative from the rest of the genotyped animals; and iii) Step ii was repeated until 198 animals were selected.
Two approaches were considered for further SNP investigation among the results obtained by the random forest analysis. One approach selected the SNPs in common among the analysis with the 396 animals and the 10 subsamples, called the Common SNPs strategy. Another approach selected only the top 1% (importance score) from the analysis with 396 animals, called the Highest 1% SNPs.
Finally, after both sets of SNPs (Common SNPs and Highest 1% SNPs) had been selected, each set of selected SNPs were fitted into a final stepwise regression model using SAS/STAT software [48] to estimate the amount of variance explained by the selected SNPs in the data set (final model R 2 values correspond to the dEBV variance explained by the model, which are reported in Table 1). For doing so, the SNPs were coded as 0, 1, and 2 for the AA, AB, and BB genotypes, respectively. In order to evaluate the significance of the results, a permutation test was conducted to estimate the bias associated with the R 2 obtained from the stepwise regression analysis. In the permutation test, the dEBV values were shuffled and then regressed to the same SNPs previously selected. The permutation test was repeated 1,000 times.

Candidate genes and pathways
A pathway analysis was conducted to characterize the genomic regions identified by the set of SNPs previously selected and to identify candidate genes influencing biological functions and pathways related to backfat thickness and fat-related traits.
The software fastPHASE version 1.4.0 [49] was used for reconstructing the haplotypes for each chromosome. Afterwards, the reconstructed haplotypes were analyzed by the software Haploview [50] (using default parameters) for estimating haplotype blocks and linkage disequilibrium (LD), which was calculated based on the squared correlation coefficient between SNP pairs (r 2 ). Considering the extent of LD based on the overall average r 2 (average r 2 = 0.12 at a distance of 250Kb, data not shown), a window of 500Kb (SNP position ± 250Kb) surrounding each SNP previously selected by the stepwise regression was considered to define the region used for candidate gene discovery and pathway annotation.
The Cattle Genome Browser through the UMD 3.1 Cattle genome assembly [51], was used for visualization of the selected SNPs and surrounding areas for localization and identification of QTLs, genes, and other interesting genomic landmarks. Other databases, such as the NCBI BioSystems database [52], and Kyoto Encyclopedia of Genes and Genomes (KEGG) [53,54] were also used for pathway annotation to gain insight into the biological processes involved in backfat thickness deposition.

Results
We performed regression analysis for both strategies (Common SNP and Highest 1% SNP), and the results were very similar in the final number of SNPs selected, and the percentages of dEBV variance explained by the final set of SNPs (Table 1) enabling the discussion to be focused on the set of 21 SNPs selected from the Highest 1% SNP strategy due to its higher % of dEBV variance explained. Also, the first five SNPs (rs133046994, rs137294146, rs109349988, rs136717249, rs134790147) in the regression model were the same and in the same order for both strategies. These first five SNPs were responsible for 34.13% of dEBV variance for backfat thickness.
As a precaution against spurious artifacts that can result from splitting small samples into training and validation datasets, this was not performed here. An alternative option is to use a permutation test, which calculates the probability of obtaining a value more extreme than or equal to the observed value of a test statistic by shuffling the data and recalculating the test statistic. The proper test statistic for multiple regression is the coefficient of multiple determination, R 2 [55].
A permutation test was carried out to evaluate the probability of bias associated with the R 2 from the stepwise regression analysis ( Table 1). The average R 2 from 1,000 permutation tests was 0.00 ± 0.02 for the Highest 1% SNP strategy, showing that there is a small bias associated with the R 2 from the stepwise regression analysis. However, this is very small when compared to the 53.27% obtained from the Highest 1% SNP strategy, and therefore reinforces the significance of the results presented in Table 1. Table 2 shows the 21 SNPs selected by the stepwise regression, their chromosome, position, % of dEBV variation explained by the SNP, genes annotated within ± 250Kb, fat-related QTLs described in the current literature, and references. Table 3 shows a summary of pathway annotation using the genes within ± 250Kb from the 21 selected SNPs using the KEGG [53,54] pathway database.

Discussion
The use of the random forest approach as a first step, to filter candidate SNPs without taking into consideration a statistical model specification, is advantageous in genome-wide association studies, as long as little is known about candidate areas and the genetic architecture of the specific trait. Furthermore, the fact that results were obtained using two different strategies (Common SNPs and Highest 1% SNPs) and are very similar, provides reliability to the random forest methodology as can be seen in the previous study [43].
With the exception of four selected SNPs in the Highest 1% SNPs strategy (chr 12: rs136348926; chr 11: rs110833507; chr 2: rs42923911; chr 9: rs110025080), all other SNPs presented a fat-related QTL described in their chromosome region. Also, only one SNP on chr 3 (rs42021729) is not close to any described gene in the surrounding area (± 250kb) ( Table 2).
In a previous genome-wide association study in Canchim, 100 SNPs on several chromosomes were considered the optimal set of SNPs to differentiate the 30 individuals with extreme phenotypes for backfat thickness. Among these SNPs, two haplotypes on chr 14 were genotyped and their association to the phenotype was validated in the whole population [33]. In the current study, even though SNPs from chr 14 were associated with backfat thickness by the random forest approach (in the Common SNP and Highest 1% SNP strategies, data not show), these SNPs were not selected in the stepwise regression model. Conflicting results and/or studies that cannot be replicated in the post-genomic area are not so uncommon [56][57][58][59], and these differences can be attributed to partially insufficient power, falsepositive results, bias, sample size, and to differences in populations, controls, and methodologies [56][57][58], or true heterogeneity associations [56]. In these two GWA studies with Canchim, the base population is very similar, but the sample size and methodologies are not, which could explain the difference in the findings. A future option to help clarify the inconsistency in these findings would be to perform a meta-analysis, which combines data together to increase sample size and power, while reducing error risks [58,60].
Another outcome from this study and the previous one [33] is the possibility of including these SNPs in the development of a low density SNP (LD-SNP) panel for implementation of genomic selection in Canchim beef cattle. The most widespread strategy for developing small panels is by applying methods of variable selection to identify a diminutive set of SNPs that have good predictive power for the trait or breeding value [61]. The increase in accuracy of genomic breeding values obtained by using LD-SNP panels can be highly similar (around 90%) compared to the accuracies obtained by high density panels [62,63], at a more cost-effective price. Therefore, it is more likely to be adopted by farmers and the beef industry [64]. Furthermore, LD-SNP panels developed with SNPs selected on the basis of their effects perform better than LD-SNP panels with SNPs evenly spaced [62,63]. Importantly, SNPs identified in these studies need to undergo a prior validation in a population of animals which are not included in the population used for the SNP discovery (training population), enabling confidence in genomic predictions for future populations. From the SNPs identified in this study, there were two on chr 10 (rs133046994, rs135638125) associated with backfat thickness, which together accounted for almost 12% of the dEBV variation (Table 2). These two SNPs are in the same chromosomal region as fat-related QTLs identified in previous studies [65,66], and they map to the same genes (THSD4 -thrombospondin, type I, domain containing 4, and LRRC49 -leucine-rich repeatcontaining protein 49) thereby indicating THSD4, LRRC49 and the surrounding areas as strong candidates for further investigations and validation. The LRRC49 gene has been linked to breast cancer in humans, but very little is known about the biological function of the protein encoded by this gene [67].
The THSD4 gene in Bos taurus and in Homo sapiens has a provisional status from RefSeq [68], which, by definition, supports that this gene is both transcribed and expressed. Further evidence for the annotation of this gene is given by its sequence identity in the UniGene database [52] when compared to orthologous sequences from M. musculus (95.1%), which has a validated status in RefSeq, and to H. sapiens (93.1%), suggesting a well-conserved homology of the THSD4 gene in these species.
The THSD4 gene encodes a protein with conserved disintegrin and metalloprotease domains, which it shares with the ADAM-TS1 protein family, and plays an import role in adipogenesis [69]. Previous studies have shown that this protein family interferes with the availability of  differentiation-inducing or differentiation-inhibiting growth factors, either by modifying the extracellular matrix, affecting cell migration and adhesion, or by activating other pathways, which are key for regulating the differentiation of adipocytes, allowing their growth and expansion during adipogenesis [70]. The subcutaneous fat percentage QTL reported on chr 10 ( Table 2) is from a Charolais × Holstein crossbred cattle population, and is described as highly significant with additive effects estimated to be 0.5 phenotypic standard deviation units [65]. The study also reveals that the Charolais allele was associated with higher fat levels.
The SNP on chr 1 (rs137294146) associated with backfat thickness is responsible for approximately 9.4% of the dEBV variation (Table 2). There is also a reported QTL for fat thickness over the 12th rib [29] and another for intramuscular fat percentage [71], indicating that there should be one or more genes in this area affecting fat metabolism. In the 500Kb window surrounding this SNP, three genes are annotated, SOX14 (sex determining region Ybox 14), CLDN18 (claudin 18), and DZIP1L (DAZ interacting protein 1-like). The SOX14 gene seems to be involved in the regulation of embryonic development, whereas CLDN18 belongs to a multigene family that encodes a tetraspanning membrane protein acting on components at tight junctions, but its regulatory mechanisms, and roles in physiology and pathology are still under investigation [72]. The DZIP1L gene encodes a zinc finger protein, but how it affects either adipogenesis or lipid metabolism has not been depicted from the current literature. Nonetheless, the functions of these gene products are still being elucidated.
The 500Kb window around the SNP on chr 3 (rs109349988) reveals many annotated genes, of which some have been reported as participating in lipid metabolism. For example, PMVK (phosphomevalonate kinase) catalyzes the conversion of mevalonate 5-phosphate with ATP to form mevalonate 5-diphosphate and ADP, which is one of the initial reactions involved in the cholesterol biosynthetic pathway [73]. Other proteins in this region include ADAR (adenosine deaminase, RNA-specific), which encodes an RNA-editing enzyme by site-specific deamination of adenosines, resulting in changes in protein function or gene expression. A study in humans was conducted that found ADAR enzymes were associated with serum triglyceride and adiponectin levels, abdominal circumference, and body mass index [74]. Interestingly, this region also contains SHC1 (Src homology 2 domain containingtransforming protein 1) which has been reported as having a role in human obesity [75], and as being one of the mediators for regulating the insulin-like growth factor 1 (IGF-1) pathway, which plays a key role in regulating cell proliferation, differentiation and apoptosis [76]. Lastly, this region contains ADAM15 (ADAM metallopeptidase domain 15), which belongs to the ADAM protein family previously discussed. These studies corroborate our findings and require further investigation to elucidate how these genes are affecting the deposition of subcutaneous fat in bovines. The SNP associated with backfat thickness on chr 19 (rs136717249) is responsible for approximately 4.88% of the dEBV variance. This region contains the PHOSPHO1 (phosphatase, orphan 1) gene, which encodes a phosphatase enzyme that has been implicated in the mineralization of the extracellular matrix, a key process for skeletal development [77]. The PHOSPHO1 gene product has high activities toward phosphoethanolamine (PEA) and phosphocholine (PCho) [78], which are the main metabolites involved in the pathway for the formation of phosphatidylcholine and phosphatidylethanolamine [79]. These compounds are implicated in the metabolism of complex glycerolipids, prostaglandins, leukotrienes, glycosylphosphatidylinositol-anchors, and some amino acids, such as glycine, serine and threonine. Also included in this region is the PHB gene (prohibitin), which is thought to be involved in regulating cell proliferation, gene transcription, and apoptosis. In recent studies, deficient PHB activity in the liver has been associated with non-alcoholic steatohepatitis and obesity, although the mechanism remains unknown [80,81]. Other examples include the IGF2BP1 (insulin-like growth factor 2 mRNA binding protein 1) gene, which encodes a protein that binds to the mRNAs of certain genes and regulates their translation. Lastly, the GIP (gastric inhibitory polypeptide, also known as the glucose-dependent insulinotropic polypeptide) gene has a known effect on stimulating the release of insulin from pancreatic β cells, but also has an insulin-like effect on adipocytes, suggesting that the GIP gene product enhances adipocyte glucose uptake, and that, at least in humans, it has an important role in the development of nutrition-induced obesity [82]. A recent study suggests that the GIP gene product has an effect on reducing free fatty acid release from adipose tissues, either by increasing reesterification or by inhibition of lipolysis [83]. Indeed, QTL studies reveal oleic acid content (OAC) and palmitoleic acid content (PAC) QTLs [84,85] in close proximity to the GIP gene in the bovine genome, which further suggests an association between this gene and free fatty acid processing.
The SNP rs134790147 on chr 13 also was associated with backfat thickness, and it is carrying 3.51% of the dEBV. Within this SNP region, a QTL for fat thickness over the 12th rib was found and described in an Angus population [29]. Also, a set of four genes are localized in the ±250kb window from the SNP position. The CCDC7 gene (coiled-coil domain containing 7) seems to be associated with human cancer [86,87], and there is no information available for bovines. The ARL5B gene product (ADP-ribosylation factor-like 5B), also known as ARL8, belongs to a family of proteins that show similar structure to ADP-ribosylation factors (ARFs family). ARLs and ARFs belong to the RAS superfamily of small GTPases, which function as modulators of complex and diverse cellular processes [88,89], of which the most canonical are cell proliferation and differentiation. However, they are also involved in protein trafficking through the trans-Golgi network (TGN). The TGN has a central role in protein sorting and directs the transport of newly synthesized proteins to different transport vesicles [90][91][92], and also receives recycled molecules and extracellular materials by retrograde transport. Recently, it was observed that ARL5B enhances retrograde transport from endosomes to the TGN [93]. The MGC152301 (uncharacterized LOC783682) and the LOC524240 (Alk-like) genes do not have any available information in terms of function of their gene products, but both show the same two conserved domains: cd00112 (LDLa) and cd06263 (MAM) [94]. The LDLa is a low density lipoprotein receptor class A domain, that plays an important role in mammalian cholesterol metabolism, the protein receptor binds LDL and transports it into the cell by endocytosis [95]. The MAM is an extracellular domain that mediates protein-protein interactions, and is found in a variety of proteins, of which many are known to function in cell adhesion [96]. The remaining 16 SNPs, which were not described in detail here, accounted for 19.14% of dEBV variation for backfat thickness and, as seen in Table 2, most of them present some fat-related QTL described within their regions [29,65,66,85,[97][98][99], and are of further interest for future investigations on how these SNPs can be influencing backfat thickness deposition in Canchim beef cattle.

Conclusions
In this study, we were able to identify a set of SNPs that correlates with approximately 50% of the deregressed estimated breeding value variance for backfat thickness in Canchim beef cattle, which introduces the possibility of including these SNPs in the development of a low density SNP panel for future implementation of genomic selection program in Canchim beef cattle. We also have applied a new methodology using the Random Forest approach to identify novel gene candidates for improving backfat thickness in Canchim beef cattle. In addition, although this study used backfat thickness as a target trait, other analyses of this type have successfully used other traits, thereby supporting the random forest approach as a means of future investigations of livestock production traits. Lastly, some regions identified are not conspicuously associated with any specific genes. This suggests that they may be involved in as of yet unidentified regulatory functions of gene expression or processing. Given the intrinsic complexity of biochemical pathways, these regions and the genes within them merit a great deal of future investigations, specifically to how they correlate with backfat thickness deposition in Canchim beef cattle and to other breeds.