Estimates of linkage disequilibrium and effective population size in rainbow trout

Background The use of molecular genetic technologies for broodstock management and selective breeding of aquaculture species is becoming increasingly more common with the continued development of genome tools and reagents. Several laboratories have produced genetic maps for rainbow trout to aid in the identification of loci affecting phenotypes of interest. These maps have resulted in the identification of many quantitative/qualitative trait loci affecting phenotypic variation in traits associated with albinism, disease resistance, temperature tolerance, sex determination, embryonic development rate, spawning date, condition factor and growth. Unfortunately, the elucidation of the precise allelic variation and/or genes underlying phenotypic diversity has yet to be achieved in this species having low marker densities and lacking a whole genome reference sequence. Experimental designs which integrate segregation analyses with linkage disequilibrium (LD) approaches facilitate the discovery of genes affecting important traits. To date the extent of LD has been characterized for humans and several agriculturally important livestock species but not for rainbow trout. Results We observed that the level of LD between syntenic loci decayed rapidly at distances greater than 2 cM which is similar to observations of LD in other agriculturally important species including cattle, sheep, pigs and chickens. However, in some cases significant LD was also observed up to 50 cM. Our estimate of effective population size based on genome wide estimates of LD for the NCCCWA broodstock population was 145, indicating that this population will respond well to high selection intensity. However, the range of effective population size based on individual chromosomes was 75.51 - 203.35, possibly indicating that suites of genes on each chromosome are disproportionately under selection pressures. Conclusions Our results indicate that large numbers of markers, more than are currently available for this species, will be required to enable the use of genome-wide integrated mapping approaches aimed at identifying genes of interest in rainbow trout.


Background
The use of molecular genetic technologies for broodstock management and selective breeding of aquaculture species is becoming increasingly more common with the con-tinued development of genome tools and reagents for species of interest [1]. Rainbow trout are the most widely produced salmonid in the US, attracting significant interest due to their economic impacts as an aquaculture spe-cies and on sport fisheries, and as a model research organism for studies related to carcinogenesis, toxicology, comparative immunology, disease ecology, physiology and nutrition [2]. To this end several international laboratories have produced genetic maps for this species to aid in the identification of loci affecting phenotypes of interest. These maps primarily include amplified fragment length polymorphisms (AFLPs) and microsatellites [3][4][5][6][7][8][9] and have resulted in the identification of many quantitative/qualitative trait loci (QTL) affecting phenotypic variation in traits associated with albinism, disease resistance, temperature tolerance, sex determination, embryonic development rate, spawning date, condition factor and growth [10][11][12][13][14][15][16][17][18][19]. In spite of these efforts, the elucidation of the precise allelic variations and/or genes underlying phenotypic diversity has yet to be achieved in this species having low marker densities and lacking a whole genome reference sequence.
Experimental designs which integrate complex segregation analyses with linkage disequilibrium (LD) approaches facilitate the discovery of genes affecting important traits [20][21][22]. To this end, the extent of LD has been characterized for humans and several agriculturally important livestock species including cattle, sheep, chickens, and pigs. Farnir et al. [20] genotyped 284 autosomal microsatellite markers on 581 dutch black-and-white dairy cattle to construct a whole genome LD map (excluding the sex chromosomes) spanning 2702 cM. Estimations of LD between syntenic loci using Lewontin's normalized D' [23] revealed large blocks of LD spanning tens of centiMorgans including values of 50% for markers <5 cM and decaying to 16% for distances of 50 cM. The value of D' between non-syntenic loci was estimated to be 12%. Vallejo et al. [24] selected distantly related animals to quantify the level of genetic diversity in United States Holstein cattle. While only 23 Holstein bulls were genotyped with 54 microsatellite loci that spanned most of the autosomal genome, extensive LD was detected in the United States Holstein population in agreement with the findings of Farnir et al. [20]. In 2002, McRae et al. [25] observed similar results by genotyping 90 microsatellites from 10 chromosomes on 276 progeny from Coopworth sheep, estimating D' values of 34.3% for marker distances <60 cM, 12.4% for syntenic markers >60 cM, and 12.4% for non-sytenic loci. In 2005 Heifetz et al. [26] reported that significant LD was only observed for loci < 5 cM apart in a commercial layer chicken population. In 2006 Harmegnies et al. [27] characterized LD in two commercial pig populations of 33 and 44 unrelated individuals by genotyping 29 and 5 microsatellite markers on two chromosomes, respectively. Estimates of r 2 (squared correlation coefficient) revealed significant LD only for loci < 1 cM apart. More recently, Du et al. [28] and McKay et al. [29] evaluated LD in pigs and cattle, respectively, by gen-otyping syntenic single nucleotide polymorphisms (SNPs) and observing significant LD for marker pairs < 3 cM apart for pigs and 0.5 Mb for cattle. All of these results indicate that large numbers of markers from high-density maps are required to identify genes of interest using whole genome association studies in these species.
The USDA/ARS National Center for Cool and Cold Water Aquaculture (NCCCWA) has established a breeding program for rainbow trout, the use of molecular genetic technologies in this program is expected to enhance capabilities for selective breeding of important aquaculture production traits. To this end we have worked within international collaborations to develop genomic tools and technologies for rainbow trout [5,8,9,[30][31][32][33][34][35][36][37][38][39][40][41][42][43][44][45][46] while initiating and characterizing our broodstock population with respect to genetic and phenotypic variation relevant to aquaculture production [47][48][49][50][51][52][53][54][55][56][57]. Our approaches include genetic linkage and LD mapping for the identification of QTL affecting traits of economic importance which will permit the development of marker/gene assisted selection strategies [58,59] and eventually genomic selection [60]. Characterizing the extent of LD in the NCCCWA rainbow trout broodstock population will support the use of integrated mapping approaches and facilitate the identification of genes affecting traits of interest by determining the marker densities required to conduct genome association scans. The NCCCWA broodstock population is closely related to commercial germplasm [49], therefore our findings also have the potential to impact industry selective breeding programs. In addition to LD we have evaluated effective population size (Ne) in an effort to characterize the true breeding size of our population. This estimate of Ne should be considered when making decisions concerning selection pressure. To this end, we characterized extent of LD by genotyping 96 unrelated individuals with 49 markers spanning four chromosomes.

Results
Genotypes were obtained for a total of 49 microsatellites on chromosomes OMY13 (n = 6), OMY14 (n = 20), OMY17 (n = 8), and OMYSex (n = 15). Genotyping success rate was 95.5% with 82 to 96 animals scored for each marker. The number of alleles per marker, marker heterozygosity, PIC, Allelic Diversity, and exact tests for departure from HWE proportions are reported in Table 1. The number of alleles per marker averaged 10.6 and ranged from 3 to 24. Marker heterozygosities ranged from 28.1% to 94.8% with an average of 69.8%. Polymorphism Information Content and Allelic Diversity ranged from 0.385 to 0.936 and 0.425 to 0.94, with averages of 0.745 and 0.773, respectively. A total of 19 loci had significant departure from HWE at P < 0.05, consisting of 3 markers from OMY13, 9 from OMY14, 2 from OMY17, and 5 from OMYSex.
LD for syntenic loci was estimated for each individual chromosome and the four chromosomes combined (Figure 1), the genome-wide estimate for extent of significant LD (r 2 > 0.25) was about 2 cM. The non-linear modeling of LD decline using the Sved (1971) equation indicates that the extent of significant LD decays sharply with physical distance. It also indicates that significant LD effects (i.e., r 2 ≥ 0.25) are not expected beyond 5 cM.
The LD for non-syntenic loci was estimated for each chromosome pairing and all pairings of non-syntenic loci, the proportion of pairings having (r2 > 0.25) is shown in Figure 2. Clearly, the proportion of significant LD (r 2 ≥ 0.25) among non-syntenic loci was quite low Effective population sizes were estimated for each chromosome and the four chromosomes as shown in Table 2.
The estimated average effective population size was N e = 145 with an average N e /N ratio of 0.45.

Discussion
The extent of LD in the NCCCWA selective breeding program was evaluated by genotyping 49 microsatellites from four chromosomes on a total of 96 unrelated individuals from the 2005 (n = 43) and 2006 (n = 53) brood classes. As a result of an evolutionarily recent autopolyploid genome duplication event [61], many microsatellite markers for rainbow trout amplify two loci which can be scored independently and placed on genetic maps [4]. Although medium density genetic maps based on microsatellites including duplicated markers have been constructed [6,9], they are problematic for use in population genetic analysis such as whole genome association studies as alleles from the two loci often have overlapping and/or identical allele sizes. Thus the number of loci available for use in whole genome association studies is much less than what is available for segregation analyses. Only evaluating single locus markers resulted in observation of a 6 cM region of OMY13 with 6 loci, a 108 cM region of OMY14 with 20 loci, a 55 cM region of OMY17 with 8 loci, and a 58 cM region of OMYSex with 15 loci. Overall, 227 cM of the 2927 cM genome was represented in this study. Of interest is the 55 cM region of OMY14 spanning the region from 75-130 cM where 6/7 loci depart from HWE. It is possible that these loci are under selection in the NCCCWA broodstock population which includes selective pressure for disease resistance and growth.
The level of LD between syntenic loci decayed rapidly at distances greater than 2 cM which is similar to observations of LD in other agriculturally important species including cattle [20], sheep [25], pigs [27,28] and chick-ens [26]. However, significant LD was also observed up to 40 cM as reported by Farnir et al. [20] and McRae et al. [25] ( Figure 1). The average proportion of significant LD (r 2 > 0.25) between non-syntenic loci was under .005, however OMY13 and OMY17, which have a common ancestor chromosome resulting from the salmonid whole genome duplication event, showed a significantly higher proportion of significant LD.
We acknowledge that the LD range reported here may be up-biased because evidences for structure and admixture have been shown in the NCCCWA Broodstock population [62] and we used unrelated individuals from 2005 and 2006 brood years that are being improved for disease resistance and production traits, respectively, in the NCCCWA selective breeding program.
As the NCCCWA selective breeding program continues throughout successive generations, we expect that the effective population size will continue to decrease. Our estimate of N e = 145 (Table 2) based on genome wide LD indicates that this population will respond well to high selection intensity as it has for a single generation of breeding for resistance to the causative agent of bacterial cold-water disease [50]. Using this estimate we can observe the effects of selection and identify correlations with the effects of inbreeding, usually first observed in reproductive success. However, the range of effective population size based on individual chromosomes was 75.51 -203. 35, possibly indicating that suites of genes on each chromosome are disproportionately under selection pressure. This also establishes the need to develop whole genome applications for these species for studies of LD.
Although F-statistics have been used to estimate population genetic parameters [49], characterization of LD will also provide valuable information about population substructure and should be a factor in developing strategies aimed at the identification of associations between markers and QTL.
We chose to evaluate the extent of LD by genotyping representatives of each year class a single generation after the population was founded by mixing germplasm from multiple sources. Our resulting characterization of LD not only provides information about population structure, but serves to document the degree of genetic diversity used to initiate the broodstock program. We expect our estimates of Ne to be higher than populations with limited genetic diversity and no introduction of new germplasm in recent past generations.
The estimated N e /N ratio of 0.45 in this study is between the ranges of estimates reported in this species in previous studies [63][64][65]. Other studies in salmonids consistently reported that the variance in reproductive success is the key factor to reduce N e /N in salmon populations [64,66].
Characterizing the extent and distribution of LD helps to determine the required marker density for LD mapping and genomic selection as they both require markers to be in LD with QTL. Our observation of significant syntenic LD at distances over 2 cM has implications for designing genome wide association studies in this population and on this species. The sex averaged map is roughly 3000 cM long; therefore 1500 markers are required to identify loci of interest. The male map having 2500 cM would require 1125 markers; the female map having 4300 cM would require 2150 markers. Currently about 1800 microsatellite markers are available for genome analyses in trout. However, we must take into consideration that: 1) roughly 33% are duplicated; 2) on average 70% are informative based on estimates of heterozygosity; and 3) these markers are not necessarily spaced at regular intervals throughout the genome.

Conclusions
To effectively conduct whole genome association studies the number of available markers for rainbow trout must be increased. Whereas genotyping microsatellites can be very expensive and time consuming process, it is likely that a marker system that enables high-throughput genotyping protocols such as single nucleotide polymorphisms will be the basis for LD mapping and genomic selection studies in the near future.

Germplasm
The NCCCWA rainbow trout selective breeding program was initiated in 2002 through 2004 with the acquisition of fish from Troutlodge, Inc. (Sumner, WA), the Donaldson strain from the University of Washington (Seattle, WA), the House Creek strain from the College of Southern Idaho (Twin Falls, ID), and the Shasta strain from the Ennis National Fish Hatchery, (Ennis, MT) [49]. To date, selective breeding for increased aquaculture production efficiency has been conducted by evaluating and selecting for growth traits on even years and resistance to challenge by the bacterial pathogen Flavobacterium psychrophilum on odd years [47,50]. To identify individuals that could best represent the genetic variation contained within this broodstock population, we identified 96 unrelated individuals (no siblings or half-siblings) from the 2005 and

Genotyping
Microsatellite markers from four chromosomes (n = 49) were selected from the NCCCWA genetic map [9] for genotyping the DNA panel of 96 fish representing NCCCWA broodstock (Table 1). This included markers mapped to the sex chromosome (OMYSex), chromosome 14 having the largest linkage group in cM (OMY14), and paralogous regions of chromosomes 13 and 17 (OMY13, OMY17) as defined by duplicated loci resulting from an evolutionarily recent genome duplication event [4,9,61]. Only single locus markers were utilized. Markers were either genotyped using the tailed protocol [68] size standard. Samples were denatured at 95°C for 5 min and kept on ice until loading on an automated DNA sequencer ABI 3730 DNA Analyzer (ABI, Foster City, CA, USA). Output files were analyzed using GeneMapper version 3.7 (ABI, Foster City, CA, USA), formatted using Microsoft Excel and stored in Microsoft Access database.

Analysis
The initial dataset for analysis included marker genotype data for 49 microsatellite loci typed on 96 unrelated individuals. These marker genotype data were analyzed to estimate frequency of alleles per marker using the ALLELE procedure of software package SAS ® , version 9.3.1 [70]. Within each marker, alleles with frequency ≤ 0.01 were merged to minimize the up biasing effect of rare alleles on LD estimates. Then, the marker alleles initially recorded in size of fragments were recoded into a consecutive-numbered allele system using the computer program RECODE [71]. This dataset with recoded marker genotypes were divided into files including marker loci corresponding to each of the four linkage groups analyzed in this study. Subsequently, these recoded marker genotype data were used in haplotype reconstruction and linkage disequilibrium analysis.

Haplotype reconstruction
For each linkage group, the most likely haplotype configuration for each individual was estimated using the software package PHASE, version 2.1, following procedures described by [72] and [73]. Briefly, this is a statistical method for inferring haplotypes from unphased genotype data for a sample of "unrelated" individuals from a population; population haplotype frequencies are assumed under Hardy-Weinberg equilibrium (HWE) but PHASE has proven robust to deviations from HWE, the effect of population structure and moderate amounts of recombination. The estimated haplotype data was formatted for subsequent analysis with the ALLELE procedure of the software package SAS ® , version 9.3.1 [70]. The haplotype data was formatted into SAS ALLELE procedure format (option TALL for data input format) using a Java script (Haplo_2SAS.jar) written by Christopher Schmitt. This script is available for academic use upon request to RLV. This formatted dataset was used in linkage disequilibrium analysis.

Hardy-Weinberg equilibrium
The HWE analysis for each of the 49 loci typed in 96 unrelated individuals was performed using the ALLELE procedure of the software package SAS ® , version 9.3.1 [70] using the option TALL for data input format. The input data were the most likely haplotypes reconstructed for each individual using the computer program PHASE and reformatted into SAS format as outlined above. The exact test for HWE for each microsatellite loci (exact P-value) was estimated using 10,000 permutations.

Linkage disequilibrium between syntenic loci
The LD analysis was performed between marker pairs within each linkage group using the ALLELE procedure of software package SAS ® , version 9.3.1 [70]. We used the option TALL for data input format, and the HAPLO = GIVEN option which indicates that the haplotypes have been observed, and thus the observed haplotype frequencies were used in the LD test statistic and measures.
The linkage (or gametic) disequilibrium coefficient D uv between two alleles from marker loci M and N, respectively, was estimated using this expression [74], Where p uv is probability that an individual receives the haplotype M u N v for marker loci M and N, p u is the probability of the u allele, and p v is the probability of the v allele.
The ALLELE procedure calculates the maximum likelihood estimates, , of the LD coefficient between a pair of alleles at different markers. This procedure calculates an overall chi-square statistic to test that all of the D uv 's between two markers are zero as follows, Which has (k-1) and (l-1) degrees of freedom for markers with k and l alleles, respectively. A Monte Carlo estimate of the exact P-value for testing the hypothesis was calculated by conditioning on the haplotype counts. The significance level is obtained by permuting the alleles at one locus to form 2 n new two-locus haplotypes. We used 10,000 permutations to estimate the exact P-values.
We used the squared correlation coefficient (r 2 ) as the LD measure for each pair of alleles M u and N v located at loci M and N, respectively [75], Where D = p 11 p 12 -p 12 p 21 is the LD coefficient, which can be directly estimated using the observed haplotype frequencies when using the option HAPLO = GIVEN with the ALLELE procedure of SAS ® , version 9.3.1 [70].

Linkage disequilibrium between nonsyntenic loci
We estimated LD adjusted for experimental sample size (r 2 -) between pairs of non-syntenic loci using the ALLELE procedure of software package SAS ® , version 9.3.1 [70]. As input data, we used marker genotype data with alleles recoded into a consecutive-numered system (nonphased marker genotype data). In the analysis, we used the option HAPLO = EST which indicates that the maximum likelihood estimates of the haplotype frequencies are used to calculate the LD test statistic as well as the LD measures. For LD estimation among non-syntenic loci, we used the ALLELE procedure options ALLELEMIN = GENOMIN = HAPLOMIN = 0.01. This last statement ensures that only alleles, genotypes, and haplotypes with frequency ≥ 0.01 are used in the LD analysis.

Effective population size
The analysis was based on the known relationship between LD as measured by r 2 (squared correlation of allele frequencies at a pair of loci) and effective population size N e , Where c is the recombination rate between the microsatellite loci and n is the experimental sample size. The constant α = 1 in the absence of mutation [76] and α = 2 if mutation is taken into account [77][78][79]. The constant k was set to k = 2 for sex chromosome and k = 4 for autosomes.
Given the formulae described in the linkage disequilibrium section, and knowing r 2 and c, we estimated N e for each chromosome by fitting this nonlinear regression model, Where is the observed LD (adjusted for chromosome sample size n) for marker pair i in chromosome j, c ij is the recombination rate from two-point linkage analysis for marker pair i in chromosome j. The parameter β j is the estimator of effective population size for chromosome j where . The parameters α j and β j were estimated iteratively using non-linear modeling with JMP ® Genomics 3.1 (SAS Institute Inc., Carey, NC, 2007).

Authors' contributions
CR participated in design, marker selection, genotyping, and drafted the manuscript, RV participated in design, performed the calculations for LD and effective population size. Both authors read and approved the final manuscript.