SNP frequency, haplotype structure and linkage disequilibrium in elite maize inbred lines

Background Recent studies of ancestral maize populations indicate that linkage disequilibrium tends to dissipate rapidly, sometimes within 100 bp. We set out to examine the linkage disequilibrium and diversity in maize elite inbred lines, which have been subject to population bottlenecks and intense selection by breeders. Such population events are expected to increase the amount of linkage disequilibrium, but reduce diversity. The results of this study will inform the design of genetic association studies. Results We examined the frequency and distribution of DNA polymorphisms at 18 maize genes in 36 maize inbreds, chosen to represent most of the genetic diversity in U.S. elite maize breeding pool. The frequency of nucleotide changes is high, on average one polymorphism per 31 bp in non-coding regions and 1 polymorphism per 124 bp in coding regions. Insertions and deletions are frequent in non-coding regions (1 per 85 bp), but rare in coding regions. A small number (2–8) of distinct and highly diverse haplotypes can be distinguished at all loci examined. Within genes, SNP loci comprising the haplotypes are in linkage disequilibrium with each other. Conclusions No decline of linkage disequilibrium within a few hundred base pairs was found in the elite maize germplasm. This finding, as well as the small number of haplotypes, relative to neutral expectation, is consistent with the effects of breeding-induced bottlenecks and selection on the elite germplasm pool. The genetic distance between haplotypes is large, indicative of an ancient gene pool and of possible interspecific hybridization events in maize ancestry.


Background
Direct analysis of genetic variation at the DNA sequence level at many loci became possible in recent years due to improvements in sequencing technology. High through-put genotyping methods, including DNA chips, allele-specific PCR and primer extension approaches make single nucleotide polymorphisms (SNPs) especially attractive as genetic markers [1][2][3].
If a whole-genome scan is to be undertaken, trait mapping by allele association requires high marker density [4][5][6][7] which could be provided by SNPs. Recent detailed analysis of allelic diversity at the maize Dwarf8 gene, which indicated association with flowering time [8], is an example of association approach using candidate genes. SNPs may also be used for mapping expressed sequence tags (ESTs) in defined segregating populations and for the integration of genetic and physical (contig) maps, which contain ESTderived landmarks.
While polymorphic simple sequence repeats (SSRs, [9]) are excellent molecular markers, because of their multiallelism and the resulting high informativeness, they may not be frequent enough for association studies. Size homoplasy of SSR alleles, as well as allele reversion could also be a problem in some applications [10,11].
In contrast to humans [12], few systematic whole genome searches for single nucleotide polymorphisms have been undertaken in plant species, with the exception of Arabidopsis [http://www.arabidopsis.org/cereon/]. However, it has been established that plants differ widely in the level of intra-specific sequence diversity. For a recent review, see [13] Maize is generally considered highly polymorphic, and it has been suggested that active transposon systems contributed to the creation of diversity [14]. For example, in one of the early studies of the sh1 locus, [15] detected 16 nucleotide changes in 540 bp, while in the 3'-untranslated region 10 changes occurred in 270 bp. Several other maize loci including Adh1 [16,17], Adh2 [18], Opaque-2 [19], b [20], glb1 [21] have been studied systematically. The range of nucleotide diversity p reported for maize genes is wide, from 0.47 (per 1000 bp) for the promoter region of tb1 [22] to 37, for synonymous substitutions at glb1 [21], a difference of almost two orders of magnitude.
Reduced allelic diversity is expected in domestication related genes. This was found in c1, an anthocyanin-biosynthesis regulatory locus [23]. Wang [22] and White [24] recently examined domestication -related changes in nucleotide diversity along the length of two maize genes. In the case of teosinte branched, tb1, a significant reduction of diversity occurred in the promoter region, but not in the coding region [22]. The sequences of terminal ear 1 (te1) alleles showed evidence of linkage disequilibrium, and only a small number of haplotypes was identified in cultivated maize, in contrast to maize progenitors.
A recent study involved 21 loci along chromosome 1 of maize, and indicated high level of diversity in landraces, only somewhat reduced in U.S. inbreds [25]. As has been previously found in Drosophila [26], diversity was correlated with recombination rate. Linkage disequilibrium was found to decline within 100-200 bp [25].
Further studies of the nature, frequency and distribution of sequence variation in the agronomically relevant maize germplasm, would allow better understanding of the range diversity and the nature of genetic changes associated with domestication and selection for agronomic performance. To this end, we surveyed sequence diversity at 18 loci. Gene segments were amplified from 36 maize elite inbred lines and sequenced. The frequency and the nature of polymorphisms were examined in detail. Structure of SNP haplotypes and short-range linkage disequilibrium within loci were also analysed.

Experimental approach
To identify and characterise patterns of DNA sequence polymorphisms in or near maize genes, we sequenced 22 maize amplicons from up to 36 diverse maize genotypes, representing the major heterotic groups of cultivated maize germplasm mainly of U.S. origin (Table 1). This germplasm set provides an excellent representation of the allelic diversity in agronomically relevant maize, as evidenced by the fact that RFLP alleles present in a modest subset of these lines (#3, 5, 16, 26-28, 30, 31, 36, Table  1) represent allelic diversity of 94.7 % of the 345 maize lines tested (data not shown). To maximise the amount of observed sequence diversity, and thus to increase the number of informative SNPs discovered, we analysed primarily the 3'-untranslated regions of the selected maize genes. PCR primers were designed to amplify a 300-500 bp segment of each gene. In some cases parts of the last intron and exon were also included. The amplicons were derived from 17 different ESTs, eight of which have exact maize GenBank homologs, and from one well-characterised gene sequence (see Additional file 1).

Types and frequency of polymorphisms
Multiple nucleotide changes and insertions / deletions of various lengths were identified, and the results are summarised in Table 2. The distribution of various types of polymorphisms at individual loci is shown in Additional file 2. Single nucleotide changes occur on average every 60.8 bp, and indels occur every 126 bp. The frequency of nucleotide substitutions is almost three times higher in non-coding regions than in coding sequences. Most of the nucleotide changes in the protein-coding regions are silent -only 5 out of 18 changes detected result in amino acid substitution. The difference in the distribution of indels is even more striking, only one 3 bp indel was found in 2.35 kb of coding sequences, while indels occur on average every 85 bp in non-coding regions (54 indels varying in size from 1 bp to over 400 bp were identified). The number of observed insertion / deletion events per locus varies widely, from 0 to 11 (median 1.5 indels per locus). Figure 1 shows size distribution of indels. Among the 55 indels reported here, dinucleotide indels are most fre-quent. Previous indel analysis in a larger data set (655 indels in 215 loci) have shown that single base insertions / deletions are most common [27]. The difference may be due to the fact that a few simple sequence repeat (SSR)like variants, generated by a different mutational mechanism [28] contribute several 2-nt. indels found in the present data. Some nested indels are observed.

SNPs as genetic markers
SNPs were evaluated individually and on the basis of haplotypes (see Additional file 2). The SNP expected heterozygosity is 0.263 (see Additional file 2). Exclusion of indels from the calculation does not produce significant change in the heterozygosity values. In comparison, the SSR expected heterozygosity has been estimated at H = 0.77 [29]. Therefore, individual SNPs are not very informative as molecular markers for genetic diagnostics. If the expected heterozygosity is calculated on the basis of haplotypes, rather than individual SNPs, the value is over twice as high, 0.561. The haplotype expected heterozygosity is comparable to the heterozygosity of RFLP markers (H = 0.58, [29]). Haplotype analysis, while increasing informativeness, it would increase the cost of genotyping, relative to the analysis of single SNPs. This increase would be proportional to the number of SNPs needed to define each haplotype. Usually 2-4 SNPs will be required to tag the haplotypes [30]. The high frequency of polymorphism in maize translates into a large number of SNPs and indels potentially available for use as genetic markers. These markers may be discovered by direct sequencing of gene-adjacent sequences, as described here, or by computer analysis of available EST sequences derived from multiple genotypes [31]. It is in principle feasible to obtain several SNP markers in the vicinity of each maize gene, a subset of which will completely define haplotypes. Such a collection of SNPs may enable whole genome scanning linkage disequilibriumbased approaches [6] to trait dissection and gene mapping in maize, if the amount of linkage disequilibrium in the relevant populations is sufficient.

Gene diversity and divergence dates
The overall level of sequence polymorphism in maize is high, more than double the inter-specific polymorphism rate in mouse (M. castaneous / M.domesticus, [1]), and about an order of magnitude higher than in humans [32]. In maize, expected heterozygosity per nucleotide site (p) values ranging from low 0.47 in the promoter region of a domestication gene, tb1 [22], to 37, for synonymous sites in Globulin-1 (glb1) locus were found [21,24]. For comparison, p in humans is from 0.3 to 1.1 [32]. In our maize study, p averages 6.3 (per bp, ´1000, non-coding regions only), on the low side of the previously reported range for silent sites. This may be explained by the difference in germplasm selections. Most of the earlier studies included a diverse set of maize accessions from North and Central America, while we concentrated on U.S. elites. Gaut [33,34] estimated the synonymous rate of substitution at 4.7-7.0 ´ 10 -9 substitutions per synonymous site per year.
The mean between-haplotype distance we observed is 11.5 nucleotide substitutions per 1000 nt of non-coding sequence, excluding indels, corresponding to 0.8-1.2 my. This number is derived primarily from silent sites, at which the substitution rate may be lower than in synonymous sites [24]. Previous estimates for the age of maize gene pool, derived from the most divergent haplotypes of te1, are quite similar, 1.2-1.4 my [24]. As expected, estimates for individual loci deviate considerably from the mean. For example, the two most distant haplotypes of stearoyl-ACP-desaturase differ by 7 nucleotide substitutions over 228 nt of the 3'-untranslated region of the gene, translating to 2.2-3.2 my divergence, which is close to the estimated divergence time between Tripsacum and maize, 2.3-2.6 my [24]. Two divergent Adh1 haplotypes (15 substitutions per 1025 bp) produce numbers close to the te1 estimates, 1-1.5 my, and slightly lower than the previous estimates for Adh of 1.9 my [35]. The individual gene-derived numbers have to be treated with caution, because they are obtained from short sequence segments and thus are burdened with significant error. Despite lower heterozygosity per nucleotide site (p) in elite maize, highly diverse haplotypes have been maintained in elite lines. Selection for heterosis, which is related to genetic diversity between parents [36][37][38] may have contributed to this effect.
S-adenosylmethionine synthase was the only gene completely monomorphic within the 254 bp (86% 3'-UTR) examined. A reduced diversity was also observed at the Glutamyl-tRNA reductase precursor locus, where one common (p = 0.935) and one rare allele were found, and nucleotide diversity p is 1.9 (per bp, ´1000). However it would be premature to speculate about any functional significance of the apparently reduced diversity at these loci, without first examining larger segments of the genes for polymorphism.
Insertions/deletions occurring on the background of a common haplotype, and therefore presumably of more recent origin, can occasionally be found. The mean difference between haplotypes is strongly affected by the exclusion of indels: 15 differences/ 1000 bp vs. 11.5 nt/1000 bp if indels are disregarded, underscoring the significant contribution of indels to maize genetic diversity.

Haplotype structure and allele distribution
To evaluate the allele distribution in the set of germplasm selected for this study, we applied Tajima D statistics [39,40], which was developed to test neutrality of mutations. Tajima D is based on the comparison of two estimators of Q = 4N e m (where N e is the effective population size and m is the mutation rate), one based on the number of segregating sites and one based on the number of pairwise differences between sequences in the sample [41]. Size range (bp)

Number of occurences
Departures from neutrality expectation can be dues to a number of factors, including population expansion, bottleneck or heterogeneity of mutation rates [42], therefore neutrality is not an expectation in the set of germplasm analysed here. While the Tajima test in the strict sense does not apply to non-random collections of germplasm such as the maize lines selected for this study, it is still a convenient indicator of the pattern of allele distribution. Negative Tajima D values indicate an excess of low frequency alleles relative to neutral mutation -drift equilibrium. Positive Tajima D indicates a deficit of low frequency alleles relative to expectation. This could be due to a population bottleneck, population subdivision or balancing selection. These factors are likely to be operational in maize elite lines.
There is no indication of the overall strong bias of Tajima (Fig. 2,3,4). It is likely that such patterns would only be revealed upon sampling of a larger set of genetic loci [43,44]. In general, higher genetic similarity is observed within heterotic groups than between heterotic groups, irrespective of the genetic marker system used [44].
At each of the loci, the sequence diversity is organised into a relatively small number (two to eight) distinct haplotypes, many of which were represented multiple times among the 36 inbred maize lines analysed. Figure 2,3,4 and Table 3 show examples of the haplotype relationships. The three most common haplotypes account for over 80% of allelic diversity at 16 out of the 18 loci examined. For example, at the stearoyl-ACP desaturase locus (Fig 3) there are three common haplotypes relatively distant from each other, and a rare one which differs by only one nucleotide change from one of the common haplotypes. Eighteen inbreds, from three heterotic groups share haplotype 4, while two Lancaster-type inbreds, H60 and H98 have rare haplotype 3.
The expected number of haplotypes may be calculated using coalescent theory [45,46], even though such calculations involve many assumptions. Mean number of predicted haplotypes for all loci examined was calculated to be 6.01 (st. dev 2.4), while 3.4 (st. dev. 1.1) was observed. These means are significantly different at P < 0.001 level (two-tailed t-test). Two loci, when examined individually, showed a statistically significant difference between Haplotype structure of a few Z. mays genes has been recognised previously [18,24], but the predominance in maize elite lines of a few diverged haplotypes in linkage disequilibrium, has not been obvious until now. In teosinte, no clear haplotype structure has been identified [24].
Selinger and Chandler [20] found three distinct clades in the phylogenetic tree of maize b gene alleles, with strong separation between clades, indicating that the alleles within clades may have arisen recently when compared with the divergence of the three clades. Both Z. mays and Z. mays parviglumis sequences appear in the three clades. One possible interpretation of this finding is that the three clades may have diverged before the divergence of the genus Zea. An alternative hypothesis, that the nucleotide substitution rates at the upstream region of b are much higher, is favored by Selinger and Chandler. Our study indicates that at least one aspect of the evolutionary pattern seen by these authors, the presence of highly divergent haplotypes, is widespread in elite maize inbreds, favoring the hypothesis of early separation of the three clades.
Phylogenetic analysis of the maize terminal ear (te1) sequences did not resolve all Z. mays sequences into a single clade. Members of the Zea subspecies, with the exception of Z. huehuetenangensis are intermixed within clades [24]. This observation has been made for other maize genes Neighbor-joining trees representing Adh1 haplotype relationships. Level of support for branch points is indicated in %, and branch length expressed as nucleotide differences are shown in parentheses. Genotypes correspond to those of Table  1, and color indicates major heterotic groups: stiff stalk (blue), non stiff stalk (green) and Lancaster (red).  [18,21,23] and has been interpreted as indicative of introgression among Zea taxa [24], or of lineage sorting [24]. The lack of resolution of species within the genus Zea into single clades was also found for c1 and Adh2 [14]. In contrast, glb1 and Adh1 appear to have a different evolutionary history, with Zea luxurians alleles forming a distinct clade [14,17,21].

Non-Stiff Stalk Stiff Stalk Synthetic Lancaster
These observations, together with our data which showed a widespread distribution of highly diverged haplotypes, seem to indicate that interspecific gene flow in the genus Zea amy have been significant. It is tempting to speculate that incongruent evolutionary histories of different loci are related to the origins of alleles either within a single Zea species, or within two or more species, followed by an inter-specific introgression event(s) [47]. Recent surprising finding that some alleles at the bz locus differ in their gene complement and in the composition of intergenic repetitive DNA segments appears to lend further credence to this hypothesis [48].
The observed haplotypes predate domestication of corn, and their distribution at different genetic loci may help understand the process of domestication, including the resulting population subdivision and selective pressures. It is tempting to speculate that selection for high yield, and consequently heterosis in open pollinated varieties and, more recently, between heterotic groups, favoured presence of highly divergent haplotypes at many loci, while in the same time bottleneck effects limiting the Neighbor-joining trees representing stearoyl-ACP desaturase haplotype relationships. Level of support for branch points is indicated in %, and branch length expressed as nucleotide differences are shown in parentheses. Genotypes correspond to those of Table 1, and color indicates major heterotic groups: stiff stalk (blue), non stiff stalk (green) and Lancaster (red).

Non-Stiff Stalk Stiff Stalk Synthetic Lancaster
number of haplotypes. As a result of these competing processes, despite strong selection a relatively high fraction of diversity (77%, [25]) is retained in elite germplasm as few highly divergent haplotypes.

Linkage disequilibrium
The presence of a small number of haplotypes shared by multiple individuals is indicative of linkage disequilibrium (LD). Population bottlenecks and inbreeding increase LD [49]. Thus, elite germplasm may be expected to have extensive linkage disequilibrium.
Linkage disequilibrium measures D' and r 2 were calculated for SNP loci within each gene ( Figure 5). No decline in the value of D' was found within the range of 300-500 bp analysed. Also, the r 2 measure of linkage disequilibrium does not appear to be declining significantly. D' is an accepted measure for the analysis of distance dependence of linkage disequilibrium [30,50], but r 2 has also been used frequently. As a control, LD was also calculated for all pairs of SNP loci between the 18 genes examined. The genes are not known to be linked genetically, and therefore no significant LD was expected between genes. In agreement with the expectation, only 0.3% of between gene pairs of SNP loci showed significant LD at P < 0.01. In contrast, 36.3 % of within gene pairs of SNP loci showed significant LD at P < 0.01. We conclude that the linkage disequilibrium observed within genes is not an artefact.

Figure 4
Neighbor-joining trees representing acetyl-CoA C-acyltransferase haplotype relationships. Level of support for branch points is indicated in %, and branch length expressed as nucleotide differences are shown in parentheses. Genotypes correspond to those of Table 1, and color indicates major heterotic groups: stiff stalk (blue), non stiff stalk (green) and Lancaster (red).   It remains to be determined at what distances, on average, LD declines in this population. In contrast to our result, in recent studies, LD was found to decline rapidly in maize [25,51]. However, both authors examined broad-based sets of germplasm -breeding germplasm and diverse lan-draces, respectively. Significant differences exist between the two studies. [51], unlike [25], found large differences in the rate of LD decay with between loci. Also, overall rate of decay in LD is less in the former study [51], based on a somewhat narrower population of individuals. In conclu-   Position in GenBank accesion AF498472 and genotype   Haplotype  Frequency  38  73 100 111 143 161 239 245 247 250 251 257 266 270 279 290 295 330  -391   449   1 0.030

Non-Stiff Stalk Stiff Stalk Synthetic Lancaster
Allelic sites at polymorphic sites are shown, together with their nucleotide positions. The insertion -deletion is represented as I. Missing data are denoted "?". Promoter region is separated from the rest of the gene by "^" (Adh1 only).
sion, appropriate choice of germplasm may allow one to adjust resolution of association studies, and, consequently, the number of genetic markers required. Elite germplasm may be preferred for initial lower resolution analysis, followed by higher resolution study in a broader germplasm collection.

Evidence for haplotype recombination
At the Adh1 locus there are only two haplotypes, one common and one rare (D = 0.06) within the promoter region (nt. 2-345, X04050, Table 3 (Table 3, haplotype 2). Further analysis of the sequences between nt. 345 and 1030 may help localise the site of recombination. This picture bears some resemblance to the observations of Wang on teosinte branched 1 locus, in that one finds a reduction in diversity in the promoter region and also a recombination event close to the beginning of transcription, even though Adh1 is usually considered a neutral gene [22]. We are currently analysing haplotype structure and linkage disequilibrium in a large region surrounding the Adh1 gene (M. Jung, private communication).

Conclusions
In contrast to previous results obtained in ancestral maize populations, the analysis of maize elite inbred lines demonstrated the presence of a small number of highly diverse haplotypes and strong linkage disequilibrium between SNP loci extending at least to 500 bp. This population structure may result from bottlenecks and selection associated with plant breeding, and has implications for the design of genetic association studies in maize.
Leaves from two-week old greenhouse-grown plants was harvested for DNA extraction.

DNA extraction
Leaf material (fresh, frozen at -80, or lyophilised) was ground with glass beads (150 microns, Sigma G9018) into a fine powder using mortar and pestle, in the presence of liquid nitrogen. The DNA was then extracted using Plant DNAzol (Life Technologies, Inc.) following the manufacturer's recommendation with one modification: after the initial room temperature incubation the tissue homogenate was centrifuged at 10,000 g for 10 min, and the supernatant was collected and used for the chloroform extraction step. The expected product sizes were 300-500 bp on average, usually corresponding to the 3' untranslated region of the gene. In the case of Adh1 three independent amplicons were analysed. A T3 tag (5'-AATTAACCCTCACTAAAGGG-3') was added to the 5' end of the forward primer, and a T7 tag (5'-GTAATACGACTCACTATAGGGC-3') was similarly added to the reverse primer, to facilitate direct PCR product sequencing.

DNA sequence accession numbers
GenBank accession numbers (18 loci, all genotypes examined) are included in Additional file 2. The aligned and concatenated DNA sequences used in the analysis are available as additional data files in the text format (Additional file 3), Nexus format (Additional file 4) and MEGA format (Additional file 5). The list of the sequences included and the coordinates of individual loci within the above listed file is available in Additional file 6.

Data analysis
Conserved haplotypes, that is DNA sequences containing identical allelic variants at all identified polymorphic sites at a locus, but derived from separate individuals, were identified visually or by alphabetical sorting of the list of sequence variants at a locus (see Table 3 for an example). Number of transitions (S), number of transversions (V) and number of insertion / deletion polymorphisms (indels) were counted directly or calculated by using Arlequin 1.1 [41]. Linkage disequilibrium measures D' and R 2 were calculated with DNAsp [45] and with Tassel (Buckler IV, E.S., [http://brooks.statgen.ncsu.edu/buckler]). Insertions / deletions and sites with excess missing data were excluded from the LD calculations. Estimation of expected number of haplotypes, given the estimated value of Theta and recombination using coalescent process simulations were also performed with DNAsp [45].
Frequencies of polymorphic sites per bp (Table 2) were calculated by dividing the total number of polymorphic sites of a given type (SNPs, indels, or both) by the length of the DNA sequence examined. Genetic parameters, including nucleotide expected heterozygosity, number of haplotypes, haplotype expected heterozygosity, mean number of differences between pairs of haplotypes, and Tajima D were calculated using Arlequin 1.1 [41].
Nucleotide expected heterozygosity and haplotype expected heterozygosity calculated from allele and haplotype frequencies, respectively: where n is the number of gene copies in the sample, p i is the frequency of the i-th allele or i-th haplotype [52]. The reported values of nucleotide expected heterozygosity are averages over all polymorphic nucleotide sites within the locus. Expected heterozygosity per nucleotide site p was calculated from nucleotide expected heterozygosity values: Where H i is nucleotide expected heterozygosity at a polymorphic site i, and L is the length of the sequence segment analysed, which contains n polymorphic sites. Insetion / deletion rates are likely to be different from single nucleotide mutation rates, and may not be caused by single molecular events, causing complications in the estimation of divergence times. Therefore, calculations involving genetic parameters were determined for single nucleotide polymorphisms only, and, in some cases, separately, for all polymorphic sites including insertions / deletions (indels). For the purpose of this calculation, each indel was treated as a single event.
Neighbor-joining trees were based on the haplotype sequences, using nucleotide number of differences as a distance measure and were calculated with Mega 2.0 (S. Kumar, K. Tamura, I.B. Jakobsen and M. Nei, [http:// www.bio.psu.edu/People/Faculty/Nei/Lab/Programs.html]. For the purposes of tree calculation indels were treated as equivalent to single nucleotide differences. The support level for branching points in the trees was determined by 1000 bootstrap re-samplings of the data.
disequilibrium and in the coordination of the study. Author 6 initials MM contributed to the study design and to data analysis. Author 7 initials JAR conceived the study, participated in its design and in data analysis.