Genetic fixity in the human major histocompatibility complex and block size diversity in the class I region including HLA-E

Background The definition of human MHC class I haplotypes through association of HLA-A, HLA-Cw and HLA-B has been used to analyze ethnicity, population migrations and disease association. Results Here, we present HLA-E allele haplotype association and population linkage disequilibrium (LD) analysis within the ~1.3 Mb bounded by HLA-B/Cw and HLA-A to increase the resolution of identified class I haplotypes. Through local breakdown of LD, we inferred ancestral recombination points both upstream and downstream of HLA-E contributing to alternative block structures within previously identified haplotypes. Through single nucleotide polymorphism (SNP) analysis of the MHC region, we also confirmed the essential genetic fixity, previously inferred by MHC allele analysis, of three conserved extended haplotypes (CEHs), and we demonstrated that commercially-available SNP analysis can be used in the MHC to help define CEHs and CEH fragments. Conclusion We conclude that to generate high-resolution maps for relating MHC haplotypes to disease susceptibility, both SNP and MHC allele analysis must be conducted as complementary techniques.


Background
The human major histocompatibility complex (MHC) is a highly polymorphic genomic region occupying approximately 4 Mb on chromosome 6p21. 3. In addition to the major HLA class I and class II gene clusters, there are several other HLA-related and immune response-related genes, some of unknown function, as well as likely pseudogenes. The rich polymorphism in this region is a critical determinant for success in tissue transplantation, and in recent years has found a further use in characterizing both ethnic and geographical population relationships. Haplotype analysis is based on the conservation of short blocks of conserved DNA sequence containing specific allele combinations of two or more adjacent or nearby genetic loci. Within the MHC region, a limited number of specific haplotypes are known to be shared by unrelated individuals of well-defined human populations. These relatively long stretches of conserved DNA sequence in the MHC have been termed conserved extended haplotypes (CEHs) [1] or ancestral haplotypes [2,3]. It is also well recognized that CEHs may be represented as a higher order of association, through successive generations, of four or more defined MHC blocks, showing a stronger linkage disequilibrium (LD) to that expected by random recombination.
Portions of a few CEHs can be detected by maximum likelihood statistics but much more precisely and completely by family studies and direct counting [1][2][3][4][5][6]. In either instance, LD can be analyzed and a significance assigned to the association [1][2][3][4][5][6]. MHC haplotype blocks and the larger CEHs are usually inherited intact as a unit, and the allele frequency distribution of particular MHC locus combinations in individuals is non-random [1][2][3][4][5][6][7]. Reports describe the existence of blocks of conserved DNA sequence in the range of 5 to 150 kb within the human genome separated by sites of high recombination activity [8][9][10]. These reports, based on LD analysis applied to single nucleotide polymorphism (SNP) data, suggested the blocks represent relatively uniform lengths of conserved DNA sequence maintained throughout the human population as haplotypes.
Conserved MHC blocks and CEHs have been shown to represent markers of human diversity and/or disease susceptibility [4]. Multi-block conserved haplotypes are not limited to the MHC region since genes encoding drug metabolizing enzymes [11], hormone receptors [12] or microtubule-associated proteins [13] are also associated with extended haplotype blocks. For human MHC studies, past work has focused on haplotypes defined by the relationship of classical HLA class I and class II loci and intermediate MHC genes. The HLA-E locus, located approximately halfway between the HLA-A and Cw class I loci approximately 780 kb telomeric to HLA-C, has limited polymorphism and has not generally been incorpo-rated into HLA association studies. Here, we describe newly identified block associations within the MHC, specifically determining the distribution of HLA-E alleles in relation to HLA-A, B, Cw, complotype and DRB1 blocks, defining a set of CEHs extending over 2.6 Mb (1.5% of chromosome 6). The inclusion of HLA-E in MHC haplotype analysis significantly improves the resolution of class I haplotypic blocks, further refining our ability to analyze associations of the human MHC to disease. Through SNP analysis of the MHC class I/class II region, we confirmed the regional genetic fixity identified by MHC allele analysis and demonstrated that SNPs can be used in the MHC to help define CEHs and CEH fragments.

Results
To improve human MHC haplotype resolution, we initially set about determining HLA-E allele polymorphism in the HLA-A/HLA-Cw interval. Within our samples, only 4 of the currently-identified HLA-E alleles were identified (E*0101, E*010301, E*010302 and E*010303) while HLA-E*0104 was not detected. We did not type for the recently identified allele HLA-E*010304 [14]; our typing method would have designated such an allele, if it existed in our subjects, as HLA-E*010302. HLA-E*010303 was found in only one of 176 individuals screened (representing subjects from all 3 panels studied) and was therefore not tested for in the other subjects, but frequent alleles found were HLA-E*0101, followed by E*010302 and E*010301 in 583 individuals. HLA-A, Cw and B alleles were identified at expected frequencies for Caucasian, African-American and Hispanic populations, respectively [1]. In 216 individuals (Panel 1), we found 9 statistically significant haplotypes between HLA-A and HLA-Cw, B, only 5 between HLA-E and HLA-Cw/B and 7 between HLA-A and HLA-E (Table 1). Of the latter, the two most significant were (A*0101, E*0101) and (A*0301, E*010302). Of the 5 identified associations between HLA-E and HLA-Cw/B, the most significant were (E*0101, Cw*0701, B*0801) and (E*010302, Cw*0702, B*0702). Analysis of the entire class I region revealed 9 haplotypes, of which the most significant were (A*0101, E*0101, Cw*0701, B*0801); (A*0301, E*010302, Cw*0702, B*0702) and (A*0201, E*0101, Cw*0501, B*4402).
Many significant HLA-Cw/B associations were found within Panel 2, as expected due to the physical proximity of HLA-Cw and -B (85 kb). Extending the region to 864 kb between HLA-E and HLA-B, 4 of the same HLA class I haplotypes found in Panel 1 individuals and 4 other statistically significant class I haplotypes were found (Fig. 1, column C). LD analysis of the complete class I region encompassing 1.41 Mb identified the same 9 class I haplotypes found in Panel 1 (Fig. 1, column D). All four HLA-A/E pairs in LD (Fig. 1, column B) were part of at least one of the larger class I haplotypes (gray lines). However, some of the larger haplotypes contained sub-domain regions not strongly linked when analyzed independently. Specifically, 4 HLA-A/E pairs ((A*0201, E*0101); (A*2301, E*0101); (A*2402, E*0101) and (A*0201, E*010302)) found in the larger class I haplotypes (Fig. 1, red lines) did not show significant LD when analyzed alone. Analysis of HLA-E to Cw/B (Fig. 1, column C) revealed that HLA-E*0101 was not in LD with (Cw*0401, B*3501); (Cw*0602, B*5701) nor (Cw*0401, B*4403) despite significant LD when the haplotypes included HLA-A (Fig 1, Column D). Conversely, one haplotype with strong D', (E*010301, Cw*12xx, B*5201), was not in LD with HLA-A. From these results, we infer ancestral breakpoints both centromeric and telomeric to HLA-E.
Within Panel 3, we also studied CEHs, ranging over 2.6 Mb, consisting of their HLA class I loci (Table 2) along with the class II HLA-DRB1 locus and the closely-linked complement genes BF, C2, C4A and C4B (the complotype; Table 3). HLA-E alleles were found to be in significant association with 10 CEHs (Table 3), but excluding HLA-A reduced this number. Nevertheless, HLA-E association with the Cw/B block and HLA-A in Panel 3 (Table 2) showed significance for 7 of the 9 class I haplotypes observed in Panels 1 and 2. Furthermore, 6 other class I haplotypes not found in Panels 1 or 2 had statistical significance in Panel 3.  Conversely, the larger number of (HLA-E*0101, Cw*08, B*14) and HLA-E*0101, Cw*06, B*50) haplotypes found as compared with their most frequent HLA-A variants implies past recombination events between HLA-E and HLA-A. In summary, we demonstrate, by both χ 2 and LD analysis, non-random association of HLA-E alleles with alleles at other class I loci and the HLA-E allele markers for 10 CEHs. Through breakdown of LD between MHC blocks, we once again infer recombination breakpoints on  Selecting only those SNPs identical in EM10 and FS10 and designated "HLA-A*26, B*38", and comparing them with those SNPs identical in B8HM1 and B8HM2 designated "HLA-A*01, B*08" (Fig. 2B), we observed 113/271 (41.7%) complete discordance between the two sets (p: <

Discussion
Human MHC polymorphisms likely represent the geographic dispersal of early man and expansion of limited haplotypes in concert with selection driven by local microbial organisms. This has led to association of haplotypes with both ethnicity and various immunopathologies. It has been postulated that the basis for some of the disease-associations may be a cross-reactivity between a microbe-specific peptide sequence and a closely-related host sequence leading to anti-host reactivity (e.g., HLA-B27 and ankylosing spondylitis [15]). To accurately identify the relationship of a genetic locus to disease, it is critical to determine whether an allele is associated with such pathology or whether the locus is co-segregating due to proximity with the responsible gene. Consideration of cosegregation is particularly critical given that direct determination of MHC haplotypes from family studies shows frequently occurring small block variants and given that a third to a half of Caucasian haplotypes are fixed from HLA-B to HLA-DRB1/DQB1 (at least 1 Mb) as CEHs [1][2][3][4][5][6].
To increase the resolution of haplotypes within the human MHC region defined by population LD analysis, this study was initially conceived as a means of incorporating HLA-E into the other class I, class II and complotype regions. HLA-E is an HLA-1b-type molecule of limited polymorphism interacting with natural killer receptors, functioning as an important mediator of cytotoxicity [16][17][18]. Initial LD analysis suggested that HLA-E polymorphism occurred early in hominid development and stabilized in Homo sapiens before the major geographic dispersals [19]. Consequently, it seems likely that the distribution of HLA-E alleles represents population migration with inbred expansion. In support of this notion, our analysis of HLA-E alleles identified 3 alleles, (HLA-E*0101, HLA-E*010301 and HLA-E*010302), nonrandomly associated with particular CEHs. Our typing method was not designed to detect the recently identified allele HLA-E*010304 [14], and, if it had been present in any of the haplotypes, it would be reported here as HLA-E*010302. We are unaware of any report describing the population frequency of that allele; we shall clarify its presence or absence in particular CEHs in future studies. We identify apparent ancestral breakpoints upstream and downstream of HLA-E, and in the context of the limited number of HLA-E alleles identified, this would seem to reinforce the notion of HLA-E polymorphism occurring early in hominid development and stabilizing, and thus not in conflict in any way with the more recent stabilization of extended haplotypes confirmed here both by population LD analysis and SNP analysis. There is the further implication that recombination breakpoints in the HLA region are relatively infrequent.
Haplotype blocks and breakpoints revealed by population analysis do not always correlate with those identified by direct haplotype sequencing of sperm [20][21][22]. Sperm crossover points may indicate the potential for recombination while family studies represent the practical end result reflecting fertilization potential and environmental selective pressures. Accordingly, recombination frequencies from a single individual or limited pool should be used cautiously to describe the effect of recombination on haplotype frequencies in the population [6]. Other suggested mechanisms to explain discrepancies between sperm crossover points and family-inferred breakpoints include higher crossover rates in female gametes not observed in sperm [21], as well as the possibility that some breakpoints recognized by segregation analysis represent inactive ancestral recombination "hot spots" which have become fixed in populations [20].
Since selection in its most accepted formulation operates mostly upon protein products, the power of allele variant haplotype analysis is undeniable. In recent reports, extensive analysis of single nucleotide polymorphisms (SNP) has been used to produce high-resolution maps of breakpoints at greater frequency identified by allele variant population haplotype analysis. Some have argued that allele variant segregation and population haplotype analysis is erratic, influenced by gene frequency and population dynamics [23]. On the contrary, it is exactly these properties that have allowed allele variant population haplotype analysis to identify ethnic descent and migration of Homo sapiens so precisely.
LD analysis of SNP distribution in haplotypes defined by maximum likelihood methods has revealed genomic structures similar to and yet far less complex than those identified by allele variants haplotyped by segregation analysis [1][2][3][4][5][6]24]. The former method may be responsible for some oversimplification of recent haplotype analyses [1,4], but using SNP markers alone may also pose inherent problems. High-throughput localization of SNP distribution is inarguably efficient, but the vast majority of SNPs reside outside coding regions. Although there is potential for polymorphisms in non-coding promoter and intron DNA to influence subsequent transcription and splicing of a gene [25,26], selection pressure is more likely to operate at the protein level. Particular haplotype block combinations of relatively long genomic distance are likely to have been initially fixed in response to geographical or environmental influences. The passage of  time, migration and alterations in climate and local flora prevent analysis, but identification of other non-immunerelated haplotype blocks offers support for selection influence on haplotype structure [11]. However, a recent report "mapping" the MHC using both HLA alleles and SNPs by LD analysis of haplotypes defined by maximum likelihood methods [24], suggests that the primary reason such maps fail to detect the details of human population haplotype structure [1][2][3][4][5][6] is their use of probabilistic (as opposed to segregation) analysis.

Conclusion
The Results of several recent studies on two specific CEHs support our general conclusion of the fixity of CEHs in the class I region. Both high density SNP [28] and resequencing [29] analysis of the A1-B8-DR3 CEH and high density SNP analysis of the A30-B18-DR3 CEH [30] showed the essential sequence fixity of each of those haplotypes in unrelated individuals. Here, in a more limited set of samples, our high density SNP analysis confirms the essential fixity of the CEH [HLA-A*26, Cw*12, B*38, SC21, DRB1*04].
Since the SNP data so strongly support the genetic fixity of CEHs first observed by direct allele analysis, several approaches may be taken to improve haplotype definition. First, to define the SNP variants of particular CEHs, the density of SNP analysis can be raised to almost complete levels by choosing the limited subset expressed within a predefined CEH. An alternate approach based on the strong SNP support for CEHs, is to identify other polymorphic MHC genes, particularly in the HLA-A to HLA-C region, for consideration in LD analysis. Therefore, we identified several polymorphic markers within the 1.3 Mb of genomic DNA between HLA-A and HLA-C (Fig. 3). Analysis of these markers permits determination of hierarchical haplotype block associations where block variation within the CEH may provide further insights into human diversity and disease susceptibility. Determining the frequency of sizes of DNA blocks in different populations will add a new dimension in the studies of human diversity and gene localization in diseases associated with the MHC class I region [1]. In this latter instance, the high resolution allele analysis will lead to better definition of the associative levels of MHC DNA blocks, CEHs and their fragments influenced by genetic admixture allowing more precise elucidation of disease-associated HLA alleles when comparing different ethnic groups and nationalities.

Cell lines
The EM10, FS10, B8HM1, B8HM2 and L2DB cell lines were used to represent homozygous haplotypes in Figure  2, as described previously [31].

MHC typing
Genomic DNA was obtained from peripheral blood mononuclear cells (PBMC), EDTA-treated plasma or lymphoblastoid cell lines and was isolated using the QIAamp DNA mini kit (Qiagen, Valencia, CA). Molecular typing of IHW cell lines was previously known [32] and/or was conducted as described below. Molecular typing of samples from Panels 1 and 2 was performed by PCR and sequencespecific oligonucleotide probes (PCR-SSOP) at intermediate to high resolution [33]. SSP molecular typing of non-IHW cell line samples from Panel 3 was performed either using an SSP UniTray kit (Invitrogen/Dynal/Pel-Freez, Brown Deer, WI) or by PCR-SSOP (HLA Quick-Type kits, Lifecodes, Stamford, CT), according to previously described amplification conditions [33]. Some samples from the CBRI had several HLA types identified serologically [34]. Typing of BF, C4A and C4B alleles was done by agarose gel electrophoresis and immunofixation of their protein products with specific antisera, and C2 alleles were determined by isoelectric focusing of serum samples in polyacrylamide gels followed by a C2-sensitive hemolytic overlay [35]. MHC complement gene haplotypes or complotypes are designated by their BF, C2, C4A, and C4B alleles, in that arbitrary order [7]. Null or Q0 alleles are simply designated 0. Thus, FC31 indicates the complotype BF*F, C2*C, C4A*3, C4B*1. Some of the non-HLA-E typings have been published previously [4,5,27].

HLA-E haplotype assignment
Panel 1 haplotypes were unambiguously assigned from individuals homozygous for at least HLA-A and HLA-E (in whom HLA-Cw, B blocks were assigned based on known associations [1]) or homozygous for at least HLA-E and HLA-Cw, B. Panel 2 haplotypes were assigned by family study using segregation analysis [5]. For the third panel, we assigned HLA-E alleles to 258 MHC haplotypes. Of these, 167 haplotypes (65%) were unambiguously assigned by one of four methods: a) in IHW or locallyproduced MHC homozygous cell lines; b) by segregation analysis in pedigrees [5]; c) to previously defined (by segregation analysis) haplotypes in subjects homozygous for HLA-E; or d) to deduced haplotypes in subjects homozygous for at least HLA-E and their HLA-Cw, B blocks. The cell lines (a, above) were assumed to be consanguineous (and received only one haplotype assignment) unless known not to be consanguineous. At the end of this first analysis, we assigned HLA-E alleles to the six most frequent CEHs ( Table 3). The remaining haplotypes (n = 91) were assigned HLA-E alleles with two assumptions. First, individuals who had all of the class I to complotype markers of at least one CEH were included in the analysis, and all of the markers of a given CEH were assigned to one of the haplotypes. Second, for individuals without clear HLA-E assignment (e.g., a family in which all subjects were HLA-E heterozygous and identical or an HLA-E heterozygous individual without relatives in the study), but who had at least one haplotype with the class I markers of one of the six CEHs defined above, the defined HLA-E assignment was given to that CEH.

SNP analysis
Genomic DNA was digested with Nsp1 or Sty1 prior to adapter ligation, amplification, end-labeling and hybridization to a GeneChip (GeneChip Human Mapping 500 K Array Set; Affymetrix, Santa Clara, CA). Arrays were analyzed on a GeneChip Scanner 7000 RG and data analyzed using the GTYPE software all according to the manufacturer's directions. 428 SNPs from the region from position 28,944,796 (near the gene TRIM27, approximately 1.0 Mb telomeric to HLA-A) to 33,362,643 (near the gene B3GALT4, approximately 0.2 Mb centromeric to HLA-DPB1) were analyzed (Genbank dbSNP build 126 rs209163 to rs466384). In several instances, a clear call on the polymorphism could not be made in which case the SNP was not used. Consequently, depending on the calls for each cell line, approximately 370 SNP with high confidence calls for each cell line were compared (Fig. 2B).

Statistical analysis
Allele frequencies of HLA generic and allele types were calculated for each of the three panels separately by direct counting [1][2][3][4][5][6]. LD for alleles at loci between HLA-E and HLA-A or between HLA-C and HLA-B was analyzed in Panel 2 using delta (Δ) and normalized delta (D'). Other two-point LD calculations were made between HLA-E and the HLA-Cw/B block, with the latter analyzed as a single entity, and between HLA-A and HLA-E/Cw/B, with the latter analyzed as a single entity. Although D' normalizes for allele frequency, it does not compensate for sample size. Accordingly, we used Fisher's exact test to provide an additional measure of significance of association of the loci. We defined significant LD as positive normalized delta (D') in the context of p < 0.05. LD is defined as a frequency of possible association for specific alleles at two or more loci (i.e., a putative haplotype) that departs from expectation based on the known frequencies of the individual alleles comprising that haplotype (determined in this report by pedigree (i.e., genotypic data) analysis). In a homogenous population at genetic equilibrium, if the alleles A and B at two loci with frequencies f(A) and f(B), respectively, are completely randomly associated with one another, they form an AB haplotype with a frequency of f(AB) = f(A) · f(B). If these conditions are not met, the alleles are said to be "in LD." The extent of LD is given by where HF is the haplotype frequency and a i and b j , the frequencies of A i and B j alleles [1]. The Δ value is converted to a normalized LD value (D') to determine the relative LD irrespective of individual allele frequencies. This normalized value is calculated as: where Δ max is the maximum LD value possible [38]. The significance of all the results (Tables 1, 2, 3 and Figure 1) was assessed with Fisher's exact test with Bonferroni correction [39]. Odds ratios (ORs) were calculated with a 95% CI [36].

MHC gene location and distances
Physical distances between MHC genes were found at the Wellcome Trust Sanger Institute Human Chromosome 6 website [40].