An evaluation of the performance of HapMap SNP data in a Shanghai Chinese population: Analyses of allele frequency, linkage disequilibrium pattern and tagging SNPs transferability on chromosome 1q21-q25
© Hu et al; licensee BioMed Central Ltd. 2008
Received: 06 June 2007
Accepted: 27 February 2008
Published: 27 February 2008
The HapMap project aimed to catalog millions of common single nucleotide polymorphisms (SNPs) in the human genome in four major populations, in order to facilitate association studies of complex diseases. To examine the transferability of Han Chinese in Beijing HapMap data to the Southern Han Chinese in Shanghai, we performed comparative analyses between genotypes from over 4,500 SNPs in a 21 Mb region on chromosome 1q21-q25 in 80 unrelated Shanghai Chinese and 45 HapMap Chinese data.
Three thousand and forty-two SNPs were analyzed after removal of SNPs that failed quality control and those not in the HapMap panel. We compared the allele frequency distributions, linkage disequilibrium patterns, haplotype frequency distributions and tagging SNP sets transferability between the HapMap population and Shanghai Chinese population. Among the four HapMap populations, Beijing Chinese showed the best correlation with Shanghai population on allele frequencies, linkage disequilibrium and haplotype frequencies. Tagging SNP sets selected from four HapMap populations at different thresholds were evaluated in the Shanghai sample. Under the threshold of r2 equal to 0.8 or 0.5, both HapMap Chinese and Japanese data showed better coverage and tagging efficiency than Caucasian and African data.
Our study supported the applicability of HapMap Beijing Chinese SNP data to the study of complex diseases among southern Chinese population.
The International HapMap Project aimed at determining the common patterns of DNA sequence variants, their frequencies, and correlations between them, through genotyping samples from four large populations, Centre d'Etude du Polymorphisme Humain reference individuals from Utah, USA (CEU), Han Chinese in Beijing, China (CHB), Japanese in Tokyo, Japan (JPT), and Yoruba in Ibadan, Nigeria (YRI), at a density of 1 SNP every 5 kb. The populations genotyped in the HapMap can serve as reference populations for the selection of tagging SNPs (tSNPs) that capture most of the variations in the genome. It provides an important shortcut to carry out candidate-gene and genome-wide association studies in a certain population by minimizing the numbers of SNPs need to be genotyped [1–3].
As stated by the International HapMap Consortium, the general applicability of the HapMap data should be confirmed in other populations . Several studies previously performed showed high concordance with HapMap data in allele frequencies and haplotype distributions, and good performance of tSNPs selected from the HapMap SNP data [4–12]. However there are few reports available in the literature which compared the linkage disequilibrium (LD) patterns of the CHB population in the HapMap data with other Chinese populations. Whether the HapMap CHB data can be broadly used in other Chinese populations remained to be a key question.
In our study, over 4500 SNPs from a 21 Mb region on chromosome 1q21-q25 were genotyped in 80 Chinese Hans from Shanghai as a component in the International Type 2 Diabetes 1q Consortium. Located in the southeast of China, Shanghai is over 1,000 kilometers away from Beijing where the CHB samples were recruited. Studies have shown that the Chinese Han population can be geographically divided into two also genetically differential groups, northern Han and southern Han Chinese, separated approximately by the Yangtze River [13–16]. The samples from Shanghai in our study are southern Hans while most of the CHB samples are northern Hans. Although previous studies showed similarity of genetic background between East Asian populations [17, 18], no study has directly analyzed the utility of HapMap data in the southern Hans. In our study, we estimated the allele and haplotype frequencies of SNPs in Shanghai individuals and compared them with those provided by the HapMap project. Furthermore, we evaluated the transferability and performance of tSNPs selected from the HapMap data in this Shanghai population.
Minor allele frequency distributions in Shanghai and four HapMap populations.
The frequencies estimated from the JPT sample were also highly correlated with the Shanghai sample (R = 0.88, P < 0.001). The allele frequency distribution of 547 SNPs was significantly different. By 10,000 permutations, 7 of them remained significant (Additional file 1).
However, the allele frequencies estimated from the CEU and YRI samples differed dramatically from those in the Shanghai sample and Pearson's correlation coefficients between them were 0.46 and 0.41, respectively (P < 0.0001) (Figure 1).
The LD structures for the whole region in the Shanghai and the four HapMap samples were shown in Additional file 2. In all non-African populations, the LD structure was similar across populations. In YRI sample, less extent of LD was observed.
A total of 2667, 2648, 2643, 2775 and 2301 SNPs with MAFs greater than 0.05 were included in the analyses of haplotype blocks for the Shanghai, CHB, JPT, CEU and YRI samples, respectively. In the Shanghai sample, 399 blocks were inferred across the region, while 370, 367, 448 and 434 blocks were inferred in the CHB, JPT, CEU and YRI samples. In our samples, the average block size and marker number was 30.51 kb and 5.5 SNPs per block. Similar results were observed in the CHB sample with the average block size of 30.47 kb and the average marker number of 5.48. The JPT and CEU samples also showed similarities to our Shanghai samples to a certain extent. Their average block sizes were 30.77 kb and 29.72 kb and average marker numbers were 5.52 and 5.34 respectively. However, the YRI sample was most distinct from the other populations. Its average block size and marker number were only 18.55 kb and 3.43.
To mimic the way an investigator would use the HapMap resource, we used the HapMap populations as reference for tSNP selection at different thresholds and then evaluated the performance of tSNP sets in the Shanghai sample. The performance was determined by computing both the percentage of SNPs captured and the average tagging efficiency. SNPs with MAFs over 0.05 in both the reference population and the Shanghai population were analyzed.
Performance of tSNP sets selected from the four HapMap populations.
r2 = 0.8
r2 = 0.5
Number of SNPs analyzed
Number and percentage of tSNPs selected
Percentage of SNPs captured
Number and percentage of tSNPs selected
Percentage of SNPs captured
Association study is a common way of identifying the genetic markers for complex diseases, such as diabetes, obesity, cancer, psychiatric illness and cardiovascular disease . The HapMap data provides the LD and tSNP information for four populations to facilitate the design for association studies.
In this study, we evaluated the performance of the HapMap tSNPs in a Shanghai Chinese Han population by comparing allele frequencies, LD patterns, and haplotype frequencies between them. We further assessed transferability of tSNPs selected from the reference populations.
We found that the allele frequencies of the SNPs in this 21 Mb region on chromosome 1q21-q25 were highly correlated between the HapMap CHB and the Shanghai samples. As marker allele frequencies affect the power and sample size required for genetic association studies [20–22], knowing the allele frequencies for the population is important for the design of genetic studies. Our findings indicate that the HapMap CHB sample provides this information for the Chinese population.
We also found the extent of LD were similar among non-African populations. The average haplotype block length, which was dramatically smaller in the YRI sample, was similar in the other populations. R2 values of adjacent SNPs were highly correlated between the CHB and Shanghai samples. However, |D'| was poorly correlated between these two populations. This may be the result of high marker density. With the average marker density of ~7 kb in this study, |D'| values can easily reach their maximum value of 1.0 in one or both populations, causing a ceiling effect.
Not surprisingly, tSNP set selected from the CHB sample performed best in the Shanghai population in terms of coverage and tagging efficiency. Although the JPT sample showed poorer correlation on the allele frequencies and r2 of adjacent SNPs than the CHB sample, the tSNP set selected from the JPT sample performed equally well in this Shanghai population. Also as shown by analyzing ALFRED (the Allele FREquency Database), tSNPs selected from Japanese had high performance in southern Hans . Thus combining JPT and CHB SNP data will be worthy trying when selecting tSNPs from HapMap reference populations.
Several comparative studies have examined similarities and differences between LD patterns and tSNP transferability of HapMap data in various populations based on genes or chromosome regions. North European population was mostly studied among the four HapMap populations. Researchers have performed comparative studies between CEU SNP data and several other populations, including Spanish, Finnish, Estonia and several population isolates [6, 8, 9, 12]. They all came to the same conclusion that the CEU SNP dataset was a robust tool for association studies in these populations. Two studies focused on Asians were also reported. Lim et al analyzed the LD patterns and haplotype structures for ENCODE region ENm010 on chromosome 7p15.2, by genotyping 792 SNPs in 90 healthy Korean individuals. Their analyses showed remarkable similarities in LD strength, haplotype profile, and efficient tSNP transferability among HapMap CHB, JPT and Korean samples . Mahasirimongkol et al analyzed 861 SNPs in 166 drug-related genes between HapMap East Asian populations and Thais. They also found extensive correlation on allele frequency, Fst statistics and r2 between these populations .
One limitation of this study is that only 79 individuals were analyzed after quality control. Small sample size may bias the allele frequencies estimated in the population, also the extent of LD between markers, and as a result, the portability of tSNPs could be over estimated. It is interesting to note, however, that we didn't detect many significant differences between our sample and the HapMap CHB sample. And as demonstrated previously, for common SNPs with MAFs over 0.05, the use of 60 independent individuals didn't affect the performance of tSNPs significantly . Another limitation is that our samples were composed of normal controls from a case-control study. They were over 65 years old with normal weight and were free from diabetes, hypertension, dyslipidaemia and the family history of these diseases, whereas the HapMap CHB individuals were recruited randomly from Beijing ignoring the disease status. The particularity of our sample might lead to the difference in some SNPs or haplotypes.
We conclude that the HapMap CHB SNP set has a good portability to the Shanghai population and thus it is a powerful tool for the genetic studies on complex disease in Chinese Hans. Further studies focusing on populations from other regions and nationalities in China are needed to confirm our findings.
Eighty unrelated Chinese individuals were recruited from Shanghai Caoyang community as a component of the case-control study in the International Type 2 Diabetes 1q Consortium. All the individuals were normal controls free from diabetes, hypertension or dyslipidaemia. The sample was compased of 55 males and 25 females, aged 74 years in average. This study was reviewed and approved by the institutional review board of Shanghai Jiao Tong University Affiliated Sixth People's Hospital, Shanghai, China. Written informed consents were obtained from all participants.
SNP selection and genotyping
A 21 Mb region on chromosome 1q21-q25 (position 148.10 Mb to 169.42 Mb, from the NCBI build 35 UCSC genome) was selected for genotyping. Over 4,500 SNPs were genotyped using the Illumina Golden Gate assay (Illumina Inc., San Diego, CA, USA) and the quality control was performed by the 1q Consortium . After removal of SNPs that failed quality control and those not in the HapMap panel, 3,042 SNPs and 79 individuals were analyzed in this study, with an average density of one SNP per 7.0 kb. Detailed SNP information is shown in Additional file 3 and can be also obtained from dbSNP.
The HapMap SNP data of 60 CEU individuals, 45 CHB individuals, 45 JPT individuals and 60 YRI individuals were obtained from HapMap database (release #20).
Allele frequencies were estimated by gene counting and checked for accordance with Hardy-Weinberg equilibrium in each population . Allele frequencies of SNPs were compared between populations by chi-square or Fisher's exact tests, where appropriate. Linkage disequilibrium parameters (|D'| and r2) for adjacent SNPs were calculated and haplotype blocks were defined within each population using the confidence interval algorithm  and performed by Haploview version 3.32 . Haplotype frequencies were estimated by Expectation – Maximization algorithm . Permutation test that randomly assigns the phenotypes while keeping the genotypes intact was used to obtain empirical P values as an alternative to multiple test correction. Pearson's correlation coefficient (R) was used to estimate correlations in allele and haplotype frequencies and linkage disequilibrium parameters among populations. A paired t-test was performed to compare the allele frequencies and linkage disequilibrium parameters among the populations using SAS for WINDOWS (version 6.12, SAS Institute Inc., Cary, NC, USA).
SNPs with MAFs over or equal to 0.05 were selected for the analyses of tSNP transferability across populations. The Tagger program in Haploview was used to identify tSNPs that optimally capture allelic variation among SNPs. The tSNPs were selected based on a pairwise approach . An r2 of 0.5 and 0.8 was selected as thresholds for tSNP selection. Coverage of tSNPs was defined as the percentage of SNPs in the evaluated population that had an r2 above 0.5 or 0.8 by the tSNP selected from the reference population. Tagging efficiencies of tSNP sets were defined as the average number of SNPs captured by each tSNP selected.
This work was supported by the National Natural Science Foundation of China grants 30630061 and 30600361, the National 973 Program 2006CB503901 and Innovation Fund for Ph.D. Student from Shanghai Jiao Tong University, School of Medicine (BXJ0732). Studies of the International Type 2 Diabetes 1q Consortium were supported primarily as a supplement to National Institutes of Health/National Institute of Diabetes and Digestive and Kidney Diseases award U01-DK58026, and by grant DK073490. We thank all the research volunteers for their participation.
- The International HapMap Consortium: The International HapMap Project. Nature. 2003, 426: 789-796. 10.1038/nature02168.View ArticleGoogle Scholar
- The International HapMap Consortium: Integrating ethics and science in the International HapMap Project. Nat Rev Genet. 2004, 5: 467-475. 10.1038/nrg1351.PubMed CentralView ArticleGoogle Scholar
- The International HapMap Consortium: A haplotype map of the human genome. Nature. 2005, 437: 1299-1320. 10.1038/nature04226.PubMed CentralView ArticleGoogle Scholar
- Lim J, Kim YJ, Yoon Y, Kim SO, Kang H, Park J, et al: Comparative study of the linkage disequilibrium of an ENCODE region, chromosome 7p15, in Korean, Japanese, and Han Chinese samples. Genomics. 2006, 87: 392-398. 10.1016/j.ygeno.2005.11.002.View ArticlePubMedGoogle Scholar
- Mahasirimongkol S, Chantratita W, Promso S, Pasomsab E, Jinawath N, Jongjaroenprasert W, et al: Similarity of the allele frequency and linkage disequilibrium pattern of single nucleotide polymorphisms in drug-related gene loci between Thai and northern East Asian populations: implications for tagging SNP selection in Thais. J Hum Genet. 2006, 51: 896-904. 10.1007/s10038-006-0041-1.View ArticlePubMedGoogle Scholar
- Montpetit A, Nelis M, Laflamme P, Magi R, Ke X, Remm M, et al: An evaluation of the performance of tag SNPs derived from HapMap in a Caucasian population. PLoS Genet. 2006, 2: e27-10.1371/journal.pgen.0020027.PubMed CentralView ArticlePubMedGoogle Scholar
- Mueller JC, Lohmussaar E, Magi R, Remm M, Bettecken T, Lichtner P, et al: Linkage disequilibrium patterns and tagSNP transferability among European populations. Am J Hum Genet. 2005, 76: 387-398. 10.1086/427925.PubMed CentralView ArticlePubMedGoogle Scholar
- Ribas G, Gonzalez-Neira A, Salas A, Milne RL, Vega A, Carracedo B, et al: Evaluating HapMap SNP data transferability in a large-scale genotyping project involving 175 cancer-associated genes. Hum Genet. 2006, 118: 669-679. 10.1007/s00439-005-0094-9.View ArticlePubMedGoogle Scholar
- Service S, Sabatti C, Freimer N: Tag SNPs chosen from HapMap perform well in several population isolates. Genet Epidemiol. 2007, 31: 189-194. 10.1002/gepi.20201.View ArticlePubMedGoogle Scholar
- Smith EM, Wang X, Littrell J, Eckert J, Cole R, Kissebah AH, et al: Comparison of linkage disequilibrium patterns between the HapMap CEPH samples and a family-based cohort of Northern European descent. Genomics. 2006, 88: 407-414. 10.1016/j.ygeno.2006.04.004.PubMed CentralView ArticlePubMedGoogle Scholar
- Stankovich J, Cox CJ, Tan RB, Montgomery DS, Huxtable SJ, Rubio JP, et al: On the utility of data from the International HapMap Project for Australian association studies. Hum Genet. 2006, 119: 220-222. 10.1007/s00439-005-0120-y.View ArticlePubMedGoogle Scholar
- Willer CJ, Scott LJ, Bonnycastle LL, Jackson AU, Chines P, Pruim R, et al: Tag SNP selection for Finnish individuals based on the CEPH Utah HapMap database. Genet Epidemiol. 2006, 30: 180-190. 10.1002/gepi.20131.View ArticlePubMedGoogle Scholar
- Xiao C, Cavalli-Sforza LL, Minch E, Du R: Principal component analysis of gene frequencies of Chinese populations. Sci China C Life Sci. 2000, 43: 472-481.Google Scholar
- Chu JY, Huang W, Kuang SQ, Wang JM, Xu JJ, Chu ZT, et al: Genetic relationship of populations in China. Proc Natl Acad Sci USA. 1998, 95: 11763-11768. 10.1073/pnas.95.20.11763.PubMed CentralView ArticlePubMedGoogle Scholar
- Wen B, Li H, Lu D, Song X, Zhang F, He Y, et al: Genetic evidence supports demic diffusion of Han culture. Nature. 2004, 431: 302-305. 10.1038/nature02878.View ArticlePubMedGoogle Scholar
- Zhao TM, Lee TD: Gm and Km allotypes in 74 Chinese populations: a hypothesis of the origin of the Chinese nation. Hum Genet. 1989, 83: 101-110. 10.1007/BF00286699.View ArticlePubMedGoogle Scholar
- Gu S, Pakstis AJ, Li H, Speed WC, Kidd JR, Kidd KK: Significant variation in haplotype block structure but conservation in tagSNP patterns among global populations. Eur J Hum Genet. 2007, 15: 302-312. 10.1038/sj.ejhg.5201751.View ArticlePubMedGoogle Scholar
- Huang W, He Y, Wang H, Wang Y, Liu Y, Wang Y, et al: Linkage disequilibrium sharing and haplotype-tagged SNP portability between populations. Proc Natl Acad Sci USA. 2006, 103: 1418-1421. 10.1073/pnas.0510360103.PubMed CentralView ArticlePubMedGoogle Scholar
- Cardon LR, Bell JI: Association study designs for complex diseases. Nat Rev Genet. 2001, 2: 91-99. 10.1038/35052543.View ArticlePubMedGoogle Scholar
- Gordon D, Finch SJ: Factors affecting statistical power in the detection of genetic association. J Clin Invest. 2005, 115: 1408-1418. 10.1172/JCI24756.PubMed CentralView ArticlePubMedGoogle Scholar
- Pfeiffer RM, Gail MH: Sample size calculations for population- and family-based case-control association studies on marker genotypes. Genet Epidemiol. 2003, 25: 136-148. 10.1002/gepi.10245.View ArticlePubMedGoogle Scholar
- Zondervan KT, Cardon LR: The complex interplay among factors that influence allelic association. Nat Rev Genet. 2004, 5: 89-100. 10.1038/nrg1270.View ArticlePubMedGoogle Scholar
- Zeggini E, Rayner W, Morris AP, Hattersley AT, Walker M, Hitman GA, et al: An evaluation of HapMap sample size and tagging SNP performance in large-scale empirical and simulated data sets. Nat Genet. 2005, 37: 1320-1322. 10.1038/ng1670.View ArticlePubMedGoogle Scholar
- Wigginton JE, Cutler DJ, Abecasis GR: A note on exact tests of Hardy-Weinberg equilibrium. Am J Hum Genet. 2005, 76: 887-893. 10.1086/429864.PubMed CentralView ArticlePubMedGoogle Scholar
- Gabriel SB, Schaffner SF, Nguyen H, Moore JM, Roy J, Blumenstiel B, et al: The structure of haplotype blocks in the human genome. Science. 2002, 296: 2225-2229. 10.1126/science.1069424.View ArticlePubMedGoogle Scholar
- Barrett JC, Fry B, Maller J, Daly MJ: Haploview: analysis and visualization of LD and haplotype maps. Bioinformatics. 2005, 21: 263-265. 10.1093/bioinformatics/bth457.View ArticlePubMedGoogle Scholar
- Qin ZS, Niu T, Liu JS: Partition-ligation-expectation-maximization algorithm for haplotype inference with single-nucleotide polymorphisms. Am J Hum Genet. 2002, 71: 1242-1247. 10.1086/344207.PubMed CentralView ArticlePubMedGoogle Scholar
- Carlson CS, Eberle MA, Rieder MJ, Yi Q, Kruglyak L, Nickerson DA: Selecting a maximally informative set of single-nucleotide polymorphisms for association analyses using linkage disequilibrium. Am J Hum Genet. 2004, 74: 106-120. 10.1086/381000.PubMed CentralView ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.