Evaluating the transferability of Hapmap SNPs to a Singapore Chinese population
© Andiappan et al; licensee BioMed Central Ltd. 2010
Received: 9 December 2009
Accepted: 7 May 2010
Published: 7 May 2010
The International Hapmap project serves as a valuable resource for human genome variation data, however its applicability to other populations has yet to be exhaustively investigated. In this paper, we use high density genotyping chips and resequencing strategies to compare the Singapore Chinese population with the Hapmap populations. First we compared 1028 and 114 unrelated Singapore Chinese samples genotyped using the Illumina Human Hapmap 550 k chip and Affymetrix 500 k array respectively against the 270 samples from Hapmap. Secondly, data from 20 candidate genes on 5q31-33 resequenced for an asthma candidate gene based study was also used for the analysis.
A total of 237 SNPs were identified through resequencing of which only 95 SNPs (40%) were in Hapmap; however an additional 56 SNPs (24%) were not genotyped directly but had a proxy SNP in the Hapmap. At the genome-wide level, Singapore Chinese were highly correlated with Hapmap Han Chinese with correlation of 0.954 and 0.947 for the Illumina and Affymetrix platforms respectively with deviant SNPs randomly distributed within and across all chromosomes.
The high correlation between our population and Hapmap Han Chinese reaffirms the applicability of Hapmap based genome-wide chips for GWA studies. There is a clear population signature for the Singapore Chinese samples and they predominantly resemble the southern Han Chinese population; however when new migrants particularly those with northern Han Chinese background were included, population stratification issues may arise. Future studies needs to address population stratification within the sample collection while designing and interpreting GWAS in the Chinese population.
The International Hapmap Project is a multi-centre effort aimed at identifying genetic variations across the human genome among different individuals to aid biomedical researchers in identifying genetic links to various diseases and variable drug response [1–3]. The Hapmap Consortium developed a human haplotype map by genotyping 270 samples from four populations with diverse geographic ancestry. These samples included 30 trios (mother, father, and adult child) from the Yoruba in Ibadan, Nigeria (YRI); 30 trios from the Centre d'Etude du Polymorphisme Humain (CEPH) collection of Utah residents of Northern and Western European ancestry; 45 unrelated Han Chinese in Beijing (CHB); and 45 unrelated Japanese in Tokyo (JPT) . While the latest published update to the Hapmap project indicates the availability of data for more than 3.1 million single nucleotide polymorphisms (SNPs) in the four populations  this number has grown to more than 26 million SNPs in 11 populations(NCBI). The common patterns of DNA sequence variants, their frequencies and correlations have been made available online at the Hapmap database  and dbSNP . While the genotyping data from the four main Hapmap populations does serve as a valuable resource for linkage disequilibrium (LD) based marker selection in genetic association studies [2, 7], there is a need to evaluate its extensibility to other populations. Studies comparing LD patterns and transferability of tag SNPs [8–13] have shown that allele and haplotype frequencies of independent populations are relatively similar with those obtained from the Hapmap populations. The concordance is however, not always near 100%. In analyzing regions spanning 750 kb in various European populations, Mueller et.al  reported that only two out of the four studied regions were well represented in the Hapmap CEPH population . While such studies on European populations are plenty, only a few have focused on Asian populations and their concordance with the Han Chinese or Japanese Hapmap populations. A recent study looked at a 21 Mb region on chromosome 1q21-q25 in 80 Chinese Hans from Shanghai as part of the International Type 2 Diabetes 1q Consortium  where 3042 SNPs were identified to match with Hapmap data from the CHB population. Another study focused on the linkage disequilibrium of a region on chromosome 7p15, in Korean, Japanese, and Han Chinese samples also reports similar results. These results are not surprising given that the study and reference populations were of the same ethnic origin from the same region. What is currently lacking is a similar validation on an ethnic Chinese population which is far removed from China. The Singapore Genome Variation Project recently published, compares three Singaporean populations (Chinese, Malay and Indian) against the Hapmap populations. Interestingly they showed that most Singapore Chinese were similar to southern Han Chinese . There was also evidence of population sub-structure when the Hapmap Han Chinese samples were compared with samples from the northern Han Chinese population, although the data was not conclusive due to the small sample size.
In our current study we investigate the applicability of the data obtained from the Hapmap CHB population to a Singapore Chinese population using genotyping data from the Affymetrix Gene Chip Human Mapping 500 K Array Set and the Illumina Human Hapmap 550 k chip. This would also serve to validate the use of these genome wide chips in disease based genetic association studies for a Hapmap based population from a different geographical location. To supplement the whole-genome comparison, a more focused gene based analysis of genes in the highly replicated 5q31-q33 chromosomal region for asthma was performed to compare the coverage of Hapmap SNP data in the context of a case control association study.
Correlation of SNP frequencies for Illumina 550 k Genotyping chip
Concordance Correlation Coefficient for Hapmap populations against Singaporean Chinese population using Illumina 550 k chip
Difference in average MAF for SNPs for Singaporean Chinese population to Hapmap Han Chinese population
Difference in MAF
Number of SNPs
0.4 - 0.5
0.3 - 0.399
0.2 - 0.299
0.1 - 0.199
0.05 - 0.099
0 - 0.05
Correlation of SNP frequencies for Affymetrix 500 k Genotyping chip
Concordance Correlation Coefficient for Hapmap populations against Singaporean Chinese population using Affymetrix 500 k chip
Difference in average MAF for SNPs for Singaporean Chinese population to Hapmap Han Chinese population
Difference in MAF
Number of SNPs
0.4 - 0.5
0.3 - 0.399
0.2 - 0.299
0.1 - 0.199
0.05 - 0.099
0 - 0.05
Chromosome based Analysis
To detect any patterns in chromosomal aggregation of similarities between Singapore Chinese and Hapmap CHB, the MAF comparison was performed at a chromosomal level. Pearson's correlation was found to be consistently high along each of the chromosomes with the lowest value being 0.948 (Additional file 1: Figure S2). This high concordance was not related to the number of SNPs from the Illumina 550 k chip on each chromosome or the length of the chromosome (Additional file 1: Figures S1, S2, and S3). Of the 561466 SNPs on the chip, 502 were found to be discordant compared to the Hapmap CHB data, with MAFs differing by up to 0.2. To identify if these SNPs were in any potential chromosomal hotspots, they were mapped to regions within each chromosome based on their physical positions. No particular chromosomal hotspot was found (Data not shown).
Principal Component Analysis (PCA)
Identification of SNPs by resequencing Asthma Candidate Genes
SNP Coverage of 5q31-33 Region in Public Databases
SNP coverage for the 20 studied genes on 5q31-33 in various Hapmap populations
The resequencing data was used to estimate the microarray coverage of the Illumina 550 k chip for the 20 genes resequenced. Out of the 237 SNPs identified, only 182 were reported previously and 52 have not been documented previously. Thus only the previously reported SNPs were used to estimate coverage of the whole genome chip.
Microarray coverage for Singapore Chinese population for the region containing the 20 genes resequenced on 5q based on Illumina 550 k chip
SNPs on microarray
r 2 > = 0.8
r 2 > = 0.5
Linkage Disequilibrium Analysis
We also looked at all SNPs on chromosome 5 on the Illumina 550 k whole genome chip and estimated pair wise LD values for Singapore Chinese and all the 4 hapmap populations. Additional File 1: Table S1 shows further evidence that the Singapore Chinese population is most closely associated with the Hapmap Han Chinese than the other 3 hapmap populations.
Correlation of SNP frequencies
Correlation of SNP frequencies between Singapore Chinese and Hapmap populations
Genes underlying common complex diseases - such as asthma and other allergic diseases, are likely to be multiple, each with a relatively small effect, but act in concert or with environmental influences to lead to clinical presentation . The Hapmap project was designed to allow researchers to identify common disease-causing variants based upon the "common disease, common variant" hypothesis, which suggests that genetic influences on many common diseases are attributable to a limited number of allelic variants (one or a few at each major disease locus) that are present in more than 1-5% of the population [18–20]. Linkage Disequilibrium (LD) data is also available in Hapmap to facilitate the design of genome wide chips for association studies. This study attempts to explore the genetic architecture of the Singapore Chinese population. By considering our population in the context of the Hapmap populations, this study reveals significant insights that are relevant in conducting genetic studies in a population of Chinese ancestry.
Hapmap data and the world populations
Hu et.al.,  had described that Shanghai Chinese are very similar to Hapmap Han Chinese based on 4,500 SNPs in a 21 Mb region on chromosome 1q21-q25 in 80 unrelated Shanghai Chinese and 45 Hapmap Beijing Han Chinese. They had a correlation coefficient of R = 0.94, p < 0.001 for 3042 SNPs (some SNPs were filtered out based on their data quality control criteria). They also reported a similar correlation coefficient of R = 0.88, p < 0.001 for comparison of Shanghai Chinese to Hapmap Japanese. Takeuchi et.al  performed a similar comparison of Japanese individuals against Hapmap Japanese by combining resequencing and high-density genotyping approaches. They stated that the Hapmap coverage is not thorough for SNPs in the Japanese population, and this needs to be considered when association results are interpreted. Researchers elsewhere have also performed comparative studies between CEU SNP data and several other populations, including Spanish, Finnish, and Estonia [10, 12, 13, 22]. They all came to the same conclusion that the CEU SNP dataset was a robust dataset for comparative and association studies in these populations. These various observations by different groups studying the effectiveness of Hapmap dataset for different populations were not really consistent. Even though Hapmap serves as a good reference population for some populations, its applicability to other populations not evaluated in the Hapmap project needs to be assessed closely.
Hapmap and Singapore Chinese
The genotype data for Singapore Chinese from both Illumina and Affymetrix have given us a high correlation coefficient of 95% in comparison to the Hapmap Han Chinese. On the contrary, comparison with the Caucasian and African populations showed very low correlation. However in a comparison of close to 1 million SNPs, 5% deviance is still somewhat significant. In an attempt to localize this deviation, a chromosome based correlation analysis was performed. A consistently high correlation (more than 95%) was observed across all the chromosomes with deviating SNPs not associated with minor allele frequencies or any specific chromosomal location. This indicates that the 5% deviation observed between Hapmap CHB and our local population was likely to be random and not due to any major differences in the two populations.
The HumanHap550 BeadChip from Illumina displays a genomic coverage 87% for the Asian population (CHB+JPT) and 90% and 57% for the CEU and YRI populations respectively (Illumina Inc) as measured by Phase I+II Hapmap genotype data. The mean MAFs determined using the HumanHap550 BeadChip was 0.23, 0.21 and 0.22 for the CEU, CHB+JPT and YRI populations, respectively . It should be noted that though the mean MAF is similar for all 3 populations, the distribution of SNPs in terms of MAF is quite different. The mean MAF determined for our Singapore Chinese population is 0.215 which is similar to the estimates for Asian population as reported by Illumina. The high genomic coverage for the Asian population set, as well as the comparable mean MAF suggest that the BeadChip designed based on linkage disequilibrium data from Hapmap can be extended for genome-wide analysis of other similar population cohorts not previously genotyped, such as the Singapore Chinese in our case. A very recent study by Chen et.al  describes the genetic architecture of Han Chinese from all over the world and found a ''north-south'' population structure which was also clearly visible in our population. The study had also included 570 Han Chinese samples from Singapore and found they were more similar the southern Han Chinese population. These differences need to be addressed while performing association studies including samples from both northern and southern Han Chinese samples in the same study.
Using the resequencing data of the 5q31-33 region, we compared and estimated the coverage of the genes in this region with that in the Hapmap project. Of the 237 SNPs we identified through resequencing; 73 (31%) were identified in Hapmap. This meant that more than two-thirds of the variation in Singapore Chinese was not reported in the Hapmap CHB population. While a further 24% of our 'novel' SNPs had proxies in Hapmap, the fact remains that even the Hapmap CHB population, likely to be genealogically closest to Singapore Chinese, was unable to provide information for at least 50% of the genetic variation in our local population. In the study of complex diseases, such as asthma, it is of the utmost importance to capture as much of the genetic variation in the study population as possible so that they can be screened for potential associations. In such situations, Hapmap by itself may be insufficient and targeted resequencing may be essential to capture all the variation in a specific population. A study by Tantoso et al  has also demonstrated that the Hapmap SNPs are not robust enough to capture the untyped variants for most of genes. They estimated a marginal coverage of about 55% for European and Asian samples and the coverage is as low as 30% for the Hapmap YRI panel. A recent study also evaluated the coverage of different SNP chips used for genome-wide association studies . Such information would be useful in selecting the chip which would provide a better coverage for the population under investigation.
In this study we evaluated the correlation between MAF of Hapmap SNPs and that obtained from a Singapore Chinese population. We found that minor allele frequencies of 976219 unique Hapmap SNPs for the Han Chinese population correlated with those from a Singapore Chinese population with a concordance correlation coefficient of 0.95. This clearly demonstrates the effectiveness of using Hapmap Han Chinese population as a reference population for future whole genome based association studies in Singapore Chinese. It also emphasizes the fact that the SNPs selected in the Genome wide chips are performing as expected as the MAF are quite similar to the actual MAF in the Hapmap project. Although the principal component analysis reveals no significant population stratification, the migration pattern of the samples needs to be addressed while designing and interpreting genome-wide association studies. While we showed that the SNPs deposited in Hapmap are sufficient to represent the gross genetic variation based on the similar LD patterns observed between both Hapmap Han Chinese and Singapore Chinese populations, targeted resequencing, as used in a candidate gene based approach, may still be necessary to capture all the variation in specific target genes. This SNP information can also help to develop SNP chips which are more targeted towards a specific population which clear population signatures.
The DNA samples used in this study were collected from ethnic Chinese participants following standard protocols of informed consent, as part of an on-going whole-genome asthma and allergic disease case-control association study (unpublished data) in Singapore. Experimental research that is reported in the manuscript has been performed with the approval of NUS Institutional Review Board (IRB) Reference - NUS07-023 and is also in compliance with the Helsinki declaration. Genomic DNA was extracted from buccal cells obtained from a mouthwash of 0.9% saline solution following a standardized protocol In short, the buccal cells were pelleted and lysed; DNA was extracted using the phenol-chloroform phase-separation technique  purified by two washes in ethanol, with the DNA pellet resuspended in reduced Tris-EDTA buffer. Samples were quantified in triplicate on the Nanodrop (ND-1000). Samples which fell within a 1% error margin in the replicate measurements were subsequently diluted to 50 ng/μl, according to the requirements in the assay manual.
A total of 114 and 1028 samples were genotyped on the Affymetrix 500 k and Illumina Hapmap 550 k chips (Illumina Infinium HumanHap550 Duo or Illumina Infinium HumanHap610 Quad) which were processed according to the protocol outlined in the Gene Chip Mapping 500 k Assay Manual and Infinium II Assay Workflow respectively. Genotypes were obtained using the BRLMM algorithm as implemented in the Genotyping Console v2.1 for the Affymetrix platform, and from the BeadStation software for the Illumina platform. Cryptic relatedness was tested to remove any relatives within the samples and gender test was also performed to ensure all predicted sexes matched the actual gender.
Concordance correlation coefficient was calculated to determine "correlation" as a measure of accuracy between actual and estimated allele frequencies. R software package and PASW statistics 17(SPSS Inc) were used to calculate the correlation statistics. Unless otherwise stated, all measures of correlation were deemed statistically significant at p < 0.05. Mean absolute deviation (MAD) was used as a more robust estimator of dispersion of errors than standard deviation or variance. Principal Component Analysis (PCA) statistics were calculated using the EIGENSTRAT software package. LD blocks were developed using Haploview version 4.0 http://www.broad.mit.edu/mpg/haploview. The r2 values were used to determine pair-wise linkage using the default Gabriel et al. algorithm. A "proxy SNP" is defined as a SNPs which is covered by another SNP at an r2 value of 0.8. Microarray coverage was calculated as described by Magi et.al 
This study is supported by grants from the Singapore Immunology Network (SIgN-06-006), Biomedical Research Council (BMRC), Singapore and National Medical Research Council (NMRC), Singapore and National University of Singapore (NUS) for the Graduate Research Scholarship for AKA. We would like to thank Dr. Liu Jianjun, Dr. Li Yi and Low Hui Qi from the Genome Institute of Singapore for their support in data analysis.
- Altshuler D, Brooks LD, Chakravarti A, Collins FS, Daly MJ, Donnelly P, Int HapMap C: A haplotype map of the human genome. Nature. 2005, 437 (7063): 1299-1320. 10.1038/nature04226.View ArticleGoogle Scholar
- Brohede J, Dunne R, McKay JD, Hannan GN: PPC: an algorithm for accurate estimation of SNP allele frequencies in small equimolar pools of DNA using data from high density microarrays. Nucleic Acids Research. 2005, 33 (17): 10.1093/nar/gni142.Google Scholar
- Collins FS, Guyer MS, Chakravarti A: Variations on a theme: Cataloging human DNA sequence variation. Science. 1997, 278 (5343): 1580-1581. 10.1126/science.278.5343.1580.View ArticlePubMedGoogle Scholar
- Foster MW, Int HapMap C: Integrating ethics and science in the international HapMap project. Nature Reviews Genetics. 2004, 5 (6): 467-475. 10.1038/nrg1351.View ArticleGoogle Scholar
- Frazer KA, Ballinger DG, Cox DR, Hinds DA, Stuve LL, Gibbs RA, Belmont JW, Boudreau A, Hardenbol P, Leal SM, et al: A second generation human haplotype map of over 3.1 million SNPs. Nature. 2007, 449 (7164): 851-U853. 10.1038/nature06258.View ArticlePubMedGoogle Scholar
- Gibbs RA, Belmont JW, Hardenbol P, Willis TD, Yu FL, Yang HM, Ch'ang LY, Huang W, Liu B, Shen Y, et al: The International HapMap Project. Nature. 2003, 426 (6968): 789-796. 10.1038/nature02168.View ArticleGoogle Scholar
- Hinds DA, Kloek AP, Jen M, Chen XY, Frazer KA: Common deletions and SNPs are in linkage disequilibrium in the human genome. Nature Genetics. 2006, 38 (1): 82-85. 10.1038/ng1695.View ArticlePubMedGoogle Scholar
- Hu C, Jia WP, Zhang WH, Wang CR, Zhang R, Wang J, Ma XJ, Xiang KS, Int Type 2 Diabet 1q C: An evaluation of the performance of HapMap SNP data in a Shanghai Chinese population: Analyses of allele frequency, linkage disequilibrium pattern and tagging SNPs transferability on chromosome 1q21-q25. BMC Genetics. 2008, 9: 19-10.1186/1471-2156-9-19.PubMed CentralView ArticlePubMedGoogle Scholar
- Khoury MJ, Wacholder S: Invited Commentary: From Genome-Wide Association Studies to Gene-Environment-Wide Interaction Studies-025EFChallenges and Opportunities. American Journal of Epidemiology. 2009, 169 (2): 227-230. 10.1093/aje/kwn351.PubMed CentralView ArticlePubMedGoogle Scholar
- Lander ES: The new genomics: Global views of biology. Science. 1996, 274 (5287): 536-539. 10.1126/science.274.5287.536.View ArticlePubMedGoogle Scholar
- Li MY, Li C, Guan W: Evaluation of coverage variation of SNP chips for genome-wide association studies. European Journal of Human Genetics. 2008, 16 (5): 635-643. 10.1038/sj.ejhg.5202007.View ArticlePubMedGoogle Scholar
- Lim J, Kim YJ, Yoon Y, Kim SO, Kang HJ, Park J, Han AR, Han B, Oh B, Kimm K, et al: Comparative study of the linkage disequilibrium of an ENCODE region, chromosome 7p15, in Korean, Japanese, and Han Chinese samples. Genomics. 2006, 87 (3): 392-398. 10.1016/j.ygeno.2005.11.002.View ArticlePubMedGoogle Scholar
- Lundmark PE, Liljedahl U, Boomsma DI, Mannila H, Martin NG, Palotie A, Peltonen L, Perola M, Spector TD, Syvanen AC: Evaluation of HapMap data in six populations of European descent. European Journal of Human Genetics. 2008, 16 (9): 1142-1150. 10.1038/ejhg.2008.77.View ArticlePubMedGoogle Scholar
- Manolio TA, Brooks LD, Collins FS: A HapMap harvest of insights into the genetics of common disease. Journal of Clinical Investigation. 2008, 118 (5): 1590-1605. 10.1172/JCI34772.PubMed CentralView ArticlePubMedGoogle Scholar
- Teo YY, Sim XL, Ong RTH, Tan AKS, Chen JM, Tantoso E, Small KS, Ku CS, Lee EJD, Seielstad M, et al: Singapore Genome Variation Project: A haplotype map of three Southeast Asian populations. Genome Research. 2009, 19 (11): 2154-2162. 10.1101/gr.095000.109.PubMed CentralView ArticlePubMedGoogle Scholar
- Willer CJ, Scott LJ, Bonnycastle LL, Jackson AU, Chines P, Pruim R, Bark CW, Tsai YY, Pugh EW, Doheny KF, et al: Tag SNP selection for Finnish individuals based on the CEPH Utah HapMap database. Genetic Epidemiology. 2006, 30 (2): 180-190. 10.1002/gepi.20131.View ArticlePubMedGoogle Scholar
- Montpetit A, Nelis M, Laflamme P, Magi R, Ke XY, Remm M, Cardon L, Hudson TJ, Metspalu A: An evaluation of the performance of tag SNPs derived from HapMap in a Caucasian population. Plos Genetics. 2006, 2 (3): 282-290. 10.1371/journal.pgen.0020027.View ArticleGoogle Scholar
- Mueller JC, Lohmussaar E, Magi R, Remm M, Bettecken T, Lichtner P, Biskup S, Illig T, Pfeufer A, Luedemann J, et al: Linkage disequilibrium patterns and tagSNP transferability among European populations. American Journal of Human Genetics. 2005, 76 (3): 387-398. 10.1086/427925.PubMed CentralView ArticlePubMedGoogle Scholar
- Murcray CE, Lewinger JP, Gauderman WJ: Gene-Environment Interaction in Genome-Wide Association Studies. American Journal of Epidemiology. 2009, 169 (2): 219-226. 10.1093/aje/kwn353.PubMed CentralView ArticlePubMedGoogle Scholar
- Reich DE, Lander ES: On the allelic spectrum of human disease. Trends in Genetics. 2001, 17 (9): 502-510. 10.1016/S0168-9525(01)02410-6.View ArticlePubMedGoogle Scholar
- Takeuchi F, Serizawa M, Kato N: HapMap coverage for SNPs in the Japanese population. Journal of Human Genetics. 2008, 53 (1): 96-99. 10.1007/s10038-007-0221-7.View ArticlePubMedGoogle Scholar
- Shek LPC, Tay AHN, Chew FT, Goh DLM, Lee BW: Genetic susceptibility to asthma and atopy among Chinese in Singapore - linkage to markers on chromosome 5q31-33. Allergy. 2001, 56 (8): 749-753. 10.1034/j.1398-9995.2001.056008749.x.View ArticlePubMedGoogle Scholar
- Chen JM, Zheng HF, Bei JX, Sun LD, Jia WH, Li T, Zhang FR, Seielstad M, Zeng YX, Zhang XJ, et al: Genetic Structure of the Han Chinese Population Revealed by Genome-wide SNP Variation. Am J Hum Genet. 2009, 85 (6): 775-785. 10.1016/j.ajhg.2009.10.016.PubMed CentralView ArticlePubMedGoogle Scholar
- Tantoso E, Yang YC, Li KB: How well do HapMap SNPs capture the untyped SNPs?. BMC Genomics. 2006, 7: 238-10.1186/1471-2164-7-238.PubMed CentralView ArticlePubMedGoogle Scholar
- Moore DDD: Unit 2.1A: Purification and Concentration of DNA from Aqueous Solutions. Current Protocols in Molecular Biology. 2002Google Scholar
- Price AL, Patterson NJ, Plenge RM, Weinblatt ME, Shadick NA, Reich D: Principal components analysis corrects for stratification in genome-wide association studies. Nature Genetics. 2006, 38 (8): 904-909. 10.1038/ng1847.View ArticlePubMedGoogle Scholar
- Magi R, Pfeufer A, Nelis M, Montpetit A, Metspalu A, Remm M: Evaluating the performance of commercial whole-genome marker sets for capturing common genetic variation. BMC Genomics. 2007, 8: 8-10.1186/1471-2164-8-159.View ArticleGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.