Characterization of a likelihood based method and effects of markers informativeness in evaluation of admixture and population group assignment
© Yang et al; licensee BioMed Central Ltd. 2005
Received: 06 June 2005
Accepted: 14 October 2005
Published: 14 October 2005
Detection and evaluation of population stratification are crucial issues in the conduct of genetic association studies. Statistical approaches useful for understanding these issues have been proposed; these methods rely on information gained from genotyping sets of markers that reflect population ancestry. Before using these methods, a set of markers informative for differentiating population genetic substructure (PGS) is necessary. We have previously evaluated the performance of a Bayesian clustering method implemented in the software STRUCTURE in detecting PGS with a particular informative marker set. In this study, we implemented a likelihood based method (LBM) in evaluating the informativeness of the same selected marker panel, with respect to assessing potential for stratification in samples of European Americans (EAs) and African Americans (AAs), that are known to be admixed. LBM calculates the probability of a set of genotypes based on observations in a reference population with known specific allele frequencies for each marker, assuming Hardy Weinberg equilibrium (HWE) for each marker and linkage equilibrium among markers.
In EAs, the assignment accuracy by LBM exceeded 99% using the most efficient marker FY, and reached perfect assignment accuracy using the 10 most efficient markers excluding FY. In AAs, the assignment accuracy reached 96.4% using FY, and >95% when using at least the 9 most efficient markers. The comparison of the observed and reference allele frequencies (which were derived from previous publications and public databases) shows that allele frequencies observed in EAs matched the reference group more accurately than allele frequencies observed in AAs. As a result, the LBM performed better in EAs than AAs, as might be expected given the dependence of LBMs on prior knowledge of allele frequencies. Performance was not dependent on sample size.
The performance of the LBM depends on the efficiency and number of markers, and depends greatly on how representative the available reference allele frequencies are for those of the population being assigned. This method is of value when the parental population is known and relevant allele frequencies are available.
Population stratification is a crucial issue in conducting genetic association studies, in particular, for case-control study designs, such that if it is not accounted for study results could be invalid – either false positive or false negative . Methods to address the issue have been proposed [2–17]. Before using these methods, an informative set of markers is necessary; this is known as a set of ancestry informative markers (AIMs). In this study, we implemented a likelihood based method (LBM), as an alternative to popular Bayesian methods such as that implemented in STRUCTURE [3, 13], and used it to evaluate the informativeness of a selected marker panel and to assess potential for stratification in a sample of European Americans (EAs) and African Americans (AAs) that are known to be admixed.
Likelihood-based methods (LBMs) provide a framework for assignment of individuals to specific populations based on observed allele frequencies in AIMs. LBMs for the classification of individuals into subgroups can be implemented by calculating the probability of a marker genotype profile (i.e., a set of genotypes) based on observations in a reference population with known specific allele frequencies for each marker ("training frequencies"), assuming Hardy Weinberg equilibrium (HWE) for each marker and linkage equilibrium among markers . The LBM method is also called an "assignment test" and is widely applied in molecular ecology and animal forensics for identifying population genetic substructures for animals or plants [18–24]. Research on the assignment test or LBM has not yet focused on the performance of the test or of specific markers in differentiating the PGS in human subjects. In theory, LBM may be better for probabilistic classification of individuals to subpopulations, if certain conditions are met. The most important of these conditions is availability of an accurate set of training frequencies. Obviously, this method may be applicable only if the populations from which the sample to be classified are already known or can be determined. This condition can be met in most situations; for example, the AA population is well known to have principally African and European American ancestry.
In the present study, we compared the performance of LBMs to that of the popular Bayesian approach used by the software program STRUCTURE. We predicted that, if the conditions for successful LBM application are met, LBMs would be more efficient that Bayesian methods for population group assignment, because they make use of more information (i.e., known ancestral population allele frequencies, which are provided a priori rather than inferred from the data presented to the program).
In EAs (Figure 2, (1)), the assignment accuracy by LBM exceeded 99% using the most efficient marker FY, and reached 100% using the 10 most efficient markers excluding FY (when FY was excluded, the assignment accuracy using the next most efficient marker D11S936 dropped by 9%). In contrast, it would take 29 markers to reach >99% assignment accuracy when the least efficient markers are selected or the seven most efficient markers are omitted. In AAs (Figure 2, (2)), the assignment accuracy reached 96.4% using FY, and then the assignment accuracy changed inconsistently as more markers were added up to 21 markers, at which point assignment accuracy stabilized at 97.6%, achieving the maximum of 98.8% when all 36 markers were used. Overall, using LBM, it can exceed 95% when using at least the 9 most efficient markers. When FY was excluded, the assignment accuracy dropped by 38%.
This 38% drop, which reflects the difference in accuracy between the most efficient marker, FY, and the second most efficient one, D11S936, was further investigated by a corresponding analysis in which the study sample was randomly split into two groups and one group was treated as a reference sample. The drop declined to 6%, which was more comparable to the 9% in EAs. Thus, this reduced accuracy was in large part attributable to mismatch between reported training allele frequencies and frequencies that are more representative of our Northeastern US AA population. LBM never reaches perfect assignment accuracy for AAs in this sample even when all the 36 markers were used, but accuracy did reach 98.8%.
Comparison of observed and reference allele frequencies
The high assignment accuracy by LBMs was observed notwithstanding the deviation between our observed allele frequencies and the reference frequencies described above. We further compared our observed allele frequencies with published reference allele frequencies using the χ2 test. In EAs, after adjusting for sample size, there were 19 markers that differed at p < 0.05, while in AAs, the corresponding number of markers was 29. In other words, allele frequencies observed in EAs matched the reference group more closely than did allele frequencies observed in AAs. As a result, the LBM performed better in EAs than AAs, as might be expected given the dependence of LBMs on prior knowledge of allele frequencies.
Evaluation of the influence of mismatched reference allele frequencies on assignment accuracy by means of split samples
Logarithm likelihood ratio
One individual in the AA series appeared to be misclassified; see Figure 4 with 9 to 12 markers. Based on this observation, we examined the phenotypic information for this subject, and determined that, although self-identified as AA, the subject had one AA and one EA parent.
Comparison of LBM results with Bayesian results obtained using STRUCTURE
The LBM is appealing for population group assignment because it is straightforward and easily implemented, provided that sufficiently accurate reference allele frequencies are available. We provide a set of allele frequencies for all markers herein that will prove useful for classifying populations similar to those discussed in the present article [see Additional file 1]. Under these circumstances, the LBM should classify individuals at least as accurately as STRUCTURE, and probably more accurately. However, a representative reference population may be difficult to establish in some cases. With a good reference group, as shown in the analysis of split samples (p-values of the χ2 test range from >0.57 to 1 for distributions of allele frequencies for the two split samples), LBM performed very well. In EAs, the clustering by LBM is as good as by STRUCTURE (using an ancestry model of admixture and an allele frequency dependence model) for δ descending, but LBM performs better for ascending values of δ. For AAs, LBM and STRUCTURE cluster the groups equally well. STRUCTURE retains certain advantages, such as the ability to classify individuals by proportional ancestry for subsequent application of the structured association method, as discussed elsewhere . It should be noted that the superior performance of LBMs over STRUCTURE, when observed, depends on LBM having more data available than STRUCTURE in the form of reference allele frequencies.
The observed allele frequencies in this study matched reference allele frequencies better for EA than AA populations. Subjects from some populations from different geographic areas might have quite different admixture proportions and ancestral origins. This is demonstrably the case for African-Americans, since in different parts of the US the percent admixture from EAs is known to range at least from 12% to 23% . Another issue with LBM involves justification for the multiplication of allele frequencies across loci under the assumption of linkage equilibrium. If the allele frequencies of different STRs vary among subpopulations, then the loci are not in complete linkage equilibrium or are not statistically independent even if they are genetically unlinked. However, we did assume linkage equilibrium within the subpopulations. This is also an underlying assumption for STRUCTURE . This assumption might prove to be problematic under some circumstances, but the practical impact seemed minimal for the present study, as evidenced by the fact that LBM performed well.
The result from the single most informative marker, FY, could exceed 99% and 96% assignment accuracy in EAs and AAs, respectively. This result is, of course, sample-specific to some extent; AA subjects who are homozygous for the allele more characteristic of European ancestry (i.e., FY (+/+)), should have a population frequency of about 4%, given a 20% admixture rate from EA, and would be misclassified into the EA group if based only on this marker; this misclassification rate is equal to what we observed, about 4% in AAs. Likewise, EAs heterozygous for the FY(-) allele characteristic of AAs are observed as well, and they are liable to be misclassified as AAs. Our Northeastern US AA population had approximated the expected European admixture rate, based on the information from FY.
The sample size of the populations being assigned is not an issue for LBM, while it is for STRUCTURE. The Bayesian cluster approach taken by STRUCTURE requires building a likelihood function from the observed samples to infer allele frequencies, such that if the sample size is insufficient, the estimated allele frequencies might not be accurate. As a result, sample size in each subgroup affects the assignment accuracy, and our simulation result  shows that approximately fifty subjects are required to have stable assignment accuracy by STRUCTURE. LBM, in contrast, uses allele frequencies from the reference populations; there is no need to estimate allele frequencies by LBM. Thus, even a single individual can be assigned accurately using the LBM.
We conclude that assignment accuracy by LBM depends on the efficiency of the markers selected (FY alone can separate EAs and AAs with accuracy that can approach 99% for excluding AAs from a presumed EA sample), the number of markers (other things being equal, more markers produce higher assignment accuracy), and greatly on how representative the parental population reference allele frequencies are for the populations being queried.
Three hundred sixty-six individuals recruited in the Northeastern US (classified as 282 EAs, 84 AAs) were studied. These individuals were selected from a larger sample for evaluation of this likelihood based method because they had complete marker data for all markers described below. All subjects provided informed consent as approved by the appropriate institutional review boards.
Detailed marker and genotyping information was described previously . Briefly, two different sets of STR markers were used. First, we used the AmpFLSTR Identifiler PCR Amplification Kit (Applied Biosystems (ABI), Foster City, CA), which provides data from a set of 16 loci useful for forensic purposes (D8S1179, D21S11, D7S820, CSF1PO, D3S1358, TH01, D13S317, D16S539, D2S1338, D19S443, vWA, TPOX, D18S51, D5S818, FGA, and amelogenin). Amelogenin is used for sex identification rather than for polymorphism content, so information from that locus was not included in any analyses. Second, we selected 21 markers known to have high δ between EAs and AAs, and in some cases Hispanic and Asian populations, based on the report of Smith et al. 2001 . This marker panel includes markers D1S196, D1S2628, D2S162, D2S319, D5S407, D5S410, D6S1610, D7S640, D7S657, D8S272, D8S1827, D9S175, D10S197, D10S1786, D11S935, D12S352, D14S68, D15S1002, D16S3017, D17S799, and D22S274. We also genotyped marker FY, added to the 36 STRs because of its known value in identifying individuals of primarily African ancestry.
Measures of marker efficiency
δ was used to measure the marker efficiency. The definition and properties of δ are described in Yang et al. 2005 . Briefly, δ is half the sum of the absolute difference in population frequency over all alleles for each marker between two populations.
Analysis with the likelihood-based method (LBM)
We assumed HWE among alleles for each marker within populations and linkage equilibrium between markers. The likelihood, or the probability of observed genotype profile, for each individual to be in a specific population is calculated as
where X is a vector of genotypes of marker loci, Z is the proposed population of origin, P Z (p11, p12,..., ) is the set of reference allele frequencies p11, p12,..., for the n m alleles of m markers of population Z, and h is a dummy variable for homozygosity (i.e., when the locus is homozygous, h is 1, otherwise h is 0) for each marker locus. When an allele is absent for a given population in the reference frequencies, the corresponding allele frequency in the study group is estimated and used in the calculation of likelihood. An individual is assigned to a population if the maximum likelihood results from assignment to that population among all possible population-specific likelihoods calculated. For assigning individuals into one of two populations A or B, an individual is assigned to population A if the logarithm of likelihood ratio is greater than zero, or otherwise to B, as shown in equation (2).
An individual was considered to be assigned accurately to a group when the greatest likelihood among all the calculated likelihoods assigned an individual the same ethnicity as the self-identified population group of that individual. Assignment accuracy in each population group was defined as the proportion of correctly assigned ethnicities. (The above decision rule is optimal if we have equal priors of proportion for the two ethnic groups. However, when there are more people from one group, a priori, then the prior information needs to be incorporated to improve the overall performance in terms of misclassification rate.) The method was realized in R/S-Plus; the function codes are available upon request from the authors.
The initial set of reference population-specific allele frequencies (training frequencies) for the 36 markers were derived from ABI reference materials  or Smith et al. 2001 , depending on the source of each marker. Since ABI uses different nomenclature (in some cases) and we redesigned some primers referenced by Smith to facilitate efficient genotyping, each observed allele had to be matched to the corresponding allele for each marker. Alleles at one marker locus (D6S1610) described by Smith et al. 2001  could not be matched accurately to our data from the same marker (however, the value of EA/AAδ that we derived for that marker, 0.336, was similar to the value reported by Smith et al., which was 0.337). The χ2 test was used to compare the allele distributions of the study group and the reference group.
Evaluation of the impact of the training frequencies on population group assignment accuracy
To evaluate the impact of the training frequencies on population group assignment accuracy, we compared the literature-derived training allele frequencies (described above) with training allele frequencies computed from our specific populations. To do so, we randomly split the 282 EAs and 84 AAs into two equal-sized groups. One was treated as the study group, and the other was treated as the reference group, from which the training allele frequencies were estimated.
Greg Dalton-Kay and Ann Marie Lacobelle provided excellent technical assistance. This work was supported in part by funds from NIH: MH14276 (Biological Sciences Training Program support to BZY), the U.S. Department of Veterans Affairs (the VA Medical Research Program [Merit Review to JG], and the VA CT REAP (Research Enhancement Award Program)), NIMH grant K02-MH01387, NIDA grants DA12690, DA12849, and DA12468, NIAAA grants AA11330, AA12870, AA13736, AA03510, and RR06192 (University of Connecticut General Clinical Research Center), and NIGMS grant GM59507.
- Yang BZ, Zhao H, Kranzler H, Gelernter J: Practical population group assignment with selected informative markers: characteristics and properties of Bayesian clustering via STRUCTURE. Genetic Epidemiology. 2005, 28: 302-312. 10.1002/gepi.20070.View ArticlePubMedGoogle Scholar
- Devlin B, Roeder K: Genomic control for association studies. Biometrics. 1999, 55: 997-1004. 10.1111/j.0006-341X.1999.00997.x.View ArticlePubMedGoogle Scholar
- Pritchard JK, Stephens M, Donnelly P: Inference of population structure using multilocus genotype data. Genetics. 2000, 155: 945-959.PubMed CentralPubMedGoogle Scholar
- Pritchard JK, Stephens M, Rosenberg NA, Donnelly P: Association mapping in structured populations. Am J Hum Genet. 2000, 67: 170-181. 10.1086/302959.PubMed CentralView ArticlePubMedGoogle Scholar
- Reich DE, Goldstein DB: Detecting association in a case-control study while correcting for population stratification. Genet Epidemiology. 2001, 20: 4-16. 10.1002/1098-2272(200101)20:1<4::AID-GEPI2>3.0.CO;2-T.View ArticlePubMedGoogle Scholar
- Ripatti S, Pitkaniemi J, Sillanpaa MJ: Joint modeling of genetic association and population stratification using latent class models. Genet Epidemiology. 2001, S409-14. Suppl 1Google Scholar
- Satten GA, Flanders WD, Yang Q: Accounting for unmeasured population substructure in case-control studies of genetic association using a novel latent-class model. Am J Hum Genet. 2001, 68: 466-477. 10.1086/318195.PubMed CentralView ArticlePubMedGoogle Scholar
- Sillanpaa MJ, Kilpikari R, Ripatti S, Onkamo P, Uimari P: Bayesian association mapping for quantitative traits in a mixture of two populations. Genetic Epidemiology. 2001, S692-9. Suppl 1Google Scholar
- Zhang S, Zhao H: Quantitative similarity-based association test using population samples. Am J Hum Genet. 2001, 69: 601-614. 10.1086/323037.PubMed CentralView ArticlePubMedGoogle Scholar
- Pfaff C, Kittles R, Shriver MD: Adjusting for population structure in admixed populations. Genetic Epidemiology. 2002, 22: 196-198. 10.1002/gepi.0126.View ArticlePubMedGoogle Scholar
- Zhang S, Zhu X, Zhao H: On a semi-parametric test to detect associations between quantitative traits and candidate genes using unrelated individuals. Genetic Epidemiology. 2003, 24: 44-56. 10.1002/gepi.10196.View ArticlePubMedGoogle Scholar
- Chen HS, Zhu X, Zhao H, Zhang S: Qualitative semi-parametric test to detect genetic association in case-control design under structured population. Annals of Human Genetics. 2003, 67: 250-264. 10.1046/j.1469-1809.2003.00036.x.View ArticlePubMedGoogle Scholar
- Falush D, Stephens M, Pritchard JK: Inference of population structure using multilocus genotype data: Linked loci and correlated allele frequencies. Genetics. 2003, 164: 1567-1587.PubMed CentralPubMedGoogle Scholar
- Hoggart CJ, Parra EJ, Shriver MD, Bonilla C, Kittles RA, Clayton DG, McKeigue PM: Control of confounding of genetic associations in stratified populations. Am J Hum Genet. 2003, 72: 1492-1504. 10.1086/375613.PubMed CentralView ArticlePubMedGoogle Scholar
- Patterson N, Hattangadi N, Lane B, Lohmueller KE, Hafler DA, Oksenberg JR, Hauser SL, Smith MW, O'Brien SJ, Altshuler D, Daly MJ, Reich D: Methods for high-density admixture mapping of disease genes. Am J Hum Genet. 2004, 74: 979-1000. 10.1086/420871.PubMed CentralView ArticlePubMedGoogle Scholar
- Smith MW, Patterson N, Lautenberger JA, Truelove AL, McDonald GJ, Waliszewska A, Kessing BD, Malasky MJ, Scafe C, Le E, De Jager PL, Mignault AA, Yi Z, De The G, Essex M, Sankale JL, Moore JH, Poku K, Phair JP, Goedert JJ, Vlahov D, Williams SM, Tishkoff SA, Winkler CA, De La Vega FM, Woodage T, Sninsky JJ, Hafler DA, Altshuler D, Gilbert DA, O'Brien SJ, Reich D: A high-density admixture map for disease gene discovery in African Americans. Am J Hum Genet. 2004, 74: 1001-13. 10.1086/420856.PubMed CentralView ArticlePubMedGoogle Scholar
- Tang H, Peng J, Wang P, Risch NJ: Estimation of individual admixture: Analytical and study design considerations. Genetic Epidemiology. 2005, 28: 289-301. 10.1002/gepi.20064.View ArticlePubMedGoogle Scholar
- Paetkau D, Calvert W, Sterling I, Strobeck C: Microsatellite analysis of population structure in Canadian polar bears. Molecular Ecology. 1995, 4: 347-354.View ArticlePubMedGoogle Scholar
- Paetkau D, Waits LP, Clarkson PL, Craighead L, Strobeck C: An empirical evaluation of genetic distance statistics using microsatellite data from bear (Ursidae) populations. Genetics. 1997, 147: 1943-1957.PubMed CentralPubMedGoogle Scholar
- Rannala B, Mountain JL: Detecting immigration by using multilocus genotypes. Proc Natl Acad Sci USA. 1997, 94: 9197-9201. 10.1073/pnas.94.17.9197.PubMed CentralView ArticlePubMedGoogle Scholar
- Cornuet JM, Piry S, Luikart G, Estoup A, Solignac M: New methods employing multilocus genotypes to select or exclude populations as origins of individuals. Genetics. 1999, 153: 1989-2000.PubMed CentralPubMedGoogle Scholar
- Guinand B, Topchy A, Page KS, Burnham-Curtis MK, Punch WF, Scribner KT: Comparison of likelihood and machine learning methods of individual classification. The Journal of Heredity. 2002, 93: 260-269. 10.1093/jhered/93.4.260.View ArticlePubMedGoogle Scholar
- Manel S, Berthier P, Luikart G: Detecting wildlife poaching: Identifying the origin of individuals with Bayesian assignment tests and multilocus genotypes. Conservation Biology. 2002, 16: 650-659. 10.1046/j.1523-1739.2002.00576.x.View ArticleGoogle Scholar
- Piry S, Alapetite A, Cornuet JM, Paetkau D, Baudouin L, Estoup A: GENECLASS2: a software for genetic assignment and first-generation migrant detection. J Hered. 2004, 95: 536-9. 10.1093/jhered/esh074.View ArticlePubMedGoogle Scholar
- Rosenberg NA, Li LM, Ward R, Pritchard JK: Informativeness of genetic markers for inference of ancestry. Am J Hum Genet. 2003, 73: 1402-22. 10.1086/380416.PubMed CentralView ArticlePubMedGoogle Scholar
- Parra EJ, Marcini A, Akey J, Martinson J, Batzer MA, Cooper R, Forrester T, Allison DB, Deka R, Ferrell RE, Shriver MD: Estimating African American admixture proportions by use of population-specific alleles. Am J Hum Genet. 1998, 63: 1839-1851. 10.1086/302148.PubMed CentralView ArticlePubMedGoogle Scholar
- Website title. [http://home.appliedbiosystems.com/]
- Smith MW, Lautenberger JA, Shin HD, Chretien JP, Shrestha S, Gilbert DA, O'Brien SJ: Markers for mapping by admixture linkage disequilibrium in African American and Hispanic populations. Am J Hum Genet. 2001, 69: 1080-94. 10.1086/323922.PubMed CentralView ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.