Comparing selfreported ethnicity to genetic background measures in the context of the MultiEthnic Study of Atherosclerosis (MESA)
 Jasmin Divers^{1}Email author,
 David T Redden^{2},
 Kenneth M Rice^{4},
 Laura K Vaughan^{2},
 Miguel A Padilla^{2},
 David B Allison^{2},
 David A Bluemke^{6},
 Hunter J Young^{5} and
 Donna K Arnett^{3}
DOI: 10.1186/147121561228
© Divers et al; licensee BioMed Central Ltd. 2011
Received: 6 May 2010
Accepted: 4 March 2011
Published: 4 March 2011
Abstract
Background
Questions remain regarding the utility of selfreported ethnicity (SRE) in genetic and epidemiologic research. It is not clear whether conditioning on SRE provides adequate protection from inflated type I error rates due to population stratification and admixture. We address this question using data obtained from the MultiEthnic Study of Atherosclerosis (MESA), which enrolled individuals from 4 selfreported ethnic groups. We compare the agreement between SRE and genetic based measures of ancestry (GBMA), and conduct simulation studies based on observed MESA data to evaluate the performance of each measure under various conditions.
Results
Four clusters are identified using 96 ancestry informative markers. Three of these clusters are well delineated, but 30% of the selfreported HispanicAmericans are misclassified. We also found that MESA SRE provides type I error rates that are consistent with the nominal levels. More extensive simulations revealed that this finding is likely due to the multiethnic nature of the MESA. Finally, we describe situations where SRE may perform as well as a GBMA in controlling the effect of population stratification and admixture in association tests.
Conclusions
The performance of SRE as a control variable in genetic association tests is more nuanced than previously thought, and may have more value than it is currently credited with, especially when smaller replication studies are being considered in multiethnic samples.
Background
The use of selfreported race and ethnicity (SRE) in genetic and epidemiologic studies has been much discussed in the literature [1–8]. Some researchers proposed to completely ban their utilization in these studies claiming that race and ethnicity are poorly defined social constructs with weak biologic and genetic basis [2, 3, 9]. Others, however, have argued that completely disregarding racial and ethnic differences in genetic and epidemiologic studies may not be appropriate, since these differences can be useful when generating and exploring new hypotheses regarding the effect of environmental and genetic risk factors and their interaction on important medical outcomes [1, 9].
Some studies have found SRE to be closely related to an individual's genetically estimated ancestry proportions [8, 10] and have suggested that SRE may provide adequate control against type I error inflation and/or loss of power due to population stratification and admixture in genetic association tests. However, it has also been shown [5, 11] that while SRE may be sufficient to predict the continent or subcontinent on which an individual's ancestors were born, genetic markers may provide a finer genetic ancestry measure capable of capturing more subtle variation within ethnic groups. In fact, most investigators currently rely on a genetic measure of an individual's ancestral background as a control variable against confounding due to population stratification and admixture in genetic association tests instead of the SRE. This use of genetic background measures is particularly common in large endeavors such as in genomewide association studies. However, in smaller candidate gene studies, investigators have asked whether accounting for SRE alone might be sufficient to control for the confounding effect, especially when the number of markers needed for controlling against confounding effects that are caused by population stratification and admixture is large compared to the number of variants to be tested.
We used ancestry informative markers (AIMs) and phenotypic data on leftventricular hypertrophy (LVH) collected in the context of the MultiEthnic Study of Atherosclerosis (MESA) to address two related questions: (1) what agreement is there between SRE and clusters created based on the genotyped AIMs? (2) In multiethnic genetic association studies, does SRE provide comparable type I error control to that provided by genetic ancestry background measures, such as individual ancestry proportions or genetic background scores? To address these two questions we compared three sets of measures; SRE, individual ancestry proportion (IAP) estimates obtained using STRUCTURE [7, 12], and a genetic background score (GBS) which we define below. We then tested for their genetic association with left ventricular mass and its related systolic functional counterpart, the left ventricular ejection fraction. These two phenotypes are known to vary considerably between ethnic groups. We also used these phenotypes as the basis to generate plasmodes [13, 14] and to illustrate the potential for type I errors when genetic association studies are conducted on phenotypic variables that are differentially distributed among the 4 ethnic groups (AfricanAmericans (AA), ChineseAmericans (CA), EuropeanAmericans (EA) and HispanicAmericans (HA)) represented in MESA.
LVH is a condition where the ventricular mass increases as existing cells of the LV enlarge or hypertrophy [15]. LVH is one of the most potent risk factors for cardiovascular disease, particularly ischemic heart disease and heart failure [16, 17], and its reversal has recently been shown to reduce the rate of cardiovascular events independently of the blood pressure level [18]. Risk factors associated with LVH include age, gender, hypertension, obesity, and diabetes [18, 19]. There is significant evidence of ethnic differences in the distribution of LVH. The rate of occurrence of LVH in EA is approximately 16%, compared to 3343% in AA [20–23]. Many of the risk factors for LVH also differ across ethnic groups and may partially account for the observed ethnic differences in LVH. However, given the number of potential determinants of LVH, there are plausibly several genes acting independently or synergistically to increase risk for LVH in different populations. A substantial amount of work has been done and published on LV mass and other LVH related phenotypes using data collected in MESA [24] and other studies.
Results
Agreement between selfreported ancestry and the genetic background scores
Agreement between selfreported ethnicity and the 4 observed clusters
Selfreported Ethnicity  Assigned Ethnic group  Total  

AA  CA  EA  HA  
African American  637  0  8  67  712 
Chinese American  0  717  1  0  718 
European American  1  0  630  81  712 
Hispanic American  69  3  135  498  705 
Tota l  707  720  774  646  2847 
Agreement between selfreported ancestry and the ancestry proportion estimates
Distribution of body surface adjusted (BSA) LV mass and the LV ejection fraction
Distribution of adjusted LVH and adjusted ejection fraction by selfreported ethnicity
Selfreported Ethnicity  BSA adjusted LV mass  LV ejection fraction  

AA  CA  EA  HA  AA  CA  EA  HA  
N  498  591  544  519  498  591  544  519 
Mean  79.9  73.8  75.8  80.7  68.4  72.3  68.5  68.7 
Standard deviation  17.0  13.6  15.5  17.6  7.6  6.1  7.4  7.5 
Minimum  37.4  42.2  40.4  34.6  40.6  45.3  22.2  28.1 
Q1  68.3  64.7  64.9  68.5  63.5  68.4  64.2  64.4 
Median  77.9  72.1  73.8  78.0  69.0  72.9  68.6  69.9 
Q3  89.7  81.5  85.6  89.6  73.6  76.3  73.6  73.9 
Maximum  146.6  180.4  163.6  153.1  88.1  81.6  86.6  84.4 
Association between the AIMs and body surface adjusted LV mass and LV ejection fraction
Type I error associated with the test for association between LV mass and the 96 AIMs
Control variable  Average Type I error  Standard error  Observed minimum Type I error  Observed Maximum Type I error 

Individual admixture estimates  0.033  0.006  0.0105  0.042 
Principal components  0.048  0.0074  0.021  0.052 
Selfreported Ethnicity  0.037  0.006  0.021  0.052 
Ignoring confounding  0.320  0.022  0.221  0.357 
Type I error associated with the test for association between LV ejection fraction and the 96 AIMs
Control variable  Average Type I error  Standard error  Observed minimum Type I error  Observed maximum Type I error 

Individual admixture estimates  0.038  0.0058  0.021  0.042 
Principal components  0.048  0.007  0.021  0.052 
Selfreported Ethnicity  0.042  0.004  0.031  0.053 
Ignoring confounding  0.705  0.009  0.694  0.737 
As mentioned above, the second set of simulation studies was designed to better understand when SRE might perform well as a control variable in genetic association tests.
Observed type I error rates when controlling for individual ancestry estimate, true ethnicity and selfreported ethnicity assuming various misclassification error rates when admixed population results from intermating between exactly two ancestral populations
Gap  Misclassification rate  Ethnicity  Admixture  SRE 

0.05  0.067  0.046  0.402  
0.05  0.075  0.087  0.041  0.46 
0.1  0.089  0.061  0.432  
0.125  0.083  0.059  0.42  
0.15  0.071  0.047  0.427  
0.05  0.061  0.054  0.517  
0.2  0.075  0.075  0.059  0.52 
0.1  0.077  0.051  0.518  
0.125  0.063  0.045  0.517  
0.15  0.067  0.055  0.492  
0.05  0.058  0.05  0.631  
0.35  0.075  0.06  0.056  0.622 
0.1  0.067  0.051  0.657  
0.125  0.06  0.052  0.62  
0.15  0.054  0.044  0.642  
0.05  0.064  0.066  0.752  
0.5  0.075  0.061  0.063  0.767 
0.1  0.043  0.038  0.787  
0.125  0.05  0.043  0.783  
0.15  0.062  0.064  0.765 
Discussion
We focused on the utility of selfreported ethnicity as a control variable in genetic and epidemiologic studies. We used data collected in the MESA for LVH traits, specifically, LV mass and ejection fraction, to illustrate our points. LVH is one of the strongest determinants of cardiovascular outcomes. Ethnic differences in the distribution of both LV mass and the LV ejection fraction have been reported in many studies, and we found significant evidence of an ethnicity related effect on these phenotypes in the MESA sample.
We observed a high degree of agreement between selfreported ethnicity and two GBMAs computed using genotyped ancestry informative markers. The selfreported HispanicAmericans were by far the most heterogeneous group represented in this dataset. This result is not surprising given the current definition of the term "Hispanic" which refers to a group of individuals who are culturally and genetically quite diverse. It is now well accepted that the ancestry distribution of selfreported Hispanics reflects, at different degrees, the genetic contribution of the three ancestral populations Africans, Europeans and Native American [27].
Another factor that may explain the genetic heterogeneity detected among the selfreported Hispanics may be the lack of individuals from the Native American ancestral groups represented in the sample. The initial panel of ancestry informative markers used in the MESA study was chosen based on their capacity to distinguish between individuals of Chinese, African and European ancestry. This panel might not be adequate to detect subtle variation between individuals who self identified as Hispanics. Following this analysis, a new panel of markers known to be particularly informative for Hispanics was typed in an effort to better understand the observed variation in the estimation of ancestry in this ethnic group. However, judging by the observed type I error rates, it appears that ancestry proportions estimated with the current marker panel work well as control variables in all the association tests that we have considered.
We could not directly evaluate the type I error on the original sample since it is not known which markers are really under the null hypothesis in that sample. Nevertheless, self reported ethnicity appeared to be effective as a control variable to protect against population stratification and admixture as the genetic background scores and the estimated individual ancestry proportions since we observed significant agreement between the set of markers that show significant pvalues independently of the control variable selected for the analysis. The plasmode analysis showed similar results. The type I error was kept at its appropriate level independent of the choice of controls variables, and did show significant inflation when none of them is included in the model. We should note that we did observe a stronger correlation between the pvalues obtained when the control variable was estimated using the AIMS than the pvalues observed between either of the genetic based measures and SRE. The simulation study shows that, when the number of ancestral populations is equal to 2, even controlling for an individual's true ethnicity might lead to significant type I error inflation depending on the composition of the study sample. We saw that as the gap value was increasing, the performance of true ethnicity was improving and even got very close to the nominal level when the gap was equal to 0.5. However, controlling for the genetic based measure of ancestry led to the correct type I error rates independently of the value of the gap. It is rarely the case that a study participant will report their ethnicity without errors. Self reported ethnicity errors may occur for various reasons, some people may not be fully aware of their true ethnicity while others may identify with one ethnic group despite their admixed background. Therefore, the use of SRE as a control variable when K = 2 is discouraged. Our analyses simply suggest that there may be cases where SRE might provide adequate type I error control in the presence of population stratification. This MESA sample, given its composition, seems to be one of these cases. SRE would not necessarily perform as well in other cases. Investigators, in general, will be best served by allocating part of the resources available for their study to genotype the appropriate set of ancestry informative markers. However, SRE seems to be a valuable alternative in multiethnic samples when the misclassification rate is likely to be small. This observation is particularly true for smaller studies like candidate gene and other replication studies where a relatively small number of markers are being considered. It is also not unexpected; for example, Walcholder et al. [28] used the law of total probability and stratification to show that the bias due to the confounding effect of admixture decreased with the number of ancestral populations that intermated to lead to the admixed population under consideration. They showed using simulated data that the relative bias when K = 3 was between 0.95 and 1.05. We have treated SRE and genetic ancestry as confounders in association tests. We should note that they could also act as effect modifiers depending of the context. However, this distinction cannot be done by a simple test statistic. More information about the causal pathway is needed.
Methods
The MESA study was designed to investigate the determinants and progression of subclinical cardiovascular diseases in 4 ethnic groups enrolled from six geographic regions in the United States [29]. Institutional Review Boards at each of the six MESA centers where participants were seen for clinical exams reviewed and approved the conduct of the MESA study including genetics research. The sample of 6,814 individuals was drawn such that it contains roughly equal proportions of men and women, all of whom were free of clinically recognized cardiovascular disease at enrollment. Race was selfreported, about 23% of the subjects enrolled in the study selfidentified as AA, 11% as CA, 38% as EA and 28% as HA. An intensive evaluation was conducted at baseline, where information regarding height, weight, waist circumference, smoking history, alcohol intake, education level, physical activity, medication, hypertension, heart rate, diabetes, and cholesterol level were collected, among other variables. Data on LV mass and LV ejection fraction were obtained via magnetic resonance imaging (MRI) from MESA participants who consented to the cardiac MRI scan. The LV ejection fraction was calculated by dividing the individual stroke volume by the enddiastolic volume. LV mass was adjusted for body surface area calculated using the formula provided by Wang et al. [21, 30]. As will be seen in the results section, these two measures are differently distributed in the four ethnic groups. We will use these ethnic differences in our simulations to create confounding situations that we will seek to control for with three measures of individual ancestry that we describe below.
As part of a MESA genotyping project, ninetysix AIMs were initially genotyped on more than 2,848 participants. These AIMs were selected from an Illumina proprietary SNP database, and were selected to maximize the difference in allele frequencies between the following pairs of ancestral populations: African vs. Chinese, African vs. European and Chinese vs. European.
Utilizing a plasmode generation approach, we developed a resampling procedure, which generates datasets where the null hypothesis of no association holds between each marker and the phenotypes of interest. A plasmode is similar to a simulated dataset; however, a plasmode dataset uses genotypic and phenotypic information observed in a study to construct pseudophenotypes so that the 'truth' of the data generating process is known [13, 14]. In this case, the relationship among the covariates remained intact while adding the permutated residual to the outcome variable ensured that all tests are conducted under the null hypothesis of no association between the genetic marker and the phenotype of interest. Therefore, we are sure that any observed significant association constitutes a type I error. Finally, we ran a series of simulation studies whose objectives are to better understand our findings and to determine whether they could be further generalized. These simulations are described in the simulation section.
In the remainder of the next section, we discuss the clustering method that was used to group MESA participants based on their observed genotypes, present the statistical approach retained for testing for association between each AIM and the 2 phenotypes of interest, and described the simulation procedures in detail.
Classification scheme
We used the 'cluster' and 'tree' procedures in the SAS software (version 9.1) to create 4 clusters based on the principal components computed from the 96 AIMs. We applied Ward's minimum variance method and the Kmeans clustering algorithms to identified clusters of individuals who similar ancestry proportions. To assess the agreement between selfreported ancestry and these clusters, we used Cohen's weighted kappa [31]. Bhapkar's test [32] was applied to test for marginal homogeneity, and loglinear models to test for quasisymmetry of the data shown in Table 1. This is done to attest the degree, direction and cause (chance or real) of agreement between SRE and the 4 clustered created from the GBS.
Genetic association tests
We examined all the available AIMs for association with both body surfaceadjusted (BSA) LV mass, and LV ejection fraction using 3 different control variables: SRE, ancestry proportion estimates computed using STRUCTURE and genetic background measures computed using principal component analysis. That is, we implemented multivariate linear regression, including (1) SRE as a categorical variable, (2) the proportion of African, Chinese and European ancestry estimated using STRUCTURE, or the first 4 principal components computed using the AIMs data as covariates. We also included gender, income, education level, smoking history, alcohol use, systolic blood pressure, diastolic blood pressure, body mass index, and waist circumference as covariates in each model. Both analyses are conducted using generalized linear models. They also both fall under the structured association test (SAT) framework, which consists in testing for genetic association controlling for a genetic based measure of ancestral background [33]. The SAT approach is commonly used to correct for population stratification and admixture in genetic association tests. Recently published papers have shown that SAT approaches may fail to provide nominal type I errors for various reasons, including measurement error in the estimation of the confounding effect [34, 35] and cases where the estimated confounder captures an insufficient fraction of phenotypic variation [36]. However, we restricted our attention to older and betterknown SAT approaches [33, 37–40]. As a preliminary analysis, we tested the null hypothesis of no association between the selected AIM and the phenotype of interest adjusting for the control variables mentioned above. Unlike in simulationbased work (see next section) the true association of the AIMs with LV mass is not known, and type I error rates for these tests could not be directly evaluated. However, marked differences between the statistical significance observed with each control variable may provide valuable insight regarding their performance. We devised a resampling procedure that guarantees that the methods are being compared under the null hypothesis. Details about this simulation procedure can be found in the next section. Finally, we ran two completely in silico analyses, which are also described in the next section, to mimic candidate gene association tests controlling for both SRE and a genetic based measure of ancestral background in order to gain a better understanding of the results observed in the initial test and the plasmode analysis.
Simulation studies
The first set of simulations is a resampling procedure that guarantees that the control variables are being compared under the null hypothesis of no association between each marker and the phenotype of interest. The second set of simulations seeks to expand upon the results observed under the first set of simulations by identifying the conditions under which accounting for ethnicity (either true or selfreported) is likely to provide appropriate type I error control and evaluating the effect of misclassification errors of SRE in genetic association studies.
Resampling procedure
Let N denote the number of subjects in the dataset. Let α denote the nominal type 1 error rate. Let Y denote the phenotype. Let K denote the number of AIMs. Let X denote the genotype at the marker being considered. Our resampling procedure is as follows:
Simulation 1
For iter = 1 to iterations {
For i = 1 to K {(for each marker)
Regress Y on X plus all relevant covariates except the confounding variables. When the looping variable i takes the value s (1≤=s≤K), X will correspond to the s^{ th } marker. This regression is fitted using the generalized linear model.
For j = 1 to N {(for each observation) compute ${\widehat{e}}_{j}={Y}_{j}{\widehat{Y}}_{j}$, where ${\widehat{Y}}_{j}$ is the predicted value for the j^{ th } person.
}
Sort the ${\widehat{e}}_{j}$'s
Compute a new pseudophenotype ${\tilde{Y}}_{j}={\widehat{Y}}_{j}+{\widehat{e}}_{\left[j\right]}$, where [j] denotes the new order after sorting the residuals.
For m = 1 to K {(for each marker)
If (m ≠ i) {
Regress $\tilde{Y}$ on X plus the same relevant covariates and each confounding variable.
Compute the Wald test pvalue for the regression coefficient of X. Denote the resulting pvalue as p_{ m } .
}
}
Compute ${T}_{i}^{iter}\equiv \frac{1}{K1}{\displaystyle \sum _{m\ne i}^{K}I({p}_{m}<\alpha )}$.
}
Compute ${T}_{iter}\equiv \frac{1}{K}{\displaystyle \sum _{i=1}^{K}{T}_{i}^{iter}}$
T_{ iter } is then an estimate of the type I error rate for the current iteration. Results for 10,000 iterations are summarized in Table 3 for LV mass and Table 4 for LV ejection fraction.
Description of the second set of simulation procedures
As can be seen in Tables 3 and 4, SRE appears to perform as well as the other GBMAs. The second set of simulations is designed to further elucidate why SRE, despite its muchpublicized shortcomings of not being able to adequately control for confounding due to population stratification and admixture, seems to provide the correct type I error rate in this dataset.
First, we considered the case where the confounder is univariate. This could arise when the study sample comprised admixed individuals born from intermating between exactly two ancestral populations. We evaluated how the performance of SRE as a control variable depends on the distribution of ancestry proportions in the sample. Specifically, we wanted to see how the continuity and the size of the gap in a discontinuous ancestry proportion distribution would affect the performance of SRE. Note that a continuous distribution would have a gap of zero. The gap for a discontinuous distribution is defined as the range of the discontinuity region. For example, admixture is an ongoing process. A sample of admixed individuals can comprise individuals who are at different stage of the admixture process. Therefore, it is possible to recruit a sample that can be divided into 2 subsets of individuals: one with very high level of European ancestry and the other with very low level European ancestry. If the minimum European ancestry in the first subset is 0.8 and the maximum European ancestry in the second subset is 0.2 then gap value would simply be (0.80.2) = 0.6. We also looked at the effect of various misclassification rates in these association tests.
Simulation 2 (univariate ancestry proportion distribution (K = 2))
Let N = 2000 be the total number of individuals
For gap = 0.05 to 0.5 by 0.05{
Draw individual ancestry proportion x from $\frac{1}{2}U\left(0,f\right)+\frac{1}{2}U\left(1f,1\right)$, where $f=\frac{1}{2}\left(1gap\right)$
Set true ethnicity to 1 if x ∈[0,f] and 2 otherwise.
Let m represent the misclassification rate.
For m = 0.025 to 0.15 by 0.025{
Draw the random variable s from a Bernoulli (m).
If (s = 0) then SRE = true ethnicity
else change the true ethnicity so that a misclassification occurs.
}
Compute ${p}_{s}^{adx}=x{p}_{s\mathbf{1}}+\left({\mathbf{1}}_{N}x\right){p}_{s\mathbf{2}}$ where p_{ s 1 }and p_{ s 2 }represent the allele frequency of the s^{ th } marker in each ancestral population respectively, 1_{ N }is vector of ones, and ${p}_{s}^{adx}$ denotes the vector of allele frequencies in the admixed population for the s^{ th } marker. We consider 2 markers G_{ 1 } and G_{ 2 } . Draw G_{ 1 }from Binomial (2,${p}_{1}^{adx}$) and G_{ 2 }from Binomial (2, ${p}_{2}^{adx}$). That is, we use the simulated ancestry proportion and the allele frequencies in the 2 ancestral populations to generate the genotypic probability for each individual under Hardy Weinberg equilibrium. Note that G_{ 1 }and G_{ 2 }are independent conditional on the individual ancestry. We will use G_{ 1 }to generate the trait and G_{ 2 }to test for genetic association.
Compute y = α_{ 0 } + α_{1}G_{1}+ e where e ~ N (0, σ^{2}).
## Note that we set α_{ 1 }and σ such that the effect size is equal to 0.5 in Figure 3a and 1 in Figure 3b.
## Note that conditional on the individual ancestry, the random variable Y is also independent of G_{ 2 } .
 1.
y = β _{0} + β _{1} x + β _{1} G _{2} + e
 2.
y = β _{0} + β _{1} ethn + β _{2} G _{2} + e
 3.
y = β _{0} + β _{1} sre + β _{2} G _{2} + e
Test whether β_{2} is statistically significant than 0 at the 0.05 level in each case.
## A statistically significant association observed between Y and G_{ 2 } will constitute a type I error.
}
Repeat the experiment 10,000 times for each configuration of the gap value and the misclassification rate and count the proportion of times that the parameter β_{2} is statistically significant for each control variable. These proportions for the ancestry proportion and SRE regressions are shown in Figure 3. However, we do show the effect of various misclassification rates for 3 gap values (0.05, 0.3 and 0.5) in Table 5.
Simulation 3 (multivariate ancestry proportion distribution (K > 2))
We wanted to determine how SRE, when used as a control variable against population stratification, would perform in a multivariate setting. That is, when the number of number of ancestral populations is greater than 2. This simulation procedure resembles the previous one, except that the individual ancestry proportions are drawn from a Dirichlet distribution. We used a different set of parameters for each ethnic group in order to create the conditions needed for confounding to occur. The parameter used to generate the Dirichlet distribution can be represented by a 4 × 4 matrix, where the rows represent the expected individual ancestry proportions in each ethnic group, and for a fixed row, the columns represents the expected individual ancestry proportion from each ancestral population.
1) Let N = 2000 individuals divided equally into 4 subsets such that n_{ k } = 500 for k = 1,2,3,4 is the sample size in each ethnic group.
2) Let the SRE for an individual in the k^{ th } subgroup be k.
3) We considered 4 cases; the parameters used for the Dirichlet distribution in each case are chosen as follows:
(a) Proportions that are very close to ancestry proportion estimates obtained with STRUCTURE in the MESA sample;
(b) Values near a 4 × 4 identity matrix with diagonal elements equal to 0.97 and off diagonal elements equal to 0.01;
(c) Values very close to what would be observed in equally admixed individuals, that is all proportions are set 0.25;
(d) Different ancestry proportions where the contribution of one specific ancestral population is clearly greater in each admixed population. That is the diagonal elements are set to 0.55 and off diagonal elements at 0.15 and made sure that the row and column sums are equal to 1.
4) Let P _{ 1 }= (0.1, 0.5, 0.3, 0.9) and P _{ 2 }= (0.05, 0.25, 0.50, 0.75) be the frequency of the reference allele of the markers G _{ 1 }and G _{ 2 }respectively in each ancestral population. These frequencies were chosen arbitrarily with the only constraint being that they vary greatly between the 4 ancestral populations. Confounding will occur if the distribution of the trait is also different in the 4 ancestral populations.
5) The allele frequencies of G _{ 1 }and G _{ 2 }in the admixed population (${p}_{1}^{adx}$,${p}_{2}^{adx}$) are computed as the weighted averages of the allele frequencies given in (4), where the weights are the simulated ancestry proportions drawn for the Dirichlet distribution.
6) Draw G _{ 1 }and G _{ 2 }according to these averages.
7) The outcome variable Y is simulated from a normal distribution where the parameters are based on the distribution of the LV ejection fraction observed each ethnic group observed in the MESA study. That is, we used the mean in each group as the intercept in the model described in the next step. We then compute the pooled variance and set ${\beta}_{1}={\beta}_{2}={\beta}_{3}=\frac{\sigma}{2}$ such that each component has an effect size of 0.5, where σ ^{2} represents the pooled variance.
8) The remaining steps are similar to those taken in simulation 2. That is, we fitted the following linear regressions:
1. y = β _{0} + β _{1} x _{1} + β _{2} x _{2} + β _{3} x _{3} + β _{4} G _{2} + e. The variables x_{ 1 } , x_{ 2 } and x_{ 3 } are the first 3 components of the individual ancestry proportion vector drawn from the Dirichlet distribution.
2. y = α _{0} + α _{1} SRE + α _{2} G _{2} + e. The variable SRE has 4 levels, which are defined in step 2. Therefore, α_{ 1 } is a vector with 4 components.
We then tested whether β_{4} and α_{2} are statistically different than 0 at the 0.05 level in each model, and repeated each experiment 10,000 times. The results of this simulation procedure can be seen in Figure 4. As can be seen in this figure, when the sample contained admixed individuals with more than 2 ancestral populations, SRE performed rather well as a control variable. This suggests that it is not the variations in the ancestry proportions themselves that cause the type I error inflation.
To better understand when the use of SRE as a control variable may fail, we devised a situation where it may be unclear which ethnicity to assign to individuals whose ancestry proportions take specific values. To facilitate the graphical representation of each scenario, we focused on the case where there are exactly 3 ancestral populations. In this case, an individual's ancestry proportion can be represented by a vector with 3 components, adx_{1}, adx_{2}and adx_{3} such that $\sum _{i=\mathbf{1}}^{\mathbf{3}}ad{x}_{i}=\mathbf{1}$. Therefore, without loss of generality, we can restrict our attention to a bivariate distribution by focusing only on the first two components of this vector. As can be seen in Figure 5, the choice of gap values on both the xaxis and yaxis and in addition to the constraint that adx_{ 1 }+ adx_{ 2 }≤ 1, define 3 or 4 specific regions. Figure 5a shows 4 valid regions, and if one decides to assign ethnicity according to the maximum value of the vector (adx_{ 1 }, adx_{ 2 }, adx_{ 3 }), it is not exactly clear what the correct ethnicity assignment should be for the individuals whose ancestry proportions fall in region IV. There is no such ambiguity in Figure 5b.
The simulation steps are similar to those described in simulation 2, except that there are now two gap values: one for adx_{ 1 }and one for adx_{ 2 }. The ancestry proportions are each drawn independently from a uniform distribution, which has been rescaled such that the proportions add up to 1. We then excluded the ancestry proportions that fell in the regions defined by the gaps. In Figure 5a, we excluded values of adx_{ 1 }and adx_{ 2 }that fall between 0.1 and 0.3. The range of excluded values went from 0.1 to 0.5 in 5b. As in all previous cases, we considered 2 markers G1 and G2. We used G1 to simulate the trait, and G2 to test for association with the simulated trait. All significant association is seen as type I error. We use the vector (0.4, 0.2, 0.6) as the allele frequency in the 3 ancestral populations for G1 and (0.2, 0.4, 0.6) for G2. We also considered various effect sizes for evaluating the contribution of admixture in the confounding pathway. We also changed the allele frequencies vector to account for the fact that k, the number of ancestral populations, is now 3 instead of 4. The remaining simulation steps are again similar to those described in simulation 3. The results of this simulation analysis can be seen on Figure 6, where the vector (a,b) represents the coefficient associated with the variables adx_{ 1 }and adx_{ 2 }in the model. The error term in all models is drawn from a normal distribution with 0 and variance 1 such that the effect size associated with adx_{ 1 }and adx_{ 2 }are equal to a and b respectively.
Simulation 4 (effect of misclassification error on SRE when K = 4)
As can be seen in Figure 7 when the number of ancestral population is greater than 2, it is misclassification errors, as opposed to the actual distribution of the ancestry proportions that dictate the performance of SRE as a control variable. We ran a final simulation study to evaluate the effect of misclassification errors on a multiethnic sample like the MESA. The simulations steps are very similar to those taken in simulation 3.
1) Let n_{ k } = 500 for k = 1,2,3,4 be the sample size in each ethnic group.
2) Let the true ethnicity of any individual in the k^{ th } subgroup be k.
3) Draw x_{ k } from Dir(α_{ k } ) where α_{ k } is based on the ancestry proportions observed in the MESA study. These proportions are displayed in Figure 2.
That is, (0.09, 0.02, 0.84, 0.05) for the AfricanAmericans, (0.89, 0.02, 0.02, 0.07) for the EuropeanAmericans, (0.24, 0.05, 0.14, 0.57) for the HispanicAmericans and (0.01, 0.97, 0.01, 0.01) for the ChineseAmericans.
To evaluate the effect of misclassification errors on the performance of SRE as a control variable, the true ethnicity of a fraction m of individuals in each subgroup is changed to create misclassification in the SRE variable. This fraction is assigned to one of the 3 remaining subgroups uniformly. This is done for each subgroup separately.
4) We let m vary from 0.05 to 0.15 by 0.025.
The remaining simulations steps are similar to those taken in simulation 3, and are not repeated here.
List of abbreviations
 MESA:

MultiEthnic Study of Atherosclerosis
 SRE:

selfreported ethnicity
 GBMA:

genetic based measures of ancestry
 AIMs:

ancestry informative markers
 IAP:

individual ancestry proportion
 GBS:

genetic background score
 AA:

AfricanAmericans
 CA:

ChineseAmericans
 EA:

EuropeanAmericans
 HA:

HispanicAmericans
 LVH:

Left ventricular hypertrophy
 LV:

Left ventricular
 MRI:

magnetic resonance imaging
 BSA:

body surfaceadjusted
 SAT:

structured association test.
Declarations
Acknowledgements
We thank the other investigators, the staff, and the participants of the MESA study for their valuable contributions. A full list of participating MESA investigators and institutions can be found at http://www.mesanhlbi.org. This research was supported by contracts N01HC95159 through N01HC95165 and N01HC95169 from the National Heart, Lung, and Blood Institute. This work was also supported by NIH grant number 1R01GM077490, 3R01DK067426 03S1 and R01DK070941.
Authors’ Affiliations
References
 Burchard EG, Ziv E, Coyle N, Gomez SL, Tang H, Karter AJ: The Importance of Race and Ethnic Background in Biomedical Research and Clinical Practice. N Engl J Med. 2003, 348: 11701175. 10.1056/NEJMsb025007.View ArticlePubMed
 Witzig R: The Medicalization of Race: Scientific Legitimization of a Flawed Social Construct. Ann Intern Med. 1996, 125: 675679.View ArticlePubMed
 Cooper RS, Kaufman JS, Ward R: Race and genomics. N Engl J Med. 2003, 348: 11661170. 10.1056/NEJMsb022863.View ArticlePubMed
 Sinha M, Larkin EK, Elston RC, Redline S: SelfReported Race and Genetic Admixture. N Engl J Med. 2006, 354: 421422. 10.1056/NEJMc052515.View ArticlePubMed
 Burnett MS, Strain KJ, Lesnick TG, de Andrade M, Rocca WA, Maraganore DM: Reliability of Selfreported Ancestry among Siblings: Implications for Genetic Association Studies. Am J Epidemiol. 2006, 163: 486492. 10.1093/aje/kwj057.View ArticlePubMed
 Risch N: Dissecting Racial and Ethnic Differences. N Engl J Med. 2006, 354: 408411. 10.1056/NEJMe058265.View ArticlePubMed
 Pritchard JK, Stephens M, Donnelly P: Inference of population structure using multilocus genotype data. Genetics. 2000, 155: 945959.PubMed CentralPubMed
 Tang H, Quertermous T, Rodriguez B, Kardia SLR, Zhu XF, Brown A: Genetic structure, selfidentified race/ethnicity, and confounding in casecontrol association studies. Am J Hum Genet. 2005, 76: 268275. 10.1086/427888.PubMed CentralView ArticlePubMed
 Wilson JF, Weale ME, Smith AC, Gratrix F, Fletcher B, Thomas MG: Population genetic structure of variable drug response. Nat Genet. 2001, 29: 265269. 10.1038/ng761.View ArticlePubMed
 Liu XQ, Paterson AD, John EM, Knight JA: The role of selfdefined race/ethnicity in population structure control. Ann Hum Genet. 2006, 70: 496505. 10.1111/j.14691809.2005.00255.x.View ArticlePubMed
 Rosenberg NA, Pritchard JK, Weber JL, Cann HM, Kidd KK, Zhivotovsky LA: Genetic structure of human populations. Science. 2002, 298: 23812385. 10.1126/science.1078311.View ArticlePubMed
 Falush D, Stephens M, Pritchard JK: Inference of population structure using multilocus genotype data: Linked loci and correlated allele frequencies. Genetics. 2003, 164: 15671587.PubMed CentralPubMed
 Mehta T, Tanik M, Allison DB: Towards sound epistemological foundations of statistical methods for highdimensional biology. Nat Genet. 2004, 36: 943947. 10.1038/ng1422.View ArticlePubMed
 Vaughan LK, Divers J, Padilla MA, Redden DT, Tiwari HK, Pomp D: The use of plasmodes as a supplement to simulations: A simple example evaluating individual admixture estimation methodologies. Comput Stat Data Anal. 2009, 53: 17551766. 10.1016/j.csda.2008.02.032.PubMed CentralView ArticlePubMed
 Bielen E, Fagard R, Amery A: The Inheritance of LeftVentricular Structure and Function Assessed by Imaging and Doppler Echocardiography. Am Heart J. 1991, 121: 17431749. 10.1016/00028703(91)900219.View ArticlePubMed
 Koren MJ, Mensah GA, Blake J, Laragh JH, Devereux RB: Comparison of leftventricular mass and geometry in black and white patients with essential hypertension. Am Heart J. 1993, 6: 815823.
 Ghali JK, Liao Y, Cooper RS: Left ventricular hypertrophy in the elderly. Am J Geriatr Cardiol. 1997, 6: 3849.PubMed
 Okin PM, Devereux RB, Jern S, Kjeldsen SE, Julius S, Nieminen MS: Regression of electrocardiographic left ventricular hypertrophy during antihypertensive treatment and the prediction of major cardiovascular events. JAMA. 2004, 292: 23432349. 10.1001/jama.292.19.2343.View ArticlePubMed
 Kong CH, Farrington K: Determinants of left ventricular hypertrophy and its progression in highflux haemodialysis. Blood Purif. 2003, 21: 163169. 10.1159/000069155.View ArticlePubMed
 Tang W, Arnett DK, Province MA, Lewis CE, North K, Carr JJ: Racial Differences in the Association of Coronary Calcified Plaque with Left Ventricular Hypertrophy: The National Heart, Lung, and Blood Institute Family Heart Study and Hypertension Genetic Epidemiology Network. Am J Cardiol. 2006, 97: 14411448. 10.1016/j.amjcard.2005.11.076.View ArticlePubMed
 Natori S, Lai S, Finn JP, Gomes AS, Hundley WG, JeroschHerold M: Cardiovascular Function in MultiEthnic Study of Atherosclerosis: Normal Values by Age, Sex, and Ethnicity. Am J Roentgenol. 2006, 186: S357S365. 10.2214/AJR.04.1868.View Article
 Drazner MH, Dries DL, Peshock RM, Cooper RS, Klassen C, Kazi Farhana: Left ventricular hypertrophy is more prevalent in blacks than whites in the general population: The Dallas Heart Study. Hypertension. 2005, 46: 124129. 10.1161/01.HYP.0000169972.96201.8e.View ArticlePubMed
 Gardin JM, Wagenknecht LE, Antonculver H, Flack J, Gidding S, Kurosaki T: Relationship of Cardiovascular RiskFactors to Echocardiographic LeftVentricular Mass in HealthyYoung BlackAndWhite Adult Men and Women  the Cardia Study. Circulation. 1995, 92: 380387.View ArticlePubMed
 Heckbert SR, Post W, Pearson GDN, Arnett DK, Gomes AS, JeroschHerold M: Traditional Cardiovascular Risk Factors in Relation to Left Ventricular Mass, Volume, and Systolic Function by Cardiac Magnetic Resonance Imaging: The Multiethnic Study of Atherosclerosis. Am J Cardiol. 2006, 48: 22852292. 10.1016/j.jacc.2006.03.072.View Article
 Parra EJ, Kittles RA, Shriver MD: Implications of correlations between skin color and genetic ancestry for biomedical research. Nat Genet. 2004, 36: S54S60. 10.1038/ng1440.View ArticlePubMed
 Knowler WC, Williams RC, Petitt DJ, Steinberg AG: Gm35,13,14 and Type2 DiabetesMellitus  An association in AmericanIndians with genetic admixture. Am J Hum Genet. 1988, 43: 520526.PubMed CentralPubMed
 Bertoni B, Budowle B, Sans M, Barton SA, Chakraborty R: Admixture in Hispanics: Distribution of ancestral population contributions in the continental United States. Hum Biol. 2003, 75: 111. 10.1353/hub.2003.0016.View ArticlePubMed
 Wacholder S, Rothman N, Caporaso N: Population stratification in epidemiologic studies of common genetic variants and cancer: quantification of bias. J Natl Cancer Inst. 2000, 92: 11511158. 10.1093/jnci/92.14.1151.View ArticlePubMed
 Bild DE, Bluemke DA, Burke GL, Detrano R, Diez Roux AV, Folsom AR: MultiEthnic Study of Atherosclerosis: Objectives and Design. Am J Epidemiol. 2002, 156: 871881. 10.1093/aje/kwf113.View ArticlePubMed
 Wang Y, Moss J, Thisted R: Predictors of body surface area. J Clin Anesth. 1992, 4: 410. 10.1016/09528180(92)90111D.View ArticlePubMed
 Cohen J: A coefficient of agreement for nominal scales. Educational and Psychological Measurement. 1960, 20: 3746. 10.1177/001316446002000104.View Article
 Bhapkar VP: A note on the equivalence of two test criteria for hypotheses in categorical data. JASA. 1966, 61: 228235.View Article
 Redden D, Divers J, Vaughan L, Tiwari H, Beasley T, Fernandez J: Regional admixture mapping and structured association testing: conceptual unification and an extensible general linear model. Plos Genetics. 2006, 2: 12541264. 10.1371/journal.pgen.0020137.View Article
 Divers J, Vaughan LK, Padilla MA, Fernandez JR, Allison DB, Redden DT: Correcting for measurement error in individual ancestry estimates in structured association tests. Genetics. 2007, 176: 18231833. 10.1534/genetics.107.075408.PubMed CentralView ArticlePubMed
 Kimmel G, Jordan MI, Halperin E, Shamir R, Karp RM: A Randomization Test for Controlling Population Stratification in WholeGenome Association Studies. Am J Hum Genet. 2007, 81: 895905. 10.1086/521372.PubMed CentralView ArticlePubMed
 Epstein MP, Allen AS, Satten GA: A Simple and Improved Correction for Population Stratification in CaseControl Studies. Am J Hum Genet. 2007, 80: 921930. 10.1086/516842.PubMed CentralView ArticlePubMed
 Zhang SL, Zhu XF, Zhao HY: On a semiparametric test to detect associations between quantitative traits and candidate genes using unrelated individuals. Genet Epidemiol. 2003, 24: 4456. 10.1002/gepi.10196.View ArticlePubMed
 Price AL, Patterson NJ, Plenge RM, Weinblatt ME, Shadick NA, Reich D: Principal components analysis corrects for stratification in genomewide association studies. Nat Genet. 2006, 38: 904909. 10.1038/ng1847.View ArticlePubMed
 Pritchard JK, Stephens M, Rosenberg NA, Donnelly P: Association mapping in structured populations. Am J Hum Genet. 2000, 67: 170181. 10.1086/302959.PubMed CentralView ArticlePubMed
 Pritchard JK, Stephens M, Donnelly PJ: Correcting for population stratification in linkage disequilibrium mapping studies. Am J Hum Genet. 1999, 65: A10110.1086/302449.View Article
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.