Open Access Research Article Population Based Allele Frequencies of Disease Associated Polymorphisms in the Personalized Medicine Research Project

Background: There is a lack of knowledge regarding the frequency of disease associated polymorphisms in populations and population attributable risk for many populations remains unknown. Factors that could affect the association of the allele with disease, either positively or negatively, such as race, ethnicity, and gender, may not be possible to determine without population based allele frequencies. Here we used a panel of 51 polymorphisms previously associated with at least one disease and determined the allele frequencies within the entire Personalized Medicine Research Project population based cohort. We compared these allele frequencies to those in dbSNP and other data sources stratified by race. Differences in allele frequencies between self reported race, region of origin, and sex were determined. Results: There were 19544 individuals who self reported a single racial category, 19027 or (97.4%) self reported white Caucasian, and 11205 (57.3%) individuals were female. Of the 11,208 (57%) individuals with an identifiable region of origin 8337 or (74.4%) were German. 41 polymorphisms were significantly different between self reported race at the 0.05 level. Stratification of our Caucasian population by self reported region of origin revealed 19 polymorphisms that were significantly different (p = 0.05) between individuals of different origins. Further stratification of the population by gender revealed few significant differences in allele frequencies between the genders. Conclusions: This represents one of the largest population based allele frequency studies to date. Stratification by self reported race and region of origin revealed wide differences in allele frequencies not only by race but also by region of origin within a single racial group. We report allele frequencies for our Asian/Hmong and American Indian populations; these two minority groups are not typically selected for population allele frequency detection. Population wide allele frequencies are important for the design and implementation of studies and for determining the relevance of a disease associated polymorphism for a given population. Background One of the challenges for translating disease associated polymorphisms into use is the lack of knowledge regarding the frequency of the polymorphism in the targeted population. Without this information, population attributable risk remains unknown. In addition, factors that could affect the association of the allele with disease, either positively or negatively, such as ethnicity and gen


Background
One of the challenges for translating disease associated polymorphisms into use is the lack of knowledge regarding the frequency of the polymorphism in the targeted population. Without this information, population attributable risk remains unknown. In addition, factors that could affect the association of the allele with disease, either positively or negatively, such as ethnicity and gen-der, may not be possible to determine without population based allele frequencies. There are a number of reasons that these frequencies have yet to be determined. Disease associations whether completed through candidate gene studies or genome wide studies are determined using case control studies that classify affected and unaffected individuals but are not necessarily representative of a population and may need correction for population stratification [1][2][3][4][5]. Another reason that many polymorphisms lack population allele frequencies is cost; without a clear necessity to genotype a large diverse population funding for such studies is scarce [6,7]. Finally, with the creation of large population wide biorepositories still in its infancy there has been a lack of samples for genotyping [8][9][10][11].
In the past, geneticists have relied on a small population of well characterized reference samples to estimate allele frequency in a population [12][13][14][15]. These samples include the HapMap collection of individuals with different ethnicities that has been anonomized and immortalized by Coriell [13,14]. These populations form the population allele frequencies most often used in dbSNP [15] and a subset of these samples are also used for population allele frequencies on the Cancer500 website [12]. While these population samples are a valuable resource, they were not collected to be representative of a population and are often made up of related individuals. Even with the advent of the 1000 genomes project [16], there will still be a relatively few samples of any ethnic background to determine population allele frequencies with any certainty. This lack of a representative population distribution may not lead to representative population allele frequencies [2,3,5,8,1]. Ethnicities, such as Hispanic, are not well represented within the dbSNP database with often less than 100 individuals genotyped to determine population allele frequencies. Other backgrounds such as Native American may be missing from dbSNP all together [15].
One of the most widely used population samples the Centre d'Etude du Polymorphisme Humain (CEPH) Human Diversity Panel, has been successful for determining large trends in human diversity and population structure [17][18][19][20][21], however even with the Caucasian population available from this Centre d'Etude du Polymorphisme Humain (CEPH) collection and other collections, there are too few individuals to determine allelic variation within an ethnic population. Recent studies have shown that even within the Central European population the country of origin can have a great impact on allele frequencies with clearly seen variation between countries and even within a country [22][23][24] [Feder, 2008 #87;Hannelius, 2008 #107]. This creates an even more challenging problem when attempting to use these generalized allele frequencies in a given population to estimate the burden disease associated polymorphisms have on disease within a population [20,21]. Current genome wide association studies (GWAS) control for this by stratifying the individuals by ethnicity and country of origin but because of their case-control nature these studies are less informative regarding allele frequencies for an entire population group [2,3,5,8,1].
One method of determining population wide allele frequencies is to genotype entire biorepositories regardless of case or control status. There are several repositories that are population based, including NHANES III [25], the UK biobank [11], the Marshfield Clinic Research Foundation Personalized Medicine Research Project [26], the Kaiser Permanente Research Program on Genes Environment and Health [8], and the Vanderbilt Databank Resource [10]. These biorepositories offer a broad cross sectional population available for genotyping and will be a valuable resource as more genotypes become publically available from all of these resources.
Here we used a panel of polymorphisms previously associated with disease and used as quality control markers in the PMRP population [27] to determine the allele frequencies within the entire Personalized Medicine Research Project population based cohort (Additional File 1). These allele frequencies found in our population were then compared to the frequencies found in dbSNP and other data bases stratified by race, with some of these allele frequencies varying over 10% from the reported allele frequencies. Differences in allele frequencies between self reported ethnicities within our population and differences between sexes were determined as well.

Population Characteristics
Demographic characteristics for the population are presented in Table 1. There were 19544 individuals who self reported a single racial category, of these individuals the vast majority (19027 or 97.4%) self reported white Caucasian. Of individuals who selected ancestral origin, 11,208 (57%) selected either a single ancestral origin or two ancestral origins that were grouped together for our analysis (ex. Norwegian and Swedish). Of the individuals in the analysis the majority self reported German origin (8337 or 74.4%). Females outnumbered males in our sample with 11205 (57.3%) individuals with confirmed female sex.

Population allele frequencies
As expected our population allele frequencies were significantly different for different self reported races. Allele frequencies stratified by race are reported in Table 2. Of the 51 polymorphisms tested 41 were significantly different between self reported race at the 0.05 level with 35 polymorphisms significantly different with p values of < 0.0001 (Table 2). Of these polymorphisms, 13 exhibited a minor allele frequency of over 50% within a racial category although the limited size of the non-Caucasian groups in this study creates a large amount of uncertainty regarding the actual allele frequencies in these racially stratified populations (Figure 1). For alleles with greater than a 2% minor allele frequency, only two polymorphisms deviated from Hardy Weinberg equilibrium (p = 0.01), rs4680 in the COMT gene in Black/African American individuals and rs1800588 in the LIPC gene in individuals self reported as Hispanic.
We compared the allele frequencies of our population with two previously published population allele frequencies, the CLUE II population [28] and the NHANES III population [29] as well as comparing our allele frequencies to those of dbSNP [15]and Cancer500 [28] allele frequencies. There were three polymorphisms assayed in all four populations, and when compared, our allele frequencies varied little from previously published population allele frequencies ( Figure 2). Our Caucasian allele frequencies were within 2% of the NHANES III [29] and CLUE II [28] allele frequencies and our African American and Hispanic allele frequencies we were within 5% of these previously reported populations. In contrast, the allele frequencies reported here showed more variability when compared to reported allele frequencies in dbSNP [15]. When we compared all of the allele frequencies with those reported in the dbSNP [15] and Cancer500 [12] websites several polymorphisms differed by more than 10%. (Additional File 2). Unfortunately our American Indian population could not be compared as this is not one of the populations with widely reported allele frequencies.

Allele frequencies by self reported region of ancestry
There is mounting evidence that further population stratifications may be necessary as even within a racial group there can be significant differences among people of different ancestral origin. To investigate this, we stratified our Caucasian population by self reported region of ori-gin to determine if there were differences in allele frequencies between individuals. 11,205 (57%) individuals could be categorized by a region of origin. Using this population, there were 19 polymorphisms that were significantly different (p = 0.05) between individuals of different origins (Table 3). Of these 5 (rs231775, rs6280, rs351855, rs601338, and rs429358) were significantly different with a p value of 0.0001 or less. Interestingly, the allele frequencies of some of the ethnic groups fell outside the 95% confidence interval of the total Caucasian MAF particularly the Eastern European ethnicity ( Figure 3, Table 3).

Allele frequencies by gender
Further stratification of the population by gender revealed few significant differences in allele frequencies between the genders. Only three polymorphisms were different in the total population p < 0.05 rs7121 (GNAS), rs1801253 (ADRB1) and rs1042714 (ADRB2). After stratification by racial group, these three polymorphisms were associated with gender differences in the white racial group. Three different polymorphisms exhibited gender differences in black individuals. Hispanic genders were different in two polymorphisms. Among the American Indian population 5 polymorphisms were different between genders and the Asian/Hmong population exhibited the greatest number of polymorphisms with gender differences with 7 polymorphisms. (Additional File 3).

Discussion
This study determined the allele frequencies of 51 polymorphisms, previously associated with at least one disease state, in a large rural population. This represents one of the largest population based allele frequency studies to date. Stratification by self reported race and region of origin revealed wide differences in allele frequencies not only by race but also by region of origin within a single racial group. Here we report allele frequencies for our Asian/Hmong and American Indian populations. These two minority groups are not typically selected for population allele frequency detection. Stratification of our Caucasian population by region of origin revealed significantly different allele frequencies within a single racial group. As we move from gene disease discovery to the application of genetic knowledge, the true population wide allele frequencies become more important for the design and implementation of studies and for determining the relevance of a disease associated polymorphism for a given population. Several recent studies have reported population wide allele frequencies; our study complements these previously published studies and contributes new information. The two largest US population allele frequency studies NHANES III [29]and CLUE II [28] also chose to report allele frequencies for polymorphisms associated with disease. The NHANES III study published allele frequencies from a nationally representative cohort for 91 polymor-phisms previously associated with disease [29]; our study included 7 of these polymorphisms. The CLUE II population was genotyped for 49 polymorphisms in inflammatory genes [28], of which we also report allele frequencies for 4 of these polymorphisms. We chose to compare allele frequencies for the 3 polymorphisms all studies had in common. Our reported allele frequencies were very similar with 2 polymorphisms varying less than 1% between the studies, particularly when comparing the Caucasian population. This similarity in allele frequencies is contrasted with the dbSNP allele frequencies [15] which we varied from by over 10% for a number of polymorphisms. This highlights the continued need for large population based genotyping efforts to determine allele frequencies of disease associated polymorphisms.
Beyond agreeing with previously published population allele frequencies, this study reports allele frequencies for 43 polymorphisms not genotyped in either the NHANES III [29] or CLUE II populations [28]. The additional polymorphisms included alleles that have been associated with a number of different diseases such as cancer, heart disease, and diabetes [27] and will contribute to our understanding of disease risk. We also report initial population allele frequencies for two additional minority groups that have importance for our population; the Asian/Hmong and American Indian populations. However our minority allele frequencies have a high degree of uncertainty given the small number of individuals within Figure 1 Polymorphisms with major and minor alleles that vary with race. Minor allele frequency for the total population and the same allele for each racial group, with 95% confidence intervals.     the racial categories (50 African Americans) and several alleles have zero in at least one homozygote category. Racial category was self reported by an individual and admixtures were not determined, a single racial group was chosen by the individual. As expected, our minority population allele frequencies were significantly different from the Caucasian allele frequencies for the majority (79%) of the tested polymorphisms, with 25% of the tested polymorphisms switching major and minor alleles between races. These differences will need to be considered when determining TAG and causal variants for disease. As an example, one of the IL6 polymorphisms tested here, rs1800796, has had both the C and G allele associated with osteoporosis depending on the race of the individual [28,29]. In Asians and Caucasians the minor allele is reversed creating complexity for determining the causal variant [30,31]. The IL1 B polymorphism (rs16944) exhibits different allele frequencies between races as well and the same allele has been shown in meta-analysis to be protective for gastric cancer in Asians and a risk factor for Caucasians [32][33][34]. Wide differences in allele frequencies may also contribute to dif-ferences in disease prevalence between racial groups. For instance the cystic fibrosis has large racial disparities [35] and as expected we found large differences in the CFTR gene (rs213950) among different racial groups in our population.
Here, we also began to investigate further substructure of our population within a single racial category. Because of the low number of minority racial groups in our population we chose to investigate potential substructure within only our Caucasian population. Using self reported region of ancestry, we found that 37% of the tested polymorphisms were significantly different between individuals reporting different regions of ancestry. Recent substructure analysis of Icelandic [23], Swedish [24,36], and Ashkenazi Jewish populations [22,37] determined that substructure was present even in these seemingly homogenous populations and that the substructure has implications for disease risk [2][3][4][5][22][23][24]36]. This study is one of the largest US studies to stratify a population by region of origin and demonstrates the large heterogeneity of the population even within a single racial group. These differences in allele frequencies may need to be considered when designing association studies in the future. One weakness of our analysis is the inability to categorize almost half (43%) of our Caucasian population into a single region of origin, because of our inability to determine the admixture given the self reported nature of the questionnaire and our limited genotyping. We are currently examining the more than 4000 individuals with whole genome scans completed to determine the ethnicities and regions of origin with greater accuracy. Stratification of the population by sex revealed few differences in allele frequency within our population. This is similar to previously reported sex stratifications of large populations. The three genes that exhibited sex differences in our population (GNAS, ADRB1 and ADRB2) have also demonstrated gene-gender interactions in other studies. Interestingly, ADRB1 and ADRB2 alleles are associated with differential gender responses in blood pressure [38], and in rats these genes interact with sex hormones in a differential manner [39]. The GNAS gene was associated with different responses to hip arthroplasty [40]. While allele frequency differences between the sexes may not be frequent when they occur this may be an indicator of potential gene-gender interactions for future research.

Conclusions
Using a panel of polymorphisms originally designed to uniquely identify individuals within our biorepository we were able to construct population allele frequencies for the PMRP population. In doing so, we have highlighted the need to consider population substructure beyond race using a large Caucasian population. Stratification of the population may lead to increased study power for future association studies. The genotyping data is available to investigators using the PMRP resource both for identification purposes and as a resource for investigating gene disease interactions.

Study population and questionnaire
The Personalized Medicine Research Project is a population based cohort of approximately 20,000 individuals ages 18 and up residing within one of 19 zip codes surrounding Marshfield Wisconsin, USA, who have agreed to provide DNA, serum, and plasma samples to be linked with a dynamic medical record for research [26]. Individuals are eligible to participate in the project if they have had care within the Marshfield Clinic Healthcare System within the 3 years prior to enrolment in PMRP. The majority of recruitment was accomplished in the first 18 months of enrolment beginning in September of 2002. At enrolment each individual was asked to complete a brief questionnaire which included questions regarding self reported racial affiliations and self reported ancestry using the US census questions, as well as occupational and environmental exposure questions. The enrolment questionnaire is available on the PMRP website [41] and the resulting summary statistics for our population has been published elsewhere [26].
The racial, ethnic and sex distribution of this cohort has been described previously. Briefly, the current cohort is over 98% Caucasian and 57.44% are females, with a mean age of 47.5 at enrolment [26]. For this study, 19544 individuals who self reported a single racial category were examined to determine allele frequencies. Individuals were allowed to select more than one ancestral origin. For this study individuals were grouped into four regions of ancestral origins: England or Ireland, Norway or Sweden, Czech Republic or Poland, and Germany. Individuals who selected two disparate regions were excluded from our ancestral analysis due to our inability to determine admixture percentages. However, individuals who selected multiple countries of origin within the same region were included in the analysis. (As an example, an individual selecting ancestral origin from both Norway and Sweden would be used in the analysis but an individual selecting Ireland and Germany would be excluded).
Informed consent was obtained upon enrolment in PMRP and this study was approved by the Marshfield Clinic Human Subjects Protection Institutional Review Board.

Genotyping
The entire population was genotyped with 2 multiplex panels for a total of 52 alleles, including a sex marker, using the proprietary Sequenom ® platform (Additional File 1). An initial panel of 36 alleles was genotyped to serve as a quality control and assurance panel. Each allele in the panel was previously associated with at least one disease and was reported to have at least a 20% minor allele frequency in Caucasians [27]. In addition, a separate panel of 15 alleles each with at least one association with disease in Caucasian populations and with no restriction on allele frequency was also genotyped in the entire cohort. Only individuals who achieved an 80% or greater call rate were used in the study. In addition, each polymorphism was required to achieve at least an 80% call rate. Individual genotypes were decided via a mixture of automated calls from Typer 3.4 ® [42] and manual calls; manual calls were checked by multiple individuals to assure agreement with the genotype assessment. Each multiplexed plate contained 4 water controls and 6 CEPH controls with 2 duplicate samples to ensure plate to plate accuracy. 8 individuals were duplicated blindly within the genotyping plates and all previous allele calls were compared to the study allele calls to ensure consistency within the cohort. In addition over 4,000 individuals in the cohort have been independently assayed using the Illu-mina 660 whole genome chip [43]. Discrepancies between genotyping were resolved with sequencing.

Statistical Analysis
Hardy-Weinberg equilibrium was determined and polymorphisms were said to be out of equilibrium if the expected and observed allele frequency were outside of the 99% Confidence interval 1 degree of freedom using a χ 2 test. Allele frequencies were compared to reported allele frequencies both as a summary and for self reported racial group. Differences in allele frequencies between self reported racial, ethnic groups were determined using chi-squared analysis and the 95% binomial confidence limits were reported. Differences between the sexes were determined using chi-squared analysis. Differences were reported as significant if P < 0.05.