Current limitations of SNP data from the public domain for studies of complex disorders: a test for ten candidate genes for obesity and osteoporosis

Background Public SNP databases are frequently used to choose SNPs for candidate genes in the association and linkage studies of complex disorders. However, their utility for such studies of diseases with ethnic-dependent background has never been evaluated. Results To estimate the accuracy and completeness of SNP public databases, we analyzed the allele frequencies of 41 SNPs in 10 candidate genes for obesity and/or osteoporosis in a large American-Caucasian sample (1,873 individuals from 405 nuclear families) by PCR-invader assay. We compared our results with those from the databases and other published studies. Of the 41 SNPs, 8 were monomorphic in our sample. Twelve were reported for the first time for Caucasians and the other 29 SNPs in our sample essentially confirmed the respective allele frequencies for Caucasians in the databases and previous studies. The comparison of our data with other ethnic groups showed significant differentiation between the three major world ethnic groups at some SNPs (Caucasians and Africans differed at 3 of the 18 shared SNPs, and Caucasians and Asians differed at 13 of the 22 shared SNPs). This genetic differentiation may have an important implication for studying the well-known ethnic differences in the prevalence of obesity and osteoporosis, and complex disorders in general. Conclusion A comparative analysis of the SNP data of the candidate genes obtained in the present study, as well as those retrieved from the public domain, suggests that the databases may currently have serious limitations for studying complex disorders with an ethnic-dependent background due to the incomplete and uneven representation of the candidate SNPs in the databases for the major ethnic groups. This conclusion attests to the imperative necessity of large-scale and accurate characterization of these SNPs in different ethnic groups.


Background
A single nucleotide polymorphism (SNP) is generally defined as a stable substitution of a single base with a frequency of more than 0.01 in at least one population [1]. In human genetic studies, SNPs are simply referred to as bi-allelic markers since the other types (tri-allelic and tetra-allelic SNPs) are very rare in the human genome [2]. SNPs have been recognized as an important tool in human genetics and medicine [3,4]. Because SNPs are abundant and scattered throughout the whole human genome, they have been widely used in the genetic association studies of various complex diseases such as obesity, osteoporosis, asthma, hypertension (see, e.g., [5][6][7]). Various analyses of SNPs across the human genome have been also conducted to determine haplotype patterns in human populations [8][9][10][11][12]. These data are very useful to study the genetic basis of common complex diseases [13].
At present there are several SNP public databases. The largest are dbSNP and HGVbase containing together several million SNPs [14]. There also exist other relatively small or specific SNP databases, e.g., TSC [15], JSNP [16], HOWDY [17], GeneSNPs [18]. While the databases have been continuously expanded, the quality and completeness of the deposited SNP data remain to be of particular importance and to be assessed. Recent evaluation of quality and comprehensiveness of about four million candidate SNPs from some public and Celera databases showed that about 6-12% SNPs could not be validated [19]. They represent rare variants, population-specific SNPs and sequencing errors. The other studies reported that only about 50% SNPs are common in any given population. Therefore, measurement of allele frequency and linkage disequilibrium for SNPs in databases should be done to efficiently select a minimal subset of SNPs for population association studies of complex diseases [20,21].
Obesity and osteoporosis are common complex disorders with continuously growing burden and cost for their prevention and treatment. For example, the most recent data for the United States, derived from the third National Health and Nutrition Examination Survey , showed ~20% of US men and ~25% of US women are obese and these proportions are increasing [22]. The same tendency has been observed in other human populations [23]. Osteoporosis is another major health problem, particularly in the elderly. More than 40% of postmenopausal women, on average, will suffer at least one osteoporotic fracture [24]. This disorder incurred an estimated direct cost of ~14 billion dollars in the USA alone in 1995 [25].
SNPs of candidate genes for obesity and osteoporosis may be of particular importance in genetic studies of these disorders. The allele frequencies are important in the selec-tion of SNPs for studying complex diseases [26]. For example, association studies of osteoporosis usually generate inconsistent results in different ethnic groups [27]. One important reason is that polymorphisms associated significantly with certain osteoporotic phenotypes in a given ethnic group may be absent or rare in another ethnic group [28][29][30][31]. Moreover, racial differences in the prevalence of certain allele could account for certain proportion of disease trait variation between different ethnicities [32]. Finally, a comparison of SNP allele frequencies among different ethnicities can provide valuable information for mapping by admixture linkage disequilibrium (MALD) [33].
Despite the importance of the above data, no attempts have been made to check the quality and completeness of SNPs of candidate genes for obesity and osteoporosis in the public databases. In fact, most allele frequencies in the databases were obtained from studies of relative small samples (dozens or so), which could yield large sampling errors. In addition to the databases, there are abundant SNP data from many association/linkage studies of these two common complex diseases. These individually published data have not yet been assessed in reference to each other, or to those in the databases. In most association studies, the regular sample sizes used to detect allele frequencies have only been of the order of a hundred or so. However, to obtain reliable estimates of allele frequency distributions, relatively larger sample sizes are needed. For example, for an SNP, to let allele frequency estimate's |error| = 0.05, the required sample sizes for frequency distribution (0.5, 0.5), (0.7, 0.3), and (0.9, 0.1) should be at least 1150, 1000, and 400, respectively [34]. Otherwise, the estimates of allele frequencies may be biased to a large extent.
In the present study, we determined allele frequencies of 41 SNPs in 10 candidate genes for obesity and/or osteoporosis in a large American-Caucasian sample (1,873 subjects from 405 nuclear families). We then compared them with the corresponding data for the major ethnic groups obtained from the public databases and various individual studies to check their accuracy and completeness, and to compare the SNP allele frequencies in different ethnic groups.

SNP polymorphism in the studied Caucasian sample
Of the 41 candidate SNPs for obesity and/or osteoporosis identified in the studied Caucasian sample (Table 1), 29 (70.7%) had minor allele frequency ≥0.1. Eight SNPs (19.5%) were monomorphic. Allele frequencies of twelve SNPs in Caucasians were reported for the first time ( Table  2). Five of these SNPs were monomorphic and the other seven were polymorphic with minor allele frequencies ranged 0.019-0.398.

Data on the candidate SNPs for obesity and/or osteoporosis from major ethnic groups: an analysis of literature and databases Caucasians
In available literature and public SNP databases, we found data about allele frequencies of 29 SNPs (representing 9 genes) in Caucasians, which were shared with the present study. The corresponding data are given in Table 3. The allele frequencies of only 14 SNPs had been previously reported for Caucasians to the public databases (dbSNP and/or HGVbase). No significant differences were found between the allele frequencies in our study and in the public databases, except for SNP31 in Australians (P = 0.05, Table 3).
In contrast to the databases, available literature contained data on allele frequencies of the 23 SNPs. We determined significant differences between our and literature data for allele frequencies of four SNPs (SNP16, SNP32, SNP35, and SNP36). We have not found any allele frequency data on the SNPs of the PTHR1 gene (SNP21-SNP25) for Caucasians. Some SNP indicated large discrepancies in allele frequencies between the data from the different databases. For example, the minor allele frequencies of SNP26 for two Caucasian samples reported to dbSNP and HGVbase are 0.43 and 0.26, respectively ( Table 3). The differences between the values are not statistically significant due to the small sample sizes (n = 31 and n = 42, respectively), so it is impossible to conclude what actual allele frequencies are in these populations and whether they are indeed significantly different. Table 4 presents our SNP data in comparison with the corresponding data on African and/or African-American samples obtained from literature and the databases. In the public domain, we found in total only 18 SNPs from seven genes, which are shared with our study. The candidate genes for obesity and/or osteoporosis appear to be underrepresented for Africans/African-Americans in the public SNP databases and literature: only 11 shared SNPs were found in the databases and seven in the literature. Among those, one (SNP36) was monomorphic and four had minor allele frequency <0.1. Significant differences in allele frequencies between Caucasians and Africans/African-Americans were found at three SNPs (SNP16, SNP34 and SNP35). Three more SNPs (SNP32, SNP33, and SNP41) manifested nearly significant differences in allele frequencies between these ethnic groups (Table 4). We found no data on the SNPs of three genes (COL1α1, PTHR1, and TNFR2) for Africans/African-Americans in the public domain, which would be shared with our study.

African/African-Americans
Similar to the case with Caucasians, there are large inconsistencies between some SNP data for Africans/African-Americans from the different databases. For example, the reported minor allele frequencies of SNP26 from the TSC and dbSNP databases are 0.27 and 0.15, respectively ( Table 4). The former is not significantly different (P = 0.61) from the corresponding value in Caucasians, while the latter is nearly significantly different (P = 0.07). However, it is hard to determine, which value is true because of the small sample sizes (n = 42 and n = 24, respectively) in the databases.

Asians
As compared to our SNP data, this ethnic group had corresponding data on 22 SNPs (8 genes) in the public domain ( Table 5). The data on 13 SNPs of the 6 genes were available from the public databases, and on 10 SNPs of the 5 genes were reported in the literature. Among those SNPs, two (SNP16 and SNP33) were monomorphic and one (SNP36) had minor allele frequency <0.1. Two more SNPs of the VDR gene (SNP40 and SNP41) have a minor allele frequency about this value; however, this frequency varies in different Asian populations. We observed a significant inconsistency between the values of a minor allele frequency for SNP40 in Asians from the database and literature, respectively. In the database, this value (0.41) is about 3-10 times higher than that reported in the literature (0.043-0.13). For gene abbreviations and SNP designations see Table 1. a The source of information about the given SNP.    A comparison of our Caucasian sample and the Asian populations revealed significant differences in allele frequencies of 13 SNPs located in the 6 genes (Table 5). Another SNP representing the TGF-ß1 gene (SNP26) indicated the nearly significant difference (P = 0.06). No such differences were observed only in the UCP3 gene.

Hispanic and Pacific Rim populations
These populations had the poorest data on the SNPs from the databases and literature as compared to the corre-sponding data from the present study. We found the data on only seven SNPs of the five candidate genes for Hispanics (Table 6) and five SNPs of the four genes for Pacific Rim populations (Table 7). SNP33 and SNP36 for the UCP3 gene were virtually monomorphic in Hispanics, similar to that in our Caucasian sample, while only one significant difference in allele frequencies was observed at SNP16 between these ethnic populations among the seven compared SNPs (Table 6).  No data on the SNPs of the 10 candidate genes for obesity and/or osteoporosis were found for Pacific Rim populations in the available literature. All the data presented in Table 7 were obtained from the dbSNP database. Three SNPs out of the five differed significantly in the allele frequencies between Pacific Rim populations and the Caucasian sample in our study.

Some new candidate SNPs for genetic studies of obesity and osteoporosis
In this study, we identified a number of SNPs, which may be suitable for genetic studies of obesity and osteoporosis.
In particular, the six polymorphic SNPs with minor allele frequency ≥0.1, which were firstly determined in Caucasians (Table 2), may be used in population association studies. A number of SNPs with significant variation in allele frequencies in populations of different ethnicity may be appropriate for studying a genetic basis of between-ethnic differences in the rates of obesity and/or osteoporosis. Examples are SNP34, SNP35, and SNP32 for Caucasians and African-Americans (Table 4) and several SNPs located in the ER-α, LEPR, PTHR1, TNFR2, and VDR genes for Caucasians and Asians ( Table 5). Some of the SNPs may also be candidates for such studies, if their allele frequencies reported to the databases are validated by using sufficiently large sample sizes (e.g., SNP26 in Asians).

Ethnic variation of SNP allele frequencies of the candidate genes for obesity and/or osteoporosis
We observed significant differences in allele frequencies of some studied SNPs for Caucasians (SNP16, SNP35, SNP32, and SNP36, Table 3) between our results and respective data from the databases and literature. This may be due to various factors, such as population admixture and relatively smaller sample size in the other studies [34]. For example, the sample in the present study consisted of the subjects of various ethnic backgrounds (German, French, Dutch, Swedish, some Portuguese and Italian backgrounds) and the relative proportions of these ethnic groups are not ascertained. In the mixed samples from other studies, these proportions may be different that thus influence the allele frequencies to a larger or lesser extent (e.g., SNP35 and SNP36, Table 3). Likewise, monoethnic samples may have allele frequencies significantly different from those in the mixed samples (e.g., SNP16 in English and Spaniards, SNP32 in French, Table  3). Our results showed that major ethnic groups have significantly different allele distribution at some SNP markers of candidate genes for complex disorders. The largest differences were observed between Asian and Caucasian populations, namely at 13 SNPs out of 22 compared (Table 5). Some genes indicated the significant differences at most SNPs compared (e.g., PTHR1 between Japanese and Caucasians, Table 5). These results are consistent with the previously reported data about high differentiation between Asians and Caucasians at candidate genes for bone mass [35,36].
Another important finding is the observed significant differences between Caucasian and Pacific Rim populations in SNP allele frequencies of three important candidate genes for osteoporosis, ER-α, IL-6, and VDR (Table 7). Interestingly, the respective SNP allele frequencies in Asian and Pacific Rim populations have very similar values (Tables 5 and 7). This may suggest that the Pacific Rim sample from the database has mainly Asian ancestry [37] or, perhaps there are some implications for an evolutionary history of these populations [38]. In any event, much more comprehensive data should be obtained to clarify these issues.
There are abundant data about significant between-ethnic differentiation at loci, which may underlie complex diseases, including obesity and osteoporosis (see, e.g., [36,[39][40][41]). Given that there is well-known different incidence of these disorders in various ethnic groups (e.g., [42,43]), such genetic differentiation may have an important implication for studying a genetic basis of ethnicityspecific definitions of obesity and osteoporosis.
Another important application of the SNP markers with high large between-ethnic and small within-ethnic differences is MALD studies of complex diseases in the admixed populations with known parental ethnicities [33,44]. Our results showed that such SNPs exist in the candidate genes for obesity and osteoporosis. However, the large-scale screening of the candidate genes is necessary to identify such markers with sufficient density.
The results of our study suggest that ethnic heterogeneity of large samples may notably affect the observed allele frequencies. It is supported by the fact that several SNP allele frequencies of some sufficiently large monoethnic samples (e.g., SNP16 in English, SNP32 in French) significantly differ from those determined in our sample of mixed Caucasians (Table 3).

SNP databases: current limitations for studying complex diseases
Our analyses showed that the SNP databases in their current status might have some limitations for studies of complex disorders, especially in different ethnic groups, due to incomplete and/or uneven representation of SNPs and/or candidate genes in these groups. As indicated above, of the ten candidate genes examined here, only four have corresponding but incomplete SNP data in the databases for all the major ethnic groups and may be used in the comparative studies of obesity and/or osteoporosis in the populations of different ethnicity. The SNP data in the databases for the other six genes (60% examined) need substantial updating. To do this, large-scale studies should be performed for the other major ethnic populations. Given that many complex diseases have different rates in different ethnic groups, the extensive volume of the SNP data needs to be updated and validated. This conclusion essentially corresponds to the recently reported results of the SNP databases evaluation regarding their use for whole-genome association studies in humans [21]. Our study provides an example showing the incompleteness of the SNP data in the current databases for studying complex disorders with an ethnic-dependent background.
Another important problem is inconsistency of the data for some SNP markers between either the different databases or databases and literature. It makes selection of a right SNP for a study rather difficult. For example, a SNP is usually considered to be appropriate for association or linkage studies, if its minor allele frequency is ≥0.1 [45,46]. Based on the data of SNP11 from the dbSNP database (Table 3), this SNP is hardly appropriate for the studies in Caucasians, because its minor allele frequency is 0.08. In fact, as determined in the present study, its actual frequency is 0.257 that makes this SNP suitable for the population association and linkage research.
In terms of well-known ethnic differences in incidence of some complex diseases, the discrepancies in the SNP database data may yield wrong conclusion about suitability of particular SNPs for studying genetic basis of these differences. For example, as was mentioned above, SNP26 from the TGF-ß1 gene has the reported minor allele frequency 0.43 (dbSNP) and 0.26 (HGVbase) for Caucasians, 0.27 (TSC) and 0.15 (dbSNP) for African-Americans, and 0.47 (dbSNP) for Asians (Tables 3,4,5). This gene is a candidate for bone mineral density [47] and, thus, may contribute to the ethnic differences in bone mass and osteoporosis. However, from the above data on the SNP allele frequencies, it is impossible to infer whether this particular SNP may be related to these ethnic differences in bone mass, because none of the values significantly differ from any other due to small sample size and, accordingly, limited statistical power.
Large discrepancies in some allele frequency data between the different databases (e.g., SNP4 for Caucasians) are likely due to small sample sizes and respective large sampling errors of the estimates. The sample sizes in the SNP databases are usually smaller than 50. With such a sample size, a relative sampling error of the allele frequency estimates ranges from 0.12 to 0.67 ( fig. 1). In this term, the estimates of the SNP allele frequencies in the present study with the sample size of 1,873 are far more reliable than the respective data from the most databases. Estima-tion of allele frequencies in small samples is notably affected by heterogeneity of the samples. Given that information about an ethnic background of the samples in the SNP databases is usually scarce, the allele frequency estimates in the databases may be significantly biased due to the ethnic admixture.
The small size of the samples in the databases makes it difficult to powerfully test differences in SNP allele frequencies (∆f) between the data from various sources, especially if absolute values of these differences are not large. Fig. 2 illustrates a statistical power of such estimation. Given two samples of size 50 each, the probability to correctly determine ∆f = 0.1 is only 27% ( fig. 2A). Having a sample size of 1,800 (similar to ours) increases this probability only up to 40% ( fig. 2B). Even a 3-fold difference in the absolute values (e.g., SNP11, Table 4) is not statistically significant due to the insufficient sample size in the databases. These examples suggest that some differences in SNP allele frequency data in the public databases may exist due to improper sampling or too small sample sizes [34]. These inconsistencies may be somewhat reduced by pooling the samples. However, this should be done with caution, because pooling data from small samples with a particular genetic background may introduce a large bias to the new values.
Relative sampling error of different minor allele frequencies under various sample sizes Figure 1 Relative sampling error of different minor allele frequencies under various sample sizes. The relative sampling error is obtained as , where is the variance of estimated allele frequencies and is computed as [51]. We plotted relative sampling errors of estimated allele frequencies vs. sample sizes. Furthermore, in many cases, even after pooling, the sample size may remain too small to gain sufficient reliability of the newly obtained allele frequencies.
The SNP data in literature may somewhat supplement those from the databases. They are sometimes obtained with larger samples and, therefore, have smaller sampling error. However, they are not systematized, which makes their collection and use difficult.
In conclusion, our study demonstrated that, although a large volume of SNP data is available in public databases and literature, a great portion of these data needs comprehensive updating and validating in order to be appropriate for genetic studies of complex disorders, such as obesity and/or osteoporosis. Such large-scale studies of the disease-associated SNPs in various ethnicities may provide important insights into the evolutionary history of human populations as well as in etiology of these diseases.

Subjects
All the study subjects came from ongoing genetic studies of complex traits that have been approved by the Creighton University Institutional Review Board. All the subjects were Caucasians of western or northern European origin (German, French, Dutch, Swedish), but some had Portuguese and Italian backgrounds. We have recruited 405 nuclear families, each composed of both parents and at least one child. The total sample size was 1,873 subjects, including 840 parents and 1,133 children. All individuals volunteered to participate in the research and signed informed-consent documents before entering the project.

The study SNPs
The candidate SNPs and genes for the present study were selected from publicly available sources based on some of the following criteria: 1) functional relevance and importance for obesity and/or osteoporosis; 2) degree of heterozygosity, i.e., allele frequencies, as reported in literature or databases; 3) position in or around the genes; and 4) their use in previous genetic epidemiology studies. After searching public SNP databases (dbSNP, JSNP, HGVbase, and TSC) and literature, we chose 41 SNPs in 10 genes shown to be associated with obesity and/or osteoporosis. The information about the SNPs is given in Table 1. Among them, SNP27 is an insertion/deletion polymorphism of a cytosine (+/-C) and the others are nucleotide substitutions.

SNP genotyping
DNA was extracted from whole blood using a commercial isolation kit (Gentra Systems, Minneapolis, MN, USA) following the procedure recommended by the manufacturer. The genotyping involved a polymerase chain reaction (PCR) and invader assay (Third Wave Technology, Power to detect differences in allele frequencies between two samples at P = 0.05 Figure 2 Power to detect differences in allele frequencies between two samples at P = 0.05. Since the power depends on the sample sizes of the two populations, we fixed one population at a sample size of n, while allowing another population to vary in sample sizes. ∆ f is a frequency difference between the sample of given size n and the samples of various sizes. A -n = 50; Bn = 1,800.  Table 1. The PCR was performed on PE9700 Thermal Cycler (Perkin Elmer Cetus, Norwalk, CT) using the following profile: 95°C for 5 min, 30 cycles of 94°C for 1 min, 50°C for 1 min, 72°C for 1 min, and then 72°C for 5 min. After the amplification, the product was diluted 1:20 in nuclease-free water. The invader reaction was carried out in 7.5 µl reaction volume containing 3.75 µl diluted PCR product, 1.5 µl probe mix, 1.75 µl Cleavase FRET mix, and 0.5 µl Cleavase enzyme/MgCl 2 solution (Third Wave Technology). The reaction mix was overlaid by 15 µl mineral oil and denatured at 95°C for 5 min, and then incubated at 63°C for 20 min on PE9700 Thermal Cycler. After the incubation, the fluorescence intensity for both colors (FAM dye and Red dye) was measured by Cytofluor 4000 (ABI). The data were then loaded to the Invader Analyzer software (Third Wave Technology), and the genotype for every sample was called according to the ratio of the fluorescence intensity of the two dyes.
We initially genotyped all the 41 SNPs in a random sample of 190 to 380 subjects to find SNPs with minor allele frequencies <0.01. Those were then excluded from the further genotyping. Finally, 33 SNPs were genotyped for the whole sample.

Data analysis
PedCheck software [48] was used to verify the accuracy of SNP genotyping in reference to Mendelian inheritance of the alleles within each family. For the 29 polymorphic SNPs the allele frequencies in the nuclear families were estimated by maximum-likelihood method [49] implemented in SOLAR http://www.sfbr.org/sfbr/public/soft ware/solar. This method uses all available marker information by accounting the dependence between relatives. Differences in allele frequencies of each SNP among various populations were tested using the χ 2 test or Fisher's exact two-tailed test as implemented in Advisor software [50]. SNPs with minor allele frequency less than 0.01 were considered to be monomorphic.