To distinguish populations, the ideal loci are those that have an allele fixed in one population and absent in the other populations [19–21]. For the HapMap data, SNPs were sequenced in a small number of individuals, which means that SNPs with rare alleles are less likely to be discovered [22]. There are many measures for Ancestry Informative Markers (AIM), such as absolute allele frequency difference (*δ*), expected heterozygosity, *F* statistics (*F*
_{
ST
}), informativeness for assignment (*I*
_{
n
}) and informativeness for ancestry coefficients (*I*
_{
a
}) [20]. We did not use any AIM estimates for SNP selection in this study because we thought that the AIM estimates from one data set may not be safely applied to other data sets if the reference sample size was not large enough. Despite our SNP ascertainment procedures, we obtained very good results for population structure detection using HapMap and Perlegen data sets. The implication of this work is that it is possible to tackle population structure issues in genome-wide association studies with a large number of markers. Instead of a special marker selection procedure, the AW-clust algorithm uses a large number of random genome-wide SNPs to ensure a sufficient number of informative makers for inferences. Subpopulations can be identified, and then association tests are applied to each homogeneous group of individuals.

There are two key parts in the implementation of the AW-clust algorithm. The first part is the ASD distance matrix between all pairs of individuals. This distance was chosen because it can be shown using Balding and Nichols's DNA profile match probability theory [23] that the expected ASD between individuals from different subpopulations is always greater than that between individuals from the same subpopulation using SNP markers [24]. Thus, the within and between population distances may hardly overlap when many genome-wide random SNP loci are used. Therefore it is possible to differentiate populations from the half-matrix of pairwise distances without explicitly estimating allele frequencies for each sub-population. It is through the accumulated effect of many SNP loci that population structure can be identified. The second part in AW-clust is Ward's minimum variance algorithm, becuase inference of population structure based on ASD is likely to be reduced to contrasting group means and minimizing within-group variance of ASD. Ward's minimum variance appraoch is particularly suited to our problem, where the correct number of subpopulations is not known in advance, and we need to minimize the increase in the within-group ASD variance each time an individual is added to a cluster.

The advantage of our algorithm is that, relative to model-based methods, it is fast. Therefore, it can be applied to data sets with large numbers of individuals and SNPs, such as occur in genome-wide association studies. It took less than one minute on a desktop computer (P4 3.0G CPU with 1 GB memory) to separate CHB and JPT using 20K random SNP loci. Kruglyak (1999) estimated that approximately 500,000 uniformly distributed markers are required for genome-wide association studies. It would be valuable if we can make full use of these SNP markers in association studies for population structure protection. The AW-clust algorithm provides a possible solution to this scenerio. The AW-clust algorithm does not assume Hardy-Weinberg equilibrium or linkage equilibrium in the sample population individuals, no special marker selection criterion is required, and the algorithm is robust to relatively small sample sizes.

One drawback with distance-based methods is that the distance measure and the clustering algorithm are somewhat arbitrary. If another definition of distance or clustering method are used, the clustering results may change, which may be a reason, in addition to relatively small number of SNP loci used, why some authors did not see a separation of Chinese and Japanese sample individuals in previous studies [9, 10]. A problem with Ward's minimum variance method is that it may or may not give the minimum possible error sum of squres over all possible sets of *K* clusters from the data units. However, Ward's solution is generally very good even if it is not optimal on this criterion [25]. We found that the classical multidimensional scaling (MDS) algorithm [26] can also be used to determine the ethnic clusters in the second stage of the AW-clust algorithm, which can be implemented in the standard statistical software package R using the *cmdscale* () function. Distance-based methods are criticized for being more suited to data exploration than to statistical inference [11]. However, we believe AW-clust could be effectively used as a first step in statistical inference. After using AW-clust to identify the major clusters, we can use Bayesian methods to calculate the posterior probabilities for individuals belonging to each different cluster. A general challenge for population structure analysis is to derive the correct number of subpopulations, *K*, and it is no different for the AW-clust algorithm. We view *K* as a variable instead of a fixed number and let researchers determine the most appropriate level of separation. For example, CHB and JPT are often grouped together for data analysis [17, 27]. However, these two samples can be separated and fall in different clusters using AW-clust, as shown in the results. It is subjective whether we would like to treat them as one group or two groups and *K* should be defined to fit the researcher's interests and as the data permits. One possible alternative to the subjective definition of *K*, is to define it using the gap statistic. It should be noted, however, estimating *K* is still more art than science, and depends on many factors, such as population distance, number of individuals in each population, number of markers, random replicates and the method used.

It is easy to separate genetically distant ethnic groups, such as CEU (60), YRI (60) and CHB (45) + JPT (44) in the HapMap data, AA (23), EA (24) and HC (24) in the Perlegen data (sample size is in parenthesis). Both STRUCTURE (version 2.1) and the AW-clust algorithm gave very good classification in these situations using several hundred of random SNP loci. But the burn-in period and number of iterations in STRUCTURE may not be easily decided. Different authors used different settings [8, 11–13, 20, 21, 28]. In our tests, it seems that these settings depend on the sample size, the number of loci, and the genetic distances among populations in order to get the correct number of populations, *K*. STRUCTURE easily found the correct *K* = 3 with 5,000 burn-in followed by 1,000 iterations when we tested it on CEU (60), YRI (60) and CHB (45) + JPT (44) using 200 random SNP loci. However, when we reduced the CHB sample size to 10 (CEU (60), YRI (60) and CHB (10)), we needed to set 10,000 burn-in and 10,000 iterations for STRUCTURE in order to get the correct *K* using 200 random SNP loci. Another challenging situation for STRUCTURE is fine-scale population structure detection, such as that between Chinese and Japanese. STRUCTURE assumes that marker loci are in linkage equilibrium within subpopulations [11], which theoretically puts a restriction on the number of SNP loci that we can use from the human genome data. Even when we ignored this assumption and used 10,000 random SNPs in STRUCTURE, it did not separate CHB (60), YRI (60), CHB (45) and JPT (44) simultaneously since CHB and JPT were predicted to be in one cluster. This contrasts with the AW-clust algorithm, which separated CHB and JPT into two different clusters. When we chose to run STRUCTURE on the pooled CHB (45) and JPT (44) using 5,000 random SNP loci with a 5,000 burn-in period followed by 1,000 iterations, STRUCTURE identified two major clusters from the posterior probabilities of *K*. But when we reduced the sample size for JPT from 44 to 20, STRUCTURE failed to identify the correct *K*, even with 10,000 iterations after a burn-in period of 10,000 were used. The AW-clust identified two major clusters for CHB and JPT in the above situations. Neither of the two methods worked well when we reduced the sample size for JPT from 44 to 10 individuals using 5,000 random SNP loci. However, AW-clust created two discrete clusters for CHB (45) and JPT (10) when we increased the number of SNP loci to 20,000. STRUCTURE may or may not find the correct number of clusters, *K*, in the fine scale population structure situations using random SNP loci especially when a population has a relatively small sample size, such as 5 or 10 individuals, in addition to the considerable computing time consumed. If accurate AIMs are available, a likelihood approach should work well when predicting an individual's ethnicity [29].

STRUCTURE usually requires multiple runs to check the convergence of MCMC (STRUCTURE manual) which requires substantial computing time when a large number of individuals and SNP loci are used. The admixture proportions for each individual, *Q*, is considered an advantage of STRUCTURE. However, in our test on the Perlgen data set (AA is an admixture population), the inferred ancestry of individuals,
, may be sensitive to the number of SNP loci and sample sizes used. For example, we ran STRUCTURE with 50,000 burn-in followed by 50,000 iterations on AA(23), EA(24) and HC(24) using 5,000 random SNP loci. All EA individuals were predicted to have ~10% membership with HC, while most of the predicted HC membership in EA individuals went away when we used only 1,000 random SNP loci. With 50,000 burn-in and 50,000 iterations, nearly half of the AA individuals were predicted to have some EA membership, the majority of the EA individuals were predicted nearly pure EA membership, and most of the HC individuals were predicted nearly pure HC membership for the data AA(23), EA(24) and HC(24) using 1000 random SNP loci. Most of the AA individuals' EA membership either disappeared or was predicted to be much smaller when we reduced the sample size of AA from 23 to 5 and kept EA(24) and HC(24) using 1,000 random SNP loci. But when the sample size setting is AA(23), EA(5) and HC(24), EA individuals showed ~20% membership with HC. The apparent advantage of the AW-clust algorithm over STRUCTURE is for fine scale population structure detection with small sample sizes since a large number of SNP loci can be used and a relatively short computing time is required. The correct number of populations may be easily identified from the major clusters in the hierarchical plot rather than through multiple runs of rough estimation from MCMC posterior probabilities of *K*, which depend on many factors, such as the length of burn-in, iterations, convergence, and number of populations in the sample.