A new model calling procedure for Illumina BeadArray data

Background Accurate genotype calling for high throughput Illumina data is an important step to extract more genetic information for a large scale genome wide association studies. Many popular calling algorithms use mixture models to infer genotypes of a large number of single nucleotide polymorphisms in a fast and efficient way. In practice, mixture models are mostly restricted to infer genotypes for common SNPs where their minor allele frequencies are quite large. However, it is still challenging to accurately genotype rare variants, especially for some rare variants where the boundaries of their genotypes are not clearly defined. Results To further improve the call accuracy and the quality of genotypes on rare variants, a new model calling procedure, named M-D, is proposed to infer genotypes for the Illumina BeadArray data. In this calling procedure, a Gaussian Mixture Model and a Dirichlet Process Gaussian Mixture Model are integrated to infer genotypes. Conclusions Applications to Illumina data illustrate that this new approach can improve calling performance compared to other popular genotyping algorithms.


Background
Genome-wide association studies (GWAS) have been designed to discover many causal genetic variants contributing to human diseases [1,2]. The success of GWAS relies heavily on the International HapMap Project where millions of single nucleotide polymorphisms (SNPs) have been widely identified on SNP arrays [3,4]. With the rapid development in biotechnology, a leading producer, Illumina [5], is capable of offering SNP arrays with tremendously wide coverage of genetic variants in a fast and cost efficient way. A number of high dimensional intensity data are generated by this manufacturer, and various powerful genotyping algorithms are imperatively needed to accurately infer genotypes. Recently, several popular calling algorithms have been designed for Illumina platform, such as: BEAGLE with BEAGLECALL software [6], CRLMM [7,8], GenCall [9], GenoSNP [10], and Iluminus [11]. In general, Illumina chip catalogs millions of SNPs and processes a large number of parallel samples, and the genotyping algorithms for the Illumina data is of the main interest.
With the application of single base extension (SBE) biochemistry technology [12], the Illumina data measures the pair of intensities with two alleles (A and B) at every SNP for each individual. Typically, a SNP with alleles A and B makes three possible genotype clusters, named AA, AB, and BB, and all possible genotypes of each SNP are called by various genotyping algorithms. One strategy is the population-based approach through which genotypes of all individuals within a SNP are inferred at one time, but its calling performances highly depend on the size of population. Thus, this method is not applicable for rare SNPs with low minor allele frequency (MAF). Another approach, GenoSNP, is designed to infer all SNP genotypes within one individual simultaneously, and is referred to as a SNP-based calling method. The applicability of this algorithm [10] relies on the assumptions that response features of all probes are similar. Compared to the population-based method, it would be unnecessary to collect a large number of samples for rare SNP calling due to the availability of high density SNPs. Unfortunately, this method leads to a larger proportion of SNPs breaking the Hardy-Weinberg (HW) principle which violates the assumption that commonly occurs in practice.
Most of the predominant calling algorithms employ the mixture models [13][14][15] to infer three genotype clusters. In particular, the mixture models developed from the population-based strategy work well for common SNPs but gradually lose their effectiveness for rare variants. To improve calculation accuracy, the mixture models need a sufficient number of observations in each genotype cluster to precisely estimate parameters. However, rare SNPs always contain a small number of individuals in one or two genotype clusters, and some rare SNPs with extremely small values of MAF may lose one or two clusters. This phenomenon creates two problems: (1) the number of components for rare SNPs is uncertain; (2) the boundaries of some genotype clusters are not clear for rare SNPs with sparsely populated observations. The problem about developing better inference for rare SNPs motivates the use of the Dirichlet Process (DP) Gaussian Mixture Model (GMM) [16][17][18]. One popular application of DP is clustering in the fields of brain imaging, information retrieval and genetics. To successfully perform a cognitive task, DP has been applied to analyze activation structures in functional magnetic resonance imaging [19]. DP has also been used to model relationships among documents in the field of information retrieval [20,21]. For better understanding of ancestry history in the genetic study, DP was smoothly adopted to identify the sets of haplotypes corresponding to subpopulation [21]. Due to its good characteristics in clustering, this paper extends DP model to the genotyping area. Specifically, a DP prior plays a critical role in clustering data through defining a mixture model with a variable number of components. More importantly, its clustering and discreteness properties allows an easy partitioning of the data into different groups, even though some observations lack clear cluster membership. Besides, empirical studies have showed that GenoSNP can improve the genotyping quality for rare variants through calling a large number of SNPs within one individual. However, the genotype clusters implemented by GenoSNP may be in a shift away from their expected positions, which could result in many SNPs breaking the HW principle [5]. For a DP Gaussian Mixture Model (DP-GMM), its model selection procedure is based on a rich-gets-richer phenomenon [17], which indicates that the cluster with an extremely small number of observations is still toughly estimated. A reference SNP selection step [22] is incorporated here to infer genotypes of rare SNPs with extremely low MAF, and this new method may solve the HW principle problem.
In this paper, a new model calling procedure (M-D) is an approach that is made up of two models and one SNP selection procedure, namely Gaussian Mixture Model, DP Gaussian Mixture Model, and reference SNP selection. In brief, this method partitions SNPs into three groups in terms of the SNP's MAF and the sample size of each cluster. In this method, three models are applied in three groups individually. The performance of M-D is evaluated through comparison with other genotyping algorithms for Illumina BeadArray data.

Illumina BeadArray data
The Illumina Omni BeadArray chip collects over one million SNPs per sample, and increasingly covers the newly identified variants. In the probe design, every beadtype that is capable of assaying two SNP alleles represents a SNP [12]. A large number of beadpools that include millions of beadtypes results in the ultimate production of the Illumina microarray. Here, Illumina data measures the pair of raw intensity at each beadtype for every sample, and the genotype clusters are estimated at this scale.

Statistical models Model I: Gaussian mixture model (GMM)
The pair of raw intensity x is = (r is , g is ) for the ith individual at the sth SNP is the basic measurement. Within one SNP, all subjects' intensity data may fall into three genotype clusters corresponding to three genotypes (AA, AB, BB) and one null component which collects the abnormal raw intensity measurements. Model I is a Gaussian Mixture Model [23] that is applied to the basic measurement x is . In principle, this model assigns each pair of raw intensities x is to one of the components with probability π ks for k = 1, 2 or 3. The relevant latent genotype class is measured by an indicator variable z is generated from a multinomial distribution (Mult 3 ) where z is = 1, 2 or 3. Then this Gaussian Mixture Model can be expressed as: where n s is the total number of individuals observed at the sth SNP, and S is the total number of SNPs. denotes a normal density with mean μ ks and variance-covariance matrix ks in the kth component at the sth SNP; all pairs of raw intensity within the sth SNP are measured by x s =(x 1s , x 2s ,..., x n s s ); the unknown parameters of the GMM is denoted by s =(π s , μ s , s ) where π s =(π 1s , π 2s , π 3s ), μ s =(μ 1s , μ 2s , μ 3s ), and s =( 1s , 2s , 3s ).
The maximum likelihood estimates of the parameters are inferred [23]. For the indicator variable z is = k, the (t + 1)th iteration is estimated by The iterative estimates of mean μ ks and variancecovariance matrix ks are expressed as, Two measurements Posterior Rate (PR: p k is ) and the Average Posterior Rate (APR: p s ) for the sth SNP are adopted to assess the quality of SNP calling [22]. Specifically, PR quantifies the strength of every individual's cluster signal, and APR gives the average strength of all individuals at the sth SNP [22].
is a conditional probability of the ith individual given that this subject is assigned to the kth cluster, and n ks is the sample size of the kth cluster at the sth SNP.

Model II: Dirichlet Process Gaussian mixture model (DP-GMM)
Model I is a fast and efficient genotyping model for SNPs having large values of MAF. In real experiments, many SNPs with low MAF may result in the disappearance of one or two genotype clusters. Also even though some SNPs with low MAF display three genotype groups, some clusters may lack sufficient data to support and recognize. In this case, Model II, DP Gaussian Mixture Model, is motivated by the need to carry out the model selection for SNPs with an uncertain number of genotype clusters [24]. Generally speaking, this is a nonparametric Bayesian method that potentially allows a flexible number of mixture components and also provides estimates for the mixture component parameters and the relevant mixing proportions.
A DP Gaussian Mixture Model [24] fits the pair of raw intensity x is into K-component Gaussian Mixture Model with K approaching a large number. The model is expressed as, where K is the total number of clusters. s =(π s , μ s , s ) denotes the unknown parameters at the sth SNP where π s =(π 1s , ..., π Ks ), μ s =(μ 1s , ..., μ Ks ), and s =( 1s , ..., Ks ).
Generally, the number of observations within the sth SNP (n s ) are partitioned into K components (n 1s , n 2s , ..., n Ks ) with relevant mixing proportions (π 1s , π 2s , ..., π Ks ). The distribution of n 1s , n 2s , ..., n Ks follows a multinomial distribution and its probability mass function is written by, p(n 1s , n 2s , . . . n Ks |π 1s , π 2s , . . . , π Ks , n s ) = n s ! n 1s ! n 2s ! . . . n Ks ! K k=1 π n ks ks (6) where n s = K k=1 n ks denotes the total number of individuals at the sth SNP. Then each pair of raw intensity for the sth SNP x is has its own indicator z is (i = 1, ...,n s ), and the distribution of indicator variables is expressed as, p(z 1s , z 2s , . . . z n s s |π 1s , π 2s , . . . , π Ks ) = K k=1 π n ks ks (7) The model can then be expressed as: where α is the DP concentration parameter and can be thought as the inverse variance of DP. The distribution of the reciprocal of α follows a Gamma distribution with 1 degree freedom and mean 1. K is the maximum number of clusters, then π s is distributed with a symmetric Dirichlet distribution with parameter α K . m and r are hyperparameters being the mean and relative precision of μ ks , and the hyperparameters ν and S −1 are degrees of freedom and inverse mean of R ks where R ks follows a Wishart distribution with parameters ν and S −1 , respectively.
The inference on Model II relies on the posterior distribution of each parameter conditional on all other parameters, then the parameters, hyperparameters and indicator variables are repeatedly sampled from their posterior distributions. In particular, the conditional posterior probabilities are proportional to the likelihood function multiplying priors. Then the posterior probabilities of the cluster indicator variable z is conditional on all other variables are expressed as: if k is an existing cluster, and n −i,ks > 0 α n s −1+α p(x is |μ ks , R ks )p(μ ks , R ks |m, r, ν, S)dμ ks dR ks if k is a new cluster (9) Note that p(x is |μ ks , R ks ) and p(μ ks , R ks |m, r, ν, S) are the likelihood function and the joint function of parameters (μ ks and R ks ), respectively. Once the optimal genotype clusters and their relevant component parameters are obtained, two measurements Posterior Rate (PR) and the Average Posterior Rate (APR) measuring the quality of the sth SNP can be calculated in the similar way.

Model III: Dirichlet Process Gaussian mixture model with reference SNP selection (DP-Ref)
A DP Gaussian Mixture Model with reference SNP selection step (DP-Ref ) combines the benefits of the population-based method with the SNP-based approach. In this context, the reference SNP selection plays an important role in determining the effectiveness of Model III. A reference SNP is referred to as a good quality SNP providing sufficient information about three genotypes clusters, thus each SNP in the third group will be called with assistants of the carefully selected reference SNP. Practically, the final reference SNP is selected by a threestep procedure [22]. Through out this section, each SNP in the third group is denoted as the "T-SNP" that needs to be called with the support of a reference SNP, and the final reference SNP having good quality is defined as "R-SNP" Step I. High MAF SNPs are selected as candidate reference SNPs. In fact, SNPs with large MAF (> 0.15) before the T-SNP are selected as R1-SNPs.
Step II. Good clustering property SNPs from R1-SNPs are further selected (denoted as R2-SNPs). This step requires three genotype clusters of R1-SNPs to contain at least 10 % of entire observations individually.
Step III. A SNP from R2-SNPs being remarkably similar to the T-SNP is selected (denoted as R-SNP). The resemblance between the T-SNP and each R2-SNP is measured by the cluster distance D t [22]. For simplifying the calculation, two dimensional raw intensity vector x is is projected to an univariate variable y is [11], and the T-SNP and all R2-SNPs are classified into three genotype clusters in terms of this univariate variable.
Empirical studies show that the initial cutoffs dividing the univariate variable y s = (y 1s , ..., y n s s ) T into three clusters can be fixed as 0.5 and −0.5. The cluster label of each individual would be roughly determined by the above equation.
We select one SNP from the third group as the T-SNP, then the cluster measure (D t ) is to find the minimum distance between the T-SNP and R2-SNPs [22]. The SNP from R2-SNP gives the minimum distance will be the R-SNP. The cluster measure is calculated by, Note that is the set of R2-SNPs selected for the T-SNP; x kt and kt are the raw intensity vector and variancecovariance matrix in the kth cluster for the T-SNP; μ kd and kd are the mean and variance-covariance matrix of the dth R2-SNP; In brief, The final R-SNP will provide sufficient clusters information to assist the testing T-SNP.
A new augmented vector is generated by combining the T-SNP with the final reference SNP, where d ∈ , and the second model DP-GMM (Eqs. 6-9) will be applied to the combined raw intensities to identify the genotype clusters through the aid of the reference SNP.

Application of new model (M-D)
This section focuses on the application of M-D. Specifically, entire SNPs are classified into three groups, and an appropriate model is selected to fit in each group. The classification standard relies on the calculations of MAF and the sample size of each genotype cluster through Model I. The reason for choosing this model is that GMM can quickly estimate the SNP's MAF and the sample size of each genotype cluster. Other advanced models (Model II and III) will be applied to the selected SNPs with small MAFs. According to this calling procedure, any SNP will be classified by, x s ∈ g 2 if MAF < 0.05 and b 1 ≤ n ks < b 2 for any one of clusters. x s ∈ g 3 otherwise.
Note that n ks is the sample size of the kth cluster at the sth SNP. The first group (g 1 ) collects SNPs with high MAF (≥ 0.05), and a large proportion of SNPs is in this group. The second group (g 2 ) includes SNPs with low MAF (< 0.05) and a certain number of subjects in either existing genotype clusters. In this study, b 1 and b 2 are fixed as 3 and 10 to determine the number of SNPs recruited in g 2 . The last group (g 3 ) collects the rest SNPs with low MAF and a small number of observations in one or two genotype clusters. In fact, SNPs in g 1 can display three genotype clusters (one major homozygote, one minor homozygote and one heterozygote) with a large number of subjects in each cluster. The rest poor SNPs with low MAF are contained in g 2 and g 3 where some SNPs may not display three genotype clusters, either one or two clusters disappear and the existing cluster may contain very few observations. In particular, the classification between g 2 and g 3 is not fixed, and scientists can easily manage the allocation of SNPs between g 2 and g 3 through adjusting the values of b 1 and b 2 .
The proposed new calling procedure is based on the partitions of SNPs.
⎧ ⎨ ⎩ g 1 : GMM g 2 : DP-GMM g 3 : DP-Ref (13) In the first group, Model I (GMM) is applied to genotype SNPs. A sufficient number of observations are observed in three genotype clusters, which will greatly help the genotyping procedure identify the boundary of each cluster. In the second group, SNPs with low MAF, Model II (DP-GMM) can implement the model selection to search the appropriate number of clusters for each SNP, and DP's clustering and discreteness properties assures the optimum partition of observations, even for a small number of observations in a genotype cluster. In the third group, the number of genotype clusters for each SNP is uncertain and an extremely small number of observations are observed in either one or two clusters. In this case, applying a DP-GMM alone for clustering is not enough due to a rich-gets-richer phenomenon [17] where the larger genotype cluster can greatly attract sparsely populated observations that originally belong to another cluster. In view of this situation, the reference SNP strategy [22] is applied to help DP-GMM call rare SNPs (DP-Ref ). More importantly, the selection of models can be determined through adjusting b 1 and b 2 in Eq. 12. For example, when b 1 takes a large value, a smaller proportion of rare SNPs may enter g 2 and more rare SNPs are allocated to g 3 , thus GMM and DP-Ref will become major methods. If b 2 takes a large value, a larger proportion of rare SNPs may be assigned to g 2 , then GMM and DP-GMM will become main methods. This flexible option provides more solutions for scientist who are interested in this genotyping method.
In this study, b 1 and b 2 are fixed as 3 and 10, then 88.6 % of SNPs are in g 1 , 4.03 % and 7.37 % of SNPs will be assigned to g 2 and g 3 , respectively. More importantly, DP Gaussian Mixture Model is powerful to infer the cluster containing a certain number of observations, thus Fig. 1 displays the genotyping results of three SNPs inferred by DP-GMM (rs1003505 MAF: 0.0479, rs1004262 MAF: 0.0404, rs1009148 MAF: 0.0439). For the extremely rare variants in g 3 , DP-Ref is used to infer genotypes (rs10084633 MAF: 0.0166, rs1003945 MAF: 0.0118, rs1008185 MAF: 0), and the calling results are summarized in Fig. 2. To clearly illustrate the effect of the reference SNP on rare SNP calling in g 3 , Fig. 3 displays how the reference SNP help rare SNP be genotyped. It is clearly seen that our model could actively infer genotypes of rare SNPs under the support of the reference SNP.

Illumina BeadArray data description
The proposed method M-D is applied to an Illumina data consisting of 1 million SNPs and 3258 samples.

Results
The performances of three calling algorithms are compared in terms of the call rate genotypes that can be inferred genotypes that are supposed to be genotyped and the concordance rate measuring the genotype agreement between any two algorithms. The overall comparison results are given in Table 1. It is clearly seen that  genotypes inferred from M-D, GenCall and GenoSNP are highly consistent, and genotypes from M-D are more consistent with those inferred from GenCall (99.93 %), than those from GenoSNP (99.65 %). This is because M-D is a population-based method in a wide sense, and the model selection of a DP and the reference SNP selection step in M-D greatly improve its call accuracy and call rate ( Table 1). Because most samples in this Illumina data are collected from the hospital, true genotypes of these sample are not known totally, so the high agreement among 3 algorithms can not tell us which method performs best. Fortunately, 141 HapMap samples are contained in this data, and the true genotypes of these HapMap samples are explored by the HapMap project as a gold standard. Table 2 provides the comparison results between each discussed method and a gold standard in terms of the call accuracy and the call rate. In brief, M-D gives the best call accuracy and the largest call rate, followed by GenoSNP and GenCall. For example: the largest call rate is achieved by M-D (99.78 %), followed by GenoSNP (99.14 %) and GenCall (96.79 %). Moreover, M-D offers the best call accuracy (99.44 %), followed by GenoSNP (98.52 %), and GenCall (96.63 %).
Compared to the population-based methods (GenCall) and the SNP-based approaches (GenoSNP) [9,10], the new model (M-D) is expected to perform better because it integrates a model selection step of DP and predominance of the population-based and the SNP-based strategies. In this study, SNPs are classified into 3 groups according to Eq. 12, and the comparison results corresponding to these 3 groups are summarized in Table 3. In brief, M-D gives the best call accuracy and largest call rate, followed by GenoSNP and GenCall. In particular, g 2 and g 3 collects   Table 4. A SNP-based method, GenoSNP, considers all SNPs calls within a sample at a time to improve genotyping quality for rare variants, but a large number of SNPs corresponding to four populations break the HW Table 4 Comparisons of Hardy-Weinberg Equilibrium test among GenCall, GenoSNP and M-D principle. In contrast, GenCall applies the populationbased strategy to call all individuals within one SNP, so the calling results are less biased, and a small number of SNPs fail the HWE test. M-D is also a population-based model in a wide sense, and the quality of SNP calls is much better than that from GenoSNP, a moderate number of SNPs break the HW principle. In summary, M-D performs well on genotyping rare variants and controlling the quality of SNPs.

Discussion
The principle of a DP Gaussian Mixture Model is to run a model selection procedure to explicitly estimate the number of components for rare variants. The concentration parameter measures the inverse variance of DP, which suggests that a larger concentration parameter implies an increasing number of components [17]. It brings a new problem of how to select the appropriate strength of the prior to control the number of components. In particular, this parameter is sensitive to SNPs where sparsely populated observations are in one or two components. There might be better ways to define this parameter to help the DP Gaussian Mixture Model more efficiently call genotypes for rare variants.
The DP mixture model incorporates the reference SNP selection step to take advantage of the population-based strategy and the SNP-based strategy for improving the missing rate and call accuracy for rare SNPs. The successful application of M-D is also based on the selection of the reference SNP across the genome. In practice, it is difficult to search the reference SNP from the entire genome due to the heavy calculation burden. In these cases, the instrumental SNPs before the testing SNP are picked out. When some probes break the assumption about identical probe responses for various SNPs, searching the best reference SNP is still challenging. In particular, the method about accurately measures the similarity between the testing SNP and the reference SNP still needs to be improved.

Conclusion
One classical genotyping approach is the populationbased method, GenCall, and it requires a large number of observations to achieve a nice call accuracy. When the increasing number of rare variants are commonly identified on the large scale Illumina array, it is extremely difficult to successfully call genotypes for rare variants. A SNP-based method, GenoSNP, was designed to solve this challenging problem, but many more SNPs inferred from GenoSNP break the HW principle. In this paper, a new model calling procedure (M-D) is proposed to take benefits of a model selection step of a DP and the advantage of GenCall and GenoSNP to improve the quality of rare SNP calls. In brief, the new model calling procedure partitions SNPs into three classes in terms of MAF and the sample size of each cluster, and a DP Gaussian Mixture Model with or without reference SNP selection are applied to rare SNPs with low MAF. The finest performance of M-D is evaluated by comparing genotypes inferred by each discussed calling method to those from the HapMap project. Compared to GenCall and GenoSNP, M-D performs better on genotyping rare SNPs, and it also infers better quality of SNP calls than that from GenoSNP.