Multiview singular value decomposition for disease subtyping and genetic associations
 Jiangwen Sun^{1},
 Jinbo Bi^{1}Email author and
 Henry R Kranzler^{2}
https://doi.org/10.1186/147121561573
© Sun et al.; licensee BioMed Central Ltd. 2014
Received: 29 January 2014
Accepted: 6 June 2014
Published: 17 June 2014
Abstract
Background
Accurate classification of patients with a complex disease into subtypes has important implications for medicine and healthcare. Using more homogeneous disease subtypes in genetic association analysis will facilitate the detection of new genetic variants that are not detectible using the nondifferentiated disease phenotype. Subtype differentiation can also improve diagnostic classification, which can in turn inform clinical decision making and treatment matching. Currently, the most sophisticated methods for disease subtyping perform cluster analysis using patients’ clinical features. Without guidance from genetic information, the resultant subtypes are likely to be suboptimal and efforts at genetic association may fail.
Results
We propose a multiview matrix decomposition approach that integrates clinical features with genetic markers to detect confirmatory evidence for a disease subtype. This approach groups patients into clusters that are consistent between the clinical and genetic dimensions of data; it simultaneously identifies the clinical features that define the subtype and the genotypes associated with the subtype. A simulation study validated the proposed approach, showing that it identified hypothesized subtypes and associated features. In comparison to the latest biclustering and multiview data analytics using reallife disease data, the proposed approach identified clinical subtypes of a disease that differed from each other more significantly in the genetic markers, thus demonstrating the superior performance of the proposed approach.
Conclusions
The proposed algorithm is an effective and superior alternative to the disease subtyping methods employed to date. Integration of phenotypic features with genetic markers in the subtyping analysis is a promising approach to identify concurrently disease subtypes and their genetic associations.
Keywords
Genotypephenotype association Multiview data analysis Subtyping Biclustering Matrix decompositionBackground
For complex diseases, such as substance dependence or psychiatric disorders, a variety of clinical features that collectively indicate or characterize the disease phenotype often vary substantially among individuals [1]. Studies of genetic association or those that aim to match patients with certain treatments for a complex disease can be impeded by this phenotypic heterogeneity [2]. Casecontrol association studies based on a binary trait, such as the diagnosis of a disease, which partitions the population into cases (subjects with the disease) and noncases (subjects without the disease), cannot differentiate the heterogeneous manifestations of the disease. Although many candidate genes or genomic regions have been associated with complex diseases [3], the characteristics or subtypes of the disease for which the association exists remain to be specified. For instance, the specific addictive behaviors that underlie the associations with candidate genetic variants need to be elucidated to clarify the risk for addiction [4].
Classification of a complex disease into homogeneous subcategories or subtypes may help to identify the genetic variants contributing to the effect of the subphenotypes [5, 6]. However, prior studies have been limited to unsupervised cluster analysis or latent class analysis on clinical features to derive subtypes. Genotypic data have only been used to evaluate the validity of subtypes, such as in subsequent association tests with the derived subtypes, rather than to guide the creation of the subtypes. Consequently, the resultant subtypes may be of limited utility in genetic association analysis. Integration of data from both clinical and genomic dimensions also offers opportunities to find confirmatory evidence of a subtype based on both its genetic and clinical features. A few studies have examined the joint use of gene expression and genotypic data for cancer subtyping [7, 8], but they did not identify a variable subspace (or a subset of features) in each data source so as to group subjects consistently across the two subspaces. Hence, they could not detect genetic variants associated with the identified clusters.
There has also been little research on this topic in the statistics literature. The most relevant area involves coclustering [9] or multiview data analysis [10], where samples are characterized or viewed in multiple ways, thus creating multiple sets of input variables. There are two types of coclustering methods: (1) biclustering, also called twomode clustering [11, 12], which simultaneously clusters the rows and columns of a data matrix and (2) multiview coclustering [9, 13], which seeks groupings that are consistent across different views. Biclustering is similar to another set of algorithms that search for subspaces and group subjects differently in each subspace [14].
Biclustering and subspace searching essentially identify different subgroups of subjects using different features (or markers), thus helping to identify genetic variants specific to a particular subgroup. However, this method can only be applied to one data matrix from a single view rather than data jointly from multiple views. Multiview coclustering, on the other hand, seeks a grouping of subjects that is consistent across different views (i.e., different sets of features), but the resultant clusters are defined using all of the available features, e.g., all of the studied genetic markers. Hence, it cannot be used to identify subtypespecific variants/features. Thus, to address our subtyping problem, we not only partitioned subjects in such a way that the subgroups differed in both clinical features and genetic markers, but also included a subspace search to identify the specific features or markers that defined the subgroups.
In this paper, we propose a multiview matrix decomposition approach based on the sparse singular value decomposition (SSVD) technique [12] to classify a complex disease into subtypes using data both from the clinical and genetic views. The objective of this problem is to identify subject clusters that agree in the clinical and genetic views, and simultaneously identify features and markers that are associated with the clusters. Employing the sparse SVD in our approach is critical to its success, especially in terms of successfully detecting associated variants given that the number of truely associated variants are much fewer than the number of single nucleotide polymorphisms (SNPs) in the whole genome. The proposed approach was validated on synthetic datasets that were simulated to have subtype structures and several genetic markers associated with the subtypes and a real world clinical dataset that was aggregated from multiple genetic studies of substance dependence. We compared our approach to a biclustering approach [12] and the latest multiview data analytics methods [9]. The results clearly show that the performance of our approach is superior to that of all other available methods.
Methods
We start with a presentation of the notations that are used throughout the paper. A vector is denoted by a bold lower case letter as in v and ∥v∥_{ p } represents its ℓ_{ p }norm, which is defined by ∥v∥_{ p }=(v_{(1)}^{ p }+⋯+v_{(d)}^{ p })^{1/p}, where v_{(j)} is the jth component of v and d is the length of v, i.e., the total number of components in v. We use ∥v∥_{0} to represent the socalled 0norm of v that equals the number of nonzero components in v. Denote u⊙v the componentwise (Hadamard) products of u and v. The set ${\mathcal{\mathcal{B}}}_{d}$ contains all binary vectors of length d. A binary vector is a vector whose components equal either 0 or 1. A matrix is denoted by a bold upper case letter, e.g., M_{n×d} is a nbyd matrix, and ∥M∥_{ F } is its Frobenius norm defined by (t r(M^{ T }M))^{1/2} where t r(·) is the trace of a matrix. Rows and columns in M are denoted by M_{(i,·)} and M_{(·,j)}, respectively.
Review of singleview biclustering
The regularization terms ∥σ u∥_{0} and ∥σ v∥_{0} are used to enforce the sparsity of u and v. Note that the scalar σ will not affect the value of the regularization terms. The parameters λ_{ u } and λ_{ v } are two hyperparameters to balance the approximation performance and the regularization terms. If both λ_{ u } and λ_{ v } equal 0, the optimal solution to this problem is the left and right singular vectors of M that correspond to its largest singular value. An alternating algorithm has been proposed in [12] to solve this problem effectively when λ_{ u } and λ_{ v } are not 0. This algorithm first initiates u and v by the first left and right singular vectors of M, then alternates between solving two subproblems until it converges. The two subproblems are: (a), fix u and find v that optimizes the objective of Eq.(1); (b), fix v and find u that optimizes the objective of Eq.(1).
Assume that each row of M represents a subject and each column corresponds to a feature. Once a pair of vectors u and v is obtained, a subject (row) cluster as indicated by the nonzero components of u is obtained. At the same time, the features on which the subjects in the cluster show high similarity are also identified in a column cluster as indicated by the nonzero components of v. More clusters can be obtained by repeating the optimization process with modified data matrices. To obtain subsequent clusters that are disjoint from any identified cluster in terms of subjects, the SSVD solves Eq.(1) using a new matrix M that excludes subjects (rows) already included in a row cluster. To obtain subsequent clusters that allow overlapping of subjects with identified clusters, the SSVD can solve Eq.(1) with the deflated M=M−σ u v^{ T } that removes the identified SVD components as used in the standard SVD.
The proposed formula for twoview joint biclustering
In this section, we extend the singleview SSVD to find a consistent grouping of subjects across two data matrices. In a later section, the resulting method will be extended to incorporate more than two data matrices.
Assume that two data matrices denoted by M_{1} of size nby d_{1} and M_{2} of size nby d_{2} characterize the same set of n subjects from two different views. We can obtain u_{1}, v_{1}, and u_{2}, v_{2} by a separate SSVD of M_{1} and M_{2}, respectively. However, it will not guarantee that the row clusters specified by u_{1} and u_{2} agree. To make them consistent, u_{1} and u_{2} must have nonzero components at the same position. Note that the two u vectors are not necessarily the same, because they may be derived from very different features in the views, such as realvalued clinical features but discrete values in genetic markers.
where λ_{ z }, ${\lambda}_{{v}_{1}}$ and ${\lambda}_{{v}_{2}}$ are tuning parameters that balance the approximation errors and regularization terms. Although the values of u’s are constrained to be unit vectors, the values of z⊙u’s are not necessarily unit vectors. However, a careful examination reveals that for any optimal solution $\widehat{\mathbf{u}}$, we can find another optimal solution $\stackrel{\u0304}{\mathbf{u}}$ that has nonzero values only at the entries indicated by the binary vector z, which ensures that $\mathbf{z}\odot \stackrel{\u0304}{\mathbf{u}}$ is also a unit vector. We first set ${\stackrel{\u0304}{\mathbf{u}}}_{\left(j\right)}={\widehat{\mathbf{u}}}_{\left(j\right)}$, if z_{(j)}≠0, or ${\stackrel{\u0304}{\mathbf{u}}}_{\left(j\right)}=0$ otherwise, for j=1⋯,n. We then update the corresponding singular value $\sigma =\sigma \parallel \stackrel{\u0304}{\mathbf{u}}{\parallel}_{2}$ and rescale $\stackrel{\u0304}{\mathbf{u}}=\stackrel{\u0304}{\mathbf{u}}/\parallel \stackrel{\u0304}{\mathbf{u}}{\parallel}_{2}$. This new vector $\stackrel{\u0304}{\mathbf{u}}$ satisfies the constraints of Eq.(2), and together with the new σ will produce the same objective value as the original solution $\widehat{\mathbf{u}}$, thus corresponding to an optimal solution as well. We design a fast algorithm in a later section to find such a sparse $\stackrel{\u0304}{\mathbf{u}}$ for Eq.(2).
By requiring u to be sparse, it can also identify consistent row clusters between two views. The resultant optimization problem is easier to solve without integer variables in z. However, it is an unnecessarily stringent constraint to limit the search space to u_{1}=u_{2}, which rules out a number of potential solutions that may include the optimal row clusters. Another alternative is to minimize the difference between u_{1} and u_{2}, which suffers from the same overconstrained problem because the exact values of the difference are not involved. Our problem only seeks to identify the indicators of whether or not a component of u is zero.
where the v vector is of size d_{1}+d_{2}. In comparison with Eq.(3), Eq.(4) uses a single σ for the two views, and the concatenated v is constrained to be a unit vector rather than individual v_{1} and v_{2}. It is easy to show that any optimal solution to Problem (3) can become a feasible solution to Problem (4) by properly rescaling v_{1} and v_{2} and absorbing the scaling factors by σ_{1} and σ_{2} to make σ_{1}=σ_{2}, but is not necessarily an optimal solution to Problem (4). An optimal v to Problem (4) may have either v_{1} or v_{2} be zero, which is however not allowed in Eq.(3). When one of the v vectors is zero, the resultant clusters differ only on one view of the features. As an example, we concatenated 64 clinical features to 1248 SNPs in a disease subtyping analysis. Because the genetic markers outweighed the clinical features, the resultant clusters differed significantly only on the SNPs, leading to disease subtypes that could not be clinically recognized.
A fast algorithm for twoview joint biclustering
 (1)
Find the optimal u _{ 1 } , v _{ 1 } , u _{ 2 } , and v _{ 2 } with fixed z
 (a)
Solve for v _{1} when u _{1} is fixed
We solve the following equivalent problem for the optimal ${\stackrel{~}{\mathbf{v}}}_{1}$ by relaxing the unit length constraint on v_{1}, and then setting ${\sigma}_{1}=\parallel {\stackrel{~}{\mathbf{v}}}_{1}{\parallel}_{2}$ and ${\mathbf{v}}_{1}={\stackrel{~}{\mathbf{v}}}_{1}/{\sigma}_{1}$.
 (b)
Solve for u _{1} when v _{1} is fixed
After v_{1} is obtained and fixed, we optimize Problem (5) with respect to σ_{1} and u_{1}. We let ${\stackrel{~}{\mathbf{u}}}_{1}={\sigma}_{1}{\mathbf{u}}_{1}$, and solve the following problem to obtain ${\stackrel{~}{\mathbf{u}}}_{1}$. By setting ${\sigma}_{1}=\parallel {\stackrel{~}{\mathbf{u}}}_{1}{\parallel}_{2}$ and ${\mathbf{u}}_{1}={\stackrel{~}{\mathbf{u}}}_{1}/{\sigma}_{1}$, we obtain a solution to Problem (5).
 (2)
Find the optimal z with fixed u _{ 1 } , v _{ 1 } , u _{ 2 } , and v _{ 2 }
where ${\gamma}_{\left(i\right)}=\frac{{\mathbf{E}}_{(i,\xb7)}{\mathbf{M}}_{(i,\xb7)}^{T}}{\parallel {\mathbf{E}}_{(i,\xb7)}{\parallel}_{2}^{2}}$ and $\theta =\frac{{\lambda}_{z}}{2\parallel {\mathbf{E}}_{(i,\xb7)}{\parallel}_{2}^{2}}$. Eq.(12) is derived based on the same calculation in [12] which was used to derive Eq.(7).
and σ_{1}, σ_{2} are recalculated as: σ_{1}=∥u_{1}∥_{2}, ${\sigma}_{2}=({\widehat{\sigma}}_{2}/{\widehat{\sigma}}_{1})\parallel {\mathbf{u}}_{2}{\parallel}_{2}$; then we normalize u_{1} and u_{2} by u_{1}=u_{1}/∥u_{1}∥_{2}, and u_{2}=u_{2}/∥u_{2}∥_{2}.
The proposed algorithm alternates between solving the three subproblems (6), (8) and (10) until a local minimizer is reached. The overall objective is monotonically nonincreasing when minimizing each subproblem, so the convergence of this iterative process is guaranteed. When applied to both synthetic and real world datasets, this process reached a convergent point in about 10 iterations. To derive another row subgroup, we repeat the algorithm using new matrices M_{1} and M_{2} that either exclude the rows corresponding to the subjects in the identified subgroup or are deflated by subtracting the identified singular value components σ u v^{ T }. By repeating this procedure, the desired number of subject groups can be achieved.
Extension to more than two views
Results and discussion
We first validated the proposed method using synthetic data that were simulated with known cluster and association structures. We then evaluated our approach on a real world disease dataset aggregated from multiple genetic studies of cocaine dependence (CD).
where $I({\mathcal{C}}^{\left(1\right)},{\mathcal{C}}^{\left(2\right)})=\sum _{i,j}\frac{{\mathcal{C}}_{i}^{\left(1\right)}\cap {\mathcal{C}}_{j}^{\left(2\right)}}{n}log\frac{n\underset{i}{\overset{\left(1\right)}{\mathcal{C}}}\cap {\mathcal{C}}_{j}^{\left(2\right)}}{\left\underset{i}{\overset{\left(1\right)}{\mathcal{C}}}\right\left{\mathcal{C}}_{j}^{\left(2\right)}\right}$, $H\left(\mathcal{C}\right)=\sum _{i}\frac{\left{\mathcal{C}}_{i}\right}{n}log\frac{\left{\mathcal{C}}_{i}\right}{n}$, and $\left{\mathcal{C}}_{i}\right$ denotes the number of subjects in the cluster ${\mathcal{C}}_{i}$. Because the true clusters are known in synthetic data, we computed NMI to measure the agreement between the true cluster assignments and the cluster assignments resulting from cluster analysis. A higher NMI value indicates better performance.
In addition to NMI, for each clustering, classifiers were constructed based on genetic markers to separate subjects in different clusters. We used the Area Under the receiver operating characteristic Curve (AUC) [15] in a 10fold crossvalidation setting to measure the genetic separability or homogeneity of the clusters in a clustering and compared it between different clusterings. We used a regularized logistic regression [16] as the classification model in these experiments.
We compared the proposed approach extensively against biclustering and multiview analytics. We calculated NMI for different methods on synthetic data and AUC values on both synthetic and real world data. Our comparison study included the following existing methods:

Singleview SSVD: Clusters were included in the comparison by running the method of SSVDbased biclustering in the clinical view, as the biclustering method does not handle multiple views. Applying this method to genetic data created completely different clusters from those obtained in the clinical view.

Coregularized spectral: This method was proposed previously [9] to find consistent row clusters across multiple views by applying spectral clustering to each view in turn together with a coregularization factor applied to the cluster indicator vector.

Kernel addition: Radial basis function (RBF) kernels were calculated for each view and combined by summing them. Then spectral clustering was applied to the combined kernel to obtain row clusters.

Kernel product: This is the same procedure as in the kernel addition described above except that kernel matrices were combined by multiplying their components in the same position.

Feature concatenation: Data from the two views were combined by feature concatenation and a kernel matrix was computed based on the combined features. It was then used in spectral clustering to obtain row clusters.
A simulation study
Two disease subtypes, subtype 1 and subtype 2, were simulated. Each of the subtypes was both defined by a set of phenotypic/clinical features and associated with a set of genetic markers. However, the clinical features and genetic markers differed for the two subtypes. Thus, each subtype corresponded to a cluster of subjects with the specific clinical features and the associated SNP markers (here we assumed that minor alleles at each locus were risk variants). The goal of the simulation was to create a reference partitioning of subjects in both views (i.e., genetic markers and clinical features).
Genetic data were obtained from the 1000 Genome Project [17], in which 1092 subjects were genotyped for several million genetic markers. We randomly selected 1000 markers from chromosome 5 that had a minor allele frequency of at least 5% as genetic inputs in our experiments. Ten markers (different for each subtype) were randomly chosen to be associated with each subtype. Thus, a cluster of subjects was formed for each subtype, and we assigned subjects to a cluster if they had ≥8 risk variants out of the 10 SNPs chosen for that subtype. This amounts to an additive genetic model for each subtype (i.e., derived by adding the risk variants). Subjects who did not belong to either of the subtypes were treated as controls, forming the third subject cluster. We removed from the analysis subjects who belonged to both subtypes to ensure clarity in the partition. A total of 1013 subjects were retained. Of these, 247 and 167 were assigned to subtype 1 and subtype 2, respectively, and 599 were controls. We named these clusters the genotypic clusters.
We then created clusters of the same subjects in the clinical view to be consistent to a certain degree with the genotypic clusters. Note that many diseases, although highly heritable, are multifactorial genetically and environmentally. To reflect the environmental effects on the clinical features, we introduced random noise to the synthesized clinical data so that the clinical clusters were not exactly the same as the genotypic clusters, so as to test the robustness of the proposed approach. We used a parameter e to indicate the relative effect that genetic variation contributed to the phenotypic variation. Denote ${r}_{i}^{j}$ the number of risk variants of subtype j shared by subject i, so $0\le {r}_{i}^{j}\le 10$ according to our definition of genotypic clusters. If ${r}_{i}^{j}\ast e+N(0,1)>7.5\ast e$, we assigned subject i to subtype j. This process created clusters of subjects that were different but similar to the genotypic clusters (with the parameter e reflecting the level of similarity).
We named these clusters the phenotypic clusters because they were used to synthesize clinical features such that the clinical data represented these clusters. Similarly, we removed from the analysis subjects that overlapped in the two phenotypic clusters. Fewer than 15 subjects were excluded in any simulated dataset in the experiments. In addition to these two phenotypic clusters, two additional phenotypic clusters, independent of any genetic variant and based on clinical features only, were created to make the simulated data more difficult but more realistic. Each of the two additional clusters included 200 subjects that were randomly selected among the controls. This design aimed to reflect the observation that multiple clinical clusters may exist in a sample, but only some clusters (two in our simulations) are associated with genetic factors.
We simulated 10 binary phenotypic/clinical features that exhibited the phenotypic clusters. A subject was assigned a value of 0 or 1 for each of the features according to a predefined probability. Subtype 1 and subtype 2 each were associated with three features. Subjects in each simulated phenotypic cluster were assigned a value of 1 with probabilities of 0.6, 0.5, and 0.4, respectively, for the three designated features. Each of the two additional phenotypic clusters was associated with two features, and subjects in each of the two subtypes were assigned a value of 1 in the two features, with probabilities of 0.6 and 0.5, respectively. A subject was assigned a value of 1 with a probability of 0.1 on any other features.
To evaluate how the proposed method performed when the genetic effect varied, four phenotypic datasets with e=1, 0.8, 0.6, and 0.4 were generated and analysed. The genetic effect on phenotypic variation decreases with decreasing e, which leads to a lower level of agreement between genotypic and phenotypic clusters.
Comparison of different methods on their cluster validity in the simulation
e=1  e=0.8  e=0.6  e=0.4  

Singleview SSVD  0.0821  0.1798  0.2432  0.2286 
Coregularized Spectral  0.2306  0.2477  0.2338  0.2549 
Kernel addition  0.2587  0.2295  0.2350  0.2566 
Kernel product  0.1917  0.2432  0.2302  0.2310 
Feature concatenation  0.1569  0.1576  0.1532  0.1211 
Proposed method  0.7949  0.7693  0.6815  0.6329 
The features identified by the proposed method in both views in the simulation
Phenotypic view  Genotypic view  

TF  TPF  FPF  TF  TPF  FPF  
Subtype 1  e=1  3  3  1  10  10  4 
e=0.8  3  1  10  5  
e=0.6  3  2  10  15  
e=0.4  3  0  10  10  
Subtype 2  e=1  3  3  0  10  10  4 
e=0.8  3  0  10  4  
e=0.6  3  0  10  2  
e=0.4  3  0  10  5 
A disease study: cocaine use and related behaviors
A total of 1,474 African Americans were phenotyped and genotyped for genetic studies of cocaine dependence (CD) [18]. Subjects were recruited from the Yale University School of Medicine, University of Connecticut Health Center, University of Pennsylvania School of Medicine, McLean Hospital and Medical University of South Carolina. All subjects gave written, informed consent to participate, using procedures approved by the institutional review board at each participating site. Subjects were phenotyped using a computerassisted interview, called the SemiStructured Assessment for Drug Dependence and Alcoholism (SSADDA) [19], a polydiagnostic instrument that was used to generate diagnoses of dependence on cocaine and other substances. Sixtyfour yesorno variables were generated by this survey, which were also used in previous genetic association studies [1, 20, 21]. These variables were used as the phenotypic features. Of the 1,474 subjects, 1,287 were diagnosed with cocaine dependence. Subjects were genotyped for 1,350 SNPs selected from 130 candidate genes [4] and 186 ancestry informative markers (AIMs) using the Illumina GoldenGate Assay platform (Illumina, Inc., San Diego, CA).
The original dataset aggregated from two studies was preprocessed with a sequence of steps for data cleaning and to address population stratification. Race was classified using STRUCTURE v2.3 [22] and AIMs, which stratified the study subjects into two population groups: African Americans (AAs) and European Americans (EAs). The AA group was used in the present analysis. Of the 1,474 AAs, 93.78% had AA as their selfreported race. We excluded other population groups from the analysis. Principal components analysis (PCA) was performed on the 186 AIMs for the stratified AA population. The first PCA dimension was used in the subsequent association tests as a covariate to correct for the residual population structure. SNPs for which data were available for less than 95% of the subjects, or for which the P value for HardyWeinberg equilibrium was less than 10^{−7}, were excluded from our analysis. The minor allele frequency (MAF) of each SNP was calculated within this AA population group. SNPs with a MAF <1% were removed. The remaining 1,248 SNPs were used as the genetic markers in the multiview biclustering experiment. The SNPs selected by the proposed Algorithm 1 were then used in the association test that was based on the logistic regression model.
From these two figures, we can see that Group 1 is distinctively associated with several withdrawal symptoms, such as feeling depressed, restless, or tired when the subject stopped, cut down or went without cocaine. When Group 2, the second row cluster, was identified, the corresponding column cluster contained 17 clinical features. We plotted the percentage of positive responses to eight of these features for all three cocaine user groups in Figure 4. Subjects in both Group 2 and Group 1 showed high values on these features. Note that subjects in Group 1 were excluded when the second cluster was derived. From these observations, we can conclude that Group 1 is a heavy user group with many negative consequences of cocaine use, Group 2 is a moderate cocaine user group, and Group 3 is a low cocaine user group.
Top five SNPs associated with each of the two CD subtypes
SNP  Chr  MAF  HWE  Gene  

rs6318  chrX  0.3643  1.00  HTR2C  
Group 1  rs2427400  chr20  0.1280  0.22  NTSR1 
vs.  rs460401  chr21  0.3500  0.18  GRIK1 
Group 3  rs10485058  chr06  0.0585  0.38  OPRM1 
rs2279423  chr15  0.0237  0.81  CHRM5  
rs897692  chr11  0.3972  0.86  HTR3A  
Group 2  rs9996854  chr04  0.5436  0.61  GABRB1 
vs.  rs481036  chr01  0.5582  0.21  CHRM3 
Group 3  rs6092933  chr20  0.2070  0.17  SLC32A1 
rs9371781  chr06  0.3687  0.49  OPRM1 
Conclusion
It is challenging to identify the genetic causes of complex disorders such as substance dependence, due to their heterogeneous clinical manifestations and complex genetic etiologies, which include gene x environment interactions. Phenotype refinement that leads to homogeneous subtypes is a promising approach to solve this problem [1, 5, 23–25]. However, most of the methods used to refine phenotypes take into consideration only the phenotypic information, despite the availability of genotypic information in genetic studies of a complex disorder. Thus, existing approaches have had limited success in finding a phenotypic subtype that is genetically homogeneous. In this paper, we propose a multiview biclustering approach to refine the phenotype by jointly taking into account genetic and phenotypic information.
The proposed method is distinct from existing multiview data analytics in that the relevant features can be identified at the same time that a subtype is determined, which is critical to its success. This increases the likelihood of finding genetic associations. The proposed method is distinct from existing biclustering methods in that it harmonizes the subject groupings in two or more views. The developed algorithm is highly scalable with large datasets because at each iteration it calculates closedform solutions for different groups of working variables. The results from extensive experimental comparisons on both synthetic data and real world datasets demonstrate the effectiveness and superior performance of the proposed approach.
This study has a number of limitations. The proposed multiview biclustering method, in its current form, does not simultaneously handle population stratification and phenotypegenotype association. It may spuriously identify markers that are relevant to a disease subtype due to population structure rather than being truly associated with the specific disease. Thus, population groups need to be stratified in additional steps such as those performed in our experiments. It is desirable to extend our method to address the threeway relationship among population subgroups, genotypes and phentoypes to ensure the validity of the identified phenotypegenotype associations. Further, the proposed method was used in our empirical study to identify the first two major subgroups of subjects, for which no invalid clusters caused by random noise were identified. When larger numbers of clusters are to be identified, the two methods we designed to find subsequent clusters (by either excluding subjects in the identified subgroups or deflating singular value components from the data matrix) become susceptible to the detection of invalid clusters because singular values will decrease in subsequent decomposition. Empirical studies may be needed to examine more thoroughly the signaltonoise pattern of the proposed method.
Declarations
Acknowledgements
This work was supported by NSF grant IIS1320586 and NIH grant DA030976. We thank Joel Gelernter, M.D. from Yale University who was instrumental in recruiting, characterizing, and genotyping the subjects in the datasets used here. Kathleen Brady, M.D., Ph.D. of the Medical University of South Carolina, Roger Weiss, M.D., of McLean Hospital and Harvard Medical School, and David Oslin, M.D., of the University of Pennsylvania Perelman School of Medicine oversaw study recruitment at their respective sites. That work was supported by NIH grants AA011330, AA017535, DA12690, DA18432, and DA12849.
Authors’ Affiliations
References
 Kranzler HR, Wilcox M, Weiss RD, Brady K, Hesselbrock V, Rounsaville B, Farrer L, Gelernter J: The validity of cocaine dependence subtypes. Addict Behav. 2008, 33 (1): 4153.PubMedPubMed CentralView ArticleGoogle Scholar
 Babor TF, Caetano R: Subtypes of substance dependence and abuse: implications for diagnostic classification and empirical research. Addiction (Abingdon, England). 2006, 101: 104110.View ArticleGoogle Scholar
 McCarthy MI, Abecasis GR, Cardon LR, Goldstein DB, Little J, Ioannidis JPA, Hirschhorn JN: Genomewide association studies for complex traits: consensus, uncertainty and challenges. Nat Rev Genet. 2008, 9 (5): 356369.PubMedView ArticleGoogle Scholar
 Hodgkinson CA, Yuan Q, Xu K, Shen PH, Heinz E, Lobos EA, Binder EB, Cubells J, Ehlers CL, Gelernter J, Mann J, Riley B, Roy A, Tabakoff B, Todd RD, Zhou Z, Goldman D: Addictions biology: haplotypebased analysis for 130 candidate genes on a single array. Alcohol Alcohol. 2008, 43 (5): 505515.PubMedPubMed CentralView ArticleGoogle Scholar
 Gelernter J, Panhuysen C, Wilcox M, Hesselbrock V, Rounsaville B, Poling J, Weiss R, Sonne S, Zhao H, Farrer L, Kranzler HR: Genomewide linkage scan for opioid dependence and related traits. Am J Hum Genet. 2006, 78 (5): 759769.PubMedPubMed CentralView ArticleGoogle Scholar
 Schwartz B, Wetzler S, Swanson A, Sung SC: Subtyping of substance use disorders in a highrisk welfaretowork sample: a latent class analysis. J Subst Abuse Treat. 2010, 38 (4): 366374.PubMedView ArticleGoogle Scholar
 Chen P, Hung YS, Fan Y, Wong STC: An integrative bioinformatics approach for identifying subtypes and subtypespecific drivers in cancer. IEEE Symposium on Computational Intelligence in Bioinformatics and Computational Biology (CIBCB). 2012, New York: IEEE, 169176.View ArticleGoogle Scholar
 Tay ST, Leong SH, Yu K, Aggarwal A, Tan SY, Lee CH, Wong K, Visvanathan J, Lim D, Wong WK, Soo KC, Kon OL, Tan P: A combined comparative genomic hybridization and expression microarray analysis of gastric cancer reveals novel molecular subtypes. Cancer Res. 2003, 63 (12): 33093316.PubMedGoogle Scholar
 Kumar A, Rai P, Daume HIII: Coregularized multiview spectral clustering. Advances in Neural Information Processing Systems 24. Edited by: Weinberger KQ, Pereira FCN, Bartlett P, Zemel RS, ShaweTaylor J. 2011, Cambridge, MA: MIT Press, 14131421.Google Scholar
 Chaudhuri K, Kakade SM, Livescu K, Sridharan K: Multiview clustering via canonical correlation analysis. Proceedings of the 26th International Conference on Machine Learning. 2009, New York: ACM, 129136.Google Scholar
 Van Mechelen I, Bock HH, De Boeck P: Twomode clustering methods: a structured overview. Stat Methods Med Res. 2004, 13 (5): 363394.PubMedView ArticleGoogle Scholar
 Lee M, Shen H, Huang JZ, Marron JS: Biclustering via sparse singular value decomposition. Biometrics. 2010, 66 (4): 10871095.PubMedView ArticleGoogle Scholar
 Kumar A, Daume HIII: A cotraining approach for multiview spectral clustering. Proceedings of the 28th International Conference on Machine Learning. Edited by: Getoor L. 2011, Scheffer New York: ACM, 393400.Google Scholar
 Guan Y, Dy J, Jordan MI: A unified probabilistic model for global and local unsupervised feature selection. Proceedings of the 28th International Conference on Machine Learning. 2011, New York: ACM, 10731080.Google Scholar
 Fawcett T: An introduction to ROC analysis. Pattern Recogn Lett. 2006, 27 (8): 861874.View ArticleGoogle Scholar
 Yuan GX, Ho CH, Lin CJ: An improved glmnet for l1regularized logistic regression. J Mach Learn Res. 2012, 13: 19992030.Google Scholar
 The 1000 Genomes Project Consortium: An integrated map of genetic variation from 1,092 human genomes. Nature. 2012, 491 (7422): 5665.View ArticleGoogle Scholar
 American Psychiatric Association: Diagnostic and Statistical Manual of Mental Disorders: Fourth Edition (DSMIV). 1994, Washington, DC: American Psychiatric Press IncGoogle Scholar
 PierucciLagha A, Gelernter J, Chan G, Arias A, Cubells JF, Farrer L, Kranzler HR: Reliability of dsmiv diagnostic criteria using the semistructured assessment for drug dependence and alcoholism (ssadda). Drug Alcohol Depend. 2007, 91 (1): 8590.PubMedPubMed CentralView ArticleGoogle Scholar
 Bi J, Gelernter J, Sun J, Kranzler HR: Comparing the utility of homogeneous subtypes of cocaine use and related behaviors with DSMIV cocaine dependence as traits for genetic association analysis. Am J Med Genet B. 2013, 2: 148156.Google Scholar
 Sun J, Bi J, Kranzler HR: Multiview comodeling to improve subtyping and genetic association of complex diseases. IEEE J Biomed Health Inf. 2013, 18 (2): 548554.Google Scholar
 Pritchard JK, Stephens M, Donnelly P: Inference of population structure using multilocus genotype data. Genetics. 2000, 155 (2): 945959.PubMedPubMed CentralGoogle Scholar
 Chan G, Gelernter J, Oslin D, Farrer L, Kranzler HR: Empirically derived subtypes of opioid use and related behaviors. Addiction. 2011, 106 (6): 11461154.PubMedPubMed CentralView ArticleGoogle Scholar
 Sun J, Bi J, Chan G, Anton RF, Oslin D, Farrer L, Gelernter J, Kranzler HR: Improved methods to identify stable, highly heritable subtypes of opioid use and related behaviors. Addict Behav. 2012, 37 (10): 11381144.PubMedPubMed CentralView ArticleGoogle Scholar
 Sun J, Bi J, Kranzler HR: A multiobjective program for quantitative subtyping of clinicallyrelevant phenotypes. Proceedings of IEEE International Conference on Bioinformatics and Biomedicine (BIBM2012). 2012, New York: ACM, 256261.Google Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.