Network-based regularization for high dimensional SNP data in the case–control study of Type 2 diabetes

Background Over the past decades, the prevalence of type 2 diabetes mellitus (T2D) has been steadily increasing around the world. Despite large efforts devoted to better understand the genetic basis of the disease, the identified susceptibility loci can only account for a small portion of the T2D heritability. Some of the existing approaches proposed for the high dimensional genetic data from the T2D case–control study are limited by analyzing a few number of SNPs at a time from a large pool of SNPs, by ignoring the correlations among SNPs and by adopting inefficient selection techniques. Methods We propose a network constrained regularization method to select important SNPs by taking the linkage disequilibrium into account. To accomodate the case control study, an iteratively reweighted least square algorithm has been developed within the coordinate descent framework where optimization of the regularized logistic loss function is performed with respect to one parameter at a time and iteratively cycle through all the parameters until convergence. Results In this article, a novel approach is developed to identify important SNPs more effectively through incorporating the interconnections among them in the regularized selection. A coordinate descent based iteratively reweighed least squares (IRLS) algorithm has been proposed. Conclusions Both the simulation study and the analysis of the Nurses’s Health Study, a case–control study of type 2 diabetes data with high dimensional SNP measurements, demonstrate the advantage of the network based approach over the competing alternatives.


Background
Type 2 diabetes mellitus (T2D), a chronic metabolic disorder, has been a major public health concern for years. An estimated 366 million cases of T2D over the world are expected by the year 2030 [1]. To better understand T2D etiology, significant efforts have been devoted to the identification of genetic markers that may contribute to the predisposition of the disease. The large scale genomewide association studies (GWAS) has proven to be powerful in finding the association between individual genetic variant (like SNPs) and complex diseases, including type 2 diabetes. However, those identified SNPs from existing studies can only account for about 10% of the genetic variance of type 2 diabetes [2], which motivate the development of more advanced statistical methodologies with the hope to explain the missing heritability.
One major limitation shared by many of the previous studies, especially the early ones, is that they are marginal in the sense that one or a small number of genetic factors are analyzed at a time. Since complex disease phenotypes are associated with the joint effects of multiple genetic factors, signals with weak or moderate marginal but strong joint effects may not be captured by the marginal analysis.
As unprecedented amount of high dimensional omics data has been generated from high-throughput profiling studies, extensive regularized variable selection methods such as LASSO [3] and elastic net [4], have been proposed to identify genes that are associated with disease phenotypes, with the genes being treated as variables. More recently, to incorporate the interconnection information, or network structure existing among genetic variants into the selection procesure, the networkconstrained regularization approaches have been developed, as in Li and Li [5] and Huang et al. [6], among many others. In particular, Huang et al. [6] developed the sparse Laplacian shrinkage (SLS) penalty built upon the combination of MCP (Zhang [7]) and Laplacian quadratic associated with a graph. They also demonstrated that in high dimension settings with p ≫ n under reasonable assumptions, SLS is selection consistent and equivalent to the oracle Laplacian shrinkage estimator with high probability.
This study has been partially motivated by analyzing the case control data from the Nurses's Health Studies (NHS) and studies alike. As a major component of the Gene Environment Association Studies Initiative, NHS was launched in 1976 in order to identify important genetic variants related to type 2 diabetes and gene-trait association under environmental exposures [8]. To accommodate the linkage disequilibrium (LD) existing among SNPs, we adopt a network measure and incorporate it in SLS. We further extend the SLS into the penalized logistic regression model for the analysis of the T2D case control data, and develop an efficient coordinate descent based algorithm. Compared with the alternatives, the proposed method can borrow strength from the correlation among SNPs and leads to more meaningful identification of important ones.
We first introduce the data and model settings, and describe the proposed approach. An efficient computational algorithm is subsequently developed. Simulation study demonstrates the significant advantage of the proposed approach over multiple competing alternatives. We analyze NHS type 2 diabetes data with high dimensional SNP measurements.

Methods
Denote the i th subject by using the subscript i. Let (X i , Y i ), i = 1, …, n be n independent and identically distributed random vectors. Y i is the binary response variable where y i = 1 indicating the case of disease, and 0 otherwise. X i is the p-dimensional design vector of SNPs. Assuming that y i follows a binomial distribution, then where η i is the i th component of η = Xβ, and β = (β i , …, β p ) T is the regression coefficient vector. The corresponding loss function is the negative log-likelihood Regularized logistic regression As the T2D disease status is only affected by a small number of important SNPs that are associated with the disease, and the dimensionality of the total number of SNPs is much larger than the sample size n, the problem is of a "large p, small n" nature. Regularization is a natural tool for such type of problem appropriate in both biological and statistical sense. By imposing penalty function to the loss function in (1), we have the following penalized likelihood where P(β; λ, γ) is the penalty function with tuning parameters λ and γ. A seemingly straightforward choice for the penalty is þ dx is the MCP penalty with tuning parameter λ 1 and regularization parameter γ (Zhang [7]).
For the SNPs, MCP is imposed on their regression coefficients. Penalized regression will shrink some components of coefficient vector β to zero, which indicates that the corresponding SNPs are not associated with the disease status y. SNPs with nonzero coefficients are treated as important variants. A major limitation of MCP here is that it ignores the interconnections among SNPs, while the high correlation among genetic variants, including SNPs, have been widely observed and reported due to LD. We use a network structure to describe the correlation pattern among SNPs. In a SNP network, a node corresponds to a SNP, and if the two SNPs are statistically or biologically associated, the two corresponding nodes are connected. To incorporate the network information, we adopt the sparse Laplacian penalty from Huang et al. [6] as follows: where |a mk | is the measure of connection intensity between SNP x m and x k . The first term of (3) is a summation of MCPs, promoting sparsity in the estimated model. The role of the second term is to encourage smoothness among the coefficient profiles of the related SNPs. Furthermore, the second term can be associated with a Laplacian matrix for a properly defined undirected weighted graph corresponding to the SNPs. As shown in Huang et al. [6], the penalty in (3) is capable of taking correlation structure into account without introducing extra bias, consequently it outperforms a large class of network-constrained penalty functions. The oracle property has also been rigorously established. Therefore, we choose (3) and extend it to the penalized logistic regression model for the analysis of case control type 2 diabetes data. The network adjacency measure, |a mk |, is perhaps the most crucial characteristic in a network to quantify strength of connection between any two nodes (Zhang and Horvath [9]). Denote A = (a mk , 1 ≤ m, k ≤ p) as the adjacency matrix, and let r mk be the corresponding Pearson correlation coefficient. We propose to use a mk = r mk α ⋅ I{ | r mk | > r c } with α =5. This measure keeps the strong correlations while downweighing the weak ones. In addition, it guarantees that a mk and r mk have the same sign. Compared with the threshold r c which determines whether the edge joins the corresponding nodes in a network, the power only denotes the relative strength of connection, and does not influence the network structure. Thus α can be chosen via an ad hoc fashion. The correlation cutoff r c is calculated based on the Fisher transformation z mk = 0.5 log((1 + r mk )/(1 − r mk )). If the correlation between m th and k th predictor is zero, then ffiffiffiffiffiffiffiffi n−3 p z mk approximately follows a standard normal distribution N(0,1), which can be used to determine a threshold c for ffiffiffiffiffiffiffiffi n−3 p z mk . Subsequently, the corresponding threshold for r mk is . Such a network is weighted and sparse. We acknowledge that there are other ways of constructing the network adjacency matrix, and conjecture that they are equally applicable. Since our main purpose is not to compare the constructions of different networks, we focus on this particular network structure in this paper.

Computation
Huang et al. [6] adopted a coordinate descent algorithm to obtain the sparse Laplacian shrinkage estimate when the continuous response variable follows a normal distribution. However, this cannot be applied to a binary response directly. We develop a coordinate descent based iteratively reweighed least squares (IRLS) algorithm for the logistic regression, which yields a form the same as the quadratic approximation to the penalized objective function based on Taylor expansion about current estimates.
Denote β (d) as the value of the regression coefficients at the beginning of the dth iteration, the quadratic approximation to (2) is where W is an n × n diagonal matrix of weights with elements w i = π i (1 − π i ), andỹ is the working response, defined as where π = (π 1 , …, π n ) T is evaluated at current parameters β (d) . The residuals after each iteration can be expressed as Here, v m needs to be re-weighted in every iteration, leading to increased computational cost. As the Hessian terms can be approximated by an exact upper bound  1 4 . Define u m and t m at iteration d as Then the close form update of β (d + 1) can be obtained as where S(⋅) is the soft thresholding function. With fixed tuning parameters λ 1 and λ 2 , the coordinate descent algorithm proceeds as follows.
The convergence is achieved when the L 2 difference between β estimates from two contiguous iterations is smaller than a predefined threshold. Tuning parameters λ 1 and λ 2 control the sparsity in SNP selection and smoothness among coefficient profiles, respectively. They can be chosen from cross validation based methods. We search over a two-dimensional discrete grid of values for λ 1 and λ 2 , and select the optimal pair in terms of testing misclassification rate. In penalized logistic regression, regularization parameter γ needs to be larger than 1/w i for MCP. We set it as 4.5 in the simulation study since it has been observed that smaller γ yield slightly better results.

Simulation
We evaluate the performance of the proposed approach through extensive simulation studies. Both categorical and continuous predictors are considered, and they correspond to SNP and gene expression data, respectively. We first generate a n × p matrix of gene expressions, where n = 500 and p = 750, from a multivariate normal distribution. For the 750 genes, there are 100 clusters with 5 genes per cluster. The gene expressions have been marginally standardized. We consider two correlation structures. (1) the auto-regression (AR) structure, in which gene i and j within the same cluster have correlation coefficients ρ |i − j| , and they are independent cluster-wisely. (2) the block structure, in which the within cluster correlation coefficient is ρ, and gene expressions in different clusters are independent. We consider ρ =0.1, 0.5 and 0.9 for both structures. In addition to the 500 by 750 matrix of gene expressions, a 1000 by 1500 matrix has also been generated with 150 clusters and 10 genes per cluster following the same correlation structures. The SNP data are simulated by dichotomizing expression values of each gene at the 1st and 3rd quartiles, with the 3-level (2,1,0) for genotypes (AA,Aa,aa) respectively. For both combinations of (n, p), (500, 750) and (1000, 1500), 10% of clusters are randomly selected to have nonzero regression coefficients, which are generated from Unif [0.25,0.75]. The binary response can subsequently be simulated. We choose the tuning parameters based upon the prediction performance of the corresponding model in an independently simulated validation dataset. For comparison, we consider three alternative approaches, LASSO, elastic net and MCP. LASSO is perhaps so far the the most widely used penalization approach for the analysis of genomic data. In contrast to LASSO, elastic net encourages the grouping effects among genomic features. MCP is equivalent to the proposed approach when λ 2 = 0 in (3). Comparison with MCP as well as elastic net will directly demonstrate the advantage of each penalty term in the formulation (3). For convenience, we term the network approach, MCP, elastic net and LASSO as A1, A2, A3 and A4, respectively.
Simulation results for the SNP data are tabulated in Table 1. We can observe that from the upper panel where (n, p) = (500, 750), A1 (network) and A2 (MCP)  Table 2 have similar performance when correlation is low. As correlation increases, the proposed one starts to outperform A2. For example, when ρ = 0.9 under AR correlation structure, A1 can identify most of the 75 true positives, 74.98 (sd 0.14), with a small number of false positives 9.74 (sd 13.31). A2 identifies similar number of false positives with a much lower number of true positives 31.22 (sd 4.57). Out of all the four approaches, A3 (elastic net) always has the largest false positives and A4 (LASSO) is inferior to A2 in general. Consistent patterns have been observed under other scenarios in Table 1. In addition, we examine the performances using the ROC curves. The ROC curves corresponding to Table 1 are given in Fig. 1, which clearly show that A1 outperforms A2-4. Additional simulation results for gene expression data are given in Table 2 and Fig. 2, which also demonstrate the merit of the network approach over the alternatives when moderate to strong correlation exists among genetic variants. To further examine the performance of the proposed approach, we also conduct simulation under n = 500 and p = 1500. Results are summarized in Table 3 and Fig. 3. Both the identification accuracy in Table 3 and ROC curves in Fig. 3 demonstrate the superiority of the proposed A1 over alternatives.
In addition to the identification results and the ROC curves, we acknowledge that plots of piece-wise solution path can be adopted to investigate the similarity and difference among different regularization methods, especially when the number of predictors (features) entering the model is moderate or small. In our simulation study, the number of important SNPs and gene expressions is large, therefore such an approach is not pursued.

Real data analysis
As described in the background section, we analyze Nurses' Health Study (NHS), a case control study of type 2 diabetes which are part of the Gene Environment Association Studies. Detailed information of the datasets are available from Hu et al. [11]. We focus on SNPs in several important pathways potentially related to T2D. They are the Wnt signaling pathway, cell cycle pathway and p53 signaling pathway. After cleaning the data through matching phenotypes and genotypes, removing SNPs with minor allele frequency (MAF) less than 0.05 or deviation from Hardy-Weinberg equilibrium, the working dataset contains 3391 subjects. There are 5079 SNPs in the Wnt signaling pathway, and 3793 SNPs in the cell cycle and p53 signaling pathway.
We first apply all the 4 methods on the Wnt signaling pathway. The A1-4 identify 834, 841,847 and 848 SNPs that are associated with T2D, respectively. As a representative example, we examine closely gene DAMM1 and its subnetwork. DAMM1 is reported to be associated with diabetic nephropathy, a common complication of type 2 diabetes (Sapienza et al. [12] and Kavanagh et al. [13]). The upper panel of Fig. 4 shows the subnetwork of DAMM1, where the red nodes indicate the SNPs from DAMM1. In the subnetworks, thickness of edges denotes the strength of correlation between SNPs. It can be clearly observed that the proposed approach has identified much more highly correlated SNPs, since the interconnections among SNPs have been accommodated by the approach that incorporates the network structure information. The network approach (A1) selects 19 SNPs and 15 belong to DAMM1, while other 3 approaches only identify 9,6 and 6 SNPs correspondingly. A1 leads to a more tightly connected network, which is consistent with our findings in the simulation study that it promotes the interconnections among SNPs. Furthermore, the proposed one identifies SNP rs1252906, which plays a crucial role in the progression of nephropathy (Kavanagh et al. [13]). Other methods fail to identify this important SNP. The genes corresponding to the SNPs in the subnetworks are given in the upper panel of Fig. 5. The analysis has also been carried out on the SNPs combined from the cell cycle pathway and p53 signaling pathway. There are 814, 828, 827 and 829 SNPs identified by A1-4 correspondingly. We focus on the subnetwork of gene CASP9, which is one of the key players in inducing cell apoptosis (Cnop et al. [14]). Previous studies show that CASP9 is associated with diabetic retinopathy, a common and serious complication of type 2 diabetes (Baharian et al. [15] and Looker et al. [16]). In the NHS study, CASP9 has a total of 11 SNPs. The subnetwork of CASP9 is shown in the lower panel of Fig. 4. The proposed method identifies a subnetwork which has 7 SNPs from CASP9, and the other 7 SNPs from gene CELA2A, CELA2B and DNAJC16. Both CELA2A and CELA2B encode protein elastases, which hydrolyze elastin and many other proteins. DNAJC16 is a member of heat shock protein family (Hsp40). And it has been found in multiple studies that Hsp40 are related to cell apoptosis in type 2 diabetes (Laybutt et al. [17] and Chien et al. [18]. It is very interesting that CELA2A, CELA2B, CASP9 and DNAJC16 locate on chromosome 1 as a cluster. CELA2A is also identified by the rest 3 approaches, while CELA2B is only not in the subnetwork identified by A2(MCP). Overall, the network effects of CASP9, CELA2A, CELA2B and DANJC16 on type 2 diabetes, especially diabetic retinopathy, worth further investigations.

Discussion
In this paper, we develop a network-based regularized logistic regression model for the analysis of high dimensional genetic data and identification of important SNPs in the case-control study of type 2 diabetes. Advancing from existing studies, the proposed one has desired property to take correlation pattern among genetic variants into account without incurring extra bias. We provide an efficient iteratively reweighed least squares (IRLS) algorithm within the coordinate descent framework. The computational cost has been significantly reduced due to convenient approximations to the original regularized log likelihood function. Simulation demonstrates that the proposed one outperforms closely related alternatives.
Computation feasibility is an important practical consideration for high-dimensional regularization methods. In simulation, the CPU time (in minutes) of applying the proposed method on 100 replicates of simulated SNP data with n = 1000, p = 1500 and AR structure is 218.3 on a regular laptop. In the case study, the CPU time for analyzing the Wnt pathway with n = 3391 and p = 5079 is 70. 16. The proposed method can be potentially applied to larger datasets with a reasonable computation time. It has been widely acknowledged in Fan and Lv [19], Jiang et al. [20] and studies alike that regularization methods have to be coupled with screening strategy to accommodate ultra-high dimensional data from for instance, large-scale GWAS studies. The proposed network constrained regularization method can be implemented in such a two stage framework. Further investigations are intriguing but beyond the scope of this paper, and will be postponed to the future.
The methodological development in this article has been partially motivated by the analysis of the datasets from Nurese's Health Study (NHS). In the past, this study has been extensively investigated under marginal methods ( [21] and [22]) which ignores the joint effects of the SNPs. In addition, although studies like Wu, Cui and Ma [23] consider the effects of SNPs jointly within the penalization framework for continuous phenotypes, the correlation among the SNPs still have not been fully taken into account. The proposed approach quantifies the strength of correlation among SNPs via network structure and is able to incorporate the correlation in the identification of important SNPs. In the case study, we have identified interesting subnetworks with respect to genes closely related to T2D. In this work, we have focused on methodological development. More thorough bioinformatics and functional investigations will be needed in the future to fully understand the identified results.

Conclusions
The network-constained logistic regulaization method proposed in this study has demonstrated superior performance in identifying important genetic variants from both simulation study and the Nurese's Health Study, a case-control study of type 2 diabetes with high dimensional SNP measurements. The network term in the regularized loss function accomodates the LD widely present among SNPs, which guarantees the advantage of the developed one over the competing alternative methods.