 Methodology article
 Open Access
A strategy analysis for genetic association studies with known inbreeding
 Stefano Cabras^{1}Email author,
 Maria Eugenia Castellanos^{2},
 Ginevra Biino^{5},
 Ivana Persico^{3},
 Alessandro Sassu^{4},
 Laura Casula^{3},
 Stefano del Giacco^{6},
 Francesco Bertolino^{1},
 Mario Pirastu^{3, 4} and
 Nicola Pirastu^{4, 7}
https://doi.org/10.1186/147121561263
© Cabras et al; licensee BioMed Central Ltd. 2011
 Received: 16 March 2011
 Accepted: 18 July 2011
 Published: 18 July 2011
Abstract
Background
Association studies consist in identifying the genetic variants which are related to a specific disease through the use of statistical multiple hypothesis testing or segregation analysis in pedigrees. This type of studies has been very successful in the case of Mendelian monogenic disorders while it has been less successful in identifying genetic variants related to complex diseases where the insurgence depends on the interactions between different genes and the environment. The current technology allows to genotype more than a million of markers and this number has been rapidly increasing in the last years with the imputation based on templates sets and whole genome sequencing. This type of data introduces a great amount of noise in the statistical analysis and usually requires a great number of samples. Current methods seldom take into account genegene and geneenvironment interactions which are fundamental especially in complex diseases. In this paper we propose to use a nonparametric additive model to detect the genetic variants related to diseases which accounts for interactions of unknown order. Although this is not new to the current literature, we show that in an isolated population, where the most related subjects share also most of their genetic code, the use of additive models may be improved if the available genealogical tree is taken into account. Specifically, we form a sample of cases and controls with the highest inbreeding by means of the Hungarian method, and estimate the set of genes/environmental variables, associated with the disease, by means of Random Forest.
Results
We have evidence, from statistical theory, simulations and two applications, that we build a suitable procedure to eliminate stratification between cases and controls and that it also has enough precision in identifying genetic variants responsible for a disease. This procedure has been successfully used for the betathalassemia, which is a well known Mendelian disease, and also to the common asthma where we have identified candidate genes that underlie to the susceptibility of the asthma. Some of such candidate genes have been also found related to common asthma in the current literature.
Conclusions
The data analysis approach, based on selecting the most related cases and controls along with the Random Forest model, is a powerful tool for detecting genetic variants associated to a disease in isolated populations. Moreover, this method provides also a prediction model that has accuracy in estimating the unknown disease status and that can be generally used to build kit tests for a wide class of Mendelian diseases.
Keywords
 Thalassemia
 Random Forest Model
 Kinship Coefficient
 Genealogical Tree
 Hungarian Algorithm
Background
One of the main objectives in studying the genetics of complex diseases is not only the search of genetic variants associated to pathologies [1], but also to build predictive models which help both their diagnosis and early treatment. This problem can be formalized by expressing the disease status, Y, of each subject as a Bernoulli random variable Y = {0, 1} where Y = 1 indicates an affected subject. The main quantity of interest, F(x) = Pr(Y = 1x), is the conditional probability of being affected given a set x of genetic variants and environmental variables. Such variables form a huge set of potential predictors, which we will refer to as omic profile. Essentially, , where P is the number of considered genetic variants and environmental variables. Specifically, we use a sample of N ≪ P where N is of order of hundreds while P is thousands times larger. This setup complicates the estimation of F(x), because, in absence of strong prior information [2] on the part of the omic profile related to the disease, the data should allow us to choose F(x) within a large class of models . In order to achieve this goal it is necessary to reduce the spurious genetic variability not related to Y. For example, if we were using logistic regression models, F(x) would be one of these models selected between all the possible logistic regression models, , which are, at least, 2 ^{ P } . In this paper we relax the assumption of a parametric model by using nonparametric methods [3], which means that has infinite dimension. Such models are usually referred as nonparametric models. The genomic profile, part of the omic profile, consists of a large set of DNA markers, say 500000 SingleNucleotide Polymorphisms (SNPs), while the set of environmental variables includes individual anthropometric measurements and information derived from a standardized interview collecting sociodemographic, lifestyle, medical and pharmacological history data on many pathologies. We suppose that such covariates may cause the outcome Y. In particular, for certain types of diseases, it is possible to have prior information about the environmental variables, but in most cases there is no such information about the causing genes. The disease prediction model, F(x), for the future outcome Y x must take into account genegene interactions and also their interaction with the environment. Such interactions, usually of unknown order, can be multiplicative or additive [4]. Estimation of F (x) is a primary concern in personalized medicine, because F(x) can be used as the basis for early diagnosis of a disease, permitting actions to prevent the pathology before its insurgence, and to personalize treatments.
In order to estimate F(x), we consider a matrix of omic profiles, X_{ N }_{× P}, and the known disease status, y_{ N } = (y_{1}, y_{2}, ..., y_{ N } ), measured on N ≈ 100 highly inbred individuals that belong to an isolate population where the genealogy is fully known.
The translation of this estimation problem into statistical terms sounds as follows: given a huge set of covariates, P ≈ 500000, we have to estimate a probability model, , using a sample of dependent observations, (y_{ N } , X_{ N }_{× P}), of size N ≪ P. The statistical analysis of such problem presents the following critical points:
i) observations are not independent and consequently all unconditional inference, with respect to the genealogy, cannot be applied here. Differently from usual association studies between genetic variants and diseases, we have knowledge of such dependency by means of the genealogical tree. Moreover, the dependency is important in order to gain precision in estimating F(x). In fact, two affected brothers are more likely to share the same part of x causing the disease with respect to two unrelated subjects.
ii) the estimation of model F(x) would typically lead to a sparse model because biological background suggests that only a very small set of genetic variants interact in order to produce the disease. The dimension of grows exponentially with P. For example, if SNPs configurations were represented by categorical variables with two levels the space would have dimension 2 ^{ P } , without considering interactions. Such dimension prevents an exhaustive exploration of all possible models. Moreover, as N ≪ P then classical multivariate analysis techniques, such as multivariate regression, cannot be employed here to make an exhaustive search of all possible models. Finally, usual model selection approaches are not feasible due to computational costs.
In this paper, we aim to address the above critical points. In particular, point i) is considered in the Methods Section by reducing the genetic variability not related with the disease. We achieve this through an experimental design in which we choose, for each case, the most related control, based on the known genealogy. Point ii) is treated also in the Methods Section where Random Forest, a nonparametric regression model based on ensemble methods, is employed to estimate F(x). This allows us to explore a wide region of at the price of reasonable computational costs.
For validation purposes, we present applications of the method to two different phenotypes: BetaThalassemia and common asthma. BetaThalassemia is a genetic disorder caused by a mutation inside the betahemoglobin (HBB) gene [5]. Only homozygous individuals for the mutation manifest the clinical traits of the disease. However, carriers, although completely sane, show a reduced mean cell volume (MCV≤72) of red blood cells [6], and this parameter can be used to identify them. In Sardinia carrier of betathalassaemia are about 15% of the population and a single mutation account for 95% of the betathalassaemia mutations [7, 8]. The main goal of this analysis is to validate the method by tracing back the position of the mutation in the gene.
Differently from Betathalassemia, the goal of analyzing common asthma is to gain more biological insights on this diffuse disease which may be caused by several unknown variants on different genes.
Although the method here proposed is of general applicability to any isolated population, we tested it on a population located in one small village (Talana) in a secluded area (Ogliastra) of Sardinia (Italy). Such population is characterized by a great deal of homogeneity in life style and eating habits and by a high endogamy and consanguinity. Inhabitants of the village participated to an epidemiological survey assessing their health status, so that a complete and standardized data set is available. Thanks to the accessibility of complete municipal and parish archives, going back to the seventeenth century, it was possible to cluster all people living in the villages into large familiar structures with common ancestor. Data have been collected by Shardna Live Science http://www.shardna.com within the Ogliastra  project aimed at studying several genetic isolates of Ogliastra.
Results and Discussion
The following two sections present the application of the method here proposed to betaThalassemia and to common asthma.
Application to BetaThalassemia
One of the biggest issue in genome wide association studies is the number of false positive results due to the great number of association tests performed. In casecontrol studies it should be possible to reduce such number by choosing the control subjects so that they are the most genetically similar to the cases. We propose to apply the Hungarian algorithm [9, 10] to solve this problem by using the known kinship coefficient [11] as a measure of the genetic similarity between the subjects. The proposed approach also works if the kinship were estimated from the sample, which is however not the case here. The details on how the choice of the controls is done are described in the method section.
The aim of this application is to evaluate the power of the Hungarian method in choosing the best set of controls for a given sample of cases which correspond to all the BetaThalassemia mutation carriers in the isolated population we investigated. Such cases can be easily identified, among the available data, through the use of the MCV values measured for each subject in our study. In fact BetaThalassemia mutation carriers are known to show reduced MCV, namely we used MCV≤72 for cases and MCV≥75 for controls. With such cutoffs there are 123 cases, over a data set of 805 subjects. We then selected the subsample D, where 123 controls have been chosen to be the most related to the I = 123 cases, based on the kinship coefficient matrix of the 805 subjects. Analysis of such data is compared against other 9 balanced subsamples, of size N = 246, where controls have been assigned randomly from the available 682 controls. Such assignation, that ignore the relativeness, is the usual one for observational studies from outbred populations.
Comparison of prediction errors for Thalassemia's analysis
Sample Id.  First SNPs  Error (%)  Distance (in Kb) 

Hungarian  rs7124435  12  394 
Random 1  rs12271916  32  581 
Random 2  rs12271916  32  581 
Random 3  rs16932946  33  278 
Random 4  rs12271916  32  581 
Random 5  rs12271916  32  581 
Random 6  rs12271916  32  581 
Random 7  rs1378738  32  316 
Random 8  rs12271916  32  581 
Random 9  rs12271916  32  581 
Comparing approach with the singlepoint analysis, carried out with Fisher test on subsample D, we obtain that SNP rs12271916 is the most significantly associated according to the Fisher test, but it is also located at 580 Kb from the HBB gene farther away from the SNP obtained with .
The final message conveyed by Figure 1 and Table 1 is that the subsample D has lower genetic spurious variability than other subsamples obtained ignoring the genealogy. Finally, using with the genealogical tree we are more likely to trace back the position of the mutation point causing BetaThalassemia.
Application to Asthma
Over a data set of 208 genotyped subjects underwent clinical and instrumental examination, we dispose of a total of I = 57 asthmatic cases and J = 151 non asthmatic potential controls. For each subject we measured breath nitric oxide, forced expiratory volume (FEV1) and blood IgE levels. Moreover a standardized epidemiological questionnaire was administered to all participants to the study. The affection status has been then assessed by a clinician according to the GINA (Global INitiative for Asthma) guidelines. The total amount of considered SNPs is P = 500192, each of them with two or three configurations. For each individual sex, age, smoking habits and degree of physical and sport activity is known. The latter has been classified in three categories: scarce, moderate and intense. We also dispose of 561 further subjects who did not undergo a clinical assessment for asthma but did participate to the epidemiological survey reporting their health status for whom genotypes are available. This sample will be used for validation purposes by looking at the association of the asthma status, predicted by the method, with asthma status selfreported by subjects.
After a first screening with the Bagging approach, from P variables, we end up with a list of P' = 200 variables. Such variables are all SNPs, while environmental variables do not play almost any role in predicting asthma if compared with these 200 SNPs.
We further validate the estimated prediction model , with the first 100 most important SNPs, on the 561 genotyped individuals, not including in the previous I + J = 208 subjects, by looking at the association of the predicted asthma state, , with the selfreported diagnosis. It is worth noting that comparing the clinically assessed individuals to the selfreporting ones we observed an agreement on the 90.9% of subjects. The association between and the anamnestic data is highly significative (pvalue ≈ 10^{6} ). Among the 68 asthmatic, 52 have been correctly classified by , while the 493 nonasthmatic 271 have been correctly classified. Essentially we have that individuals predicted to be healthy and having an healthy anamnesi are more than the number expected under the hypothesis of independence between anamnesi and the predicted asthma status, . The same, but in the reverse, can be noted for the frequency of those classified as asthmatic which also exhibit an asthmatic anamnesi more than that expected under the hypothesis of independence.
Order  SNPs  Gene  Chrom.  Chrom. Pos. 

1  rs12273350  PKNOX2  11  125241046 
2  rs1254673  no  10  44473793 
3  rs10861957  CORO1C  12  109056377 
4  rs3105377  no  7  68837002 
5  rs10004892  no  4  189872850 
6  rs434949  no  11  29602274 
7  rs9524111  GPC6  13  94169901 
8  rs1918215  no  12  77750550 
9  rs7958647  no  12  77747825 
10  rs10746129  CORO1C  12  109091089 
Asthma's analysis with Mixed Effect model
Order  SNPs  Gene  Chrom.  Chrom. Pos. 

1  rs2505506  CSGALNACT2  10  43645854 
2  rs2813829  no  7  24253168 
3  rs739854  no  7  24257317 
4  rs7559302  PARD3B  2  205951603 
5  rs16218  no  7  24257655 
6  rs16212  no  7  24264661 
7  rs2642265  no  7  24253062 
8  rs16997879  no  20  51909584 
9  rs16217  no  7  24257706 
10  rs94967  no  21  41195307 
Simulation study
In this section we show the results of a small simulation study with the main objective of studying differences among Receiver Operating Characteristic (ROC) curves when considering and ignoring the kinship among individuals. Beside we consider: the BH method for multiple testing [13, 14], the Qvalues described in [21], the Efron's procedure in [22] and the ME model also applied for the asthma. It is worth to mention that the comparison of different methods is very difficult and not clear, because each method is specific and tailed for certain desirable features. For instance, among the methods that control an error rate for multiple testing, different methods claim to control the FDR. However, they indeed control different definition of the FDR, because the BH method [13] controls a FDR whose definition differs from the positive FDR of the Qvalues in [21] and the local FDR in [22]. We left to the reader to look at [23] for a general discussion on comparison among these methods. Further more, the three mentioned methods differ from the in that they do not explicitly estimate the F(x) model, while they just assume that there exists one that belongs to a certain class of models as the one specified in [14] for the BH procedure. Moreover, the comparison between the importance of a predictor with and the corresponding significance of a regression coefficient is challenging because they belongs to different metrics as illustrated in [24]. In particular, comparison against the GEE logistic regression discussed in [25] and [26] is left as further interesting work. However, the main advantage of , against regression models, is that nonlinearities and interactions can be learned from the data without any need to be specified beforehand.
Although an extensive comparison of the with other classification methods is beyond the scope of the paper (on this purposes see, for instance, [27]), we consider a small simulation study were we compare the area under the ROC curve of and the three mentioned multiple testing approaches with the primary intention of assessing the differences under two scenarios: when the kinship is taken into account, as the case of this paper, and when it is ignored.
Here we simply assume that each method under comparison is able to sort P variables according to their level of association, whether or not this is significative. Considering P cut points on each list of important variables, we are able to build a ROC curve and measure the corresponding area.
Another message conveyed by Figure 4 is that improvement of methods, under the Hungarian sample, it is not specific to the , but it may be also substantial for other genomewide scanning methods.
Conclusions
We propose a strategy analysis that leads to a statistical procedure which allows us to estimate the status of genetic disease, based on the omic profile, and also provides more insights on the biological background of the disease.
Differently from usual methods, we make a multipoint analysis where interactions among all genes are considered and estimated. The method also provides a way to reduce spurious genetic variability by introducing the genealogical information into the analysis through a suitable experimental design.
We think that this strategy may be a reference one for analyzing data from genetic isolated populations, where the degree of relativeness is known.
Methods
The two critical points of genetic association studies of isolated populations with known genealogy are addressed here. In particular, dependency among subjects is accounted in the following section with a suitable experimental design, based on Hungarian method. The prediction model F(x) and its estimation, based on Random Forest , is described in the Model Section.
The Hungarian method has been implemented in R (library clue, http://www.rproject.org) while for the we recurred to the open source project named PArallel Random Forest (PARF, http://code.google.com/p/parf). The latter is implemented in Fortran 95 and compiled with Open MPI libraries http://www.openmpi.org.
An experimental design that accounts for genealogy
Estimation of model F(x), , is based on a subset, D, of the available random sample of size N , . Variable y_{ n } represents the disease status of the nth subject, where y_{ n } = 1 if the subject n has been diagnosed with the disease and y_{ n } = 0 otherwise. Suppose to have cases and J = N  I controls, where cases and controls have been labeled according to the medical diagnosis of the disease. Vector x_{ n } is of size P and contains SNPs configurations and covariates that resemble habits and individual characteristics of subject n. The vector x_{ n } defines the omic profile of subject n which we use to predict the disease status Y.
We are interested in using a balanced sample of size 2I where the genetic variability, exogenous to the disease, is reduced. The idea to obtain such sample consists in accounting for relatedness among individuals of different disease status, which we can do taking advantage of the availability of the entire genealogy of the investigated isolated population. For instance, if we consider two individuals, one affected and one not, we are more likely to identify disease predisposing variants if they were brothers rather than if they were unrelated.
The genetic variability, between two individuals, increases with the number of meiotic steps that separate them into the genealogical tree. Such number define relatedness among two individuals. The most common used measure of relatedness between two individuals i and j, is the kinship coefficient [11], k_{ ij } , which represents the probability that two genes, sampled at random from each individual, are identical because inherited from the same ancestor (IBD). For instance k_{ ij } = 1/4 if i is the parent of individual j, while k_{ ij } = 0 zero if i and j are not related.
Let be the matrix of kinship indexes between cases and controls, whose generic entry is k_{ ij } , the kinship index between case i, i = 1, ..., I and control j, j = 1, ..., J. Finding a subset of size I of controls most related to the I cases corresponds to find a submatrix K_{I × I}of K' with the I columns made by a set of indexes such that is maximized over all possible sets of indexes .
This is an assignment problem, which is typical of combinatorial optimization area, and that has been solved with the well known Hungarian algorithm. This method was, for many years, attributed to H. Kuhn who developed and published it in [9]. However, in 2006, it was discovered that Carl Gustav Jacobi solved the assignment problem in the 19th century, and published posthumously in 1890 in Latin [10].
In the sequel, we assume that the prediction model F(x) is estimated on the subsample D ⊂ D' where controls correspond to the columns of K.
The design of such experiment provides a balanced case/control study with 2I individuals, where spurious genetic variability is supposed to have been reduced.
Model
In order to obtain the prediction model, F(x), for a certain omic profile, x, we propose a procedure based on Random Forest described in [29]. is a non parametric regression model and its use in genetic association studies grows only in the very recent years. In particular, [27] and [30] use it with expression data, where the considered values of P are of order of a few thousands and the observations are assumed independent. On the contrary, most genetic association studies consider only one genetic variant at time and rarely take into account the interactions between them [12]. Other papers, based on multiple testing [23, 31], take into account interactions among genes, but they do not explicitly model such dependency as done in this paper.
In this application, each h_{ m } (a _{ m }) estimates part of the complex relation between the omic profile and the disease, while the whole estimation of such complex model is given by .
If then the profile x belongs to an ill individual otherwise to a healthy one. Note that is also the Bayes rule under prior probabilities Pr(Y = 1) = Pr(Y = 0) = 0.5. The threshold, 0.5, may be changed according to different prior probability for the disease of interest. Such prior may be suggested by the prevalence of the disease in the population of interest.
where s = 1  E_{(Y,X)}(λ_{ m }), is the classifier's strength, and ρ is the correlation among classifiers, also known as classifier's diversity. Under conditions of Theorems 1.2 and 2.3, does not overfit if we arbitrary increase M. A more general discussion on consistency of and other classification methods can be found in [33].
Using we obtain a nonparametric regression model that we can use to make prediction and variable (SNP) selection at the same time. Variable selection is performed by means of each h_{ m } , because a_{ m } is obtained by using only those variables that assure the largest decrease of the Bernoulli deviance.
Therefore, for each of the P variables we can consider its importance according to the total amount of decreasing of the Bernoulli deviance induced in all M trees, say . We can then produce a rank of the P variables by sorting them into a decreasing order according to their respective η_{ j } . The interpretation of the set is that each genetic variant contributes to the complex genetic model, but some of them are more important than others. The greater the η_{ j } the more likely is that variable (SNP) j is a risk or protective factor for the disease.
Another advantage of is that it is highly parallelizable, in fact we implemented it for running in a cluster of CPUs, but it could also be implemented to run on GPUs (Global Processor Units, gpu.sourceforge.net). This approach makes a suitable approach for association studies with high numbers of combination of genetic variants that interact one with the other. Moreover, the simulation study in [34] showed that is able to detect true associations under a purely interactive model.
A Bagging strategy for genomewide association studies
Association studies aim at discovering causal variants that are responsible for the disease and these can be located by those SNPs that are nearest to the causal variant. In order to have a good prediction model, we should exclude all SNPs that, although having some predictive power on the outcome Y, are far from the causal variant and they are essentially false positives. Statistically speaking, we have to make a screening among all SNPs before proceeding to estimate the final model F(x). After the screening phase we construct the final model, by selecting the most associated SNPs. These two phases are related to the screening and cleaning procedures advocated in [36]. The work of [36] shows the consistency of this strategy in the context of classical regression models.
For the screening procedure we noted that each h_{ m } makes the subset of the p variables randomly chosen to compete, by using, as splitting variables, only those that are most associated with Y (i.e. the highest decrease in the Bernoulli deviance). Therefore, only SNPs most related to I are chosen in each tree. If p ≪ P then each a_{ m } is estimated by the competition between only p variables. For this reason it is likely that in M trees there would be many different SNPs that are selected to form the final . This approach, which is the usual one in , may produce a final model in which many SNPs are included just because each a_{ m } refers to different sets of predictors. In order to avoid such false positives, we start by running in Bagging mode, that is with p = P, implying that all D_{ ib } in M trees differ only for individuals but not for variables. The set , corresponding to the bagging, allows us to select those SNPs that are most related to Y by eliminating all variables with η_{ j } = 0. We actually expect many variables with zero importance, because such variables have never been used in any of the M trees. We then repeat the bagging procedure on the remaining set of variables until all η_{ j } > 0. This screening phase eliminates all those false positives due only to LD while retaining those SNPs that bear important information. For example, in the case of an association to an haplotype instead of a single SNP on a gene, we expect that only SNPs important to the haplotype would be retained.
We end up, with the screening phase, with a set of P' variables and the corresponding model has a high Λ just because trees only differs for individuals and not for the used variables, therefore its correlation ρ → 1. In order to make ρ → 0, we run again with the k most important variables among the selected P' using the rule suggested in [29], that is for k = 1, ..., P', and denotes the nearest smallest integer to . We start estimating with all the elements in the list according to and we subsequently remove one variable at a time until the estimated value of Λ starts to increase. The final list of variable (SNPs) is the one obtained by running the RF on the increasing sets of most important variables obtained with Bagging. This set is the smallest one that produces the lowest prediction error. This is also the set of variables supposed to be most associated with Y and that produces the smallest classification error with .
Remarks
As final remarks, we would like to comment some aspects of the methods we used:

all covariates in X have almost the same number of categories, 2 or 3, both for SNPs and environmental variables. This is important because it is known that if the number of categories are very different among predictors, or if continuous covariates are mixed with categorical ones, then the latter would result to be most important than categorical variables with a small number of categories. Essentially, there would be a BIAS in assessing variable importance with as noted in [37]. The BIAS is due to the fact that a variable with more levels or a continuous variable is more likely to be a splitting variable with respect to another discrete variable with a lower number of levels. This could affect situations in which there variables with different number of levels, however it is not the case in our applications. Finally, it would be interesting to compare variable importance by looking at their distance from the root node of the tree as recently proposed in [38] for survival analysis. At the moment, this is beyond the scope of the paper;

the huge number of predictors, say P ≈ 500000 (but P ≈ 10^{6} is on the wing), requires parallel computing and huge amount of storage and fast access memory. Therefore, it is difficult to manage the kind of analysis, here discussed, by classical regression methods which usually require to calculate (X^{ T }X)^{1}. Other alternatives to , based on ensemble methods, could be the gradient boosting algorithm [39] or its stochastic version [40]. From a Bayesian perspective, we note that the Bayesian Additive Regression Trees (BART), proposed in [41], could be an alternative because it allows to draw explicit causal relations among genetic variants and disease status. In our study, we rather draw simple associations. Unfortunately, BART is much more difficult to be parallelize than and the computational e ort would be not affordable with a standard machine used for the present study;

the kinship coefficient is a first attempt to summarize the genealogical tree and some information may be lost. To avoid this, it would be possible to directly use the genealogy in the construction of each classification tree h_{ m }. This would be in line with the Breiman's prescription which requires that all aspects of growing a random forest take into account the outcome.

the proposed design consisting in selecting the controls most inbred with cases could be potentially applied to other methods for association studies, beginning with the very simple Fishertest as done in the simulation study of Figure 1 for the Thalassemia and also in the simulation study where alternative approaches to the are compared. At the moment, we have no evidence for stating that the combination of the design and Random Forest has unique advantages;

rare causal variants, which have a very small effect, can be a problem for RF because each of the M trees are grown until the subjects in the final node are less than a specified quantity which is 2 subjects in the present study. Therefore, if there are rare causal variants and they have an effect so small that less than 2 individuals are affected, these cannot be detected with the proposed approach. However, such variants are, in general, very difficult to be detected.
Declarations
Acknowledgements
This work was supported by the following institutions:
• Italian Ministry of Education, University and Research, grants no. 2008275CMW002 and no. 5571/DASPAR/2002;
• The visiting professor program of Regione Autonoma della Sardegna (Italy);
• Ministerio de Ciencias e Imnovaci'on of Spain, grant no. MTM201019528;
• Comunidad Autonoma de Madrid of Spain, grant no. CM. S2009/esp1594;
• Project ICT SIAI 101 "Creating a center of excellence in biocomputing technologies applied to personalized medicine" Sardegna Ricerche, Sardinia Region, Ministry of Economy and Finance.
Authors’ Affiliations
References
 Balding D: A tutorial on statistical methods for population association studies. Nat Rev Genet. 2006, 7: 781191. 10.1038/nrg1916.View ArticlePubMedGoogle Scholar
 Fridley BL, Serie D, Jenkins G, White K, Bamlet W, Potter JD, Goode EL: Bayesian mixture models for the incorporation of prior knowledge to inform genetic association studies. Genet Epidemiol. 2010, 34 (5): 41826. 10.1002/gepi.20494.PubMed CentralView ArticlePubMedGoogle Scholar
 Wasserman L: All of Nonparametric Statistics. 2006, SpringerGoogle Scholar
 Andersson T, Alfredsson L, Källberg H, Zdravkovic S, Ahlbom A: Calculating measures of biological interaction. Eur J Epidemiol. 2005, 20 (7): 575579. 10.1007/s106540057835x.View ArticlePubMedGoogle Scholar
 Trecartin RF, Liebhaber SA, Chang JC, Lee KY, Kan YW, Furbetta M, Angius A, Cao A: beta zero thalassemia in Sardinia is caused by a nonsense mutation. J Clin Invest. 1981, 68 (4): 10121017. 10.1172/JCI110323.PubMed CentralView ArticlePubMedGoogle Scholar
 Rosatelli C, Leoni G, Tuveri T: Heterozygous Betathalassemia: relationship between the hematological phenotype and the type of Betathalassemia mutation. American Journal of Hematology. 1989, 39:Google Scholar
 Rosatelli MC, Dozy A, Faà V, Meloni A, Sardu R, Saba L, Kan YW, Cao A: Molecular characterization of betathalassemia in the Sardinian population. Am J Hum Genet. 1992, 50 (2): 4226.PubMed CentralPubMedGoogle Scholar
 Cao A, Congiu R, Sollaino MC, Desogus MF, Demartis FR, Loi D, Cau M, Galanello R: Thalassaemia and glucose6phosphate dehydrogenase screening in 13 to 14yearold students of the Sardinian population: preliminary findings. Community Genet. 2008, 11 (3): 1218. 10.1159/000113873.View ArticlePubMedGoogle Scholar
 Kuhn HW: The Hungarian Method for the Assignment Problem. Naval Research Logistics. 1955, 2: 8397. 10.1002/nav.3800020109.View ArticleGoogle Scholar
 Calmet J, Ollivier F: Editors' foreword [Special issue: Jacobi's legacy]. Appl Algebra Engrg Comm Comput. 2009, 20: 14. 10.1007/s0020000900908.View ArticleGoogle Scholar
 Lange K: Mathematical and statistical methods for genetic analysis. 1997, SpringerVerlag, New YorkView ArticleGoogle Scholar
 Balding DJ: A tutorial on statistical methods for population association studies. Nat Rev Genet. 2006, 7 (10): 781791. 10.1038/nrg1916.View ArticlePubMedGoogle Scholar
 Benjamini Y, Hochberg Y: Controlling the False Discovery Rate: a Practical and Powerful Approach to Multiple Testing. J Roy Stat Soc B. 1995, 57: 289300.Google Scholar
 Benjamini Y, Yekutieli D: The control of the False Discovery Rate in multiple testing under dependence. Ann Stat. 2001, 29: 11651188. 10.1214/aos/1013699998.View ArticleGoogle Scholar
 Lee YW, Eum SY, Chen KC, Hennig B, Toborek M: Gene expression profile in interleukin4stimulated human vascular endothelial cells. Mol Med. 2004, 10 (16): 1927.PubMed CentralView ArticlePubMedGoogle Scholar
 Weiss ST, Raby BA, Rogers A: Asthma genetics and genomics 2009. Curr Opin Genet Dev. 2009, 19 (3): 279282. 10.1016/j.gde.2009.05.001.View ArticlePubMedGoogle Scholar
 Almon RR, Yang E, Lai W, Androulakis PI, DuBois CD, Jusko WC: Circadian Variations in Rat Liver Gene Expression: Relationships to Drug Actions. The Journal of Pharmacology and Experimental Therapeutics. 2008, 326: 700726. 10.1124/jpet.108.140186.PubMed CentralView ArticlePubMedGoogle Scholar
 Thompson EA, Shaw RG: Pedigree analysis for quantitative traits: variance components without matrix inversion. Biometrics. 1990, 46 (2): 399413. 10.2307/2531445.View ArticlePubMedGoogle Scholar
 Chen WM, Abecasis GR: Familybased association tests for genomewide association scans. Am J Hum Genet. 2007, 81 (5): 913926. 10.1086/521580.PubMed CentralView ArticlePubMedGoogle Scholar
 Aulchenko YS, de Koning DJ, Haley C: Genomewide rapid association using mixed model and regression: a fast and simple method for genomewide pedigreebased quantitative trait loci association analysis. Genetics. 2007, 177: 577585. 10.1534/genetics.107.075614.PubMed CentralView ArticlePubMedGoogle Scholar
 Storey JD: The Positive False Discovery Rate: A Bayesian Interpretation and the Qvalue. Ann Stat. 2003, 31: 20132035. 10.1214/aos/1074290335.View ArticleGoogle Scholar
 Efron B: Size, Power and False Discovery Rates. Ann Statist. 2007, 35 (4): 13511377. 10.1214/009053606000001460.View ArticleGoogle Scholar
 Farcomeni A: A review of modern multiple hypothesis testing, with particular attention to the false discovery proportion. Stat Methods Med Res. 2008, 17 (4): 347388.View ArticlePubMedGoogle Scholar
 Grömping U: Variable Importance Assessment in Regression: Linear Regression versus Random Forest. The American Statistician. 2009, 63 (4): 308319. 10.1198/tast.2009.08199.View ArticleGoogle Scholar
 Hancock DB, Martin ER, Li YJ, Scott WK: Methods for interaction analyses using familybased casecontrol data: conditional logistic regression versus generalized estimating equations. Genet Epidemiol. 2007, 31 (8): 88393. 10.1002/gepi.20249.View ArticlePubMedGoogle Scholar
 Chen MH, Yang Q: GWAF: an R package for genomewide association analyses with family data. Bioinformatics. 2010, 26 (4): 5801. 10.1093/bioinformatics/btp710.PubMed CentralView ArticlePubMedGoogle Scholar
 DíazUriarte R, de Andrés SA: Gene selection and classification of microarray data using random forest. BMC Bioinformatics. 2006, 7 (3): 1122.Google Scholar
 Genovese C, Wasserman L: Operating characteristics and extensions of the false discovery rate procedure. J Roy Stat Soc B. 2002, 64: 499518. 10.1111/14679868.00347.View ArticleGoogle Scholar
 Breiman L: Random forests. Machine Learning. 2001, 45: 532. 10.1023/A:1010933404324.View ArticleGoogle Scholar
 de Edelenyi FS, Goumidi L, Bertrais S, Phillips C, Macmanus R, Roche H, Planells R, Lairon D: Prediction of the metabolic syndrome status based on dietary and genetic parameters, using Random Forest. Genes Nutr. 2008, 3: 1736. 10.1007/s122630080097y.View ArticleGoogle Scholar
 Cabras S: A note on multiple testing for composite null hypotheses. Journal of Statistical Planning and Inference. 2010, 140:Google Scholar
 Hastie T, Tibshirani R, Friedman JH: The Elements of Statistical Learning: Data Mining, Inference, and Prediction. 2001, SpringerVerlagView ArticleGoogle Scholar
 Biau G, Devroye L, Lugosi G: Consistency of random forests and other averaging classifiers. Journal of Machine Learning Research. 2008, 9: 20152033.Google Scholar
 Garcı'aMagarinos M, de Ullibarri IL, Cao R, Salas A: Evaluating the ability of treebased methods and logistic regression for the detection of SNPSNP interaction. Ann Hum Genet. 2009, 73 (3): 360369. 10.1111/j.14691809.2009.00511.x.View ArticleGoogle Scholar
 Lewontin RC, Kojima K: The evolutionary dynamics of complex polymorphisms. Evolution. 1960, 14 (4): 458472. 10.2307/2405995.View ArticleGoogle Scholar
 Wasserman L, Roeder K: High Dimensional Variable Selection. Ann Stat. 2009, 37 (5A): 21782201. 10.1214/08AOS646.PubMed CentralView ArticlePubMedGoogle Scholar
 Strobl C, Boulesteix A, Zeileis A, Hothorn T: Bias in Random Forest Variable Importance Measures: Illustrations, Sources and a Solution. BMC Bioinformatics. 2007, 8 (25):Google Scholar
 Ishwaran H, Kogalur UB, Gorodeski EZ, Minn AJ, Lauer MS: HighDimensional Variable Selection for Survival Data. Journal of the American Statistical Association. 2010, 105 (489): 205217. 10.1198/jasa.2009.tm08622.View ArticleGoogle Scholar
 Friedman JH: Greedy Function Approximation: A Gradient Boosting Machine. The Annals of Statistics. 2001, 29 (5): 11891232.View ArticleGoogle Scholar
 Friedman JH: Stochastic Gradient Boosting. Computational Statistics and Data Analysis. 2002, 38: 367378. 10.1016/S01679473(01)000652.View ArticleGoogle Scholar
 Chipman HA, George EI, McCulloch RE: BART: Bayesian Additive Regression Trees. The Annals of Applied Statistics. 2010, 4: 266298. 10.1214/09AOAS285.View ArticleGoogle Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.