One of the main objectives in studying the genetics of complex diseases is not only the search of genetic variants associated to pathologies [1], but also to build predictive models which help both their diagnosis and early treatment. This problem can be formalized by expressing the disease status, *Y*, of each subject as a Bernoulli random variable *Y* = {0, 1} where *Y* = 1 indicates an affected subject. The main quantity of interest, *F*(**x**) = Pr(*Y* = 1|**x**), is the conditional probability of being affected given a set **x** of genetic variants and environmental variables. Such variables form a huge set of potential predictors, which we will refer to as *omic profile*. Essentially,
, where *P* is the number of considered genetic variants and environmental variables. Specifically, we use a sample of *N* ≪ *P* where *N* is of order of hundreds while *P* is thousands times larger. This setup complicates the estimation of *F*(**x**), because, in absence of strong prior information [2] on the part of the omic profile related to the disease, the data should allow us to choose *F*(**x**) within a large class of models
. In order to achieve this goal it is necessary to reduce the spurious genetic variability not related to *Y*. For example, if we were using logistic regression models, *F*(**x**) would be one of these models selected between all the possible logistic regression models,
, which are, at least, 2^{
P
}. In this paper we relax the assumption of a parametric model by using non-parametric methods [3], which means that
has infinite dimension. Such models are usually referred as non-parametric models. The genomic profile, part of the omic profile, consists of a large set of DNA markers, say 500000 Single-Nucleotide Polymorphisms (SNPs), while the set of environmental variables includes individual anthropometric measurements and information derived from a standardized interview collecting socio-demographic, lifestyle, medical and pharmacological history data on many pathologies. We suppose that such covariates may cause the outcome *Y*. In particular, for certain types of diseases, it is possible to have prior information about the environmental variables, but in most cases there is no such information about the causing genes. The disease prediction model, *F*(**x**), for the future outcome *Y* |**x** must take into account gene-gene interactions and also their interaction with the environment. Such interactions, usually of unknown order, can be multiplicative or additive [4]. Estimation of *F* (**x**) is a primary concern in personalized medicine, because *F*(**x**) can be used as the basis for early diagnosis of a disease, permitting actions to prevent the pathology before its insurgence, and to personalize treatments.

In order to estimate *F*(**x**), we consider a matrix of omic profiles, **X**
_{
N
}
_{× P
}, and the known disease status, **y**
_{
N
} = (*y*
_{1}, *y*
_{2}, ..., *y*
_{
N
} ), measured on *N* ≈ 100 highly inbred individuals that belong to an isolate population where the genealogy is fully known.

The translation of this estimation problem into statistical terms sounds as follows: given a huge set of covariates, *P* ≈ 500000, we have to estimate a probability model,
, using a sample of dependent observations, (**y**
_{
N
} , **X**
_{
N
}
_{× P
}), of size *N* ≪ *P*. The statistical analysis of such problem presents the following critical points:

*i*) observations are not independent and consequently all unconditional inference, with respect to the genealogy, cannot be applied here. Differently from usual association studies between genetic variants and diseases, we have knowledge of such dependency by means of the genealogical tree. Moreover, the dependency is important in order to gain precision in estimating *F*(**x**). In fact, two affected brothers are more likely to share the same part of **x** causing the disease with respect to two unrelated subjects.

*ii*) the estimation of model *F*(**x**) would typically lead to a sparse model because biological background suggests that only a very small set of genetic variants interact in order to produce the disease. The dimension of
grows exponentially with *P*. For example, if SNPs configurations were represented by categorical variables with two levels the space
would have dimension 2^{
P
}, without considering interactions. Such dimension prevents an exhaustive exploration of all possible models. Moreover, as *N* ≪ *P* then classical multivariate analysis techniques, such as multivariate regression, cannot be employed here to make an exhaustive search of all possible models. Finally, usual model selection approaches are not feasible due to computational costs.

In this paper, we aim to address the above critical points. In particular, point *i*) is considered in the Methods Section by reducing the genetic variability not related with the disease. We achieve this through an experimental design in which we choose, for each case, the most related control, based on the known genealogy. Point *ii*) is treated also in the Methods Section where Random Forest, a non-parametric regression model based on ensemble methods, is employed to estimate *F*(**x**). This allows us to explore a wide region of
at the price of reasonable computational costs.

For validation purposes, we present applications of the method to two different phenotypes: Beta-Thalassemia and common asthma. Beta-Thalassemia is a genetic disorder caused by a mutation inside the beta-hemoglobin (HBB) gene [5]. Only homozygous individuals for the mutation manifest the clinical traits of the disease. However, carriers, although completely sane, show a reduced mean cell volume (MCV≤72) of red blood cells [6], and this parameter can be used to identify them. In Sardinia carrier of beta-thalassaemia are about 15% of the population and a single mutation account for 95% of the beta-thalassaemia mutations [7, 8]. The main goal of this analysis is to validate the method by tracing back the position of the mutation in the gene.

Differently from Beta-thalassemia, the goal of analyzing common asthma is to gain more biological insights on this diffuse disease which may be caused by several unknown variants on different genes.

Although the method here proposed is of general applicability to any isolated population, we tested it on a population located in one small village (Talana) in a secluded area (Ogliastra) of Sardinia (Italy). Such population is characterized by a great deal of homogeneity in life style and eating habits and by a high endogamy and consanguinity. Inhabitants of the village participated to an epidemiological survey assessing their health status, so that a complete and standardized data set is available. Thanks to the accessibility of complete municipal and parish archives, going back to the seventeenth century, it was possible to cluster all people living in the villages into large familiar structures with common ancestor. Data have been collected by Shardna Live Science http://www.shardna.com within the Ogliastra - project aimed at studying several genetic isolates of Ogliastra.