Probability genotype imputation method and integrated weighted lasso for QTL identification
- Nino Demetrashvili^{1, 2}Email author,
- Edwin R Van den Heuvel^{1, 2} and
- Ernst C Wit^{1}
DOI: 10.1186/1471-2156-14-125
© Demetrashvili et al.; licensee BioMed Central Ltd. 2013
Received: 11 July 2013
Accepted: 17 December 2013
Published: 30 December 2013
Abstract
Background
Many QTL studies have two common features: (1) often there is missing marker information, (2) among many markers involved in the biological process only a few are causal. In statistics, the second issue falls under the headings “sparsity” and “causal inference”. The goal of this work is to develop a two-step statistical methodology for QTL mapping for markers with binary genotypes. The first step introduces a novel imputation method for missing genotypes. Outcomes of the proposed imputation method are probabilities which serve as weights to the second step, namely in weighted lasso. The sparse phenotype inference is employed to select a set of predictive markers for the trait of interest.
Results
Simulation studies validate the proposed methodology under a wide range of realistic settings. Furthermore, the methodology outperforms alternative imputation and variable selection methods in such studies. The methodology was applied to an Arabidopsis experiment, containing 69 markers for 165 recombinant inbred lines of a F8 generation. The results confirm previously identified regions, however several new markers are also found. On the basis of the inferred ROC behavior these markers show good potential for being real, especially for the germination trait G_{max}.
Conclusions
Our imputation method shows higher accuracy in terms of sensitivity and specificity compared to alternative imputation method. Also, the proposed weighted lasso outperforms commonly practiced multiple regression as well as the traditional lasso and adaptive lasso with three weighting schemes. This means that under realistic missing data settings this methodology can be used for QTL identification.
Keywords
Arabidopsis Germination traits QTL mapping Recombinant inbred line (RIL) Binary genotypes Likelihood-based genotype imputation Sparse variable selection Weighted lassoBackground
Quantitative traits are traits that vary continuously. The phenotype of a quantitative trait (QT) is the cumulative result of several genes, their interactions and the environment. Regions within genomes that contain genes associated with a particular QT are known as quantitative trait loci (QTL) [1]. The major biological question is to identify the QTL associated with variation in traits. Understanding the genetic architecture of quantitative traits is important for animal and plant breeding, medicine, and evolution. For example, plant breeders can use the QTL identified for seed quality to select and breed plants with certain desirable characteristics. Molecular markers are specific fragments of DNA that represent the genetic differences between individual organisms or species [1]. Development of molecular (or genetic) markers creates new opportunities for QTL identification. Markers are not usually targets themselves but act as “flags” for genes controlling the trait. Molecular markers that are closely located and tightly linked to genes that control the trait are referred to as “tags”.
The process of coupling the phenotype (i.e. trait measurements) and genotype (i.e. molecular markers) data followed by QTL analysis is known as QTL mapping. The aim of QTL mapping is to identify the markers which are tightly linked to genes affecting the trait as well as to estimate the magnitude of their effects. Most methods consider repeated single QTL models, but it is now understood that modeling multiple QTLs simultaneously, as we consider in this paper, is superior to single QTL models [2]. Often both the phenotype and genotype data are incomplete. Though imputation methods for phenotype data in the context of QTL mapping is quite well-developed [3, 4], there is less consensus on imputation of missing genotype data, due to its categorical nature. Two major strategies for genotype imputation are based on: (1) a maximum likelihood and (2) multiple imputation strategies [5]. Although multiple imputation is potentially flexible, it tends to be slow for large fraction of missing values. Therefore, we propose a likelihood-based method. In the context of QTL mapping, existing genotype imputation methods use phenotype data and multiple generation information to obtain a conditional probability of a missing genotype [6]. These methods are design-specific and lack generalizability [6, 7]. Most commonly, the missing genotypes are replaced with predicted values that are based on the observed genotypes at neighboring markers, as in the multiple QTL mapping (MQM) algorithm [8, 9].
Due to the roughly Markov structure of the meiosis process, we introduce a probability imputation method for markers with binary genotypes that includes information only from immediate neighbors. This method is applied to recombinant inbred line (RIL) experiment, though it can be extended to other mating designs with binary genotypes (e.g. backcross, double-haploid). Clearly, our method is applicable to a wider set of designs and it does not require the phenotype data in order to compute a probability for missing genotype. In contrast to others [8], the recombination rate is not estimated separately but rather a specific parameter is computed within the algorithm that plays a similar role. Our imputation method considers two separate models, one for markers at the edge of a chromosome and another for all others. Each model requires an estimation of a recombination rate parameter. The model-specific parameter for middle markers is estimated as a function of the genotypes of the two flanking markers and the genetic distances towards those neighbors. A distinguishing characteristic of our imputation method is that the outcomes are probabilities which correspond to weights of observing one of the two parental lines at that locus. We integrate these weights into a lasso [10] to advance the QTL identification. The proposed analysis pipeline is validated using extensive simulations and compared to alternative methods. It is then applied to a real dataset that motivated our method and which is described next.
Motivating example
The primary biological goal of this work is QTL detection and the eventual goal is to improve the quality of seed production in Arabidopsis thaliana. It has been shown that measurements of the germination rate of maize in the laboratory could predict the relative performance in field sowing [11]. An increase in sowing performance can result in economically important crops. A similar strategy is taken in our study where germination characteristics of Arabidopsis seeds were examined in order to find QTLs associated with each trait [12]. Lines from recombinant inbred population are important and convenient resources to study the genetic mapping of quantitative traits in plants or animals.
In our study, F8 seeds from 165 lines of Arabidopsis were obtained from the Versailles Biological Resource Centre [13]. The seeds were the results of cross between Bayreuth and Shahdara Arabidopsis plants, using an inbreeding approach over eight generations. Bay-0 originates from a fallow-land habitat near Bayreuth in Germany, whereas Shahdara grows at high altitude in the Pamiro-Alay mountains in Tadjikistan [14]. The Bay-0 and Sha RIL populations have been used in several previous studies to map QTLs [12]. Arabidopsis has five chromosomes. For every RIL, 69 markers were genotyped with an average genetic distance of 6.1 centimorgans (cM) between markers [14]. Of the 69 included markers, respectively 18, 11, 12, 11 and 17 markers are located on chromosomes 1, 2, 3, 4 and 5. The lengths of chromosomes are 91.3, 64.6, 72.2, 69.1 and 91.2 cM respectively.
Arabidopsis germination experiment
The phenotyping experiment was conducted in two stages: (1) seed sowing followed by measurements of grown plant traits and (2) collecting seeds from these plants followed by germination [12]. In this study we examine traits from the second part of the experiment, namely germination. In the first stage of the experiment (2008), Arabidopsis seeds obtained from Versailles were randomly allocated to three plates and grown in a climatized chamber. The plates can be considered technical replicates, each with 3-5 RIL plants. One year later, seeds stored in 2008, were planted on a fourth plate. In addition, the best seeds (free of fungus, etc.) from the first three plates were collected and germinated in 2009 on a fifth plate. In the second stage of the experiment, 50-100 seeds from each line of the core population of grown plants were collected and germinated.
Several factors were varied or simply needed to be accounted for in germination experiment, such as seed age, dormancy, imbibition, growing plate (and selection), temperature, and chemical stress. With respect to dormancy, i.e. storage conditions, two types can be identified, namely fresh and after-ripened (AR) [12]. Besides normal temperature, two types of temperature stresses were applied, namely cold and heat shock. The following chemical stresses were applied: table salt, osmotic-inducing mannitol, oxidizer hydrogen peroxide, inhibitor abscisic acid (ABA), and controlled deterioration (CD) [12]. Germination process under all chemicals except hydrogen peroxide was carried out in light and dark imbibition conditions.
Statistical design
All five traits are continuous traits (G_{max} (%), T_{50} (h), T_{10} (h), U_{8416} (h), AUC_{100} (%× h)). Higher G_{max} means more germination. Lower T_{10} and T_{50} mean faster germination. The aim of this QTL analysis is to find the markers that are associated with these five response variables. We use the terms “(quantitive) trait” and “response variable” interchangeably.
The set of explanatory variables are the 69 markers. Our genetic dataset contains genotypes of 69 markers for 167 plants (including parental Bay and Sha). Marker genotypes for Bay and Sha are denoted by 0 and 1 respectively. Genotypes across all markers are the same for each parent. Genotypes in children plants are inherited from either of the parents, therefore each marker has only two possible genotypes {0,1}. So, technically, the 69 markers can be seen as a distinct combination of 0’s and 1’s across the RILs.
All 167 plants are treated under 42 different conditions resulting in 7014 observations for every trait, some of which are missing. These conditions are made up from combination of the factors (described in previous section), which are not of primary interest, but which must be taken into account. Conditions are age, dormancy (Fresh, AR), plate (1-5), imbibition (light, dark), temperature (8, 10, 20, 25, 30 degrees Celcius) and chemical stress (no, salt, mannitol, hydrogen peroxide, ABA, CD). Hence, each response variable is adjusted for these nuisance variables. Such adjustment confirms that any detected marker effect is robust under all these conditions.
Missing data
The phenotypes G_{max}, T_{50}, T_{10}, U_{8416}, and AUC_{100} contain respectively 0.49%, 1.90%, 1.90%, 1.92% and 0.49% missing data. It seems reasonable to assume that the missingness of any observation for a given trait is independent of the observed and unobserved values. Such missing mechanism is known as missing completely at random (MCAR). Furthermore, the small percentage of missingness across all phenotypes means that we can safely omit the missing observations from our analysis, even if the MCAR assumption is not true.
We summarized the number of missing markers per plant and the number of missing plants per marker. About 25% of RILs do not have missing values for any marker. The remaining RILs have up to 9 missing values, with only two RILs have 20 and 24 missing markers. As for the number of missing plants per marker, we counted that each marker has at least one missing RIL. The number of missing RILs per marker vary between 1 and 13. Missingness in markers may be caused by essay quality, poor hybridization and/or other reasons. Such nature of missingness is unlikely to be MCAR, as the missingness may well statistically depend on whether its neighbor is missing – due to the sequential operation of the genotyping instrument. This missingness feature might then be described as missing at random (MAR) since the missingness does not depend on the unobserved markers themselves, but it does depend on the observed markers, i.e. we observe whether its neighbor is missing or not and this enables prediction of the probability that this marker is missing.
Methods
Marker probability model
where t_{0} and t_{1} are genetic locations of flanking markers and ${x}_{c,{t}_{0}}^{\left(i\right)}$ and ${x}_{c,{t}_{1}}^{\left(i\right)}$ are genotypes of those markers. In our case genetic locations t_{0}, t and t_{1} refer to genetic distances from starting point of a chromosome.
As stated above, markers in a RIL have only two genotypes {0,1}. There are two possible sources of the genetic variability, genetic recombination and mutation. Recombination/meiosis is a process of chromosomal crossover whereby two chromatids can mesh with one another. The variations, isolated by breeders are the result of recombination and not mutation due to short period of time involved with the isolation of the varieties. Variation due to mutation on the time-scale of this experiment (i.e. 6/8 generations) is dwarfed by the variation of recombination.
Note, for a marker surrounded by two known markers, if α=0 this means the recombination rate is zero. Therefore, (3) gives us that the probability of a Sha marker between two given Sha markers is 1. In contrast, if α→∞ then the recombination rate is infinite meaning that there is no information in neighboring markers. Therefore, (3) gives us a probability of 0.5 for any value of the flanking markers.
where x_{0}∈{0,1} and δ_{ x } is defined as in (4). The probability model for the markers at left edge of a chromosome is similar to equation (5).
Pseudo maximum likelihood in imputation
from which we estimate α. In a similar way we estimate β.
The distribution of the LRT is the weighted sum of four independent ${\chi}_{1}^{2}$ distributions [17]. Since the pseudo likelihood is probabilistically close to the true likelihood, the LRT can also be approximated by ${\chi}_{4}^{2}$. Alternatively, the p-value can be calculated using a bootstrap approach.
We also tested the goodness of fit of the proposed model using Pearson’s chi-squared statistic.
Genotype imputation
where ${w}_{c,t}^{\left(i\right)}\in [0,1]$. Indeed, zero weights are used for genotypes with rounded probabilities equal to 0.5 and weights of ones are used when the imputed probabilities approach to zero or one. For non-missing genotypes, we assume that they are observed with complete certainty resulting in a weight of one. The highest possible weights are given to the observed genotypes. The weights ${w}_{c,t}^{\left(i\right)}$ are computed from the imputed (predicted) probabilities $\widehat{P}\left({x}_{c,t}^{\left(i\right)}\right)$. These probabilities were estimated via the maximum likelihood estimation procedure given in previous two sections.
Phenotype response
where the residual variance is constant across all observations and the residual is distributed independently and identically (iid), y_{ i }∼i i d N(0,τ^{2}). The model (9) employs m regression parameters ϕ_{0},…,ϕ_{ m }. Then we computed the difference between the observed and fitted values for every observation and used these residuals y_{ i } as inputs to weighted lasso.
Weighted lasso phenotype inference
The original lasso weights all observations equally [10]. An adaptive lasso is an extension of lasso by weighting or penalizing different coefficient differently in a way that depends on the data [18]. The proposed weighted lasso (wlasso) is a different extension of the lasso by weighting different observations differently in a way that depends on the data and the value of the coefficients. An adaptive lasso places weights in the penalty part of the objective function, whereas wlasso places weights in the sum of squares part of the objective function.
The idea behind the method is to downweight observations with a lot of imputed values x_{ ij }: note for instance that observations with all weights w_{ ij } zero are eliminated from the regression, whereas observations with a fully observed x_{i.}, and therefore all weights w_{ ij } equal to one, are fully taken into consideration. Moreover, observations that have only missing values on marker locations j that are deemed irrelevant for the regression, i.e. ${\widehat{\theta}}_{j}=0$, will not be penalized for their partial missingness. The model naturally accounts for imputation imprecision, without letting irrelevant imputations affects the quality of the estimate and with the ordinary lasso as limiting case when no imputation was performed.
where s^{2} is some robust estimate of the variance that does not depend on λ and $\mathit{\text{df}}\left(\lambda \right)=\sum _{j=1}^{p}{1}_{\{{\widehat{\theta}}_{\lambda ,j}\ne 0\}}$ is the number of non-zero parameters in the model.
We are defining convergence as the first k, such that $|{w}_{i}^{(k+1)}-{w}_{i}^{\left(k\right)}{|}^{2}<\epsilon $, where ε is a predefined tolerance level. Using tolerance level ε=10^{−8} all five traits converged in 4 iterations. The plots of weights are included for visualization convenience (see Additional files 1, 2 and 3).
Results
In this section, we first present the results of detecting QTL effects for germination in Arabidopsis. Then, we present the strategies and results from three simulation studies. With the first simulation study we aim to justify our proposed methodology. In the second and third simulation studies we compare our methodology with alternative methodologies for the case study. The second study emphasizes a comparison of our imputation methods with the nearest marker imputation. The third study focuses on comparison of sparse variable selection techniques, namely our weighted lasso, the traditional lasso [10] and adaptive lasso [18].
Analysis of Arabidopsis germination experiment
LRT statistic of markers selected by weighted lasso for G _{ max } , T _{ 50 } , T _{ 10 } , U _{ 8416 } , and AUC _{ 100 } ordered by genetic distance across 5 chromosomes in Arabidopsis
Marker | chr | gdist | G _{ max } | T _{50} | T _{10} | U _{8416} | AUC _{100} |
---|---|---|---|---|---|---|---|
MSAT100008 | 1 | 0 | 0.64 | ||||
F21M12 | 1 | 9.7 | 0.09 | ||||
IND4992 | 1 | 15.4 | 0.04 | 0.32 | |||
MSAT110 | 1 | 21.6 | 0.02 | 0 | |||
MSAT108193 | 1 | 26.6 | 0.14 | 0.21 | |||
T27K12 | 1 | 49.1 | 1.27 | 0.65 | |||
F5I14 | 1 | 69.6 | 0 | 0 | |||
MSAT127088 | 1 | 82.7 | 0.15 | 0.33 | |||
MSAT15 | 1 | 91.3 | 0.01 | 0.81 | 0 | ||
MSAT25 | 2 | 0.19 | |||||
MSAT200897 | 2 | 7.9 | 0.02 | ||||
MSAT238 | 2 | 13 | 0.16 | 0.15 | |||
MSAT241 | 2 | 35 | 0.09 | 0 | |||
IND216199 | 2 | 51.5 | 0.03 | ||||
MSAT210 | 2 | 57.9 | 0.17 | 0.24 | |||
MSAT222 | 2 | 64.6 | 0.86 | ||||
MSAT399 | 3 | 3.2 | 0.68 | 0.79 | 0.58 | 1.03 | |
ATHCHIB2 | 3 | 6.6 | 0.57 | 0.02 | 0 | 0.16 | |
MSAT305754 | 3 | 7.9 | 0.58 | ||||
MSAT319 | 3 | 23.2 | 0.03 | 0.13 | |||
MSAT332 | 3 | 39.5 | 0.17 | 0.38 | 0.51 | 0.03 | 0.31 |
MSAT321 | 3 | 48 | 0.33 | 0.09 | 0.25 | ||
MSAT318406 | 3 | 53.3 | 0.05 | 0 | |||
MSAT318 | 3 | 64.1 | 0 | ||||
MSAT370 | 3 | 72.2 | 1.28 | ||||
MSAT48 | 4 | 2 | 0.09 | 0.04 | |||
MSAT443 | 4 | 10.7 | 0.52 | ||||
MSAT415 | 4 | 33.5 | 0.45 | ||||
CIW7 | 4 | 45 | 1.22 | 0.29 | |||
MSAT418 | 4 | 47 | 0.03 | ||||
MSAT49 | 4 | 55.6 | 0.06 | 0.60 | 0.31 | 0 | |
MSAT468 | 4 | 61.8 | 0.33 | 0.17 | |||
MSAT500027 | 5 | 0 | 0.34 | 0.5 | |||
MSAT514 | 5 | 26.6 | 0 | 0.16 | 0.01 | 0.46 | 0.09 |
NGA139 | 5 | 30.4 | 0.42 | 0.03 | 0 | 0.13 | |
MSAT520037 | 5 | 67.4 | 0.9 | 0.30 | 0.27 | 0.01 | 1.07 |
MSAT512 | 5 | 71.6 | 0.13 | 0.39 | 0.31 | ||
MSAT519 | 5 | 85 | 0.26 | 0.01 | |||
K9I9 | 5 | 91.2 | 0.35 | 0.08 | 0.25 |
None of the detected peaks at chromosomes 2 and 4 were identified before [12]. Peaks at chromosome 2 are relatively low. The region with MSAT25 provides relatively high confidence for association with T_{10}. The duo markers (MSAT238 and MSAT241, IND216199 and MSAT210, MSAT210 and MSAT222), with a considerable LRT value, demonstrate a certain confidence of association with G_{max}, T_{10} and U_{8416} respectively. Thus, regions at chromosome 2 should be considered among QTL despite their low LRT statistic. Similar interpretations apply to detected regions at chromosome 4. In addition, several peaks detected by us at chromosomes 1, 3 and 5 have not been identified by others [12].
Simulation strategies
Simulating genotype data by an Ising model (interaction parameter η) is quite realistic for a recombination process on a chromosome [20]. The parameter η shows the strength of the dependence between markers. We considered that inheritance at loci on different chromosomes are independent events and simulated the markers of every chromosome one at a time. We included the dependence between markers in two ways: (1) the dependence between equally-spaced neighboring markers using an Ising model with η=0.4, (2) the genetic distances between the observed markers by subsampling the full process. In particular, we rounded the genetic distance of every chromosome and simulated markers with genotypes {0,1} 1 cM apart for 165 RILs. Then we selected those markers which were spaced with the same genetic distances as markers in our RIL experimental data. Markers for every chromosome were simulated independently and then joined together as a genotype dataset.
We assumed that among all observed markers about 10% have the true QTL effect. Thus, the largest positive and negative $\widehat{\theta}$ of 6 markers from weighted lasso of our real experiments were selected as the true effects (see Figure 6). Among the simulated 69 markers, 6 evenly spaced markers (along five chromosomes) were selected as the true input variables. An additive effect of markers was assumed and the response variable was generated using the multiple regression model. Residual error, having the normal distribution with mean μ=0, was added to the trait. We studied our method under several values of residual error variances, namely σ^{2}=0.5,1,2,3. We also investigated our methodology with 6 true markers being clustered (3 markers on the first chromosome and 3 on the second one) and compared it with the case of evenly-spaced markers.
We studied two missing mechanisms among markers: (1) an MCAR using Bernoulli missingness and (2) MAR using an Ising missingness model. Following our experimental data, we explored the case with 10% of missingness. Thus, the probability parameter in Bernoulli distribution is 0.1. We assumed the stronger dependence η_{ MAR }=0.6 among missing markers than simply among observed and non-observed markers (η=0.4). For every above described scenario, we carried out 50 simulations. The simulated data were analyzed using the proposed imputation model and wlasso as well as alternative approaches. For every simulation scenario, we summarized the performance of the tested methodology using the receiver operating curve (ROC). For that we measured the fraction of true positives out of all positives, so called true positive rate (TPR) and the fraction of false positives out of the negatives, so called false positive rate (FPR). To be specific, TPR=TP/(TP+FN) and FPR=FP/(FP+TN), where TP, TN, FP, and FN are the numbers of true positives, true negatives, false positives, and false negatives. TPR and FPR are also known as the sensitivity and (1-specificity) respectively.
For the sparse variable selection methods, we studied the TPR and FPR across a range of the lasso regularization parameter λ (BIC is not employed here as it was for real case study).
Simulation study 1: justification of the proposed methodology
To evaluate the accuracy of RIL experiments, we have to examine the results from real data in light of the simulation results. Traits G_{max} and T_{50} were examined as examples. In the simulation studies: regression coefficients were θ=0.5 (and up), investigated residual error variances were σ^{2}=0.5,1,2,3, and the number of RILs was 165. In experiment for T_{50}: regression coefficients were θ=0.5 (and up), residual error variance was σ^{2}=150, and the number of observations was n=165×42=7014. These are equivalent to simulations with 165 RILs and σ^{2}≈3.5. In experiment for G_{max}: θ=0.02 (and up), σ^{2}=0.04, n=7014. These are equivalent to simulations with θ=0.5 (and up), 165 RILs, and σ^{2}=0.04(0.5/0.02)^{2}/42≈0.6. From these residual error variances, we can find the corresponding ROC curves for T_{50} and G_{max} (see Figure 10 or 11). Thus, the results of T_{50} experiment are less powerful given the overall low ROC curve (σ^{2}=3). In contrast, the results of G_{max} are highly stable, given the overall high ROC curve (σ^{2}=0.5).
Simulation study 2: comparison of the proposed methodology focusing on imputation methods
Simulation study 3: comparison of the proposed methodology focusing on sparse variable selection techniques used for phenotype inference
Discussion
The simulation studies have shown that the combination of the proposed probabilistic imputation method and wlasso is an accurate methodology for QTL analysis. The pipeline suggested in this paper has an advantage of computational speed. An alternative to the proposed likelihood-based imputation is multiple imputation [21], but it is slower and leads every time to a possibly different result. The wlasso is used to advance the selection of markers associated with a trait.
In this paper, we analyzed each of the five traits of the Arabidopsis separately. In principle, it is possible to analyse the traits jointly, as they are all traits associated with germination. Clearly, the QTLs shared by all traits can be analyzed further. To identify whether these loci are causal or reactive for a particular trait is an interesting follow-up question. Possible causal relationships among a trio of two traits and a QTL is summarized by others [22]. Their approach can be applied to various pairs of selected Arabidopsis traits and extended to a quintet of traits in order to determine the type of relationship (for example, independent, reactive, causal) existing among traits. Though, this goes beyond the scope of the paper.
Conclusions
Our methodology has high accuracy in terms of sensitivity and specificity for clustered and evenly-spaced markers for both, MCAR and MAR missing mechanisms. Clearly, the accuracy increases as the magnitude of the residual error variance decreases. In comparison with other approaches, our proposed methodology outperforms alternative methods under most investigated scenarios but is never worse than any of the approaches. More specifically, our probabilistic imputation method is more accurate than the nearest marker imputation. Also, our wlasso is more accurate than commonly practiced multiple regression, the traditional lasso, and adaptive lasso (with the three selected weighting scheme). More importantly, our methodology has been biologically validated on an Arabidopsis study and demonstrated good accuracy. In conclusion, the proposed methodology can be used for QTL identification, especially under the realistic setting of missing genotypes among markers.
Declarations
Acknowledgements
We would like to thank Joosen Ronny from the University of Wageningen for providing Arabidposis data and useful discussions about biology. We also would like to thank staff from Groningen Bioinformatics Centre, particularly Danny Arends for productive discussions about genetics. This study was funded by the pharmaceutical company Merck/Schering-Plough and the University Medical Center Groningen, University of Groningen.
We would like to thank the reviewers and the editor for valuable comments which helped to improve this manuscript considerably.
Authors’ Affiliations
References
- Collard B, Jahufer M, Brouwer J, Pang E: An introduction to markers, quantitative trait loci (QTL) mapping and marker-assisted selection for crop improvement: The basic concepts. Euphytica. 2005, 142 (1-2): 169-196. 10.1007/s10681-005-1681-5.View ArticleGoogle Scholar
- Foster S: The LASSO linear mixed model for mapping quantitative trait loci. PhD thesis. University of Adelaide, School of Agriculture, Food and Wine; 2006,
- Bobb J, Scharfstein D, Daniels M, Collins F, Kelada S: Multiple imputation of missing phenotype data for QTL mapping. Stat Appl Genet Mol Biol. 2011, 10 (1): 1-27.Google Scholar
- Guo Z, Nelson J: Multiple-trait quantitative trait locus mapping with incomplete phenotypic data. BMC Genet. 2008, 9: 82-10.1186/1471-2156-9-82.PubMed CentralView ArticlePubMedGoogle Scholar
- Balding D: A tutorial on statistical methods for population association studies. Nat Rev Genet. 2006, 7 (10): 781-791. 10.1038/nrg1916.View ArticlePubMedGoogle Scholar
- Jiang C, Zeng Z: Mapping quantitative trait loci with dominant and missing markers in various crosses from two inbred lines. Genetica. 1997, 101 (1): 47-58. 10.1023/A:1018394410659.View ArticlePubMedGoogle Scholar
- Haley C, Knott S, Elsen J: Mapping quantitative trait loci in crosses between outbred lines using least squares. Genetics. 1994, 136 (3): 1195-1207.PubMed CentralPubMedGoogle Scholar
- Arends D, Prins P, Jansen R, Broman K: R/qtl: high-throughput multiple QTL mapping. Bioinformatics. 2010, 26 (23): 2990-2992. 10.1093/bioinformatics/btq565.PubMed CentralView ArticlePubMedGoogle Scholar
- Arends D, Prins P, Broman K, Jansen R: Tutorial-multiple-QTL mapping (MQM) analysis. Technical report. 2010,Google Scholar
- Tibshirani R: Regression shrinkage and selection via the lasso. J R Stat Soci Series B (Methodological). 1996, 58 (1): 267-288.Google Scholar
- Khajeh-Hosseini M, Lomholt A, Matthews S: Mean germination time in the laboratory estimates the relative vigour and field performance of commercial seed lots of maize (Zea mays L). Seed Sci Technol. 2009, 37 (2): 446-456.View ArticleGoogle Scholar
- Joosen R, Arends D, Willems L, Ligterink W, Jansen R, Hilhorst H: Visualizing the genetic landscape of Arabidopsis seed performance. Plant Physiol. 2012, 158 (2): 570-589. 10.1104/pp.111.186676.PubMed CentralView ArticlePubMedGoogle Scholar
- Joosen R, Kodde J, Willems L, Ligterink W, Hilhorst H, Van Der Plas L: Germinator: a software package for high-throughput scoring and curve fitting of Arabidopsis seed germination. Plant J. 2010, 62 (1): 148-159. 10.1111/j.1365-313X.2009.04116.x.View ArticlePubMedGoogle Scholar
- Loudet O, Chaillou S, Camilleri C, Bouchez D, Daniel-Vedele F: Bay-0 × Shahdara recombinant inbred line population: a powerful tool for the genetic dissection of complex traits in Arabidopsis. TAG Theor Appl Genet. 2002, 104 (6): 1173-1184.View ArticleGoogle Scholar
- Besag J: Statistical analysis of non-lattice data. Statistician. 1975, 24 (3): 179-195. 10.2307/2987782.View ArticleGoogle Scholar
- Besag J: Efficiency of pseudolikelihood estimation for simple Gaussian fields. Biometrika. 1977, 64 (3): 616-618. 10.1093/biomet/64.3.616.View ArticleGoogle Scholar
- Aerts M, Claeskens G: Bootstrapping pseudolikelihood models for clustered binary data. Ann Inst Stat Math. 1999, 51 (3): 515-530.View ArticleGoogle Scholar
- Zou H: The adaptive lasso and its oracle properties. J Am Stat Assoc. 2006, 101 (476): 1418-1429. 10.1198/016214506000000735.View ArticleGoogle Scholar
- Zou H, Hastie T, Tibshirani R: On the “degrees of freedom” of the lasso. Ann Stat. 2007, 35 (5): 2173-2192. 10.1214/009053607000000127.View ArticleGoogle Scholar
- Majewski J, Li H, Ott J: The Ising model in physics and statistical genetics. Am J Hum Genet. 2001, 69 (4): 853-862. 10.1086/323419.PubMed CentralView ArticlePubMedGoogle Scholar
- Souverein O, Zwinderman A, Tanck M: Multiple imputation of missing genotype data for unrelated individuals. Ann Hum Genet. 2006, 70 (3): 372-381.View ArticlePubMedGoogle Scholar
- Li Y, Tesson B, Churchill G, Jansen R: Critical reasoning on causal inference in genome-wide linkage and association studies. Trends Genet. 2010, 26 (12): 493-498. 10.1016/j.tig.2010.09.002.PubMed CentralView ArticlePubMedGoogle Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.