iXora: exact haplotype inferencing and trait association
© Utro et al.; licensee BioMed Central Ltd. 2013
Received: 7 October 2012
Accepted: 14 May 2013
Published: 6 June 2013
We address the task of extracting accurate haplotypes from genotype data of individuals of large F1 populations for mapping studies. While methods for inferring parental haplotype assignments on large F1 populations exist in theory, these approaches do not work in practice at high levels of accuracy.
We have designed iXora (Identifying crossovers and recombining alleles), a robust method for extracting reliable haplotypes of a mapping population, as well as parental haplotypes, that runs in linear time. Each allele in the progeny is assigned not just to a parent, but more precisely to a haplotype inherited from the parent. iXora shows an improvement of at least 15% in accuracy over similar systems in literature. Furthermore, iXora provides an easy-to-use, comprehensive environment for association studies and hypothesis checking in populations of related individuals.
iXora provides detailed resolution in parental inheritance, along with the capability of handling very large populations, which allows for accurate haplotype extraction and trait association. iXora is available for non-commercial use from http://researcher.ibm.com/project/3430.
We address the task of extracting accurate haplotypes from genotype data of individuals of large F1 populations for mapping studies. Haplotypes are useful for inferring the underlying causal genetic basis of the traits in mapping populations as one can more efficiently evaluate the parental inheritance of the haplotype implicated in the determination of the trait [1, 2]. iXora is specifically suited to plant (or animal) breeding, in which mapping populations of individuals of inbred (or non-inbred) parents are utilized. iXora uses a novel approach by effectively utilizing the large data size and exploiting the fortuitous combinatorial structure in the problem. The algorithm is outlined in the next section with a running example and the mathematical details are presented in Methods.
Given genotypes of n progeny on m loci, the general problem of constructing haplotypes from genotypes is NP-complete, under various models such as parsimony, maximum likelihood, phylogeny (see  for a detailed exposition). Both statistical and combinatorial frameworks have been used in the literature to solve the problem of haplotype extraction from genotype data. For instance, BEAGLE [4, 5], fastPHASE , HAPI , HAPI-UR , MACH , and SHAPEIT [10, 11] use a Hidden Markov Model; whereas Gusfield  proposes a combinatorial approach that is based on the parsimony principle. Merlin  uses pedigree data under a parsimony model to construct the haplotypes of F1 progeny based on their genotypes and the genotypes of the parents. A review of phasing methods, particularly applicable on human data, is presented in . Based on the models of the input population, the existing methods can be categorized into the following scenarios: unrelated individuals, unrelated trios (two parents per one progeny), and related trios (two parents per several progeny, our F1 population of interest). These categories are discussed in more detail in Methods.
Definition and size of the classes
Unrelated individuals (no parent information)
Up to 30-100 markers
< 15 progeny/parent
< 15 progeny/parent
< 50 progeny/parent
iXora provides a user-friendly, easy-to-use comprehensive environment for mapping studies. The analysis framework and user interface are described in Additional file 1: Figures A1 and A2. Its usage is described in the section "Using iXora” in Additional file 1, with a running example, where we also demonstrate that genomic regions can be associated with a phenotype at a much higher resolution with haplotypes than with genotypes. The iXora framework has been successfully applied to real data analysis , in which pod color phenotype was localized, with high resolution, to a single locus in the T. cacao genome.
Results and discussion
In this section we outline the main results: the iXora phasing algorithm and its comparison with related methods in literature.
Outline of the core algorithm
Phase I (Preprocessing)
Phase II (Detecting crossovers)
Phase III (Staging output)
The parent haplotypes are encoded in V a , V b respectively and the progeny haplotypes in R a and R b . The solution shows no recombinations in the progeny, but has two errors between the phased sequences and the observed genotypes, shown in red. We omit the evaluation of the results (precision measures) here, and direct the reader to Methods.
Comparison with related methods
Here we describe the results from a simulation study on a F1 population with 200 individuals and 300–600 markers. The parameters of the simulation were chosen to reflect real data and the details are described in the section "Using iXora” in Additional file 1.
We compare iXora with the existing phasing methods BEAGLE , fastPHASE , FMPH , HAPI , HAPI-UR , MACH , Merlin , and SHAPEIT2 . Each existing method solves a slightly different phasing problem, such as not providing a parental haplotype assignment for the progeny, or not processing the entire population in one run, as discussed in Methods. Hence we used evaluation criteria that enable a meaningful comparison of this wide-spectrum of methods, when applied on simulated data. The evaluation criteria are described in Methods, and the results are summarized in Table 1.
Accuracy of Parent Assignments (PA)
This accuracy is measured on a marker-by-marker basis. We post-process the output of the systems that do not directly provide parental assignment. The best parental assignment is seen with BEAGLE and HAPI-UR, followed by iXora. In the two former cases, there are no unassigned markers. HAPI and SHAPEIT2 show moderate accuracy while Merlin, fastPHASE, and MACH perform poorly. Note that Merlin’s performance deteriorates with the increase in number of markers, while HAPI and iXora display similar levels of accuracy.
Handling missing data/imputation
All the methods, except HAPI, show some capability of handling missing data. Merlin has about a third of the missing data unresolved, while iXora has less than 10% unresolved. The rest of the methods resolve all the missing data. BEAGLE, HAPI-UR and iXora display levels of accuracy in the imputed data larger than 90% while the rest perform poorly. Note that these values only account for missing data in the progeny. We found that missing data in the parents were debilitating for all the trio based methods, except Merlin and iXora. These two methods were the only ones that produced some results when all the marker data of both parents were missing. Since Merlin can handle only a small number of individuals per parent, about 15% of the parent haplotypes remained unresolved. We observed that iXora is the only method robust enough to be unaffected by completely missing parent genotypes. We attribute this resilience to its ability to handle large families of individuals without splitting into smaller sets.
Accuracy of Parent Haplotype Assignments (PHA)
Note that PHA is the most important computation since this crucially contributes to the improvement in accuracy and resolution in genomic region assignment to traits (see "Using iXora” in Additional file 1 for an example). In Table 1 the corresponding column, labeled PHA, is shown in bold and is the focus of the comparison study.
The PHA accuracy is measured on a marker-by-marker basis. Only HAPI, Merlin, and iXora provide an assignment of the parental haplotypes. Note that although SHAPEIT2 utilizes trios, it did not give us any means to extract parent haplotype information from the output. Both HAPI and Merlin perform poorly, with accuracy under 85% and 70% respectively. In contrast, iXora yields an accuracy of over 95%.
Although, HAPI and Merlin give means of identifying the parent haplotypes, they suffer a severe scaling problem, and are unable to handle more than about ten progeny per family. Thus it is not obvious how these systems can be coaxed to exploiting the availability of large progeny to improve the accuracy of the parental haplotype assignments.
From the comparison with related methods, we conclude that while methods to the problem of inferencing parental haplotype assignments on large F1 populations exist in theory, these approaches do not work in practice at high levels of accuracy (say > 90%). Moreover, iXora is the only algorithm that is robust enough to accurately extract the parental haplotypes in the absence of any parental genotype information. In practice, when the genotypes of the parents were known, we used this capability of iXora to match the estimated parent genotypes against the true genotypes to confirm the integrity of the phasing results. iXora additionally outputs several intrinsic measures of preciseness (the triplet ), and all the equally-likely phasing solutions with annotations (q/Q/*), see Methods. These added capabilities make iXora and its output particularly attractive, over existing methods, for trait association and inferencing studies.
In this section we describe the mathematical details of the iXora haplotype inference algorithm, the measures used to quantify the precision of the output, and the different downstream processing of the output. We conclude the section with the description of the measures used to compare the results from different phasing algorithms.
iXora algorithm: haplotype inferencing
The outline of the three phases in the iXora algorithm is shown in Figure 1. In this section what follows is a more precise mathematical description of the steps which is presented as a sequence of key observations: thus this describes as well as provides the rationale for each of the steps. Note that Figure 1, annotated by equation and observation numbers, is a road map for the exposition in this section to help the reader understand the description.
Let be an n × m matrix that encodes n (> 1) progeny with m (> 1) markers. Each row i represents a progeny and each column j represents a marker. The order of the markers is also captured in the matrix, i.e., if and only if marker j is located between markers j′ and j′′ on the chromosome. The j th marker of the i th progeny, denoted or simply 〈i j〉, also referred to as the position 〈i j〉, is a pair of alleles: each individual allele can be accessed as and . Thus if or simply written as AC, A and C. The two observed alleles at marker j (across all individuals) will be denoted as Z and Z′. A marker j is polymorphic if Z ≠ Z′. We make the assumption that all the markers in are polymorphic. When or the j th marker at individual i, or position 〈i j〉, is said to be homozygous. Similarly, when , position 〈i j〉 is said to be heterozygous.
We next introduce a definition and notation for conjugacy. Let conjugate of z be written as , where z is a matrix or a discrete value as defined below. Note that the conjugate of conjugate of z is z, i.e., always holds. For parents p = a,b, if p = a, then and vice-versa. Similarly, if k = 0, then and vice-versa. Thus (and ). Also, and for the parents p = a,b.
Solvability of a given genotype matrix
Assume that there is no more than one crossover, in an individual, between two adjacent positions. If d t is the true number of such crossovers, then 0≤d t ≤ 2(m - 1)n. Let the estimate of d t be d c ≥ 0. Consider a scenario where each position in is heterozygous. Then there exist an exponentially large number (2 m n ) of distinct and equally-likely haplotype configurations of the progeny, each with no crossovers. While informative markers in general are chosen for their heterozygosity in data sets, it is observed that the same marker is also homozygous in many a progeny. Thus, in practice, due to Mendelian inheritance  it is very unlikely to have a run of markers that are heterozygous in all the progeny. It is this random sprinkling of homozygous positions in , that makes d c estimation possible. We conducted simulations to study the relationship between d c and the extent of homozygosity in in realistic data scenarios.
Let e denote the mean number of crossovers in a progeny. We used simulations with values 2 ≤ e ≤ 15. We observed that for this wide range of crossover profiles, the required fraction of homozygous sites in to get a good estimate of d t (i.e., within 5% of the true value) was bound. The empirical observation is summarized below.
Observation 1. When at least 28% of each subsample of randomly chosen positions inis homozygous, the estimated d c is within 5% of the exact value d t , i.e., d c ≥ 0.95 d t .
In practice, we have encountered values of n in the range of 50 to 400 and m in the range of 30 to 600. We observe that consistently, at least 50% positions are homozygous and the solutions, obtained using methods of the following sections, displayed e ≈ 2. With decrease in cost and in increase in accessibility of sequencing and genotyping technologies, m could be orders of magnitude larger in the coming years. Note that e is not expected to increase with m. Thus the lower bound estimation scales well with m, since the observation above is independent of m and the algorithm discussed in the later sections is linear in m.
Phase I: Preprocessing
Since all the individuals have the same two parents, let the two parents be a and b. Let V a (0) encode haplotype 0 of parent a and V a (1) encode haplotype 1 of parent a. Similarly, V b (0) and V b (1). The two distinct allele values of parent p at marker j are represented by and .
Also, V a and V b is initialized as and , for each j and p = a,b.
If marker j is homozygous in both parents, then position 〈i,j〉 is heterozygous, for all progeny i. If marker j is not homozygous in both parents, then possible progeny values are Z j Z j , , and . When both parents are homozygous, it may not be apparent what the allelic values of the individual parents are, but as is seen in the running example of the Overview section, this does not affect the solution. Such loci are marked as "-” in the parents, i.e., "-” for p = a,b. Also the column j of the matrices are updated as , for each i. If exactly one parent (either a or b) is homozygous, then only Z j Z j , but not , can be observed in some progeny, while the rest are . In Step 2, we identify all such markers.
Note that while it is easy to estimate if a marker is homozygous in both parents or heterozygous in both, it is not obvious to estimate the heterozygous parent when exactly one of the parents is so. In Step 3 we identify markers which are homozygous in exactly one parent (i.e., either a or b). The execution of this process is illustrated in the Overview section through the running example and is described in detail here.
In our implementation we use α1 = 0.3 and α2 = 0.15, which we have observed to yield accurate phasing results on real data, confirmed by the precision measures discussed later. When the polarization condition is violated, it is considered to be an error in data . In practice, we observed that in all such cases this was due to experimental errors at one of the markers. Hence, we make the following assumption. For parent p and for the pair of non-homozygous markers j and j′, must be polarized. In practice, when this condition is violated, we flag the marker for closer scrutiny in the experiments. A consequence of this assumption is the following.
Observation 2. Let J a (resp J b ) be the set of markers with heterozygous parent a (resp b ). Ifis not polarized, then j ∈ J p and.
This observation states that it is possible to computationally obtain two non-overlapping sets of markers, where one set represents the markers that are homozygous in only parent a and the other set represents the markers that are homozygous in only parent b. Thus, when the parent genotypes are not available, this observation is utilized in Step 3 to resolve the markers that are homozygous in exactly in one parent.
Note that the above is equivalent to: For a marker j that is homozygous only in parent a: is set to H for all i and if was X, then it is updated to 1. Similarly, for a marker j that is homozygous only in parent b: is set to H for all i and if was X, then it is updated to 1.
Rationale. Note parent a is homozygous at j (but parent b is not). The reason for switching X to 1 in column j of M b is that no matter what the value of is, or will be eventually updated to, 1 encodes for the allele Z’.
When marker j is homozygous only in parent b, a similar update as above is performed. Note that in the implementation, as a marker j is resolved (Step 3), the marker is immediately updated in V p and M p (Step 4).
Phase II: Detecting crossovers
Without loss of generality, for each non-homozygous j in M p , let j n x t be the adjacent non-homozygous marker in M p . Substituting j′ with j n x t in the definitions from the last section, let tmax and tmin be redefined. Then in Step 5, V p is phased as follows: For parent p, if , then the values of and are interchanged. Further, column j in M p is updated by replacing an existing entry of 0 with 1 and an entry of 1 with 0, to reflect the updated .
Let . Then the minimal number of crossovers (or recombinations) between j and j n x t in haplotypes inherited from parent p is . Thus:
Note that d c could be larger than the sum on the left, since some further crossovers may be introduced in the next phase. Since only adjacent markers need to be considered for detecting crossovers, the following holds.
Observation 4. Given, the haplotype matrices M a ,M b and the parent haplotypes V a ,V b can be constructed intime.
Missing values There are often missing values in the data, sometimes as high as 20%. When the value is missing at position 〈i j〉, we make the assignment , for p = a,b. However, if after the computation of V, marker j is homozygous at parent p, then and is appropriately updated. It can be verified that this treatment does not introduce any extra (spurious) crossovers.
V a =V b and if then , for all i and j, or,
V a ≠ V b and if then , for all i and j.
Thus M b is defined completely by M a and thus the task is to estimate only M a .
Monotonic state transitions Next, the matrices M p are transformed as follows. A position 〈i j〉 is a heterozygous trio, if the two parents at j are heterozygous and so is .
Note that a marker j can never be both leftmost and rightmost since the number of markers is at least two, hence both Lt(·) and Rt(·) are well defined. Let M i j = x. Then the transition S x →Sy is applied to assign the value y to M i j (written as M i j ←y) as follows.
o If Lt(M i j ) = Rt(M i j ) = k, then M i j ← e k .
o If Lt(M i j ) ≠ Rt(M i j ) with Lt(M i j ) = k1 and Rt(M i j ) = k2, then
o , (note that k1 ≠ k2)
o If Lt(M i j ) = Rt(M i j ) = k, then M i j ← e k .
o M i j ← k.
o If is numeric, then
To estimate the running time of the algorithm, we classify the transitions as intra-transitions and inter-transitions, depending on whether M or is used respectively. Thus is the only inter-transition. The above transitions are applied to the elements of M a and M b till no more transition is applicable. This final form of matrix is written as F(M a ) or simply F a (similarly, F(M b ) or simply F b ).
Each element of F p is in the final state, i.e., for each position 〈i j〉, , p = a,b. A numeric value of -1 indicates that no information regarding the haplotypes can be deduced. It can be verified that these state transitions induce the following properties on F and V.
If is not numeric but is, then must hold.
If and are both not numeric, then 〈i j〉, must be a heterozygous trio. The converse however is not true.
If -1, for some j, then -1, for all j.
Given M a and M b , F a and F b are unique.
F a and F b can be constructed from M a and M b in time.
To show that F a and F b are unique, the iterations can be viewed as of two types: one where only inter-transitions and the other where only intra-transitions are applicable. In the very first iteration, due to the possible start states, no inter-transition can be applied. Thus using uniquely applicable intra-transitions, each and each is transformed. When no intra-transition can be applied, the uniquely applicable inter-transitions are applied. Thus the iterations alternate between intra- and inter-transitions. Hence the final forms F a and F b are unique. Next, since each entry can go through no more than three transitions, it is possible to obtain F a and F b from M a and M b in time using suitable list data structures.
Phase III: Staging output
An optimization problem (e.g., minimizing an appropriate error function) whose solution is associated with an output configuration, such as alignment of multiple sequences or a phylogeny topology or landscape of crossovers in chromosomes, has the added burden of proving stability in the solution. In other words, how distinct in configuration are the next closest solutions? This is usually very difficult to answer, and most methods are inadequate in addressing this. However, due to the very specific nature of our problem, we provide an agglomerate of "best” solutions, so that its stability can be gauged. Our focus is not just on a maximum likely or a most parsimonious solution, but on an "agglomerate” of all (equally-likely) feasible solutions. This is a characteristic not just of the method but, in a sense, that of the data as well.
Suppose there are L > 0 feasible distinct solutions for a given . Is it possible to generate an agglomerate in a single pair of matrices R a and R b that captures all the L solutions ? In the following paragraphs we describe how to construct the agglomerate (R a and R b ) based on the encodings in F a and F b .
If for some j, then without loss of generality and , for all j.
For each j, if is numeric and is not, then .
- 4.For each j, where and are both not numeric, then(3)
This is illustrated in the following example.
Example 1. Letand, for some j. Further letandfor some i. Then by Equation 4,. However if, then.
The non-numeric values in R p encode multiple solutions, possibly with multiple crossovers. Hence we call the contiguous blocks of non-numeric values as dispersion intervals. The details of interpreting these intervals is discussed in the following sections.
Precision measures of iXora output
Note that d c is the number crossovers seen in the result matrices, thus an average number of crossovers for the individuals is d c /n. But these also correspond to an output configuration. Then how precise is the output of iXora for an input genotype matrix ?
The matrices R a and R b enable the elicitation of the parameters for the various haplotype distributions across the markers as well as the precision measures. The agglomerate structure aids in defining a stability measure summarized as the metric (Equation 11). The other inadequacies, such as insufficient information and errors due to imperfect experiments or data-acquisition, are evaluated by the methodology and summarized as and respectively (Equations 12-13). The former is the extent of dispersion in position of each crossover over the agglomerate, and the latter is the observed error in the R matrices. The trio define the haplotype precision in the given genotypes.
Let r be a run of contiguous values in a row (progeny) of R p . A dispersion interval is a non-numeric run, sandwiched by the chromosome boundary or numeric value. Let R a and R b be runs in the two haplotypes of the progeny and are aligned in the examples discussed below. It is possible that only one of R a or R b is a numeric run. Then it is called a asymmetric. But if both are non-numeric, then by construction (Equation 4) r a = r b and the run is symmetric. A switch is defined to occur between q and Q or between Q and the numeric value that sandwiches the run. The number of switches is written as s w(r). The value ∗ is a wild card and can be treated either as q or Q. The switch detection is succinctly described by the following two examples. In the illustrations the dispersion interval is shown in green sandwiched between numeric values (in black) and each switch is marked as a numbered (red) down-arrow.
The wild card may result in different positions of the same switch corresponding to whether it was interpreted as a q or a Q. These distinct positions of the switch is termed the wild card count of a switch. If a switch position is not affected by any wild card, its count is 1.
The wild card counts of the four switches are 1, 1, 2 and 1 respectively.
The transformed values are shown in bold above. The same process is applied to every dispersion interval to transform the matrices R a and R b . The observations from the transformations above is summarized as:
Observation 7. R is such that for each dispersion interval r, if s w(r) > 0, then (1) r begins and ends with Q and (2) the positions of the first and the last switch sandwich the dispersion interval.
In fact, the following is easily verified:
Observation 8. Let r be a dispersion interval and let s = sw(r). Then (1) the total number of crossovers in the interval, across both the haplotypes, is exactly s and each crossover in each haplotype of the configuration is at a position of the switch.(2) If r is asymmetric, then s is zero. In general, s is even and the number of crossovers in each haplotype of each distinct configuration is odd.
In practice, we observed that in all data sets all the dispersion intervals had no switches. There was exactly one instance where s w(r) was 2. Also, the following is easily verified.
Observation 9. Let s = s w(r) for a dispersion interval r. Then if s ≤ 2, there is no additional crossover introduced by r. If s > 2, the number of additional crossovers is s-2.
where c l is the wild card count of each switch l in r. When the location of each crossover is known precisely, this index is 0, while value 1 indicates maximum uncertainty in the location.
Also, in our experiments these mismatches were extremely low (less than 0.01) and when followed up turned out to be experimental errors. Hence we have followed the convention that such a mismatch be flagged as a potential error. Then an error, if any, is at 〈i j〉 in F that underwent the transitions . Note that the converse is not true. Also, an additional crossover, if any, is at 〈i j〉 in F that underwent the transitions . Again, the converse is not true.
To summarize, the trio () define the haplotype precision in the given genotypes . Across our experiments (30 distinct data sets) with real data , the mean precision trio were observed to be (0,0.002165,0.000299).
Downstream processing of iXora output
Since iXora associates the parent haplotypes (not just the inherited alleles) to each marker in a progeny, it is possible to study the distributions of the inherited parent haplotypes independent of or in association with a phenotype. The details of these and other related postprocessing available in the iXora framework are described here.
Haplotype frequency distributions
One of the important consequences of haplotype inferencing is obtaining the haplotype frequency distribution across the chromosomes. A marker j of progeny i has two alleles, one derived either from haplotype 0 or haplotype 1 of parent a and the other either from haplotype 0 or haplotype 1 of parent b. Since R is an agglomerate, it also contains the non-numeric values that encode multiple configurations. Based on the encoding in R we estimate the expected value of the frequency count and its variance. First, we enumerate the feasible solutions encoded by a non-numeric run r.
Let w t(r) be the number of distinct solutions encoded by r.
Observation 10. Let s = sw(r). Let c1,c2,…c s be the wild card counts of the switch positions. Let wt(r) be the number of distinct feasible solutions due to r. Then
(1) If < 2, wt(r) = len(r) + 1.
above is easily verified and for (2), Equation 17 is explained through the following example.
There are six distinct configurations for the first case and since the two haplotypes can be switched it gives 2 × 6 distinct configurations. For the second there aredistinct configuration, giving a total of 12+20 distinct solutions due to r.
Expected frequency and variance
In the absence of any other external information, each of the alternative solutions in R is equally likely. Under this assumption, it is easy to verify the following:
The varianceof the count of the haplotype pair is approximated as.
Haplotype-phenotype association analysis
Below, we describe the iXora methodology for associating discrete traits with genomic locations using haplotypes. The same approach can be used for continuous traits, using different statistical tests and randomizations. In general, the phasing output can be used in other types of statistical tests, for example to test for associations between a pair of markers and a phenotype. In the following, let L the number of discrete phenotype values.
Combination of parents We can test the effect of each haplotype pair at a marker with a phenotype as follows. In the case of discrete phenotype with L possible values, iXora constructs for each marker a 4 × L contingency table of haplotype pair & phenotype counts. In the current iXora implementation, we use Fisher’s exact test fisher.test in R  to test for association between haplotype pairs & phenotype. The test outputs a p-value for each marker, denoting the significance of the association.
Individual parents We can also investigate the contribution to phenotype of each parent individually. The contingency table in this case is a 2 × L table of haplotype & phenotype counts (a separate table for each parent at each marker). Fisher’s Exact test is used to test for association between haplotypes & phenotype. The test outputs a p-value for each marker and for each parent, denoting the significance of the association.
Significance thresholds via randomization We include a method for directly estimating the background distribution of p-values in the haplotype-phenotype data by randomizing the phenotype labels. If p-values observed in the randomized data are always larger than in the real data, then the findings on real data may indeed be significant. When working with categorical traits, we take into account the size (number of individuals) of each phenotype category in the permutation test. Let S be the size of the smallest category. For the permutation test, we randomly select S individuals from each category and permute their phenotype values. The permutation test is repeated T times, running the same statistical test on the randomized data as on the real data, for each marker. The smallest p-value observed in the randomized data is generated as output and becomes the threshold for significance in the real results.
iXora output visualization
The agglomerate solution from the phasing algorithm can be directly visualized to detect distortions in the data, with or without using phenotypic information. Approaches for visualizing the phasing solution are demonstrated in the following two paragraphs, while the third paragraph describes visualization of haplotype-phenotype associations. The figures shown here as examples stem from the use case described in detail in the Section "Using iXora” in Additional file 1. Therein is detailed the phasing an trait association process for locating the genomic region and specific haplotype that determines the simulated phenotype (height).
Individual haplotypes The individual haplotypes can be directly visualized, for example as the colored haplotype blocks shown in Additional file 1: Figure A3. The visualization indicates the locations of crossovers in each individual, including uncertainty in the crossover locations. As an alternative visualization, Additional file 1: Figure A4 shows iXora output on the frequencies of individual haplotypes per parent. In this case, the individuals are divided into two groups by phenotype, and clear distortion is observed regarding frequency of haplotypes in each group.
Comparison with related methods
Here we first elaborate on the three distinct categories of population models for phasing, and then give details on the comparison of iXora with related phasing methods in literature. Technical details on the settings for each compared method can be found in Additional file 1.
Unrelated individuals (no parent information) These methods treat the input genotypes as samples from a population of unrelated individuals, and do not assign the progeny to parental haplotypes. While they may be applicable to human population data, they are not suitable for our purposes. fastPHASE can be adapted to our problem setting by treating the input as n + 2 individuals from a population originating from four founder haplotype clusters. We adapt MACH similarly, though it has a much larger default value on the number of haplotypes. FMPH (Integer Programming Formulations To Solve Maximum Parsimony Haplotyping)  is closest in spirit to iXora, but is computationally very expensive and suitable only for small data sets (up to ≈50×30), although a hybrid approach is suitable for slightly larger data sets (up to ≈150×100). While the limitation on number of individuals can be tackled by splitting the data (as we do for Merlin, HAPI, SHAPEIT2 below), the limitation on the number of markers is debilitating, so we were unable to run FMPH on our data sets.
Unrelated trios These methods allow the definition of family relationships between parents and progeny in the input, with the limitation that each parent has only one progeny. BEAGLE and HAPI-UR (HAPlotype Inference for UnRelated samples) are two such methods. The methods phase the progeny individually in terms of sequences that were transmitted from each parent. Therefore the progeny are not necessarily assigned to a consistent set of parent haplotypes.
Related trios These methods allow defining several progeny originating from the same two parents, thus the underlying algorithms utilize the full sibling information. However, unlike iXora, none of the existing methods was able to use the entire set of progeny per two parents. In our application this size is in hundreds. HAPI and Merlin produce results only on families of about 10 progeny while SHAPEIT2 can only process sizes up to 50. We therefore randomly divided the progeny into sets of appropriate sizes and phased the sets independently. However, we observed that the phasing results for the parents between sets were often inconsistent, affecting the overall accuracy. HAPI and Merlin produce an assignment of progeny to parental haplotypes while SHAPEIT2 does not.
Comparison measures Here we describe the measures that we used to compare the different methods. Since existing phasing methods generate various types of output, we use two different measures so that all the methods are comparable on at least one measure. Our interest was not simply restricted to phasing the progeny genotypes by assigning each allele to the correct parent (PA), but also assigning them to the correct haplotype of that parent (PHA).
First, the phasing accuracy (PA) of progeny is examined, on a marker-by-marker basis, of only the heterozygous positions. We report the number of correctly assigned and the unassigned (unknown) positions as a percentage. BEAGLE, HAPI, HAPI-UR, Merlin, SHAPEIT2 and iXora can be directly compared on this measure for progeny, because they report the parental origin (maternal, paternal) of each allele. To incorporate fastPHASE and MACH also in this comparison, we post-processed their output as follows: progeny haplotypes are labeled as ‘maternal’ and ‘paternal’, using the labeling that minimizes mismatches compared to true maternal and paternal haplotypes. After the post-processing, all methods can be compared on this measure for progeny. The same accuracy evaluation is used to report imputation accuracy, by examining only the phasing for the missing input values.
Second, the accuracy of assigning the correct parental haplotype (PHA) for each progeny allele is examined, again on a marker-by-marker basis. All markers, including homozygous positions are used. For the output of programs where the input had to be split into smaller families, we consider only those subsets of progeny whose parents’ phasing are consistent with the majority parents (see Additional file 1 for details). Again, we report the number of correctly assigned and unassigned (unknown) positions, deeming a progeny position to be correct only when both alleles of the genotype are assigned to the correct parental haplotype. Only HAPI, Merlin, and iXora can be directly compared on this measure for the progeny.
All the simulated data sets are available at the iXora website http://researcher.ibm.com/project/3430.
FU is partially funded by MARS, under the joint cacao project with IBM Research. OEC is partially funded by MARS, Incorporated. The authors would like to thank Dr. Seth Findley and Joseph Conrad Stack, and the anonymous reviewers for providing valuable comments on the manuscript.
- Browning BL, Browning SR: Efficient multilocus association testing for whole genome association studies using localized haplotype clustering. Genetic Epidemiol. 2007, 5: 365-375.View ArticleGoogle Scholar
- Schaid DJ: Evaluating associations of haplotypes with traits. Genet Epidemiol. 2004, 27: 348-364. 10.1002/gepi.20037.View ArticlePubMedGoogle Scholar
- Bafna V, Edwards N, Lippert R, Yooseph S, Istrail S, Halldórsson B: A survey of computational methods for determining haplotypes. Computational Methods for SNPs and Haplotype Inference,Volume 2983 of Lecture Notes in Computer Science. Edited by: Clark A, Waterman M, Istrail S, Istrail S, Waterman M, Clark A. 2004, Berlin: Heidelberg: Springer, 613-614.Google Scholar
- Browning SR, Browning BL: Rapid and accurate haplotype phasing and missing-data inference for whole-genome association studies by use of localized haplotype clustering. Am J Human Genet. 2007, 81 (5): 1084-1097. 10.1086/521987.View ArticleGoogle Scholar
- Browning BL, Browning SR: A unified approach to genotype imputation and haplotype phase inference for large data sets of trios and unrelated individuals. Am J Human Genet. 2009, 84 (2): 210-223. 10.1016/j.ajhg.2009.01.005.View ArticleGoogle Scholar
- Scheet P, Stephens M: A fast and flexible statistical model for large-scale population genotype data: applications to inferring missing genotypes and haplotypic phase. Am J Human Genet. 2006, 78 (4): 629-644. 10.1086/502802.View ArticleGoogle Scholar
- Williams A, Housman DE, Rinard MC, Gifford DK: Rapid haplotype inference for nuclear families. Genome Biol. 2010, 11 (10): R108-10.1186/gb-2010-11-10-r108.PubMed CentralView ArticlePubMedGoogle Scholar
- Williams AL, Glessner J, Hakonarson H, Reich D: Phasing of many thousands of genotyped samples. Am J Human Genet. 2012, 91 (2): 238-251. 10.1016/j.ajhg.2012.06.013.View ArticleGoogle Scholar
- Li Y, Willer CJ, Ding J, Scheet P, Abecasis GR: MaCH: using sequence and genotype data to estimate haplotypes and unobserved genotypes. Genetic Epidemiol. 2010, 34 (8): 816-834. 10.1002/gepi.20533.View ArticleGoogle Scholar
- Delaneau O, Marchini J, Zagury JF: A linear complexity phasing method for thousands of genomes. Nature Methods. 2012, 9 (2): 179-181.View ArticleGoogle Scholar
- Delaneau O, Zagury JF, Marchini J: Improved whole-chromosome phasing for disease and population genetic studies. Nature Methods. 2013, 10: 5-6.View ArticlePubMedGoogle Scholar
- Gusfield D: Haplotype inference by pure parsimony. Combinatorial Pattern Matching, Volume 2676 of Lecture Notes in Computer Science. Edited by: Chávez E, Baeza-Yates R, Chávez E, Crochemore M. 2003, Berlin: Heidelberg: Springer, 144-155.Google Scholar
- Abecasis GR, Cherny SS: Merlin–rapid analysis of dense genetic maps using sparse gene flow trees. Nature Genet. 2002, 30: 97-101. 10.1038/ng786.View ArticlePubMedGoogle Scholar
- Browning SR, Browning BL: Haplotype phasing: existing methods and new developments. Nature Rev Genet. 2011, 12 (10): 703-714. 10.1038/nrg3054.PubMed CentralView ArticlePubMedGoogle Scholar
- Motamayor JC, Mockaitis K, Schmutz J, Haiminen N, Livingstone DIII, Cornejo O, Findley SD, Zheng P, Utro F, Royaert S, Saski C, Jenkins J, Podicheti R, Zhao M, Scheffler BE, Stack JC, Feltus FA, Mustiga GM, Amores F, Phillips W, Marelli JP, May GD, Shapiro H, Ma J, Bustamante CD, Schnell RJ, Main D, Gilbert D, Parida L, Kuhn DN: The genome sequence of the most widely cultivated cacao type and its use to identify candidate genes regulating pod color. Genome Biology. 2013, 14 (6): R53-10.1186/gb-2013-14-6-r53.http://genomebiology.com/2013/14/6/R53,PubMed CentralView ArticlePubMedGoogle Scholar
- Hartl DL, Clark AG: Principles of Population Genetics. 2006, Sinauer Associates: Inc.Google Scholar
- Development Core Team R: R: A Language and Environment for Statistical Computing. 2006, Vienna: R Foundation for Statistical Computing, [http://www.R-project.org] [ISBN 3-900051-07-0]Google Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.