Improved IBD detection using incomplete haplotype information
 Giulio Genovese^{1}Email author,
 Gregory Leibon^{1},
 Martin R Pollak^{2} and
 Daniel N Rockmore^{1, 3}
DOI: 10.1186/147121561158
© Genovese et al; licensee BioMed Central Ltd. 2010
Received: 11 September 2009
Accepted: 30 June 2010
Published: 30 June 2010
Abstract
Background
The availability of high density genetic maps and genotyping platforms has transformed human genetic studies. The use of these platforms has enabled populationbased genomewide association studies. However, in inheritancebased studies, current methods do not take full advantage of the information present in such genotyping analyses.
Results
In this paper we describe an improved method for identifying genetic regions shared identicalbydescent (IBD) from recent common ancestors. This method improves existing methods by taking advantage of phase information even if it is less than fully accurate or missing. We present an analysis of how using phase information increases the accuracy of IBD detection compared to using only genotype information.
Conclusions
Our algorithm should have utility in a wide range of genetic studies that rely on identification of shared genetic material in large families or small populations.
Background
Genetic studies designed to identify the location of loci that influence phenotypes depend on identifying regions of the genome that are shared among different individuals. This is true for both identification of rare, highly penetrant monogenic disease loci via linkage analysis or for common alleles that influence disease susceptibility via linkage disequilibrium as revealed by genomewide association studies (GWAS). The use of very dense panels of singlenucleotide polymorphisms (SNPs) via microarrays makes direct identification of diseaseassociated variation possible in some study designs. However, in familybased studies of monogenic or oligogenic phenotypes, causal alleles are expected to have nontrivial penetrance and be relatively rare, thus making identification of diseaseassociated chromosomal regions a necessary prerequisite for identifying causal variation.
There has been tremendous recent progress at both ends of the spectrum for finding diseaseinfluencing variants. There are useful techniques for identifying rare, but highly penetrant, monogenic disorders as well as for uncovering common variation conferring small but reproducibly increased risk. However, the middle ground of variants of moderate risk and moderate frequency is less well explored. Studies in large complex extended families and isolated populations with a high rate of specific phenotypes provide one method for approaching such phenotypes. Analyses of this sort would benefit from improved methods for identifying shared chromosome segments that rely neither on standard genetic linkage methods (which in turn rely on a near perfect understanding of family structure and are computationally very intensive) nor on association analysis (which is not well suited for identifying less common and more recent phenotypeinfluencing variants).
Overview
Alleles that are identical on homologous chromosomes are said to be IBS (identical by state). IBS alleles are said to be IBD (identical by descent) if they are IBS by virtue of having been inherited from a recent common ancestor. It is common practice to identify chromosomal segments as "likely IBD" when a sequence of consecutive loci is observed to be IBS and is of such a length that the odds of this event happening by chance is small compared to the probability of that segment being inherited IBD. Thus, information about a series of consecutive loci informs the likelihood that alleles at any one of those loci are inherited IBD.
In the absence of informative parental genotypes at the loci of interest, the only absolute certainty as to whether a locus is IBD occurs when two genotyped individuals share no common alleles (say, one has genotype AB and the other CD). In this case no pair of autosomes is IBS, and we can conclude with certainty that the locus is not IBD. For biallelic markers, this can happen only when each individual is homozygous for a different allele (i.e., one individual is AA and the other is BB). Loci where this happens are said to be "incompatible" (for the pair) and, assuming no genotyping error has taken place, provide certainty for not being IBS and therefore not being IBD.
This simple observation provides the foundation for probabilistic approaches to identify likely IBD segments in [1] and [2]. These approaches consider regions IBD if no incompatible loci are observed on a sufficiently long segment. More intricate approaches using allele frequencies to weigh the evidence brought for IBD by every single locus are described in [3, 4], and [5]. In [3] and [5] a Hidden Markov Model (HMM) is used in which there are two states, corresponding to being IBD and not being IBD, even though the process generating the states does not in general satisfy the Markov property. The model is chosen mainly for its relative simplicity and consequent computational tractability, as opposed to other approaches, such as those described in [6] and [7], which use inheritance vectors as states from which IBD status can be inferred. This inheritance vector approach is computationally intractable even for moderately sized pedigrees, but has the property that the underlying process is Markov.
Markov Chain Monte Carlo (MCMC) methods have been developed in [8] to deal with complex pedigrees, but they are not suitable for the common situation where information about the pedigree is inaccurate, incomplete, or spans only a few of the most recent generations.
Pedigree data can be useful for many purposes, but it does not in general provide a great deal of additional information for the purpose of IBD detection in situations in which the genotyping platforms are orders of magnitude finer than the expected number of recombination events in the pedigree. This can be the case when using Affymetrix or Illumina SNP microarrays. It is in fact possible to detect IBD segments and infer undetected relationships.
When haplotype (i.e., "phased" genotype) data is available, it is possible to identify as IBD segments that are shorter than those identified solely with genotype data. This is a key observation and we take advantage of this fact in the method we present below. Several algorithms use the haplotype to perform IBD detection [9] and association testing [10]. Note that even if we had the genome sequence data with base pair resolution, we would still be in need of a statistical model to infer haplotypes.
It is for these reasons that the "phasing" of genotype is important and relevant. Several different techniques have been used to infer the phase from genotype data in multiple individuals. Some techniques try to make such inferences by looking at thousands of individuals and then applying linkage disequilibrium (LD) to discern local haplotypes, using principles like maximum parsimony [11], entropy minimization [12], or variable length Markov chains [13]. These algorithms are fast and efficient. However, they require large sets of samples and while they tend to perform very well on the local scale, they perform less well, if at all, over long intervals. As a consequence, IBD detection algorithms using the inferred haplotype data end up identifying IBD segments broken up by short regions due to errors in the phase [9]. In general, assigning the most likely haplotype to a given sample without modeling the uncertainty can bias the results of studies based on haplotype.
An alternative approach to phasing is described in [14]. This technique uses IBD segments among relatives to infer the alleles that have been inherited from the maternal side and those that have been inherited from the paternal side of each individual. This approach is more robust than traditional methods and performs much better over the long range.
In this paper we introduce a new approach to the detection of IBD segments for a pair of samples. This approach uses the available phased genotype information to increase the accuracy of the detection, but at the same time is robust to inaccuracies of the phase. The method we present takes advantage of phase information even if this information is incomplete or less than fully accurate. In addition, we provide a novel method to improve the given configuration for the phased genotype through the identified IBD segments. We have implemented this algorithm and we detail its performance using genetic data in a large family known to be affected by an inherited form of kidney disease. In this example, our approach finds an additional 10% (probable) IBD loci as compared to the findings of an analogous method based only on genotype data. The ability to identify the entire set of regions that are actually IBD is of clear importance in identifying phenotypeinfluencing loci. Our algorithm should therefore be useful in a wide range of genetic studies that rely on the identification of shared genetic material.
Results
IBD detection using genotype
Naively, a region would be identified as IBD for two samples if a long streak of loci for which at least one allele for one sample is IBS to one allele for the other sample is observed, while if the region is not shared IBD, we would expect to observe some loci as homozygous for both samples but for different alleles. Our approach to IBD detection expands on this basic idea and can be viewed as a descendant of the various HMMbased approaches. For context it is worth a quick review of the basic HMM approach. For a clear general explanation of HMMs see [[15], Chap. 3] and [[16], Section 3.10].
A typical HMMbased method for detecting IBD segments in a pair of samples uses a twostate model along two homologous pairs of autosomes belonging to two different samples (see e.g., [3]). We call this HMM for genotype emission. The two states correspond to the cases in which the two samples share at least one allele at a given locus (the state "IBD") and the case in which they do not share any allele by descent (the state "NO IBD"). It is possible to add a third state for the case in which both alleles are shared. It is also possible to identify seven additional unordered ways to be IBD between two samples (for a total of nine states), as described in [[17], Chap. 5] and introduced originally by [18]. This takes into account the possibility of more than a pair of segments being IBD among the four homologous chromosomes for the two samples. In order to mitigate the computational burden we decided not to consider this enlarged state space, although the extension to a greater number of states would be straightforward.
HMM with genotype/haplotype emissions.
A  NO IBD  IBD  lev_{ G } 

{AA, AA}  p ^{4}  p ^{3}  log_{2}(p) 
{AA, AB}  4p^{3}q  2p^{2}q  1 + log_{2}(p) 
{AA, BB}  2p^{2}q^{2}  0  γ 
{AB, AB}  4p^{2}q^{2}  pq  2 + log_{2}(pq) 
{AB, BB}  4pq^{3}  2pq^{2}  1 + log_{2}(q) 
{BB, BB}  q ^{4}  q ^{3}  log_{2}(q) 
B  NO IBD  IBD  lev _{ H } 
{A, A}  p ^{2}  p  log_{2}(p) 
{A, B}  2pq  0  γ 
{B, B}  q ^{2}  q  log_{2}(q) 
Note that while the property of IBD/NO IBD is not itself Markovian, the HMM framework still makes sense here: emission probabilities are defined in exactly the same manner, but initial and transition probabilities need a new interpretation. One way to proceed is by assigning these probabilities for each pair so that the defined Markov process has the correct equilibrium frequencies in accordance with the amount of genome shared IBD by the pair [3, 5, 9]. This way the choice of initial and transition probabilities are guided by the degree of relationship of the two samples being compared.
Our approach differs and instead regards the transition probabilities as costs. The smaller we keep the transition probabilities, the longer the streaks of compatible observations are needed to justify transitioning back and forth between the two states, and therefore the more likely it will be that we avoid false positive IBD detection. On the other hand, in doing this, we may incur false negative IBD detections. We show how to mitigate this and balance these considerations via the use of phase information.
Standard forwardbackward decoding algorithms are then used to assign probabilities to the hidden variables, and each locus then is labeled as IBD if the probability of this event is greater than the probability of the opposite event.
One assumption of this analysis is that the observations for adjacent loci be independent of each other. This requires SNPs to be in linkage equilibrium. This is clearly not the case for SNPs in current genotype arrays, and relaxing this assumption would require a more sophisticated analysis. Nevertheless, this approach does still encode the intuition that haplotype data brings more evidence to avoid false positives than genotype data does.
In Appendix A we compare how easier it would be to detect IBD segments if we had haplotype information as opposed to genotype information.
IBD detection using phased genotype
If we have the phased genotype information for each individual, then we could use a more detailed HMM, with one state representing the case in which no pair of autosomes is IBD and four different states each corresponding to the pair of autosomes (denoted as either "L" or "R") carrying the haplotype shared in the first and second sample. Following standard usage, when discussing phased genotype of heterozygous loci, we use the notation AB to indicate that allele A belongs to the "left" autosome and BA to indicate that allele A belongs to the "right" autosome. The definition of left and right autosome is of course purely arbitrary, but makes sense in the context of consecutive heterozygous loci. We will consider a choice for the phase for a group of heterozygous loci to be "correct" if it is consistent with the haplotypes of the segment containing those loci. In this situation the cases break down as

IBD LL  The allele on the left autosome of sample 1 is IBD with the allele of the left autosome of sample 2,

IBD LR  The allele on the left autosome of sample 1 is IBD with the allele of the right autosome of sample 2,

IBD RL  The allele on the right autosome of sample 1 is IBD with the allele of the left autosome of the sample 2,

IBD RR  The allele on the right autosome of sample 1 is IBD with the allele of the right autosome of the sample 2.
Up to ten additional states could be introduced to account for all possible cases (for a total of fifteen states), in which more than one pair of chromosomes are in an IBD state [[17], Chap. 5]. As before, for ease of exposition as well as computational efficiency we have chosen not to describe a model with this level of detail, but it would be straightforward to extend our techniques in this way.
This suggests that it is of great importance for an IBD detection algorithm to be able to cope with uncertainty while at the same time still being able to exploit the available phase information. In order to account for the possibility of errors in the phase, the method we present here modifies the HMM by introducing transition probabilities between the four IBD states every time one of the observed loci is in a heterozygous state, so as to allow the tagged IBD segments to switch from one autosome to the homologous one.
We will call this new model HMM with phased genotype emission. The simple idea is that IBD segments should be allowed to switch between homologous autosomes at every heterozygous locus to allow for the possibility that the relative phase between two consecutive loci is incorrect.
HMM with phased genotype transitions.
The constants λ and μ are chosen according to how unlikely we expect that switching should take place. If the current phase configuration is completely random, then we would expect switching and not switching to be equally likely. If we model this by choosing λ = μ = 1/2, then the results are equivalent to those obtained using the HMM with genotype emission probabilities. We prove this in Appendix B. The method works well and does not run the risk of missing IBD segments that otherwise would be identified with the more simple HMM with genotype emission.
Phased genotype inference from IBD segments
We record a configuration like the previous one as a link between the two heterozygous loci. Once a sample has been compared with all the other available samples for shared IBD segments, the phased genotype that satisfies the largest number of links is computed.
Because of errors, it will not be possible in general to find a configuration that is compatible with all IBD segments detected. Therefore we use a maximum satisfiability approach. For a given sample, target, let t_{1},....,t_{ m }denote the locations of the heterozygous loci. For every phasing configuration of the genotype, associate a binary vector (y_{1}, y_{2},...,y_{ m }) with the convention that if y_{ i }= 0, then the phased genotype at locus t_{ i }is AB, otherwise if y_{ i }= 1 then the phased genotype is BA. Now, for each sample target we initialize a sparse square matrix L_{ target }. For each IBD segment sample target shares with another sample source, we search for all consecutive pairs of loci (t_{ i }, t_{ j }) for which sample target is heterozygous at both loci t_{ i }and t_{ j }while sample source is homozygous, and no other locus in between t_{ i }and t_{ j }has the same property.
For each pair (t_{ i }, t_{ j }) defined this way, we increment L_{ target }(I, j) by one if for sample source we observe either and or and , and we decrement L_{ target }(i, j) by one if for sample source we observe either and or and . Pseudocode is given in Algorithm 1 in Appendix C.
is as large as possible. In fact, notice that the expression in (1) is equal to the number of links satisfied by the phase configuration given by the binary vector (y_{1}, y_{2},...,y_{ m }) minus the number of unsatisfied links. Moreover the number of loci (t_{ i }, t_{ j }) for which L_{ target }(i, j) is not zero is linear in m since the likelihood of finding a link between loci t_{ i }and t_{ j }decreases exponentially with the difference j  i.
The general problem is a special instance of the more general class of MAX GEN2SAT problems [20]. This a class of problems which try to find the configuration of a collection of binary variables with constraints on pairs of variables so to satisfy as many constraints as possible. It is known that for such problems it is possible to find a configuration that satisfies at least 87.856% of the maximum number of satisfiable links [21] in time polynomial in the number of binary variables.
where ⊕ corresponds to the logical xor (the addition of numbers modulo 2). Using relative binary phase coefficients makes it easier to devise an algorithm that performs an iterative climb in the space of all possible combinations of binary phase coefficients. Testing with the FGFM dataset (see the Experiments section below) proved that very good results are already possible with a naive approach that will search among all possible configurations of s binary coefficients (z_{ i }, z_{i+1},...,z_{i+s}) for the one that satisfies the largest number of links for i = 1,...,m  s  1. Pseudocode is given in Algorithm 2 in Appendix C. Note that an important conceptual advantage of this approach is that each link provides information with respect to the relative phase of two heterozygous loci, usually close to each other, leading to a simpler approach to handling conflicting information coming from multiple IBD segments. In fact, trying to determine which alleles have been inherited from the maternal chromosomes and which have been inherited from the paternal chromosomes (as is done in [14]) might be more problematic when dealing with information coming from short IBD segments for which the inheritance pattern is unclear. A straightforward example comes from phasing the genotype of an individual using the genotype of a child. All loci will be easily recognized as IBD although we cannot infer where the recombination events took place. Nevertheless, only one link per recombination event would not be satisfied by the true haplotype. If no other evidence is available there is no way to infer the correct phase, but if other links are collected around the recombination event from other sources, it is likely that the true haplotype will be the one satisfying the maximum number of links. The same argument can be made for links not satisfiable by the true haplotype which are due to genotype errors in the source sample or to false positive IBD segments.
Experiments
We performed two different analyses, the first one aimed at understanding how much phase information is concealed in the IBD segments and the second one aimed at quantifying how much IBD detection with phased genotype, even if incomplete, can be achieved and how this outperforms the one with genotype.
To compare how phasing using IBD segments compares with other haplotype phasing methods, we simulated the dynamics of a chromosome segment reproducing through a small population using a WrightFisher model with recombination events. For a clear description of this model see [[22], Chap. 3]. The switch error rate (see [13, 23]) measures how often the relative phase between heterozygous loci has been retrieved correctly. As an example, suppose that a sample consists of two haplotypes A_{1}A_{2}A_{3}B_{4}B_{5}A_{6} and B_{1}A_{2}B_{3}B_{4}A_{5}B_{6}, but the inferred haplotypes are instead A_{1}A_{2}B_{3}B_{4}A_{5}B_{6} and B_{1}A_{2}A_{3}B_{4}B_{5}A_{6}. Then the switch error rate is 33%, since the relative phase between the first and third locus is incorrect, while the relative phases between the third and fifth locus and between the fifth and sixth locus are correct. The second and fourth loci do not matter since they are homozygous for the sample.
To indicate how much information is concealed in pairwise IBD information of a group, we compared the switch error rate of the haplotype retrieved by our phase update algorithm using IBD segments (as computed from the simulation) and how much relative phase was instead retrieved by the BEAGLE algorithm (see [13]). We also computed how much phase could be retrieved correctly by using both BEAGLE and the IBD segments.
Comparison of switch error rates.
n  L  IBD  BGL  BOTH 

50  10  3.88%  6.17%  2.41% 
100  10  2.99%  4.54%  1.81% 
200  10  1.96%  2.23%  1.18% 
50  5  3.73%  4.87%  1.83% 
100  5  2.13%  2.02%  1.01% 
200  5  1.47%  1.01%  0.63% 
50  2  3.81%  3.16%  1.57% 
100  2  2.70%  1.20%  0.62% 
200  2  1.55%  0.55%  0.35% 
These results are indicative of how much information is being missed by the BEAGLE algorithm and that could have still been retrieved using complete and correct IBD segment information. It is clear that BEAGLE scales very well as the size of the samples increases. However it is also clear that BEAGLE misses some of the correct switches which could be correctly inferred if the information from IBD segments was correctly exploited. This is true in particular for less dense SNP arrays. It is likely that while BEAGLE will perform very well when predicting the phase between closely linked markers, it might not do so for markers at larger genetic distance from each other, as could be the case for markers at opposite sides of a recombination hotspot.
Comparison of IBD detection accuracy.
n  L  FN GT  FN HT  FP GT  FP HT 

50  10  30%  9.9%  0.67%  0.29% 
100  10  26%  13%  1.5%  0.65% 
200  10  32%  28%  3.4%  0.75% 
50  5  30%  2.9%  0.83%  0.19% 
100  5  15%  5.6%  3.3%  0.85% 
200  5  19%  13%  7.1%  1.6% 
50  2  29%  2.6%  5.4%  4.8% 
100  2  18%  2.4%  16%  14% 
200  2  14%  5.5%  39%  34% 
The percentages relate to the total amount of loci pairwise shared by descent from one of the founders used to bootstrap the simulation. Notice that if two founders shared a segment IBD, their two haplotypes would not have been counted as IBD from the simulation, but they might have been detected as IBD by the IBD detection algorithm if inherited by two different samples, resulting in a false positive.
FGFM family
Genetic data was collected from FGFM using the Affymetrix 100 k Array Set, with the goal of searching for an underlying genetic basis to the inherited form of kidney disease affecting many members of this family. The complexity of the pedigree, together with the lack of information for how founders are related, made parametric approaches unfeasible and was the motivating factor for developing an alternative methodology. We evaluate how the use of phased data improves the performance of our algorithm by first measuring how many loci are identified as IBD by the HMM with genotype emission and then using this information to compute phased genotype data. At this point, we iterate the HMM with phased genotype emission algorithm to identify IBD segments and to update the phase a few times.
For both HMM with genotype emission and with phased genotyped emission we used parameters δ = 40, γ = 8 and for the HMM with phased genotype emission we use λ = μ = 1/4. The average percentage of IBD detected by the two algorithms between two samples was, respectively, 14.14% and 16.36%. This increase, even if (as of yet) still insufficient to identify the diseasecausing mutation in our case, was in fact dramatic, because it was mainly due to a whole new set of smaller IBD segments that we were now able to identify. This, was largely due to the fact that we were able to retrieve a great deal of the phase. In fact, for every sample, on average more than 90% of the loci were detected IBD with at least another sample of the family. This means that a given random consecutive pair of heterozygous loci has a good chance to encounter evidence to determine what their relative phase should be. This is in part due to the close relatedness of the people in the FGFM family and the high level of inbreeding.
Discussion and Conclusions
We presented a new algorithm for the detection of IBD segments in genotype data. This approach uses the available phased genotype information to increase the accuracy of the detection, but at the same time is robust against inaccuracies of the phase. The method we present takes advantage of phase information even if this information is incomplete or less than fully accurate. The key components are (1) a new method for using phase information for the identification of IBD segments and (2) a new method for using IBD segment identification for the determination of phase. In one exemplary dataset derived from samples in a complex pedigree of a large complex family with a high rate of kidney disease (focal segmental glomerulosclerosis), this approach produced approximately 10% additional likely IBD loci compared with traditional methods. The ability to detect IBD segments, both long and short, will become increasingly critical for human genetic research, particularly in the search for genetic variants that are relatively rare and shared by only small subsets of people (as opposed to common variants that are present in large populations). It is therefore of paramount importance to be able to use the (necessarily) limited data available as best as possible in order to extract the full amount of information concealed in genotype data. When the fastest exact approaches [25, 26] are not feasible, our algorithm can be a useful alternative. In fact, most exact algorithms have a computational burden exponential in the number of people in the pedigree. By contrast, our algorithm has a computational cost quadratic in the number of collected samples, requiring only that comparisons between all pairs of samples need to be performed. Our algorithm remains linear in the number of loci for which genotype data is available. It does not scale to the speed of [9], which is aimed at identifying short IBD segments with a computational cost that is linear in the number of people. However, unlike traditional approaches, our algorithm does not require precise pedigree information to run nor is it sensitive to pedigree loops, as are algorithms that perform exact inference (see [27]). Also, in our approach, there is no need to assume that the genotypes of the founders are independent, an assumption that can easily create significant problems, as can occur when two people distantly related share a long IBD segment.
This algorithm lends itself very well for iterative updates. For example, given a database of phased samples we can use this algorithm to perform phase inference on new samples using the IBD segments they will share with the samples already in the database. This would be very useful for the analysis of large biobanks of genetic data. The idea of using IBD segments to perform longrange phasing on genotype data from biobanks was introduced first in [14]. Our approach is different from this, perhaps conceptually simpler, and more robust. Most importantly, it is not strongly tied to the availability of correct pedigree data, which is often available only on a small scale, that is, only for nuclear subfamilies in the most recent generations. Moreover, it can take advantage of smaller IBD segment, in case larger ones are not available to phase at a given loci. We anticipate that this is but one of a number of future applications and venues for extension of this work.
Appendix A  Analysis of HMM with haplotype and genotype emission
The possible values for lev_{ G }are shown in the last column of part A of Table 1. Think of lev_{ G }as the logevidence in favor of being in state NO IBD. Approximately, if the overall logevidence for a segment is smaller than 2δ, then the model would decode that segment as IBD, since it would be more likely to have switched at the beginning and at the end of the two segment than having remained in the NO IBD state. Notice that the probability of observing genotypes {AA, BB} is 0 for state IBD, since this state is incompatible with the locus observed being IBD, and therefore the value lev_{ G }({AA, BB}) would be infinite. In practice, this is weighted by the fact that we might believe that the locus has been genotyped incorrectly and therefore it is not absolute evidence to discard the hypothesis that the locus is in the IBD state. As a consequence, we modify lev_{ G }({AA, BB}) to be equal to a fixed constant γ.
In part B of Table 1 we show the possible observed haplotypes together with their emission probabilities and the values for lev_{ H }.
where g is one of the six observable pair of genotypes in part A of Table 1, and h is one of the six observable pair of haplotypes in part B Table 1. The variables Σ_{IBDNO IBD}and Σ_{NO IBDIBD}represent, respectively, the log evidence pro IBD once the underlying Markov chain is in NO IBD state, and the log evidence pro NO IBD while the underlying Markov chain is in IBD state.
The ratio between expectation and standard deviation is clearly much smaller for Ω_{IBDNO IBD}than it is for Σ_{IBDNO IBD}, even more so when the minor allele frequency is large. Naively, alleles with large minor allele frequency are more likely to witness the fact that two segments are not IBD, even more so if we have available the haplotype rather than the genotype.
If a short sequence of loci is IBD, the HMM algorithm will recognize it as such if the collected logevidence lev_{ G }or lev_{ H }will reach a threshold depending on the transition probabilities. Consider the HMM with genotype emission using the emission probabilities as in part A of Table 1. Figure 9C shows the expected value of Σ_{NO IBDIBD}, with respect to the allele frequency of that locus, together with the standard deviation. Notice that for alleles for which the minor allele frequency is very small, the standard deviation is very high. This is explained by the paradoxical fact that the logevidence lev_{ G }can be positive, that is, in favor of state NO IBD, both for state {AA, BB} and one of the states {AA, AB} or {AB, BB}, despite the fact that either of these states are compatible with IBD state.
Using the model with the emission probabilities as in part B of Table 1, we get instead the graph in Figure 9D. The fact that the standard deviation for Σ_{NO IBDIBD}and Ω_{NO IBDIBD}is large when the minor allele frequency is small reflects the fact that the evidence for IBD might be very small if the allele shared IBD is the common one and huge if the observed allele is the rare one, so there is a wide range of possibilities. Again, the smaller ratio between expectation and standard deviation measures how much more unlikely we are to incur false negatives in detecting IBD segments when using haplotype data rather than genotype data. To better quantify this concept, we show some experimental results with data generated in silico in the Experiments section of this paper.
Appendix B  Analysis of HMM with phased genotype emission
We will show in this appendix that the HMM with genotype is a particular case of the HMM with phased genotype when λ = μ = 1/2. Also, the HMM with haplotype is a particular case of the HMM with haplotype when λ = μ = 0, which implies that no transition among different IBD states is allowed. This shows intuitively how the HMM with phased genotype is a model which is at the same time flexible and precise.
Notice that equals the number of observed states x_{t+1}corresponding to the observed state .
HMM with phased genotype emissions.
Phased genotype  NO  LL  LR  RL  RR 

(AA, AA)  p ^{4}  p ^{3}  p ^{3}  p ^{3}  p ^{3} 
(AA, AB)  p ^{3} q  p ^{2} q  0  p ^{2} q  0 
(AA, BA)  p ^{3} q  0  p ^{2} q  0  p ^{2} q 
(AA, BB)  p ^{2} q ^{2}  0  0  0  0 
(AB, AA)  p ^{3} q  p ^{2} q  p ^{2} q  0  0 
(AB, AB)  p ^{2} q ^{2}  pq ^{2}  0  0  p ^{2} q 
(AB, BA)  p ^{2} q ^{2}  0  pq ^{2}  p ^{2} q  0 
(AB, BB)  pq ^{3}  0  0  pq ^{2}  pq ^{2} 
(BA, AA)  p ^{3} q  0  0  p ^{2} q  p ^{2} q 
(BA, AB)  p ^{2} q ^{2}  0  p ^{2} q  pq ^{2}  0 
(BA, BA)  p ^{2} q ^{2}  p ^{2} q  0  0  pq ^{2} 
(BA, BB)  pq ^{3}  pq ^{2}  pq ^{2}  0  0 
(BB, AA)  p ^{2} q ^{2}  0  0  0  0 
(BB, AB)  pq ^{3}  0  pq ^{2}  0  pq ^{2} 
(BB, BA)  pq ^{3}  pq ^{2}  0  pq ^{2}  0 
(BB, BB)  q ^{4}  q ^{3}  q ^{3}  q ^{3}  q ^{3} 
where the second equation follows from equation (3).
Because of the previous statement, using coefficients λ = μ = 1/2 will give as a result that the probability for being in one of the four IBD states for the HMM with phased genotype emission is the same as the probability for being in IBD state according to the HMM with genotype emission. Therefore it will not increase our ability to detect IBD segments. Although using coefficients λ = μ = 0 will indeed increase our ability since the costs of switching back and forth from NO IBD state to one of the IBD states are about the same for the HMM with haplotype emission and the HMM with phased genotype emission, other than for a small factor of 4, but we would need to have accurate phased genotype data. The nice property of the HMM with phased genotype emission is that we can tune the parameters λ and μ to some intermediate values, so that we allow for inaccurate phase but we still take advantage of it to detect small IBD segments. A choice of λ = μ = 1/4 would for example increase the ability to detect smaller IBD segments and still be very robust against errors in the phase.
Appendix C  Link and phase update pseudocode
Algorithm 1  Link update algorithm
Given a set of phased genotypes, this algorithm creates a matrix of links for each phased genotype that will be later used to perform phase update. The algorithm will identify all IBD segments among each pair of samples and for each individual it will update a sparse upper triangular matrix with the links generated by each detected IBD segment.
Input: Phased genotype {G_{1}, G_{2},...,G_{ n }} from n samples.
for all pair of samples do
Identify IBD segments in between the two samples using the HMM with phased genotype.
for all IBD segments detected do
Define one sample as the source and one sample as the target.
Identify loci t_{1} < ... <t_{ m }for which the target genotype is heterozygous.
Identify indexes i_{1} < ... <i_{ k }for which the source phased genotype at loci is homozygous AA or BB and the target phased genotype is heterozygous AB or BA for j = 1,...,k.
for j := 1 to k  1  do
if G_{ source }( ) = G_{ source }( ) then
L_{ target }(i_{ j }, i_{j+1}) := L_{ target }(i_{ j }, i_{j+1}) + 1
else
L_{ target }(i_{ j }, i_{j+1}) := L_{ target }(i_{ j }, i_{j+1})  1
end if
end for
Swap target and source and repeat once.
end for
end for
return Sparse matrices {L_{1}, L_{2},...,L_{ n }}.
Algorithm 2  Phase update algorithm
Given a phased genotype with m heterozygous loci and phase choice represented by an m × 1 binary vector, an m × m sparse matrix of links L, this algorithm generates a new choice for the phase. The idea is to identify first for every consecutive heterozygous loci which links are affected by modifying the relative phase in between the two. Then, for every sliding window of a given length s, all possible combinations of phases are tested to identify which one satisfies the maximum number of links. Coefficients z_{ i }represent the present relative phase between the ith and i + 1th heterozygous locus. For every nonzero entry in L the sparse matrix B keeps track of all the relative phase coefficients which determine if the corresponding links are satisfied or not while the vector S keeps track of the amount of links for each nonzero entry of L. In the second part of the algorithm, using a sliding window of length s all possible configurations for the relative phase coefficients withing that sliding window are tested and the one which maximizes the amount of links is chosen. Whenever a change is made, a flag variable is raised, and the process reiterates until no flag variable is raised, that is, no further changes within a sliding window of length s can increase the amount of satisfied links.
Input: Current m × 1 phase binary vector Y = (y_{1}, y_{2},...,y_{ m }).
Input: Sparse m × m matrix L with link coefficients collected from IBD segments.
Input: Window length s.
Initialize r × m binary sparse matrix B with r the number of nonzero entries of L, r × 1 vector S, and k := 1.
Compute coefficients z_{ i }:= y_{ i }⊕ y_{i+1}for i = 1,...,m  1.
for all nonzero entries in L at position i,j do
for l := i to j  1 do
B(k, l) := 1.
end for
if y_{ i }= y_{ j }then
S(k) := L(i, j).
else
S(k) :=  L(I, j).
end if
k := k + 1.
end for
Initialize flag := 1.
while flag = 1 do
flag := 0.
for i := 1 to m  1 do
for all binary vectors (c_{0}, c_{1}, c_{2},...,c_{ s }) with c_{0} = 1 do
z_{i+k}:= z_{i+k}⊕ c_{ k }for k = 0,...,s
S(k) :=  S(k) for k such that w(k) = 1
flag := 1.
end if
end for
end for
end while
Recompute coefficients y_{i+1}:= y_{ i }⊕ z_{ i }for i = 1,...,m  1.
return updated phase binary vector Y = (y_{1}, y_{2},...,y_{ m }).
Declarations
Acknowledgements
This work is supported by grants from the U.S. National Institutes of Health GM07531002 and DK54931.
Authors’ Affiliations
References
 Miyazawa H, Kato M, Awata T, Kohda M, Iwasa H, Koyama N, Tanaka T, Kyo S, Okazaki Y, Hagiwara K: Homozygosity haplotype allows a genomewide search for the autosomal segments shared among patients. The American Journal of Human Genetics. 2007, 80 (6): 10901102. 10.1086/518176.View ArticlePubMedGoogle Scholar
 Thomas A, Camp N, Farnham J, AllenBrady K, CannonAlbright L: Shared genomic segment analysis. Mapping disease predisposition genes in extended pedigrees using SNP genotype assays. Annals of Human Genetics. 2008, 72 (2): 279287. 10.1111/j.14691809.2007.00406.x.PubMed CentralView ArticlePubMedGoogle Scholar
 Purcell S, Neale B, ToddBrown K, Thomas L, Ferreira M, Bender D, Maller J, Sklar P, de Bakker P, Daly M, et al: PLINK: A tool set for wholegenome association and populationbased linkage analyses. The American Journal of Human Genetics. 2007, 81 (3): 559575. 10.1086/519795.View ArticlePubMedGoogle Scholar
 Leibon G, Rockmore D, Pollak M: A SNP streak model for the identification of genetic regions identical by descent. Statistical Applications in Genetics and Molecular Biology. 2008, 7: 1610.2202/15446115.1340.PubMed CentralView ArticleGoogle Scholar
 Thompson E: Analysis of data on related individuals through inference of identity by descent. 2008, Department of Statistics, University of Washington, Technical Report Number 539Google Scholar
 Donnelly K: The probability that related individuals share some section of genome identical by descent. Theoretical Population Biology. 1983, 23: 3463. 10.1016/00405809(83)900047.View ArticlePubMedGoogle Scholar
 Lander E, Green P: Construction of multilocus genetic linkage maps in humans. Proceedings of the National Academy of Sciences. 1987, 84 (8): 23632367. 10.1073/pnas.84.8.2363.View ArticleGoogle Scholar
 Heath S: Markov chain Monte Carlo segregation and linkage analysis for oligogenic models. The American Journal of Human Genetics. 1997, 61 (3): 748760. 10.1086/515506.View ArticlePubMedGoogle Scholar
 Gusev A, Lowe J, Stoffel M, Daly M, Altshuler D, Breslow J, Friedman J, Pe'er I: Whole population, genomewide mapping of hidden relatedness. Genome Research. 2009, 19 (2): 31810.1101/gr.081398.108.PubMed CentralView ArticlePubMedGoogle Scholar
 Lin D, Huang B: The use of inferred haplotypes in downstream analyses. The American Journal of Human Genetics. 2007, 80 (3): 577579. 10.1086/512201.View ArticlePubMedGoogle Scholar
 Wang L, Xu Y: Haplotype inference by maximum parsimony. Bioinformatics. 2003, 19 (14): 177380. 10.1093/bioinformatics/btg239.View ArticlePubMedGoogle Scholar
 Gusev A, Măndoiu I, Paşaniuc B: Highly scalable genotype phasing by entropy minimization. IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB). 2008, 5 (2): 252261. 10.1109/TCBB.2007.70223.View ArticleGoogle Scholar
 Browning S, Browning B: Rapid and accurate haplotype phasing and missingdata inference for wholegenome association studies by use of localized haplotype clustering. The American Journal of Human Genetics. 2007, 81 (5): 10841097. 10.1086/521987.View ArticlePubMedGoogle Scholar
 Kong A, Masson G, Frigge M, Gylfason A, Zusmanovich P, Thorleifsson G, Olason P, Ingason A, Steinberg S, Rafnar T, et al: Detection of sharing by descent, longrange phasing and haplotype imputation. Nature Genetics. 2008, 40 (9): 106810.1038/ng.216.PubMed CentralView ArticlePubMedGoogle Scholar
 Durbin R, Eddy S, Krogh A, G M: Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids. 1998, Cambridge University PressView ArticleGoogle Scholar
 Duda R, Hart P, Stork D: Pattern Classification. 2001, Wiley New YorkGoogle Scholar
 Lange K: Mathematical and Statistical Methods for Genetic Analysis. 2002, Springer New YorkView ArticleGoogle Scholar
 Jacquard A: Genetic information given by a relative. Biometrics. 1972, 28 (4): 110114. 10.2307/2528643.View ArticlePubMedGoogle Scholar
 Thompson E: The IBD process along four chromosomes. Theoretical population biology. 2008, 73 (3): 369373. 10.1016/j.tpb.2007.11.011.PubMed CentralView ArticlePubMedGoogle Scholar
 Hochbaum D, Pathria A: Approximating a generalization of MAX 2SAT and MIN 2SAT. Discrete Applied Mathematics. 2000, 107 (13): 4159. 10.1016/S0166218X(00)002444.View ArticleGoogle Scholar
 Goemans M, Williamson D: Improved approximation algorithms for maximum cut and satisfiability problems using semidefinite programming. Journal of the ACM (JACM). 1995, 42 (6): 11151145. 10.1145/227683.227684.View ArticleGoogle Scholar
 Wakeley J: Coalescent Theory: an Introduction. 2009, Roberts & Co. PublishersGoogle Scholar
 Stephens M, Donnelly P: A comparison of bayesian methods for haplotype reconstruction from population genotype data. The American Journal of Human Genetics. 2003, 73 (5): 11621169. 10.1086/379378.View ArticlePubMedGoogle Scholar
 Marchini J, Howie B, Myers S, McVean G, Donnelly P: A new multipoint method for genomewide association studies by imputation of genotypes. Nature Genetics. 2007, 39 (7): 906913. 10.1038/ng2088.View ArticlePubMedGoogle Scholar
 Browning S, Browning B: On reducing the statespace of Hidden Markov Models for the identity by descent process. Theoretical Population Biology. 2002, 62: 18. 10.1006/tpbi.2002.1583.View ArticlePubMedGoogle Scholar
 Fishelson M, Geiger D: Exact genetic linkage computations for general pedigrees. Bioinformatics. 2002, 18 (Suppl 1): S189S198.View ArticlePubMedGoogle Scholar
 Fishelson M, Dovgolevsky N, Geiger D: Maximum likelihood haplotyping for general pedigrees. Human Heredity. 2005, 59: 4160. 10.1159/000084736.View ArticlePubMedGoogle Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.