Statistics on continuous IBD data: Exact distribution evaluation for a pair of full(half)-sibs and a pair of a (great-) grandchild with a (great-) grandparent

Background Pairs of related individuals are widely used in linkage analysis. Most of the tests for linkage analysis are based on statistics associated with identity by descent (IBD) data. The current biotechnology provides data on very densely packed loci, and therefore, it may provide almost continuous IBD data for pairs of closely related individuals. Therefore, the distribution theory for statistics on continuous IBD data is of interest. In particular, distributional results which allow the evaluation of p-values for relevant tests are of importance. Results A technology is provided for numerical evaluation, with any given accuracy, of the cumulative probabilities of some statistics on continuous genome data for pairs of closely related individuals. In the case of a pair of full-sibs, the following statistics are considered: (i) the proportion of genome with 2 (at least 1) haplotypes shared identical-by-descent (IBD) on a chromosomal segment, (ii) the number of distinct pieces (subsegments) of a chromosomal segment, on each of which exactly 2 (at least 1) haplotypes are shared IBD. The natural counterparts of these statistics for the other relationships are also considered. Relevant Maple codes are provided for a rapid evaluation of the cumulative probabilities of such statistics. The genomic continuum model, with Haldane's model for the crossover process, is assumed. Conclusions A technology, together with relevant software codes for its automated implementation, are provided for exact evaluation of the distributions of relevant statistics associated with continuous genome data on closely related individuals.


Background
Pairs of related individuals, such as full-sibs, are widely used in linkage analysis. Most of linkage tests are based on statistics associated with identity by descent (IBD) data. Evaluation of p-values requires relevant information on the distributions of such statistics. The current biotechnology provides data on very densely packed loci, and therefore, it may provide almost continuous IBD data for pairs of closely related individuals. The distribution theory for statistics on continuous IBD data has not been developed yet. Bickeboller and Thompson [1,2] provide approximations, based on the Poisson clumping heuristic, to the distribution of the proportion of genome shared IBD by halfsibs, while Stefanov [3] provides a methodology for exact evaluation of the cumulative probabilities for the proportion of genome shared IBD by two individuals in grand-parent-type relationship. Browning [4,5] suggests a Monte-Carlo approach for such evaluations. Zhao and Liang [6] deal with the exact calculation of the likelihood of a particular relationship for a given gamete IBD data.
This paper provides a technology for numerical evaluation, with any given accuracy, of the cumulative probabilities of relevant statistics for pairs of closely related individuals, such as full(half)-sibs and a (great-)grandchild with a (great-) grandparent. Codes are provided, in the popular software package Maple, for rapidly implementing such evaluations. Possible applications of the results are also discussed (see subsection Discussion).

Results
A technology is provided for numerical evaluation, with any given accuracy, of cumulative probabilities of statistics on continuous IBD data from pairs of related individuals. The pairs of interest are full-sibs, half-sibs, a grandchild with a grandparent, and a great-grandchild with a great-grandparent. Three Maple codes are provided in the section Materials and Methods. Two of these concern a pair of full-sibs. The first one evaluates the cumulative probabilities for the number of pieces, of a chromosomal segment of a fixed length, on each of which 2 haplotypes are shared IBD. The second one evaluates the same for the number of pieces, of a chromosomal segment of a fixed length, on each of which at least 1 haplotype is shared IBD. The third code evaluates the cumulative probabilities for the number of pieces inherited by a great-grandchild from a great-grandparent on a chromosomal segment of a fixed length. The user of these codes should enter the length (in morgans) of the chromosomal segment of interest (y) and the number (k) of pieces. The codes contain hypothetical values for these and the corresponding evaluated probability that appears on the screen after the code is executed. A formula is provided in section Materials and Methods (cf. (8)) for a straigthforward evaluation of the cumulative probabilities for the number of IBD pieces for a pair of half-sibs. Also, it is explained in the same section (cf. (1)(2)(3) and (6)) how to use the Maple codes provided in [3] in order to evaluate the cumulative probabilities for the proportion of genome with 2 (at least 1) haplotypes shared IBD by a pair of fullsibs, and the proportion of genome shared IBD by a pair of half-sibs, all on a chromosomal segment of a fixed length. Excerpts from such evaluations concerning the statistics of interest for the related pairs of interest are provided in Tables 1,2,3 ,4,5,6,7,8. Furthermore, our Maple codes evaluate the corresponding cumulative probabilities conditional on information (such as inheritance) on one of the flanking markers. To do so the user should set up the initial probabilities (c 0 ,c 1 ,c 2 ) accordingly.

Discussion
In this article a technology is provided for numerical evaluation, with any given accuracy, of the cumulative probabilities of some statistics on continuous genome data for pairs of related individuals. The pairs of interest are: fullsibs, half-sibs, grandparent with a grandchild, and a greatgrandparent with a great-grandchild. In the case of a pair of full-sibs, the following statistics are considered: (i) the proportion of genome with 2 (at least 1) haplotypes shared IBD on a chromosomal segment, (ii) the number of distinct pieces, of a chromosomal segment, on each of which 2 (at least 1) haplotypes are shared IBD. The natural counterparts of these statistics for the other relationships are also covered. Relevant Maple codes are provided for a rapid evaluation of the cumulative probabilities of such statistics.
In the case of full-sibs the IBD is meant within pedigrees consisting of a pair of sibs and their two parents -that is, nuclear families. Also, our distributional results assume such an interpretation of IBD. If the sibs in a pair, or at least one of their parents, are inbred within a larger pedigree than the nuclear family (see [7] for relevant terms) then IBD subsegments with respect to the nuclear family and IBD subsegments with respect to the larger pedigree will not be distinguishable, due to identity-by-state (IBS) status of the data. Consequently, if the sibs in a pair, or at least one of their parents, are inbred, then the data will record larger numbers of distinct chromosomal segments, with 2 haplotypes shared IBD, than those in the case of non-inbreeding. Therefore, the distribution of the number of such pieces for a pair of sibs, which is evaluated by the enclosed relevant Maple codes, can be used to assess the evidence for a lack of inbreeding. Such information may be used accordingly. Likewise, the distribution results for the proportion of shared genome and number of pieces IBD on the different chromosomes may be used in testing for mis-specified sib-relationship. Such tests for significance may be based on a combination of separate tests each corresponding to the data on a single chromosome with the suitable Bonferroni correction of the significance level. Similar to these applications hold for a grandparent-type relationship when using the corresponding distribution results.
Our results may also be used in identifying chromosomal segments that may contain loci responsible for complex deseases. Nonparametric tests, similar to that suggested in [3] for pairs in grandparent-type relationship, can be devised for pairs of full-sibs. Assume a chromosomal segment is suspected of carrying responsible gene(s) for a particular desease. The hypothesis to be tested is 'the segment does not carry such genes'. Assume a continuous IBD data are available for n independent pairs of full-sibs, all affected by the desease. In particular, the data contain the proportions, x 1 , x 2 ,...,x n , of genome with 2 haplotypes shared IBD on the chromosomal segment in question. A relevant test statistic is the minimum of these proportions, say x, for these n full-sib pairs. The relevant p-value is equal to (1 -F(x)) n , where F(x) can be evaluated using the relevant Maple code. Likewise, a similar test can be based on the corresponding proportions of genome with at least 1 haplotype shared IBD. Both tests are robust and do not depend on the mode of inheritance. However, one may expect that the first one, based on the genome with 2 haplotypes shared IBD, would be more sensitive to a recessive pattern of inheritance on the chromosomal segment. Also, the second one is relevant if sharing of either one or two alleles cannot be distinguished. Our resuts may also be used in identifying the presence of another gene(s) responsible for a complex desease on a chromosomal segment flanked by an already identified major desease gene. The relevant tests are to be based on the corresponding proportions of shared genome, conditional on the information on one of the flanking markers. Recall that our Maple codes also evaluate such conditional probabilities and therefore evaluate the relevant p-values. Table 2: Full-sibs: Cumulative probabilities ( F (T 1 (0.5)+T 2 (0.5))/0.5 (x) ) of the proportion (x) of genome, with at least 1 haplotype shared IBD, on a chromosomal segment of length t morgans

Conclusions
A technology, together with relevant software codes for its automated implementation, are provided for exact evaluation of the distributions of relevant statistics associated with continuous genome data on closely related individuals.

Materials and Methods
The underlying mathematical model Througout the paper the genomic continuum model, with Haldane's model for the crossover process, is assumed. That is, the occurrence of crossovers along the chromosomes is modelled by a Poisson process (see [8]). If the distances are measured in morgans then the rate of the Poisson process is one. Donnelly [9] elaborated on this model and showed that all crossover processes on a pedigree can be viewed as a continuous time Markov chain, whose states are the vertices of a hypercube, and time refers to distance. For a pair of full-sibs (the relevant pedi-gree consists of the two sibs and their parents) the relevant hyperecube is four-dimensional. The coordinates are either 0 or 1 depending on whether a grand-paternal or a grand-maternal DNA was transmitted. The first two coordinates indicate the parental transmissions for sib one and the other two do the same for sib two. For example, the vertex (0,1,1,0) indicates the following transmissions at a chromosomal locus: a grand-paternal (0) from the mother and a grand-maternal (1) from the father of sib one, and a grand-maternal (1) from the mother and a grand-paternal (0) from the father of sib two. The DNA at a location on, or a segment from, one of the homologous chromosomes is called haplotype. The sixteen states of the hypercube can be divided into three groups of vertices indicating whether 0,1, or 2 haplotypes are shared IBD at a locus with the assumption of non-distinguishing between sharing of maternal and paternal DNA. Then the underlying model can be reduced to a three-state continuous time Markov chain whose parameters are described as follows (see [10]). States are denoted by 0,1, and 2 corresponding to the number of shared IBD haplotypes. The holding times are exponentially distributed with rate parameter 4 and the one-step transition probability matrix of the embedded discrete time Markov chain is given by: The initial probability vector is (1/4,1/2,1/4) (the steadystate probabilities). The continuous data on a chromosomal segment consists of the lengths of the consecutive pieces (subsegments) characterised by the number of haplotypes shared IBD.
The sojourn time in a state has the following interpretation. Let d be the length (in morgans) of a chromosome segment of interest. Then the sojourn time in state i (i = 0,1,2), within time interval of length d, is the length of genome whose each location has i haplotypes shared IBD by the two sibs on that segment. Such a genome will be called briefly a genome with i haplotypes shared IBD. The afore- Likewise, the underlying model for a pair of half-sibs is a continuous time Markov chain with four states which are the vertices of the two-dimensional cube. The four states can be divided into two groups, each indicating the number (0 or 1) of haplotypes shared IBD at a locus and again not distinguishing between sharing of maternal and paternal DNA. Then the reduced underlying model is a two-state continuous time Markov chain with states denoted by 0 and 1, exponentially distributed holding time with rate parameter 2, and the following one-step transition probability matrix of the embedded discrete-time Markov chain: The initial probability vector is (1/2,1/2).
There is a very close similarity between the aforementioned underlying models and those corresponding to the relationships great-grandchild-great-grandparent and grandchild-grandparent. Donnelly (1983) discusses the models for grandparent-type relationships. Note that the reduced underlying model for the relationship greatgrandchild-great-grandparent and that for a pair of fullsibs are the same, except for the value of the parameter of the holding time distribution. This is 4 for a pair of fullsibs and 2 for the relationship great-grandchild-greatgrandparent. Of course, the interpretation of the states is different. For example, state 2 (or state 0; note that states 0 and 2 are interchangable) indicates a transmission of a great-grandparent DNA to a great-grandchild and the remaining two states indicate two cases resulting in nontransmission. Likewise, the reduced underlying model for

F S 2 (t) (k)
the relationship grandchild-grandparent and that for a pair of half-sibs are the same, again except for the value of the parameter of the holding time distribution. This is 2 for a pair of half-sibs and 1 for the relationship grandchild-grandparent.

Methods
Our methodology is similar to that introduced by Stefanov (2000) who treated grandparent-type relationships. Namely, relevant stopping times are introduced and explicit expressions for their characteristic functions are found. These characteristic functions are numerically invertable using the system Maple V (for introduction to Maple V see [11]) and some numerical tools. Therefore, their distribution functions are derivable. Finally, the latter distribution functions yield the distribution functions of relevant random quantities, such as the sojourn time in a state, counts of transitions from a state to another state, all within a fixed time interval. Subsequently, the cumulative probabilities of relevant statistics on continuous IBD data can be calculated. For example, such statistics for a pair of full-sibs are the proportion of genome with 2 (at least 1) haplotypes shared IBD, and the count of pieces, of a chromosomal segment, on each of which 2 (at least 1) haplotypes are shared IBD. More details follow.

Full-sibs
Let {X(t)} t≥0 be a three-state continuous time Markov chain whose parameters are those of the underlying model for a pair of full-sibs. Denote by N ij (t) the number of one-step transitions from state i to state j, and by T i (t) the sojourn time in state i, i, j = 0,1, 2, all up to time t.
Note the following interpretation of these quantities when considering a chromosomal segment of length t. The sojourn time in state 2, T 2 (t), is the length of genome with both haplotypes shared IBD, on a chromosomal segment of length t; the sojourn time T 1 (t) + T 2 (t) is the length of genome with at least one haplotype shared IBD, on a chromosomal segment of length t; N 12 (t) (N 12 (t) + 1), given the initial state is not 2 (is 2), counts the number of distinct pieces, on each of which both haplotypes are shared IBD, on a chromosomal segment of length t; N 01 (t) (N 01 (t) +1), given the initial state is 0 (is not 0), counts the number of distinct pieces, on each of which at least one haplotype is shared IBD, on a chromosomal segment of length t.
Note the following interpretation of T 2 (t) and N 01 (t) if the aforementioned three-state Markov chain is the underlying model for the great-grandchild-great-grandparent relationship: T 2 (t) is the amount of genome inherited by a great-grandchild from his great-grandparent on a chromosomal segment of length t, and N 12 (t) (N 12 (t) +1), given the initial state is not 2 (is 2), is the number of distinct pieces whose relevant haplotypes are shared IBD on a chromosomal segment of length t.
Denote by F t (s) the cumulative probability that is evaluated by the Maple program provided in [3] for the greatgrandchild-great-grandparent relationship (t and s are the lengths of the chromosomal segment and the shared IBD part of it, respectively). Then it is easy to see that the following hold when the underlying model is that for a pair of full-sibs: Therefore, in view of the aforementioned interpretation of T 2 (t) and T 1 (t) + T 2 (t), using the identities (1), (2), and (3), and the Maple program for the great-grandchildgreat-grandparent relationship provided in [3], one can derive the cumulative probabilities of the following quantities associated with a pair of full-sibs: the proportion of genome with both haplotypes shared IBD, and the proportion of genome with at least one haplotype shared IBD, on any chromosomal segment.
In what follows we discuss how the cumulative probabilities of other relevant statistics are derived. Introduce the following stopping times: Explicit expressions for the characteristic functions of τ k and ν k corresponding to different initial states are derivable. The relevant propositions and their proofs are found in the next subsection.
Denote by S 2 (t) (S 1,2 (t)) the number of distinct pieces of a chromosomal segment of length t, on each of which 2 (at least 1) haplotypes are shared IBD. Then the distributions of S 2 (t) and S 1,2 (t) are related to the distributions of the τ k and ν k as follows.
In order to compute these cumulative probabilities we need the conditional cumulative probabilities of the τ k and ν k given the initial state. The propositions in the next subsection provide the characteristic functions of these conditional distributions. They are numerically invertable, and subsequently, the required cumulative probabilities are derivable. Likewise, the cumulative probabilities of the number of distinct pieces inherited by a greatgrandchild from a great-grandparent on a chromosomal segment of length t can be calculated (see Remark 2 in the next subsection).
The relevant Maple codes, for rapidly implementing such evaluations, are provided in subsection Maple V codes.

Half-sibs
Consider now the underlying model for a pair of half-sibs. We use the same notation, N ij (t) and T i (t), (i,j = 0,1), for the number of one-step transitions from state i to state j and the sojourn time at state i, respectively, up to time t. The sojourn time T 1 (t) is the amount of genome shared IBD by the half-sibs on a chromosomal segment of length t. Similarly to the preceding case the cumulative probabilities of the proportion of such genome can be calculated using the identity where F t (s) is the cumulative probability that is evaluated by the Maple program provided in [3] for the grandchildgrandparent relationship (t and s are the lengths of the chromosomal segment and the shared IBD part of it, respectively).
Introduce the following stopping times µ k = inf {t : N 01 (t) = k}, k = 1,2,..., µ 0 = 0, that is, µ k is the waiting time till entering state 1 for the kth time. Denote by U(t) the number of IBD pieces on a chromosomal segment of length t. Then it is easy to see that the following hold.
Therefore, the distribution of U(t) is related to the distributions of the µ k , as follows: It is easy to see that µ k , given the initial state is 1, is distributed as the sum of 2k independent and exponentially distributed random variables with parameter 2. Likewise µ k , given the initial state is 0, is distributed as the sum of 2k -1 such variables. Therefore, the following hold.
Fact 1. The conditional distribution of µ k , given the initial state is 1, is a Gamma distribution (G(2k, 0.5)) with parameters 2k and 0.5.
In view of these facts and the identity given in (7) where F G(.,.) is the cumulative distribution function of a Gamma distribution G(.,.) and (c 0 ,c 1 ) is the initial probability vector. Thus, the cumulative probabilities of U (t) can be computed using any standard statistical software. Excerpts of such probabilities are provided in Table 6 in the case c 0 = c 1 = 0.5 (the steady-state probabilities).

Remark 1.
If the underlying model for the grandchildgrandparent relationship is considered then the second parameter of the aforementioned Gamma distributions is to be changed from 0.5 to 1.

Relevant characteristic functions
Consider the underlying model for a pair of full-sibs. The random quantities τ k and v k have been introduced in subsection Methods. The following propositions hold.
Proposition 1. Assume that the initial state is either 0 or 2. Then the characteristic function of τ k is given by: where Proposition 2. Assume that the initial state is 1. Then the characteristic function of τ k is given by:  .9856720668

Full-sibs
Cumulative probabilities for the number of pieces (k), of a chromosomal segment of length y morgans, on each of which at least 1 haplotype is shared IBD > assume(x, real, y, real): > y := 0.5: .7448682221 Great-grandchild-great-grandparent Cumulative probabilities for the number of pieces (k), inherited by a great-grandchild from his great-grandparent, on a chromosomal segment of length y morgans > assume(x, real, y, real): > y := 0.5: .9999781100