Inferring relationships between pairs of individuals from locus heterozygosities
© Presciuttini et al. This is an Open Access article: verbatim copying and redistribution of this article are permitted in all media for any purpose, provided this notice is preserved along with the article's original URL. 2002
Received: 12 July 2002
Accepted: 20 November 2002
Published: 20 November 2002
The traditional exact method for inferring relationships between individuals from genetic data is not easily applicable in all situations that may be encountered in several fields of applied genetics. This study describes an approach that gives affordable results and is easily applicable; it is based on the probabilities that two individuals share 0, 1 or both alleles at a locus identical by state.
We show that these probabilities (zi) depend on locus heterozygosity (H), and are scarcely affected by variation of the distribution of allele frequencies. This allows us to obtain empirical curves relating zi's to H for a series of common relationships, so that the likelihood ratio of a pair of relationships between any two individuals, given their genotypes at a locus, is a function of a single parameter, H. Application to large samples of mother-child and full-sib pairs shows that the statistical power of this method to infer the correct relationship is not much lower than the exact method. Analysis of a large database of STR data proves that locus heterozygosity does not vary significantly among Caucasian populations, apart from special cases, so that the likelihood ratio of the more common relationships between pairs of individuals may be obtained by looking at tabulated zi values.
A simple method is provided, which may be used by any scientist with the help of a calculator or a spreadsheet to compute the likelihood ratios of common alternative relationships between pairs of individuals.
The usual, long-established method of inferring relationships between individuals in forensic genetics is based on the population frequencies of the observed alleles and on the conditional probabilities of the observed genotypes, given two alternative hypothesized relationships . In the more frequent instances, such as paternity testing of trios, or similar cases with deficiencies, well-known formulas have come into common use [2, 3]. However, in more complex cases where, for example, the relationship between pairs of individuals from large samples is under investigation, or where the DNA profile of a number of related individuals is known and we want to know the most likely relationships among them, these calculations become exceedingly complex. Each particular problem requires the development of specific formulas, necessitating either the expertise of highly specialized professionals, or recourse to suitable computer programs [4–8] these latter, on the other hand, require trained personnel to be used. In addition, the exact method assumes knowledge of allele frequencies at the marker loci, which often show considerable variability between ethnic groups.
Examples of 'difficult' situations sometimes encountered in forensic science include attribution to missing individuals of one or more body remains , identification of the victims of mass disasters , validation of large databases of individual genetic profiles . Examples from other fields include linkage analysis (investigators may want to verify the true relationships existing among reported relatives [12, 13]), natural and domestic population studies (to resolve kin structures in the wild  or confirm the stock source of animal food ), and research in physical anthropology (in reconstructing genealogies when there are no civic records , or inferring relationships in ancient cemeteries [17, 18]).
The increasing availability of highly polymorphic genetic markers and their decreasing cost of typing provide high power of resolving the true biological relationship between individuals even with methods that use only part of the genetic information, being at the same time more easily applicable. The aim of this work is to generalize a method for inferring relationships between pairs of individuals, based on the probabilities (here called z0, z1, z2) that two subjects with a given relationship share 0, 1 or 2 alleles identical by state at a locus. This approach was suggested by Chakraborty and co-authors [19, 20], and was subsequently developed by others, generally in the context of genome wide linkage scans [21, 22]. We first show that the values of zi for a certain relationship depend on the heterozygosity of a locus (H) and very little on the particular distribution of its allele frequencies; this property allows us to obtain regression equations relating the values of zi to H for the more common relationships; then, we compare the results of our method with those of the conventional exact approach in large samples of mother-child and full-sib pairs. Finally, we examine a large database of gene frequencies of human populations typed for loci commonly used in forensic science. Based on results from this analysis, the zi values of the CODIS and other loci  are tabulated for Caucasian populations; these may be directly used by any scientist with the help of a calculator or a spreadsheet to compute the likelihood ratios of common alternative relationships between any pair of individuals. In more general cases, the equations relating zi to H provide an easy way to compute the zi values to be applied to each particular problem.
Relationship between zi and H
Probabilities of genotype combinations and allele sharing for several common relationships
A) Genotype combination
4pA 3 pB
pA 2pB 2/2
pA 2pB 2
2pA 2pB 2
4pA 2pB 2
B) Number of shared alleles (Z)
Equations relating heterozygosity to zi
0.1914 H -
0.5815 H2 +
0.2272 H +
1.7586 H2 -
0.0358 H -
1.1771 H2 +
0.3829 H -
1.1630 H2 +
0.9544 H +
3.5173 H2 -
0.5715 H -
2.3543 H2 +
0.7658 H -
2.3259 H2 +
2.9088 H +
7.0345 H2 -
2.1431 H -
4.7086 H2 +
LRs compared between the IBS method and the exact method
Percentages of pairs with LR higher than given cut-off values
Variation of STR heterozygosity among populations
Probabilities of zi for 18 loci and four common relationships
Inferring the biological relationships existing between two or more specimens using genetic polymorphisms is a cornerstone task in forensic science, which is also encountered in a variety of problems of applied genetics. Application of the conventional exact method requires three critical steps: 1) identification of the correct genotype combination for each locus (among seven possible types); 2) identification of the allele(s) whose frequency must be entered into the appropriate formula; 3) identification of the proper allele frequencies to be used in the calculation. Steps 1 and 2 are computationally tricky; if large databases have to be examined to identify first-degree relatives among unrelated individuals, one must recourse to programs developed by experts. Point 3 is also computationally not trivial (it requires to look up values in external tables to be imported into the specific formula appropriate for each pair), but it is also critical from a conceptual point of view, as sometimes we are uncertain about the allele frequencies that are appropriate for a given problem. This latter problem is particularly important, for example, in mass disasters, when large numbers of ethnically diverse victims must be screened against large numbers of possible relative matches. In such a situation, use of H in the IBS method may be much more appropriate than using incorrect "average" values of allele frequencies in the exact method. Once a match has been made for a pair of relatives, then its exact probabilities can be computed with the exact method using the ethnically appropriate allele frequencies.
In this study, we have shown that an approach based on the number of alleles shared IBS at each locus may be conveniently used for the purpose of inferring relationships. In this method, the probability of a certain relationship, given the genotypes observed in a pair, depends on a single parameter, the locus heterozygosity (H). This makes it easy to handle even large volume of data in a single spreadsheet, since these probabilities depend on the number of shared alleles (0, 1 or 2) and on a constant (H). For example, Presciuttini et al.  were interested in the relationships connecting 26 individuals buried in a 18th century cemetery; application of the IBS method to all possible pairs of this sample was not only computationally easier than the exact method, but it was also theoretically more robust, as it did not require making inferences about the allele frequencies that characterized the population, but only required assumptions about H. The functional dependence on H is one of the main advantages of the IBS approach. Heterozygosity, being a composite parameter, is inherently less variable among populations than individual allele frequencies. To examine this issue in real data, we analysed the sampling variance of H in a large database of allele frequencies, and concluded that heterozygosity is sufficiently homogeneous, at least among Caucasian populations, as to justify the adoption of a single common mean value, apart from special cases of historically isolated groups. Based on this observation, we tabulated the values of the zi for the more common relationships and the more frequently used loci. The values of H of these 18 loci span from 0.632 (TPOX) to 0.878 (D18S51), thus covering a wide range of values. This table has two main applications. The first concerns all those studies in which the individuals belong to a population whose allele frequencies are unknown (e.g., immigrants from poorly investigated ethnic groups or large samples of ethnically mixed victims and putative relatives); in this situation, applying the exact method is problematic, due to the uncertainty about the proper allele frequencies to be used, whereas the IBS method is straightforward. The second application concerns the cases where a new locus has been typed in a certain population, maybe of an animal species, and its heterozygosity is known; in this case, one may use the tabulated values of a locus with a similar heterozygosity, or may interpolate them (for out-of-range loci, or for more precise results, the regression equations may be used). More generally, using the tabulated zi values makes it easy to obtain a first-hand inference about the relationship between any two samples by simple inspection of genotype data. One may write down, given the number of alleles the pair shares at a locus, the corresponding probabilities of any two relationships to be tested, and then multiply the ratios of the two probabilities across all typed loci. The exact approach is more cumbersome and error prone, as it requires a table of formulas to be applied to each genotype combination and a table of allele frequencies from which one obtains the correct frequencies.
The main disadvantage of the IBS method is, of course, its reduced power. If the results are part of a legal case, all available information must be used to support a given hypothesis, and the exact method must always be used if it is applicable. However, there are many instances in which a standard significance level (0.05 or 0.01) is acceptable for screening a large database or to draw provisional scientific conclusions, and the IBS method may reach these limits even with a small number of typed loci, at least for discriminating first-degree relatives from non-relatives.
Comparison of exact and IBS methods in a particular case
Genotype combination (1)
Allele frequencies (2)
Likelihood ratio (3)
Likelihood ratio (5)
p9 = 0.044
p13 = 0.062
p20 = 0.156
p24.2 = 0.009
p12 = 0.266
p13 = 0.047
p5 = 0.184
p12 = 0.369
p6 = 0.080
p10 = 0.275
p9.3 = 0.241
p13 = 0.344
p15 = 0.116
p8 = 0.512
p17 = 0.275
p18 = 0.206
p63 = 0.244
In Table 5, markers are arranged in decreasing order of the ratio between the two LRs (last column). It may be seen that the three topmost markers provide most of the bias in favor of the exact method, and this is clearly the consequence of the occurrence, in this particular pair, of rare alleles. This highlights the major difference between the two methods. In the exact method, the frequencies of the observed alleles are both necessary for the calculation and critical. They are necessary because they contain all information we can use for inference, and they are critical because small changes of their values may cause large variations of the resulting likelihood ratios. The exact method assumes the allele frequencies are known without error; if they are misspecified (because of poor quality of published estimates, inadequate information about the ethnicity of the members of the putative pair, etc.), then the results of the exact method will be incorrect. When the alleles shared by any two individuals are rare, the LR that they are related may reach high values. In the IBS method, the frequencies of the observed alleles are irrelevant, so that we do not expect to find high peaks of LR in any pair. However, the occurrence of rare alleles in random pairs of individuals is also rare, so that, on the average, the power of the IBS method is not much lower than that of the exact method. This was apparent in our analysis of true parent-child and full-sib pairs; the exact method produced a tail of pairs with very high LR, whereas the IBS method appeared to be more constrained in the upper bound.
In conclusion, the IBS method presented here may be conveniently used as a preliminary approach to investigate the relationship existing between any pair of individuals. It can be applied by anybody using a desk calculator or a spreadsheet. Future work based on extensive computer simulations will address issues that we have not examined here. These include analysis of statistical power (which requires considering the distribution of LR when assumed relationships are false), the effects of typing errors and gene mutations, the robustness of the method to deviations in any of the assumptions (such as taking average H values).
Using the IBS method may help deciding when the collected evidence for a certain problem is sufficient for the purposes, or it is advisable to type additional loci, before embarking in exact calculations. In addition, the IBS method's using of estimates of H rather than of allele frequencies makes the IBS method particularly attractive in all those cases where ethnicity pose a problem, since H varies less across ethnicities. Furthermore, the results of the IBS method may even be accepted without further analyses in certain circumstances, since the LRs are highly correlated with those calculated by the exact method. Of course, the exact method should always follow IBS analysis when the results are critical to living human subjects.
The conventional approach to determine the biological relationship existing between pairs of individuals is based on the probability P(X|R) of the observed marker genotypes (X), conditional on a certain relationship R. Here, X may be a multi-locus genotype. The collected evidence is then summarized in the form of a likelihood ratio of two alternative hypotheses, the probability of observation X given the relationship R1 and the probability of the same observation X given the relationship R2. Seven possible configurations of genotypes (regardless of order) are generally possible for two individuals and a multi-allelic locus ; Table 1A shows the formulas expressing P(X|R) for the following four relationships: 1) parent-child (PC), 2) full sibs (FS), 3) second degree relationships, including half sibs, avuncular pairs, and grandparent-grandchild pairs (2D), and 4) non-relatives (NR), as a function of the allele frequencies at a single locus.
In the IBS method, we consider the probabilities P(Z|R) that two individuals share 0, 1 or 2 alleles identical by state (Z) at a locus (again, Z may be a multi-locus vector), given a certain relationship R. Thus, the particular genotypes observed in each individual are irrelevant, as the observed variable Z is the number of alleles they have in common. In the case of a diallelic locus, these probabilities (z0, z1, z2, for Z = 0, Z = 1, and Z = 2, respectively) were easily obtained for the most common relationships as simple functions of the locus heterozygosity H (Table 1B). As the number of alleles increases, the values of z0, z1, z2 show increasing departures from those predicted by the diallelic formulas; only the linear relation of parent-child pairs remains valid for any value of H and for any number of alleles. For example, z0 is still H2/2 in the case of a tri-allelic locus and a pair of non-relatives (Table 1), whereas z2 = 1 - 2H + H2 + 2(p2q2 + p2r2 + q2r2). If the last term of this equation were = H2/2, the equation would have been identical to that shown in Table 1; in contrast, this term is smaller than H2/2 by three cross-product terms (4p2qr, 4pq2r, and 4pqr2, respectively), so that z2 cannot be expressed as a simple function of H. Of course, the value of z1 is higher than that predicted by the diallelic formula of the same amount. In the case of a locus with four alleles, even the value of z0 is different from H2/2. This suggests that the zi values of multi-allelic loci, albeit being related to H, are not exact functions of it.
Computation of exact zi values for markers with arbitrary numbers of alleles and for the four above relationships was obtained by first determining the population probability of all possible genotype pairs for each locus; this was accomplished by listing all possible genotype pairs for each locus and then applying to each pair the exact formulas of Table 1A. The number of shared alleles (Z) was also determined for each pair, and the values of zi were simply calculated by summing together the probabilities of all pairs for the three different values of Z. In this procedure, we used allele frequencies from the databases currently used in our forensic casework studies; these values were also used in computing likelihood ratios of different relationships for parent-child pairs. The obtained exact zi values were fitted to third-order polynomial equations, where the independent variable was the locus heterozygosity.
In the analysis of true familial data, two independent samples were used: i) a series of 102 mother-child pairs from disputed paternity studies, typed for a variable number of markers (5 to 17, out of 26 codominant loci), and, ii) a sample of 80 sib pairs, the bone marrow transplant recipients and donors  typed for 13 loci (list is shown in Table 5). These siblings were identical for haplotypes of both HLA class I and class II loci, making it unlikely that any of these pairs were actually biologically unrelated. In the analysis of interpopulation variation of H, allele frequency data of STR markers and sample sizes were extracted from the on-line database "The Distribution of the Human DNA-PCR Polymorphisms" http://www.uni-duesseldorf.de/WWW/MedFak/Serology/dna.html. Heterozygosities and their sampling variance were computed by formulas 8.3 and 8.13 in Nei , respectively. Tukey's multiple comparison procedure  was used to test differences in single-locus heterozygosities between populations. This test is essentially a t-test applied to multiple means, and uses an appropriate and controlled significance level; it is designed to recognize the mean(s) that are significantly different from one or more other means in a given group. In applying this test to our data, we formed all possible pairs of the population to be tested and calculated the test statistics q = (2d)1/2/s d , d being the difference in H between two populations and s d being the standard error of this difference. The q's critical values are tabulated (the Tukey's studentized range distribution tables, see e.g. http://cse.niaes.affrc.go.jp/miwa/probcalc/s-range/).
- Evett IW, Weir BS: Interpreting DNA evidence. Sunderland, Sinauer Associated Inc 1998.Google Scholar
- Lee HS, Lee JW, Han GR, Hwang JJ: Motherless case in paternity testing. Forensic Sci Int 2000, 114:57–65.View ArticlePubMedGoogle Scholar
- Ayres KL: Relatedness testing in subdivided populations. Forensic Sci Int 2000, 114:107–115.View ArticlePubMedGoogle Scholar
- Egeland T, Mostad PF, Mevag B, Stenersen M: Beyond traditional paternity and identification cases. Selecting the most probable pedigree. Forensic Sci Int 2000, 110:47–59.View ArticlePubMedGoogle Scholar
- Brenner CH: Symbolic kinship program. Genetics 1997, 145:535–42.PubMedGoogle Scholar
- Slate J, Marshall TC, Pemberton JM: A retrospective assessment of the accuracy of the paternity inference program CERVUS. Mol Ecol 2000, 9:801–808.View ArticlePubMedGoogle Scholar
- Maviglia R, Mortera J, Dobosz M, Caglià A, Pascali VL, van Boxel DW, Dawid AP: Forensic inference from incomplete pedigrees by probabilistic expert systems. Progress in Forensic Genetics 2000, 8:399–401.Google Scholar
- Epstein MP, Duren WL, Boehnke M: Improved inference of relationship for pairs of individuals. Am J Hum Genet 2000, 67:1219–1231.PubMedGoogle Scholar
- Corach D, Sala A, Penacino G, Iannucci N, Bernardi P, Doretti M, Fondebrider L, Ginarte A, Inchaurregui A, Somigliana C, Turner S, Hagelberg E: Additional approaches to DNA typing of skeletal remains: the search for "missing" persons killed during the last dictatorship in Argentina. Electrophoresis 1997, 18:1608–1612.View ArticlePubMedGoogle Scholar
- Hsu CM, Huang NE, Tsai LC, Kao LG, Chao CH, Linacre A, Lee JC: Identification of victims of the 1998 Taoyuan Airbus crash accident using DNA analysis. Int J Legal Med 1999, 113:43–46.View ArticlePubMedGoogle Scholar
- Ruitberg CM, Reeder DJ, Butler JM: STRBase: a short tandem repeat DNA database for the human identity testing community. Nucleic Acids Res 2001, 29:320–322.View ArticlePubMedGoogle Scholar
- Sun L, Abney M, McPeek MS: Detection of mis–specified relationships in inbred and outbred pedigrees. Genet Epidemiol 2001,21(Suppl 1):S36–41.PubMedGoogle Scholar
- Sieberts SK, Wijsman EM, Thompson EA: Relationship inference from trios of individuals, in the presence of typing error. Am J Hum Genet 2002, 70:170–80.View ArticlePubMedGoogle Scholar
- Alderson GW, Gibbs HL, Sealy SG: Parentage and kinship studies in an obligate brood parasitic bird, the brown–headed cowbird (Molothrus ater), using microsatellite DNA markers. J Hered 1999, 90:182–90.View ArticlePubMedGoogle Scholar
- Ciampolini R, Moazami–Goudarzi A, Vaiman D, Dillmann C, Mazzanti E, Foulley JL, Leveziel H, Cianci D: Individual multilocus genotypes using microsatellite polymorphisms to permit the analysis of the genetic variability within and between Italian beef cattle breeds. J Anim Sci 1995, 73:3259–3268.PubMedGoogle Scholar
- Calafell F, Shuster A, Speed WC, Kidd JR, Black FL, Kidd KK: Genealogy reconstruction from short tandem repeat genotypes in an Amazonian population. Am J Phys Anthropol 1999, 108:137–146.View ArticlePubMedGoogle Scholar
- Presciuttini S, Bramanti B, Hummel S, Herrmann B: Assessing relationships in an ancient skeletal collection by the number of alleles shared identical by state (IBS) among pairs of individuals. Progress in Forensic Genetics 2002, in press.
- Shinoda K, Kanai S: Intracemetery genetic analysis at the Nakazuma Jomon site in Japan by mitochondrial DNA sequencing. Anthropol Sci 1999, 107:129–140.Google Scholar
- Chakraborty R, Jin L: Determination of relatedness between individuals using DNA fingerprinting. Hum Biol 1993, 65:875–895.PubMedGoogle Scholar
- Stivers DN, Zhong Y, Hanis CL, Chakraborty R: RELTYPE: a computer program for determining biological relatedness between individuals based on allele sharing at microsatellite loci. Am J Hum Genet 1995, suppl 59:A190.Google Scholar
- Ehm MG, Wagner M: A test statistic to detect errors in sib–pair relationship. Am J Hum Genet 1998, 62:181–188.View ArticlePubMedGoogle Scholar
- McPeek MS, Sun L: Statistical tests for detection of mispecified relationships by use of genome–screen data. Am J Hum Genet 2000, 66:1076–1094.View ArticlePubMedGoogle Scholar
- Butler JM: Forensic DNA typing. London San Diego, Academic Press 2001.Google Scholar
- Lange K: A test statistic for the affected–sib–set method. Ann Hum Genet 1986, 50:283–290.View ArticlePubMedGoogle Scholar
- Bacigalupo A, van Lint MT, Valbonesi M, Lercari G, Carlier P, Lamparelli T: Thiotepa cyclophosphamide followed by granulocyte colony–stimulating factor mobilized allogenic peripheral blood cells in adults with advanced leukemia. Blood 1996, 88:353–357.PubMedGoogle Scholar
- Huckenbeck W, Kuntze K, Scheil HG: The distribution of the human dna–PCR polymorphisms. Berlin, Verlag Köster 2002., (Suppl II): Google Scholar
- Nei M: Molecular evolutionary genetics. New York, Columbia University Press 1987.Google Scholar
- Kirk RE: Multiple comparison tests. In Experimental Design, Belmont, Brooks/Cole 2 Edition 1982, 90–126.Google Scholar