Volume 6 Supplement 1
Calculation of multipoint likelihoods using flanking marker data: a simulation study
© George et al; licensee BioMed Central Ltd 2005
Published: 30 December 2005
The calculation of multipoint likelihoods is computationally challenging, with the exact calculation of multipoint probabilities only possible on small pedigrees with many markers or large pedigrees with few markers. This paper explores the utility of calculating multipoint likelihoods using data on markers flanking a hypothesized position of the trait locus. The calculation of such likelihoods is often feasible, even on large pedigrees with missing data and complex structures. Performance characteristics of the flanking marker procedure are assessed through the calculation of multipoint heterogeneity LOD scores on data simulated for Genetic Analysis Workshop 14 (GAW14). Analysis is restricted to data on the Aipotu population on chromosomes 1, 3, and 4, where chromosomes 1 and 3 are known to contain disease loci. The flanking marker procedure performs well, even when missing data and genotyping errors are introduced.
The calculation of multipoint likelihoods on general pedigrees is computationally challenging. Factors influencing the complexity of multipoint calculations include family size, pedigree structure, marker number, and missing data. Efficient algorithms have been developed for handling large pedigrees with few markers  or small pedigrees with many markers , but calculating multipoint probabilities on large pedigrees with many markers is infeasible.
In this paper, the performance characteristics of CHROM-WALK, a computer program for calculating multipoint likelihoods on general pedigrees and many linked markers, is explored. Multipoint likelihoods are calculated using data observed only on markers flanking a hypothesized position of the trait locus. Calculating these three-point likelihoods is often feasible even on large complex pedigrees. Likelihood computations in CHROM-WALK are performed via VITESSE . The speed and accuracy of CHROM-WALK are examined through a heterogeneity LOD (HLOD) score analysis of data simulated on the Aipotu population from Genetic Analysis Workshop (GAW) 14. CHROM-WALK has been developed to make the multilocus linkage analysis of data on large general pedigrees computationally feasible.
The CHROM-WALK computer program uses VITESSE to perform likelihood calculations needed to calculate three-point likelihoods across a chromosome on general pedigrees. Multipoint likelihoods are calculated on data on markers flanking a hypothesized position of the trait locus. Results are reported as either homogeneity LOD scores or HLOD scores at pre-specified positions of the trait locus. Functionally, CHROM-WALK is similar to GENEHUNTER . Only a single locus file and pedigree file in linkage format are required as input files. Command line arguments are used to specify the distance (in cM) between hypothesized positions, the distance beyond the linkage map (if any) to compute likelihoods, and whether homogeneity LOD scores or HLOD scores are to be reported.
In this study, data generated on the GAW14 Aipotu population were chosen for analysis. This data consisted of 100 nuclear families, ranging in size from 2 to 10 siblings. For computational expedience, the other three GAW 14 populations were not analyzed. Marker and trait data were observed on each individual. The trait is dichotomous where an individual is either affected or unaffected for the disease. Microsatellite marker data on chromosomes 1, 3, and 4, containing 41, 42, and 44 linked markers, respectively, are selected for analysis. Inter-marker distances, on average, are 7.5 cM. Chromosome 1 contained a disease locus between the 23rd and 24th marker locus. Chromosome 3 contained a disease locus between the 41st and 42nd marker locus. Chromosome 4 is unlinked to disease causing loci. There are 100 replicates of data.
Linkage detection and mapping
The accuracy of CHROM-WALK for detecting and localizing trait loci was examined through the analysis of simulated family data. A dominant trait model with incomplete penetrance (0.05, 0.95, 0.95) and a disease allele frequency of 0.01 was assumed. Here, three marker sets formed from the original GAW14 simulated data were considered: four linked markers closest to the disease locus (Mset4), 16 linked markers closest to the disease locus (Mset16), and all markers on a chromosome (MsetAll). Because chromosome 4 is unlinked to a disease locus, markers were selected from the beginning of the marker map. HLOD scores are calculated every 1 cM. Three-point HLOD scores are compared to multipoint HLOD scores calculated via GENEHUNTER. Multipoint scores on each data set are only calculated on markers available in that data set. Hence, the impact of increasing the number of markers incorporated into the multipoint calculation can be examined.
Missing data and genotyping errors
Multipoint calculations on pedigrees are affected by missing data and genotyping error. To explore the utility of CHROM-WALK given imperfect data, missing data and Mendelian consistent genotyping errors were introduced. The marker phenotype at a locus for an individual was randomly removed with probability 0.01. Mendelian consistent genotyping errors were created, with probability 0.005, by randomly permuting with equal probability the transmitted allele from one of the parents. Note that this error model is simplistic since it cannot produce genotyping errors in the parents and does not make distinctions between types of genotyping errors, which are all equally likely in the present study. The assumed probability of Mendelian consistent errors was consistent with an overall (pedigree consistent and inconsistent) genotyping error rate of 1% . The levels of missing data and genotyping error were realistic compared to real data.
Linkage detection and mapping
Comparison of CHROM-WALK and GENEHUNTER HLOD scores for different marker sets
Chr 1 Pos
Figure 1b plots the difference in the chromosomal location of the peaks on the vertical axis against the peak GENEHUNTER HLOD on the horizontal axis. Again, there is a clustering of points around a horizontal line intersecting 0 on the vertical axis, indicating close agreement between the localization of the trait using flanking markers and all available markers. It is also reassuring that the largest differences in locations occur for small peak GENEHUNTER scores. When the peak GENEHUNTER score is small, there is little information in the data for detecting linkage. Using HLOD scores calculated on flanking markers for localization does not result in conclusions that are different to analyzing data on all available markers jointly.
Missing data and genotyping errors
From Figure 2, the mean HLOD scores calculated using CHROM-WALK and GENEHUNTER show that the scores are quite robust to imperfect data. The mean CHROM-WALK HLODs (the dashes in Figure 2a) are slightly lower but there is still clear evidence of linkage. The mean GENEHUNTER HLODs calculated on the imperfect data (the dashes in Figure 2b) are almost identical to the GENEHUNTER HLODs with perfect data (the circles in Figure 2b). Furthermore, it is reassuring that the CHROM-WALK HLODs are similar to the GENEHUNTER HLODs across marker subsets, despite GENEHUNTER requiring an order of magnitude longer run times for the analysis of most replicates.
In this paper, the calculation of multipoint likelihoods using a new computer program CHROM-WALK is assessed through the calculation of HLOD scores on simulated data. By only considering data observed on flanking markers, the computational complexity of multipoint calculations are greatly reduced. For data simulated on nuclear families, there is little loss in accuracy using the proposed approximation procedure. Furthermore, CHROM-WALK produced multipoint results, on average, an order of magnitude faster than GENEHUNTER. Further exploration is warranted for extended families, differing amounts and patterns of missing data, differing amounts of genotyping error, and changes in marker informativeness.
Genetic Analysis Workshop 14
Heterogeneity LOD score
This work was funded in part by R01 MH052841 (VJV) and NSF ITR/ACI0218491 (AMS).
- Elston R, Stewart J: A general model for the analysis of pedigree data. Hum Hered. 1971, 21: 523-542.View ArticlePubMedGoogle Scholar
- Lander E, Green P: Construction of multilocus genetic linkage maps in humans. Proc Natl Acad Sci U S A. 1987, 84: 2363-2367. 10.1073/pnas.84.8.2363.PubMed CentralView ArticlePubMedGoogle Scholar
- O'Connell J, Weeks D: The VITESSE algorithm for rapid exact multilocus linkage analysis via genotype set-recoding and fuzzy inheritance. Nat Genet. 1995, 11: 402-408. 10.1038/ng1295-402.View ArticlePubMedGoogle Scholar
- Kruglyak L, Daly M, Reeve-Daly M, Lander E: Parametric and nonparametric linkage analysis: a unified multipoint approach. Am J Hum Genet. 1996, 58: 1347-1363.PubMed CentralPubMedGoogle Scholar
- Douglas J, Skol A, Boehnke M: Probability of detection of genotyping errors and mutations as inheritance inconsistencies in nuclear-family data. Am J Hum Genet. 2002, 70: 487-495. 10.1086/338919.PubMed CentralView ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.