Volume 6 Supplement 1
Genetic Analysis Workshop 14: Microsatellite and singlenucleotide polymorphism
Calculation of multipoint likelihoods using flanking marker data: a simulation study
 Andrew W George^{1}Email author,
 LaVonne A Mangin^{2},
 Christopher W Bartlett^{1, 3},
 Mark W Logue^{1},
 Alberto M Segre^{1, 2} and
 Veronica J Vieland^{1, 4}
DOI: 10.1186/147121566S1S44
© George et al; licensee BioMed Central Ltd 2005
Published: 30 December 2005
Abstract
The calculation of multipoint likelihoods is computationally challenging, with the exact calculation of multipoint probabilities only possible on small pedigrees with many markers or large pedigrees with few markers. This paper explores the utility of calculating multipoint likelihoods using data on markers flanking a hypothesized position of the trait locus. The calculation of such likelihoods is often feasible, even on large pedigrees with missing data and complex structures. Performance characteristics of the flanking marker procedure are assessed through the calculation of multipoint heterogeneity LOD scores on data simulated for Genetic Analysis Workshop 14 (GAW14). Analysis is restricted to data on the Aipotu population on chromosomes 1, 3, and 4, where chromosomes 1 and 3 are known to contain disease loci. The flanking marker procedure performs well, even when missing data and genotyping errors are introduced.
Background
The calculation of multipoint likelihoods on general pedigrees is computationally challenging. Factors influencing the complexity of multipoint calculations include family size, pedigree structure, marker number, and missing data. Efficient algorithms have been developed for handling large pedigrees with few markers [1] or small pedigrees with many markers [2], but calculating multipoint probabilities on large pedigrees with many markers is infeasible.
In this paper, the performance characteristics of CHROMWALK, a computer program for calculating multipoint likelihoods on general pedigrees and many linked markers, is explored. Multipoint likelihoods are calculated using data observed only on markers flanking a hypothesized position of the trait locus. Calculating these threepoint likelihoods is often feasible even on large complex pedigrees. Likelihood computations in CHROMWALK are performed via VITESSE [3]. The speed and accuracy of CHROMWALK are examined through a heterogeneity LOD (HLOD) score analysis of data simulated on the Aipotu population from Genetic Analysis Workshop (GAW) 14. CHROMWALK has been developed to make the multilocus linkage analysis of data on large general pedigrees computationally feasible.
Methods
CHROMWALK
The CHROMWALK computer program uses VITESSE to perform likelihood calculations needed to calculate threepoint likelihoods across a chromosome on general pedigrees. Multipoint likelihoods are calculated on data on markers flanking a hypothesized position of the trait locus. Results are reported as either homogeneity LOD scores or HLOD scores at prespecified positions of the trait locus. Functionally, CHROMWALK is similar to GENEHUNTER [4]. Only a single locus file and pedigree file in linkage format are required as input files. Command line arguments are used to specify the distance (in cM) between hypothesized positions, the distance beyond the linkage map (if any) to compute likelihoods, and whether homogeneity LOD scores or HLOD scores are to be reported.
Simulated data
In this study, data generated on the GAW14 Aipotu population were chosen for analysis. This data consisted of 100 nuclear families, ranging in size from 2 to 10 siblings. For computational expedience, the other three GAW 14 populations were not analyzed. Marker and trait data were observed on each individual. The trait is dichotomous where an individual is either affected or unaffected for the disease. Microsatellite marker data on chromosomes 1, 3, and 4, containing 41, 42, and 44 linked markers, respectively, are selected for analysis. Intermarker distances, on average, are 7.5 cM. Chromosome 1 contained a disease locus between the 23^{rd} and 24^{th} marker locus. Chromosome 3 contained a disease locus between the 41^{st} and 42^{nd} marker locus. Chromosome 4 is unlinked to disease causing loci. There are 100 replicates of data.
Linkage detection and mapping
The accuracy of CHROMWALK for detecting and localizing trait loci was examined through the analysis of simulated family data. A dominant trait model with incomplete penetrance (0.05, 0.95, 0.95) and a disease allele frequency of 0.01 was assumed. Here, three marker sets formed from the original GAW14 simulated data were considered: four linked markers closest to the disease locus (Mset4), 16 linked markers closest to the disease locus (Mset16), and all markers on a chromosome (MsetAll). Because chromosome 4 is unlinked to a disease locus, markers were selected from the beginning of the marker map. HLOD scores are calculated every 1 cM. Threepoint HLOD scores are compared to multipoint HLOD scores calculated via GENEHUNTER. Multipoint scores on each data set are only calculated on markers available in that data set. Hence, the impact of increasing the number of markers incorporated into the multipoint calculation can be examined.
Missing data and genotyping errors
Multipoint calculations on pedigrees are affected by missing data and genotyping error. To explore the utility of CHROMWALK given imperfect data, missing data and Mendelian consistent genotyping errors were introduced. The marker phenotype at a locus for an individual was randomly removed with probability 0.01. Mendelian consistent genotyping errors were created, with probability 0.005, by randomly permuting with equal probability the transmitted allele from one of the parents. Note that this error model is simplistic since it cannot produce genotyping errors in the parents and does not make distinctions between types of genotyping errors, which are all equally likely in the present study. The assumed probability of Mendelian consistent errors was consistent with an overall (pedigree consistent and inconsistent) genotyping error rate of 1% [5]. The levels of missing data and genotyping error were realistic compared to real data.
Results
Linkage detection and mapping
Comparison of CHROMWALK and GENEHUNTER HLOD scores for different marker sets
Chr  Chr 1 Pos  Mset4  Mset16  MsetAll  

HLOD_{FL}  HLOD_{GH}  HLOD_{FL}  HLOD_{GH}  HLOD_{FL}  HLOD_{GH}  
1  175 cM  1.96 (0.15)  2.23 (0.15)  1.95 (0.15)  2.30 (0.16)  1.94 (0.15)  2.30 (0.16) 
3  312 cM  1.57 (0.14)  1.57 (0.14)  1.57 (0.14)  1.57 (0.14)  1.57 (0.14)  1.57 (0.14) 
4  20 cM  0.06 (0.02)  0.06 (0.02)  0.06 (0.02)  0.06 (0.02)  0.06 (0.02)  0.06 (0.02) 
Figure 1b plots the difference in the chromosomal location of the peaks on the vertical axis against the peak GENEHUNTER HLOD on the horizontal axis. Again, there is a clustering of points around a horizontal line intersecting 0 on the vertical axis, indicating close agreement between the localization of the trait using flanking markers and all available markers. It is also reassuring that the largest differences in locations occur for small peak GENEHUNTER scores. When the peak GENEHUNTER score is small, there is little information in the data for detecting linkage. Using HLOD scores calculated on flanking markers for localization does not result in conclusions that are different to analyzing data on all available markers jointly.
Missing data and genotyping errors
From Figure 2, the mean HLOD scores calculated using CHROMWALK and GENEHUNTER show that the scores are quite robust to imperfect data. The mean CHROMWALK HLODs (the dashes in Figure 2a) are slightly lower but there is still clear evidence of linkage. The mean GENEHUNTER HLODs calculated on the imperfect data (the dashes in Figure 2b) are almost identical to the GENEHUNTER HLODs with perfect data (the circles in Figure 2b). Furthermore, it is reassuring that the CHROMWALK HLODs are similar to the GENEHUNTER HLODs across marker subsets, despite GENEHUNTER requiring an order of magnitude longer run times for the analysis of most replicates.
Conclusion
In this paper, the calculation of multipoint likelihoods using a new computer program CHROMWALK is assessed through the calculation of HLOD scores on simulated data. By only considering data observed on flanking markers, the computational complexity of multipoint calculations are greatly reduced. For data simulated on nuclear families, there is little loss in accuracy using the proposed approximation procedure. Furthermore, CHROMWALK produced multipoint results, on average, an order of magnitude faster than GENEHUNTER. Further exploration is warranted for extended families, differing amounts and patterns of missing data, differing amounts of genotyping error, and changes in marker informativeness.
Abbreviations
 GAW14:

Genetic Analysis Workshop 14
 HLOD:

Heterogeneity LOD score
Declarations
Acknowledgements
This work was funded in part by R01 MH052841 (VJV) and NSF ITR/ACI0218491 (AMS).
Authors’ Affiliations
References
 Elston R, Stewart J: A general model for the analysis of pedigree data. Hum Hered. 1971, 21: 523542.View ArticlePubMedGoogle Scholar
 Lander E, Green P: Construction of multilocus genetic linkage maps in humans. Proc Natl Acad Sci U S A. 1987, 84: 23632367. 10.1073/pnas.84.8.2363.PubMed CentralView ArticlePubMedGoogle Scholar
 O'Connell J, Weeks D: The VITESSE algorithm for rapid exact multilocus linkage analysis via genotype setrecoding and fuzzy inheritance. Nat Genet. 1995, 11: 402408. 10.1038/ng1295402.View ArticlePubMedGoogle Scholar
 Kruglyak L, Daly M, ReeveDaly M, Lander E: Parametric and nonparametric linkage analysis: a unified multipoint approach. Am J Hum Genet. 1996, 58: 13471363.PubMed CentralPubMedGoogle Scholar
 Douglas J, Skol A, Boehnke M: Probability of detection of genotyping errors and mutations as inheritance inconsistencies in nuclearfamily data. Am J Hum Genet. 2002, 70: 487495. 10.1086/338919.PubMed CentralView ArticlePubMedGoogle Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.