An examination of the genotyping error detection function of SIMWALK2
© Badzioch et al; licensee BioMed Central Ltd 2003
Published: 31 December 2003
This investigation was undertaken to assess the sensitivity and specificity of the genotyping error detection function of the computer program SIMWALK2. We chose to examine chromosome 22, which had 7 microsatellite markers, from a single simulated replicate (330 pedigrees with a pattern of missing genotype data similar to the Framingham families). We created genotype errors at five overall frequencies (0.0, 0.025, 0.050, 0.075, and 0.100) and applied SIMWALK2 to each of these five data sets, respectively assuming that the total error rate (specified in the program), was at each of these same five levels. In this data set, up to an assumed error rate of 10%, only 50% of the Mendelian-consistent mistypings were found under any level of true errors. And since as many as 70% of the errors detected were false-positives, blanking suspect genotypes (at any error probability) will result in a reduction of statistical power due to the concomitant blanking of correctly typed alleles. This work supports the conclusion that allowing for genotyping errors within likelihood calculations during statistical analysis may be preferable to choosing an arbitrary cut-off.
Optimal performance of genetic linkage and association tests relies on accurate and efficient genotyping as data errors reduce power to detect and map genetic effects. Even at low rates (<2%), typing errors can have significant effects on results . They inflate apparent recombination and can falsely exclude linkage, especially in multipoint analysis . The most insidious errors, often accounting for over 25% of all mistypings, are those not violating rules of Mendelian inheritance . A recent extension to a widely used computer program, SIMWALK2 , has been offered to detect these errors through a Markov-chain Monte Carlo algorithm using full, extended pedigrees and multiple markers . However, before application to a real data set, it is desirable to measure the program's sensitivity and specificity in detecting known genotyping errors. Low specificity or a high rate of false positives will result in undesirable loss of statistical power while low sensitivity will leave many true errors undetected. We report here a study of SIMWALK2's ability to accurately and efficiently detect known genotyping errors under a variety of conditions in a single simulated replicate of 330 pedigrees provided by the Genetic Analysis Workshop 13 organizers.
Number of Families
Average Family Size
0.08 – 0.19
0.20 – 0.29
0.30 – 0.39
0.40 – 0.49
0.50 – 0.59
0.60 – 0.67
Total (mean 0.359)
A computer program was written to simulate genotyping errors according to the empirical error model presented by Sobel et al. . For each genotype, five random numbers ranging from 0 to 1 were generated and if their values were less than or equal to a pre-set limit then a new, incorrect genotype replaced the original, correct genotype for that marker. Five data sets were created, each containing a different overall generated error rate (GER), using 0x (i.e., simulating no errors), 1x, 2x, 3x, and 4x as the pre-set (default) error rates used by SIMWALK2. SIMWALK2's five default error rates are (ε1) 0.0125 for false homozygosity; (ε2) 0.0075 for misreading one heterozygotic allele; (ε3) 0.005 for misreading both heterozygotic alleles; (ε4) 0.01 for misreading homozygote as heterozygote; and (ε5) 0.0025 for random mistyping (sample switch, etc.), and sum to an overall error rate of 0.0175 (ε3 + ε4 + ε5) in true homozygotes and to 0.0275 (ε1+ ε2 + ε3 + ε5) in true heterozygotes . If more than one error type was probable for any genotype the one with the highest pre-set rate limit was simulated. This happened quite rarely and was not thought to bias the results. For each error the original and simulated genotypes were recorded as well as the error type (1 through 5). In these five data sets 0, 285, 564, 822, and 1108 genotyping errors, respectively, were generated out of a total of 11,586 chromosome 22 genotypes contained in the 330 families. Thus, the overall generated error rates were 0, 0.025, 0.049, 0.071, and 0.096. These data sets will be referred to as ge000, ge025, ge050, ge075, and ge100, respectively.
PEDCHECK  was used to identify all Mendelian errors in the five data sets. PEDCHECK levels 1 and 2 detect Mendelian inconsistencies between parents and children. PEDCHECK levels 3 and 4 detect occurrences of more than four alleles in full sibships and lists the relative likelihood of possible corrective measures. Errors found using PEDCHECK were untyped (reset to missing values) and the program rerun on the updated file until no more errors were found. (See Results for description of PEDCHECK-detected errors.)
MEGA2  was used to prepare input files and SIMWALK2 was used to analyze the remaining genotypes in each of the five data sets, which excluded all Mendelian errors. All SIMWALK2 analyses were performed under the empiric error model. For each of the five generated data sets, five analysis error rates (AERs) were tested, 0.00001, 0.025, 0.050, 0.075, and 0.100, and are respectively referred to as ae000, ae025, ae050, ae075, and ae100. SIMWALK2's output consists of a list of identification numbers, the marker name and genotype purported to be erroneous, and the probability of mistyping of the first, second, both, and either allele(s) when any probability is greater than 0.25. The present study uses the probability that either allele was mistyped as the probability that the genotype is in error; the p-values given in the Results section refers to this probability.
Overall, PEDCHECK found 1104 Mendelian errors or 40% (range 38-42%) of the generated errors in the four data sets containing errors. In all but one case of 57 level 4 errors, the true misgenotyped person was listed as a possible error. Thirty-eight of these individuals were indicated as being the most likely misgenotyped person when several closely related individuals were suggested. In two cases, the individuals for whom errors were simulated were not listed in the level 4 output but others in their nuclear family were listed. Except for these latter two cases, the simulated error was untyped (even when PEDCHECK didn't indicate it was the most probable).
We have examined factors that potentially affect detection of Mendelian-consistent genotyping errors and SIMWALK2's ability to detect these errors under varying GERs and program-specified AERs. In this data set, up to an assumed error rate of 10% (specified in the program), only 50% of the Mendelian-consistent mistypings were found under any level of true errors. In fact, the true error rate appeared to have little impact on the proportion of errors detected. Although the chance of identifying a true error increased as the assumed error rate increased, the ratio of true positive to false-positive errors detected decreased. As many as 70% of the errors detected were false-positives. This decrease in specificity was dependent on the overall level of errors in the data and was not associated with marker heterozygosity.
Many genotyping errors will necessarily go undetected under current techniques. Under the highest assumed error rate, ae100, at least 50% of the generated errors remain undetected. These errors are consistent with Mendelian inheritance but no further examination of them was performed. Characterizing these undetected errors may provide clues leading to their identification. At present there appears to be no means by which "true-positives" can be differentiated from "false-positives" and the cost of false-positives can be quite severe. To detect true errors the investigator has no choice but to accept that this trade-off will result in loss of power in identifying genetic effects. However, our research has suggested that avenues to lower the 'overhead' costs, such as increasing the proportion of genotyped individuals per family, could be of value. Several potentially important parameters were not examined here. For instance, it is possible that allele frequency may have a significant impact on error detection rate. If a more common allele was misread as a less common one, it may be more likely to be detected as an error than otherwise. Additionally, no attempt was made to re-generate errors multiple times at a constant GER. However, the results presented here are likely robust to sampling error because most trends were smooth and consistent over varying conditions. However, any sampling variation present would be seen as differences between GER levels. Except for ae000 in Figure 1 and p = 1.0 in Figure 2, the plotted values were generally in proportion to GER value.
We do not attempt here to evaluate the theoretical foundation of SIMWALK2's genotyping error detection procedure but only offer a brief analysis of its function and set its results into a contextual framework. Further work in detecting true genotyping errors will no doubt be done due to its importance in linkage and association studies. Retyping suspect genotypes may help, but factors such as reproducible errors and mutations may limit its utility. Overall, we conclude that progress has been made in detecting Mendelian-consistent errors. However, blanking suspect genotypes (at any error probability) will result in a reduction of statistical power due to the concomitant blanking of correctly typed alleles. Several authors [2, 7] have suggested allowing for genotyping errors within likelihood calculations during linkage analysis and this approach may be preferable to choosing an arbitrary cut-off.
The authors thank the Institute of Systems Biology, Seattle, WA, for providing computer support for this research and especially Kerry Deutsch for her expert assistance. This work was sponsored in part by National Institutes of Health grant PO1 HL30086.
- Douglas JA, Boehnke M, Lange K: A multipoint method for detecting genotyping errors and mutations in sibling-pair linkage data. Am J Hum Genet. 2000, 66: 1287-1297. 10.1086/302861.PubMed CentralView ArticlePubMedGoogle Scholar
- Goring HH, Terwilliger JD: Linkage analysis in the presence of errors. II: Marker-locus genotyping errors modeled with hypercomplex recombination fractions. Am J Hum Genet. 2000, 66: 1107-1118. 10.1086/302798.PubMed CentralView ArticlePubMedGoogle Scholar
- Sobel E, Lange K: Descent graphs in pedigree analysis: applications to haplotyping, location scores, and marker sharing statistics. Am J Hum Genet. 1996, 58: 1323-1337.PubMed CentralPubMedGoogle Scholar
- Sobel E, Papp JC, Lange K: Detection and integration of genotyping errors in statistical genetics. Am J Hum Genet. 2002, 70: 496-508. 10.1086/338920.PubMed CentralView ArticlePubMedGoogle Scholar
- O'Connell JR, Weeks DE: PedCheck: A program for identifying genotype incompatibilities in linkage analysis. Am J Hum Genet. 1998, 63: 259-266. 10.1086/301904.PubMed CentralView ArticlePubMedGoogle Scholar
- Mukhopadhyay N, Almasy L, Schroeder M, Mulvihill WP, Weeks DE: Mega2, a data-handling program for facilitating genetic linkage and association analyses. Am J Hum Genet. 1999, 65: A436-Google Scholar
- Abecasis GR, Cherny SS, Cookson WO, Cardon LR: Merlin – rapid analysis of dense genetic maps using sparse gene flow trees. Nat Genet. 2002, 30: 97-101. 10.1038/ng786.View ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.