Volume 6 Supplement 1
The effect of linkage disequilibrium on linkage analysis of incomplete pedigrees
© Levinson and Holmans; licensee BioMed Central Ltd 2005
Published: 30 December 2005
Dense SNP maps can be highly informative for linkage studies. But when parental genotypes are missing, multipoint linkage scores can be inflated in regions with substantial marker-marker linkage disequilibrium (LD). Such regions were observed in the Affymetrix SNP genotypes for the Genetic Analysis Workshop 14 (GAW14) Collaborative Study on the Genetics of Alcoholism (COGA) dataset, providing an opportunity to test a novel simulation strategy for studying this problem. First, an inheritance vector (with or without linkage present) is simulated for each replicate, i.e., locations of recombinations and transmission of parental chromosomes are determined for each meiosis. Then, two sets of founder haplotypes are superimposed onto the inheritance vector: one set that is inferred from the actual data and which contains the pattern of LD; and one set created by randomly selecting parental alleles based on the known allele frequencies, with no correlation (LD) between markers. Applying this strategy to a map of 176 SNPs (66 Mb of chromosome 7) for 100 replicates of 116 sibling pairs, significant inflation of multipoint linkage scores was observed in regions of high LD when parental genotypes were set to missing, with no linkage present. Similar inflation was observed in analyses of the COGA data for these affected sib pairs with parental genotypes set to missing, but not after reducing the marker map until r2 between any pair of markers was ≤ 0.05. Additional simulation studies of affected sib pairs assuming uniform LD throughout a marker map demonstrated inflation of significance levels at r2 values greater than 0.05. When genotypes are available only from two affected siblings in many families in a sample, trimming SNP maps to limit r2 to 0–0.05 for all marker pairs will prevent inflation of linkage scores without sacrificing substantial linkage information. Simulation studies on the observed pedigree structures and map can also be used to determine the effect of LD on a particular study.
Linkage genome scans using dense maps of single nucleotide polymorphism (SNP) markers have been shown to provide greater information content than 10-cM microsatellite scans [1–5]. However, false positive peaks were observed in a SNP-based linkage study of prostate cancer in regions with marker-marker D' values greater than 0.6 ; and in simulations of pairs of markers with no linkage present, inflation of linkage scores was observed as marker-marker D' values increased between 0.4 and 0.8 . The problem has not been systematically studied using the r2 measure of linkage disequilibrium (LD).
The Collaborative Study on the Genetics of Alcoholism (COGA) datasets for Genetic Analysis Workshop 14 (GAW14) provided an opportunity to study this problem, because the Affymetrix SNP data revealed regions with high marker-marker LD. The main goal of the present analyses was to test a novel simulation strategy for studying the effect of LD on linkage scores with and without availability of parental genotypes.
We selected the 116 European-ancestry pedigrees from the 143 families in the GAW14 COGA dataset. We focus here on analyses that used one pair of affected siblings per family plus parents, but additional analyses utilized the full pedigrees (485 affected and 287 unaffected genotyped individuals, including most parents); nuclear families (139 sibships, 473 affected and 127 unaffected genotyped individuals); or one sibship from each pedigree (390 affected and 264 unaffected individuals). In these additional analyses, comparisons of linkage results with different maps were similar to those reported in other GAW14 papers and so are not discussed in detail here; comparisons of information content for different maps, and analyses of the effects of LD on linkage scores, were consistent with those reported here for simulated data.
Genotypes were available for a 10-cM microsatellite map, 4,752 Illumina SNPs and 11,560 Affymetrix SNPs. In the 66 Mb of chromosome 7 containing the largest linkage peak in these 116 pedigrees (by multipoint analysis of microsatellite data), there were 212 Affymetrix SNPs. We excluded 36 of these because of deviation from Hardy-Weinberg equilibrium (P < 0.001), call rate < 0.8, or minor allele frequency < 0.05. We studied the remaining 176 SNPs ("High-LD map") and a subset of 109 SNPs ("Low-LD map") in which there was no pairwise r2 value > 0.05.
Linkage analyses were carried out with ALLEGRO  (exponential model, Spairs, with families weighted to the power 0.5 of the variance of their expected scores without linkage). The r2 LD statistic was computed with HAPLOVIEW , and correlation and regression statistics with SYSTAT 8.0.
To create founder haplotypes for simulation studies, we used ALLEGRO to infer 176-marker haplotypes for all individuals in the full pedigree data. ALLEGRO reports the most likely haplotype, with "0" alleles where no inference can be made. From inferred haplotypes with less than 5 "0" alleles, we selected 464 haplotypes from unrelated individuals, for use as parental haplotypes in the simulation study described below. Missing data were imputed based on COGA pedigree SNP allele frequencies. These haplotypes had the same LD pattern as the entire dataset (HAPLOVIEW).
Data were simulated using SIM (unpublished, A. Kirby) and programs written for this study. For each replicate, SIM assigned 2 unique alleles to each founder (e.g., founder 100 was assigned allele "199" for all markers on one chromosome and "200" on the paired chromosome), transmitted them to offspring by selecting locations of recombinations for each meiosis based on genetic distances, and then transmitted parental chromosomes (here, assuming no linkage, although the program can also follow a specified disease transmission model). Then, for each replicate two different datasets were created by replacing each unique allele with an allele from a corresponding founder haplotype (gene-dropping). First, we assigned to each parent 2 haplotypes from among the 464 haplotypes inferred from COGA data as described above (LD condition), and then we assigned parental haplotypes created by random selection of alleles based on the allele frequencies in the COGA data, with no correlation between markers (No-LD).
In addition, to examine the effects of varying levels of inter-marker LD more systematically, 5,000 replicates were created for 650 pedigrees containing 920 affected sib pairs affected sib pairs (ASPs) with 30% of parents genotyped, for 200 SNPs (0.2 cM apart) with no linkage present, with uniform LD at r2 values between marker pairs of 0–0.4 in steps of 0.05.
Marker-marker LD (r2) in one region (≈ 60 Mb).
Correlations between the difference between observed and "true" Zlr and measures of LD and distance (ASPs without parents).
Correlation of Zlr difference (observed-true) with:
a) LD (pairwise r2, left)
b) Distance (pairwise, left)
c) Avg4-LD (average of 4 consecutive r2 values)
d) Avg4-Distance (average of 4 consecutive distances)
Discussion and Conclusions
These analyses support the conclusion that when parental genotypes are missing and cannot be reconstructed from the constellation of genotyped individuals, the presence of marker-marker LD can substantially inflate linkage scores [4, 6]. In the real COGA data, this effect was clearly visible, although inconsistent. However, we only studied one small chromosomal region in the real dataset. In simulated data, where multiple replicates could be studied, the effect was highly significant. Thus, in genome-wide studies with many missing parental genotypes, one would expect that if strong LD was present in many regions, linkage scores would be inflated in some of them.
The data presented in Figure 3 suggest that when a dataset includes incomplete families, and especially ASPs without parents, an r2 threshold of 0.05 is probably desirable. Fortunately, as shown in Figure 1, little linkage information is likely to be lost by using the densest map with all pairwise values of r2 < 0.05. Alternatively, it might be possible to correct for LD statistically, although it may prove difficult to account for patterns of LD that extend beyond the adjacent two markers.
The simulation method described here can also be used to evaluate whether inflation of linkage scores is likely with a marker map and pedigree sample. One would first simulate replicates based on the pedigree structures in the real study as described above, with or without linkage present, assuming that all parental genotypes are available. Gene-dropping would then be carried out, using haplotypes inferred from the real data (and thus containing the observed LD pattern). After setting parental genotypes to missing, one would repeat the linkage analysis of each replicate for the "true" data (the unique alleles from the simulation) and the gene-dropping data that contain the LD pattern, compute the difference between these scores for each replicate, and determine whether the difference is correlated with r2 values (such as the Avg4 measure described above).
Affected sib pair
Collaborative Study on the Genetics of Alcoholism
Genetic Analysis Workshop
Single nucleotide polymorphism
This work was supported by grants K24 MH64197, R01 MH062276 and R01-MH61675 from the U.S. National Institute of Mental Health. Andrew Kirby wrote the SIM software used in the simulation studies.
- Middleton FA, Pato MT, Gentile KL, Morley CP, Zhao X, Eisener AF, Brown A, Petryshen TL, Kirby AN, Medeiros H, Carvalho C, Macedo A, Dourado A, Coelho I, Valente J, Soares MJ, Ferreira CP, Lei M, Azevedo MH, Kennedy JL, Daly MJ, Sklar P, Pato CN: Genomewide linkage analysis of bipolar disorder by use of a high-density single-nucleotide-polymorphism (SNP) genotyping assay: a comparison with microsatellite marker assays and finding of significant linkage to chromosome 6q22. Am J Hum Genet. 2004, 74: 886-897. 10.1086/420775.PubMed CentralView ArticlePubMedGoogle Scholar
- John S, Shephard N, Liu G, Zeggini E, Cao M, Chen W, Vasavda N, Mills T, Barton A, Hinks A, Eyre S, Jones KW, Ollier W, Silman A, Gibson N, Worthington J, Kennedy GC: Whole-genome scan, in a complex disease, using 11,245 single-nucleotide polymorphisms: comparison with microsatellites. Am J Hum Genet. 2004, 75: 54-64. 10.1086/422195.PubMed CentralView ArticlePubMedGoogle Scholar
- Sawcer SJ, Maranian M, Singlehurst S, Yeo T, Compston A, Daly MJ, De Jager PL, Gabriel S, Hafler DA, Ivinson AJ, Lander ES, Rioux JD, Walsh E, Gregory SG, Schmidt S, Pericak-Vance MA, Barcellos L, Hauser SL, Oksenberg JR, Kenealy SJ, Haines JL: Enhancing linkage analysis of complex disorders: an evaluation of high-density genotyping. Hum Mol Genet. 2004, 13: 1943-1949. 10.1093/hmg/ddh202.View ArticlePubMedGoogle Scholar
- Schaid DJ, Guenther JC, Christensen GB, Hebbring S, Rosenow C, Hilker CA, McDonnell SK, Cunningham JM, Slager SL, Blute ML, Thibodeau SN: Comparison of microsatellites versus single-nucleotide polymorphisms in a genome linkage screen for prostate cancer-susceptibility loci. Am J Hum Genet. 2004, 75: 948-965. 10.1086/425870.PubMed CentralView ArticlePubMedGoogle Scholar
- Evans DM, Cardon LR: Guidelines for genotyping in genomewide linkage studies: single-nucleotide-polymorphism maps versus microsatellite maps. Am J Hum Genet. 2004, 75: 687-692. 10.1086/424696.PubMed CentralView ArticlePubMedGoogle Scholar
- Huang Q, Shete S, Amos CI: Ignoring linkage disequilibrium among tightly linked markers induces false-positive evidence of linkage for affected sib pair analysis. Am J Hum Genet. 2004, 75: 1106-1112. 10.1086/426000.PubMed CentralView ArticlePubMedGoogle Scholar
- Gudbjartsson DF, Jonasson K, Frigge ML, Kong A: Allegro, a new computer program for multipoint linkage analysis. Nat Genet. 2000, 25: 12-13. 10.1038/75514.View ArticlePubMedGoogle Scholar
- Barrett JC, Fry B, Maller J, Daly MJ: Haploview: analysis and visualization of LD and haplotype maps. Bioinformatics. 2004 Aug 5,
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.