 Proceedings
 Open Access
 Published:
Interval estimation of disease loci: development and applications of new linkage methods
BMC Genetics volume 6, Article number: S21 (2005)
Abstract
Three variants of the confidence set inference (CSI) procedure were proposed and applied to both the simulated and the Collaborative Study on the Genetics of Alcoholism (COGA) data. For each of the two applications, we first performed a preliminary genome scan study based on the microsatellite markers using the GENEHUNTER+ software to identify regions that potentially harbor disease loci. For each such region, we estimated the sibling identitybydescent sharing probability distribution at the putative disease locus. Based on these estimated probabilities, the CSI procedures were employed to further localize the disease loci using the singlenucleotide polymorphism markers, leading to confidence intervals/regions for their locations. For our analysis with the simulated data, we had knowledge of the simulating models at the time we performed the analysis.
Background
A frequently used strategy in linkage analysis is to first screen the entire genome using microsatellite (MS) markers, and then to follow up on preliminary linkage regions using densely saturated (often singlenucleotide polymorphisms (SNP)) markers. Although many statistical methods are available for each step of this twostage approach, only a limited number of them (e.g., Liang et al. [1]) are able to provide confidence estimates of disease gene locations. Furthermore, most of the methods available are subject to multiplicity adjustment, which is a nontrivial matter given the complex dependency of the statistics involved.
The confidence set inference (CSI) procedure [2] can be used to obtain confidence estimates using affected sibpair (ASP) data, and avoids the multiple testing problem. Unlike the approach of Liang et al. [1], it is not based on the asymptotic distribution of the estimator of the location of the trait locus. Instead, it indirectly deduces a confidence region for the trait locus based upon a set of markers that are inferred to be within a prespecified distance from the trait locus. Note that this is a nondirectional procedure that makes no distinction between loci symmetrically located around a marker. In the present paper, we explore three variants of the CSI procedure to further improve its performance. The first modification is to test every location (not just the markers) in the region of interest. This practice effectively eliminates the "nondirectional" problem of the original CSI method. The second variant is a multipoint extension of the first variant, in that information from all markers are utilized to calculate the IBD sharing statistic at a marker locus. The last is also a multipoint approach, but the identitybydescent (IBD) sharing statistic is calculated at each hypothesized disease locus rather than at its nearest marker locus.
Methods
Confidence set inference
This is a twopoint approach that tests the following hypotheses for each marker m:
H_{0m} : θ_{m} ≤ θ_{0} vs. H_{am}: θ_{m} > θ_{0},
where θ_{ m }denotes the true, but unknown, recombination fraction between the disease locus and marker m, and θ_{0} is a predetermined recombination fraction. Note that the above null and alternative hypotheses are the reversals of those in traditional linkage analysis. It is actually this formulation that allows us to construct the confidence region for the location of the trait locus [2, 3]. For testing the above hypotheses with ASP data, we use the mean statistic [4] because it has been shown to perform well under a wide variety of disease models [3, 4]. Recently, Elston et al. [5] suggested that, when there is actually no linkage, the average IBD sharing between sib pairs in an ASPonly design maybe higher than what would be expected under the traditional null hypothesis of no linkage. Because our null hypothesis assumes tight linkage, the effect of such phenomenon on our method is unclear, and thus further investigation is needed.
Let L be the set of markers for which the corresponding null hypotheses are not rejected at level α. Then the probability that L includes at least one marker located within θ_{0} from the disease locus is at least (1 – α). From this set of markers, we can deduce the following confidence region for the disease locus:
where τ_{ m }is the map position of marker m, while d (θ_{0}) is the genetic distance that corresponds to recombination fraction θ_{0}. In practice, θ_{0} is usually chosen to correspond to half of the maximum distance between any two adjacent markers in the region to be investigated.
Variant I (CSIv1): testing at an arbitrary locus
For an arbitrary location τ in the preliminary linkage region, we test the following hypotheses:
H_{0m}: τ = τ* vs. H_{ am }: τ ≠ τ*
where τ* is the true location of the disease locus in the region, and m is the marker closest to τ. The following is our strategy for carrying out these tests. First, we consider a finite number of loci (say, at 1 cM density) in the region. For each of these loci, we test whether the IBD sharing at the nearest marker locus m is within the margin of error from what is expected if the disease locus is indeed at τ (the null hypothesis). We then iteratively refine the discretization strategy so that the locations for which the null hypotheses are not rejected constitute a (union of) "continuous" chromosomal segment(s).
Variant II (CSIv2): multipoint extension
This is the multipoint extension of CSIv1. The hypotheses and the discretization/ search strategies are the same as before. However, when calculating the observed sharing statistic at the nearest marker locus m, information from all markers, not just marker m itself, are used.
Variant III (CSIv3): multipoint extension with sharing statistic
This is another multipoint extension with the same hypotheses as in CSIv1 and CSIv2, but the observed sharing statistic is calculated at the hypothesized locus itself given all the observed marker data.
Estimation of IBD sharing probabilities at putative disease loci
In Papachristou and Lin [2], the IBD sharing probabilities are estimated through the use of population risk characteristics (disease prevalence, relative risks for offspring and sibling), which are frequently available from population epidemiological studies. Alternatively, these probabilities can be estimated directly from the current data after preliminary linkage regions are established. Let τ be the putative disease locus (at which the maximum score occurs) in a linkage region, and z_{ k }, k = 0, 1, 2, be the probabilities that an ASP shares k alleles IBD at τ. Then, the likelihood of z_{ k }is
where n is the number of ASPs in the study. Note that the likelihood is parameterized in terms of z_{ 1 }and z_{ 2 }only, since z_{0} is completely determined by the other two. For more details on the computation of the above likelihood the reader is referred to Kruglyak et al. [6]. The z_{ k }values that maximize the above likelihood (obtained using the EM algorithm) are taken as estimates of the IBD probabilities.
Data and phenotypes
For all four simulated populations, we extracted all possible families with at least two affected children. For families with three or more affected children all possible pairs were formed and were treated as if they were independent nuclear families. This method yielded an average of 150, 170, 150, and 180 ASPs for the AI, DA, KA, and NYC populations, respectively. For the application to the Collaborative Study on the Genetic Analysis of Alcoholism (COGA) data, ASPs were also extracted from the extended pedigrees, yielding a sample of 551 pairs. The ALDX1 diagnostic criterion was used, and only those who were confirmed to be affected were used in our study.
Selection of SNPs in the linkage regions
After linkage regions are identified using the method of Kong and Cox (KAC) [7] as implemented in GENEHUNTER+, a variant of GENEHUNTER [6], SNPs surrounding the maximum KAC scores are selected for CSI analyses. For chromosome 1 in the simulated data, we use 20 SNP markers, ten on each side of the maximum KAC score, that cover a region of about 60 cM. Because the regions on chromosomes 5 and 9 are usually maximized at the beginning of the chromosomes, we use the first 15 available SNP markers that are spread over a region of about 55 cM. On the other hand, the linkage regions on chromosome 3 are usually maximized at the end of chromosome 3, so we use the last 15 SNP markers of the chromosome that spanned a region of 45 cM. For the COGA data, for each chromosome with a maximum KAC score exceeding the threshold, we used all the Illumina SNPs within 30 cM (or less if the end of a chromosome is encountered as in chromosome 12) on each side of the maximum score for CSI analyses.
Results
Simulated data
First, the KAC scores were calculated using the MS markers throughout the whole genome. Only chromosomes 1, 3, 5, and 9 yielded significant results using the threshold of 3.09 (corresponding to a pointwise significant level of 0.001) for more than onethird of the replicates in any of the populations. Therefore, we decided to focus on those 4 chromosomes for obtaining confidence regions using CSI and its variants based on the 3cMdensity SNPs.
From the clinical ascertainment scheme, DA would be most informative for a locus influencing the behavioral symptoms. It turns out that, in the simulating model, locus D1 (on chromosome 1) plays an important role in the trait relating to these symptoms. Indeed, the KAC results reveal that only population DA has a majority of the replicates (in fact, all 100 of them) showing linkage at the pointwise significance level of 0.001. Among the other three populations, only AI has more than onethird of the replicates (45) showing significant results. Similarly, as expected from the ascertainment schemes, all four populations contain information about the disease gene on chromosome 3, and thus they all have more than onethird of the replicates (61, 82, 39, and 50 for AI, DA, KA, and NYC, respectively) with significant results. For the genes on chromosomes 5 and 9, on the other hand, only AI and KA contain information about their locations. Again, the results are consistent with the simulating model in that KA is more informative about these two loci than AI (77 vs. 48 for D3, and 76 vs. 36 for D4). Surprisingly, more than twothirds of the replicates from NYC are not informative for these loci.
Table 1 presents the results from the four CSI procedures with a preset 95% coverage probability. For each disease locus, only populations that are informative for linkage for at least onethird of the replicates are investigated. As can be seen from the results, the original CSI method indeed produced wide regions. On the other hand, the two multipoint variants (CSIv2 and CSIv3) have considerably narrower regions, with almost all of them including the true disease locations.
The COGA dataset
Wholegenome screening using KAC based on the MS markers resulted in 6 chromosomal regions (one of each on chromosomes 1, 2, 6, 11, 12, and 15) with the maximum KAC scores exceeding the cutoff of 2.33 (a pointwise significant level of 0.01). As with the simulated data, our analysis scheme, after preliminary linkages are established, is to use the SNPs data to further narrow down the linkage regions. Specifically, the four CSI procedures, all with 95% coverage probability, were used to construct confidence regions for the disease gene locations. However, chromosomes 1 and 15 are not included in the CSI analysis as there are no SNPs in the preliminary linkage regions (toward the ends) on these two chromosomes. Figure 1 shows the results for the remaining four chromosomes focusing on the linkage regions. For chromosomes 6, 11, and 12, the results demonstrate the abilities of the two multipoint CSI methods for narrowing down the linkage regions. In particular, CSIv3 narrows the regions on these three chromosomes to 1.5, 12.3, and 7.4 cM, respectively. However, for chromosome 2, the two multipoint procedures failed to narrow further from the twopoint regions. By inspecting the KAC scores in the region, it appears that there are potentially multiple disease loci in the region, which might, in part, explain the CSI results.
Discussion
The purpose of this contribution is twofold. First, we want to demonstrate that, unlike most of other linkage methods, confidence regions with prespecified coverage probabilities can be obtained by the CSI procedures. This is especially useful following preliminary linkage analysis. Specifically, after linkage is established, dense SNP markers can be genotyped in the linkage regions so that the CSI procedures can then be applied, perhaps as an intermediate mapping method before fine mapping association studies commence. Second, through the analyses of both the simulated and the COGA data, we show that the CSI procedure [2] can be further refined to provide narrower confidence regions for disease gene locations. We are highly encouraged by the extremely high actual coverage probabilities for the two multipoint CSI procedures, as can be seen from the simulated data. This would also give us confidence in results from real applications.
For the COGA dataset, we are able to place the disease loci on three of the chromosomes to narrow confidence regions using CSIv3 (ranging from 1.5 cM to 12.3 cM in length), which may have potential implications in studying the genetics of alcoholism. For the simulated data, however, the confidence regions are still quite large (around 30 cM for most of them) with the two multipoint variants. We speculate that this is mainly due to the limited informativeness of the still quite sparse 3cMdensity SNP markers. We believe that, with a much denser SNP map (say one with 0.25–0.5 cM intermarker separation), further narrowing can be achieved. Moreover, we also plan to explore other methods for estimating relative risks (or IBD probability distributions at disease loci) to examine their effects on the results from the CSI procedures.
Abbreviations
 ASP:

Affected sibpair
 COGA:

Collaborative Study on the Genetics of Alcoholism
 CSI:

Confidence set inference
 EM:

Expectation maximization
 IBD:

Identitybydescent
 KAC:

Kong and Cox
 MS:

Microsatellite
 SNP:

Singlenucleotide polymorphism
References
 1.
Liang KY, Chiu YF, Beaty TH: A robust identity by descent procedure using affected sib pairs: multipoint mapping for complex diseases. Hum Hered. 2001, 51: 6478. 10.1159/000022961.
 2.
Papachristou C, Lin S: A confidence set inference procedure for gene mapping using markers with incomplete polymorphism. Hum Hered. 2005, 59: 113. 10.1159/000084731.
 3.
Lin S: Construction of a confidence set of markers for the location of a disease gene using affectedsibpair data. Hum Hered. 2002, 53: 103112. 10.1159/000057988.
 4.
Blackwelder WC, Elston RC: A comparison of sibpair linkage tests for disease susceptibility loci. Genet Epidemiol. 1985, 2: 8597. 10.1002/gepi.1370020109.
 5.
Elston RC, Song D, Iyengar SK: Mathematical assumptions versus biological reality: myths in affected sib pair linkage analysis. Am J Hum Gen. 2005, 76: 152156. 10.1086/426872.
 6.
Kruglyak L, Daly MJ, ReeveDaly MP, Lander ES: Parametric and nonparametric linkage analysis: a unified multipoint approach. Am J Hum Genet. 1996, 58: 13471363.
 7.
Kong A, Cox N: Allelesharing models: lod scores and accurate linkage tests. Am J Hum Genet. 1997, 61: 11791188. 10.1086/301592.
Acknowledgements
This work was supported in part by NSF grant DMS0306800, and NIH grant 1R01HG00265701A1.
Author information
Additional information
Authors' contributions
Both authors contributed equally to the conceptual development of the project. CP performed the analysis. SL drafted the manuscript. Both authors read and approved the final manuscript.
Rights and permissions
About this article
Cite this article
Papachristou, C., Lin, S. Interval estimation of disease loci: development and applications of new linkage methods. BMC Genet 6, S21 (2005) doi:10.1186/147121566S1S21
Published
DOI
Keywords
 Coverage Probability
 Linkage Region
 Confidence Region
 Ascertainment Scheme
 Disease Gene Location