Investigation of altering single-nucleotide polymorphism density on the power to detect trait loci and frequency of false positive in nonparametric linkage analyses of qualitative traits

Genome-wide linkage analysis using microsatellite markers has been successful in the identification of numerous Mendelian and complex disease loci. The recent availability of high-density single-nucleotide polymorphism (SNP) maps provides a potentially more powerful option. Using the simulated and Collaborative Study on the Genetics of Alcoholism (COGA) datasets from the Genetics Analysis Workshop 14 (GAW14), we examined how altering the density of SNP marker sets impacted the overall information content, the power to detect trait loci, and the number of false positive results. For the simulated data we used SNP maps with density of 0.3 cM, 1 cM, 2 cM, and 3 cM. For the COGA data we combined the marker sets from Illumina and Affymetrix to create a map with average density of 0.25 cM and then, using a sub-sample of these markers, created maps with density of 0.3 cM, 0.6 cM, 1 cM, 2 cM, and 3 cM. For each marker set, multipoint linkage analysis using MERLIN was performed for both dominant and recessive traits derived from marker loci. Our results showed that information content increased with increased map density. For the homogeneous, completely penetrant traits we created, there was only a modest difference in ability to detect trait loci. Additionally, as map density increased there was only a slight increase in the number of false positive results when there was linkage disequilibrium (LD) between markers. The presence of LD between markers may have led to an increased number of false positive regions but no clear relationship between regions of high LD and locations of false positive linkage signals was observed.


Background
Genome-wide linkage analysis using microsatellite markers has been successful in the identification of numerous Mendelian and complex disease loci. Recently available high-density single-nucleotide polymorphism (SNP) maps theoretically provide greater information content (IC), which should help to both identify and narrow linkage regions. This is supported by a few published reports comparing genome-wide linkage analysis using microsatellites to studies of the same dataset using dense SNP maps [1,2]. Yet questions remain about the optimal density of SNP marker sets for linkage studies. Additionally, current algorithms for linkage analysis assume that adjacent markers are in linkage equilibrium. However, there may be significant linkage disequilibrium (LD) between adjacent markers in dense SNP marker sets, which can lead to false positive results [3]. To explore these issues we used the simulated and Collaborative Study on the Genetics of Alcoholism (COGA) datasets to examine how altering the SNP density impacted the overall IC, the power to detect trait loci, and the number of false positive results. We compared these results to analyses performed using microsatellite markers.

Simulated data
Analyses were performed (separately for each population and replicate) using all replicates of the Aiputo, Danaca, and Karanga populations. The full marker sets for both the MS (7.5 cM) and SNP (3 cM) maps were used. Additional fine mapping markers were purchased for chromosomes 8 and 9 (packets 400-406 and 416-419) to increase the density of the SNPs (0.3 cM). We had knowledge of the answers.

Trait definition (simulated)
Dominant or recessive traits were created using these marker loci: B08T8044, B08T8045, B08T8050, and B08T8051. Affection status for a dominant trait was defined as individuals with ≥ 1 copy of allele 1 at the marker and for a recessive trait as individuals with 2 copies of allele 1.

COGA data
Using a perl script, we created an interpolated genetic map that used MS markers from the deCode map and SNPs from both Illumina and Affymetrix. For each SNP, 2 MS markers from the deCode map were identified that flanked the SNP using the physical positions of these markers obtained from sequence build 34. From the physical and genetic position of the 2 flanking microsatellites and assuming a linear interpolation between the markers, the genetic position of the SNP was determined. Any MS or SNP without a physical position was removed. If SNP markers mapped to the same genetic location, the SNP with the largest physical location was kept.

Trait definition (COGA)
The following markers (and risk alleles) were used to create a dominant and/or a recessive trait: rs0041510 (allele 2), tsc2832191 (allele 1), tsc0061481 (allele 1). To avoid errors due to differences in allele frequencies between ethnic groups, analysis was limited to the white/non-Hispanic families, which comprised the largest ethnic subgroup.

Creation of SNP maps
Using a perl script, we selected a subset of the SNP markers to create maps that were less dense. Our goal was to select markers with desired inter-marker distances. To avoid tight clusters of markers, we moved at least the desired distance minus 10% of that distance before another marker was selected. If there were multiple markers within ± 10% of the desired distance, the marker with the major allele frequency (MAF) closest to 0.5 was selected. For example, for the 0.3-cM map, markers were forced to be at least 0.27 cM apart, and if there were multiple markers located between 0.27 cM and 0.33 cM from the last marker, the marker with the MAF closest to 0.5 was selected.

Statistical analysis
We used the analysis program MERLIN for all linkage analyses [4]. Allele frequencies were estimated from all founders. Kong and Cox LOD scores [5] and the associated p-values for Whittemore and Halpern's NPL All [6] statistic were used for the analysis of qualitative traits. Entropy, a measure of IC, was used. Multipoint evaluation was performed at each of the marker loci (betweenmarker evaluations were not performed). For the evaluation of power and type I error we used 4 standard p-value thresholds (0.05, 0.01, 0.001, and 0.0001) and 2 Lander-Krugylak [7] genome-wide significance levels. We calculated power as the number of replicates with a p-value less than the threshold within a 20 cM region (10 cM in either direction) of the trait loci. To assess the frequency of false positive results, we counted the number of regions where a p-value less than the above-mentioned cut-off occurred on chromosomes not containing the trait loci. In order to ensure that adjacent makers with p-values below the given level were not counted as multiple false positive results, a region with a p-value greater than or equal to 0.2 was required to occur between two false positive regions. Table 1 presents the results of our comparison of the IC for the various map densities. In the simulated data, the average IC of the MS map was 0.934. There is a loss in information when we compared the 3-cM SNP map ( 0  .833) to the MS map. Conversely, a very dense SNP map showed a modest increase in IC (0.986); the mean IC was highest in the very dense (0.3 cM) SNP map (0.986). In the COGA dataset IC increased with increasing map density and was lowest in the MS marker set. The overall IC was a bit lower in the COGA data; this could be due in part to the presence of missing data in the COGA dataset or overall marker heterozygosity. Note that the MS map in the COGA dataset (13.6 cM) is less dense than the MS map in the simulated dataset (7.5 cM).

Results
There was a modest increase in power with increasing SNP map density in the simulated data (Table 2). Power was greatest for the 0.3-cM density. Power for the MS map seemed to fall between the 1 cM and 3 cM SNP map. Overall power was quite low when we used a genome-wide significance level of 0.000049. However, in the COGA dataset (Table 3) there were less consistent trends in the ability to detect the trait loci as map density increased. In fact, the denser maps sometimes gave smaller LOD scores as compared with less dense maps (e.g., Drs0041510). This could be due to errors in marker order or intermarker distance for the denser map sets. It is important to note that our created traits were homogenous and had complete penetrance, and thus overall power was very high, possibility masking any true variations in power due to differences in map density. For all map sets disease frequencies had a large impact on power. Additionally, given we only performed analysis at the marker loci and not between marker loci, we cannot evaluate if denser maps yielded smaller confidence intervals for the linkage peaks because 1-LOD confidence intervals are dependant upon the density of analytic evaluations.
The number of false positive linkages (p-value below a given level in a region unlinked to the trait loci) for the simulated data is in Tables 4 and 5  increase in the number of false positive results compared to the 1-cM map, it is hard to interpret this because such a dense map was only available in one 18-cM region. We also examined the number of false positive regions for each of the traits in the COGA dataset (Table 6)

Conclusion
Overall, IC was higher for the dense SNP maps as compared with the less dense SNP and MS maps. In the simulated data, there was a modest increase in power with increasing SNP map density. However in the COGA data, no consistent trends were observed in our ability to detect trait loci with increasing map density. There was variation Mean number of false positive regions in the 9 unlinked chromosomes per replicate with p-value below the following criteria. in the LOD scores across maps, with more dense maps sometimes yielding lower LOD scores. This could be due to errors in map order and supports the need for precise genetic maps when using dense SNP maps for linkage. Unsurprisingly, power was dependent on disease prevalence for these homogeneous, completely penetrant traits.
In the simulated data, in which there was no significant LD between markers, the number of false positives did not increase with increasing map density. In the COGA data, more false positives were observed for the densest map set, 0.25 cM, in which there was significant intermarker LD. Huang et al. [3] reported that the presence of intermarker LD caused an increase in false positives, particularly when there is missing parental data. This is of particular concern because others have reported that SNPs are more powerful than microsatellites when there is missing parental data. To examine this, we calculated the LD between all SNPs up to 500 kb apart. Twenty-one percent of all pairwise SNPs had a D' > 0.70 (high LD). Of those SNPs with a D' > 0.70, 89% were <200 kb apart, 9% were 200-400 kb apart and 2% >400 kb apart. The LD between SNPs diminished as distance increased, suggesting maps with an average marker distance >200 kb would have limited intermarker LD. Comprehensive review of the locations of all type I errors observed for two of these traits (created from marker tsc006148 on chromosome 13) showed that while 90% of these regions contained markers exhibiting LD, the LD patterns in these regions did not differ markedly from the LD on the remainder of the chromosomes. Interestingly, 20% of the false positives occurred at the telomeres of chromosomes. While some of the increases in numbers of type I errors could be due to increased intermarker LD in the densest maps, they could  Publish with Bio Med Central and every scientist can read your work free of charge also be caused by the fact that more evaluations of linkage were performed for the dense maps, since we evaluated linkage at each marker location and did not perform any intermarker evaluations. Thus, the densest map had the largest number of linkage tests performed (see Table 1), so increased type I errors could be due to LD or to increased tests.