Comparative linkage analysis and visualization of high-density oligonucleotide SNP array data
© Leykin et al; licensee BioMed Central Ltd. 2005
Received: 06 July 2004
Accepted: 15 February 2005
Published: 15 February 2005
The identification of disease-associated genes using single nucleotide polymorphisms (SNPs) has been increasingly reported. In particular, the Affymetrix Mapping 10 K SNP microarray platform uses one PCR primer to amplify the DNA samples and determine the genotype of more than 10,000 SNPs in the human genome. This provides the opportunity for large scale, rapid and cost-effective genotyping assays for linkage analysis. However, the analysis of such datasets is nontrivial because of the large number of markers, and visualizing the linkage scores in the context of genome maps remains less automated using the current linkage analysis software packages. For example, the haplotyping results are commonly represented in the text format.
Here we report the development of a novel software tool called CompareLinkage for automated formatting of the Affymetrix Mapping 10 K genotype data into the "Linkage" format and the subsequent analysis with multi-point linkage software programs such as Merlin and Allegro. The new software has the ability to visualize the results for all these programs in dChip in the context of genome annotations and cytoband information. In addition we implemented a variant of the Lander-Green algorithm in the dChipLinkage module of dChip software (V1.3) to perform parametric linkage analysis and haplotyping of SNP array data. These functions are integrated with the existing modules of dChip to visualize SNP genotype data together with LOD score curves. We have analyzed three families with recessive and dominant diseases using the new software programs and the comparison results are presented and discussed.
The CompareLinkage and dChipLinkage software packages are freely available. They provide the visualization tools for high-density oligonucleotide SNP array data, as well as the automated functions for formatting SNP array data for the linkage analysis programs Merlin and Allegro and calling these programs for linkage analysis. The results can be visualized in dChip in the context of genes and cytobands. In addition, a variant of the Lander-Green algorithm is provided that allows parametric linkage analysis and haplotyping.
The oligonucleotide Mapping 10 K arrays  have been used for linkage analysis [2–4] and their advantages in genome coverage and information content compared to microsatellite-based assays has been demonstrated. The array contains 11,550 SNPs with an average heterozygosity rate of 0.32 and an average marker distance of 0.31 cM. However, the commonly used multi-point linkage analysis software packages such as GeneHunter [5, 6] and Merlin  are command-line programs and it is not straightforward to find genes in the regions of high linkage scores. In addition, the haplotyping results are represented commonly in a text format without any gene context.
Here we report the development of a new software tool called CompareLinkage that can be used for automated conversion of Mapping 10 K genotype data into the "Linkage" format for linkage analysis in Merlin, GeneHunter and Allegro . In addition the program can convert the pedigree information and SNP marker information into the "Linkage" format. After performing the linkage analysis using one or more of these programs, the CompareLinkage software can export the linkage score information into the dChip software [9–11] to visualize the results within a chromosome window. In addition, we implemented a variant of the Lander-Green [5, 12] algorithm into the dChipLinkage module to analyze pedigrees with up to 18 bits (bits = 2n-f ; with n = number of non-founders and f = number of founders) using the parametric linkage analysis method. We are currently testing and validating the implementation of the algorithm which will be described in detail elsewhere. The linkage score curves, genotypes and haplotypes are graphically displayed in a dChip chromosome window which has the genes, cytoband and SNP marker information included. Together the CompareLinkage and dChip software programs provide for the first time a graphical user interface (GUI) and an automated procedure for comparative linkage analysis utilizing three commonly used linkage software programs.
The CompareLinkage software for comparative linkage analysis using Merlin and Allegro
To analyze large pedigrees rapidly and to compare the linkage analysis results of different software packages, we developed a software tool called CompareLinkage to automate the following processes: (1) Converting of Affymetrix Mapping 10 K genotype data, pedigree files and marker information into the "Linkage" format , and detecting and fixing incompatibilities in pedigree genotypes. The input genotype text file for CompareLinkage can be a single text file containing genotypes for each sample or a combined text file as exported by the Affymetrix GDAS 3.0 software. (2) Automatically calling the software packages Merlin and Allegro for linkage analysis and converting the analysis results (LOD or non-parametric linkage (NPL) scores) into the input files for dChip to visualize the results in the context of genes and cytobands. (3) The SNP genotype data in the "Linkage" format can be converted into the dChip input files (genotype, pedigree and marker information files) to perform parametric linkage analysis by dChipLinkage. All steps are discussed in detail at the CompareLinkage software manual provided on the software website. All these functionalities are useful for cross-validation of linkage results and to identify concordance and discordances between different linkage analysis programs as well as between parametric and non-parametric linkage results.
The dChipLinkage software module
The Affymetrix Mapping 10 K array CEL files and genotype TXT files can be imported into dChip and visualized along cytobands and genes as previously reported [9, 11]. The information of the SNPs such as their genetic and physical distance and allele frequencies from three ethnic groups (Asian, African American and Caucasian) is obtained from the Affymetrix website  and converted into the genome information files for dChip. The information of the reference genes and cytobands is obtained from the UCSC genome bioinformatics database  for the matching human genome assembly (hg12 or hg15) of the SNP information, and is organized into the refGene and cytoband file provided with dChip.
where Fi represents the i th of all the possible founder allele configurations and is independent of v. P(real genotypes i | v, Fi) is 1 since an inheritance vector and founder allele configuration uniquely determines the real genotypes, and P(observed genotypes | real genotypes i) involves comparing the real genotype and observed genotype for all the individuals and multiplying the probability by the error rate of 0.01 (default value) for each disagreement and 0.99 for each agreement. We also use the matrix-vector multiplication algorithm and bit reduction due to founder phase symmetry described in , and the founder allele factoring technique reported in [6, 17] to speed up the computation of single-locus and accumulative likelihood vectors as well as the likelihood vector of disease phenotypes.
We use the forward-backward computation in the Lander-Green algorithm to obtain the marginal probability distribution of inheritance vector at each SNP marker position given the data of all the markers on a chromosome. In addition the most likely inheritance vector at each marker given the genotype data of all the markers on this chromosome is calculated . Conditioned on the most likely inheritance vector at a marker and the observed genotype data, we can find the most likely founder allele configurations. When there are competing inheritance vectors with the same largest marginal probabilities at a marker, we select the one with fewer crossover events from the last marker since the distance between adjacent markers are small (average 300 kb) and it is therefore less likely to have multiple crossover events between two markers in a pedigree . Together these procedures give the haplotyping results of the pedigree data. dChipLinkage visualizes the haplotyping result in either the haplotype view or the ordered genotype view.
The comparative linkage analysis using Merlin, GeneHunter, Allegro and dChipLinkage
Linkage analysis and visualization using dChipLinkage
1. Open dChip.
2. Select the Analysis menu and the Get External Data function to read in the genotype file in the text format (Figure 13A).
3. Select the genome information file downloaded from the dChip website (Figure 13B). This file is provided in three versions, each containing the SNP information like TSC SNP ID and genetic map locations but having different allele frequencies for each of the three ethnic groups (Asian, Caucasian and African Americans).
4. Select the Analysis menu and the Chromosome function to display the genotype calls, genes and cytobands along the chromosome
5. After the program has displayed the genotype data, select the Chromosome menu and the Linkage function to start the dChipLinkage module (Figure 3). Specify the pedigree file (Figure 13C) and other linkage parameters. Depending on whether the dChip "Chromosome View" displays one or all chromosomes, the linkage analysis will be performed for one or all chromosomes accordingly. For the analysis of the 5026.10 family, the recessive disease model is assumed, and a penetrance of 0.99, phenocopy of 0.01, disease allele frequency of 0.001 and a SNP marker error rate of 0.01 are used. The SNP allele frequencies in the genome information file are used and truncated to values between 0.001 and 0.999. This family has 13 bits and it takes about 20 minutes for the whole genome linkage analysis.
Discussion and conclusions
We have developed the CompareLinkage software for easy comparison and analysis of genotype datasets with common multi-point linkage analysis software programs. It provides functions such as automated data formatting and the calling of linkage analysis software programs to facilitate comparative linkage analysis. The results can be visualized in a chromosome window in the context of genes, cytobands and SNPs in dChip's user friendly graphical interface. The linkage scores of other linkage software packages can be saved into the dChip score file format through CompareLinkage and viewed in the dChip chromosome viewer. This provides the interface to view other computed statistics such as linkage disequilibrium scores along the chromosomes. We have also implemented a variant of the Lander-Green algorithm as the dChipLinkage module for parametric linkage analysis of small pedigrees. It can analyze all chromosomes for families with up to 18 bits within one hour on a PC with one gigabyte memory. This is useful for recessive and consanguineous families whose bits are often small.
The comparison analysis of three Mapping 10 K array data sets show similar results in regions with significant LOD scores across all the four software packages. The regions with concordant LOD/NPL scores should provide more confidence in the candidate disease loci. However, there are clear differences in isolated regions. This emphasizes the challenge of a comparative analysis using different linkage algorithm implementations. We hypothesize that the differences between the software programs in peak locations are attributable to:
1. The specific algorithm implementation in each program.
2. The difference between parametric – and non-parametric analysis.
3. The existence of undetected genotype errors in the data sets which could falsely deflate LOD scores [17, 22]. dChipLinkage uses an error model to automatically handle genotype errors and avoid sporadic LOD score peaks due to undetected non-Mendelian errors, and results in a smoother LOD curve as seen in Figure 7, 8, 9, 10, 11, 12. However, this error handling algorithm involves more iterations and increases the computation time. There are further techniques to reduce the memory and time requirement of the Lander-Green algorithm [7, 8, 23, 24]
In light of the discordance between the results from common linkage software packages and from dChipLinkage, we will validate dChipLinkage implementation using additional datasets and the CompareLinkage software.
In summary, the CompareLinkage and dChipLinkage software automate the comparative linkage analysis and visualization using multiple software packages. With these tools users will be able to increase their confidence in candidate regions and can use the visualization tools to explore the disease associated genome regions.
Availability and requirements
Project name: The CompareLinkage software and the dChipLinkage software module
Project home page: http://biosun1.harvard.edu/complab/linkage
Operating system(s): Windows (dChipLinkge); Windows (CompareLinkage and its graphical interface), Unix (CompareLinkage command line version)
Programming language: Visual C++ 6.0 (dChipLinkge); Perl and Java (CompareLinkage software)
Other requirements: None
Any restrictions to use by non-academics: No restrictions
We thank Hajime Matsuzaki, Patricia Dahia, Robert Sean Hill, Steven Boyden for helpful discussions. This work is supported by NIH grant 1R01HG02341 and P20-CA96470 (IL, KH and WHW), RO1-DC02842 (Richard J.H. Smith), NIH DK54931 (Martin R. Pollak), and grants from Friends of Dana-Farber Cancer Institute (CL) and Claudia Adams Barr Program in Cancer Research (CL).
- Kennedy GC, Matsuzaki H, Dong S, Liu WM, Huang J, Liu G, Su X, Cao M, Chen W, Zhang J, Liu W, Yang G, Di X, Ryder T, He Z, Surti U, Phillips MS, Boyce-Jacino MT, Fodor SP, Jones KW: Large-scale genotyping of complex DNA. Nat Biotechnol. 2003, 21: 1233-1237. 10.1038/nbt869.View ArticlePubMed
- Matsuzaki H, Loi H, Dong S, Tsai YY, Fang J, Law J, Di X, Liu WM, Yang G, Liu G, Huang J, Kennedy GC, Ryder TB, Marcus GA, Walsh PS, Shriver MD, Puck JM, Jones KW, Mei R: Parallel genotyping of over 10,000 SNPs using a one-primer assay on a high-density oligonucleotide array. Genome Res. 2004, 14: 414-425. 10.1101/gr.2014904.PubMed CentralView ArticlePubMed
- Middleton FA, Pato MT, Gentile KL, Morley CP, Zhao X, Eisener AF, Brown A, Petryshen TL, Kirby AN, Medeiros H, Carvalho C, Macedo A, Dourado A, Coelho I, Valente J, Soares MJ, Ferreira CP, Lei M, Azevedo MH, Kennedy JL, Daly MJ, Sklar P, Pato CN: Genomewide linkage analysis of bipolar disorder by use of a high-density single-nucleotide-polymorphism (SNP) genotyping assay: a comparison with microsatellite marker assays and finding of significant linkage to chromosome 6q22. Am J Hum Genet. 2004, 74: 886-897. 10.1086/420775.PubMed CentralView ArticlePubMed
- John S, Shephard N, Liu G, Zeggini E, Cao M, Chen W, Vasavda N, Mills T, Barton A, Hinks A, Eyre S, Jones KW, Ollier W, Silman A, Gibson N, Worthington J, Kennedy GC: Whole-genome scan, in a complex disease, using 11,245 single-nucleotide polymorphisms: comparison with microsatellites. Am J Hum Genet. 2004, 75: 54-64. 10.1086/422195.PubMed CentralView ArticlePubMed
- Lander ES, Green P: Construction of multilocus genetic linkage maps in humans. Proc Natl Acad Sci U S A. 1987, 84: 2363-2367.PubMed CentralView ArticlePubMed
- Kruglyak L, Daly MJ, Reeve-Daly MP, Lander ES: Parametric and nonparametric linkage analysis: a unified multipoint approach. Am J Hum Genet. 1996, 58: 1347-1363.PubMed CentralPubMed
- Abecasis GR, Cherny SS, Cookson WO, Cardon LR: Merlin--rapid analysis of dense genetic maps using sparse gene flow trees. Nat Genet. 2002, 30: 97-101. 10.1038/ng786.View ArticlePubMed
- Gudbjartsson DF, Jonasson K, Frigge ML, Kong A: Allegro, a new computer program for multipoint linkage analysis. Nat Genet. 2000, 25: 12-13. 10.1038/75514.View ArticlePubMed
- Lin M, Wei LJ, Sellers WR, Lieberfarb M, Wong WH, Li C: dChipSNP: significance curve and clustering of SNP-array-based loss-of-heterozygosity data. Bioinformatics. 2004, 20: 1233-1240. 10.1093/bioinformatics/bth069.View ArticlePubMed
- Li C, Wong WH: Model-based analysis of oligonucleotide arrays: expression index computation and outlier detection. Proc Natl Acad Sci U S A. 2001, 98: 31-36. 10.1073/pnas.011404098.PubMed CentralView ArticlePubMed
- Li C, Wong WH: DNA-Chip Analyzer (dChip). The analysis of gene expression data: methods and software. Edited by: Parmigiani G, Garrett ES, Irizarry R and Zeger SL. 2003, New York, Springer, 120-141.View Article
- Lange K: Mathematical and statistical methods for genetic analysis. 2002, New York, Springer-Verlag, 2View Article
- Lathrop M, Ott J: Linkage User's Guide. [ftp://linkage.rockefeller.edu/software/linkage]
- Affymetrix: Affymetrix Mapping 10K Array - Support Materials. [http://www.affymetrix.com/support/technical/byproduct.affx?product=10k]-
- UCSC: UCSC Genome Bioinformatics. [http://genome.ucsc.edu/]-
- Kruglyak L, Daly MJ, Lander ES: Rapid multipoint linkage analysis of recessive traits in nuclear families, including homozygosity mapping. Am J Hum Genet. 1995, 56: 519-527.PubMed CentralPubMed
- Sobel E, Papp JC, Lange K: Detection and integration of genotyping errors in statistical genetics. Am J Hum Genet. 2002, 70: 496-508. 10.1086/338920.PubMed CentralView ArticlePubMed
- O'Connell JR, Weeks DE: PedCheck: a program for identification of genotype incompatibilities in linkage analysis. Am J Hum Genet. 1998, 63: 259-266. 10.1086/301904.PubMed CentralView ArticlePubMed
- Lander E, Kruglyak L: Genetic dissection of complex traits: guidelines for interpreting and reporting linkage results. Nat Genet. 1995, 11: 241-247. 10.1038/ng1195-241.View ArticlePubMed
- Zheng L, Sekerkova G, Vranich K, Tilney LG, Mugnaini E, Bartles JR: The deaf jerker mouse has a mutation in the gene encoding the espin actin-bundling proteins of hair cell stereocilia and lacks espins. Cell. 2000, 102: 377-385. 10.1016/S0092-8674(00)00042-8.PubMed CentralView ArticlePubMed
- Naz S, Griffith AJ, Riazuddin S, Hampton LL, Battey JFJ, Khan SN, Wilcox ER, Friedman TB: Mutations of ESPN cause autosomal recessive deafness and vestibular dysfunction. J Med Genet. 2004, 41: 591-595. 10.1136/jmg.2004.018523.PubMed CentralView ArticlePubMed
- Douglas JA, Boehnke M, Lange K: A multipoint method for detecting genotyping errors and mutations in sibling-pair linkage data. Am J Hum Genet. 2000, 66: 1287-1297. 10.1086/302861.PubMed CentralView ArticlePubMed
- Markianos K, Daly MJ, Kruglyak L: Efficient multipoint linkage analysis through reduction of inheritance space. Am J Hum Genet. 2001, 68: 963-977. 10.1086/319507.PubMed CentralView ArticlePubMed
- Kruglyak L, Lander ES: Faster multipoint linkage analysis using Fourier transforms. J Comput Biol. 1998, 5: 1-7.View ArticlePubMed
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.