LD2SNPing: linkage disequilibrium plotter and RFLP enzyme mining for tag SNPs

Background Linkage disequilibrium (LD) mapping is commonly used to evaluate markers for genome-wide association studies. Most types of LD software focus strictly on LD analysis and visualization, but lack supporting services for genotyping. Results We developed a freeware called LD2SNPing, which provides a complete package of mining tools for genotyping and LD analysis environments. The software provides SNP ID- and gene-centric online retrievals for SNP information and tag SNP selection from dbSNP/NCBI and HapMap, respectively. Restriction fragment length polymorphism (RFLP) enzyme information for SNP genotype is available to all SNP IDs and tag SNPs. Single and multiple SNP inputs are possible in order to perform LD analysis by online retrieval from HapMap and NCBI. An LD statistics section provides D, D', r2, δQ, ρ, and the P values of the Hardy-Weinberg Equilibrium for each SNP marker, and Chi-square and likelihood-ratio tests for the pair-wise association of two SNPs in LD calculation. Finally, 2D and 3D plots, as well as plain-text output of the results, can be selected. Conclusion LD2SNPing thus provides a novel visualization environment for multiple SNP input, which facilitates SNP association studies. The software, user manual, and tutorial are freely available at .


Introduction
Single nucleotide polymorphisms (SNPs) are the most common genetic polymorphisms in the human genome. They are increasingly important to the personalized medicine and many association studies. However, too much SNPs may make it hard to identify the interesting SNPs associated with diseases or cancers.
Accordingly, it is essential to use a small representative subset of informative SNPs for the association studies.
Linkage disequilibrium (LD) is one of the common methods to identify these representative SNPs, called tag SNPs (tSNPs). Here, we developed the LD 2 SNPing to compute LD measurement and visualize in 2D and 3D plots for user's input data file or on-line retrieval for multiple SNPs from HapMap and NCBI. Gene input to provide the tag SNP from HapMap is available. SNP ID rs# input for the RFLP restriction enzyme information for SNP genotype is implemented. Software, user manual, and video tutorial can be downloaded freely in http://bio.kuas.edu.tw/LD2SNPing. Many animations were provided in the end of each figure legend to help the user to practice the example. Updated LD 2 SNPing is implemented with java code and supported with three-dimensional display. The system needs the Java Runtime Environment (JRE) and Java 3D to implement and maintain the system works. The Java 3D have been packaged in LD 2 SNPing. If your computer does not support any JRE software, you need to download the software in Sun's website (http://www.sun.com/) of JRE. Please see the following description for installation. Many animations were provided in the end of each figure legend to help the user to practice the example.

System Requirements
Programming language: Java Runtime Environment (JRE) needs to be installed.
Compute system: The software using in Pentium4 CPU system, 256M RAM and 15M of disk space.

Installing JRE
JRE is implementation of Java Application software by the sun. LD 2 SNPing is coded in Java, and users can download JRE accurate to operate. Users can download the latest version of JRE installed on their computers form http://www.sun.com/.

Installing Java 3D
Java 3D is implementation of a Java three-dimensional Application software by the sun. LD 2 SNPing can show three-dimensional in graphics. In order to complete the presentation of graphics, the Java 3D have been packaged in LD 2 SNPing. Java 3D needs the environment in DirectX or OpenGL. LD 2 SNPing also needs to use software of Microsoft developed by a standard DirectX. If users are unable to implement the LD 2 SNPing programs, it is necessary to install the update version in http://www.microsoft.com/windows/directx/default.aspx.

Installing LD 2 SNPing
Before the implementation, users have to make sure the Java platform application environment has been set correctly. The latest version of the LD 2 SNPing (LD 2 SNPing V2.0.exe) can be downloaded from the http://bio.kuas.edu.tw/LD2SNPing and it is set Updated: 2009/05/07 up step by step with instruction. This software is developed and used in the Windows operating system. LD 2 SNPing V2.0.exe can run in any Windows platform, but 3D computing needs greater demand for memory function, memory requirements proposed in more than 256 M RAM.
(Please click the box to demonstrate.)

Most functions need the internet on-line to retrieve all the necessary information except
the LD calculation using file input.

Input Format
Before introducing the function of LD 2 SNPing, we firstly list some acceptable input formats such as file, gene name, and rsID# as follows. LD 2 SNPing accepts four kinds of input file formats, such as Excel (.xls and .cvs), Word (.doc) and NotePad (.txt) (in Fig. 1, Fig. 2, and Fig. 3 respectively). The first row for each file is for SNP name (user can type any names). The second row for each file is for distance (optional). More example files are available from example file folder of LD 2 SNPing (described later in Fig. 15). It is available in the subfolder under the program file folder of LD 2 SNPing.

XLS and CVS Formats
In Fig. 1 15). If user install the LD 2 SNPing in C, then the path to get the example file is followed, C:\Program Files\LD2SNPing\example.

TXT Format
In Fig. 3

Retrieval of the individual SNP information from NCBI
LD 2 SNPing provides the rsID# input to retrieve the individual SNP information from dbSNP of NCBI on-line (Fig. 5). Therefore, the SNP information for all population existed in current version of dbSNP is provided. The ssID# for corresponding rsID# is selective using pull-down window.

Retrieval of genotype frequency of different populations for multiple SNPs from NCBI
Alternatively, users may need to retrieve several interested SNPs from NCBI using rsID# and ssID# for further LD calculation and visualization (Fig. 6, only input action is shown here and the output result is shown in output section later).
LD 2 SNPing provides the method for users to input some SNP IDs and automatically retrieve the SNP frequency information from dbSNP in NCBI on-line (Fig. 6). LD measurements between these SNPs are provided without the knowledge of the SNP information for these SNP IDs (see output section later). The input procedures to retrieve several interested SNPs from NCBI using rsID# and ssID# inputs for LD calculation and visualization are indicated from arrow 1 to arrow 6. In arrow 2, users can key in the desired numbers for SNP ID input.
Subsequently, the box numbers for requested SNP numbers are immediately generated (arrow 4). The arrow 5 indicates the pull-down window for rsID# and ssID# selections. Once the rsID# is selected, several corresponding ssID# are interchangeable if they are provided in dbSNP of NCBI (no shown here; it will described later). Clicking the box indicated by arrow 6 shows the SNP genotype frequency for selected population which is on-line retrieved from dbSNP of NCBI. Each SNP genotype frequency is retrieved one-by-one by clicking the "find" box. Each SNP rsID# or ssID# for each input window is editable. Once finished, please click the "find" box again to update the search. The characteristics of the sample file are provided. Actually, LD 2 SNPing accepts the genotype format in the form of NN, N_N, and N/N (N is one of the nucleotides) (not shown).

Function of LD 2 SNPing
Performing the RunLD2SNPing.exe, users will enter the main screen of the LD 2 SNPing system program (Fig. 8). Most functions need the internet on-line to retrieve all the necessary information except the LD calculation using file input.

Brief review for the function of LD 2 SNPing
Six functions were provided, including three LD-free functions and three LDavailable functions as follows.
LD-free functions 1) Single rsID information browser and RFLP enzyme mining -On-line retrieval of individual SNP information among different populations from NCBI.
2) Gene input to find rsID data of tagSNP and RFLP enzyme mining -On-line retrieval of tagSNP in HapMap by HUGO gene name input.
3) RFLP enzyme mining tool using rsID input -RFLP enzymes are provided for SNP genotype using rsID# input.

Single rsID information browser and RFLP enzyme mining -On-line retrieval of individual SNP information among different populations from NCBI.
After inputting rsID# (e.g., rs17884306) and selecting ssID# (e.g., ss48297306), the population class, total sample, major allele, minor allele, genotype frequency, HWP (P value of the Hardy-Weinberg equilibrium) and data source are provided as shown in Fig. 9A. These data are completely matched to that of NCBI (http://www.ncbi.nlm.nih.gov/SNP/snp_ref.cgi?rs=17884306) as shown in Fig. 9B.
Please note that different ssID# may change their corresponding genotype frequency.
This is the nature characteristic of dbSNP in NCBI because these data were reported from different laboratories into different ssID# for the same rsID#. Clicking the example (rs2247603) automatically provides similar SNP ID retrieval (Fig. 14, later).

Gene input to find rsID data of tagSNP and RFLP enzyme mining -On-line retrieval of tagSNP in HapMap by HUGO gene name input.
In order to find output data from the gene-related information (e.g., BRCA2 and BRCA1 for Fig. 11 and Fig. 12, respectively), LD 2 SNPing can provide the tagSNP information through the HapMap (http://hapmap.jst.go.jp/hapmappopulations.html).
As demonstrated in Fig. 11

RFLP enzyme mining tool using rsID input -RFLP enzymes are provided for SNP genotype using rsID# input.
It is designed to provide RFLP enzyme information for SNP genotype before LD analysis. In the LD 2 SNPing, restriction enzymes information for interested SNP (e.g., rs9534275) is provided as shown in Fig. 13. The restriction enzyme information is downloaded from REBASE. The implement of SNP-RFLP is similar to our previous publication, SNP-RFLPing (http://bio.kuas.edu.tw/snp-rflp).
Here we firstly introduce how to use these different inputs separately and briefly.
Then, we show their detailed common LD calculation and 2-D visualization, graph analysis, and LD in 3-D visualization. They are described in detail as follows.

Sample file input for the tutorial of LD calculation/visualization
These sample files provide the chance for users to practice the LD 2 SNPing software. The SNP is defined by the nucleotide variant (allele) larger than 1% (Minimal allele frequency; MAF > 0.01) of the population. Therefore, for those allele frequencies less than 1% is not shown in the 2D-LD plot in default setting, such as the SNP1, 3, 5-9, 11-13, 15, 19, 22-26, 28, 30 and 32 of sample 4 in Fig. 15A. Those SNPs with MAF <0.01 are not visualized in the 2D-LD plot (Fig. 15B). All the MAF values for all SNPs are provided in "show SNP data" of "help" (Fig. 15C).

Multiple rsID/ssID information browsers for LD calculation/visualization
Genotype frequency of different populations for multiple SNPs is retrieved from NCBI on-line. Therefore, the SNP information for all population exited in current version of dbSNP is provided. Five SNP rsID# with available genotype raw data of HapMap (Fig. 16) are demonstrated to perform LD visualization in Fig. 18A. Fig. 16. Search rsID input type. First, users have to input the required SNP numbers at the top window (arrows 1 and 2). After clicking "Enter" (arrow 3), the system generates the same box numbers for SNP ID inputs. Input the rsID (e.g., rs11571315, rs733618, rs5742909, and rs11571316), select the population by pull-down window (arrow 5). Users are able to select the HapMap (arrow 6), simulation (arrow 7), or all (containing both HapMap and simulation; arrow 8). The SNP information is retrieved by clicking the "Find" box one-by-one (arrow 9).

A B
Then, the SNP frequency information appears. Once completing the SNP ID inputting, users The function for multiple rsID/ssID information browsers for LD calculation/ visualization in LD 2 SNPing is in direct manner (Fig. 16). In contrast, the multiple rsID/ssID information browsers for LD calculation/visualization in Haploview are in indirect manner as shown in Fig. 17. The Haploview cannot accept the multiple SNP input directly. Instead, the Haploview provides all the SNPs within the user's input range and subsequently narrow down to user's interested SNPs for LD analysis by manually clicking one-by-one. In Fig. 17, the same four SNPs listed in Fig. 16 are used as example to perform LD analysis by Haploview and their 2D-LD result is shown in Fig. 18B. The patterns of 2D-LD plot for LD 2 SNPing (in the example of HapMap-CEU) and Haploview are completely matched as shown in Fig. 18A and Fig. 18B, respectively. A B Fig. 18. Comparison of four rsID# input and its 2D LD plot in LD 2 SNPing vs. Haploview.
The r 2 for LD 2 SNPing and Haploview is also matched.

rsID# and ssID# searching for SNP frequency information
For LD calculation to specific population, the selection of specific ssID# may be helpful to analyze the population-based association studies. Different records from different submitters for the same SNP rsID# are given with the ssID# (Fig. 19,   rs2078486) (http://www.ncbi.nlm.nih.gov/SNP/snp_ref.cgi?rs=2078486). Each ssID# may be derived from different populations for single rsID# (like Fig. 20,   rs41446050). The ssID# for corresponding rsID# is selective using pull-down window.
Although the genotype frequency data are available, all of them are unable to retrieve the genotype raw data for LD measurement in this case. Using the genotype frequency, the LD 2 SNPing randomly generates the simulated genotype data to calculate the simulated LD analysis when there is not HapMap raw data.
Since they have the genotype frequency, a simulation for 100 randomized genotypes was computationally generated to fit their genotype frequency. Different times to perform the simulation may lead to the different genotype orders in a population. To provide the more reliable LD value, we designed ten different simulations to perform their LD analysis and the average of these LD values was provided in LD 2 SNPing. Although the possible linkage between different SNPs within the same individual is ignored, this method provides the LD evaluation for SNPs with genotype frequencies alone. However, this kind of evaluation is not suggested if the HapMap is available for the selected SNP ID. SNP information from different populations was not suggested to perform the LD calculation. In LD 2 SNPing, eight functions are provided in control panel of 2D-LD plot as shown in Fig. 21. They are described in detail later (after Fig. 22). In the Fig. 22, LD 2 SNPing demonstrates that distance between SNPs is optional to show or hide. The distance value shown in SNPn is the distance between SNPn to zero point.

Scope selection from large to small area of 2D-LD plot
Scope control is designed to narrow-down the SNP number for LD analysis if the huge SNP data is evaluated. As shown in Fig. 23, users can "zoom in" to select SNP within small region and it is reversible.   Step 2: Select the needed block numbers (e.g., 10 in arrow 2). The default color setting is gray-scale. If there is not necessary to change, then go to next step. If users want to change each color, please see Fig. 27

Save text file for output LD measure data
The LD 2 SNPing provides the output for the LD related text data of all SNPs within this plot such as Fig. 31. In Fig. 32, the steps to save the text data for LD plot are demonstrated. The output for LD graph is described later (Fig. 33).

Comparison of LD plotting between JLIN and LD 2 SNPing
The performance and accuracy for LD 2 SNPing are demonstrated by comparing to three common LD softwares, such as JLIN (Fig. 34), LDA (Fig. 35), and Haploview ( Fig. 36). Since the JLIN only accept the .csv file, three example files (.csv) are used to test the visualization between LD 2 SNPing and JLIN as shown in Fig. 33.
Haploview needs the .ped and .info formats, therefore, the same data have to change format. Their 2D-LD plotting is completely matched. The visualization between LD 2 SNPing and LDA is compared as shown in Fig. 35.
Their 2D-LD plotting is completely matched. Because the formula of calculating LD related information of LD 2 SNPing is derived from help system of LDA, the values for all LD-related information are confirmed to be the same for each other.

Graph analysis-related functions in 2D-LD plot
In addition to LD analysis and visualization, LD 2 SNPing also provides the some alternative function for LD analysis. The brief review for five graph analysis-related functions in 2D-LD plot is mentioned below. 4.3.6.1. Brief review for five graph analysis-related functions in 2D-LD plot Since the function of the control panel is introduced above, we focus on the function with icons as shown in Fig. 37. They are described in detail later (after Fig. 38). Sometimes, users may lose the way to go to the home screen of LD 2 SNPing for another analysis. In stead of closing and restarting the LD 2 SNPing, the first icon provides the "close file" function for homing (Fig. 38).

Brief review for all functions in analysis graph
LD 2 SNPing provided some graphic analysis such as grid, bar and pie3D graph to supplement LD analysis to 2D-LD visualization and analysis (Fig. 39). Detail function is described in Figs. 40-44. The steps to perform grid graph analysis in LD 2 SNPing are shown in Fig. 40. The LD information from grid graph is in consistence with text data (Fig. 41).

Return to 2D-LD plot
At anywhere, users can return to the first screen for 2D-LD visualization by clicking the "Show 2D" as indicated by arrow in Fig. 44.

Representative view of 3D-LD plot
The relationship of visualization between 2D-LD and 3D-LD plots is described in Fig.   46. The distance between SNPs is visualized to both 2D-LD and 3D-LD plots if the distance information is available in SNP data set. The height of the diagonal line in 3D-LD plot is in proportion to its distance between SNPs in reference to the first SNP.

Zoom-in, zoom-out and rotation of 3D-LD plot
LD 2 SNPing can rotate the orientation of 3D-LD plot as well as zoom-in and zoom-out ( Fig. 47). In order to avoid out of memory, it provides maximal 10 SNPs for 3D-LD visualization. Six different graphic screens are provided for personal preferences. Six colors are available for selection in 3D-LD plotting as shown in Fig. 48.   Fig. 48. Changing color for 3D-LD plot. Clicking anyone of the "select color" box indicated by A to F in left side allows the users to change the 3D color. This 2D-LD plot is drawn using sample file 2 of LD 2 SNPing. No distance information in original data set in the sample file 2 and therefore the diagonal line with white patch is flat.

Selection of block number for 3D-LD plot
The current version of LD 2 SNPing accepts 10 SNPs for maximal visualization. In the   (Fig. 15) and 3D-LD plots (Fig. 49), respectively. The current version of LD 2 SNPing accepts 10 SNPs for maximal visualization.

Help
Anytime user can find the help before and after LD plotting (Fig. 50).