According to the National Institutes of Health, by 2005 the rate of DNA sequence submission to National Center for Biotechnology Information's (NCBI) GenBank database increased to approximately 3 million new sequences each month or 4,000 sequences per hour [1], and the rate of deposition has continued to accelerate ever since. As more genetic information becomes available, the demand for more computing power to analyze this information grows proportionally. Although CPU speed continues to increase at a remarkable rate, the large quantity of sequence data has outpaced the rate at which computer hardware is improving. The idea, popularized by Gordon Moore [2], that the processing speed of sequential computers doubles every two years is insufficient to keep up with the expanding complexity of genetic information. Thus, optimization and parallel processing need to be employed in order to develop algorithms that offer significantly more efficient data processing.

Parallelization coupled with optimization has been particularly effective in speeding up the most heavily used bioinformatics tool NCBI BLAST [3]. BLAST has a web interface that makes it accessible to the widest possible array of users. Web interfaced tools are popular among biologists because they are easy to use, require only a web browser to execute, and typically return valuable information in an intuitive format. BLAST has been adapted to handle the influx of data with efficient search algorithms and several approaches for parallel processing [4, 5]. Asterias, ParaMEME, and CBSU's Web Computing Interface [6–8] are also web-based bioinformatics analysis tools that have improved processing time by subdividing work among multiple processors. For example, CSBU has tools specifically of interest for population and evolutionary genetics analysis (e.g., MrBayes, Parentage, PLINK).

Like CSBU, the Isolation by Distance Web Service (IBDWS) is a web-based program that performs statistical analysis with a user-friendly interface for population genetics [9]. Statistical analysis can be performed on the relationships among individuals, or by grouping sets of individuals into populations a priori. IBDWS is generally designed to perform statistical tests on the latter. The website is named after "Isolation by distance" (IBD), a population genetics principle first described by Sewall Wright [10]. IBD describes patterns in allelic frequencies that are the consequence of spatially restricted gene flow, specifically an increase in the genetic distance between pairs of populations as the geographic distance between them increases. Two separate methods (the Mantel test and Reduced Major Axis (RMA) regression) are used to determine the correlation between genetic distance and geographic distance. The Mantel algorithm tests for non-random associations between a genetic distance matrix and a matrix containing geographic distances [11]. As described by Bohonak [12], the RMA regression quantifies the strength of the IBD relationship, with slope and intercept errors calculated through a variety of resampling techniques.

IBDWS arose as a conversion of the standalone Isolation by Distance program for Macintosh and Windows [9, 12]. IBDWS has progressively become more flexible through its later versions (e.g., the ability to directly input raw DNA data sets in v. 3.0). The conversion to IBDWS in 2004 allowed many users to process more data faster than before and there was no other easily accessible web-based software available for exploring IBD patterns. The papers published in 2004 that cited IBDWS had an average population size of 16, with the largest 10% having an average of 34 populations [13–23]. Between November 2006 and November 2007 we tracked the number of populations in the data sets used in IBDWS. Out of 7947 analyses, the average number of populations was 23, with the largest 10% having an average of 95 populations. As the number of populations increases, the complexity of the calculations increases and the time to process submitted data sets grows exponentially.

In this paper, we set out to improve the speed and performance of IBDWS using a combination of algorithmic optimization techniques and parallel computing. Even after algorithmic optimizations, the growing size of submitted data sets is already too large for rapid analysis on a single server. For example, in the last implementation prior to these modifications, 100 populations with 10,000 randomizations required at least 7 hours to complete. In order to improve analysis speed, we first identified the time intensive parts of the program and determined their time complexity. Randomization routines within the Mantel test and RMA calculations proved particularly time-consuming but amenable to algorithmic optimization and parallelization. Accordingly, we algorithmically combined the Mantel test and RMA randomizations, distributed the randomizations among several processors, and implemented the Fisher-Yates shuffling algorithm [24] for conducting the randomizations. The differences in runtime between each algorithm enhancement, the overall runtime changes, the predicted speedup, and the effects of increasing the number of nodes are discussed.