Comparative analysis of copy number variation detection methods and database construction
© Koike et al; licensee BioMed Central Ltd. 2011
Received: 29 August 2010
Accepted: 7 March 2011
Published: 7 March 2011
Skip to main content
© Koike et al; licensee BioMed Central Ltd. 2011
Received: 29 August 2010
Accepted: 7 March 2011
Published: 7 March 2011
Array-based detection of copy number variations (CNVs) is widely used for identifying disease-specific genetic variations. However, the accuracy of CNV detection is not sufficient and results differ depending on the detection programs used and their parameters. In this study, we evaluated five widely used CNV detection programs, Birdsuite (mainly consisting of the Birdseye and Canary modules), Birdseye (part of Birdsuite), PennCNV, CGHseg, and DNAcopy from the viewpoint of performance on the Affymetrix platform using HapMap data and other experimental data. Furthermore, we identified CNVs of 180 healthy Japanese individuals using parameters that showed the best performance in the HapMap data and investigated their characteristics.
The results indicate that Hidden Markov model-based programs PennCNV and Birdseye (part of Birdsuite), or Birdsuite show better detection performance than other programs when the high reproducibility rates of the same individuals and the low Mendelian inconsistencies are considered. Furthermore, when rates of overlap with other experimental results were taken into account, Birdsuite showed the best performance from the view point of sensitivity but was expected to include many false negatives and some false positives. The results of 180 healthy Japanese demonstrate that the ratio containing repeat sequences, not only segmental repeats but also long interspersed nuclear element (LINE) sequences both in the start and end regions of the CNVs, is higher in CNVs that are commonly detected among multiple individuals than that in randomly selected regions, and the conservation score based on primates is lower in these regions than in randomly selected regions. Similar tendencies were observed in HapMap data and other experimental data.
Our results suggest that not only segmental repeats but also interspersed repeats, especially LINE sequences, are deeply involved in CNVs, particularly in common CNV formations.
The detected CNVs are stored in the CNV repository database newly constructed by the "Japanese integrated database project" for sharing data among researchers. http://gwas.lifesciencedb.jp/cgi-bin/cnvdb/cnv_top.cgi
Copy number variations (CNVs), duplications, and deletions of chromosomal segments longer than 1 kb are major structural variations in various organisms such as yeast, Drosophila, and humans and are believed to have evolutionary importance  and be related to various diseases such as mental retardation , neurological disorders [3, 4], and cancers . Although chromosomal abnormalities including CNVs were originally investigated using cytogenetic karyotype analyses such as chromosome banding analysis and fluorescence in situ hybridization (FISH), recent technologies such as pair-end sequencing [6–8], whole genome sequencing using massively parallel sequencers , and array-based approaches such as array comparative genomic hybridization (CGH)  and single nucleotide polymorphism (SNP) genotyping microarrays ) enabled us to obtain refined CNV structures with relatively high throughput. Particularly, SNP genotyping microarrays are becoming widely used for CNV identification since CNVs can be detected using the same microarrays used in the SNP-based case control study. The SNP microarray approaches, however, tend to lead to too many false negatives and positives and require sophisticated CNV detection methods/algorithms.
Many methods/algorithms have been proposed for detecting CNVs for array CGH and SNP array and are divided into several models such as smoothing methods, clustering methods, maximum likelihood procedures including Hidden Markov models (HMMs) and expectation-maximization (EM) algorithms. The simplest smoothing method is for smoothing log2ratio (ratio is defined as the intensity of target of probe divided by that of reference probe) profiles using a moving average and detecting duplicated or deleted regions over the specified thresholds . For more sophisticated smoothing methods, a quantile smoothing method based on L1 norm (the sum of absolute values) penalty minimization  and a wavelet de-noising method  have been proposed. For a clustering method, the cluster along with chromosomes (CLAC) method was developed, in which hierarchical clustering trees along each chromosome arm (or chromosome) are calculated and the 'interesting' clusters considering the false discovery rate (FDR) are selected . Smoothing and clustering methods are effective in simulation data, but they do not achieve good enough CNV detection performance compared with other methods in array CGH experimental data . To date, various maximum likelihood-related approaches have been proposed. Jong et al. introduced genetic local search algorithms (memetic algorithms) for maximizing the likelihood by considering the penalty function of breakpoints . Picard et al. developed an adaptive method for estimating the penalty constant to avoid selecting too large a segmentation number for over fitting given data. In this method, the probe intensity profile (log2ratio) is supposed to be a Gaussian distribution, and the number of segments is estimated by maximizing the likelihood . A circular binary segmentation (CBS) method was proposed by Venkatraman and Olshen , in which the average probe intensity is assumed to also have a Gaussian distribution. The likelihood ratio statistic for testing the null hypothesis, in which there is no change, and the alternative hypothesis, in which there is exactly one change at an unknown location, are introduced in this method. The test is done using a permutation test. If the null hypothesis is rejected, the hypothetical change-points are adopted. The change-points are searched recursively using overlapping windows .
An HMM is a statistical model in which it is assumed that the system follows a Markov process [20–22]. In most HMM models for CNV detection methods, the probe intensity values or logR ratio (LRR, log2(Robserved/Rexpected), Rexpected is calculated from linear interpolation of canonical genotype clusters, R is a sum of probe intensities) and B allele frequency (BAF, normalized measure of relative signal intensity ratio of the B and A alleles in the SNP array) or genotypes are assumed to be independent, and the copy number states of the probes are set to be hidden states with certain transition probabilities. By maximizing the likelihood of observed data (probe intensity, LRR and BAF, or genotypes), the copy number state of each probe is obtained.
Many studies have compared the performance of these methods or programs based mainly on simulation data, arrayCGH data  and Illumina SNP arrays . However, Affymetrix SNP-array-based CNV detection requires much more robust algorithms than those using Illumina SNP arrays and array CGH due to the characteristics of the Affymetrix SNP arrays.
We assessed the following widely used methods/programs, the circular binary segmentation (CBS) method (implemented in DNAcopy , R package), Picard's adaptive method (CGHseg , R package), HMMs (Birdsuite 1.4  and PennCNV ) on the basis of whether they accurately detect CNVs on Affymetrix data (Affymetrix 6.0). The first two methods/programs are known to be effective in detecting CNVs of array-CGH experiments, but their performances in microarrays are yet unknown. QuantiSNP , which uses an objective Bayes HMM, showed the best detection performance with the simulation and Illumina SNP array data in a previous study . We also tested QuantiSNP, but the parameter tunings for Affymetrix data were too difficult for us to achieve sufficient performance. Therefore, the results are not shown in the following results. The assessments are done using HapMap data and other experimental results. After that, CNVs of 180 healthy Japanese individuals were detected using the parameters that showed the best performance in the HapMap data. The characteristics of start and end regions of CNVs are also discussed. These results are registered in the CNV control database, which has been developed as part of the Japanese integrated database project.
We conducted comparative analyses on the CNV detection performance of five programs: DNAcopy  (R package) as a circular binary segmentation (CBS) method, CGHseg  ("tiling array" of R) as a Picard's adaptive method, Birdseye (part of Birdsuite)  and PennCNV  as HMMs, and Birdsuite (mainly consisting of Birdseye and the EM-based Canary).
There are two main subjects for CNV detection: estimation of the copy numbers and detection of accurate CNV boundaries from probe intensity data. In this study, duplications (gains) or deletions (losses) were assessed, but copy numbers were not assessed because there were not enough validation data to confirm them. In the programs for array CGH, the log2ratios of normalized intensities of target probes against control probes were used as input data, while programs for the SNP arrays, not only the log2ratios or LRR but also allele frequencies or genotypes were also used in most cases. Since DNAcopy and CGHseg were developed for array CGH, only log2ratios were used as input data. During log2ratio calculation, quantile-normalization was done for probe intensities and the average values of HapMap data (270 individuals) were used as reference probe intensity data.
Birdsuite consists of four programs (modules), "Canary", "Birdseed", "Birdseye", and "Fawkes," and determines CNVs using multiple individual data. In Canary, common CNVs are detected with EM algorithms using registered known common copy number data, while Birdseye detects novel CNVs by using the HMM with the Viterbi algorithm based on probe intensities and genotypes determined using the Birdseed program. Fawkes combines results of Canary, Birdseye, and Birdseed and assigns a comprehensive SNP genotype. Since we do not access the genotypes of CNVs, we used only Canary and Birdseye to obtain the following results.
In PennCNV, probe intensity data is converted into LRR and BAF, copy numbers are set as hidden states, and the emission probability of LRR and BAF are modeled. The hidden state for maximizing the likelihood of observed data (LRR and BAF) is obtained using the Viterbi algorithm.
To compare the detection performance of these algorithms, we used Affymetrix Genome-wide Human SNP Array 6.0 (Affy 6.0) data of the HapMap data. The HapMap data were collected from 45 Japanese in Tokyo, Japan, 45 Han Chinese in Beijing, China, 90 Yoruba individuals (30 trios) in Ibadan, Nigeria, and 90 individuals from the US state of Utah with northern and western European ancestry (collected in 1980 by the Centre d'Etude du Polymorphisme Humain) whose CNVs were previously measured in other experiments and/or algorithms [6–8, 11, 20, 21, 24–29]. Furthermore, five sets of five Affy 6.0 microarrays of the same individuals (NA04626, NA01416, NA06061, NA10851, NA15510) were also used to investigate the reproducibility of the detection algorithm. These Affy 6.0 data were typed in Affymetrix and downloaded from the Affymetrix web site .
Furthermore, novel Affy 6.0 data of 180 healthy Japanese individuals, whose typing was carried out in our previous study , were used for detecting Japanese CNVs. (Hereinafter we call these data "original data.") As described in the previous paper , the study was approved by the research ethics committee of Central Research Laboratory, Hitachi Ltd. (permission number 128-2) and the Faculty of Medicine, The University of Tokyo (permission number 2583) and the informed consent was obtained from all participants.
A CNV control database has been constructed as part of the integrated database project by the Ministry of Education, Culture, Sports, Science and Technology (MEXT) using mySQL. In the database, the start-end information of copy number segments, copy number regions which are clustered copy number segment information, and their frequencies were accumulated. Other information such as genes, exons, introns, and the CNVs of database of genomic variants (DGV ) data were also accumulated to enable users to easily interpret the meaning of detected CNV regions. The start and end positions of the CNVs were also stored on a distributed annotation system (DAS) with the GMOD Gbrowse-based  browser for calling up our data on other DAS servers, such as Ensemble, or vice versa; the data on another DAS are called up on our DAS to view various data simultaneously for interpretation.
Average stability rate of CNVs using same individual's data
Average of concordance rate
The rate of Mendelian inconsistencies in offsprings' CNVs of HapMap trio data
Overlap > 0%
The stability rate does not directly predict performance but indicates the stability of detection performance. Table 1 shows that PennCNV and Birdseye (part of Birdsuite) have the same level of detection stability, about 10% higher than those of DNA copy and CGHseg, while Canary (part of Birdsuite) achieves higher stability than others due to the utilization of pre-defined copy number regions.
Table 2 shows the rate of Mendelian inconsistencies of HapMap trio (90 individuals) data. The rate is calculated as the average number of an offspring's CNVs that are not detected in both parents divided by the number of each offspring's CNVs. Although a rate of Mendelian inconsistencies close to 0 does not directly mean high CNV detection performance, the rate of Mendelian inconsistencies much larger than 0 implies low CNV detection performance considering heritability because the frequency of copy number change is assumed to be in the range of 10-6 to 10-4 per gamete [24, 34]. The rates of Mendelian inconsistencies of Canary (part of Birdsuite), and Birdseye (part of Birdsuite) in Table 2 are the lowest and second lowest, respectively; however, there may be influences of the parameters which are tuned using HapMap data  considering the low stability rate of Birdseye in Table 1 (that is, the square-root of the stability rate << 1-Mendelian inconsistency in Birdseye). For Canary, since only predefined regions are the targets of CNV detection, low Mendelian inconsistencies of Canary are reasonable. The low Mendelian inconsistency is a necessary but not sufficient condition for high accuracy for CNV detections. Mendelian inconsistencies of all methods, except Canary and Birdseye, are not sufficiently low, and DNAcopy shows the highest Mendelian inconsistencies, although the stability rate in Table 1 is the same level as CGHseg. The cause is not clear, but there might be CNV patterns that are difficult to detect with circular binary-segmentation-based detection for DNAcopy.
Similarity of CNVs detected between programs using HapMap data
Birdseye (part of Birdsuite) Birdsuite
Birdseye (part of Birdsuite)
Affymetrix data is known to be noisy for determining genotypes under some experimental conditions. When detecting CNVs, data noise is expected to be problematic. Supplemental Figures 1S-a (HapMap) and 2S-a (180 healthy Japanese individuals) [Additional file 1: Supplemental Figures 1S and 2S] show the relationship between the standard deviation of log2ratio in each array and CNV counts per individual, while supplemental Figures 1S-b (HapMap) and 2S-b (180 healthy Japanese individuals) [Additional file 1: Supplemental figures 1S and 2S] show the relationships between the call rate (the percentage of SNPs whose genotypes are determined in the genotype calling process) and CNV counts per individual detected using each program. Figures 1S-a and 2S-a indicate that the CNV counts per individual drastically increase when the standard deviation of the log2ratio of the array is high. Figures 1S-b and 2S-b indicate that Affymetrix arrays with low recall showed higher CNV counts. Since these tendencies were observed in all programs, arrays with large standard deviations or with low recall should be removed. In the analysis of original data in Section 'CNVs of original data of healthy Japanese', individuals with high CNV counts (15 individuals whose CNV number are PennCNV > 250, Birdseye (part of Birdsuite) > 400, Canary (part of Birdsuite) > 500, DNAcopy > 100, or CGHseg > 100) were removed.
Sensitivity of each detection program when other experimental result is regarded as golden standard
Shaikh** (Illumine Data)
Park(array-CGH and massively parallel sequencing)
(part of Birdsuite)
The distributions of CNV length are summarized in Figure 1. As shown in Figure 1 (a), the difference of distributions of CNV length between each program is not so large and especially the distribution of CNV length between Birdseye (part of Birdsuite) and PennCNV, which both use HMM, is similar. When Figures 1 (a) and (b) are compared, the lengths of the newly detected CNVs are longer than other experimental results and the total number of detected CNVs are less than others, which are partly caused of used thresholds for detecting CNVs. Shorter CNVs are also detected using these programs before applying thresholds. The distributions of CNV lengths are quite different depending on the CNV experimental method, even if the same individual is used, as shown in Figure 1 (b). Although a gain or loss < 1 kb is not regarded as a CNV but as a deletion or insertion in the DGV criteria , all gain and loss data were used as CNVs in this study. When sequencing techniques were used, small gain and loss could be detected, as shown in this figure. On the other hand, when SNP array or array CGH was used, only long CNVs could be detected due to the sparseness of the probe density on a genome.
The results of the section 'Evaluation of CNV detection algorithms using trios data' and 'Evaluation of CNV detection algorithms by comparing other experimental results' indicate that Hidden Markov based programs PennCNV and Birdseye (part of Birdsuite), or Birdsuite (mainly consisting of Birdseye and EM-based Canary) are superior to others when the high CNV reproducible (stability) rates of the same individuals and the low Mendelian inconsistencies are taken into account. For measuring sensitivities with other experimental results, Birdsuite is the best. However, overlapping rates with other experimental results suggest that there remain many false negatives and some false positives, although other experimental results are also expected to contain many false positives and negatives.
Overlap ratio of original data and HapMap data with Park and Conrad's data
Birdseye (part of Birdsuite)
The numbers of commonly detected CNVs with HapMap data and original data of healthy Japanese are summarized in supplemental Table 4S [Additional file 2: Supplemental Table 4S]. Since the overlap between original data and HapMap data should be JPT/CHB > CEU > YRI, considering the genetic distance of ethnics, the tendency JPT/CHB > CEU > YRI in any program is reasonable in Table 4S.
As interpretations of origin of copy number variations, two mechanisms have been proposed . One is the non-allelic homologous recombination (NAHR, ectopic HR) mediated mechanism and the other is a microhomology-based mechanism. NAHR requires long repeated sequences (up to 300 bp in human ) in the start and end regions of CNVs, while microhomology requires only 5-15 bp homology sequences. NAHR is expected to occur by unequal crossing-over and break-induced replication (BIR). Single-strand annealing (SSA) is also a deletion mechanism . In SSA, the complementary single-stranded sequences of the 5'-end of a double-strand break are annealed, and the regions between the two complementary sequences are deleted.
The segmental repeats and interspersed repeats-included percentage of start and end regions of CNV
Segmental repeat +interspersed repeats (ALL*)
In all detection programs, with an increasing number of commonly detected individuals, the CNV ratios containing interspersed repeats or segmental repeats also increased as shown in supplemental Table 5S [Additional file 2: Supplemental Table 5S]. (It should be noted that there are no obvious relationships between these ratios and the stability rate or the Mendelian inconsistency ratio of each program.) When the ratios of CNVs including "segmental repeats or interspersed repeats" in start and end regions of CNVs are calculated, the results are higher than "segmental repeats" only or "interspersed repeats (ALL)" only as shown in Table 6. About 40% of segmental repeats detected around CNVs include interspersed repeats. Of these segmental repeats, LINE and LTR are about 50% and 15%, respectively. Since LINE, SINE, and LTR account for about 52%, 14%, 23% of all interspersed repeats in the human genome (hg18), respectively, there are no LINE biases in the segmental repeats around CNVs. The enrichment of interspersed repeats and segmental repeats is also observed in previous experimental data as shown in supplemental Table 6S [Additional file 2: Supplemental Table 6S]with a few exceptions.
These results indicate that not only segmental repeats but also interspersed repeat regions, especially "LINEs", have an important role in the formation of CNVs, at least in frequently observed copy number variations, although there is a possibility that interspersed repeats promote segmental repeats. Although both SINE and LINE elements have been reported to contribute structural variations , our results and those of others  do not support a contribution of SINE to formation of CNVs.
When the CNV ratios including simple repeats and mobile elements (provided in UCSC) in both the start and end regions are compared, there are no differences among Common1, Common2, and random values. The recombination rate of these regions (± 500 bp) calculated using the Marshfield average does not show statistically meaningful differences among random values ( = 1.17) and CNV common1 ( = 1.14) and common2 ( = 1.18).
The expected length, 5-15 bp, of microhomology is too short to use array-based CNV detection methods. In many CNV start and end regions, microhomologous sequences were found but there were almost no statistical significances (for example, 4bases^5bp is at most 1024) when the start and end positions of CNVs were ambiguous due to sparseness of probe positions. Conrad et al.  have investigated breakpoints of CNV deletions by massive parallel sequencer and found 70% of them have microhomology. Sequencing of these regions is necessary to confirm their importance.
Average conservation score of start and end of CNV regions
Average conservation score in start regions
Average conservation score in end regions
Random ± 500bp
Random ± 800bp
Random ± 1kbp
CNVCommon1 ± 500bp
CNVCommon1 ± 800bp
CNVCommon1 ± 1kbp
CNVCommon2 ± 500bp
CNVCommon2 ± 800bp
CNVCommon2 ± 1kbp
The above-mentioned tendencies in the start and end regions of CNVs were also observed in our CNV results of HapMap data, and differences between CNV duplications and deletions were not observed. It should be noted that although the overlapping ratio between our CNV results of HapMap data and other experimental data are low, most of them have the similar tendency that both segmental repeats and interspersed repeats (especially LINE) are enriched in CNV start and end regions, and conservation scores based on primates are lower in these regions than in randomly selected regions.
In the Japanese Integrated Database Project of MEXT, our organization (Univ. of Tokyo, Univ. of Tokai, and Hitachi, Ltd.) has constructed and maintained a public repository for the genome-wide association study (GWAS) database for continuous and intensive management of GWAS data and to facilitate data-sharing among researchers . A CNV control database has been constructed as a part of the GWAS database.
In this system, bulk CNV data are freely available, but a simple application is required for accessing CNV data at the individual level. To use raw data for CNVs, researchers must submit an application describing the research purpose in detail to the database access committee. Access is determined at the data sharing review board. The details of the data sharing policy are summarized in http://gwas.lifesciencedb.jp/gwasdb/db_policy_en.html.
The CNV control database is accessible at http://gwas.lifesciencedb.jp/cgi-bin/cnvdb/cnv_top.cgi.
In this study, we evaluated five widely used CNV detection programs, Birdsuite, Birdseye (part of Birdsuite), PennCNV, CGHseg, and DNAcopy from the viewpoint of performance on the Affymetrix platform using HapMap data and other experimental data. Our results indicate that hidden Markov based programs PennCNV and Birdseye (part of Birdsuite), or Birdsuite are superior to others when the high CNV detection stability (reproducibility) rates of the same individuals and the low Mendelian inconsistencies are considered. For measuring the sensitivity of other experimental results, Birdsuite shows the best performance. However, the low overlapping rates with other experimental results imply that there remain many false negatives and some false positives, although other experimental results also contain many false positives and negatives.
The analysis of start and end regions of CNVs in the data for healthy Japanese and the HapMap data showed that both segmental repeats and interspersed repeats are enriched in CNV start and end regions, suggesting that not only segmental repeats but also interspersed repeats, especially LINE, are deeply involved in CNV formation, particularly in common CNV formations, although the previous studies mainly focused on segmental repeats [9, 11]. There are CNVs without segmental repeats or interspersed repeats. They might contain microhomologies or other characteristics, the resolution of SNP array seems too coarse to analyze microhomologies. Other sequence level technologies will be required for further detailed analysis.
This work was supported by the contract research fund "Integrated Database Project" from the Ministry of Education, Culture, Sports, Science, and Technology of Japan. We thank anonymous reviewers for helpful comments.
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.