An integrative framework for the identification of double minute chromosomes using next generation sequencing data
© Hayes and Li; licensee BioMed Central Ltd. 2015
Published: 23 April 2015
Double minute chromosomes are circular fragments of DNA whose presence is associated with the onset of certain cancers. Double minutes are lethal, as they are highly amplified and typically contain oncogenes. Locating double minutes can supplement the process of cancer diagnosis, and it can help to identify therapeutic targets. However, there is currently a dearth of computational methods available to identify double minutes. We propose a computational framework for the idenfication of double minute chromosomes using next-generation sequencing data. Our framework integrates predictions from algorithms that detect DNA copy number variants, and it also integrates predictions from algorithms that locate genomic structural variants. This information is used by a graph-based algorithm to predict the presence of double minute chromosomes.
Using a previously published copy number variant algorithm and two structural variation prediction algorithms, we implemented our framework and tested it on a dataset consisting of simulated double minute chromosomes. Our approach uncovered double minutes with high accuracy, demonstrating its plausibility.
Although we only tested the framework with three programs (RDXplorer, BreakDancer, Delly), it can be extended to incorporate results from programs that 1) detect amplified copy number and from programs that 2) detect genomic structural variants like deletions, translocations, inversions, and tandem repeats.
The software that implements the framework can be accessed here: https://github.com/mhayes20/DMFinder
Double minute chromosomes (DM) are circular fragments of extrachromosomal DNA . They have been found in human tumors of the lungs, ovaries, colon, and breast . They have also been detected in tumors of patients afflicted with neuroblastoma . Double minutes tend to be highly amplified, and they usually contain genes that encode proteins which are essential to cancer formation (oncogenes); these genes may also be highly resistant to drug treatment . The combination of oncogenes and amplification suggests that double minutes are highly lethal, and it is thus important to have efficient methods to locate and characterize them. Such methods could, for example, help researchers to develop drugs that target double minutes. Also, if double minutes are detected, it could help researchers assess the effectiveness of existing drugs in the treatment of double minutes. One such study noted that a certain chemotherapy drug (Gemcitabine) is effective in eliminating double minutes from ovarian cancer cells . The authors note that eliminating double minutes is important because it decreases the malignancy of cancer. However, there are few computational tools available for detecting double minute chromosomes. Raphael et al.  provide a method to reconstruct the tumor amplisome. They constructed a bacterial artificial chromosome (BAC) library from the MCF7 breast cancer cell line, and they reconstructed the amplisome of this cell line using end sequence profiling (ESP), which uses fragments of 100-300 kb in length. Using matched tumor and normal samples in NGS data, Sanborn et al.  provide methods that can reconstruct circular genome assemblies, including DMs and homogenously staining regions (HSRs). They used their method to reconstruct double minutes in tumors from patients with glioblastoma multiforme (GBM).
As mentioned by Raphael et al., concurrent analysis of the amplisome and genomic rearrangements is an important problem. In the context of double minute (DM) detection, we want to identify 1) contiguous amplified segments, and 2) the breakpoints that unite these contiguous segments. Identifying these breakpoints is analogous to the problem of detecting genomic rearrangements that are caused by large structural variants (SV), including deletions, translocations, inversions, and tandem duplications. Integrating SV breakpoint prediction with amplicon detection is essential to algorithmically discovering DMs.
Double minute chromosomes have two key properties. First, they are comprised of several distinct genomic segments. When these segments adjoin, their breakpoints resemble those of deletion, inversion, interchromosomal, and intrachromosomal rearrangements. These breakpoints can be discovered using a structural variation detection algorithm like BreakDancer . The second distinctive property of double minutes is their propensity to amplify. Every segment comprising the double minute is an amplicon, and the copy numbers of these amplicons should be similar to one another.
Double minute discovery algorithms
Double minute chromosomes consist of consecutively-joined segments that are highly amplified. Our algorithms exploit the fact that double minutes are circular, and thus in the amplicon graph, the corresponding subgraph representing a true double minute should contain a simple path from some vertex back to itself. Lines 1-13 in Algorithm 1 are straightfoward. It simply builds the amplicon graph G, and it adds the same vertices in G to the auxiliary graph H. In line 13, we perform a modified depth-first search (DFS) on the graph G by keeping track of whether a currently explored edge has already been traversed. If not, then we add the corresponding directed edge to H with an edge weight of 1. This DFS is performed because we ultimately want our algorithm to identify cyclical subgraphs. If there is some simple path from a vertex v back to itself (excluding the trivial case where the path only contains v), a DFS will encounter a series of edges that represent a path from that vertex back to itself. While searching G, we add a directed edge to H if the current edge in the search is unexplored. This ensures that for the graph G, every vertex reachable from itself will have a corresponding directed path in H with a path to itself. If such a path can be identified, then it could indicate a possible double minute.
Algorithms 2 and 3
In Algorithm 2, the method processes the auxiliary graph H and determines whether or not it contains double minute chromosomes. It does this by first collecting all of the strongly connected components (SCC) in H, because a double minute is an SCC, or should at least be captured in a SCC. In line 5 of Algorithm 2, the method iterates through each SCC to see if it contains a double minute. It first runs an all-pairs-shortest-path algorithm on each retrieved SCC, and for each vertex u in an SCC, it searches for a shortest path from v to u, where v is a vertex adjacent to u. This step is performed because a double minute should contain a simple cycle from any vertex back to itself. We compute the shortest path for computational efficiency, as the longest simple path problem is NP-hard. However, we do not query the shortest path matrix for a path from u back to itself, because the shortest path from a node to itself is simply that node. To find a double minute, the method finds the shortest path cycle that minimizes the total edge weight. Thus, we assume that a double minute will be a cycle containing the fewest vertices (i.e., amplicons). This cycle is stored to DM in line 13. As previously stated, double minutes must consist of amplicons that have similar copy number. In NGS experiments, the copy number of a region in a donor genome is proportional to the sequence coverage in that region. Thus, each amplicon should also have similar mapping coverage. We check this condition by assessing the difference between the amplicon with the highest mapping coverage, and the amplicon with the lowest mapping coverage. This is performed in lines 15-17. If this difference is less than the average mapping coverage, then we predict the entire cycle in DM as a double minute.
We tested our double minute prediction algorithm on three simulated NGS datasets. We created 20 double minutes, each comprised of 10 amplicons whose coordinates we chose beforehand. These regions were selected from chromosomes 1, 2, 3, and 14. The double minutes and their chosen coordinates are specified in Additional file 1. The selection of these chromosomes was arbitrary; any subset of chromosomes could have been selected for this experiment. From the coordinates, we extracted the corresponding FASTA sequences from the human reference genome build 37 (hg19). We then copied the sequences five times (5 copies), then two times (2 copies), then one time (1 copy), assigning each group of copied sequences to its own dataset. We then appended chromosomes 1, 2, 3, and 14 to each of the three datasets (5 copies, 2 copies, 1 copy). We then created simulated sequence reads (in FASTQ format) from each of these datasets using WGSim . The reads were "sequenced" at 35X coverage with 100 bp reads and 400 bp fragment lengths, with a standard deviation of 50. The mutation rate was set to 0.001 and the indel rate was set to 0.15, which are default values. The default base error rate of 0.02 was used.
We aligned the aforementioned sequence reads to chromosomes 1, 2, 3 and 14 of the human reference genome using Bowtie2 . To detect amplicons, we provided the resulting SAM file to RDXplorer, which is a program to detect copy number variants in NGS data . This step is necessary to fulfill line 3 in Algorithm 1 (Figure 4). After acquiring the amplicons and applying a simple procedure to merge copy number predictions, we applied the BreakDancer and Delly algorithms to the alignments to uncover deletion, intrachromosomal translocation, interchromsomal translocation, and inversion breakpoints [10, 12]. These types of rearrangements comprise the breakpoints that are seen in double minutes. After this step, we looked for all predicted double minutes that consisted of 7 or more amplicons, since we only wanted to predict a DM if many of its amplicons were linked together by our algorithms (we have 10 amplicons per DM). Real DMs can have dozens of amplicons, so this number is reasonable in the general case. A complete description of programs and settings used in our analysis is provided in Additional file 2.
RDXplorer + BreakDancer.
RDXplorer + Delly.
It should be noted that although BreakDancer, Delly, and RDXplorer were used for this experiment, our framework is applicable to any methods that call deletion, translocation, inversion, and duplication breakpoints, and any methods that predict copy number gains. Results from several programs may also be combined and provided as input to the framework. The method failed to return predictions for the 1 Copy dataset. This is not unexpected, as this dataset has the lowest read depth per DM, which decreases the ability of the framework to detect them. In reality, double minutes are highly amplified beyond just a single copy, so the framework is likely to have an ample signal to make predictions.
We have presented a framework for detecting double minute chromosomes in next generation sequencing data. Our method accurately reconstructed double minutes whose signal was sufficiently represented in the structural variant and copy number predictions. Our method is also flexible; it can be applied to any SV and copy number resolution algorithms, thus giving a researcher the choice of methods to use in analysis. Future work will entail the use of real patient cancer datasets. Double minutes are present in many cancers, so using data from tumors would give better insights into the strengths and weaknesses of the method on real data.
Regarding limitations, our method currently does not attempt to measure the abundance of predicted DMs. Such a method would take into consideration the read depth of the amplicons that are predicted to comprise the double minute. Such information would be very helpful to a researcher who is attempting to assess the malignancy of a tumor that contains DMs. Future work will address such methods. Furthermore, the method currently looks for double minutes that are along a shortest path in the auxiliary graph. This was done for computational efficiency, but biologically, a double minute may exist along any path. We will explore this issue in subsequent publications. Lastly, we will consider more sophisticated methods of finding double minutes that may have been missed due to a lack of a clear signal in the data. With Algorithm 3, it is possible to have a predicted double minute that exhibits a bizarre graph structure (e.g. like a star graph), so we will explore more intelligent methods to address this problem that will take into account the circular nature of DMs.
We would like to thank the Ohio Supercomputer Center for its resources.
The publication costs for the article were funded by the authors.
This article has been published as part of BMC Genetics Volume 16 Supplement 2, 2015: Selected articles from The 11th Annual Biotechnology and Bioinformatics Symposium (BIOT-2014): Genetics. The full contents of the supplement are available online at http://www.biomedcentral.com/bmcgenet/supplements/16/S2.
- Barker P: Double minutes in human tumor cells. Cancer Genetics and Cytogenetics. 1982, 5: 81-94. 10.1016/0165-4608(82)90043-7.View ArticlePubMedGoogle Scholar
- Yu L, Zhao Y, Quan C, Ji W, Zhu J, Huang Y, Guan R, Sun D, Jin Y, Meng X, Zhang C, Yu Y, Bai J, Sun W, Fu S: Gemcitabine eliminates double minute chromosomes from human ovarian cancer cells. PLoS ONE. 2013, 8 (8): 71988-10.1371/journal.pone.0071988.View ArticleGoogle Scholar
- Storlazzi CT, Lonoce A, Guastadisegni MC, Trombetta D, D'Addabbo P, Daniele G, L'Abbate A, Macchia G, Surace C, Kok K, Ullmann R, Purgato S, Palumbo O, Carella M, Ambros PF, Rocchi M: Gene amplification as double minutes or homogeneously staining regions in solid tumors: origin and structure. Genome Res. 2010, 20 (9): 1198-1206. 10.1101/gr.106252.110.PubMed CentralView ArticlePubMedGoogle Scholar
- Robertson SM, Ling V, Stanners CP: Co-amplification of double minute chromosomes, multiple drug resistance, and cell surface P-glycoprotein in DNA-mediated transformants of mouse cells. Mol Cell Biol. 1984, 4 (3): 500-506.PubMed CentralView ArticlePubMedGoogle Scholar
- Raphael BJ, Pevzner PA: Reconstructing tumor amplisomes. Bioinformatics. 2004, 20 (Suppl 1): 265-273. 10.1093/bioinformatics/bth931.View ArticleGoogle Scholar
- Sanborn JZ, Salama SR, Grifford M, Brennan CW, Mikkelsen T, Jhanwar S, Katzman S, Chin L, Haussler D: Double minute chromosomes in glioblastoma multiforme are revealed by precise reconstruction of oncogenic amplicons. Cancer Res. 2013, 73 (19): 6036-6045. 10.1158/0008-5472.CAN-13-0186.PubMed CentralView ArticlePubMedGoogle Scholar
- Stephens PJ, Greenman CD, Fu B, Yang F, Bignell GR, Mudie LJ, Pleasance ED, Lau KW, Beare D, Stebbings LA, McLaren S, Lin ML, McBride DJ, Varela I, Nik-Zainal S, Leroy C, Jia M, Menzies A, Butler AP, Teague JW, Quail MA, Burton J, Swerdlow H, Carter NP, Morsberger LA, Iacobuzio-Donahue C, Follows GA, Green AR, Flanagan AM, Stratton MR, Futreal PA, Campbell PJ: Massive genomic rearrangement acquired in a single catastrophic event during cancer development. Cell. 2011, 144: 27-40. 10.1016/j.cell.2010.11.055.PubMed CentralView ArticlePubMedGoogle Scholar
- Storlazzi CT, Fioretos T, Surace C, Lonoce A, Mastrorilli A, Strombeck B, D'Addabbo P, Iacovelli F, Minervini C, Aventin A, Dastugue N, Fonatsch C, Hagemeijer A, Jotterand M, Muhlematter D, Lafage-Pochitaloff M, Nguyen-Khac F, Schoch C, Slovak ML, Smith A, Sole F, Van Roy N, Johansson B, Rocchi M: MYC-containing double minutes in hematologic malignancies: evidence in favor of the episome model and exclusion of MYC as the target gene. Hum Mol Genet. 2006, 15 (6): 933-942. 10.1093/hmg/ddl010.View ArticlePubMedGoogle Scholar
- McClintock B: Chromosome organization and genic expression. Cold Spring Harb Symp Quant Biol. 1951, 16: 13-47. 10.1101/SQB.1951.016.01.004.View ArticlePubMedGoogle Scholar
- Chen K, Wallis JW, McLellan MD, Larson D, Kalicki J, Pohl C, McGrath S, Wendl M, Zhang Q, Locke D, Shi X, Fulton R, Ley T, Wilson R, Ding L, Mardis E: Breakdancer: an algorithm for high-resolution mapping of genomic structural variation. Nat Methods. 2009, 6: 677-681. 10.1038/nmeth.1363.PubMed CentralView ArticlePubMedGoogle Scholar
- Yoon S, Xuan Z, Makarov V, Ye K, Sebat J: Sensitive and accurate detection of copy number variants using read depth of coverage. Genome Res. 2009, 19 (9): 1586-1592. 10.1101/gr.092981.109.PubMed CentralView ArticlePubMedGoogle Scholar
- Rausch T, Zichner T, Schlattl A, Stutz AM, Benes V, Korbel JO: DELLY: structural variant discovery by integrated paired-end and split-read analysis. Bioinformatics. 2012, 28 (18): 333-339. 10.1093/bioinformatics/bts378.View ArticleGoogle Scholar
- Hayes M: Algorithms to Resolve Large-Scale and Complex Structural Variants in the Human Genome (PhD thesis). 2013, Case Western Reserve University, Electrical Engineering and Computer ScienceGoogle Scholar
- lh3/wgsim. Accessed: 2013-06-03, [https://github.com/lh3/wgsim]
- Langmead B, Salzberg SL: Fast gapped-read alignment with Bowtie 2. Nat Methods. 2012, 9: 357-359. 10.1038/nmeth.1923.PubMed CentralView ArticlePubMedGoogle Scholar
- Yoon S, Xuan Z, Makarov V, Ye K, Sebat J: Sensitive and accurate detection of copy number variants using read depth of coverage. Genome Res. 2009, 19: 1586-1592. 10.1101/gr.092981.109.PubMed CentralView ArticlePubMedGoogle Scholar
- Gansner ER, North SC: An open graph visualization system and its applications to software engineering. SOFTWARE - PRACTICE AND EXPERIENCE. 2000, 30 (11): 1203-1233. 10.1002/1097-024X(200009)30:11<1203::AID-SPE338>3.0.CO;2-N.View ArticleGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.