It is critical to ensure that samples selected for use in validation of NGS carried representative changes and mutations that a clinical laboratory expects to detect in real-world samples.
NGS is able to detect complex mutations using targeted amplification
Genes selected included the ACADVL, BCKDHA, CBS, CFTR, DMD, GAA, GALC, GALT, GBA, GJB2, HEXB, IDUA, OPA1, REQL4, SGSH, SMPD1 and ZEB2 genes. Duchenne muscular dystrophy (DMD) is caused by mutations in the DMD gene, the largest human gene, spanning 2.2 Mb on the X chromosome
[13, 14]. Gaucher disease is an autosomal recessive disorder where mutations in the GBA gene result in a decrease in the activity of acid β-glucosidase. The GBA gene is an extremely difficult gene to perform diagnostic testing on, due to the presence of a pseudogene that is >98% identical to the active gene
[15, 16]. The REQL4 gene has an atypical structure; it is a very compact gene of ~6.5 kb, where most of the introns are less than 100 bp in length. It is also highly repetitive and GC rich, making it difficult to amplify and sequence cleanly
[17, 18]. Other genes selected for inclusion in the validation run were mainly based on the changes they carry. One such example is a sample with two mutations in the GJB2 gene. This sample carries a c.35delG on one allele and a c.35dupG on the second allele (Table
1). In conventional Sanger sequencing analysis, it is very difficult to interpret the data when there are two deletions at the same nucleotide position
. Both mutations in the GJB2 gene were identified on the NGS run. NGS is able to sequence both strands independently, providing our laboratory with not only the genotype but also the data to determine which change is on which strand of the DNA.
Target amplification method needs to be chosen carefully for NGS
In this study, we used a standard PCR approach to test the sensitivity and specificity of NGS. We faced many challenges during the initial startup phase in acquiring and deploying an NGS instrument in a clinical laboratory environment. Clinical laboratories routinely generate hundreds if not thousands of PCR reactions a day for use in Sanger sequencing, but this enrichment strategy would not work for NGS; it involves too many labor-intensive steps to accurately quantitate individual PCR amplicons before it can be pooled for use in the NGS chemistry pipeline. This labor-intensive manual process will raise costs and lengthen the time of the entire process. Laboratories will find it hard to continue to use standard Sanger sequencing enrichment techniques on a routine basis, because of the need to exploit the full capacity of the NGS instrument to minimize costs. On the SOLiD v3 instrument, we are able to interrogate up to 2.4 Mbp of a region of interest in a single quad. The cost in time and effort to generate individual PCR amplicons for an entire 2.4-Mbp region of interest is prohibitive and raises the chances that a mistake will occur. Even if long PCR techniques could be employed as the enrichment technique, it would require 240, 10-kb individual reactions to enrich for a 2.4-Mbp region.
It is clear that, to manage the workflow of a larger number of amplicons for gene panels, clinical laboratories will need to consider target enrichment methods, such as multiplex PCR (Fluidigm™), microdroplet-based PCR (RainDance™), or in solution-based PCR (Agilent SureSelect™). Jones et. al
. have recently demonstrated the use of microdroplet-based PCR for the testing of 25 genes for congenital disorders of glycosylation (CDG) in a clinical laboratory. In the work performed by Jones et. al., it was shown that even after using target enrichment methods, some exons fail to give adequate coverage and still need Sanger sequencing to complete the clinical test. Sanger sequencing will continue to play an important role in the clinical laboratory for assay completeness, both for sequencing low-coverage and difficult regions in a gene and for confirmatory studies once a mutation is identified in a proband and additional family members need to be tested. Given our initial approach of adapting the enrichment method used for standard Sanger sequencing, we have demonstrated any change within the boundaries of custom-designed primers flanking the region of interest (eg, exons) can be detected successfully.
Using coverage data as the sole indicator of whether a change was real is difficult. The nine false-positive changes that were picked up had a median coverage of approximately 400 reads and a mean of approximately 3,600 reads. As a contrast, confirmed changes had approximate median coverage of 5,300 reads and an approximate mean coverage of 7,000 reads. The numbers of reads for actual confirmed changes are approximately 15-fold higher compared to false-positive changes. As the number of reads for both confirmed and false-positive changes overlaps significantly, we are unable to use just the number of reads as the sole indicator. In this study, we see a great overlap in coverage between the number of reads for substitution mutations and with smaller insertion/deletion mutations. To detect larger deletions/duplications using NextGENe’s™ condensation function, the number of reads was effectively reduced. The GBA_2 sample, c.1265_1319del55 mutation had only 20 reads, compared to the GJB2 sample, which has a single base deletion, c.35delG mutation that had 34,879 reads. Similarly, the OPA1 sample, c.93_96dupAAAA mutation has only 179 reads compared to the GJB2 sample, c.35dupG mutation, which had 33,377 reads. In an effort to try to determine an appropriate coverage threshold, simulation experiments were run for mutation c.2052_2053insA in the CFTR gene. A varying number of reads that align to the region were randomly selected and used for analysis. We performed 80 simulations with the number of reads selected varying from 15 to 50 reads for every 10,000 reads. Coverage for the insertion varied from 8 to 43. For some of the simulations, NextGENe was able to detect the insertion with coverage as low as 8 reads. We chose 20 reads as the average threshold. Other groups have also expressed a similar viewpoint
[21–25]. In work performed by De Leeneer K. et. al., the authors have performed a detailed analysis to determine the coverage needed during a NGS sequencing run given two variables (quality score of data and sequencing errors) to detect heterozygous changes. In their paper, they have determined that data with a quality score of 30 will require a minimum 18X coverage if sequencing error is at 15%
. Dohm J et. al. in their study found bona fide SNPs by applying high coverage of >20X
Software has a Phred-like confidence score calculated with a novel SoftGenetics algorithm. The software algorithm takes into account multiple variables to calculate a final probability that any one change is a true. A phred score of 10 means there is approximately a 1 in 10 chance that the change is the result of an error, while a phred score of 30 represents a 1 in 1000 chance that the change is an error. This Phred-like score gives us greater confidence in determining true and false-positive changes. In our study, we have seen real changes with Phred-like confidence scores averaging a score of 24 with a minimum score of 9.4 and a maximum score of 34.6 (Table
1). Some changes detected using the condensation algorithm does not have a Phred-like confidence score. Confidence score of nine and above along with coverage above 20X makes it more likely that a change is real.
Proportion of bases
Another indicator is the relative proportion of mutant compared to the wild-type base. In one of the samples we ran, there is a heterozygous c.1504 C > G (p.L502V) missense mutation in the ACADVL gene. This mutation had 5869 reads showing an approximately equal proportion of the wild-type C allele (60%) compared to the mutation G allele (40%). Our validation data set suggests that real heterozygous calls should be present in the data in approximately equal proportion and can range as to as much as 70% wild-type to 30% mutant, whereas homozygous/hemizygous calls should consist almost exclusively of the mutant allele but can range as much as 20% wild-type to 80% mutant. The proportion of bases called will never be exact, due to the presence of nonspecific amplification that was sequenced and aligned back to the regions of interest. This is compounded by errors generated during next generation sequencing wet bench process and errors generated by the Solid instrument during sequencing.
NGS pipeline in a clinical laboratory
Most clinical laboratories are very well equipped and accustomed to performing high-complexity testing that requires multiple steps. While most clinical laboratories will not find it difficult to perform the wet bench work required to perform a NGS run, it is a challenge to maintain the same level of consistency as could be achieved easily with a Sanger sequencing pipeline.
The current NGS pipelines involve many interdependent steps, and a major challenge faced by our laboratory was how to accurately and consistently quantitate small amounts of the enriched library that are present in each single step of the process. A subtle change in quantity could result in a bad library preparation and lead to a less than ideal data set, especially if loading the quad to its maximum capacity. Equal deep coverage of at least 20 reads per base across every region of interest is needed to ensure that all changes are picked up accurately by the laboratory.
Changes in laboratory structure
Clinical laboratories often lack experienced bioinformatics staff and the necessary computing infrastructure within a clinical setup. There are only a few NGS 50-bp fragment analysis programs available on the market. The few that exist were developed for use by programmers and bioinformatics specialists. This dearth of software packages, which are both 'laboratorian' friendly and powerful enough to perform de novo detection of the entire mutation spectrum, hinders developments that would enable to use of NGS fragment capabilities to perform targeted resequencing projects. We selected SoftGenetics NextGENe™ software package as it is designed to detect the entire mutation spectrum, including small and large indels using data generated from a 50-bp fragment run. Our laboratory has demonstrated that we are able to leverage the power of SOLiD’s 50-bp fragment run to detect not only single nucleotide changes, but also small and large indels. This is possible due to a proprietary indel detection process called condensation, developed by SoftGenetics
. The condensation tool is used to polish and lengthen short sequence reads into fragments that are longer and more accurate. The short reads from the SOLiD System are often not unique within the genome being analyzed. By clustering similar reads containing a unique anchor sequence, data of adequate coverage are condensed; short reads are lengthened and instrument errors are filtered from the analysis. This stage helps to prepare data for analysis in applications such as SNP/Indel detection by statistically removing many of the errors, while maintaining true variations. The reads used for each condensed read are recorded to maintain allele frequency information. In addition, the condensation tool can be set to automatically run multiple cycles, further increasing the read lengths. Condensation operates without referring to a reference sequence. Reads are clustered using 12-bp anchor sequences within the reads. Each possible 12-bp sequence within the reads is considered for indexing. All reads containing this exact sequence are clustered together to form a group. The group of reads is further sorted by the flanking shoulder sequences, immediately upstream and downstream from the anchor sequence, into subgroups. A consensus read, generally 1.6 times the original read length, is created for each subgroup. By removing many low-frequency, biased calls and improving alignment accuracy by lengthening reads, the condensation tool is useful for preparing data prior to indel detection. NextGENe™ then aligns the consensus reads to the reference sequence. NextGENe™ can be run by a laboratory technician, which is an important consideration for a clinical laboratory. A laboratory technician who has been trained to analyze Sanger sequencing data does not necessarily have the programming skills to perform NGS analysis. Skilled professional programmers or bioinformatics specialists are needed to work in partnership with laboratory directors, genetics counselors, and clinicians to interpret the massive amount of data generated in a single NGS run.
Due to the immense capacity to generate data from a NGS platform, clinical laboratories will not perform single-gene analysis on the NGS platform. We are able to use the increased capabilities of the NGS platform by raising the number of genes being analyzed at a time. As the number of genes in a gene panel increases, the potential number of false positives identified will correspondingly go up. Clinical laboratories will deal with a larger number of false-positive changes in order to avoid missing any real disease-causing mutations. As with any clinical test, changes identified from a NGS platform will need to be confirmed using an alternative technology, such as Sanger sequencing. It is important that clinical laboratories perform such confirmation to determine the validity of calls generated by the NGS data. We have been able to identify three indicators (coverage of above 20 reads, confidence score of 30 and above and proportion of bases for heterozygotes that can range as skewed as 70% wild-type to 30% mutant and for homozygous as much as 20% wild-type to 80% mutant) to help to determine whether a change that is detected is real.
Cost considerations when implementing NGS in a clinical laboratory
The cost of implementing a NGS system in a laboratory is not confined to the cost of the instrument package as provided by the manufacturer. There are many pieces of ancillary equipment required, and their availability will be critical to the success of the NGS setup in the laboratory. Equipment such as a powerful computer and secure data storage are required in the laboratory to handle the massive amounts of data. Cloud computing is an option that has emerged as NGS was developed over the last few years. While this is an alternative, the clinical laboratory will need to identify a secure HIPAA-compliant cloud provider that will be able to support clinical needs. While the cost of such a computer and storage cluster is reasonable, laboratories will need to budget additional funds to cover the purchase of such ancillary equipment.