Evaluation of a target region capture sequencing platform using monogenic diabetes as a study-model

Background Monogenic diabetes is a genetic disease often caused by mutations in genes involved in beta-cell function. Correct sub-categorization of the disease is a prerequisite for appropriate treatment and genetic counseling. Target-region capture sequencing is a combination of genomic region enrichment and next generation sequencing which might be used as an efficient way to diagnose various genetic disorders. We aimed to develop a target-region capture sequencing platform to screen 117 selected candidate genes involved in metabolism for mutations and to evaluate its performance using monogenic diabetes as a study-model. Results The performance of the assay was evaluated in 70 patients carrying known disease causing mutations previously identified in HNF4A, GCK, HNF1A, HNF1B, INS, or KCNJ11. Target regions with a less than 20-fold sequencing depth were either introns or UTRs. When only considering translated regions, the coverage was 100% with a 50-fold minimum depth. Among the 70 analyzed samples, 63 small size single nucleotide polymorphisms and indels as well as 7 large deletions and duplications were identified as being the pathogenic variants. The mutations identified by the present technique were identical with those previously identified through Sanger sequencing and Multiplex Ligation-dependent Probe Amplification. Conclusions We hereby demonstrated that the established platform as an accurate and high-throughput gene testing method which might be useful in the clinical diagnosis of monogenic diabetes.


Background
Since 1992, several genetic subtypes of monogenetic diabetes have been described in which gene mutations result in diabetes, primarily by causing beta-cell dysfunction [1]. Thus, patients previously categorized clinically as having maturity-onset diabetes of the young (MODY), permanent neonatal diabetes mellitus (NDM), or transient NDM, can now be classified by genetic sub grouping. Identification of the genetic subgroups is a prerequisite for appropriate treatment, genetic counseling, and prognostic information.
MODY, defined as dominantly inherited, young-onset (at least one family member with onset before 25 years of age), non-autoimmune and non-insulin dependent diabetes (no insulin treatment needed three or more years after diabetes diagnosis or measurable circulating C-peptide) [1,2], is the most common form of monogenetic diabetes. Molecular genetic findings have shown that MODY is very heterogeneous. To date, the MODY phenotype has been reported to be linked with mutations within 13 [3][4][5]. Approximately 70-80% of clinically diagnosed MODY patients carry mutations in one of these 13 genes, however, the remaining 20-30% are likely to carry disease causing mutations in yet unidentified genes [1,[6][7][8]. Thus, the term MODY encompasses a group of clinically and genetically heterogeneous forms of beta cell dysfunction and in a few instances also insulin resistance which at a molecular level is defined by mutations in various genes [2,5].
NDM, a form of monogenic diabetes which is usually diagnosed within the first six months of life, has a rare incidence of one in 100,000 to 400,000 live births. Clinically, NDM can be divided into two subtypes: transient neonatal diabetes mellitus (TNDM) and permanent neonatal diabetes mellitus (PNDM). The most frequent cause of PNDM is due to mutations in KCNJ11 or ABCC8 which encode a subunit and a modulator of the ATPsensitive potassium (K ATP ) channels in the pancreatic beta-cell, respectively [9]. Patients with PNDM due to KCNJ11 or ABCC8 mutations may be treated with oral sulfonylurea instead of insulin [1].
The monogenic subset of the diabetic population is often mis-classified as type 1 diabetes (T1D) or type 2 diabetes (T2D) and thus is not offered optimal treatment regimens [10]. In Europe and USA, approximately 2-5% of T2D and 10% of familial T1D have previously been estimated to be misdiagnosed MODY cases [11,12], however, the exact incidence and prevalence of monogenic diabetes is not known. According to recent estimates, approximately 92 million Chinese are recognized diabetes patients, making China one of the world's leading countries with regards to diabetes prevalence [13]. In the latest report, 27.7 million Chinese children and 334 million Chinese adults are estimated to be pre-diabetic or diabetic [14]. With this high diabetes prevalence and the known high prevalence of specific MODY forms in young onset diabetic patients in China [15,16], the prevalence of monogenic diabetes cases in China is likely to be accordingly high, and the importance of clinical differentiation should thus not be ignored. A high-throughput method with high accuracy to diagnose monogenic forms of diabetes is needed not only for a scientific purpose but also for medical application. Due to its high accuracy, Sanger sequencing is still the gold standard in terms of gene variant detection. However, when it comes to heterogeneous disorders, the method requires accurate clinical assessments and gene selection. If the "wrong gene" is tested, a (false) negative result is obtained and the testing procedure has to be repeated. In this case, the method is time consuming and expensive.
Here, we present the use of a gene testing panel (117 genes) based on a target region capture system coupled with next-generation sequencing (NGS) technology. The performance of the capture probe was tested by use of in-house YanHuang (YH) DNA [17]. The assay performance was validated by investigating six of the known monogenic diabetes genes (HNF4A, GCK, HNF1A, HNF1B, INS and KCNJ11) in 70 Danish patients carrying previously identified causative variants. All 70 causative variants were correctly identified. We therefore believe the established platform to be a high-throughput gene testing method which may be applied in clinical diagnosis of monogenic diabetes.

Samples
As a means to test the use of the platform, six YH samples were contributed from the Clinical Laboratory Center at Beijing Genomic Institute (BGI) [17], Shenzhen. For the validation of the method in relation to detection of known diabetes causing mutations, a total of 70 samples from unrelated Danish MODY and NDM probands collected at Steno Diabetes Center, Gentofte, Denmark, were examined. Prior to the participation in the study, informed consent was obtained from all subjects. The study was approved by the Ethical Committee of Copenhagen and was in accordance with the principles of the Declaration of Helsinki II.

Target region array design
A customized oligonucletide probe was designed to capture whole genome regions (including exons, introns, and untranslated regions (UTRs)) of 117 genes for monogenic diabetes [3][4][5] and selected candidate genes referred to type 2 diabetes mellitus, metabolism of glucose and oral hypoglycemic agents, and obesity [2,18,19] (Additional file 1: Table S1). Hence, the total size of probe set was 4962226 bp. The probes (GenCap™ Enrichment, MyGenostics, USA) were approximately 60 bp in length and with a 5-10 bp overlap in restricted regions.
DNA extraction, target region capture and nextgeneration sequencing Methods for DNA extraction, target region capture, and NGS have previously been extensively described [20] and a brief experimental workflow is included in Additional file 1: Figure S1. In brief, genomic DNA was extracted from peripheral blood lymphocytes by standard procedures using QIAamp DNA Bloodmini kits (Qiagen, Germany). Next, 1 ug genomic DNA was fragmented by Covaris sonicator (Covaris S2, USA) to sizes of 150-300 bp and then purified. The blunt ends of the purified DNA fragments were then repaired, and A-tailing was added. The fragments were ligated overnight using standard Illumina paired-end (PE) adapter. The ligated products were then amplified through 4-cycle polymerase chain reactions (PCRs) using PE primers containing 8 bp index tags. The purified PCR products containing 3 ug DNA were hybridized to the GenCap™ probe (in solution) at 65°C for 22 hours using a PCR machine. The products were bound to a rotator for 1 hour at room temperature using Dynal Myone Streptavidin C1 magnetic beads (Invitrogen, USA), which had been activated beforehand, and the products were then washed with buffer according to the kit manual. The captured DNA libraries were amplified using 15-cycle PCRs, purified, and subsequently eluted in a 30 ul volume and subjected to Agilent 2100 Bioanalyzer and quantitative PCR to estimate the magnitude of enrichment. The final captured DNA libraries were sequenced using the Illumina HiSeq2000 Analyzers as PE 90 bp reads (following the manufacturer's standard cluster generation and sequencing protocols), providing an average coverage depth for each sample of at least 100-fold.

Data filtering and analysis
Image analysis, error estimation, and base calling were performed using the Illumina pipeline (version 1.3.4) with default parameters. Indexed primers were used to identify the different samples in the primary data. All unqualified reads (defined as reads either polluted by adapter, containing more than 10% nucleotides out of read length, having an average quality of less than 10, or having 50% bases with a quality value less than 5) were removed using a local dynamic programming algorithm. The remaining reads were aligned to the reference human genome (UCSC hg19) using Burrows-Wheeler Alignment Tool (BWA-0.5.9). Next, SNPs and indels were identified using SOAPsnp software 2.0 and SAMtools v1.4 while using the recommended parameters [21,22].

FNFP assessment
The accuracy and precision of the targeted region capture sequencing were assessed by comparing SNPs identified in one of the 6 YH samples with those reported in the online YH database (http://yh.genomics.org.cn/). The remaining 5 YH samples were used as inter-control samples.
According to the YH database reference, an in-house pipeline was used to estimate the true positive (TP) ratio, true negative (TN) ratio, false positive (FP) ratio, and false negative (FN) ratio of all SNPs [23]. The accuracy and precision was calculated by (TP + TN)/(TP + FN + TN + FP) and TP/(TP + FN), respectively. In this calculation, only high quality SNPs fulfilling the following criteria were included: 1) SOAPsnp score ≥ 20; 2) Depth ≥ 20; 3) Percentage of reads supporting variation ≥ 28% [24].

Functional annotation of genetic variants
The variants were functionally annotated using an inhouse pipeline as well as the reported frequencies available from public databases (dbSNP 135, HapMap database, 1000 genome variants database, and a local control database) and categorized into either missense, nonsense, splice-site, insertion, deletion, synonymous or noncoding mutations. For all variants, the results were filtered using a quality value of single base sequencing ≥ 20. The variants were filtered to potential mutation candidates through: 1). Functional variants (insertion/deletion: in CDS and splicing region, SNP: nonsense, splice site and missense) and 2). Variants with an allele frequency below 0.01 in either of the public databases mentioned above. To validate the pre-screened samples, variants were initially filtered by six known disease-causing genes (HNF4A, GCK, HNF1A, HNF1B, INS and KCNJ11).

Large deletions/duplications analysis
The depths of each region of a gene in different samples within the same sequencing lane are significantly correlated (r > 0.7), and the depth of each capture region was therefore used to calculate a z-score according to the following equation: The large deletions and duplications were identified using a predefined cut-off point (±3) of derived z-score of each captured gene region. We used the cut-off value of 3 for absolute z-score, as it represents the 99.9 th percentile of the normal samples set for one tailed region. Any region with a z-score above 3 was defined as either a deletion (<-3) or a duplication (> 3).

General performance of the overall sequencing
In total, we collected 76 samples to assess the method of target region capture combined with high-throughput sequencing, the brief analysis workflow can be found in supplementary information (Additional file 1: Figure S1) and the average performance of the method is provided in Table 1. For each sample, we obtained an average of 4409.95 Mb raw data of which 2046.69 Mb was mapped to the target region after alignment using BWA. The specificity of the probe was 53.72%, the overall coverage of the target region was 99.37% and 96.95% for a minimum depth of 1X and 20X, respectively, and the average sequencing depth was 412-fold (Table 1). For the 200 bp flanking target regions, the coverage was 94.45%, and the average depth was 240-fold ( Table 1). The mean depth and the coverage of the target region for each screened sample are provided in Figure 1.
Using SOAPsnp and SAMtools for variant calling [21,22], we obtained a total of 15875 variants in the 117 genes on average pr. sample, including 13808 SNPs and 2067 indels. All variants were annotated as either heterozygous (het) or homozygous (hom) using the bioinformatics software (hom-0.0005, het-0.0010) [21]. Overall, 96.83% and 1.12% of SNPs were located in the introns and exons, respectively (Table 2), and the percentage of heterozygotic and homozygotic SNPs was 70.94% and 29.06%, respectively (Table 2). In each screened sample, an average of 70 missense, 18 splice site, 1 nonsense, and 66 synonymous mutations were identified.
With regards to the indels, 97.28% and 0.2% were located in introns and exons, respectively, and the percentage of heterozygotic and homozygotic indels was 93.27% and 6.73%, respectively (Table 3). Furthermore, on average, each sample had a total of 1226 deletions and 841 insertions in the all gene region, 3 deletions and 1 insertion positioned within a coding region, and 1 indel positioned in a splice site region.

Accuracy and precision of target region captured sequencing
To assess the accuracy of the method, we used the YH DNA as a control sample, as genome information of this sample was readily available [17]. In this sample, the relative TP, TN, FN and FP ratio for all SNPs were 90.96%, 99.99%, 9.04%, and 0.006%, respectively. The precision of sequencing was 95.43%, and the accuracy was 99.98% (Table 4).

Identified variants in monogenic diabetes samples
To validate whether the method can accurately detect causative variants in patients with known disease causing mutations in HNF4A, GCK, HNF1A, HNF1B, INS, or KCNJ11, we applied the method to 70 samples from Danish patients.
In regions with a more than 20-fold minimum depth, the coverage was 97.87%. Regions with a less than 20-fold depth were either introns or UTRs. Moreover, when only considering translated regions, the coverage was 100% with a 50-fold minimum depth. Thus, by selecting only genes of relevance for the individuals phenotype combined with data filtering, the number of potential pathogenic variants (indel, missense, nonsense, and splice-site) [5,23] in each sample could be narrowed down to 3-10 variants. These variants were arbitrarily defined to be either pathogenic, suspected pathogenic, polymorphism or of unknown significance. Mutations previously reported in MODY patients, or leading to loss of function due to the introduction of a stop codon or a frameshift, were termed pathogenic (n = 51). For novel mutations, functional prediction was performed with the use of bioinformatics tools. If novel mutations were located in important functional regions and/or conservative regions, they were defined as suspected pathogenic (n = 12). Polymorphisms, i.e. variants detected in both diabetes patients and controls, were predicted to be benign, and the remaining variants were defined as being of unknown significance. Hence, in 70 positive samples, we identified 63 pathogenic variants, including 15 indels, 36 missense, 5 nonsense, and 7 splicesite variants (Figure 2 and Additional file 1: Table S2). In the remaining 7 samples, we identified large deletions or duplications in exons (Additional file 1: Table S2, Additional file 1: Figure S3). Among the 70 samples, the percentage of GCK-mutated samples was 37%, HNF1A was 40%, HNF4A was 13%, HNF1B was 3%, INS was 3%, and KCNJ11 was 4% (Additional file 1: Table S2). In addition, a series of polymorphisms and variants of unknown significance located within these genes was also identified.

The clinical characteristics of Danish patients of monogenic diabetes mellitus
The clinical characteristics were collected at Steno Diabetes Center, Denmark and included the pathogenic gene, sex, age at examination and diagnosis, height, weight, BMI, HbA1c and treatment of every patient ( Table 5). All 70 patients were diagnosed with diabetes mellitus at a young age, except one patient who was diagnosed with hyperinsulinemic hypoglycemia. Furthermore, all patients had a normal BMI and slightly elevated levels of HbA1c, and these clinical characteristics are all corresponding to the known features of monogenic diabetes. With the use of target region capture NGS, 100% of pathogenic variants previously identified by Sanger sequencing and MLPA were detected in a study sample of 70 Danish patients with known monogenic forms of disease. Therefore, the targeted NGS approach was demonstrated to be an accurate and reliable method to detect variants of monogenic diabetes or hyperinsulinemic hypoglycemia.

Discussion
In the present study, target region enrichment combined with NGS technology proved to be a useful tool for the identification of genetic variants involved in monogenic diabetes. The hybridizing procedure to enrich target region is easy to use, and the duration of the total process is approximately 30 hours which is considerably less than performing directional sequencing of each gene. Combined with the applied sample pooling strategy, the target region capture procedure proved to be very effective. With the use of this method, 117 genes potentially implicated in metabolic disorders could be analyzed in one run. Each sample generated an average of 2005.13 Mb data mapped to the target region. Though the capture specificity of the applied chip is relatively low (53.72% compared to~70% on other commercial chips, e.g. Nimblegen EZ Choice), the compatible performance in capture coverage (reducing the FN rate) as well as the lower cost (20% more sequencing cost is also considered) makes the present capture chip a good choice for target-region capture sequencing. For the entire testing panel, the coverage of the 117 genes was 99.38%, and the coverage was only slightly diminished (96.95%) at 20-fold minimum depth.
To assess the accuracy of the method, in-house YH DNA was initially tested. We found that after filtering of the low quality variants, the accuracy was 95%, and both the FP and FN ratio reached the lowest value. Nevertheless, comparison with the reference data revealed that the FN ratio was still quite high (9.04%). The relatively high FN ratio may be due to errors in the YH reference sequence; such false positives in the reference sequence will cause an inflated FN rate in the examined sequence. This is supported by the fact that an FN ratio of 5-10% has also been observed in other previous studies using the YH reference sequence [23].
By applying this method, a total of 70 pre-screened samples were tested, and in 100% of the samples, we were able to detect known mutations. With the use of this method, small variants (both SNPs and indels) as well as larger deletions and duplications could be identified in one analysis. In order to apply the method in Sample Number  clinical settings, several improvements could be implemented. Firstly, the gene panel could be re-designed to include only translated regions of phenotype specific genes (known causative genes, candidate genes for MODYX, other disease functional related genes), plus regulatory regions with reported causative mutations, for example the P2 promoter of HNF4A [26]. Secondly, any low coverage regions should be supplemented by Sanger sequencing methods in order to improve the mutation detection rate. Thirdly, our assay at its current form has a turnaround time of seven weeks, which is relatively long for routine clinical testing and reporting. Therefore, transferring the sequencing procedure from Hiseq to Miseq or Proton might shorten turnaround time while maintaining the high accuracy and throughput. Last but not least, clinical variation interpretation can be a great challenge. Thus, we suggest a "phenotype-specific candidate-genes panel" approach i.e. only genes of relevance to the individuals' phenotype are analysed in order to reduce the number of variants identified to facilitate the interpretation of the pathogenicity of the variants [27]. A similar approach was recently published targeting the coding regions of 29 genes in which mutations have been reported to cause neonatal diabetes, MODY, maternally inherited diabetes and deafness (MIDD) or familial partial lipodystrophy (FPLD) [28]. However, as the present platform includes a much larger number of genes, includes non-coding region, and has the ability to identify larger structural variations, we believe that the present platform has a greater usability.

Conclusions
We have established a straight-forward target region sequencing method for examining a large number of genes. The method was successfully evaluated using monogenic diabetes as a study-model. The turnaround time was seven weeks, and the coverage of the target gene coding region reached 100%. The method can be used to detect small size mutations and larger deletions and/or duplications in the same run. In terms of the 70   FN)):

95.43
The assessment is calculated based on qualified SNPs filtered by SOAPsnp score, depth, and supporting reads ratio. Figure 2 The distribution of pathogenic (suspected pathogenic) variants of disease-causing genes among 70 pre-screened samples. The mutation type (large exon deletion/duplication, splice, nonsense, missense, small indel) are present by gradually changing color in blue. pre-screened samples, we correctly identified all the causative variants. We believe our method could be a useful tool in large-scale monogenic diabetes gene testing.

Additional file
Additional file 1: Table S1. The examined genes on designed capture panel. Table S2. Overview of the pathogenic variants of six monogenic diabetes mellitus' casual genes. Figure S1. Workflow of experimental procedure before sequencing. Figure S2. Bioinformatics analysis pipeline. Figure S3. Detection of 7 large deletions or duplications in Danish patients with known monogenic diabetes. YH4 is shown as control.