Exome scale map of genetic alterations promoting metastasis in colorectal cancer

Background Approximately 90% of colorectal cancer (CRC) deaths are caused by tumors ability to migrate into the adjacent tissues and metastase into distant organs. More than 40 genes have been causally linked to the development of CRC but no mutations have been associated with metastasis yet. To identify molecular basis of CRC metastasis we performed whole-exome and genome-scale transcriptome sequencing of 7 liver metastases along with their matched primary tumours and normal tissue. Multiple, spatially separated fragments of primary tumours were analyzed in each case. Uniformly malignant tissue specimen were selected with macrodissection, for three samples followed with laser microdissection. Results > 100 sequencing coverage allowed for detection of genetic alterations in subpopulation of tumour cells. Mutations in KRAS, APC, POLE, and PTPRT, previously associated with CRC development, were detected in most patients. Several new associations were identified, including PLXND1, CELSR3, BAHD1 and PNPLA6. Conclusions We confirm the essential role of inflammation in CRC progression but question the mechanism of matrix metalloproteinases activation described in other work. Comprehensive sequencing data made it possible to associate genome-scale mutation distribution with gene expression patterns. To our knowledge, this is the first work to report such link in CRC metastasis context. Electronic supplementary material The online version of this article (10.1186/s12863-018-0673-0) contains supplementary material, which is available to authorized users.


Background
High mortality rate of colorectal cancer stems from its metastatic potential [1]. Metastasis is also crucial health problem for other tumours -it causes 90% of deaths for all solid tumours [2]. Recently great progress has been made in the understanding of biological principles of the metastatic process [3], which translated into new therapies extending patient survival over twofold [4]. Further advances in clinical treatment are hampered by genetic heterogeneity and evolutional potential of lesions. Genotyping of single variant or even single whole gene is often insufficient to predict effectiveness of molecularly targeted therapies and we still lack the thorough atlas of underlying genetic aberrations.
The development of primary colorectal tumour (PT) occurs along well described sequence of genomic mutations. The most essential are alterations in APC, TP53, KRAS, PIK3CA and TGFB, but many others have been detected -46 genes have been causally linked to the development of CRC according to the Catalogue Of Somatic Mutations In Cancer (COSMIC) database [5]. In contrast, no mutations have been associated with metastasis yet [6].
There are two possible reasons of the failure of previous work [7,8] to demonstrate genetic causal link to metastasis. The first one is the molecular heterogeneity of cancer specimen studied. Metastatic lesions (MT) have been shown to harbour from less than 10 to more than 800 somatic mutations in the exomic region [7]. The molecular features of primary tumours are also highly inconsistent which led to selection of distinct subclasses [9]. There may be multiple paths leading to dissemination into distant locations for each subclass of primary tumour, making published studies underpowered. Secondly, metastasis may be purely stochastic process, independent of specific genetic traits present in the primary lesion. Factors outside cancerous cells, like immunological response, relative position of primary tumour in respect to existing vasculature and susceptibility of vascular epithelia to invasion may contribute to metastasis, greater than any single genetic mutation.
There are three aspects of metastasis genetics that are yet to be explained: which alterations are key drivers of the process, in what mechanism they occur and what functions/aspects of cell do they modify. The first problem is complicated by the fact that multiple distinct DNA modifications can lead to similar phenotype, which increases sample size required to prove causal link. Functional alterations are yet impossible to decrypt on genomic scale with genotype alone and without broad information on gene expression.
Here we employ next generation sequencing for both, exome genotyping and transcriptome sequencing of freshly frozen samples sets (normal tissue, primary tumour and liver metastasis) from 7 patients to characterise mutational landscape of metastatic CRC.

Tissue specimen
Primary colon tumours with normal tissue margin and slices of liver metastases less than 1 mm thick and less than 10 mm long were dissected simultaneously. Parts of both were used for immediate pathology examination and the rest was frozen in -80°C upon further processing.
For primary tumours, sections of uniformly malignant tissue were selected in macro-dissection procedure. For 5 primary tumours further dissection of multiple spatially separated fragments of malignant tissue was conducted to assess intra-tumour variability. For three primary tumours microdissection was performed using PALM laser microdissection and pressure catapulting (LMPC) system (PALM MicroBeam with PALM Robo-Mover module and PALM RoboSoftware; Carl Zeiss MicroImaging GmbH, Germany) (samples 10PT3, 10PT4, 5PT1, 5PT2, 9PT4, 9PT5).
The extraction and purification of DNA was performed using QIAamp DNA Micro Kit (Qiagen, Germany) according to Protocol for Isolation of Genomic DNA from Laser-Microdissected Tissues. DNA sample concentration was measured using NanoDrop spectrophotometer, following the manufacturer's instructions. DNA was further stored at -20°C.
Variants previously linked to CRC development were imported from COSMIC database (version 20,161,128 [5]).
"Filtered variants" sets were created in three consecutive steps. First, variants differentiating tumour and normal tissue were called with Varscan2. Detected variants were then filtered according to read depth (> = 20) and number of non-reference reads from each strand (> = 4). Last, variants detected in more than 1% of population according to ExAC, 6500 exomes or 1000 Genomes Project (both global and European) database were discarded. Exclusive metastatic variants (EMV) were selected in similar way, by further removing variants detected in primary tumours from metastatic variants set.
Functional analysis of EMV was performed with Gene Set Enrichment Analysis (GSEA) software (version 2.2.4, [16]), using Reactome [17] as gene sets database. Two scores were used as gene rankings for GSEA -Cancer-specific High-throughput Annotation of Somatic Mutations (CHASM) [18] for missense driver cancer mutations and highest Combined Annotation Dependent Depletion (CADD) score variant per gene [19] for all the variants.

Transciptome sequencing
Total RNA was isolated from tissue using RNeasy Plus Mini Kit (Qiagen, Germany), following manufacturer protocol. The purity and quantity of RNA was measured with NanoDrop spectrophotometer and assessed using an Agilent 2100 Bioanalyzer with RNA 6000 Nano Kit (Agilent, California). Samples were stored at -70°C.
Sequencing libraries were generated using Ion Ampli-Seq Library Kit Plus (Thermo Fisher). Sequencing was performed using Ion Proton instrument with 5 or 6 samples per chip with Ion PI Hi-Q Sequencing 200 Kit (Thermo Fisher). Reads were aligned to the hg19 Ampli-Seq Transcriptome ERCC v1, target panel 21 K v1. Transcripts were quantified with HTseq-count (version 0.6.0 [20]), run with default options. Differentially expressed genes were determined with negative binominal test implemented in DESeq2 package (version 1.12.4, [21]). Patients were used as confounding variable. P-values were corrected for multiple hypotheses testing with Benjamini-Hochberg procedure and differences with corrected p-values < 0.05 were considered significant.
Overrepresentation of Gene Ontology (GO) terms [22] assigned to genes with the most marked expression differences between groups was tested with Fisher Exact test implemented in the GOstats package (version 2.40.0, [23]). Tests were performed in the "conditional" mode, separately for biological process and molecular function branch. Only terms with more than 2 and less than 2% of the total number of observed transcribed genes (~20,000) were assessed. P-values from Fisher Exact test were corrected with Benjamini-Hochberg procedure.
The link between transcriptome changes and observed mutations was probed with Kolmogorov-Smirnov test. Genes were sorted according to expression fold-change (FC) between each tumour sample and respective normal sample. Positions of genes carrying selected classes of mutations on FC sorted list was used as an input for Kolmogorov-Smirnov test. Analysis was done separately for all non-silent, homozygous non-silent, stopgain and indel mutations.

Availability of data and materials
The dataset supporting the conclusions of this article is available in the Gene Expression Omnibus repository (https://www.ncbi.nlm.nih.gov/geo/) under entry GSE89393.

Results
Exome sequencing Between 1.5 and 9.7 billion base reads that mapped to the reference genome were generated during exome sequencing for 7 sets of freshly frozen samples (Additional file 1: Table S1). Each set consisted of normal tissue, metastatic tumour and between 1 and 6 samples of primary tumour. Between 54 and 3029 variants differentiating primary tumours and metastases from normal tissue were found ("filtered variants", Fig. 1). Between 1 and 88 of those variants were stop-gains. Samples could be classified into low and high mutation count categories with between 54 and 306 mutations detected in the former and between 1490 and 3029 in the latter (1-7 and 36-88 stop-gains, respectively). Characteristics of variants detected in metastatic samples closely resembled those of respective primary tumours. 426 filtered variants were detected in more than one patient, 49 were detected in 3 patients and 17 in 4 patients (Additional file 2: Table S2). There were three frameshift substitutions detected simultaneously in 4 patients, in ABTB2, TPI1 and GLI2. 6 filtered variants that were homozygous, exonic and nonsynonymous were detected in at least two patients (Additional file 3: Table S3). Frameshift causing insertion of adenine at codon 336 of transcript NM_000365 of TPI1 was detected in four patients, three of those insertions were homozygous.
Mutations of C:G pairs were detected over nine times more often than mutations of A:T pairs and three times as many transversions than transitions (Fig. 1). Most of the filtered variants were exonic (46.2%), intronic (14.5%), 3'UTR (11.6%) or intergenic (10.5%), which was in line with library preparation method used. 29.0% of filtered variants were nonsynonymous SNV, 11.4% were synonymous SNV and 2.6% were stopgains according to Annovar (Additional file 4: Table S4).
Numerous variants in genes already implicated in CRC development were detected among filtered variants (Additional file 5: Table S5). Mutation in KRAS was detected in five patients and mutation in APC, POLE and PTPRT was detected in four patients. Notably there were no mutations detected in APC and KRAS in 3 metastatic samples although primary tumours from the same patients were carrying mutations in this genes.
In metastases there were between 26 and 2029 variants that weren't detected in any normal tissue nor in primary tumours (exclusive metastatic variants -EMV). Mutation types were similar to those differentiating primary tumours and normal tissue with C:G pairs substitutions ten times more likely than A:T pair and 4.6 times as many transversions than transitions (Additional file 6: Fig. S2). 47.3% of EMV were exonic, 15.7% intronic, 10.6% 3'UTR and 10.3% were intergenic. 30.9% EMV were nonsynonymous SNV, 10.9% were synonymous SNV and 2.8% were stopgain (Additional file 7: Table S6). The most frequently mutated genes in MT (normalized for length) were NHLH2, RPL13A and SSNA1.
Among variants exclusive to metastatic tumors (EMVs), 89 missense variants are potential cancer drivers (FDR(false discovery rate)-adjusted CHASM p-value < 0.05). Only one variant, BAHD1 p. R533S is present in more than one sample (Additional file 8: Table S7). There are 128 genes, are potential cancer-driver genes (FDR-adjusted CHASM composite p-value < 0.05). None of them is mutated in every samplethe most changed is PLXND1, with mutation in 5 samples (Additional file 9: Table S8). CELSR3 had EMV in four patients, BAHD1 and PNPLA6 in three. GSEA analysis with CHASM score as ranking feature revealed 5 Reactome pathways with FDR values in 0.05-0.1 range, which included Signaling by FGFR pathway (Additional file 10: Table S9). On the other hand, similar analysis with highest CADD score per gene yielded no significant results (not shown).

Transcriptome sequencing
Between 7.8 and 22.4 million of tags read during transcriptome sequencing were mapped to the reference sequence. Between 61.2 and 74.5% of the reference transcripts were detected (Additional file 11: Table S10). There were two outliers among samples according to Principal Component Analysis (PCA), (Additional file 12: Fig. S1). Corresponding samples weren't taken into account in comparisons between groups.
Expression of 3066 genes was significantly different between normal tissue and primary tumours. 1555/1511 were up/down regulated in tumours. 2677 of them showed at least 2 fold change and 216 over 10 fold change in expression (Additional file 13: Table S11A). There were genes with over 100 fold decrease and over 100 fold increase in expression ( Table 1). The most notable examples of down-regulated genes were GUCA2B (FC = 187), TMIGD1 (FC = 129) and CA1 (FC = 121), while CST1 and S100A2 (FC = 131/88.7, respectively) were highly expressed in tumours but not in normal tissue. Differences in expression were attributed to electrolyte homeostasis (GO:0015711, GO:0006811) and some metabolic processes, including lipid and fatty acid metabolism (Table 3A). "Response to drug" (GO:0042493) is particularly interesting in this context because all samples were collected prior to chemotherapeutic treatment. "Magnesium ion binding" (GO:0000287) was the only molecular function overrepresented among the most differentiating genes (adj. p = 0.015, FC = 1.98).
105 genes were differentially expressed between metastases and primary tumours. 38/67 were up/down regulated in metastases. For CRP and FGG expression increased over 50-fold (Table 2, Additional file 13: Table S11B). The most overrepresented biological processes among differentiating genes were cellular component and extracellular matrix organization, followed by immune response-related processes (Table 3B). Interestingly, neither EGFR nor EGF, previously proposed as essential for matrix organization [24], were found to be differentially expressed. The most significant molecular function was "heparin binding" and several extracellular matrix remodelling processes (Additional file 14: Table S12).
GO biological processes with the highest overrepresentation in the 10% of genes with the lowest p-value (selected subset) in comparison between normal colon vs primary tumour (A) or primary tumour vs metastases (B). Count -number of genes in selected subset attributed to a given GO term. Expected count -number of genes expected to be attributed to given category by chance. The aggregated effect of accumulated mutations was visible in the observed transcriptome remodelling. When genes were sorted according to fold-change of expression (FC) for three pairs of tumour-normal sample, the genes with detected filtered variants weren't distributed randomly. For various classes of filtered variants there was a significant bias of distribution along FC-sorted genes detected with Kolmogorov-Sminov test (bold highlight in Table 4). Differences in one MT transcriptome vs respective normal tissue were linked to the set of all non silent mutations. Interestingly, stop-gains were less impactful on their own, with significant association  with transcriptome changes only in one primary and none of metastatic tumours (Table 4).

Discussion
Contrary to previously published results [7], where transversions were less prevalent than transitions by twofold, there were 3 times more transversions than transitions. The number of detected somatic variants was, on average, more than two times higher here than in Lim B et al. [7]. Discrepancies cannot be explained neither by sequencing technology (Illumina HiSeq in both cases), nor by sequencing depth, which was similar (101 vs 133). Mapping software was also comparable (BWA [25] vs Bowtie 2). The most significant protocol difference is that we used Varscan2 [11] instead of MuTect [26]. Varscan2 is more sensitive than MuTect, detecting over 3 times more SNP in some scenarios [27]. Furthermore, MuTect misses some high quality variants [28]. We believe that Mutect is overly conservative, especially when sequencing depth is high (> 100). Additional filtering for minimal number of reads from each strand (> = 4) supporting variant protect against high false-positive rate. Contradicting results on mutation type distribution highlight the dependence of conclusions regarding mutation mechanism on analytic choices. There were 89 cancer-driver mutations among EMVs predicted by CHASM, however most of them concerned only one tumour. On the other hand, on gene level there were 128 cancer-driver genes predicted, two mutated in four patients and one in five. Moreover, GSEA analysis revealed significant enrichment of FGFR signalling and  Given values are -log10 of p-value from Kolmogorow-Smirnoff test of altered vs non-altered genes (see methods). Bold highlights significant association (values greater than -log10(0.05)). Inf -values greater than 10 antigen processing pathways. These results suggest indeed there are no specific mutations involved in metastatic processes, however the cancer-driver mutation distribution is not entirely random since it involves specific genes and pathways. High levels of CRP, the gene with the most significant expression increase in liver metastases (Table 2), were previously associated with poorer prognosis for CRC [29,30]. This is in line with other findings associating various inflammation symptoms with metastasis (Table  3) [31]. The key players in inflammation progression are matrix metalloproteinases (MMP) well described in the CRC context [32] and significantly differentiating primary tumour from normal tissue here (Additional file 13: Table S11). EGFR was labelled MMP regulator [24] and was found downregulated in lymph node metastasis vs primary tumours [33]. In our study neither EGF nor EGFR expression did differentiate metastasis from primary tumour, which suggests there is other mode of MMP activation.