SERpredict: Detection of tissue- or tumor-specific isoforms generated through exonization of transposable elements

Background Transposed elements (TEs) are known to affect transcriptomes, because either new exons are generated from intronic transposed elements (this is called exonization), or the element inserts into the exon, leading to a new transcript. Several examples in the literature show that isoforms generated by an exonization are specific to a certain tissue (for example the heart muscle) or inflict a disease. Thus, exonizations can have negative effects for the transcriptome of an organism. Results As we aimed at detecting other tissue- or tumor-specific isoforms in human and mouse genomes which were generated through exonization of a transposed element, we designed the automated analysis pipeline SERpredict (SER = Specific Exonized Retroelement) making use of Bayesian Statistics. With this pipeline, we found several genes in which a transposed element formed a tissue- or tumor-specific isoform. Conclusion Our results show that SERpredict produces relevant results, demonstrating the importance of transposed elements in shaping both the human and the mouse transcriptomes. The effect of transposed elements on the human transcriptome is several times higher than the effect on the mouse transcriptome, due to the contribution of the primate-specific Alu elements.


Results:
As we aimed at detecting other tissue-or tumor-specific isoforms in human and mouse genomes which were generated through exonization of a transposed element, we designed the automated analysis pipeline SERpredict (SER = Specific Exonized Retroelement) making use of Bayesian Statistics. With this pipeline, we found several genes in which a transposed element formed a tissue-or tumor-specific isoform.

Conclusion:
Our results show that SERpredict produces relevant results, demonstrating the importance of transposed elements in shaping both the human and the mouse transcriptomes. The effect of transposed elements on the human transcriptome is several times higher than the effect on the mouse transcriptome, due to the contribution of the primate-specific Alu elements.

Background
Transposed elements (TEs) are sequences of DNA that can move from one position to another in the genome. There are two classes of transposed elements, the DNA transposons and the retroelements. DNA transposons usually move by cut and paste using the transposase enzyme. In contrast, retroelements are genetic elements that integrate in a genome via an RNA intermediate which is reversetranscribed to DNA. In mammals, almost half the genome is comprised of TEs: around 45% of the human genome is made up of them. This translates to millions of elements, so that on average, every gene in our genome contains about 3 transposed elements. Transposed elements comprise approximately 37% of the mouse genome.
The human and mouse genome sequences show that TEs have played an important role in shaping the genomes [1,2]. The human genome contains retroelements such as Alu, which is a short interspersed element (SINE), MIR (mammalian interspersed repeat) as well as LINE-1 (L1), LINE-2 (L2) and CR1 (L3). The last three of the given families of retroelements are termed long interspersed elements. In addition, the human genome contains LTR elements such as MaLR (mammalian apparent LTR-retrotransposon), ERVL and ERV1 (endogenous retroviruses) as well as DNA transposons where common families are MER1 and MER2. The mouse genome contains MIR elements as well as rodent-specific SINEs such as B1 (homologous to the left arm of the Alu), B2, B4 and ID as well as LINEs such as L1, L2 and CR1. Similar to the human genome, the mouse genome contains LTRs and DNA transposons. With approximately 1 million copies, Alu is the most frequently encountered TE in the human genome. In mouse, B1 and L1 are the elements with the highest number of copies (B1: 500,000 copies, L1: 800,000 copies).
Through splicing processes ("exonizations"), small pieces of transposed elements can be inserted into mature mRNAs. These exonizations are caused by motifs that resemble consensus splice sites in both strands of the TEs [3]. The transposed elements do not only contain these splice sites but also polyadenylation sites, promoters, enhancers and silencers. Therefore, they can add a variety of functions to their targeted genes [4][5][6].
Mutations within intronic TEs may yield active splice sites which can be used instead of the normal splice sites, leading to the partial exonization of the intronic TE. However, the other TEs of the human and mouse genomes can be exonized, too. In a previous study, Sela et al. [7] showed that 1824 TEs are exonized in the human genome, of which about 58% are Alus. In the mouse genome, 506 transposed elements are exonized, most of which are either B1 or L1 elements (26% and 20%, respectively). Thus, transposed elements can affect the transcriptome. Either new exons are generated from intronic TEs (see Figure 1a (i)), or the TE inserts into the first or last exon of a gene (Figure 1b (i)), leading to a new transcript [8]. In the first case, the exonization can either generate an internal cassette exon (Figure 1a (ii)), an alternative 3'splice site (Figure 1a (iii)), an alternative 5'splice site (Figure 1a (iv)) or a constitutively spliced exon (Figure 1a (v)) [7,9]. In the case of insertions into first or last exons, the insertions cause either an elongation of the first/last exon (Figure 1b (ii, iii)) or an activation of an alternative intron (Figure 1b (iv)). For the exact number of occurrences of the different events please refer to [7].
It has been previously reported that more than 5% of the alternatively spliced exons in the human genome are Aluderived and that all Alu-derived exons are the result of exonization of intronic sequences [8]. It can therefore be supposed that genetic diseases can occur when an intronic TE is constitutively or alternatively spliced into the mature mRNA. Searching the literature indeed uncovers evidence that Alu insertions cause genetic disorders [10][11][12][13].
Another effect of new exonizations is a potential tissue specificity, in which an exon shows strong tissue regulation [14]. An experimental verification of this mechanism is described in a report on Alu de-novo insertion and subsequent exonization within the dystrophin gene that creates a tissue-specific exon inflicting cardiomyopathy [15]. Figure 1 The effects of TE insertions. a) (i) TE inserts into an intron of a gene. (ii-v) show the possible effects of this integration; (ii) alternatively exon is created, (iii) TE contributes alternative 5'splice site, (iv) TE contributes alternative 3'splice site, (v) TE creates a constitutively spliced exon. b) (i) TE inserts into the first or last exon of a gene. (ii -iv) show the possible effect of this integration: (ii, iii) enlargement of first or last exon, (iv) TE activates an alternative intron.

The effects of TE insertions
Furthermore, since tumorous tissues have been shown to adopt aberrant splicing patterns [16], there might be TE exonizations that are potentially tumor-specific. The survivin gene is one example in which an Alu-generated splice variant is tumor-specific [17].
For the detection of new tissue-or tumor-specific TE-containing isoforms in human or mouse genes, we designed and implemented SERpredict, an analysis pipeline making use of Bayesian Statistics.

Implementation
SERpredict is based on several databases: the Ensembl database [18], the UCSC genome browser database [19], dbEST [20], and EMBL [21]. How they are used and combined is described in the following section.

Library classification
To obtain information about the tissue and the health status of alternative splice forms of genes, the databases dbEST [20] for expressed sequence tags (ESTs) and EMBL [21] for mRNAs are used. These EST and mRNA sequence databases provide information about tissue and tumor sources. For dbEST, library information which include an ID, the library name, the organism, the tissue and a more detailed description is provided with each entry. For the EMBL database, there are features termed "clone lib" and "tissue type". However, this poses a problem since the names of the tissues and tumors are not standardized across the databases. For this reason, we extracted all the EST and mRNA identifiers from the two databases dbEST and EMBL, obtained the associated library information and assigned a tissue category to every given tissue according to [22] (see as well Additional file 1). Here, we used key words to identify 52 different cell or tissue source categories, e.g., leukocyte is mapped to the category blood, hippocampus is mapped to brain and so on. Furthermore, either "tumor" or "normal" was added to each library, using again keywords. All the information was then stored in a locally installed MySQL database which is automatically updated if one of the underlying databases is updated. The "annotated" EST and mRNA sequences obtained in this way were used to perform the statistical analysis to determine whether a certain isoforms is tissueor tumor-specific.

Tissue and tumor specificity
The analysis for tissue or tumor specificity of a certain alternative splice form can be done using the above "annotated" EST and mRNA sequences. Therefore, all EST and mRNA sequences which map the gene of interest have to be extracted. To determine tissue or tumor specificity, the extracted sequences have to be divided into two groups reflecting the two isoforms of the gene. This is easy in our special case of alternative isoforms because one of the isoforms was generated by the exonization of an TE.
On the basis of this information, the ESTs and mRNAs are separated into the groups "holding", when the TE is present in the sequence of interest or "skipping", in cases where the TE is not. Using the library classification terms for the sequences, we then get four sets of distributions. For each of those EST and mRNA sequences skipping the TE as well as for those holding the TE, we obtain a tissue and a source (tumor or normal) distribution.
Determining tissue or tumor specificity from these distributions is not easy, because tissue and tumor source data for EST or mRNA sequences are often incomplete and inconsistent. For a certain gene there are often only a few ESTs sequenced from a particular tissue covering the exons of interest. We therefore have to cope with a poor EST library coverage. Additionally, there are extremely different numbers of ESTs and mRNAs for the different tissues, see Figure 2. This leads to a sampling bias problem.
To address these problems, statistical analysis is needed.
Furthermore, when dealing with tissue or tumor specificity there is a problem with including ESTs from cell lines into the analysis. Cell lines are often immortalized, and the immortal lines obtained might not be a perfect representation of the original cells in primary culture. For an estimation how many of the ESTs originated from cell lines, we checked the annotation in the dbEST database and determined that only about 10% of the human and mouse EST sequences were derived from cell lines.

Statistical analysis
To deal with the incomplete and inconsistent data, we used a previously described Bayesian statistics approach to identify tissue-specific exons [23] and to identify exons showing deregulated splicing in tumors [24].
To identify tissue-specific exons, a tissue specificity score (TS score) is computed. The confidence that a certain splice variant is preferred in tissue T is calculated as a Bayesian posterior probability: Here, θ 1T represent the hidden frequency of a splice variant in a specific tissue T and obs stands for the number of ESTs and mRNAs observed in tissue T. P(obs|θ 1T ) is calculated using a binomial distribution and P(θ 1T ) = 1 was used as uninformative prior probability. In the same way, the posterior probability that the same splice variant is preferred in the pool of all other tissues ~ is computed. The TS score for tissue T is then defined as the difference between the posterior probability for tissue T and the posterior probability for the pool of all other tissues: Here, θ 1~ is the frequency of a splice variant in the pool of all other tissues ~. To assess the stability of the TS score robustness values, r TS and r TS~ were calculated analogous to the "jack-knife" resampling method. For more details, please refer to [23].
To identify tumor-specific exons, a log-odd score (LOD score) is calculated, giving the confidence that the frequency of a splice variant in tumor tissue (θ T ) is higher than the frequency of the splice variant in normal tissue (θ N ): The LOD score was calculated using direct numerical integration [24].
The criteria used for high-confidence of tissue specificity were TS > 50, r TS > 0.9 and r TS ~ > 0.9 as described in [23]. A necessary condition for high confidence tissue specificity was at least three EST observations of the mRNA containing the exon in tissue T. As we wanted to have results with high significance only, we changed the criteria for the TS score to TS > 85 for our pipeline. For tumor specificity, a log-odd score was calculated. As in [24], a log-odd score above 2, equivalent to a p-value <0.01, was considered significant.
Pie chart of EST numbers

Work flow in SERpredict
Using the information presented in the sections above, we designed SERpredict to detect tissue-or tumor-specific isoforms, which were generated through the exonization of transposed elements. The work flow is displayed in Figure  3.
Initially, the genomic information of the input sequence is determined. Therefore, a Blast search [25] with Ensembl_cdna [18] is performed. Utilizing the Ensembl Application Programming Interface (API), the extracted Ensembl gene identifier is then used to find all transcripts of the gene and thereby all the different exons. If there is no Blast hit matching the criteria (Identity > 95% and Evalue <10 -3 ), an empty output is produced.
Subsequently, every extracted exon is screened for transposed elements. This is done using the chrN_rmsk table of the UCSC genome browser database [19], which maps the positions of all TEs that have been found by RepeatMasker [26] and the Repbase annotations [27,28] to the human and mouse genomes, respectively. This approach is much faster than using RepeatMasker directly.
Finally, for every such TE-containing exon, an analysis of tissue or tumor specificity is performed as described in the Section "Statistical Analysis". Subsequently, SERpredict extracts all expressed sequence tags (ESTs) and mRNA sequences from the UCSC genome browser database. These are then divided into two groups as described in the Section "Tissue and tumor specificity" and used as input for the statistical analysis.
As output, SERpredict returns a file with the following information: • Information about the genomic location: Ensembl gene identifier, gene description, chromosome, strand, start and end on genome, number of transcripts and corresponding number of exons • A graphical display of the alternative splice forms of the gene • Information about the repetitive elements: family, ID of exon in which the TE is located, start and end of the TE and the IDs of the transcript containing the TE-exon • If observed: the tissue or tumor specificity of the TE-containing exon These results are provided as HTML for visual inspection (see Figure 4) or can be downloaded as XML for easier extraction of relevant results and for storage in private databases.

Results and Discussion
All annotated genes of the human and mouse genomes were screened for TE-containing exons. The number of times the different transposed elements were exonized (and fulfilled the condition of at least three EST observations of the mRNA containing the exon in tissue T) are shown in Table 1 (for the human genome) and Table 2 (for the mouse genome).
The 859 human and 260 mouse TE-containing exons were then analyzed for tissue or tumor specificity using SERpredict. In the human exon list, we were able to identify 39 tissue-specifically spliced exons (see Table 3 for the exons with tissue specificity (TS) score > 90). In the mouse exon list, 11 exons showed tissue-specific splicing (see Table 4 for the exons with tissue specificity (TS) score > 90). In the human genome, 18 exons belonged to Alu, 5 were L1 exons, 2 were L2 exons, 1 was an CR1 exon, 5 were MIR exons, 4 were LTR exons and 4 were exons derived from DNA transposons. The highest amount of tissue-specific exonizations arises from the exonization of an Alu element. The fact that the Alu is the most abundant transposed element in the human genome and that it contains potential splice sites makes it a much better-suited sequence for the exonization process than other transposed elements [7] and could be a reason for these results. In mouse, 4 were B1 exons, 2 were B2 exons and 2 were LTR exons. For B4, L2, MIR there was one exon each. The higher amount of specific exonized B1 elements is consistent with the fact that B1 derived from the same ancestral origin as Alu. Still, B1 does not reach the same amount of specific exonizations as Alu because the majority of exonizations of Alu occur in the right arm of the Alu element which is not present in B1. In contrast to the dimeric structure of the Alu element, B1 is a monomer.
We did not observe a tendency for specificity in any certain tissue in humans. In the mouse genome, interestingly, there is a bias for specific exons in pancreas tissue. This is not due to a bias in the number of ESTs/mRNAs from mouse pancreatic tissue in the database since there are as many pancreatic sequences as sequences from other tissues like intestine or blood. Therefore, this is an interesting result for which we do not have any explanations so far.
As MIR SINEs were active prior to the mammalian diversification [29] it was unexpected to find 5 tissue-specific MIR exonizations in human and only 1 in the mouse genome. We examined the orthologous loci of the 5 relevant genes RDH13, Elmo2, MRRF, Tri14 and NP_060401.2 in the human and the mouse genome and discovered that there is no MIR element in the mouse genome in 4 of the 5 cases. Only for MRRF there is a MIR in the mouse genome but the exonization in mouse is not Work flow of SERpredict Figure 3 Work flow of SERpredict. Programs and rules used for extracting tissue-or tumor-specific TE-containing exons, for details see Section "Work flow in SERpredict".

Search for transcripts
BlastN2 for every exon no yes

UCSC genome browser data (chrN_rmsk)
Transposed element in the exon ?? Alu-exon download result in XML format tissue-specific. For the specific exon of gene ST6galrsc4 in the mouse genome, there is a MIR at the same position in the human genome but the exon boundaries are different. Therefore, the MIR is not exonized in the human genome.
To show the efficiency of SERpredict, some of the genes which we predicted to have a tissue-specific TE-derived exon were verified by searching both the literature and the database annotations. Isoform 2 of the T-cell activation NFKB-like protein contains an Alu exon and was predicted as ovary-specific (Table 3), which was verified through the human SwissProt [30] entry Q9BRG9. A testis-specific isoform of TPK1 (Thiamine pyrophosphokinase 1) is described in the OMIM [31] entry 606370 [32,33]. This isoform is 100 bp longer than the broadly expressed variant. This complies with our results of an additional ERV1derived exon of about 100 bp which makes this isoform testis-specific (Table 3). Additionally, the 4F2 cell-surface heavy chain protein seems to be highly expressed in the early stage of new bone formation [34]. Although we found an alternative isoform expressed in bone (Table 3), the specificity of the TE-derived exon is not described in the literature and could therefore not be verified.
Our second analysis identified exons which were spliced in a tumor-specific way. We found 21 such exons in human and 2 in mouse genes. In the human genome, 11 were Alu exons, 1 was a L1 exon, 1 was a L2 exon and 4 were MIR exons, 3 were LTR exons and one exon derived from a DNA transposon (see Table 5). In mouse, there was 1 L1 exon and 1 MIR exon (see Table 6). The data was filtered to search for exons that were intronic within normal tissues and were recognized as exons only within tumor-ous tissues and, as such, could serve as potential markers for tumor diagnostics. One such exon which contains an Alu element was found in the human gene YY1AP1 (YY1associated protein 1: hepatocellular carcinoma susceptibility protein). All results for TS > 85 and LOD > 2 are given in Additional file 2 for the human and the mouse genome.
We also found an indication for the accuracy of our predictions of genes with tumor-specific exons. From the ST6GALNAC6 gene a 2.4 kB transcript has been described for colon carcinoma, while in normal colon transcripts of 2.5 and 7.5 kB length are found [35,36]. The colon carcinoma transcript could represent the isoform which omits the first exon and contains the tumor-specific exon.
Taking these results into account, SERpredict is a useful tool for analyzing TE insertions in genes and to determine their effects for the human and mouse transcriptomes. On the one hand, their insertion into mature mRNAs and the subsequent change in the protein can cause effects in single tissues or even cause major illnesses like cancer. This has already been shown in several examples in the literature [10][11][12]15,17]. On the other hand, these new exons could be raw material for future evolution of the organisms. The new alternative TE-exons are only included into a fraction of the transcripts of a gene while the rest of the transcripts maintain their original function. Therefore, the addition may be free to evolve with no loss of original function. If the alternative form gains a useful function, its splice sites are strengthened or it can become tissue-specific if the new function has only local benefits [14]. Number of exonized transposed elements in the human genome which have a least three EST observations of the mRNA containing the TE exon. The Alu element is exonized most frequently among TEs.  Potentially tissue-specific TE-exons in the mouse transcriptome. From left to right: the gene name in which the exonization occurred, the transposed elements family name, the chromosome number, the name of the tissue to which the exon is specific and the TS score. Potentially tissue-specific TE-exons in the human transcriptome. From left to right: the gene name in which the exonization occurred, the transposed elements family name, the chromosome number, the name of the tissue to which the exon is specific and the TS score.

Conclusion
Our results show that SERpredict produces relevant results, demonstrating the importance of transposed elements in shaping both the human and the mouse transcriptomes. Due to the contribution of the primate-specific Alu elements, the effect of TEs on the human transcriptome is several times higher than the effect on the mouse transcriptome. We found some evidences for our results in both the literature and the database annotations. Other results still need biological verification. The pipeline can therefore be used as an indicator for biologists interested in tissue-or tumor-specific isoforms to decide which gene might be interesting for further research.
Due to the incompleteness of the present gene databases, our analysis remains confined to the annotated gene data.
With the continuous updating of the mRNA and EST data-bases, and with it our internal MySQL database, the analysis can be repeated. This will make analyses more precise and will provide results on previously undiscovered exons, using SERpredict to obtain either tissue-or tumorspecific splicing.
In further studies we will include additional organisms into SERpredict in order to determine differences to the human and mouse genomes. Additionally, we are planning to build a database containing the data of TE-containing exons, the annotation with the TEs, as well as tissue and tumor specificities for different organisms. This will be an extension and an update of the AluGene database [37,38].

Availability and requirements
Project name: SERpredict Potentially tumor-specific TE-exons in the mouse transcriptome. From left to right: the gene name in which the exonization occurred, the transposed elements family name, the chromosome number and the LOD score.

Usage
As part of the HUSAR open server, applications are listed on the web page with additional information about the tasks they perform. Query sequences can be uploaded by the usual "copy & paste" procedure into the input box. If more than one sequence is to be queried, a multiple FASTA file can be used. The query starts by clicking on the "submit" button and then the "run" button on the following page. Results can be received by selecting the tab "Go to results page". For further explanation, a flow chart, an example output, and a test sequence are given on the web page.

Input/output formats
SERpredict accepts only nucleotide sequences as input. For output, see Section "Work flow in SERpredict".

Performance
Calculations are normally fast, depending on the length of the input sequence and the number of exons the input sequence contains. A calculation takes approximately one minute.

Abbreviations
TE -transposed element, SINE -short interspersed element, LINE -long interspersed element, MIR -mammalian interspersed repeat, EST -expressed sequence tag, TS score -tissue specificity score, LOD score -log-odd score, HUSAR -Heidelberg Unix Sequence Analysis Resources