Computational cloning of drug target genes of a parasitic nematode, Oesophagostomum dentatum

Background Gene identification and sequence determination are critical requirements for many biological, genomic, and bioinformatic studies. With the advent of next generation sequencing (NGS) technologies, such determinations are predominantly accomplished in silico for organisms for which the genome is known or for which there exists substantial gene sequence information. Without detailed genomic/gene information, in silico sequence determination is not straightforward, and full coding sequence determination typically involves time- and labor-intensive PCR-based amplification and cloning methods. Results An improved method was developed with which to determine full length gene coding sequences in silico using de novo assembly of RNA-Seq data. The scheme improves upon initial contigs through contig-to-gene identification by BLAST nearest–neighbor comparison, and through single-contig refinement by iterative-binning and -assembly of reads. Application of the iterative method produced the gene identification and full coding sequence for 9 of 12 genes and improved the sequence of 3 of the 12 genes targeted by benzimidazole, macrocyclic lactone, and nicotinic agonist classes of anthelminthic drugs in the swine nodular parasite Oesophagostomum dentatum. The approach improved upon the initial optimized assembly with Velvet that only identified full coding sequences for 2 genes. Conclusions Our reiterative methodology represents a simplified pipeline with which to determine longer gene sequences in silico from next generation sequence data for any nematode for which detailed genetic/gene information is lacking. The method significantly improved upon an initial Velvet assembly of RNA-Seq data that yielded only 2 full length sequences. The identified coding sequences for the 11 target genes enables further future examinations including: (i) the use of recombinant target protein in functional assays seeking a better understanding of the mechanism of drug resistance, and (ii) seeking comparative genomic and transcriptomic assessments between parasite isolates that exhibit varied drug sensitivities.


Background
Helminth infection of the gut of humans and domestic animals is a global concern with tremendous social and economic costs [1]. Treatment of either humans or animals may involve administering anthelminthic drugs, typically from 1 of 3 main drug classes: the benzimidazoles (which selectively bind nematode beta-tubulin), macrocyclic lactones (which allosterically activate glutamate-gated chloride channels present in nematodes), or nicotinic agonists (which selectively activate subtypes of nematode nicotinic acetylcholine receptors; nAChRs) [2,3]. Not yet in widespread use are two more recently developed drugs monepantel [4] and derquantel [5] that represent aminoacetonitrile and spiroindole drug classes, respectively; both of these compounds have sites of action on subtypes of nematode nicotinic receptors other than the nematode levamisole receptor subtype [3] and have been promoted as 'resistance-busting'. Of particular concern is that resistance has developed to members of each of the typically used drug classes [6,7]. Studies of anthelminthic drugs, many performed in the non-parasitic nematode Caenorhabditis elegans, have identified a number of genes/ proteins that are drug targets of benzimidazoles (ben-1), macrocyclic lactones (avr-14, avr-15, glc-1, glc-2, glc-3, glc-4), and the nicotinic agonist levamisole (lev-1, lev-8, unc-29, unc-38, unc-63) (reviewed in [7,8]). From studies like these it appears that the molecular basis of susceptibility to macrocyclic lactones and levamisole is polygenic.
Although studies in C. elegans have provided important insights about drug-targets and drug-sensitivity, they are free-living bacteriovores and not parasitic worms. Therefore, there is great interest in extending drug sensitivity studies to the parasitic nematodes. One major hindrance to such efforts has been the lack of genome/ gene information that is available for the parasitic nematodes. Consequently, studies requiring sequence determination typically include the time-and laborintensive steps of: (i) identifying the target gene from among known genes of genetically close organisms, (ii) aligning target gene sequences to determine regions of high similarity, (iii) designing DNA primers to those regions, and finally, (iv) amplifying the target using PCR and a single pair of primers, cloning and sequencing the cloned product.
A modification of this approach, one having the potential to reduce the time and expense required to identify multiple gene sequences, is to use RNA-Seq data to build the target nematode gene sequences in silico, i.e., by computational methods. The next-generation sequencing (NGS) technique of RNA-Seq economically produces large amounts of sequence data, albeit comprised of very short (50-150 base) reads. When applied to an organism having a known genome, RNA-Seq sequence data can be computationally analyzed using software packages such as the commonly used Velvet package [9] to produce entire gene sequences in silico. Unfortunately, when applied to organisms for which little genome information is known, the output is more typically comprised of short contigs and contigs of lower quality [10].
An example of a parasitic nematode lacking a described genome is Oesophagostomum dentatum, a soiltransmitted helminth (STH) with a classic fecal-oral transmission route that causes minor nodules in the large intestine of swine and that is used as a parasite model for STH [11,12]. This report describes a reiterative method that builds upon assembly using the existing Velvet analysis software to allow greatly improved in silico determination of gene sequences from O. dentatum. The method was applied towards a determination of the sequence of 12 O. dentatum genes that, predominantly in C. elegans, have been identified as targets of the major classes of anthelminthic drugs (benzimidazoles, macrocyclic lactones, and nicotinic agonists). Establishing these sequences in O. dentatum facilitates downstream physiological and molecular studies.

Generation of mRNA-seq read libraries
Separate mRNA-seq libraries were constructed from 5 μg of high quality total RNA (RNA Integrity Number ≥ 7.3) isolated from 58 adult male and 141 adult female worms. Given the absence of a sequenced genome of O. dentatum for use as a scaffold during subsequent assembly and analysis, paired-end sequencing was performed to facilitate contig-building steps. Similar numbers of 75-cycle paired reads were obtained for both libraries, (2.144 × 10 7 for male, 2.159 × 10 7 for female) for a combined total of more than 40 million reads. The lack of information about the exome size and complexity of O. dentatum, and that the RNA-seq libraries were not normalized, preclude estimation of the depth of coverage of the libraries; however, in broad approximation, the 3,100 × 10 6 DNA base calls within the 40 × 10 6 reads represent considerable coverage of mRNAs transcribed from the estimated 53-59 MB genome [13]. Reads were trimmed of read tags (added during library building to allow multiplexing during the sequencing run) and deposited at the NCBI sequence read archive: male library [GenBank:SRR393668] and female library [GenBank:SRR393669].

Velvet assembly
The NGS assembler Velvet executed in the paired-end mode was used to assemble the 40 million reads from the combined male and female libraries into contigs ( Figure 1  (1) Individual data sets were assembled into contigs using Velvet.
(2) BLAST searches for genes of the nAChR-pathway were carried out with a high cutoff (expect value = 1E -10 ) to identify contigs highly similar to the target genes. (3) Reads were individually mapped (using Bowtie) to high similarity contigs. (4) All paired-reads for which at least one read mapped to a contig (in Step-3) were identified and binned using a custom Java program. (5) De novo assembly of Step-4 sequences was performed using Velvet. (6) Iteration of Steps 3-5 was performed until the iteration resulted in no additional reads being mapped to the contig of interest. number of reads that were identified for each target sequence while maintaining compliance with NCBI guidelines requiring submitted cDNA sequences be derived from a single strain. The paired-end mode was used for optimal de novo assembly, given the lack of a reference genome. Hashing, the early Velvet step in which individual reads are indexed into overlapping sequences (k-mers) of some length "k", has significant impact on the length and quality of the contigs built during an assembly, with longer k-mers generally corresponding to greater specificity (i.e., that a detected alignment is actually correct) but lower sensitivity to detect alignments [9]. Because optimal k-mer length cannot be known a priori, multiple assemblies were performed using k-mers ranging from length 17 to 49.
The 12 genes of C. elegans shown to produce anthelminthic targets for which the homologous genes and sequences of O. dentatum were sought are shown in Table 1; the drug classes for which they are targets are shown in Table 2. To identify contigs exhibiting high similarity to the C. elegans target sequences separate BLAST databases, each comprised of the set of contigs produced by a single k-mer assembly, were built and then queried each database using tblastn and the protein sequences of the C. elegans target genes ( Figure 1, step 2). A stringent "E-value" threshold of 1 × 10 -10 was used within the tblastn search to limit false positive identifications; the E-value corresponds to the number of times a match of the same quality would be found by chance within the database. A custom java-script (Additional file 1) was used to collect those contigs whose corresponding high scoring pair (HSP) E-values were ≤ 1 × 10 -10 and, for further stringency, to retain only those contigs whose region of alignment with the target protein exhibited ≥ 60% identity across ≥ 50% of the contig length. Some contigs that passed these filters exhibited high similarity to more than one target gene (data not shown), as would be expected given that some of the target genes are paralogous to one another. To unambiguously assign the filtered contigs to single target genes, contigs were queried (using blastx) against the C. elegans genome database and then gene identity was assigned based upon the match exhibiting the lowest E-value.
For each gene of interest, Table 1 indicates C. elegans target gene information including coding sequence length, and summarizes the BLAST results profile for each k-mer assembly from k-value 21 to 47; profiles for k-values of 17, 19, and 49 were unremarkable and are not shown. For each k-mer evaluated and each target sequence, the table indicates the number of HSPs identified, their mean length, and their range from shortest to longest. A best/longest HSP was identified for each target contig (as is indicated by bold font for R value), and no contig yielded more than 1 HSP for a target gene (data not shown). Relative to the length of the target gene coding sequences, the length of the corresponding longest HSP was quite short for 9 genes, and was near to completely full length for 3 other sequences (lev-1, unc-38, and glc-4). Table 1 demonstrates the impact that k-mer can have on the quality of assembly; for a given length k-mer, the quality of the assembly varied greatly, with (i) some k-mers producing many but short HSPs, (ii) other k-mers yielding few, but longer HSPs, and (iii) no single k-mer yielding the longest HSP (contig) for all target genes. Also as shown in Table 1, for some targets only a single k-mer produced the best HSP (e.g. glc-3 at 39-mer), whereas for other targets a range of k-mers yielded equivalent best HSPs (e.g. unc-38 at k-mers 29-45).

Iterative-binning and iterative-assembly to optimize sequence determination
To improve the overall quality of the contigs (i.e., increase contig lengths by extension and by gap filling) additional computational steps of assembly were developed ( Figure 1, steps 3-6). In outline, this involved identifying and binning all RNA-Seq library read-pairs for which at least 1 read matched a single contig for each target gene, and then using Velvet to reassemble those binned reads into contigs; this process was repeated until the output contigs exhibited no relative improvement. In detail, high-identity contigs that were identified in the initial Velvet assembly (at k-mer length 31 for all gene targets excepting lev-8, glc-1, and glc-3,which used k-mer lengths of 23, 25, and 39 respectively) were used in Step 3 ( Figure 1) to identify and bin all library reads that mapped to those contigs; this process utilized the mapping program Bowtie set to increase the sensitivity of read identification by using a low quality threshold value (of 150) and by running it in unpaired-read mode. The unpaired-read mapping mode allowed inclusion of those reads that did not contribute to the prior contig (Table 1) for reason their pair read failed to map to that prior contig. Consequently, the reads binned in Step 3 included the paired reads along with a number of single reads for which their pair did not map; a custom Javascript collected into a single bin those reads that mapped as well as reads that did not map but whose read-pair did map (Figure 1, Step 4: Additional file 2). The Velvet program was then used to assemble contigs from the collected paired-end reads combined from both libraries (Figure 1, Step 5) using multiple length k-mers and coverage cutoffs to identify assembly conditions that produced a maximum contig length; this step involved minimal computational time given the small number of reads (< 2 * 10 5 ) present among all bins. Bowtie mapping, read set collection and assembly were reiterated ( Figure 1, Step 6) until a maximum contig length and maximum coverage (relative to the target genes) were achieved.   C. elegans target genes (name (ID), GenBank accession number (Acc #) and coding sequence length (CDS len)) used to BLASTx-query a database comprised of the initial de novo library assembly. For each k-mer (e.g. "21 mer HSPs") are listed in columns the number of high scoring pairs identified (HSP; "#"), the mean HSP length in DNA bases (" X "), and the range of HSP-lengths ("R") with minimum and maximum length shown. Bold HSP-length values indicates the longest HSP identified among all k-mers for a given target gene. "nd" indicate no high-similarity HSPs were identified at that k-mer.
The dramatic effectiveness of the reiterative method is evidenced by comparison of contigs and corresponding best-HSPs of the final (iterative-derived) contigs to the initial contigs and to the C. elegans target sequence, as is shown in Table 2. Whereas the initial Velvet assembly yielded 3 sequences corresponding to nearly 100% of the target gene coding sequences (lev-1, unc-38, glc-4), reiteration yielded the full coding sequence of an additional 7 target sequences including a second isotype of ben-1, and yielded significant improvement to all other sequences excepting that of lev-8.
Interestingly, whereas the initial assembly yielded contigs that best-mapped by BLAST analysis to glc-1 (Table 1), reiteration yielded an extended contig unambiguously identified by BLAST as the paralogous target gene avr-15. A closer examination of the initial contigs that were identified as glc-1 revealed that their region of similarity to glc-1 was quite small and exhibited almostas-high BLAST similarity to the paralog avr-15. Consequently, reads within the libraries provide evidence for transcripts for 2 avr-15 variants and no glc-1.
The quality/accuracy of the in silico derived sequences can be inferred from the shared identity to target genes of C. elegans (Table 2, column "% ID"). Additionally, a direct indication of in silico sequence quality for 2 of the target genes (unc-38 and -63) is possible because their sequence in O. dentatum had previously been determined (from PCR-amplicons). BLAST nucleotide comparison of the previously determined O. dentatum unc-63 mRNA sequence [GenBank:HQ162136] to the corresponding 1770 nucleotide in silico sequence ( Table 2, GACS01000005) identified a single alignment comprised of 1768 bases (including 30 non-identical bases) and no gaps. Of the non-identical bases, 20 were located within the coding sequence but only 1 of the 20 resulted in a change in the deduced protein sequence, a G 1411 A/ R 433 Q (numbering relative to HQ162136 DNA and corresponding protein sequences). Within the RNAseq library reads this base call was invariant, i.e. of the 5 reads that mapped over the G 1411 A position, all were unique and all contained the "A" base call.
A similar comparison of the 1681 base O. dentatum unc-38 mRNA sequence [GenBank:GU256648.1] to the corresponding 1631 nucleotide in silico sequence ( Table 2, GACS01000004) identified a single alignment comprised of 1607 bases and no gaps, and which, when limiting the comparison to only the coding sequence, exhibited 3 base changes (i.e., 1521 of the 1524 coding sequence bases were identical). All 3 base changes result in amino acid changes: A 179 G/ Y 35 C, T 434 C/ Y 120 H, and T 1103 C/ F 343 S (numbering relative to GACS01000004 DNA and corresponding protein sequences). The other differences determined by alignment of the full sequences were that the GACS01000004 3' UTR was shorter by 43 bases, and that the first 25 bases of the GACS01000004 75 base 5' UTR was not contained within the 82 base GU256648.1 5' UTR (suggesting the in silico sequence may represent a 5' splice variant). To validate these differences within the 5' UTR and the coding sequence of GACS01000004, reverse transcription Best initial high scoring pair (HSP) and contig lengths correspond to the longest HSP from the initial read assemblies (see bold values from Table 1) and the contig from which that HSP derives. "% full length" represents the percent of the comparator C. elegans protein that is represented within the final contig. "% ID" represents the percent identity as determined by pairwise alignment.
polymerase chain reaction (RT-PCR) using a 5' SL1 splice leader forward primer (see Methods) and an unc-38 specific internal reverse primer were used to amplify an approximately 1600 base RNA segment (spliced leaders represent a set of invariant nucleotide sequences that are post-transcriptionally added to the 5' end of many nematode mRNAs). The sequence of individual PCR clones fully recapitulated the GACS01000004 5' UTR sequence as well as the 3 base changes within the coding sequence. Further evidence for the validity of the in silico derived sequences was shown by RT-PCR amplification of ben-1 and avr-15 from RNA template using gene specific primers designed from the in silico sequence ( Table 2, GACS01000008 and GACS01000007, respectively). The sequences of 9 ben-1 PCR clones compared against that of GACS01000008 showed a single 1349 base no-gap alignment with 1307 identities and 42 variant base calls of which 6 resulted in a change in the deduced protein sequence: G 165 A/ V 49 I, G 512 A/ M 164 I, A 646 G/ D209G, T 840 C/ S 274 P, G 996 A/ V 326 M, C 1111 T/ A 364 V (numbering relative to GACS01000008 DNA and corresponding protein sequences). Interestingly, at each of the 42 sites of base variance, the majority of clone sequences called for an identical base to that of the in silico GACS01000008 sequence. In addition, at 21 of the 42 sites of base variance only a single PCR clone contained the variant base call. Thus the 42 base variants likely represent single nucleotide polymorphisms that are present within the population of ben-1 RNAs examined. In the equivalent analysis of avr-15, the sequences of 2 PCR clones compared against that of GACS01000007 showed a single 1363 base no-gap alignment with 1320 identities and 43 variant base calls of which 6 resulted in a change in the deduced protein sequence: A 1089 G/ K 345 R, C 1103 T/ H 350 Y, C 1173 T/ V 373 A, T 1185 C/ V 377 A, A 1284 G/ K 410 R, C 1292 G/ L 413 V (numbering relative to GACS01000007 DNA and corresponding protein sequences). Among the 2 PCR clones, one contained only the V 373 A deduced change and otherwise was identical in translation to the protein deduced from GACS01000007, and the other clone contained all 6 deduced amino acid changes.

Discussion
The data shown in Results demonstrate the efficient and successful use of an iterative de novo assembly of RNA-Seq data to determine in silico the sequence of 12 O. dentatum anthelminthic drug target genes. Selection of these particular genes was based upon their general importance to a range of studies investigating helminth drug targets (reviewed in [7,8]). The iterative assembly produced full length coding sequences for 9 target genes, whereas the Velvet assembly yielded full length (or nearly full-length) sequences for only a 3-gene subset of those 9 genes. A major utility of this process is that, as a computational process, it is scalable and should fit well to a variety of gene-characterization situations in O. dentatum or in any other nematode lacking a known genome sequence.
Related computational processes have been described with utility for producing output other than full length coding sequences. In one method, remapping of reads by identity to contigs within an initial assembly, and then reassembling contigs from those remapped reads, improved transcriptome assembly [14]; the desired output from that work was production of a general gene ontology. In another method, genome assemblies were variably improved via gap closure achieved by mapping paired-end reads and collecting pairs for which only one of the ends aligned to a contig [15].
As shown in Table 1, a number of k-mer lengths were used for initial library assemblies, demonstrating the dramatic effect of k-mer length on contigs. That said, as noted in Results, k-mer length 31 was used to build the contigs used in all target-gene resampling excepting that for lev-8 (for which k-mer 31 returned no contigs; see Table 1) and glc-3 (for which k-mer 31 contigs failed to support generation of resampled contigs representing the full target sequence). This suggests there is little need to use the best or longest contig as input to the resampling process. As a further test of this concept we were able to successfully build avr-15 from a single approximately 300 base initial contig. Thus, there seems to be a reduced need to conduct full library assemblies over a wide range of k-mers when attempting to derive in silico the sequences for a set of target genes; instead, one stops when a limited set of k-mers that have been run produce a quality contig for each target.
The reiterative approach presented here may have utility on a larger scale to facilitate transcriptome projects in nematodes and other organisms that lack genomic information but for which more complete gene-transcript sequences are desired. A dynamic programming approach will likely be required in such extended application to accommodate the conditional filters used during reiterative binning and assembly, and to accommodate a broader range of possible contig-to-target sequence similarity scores (i.e., blast E-values). We note that for a gene family represented by many members within a single organism there can arise ambiguities in contig identification, something that was seen in the present study for some initially assembled nAChR subunit contigs (data not shown); that such ambiguities were not observed for any reiteratively assembled contigs logically suggests that the longer the contig sequence-lengths, the less the chance that ambiguities of identity assignment will occur.
Of interest to nematologists is the identification of a lev-8-like sequence in O. dentatum, since it has also been found in C. elegans [16] but not in H. contortus [17], a Clade V nematode that is considered more closely related to O. dentatum than is C. elegans [18]. This identification is validated by its 85% amino acid identity ( Table 2) and 94% similarity to C. elegans lev-8; by comparison, it exhibited only 67% identity and 82% similarity to the highest BLAST-identified matching gene/protein in H. contortus, acr-8 [GenBank:ABV68891]. These data suggest a loss of lev-8 in H. contortus but not O. dentatum. Because data reported here identified only a partial in silico sequence for lev-8 (<300 bp), it is uncertain whether O. dentatum contains a full length (functional) lev-8 or only a vestigial (partial) lev-8 sequence; if the latter, then O. dentatum may be close behind H. contortus in losing the gene; if the former, then given that the lev-8 nAChR subunit has been shown in C. elegans to confer sensitivity to levamisole [19], one might predict a difference in levamisole sensitivity as a function of the presence, or absence of lev-8. While differences in the nAChR properties and drug sensitivities of closely related nematode species have been observed, the genetic basis for these differences are unknown.

Conclusions
The reiterative approach presented here was effective in determining in silico longer sequence reads for 11 genes of a 12-gene set of drug target genes of O. dentatum, a nematode for which exists very little genomic or gene information; an initial Velvet assembly that yielded 3 full/nearly-full length sequences can be improved by reiteration to yield full coding sequences for 9 (or 10, including ben-1 isotype 2) target genes and improved sequences for the remaining genes. The reiterative approach is expected to have general application for the in silico gene identification/sequencing of any nematode for which detailed genetic/gene information is lacking. The identification of full coding sequences for the target genes enables further examinations including studies like (i) seeking to reconstitute functional proteins/systems for assessment in vitro (similar to [17]), (ii) seeking comparative genomic and transcriptomic assessments between parasites isolates that exhibit varied drug sensitivities; such studies are ongoing in our labs and those of collaborators.

RNA isolation
Parasite samples resuspended in 1.0 ml TRI reagent (Molecular Research Center, Cincinnati, Ohio) were ground by mortar and pestle under liquid nitrogen then brought to a total volume of 2 to 3 ml TRI reagent. Total RNA was extracted from the TRI reagent according to the manufacturer's instructions, including an additional centrifugation step for clearing insoluble material. Extracted RNA was treated with DNase I (New England BioLabs, Ipswich, Massachusetts) (10 min at 37°C, 10 min at 75°C), then re-extracted with TRI reagent and resuspended in diethylpyrocarbonate-treated water. RNA concentration, purity, and quality (RNA Integrity Number) were assessed on a 2100 Bioanalyzer (Agilent Technologies, Santa Clara, California).

mRNA-Seq
The building of indexed, non-normalized, paired-end mRNA-seq libraries, and subsequent 75-cycle pyrosequencing on an Illumina GAIIx platform, were performed as a service by the DNA Facility (Office of Biotechnology, Iowa State University) using 5 μg total RNA (per sample). Male and female libraries were duplexed in a single sequencing lane.

Genomics and bioinformatics Assembly
Velvet version 1.1.06 [9] was used for contig assembly.

Similarity searching
BLAST algorithms [21] were used to compare contigs with sequences available in public databases including the National Center for Biotechnology Information (NCBI) to identify homologues from other nematodes, i.e., sequences returning BLAST expect values ≤ 1E -10 .
Read mapping 64 bit Bowtie [22] version 0.12.7 was used to map reads for contig building.

Pairwise comparison
The Needle algorithm [23] was used for pairwise comparison.

Custom codes
Java 1: Java-script code to read BLAST output and collect contigs that pass identity thresholds from a contig