A chloroplast genomic strategy for designing taxon specific DNA mini-barcodes: a case study on ginsengs

Background Universal conventional DNA barcodes will become more and more popular in biological material identifications. However, in many cases such as processed medicines or canned food, the universal conventional barcodes are unnecessary and/or inapplicable due to DNA degradation. DNA mini-barcode is a solution for such specific purposes. Here we exemplify how to develop the best mini-barcodes for specific taxa using the ginseng genus (Panax) as an example. Results The chloroplast genome of P. notoginseng was sequenced. The genome was compared with that of P. ginseng. Regions of the highest variability were sought out. The shortest lengths which had the same discrimination powers of conventional lengths were considered the best mini-barcodes. The results showed that the chloroplast genome of P. notoginseng is 156,387 bp. There are only 464 (0.30%) substitutions between the two genomes. The intron of rps16 and two regions of the coding gene ycf1, ycf1a and ycf1b, evolved the quickest and served as candidate regions. The mini-barcodes of Panax turned out to be 60 bp for ycf1a at a discrimination power of 91.67%, 100 bp for ycf1b at 100%, and 280 bp for rps16 at 83.33%. Conclusions The strategy by searching the whole chloroplast genomes, identifying the most variable regions, shortening the focal regions for mini-barcodes are believed to be efficient in developing taxon-specific DNA mini-barcodes. The best DNA mini-barcodes are guaranteed to be found following this strategy. Electronic supplementary material The online version of this article (doi:10.1186/s12863-014-0138-z) contains supplementary material, which is available to authorized users.

Background DNA barcoding is a relatively new concept, aiming to provide rapid, accurate and automatable species identification using a standard DNA region. Chloroplast (or plastid) sequences such as rbcL and matK are usually used as DNA barcodes for plant [1]. The lengths of the commonly used barcoding markers are longer than 650 bp. In most cases it is easy to achieve PCR success when using DNA of high quality. However, if the DNA molecules have degraded into fragments shorter than the spanning length of the primers, say 650 bp, it would not be possible to amplify the DNA barcodes. In these cases, DNA minibarcodes could be used.
A DNA mini-barcode is a short DNA, generally 100-250 bp [2], suitable for species identification. Thus far, a few tries have been made to design DNA mini-barcodes [3,4]. Owing to significantly reduced length of sequences, PCR amplification success should presumably be much improved, but identification success would thus be hampered. A good DNA mini-barcode should be of high PCR and sequencing successes without much lowering species discrimination power. Therefore, DNA mini-barcodes are more often taxon specific than universal. Preferably DNA mini-barcodes should be the most informative regions of a genome. For seed plants, it is now realistic to find such DNA mini-barcodes by searching the whole chloroplast genomes owing to the ease of chloroplast genome sequencing [5].
Chloroplast sequences have been extensively used for species identification and phylogenetic reconstruction of plants. Chloroplast sequences evolve relatively slowly and there are not very many substitutions between species within a genus. To find the best DNA mini-barcodes, whole chloroplast genome screening is usually necessary. Typically, the chloroplast genome size of higher plants ranges from 120 to 160 kb, and a pair of inverted repeats (IRs) divides the genome into a large single copy (LSC) region and a small single copy (SSC) region. The IR regions are quite conservative [6], and the variable regions locate predominantly in the LSC and SSC [7].
DNA mini-barcodes can be used for species identification of digested material [8], old herbarium/museum specimens [9], ancient DNA, and more frequently processed medicinal herbs when high-quality DNA is not available and degraded DNA has to be used. Ginsengs (Panax spp., Araliaceae) are the best known Chinese medicine worldwide. They have been used as medicines alone or in combinations with other medicines. Recently, ginsengs were also used as an ingredient of cosmetics, tooth paste, beverage, vegetable, etc. There are eight species in Panax. All species are considered seriously endangered medicinal plants. Panax notoginseng (Burkill) F. H. Chen ex C. Y. Wu & K. M. Feng is extinct in the wild and wild P. ginseng C.A. Mey. in China is nearly extinct. However, illegal harvest and trade happen occasionally. For law-enforcement activities in conservation of wild populations of endangered species, there is a need for a method for correct identification of confiscated materials in forms of fragments, powders or decoctions of any organs. Panax ginseng and P. notoginseng have been cultivated in China for a long time for medicinal purposes. Roots of P. quinquefolius L. are imported from the USA or produced in the Northeast China. The commercial roots of P. ginseng and P. quinquefolius resemble each other and it is difficult for laymen to tell them apart. When they were sliced or powdered, it is unlikely for experts to distinguish them. Almost all species are identifiable using the DNA barcoding method according to Zuo et al. [10]. However, if the materials were processed as in decoctions and dietary supplements, the conventional DNA barcodes would probably fail. Therefore, it is justified to design DNA mini-barcodes of ginsengs for conservation purpose and for monitoring ginseng market and protecting consumers' rights.
In this study, we report a strategy of designing taxonspecific DNA mini-barcodes using ginsengs as an example. We first sequenced the chloroplast genome of P. notoginseng, then we sought out the hypervariable regions by comparing the new genome to the one of P. ginseng, and finally we determined the length and positions of the best DNA mini-barcodes and tested their applicability.

Results
Characteristics of the chloroplast genome of P. notoginseng The chloroplast genome of P. notoginseng is 156,387 bp in length, slightly longer than the genome of P. ginseng which is 156,318 bp (GenBank Accession number: KJ566590). The length of IR regions is 26,126 bp each, 55 bp longer. The LSC region is 86,111 bp, 5 bp longer; and SSC is 18,024 bp, 46 bp shorter. There are 79 protein-coding genes, 30 tRNA genes, and 4 rRNA genes ( Figure 1, Additional file 1: Table S1). The total G + C content of the whole chloroplast genome is 38.08%. The IRa/LSC, LSC/ IRb and IRb/SSC junctions are identical to the chloroplast genome of P. ginseng, but the SSC/IRa junction (ycf1) of P. notoginseng is 8 bp shorter.
A comparison of the entire chloroplast genome sequences of P. notoginseng and P. ginseng revealed 464 nucleotide substitutions, including 273 transitions (Ts) and 191 transversions (Tv) ( Figure 2). Of these substitutions, 193 events were in the coding regions, 45 in the introns and 226 in the intergenic spacers. The patterns among the three regions were similar. The proportion of Ts was much higher than that of Tv in all regions, indicating a bias in favor of transitions. This bias was even more pronounced in the coding region, in which the Ts/Tv was 1.68, whereas the Ts/Tv in the introns and intergenic spacers was 1.50 and 1.24, respectively. Among the 79 genes, 23 genes had non-synonymous substitutions.
In total, 156 indels were detected in the chloroplast genomes of two Panax species (Additional file 2: Table S2), 84 insertions and 72 deletions in P. notoginseng or 84 deletions and 72 insertions in P. ginseng. Most of the indels (63.06%) were single nucleotide differences. Indels longer than 10 bp occurred 16 times. The longest indel (34 bp) was in the spacer between rps16 and trnQ. The majority of the indels occurred in the non-coding regions with two exceptions, a 15 bp insertion and an 18 bp insertion in the ycf2 gene of P. notoginseng.
Three short inversions were observed in ndhD-psaC, petB intron and trnM-atpE between the two chloroplast genomes (Additional file 3: Table S3). All the inversions have hairpin structures, including the inversions and the inverted repeats. The inverted repeats formed the stem structures, and the inversions formed the loops. The lengths of inverted repeats were 3 bp, 44 bp, and 11 bp, and the lengths of the inversions were 19 bp, 18 bp, and 14 bp, respectively in the ndhD-psaC, petB intron and trnM-atpE regions.

Variability throughout the chloroplast genomes
The variability throughout the chloroplast genomes was quantified using the average nucleotide diversity (π) (Figure 3). The average value of π is 0.00208. The IR regions exhibited lower variability than the LSC and SSC regions. There were three peaks which showed remarkably higher π values (>0.012). One is the intron of rps16, the other two are the coding regions of ycf1 (ycf1a and ycf1b) (Figure 3). The variability of the three regions were tested together with the three conventional candidate barcodes (matK, rbcL and trnH-psbA) using 24 samples of all eight Panax species. The ycf1a, ycf1b and trnH-psbA showed nearly double the π values of the other three markers (Table 1).
A barcoding analysis demonstrated that the matK and trnH-psbA can discriminate 62.50% of the samples. The percentages are 83.33% for rps16 intron, 91.67% for rbcL and ycf1a, and 100% for ycf1b (Table 2).

DNA mini-barcode for Panax
Discrimination power, the maximum percentage of samples discriminated (Pm), varied with the increase of sequence lengths and among markers ( Figure 4). The Pm of trnH-psbA never changes with the increase of sequence length. The Pm stabilized at 100 bp for matK and ycf1a, 150 bp for ycf1b, and 200 bp for rbcL. The Pm of rps16 intron rose with the increase of sequence length (Figure 4). Since no change was observed on the Pm of trnH-psbA, the shortest minibarcode is 60 bp of ycf1a with 91.67% of discrimination power (Table 2), whereas ycf1b needs 110 bp for a 100% of discrimination power. A pair of primer for the best minibarcode of each marker was designed ( Table 2). Powdered roots of P. notoginseng and steamed roots of P. ginseng purchased from market were used to test the minibarcode (Additional file 4: Figure S1). Amplification and sequencing of ycf1b mini-barcode were successful, but amplification of the conventional ycf1b failed (Additional file 5: Figure S2).

Discussion
Practically the length of a barcode becomes an issue of concern. Very subjectively we can classify a barcode according to the length, for example, micro-barcode within 100 bp, mini-barcode of 100-250 bp [2,11], conventional barcode of 250-1000 bp, super-barcode of 1000-6000 bp, and genome barcode using the whole genome. However, most applications of mini-barcodes are not necessarily to be the kind because a conventional barcode could be created by concatenating several mini-barcodes. Mini-barcodes were often used at the risk of lowering the resolution of taxa and consequently underestimated biodiversity. Mini-barcode has its potentials in situations that long fragments are impracticable or unnecessary for efficiency and economy considerations. Mini-barcodes of high resolutions are not easily found and that is why whole genomes are indispensible for development of mini-barcodes.
Chloroplast genome is endemic to plants. Chloroplast DNA barcodes bypass the DNA contamination from other organisms without chloroplasts, such as animals and fungi. Therefore, chloroplast DNA barcodes are of primary choices. Unfortunately, chloroplast genes usually evolve more slowly than nuclear genes [12] and the candidate barcodes such as matK and rbcL often have limited resolutions at species level [13,14]. However, there are some regions in the chloroplast genome which evolve much quickly and meet the criteria of being a DNA barcode. The strategy of searching the whole chloroplast genomes had been successfully applied to Jacobaea [15], Oncidium [16], Parthenium [17], and Theobroma [18]. Although some species are extremely closely related and no variations at the loci of matK and rbcL, for example, Acorus americanus v. s. A. calemus and Oryza nivara v. s. O. sativa, there are some differences at other loci [7]. Therefore, it is a reliable strategy to find the best chloroplast DNA mini-barcodes by searching the chloroplast genomes of congeners. Another advantage of chloroplast mini-barcode is that there is almost free of intra-populational variations and very low interpopulational variations. Sequence divergence is predominantly between species [19]. Species identification is for most cases more reliable.  The patterns of nucleotide substitutions among the two Panax chloroplast genomes. The patterns were divided into 6 types as indicated by the six non-strand-specific base-substitution types (i.e., numbers of considered G to A and C to T sites for each respective set of associated mutation types). The chloroplast genome of P. notoginseng was used as a standard.
Indels (gaps) are another kind of informative signals of potentially useful [20]. There are 157 indels along the two genomes of Panax. Indels are more useful at the lowest taxonomic level. Microsatellite markers are analogous to indels. It is often cumbersome in practice by using indel information. When gaps are coded as the fifth state of characters, they are very likely to be overweighted. To solve this problem, gaps are better coded manually. There is unlikely to have indels in the minibarcodes of closely related species. However, chloroplast indels are likely to be another kind of DNA barcode for closely related species.
DNA mini-barcodes have so far been used for studying flora or fauna [3,4,9,11,21,22]. Such usages are often a compromise between resolution and experimental success. Consequently the mini-barcodes may underestimate the diversity of flora or fauna. However, DNA mini-barcodes are more suitable for ecologically and economically important taxa because it is more likely to find the best and taxa-specific mini-barcodes. Ginsengs are the most well-known herbal medicine in China. They have been extensively used for a long time. Substitution of expensive materials with similar but cheaper ones of congeners is reported occasionally. An effective and quick method for identifying the species of ginsengs is helpful for monitoring ginseng markers. We tested our mini-barcodes using materials purchased from market and they are proven applicable for such cases.

Conclusions
In this study we provide a strategy for developing taxon-specific DNA mini-barcode without lowering discrimination power using the ginseng genus (Panax) as an example. The strategy by searching the whole genomes, identifying the most variable regions, shortening the focal regions for mini-barcodes are believed to be efficient in developing taxon-specific DNA mini-barcodes. The mini-barcodes for Panax were tested useful for identifying processed ginsengs from medicinal market.

Chloroplast genome sequencing
Leaves of P. notoginseng were collected from Wenshan, Yunnan province (Collection number: A8). The genomic DNA were extracted using modified CTAB (mCTAB) methods [23] and purified using the Wizard DNA Clean-Up System (Promega, Madison, WI, USA). The chloroplast genome was sequenced by using the short-range  PCR method similar to Dong et al. [5]. Panax-specific primers (Additional file 6: Table S4) based on the chloroplast genome of P. ginseng [24] and some universal primers [5] were used to amplify and sequence the chloroplast genome of P. notoginseng. The chloroplast genome of ginseng served as a reference. The genome structure was confirmed by amplifying additional fragments spanning the LSC ↔ IRb, IRb ↔ SSC, SSC ↔ IRa, and IRa ↔ LSC [5].

Genome annotation
The whole chloroplast genome was annotated using DOGMA [25] to identify the coding sequence, rRNA, and tRNA using the chloroplast/bacterial genetic code. The annotation of the tRNA genes was checked using tRNAscan-SE [26]. The genome map was generated using GenomeVx [27].

Identification of the hypervariable regions
The chloroplast genome of P. notoginseng was aligned to the chloroplast genome of P. ginseng [24] using MAFFT [28] and then adjusted manually using Se-Al 2.0 [29]. To identify the highly variable regions within the chloroplast genomes, we calculated the nucleotide diversity using DnaSP ver. 5.0 [30] with a sliding window analysis. The window length was set to 600 bp with a step size of 25 bp.

Plant material, PCR amplification and hypervariable region sequencing
All 8 Panax species were included in this study and each species was represented by at least two accessions (Additional file 7: Table S5). Medicinal materials (Additional file 4: Figure S1) were purchased from market to test the mini-barcodes designed in this study. The primers for amplifying the highly variable regions were designed using FastPCR (Additional file 8: Table S6). The primers for amplifying and sequencing the control markers of rbcL, matK and trnH-psbA were the same as Zuo et al. [10]. The PCR amplifications were performed in a final volume of 25 μL containing 1× PCR buffer (with Mg 2+ ), 0.25 mmol/L each dNTP, 0.25 μmol/L each primer, 1.25 U Taq polymerase, and 20-30 ng DNA. The PCR program started at 94°C for 4 min, followed by 34 cycles of 30 s at 94°C, 40 s at 52°C, and 1 min at 72°C, and ended with a final extension of 10 min at 72°C. The PCR products were checked by electrophoresis on a 1% agarose gel containing ethidium bromide and visualized using an ultraviolet transilluminator. Both of the strands were sequenced on ABI Prism 3730xl (Applied Biosystems, Foster City, U.S.A.) following the manufacturer's protocols.

DNA barcoding analysis
Distance is likely the most commonly used method for classifying DNA sequences. In this study, the distance method was used to analyze the barcoding performances of the newly identified highly variable regions. The function nearNeighbour of SPIDER was used for barcoding analysis [31]. Species discrimination was considered successful if the closest K2P distance for all of the individuals of a given species belonged to only one conspecific individual.

DNA mini-barcode search using SPIDER
We used the sliding window function slideAnalyses of SPIDER [31] version 1.2-0 to find out the shortest informative windows. This function extracts all the passable windows of a chosen size in a DNA alignment and performs pairwise distance-(K2P) and NJ tree-based analyses of each window. In order to know the performances of markers with the increases of their sequence lengths, the changes of discrimination power, the maximum percentage of samples discriminated (Pm), at 50, 100, 150, 200, 250, 300, and 350 bp were depicted. In order to know the minimum length of a mini-barcode that performed as well as the full length, sliding window analyses were conducted. The starting length was set to 50 bp. The length was increased by 10 bp each round in the subsequent searches till the length of maximum discrimination power. The shortest length of a marker was considered the shortest mini-barcode of the marker.