Principal components analysis - K-means transposon element based foxtail millet core collection selection method
- Ernesto Borrayo†1, 2,
- Ryoko Machida-Hirano†1,
- Masaru Takeya2,
- Makoto Kawase1 and
- Kazuo Watanabe1Email author
© Borrayo et al. 2016
Received: 19 April 2015
Accepted: 1 February 2016
Published: 16 February 2016
Core collections are important tools in genetic resources research and administration. At present, most core collection selection criteria are based on one of the following item characteristics: passport data, genetic markers, or morphological traits, which may lead to inadequate representations of variability in the complete collection. The development of a comprehensive methodology that includes as much element data as possible has been explored poorly. Using a collection of (Setaria italica sbsp. italica (L.) P. Beauv.) as a model, we developed a method for core collection construction based on genotype data and numerical representations of agromorphological traits, thereby improving the selection process.
Principal component analysis allows the selection of the most informative discriminators among the various elements evaluated, regardless of whether they are genetic or morphological, thereby providing an adequate criterion for further K-mean clustering. Overall, the core collections of S. italica constructed using only genotype data demonstrated overall better validation scores than other core collections that we generated. However, core collection based on both genotype and agromorphological characteristics represented the overall diversity adequately.
The inclusion of both genotype and agromorphological characteristics as a comprehensive dataset in this methodology ensures that agricultural traits are considered in the core collection construction. This approach will be beneficial for genetic resources management and research activities for S. italica as well as other genetic resources.
KeywordsCore collection Foxtail millet K-means Principal component analysis
The exploitation of genetic resources has been a primary concern for several governmental and nongovernmental agricultural institutions around the world , where the interest may vary from economically exploitable variant crops , to sociocultural , health-related , and biological-related studies (phylogenetic relationships, phenotype-genotype relationships, and physiological-environmental behaviors ). However, most researchers must address the problem of data mining to obtain collections of an appropriate size .
Due to the size of some collections, complete collection (MC) data mining may sometimes be too expensive (both operative and monetary); therefore, core collections (CC)  and mini-core collections have emerged in recent decades .
Methods for obtaining an optimal CC have been explored widely [8–11], and several algorithms and informatics tools have been developed [12–15], but CCs still have many different objectives and various evaluation criteria .
Most CC-related studies are based on one or more of three principal characteristics: a) passport data, b) genotypic analysis, and c) morphological traits (). As new genetic information becomes available, CC selection has increasingly used genotypic analysis as a good criterion, but the efficiency of specific molecular markers needs to be demonstrated for phenotypic traits of interest because both types of data are fundamental requirements of genetic breeding programs . Several studies have utilized molecular markers in different collections, including the development of CCs based on widely used simple sequence repeats [11, 17, 18] and restriction fragment length polymorphisms , which have demonstrated the great potential of using genetic data for CC selection.
Foxtail millet (Setaria italica subsp. italica (L.) P. Beauv.) is one of the oldest cereals consumed by people in Eurasia, America, Africa, and Australia. Foxtail millet has a relatively small genome size (515 M) and it is has been adopted as a model organism [20, 21] because of its potential use in studies that involve grass species evolution, C3 and C4 photosynthesis, stress biology and biofuel [22–24].
Three recently active transposons (TE) have proved to be suitable genome-wide markers for evolutionary studies of S. italica . We hypothesize that these markers may also be useful for CC selection in this species.
In this study, we combined principal components analysis (PCA) and the K-means method for CC selection  based on evaluations of traditional and newly described CC evaluation parameters . This methodology allowed to include both genotypic and agromorphological traits (AT) in CC selection. Thus, we present a proof of concept for the potential use of TE and AT combined as selection criteria for CC construction in S. italica.
Core collection selection
The accessions used in this study originated from 38 different countries, which encompassed the major traditional geographical distribution (Asia, Eurasia, and Africa) of the study species. In order to obtain genomic information, transposon display (TD), a modified form of amplified fragment length polymorphism (AFLP) , was performed with some modifications using three TEs: TSI-1 [tourist miniature interspersed nuclear elements (Tourist MITE)], TSI-7 [long terminal repeats (LTR) retrotransposons], and TSI-10 [short interpersed nuclear elements(SINE)], with different classes and characteristics . These TEs were identified in the mutant alleles of Waxy (GBSS1), which controls the amylose content in the starch endosperm . The genomic dataset obtained (data 0) comprised a total of 423 S. italica accessions, which were genotyped by TD . AT data was downloaded and categorized from the National Institute of Agrobiological Sciences (NIAS) http://www.gene.affrc.go.jp/databases-plant_search_char_en.php?type=9 for 141 of the original 423 accessions. Eight ATs were categorized and mapped to binary data, which were represented as 28 “m” characteristics (data II) for discrete variables, and any possible phenotypic traits were treated as present/absent. Continuous variables were categorized arbitrarily into three groups and then treated as discrete variables using the same present/absent criteria. The original phenotypic values and their numerical representations are summarized in Additional file 1 (Online Resource 1). To facilitate comparisons of data II behavior, we created data I, which comprised the same 141 accessions used in data II, but with the genotypic information for data 0. In order to determine the feasibility of analyzing phenotypic traits with genotypic markers in a single step, we merged the data I and data II sets to obtain (data III), where each m element was treated as equal regardless of its TD or AT origin.
Principal component analysis - K-means analysis
Because the informativeness is different for each m element of data, PCA was performed in order to rearrange data into a new matrix. This procedure decreases the informativeness of subsequent elements and it discards elements with a variance that is equal to 0. This process generated two new matrices: one containing the original m characteristics mapped vectors (x) and the rearranged variance value matrix (X). Thus, matrix X contained n samples, which were formed of a numerical vector with m=m-(non-informative m). m can also be determined arbitrarily in order to work with only the most informative elements of data. To select the CCs, we performed PCA to arrange the data from the most significant to the least significant elements in terms of the difference information discriminator, but without affecting the element associations . After rearranging the data, the score that represented each value was subjected to K-means clustering according to , which is an implementation that enhances the K-means algorithm in order to avoid empty clusters. For each K cluster, the sample with the lowest Euclidean distance from the cluster centromere was selected as a representative. The newly generated CC was evaluated according to several validation parameters, which have been used widely [8, 9] and reviewed in recent studies .
Evaluation of the selected core collections
The selected CCs were analyzed based on their distribution according to a phylogenetic reconstruction. A genetic distance matrix and a neighbor-joining dendrogram were obtained using AFLP-SURV 1.0  and the Phylogeny Inference Package (PHYLIP) 3.69 , respectively, for the 141 accessions present in data I. The data I dendrogram and the visualization of the CCs were obtained using MEGA 5.2 . The geographical distributions of the CCs were digitalized and visualized using DIVA GIS http://www.diva-gis.org/.
where K is the total of CC elements, k is each CC element, and D is the GAFD distance between k and all other jth CC elements, cCC, excluding itself, thereby yielding L comparisons in total.
The ideal value for ANE is 0, where each sample of CC represents itself and others exactly like it. It is useful to evaluate CCs where the objective is a homogeneous representation of the diversity in the MC. In addition, ENE and E are used to evaluate the data dispersion for the CC, where higher values indicate the better representation of extreme values.
Evaluation criteria based on statistical parameter comparisons between the CC and the MC are used mainly to determine whether the CC adequately represents the identity of the MC as well as its distribution. Widely used evaluation parameters that meet these criteria were applied as follows.
A homogeneity test was performed on each trait for CC and MC based on the means and variances. For each comparison, a global value was represented as the percentage of traits that were statistically different (α=0.05) according to a t−t e s t for means (MD) and the F−t e s t for variances (VD) .
where ACC is the set of alleles in the CC and AMC is the set of alleles present in the MC .
Excluding the phylogenetic reconstruction and geographical distribution, all of the methodological procedures were performed using FREEMAT v4.2 www.freemat.sourceforge.net.
The FREEMAT codes are available in Additional file 2 (Online Resource 2).
Results and discussion
Usefulness of transposon display markers for CC selection
Locus-specific molecular marker systems, such as SNPs [21, 34], microsatellites  and other indel events  are available for foxtail millet. These markers may provide useful information for CC selection, but the full coverage of the complete genome with these markers has some conceptual and methodological limitations. SNPs and indels provide relatively less information per locus due to their bi-allelic nature and over 10,000 SNPs may be required to discriminate a closely related populations . Microsatellites may overcome these limitations, but testing microsatellites that cover the complete genome distribution also incur high laboratory expenses and time-consuming procedures .
The use of TEs as an alternative to locus-specific molecular marker systems is based on the assumption that a significant fraction of plant genomes comprise TEs , i.e., recently active display higher polymorphisms . A considerably large number of alleles can be detected using TEs as genetic markers with a small number of primer sets. CC selection using TEs combined with the recently released foxtail millet genome sequence  will considerably increase the number of polymorphic markers. Thus, we proposed a method that does not require genomic information, or a large number of locus-specific genetic markers, which is based on an AFLP-like technique that could easily be transferred to other biological systems. This method will enhance the reliability of CC selection considerably, thereby refining the exploitation of genetic resources.
To demonstrate the efficiency of ATs and TEs as CC selection criteria, we used K-means as a practical approach to clustering based on Kai et al. , who stated that the use of the principal coordinates instead of raw data (i.e., microsatellite genotype data) before K-means clustering makes the clustering step less sensitive to changes in the noisiness of the raw data. We agree that dimensionality reduction can enhance clustering process and it is possible to reduce the number of dimensions analyzed during this methodological step. However, to avoid more variables in the ATs and TEs evaluation, we used all of the informative data and we will explore the significance of dimension reduction in future implementations.
Validation of the CCs selected by different datasets
Core Collection evaluation scores for different K selected elements
The discreteness of the 141 accessions used in the CC selection procedures was confirmed by displaying their distribution on the phylogenic dendrogram based on data 0 presented in Additional file 3 (Online Resource 3). In order to evaluate whether the CC was representative, a phylogenetic dendrogram was constructed based on the genotypic distances among the MCs data I. The phylogenetic reconstruction obtained eight groups, which agreed with previously reported groupings . Thus, the selected CCs were identified according to this dendrogram.
To the best of our knowledge, the present study is the first attempt to combine genotypic and morphological information during CC construction with this approach. It was possible to construct CCs based on both information types using the proposed methodology. As demonstrated by the VS values, the PCA distribution (Figs. 2, 3, and 4), phylogenetic representations (Fig. 5), and geographic distributions (Fig. 6), the phenotypic data provided useful and potentially important information. We believe that genotypic information alone should not be used to generate CCs. In general, morphological information is used to include variation in the CC [11, 18]. Our evaluation of the PCA distribution suggests that both phenotypic and genotypic information have important effects on the selected CCs.
Our approach was successful in capturing most of the genotypic, phenotypic, and geographical diversity in a small set of individuals. Data III CCs were highly representative in terms of both genetic and phenotypic variations. The use of this approach for CC selection may provide beneficial materials in terms of biochemical, morphological, agronomic, and phylogenetic traits, which can be combined with genomic information. The precise definition of phenotypic numerical representations requires further attention, but we believe that combined information CCs will be highly beneficial for breeding improvement, domestication description processes, evolutionary studies, and phenotype/genotype correlation research given the advantages of using adequate CCs for S. italica as well as other crops.
Availability of data and materials
Supporting data and codes are available as additional files.
This research was supported partly by the SATREPS project entitled ’Diversity Assessment and Development of Sustainable Use of Mexican Genetic Resources’ funded by the Japan Science and Technology Agency and by Japan International Cooperation Agency, and partly by a Grant-in-Aid (No. 25257416) from the Japan Society for the Promotion of Science.
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver(http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
- McCouch SR, McNally KL, Wang W, Sackville Hamilton R. Genomics of gene banks: a case study in rice. Am J Bot. 2012; 99(2):407–23. doi:10.3732/ajb.1100385.View ArticlePubMedGoogle Scholar
- Studnicki M, MADRY W, Schmidt J. Efficiency of sampling strategies to establish a representative in the phenotypic-based genetic diversity core collection of orchardgrass (Dactylis glomerata. Czech J Genet Plant Breed. 2013; 2013(1):36–47. [Accessed 8 August 2014].Google Scholar
- Bellon M, Smale M, Aguirre A, Taba S. Identifying appropriate germplasm for participatory breeding: An example from the Central Valleys of Oaxaca, Mexico. 2000. http://ageconsearch.umn.edu/bitstream/46524/2/wp000003.pdf [Accessed 8 August 2014].
- Santra M, Matthews SB, Thompson HJ. Development of a core collection of Triticum and Aegilops species for improvement of wheat for activity against chronic diseases. Agric Food Secur. 2013; 2(1):4. doi:10.1186/2048-7010-2-4.View ArticleGoogle Scholar
- Reeves Pa, Panella LW, Richards CM. Retention of agronomically important variation in germplasm core collections: implications for allele mining. TAG. Theoretical Appl Genet Theoretische und angewandte Genetik. 2012; 124(6):1155–71. doi:10.1007/s00122-011-1776-4.View ArticleGoogle Scholar
- Brown AHD. Core collections: a practical approach to genetic resources management. Genome. 1989; 31(2):818–24. doi:10.1139/g89-144.View ArticleGoogle Scholar
- Guo Y, Li Y, Hong H, Qiu LJ. Establishment of the integrated applied core collection and its comparison with mini core collection in soybean (Glycine max). Crop J. 2014; 2(1):38–45. doi:10.1016/j.cj.2013.11.001.View ArticleGoogle Scholar
- Franco J, Crossa J, Warburton ML, Taba S. Sampling strategies for conserving maize diversity when forming core subsets using genetic markers. Crop Sci. 2006; 46(2):854. doi:10.2135/cropsci2005.07-0201.View ArticleGoogle Scholar
- Hu J, Zhu J, Xu HM. Methods of constructing core collections by stepwise clustering with three sampling strategies based on the genotypic values of crops. TAG Theor Appl Genet. 2000; 101(1-2):264–8. doi:10.1007/s001220051478.View ArticleGoogle Scholar
- Odong TL, Jansen J, van Eeuwijk FA, van Hintum TJL. Quality of core collections for effective utilisation of genetic resources review, discussion and interpretation. TAG Theor Appl Genet Theoretische und angewandte Genetik. 2013; 126(2):289–305. doi:10.1007/s00122-012-1971-y.View ArticlePubMedGoogle Scholar
- Kai S, Tanaka H, Hashiguchi M, Iwata H, Akashi R. Analysis of genetic diversity and morphological traits of Japanese Lotus japonicus for establishment of a core collection. Breed Sci. 2010; 60(4):436–46. doi:10.1270/jsbbs.60.436.View ArticleGoogle Scholar
- Thachuk C, Crossa J, Franco J, Dreisigacker S, Warburton M, Davenport GF. Core Hunter: an algorithm for sampling genetic resources based on multiple genetic measures. BMC Bioinformatics. 2009; 10:243. doi:10.1186/1471-2105-10-243.PubMed CentralView ArticlePubMedGoogle Scholar
- De Beukelaer H, Smýkal P, Davenport GF, Fack V. Core Hunter II: fast core subset selection based on multiple genetic diversity measures using Mixed Replica search. BMC Bioinformatics. 2012; 13:312. doi:10.1186/1471-2105-13-312.View ArticlePubMedGoogle Scholar
- Jansen J, van Hintum T. Genetic distance sampling: a novel sampling method for obtaining core collections using genetic distances with an application to cultivated lettuce. TAG Theor Appl Genet Theoretische und angewandte Genetik. 2007; 114(3):421–8. doi:10.1007/s00122-006-0433-9.View ArticlePubMedGoogle Scholar
- Gouesnard B, Bataillon T. MSTRAT: An algorithm for building germ plasm core collections by maximizing allelic or phenotypic richness. J Hered. 2001; 92(1):93–4. doi:10.1093/jhered/92.1.93.View ArticlePubMedGoogle Scholar
- Parra-Quijano M, Iriondo JM, Cruz MDL, Torres E. Strategies for the development of core collections based on ecogeographical data. Crop Sci. 2011; 51(2):656. doi:10.2135/cropsci2010.04.0191.View ArticleGoogle Scholar
- Santos-Garcia MO, de Toledo-Silva G, Sassaki RP, Ferreira TH, Resende RMSA, Chiari L, et al.Using genetic diversity information to establish core collections of Stylosanthes capitata and Stylosanthes macrocephala. Genet Mol Biol. 2012; 35(4):847–61. doi:10.1590/S1415-47572012005000076.PubMed CentralView ArticlePubMedGoogle Scholar
- Ebana K, Kojima Y, Fukuoka S, Nagamine T, Kawase M. Development of mini core collection of Japanese rice landrace. Breed Sci. 2008; 58(3):281–91. doi:10.1270/jsbbs.58.281.View ArticleGoogle Scholar
- Kojima Y, Ebana K, Fukuoka S, Nagamine T, Kawase M. Development of an RFLP-based Rice Diversity Research Set of Germplasm. Breed Sci. 2005; 55(4):431–40. doi:10.1270/jsbbs.55.431.View ArticleGoogle Scholar
- Zhang Y, Zhang X, Che Z, Wang L, Wei W, Li D. Genetic diversity assessment of sesame core collection in China by phenotype and molecular markers and extraction of a mini-core collection. BMC Genet. 2012; 13(1):102. doi:10.1186/1471-2156-13-102.PubMed CentralView ArticlePubMedGoogle Scholar
- Bennetzen JL, Schmutz J, Wang H, Percifield R, Hawkins J, Pontaroli AC, et al.Reference genome sequence of the model plant Setaria. Nat Biotechnol. 2012; 30(6):555–61. doi:10.1038/nbt.2196.View ArticlePubMedGoogle Scholar
- Doust AN, Kellogg Ea, Devos KM, Bennetzen JL. Foxtail millet: a sequence-driven grass model system. Plant Physiol. 2009; 149(1):137–41. doi:10.1104/pp.108.129627.PubMed CentralView ArticlePubMedGoogle Scholar
- Brutnell TP, Wang L, Swartwood K, Goldschmidt A, Jackson D, Zhu XG, et al.Setaria viridis: a model for C4 photosynthesis. Plant Cell. 2010; 22(8):2537–44. doi:10.1105/tpc.110.075309.PubMed CentralView ArticlePubMedGoogle Scholar
- Muthamilarasan M, Prasad M. Advances in Setaria genomics for genetic improvement of cereals and bioenergy grasses. Theor Appl Genet. 2015; 128(1):1–14. doi:10.1007/s00122-014-2399-3.View ArticlePubMedGoogle Scholar
- Hirano R, Naito K, Fukunaga K. Genetic structure of landraces in foxtail millet (Setaria italica (L.) P. Beauv.) revealed with transposon display and interpretation to crop evolution of foxtail millet. Genome. 2011; 54(6):506:498–506. doi:10.1139/G11-015.View ArticleGoogle Scholar
- Casa AM, Nagel A, Wessler SR. MITE display. Methods Mol Biol (Clifton, N.J.). 2004; 260(1):175–88. doi:10.1385/1-59259-755-6:175.Google Scholar
- Kawase M, Fukunaga K, Kato K. Diverse origins of waxy foxtail millet crops in East and Southeast Asia mediated by multiple transposable element insertions. Mol Genet Genomics:MGG. 2005; 274(2):131–40. doi:10.1007/s00438-005-0013-8.View ArticlePubMedGoogle Scholar
- Bamberg J, del Rio A. Selection and validation of an AFLP marker core collection for the wild potato solanum microdontum. Am J Potato Res. 2013. doi:10.1007/s12230-013-9357-5.
- Pakhira MK. A modified k -means algorithm to avoid empty clusters. Int J Recent Trends Eng. 2009; 1(1):220–6.Google Scholar
- Vekemans X. AFLP-SURV. Laboratorie de Génétique et Ecologie Végétale. Bruxelles, Belgium: Université Libre de Bruxelles; 2002.Google Scholar
- Felsenstein J. PHYLIP. Seattle, Washington, USA: Department of Genome Sciences, University of Washington; 2004.Google Scholar
- Tamura K, Peterson D, Peterson N, Stecher G, Nei M, Kumar S. MEGA5: molecular evolutionary genetics analysis using maximum likelihood, evolutionary distance, and maximum parsimony methods. Mol Biol. 2011; 113(6):1530–4. doi:10.1093/molbev/msr121.Google Scholar
- Borrayo E, Mendizabal-Ruiz EG, Vélez-Pérez H, Romo-Vázquez R, Mendizabal AP, Morales JA. Genomic signal processing methods for computation of alignment-free distances from DNA sequences. PLoS ONE. 2014; 9(11):110954. doi:10.1371/journal.pone.0110954.View ArticleGoogle Scholar
- Jia G, Huang X, Zhi H, Zhao Y, Zhao Q, Li W, et al.A haplotype map of genomic variations and genome-wide association studies of agronomic traits in foxtail millet (Setaria italica). Nat Genet. 2013; 45(8):957–61. doi:10.1038/ng.2673.View ArticlePubMedGoogle Scholar
- Zhang S, Tang C, Zhao Q, Li J, Yang L, Qie L, et al.Development of highly polymorphic simple sequence repeat markers using genome-wide microsatellite variant analysis in Foxtail millet [Setaria italica (L.) P. Beauv]. BMC Genomics. 2014; 15(1):78. doi:10.1186/1471-2164-15-78.PubMed CentralView ArticlePubMedGoogle Scholar
- Yamamoto T, Nagasaki H, Yonemaru J-I, Ebana K, Nakajima M, Shibaya T, et al.Fine definition of the pedigree haplotypes of closely related rice cultivars by means of genome-wide discovery of single-nucleotide polymorphisms. BMC Genomics. 2010; 11:267. doi:10.1186/1471-2164-11-267.PubMed CentralView ArticlePubMedGoogle Scholar
- Feschotte C, Jiang N, Wessler SR. Plant transposable elements: where genetics meets genomics. Nat Rev Genet. 2002; 3(5):329–41. doi:10.1038/nrg793.View ArticlePubMedGoogle Scholar
- Monden Y, Naito K, Okumoto Y, Saito H, Oki N, Tsukiyama T, et al.High potential of a transposon mPing as a marker system in japonica x japonica cross in rice. DNA Res. 2009; 16(2):131–40. doi:10.1093/dnares/dsp004.PubMed CentralView ArticlePubMedGoogle Scholar