Characterization of the late embryogenesis abundant (LEA) proteins family and their role in drought stress tolerance in upland cotton

Background Late embryogenesis abundant (LEA) proteins are large groups of hydrophilic proteins with major role in drought and other abiotic stresses tolerance in plants. In-depth study and characterization of LEA protein families have been carried out in other plants, but not in upland cotton. The main aim of this research work was to characterize the late embryogenesis abundant (LEA) protein families and to carry out gene expression analysis to determine their potential role in drought stress tolerance in upland cotton. Increased cotton production in the face of declining precipitation and availability of fresh water for agriculture use is the focus for breeders, cotton being the backbone of textile industries and a cash crop for many countries globally. Results In this work, a total of 242, 136 and 142 LEA genes were identified in G. hirsutum, G. arboreum and G. raimondii respectively. The identified genes were classified into eight groups based on their conserved domain and phylogenetic tree analysis. LEA 2 were the most abundant, this could be attributed to their hydrophobic character. Upland cotton LEA genes have fewer introns and are distributed in all chromosomes. Majority of the duplicated LEA genes were segmental. Syntenic analysis showed that greater percentages of LEA genes are conserved. Segmental gene duplication played a key role in the expansion of LEA genes. Sixty three miRNAs were found to target 89 genes, such as miR164, ghr-miR394 among others. Gene ontology analysis revealed that LEA genes are involved in desiccation and defense responses. Almost all the LEA genes in their promoters contained ABRE, MBS, W-Box and TAC-elements, functionally known to be involved in drought stress and other stress responses. Majority of the LEA genes were involved in secretory pathways. Expression profile analysis indicated that most of the LEA genes were highly expressed in drought tolerant cultivars Gossypium tomentosum as opposed to drought susceptible, G. hirsutum. The tolerant genotypes have a greater ability to modulate genes under drought stress than the more susceptible upland cotton cultivars. Conclusion The finding provides comprehensive information on LEA genes in upland cotton, G. hirsutum and possible function in plants under drought stress. Electronic supplementary material The online version of this article (10.1186/s12863-017-0596-1) contains supplementary material, which is available to authorized users.


Background
Drought stress has resulted in to massive losses in crop production and also has altered the natural equilibrium of the environment [1]. To save the ecosystem and enhance production, advanced molecular breeding is the recipe for activation and regulation of specific stressrelated genes [2]. Water deficit stress do led to a series of changes including biochemical alterations like accumulation of osmolytes and specific proteins involved in stress tolerance [3]. One of the proteins that play a role in the mechanism of drought resistance is the LEA types of protein known as dehydrin [4]. In cotton production, drought is the main abiotic stress responsible for plant growth compromise and severe yield loss. Even though cotton is considered to be relatively tolerant to water deficit, its optimal growth and yield negatively affected when water supply is limited or interrupted [5]. Water is an essential element for biotic component of the biosphere, such that various responses have evolved to withstand water deficit in all plants and animals, to enable them withstand long periods of water deprivation by adopting a type of life condition known as anhydrobiosis [6].
There is great agronomic significance to understand cotton plant responses to water deficit due to the huge economic losses that results from drought [7]. Cotton metabolism and yield are negatively affected under water deficit conditions, especially at flowering stage [8]. Plants have acquired an evolutionary response to withstand the effect of low water availability, a condition that can disadvantage their growth and development. As immobile organisms, plants possess diverse strategies of responses to drought. Among the molecules highly associated with plant responses to water limitation are the late embryogenesis abundant (LEA) proteins [9]. These proteins are widespread in the plant kingdom and highly enriched during the late stages of embryogenesis and in vegetative tissues in response to water deficit [10].
LEA proteins were first discovered more than 30 years ago and were observed to accumulate at late stages of plant seed development [11]. The LEA proteins have been found in various tissues of abiotic stressed plants and non-plant organisms known to be tolerant to desiccation, such as bacteria and some invertebrates [12]. LEA proteins are members of a large group of hydrophilic, glycine-rich proteins present in a wide range of plant species [13]. This class of proteins are known to be intrinsically disordered in their structures and are mainly expressed under water deprivation condition [14]. The LEA genes are highly diverse, with wide distribution in the plant kingdom and has pivotal role in various stress tolerance responses [15].
Scientific investigations on LEA protein families have been on-going for more than two decades [16]. Although there has been a strong association of LEA protein families with environmental stress tolerance of significance drought and cold stress [17], LEA protein families for most of that time, their function has been entirely obscure [18]. Considerable evidence gives an indication that LEA genes are involved in desiccation, though their precise function is unknown [19]. The bacterial group 1 LEA proteins have the ability to block enzyme inactivation upon freeze-thaw treatments in vitro and it has analogous functions to plant LEA proteins [10]. Therefore, there is need to conduct a genome wide characterization of LEA protein families in cotton. The recent upland cotton genome publications, G. hirsutum [20], G. arboreum [21] and Gossypium raimondii [22], enabled us to carry out the identification and characterization of all cotton LEA genes. In this study, we identified 242, 136 and 142 candidate LEA proteins in G. hirsutum, G. arboreum and G. raimondii respectively, analysed their phylogenetic tree relationships, chromosomal positions, duplicated gene events, gene structure, conserved motif compositions and profiling analysis of gene expression from different cotton plant organs. Our results provides a strong platform for better understanding of the roles and evolutionary history of LEA genes, and will help in future studies of the molecular and biological functions of LEA protein families in cotton.

Chromosomal locations and syntenic analysis
The chromosomal distribution of LEA genes were mapped on cotton chromosomes based on gene position, from up down by Circos-0.69 (http://circos.ca/) [27]. Homologous genes of G. hirsutum, G. raimondii and G. arboreum were identified by BLASTP with threshold >80% in similarity and at least in 80% alignment ratio to their protein total lengths. Default parameters were maintained in all of the steps. Tandem duplications were designated as multiple genes of one family located within the same or neighbouring intergenic region [28]. The Ks/Ka value is an important tool in determining selection pressure acting on the protein coding genes. The genes paralogous pair, which has Ka/ Ks, ratio greater than 1, denotes activating evolution under beneficial selection, indicating that at least some of the mutations were advantageous. When the ratio is equal to 1, then the mutation is neutral but if the ratio is less than 1, it implies that the mutation is disadvantageous or under purifying selection [29]. In the estimation of Ks and Ka substitution rate, we used an alignment of multiple nucleotide sequences of homologous genes which code for LEA proteins. In this research, paralogous pairs were aligned using MEGA 6.0. synonymous substitution (Ks) and non-synonymous substitution (Ka) rate were obtained by Dnasp [30].
Phylogenetic analyses, gene structure organization and motif composition of the LEA proteins in cotton Full-length sequences of G. hirsutum, G. arboreum, G. raimondii, P. tabuliformis and A. thaliana LEA proteins were first aligned using ClustalW on MEGA 6 software [31] then conducted phylogenetic analyses based on protein sequences, with neighbour joining (NJ) method. Support for each node was tested with 1000 bootstrap replicates. The analysis of phylogenetic tree was carried out on upland cotton, G. hirsutum. The gene structures were obtained through comparing the genomic sequences and their predicted coding sequences from the cotton genome project. In addition, MEME (Multiple Expectation Maximization for Motif EliCitation) online program (http:// meme.nbcr.net/meme/cgi-bin/meme.cgi) [32], was used to identify the conserved protein motifs, with maximum number of different motif at 20; the minimum and largest base sequence width of 6 and 50 respectively.

Promoter cis-element analysis
Promoter sequences (2 kb upstream of the translation start site) of all LEA genes were obtained from the cotton genome project (http://cgp.genomics.org.cn/page/ species/index.jsp).Transcriptional response elements of LEA genes promoters were predicted using using the PLACE database (http://www.dna.affrc.go.jp/PLACE/ signals can.html) [36].

Gene ontology (GO) annotation
The functional grouping of LEA proteins sequences and the analysis of annotation data were executed using Blas-t2GO PRO software version 4.1.1 (https://www.blast2go. com). Blast2GO annotation associates genes or transcripts with GO terms using hierarchical terms, cellular component (CC), biological process (BP) and molecular function (MF). Genes were described in three categories of the GO classification terms: molecular function, biological processes and cellular components.

Plant materials and treatment
One-month-old cotton seedlings of G. tomentosum-AD3-00 (P0601211), G. hirsutum-CRI-12 (G09091801-2) and their BC 2 F 1 genotypes, with G. tomentosum as the donor and G. hirsutum as the recurrent parent were used to examine the expression patterns of the LEA genes under drought condition. G. tomentosum is drought tolerant genotype while G. hirsutum is drought susceptible genotype. The two upland cotton accessions are perennially grown and maintained by our research group, in Sanya Island, Hainan province, China. Plants were grown in boxes, with dimension of 41 × 41 cm, with a depth of 30 cm and with three biological replications in the greenhouse located at the cotton research institute, Chinese Academy of Agricultural Sciences (CAAS), Anyang, Henan province, China. The greenhouse conditions were set with temperature at 23 ± 1°C and a 14-h light/10-h dark photoperiod. After one month of growth, watering was totally withdrawn from drought treated seedlings but not in control. The samples for RNA extraction were collected at 0, 7 and 14th day of drought stress exposure, for plants under drought and control. Root, stem and leaf were the main organs of target in this study.

RNA isolation and qRT-PCR verification
RNA extraction kit, EASYspin plus plant RNA kit, obtained from Aid Lab, China was used to extract total RNA from roots, stems and leaves. The quality and concentration of each RNA sample was determined using gel electrophoresis and a NanoDrop 2000 spectrophotometer. Only RNAs which met the criterion 260/280 ratio of 1.8-2.1, 260/230 ratio ≥ 2.0, were used for further analyses and stored at −80°C. The cotton constitutive Ghactin7 gene, forward "ATCCTCCGTCTTGACCTTG" and reverse sequence "TGTCCGTCAGGCAACTCAT" was used as a reference gene and specific LEA genes primers were used for qRT-PCR validation. The firststrand cDNA synthesis was carried out with TranScript-All-in-One First-Strand cDNA Synthesis SuperMix for qPCR, obtained from TRAN, it was used in accordance with the manufacturer's instructions. Primer Premier 5 was used to design 43 LEA primers with melting temperatures of 55-60°C, primer lengths of 18-25 bp, and amplicon lengths of 101-221 bp. Details of the primers are shown in (Additional file 1: Table S1). Fast Start Universal SYBRgreen Master (Rox) (Roche, Mannheim, Germany) was used to perform qRT-PCR in accordance with the manufacturer's instructions. Reactions were prepared in a total volume of 20 μL, containing 10 μL of SYBR green master mix, 2 μL of cDNA template, 6 μL of ddH2O, and 2 μL of each primer to make a final concentration of 10 μM.

Identification of LEA genes in cotton
The HMM profile of the Pfam LEA domains (PF3760, PF03168, PF03242, PF02987, PF00477, PF10714, PF00257 and PF 04927) were used as the query to identify LEA genes in the cotton genomes. Two hundred and eighty LEA genes were identified in upland cotton, Gossypium hirsutum, one hundred-seventy LEA genes in G. raimondii and one hundred-fifty LEA genes in G. arboreum. All the LEA genes were analyzed manually using the SMART and PFAM database (http://pfam.xfam.org/) to verify the presence of the LEA gene domain. Finally, 242, 136 and 142 candidate LEA proteins were identified in G. hirsutum, G. arboreum and G. raimondii respectively. All identified LEA genes were grouped into eight groups, ranging from LEA 1 to LEA 6, dehydrin and seed maturation protein (SMP). To validate our classification of upland cotton LEA genes, we compared the LEA genes nomenclature with previous identification adopted by Hundertmark and Hincha [12] and Bies-Etheve et al. [37] (Table 1).
The physicochemical parameters of each LEA gene were calculated by using ExPASy, an online tool [38]. Most of the LEA proteins in the same family had similar physicochemical parameters. Cotton LEAs of the LEA 4 contained a greater number of amino acid residues as depicted by their protein lengths (aa), followed closely by the dehydrins (Table 2). Dehydrins have been found to contain high number of amino acid residues from the structural analysis of LEA genes in Brassica napus [39]. Cotton LEA_6 family members all had relatively low molecular masses, ranging from 10.177 to 11.9634 kDa, similar findings was also reported in the analysis of B. napus LEA genes, in which all the LEA 6 genes had lower molecular masses [39]. Approximately two-thirds of the cotton LEA proteins had high isoelectric points Pl ≥ 7.0, including majority of LEA 2 family.
The only LEA proteins with all its members having Pl < 7, were the SMPs, this results is in agreement to Pl values obtained for SMPs in Brassica napus, all had Pl < 7.0 [39]. The grand average of hydropathy (GRAVY) results as obtained from ExPASy indicated that cotton LEA 2 proteins are the most hydrophobic, with all except three with GRAVY values <0. The rest of the LEA proteins were highly hydrophilic, with almost all of the groups had gravy value of less than 0, these results are consistent with those of the LEA proteins in Arabidopsis thaliana [40]. Low hydrophobicity and high net charge are the main characteristics of other LEA proteins [38] which enables them to be totally or partially disordered, this unique features is an attribute which gives the LEA proteins the ability to form flexible structural elements such as molecular chaperones which are integral for the protection of plants from desiccation effects [41]. TargetP1.1 (http://www.cbs. dtu.dk/services/TargetP/) server [24] and Protein Pprowler Subcellular Localisation Predictor version 1.2 (http:// bioinf.scmb.uq.edu.au/Pprowler_webapp_1-2/) [25], were       (Table 2 and Additional file 2: Table S2). We further used WoLFPSORT [26] to investigate the particular cell compartments in which the LEA proteins were embedded in, 148 LEA genes were predicted to be chloroplasts proteins, 47 as cytoplasm proteins, 20 as mitochondrion proteins, 35 as nucleus proteins, 11 as Golgi body proteins, 7 as extracellular proteins, 7 as plasma proteins, 4 as vacuole proteins and 3 as endoplasmic reticulum proteins. The details of other characteristics of the nucleic acid and protein sequences are provided in (Table 2). LEA genes have ubiquitous distribution across cell compartments with unique subcellular localization [42]. LEA 4 gene families were found to be widely distributed in cell structures such as cytosol, mitochondria, plastid, ER, and pexophagosome [42]. The unique and wide distribution of LEA genes within the various cell structures is to establish interactions with various cellular membranes under stress conditions. The broad subcellular distribution of LEA proteins highlights the requirement for each cellular compartment to be provided with protective mechanisms to cope with drought stress [17]. In Summary, both experimental and prediction data indicates that LEA proteins have wide distribution in subcellular compartments [42].
Phylogenetic analyses, gene structure and protein motifs of LEA genes in upland cotton To examine the evolutionary history and relationships of LEA protein families, an unrooted phylogenetic tree was constructed from alignments of the full lengths of LEA gene sequences with Neighbor-joining method based on similarities of the LEA genes in upland cotton, G. hirsutum. We constructed phylogenetic tree of all the groups of LEA genes separately, which we further combined with intron-exon and motifs to unearth more information about phylogenetic tree and LEA genes similarities  (Fig. 1). Gene structural diversity and conserved motif divergence are possible mechanisms for the evolution of multigene families [43]. To gain further information into the structural diversity of cotton LEA genes, we analyzed the exon / intron organization in the full-length cDNAs with their corresponding genomic DNA sequences of individual LEA genes in cotton (Fig. 1). Most closely related LEA gene members within the same groups shared similar gene structures in terms of either intron numbers or exon lengths. For example, LEA 1,3,4,5, SMP and dehydrins genes had one to four introns with exception of LEA 2 and 6, which had zero to five introns. This result is in agreement with earlier finding in which dehydrin were found to have introns [44]. By contrast, the gene structure appeared to be more variable in LEA 2 which had the largest number of genes, with sizes of exon/intron structure variants with striking distinctions (Additional file 3: Figure S1). The result suggest the divergence functions of this group of protein family in upland cotton.
Twenty-five distinct motifs were identified. Motifs 1, 2, 3, 4, 5 and 6 were common among all the different groups of LEA genes, similar motifs have been previously identified in other plant species, including maize [45], Arabidopsis thaliana [40], tomato [46] among other plants. Motif analysis of the cotton LEA proteins showed that members of each LEA group possess several group-specific conserved motifs (Table 3). Similar features have been reported for LEA proteins in Solanum lycopersicum [46], Arabidopsis [40], Prunus [47] and poplar [48]. For example, a distinctive and conserved motif in the dehydrin group is the repeated motif, EKKGIMDKIKEKLPG (motif K, richness in lysine residues), in this study, we identified a unique motif among the dehydrin families, GEG REKKGFLEKIKEKLPGHHKKTEEAS, which we named as K1, because of the close similarity with the K-motif. In addition, the commonly known motifs such as EHHEKKG IMDKIKEKLPGHH (K motif) and HSLLEKLHRSNSSSSSS SSDE (S-motif) were also observed. K motif is rich in Fig. 1 Phylogenetic tree, gene structure and motif compositions of LEA genes in upland cotton. The phylogenetic tree was constructed using MEGA 6.0. Exon/intron structures of LEA genes in upland cotton, exons introns and up / down-stream were represented by yellow boxes, black lines and blue boxes, respectively. Protein motif analysis represented by different colours, and each motif represented by number lysine (K) residues and it is known for protective role of enzymatic activities from the drought effects [49]. The motif pattern formation indicates that cotton LEA proteins are actively involved in various biological processes and are group specific in terms of their activities. The distinct nature of the conserved motifs observed in all the LEA Table 3 A consensus amino acid sequence of the different motifs features of each upland cotton LEA protein families The colour scheme of the logo indicates amino acid types. Polar: green = uncharged; blue = +vely charged; red = -vely charged; Non-polar: violet/purple = aliphatic. As described by Dure, 2001 protein families, gives an indication that, the LEA proteins evolved from the gene expansion within their specific gene families. In addition, LEA 4 gene families were found to contain repeats of conserved motif 3, in which in some cases, the repeats were 5, the same attribute was also noted in which they were found to have tendencies of harboring repeat motifs, more so motif 8 [37]. We further did comparison of the common motifs with already identified motif, by the use of Tomtom motif comparison tool, adopting the distance measure of Sandelin-Wasserman function [50]. Motif 1 had 23 matches, with 5 jolma2013, 3 JASPER-CORE2014 vertebrates and 15-uniprobe mouse. In motif 2, had 35 matches, 5 jolma2013, 5JASPERCORE2014 vertebrates and 25-uniprobe mouse. With MEME functional tool, we were able to affirm the similarities of our motifs to already published motif in the motif database.

Phylogenetic analyses of the LEA -proteins in cotton with other plants
To get a better understanding of the evolutionary history and relationships of LEA gene families in cotton to other plants, multiple sequence alignment of 242 genes for G. hirsutum, 136 genes for G. arboreum, 146 genes for G. raimondii, 30 genes for Pinus tabuliformis and 51 genes for Arabidopsis LEA protein sequences (Fig. 2) were done. The boot strap values for some nodes of the NJ tree were low due to long sequence similarities. The reliability of the phylogenetic tree was done by reconstructing the phylogenetic tree with minimal evolution method. The trees produced by the two methods were identical thus the results were consistent. Based on the Phylogenetic tree analysis, LEA genes in cotton were further classified into eight (8) groups. LEA 2 was the largest group with 334 genes from G. hirsutum (157), G. raimondii (89), G. arboreum (85), A. thaliana (3) and P. tabuliformis (1). All the ortholog genes in LEA 2 were found in upland cotton, G. hirsutum, G. arboreum and G. raimondii genome while no ortholog genes were observed between upland cotton, G. hirsutum to either Arabidopsis thaliana and or P. tabuliformis. The second largest group were LEA 4, with highest number of genes 13 and 16 in P. tabuliformis and upland cotton Fig. 2 Phylogenetic relationship of LEA genes in three cotton species with Arabidopsis and Pinus tabuliformis. Neighbor-joining phylogeny of 242 genes for G. hirsutum, 136 genes for G. arboreum, 146 genes for G. raimondii, 30 genes for Pinus tabuliformis and 51 Arabidopsis LEA protein sequences, as constructed by MEGA 6.0. The difference colours mark the various LEA gene types respectively. Upland cotton, G. hirsutum contained the highest numbers of LEA genes of the following groups, LEA 1, LEA 2, LEA 3, LEA 5, LEA 6, SMP and dehydrin with the exception of LEA 4. Among all the LEA gene groups, only LEA 6 had fewer genes, 10 and 3 gene in cotton genome and Arabidopsis respectively ( Table 1). The total number of ortholog genes between upland cotton, G. hirsutum, G. arboreum and G. raimondii were 201 out of 601 genes mapped on the Phylogenetic tree, which translates to 33%.
In this study, no ortholog genes were detected between upland cotton, G. hirsutum, P. tabuliformis and Arabidopsis. All the ortholog genes were detected only among the cotton species; this could be due to their evolution. Upland cotton, G. hirsutum emerged through hybridization mechanism between A and D genome [51]. Based on the results, there was close relationship between Arabidopsis and cotton species as compared to P. tabuliformis. The LEA genes seems to have a common evolutionary history [37], the aggregation pattern of the genes within the tree showed that LEA 1, LEA3, LEA 4, LEA5, LEA 6 and SMP had a common origin, similar results have also been obtained in the analysis of LEA genes in potato, in which SMP, LEA 1, LEA4 and LEA 5 shared a common point of origin [52]. A unique feature on the abundance of cotton LEA genes and their distribution was observed in which LEA 2 formed the majority of the cotton LEA gene families (Table 1 and Fig. 2). Analysis of the LEA genes in monocots and dicots, nearly half of the species containing LEA genes, the majority of the genes belong to the LEA 4 and dehydrin families [39]. The analysis of cotton LEA genes with other plants revealed that main differences occur in the LEA 2 genes (Fig. 2). The abundance of LEA 2 genes was lowest in Pinus tabuliformis (1) and Arabidopsis (3) and higher in G. hirsutum (157), G. arboreum (85) and G. raimondii (89). Similar results were also been observed in which lower proportions of other LEA gene families were observed in grapevine but significantly higher number of LEA 2 genes were observed in rice and poplar [53]. It is important to note that, such large number of LEA 2 families have not been described in the previously investigated genomes of poplar [48], rice [11] and Arabidopsis [40]. This result may be explained in part by the improved annotation of the higher plant genomes available at Phytozome (v10.2) and also by the fact that LEA 2 is an unusual group composed of 'a typical' proteins of hydrophobic nature. This finding suggests that the LEA protein families in higher plants may be larger and much more complex than previously described. On the other hand, minor variations were observed in the other upland cotton LEA gene families. Based on this result, the entire LEA 2 gene families probably were the last to evolve among the LEA gene families in higher plants.

Chromosomal distribution of cotton genes encoding LEA proteins
To determine the chromosomal locations of cotton LEA genes based on their positions, data retrieved from the whole cotton genome sequences were used. Chromosome distribution was done by BLASTN search against G. hirsutum and G. arboreum in cotton genome project and G.raimondii genome database in Phytozome (http:// www.phytozome.net/cotton.php). One hundred and eighty six (186) upland cotton LEA genes were mapped in all chromosomes by Map chart and 56 upland LEA genes into unknown chromosomes (scaffold). A plot of LEA genes on the cotton genome shows that LEA loci are found on every chromosome (Fig. 3). The distribution of the mapped LEA genes were more in Dt with 110 (59%) compared to At, with only 76 (41%) genes. However, the densities of these loci were high on Dt_chr 09, with 9% of all the LEA genes. Gene loss was observed on At_chr 05, with a single gene compared to its homolog chromosome Dt_chr 05, which had 9 genes. Similar case was also noted on chromosome At_chr02 and Dt_chr02 with 2 and 7genes respectively. This result indicates an element of gene loss during the hybridization period, as result of crossing over or other internal or external chromosomal phenomenon.
In the A genome of, G. arboreum 136 LEA genes were mapped across all the 13 chromosomes, high density of these loci were observed on chromosome 10, which contained 21 genes, translating to 15% of all the LEA genes in the genome. The mapping of the gene loci were generally uniform, the lowest loci density was observed on chromosome 9, with 5 genes (4%), followed by chr 2, chr5 and chr 8, with 6 genes each (Fig. 3). In D genome, (G. raimondii), 143 LEA genes were distributed across all the chromosomes. The highest gene loci density was in chromosome 9 with 18 genes (13%) and the lowest density was in chromosome 12, with only 5 genes (3%). The mapping of the LEA genes in both diploid and tetraploid cotton chromosomes, tend to have a unique clustering pattern, high density LEA gene clusters were observed in specific chromosomal regions, either at the upper arm, lower arm or the middle region of the chromosomes as shown on chromosomes At_ch01, Dt01_chr15, Dt02_chr14, Dt05_chr19 and Dt10_chr20 within the AD genome, chr02, chr05, chr06 and chr07 in A genome and in D genome, ch07 and chr11 (Fig. 4). The clustering pattern of the LEA gene and chromosomal location could be attributed to LEA gene duplication patterns [37].
In general, genes which belong to the same family are distributed across the entire chromosomes in order to ensure their maximum functionalization [54]; this was evident in LEA 2 genes, which was distributed across the entire chromosomes of both tetraploid and diploid cotton (Figs. 3 and 4). It was unique to find some members of LEA gene families with restricted distribution, mainly found in some chromosomes but not all like dehydrin despite of their numbers, this implies that dehydrin like genes have the tendency to duplicate and evolve more conservatively within a particular chromosome.

Gene duplication and syntenic analysis
Expansion of gene families occurs through three processes namely, segmental duplication, tandem duplication and whole genome duplication [55]. Homologous and orthologous genes are the products of gene duplication events. Duplicated genes function in stress response and development processes in plants [56]. To analyse the relationships between the LEA genes and gene duplication events, syntenic blocks of LEA genes were combined among G. hirsutum, G. raimondii and G. arboreum (Fig. 5). A total of 241 LEA genes were duplicated across the three cotton genome. The most duplicated genes were detected between G. hirsutum and its ancestors, G. arboreum and G. raimondii, this could be due the origin of G. hirsutum, as a result of polyploidization of A and D genome (Table 4). Two types of duplication, tandem and segmental duplication event were identified. Majority of the duplicated LEA genes, were segmental, this implied that, segmental gene duplication, had a major contributing factor during the evolution time [57]. The Ka/Ks ratio is a pointer to selective pressure acting on a protein-coding gene. It has been observed that some systematic bias in some species do occur more easily in the process of nucleotides substitutions because of species diversity and high mutation rate do accelerates the changes in amino acid proportions [58]. The analysis of the Ka /Ks ratios of the 156 paralogous pairs, were less than 1 and only 20 had ratios of more than 1. Majority of LEA 2, LEA 4 and SMP had very low Ka/Ks ratios; the highest Ka/Ks ratio of 2.59265 was noted in LEA 6. This result is consistent to the previous findings of Brassica napus LEA genes, LEA 3 and LEA 6, families recorded higher Ka/Ks ratios, whereas LEA 5 and LEA 2 gene families recorded lowest Ka/Ks ratios [39]. In general Ka/Ks for paralogous gene pair of LEA genes had a range of 0 to 2.593 with mean of 0.4717. This result gives an indication that the LEA genes have been influenced extensively by purifying selection during the process of their evolution. LEA 2 gene families preferentially do have conserved structure and functions under selective pressure [59].

Prediction of LEA genes (mRNA) targeted by miRNAs in upland cotton
Large groups of small RNAs, known as microRNAs (miRNAs) are reported as the regulators in plant adaptation to abiotic stresses [60]. In transgenic creeping bent grass, Agrostis stolonifera, over expression of rice miR319a showed enhanced salt and drought tolerance [61]; over expression of miR396c and miR394 in plants was due to hypersensitive to salinity stress [62]. In cotton, Gossypium hirsutum, a group of miRNAs and their targets have been identified, and some of them respond to salt and drought stresses [60]. To get more information on LEA genes functions, we carried out the prediction of miRNAs targets on LEA transcripts (mRNA) by the use of psRNATarget, the same as been used for other     functional genes in cotton [63]. Out of 242 upland cotton LEA genes, 89 genes were found to be targeted by 63 miRNAs, representing 37% of all the LEA genes (Additional file 4: Table S3). The highest levels of target were detected on the following genes with more than 6 miRNAs; CotAD_00799 (6 miRNAs), CotAD_06037 (9 miRNAs), CotAD_13,827 (6 miRNAs), CotAD_19,205 (6 miRNAs), CotAD_31,936 (6 miRNAs) CotAD_33,143 (6miRNAs), CotAD_41,925 (8miRNAs) and CotAD_69,738 (7miRNAs) as highlighted in (Additional file 4: Table S3). The rest of the genes were either targeted by one or not more than 5 miRNAs. The high number of miRNAs targeting LEA genes could possibly had direct or indirect correlation to their stress tolerance levels to abiotic stress more so drought. Some specific miRNAs had high level of target to various genes such as ghr-miR164 (5 genes), ghr-miR2949a-3p (7 genes), ghr-miR2950 (10 genes), ghr-miR7492a (10 genes), ghr-miR7492b (10 genes), ghr-miR7492c (10 genes), ghr-miR7495a (10 genes), ghr-miR7495b (10 genes), ghr-miR7504a (5 genes), ghr-miR7507 (5 genes), ghr-miR7510a (6 genes), ghr-miR7510b (10 genes), ghr-miR827b (4 genes) and lastly ghr-miR827c (4 genes). It has been found that miRNAs might be playing a role in response to drought and salinity stresses through targeting a series of stress-related genes [60]. Cotton ghr-miR7510b not only involved in drought stress but also highly up regulated in ovule and fibre, thus has an integral role in fibre formation [64]. Deep sequencing of miRNA under drought and salinity, ghr-miR408a, ghr-miR2911, ghr-miR156a/c/d and ghr-miR3954a/b were found to have differential expression in either of the stress factors, drought and salt stress [60].

Gene ontology (GO) annotation
The biological processes, molecular functions and cellular components of cotton LEA genes were examined according Gene Ontology (GO) data base. Blast2GO v4.0 was used to carry out the analysis ( Fig. 6 and Additional file 5: Table S4). The results showed that the 242 LEA genes were putatively involved in a range of biological processes. Of the 5 terms of biological processes defined by Blast2Go terms, most LEA genes were predicted to function in the response to desiccation (~29%), followed by response to stress and response to defense. Molecular function prediction indicated that all 242 LEA genes, majority were involved in signal transducer activity, transferase activity and DNA binding. In cellular component prediction of LEA genes exhibited to be involved in membrane were 117, membrane parts (113), cell (13) and cell part (13). Higher numbers of upland cotton, G. hirsutum LEA genes were mainly involved in cellular component and molecular functions and few were found to be involved in biological processes. In all the LEA groups, molecular functions, biological process and cellular components were noted except in LEA 1 in which only two GO functions, biological process and cellular components were observed.

Promoter cis-element analysis
Promoter sequences, 2 kb upstream and downstream of the translation start and stop site of all the LEA genes were obtained from the cotton genome project. Transcriptional response elements of LEA genes promoters were analyzed using the PLACE database (http://www.dna.affrc.go.jp/ PLACE/signalscan.html) [36]. In order to determine cisacting regulator element, we queried a section of the sequence of each gene, but only the start and end codon were used for the selection of cis-promoter elements. Analysis of the promoter region of all upland cotton LEA genes identified the presence of various stress responsive cisacting regulatory elements, including DRE/CRT, ABRE, LTRE and MYBS. These stress-responsive elements were relatively abundant in the promoters of the upland cotton LEA genes, more specifically ABREs ( Fig. 7 and Additional file 6: Table S5), indicating that LEA proteins may have an important functional role in drought stress response and tolerance in upland cotton, G. hirsutum. There were significant differences in the average proportions of the promoter elements detected within the different LEA gene families (Fig. 7). The upland cotton LEA genes from LEA 2, LEA3, SMP and dehydrins gene families contained the highest average proportions of stress-responsive elements, while those from LEA 1 and LEA 6 contained the lowest proportions. ABRELATERD1 (ACGTG) was the dominant cis promoter elements, similar findings, with the predominance of ABRE cis-element, have been reported for LEA genes in tomato [46], Arabidopsis [40] and Chinese plum [47]. ABRE is a cis-acting element majorly involved in abscisic acid signaling in response to abiotic stresses, while DRE/ CRT and LTRE are major cis-acting regulatory elements involved in the ABA-independent gene expression in response to water deficit (DRE/CRT) and cold (DRE/CRT and LTRE) [65]. MYBS is well-studied cis-acting promoter element with key role in the abscisic acid-dependent signaling pathway in response to drought, salt and cold [66].

Upland cotton LEA genes expression analysis under drought stress
To examine the expression profile of LEA proteins family in various tissues under drought stress treatments, we selected 42 LEA genes based on phylogenetic tree analysis, intron-exon and protein motif features, for each LEA group. Three cotton genotypes, G. tomentosum, a wild type known to be drought resistant, G. hirsutum, an elite cultivar, though drought susceptible cultivar and their backcross type BC 2 F 1 generation were cultivated in the greenhouse under drought simulated and well watered condition. The qRT-PCR analysis was done on the three sets of accessions on different plant organs, roots, stems, and leaves. The results showed that LEA genes were differentially expressed under drought treatment across different tissues tested. Based on the cluster analysis, gene expression profiling were categorized into 2 groups, sub group 1, included 15 genes; the  Table S4 majority in this cluster of genes were up regulated in G. tomentosum in all tissues after 14 days of stress exposure and down regulation after 7 days of stress except in root, which showed partial expression. In G. hirsutum, majority of the genes were up regulated after treatment except in leave tissues in which the genes showed down regulation. In BC 2 F 1 , majority of the genes were down regulated after one week of exposure but expression was high after two weeks under drought stress. This result showed that these genes might be involved either directly or indirectly in drought stress, and their role was majorly concentrated in roots and stem. The second cluster, with 28 genes, the majority of the genes showed down regulation after one week of stress in all the tissues across the three genotypes. Some genes were up regulated across the three genotypes after 2 weeks of drought treatment. The results exhibited differential expression pattern in the 3 genotypes tested (Fig. 8). Some of the LEA genes were differentially expressed in the three plant organs and genotypes tested while others showed same expression pattern in different tissues, this could be due to functional divergence of LEA genes during plant development, for instance, CotAD_16,595 and CotAD_40,972 were highly expressed in the roots in all the three genotypes ( Fig. 1), implying that they could be responsible for enhancing roots traits to drought tolerance. CotAD_13,827 and CotAD_31,906 were highly expressed in the leaves while others such as CotAD_10,044 and CotAD_03264 were highly up regulated in the stem. Further analysis of the expression showed that more than a half the upland cotton LEA genes were increased in roots and leaves at 7th and 14th day of stress as opposed to the stem. Roots and leaves tissues are highly sensitive to drought, the roots is the first organ to be affected by water deficit [67]. The leaves wilt or become chlorotic in stress conditions and affects photosynthesis process [68].

Discussion
LEA proteins family is a large and widely diversified across plant kingdom [15]. The LEA gene family has been identified in several crops, such as rice and maize [69], in other organisms such as invertebrates and microorganisms [70]. However, characterization of the LEA protein family and their role in drought stress tolerance in upland cotton has never been reported. In this study, we identified different numbers of LEA genes in G. hirsutum (242), G. arboreum (136), G. raimondii (142), A. thaliana (51) and P. tabuliformis (30). The number of LEA genes in cotton genome AD (G. hirsutum) were higher than A (G. arboreum) and D (G. raimondii). The number of LEA genes in AD is approximately 1.78 and 1.70 times that in A and D respectively. The high number of LEA genes in G. hirsutum were more likely caused by gene duplication and the conservation of the LEA genes during the polyploidization process, signifying the important role played by these groups of gene families in the process of plant growth and development [40]. Gene duplication, is a major feature of genomic architecture, with cardinal role in the process of plant genomic  and organismal evolution, resulting into new raw genetic materials for genetic drift, mutation and selection, which ultimately results into emergence of new gene functions and evolution of gene networks [71]. Gene duplication mechanism not only lead to the expansion of genome content but aids in the diversification of gene function to ensure adequate adaptability and evolution of plants [55]. The tetraploid cotton have undergone whole genome duplication events during their evolution period [72] and G. hirsutum emerged through allopolyploidy [73]. In this study, only 46 (17%) tandemly duplicated genes were detected, similarly only 6 genes were found to be tandemly duplicated in Brassica napus, perhaps because the evolution of upland cotton and Brassica napus are due to whole genome duplication [74,75]. Majority of the upland LEA genes showed a close relationship with respect to the block locations of G. arboreum and G. raimondii LEA genes.
A phylogenetic analysis provided evidence of the contribution of whole genome duplication contribution to Upland cotton LEA genes abundance. LEA gene expansion through whole genome duplication have been observed in Arabidopsis [37] and Brassica napus [86]. The Gossypium arboreum genome contained 136 LEA genes and G. raimondii genome had 142 LEA genes; therefore, a WGD process would be expected to produce more than 242 LEA genes in Gossypium hirsutum. The LEA genes numbers proportions in G. hirsutum (tetraploid) implied that a larger number of the duplicated LEA genes were lost or became functionless after whole genome duplication. The loss of Upland cotton LEA genes could have been due to chromosomes rearrangement, the same mechanism was also observed in the case of Brassica [76]. The expansion of LEA genes in upland cotton was majorly through segmental duplication, 44% (130/242) of the upland cotton LEA genes emerged through segmental duplication. This finding is concurrent to observation made in Brassica 72 out of 108 genes occurred through segmental duplication [39] and in Arabidopsis in which 24% of its LEA genes arose through segmental type of gene duplication [40]. In synteny analysis, we identified 241 pairs with high similarity, implying that most LEA gene family members are embedded in highly-conserved syntenic regions, and some genes were either lost or recovered. The loss or gain of genes within the syntenic region have been observed in a number of gene families not only in LEA genes [77].
Characterization and structural analysis of genes with major functions on abiotic and biotic stress factors have been found to have fewer introns [48]. The analysis of the upland cotton LEA genes, LEA 1, 3, 4, 5, SMP and dehydrins genes had one to four intron with exception of LEA 2 and 6, which had zero to five introns. The reduced intron numbers in stress responsive genes have been recorded, such as trehalose-6-phosphate synthase gene family which plays an important role in abiotic stress and metabolic regulation [78]. The existence of introns in a genome is argued to cause enormous burden on the host [79]. The burden is because the introns requires a spliceosome, which is among the largest molecular complexes in the cell, comprising 5 small nuclear RNAs and more than 150 proteins [79]. It has also been found that intron transcription is costly in terms of time and energy [80]. Moreover, introns can extend the length of the nascent transcript, resulting into an additional expense for transcription [81].
The motif protein analysis and composition of each LEA gene family largely varied, although some amino acid-rich regions were detected, similar to previous studies done on Arabidopsis [40] and legumes [82]. We found that that genes belonging to the same families exhibited similar gene structure and motif composition. This results is consistent to previous studies which recorded similar exon -intron and protein motif within the same group of the LEA genes [23]. LEA proteins have disordered structure along their sequences due to their amino acid compositions [83]. LEA proteins play key roles in the plant cell despite of their disordered structure [41], they have the ability to form chaperons with other elements [84].The structural flexibility of the LEA proteins facilitate interactions with other macromolecules, such as membrane proteins, hence cell membrane stability during drought stress [85]. These results demonstrate that LEA proteins have intrinsic characteristics which enables them to functions as flexible integrators in protecting other molecules under drought stress and other forms of abiotic stress factors [86].
In relation to gene ontology (GO) analysis, biological processes, molecular functions and cellular components are features of genes or gene products that enable us to understand the diverse molecular functions of proteins. Cellular component and molecular activities were highest among the upland cotton LEA proteins, this could be in line with their functions of protecting the membranes and enzymes to maintain cellular activities under drought stress conditions [87]. The finding in this study is concurrent to previous studies which reported that LEA proteins are mainly located in subcellular regions such as chloroplast, nucleus, cytoplasm and mitochondria in Arabidopsis [40] and tomato [46]. The subcellular localization and the role of the LEA protein in the cell are positively correlated. Binding to different molecules such as ATP binding (GO: 0005524), sequence-specific DNA binding (GO: 0003700) and zinc ion binding (GO: 0008270) were the major activities for the action of upland LEA proteins as molecular function. Binding of LEA proteins to nucleic acids in order to protect cellular structures by constructing hydrogen network was reported, which is related to the roles of LEA proteins in drought stress tolerance [88]. In addition, LEA protein family groups have been found to enhance membrane stabilization through chaperons formation with phospholipids and other sugar molecules as described in model membranes under drought condition [87]. The molecular function of LEA proteins in drought stress may be through the binding activity.
In addition, biological processes in response to stress factors were dominant, response to desiccation (GO: 0009269); abscisic acid transport (GO: 0080168); response to stress (GO: 0006950); response to water (GO: 0009415); auxin-activated signalling pathway (GO: 0009734); response to water deprivation (GO: 0009414); response to cytokinin (GO: 0009735) and phosphorylation (GO: 0016310). These biological roles detected in cotton LEA proteins were consistent with earlier findings of biological functions of the LEA proteins such as oxidant scavenging activity, enzyme and nucleic acid preservation and membrane maintenance, these biological functions protect cell structures from the deleterious effects of drought and other abiotic stress factors [89]. Our findings is further supported by the highly up regulation of LEA proteins in various studies done in transgenic modal plant, Arabidopsis [90] and bacteria [91].
The small RNAs are a diverse class of non-coding regulatory with important function in gene regulation under drought stress conditions by destroying the target gene transcripts in plants [92]. The analysis of upland cotton miRNAs showed that 89 LEA transcripts were targeted by 63 different miRNAs. The NAC gene family are plant-specific transcriptional factors known to play diverse roles in various plant developmental processes, MYB is a transcriptional factor family mainly involved in controlling various processes like responses to biotic and abiotic stresses, development, differentiation, metabolism, defense among other biological processes while mitogen-activated protein kinase (MAPK) gene families also do play an important roles in plant growth, development and defense response. The three plant transcriptional factors, MYB, NAC and MAPK are ranked top under the context of drought and salinity indicating their important roles for the plant to combat drought and salinity stress. Through target prediction, a series of cotton miRNAs were found to be associated with MYB, NAC and MAPK genes including miR164 [60,93]. In this research work miR164 was found to target four (5) LEA genes, CotAD_03784, CotAD_07516, CotAD_19,375, CotAD_24497and CotAD_63,174. The association of these 5 LEA genes with miR164, which have been found to be linked to highly ranked plants transcription factors under drought and salt stress, provides a strong evidence of a major role played by LEA genes in drought stress. A small RNA like miR827 have been found to confer drought tolerance in transgenic Arabidopsis, homologous form of the same miRNA, denoted as Hv-miR827, have been proved to confer drought tolerance in barley [94]. The same miRNA, ghr-miR827a/b/c/d, was found to target 12 different LEA genes, this implied that these genes targeted by the miRNA had a direct functional role in enhancing drought tolerance in upland cotton.
Gene promoters, also termed as cis-element, play various key roles in the transcriptional regulation of genes controlling a number of abiotic stress and plant hormones responses. Phyto-hormones enhance the ability of plants to adapt to changing environments. Many abiotic stress-related and plant hormones-related cis-elements, including W-Box, MBS, HSE, ABRE and TCA-elements, have been identified [95]. All of these and other ciselements were detected in our investigation. Therefore, the results obtained is in agreement to the various cisacting element detected in the analysis of LEA genes in various plants such as tomato [46], Chinese plum [47], in brassica [39] and poplar [48]. In each LEA gene, contained more than four cis-elements related to abiotic stress signal responsiveness, which provides strong evidence, that these genes might have important functions under different drought stresses.
A number of studies have shown that LEA genes do have a significant contribution in drought stress [69]. From the heatmap and expression pattern of cotton LEA gene families, high number of LEA genes showed higher expression levels across all the plant organs tested. The high expression levels of these genes under drought condition, indicates the maintenance functions during stress conditions, leading to drought tolerance of the plants. A unique observation was made, in which high percentage of the LEA 2 genes, used in the expression analysis, almost all showed high expression, and it would be of interest to characterize this group of LEA gene family in upland cotton. High expression pattern of the LEA genes have been observed in Brassica napus [39], Prunus meme [47], Arabidopsis [40] and sweet orange [53]. LEA genes have been found to have wide distribution and abundance among the terrestrial plants as opposed to aquatic plants, the abundance of these gene families could be pegged on to their conservative role, aquatic plants do not suffer from drought stress, thus the shrinking number of LEA genes. Therefore, the finding of this work and previous publication on function of LEA genes may explain why the LEA gene family have a wider distribution in terrestrial plants but not moss plants [96]. LEA families with close taxonomic relationships generally exhibited similar scales and distributions. However, the scales of the LEA gene family differ in upland cotton Gossypium hirsutum and other higher plants such as Theobroma cacao L, sweet orange, Arabidopsis among others; this could be due to changes in the environmental conditions. The high number of LEA genes in upland cotton, suggested that stress adaptation might have initiated the evolution of protein coding sequences which are LEA specific. Similar observation have been reported in maize in which adaptation to abiotic stress led to evolution of protein coding sequences, leading to the variation of LEA genes in maize compared to rice [97]. We carried out the expression profiling of the LEA genes in drought susceptible upland cotton, Gossypium hirsutum, drought tolerant cultivar Gossypium tomentosum and their BC 2 F 1 offspring. High numbers of genes were found to be highly up regulated in G. tomentosum than in G. hirsutum (Figs.  8 and 9). The results obtained is concurrent to similar findings which have been reported in maize landraces with varying drought stress tolerance levels when compared at the transcriptional level [98]. This result implied that more tolerant cotton genotypes had a greater ability to rapidly adjust more genes under drought stress than the more susceptible cotton cultivars. Moreover, rapid adjustments of greater number of differentially expressed genes and of different transcriptome factor families is considered an important trait of the drought tolerant genotypes [99].

Conclusions
This research work provides the very first detailed analysis, characterization and expression profile of upland LEA genes under drought condition. A total of 242 LEA genes were identified in upland cotton and divided into eight groups. Chromosomal mapping and syntenic analysis showed that all the LEA genes were distributed in Fig. 9 Quantitative PCR analysis of the selected LEA genes. Abbreviations: Rt (root), Sm (stem) and Lf (leaf). 0, 7 and 14 days of stress. Gh -Gossypium hirsutum, Gt -Gossypium tomentosum and BC-BC 2 F 1 offspring. Y-axis: relative expression (2 −ΔΔCT )