The transcriptome enables the identification of candidate genes behind medicinal value of Drumstick tree (Moringa oleifera)

The transcriptome enables the identification of candidate genes behind medicinal value of Drumstick tree (Moringa oleifera)

Genomics xxx (xxxx) xxx–xxx Contents lists available at ScienceDirect Genomics journal homepage: Original Article Th...

1MB Sizes 0 Downloads 19 Views

Genomics xxx (xxxx) xxx–xxx

Contents lists available at ScienceDirect

Genomics journal homepage:

Original Article

The transcriptome enables the identification of candidate genes behind medicinal value of Drumstick tree (Moringa oleifera) Shaik Naseer Pashaa, K. Mohamed Shafia,b, Adwait G. Joshia, Iyer Meenakshia, K. Harinia, Jarjapu Mahitaa, Radha Sivarajan Sajeevana, Snehal D. Karpea, Pritha Ghosha, Sathyanarayanan Nitisha,b, A. Gandhimathia, Oommen K. Mathewa, Subramanian Hari Prasannaa, Manoharan Malinia, Eshita Mutta, Mahantesha Naikaa, Nithin Ravoorua, Rajas M. Raoa, Prashant N. Shingatea, Anshul Sukhwala, Margaret S. Sunithaa, ⁎ Atul K. Upadhyaya, Rithvik S. Vinekara, Ramanathan Sowdhaminia, a b

National Centre for Biological Sciences (NCBS), TIFR, GKVK Campus, Bangalore 560065, Karnataka, India The University of Trans-Disciplinary Health Sciences & Technology (TDU), Yelahanka, Bangalore 560064, Karnataka, India



Keywords: Plant transcriptome Functional annotation Differential expression Metabolic pathway analysis

Moringa oleifera is a plant well-known for its nutrition value, drought resistance and medicinal properties. cDNA libraries from five different tissues (leaf, root, stem, seed and flower) of M. oleifera cultivar Bhagya were generated and sequenced. We developed a bioinformatics pipeline to assemble transcriptome, along with the previously published M. oleifera genome, to predict 17,148 gene models. Few candidate genes related to biosynthesis of secondary metabolites, vitamins and ion transporters were identified. Expressions were further confirmed by real-time quantitative PCR experiments for few promising leads. Quantitative estimation of metabolites, as well as elemental analysis, was also carried out to support our observations. Enzymes in the biosynthesis of vitamins and metabolites like quercetin and kaempferol are highly expressed in leaves, flowers and seeds. The expression of iron transporters and calcium storage proteins were observed in root and leaves. In general, leaves retain the highest amount of small molecules of interest.

1. Background Drumstick tree (Moringa oleifera Lam.), a member of the family Moringaceae, is a perennial softwood tree native to the sub-Himalayan region of India, growing to a height of 7–12 m with a diameter of 40 cm [1,2]. It is characterized by a corky bark, tuberous roots, spirally arranged, long-petioled tripinnate leaves and leaflets varying in length from 12 to 18 mm. The slender, curved pods resemble the stick used for

beating the drum, conferring the popular name “Drumstick” to this plant [3]. Its ability to withstand dry and arid tracts, coupled with its high nutritional value, make it widely sought-after in the tropics especially in drought-prone regions [4]. Oil extracted from the seeds of M. oleifera is utilized as a vegetable oil as well as a lubricant. M. oleifera has been extensively studied for its medicinal and pharmacological properties and the different tissues of this plant have been exploited for treating a wide variety of diseases in the indigenous system of medicine.

Abbreviations: Xb, Xbasepairs; XB, Xbytes; CEG, Core Eukaryotic Genes; DEG, Database of Essential Genes; FIR, functionally important residue; 4CL, 4-CoumarateCoA ligase; CHS, chalcone synthase; CHI, chalcone flavone isomerase; F3H, flavonone 3-hydroxylase; FLS, flavonone synthase; OMT, tricin synthase; F3’H, flavonoid 3-monooxygenase; NFD, N-substituted formamide deformylase; LCY, lycopene beta cyclase; TMT, gamma tocopherol methyltransferase; GLS, glucosinolate; GGP, GDP-L-galactose phosphorylase; GME, GDP-mannose-3,5-epimerase; GPP, L-galactose-1-P phosphatase; GDH, L-galactose dehydrogenase; GLDH, L-galactono-1,4lactone dehydrogenase; ICP, inductively coupled plasma-atomic emission spectroscopic; VIT, vacuolar iron transporters; C-CAMP, Centre for Cellular and Molecular Platforms; PE, paired end; MP, mate paired; CEGMA, Core Eukaryotic Genes Mapping approach; BUSCO, Benchmarking Universal Single-Copy Orthologs; EST, expressed sequence tags; NR, non-redundant; ITS, internal transcribed spacer; FPKM, fragments per kilobase of exon per million mapped reads; GO, Gene Ontology; ML, maximum likelihood; MSA, multiple sequence alignment; qRT-PCR, real-time quantitative polymerase chain reaction; PPR, pentatricopeptide repeat; LRR, leucine-rich repeat; TPR, tetratricopeptide repeats; WD40, beta-transducin repeat; RRM, RNA-recognition motif; Myb, Myb-like DNA-binding domain ⁎ Corresponding author at: Lab-25, National Centre for Biological Sciences, Tata Institute of Fundamental Research, GKVK Campus, Bellary Road, Bangalore 560065, India. E-mail address: [email protected] (R. Sowdhamini). Received 3 December 2018; Received in revised form 26 February 2019; Accepted 21 April 2019 0888-7543/ © 2019 Elsevier Inc. All rights reserved.

Please cite this article as: Shaik Naseer Pasha, et al., Genomics,

Genomics xxx (xxxx) xxx–xxx

S.N. Pasha, et al.

2. Materials and methods

For example, the anti-inflammatory and anti-microbial properties of the root make it useful in the treatment of constipation and lower back pain [2,5,6], but more so, as an anti-fertility agent [7]. Juice extracted from the leaves has been demonstrated to exhibit anti-diabetic properties. Similarly, the stem possesses analgesic, anti-diabetic, anti-cancer and anti-inflammatory properties [2,6]. The flower and pod have cholesterol-lowering effects [2,5]. For these reasons, M. oleifera is also referred to as a “Miracle Tree” or “Magic Tree” [8]. These medicinal aspects of the plant, as described above, are attributed to the presence of bioactive compounds, including secondary metabolites like quercetin, kaempferol, moringine (benzylamine), ursolic and oleanolic acid, placing it as a valuable resource in the pharmaceutical industry towards the development of drugs [2]. In India, treatment of chronic diseases and infections using different plant tissue extracts has been in practice for many centuries [9]. However, reports of quantitative structure-activity studies of plant secondary metabolites using in vitro and in vivo experiments are very limited [10]. A recent report by Ravindranath and group on Ashwagandha (Withania somnifera), showed the direct effect in reduction of plaque formation and hence its relevance in Alzheimer's disease [11]. Total organic synthesis of the secondary metabolites is also a timeconsuming and challenging process in most cases, due to the complex structure of these molecules, such as those having multiple chiral centers. Despite several studies on the specific metabolite pathways and genes involved in drought resistance for drumstick, transcriptome analysis and relative expression of transcripts contributing to these aspects of the plant are less studied. Sequencing genomes and transcriptomes of medicinally important plants is emerging as a popular alternative to traditional methods in the drug discovery process [12–14]. The genomes and transcriptomes of other indigenous plants - Neem (Azadirachta indica) [15], Turmeric (Curcuma longa L.) [16], Shatavari (Asparagus racemosus) [17] and Tulsi (Ocimum tenuiflorum) [18,19] have been published recently from India. We present the combined transcriptome of five different tissues of M. oleifera plant - leaf, root, stem, seed and flower (Fig. 1) and report on unique and tissue-specific abundant transcripts. A total of 14,624 transcripts encoded from these were found to be associated with Pfam domains. We specifically investigated the expression of enzyme-coding genes involved in the biosynthesis of quercetin, kaempferol, benzylamine and ursolic/oleanolic acid which are responsible for the medicinal properties of M. oleifera. Transcripts participating in biosynthesis of vitamins A, C and E, as well as transcripts coding for metal ion transporters were also analyzed for their relative expression among different tissues. Transcriptome analysis of M. oleifera presented here, along with additional analysis based on results of qRT-PCR experiments, quantification of metabolites and minerals, has enabled us to identify around 36 candidate genes and stress-responsive transcription factors. Our results will hopefully open up new avenues towards the utilization of the relevant genes of this plant for commercial purposes.

2.1. Isolation and sequencing of RNA from five tissues of M. oleifera M. oleifera plant (Bhagya variety) at the University of Agricultural Sciences, GKVK, Bangalore, India, was the source for our study and samples from five different tissues (leaf, root, stem, seed and flower) were obtained for transcriptome sequencing. RNA from the different tissues were isolated using Spectrum Plant total RNA kit (Sigma Aldrich) as per manufacturer's guidelines and the isolated RNA was treated with Ambion-DNase1 (Thermofisher). Quality and integrity of all the RNA samples were checked by Bioanalyzer (Agilent Technologies) and only samples having acceptable RNA Integrity Number (RIN) value above 7 were sequenced using Illumina HiSeq 1000 platform. Each library was sequenced in two technical replicates and the reads were processed using Trimmomatic (version 0.35) [20] to remove bad quality reads using default parameters. A total of 271 million reads were obtained after quality processing from a total of 10 libraries. 2.2. Transcriptome assembly, annotation and validation The reads were assembled using Trinity (version 2.4.0) with default parameters [21], with the reference genome assembly [22] used as a guide. The assembled transcript set is referred to as transcriptome hereafter. Gene annotation was carried out by integrating the previously published M. oleifera genome assembly, with our transcriptome data as evidence, using the MAKER (version 2.31.9) annotation pipeline [23]. The Augustus gene models from Arabidopsis thaliana were used to guide the annotations. The complete set of proteins identified from the transcripts is referred to as proteome hereafter. In addition, HMMSCAN (HMMER v3.1) searches were performed to identify Pfam domains [24,25] against the Pfam library (Pfam version 31) at an E-value of 0.01. The completeness of the proteome was assessed using the Core Eukaryotic Genes (CEG) dataset from CEGMA [26] and the Viridiplantae and Embryophyta (odb10) datasets from BUSCO [27,28]. 2.3. Orthology analysis Orthologous groups of M. oleifera-encoded proteins in four other plant species (Carica papaya, Theobroma cacao, Arabidopsis thaliana and Oryza saiva) were obtained using OrthoMCL (version 2.0.9) [29] at an E-value of 10−5. The analysis was extended to 37 sequenced plant species, for which the related proteomes were first collected from Phytozome v10.3.1 [30]. The list of these plant species is provided in Supplementary Data 1. Orthogroups were identified across M. oleifera and these 37 proteomes (with E-value cutoff of 10−10) using ProteinOrtho v5.15. Unique protein sequences from M. oleifera observed through analysis from both OrthoMCL v.2.0.9 [29] and ProteinOrtho v.5.15 [31], were further studied for their domain and GO annotations

Fig. 1. Morphology of various tissues in M. oleifera. 2

Genomics xxx (xxxx) xxx–xxx

S.N. Pasha, et al.

using Pfam v. 31 (HMMscan, E-value 0.001), InterPro (version 2.0.9) [32], InterPro2GO and Blast2GO using default parameters [33]. Representative species from Brassicales and Malvales, taxonomically closest to M. oleifera, were studied in detail as follows. In the first step, a phylogenetic tree for these selected species was built using concatenated nucleotide sequence alignments for ribulose-1,5-bisphosphate carboxylase/oxygenase large subunit (rbcL), maturase K (matK) and internal transcribed spacer 1 and 2 (ITS1 and ITS2) [34]. During this process, MAFFT (version 7) [35] was used in aligning sequences of the individual loci, with maximum allowance for aligning regions having the considerable number of gaps. O. sativa sequences were later added to these alignments as outgroup to minimize the number of gaps, since O. sativa is a monocot while the other plants are dicots. These alignments corresponding to all the above-mentioned genetic loci were then concatenated to obtain the combined alignment. Manual editing of the alignment was performed to remove the positions rich in gaps. Subsequent to construction and manual editing of the alignment, phylogenetic tree reconstruction of the concatenated alignment was performed using MEGA6 [36] with the Neighbour-Joining method [37]. The ortholog distribution of the Brassicales-Malvales clade was plotted against their phylogenetic tree for easy comparison.

2.6. Vitamin biosynthesis enzymes and ion transporters A similar sequence search strategy was employed for the identification of enzymes involved in the biosynthesis of vitamins - and the precursor of Vitamin A (β-carotene), Vitamin C (L-ascorbate) and Vitamin E (α-tocopherol), as well as metal ion transporters from the M. oleifera proteome. FIR mapping and phylogenetic analysis, as described earlier, was performed to recognize the best hit. The expression of the genes coding for the transporters in different tissue samples were analyzed based on their FPKM values. 2.7. Validation of DEGs by quantitative RT-PCR analysis Quantitative reverse transcription PCR (qRT-PCR) was employed to validate the expression of genes involved in biosynthesis of quercetin, kaempferol, benzylamine and a few ion transporters among the five tissues. Details of the RNA isolation and qRT-PCR [42] are mentioned in the Methods section of Supplementary data 1. 2.8. Quantification of metabolites and minerals in different tissue samples Samples corresponding to leaf, root, stem, seed, and flower tissues were air dried and powdered. These powdered samples were then subjected to quantitative analysis using mass spectrometry and HPLC for the presence of different metabolites [43]. The minerals were quantified by ICP-AES (Inductively Coupled Plasma-Atomic Emission Spectroscopy) from all five tissue samples and compared with the values for the same observed in spinach leaves. The detailed procedure for the quantitative estimation of metabolites and minerals [44] is described in the Methods section of Supplementary data 1.

2.4. Differential expression analysis The reads from all the five tissue samples, including the technical replicates for each tissue, were mapped onto the reference genome assembly [22] using default parameters in TopHat [38]. Cufflinks (version 2.2.1) [39] was then used to assemble and generate the gene models (GTF files) from each sample and the Fragments Per Kilobase Million (FPKM) values for each transcript in all the samples was also determined. Each of these assemblies was merged into a master assembly using the cuffmerge module of the Cufflinks suite. The differential abundance of these transcripts in the five tissues was calculated by using the cuffdiff module [40] of the Cufflinks suite. The log2 (fold change) expression values for every gene in a particular tissue, relative to another tissue. Their corresponding p-values and q-values were also calculated and were noted wherever huge differences in FPKM values were observed. Transcripts with higher relative abundance, among five tissues of the plant, were identified. The top 100 highly expressed transcripts from all five tissues have been annotated for their function using Blast2GO tool (version 5.2) [33]. Gene ontology enrichment analysis was performed using Fisher's exact method and a p-value of < 0.05 GO terms considered over-represented.

3. Results and discussion 3.1. Transcriptome assembly, gene annotation and quality assessment Total 271 million reads were sequenced from the five tissues - leaf, root, stem, seed and flower of M. oleifera. Combined transcriptome assembly of the reads from these five tissues was carried out to achieve maximum coverage of the transcripts. The results of the combined transcriptome assembly and annotation have been provided in Supplementary Table 1. A total of 66,079 transcripts was observed, corresponding to a N50 value of 2094 bp. Above 90% of the raw reads could be mapped back to the combined assembly, ensuring substantial coverage of reads. A total of 17,148 gene models were predicted by aligning our transcriptome data to the available M. oleifera genome [22], using the MAKER pipeline. A similar number of gene models (19,465) was predicted from the reference genome of drumstick by Tian and co-workers [22]. The quality of transcriptome assembly was assessed by searching for the presence of essential genes in the proteome created from above gene models. The results reveal that nearly 85.9% of the 248 essential genes, as recorded in the database of Core Eukaryotic Genes (CEG), could be identified and this percentage further increases to 95.2%, if partial matches with the essential genes are included. Another tool, BUSCO reported 83.8% and 79.3% completeness based on Viridiplantae and Embryophyta (odb10) datasets, respectively.

2.5. Proteins involved in synthesis of secondary metabolites A detailed analysis was carried out on four secondary metabolites, namely quercetin, kaempherol, α-aminotoluene (popularly known as benzylamine) and ursolic/oleanolic acid. The PlantCyc database [41] was referred in order to select the protein sequences of the enzymes participating in the biosynthesis of these secondary metabolites. These sequences were then used as queries in a sequence search approach, termed Jump start PSI-BLAST (E-value = 10−5, h-value = 10−5 and number of iterations = 2), to identify homologous sequences in the M. oleifera proteome. The hits obtained were validated by two approaches. The first approach was through the construction of a phylogenetic tree to understand their evolutionary relationship with enzymes belonging to the same class, while the second approach involved mapping of functionally important residues (FIR) onto a multiple sequence alignment of the enzymes. Details of the protocol and validation of the hits are provided in Supplementary Data 1. The putative transcripts were recognized and their expression was compared across leaf, root, stem, seed, and flower tissue samples of M. oleifera.

3.2. Protein domain assignment and functional association Out of 17,148 gene models, Pfam associations could be obtained for 14,624 (85.3%) predicted proteins, while 12,026 (70.1%) proteins had > 70% coverage to Pfam HMM models. A majority of proteins were predicted to contain domains like protein kinases or protein tyrosine kinases, followed by repeat containing domains such as PPR, LRR, TPR, WD40 and nucleotide-binding domains like RRM, Myb and zinc fingers. (Supplementary Fig. 1). We were also able to identify homologous 3

Genomics xxx (xxxx) xxx–xxx

S.N. Pasha, et al.

Fig. 2. Orthology analysis of M. oleifera proteome with select proteomes. (A) Orthogroup distribution of proteins from A. thaliana, C. papaya, O. sativa, T. cacao and M. oleifera. (B) Orthogroup distribution of species belonging to Brassicales-Malvales.

tolerance activity, secondary metabolite production and metal ion transporters. The GO enriched terms ‘GDP-L-galactose phosphorylase activity’ (GO:0080047), involved in the Vitamin C synthesis, were associated with abundant transcripts in leaf, seed, root and stem tissues. Likewise, ‘antioxidant activity’ (GO: GO:0016209) term was significantly enriched in top-100 abundant transcripts in the root. Transcripts associated with ‘metal ion binding’ (GO:0046872) and ‘response to metal ion’ (GO:0010038) were abundant in seed, stem and root tissues (Supplementary Table 2). Following this, we directed our efforts towards investigating the expression of enzyme-coding genes involved in the biosynthesis of the secondary metabolites responsible for imparting medicinal properties to the plant.

proteins in the Viridiplantae database, UniProt for 16,365 (95.4%) predicted proteins in M. oleifera at cut-off values of 30% sequence identity and 70% query coverage. These observations demonstrate that 91% of our predicted proteins are nearly full-length and have homologous sequences in other highly annotated plant genomes. Please see Supplementary Downloads for the annotation results. 3.3. Orthologous relationships of M. oleifera proteins Orthology analysis across other species was carried out for the predicted proteome to gauge the distribution of orthologs across closely related species like C. papaya, T. cacao and distantly related species like A. thaliana and O. sativa using OrthoMCL. We observe 7830 orthogroups are common to all these species (Fig. 2A). While M. oleifera shares 182 unique orthogroups with C. papaya, a closely-related species, it shares the most uniquely common orthogroups (254) with T. cacao. This may be due to the different assembly and annotation protocols employed for the plants used in this study. This prompted us to conduct a further detailed study of orthologous groups across 37 representative species, using another orthology prediction tool, ProteinOrtho with a more rigorous E-value criterion. Orthology prediction for sequenced plant proteomes using Proteinortho5.15 resulted in a total of 1,88,732 orthogroups. Here, each orthogroup can possess more than one protein from a plant species. It was found that 17,148 M. oleifera proteins were distributed across 12,508 orthogroups. Due to a stringent E-value cut-off and the diversity of species used, only 102 orthogroups were found to be shared among all the 38 plant species, whereas 51 orthogroups were unique to only C. papaya and M. oleifera (please see Supplementary Downloads). A closer inspection of species, with respect to their phylogenetic relationship, also reveals that M. oleifera and its closest relative C. papaya, both of which are within the plant Order Brassicales, follow very similar orthogroup distribution. This trend is followed with T. cacao, which belongs to the neighbouring order Malvales (Fig. 2B). Hence, the analysis of Proteinortho against larger number of proteomes provide satisfactory results. Further assessment and analysis of singletons in M. oleifera reveal that several of singletons are presently of short length (please see Supplementary Data 1).

3.5. Expression of enzymes involved in the synthesis of key medicinally relevant secondary metabolites are higher in leaves, flowers and seeds Secondary metabolites such as quercetin, kaempferol, benzylamine and ursolic/oleanolic acid synthesized by M. oleifera are responsible for its renowned medicinal values (please see Introduction section in Supplementary Data 1). Enzymes participating in the biosynthesis of metabolites were investigated for their expression among different tissues as illustrated in Supplementary Data 1. The number of hits obtained for each enzyme and their FPKM values is provided in Supplementary Table 3, while the results of the qRT-PCR analysis is shown in Supplementary Fig. 4. The relative expression of these enzymes (in terms of FPKM values) among different tissues is also represented as a heat map in Fig. 3A. 3.5.1. Biosynthesis of quercetin and kaempferol Seven enzymes are involved in the Quercetin biosynthesis pathway. 4-Coumarate-CoA ligase (4CL), Chalcone synthase (CHS), Chalcone flavone isomerase (CHI), Flavonone 3-hydroxylase (F3H), Flavonol synthase (FLS), Tricin synthase (OMT) and Flavonoid 3′-monooxygenase (F3’H). Out of these, five enzymes, 4CL, CHS, CHI, F3H and FLS, are common to kaempferol biosynthesis (Fig. 3B). We could identify a number of proteins corresponding to these enzymes in the assembled transcriptome, after resolving cross-family connections (Supplementary Table 3). The multiple sequence alignments, highlighting the functional residues and the corresponding phylogenetic trees, are shown in Supplementary Fig. 2(A–N). Interestingly, two enzymes in the pathway (F3H and FLS), appear to be encoded by the same transcript Supplementary Fig. 2(G–J). There are reports of bifunctional FLS, which also acts as a F3H, from certain other plants [45]. OMT is involved in the production of quercetin only in certain plants by a yet-

3.4. Differential expression analysis of transcripts Next, we identified highly abundant transcripts in the five tissues using cufflink program. We found that the top-100 highly abundant transcripts in the five tissues retain significant GO terms such as stress 4

Genomics xxx (xxxx) xxx–xxx

S.N. Pasha, et al.

Fig. 3. Enzymes and reactants involved in the quercetin synthesis pathway. (A) A heatmap showing the expression of candidate genes. Normalized FPKM values of enzymes involved in the synthesis of secondary metabolites, vitamins and metal ion transporters in different tissues of M. oleifera. (B) Quercetin biosynthesis pathway.

unidentified mechanism [46]). Differential expression analysis of the putative transcripts, encoding all the seven enzymes of the quercetin/kaempferol biosynthesis pathway, indicates that 4CL and FLS enzymes are highly expressed in flower, CHS and OMT in root, CHI in seed, and F3H in stem (Fig. 3B). The log2 (fold change) value for the first enzyme in the pathway 4CL is 7.3 in flower (with respect to seed) and CHS is 8.5 in root (over stem). CHI exhibits a log2 (fold change) value of 2.5 in seed (relative to stem), while F3H/FLS shows a log2 (fold change) value of 8 in flower (with reference to seed). These observations were further supported by qRTPCR analysis of these key enzyme transcripts (Supplementary Fig. 4 and Supplementary Data 1) and HPLC analysis of the metabolites (Supplementary Fig. 3A). These results suggest that quercetin and kaempferol are present in higher amounts in flower, leaf and seed.

3.5.3. Biosynthesis of ursolic/oleanolic acid Ursolic acid and oleanolic are synthesized from a common precursor, (3S)-2, 3-epoxy squalene oxide. The enzymes, α-amyrin synthase and β-amyrin synthase, act on this precursor to produce α-amyrin and β-amyrin, respectively [48]. These intermediate products are further utilized as substrates by a Cytochrome P450 (Cyp450) (Amyrin monooxygenase), culminating in the production of ursolic and oleanolic acid, respectively. CYP450 is a large superfamily consisting of 52 subfamilies. It has been previously shown that the CYP450 716A/B subfamily catalyzes the reaction corresponding to the synthesis of Ursolic acid and Oleanolic acid [49]. Six transcripts, which encode putative CYP450 domains, were identified in the M. oleifera proteome through sequence search methods (please see Supplementary Data 1). Two out of the six hits were accepted as true hits, since they were observed to cluster with the clade corresponding to the CYP450 716A/B subfamily. Further, analysis of these two transcripts indicate that they are highly upregulated in root (Supplementary Table 3). The putative transcript encoding the amyrin monooxygenase enzyme shows a log2 (fold change) value of 4.5 in root with reference to seed (p-value of 5e05, significant). Interestingly, the roots of M. oleifera and these two secondary metabolites, in particular, have long been associated with medicinal value and used as anti-fertility agents [7,8].

3.5.2. Biosynthesis of benzylamine Benzylamine is produced by the action of the enzyme N-substituted formamide deformylase (NFD) on the reactant N-benzylcyanide in bacteria [47]. In plants, this reactant is involved in the biosynthesis of benzylamine corresponds to N-benzylformamide and the enzyme NFD catalyzing the conversion of N-benzylformamide to benzylamine has been characterized in barley (Hordeum vulgare) (Supplementary Downloads). A sequence search for the NFD enzyme in the M. oleifera proteome (using the protein sequence of barley enzyme as a query) resulted in a single homologous sequence hit, which was further validated using phylogenetic analysis and mapping of functionally important residues (Supplementary Downloads). The expression of this transcript is higher in seeds, stem and leaf tissues relative to flower and root. This observation is supported by the qRT-PCR results, where maximum expression is observed in seeds, followed by leaf tissue (Supplementary Fig. 4F). Quantitative analysis of the metabolites, however, indicates that dibenzylamine is present in higher quantities in the root sample (Supplementary Fig. 3B), implying that benzylamine may be synthesized in other tissues and transported for storage in root.

3.6. Expression of genes involved in vitamin biosynthesis are higher in leaves, flowers and seeds During the biosynthesis of Vitamin A, the enzyme lycopene β-cyclase (LCY) cyclises the linear lycopene at both the ends in two sequential steps, resulting in the formation of β-carotene which is the precursor of Vitamin A in animals [50]. In the case of Vitamin C (Lascorbate) biosynthesis in plants, the route is majorly through the Lgalactose pathway [50]. The committed step in the pathway is the conversion of GDP-L galactose to L-Galactose-1-phosphate catalysed by GDP-L-galactose phosphorylase (GGP). The other enzymes that 5

Genomics xxx (xxxx) xxx–xxx

S.N. Pasha, et al.

constitute the pathway are GDP-mannose-3,5-epimerase (GME), L-Galactose-1-P phosphatase (GPP), L-Galactose dehydrogenase (GDH) and L-Galactono-1,4-lactone dehydrogenase (GLDH). The last enzyme, GLDH is also an essential component of the mitochondrial complex-I in plants. Tocopherol cyclase and γ-tocopherol methyltransferase (TMT) catalyze the last two steps in the biosynthesis of Vitamin E. The phylogenetic tree and multiple sequence alignment of homologous sequences of these enzymes are shown in Supplementary Downloads. Few hits could be recognized in the assembled transcriptome of M. oleifera for each of these enzymes, through sequence searches and validation, as explained earlier and in Methods (please see Supplementary Data 1). Further, functional residues were mapped on to the multiple sequence alignment of these enzymes to confirm their conservation in the proteins and in turn, to confirm their functional significance. The transcripts corresponding to these hits were analyzed for their expression values among different tissues. We observe that expression of the putative transcripts coding for the LCY enzyme required for the biosynthesis of Vitamin A, is higher in M. oleifera leaves, followed by flowers. The log2 (fold change) value for the putative transcript, encoding the LCY enzyme, is observed to be 3.4 for leaf relative to stem. Likewise, for Vitamin C, transcript evidence suggests that mRNA expression of GGP and GLDH is high in leaves and stem, whereas GME is highly expressed in stem and seeds, GPP in leaves and GDH in flowers and seeds. The results are shown in Supplementary Downloads. These observations were also confirmed using qRT-PCR analysis (Supplementary Fig. 4G) for one of the enzymes, GGP. The enzyme TMT, involved in Vitamin E biosynthesis, is observed to be expressed higher in M. oleifera leaves and seeds, relative to other tissues. The expression of tocopherol beta cyclase (TC) was higher in leaves and seeds.

respectively. Indeed, M. oleifera seeds are used for water purification and have well-characterized flocculation and coagulation properties, due to presence of such minerals in high quantities [8]. The expression of select genes, involved in the storage of some of these minerals, was compared across the five tissues. Genes coding for vacuolar iron transporters (VIT), Calreticulin for calcium storage, Zinc transporters and Magnesium transporters were identified and their corresponding transcript levels compared across the different tissues. As expected, transcripts coding for vacuolar iron transporter are found to be more abundant in root and leaf, consistent with the results on mineral quantification (that demonstrate iron to be present in relatively higher amounts in root). Three transcripts were identified that potentially retain a Calreticulin domain, out of which two show higher expression in leaf and one in root tissue (Supplementary Table 3). Six unique transcripts, coding for putative Magnesium transporters, were also identified. They show highest expression in leaf, followed by seed tissues. Three transcripts, coding for proteins belonging to the zinc transporter family of proteins, were identified and their analysis suggest that these proteins are expressed more in root and leaf. Results of the qRT-PCR analysis also correlate with the observations based on the FPKM values and mineral quantification for calcium and magnesium ion transporters (Supplementary Fig. 4I–J). Overall, this analysis leads to interesting candidate genes and is consistent with previous literature reports [2] on M. oleifera. 3.8. Investigation of transcription factor families and their involvement in stress response Transcription factors (TFs) play a significant role in plant development and stress tolerance. They act as regulatory proteins by regulating a set of targeted genes in a coordinated manner and consequently enhance the stress tolerance of the plant. Transcription factors were identified in M. oleifera transcriptome using PlantTFcat [51]. Around 2100 M. oleifera transcripts were observed to contain 2326 TFs. The TFs could be grouped into 96 families (Supplementary Table 4). and compared with TFs in other related species used for our initial orthology analysis (A. thaliana, C. papaya, T. cacao and O. sativa). This analysis showed a comparable number of TFs in the closely related species C. papaya. Since M. oleifera is known for its drought-resistant properties, we next searched for TFs that are implicated in multiple biotic and abiotic stress responses. We identified nine transcription factor families (AP2EREBP, NAM, WRKY, ARF, bHLH, MYB, Homeobox, bZIP and HSF, as recorded in our STIFDB2 database [52]. Such stress-responsive TFs accounts for 32% of total predicted TFs in M. oleifera, agreeing well with previous reports of such TFs [22]. The C2H2 transcription factor family was observed to be the highest in number, which has a role in

3.7. Quantification of minerals and differential expression analysis of putative ion transporter-coding transcripts reveal large amounts of minerals in leaves and higher abundance of candidate transcripts in roots and leaves The mineral content in different tissues of M. oleifera was quantified and compared with the amount present in spinach leaves taken as the standard. A detailed list of the minerals and their quantities in the corresponding tissues is provided in (Table 1). A comparison among the five tissue types reveals that the M. oleifera root contains a relatively higher quantity of Aluminum (Al), Calcium (Ca), Magnesium (Mg), Zinc (Zn), Sodium (Na) and Iron (Fe). Flowers and seeds are observed to contain higher amounts of Potassium (K), while leaf contains relatively more quantities of Manganese (Mn) and Phosphorus (P). It is also observed that tissues of M. oleifera contain higher content in comparison to spinach. Remarkably, the leaves of M. oleifera contain 30 times and 100 times more Iron and Calcium, in comparison to spinach leaves, Table 1 Comparison of the mineral content in the tissues of M. oleifera and spinach. Elements









Spinach leaves






2.85 15.84 0.01 6.35 4.56 0.06 0.08 1.16 0.59 5.27

0.14 1.76 0.01 17.02 1.69 0.02 0.03 0.43 2.23 0.08

0.17 2.88 0.02 17.10 2.49 0.02 0.05 0.48 3.54 2.28

0.25 5.67 0.01 14.04 2.39 0.03 0.06 0.59 1.82 2.19

0.19 10.74 0.03 11.99 4.23 0.05 0.16 0.47 2.98 3.95

(mg/g) Al Ca Cu K Mg Zn Mn Na P Fe

185.59 183.80 221.81 404.72 202.58 472.20 191.51 330.26 178.77 259.94

0.00 0.11 0.00 0.22 0.07 0.00 0.00 0.14 0.04 0.11

Bold-faced numbers indicate relatively higher mineral content. 6

Genomics xxx (xxxx) xxx–xxx

S.N. Pasha, et al.

defense mechanism and multiple abiotic stresses, followed by other families like GRAS, MADS and C3H. Further studies are required to validate the regulatory role of these transcription factor families in M. oleifera plant.

SRX3011280 (Seed), SRX3011278 (Leaf), SRX3011259 (Flower). BioProject: PRJNA394193 Moringa oleifera cultivar:Bhagya (KDM01). Competing interests

4. Conclusions

The author(s) declare that they have no competing interests.

The assembly and analysis of transcriptome of M. oleifera has provided new insights into the relative expression of certain key enzymes participating in the biosynthesis of medicinally-important secondary metabolites such as quercetin, kaempferol and benzylamine among the stem, leaf, flower, root and seed tissues of the plant. We also extended our investigations to enzymes participating in Vitamin biosynthesis pathways, as well as ion transporters. Our interpretations based on the differential expression analysis of the transcripts were further validated by real-time quantitative PCR for few candidate genes. The presence of metabolites and trace elements in these tissues were also quantified by HPLC and mass spectrometry, respectively. For example, in the case of quercetin, which is an important secondary metabolite with wellcharacterized medicinal properties, our transcriptome data suggested a relatively higher expression of some of the enzymes involved in its biosynthesis in the flower and leaf tissues. Leaves of M. oleifera contain high amounts of minerals like Iron and Calcium than spinach leaves, suggesting this to be a “super food” (Table 1). We report high abundance of vacuolar iron transporters and calcium transporters in root and leaves of M. oleifera, through our transcriptome studies. Such data provide genomic evidence for the nutrient properties and water purifying nature of this tree. Similarly, we also investigated transcription factors present in M. oleifera plant and their roles in stress responses and other regulatory mechanisms. Although we do not have biological replicates for our sample, we have used the genome assembly from the previous report [22] to strengthen our transcriptome assembly which provides better understanding of the spatial expression of genes of interest. The correlation between the small molecule quantification and relevant transcript expression is presented in Supplementary Downloads. We are providing a consolidated view of these results as an animation in Supplementary Downloads. We have also developed a bioinformatics pipeline which can be of benefit to other researchers, working in this rapidly-growing field of genome sequencing. Supplementary data to this article can be found online at https://

Consent for publication Not applicable. Ethics approval and consent to participate Not applicable. Acknowledgements We thank NCBS (TIFR) for infrastructural and financial support. We thank Dr. C. Anandharamakrishnan, the Director of the Indian Institute of Food Processing Technology (IIFPT) and Mr. Kumaravel (IIFPT), Thanjavur for mass spectrometric quantification of metabolites and HPLC quantification of quercetin. Sequencing was performed at Centre for Cellular and Molecular Platforms, Bangalore. We thank Dr. Mahesh Sankaran of NCBS for permitting us to use the ICP facility and his students, Nandita Nataraj and K. Chengappa for their assistance. We thank Prof. S. Ramaswamy for providing the space to carry out the wet lab experiments. We also thank the Department of Horticulture, Gandhi Krishi Vignan Kendra (GKVK), UAS for providing plant sample. Supplementary downloads. Additional downlodable files have been provided in the download link ( download/ddat). List of the files under the link has been provided in the file List_index.txt. References [1] C. Ramachandran, K.V. Peter, P.K. Gopalakrishnan, Drumstick (Moringa oleifera): a multipurpose Indian vegetable, Econ. Bot. 34 (1980) 276–283. [2] J. Fahey, Moringa oleifera: a review of the medical evidence for its nutritional, therapeutic, and prophylactic properties. Part 1, Trees Life J. (2005) 1–15. [3] M.-C. Shih, C.-M. Chang, S.-M. Kang, M.-L. Tsai, Effect of different parts (leaf, stem and stalk) and seasons (summer and winter) on the chemical compositions and antioxidant activity of Moringa oleifera, Int. J. Mol. Sci. 12 (2011) 6077–6088. [4] A. Parotta John, Moringa oleifera Lam. Reseda, Horseradish Tree. Moringaceae. Horseradish Tree Family, (1993). [5] B.R. Goyal, B.B. Agrawal, R.K. Goyal, A.A. Mehta, Phyto-pharmacology of Moringa oleifera Lam.?? An overview, Indian J. Nat. Prod. Resour. 6 (2014) 347–353. [6] Akwasi Ampofo-yeboah, H. Lambrechts, D. Brink, F. Hilten, Evans Afriyie-Gyawu, Analysis of oleanolic acid and ursolic acid, potential antifertility agents in Moringa, J. Agric. Sci. Technol. A 3 (2013) 989–999. [7] S. Shukla, R. Mathur, A.O. Prakash, Antifertility profile of the aqueous extract of Moringa oleifera roots, J. Ethnopharmacol. 22 (1988) 51–62. [8] F. Anwar, S. Latif, M. Ashraf, A.H. Gilani, Moringa oleifera: a food plant with multiple medicinal uses, Phyther. Res. (2007) 17–25. [9] C.P. Khare, Indian Medicinal Plants: An Illustrated Dictionary, (2007), p. 900. [10] A. Al-Romaiyan, B. Liu, R. Docherty, G.-C. Huang, S. Amiel, S.J. Persaud, et al., Investigation of intracellular signalling cascades mediating stimulatory effect of a Gymnema sylvestre extract on insulin secretion from isolated mouse and human islets of Langerhans, Diabetes Obes. Metab. 14 (2012) 1104–1113. [11] N. Sehgal, A. Gupta, R.K. Valli, S.D. Joshi, J.T. Mills, E. Hamel, et al., Withania somnifera reverses Alzheimer's disease pathology by enhancing low-density lipoprotein receptor-related protein in liver, Proc. Natl. Acad. Sci. 109 (2012) 3510–3515. [12] A.K. Al-Asmari, S.M. Albalawi, M.T. Athar, A.Q. Khan, H. Al-Shahrani, M. Islam, Moringa oleifera as an anti-cancer agent against breast and colorectal cancer cell lines, PLoS One 10 (2015) e0135814. [13] K. Shameer, M.B.N. Naika, K.M. Shafi, R. Sowdhamini, Decoding systems biology of plant stress for sustainable agriculture development and optimized food production, Prog. Biophys. Mol. Biol. (2018) pii: S0079-6107(17)30089-5. [14] B. Song, H. Lu, H. Yang, J. Wang, L. Li, S. Cheng, et al., The Draft Genomes of Five Agriculturally Important African Orphan Crops, Gigascience 8 (2019) pii: giy152. [15] N.M. Krishnan, S. Pattnaik, P. Jain, P. Gaur, R. Choudhary, S. Vaidyanathan, et al., A draft of the genome and four transcriptomes of a medicinal and pesticidal angiosperm Azadirachta indica, BMC Genomics 13 (2012) 464. [16] R.S. Annadurai, R. Neethiraj, V. Jayakumar, A.C. Damodaran, S.N. Rao,

Funding This work was supported by the Department of Biotechnology, Government of India grant (BT/PR10550/BID/7/479/2013) and by the JC Bose fellowship, Science and Engineering Research Board, Department of Science and Technology, Government of India (SB/S2/ JC-071/2015) to RS. We also thank NCBS (TIFR) for infrastructural and financial support. Author's contributions Conceived and designed experiments: RS. Data generation: SNP, KH, SN, RSS, MN, MSS. Data analysis and presentation: SNP, KH, JM, PG, SN, RSS, IM, AG, SDK, OKM, KMS, AGJ, SHP, MM, EM, MN, NR, RMR, PNS, AKU, RSV. Writing of Manuscript: JM, SNP, KH, PG, SDK, IM, SN, RSS, KMS, and RS. Provided resources and tools and critical reviewed the manuscript: RS. Availability of data and materials Raw transcript reads are available from NCBI Short Read Sequence Archive – SRA at with accession numbers: SRX3011282 (Stem), SRX3011281 (Root), 7

Genomics xxx (xxxx) xxx–xxx

S.N. Pasha, et al.

[17] [18] [19] [20] [21] [22] [23] [24] [25] [26] [27] [28] [29] [30] [31] [32] [33] [34]

M.A.V.S.K. Katta, et al., De novo transcriptome assembly (NGS) of Curcuma longa L. rhizome reveals novel transcripts related to anticancer and antimalarial terpenoids, PLoS One 8 (2013). S. Upadhyay, U.J. Phukan, S. Mishra, R.K. Shukla, De novo leaf and root transcriptome analysis identified novel genes involved in Steroidal sapogenin biosynthesis in Asparagus racemosus, BMC Genomics 15 (2014). A.K. Upadhyay, A.R. Chacko, A. Gandhimathi, P. Ghosh, K. Harini, A.P. Joseph, et al., Genome sequencing of herb Tulsi (Ocimum tenuiflorum) unravels key genes behind its strong medicinal properties, BMC Plant Biol. 15 (2015) 212. A. Gandhimathi, N. Sathyanarayanan, M. Iyer, R. Gupta, R. Sowdhamini, Evolutionary Analysis of a Few Protein Superfamilies in Ocimum tenuiflorum, The Ocimum Genome, Springer International Publishing, 2018. A.M. Bolger, M. Lohse, B. Usadel, Trimmomatic: a flexible trimmer for {Illumina} sequence data, Bioinformatics 30 (2014) 2114–2120. M.G. Grabherr, B.J. Haas, M. Yassour, J.Z. Levin, D.A. Thompson, I. Amit, et al., Full-length transcriptome assembly from RNA-Seq data without a reference genome, Nat. Biotechnol. 29 (2011) 644–652. Y. Tian, Y. Zeng, J.J. Zhang, C.G. Yang, L. Yan, X.J. Wang, et al., High quality reference genome of drumstick tree (Moringa oleifera Lam.), a potential perennial crop, Sci. China Life Sci. 58 (2015) 627–638. B.L. Cantarel, I. Korf, S.M.C. Robb, G. Parra, E. Ross, B. Moore, et al., {MAKER}: {An} easy-to-use annotation pipeline designed for emerging model organism genomes, Genome Res. 18 (2008) 188–196. R.D. Finn, P. Coggill, R.Y. Eberhardt, S.R. Eddy, J. Mistry, A.L. Mitchell, et al., The Pfam protein families database: towards a more sustainable future, Nucleic Acids Res. 44 (2016) D279–D285. R.D. Finn, J. Clements, S.R. Eddy, HMMER web server: interactive sequence similarity searching, Nucleic Acids Res. 39 (2011) W29–W37. G. Parra, K. Bradnam, I. Korf, CEGMA: a pipeline to accurately annotate core genes in eukaryotic genomes, Bioinformatics 23 (2007) 1061–1067. F.A. Simão, R.M. Waterhouse, P. Ioannidis, E.V. Kriventseva, E.M. Zdobnov, BUSCO: assessing genome assembly and annotation completeness with single-copy orthologs, Bioinformatics 31 (2015) 3210–3212. R.M. Waterhouse, M. Seppey, F.A. Simão, M. Manni, P. Ioannidis, G. Klioutchnikov, et al., BUSCO applications from quality assessments to gene prediction and phylogenomics, Mol. Biol. Evol. 35 (2017) 543–548. L.L. Li, C.J. Stoeckert, D.S. Roos, OrthoMCL: identification of ortholog groups for eukaryotic genomes, Genome Res. 13 (2003) 2178–2189. D.M. Goodstein, S. Shu, R. Howson, R. Neupane, R.D. Hayes, J. Fazo, et al., Phytozome: a comparative platform for green plant genomics, Nucleic Acids Res. 40 (2012) D1178–1186. M. Lechner, S. Findeiß, L. Steiner, M. Marz, P.F. Stadler, S.J. Prohaska, Proteinortho: detection of (co-)orthologs in large-scale analysis, BMC Bioinformatics 12 (2011) 124. S. Hunter, R. Apweiler, T.K. Attwood, A. Bairoch, A. Bateman, D. Binns, et al., InterPro: the integrative protein signature database, Nucleic Acids Res. 37 (2009) D211–D215. A. Conesa, S. Götz, Blast2GO: a comprehensive suite for functional analysis in plant genomics, Int. J. Plant Genom. 2008 (2008) 1–12. W.J. Kress, D.L. Erickson, A two-locus global DNA barcode for land plants: the coding rbcL gene complements the non-coding trnH-psbA spacer region, PLoS One 2 (2007) e508.

[35] K. Katoh, K. Misawa, K. Kuma, T. Miyata, MAFFT: a novel method for rapid multiple sequence alignment based on fast Fourier transform, Nucleic Acids Res. 30 (2002) 3059–3066. [36] K. Tamura, G. Stecher, D. Peterson, A. Filipski, S. Kumar, MEGA6: molecular evolutionary genetics analysis version 6.0, Mol. Biol. Evol. 30 (2013) 2725–2729. [37] N. Saitou, M. Nei, The neighbour-joining method: a new method for reconstructing phylogenetic trees, Mol. Biol. Evol. 4 (1987) 406–425. [38] C. Trapnell, L. Pachter, S.L. Salzberg, TopHat: discovering splice junctions with RNA-Seq, Bioinformatics 25 (2009) 1105–1111. [39] C. Trapnell, B.A. Williams, G. Pertea, A. Mortazavi, G. Kwan, M.J. van Baren, et al., Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation, Nat. Biotechnol. 28 (2010) 511–515. [40] C. Trapnell, D.G. Hendrickson, M. Sauvageau, L. Goff, J.L. Rinn, L. Pachter, Differential analysis of gene regulation at transcript resolution with RNA-seq, Nat. Biotechnol. 31 (2013) 46–53. [41] L. Chae, T. Kim, R. Nilo-Poyanco, S.Y. Rhee, Genomic signatures of specialized metabolism in plants, Science 344 (2014) 510–513. [42] L.-T. Deng, Y.-L. Wu, J.-C. Li, K.-X. OuYang, M.-M. Ding, J.-J. Zhang, et al., Screening reliable reference genes for RT-qPCR analysis of gene expression in Moringa oleifera, PLoS One 11 (2016) e0159458.. [43] G. Sathyaprabha, S. Kumaravel, D. Ruffina, P. Praveenkumar, Available Online Through a Comparative Study on Antioxidant, Proximate Analysis, Antimicrobial Activity and Phy- Tochemical Analysis of Aloe Vera and Cissus quadrangularis by GC-MS, vol. 3, (2010), pp. 5–8. [44] R. Paranthaman, P. Praveen Kumar, S. Kumaravel, GC-MS analysis of phytochemicals and simultaneous determination of flavonoids in Amaranthus caudatus (Sirukeerai) by RP-HPLC, J. Anal. Bioanal. Tech. 03 (2012) 5. [45] R. Lukacin, F. Wellmann, L. Britsch, S. Martens, U. Matern, Flavonol synthase from Citrus unshiu is a bifunctional dioxygenase, Phytochemistry 62 (2003) 287–292. [46] P.Y. Lam, F.-Y. Zhu, W.L. Chan, H. Liu, C. Lo, Cytochrome P450 93G1 is a flavone synthase II that channels flavanones to the biosynthesis of tricin O-linked conjugates in rice, Plant Physiol. 165 (2014) 1315–1327. [47] H. Fukatsu, Y. Hashimoto, M. Goda, H. Higashibata, M. Kobayashi, Amine-synthesizing enzyme N-substituted formamide deformylase: screening, purification, characterization, and gene cloning, Proc. Natl. Acad. Sci. U. S. A. 101 (2004) 13726–13731. [48] H.H. Rees, G. Britton, T.W. Goodwin, The biosynthesis of beta-amyrin. Mechanism of squalene cyclization, Biochem. J. 106 (1968) 659–665. [49] R.C. Misra, S. Sharma, A.Garg Sandeep, C.S. Chanotiya, S. Ghosh, Two CYP716A subfamily cytochrome P450 monooxygenases of sweet basil play similar but nonredundant roles in ursane- and oleanane-type pentacyclic triterpene biosynthesis, New Phytol. 214 (2017) 706–720. [50] C.L. Linster, S.G. Clarke, L-Ascorbate biosynthesis in higher plants: the role of VTC2, Trends Plant Sci. 13 (2008) 567–573. [51] X. Dai, S. Sinharoy, M. Udvardi, P.X. Zhao, PlantTFcat: an online plant transcription factor and transcriptional regulator categorization and analysis tool, BMC Bioinformatics 14 (2013) 321. [52] M. Naika, K. Shameer, O.K. Mathew, R. Gowda, R. Sowdhamini, STIFDB2: an updated version of plant stress-responsive transcription factor database with additional stress signals, stress-responsive transcription factor binding sites and stressresponsive genes in arabidopsis and rice, Plant Cell Physiol. 54 (2013) e8.