Characterization of the Tibet plateau Jerusalem artichoke (Helianthus tuberosus L.) transcriptome by de novo assembly to discover genes associated with fructan synthesis and SSR analysis

Background Jerusalem artichoke (Helianthus tuberosus L.) is a characteristic crop in the Qinghai-Tibet Plateau which has rapidly developed and gained socioeconomic importance in recent years. Fructans are abundant in tubers and represent the foundation for their formation, processing and utilization of yield; and are also widely used in new sugar-based materials, bioenergy processing, ecological management, and functional feed. To identify key genes in the metabolic pathway of fructans in Jerusalem artichoke, high-throughput sequencing was performed using Illumina Hi Seq™ 2500 equipment to construct a transcriptome library. Results Qinghai-Tibet Plateau Jerusalem artichoke “Qingyu No.1” was used as the material; roots, stems, leaves, flowers and tubers of Jerusalem artichoke in its flowering stage were mixed into a mosaic of the Jerusalem artichoke transcriptome library, obtaining 63,089 unigenes with an average length of 713.6 bp. Gene annotation through the Nr, Swiss Prot, GO, KOG and KEGG databases revealed 34.95 and 46.91% of these unigenes had similar sequences in the Nr and Swiss Prot databases. The GO classification showed the Jerusalem artichoke unigenes were divided into three ontologies, with a total of 49 functional groups encompassing biological processes, cellular components, and molecular functions. Among them, there were more unigenes involved in the functional groups for cellular processes, metabolic processes, and single-organism processes. 38,999 unigenes were annotated by KOG and divided into 25 categories according to their functions; the most common annotation being general function prediction. A total of 13,878 unigenes (22%) were annotated in the KEGG database, with the largest proportion corresponding to pathways related to carbohydrate metabolism. A total of 12 unigenes were involved in the synthesis and degradation of fructan. Cluster analysis revealed the candidate 12 unigene proteins were dispersed in the 5 major families of proteins involved in fructan synthesis and degradation. The synergistic effect of INV gene is necessary during fructose synthesis and degradation in Jerusalem artichoke tuber development. The sequencing data from the transcriptome of this species can provide a reliable data basis for the identification and assessment of the expression of the members of the INV gene family.A simple sequence repeat (SSR) loci search was performed on the transcriptome data of Jerusalem artichoke, identifying 6635 eligible SSR loci with a large proportion of dinucleotide and trinucleotide repeats, and the most different motifs were repeated 5 times and 6 times. Dinucleotide and trinucleotide repeat motifs were the most frequent, with AG/CT and ACC/GGT repeat motifs accounting for the highest proportion. Conclusions In this study, a database search of the transcriptome of the Jerusalem artichoke from the Qinghai Tibet Plateau was conducted by high throughput sequencing technology to obtain important transcriptional and SSR loci information. This allowed characterization of the overall expression features of the Jerusalem artichoke transcriptome, identifying the key genes involved in metabolism in this species. In turn, this offers a foundation for further research on the regulatory mechanisms of fructan metabolism in Jerusalem artichoke.


Background
Jerusalem artichoke (Helianthus tuberosus L.) belongs to the genus Helianthus L. of Asteraceae. Although it is native to temperate zones in North America, it was introduced into China over 300 years ago, where it has been planted sporadically as a pickled vegetable for a long time. The twenty-first Century has seen a gradual decrease of the economic value of Jerusalem artichoke, with a rapid expansion in its planting and processing. The Jerusalem artichoke tuber contains fructose, which can be used to process inulin [1], bioethanol [2] and various foods [3]. In addition, because of its strong stress resistance, it can be cultivated in dry and saline alkaline environments, and has also been utilized in the ecological treatment of desert soil [4], seashore saline soil [5] and heavy metal-contaminated soil [6]. The Qinghai-Tibet Plateau is one of the earliest areas in China to conduct large-scale planting and processing of Jerusalem artichoke. The climate is cold, with large differences in temperature between day and night. The yield of the Jerusalem artichoke tuber is high, due its elevated fructan content. Moreover, it has natural and pollution-free environmental advantages, being very beneficial to the processing of organic fructan products. The industry of Jerusalem artichoke planting and processing has achieved rich development potential in the Qinghai-Tibet Plateau.
Fructose is the third largest storage carbohydrate after starch and sucrose. About 15% of angiosperms contain fructan, most prominently in the Compositae, Liliaceae and Gramineae families; which mostly grow in cold temperate zones [7]. Plant fructan has been studied for over 200 years, with early research focused on the biochemical properties of fructans [8]. In recent years, abundant studies have shown fructan metabolism is closely related to plant resistance to cold [9], drought [10] and saline alkali [11], as well as other stressors. With the discovery of the enzymes involved in fructan metabolism, research focus has shifted to the understanding of the metabolic processes and functions of fructosan [12]. In addition to the important role of fructan in plants, processed food derived from this carbohydrate also helps maintain human health. The properties and processing techniques of fructan have been extensively studied [13]. Fructan is an optimal nutrient and sweetener for patients with diabetes and hypertension, as it has a caloric value of only 1 kcal/g, and it is not easily digested and decomposed into monosaccharides and therefore does not sharply increase blood glucose [14,15]. Highly polymerized fructans are often used as substitutes for fats in foods such as yogurt, ice cream and jelly pudding [16].
Fructan structure is mainly linear or branched in plants and 5 different types of derivatives. The earliest model of fructan metabolism was proposed by research on Jerusalem artichoke [17]. Jerusalem artichoke is a typical crop that accumulates inulin-type linear fructan. This type is composed of D-fructose residues linked by β-2, 1 bonds with molecular weights ranging from 3 to 5 kDa [18]. Current models of inulin-type fructan metabolism suggest its synthesis is based on sucrose as a substrate: First, the synthesis of kestose (1-Kestose) is catalyzed by sucrose 1-fructosyltransferase (1-SST, EC 2.4.1.99), followed by elongation of the fructan chain [19,20] catalyzed by fructan 1-fructosyltransferase (1-FFT, EC 2.4.1.100). Then, the degradation of fructan is catalyzed by 1-Fructan exohydrolase (1-FEH, EC 3.2.1.80), with the fructosyl being cleaved one by one from the end of the sugar chain. At present, the functions of known key enzymes in the metabolism of Jerusalem artichoke fructan are unclear, and which genes are involved in their metabolic regulation remains to be discovered.
In recent years, high-throughput transcriptome sequencing has been widely used to study gene expression in different biological problems. This method can comprehensively and quickly acquire all mRNA sequences [21][22][23] generated by gene expression in a certain state and identify important functional genes from it in order to reveal the molecular mechanisms underlying different biological traits. In this study, the transcriptome of Jerusalem artichoke from the Qinghai Tibet Plateau was sequenced using high-throughput sequencing technology with Illumina Hi Seq 2500 equipment. Functional annotation of genes, and functional classification and metabolic pathway analysis of unigene were carried out with bioinformatics methods to lay a foundation for further research on the key metabolic genes in Jerusalem artichoke, as well as for interpretation of the molecular mechanisms involved in the regulation of fructan metabolism, and the development of related molecular markers.

Analysis of RNA-Seq datasets
Illumina sequencing obtained a total of 123,169,632 reads, containing 18.68 G of nucleotide sequence information. The percentages of Q20 (sequencing error rate less than 1%), Q30 (sequencing error rate less than 0.1%), and GC content were 98.28, 95.40, and 44.92%, respectively ( Table 1). The data from the transcriptome sequencing had high volume and quality, providing good raw data for subsequent data assembly. Trinity was assembled to obtain 63,098 spliced transcript sequences, with an average length of 713 nt and a N50 of 1106 nt. The longest transcript in each gene was taken as an unigene, with an average length of 713.6 bp. The length distribution of the unigenes revealed a total of 12,962 unigenes with a length greater than 1000 nt, accounting for 20.55% of all unigenes. Among them, 34,765 unigenes had a length of 200-499 nt, accounting for 55.10% of the total; whereas 14,004 unigenes had a length of 500-999 nt, corresponding to 22.20% of the total; and 14,320 unigenes had a length of ≥1000 nt, accounting for 22.70% of the total (Fig. 1). This indicated the transcriptome library had good sequencing and assembly results, and adequate for subsequent bioinformatics analysis.

Functional annotation and classification
The assembled unigene sequences were separately aligned according to the Nr and Swissprot databases by Blast software. Results recognized 22,048 unigenes aligned to similar sequences in the Nr database, with E values ranging from 1E-50 to 1E-20, with a total of 9873 (44.78%). To determine the species to which homologous sequences belong, the sequence in which each unigene yielded the best results (the lowest E value) in the Nr library was selected (if there was a parallel, the first one was selected). The number of homologous sequences statistically aligned to each species was determined in the Nr database. The highest matches with the unigenes from Jerusalem artichoke were Cynara cardunculus (19002), followed by Daucus carota (1346), Helianthus annuus (1126), Vitis vinifera (1042), Theobroma cacao (934), Sesamum indicum (810), and others. 29,595 (46.91%) unigenes had similar sequences in the Swiss Prot database, where the E values of most ranged between 1E-50 to 1E-20, corresponding to 8135 unigenes (27.49%). Compared with Nr, SwissProt function annotation found a large reduction in unigenea with high similarity sequences, consistent with database characteristics. The SwissProt data source is very strict, and all sequence entries are carefully checked by experienced molecular biologists and protein chemists through computer tools and related literature. In contrast, the Nr database integration standard is looser, and any two sequences any differences are regarded as two distinct records.
The GO database was used to classify the unigenes according to biologic function. Results showed the 63,089 Jerusalem artichoke unigenes were divided into three ontologies ( Fig. 2): Biological processes, Cellular components, and Molecular functions, with a total of 49 functional groups. Further analysis showed 46,360 GO   Jerusalem artichoke unigenes were compared with the KOG database and classified according to their function. 38,999 unigenes were divided into 25 categories, with the most annotated unigenes corresponding to General function prediction (7016 unigenes), followed by those related to posttranslational modification, protein turnover and chaperones (4507 unigenes), and signal transduction mechanisms (4740 unigenes). Meanwhile, the groups with the least annotated unigenes were those relate to nuclear structure (147 unigenes), extracellular structures (114 unigenes), and cell motility (21 unigenes). The identified unigenes of Jerusalem artichoke were involved in most biological activities typical of growth and development (Fig. 3).

Functional classification by the KEGG pathway
Jerusalem artichoke unigenes were compared to the KEGG database, and the metabolic pathways were analyzed based on the annotation information to understand the metabolic pathways and functions of their products in cells. A total of 13,878 unigenes (22.00%) were annotated for analysis of metabolic pathways. Among the 7 most statistically significant pathways, carbohydrate metabolism-related pathways accounted for the largest proportion, followed by translation, global and overview, folding, sorting and degradation, amino acid metabolism, transport and catabolism, and energy metabolism. Most unigenes corresponded to metabolic pathways, indicating Jerusalem artichoke had strong metabolism during this period. On the other hand, more than 400 other unigenes were linked to other subclasses related to ribosomes, carbon metabolism, biosynthesis of amino acids, protein processing in the endoplasmic reticulum, and plant hormone signal transduction pathways (Fig. 4).

Phylogenetic analysis of fructan synthesis and degradation unigenes
Results indicated 12 unigenes were involved in the synthesis and degradation of fructan ( Table 2). The protein phylogenetic tree of Jerusalem artichoke and other fructan crops was constructed by MEGA 5.0 software. According to the topology of the phylogenetic tree (Fig. 5), fructan synthesis and degrading enzyme gene proteins were clustered and analyzed into three categories: nigenes, fructosyltransferase gene and fructosyl synthase gene, obtained by sequencing. The 12 candidate unigenes' proteins were dispersed in five major protein families of fructan synthesis and degradation enzymes; especially in the fructosyltransferase gene family, accounting for 91.67%. This may be because the fructosyltransferase gene is similar in structure to the vacuolar invertase and cell wall invertase genes, displaying the widest distribution in the phylogenetic tree. The fructosyltransferase gene were clustered with Unigene37416, Unigene49657, Unigene7092 and Unigene23036. These fructosyltransferase genes awere basically INV, CW-INV and β-fructofuranosidase. However, Unigene32177 was the only candidate unigene that was combined with the fructan synthase protein. In addition, Unigene37416 and Unigene49657 were on the same evolutionary branch as Helianthus annuus and Cichorium intybus, respectively. Unigene36264, Unigene46471, Uni-gene20477, and Unigene21582 were adjacent and on  (Table 3). We found the Jerusalem artichoke transcriptome to be rich in SSR types, with 6635 SSR loci containing 180 repeat motifs, corresponding to 4 dinucleotides, 10 trinucleotides, 30 tetranucleotides, 52 petanucleotides and 83 hexanucleotides. Among the dinucleotides, AG/CT repeat motifs were dominant, accounting for 32.9% of all SSR; while the repeating motifs ACC/GGT, ATC/ATG, and AAG/CTT in the trinucleotides accounted for 10, 8.4 and 8.4%, respectively. The higher proportion of tetranucleotides corresponded to AAAAC/GTTTT, AAAAG/CTTTT, AAAAT/ATTTT and AAACC/GGTT T, while the remaining tetranucleotide, pentanucleotide and hexanucleotide repeating motifs were more dispersed, with low occurrence frequencies, and account only for 9.4% (Fig. 7).

Discussion
Jerusalem artichoke is a hexaploid (2n = 6x = 102) outcrossing plant with a large genome [24], which utilizes tubers for asexual reproduction and has self-incompatibility, low hybridization rate, and a complex genetic structure. Compared with other large crops, bioinformatics resources on Jerusalem artichoke are relatively scarce. With the widespread use of transcriptome sequencing technology, mining for genetic resources is possible with low cost and high throughput. In this study, RNA-Seq technology was used to assemble and analyze the transcriptomes of different tissues at flowering stage, obtaining 63,089 unigenes. After analysis of the Nr and Swiss Prot protein databases in NCBI, 41,041 (65.05%) and 33,494 (53.09%) unigenes were not annotated as relevant information in each database, respectively. This phenomenon has been reported in the results of transcriptome sequencing in many species, such as Abelmoschu sesculentus [25], Fraxinus velutina [26], Morus alba L. [27], and other. This has been related to the short length of some Unigene fragments, the lack of information on the relevant database genes, and the possible presence of new genes in Jerusalem artichoke [28]. The GO database was used to analyze the functional classification and biological characteristics of the annotated genes. However, due to defects such as incomplete functional classification in the GO database, 41,738 unigenes have not been annotated to possible GO entries. Therefore, other bioinformatics methods are needed to supplement the functional classification of Jerusalem artichoke unigenes.
Through the analysis of amino acid sequence alignment, we found the unigenes related to fructan metabolism in Jerusalem artichoke had some sequence-conserved functional domains, all belonging to the 32 family of glycosyl hydrolase (Glycosylhydrolysase 32, GH-32). The phylogenetic tree was constructed to show fructosyltransferases (FBE) and vacuolar invertase constituted a branch of the Jerusalem artichoke phylogenetic tree, while extracellular hydrolase and cell wall transformase were clustered into one species, indicating FBE in plants may have evolved from vacuolar invertase [29,30]. These similar evolutionary origins are supported by studies where synthetase 1-FFT has been observed to display degradative enzymatic function, while the degrading enzyme 1-FEH sometimes has partial synthesis ability [31,32]. Further research is need to clarify the functions of currently known fructan metabolism-related enzymes. Indeed, although some genes involved in fructan metabolism in Jerusalem artichoke fructan have been cloned [32,33], other genes in this metabolic pathway need further exploration.
Analysis of the expression of key unigenes in the synthesis and degradation of fructan in Jerusalem artichoke revealed 12 unigenes were expressed during tuber development. The metabolism of fructan is a complex processes. Fructan is a carbohydrate composed of multiple fructose groups linked by glycosidic bonds. Sucrose is a precursor of fructan synthesis, and different combinations of fructosyltransferases catalyze the synthesis of multiple types of fructans, without the involvement of phosphorylation or nucleotide cofactors [34]. Results indicate Unigene32177, which is related to fructan synthesis in Jerusalem artichoke, has similar expression patterns to sucrose invertases Unigene7092 and Uni-gene46471. Because the substrates are fructose and glucose produced by sucrose hydrolysis during inulin synthesis, fructan synthesis and sucrose conversion are thought to occur simultaneously. Current research on inulin-type fructan have focused on 1-SST, 1-FFT (two important genes involved in fructan synthesis) and FEH (which is ultimately responsible for the degradation of fructan). However, invertase INV is required for the complete degradation of fructan [35]. Because one the final product of FEH hydrolysis is sucrose, this reaction is inhibited by sucrose [36]. The outcomes of the present study indicate different INV have diverse expression levels in various stages of Jerusalem artichoke tuber development. It is hypothesized these functional differences among INV are expressed at a variety of stages, showing functional pleiotropy. More importantly, the synergistic effect of INV gene is necessary during fructose synthesis and degradation in Jerusalem artichoke tuber development. The sequencing data from the transcriptome of this species can provide a reliable data basis for the identification and assessment of the expression of the members of the INV gene family. Therefore, further study is required regarding the functional analysis of INV in fructose synthesis and degradation, and its involvement in signal transduction pathways.
There is currently no SSR marker available in Jerusalem artichoke, and only 6 pairs of SSR-labeled primers have been identified in the same sunflower of Jerusalem artichoke [34], which does not meet research needs. However, the direct use of Jerusalem artichoke genome information to develop SSR markers is an important prerequisite for systematic research on the inheritance and evolution of this species. In this study, the transcriptome sequence information of Jerusalem artichoke was used to conduct a SSR locus search, identifying 6635 SSR loci with a frequency of 10.52%, which was higher than 7.6% reported for Cajanus cajan L. [35], but lower than the 14.70% documented for tea [36]. Dinucleotides (45.24%) and trinucleotides (42.53%) appeared most frequently as repeat motifs in Jerusalem artichoke, which was consistent with most species [37]. The most frequent dinucleotide repeat motif was AG/CT, echoing studies in dicots such as Hevea brasiliensis [38] and Vitis L. [39]. The most abundant trinucleotide repeat was ACC/GGT, which is inconsistent with the previous reports where the main repeat type of trinucleotide in dicots was AAG/CTT [40].

Conclusions
In this study, a database evaluation of the transcriptome of Jerusalem artichoke from the Qinghai Tibet Plateau was conducted with high throughput sequencing technology to obtain a great amount of data on transcripts and SSR loci, depicting the general expression characteristics of this species' transcriptome, and identifying its key metabolism-related genes. Moreover, this study also provides a reference for subsequent investigation of the metabolism of fructan and its regulation in Jerusalem artichoke; along with abundant data for molecular biology research on Asteraceae Helianthus plants.

Materials and methods
Plant material and RNA extraction The transcriptome data was derived from the results of high-throughput sequencing conducted with Illumina equipment in " Then poly (A) was added, and the sequencing linker was ligated to prepare a sequencing library for PCR amplification and subsequent sequencing.

Illumina sequencing and de novo assembly
The PE125 sequencing method was used to construct the Jerusalem artichoke transcriptome library using the Illumina HiSeq™ 2500 sequencing platform. The original image data obtained by order were converted to raw reads via base recognition (base calling), and raw reads were filtered to get clean reads, followed by statistical analysis of the amount of raw data and its length. The original data was subjected to removal of sequencing joints, repeated redundant sequences, and low-quality sequence data. Then, the number of clean reads, their total length, Q20, Q30%, N%, and GC% were counted. Q20 represents the ratio of bases having a mass of not less than 20 after filtration, Q30 represents the ratio of bases having a mass of not less than 30 after filtration, and N represents the ratio of bases which were not determined after filtration. The assembly was performed using Trinity [41] software, the sequence was extended to contigs by overlap between sequences, and then transcripts were obtained by local assembly. Finally, Transcripts were homogenously clustered and spliced using TGICL [42] and Phrap software [43] to obtain a single gene cluster (unigene). Analytical projects included analysis of the sequencing assembly results, FPKM statistical analysis, and SSR analysis.

Sequence annotation
By comparing the protein database of Jerusalem artichoke to the protein databases (E value <1e-5) with the Blastx alignment tool, functional annotation was performed according to the similarity of the genes, and the protein with the highest sequence similarity with the given unigene was selected, thus the relevant functional annotation information. All annotation details were merged to summarize the information retrieved in this process. The protein databases utilized include Nr (Non-redundant protein database), Swiss Prot (Swiss Prot protein database), KOG (Clusters of orthologous groups for eukaryotic complete genomes), Cluster), GO (Gene Ontology) and KEGG (Kyoto Encyclopedia of Genes and Genomes). In addition, gene annotation was performed in the latter to identify key genes involved in the biosynthetic pathway of fructan in Jerusalem artichoke.
Expressed sequence tag-simple sequence repeat identification Unigenes in the Jerusalem artichoke transcriptome data was investigated using microsatellite identification tool (MISA), Available online: http://pgrc.ipk-gatersleben.de/ misa. The search criteria were: minimum repeat number of dinucleotide 6 times, trinucleotide 5 times, 4 nucleotides 4 times, 5 nucleotides 4 times, and hexanucleotide 4 times, and the SSR locus interval was set to a minimum of 100 bp. The text file generated was imported into Excel for basic statistical analysis. SSR occurrence frequency = number of searched SSR-containing unigene sequences / total number of unigene sequence; SSR distribution frequency = SSR number / total number of unigene sequences; average distance of SSR distribution = total unigene length / number of searched SSR.

Quantitative polymerase chain reaction analysis
The test material was "Qingyu No.1", the main cultivar of the Qinghai Tibet Plateau. Planting occurred in the middle of April 2016 in the experimental field of the Academy of agriculture and Forestry Sciences, Qinghai University, Xining, Qinghai. The normal cultivation method was implemented, with a row spacing of 1 m, and a plant spacing of 80 cm. Samples were taken during the expansion of the tubers, and a total of 4 samples were taken, until the plants withered. The tuber was set to repeat 3 times. Immediately after sampling, it was placed in liquid nitrogen for quick freezing, and then stored at − 80°C in a refrigerator. RNA was extracted using the TIANGEN RNA prep Pure Plant Total RNA Extraction Kit (DP432), and then gelled with 1.2% agarose gel, as well as freshly prepared 0.5*TBE for 15 min with 150 V to assess RNA integrity. In addition, the TaKaRa reverse transcription kit Prime-Script™ RTMaster Mix (Perfect Real Time) (RR036B) was used to realize reverse transcription according to manufacturer's instructions. The RNA sample amount was 1 μg, and the reaction system was 20 μL. Furthermore, the reverse-transcribed cDNA sample was then stored at − 20°C for later use. We performed real-time qPCR on a 96-well reaction plate with Applied Biosystems Stepone-Plus using TaKaRa's SYBR Premix Ex TaqTM (TliRNaseH Plus) kit (RR420B). The internal reference primer was selected from the 25S ribosomal RNA primer F of Jerusalem artichoke: CTGTCTACTATCCAGCGAAACCA, R: AGG GCTCCCACTTATCCTACAC. The total volume of the quantitative PCR reaction was 20 μL, containing 10 μL of SYBR Premix, 0.6 μL of the upstream and downstream primers, 3.4 μL of ddH2O, and 5 μL of 0.1* cDNA. The reaction conditions of qPCR were: Pre-denaturation at 95°C for 30 s; 95°C for 5 s, 60°C for 30 s, 40 cycles. The fluorescence signal was collected during the extension phase at 60°C to obtain the cyclic Ct threshold of different genes. At the end of the cycle, the dissolution curve was measured to evaluate the specificity of the PCR product. The dissolution curve program was 95°C for 15 s, 60°C for 1 min, and 95°C for 15 s. Each sample contained 3 biological replicates during the quantification process, 3 replicates of the same sample, and 3 primer-free negative controls in each 96-well reaction plate. The relative expression results of the genes were calculated using 2 -ΔΔCT [44]. The HeatMap Tools at http://www.omicshare.com/ tools/home/soft/heatmap/ were implemented to draw a heatmap with parameters set to: cluster_rows: TRUE, cluster_cols: TRUE, colors: green, black, red.
Abbreviations GO: Gene ontology; KEGG: Kyoto encyclopedia for genes and genomes; KOG: Clusters of orthologous groups for eukaryotic complete genomes; SSR: Simple sequence repeat