Genetic diversity and structure of tea plant in Qinba area in China by three types of molecular markers

Background Qinba area has a long history of tea planting and is a northernmost region in China where Camellia sinensis L. is grown. In order to provide basic data for selection and optimization of molecular markers of tea plants. 118 markers, including 40 EST-SSR, 40 SRAP and 38 SCoT markers were used to evaluate the genetic diversity of 50 tea plant (Camellia sinensis.) samples collected from Qinb. tea germplasm, assess population structure. Results In this study, a total of 414 alleles were obtained using 38 pairs of SCoT primers, with an average of 10.89 alleles per primer. The percentage of polymorphic bands (PPB), polymorphism information content (PIC), resolving power (Rp), effective multiplex ratio (EMR), average band informativeness (Ibav), and marker index (MI) were 96.14%, 0.79, 6.71, 10.47, 0.58, and 6.07 respectively. 338 alleles were amplified via 40 pairs of SRAP (8.45 per primer), with PPB, PIC, Rp, EMR, Ibav, and MI values of 89.35%, 0.77, 5.11, 7.55, 0.61, and 4.61, respectively. Furthermore, 320 alleles have been detected using 40 EST-SSR primers (8.00 per primer), with PPB, PIC, Rp, EMR, Ibav, and MI values of 94.06%, 0.85, 4.48, 7.53, 0.56, and 4.22 respectively. These results indicated that SCoT markers had higher efficiency. Mantel test was used to analyze the genetic distance matrix generated by EST-SSRs, SRAPs and SCoTs. The results showed that the correlation between the genetic distance matrix based on EST-SSR and that based on SRAP was very small (r = 0.01), followed by SCoT and SRAP (r = 0.17), then by SCoT and EST-SSR (r = 0.19). The 50 tea samples were divided into two sub-populations using STRUCTURE, Neighbor-joining (NJ) method and principal component analyses (PCA). The results produced by STRUCTURE were completely consistent with the PCA analysis. Furthermore, there is no obvious relationship between the results produced using sub-populational and geographical data. Conclusion Among the three types of markers, SCoT markers has many advantages in terms of NPB, PPB, Rp, EMR, and MI. Nevertheless, the values of PIC showed different trends, with the highest values generated with EST-SSR, followed by SCoT and SRAP. The average band informativeness showed similar trends. Correlation between genetic distances produced by three different molecular markers were very small, thus it is not recommended to use a single marker to evaluate genetic diversity and population structure. It is hence suggested that combining of different types of molecular markers should be used to evaluate the genetic diversity and population structure. It also seems crucial to screen out, for each type of molecular markers, core markers of Camellia sinensis. This study revealed that genes of exotic plant varieties have been constantly integrated into the gene pool of Qinba area tea. A low level of genetic diversity was observed; this is shown by an average coefficient of genetic similarity of 0.74.


Background
Evaluation of genetic diversity and population structure has significant implications for genetic improvement in plant breeding. It has been well established that the genetic basis of biological organisms is concealed within the genome sequence, and that base-pair substitution, insertion, deletion, and other alterations can lead to genetic diversity; the diversity of organisms are manifested through phenotypic, chromosomal and proteomic differences. DNA molecular markers, having stable performance, high polymorphism and other properties, are increasingly employed in taxonomical, genetic evolutionary, breeding, and cloning studies. The use of different molecular markers and different primers for a same marker may result in amplification of distinct regions of the genome. Theoretically, higher numbers of polymorphic markers used are associated with wider amplified regions that covers the entire genome and more accurate results.
EST-SSR (Expressed Sequence Tag-Simple Sequence Repeat) molecular markers have been widely used with many species and for many applications, such as genetic linkage mapping, comparative mapping, and evaluation of genetic diversity [1][2][3][4][5]. SRAP (Sequence related amplified polymorphism) was first used on Brassica in 2001 by Li G [6]. The genetic diversity and population structure analysis of Camellia sinensis by SRAP [7][8][9][10][11][12] have already been reported. SCoT (Start codon targeted polymorphism) marker was designed according to the Kozak sequence pattern and was developed after the discovery of the conservativeness of the initiation codon ATG (+ 1, + 2, + 3) flanking sequences, in which the positions + 4, + 7, + 8, and + 9 are occupied by nucleotides G, A, C, and C, respectively. These seven nucleotides are generally conserved. At positions − 3, − 6, and − 9, G is the usual nucleotide. Primers can therefore be designed according to the conservativeness of the initiation sequence SCoT marker allows single primer amplification of the region between two genes. Bertrand et al. first applied this marker on Oryza sativa [13]. Lately, SCoT molecular marker has been used to access the genetic diversity of plant species such as Saccharum spontaneum L [14], Dactylis glomerata [15], Mangifera indica [16], Arachis hypogaea [17], Saccharum officinarum [18], Podocarpus macrophyllus [19] and Paeonia suffruticosa [20]. Nevertheless, no similar study has been conducted on Camellia sinensis. Tea plant is an allogamous species; theoretically, after prolonged spontaneous hybridization, the genetic background of tea plant should be increasingly complex.
China is one of the main sources of tea germplasms. Currently, there are 1,100,000 ha of tea planting area, with different regions growing different types and different varieties of tea according to topographic, soil, and climatic characteristics. Xinan, Huanan, Jiangnan, and Jiangbei represent the four main districts of tea planting area in China. The Qinba area belongs to the Jiangbei district. In this research, 50 tea varieties, including those collected from different districts, common tea plant species, as well as local species in the Qinba area, were genotyped with EST-SSR, SRAP, and SCoT markers. Herein we constructed three types of molecular marker dataset which have important applications in diversity analysis, marker efficiency analysis, and correlation analysis that use these marker systems. Our study allowed the establishment of population structure, providing significant insights into the selection of molecular markers for tea plant breeding.

Marker efficiency analysis
In this study, three types of molecular markers were used to differentiate tea plant accessions. A total of 1072 bands were produced using 118 primer pairs. 38 SCoT, 40 SRAP and 40 EST-SSR primers were selected for further studies according to the percentage of polymorphic bands (PPB), polymorphism information content (PIC) and the degree of clear band selected markers using six selected genotypes (Table 1). A total of 414, 338, and 320 bands were obtained using SCoT, SRAP and EST-SSR markers, respectively from the 50 test materials, which included 398, 302, and 301 polymorphic bands, with PPBs of 96.13%, 89.35%, and 94.06%. Comparisons of the three types of markers are shown in Table 2. SCoT markers have a higher marker efficiency and are excellent for the appraisal of polymorphic loci, except that its polymorphic information content is lower than that of EST-SSR.
Correlation analysis among genetic distance matrices by three-types of marker dataset Mantel tests [21] were used to measure the correlation between the genetic distance matrices generated by SCoT, SRAP and EST-SSR molecular markers. r ≥ 0.9, 0.8 ≤ r < 0. 9, 0.7 ≤ r < 0.8, and r < 0.7 represented significant correlation, moderate correlation, weak correlation, and no correlation, respectively. In the present study, the coefficients of correlation (r) between the genetic distance matrices of SCoT and EST-SSR markers, SCoT and SRAP markers, and SRAP and EST-SSR markers were 0.19, 0.17, and 0. 01, respectively (Fig. 1). Different molecular markers and different primers of the same marker all yielded distinct amplification products, which reflected the polymorphism of the genomic regions; hence, utilization of different marker designing strategies will produce different results. Theoretically, the validity of the results should improve with increasing numbers of markers and increasing coverage of the genome. Therefore, we employed three types of molecular markers to generated 1072 bands and to perform genetic constitution analyses.

Genetic constitution analysis Analysis using STRUCTURE
One thousand seventy-two polymorphic bands with MAF (minor allele frequency) < 5% were used to elucidate the population structure of the entire pool of tea germplasms. In this study, STRUCTURE 2.3.4, which applies a Bayesian clustering algorithm, was used to simulate population genetic structure based on the assumption that the 1072 loci were independent. Using a membership probability threshold of 0.60, population K values from 1 to 10 were simulated with 20 iterations for each K using 10,000 burnin periods followed by 10,000 Markov Chain Monte Carlo iterations in order to obtain an estimate of the most probable number of population. Delta K was plotted against K values; the best number of clusters was determined following the method proposed by Evanno et al. [22] and obtained via the Structure Harvester platform (http://taylor0. biology.ucla.edu/structureHarvester/). Delta K reached a maximum value at K = 2, suggesting that the 50 tea germplasm were best divided into two subgroups (Fig. 2).

UPGMA clustering
A dendrogram was constructed with cluster analysis using the unweighted pair-group method with arithmetic means (UPGMA), which demonstrated that the 50 genotypes could be clearly divided into 2 groups (Fig. 3). Group I included 27 varieties, and group II contained 23 varieties. The average similarity coefficient was 0.74. The two most closely related materials were 15 and 16, which have a sister line with a genetic similarity coefficient of 0.93.

Principal components analysis
The top three principal components were used to analyze population structure. Principal component analysis was conducted under NTSYS-pc2.10e [23]. The results showed that the three PCs had contribution rates of 15.97%, 8.50% and 6.17%. PCA separated the 50 genotypes into two major groups (Fig. 4) which were consistent with the STRUC-TURE and UPGMA results. GroupI consisted of 18 genotypes (Fig. 4, left), with the other 32 genotypes belonging to group II (Fig. 4, right). The analysis performed using STRUCTURE, UPGMA and PCA yielded similar results, clustering the 50 genotypes into 2 sub-populations. Of note, PCA results had good consistency with previous results from STRUCTURE. The results generated using UPGMA were slightly different from those using STRUCTURE and PCA (Table 3) and bold numbers in group 1 by UPGMA represent the differences between the results using STRUCTURE and PCA and the results using NJ.

Conclusions
We firstly reported the use of SCoT markers to analysis genetic diversity of tea germplasms. The results showed that SCoT markers revealed high genetic diversity among tea resources. In the future, we planed to select core SCoT markers. Different kinds of molecular markers can reveal different and complementary information of the same genome. Thus, we highly recommend using more marker types for comprehensive evaluation of genetic diversity and structure. 50 accessions were clustered into 2 sub-populations based on STRUCTURE, UPGMA and PCA; there was no obvious differences between imported and local germplasms. The genes of exotic varieties have been constantly integrated into the gene pool of Qinba tea through longterm (20-25 years) tea breeding and production activities. The selection of varieties with economic characters was emphasized during the process of breeding, resulting in the loss of some tea resources and the decrease of genetic diversity; thus, it is necessary to introduce new tea tree resources in order to broaden the genetic diversity.  Fig. 1 The correlation between the genetic distance matrices using Mantel tests  (Table 4).

DNA extraction and marker genotyping
Genomic DNA was extracted from fresh leaves of each individual using the modified CTAB technique and detected with 0.8% agarose gel electrophoresis. PCR was carried out as follows: 2 × Taq Master Mix (7.5 μL), forward and reverse primers (1 μL each, 2 μL for SCoT primers), RNase-free water (3.5 μL), and tea genomic DNA (2 μL). In order to improve the effect of PCR amplification, changing annealing temperature was used in a PCR reaction system; the reactions were programed as follows: initial denaturation at 94.0°C for 5 min, denaturation at 94.0°C for 1 min, annealing at 60.0°C for 1 min, and extension at 72.0°C for 1 min, for a total of 10 cycles; subsequently, a total of 35 cycles of denaturation at 94.0°C for 30 s, annealing at 35°C for 30 s, and extension at 72.0°C for 1 min were performed. The duration of extension was 10 min; then storage at 4.0°C.

Genetic variation and marker efficiency analysis
Following electrophoresis, each amplification band corresponded to a primer hybridization locus and was considered as an effective molecular marker. Each polymorphic band detected by a same given primer represented an allelic mutation. In order to generate molecular data matrices, clear bands for each fragment were scored in every accession for each primer pair and recorded as 1 (presence of a fragment), 0 (absence of a fragment), and 9 (complete absence of band). Excel was used to compute the marker index (MI) of the three types of markers and the marker frequencies of the three types of markers were compared. MI values were obtained from the average band informativeness (Ib av ) and the effectiveness multiplex ratio (EMR); EMR represents the number of polymorphic loci and Ib av is given by the following formula: where P i represents the proportion of the i th sample in the amplified locus and n represents the total number of amplified loci. Using the method reported by Smith et al. [24], the value of the polymorphism information content (PIC) was calculated with the formula: where PIC represents the PIC value of the i th locus and P ij represents the frequency that allele j appears in the i th locus. The value of PIC varies from 0 to 1, with 0  indicating an absence of polymorphism at a given locus and 1 reflecting multiple alleles at a given locus. The level of polymorphism of each marker was assessed by the polymorphism information content (Botstein et al. [25]), which measures the extent of genetic variation: PIC values smaller than 0.25 indicates low levels of polymorphism associated to a locus, PIC values between 0.25 and 0.5 imply moderate levels of polymorphism, while PIC values greater than 0.5 indicate high levels of polymorphism.
Correlation analysis among genetic distance matrices by three-types of marker dataset Mantel test was carried out with the batch file of the NTSYS-pc2.10e software.
Genetic constitution analysis STRUCTURE v2.3.4 was used to assess the population structure of the 50 tea genotypes with 1072 loci. The number of sub-population (K) was set from 1 to 10 based on admixture models and correlated band frequencies.
Genetic similarity coefficients were computed using the SM functionality of the NTSYS-pc2.10e software, cluster analysis were conducted using the UPGMA method, and the principal component analysis using the batch file under the NTSYS-pc2.10e software.