Genetic variation, population structure and linkage disequilibrium in Switchgrass with ISSR, SCoT and EST-SSR markers

Background To evaluate genetic variation, population structure, and the extent of linkage disequilibrium (LD), 134 switchgrass (Panicum virgatum L.) samples were analyzed with 51 markers, including 16 ISSRs, 20 SCoTs, and 15 EST-SSRs. Results In this study, a high level of genetic variation was observed in the switchgrass samples and they had an average Nei’s gene diversity index (H) of 0.311. A total of 793 bands were obtained, of which 708 (89.28 %) were polymorphic. Using a parameter marker index (MI), the efficiency of the three types of markers (ISSR, SCoT, and EST-SSR) in the study were compared and we found that SCoT had a higher marker efficiency than the other two markers. The 134 switchgrass samples could be divided into two sub-populations based on STRUCTURE, UPGMA clustering, and principal coordinate analyses (PCA), and upland and lowland ecotypes could be separated by UPGMA clustering and PCA analyses. Linkage disequilibrium analysis revealed an average r2 of 0.035 across all 51 markers, indicating a trend of higher LD in sub-population 2 than that in sub-population 1 (P < 0.01). Conclusions The population structure revealed in this study will guide the design of future association studies using these switchgrass samples.


Background
Genetic diverstiy is a significant factor that contributes to crop improvement. Evaluation of genetic variation in contemporary germplasm through breeding programs may be indirectly favorable for genetic progress in future cultivars [1]. Thus, estimation of plant diversity is crucial for the efficacious use of genetic resources in breeding programs. Molecular markers, as particular segments of DNA that represent different functional classes, play an essential role in all aspects of plant breeding, and have been widely used to estimate genetic variation.
Compared with conventional phenotyping methods, molecular markers have numerous advantages as they are easily detectable and stable in plant tissues regardless of environmental influences [2]. The inter simple sequence repeat marker (ISSR) is highly polymorphic and is useful in studies of genetic diversity, genome mapping and evolutionary biology [3]. This PCR-based technique is used in various types of plants and can overcome many defects of other marker methods, such as high-cost of amplified fragment length (AFLP) and the low reproducibility of random amplified polymorphic DNA (RAPD) [4]. Start codon targeted marker (SCoT) is a reliable and simple gene-targeted marker located on the translational start codon [5]. This technique involves designing single primers from the short conserved region flanking the ATG start codon [6] without knowing any further genomic sequence information. It has been used in peanut and mango crops for genetic diversity and cultivar relationship analysis [7]. Expressed sequence tag-simple sequence repeats marker (EST-SSR) detects variation based on the expressed portion of the genome from EST databases, thus explaining the low cost of development compared with the genomic simple sequence repeat marker (SSR) [8]. These EST-SSR primers can be used across various species for comparative mapping and the construction of genetic linkage maps [9,10]. Each marker type has unique advantages and these three marker systems have found extensive application in the evaluation of genetic variation, population structure, and assisted selection for crop improvement [3,[11][12][13][14]. Many studies have shown that these markers are mainly used to develop genetic linkage maps [15,16], however, fewer studies have focused on constructing linkage disequilibrium (LD) maps. Remarkably, LD and linkage are two different genetic terms, where LD refers to correlation between alleles in a population, while LD means the correlated inheritance of loci through physical connection on a chromosome [17]. Some factors can affect the LD level, including allele frequency and recombination. Unlike linkage analysis, LD mapping relies on a natural population which is used to identify the relationships between genetic and phenotypic variation. LD mapping, that is association analysis, represents a useful tool to identify trait-marker relationships, and the first LD mapping of a quantitative trait was the analysis of flowering time and the dwarf8 gene in maize [18].
Linkage disequilibrium (LD), referring to the nonrandom association of alleles between linked or unlinked loci, is the basis of association mapping to identify genetic regions associated with agronomic traits [17]. Recently, LD studies have been performed in various plants, such as rice (Oryza sativa L.) [19], barley (Hordeum vulgare L.) [20], Maize (Zea mays L.) [21], chickpea (Cicer arietinum L.) [22], perennial ryegrass (Lolium perenne L.) [23], and the model legume, Medicago truncatula [24]. The level of LD is constantly regarded as a standard to reflect mapping resolution. Association mapping in populations with low LD requires a high number of markers, whereas a high LD means low mapping resolution [25]. In addition, information about population structure within germplasm collections is also crucial for the interpretation and identification of associations between genetic and functional diversity, and to assess whether the inter-sample relatedness is suitable for association studies [26][27][28]. Therefore, population structure is also included as an effect in models used for association analysis. [15,29].
Switchgrass (Panicum virgatum L.), as a warm season C 4 perennial grass that is native to North America [30], is regarded as an important biofuel crop for its remarkable biomass yield and good adaptability on marginal lands thereby not competing with food crops on farmland [31][32][33]. In this study, we explored two distinct forms of switchgrass, upland and lowland ecotypes. The upland accessions are distributed in northern cold areas with lower biomass than lowland varieties. Generally, upland switchgrass is shorter (≤2.4 m, tall) than lowland types (≥ 2.7 m) in favorable environments. However, lowland cultivars appear more sensitive to moisture stress than upland cultivars [34].
Constructing association maps comparing the physiological and genetic basis of varying stresses can provide an available reference for the genetic improvement of switchgrass, and the evaluation of the level of LD and population structure can aid association analyses. To date, however, LD analysis across the switchgrass genome remains inadequate [35]. In our study, we present 134 switchgrass accessions supplied by Plant Genetic Resources Conservation Unit, Griffin, Georgia USA to identify the levels of genetic variation, population structure, and extent of LD using 51 markers including 16 ISSRs, 20 SCoTs, and 15 EST-SSRs. These results will provide a valuable molecular basis for enriching switchgrass genetic variation, and the information on the level of LD and population structure may guide association mapping using this representative collection.
Here we constructed a three-marker molecular dataset with important applications for diversity analysis, establishment of population structure and evaluation of linkage disequilibrium in switchgrass which is an allogamous species.
These three marker systems (ISSR, SCoT, and EST-SSR) have been used for cultivar identification and genetic variation assessment in many plant species [36][37][38][39]. In this study, these markers were successfully used to differentiate switchgrass accessions. A total of 51 primer pairs were used and 793 bands were produced, with a mean of 15.5 bands per primer, among which 89.28 % were polymorphic. Our results suggested that ISSR, SCoT, and EST-SSR analyses could contribute to the detection of genetic variation. In addition, Nei's (1973) gene diversity index (H) and Shannon's information index (I) was 0.311 and 0.471, respectively, and the similarity coefficient, ranging from 0.162 to 0.857 with an average of 0.510 was similar to other studies on switchgrass, in which the similarity coefficients were estimated to be between 0.45 to 0.81 [40] or 0.53 to 0.78 [41]. This indicates that switchgrass has abundant genetic variation and is a highly heterogenous species [42]. The AMOVA of the distance matrix for the genotypes permitted a partitioning of the overall variation into two levels: between upland and lowland ecotypes and within a population. The results revealed genetic differentiation between upland and lowland ecotypes (P < 0.001), with 31.42 % of genetic variation between ecotypes and 68.58 % of genetic variation within ecotypes. Similar results were obtained in other switchgrass germplasm collections [40,43,44] and in other perennial, and cross-pollinated plants [45].

Marker efficiency analysis
In this study, we extracted genomic DNA from an individual so that we were able to obtain complete genetic information including allele numbers, gene frequency and observed heterozygosity for marker efficiency analysis. A parameter marker index (MI) was used to compare the efficiencies of the three assays in the collection of 134 switchgrass genotypes (Table 2). There was almost no disparity between the average band informativeness (Ib av ) indice for ISSRs, SCoTs, and EST-SSRs, which were 0.38, 0.43, and 0.36, respectively. However, the effective multiplex ratio (EMR) index for ScoT (20.10) was twice as high as that of the ISSRs (12.25) and three times as high as that of the EST-SSRs (7.33).
The MI calculation indicated an efficient and distinctive nature of the SCoTs with the MI for these markers (8.64) higher than the other two assays examined here (4.66 for ISSRs and 2.64 for EST-SSRs). A parameter MI, has been widely used to evaluate the overall utility of each marker system [46]. The high MI in the SCoTs results from its high EMR, making these markers appropriate for fingerprinting [47] or evaluating genetic variation in breeding populations [48,49]. In addition, the SCoTs performed well in other species. Compared with ISSR and inter-retrotransposon amplified polymorphism (IRAP), SCoT markers were more informative than IRAP and ISSR for the assessment of diversity among Persian oak (Quercus brantii Lindl.) individuals [50]. Results from the evaluation on the genetic variation of mango (Mangifera indica L.) cultivars indicated that the SCoT analysis represents actual relationships better than the ISSR analysis [51].

Population structure analysis
After removing low frequency bands (considering MAF ≤ 0.05), we analyzed the data from 51 pairs of Table 1 The ISSR, SCoT, and EST-SSR primers used in this study and amplification results (Continued) ISSR, SCoT, and EST-SSR primers to understand the population structure of the entire switchgrass collection based on a Bayesian clustering approach using STRUC-TURE [52]. The number of subpopulations (K) was identified based on maximum likelihood and ΔK values. For the 134 switchgrass genotypes the maximum ΔK was observed at K = 2 ( Fig. 1), with genotypes falling into two subpopulations. Using a membership probability threshold of 0.75, 76 genotypes were assigned to subpopulation 1 (G1), out of which, 69 genotypes belonged to upland ecotypes, and the remaining 7 were lowland. Subpopulation 2 (G2) contained 42 genotypes, and all of them were upland ecotypes. The remaining 16 genotypes were classified into an admixed group as they had membership probabilities lower than 0.75 in any given subpopulation. With the maximum membership probability, 91 accessions were assigned to G1 and 43 accessions to G2 (Fig. 2). The UPGMA cluster analysis from 51 markers generated a dendrogram, demonstrating that the 134 genotypes could be clearly divided into two groups (Fig. 3).
The dendrogram clustered all of the lowland ecotypes (LL) into the first. The second group contained all of the upland ecotypes (UL). Other methods have also been used to cluster upland and lowland switchgrass ecotypes. Missaoui et al adopted restriction fragment length polymorphism (RFLP) markers to analyze the genetic relationships among 21 switchgrass genotypes, resulting in three upland and eighteen lowland genotypes clusteringinto two different groups [53]. Huang et al identified differences between the coding sequences of a nuclear gene encoding plastid acetyl-CoA carboxylase in upland and lowland ecotypes genetic variation analysis at gene level, provided by Huang et al researching about a nuclear gene encoding plastid acetyl-CoA carboxylase [54]. In this study, we preliminarily presented population structure analysis of 7 lowland and 127 upland genotypes using 51 ISSR, SCoT, and EST-SSR primer pairs, resulting in an apparently separate cluster among the two ecotypes, confirming the genetic differences between upland and lowland ecotypes. However, as we do not have as many lowland switchgrass samples as upland, we highly recommend more lowland ecotype or other nuclear markers should be used in conjunction with ISSR, SCoT and EST-SSR to more appropriately classify upland and lowland ecotypes.
Based on modified Rogers distances (MRD), PCA separated the 134 genotypes into two major groups, which was consistent with assignments generated by STRUC-TURE and the UPGMA dendrogram (Fig. 4). Seven genotypes formed group 1 (Fig. 4, upper right), and the other 127 genotypes, belonging to group 2, were mainly distributed at the lower portion of the plot. The accessions belonging to G1 inferred by the STRUCTURE analysis were all distributed on the right portion of the  resulting plot, while G2 was distributed on the left portion of the plot. The distribution of G1 accessions was less tightly clustered than G2, indicating accessions in G1 had higher diversity than G2 (Fig. 4). Before analyzing LD and association mapping, the analysis of population structure emphasizes the need for the genetic analysis of different ecotypes [28]. The UPGMA cluster and PCA analysis demonstrated that 134 genotypes could be clearly divided into two groups (Figs. 1 and 4), and the lowland and upland germplasm clusters were almost completely separated, which was consistent with the results of several other switchgrass studies [41,55,56]. For the UPGMA cluster analysis, the first group only included lowland ecotypes, while the second group contained upland ecotypes and could be further classified into two subgroups. Subgroup 1 (G1) contained 83 genotypes, while the remaining 43 belonged to subgroup 2 (G2). The 46 accessions of the 70SG series and 42 accessions of the 71SG series dispersed into these two subgroups are from the same geographical distribution of North Dakota, United States. This indicates that most of the germplasm sub-clustered in accordance with different regions [43,55], and the assignment of 132 accessions (98.51 % of the total) by the UPGMA cluster analysis was consistent with their classification using PCA (Fig. 4). Unexpectedly, in the STRUCTURE analysis, the 127 upland genotypes were assigned to two subpopulations, possibly because the UPGMA and STRUCTURE programs calculate parameters in different ways. Clusters are generated in STRUCTURE based on both transitory Hardy-Weinberg disequilibrium and LD caused by admixture between populations [55], while the UPGMA dendrogram generates clusters based on the genetic distance among populations [57].

Linkage disequilibrium estimation
After the deletion of low frequency alleles (MAF ≤ 5 %), the 51 ISSRs, SCoTs, and EST-SSRs with unknown chromosome information were used to evaluate the extent of LD among the switchgrass samples. In the collection, interallelic r 2 values, the association between any pair of alleles from different loci, were calculated and ranged from 0.000 to 1.000 with an average r 2 of 0.035. Across all 51 loci, 247,456 locus pairs were detected in the 134 switchgrass samples. Among all of the locus pairs, 7107 of 135,718 (5.24 %) showed LD at the P < 0.001 level for G1 and 5415 locus pairs (3.99 %) were found at r 2 > 0.1 at P < 0.001. For G2, 84,154 locus pairs were detected, 4833 were significant pairs (P < 0.001, 5.74 %), while 4235 locus pairs (5.03 % of 84,154) were found at r 2 > 0.1 at P < 0.001. The mean r 2 for all materials was 0.480 (P < 0.001), and the LD in G2 (0.668, ranging from 0.068 to 1.000) was significantly (P < 0.001) larger than that in G1 (0.291, ranging from 0.066 to 1.000) (P < 0.01).
Populations with high levels of outcrossing have relatively low LD [58]. Among outcrossing maize (Zea mays L.), Remington et al. [59] found lower levels of LD among 47 SSR loci (9.7 % of SSR pairs performing LD at P < 0.01), compared to LD data from an SSR survey of inbred lines of maize, which showed high levels of LD [60]. For switchgrass, LD data comparisons showed a trend towards higher LD in G2 (mean r 2 = 0.668) including 42 genotypes all belonging to upland ecotypes, compared with G1 (mean r 2 = 0.291), which contained 76 genotypes, including 7 lowland ecotypes.

Plant material
A total of 134 switchgrass genotypes, representing most of the natural geographical distribution areas of switchgrass supplied by the Plant Genetic Resources Conservation Unit, Griffin, Georgia USA were used in this study. These included 7 lowland genotypes originating from 5 US states and 127 upland genotypes originating from Belgium and 15 US states ( Table 3). The full accession  data and information on switchgrass germplasm comes from ARS GRIN (http://www.ars-grin.gov/). The 134 genotypes, including one seedling from each accession, were grown and maintained in the experimental farm of the Sichuan Agricultural University during the 2012 growing season.

DNA extraction and marker genotyping
Genomic DNA was extracted from tender leaves of each individual using a modified cetyltrimethylammonium bromide (CTAB) method [61]. ISSR [designed by the University of British Columbia (UBC set No. 9)], EST-SSR [62], and SCoT primer [45] sequences were aligned to the Panicum reference genome using the bl2seq blast program in NCBI (www.ncbi.nlm.nih.gov/BLAST/), which was designed to eliminate redundancies. Initially, four germplasms were used to screen marker primers [PI421999 (AM-314/MS-155), PI422006 (Alamo), PI642190 (Falcon), and PI642207 (70SG 016)]. The selected primers were synthesized by the Shanghai Sangon Biological Engineering Technology and Service Company (Shanghai, China) to genotype the collection. ISSR-PCR was carried out according to Li et al [63] as follows: the total reaction volume was 15 μL and contained 20 ng template DNA, approximately 1.0 μM primer, 7.5 μL Mix (10 × PCR buffer, Mg 2+ , dNTPs; Tiangen Biotech, Beijing, China), and 1 U Taq polymerase. Amplifications were performed in a BioRad iCycle PCR machine (BIO-RAD Certified) under the following conditions: 95°C for 5 min, followed by 35 cycles of the following: 95°C for 45 s, 52-55°C for 45 s, and 72°C for 90 s. A final extension was conducted at 72°C for 7 min. All PCR bands were visualized on 1 % polyacrylamide gel electrophoresis in 1 × TBE buffer. Silver staining was used to visualize the bands. The SCoT-PCR amplification reaction was conducted in a total volume of 15 μL according to Collard and Mackill [5], and containing 10 ng template DNA, 0.8 mM primers, 1.2 mM MgCl 2 , 0.4 mM dNTPs, and 1 U Taq DNA polymerase (Tiangen Biotech, Beijing, China). PCR amplification had an initial denaturation step of 5 min at 95°C, followed by 45 s at 95°C, 45 s at 55°C, 1.5 min at 72°C for 30 cycles, and 7 min at 72°C. PCR products were visualized following agarose gel (1.5 %) electrophoresis at 120Vfor 1.5 h in 1 × TBE buffer, followed by staining with GelRed (Tiangen Biotech, Beijing, China). The EST-SSR PCR consisted of a denaturation for 5 min at 94°C then 35 cycles of 30 s at 94°C, 30 s at 53-55°C, and 2 min at 72°C, with a final extension of 5 min at 72°C [62] and products were visualized as described above.

Genetic variation and marker efficiency analysis
For each marker, polymorphic alleles were scored as "1" for presence and "0" for absence at the same mobility, and this data was used to construct an original data matrix. Using Excel 2007 and POPGENE v.1.32 [64], corresponding diversity parameters were estimated including: total number of bands (TNB), number of polymorphic bands (NPB), percentage of polymorphic bands (PPB), Nei's (1973) gene diversity index (H), and Shannon's information index (I). AMOVA v.1.55 was employed to reveal genetic variation among groups and within a population [65]. The data input to POPGENE and AMOVA was produced by DCFA v.1.1 [66]. The comparative efficiency of ISSRs, SCoTs, and EST-SSRs in these 134 switchgrass genotypes was assessed with MI. MI is the product of the EMR and the Ib av for the polymorphic markers [67]. EMR is explained as the average number of polymorphic bands [68]. Ib av is defined as: pi is the proportion of the i-th amplification site, n represents the total number of amplification site.

Population structure analysis
The model-based program STRUCTURE v.2.3.4 (http:// pritchardlab.stanford.edu/structure.html) [69] was applied to assess the population structure of the 134 switchgrass genotypes with 51 ISSRs, SCoTs, and EST-SSRs. The number of subpopulations (K) was set from 1 to 10 based on admixture models and correlated band frequencies. With 5 × 10 5 Markov Chain Monte Carlo replications carried out for each run after a burn-in period of 10 6 iterations, 20 independent runs were performed per K. When there was a clear maximum value for posterior probability [LnP(D)] output in STRUC-TURE, a K value was selected in the range of 1 to 10 subpopulations. The most probable K value was the ΔK, an ad hoc quantity related to the rate of change in LnP(D) between successive K inferred by STRUCTURE [70]. The replication of K showing the maximum likelihood was applied to subdivide the genotypes into different groups with membership probabilities ≥ 0.75. Genotypes with less than 0.75 membership probabilities were assigned to an admixed group. Bar charts from the STRUCTURE data were displayed using Distruct 1.1 [71].
To reveal relationships among the 134 switchgrass genotypes, a figure of two-dimensional scatterplots representing all of the genotypes was obtained for principal coordinate analysis (PCA) using NTsys-pc v.2.1 [73]. All of the switchgrass individuals were analyzed to calculate MRD [74]. The resulting genetic distance matrices were double-centered and used to obtain eigenvectors by the modules DCENTER and EIGEN using NTsys-pc.

Evaluation of linkage disequilibrium
The significance of pairwise LD was evaluated using squared band-frequency correlations (r 2 ) between all combinations of marker loci using the package TASSEL version 2.1 (http://www.maizegenetics.net/bioinformatics) [75]. Rare bands with a band frequency of less than 5 % were removed to avoid biased evaluations of LD because of their large variances. Other pairs of bands were evaluated with a minor band frequency of at least 5 % (MAF ≥ 0.05) with the GDA 1.1 program [76].

Conclusions
The results of this study showed a great level of genetic variation among switchgrass germplasm. The switchgrass accessions were clearly divided into two groups containing upland and lowland ecotypes. For the first time, we revealed the extent of LD and population structure in switchgrass. The implications of these results in terms of utilizing association mapping for genes or QTL discovery in switchgrass were discussed. For further association mapping using a collection of switchgrass samples, we highly recommend the inclusion of more lowland ecotypes or the use of other nuclear markers in conjunction with ISSR, SCoT and EST-SSR.

Competing interests
The authors declare that they have no competing interests.
Authors' contributions L-KH and X-QZ conceived the project and designed the experiments. YZ, X-MJ, and H-DY performed the experiment. H-DY, X-LW and YZ wrote the paper. BX and L-XZ revised the paper. All authors discussed the results and commented on the manuscript. All authors read and approved the final manuscript.