Genetic diversity of avocado from the southern highlands of Tanzania as revealed by microsatellite markers

Background Avocado is an important cash crop in Tanzania, however its genetic diversity is not thoroughly investigated. This study was undertaken to explore the genetic diversity of avocado in the southern highlands using microsatellite markers. A total of 226 local avocado trees originating from seeds were sampled in eight districts of the Mbeya, Njombe and Songwe regions. Each district was considered as a population. The diversity at 10 microsatellite loci was investigated. Results A total of 167 alleles were detected across the 10 loci with an average of 16.7 ± 1.3 alleles per locus. The average expected and observed heterozygosity were 0.84 ± 0.02 and 0.65 ± 0.04, respectively. All but two loci showed a significant deviation from the Hardy-Weinberg principle. Analysis of molecular variance showed that about 6% of the variation was partitioned among the eight geographic populations. Population FST pairwise comparisons revealed lack of genetic differentiation for the seven of 28 population pairs tested. The principal components analysis (PCA) and hierarchical cluster analysis showed a mixing of avocado trees from different districts. The model-based STRUCTURE subdivided the trees samples into four major genetic clusters. Conclusion High diversity detected in the analysed avocado germplasm implies that this germplasm is a potentially valuable source of variable alleles that might be harnessed for genetic improvement of this crop in Tanzania. The mixing of avocado trees from different districts observed in the PCA and dendrogram points to strong gene flow among the avocado populations, which led to population admixture revealed in the STRUCTURE analysis. However, there is still significant differentiation among the tree populations from different districts that can be utilized in the avocado breeding program.

hybridisation can occur between trees of different races when grown near each other [6,7].
Microsatellites are DNA sequences with 2 to 10 base pair repeat motifs, typically repeated 5 to 50 times [8,9]. They are spread throughout the genome, especially in the euchromatic regions of eukaryotic chromosomes, both in the coding and non-coding DNA regions [10,11]. Microsatellite regions have a higher mutation rate than other genomic regions leading to a high genetic diversity [12]. Microsatellites are also referred to as simple sequence repeats or SSRs [13]. Polymerase strand-slippage in DNA replication or recombination errors may result in differences in the number of repeats of a given motif (SSR locus) leading to new alleles at the locus under consideration. Thus, different alleles may exist at a given SSR locus, a characteristic that makes the SSRs more informative than other molecular markers, including single nucleotide polymorphisms or SNPs [14]. Being highly nformative, codominant, multi-allelic, highly reproducible and transferable among related species, SSR genetic markers have been widely used for estimating gene flow, diversity, crossing over rates and evolution for uncovering intraspecific genetic relatedness [13][14][15][16]. They have also been used in linkage map construction, for quantitative trait loci (QTL) mapping and marker assisted selection, for DNA fingerprinting of cultivars and for estimation of the degree of kinship between genotypes [17,18].
Genetic diversity refers to differences in genomic regions among individuals, populations and species. It allows populations to adapt to environmental changes. The wider the genetic diversity, the higher the chance of individuals harbouring allele variants that can help cope with given environmental changes. Such individuals will survive and transfer the favourable alleles to their offspring [19].
In Tanzania, avocado is one of the most important commercial fruits sold on domestic and international markets [20]. The first report of avocado cultivation in Tanzania dates back to 1892 [21,22]. The crop has been grown, mainly seed propagated, for over 100 years and has adapted to a wide range of topography, habitats and climates. As a result, a large diversity has accumulated in this germplasm. So far, only a single study has been conducted to assess the diversity of this germplasm using morphological traits [23]. The aim of the present study was to uncover the genetic diversity of this germplasm using microsatellite markers. The results from the study can be used to establish proper management and conservation strategies and for future breeding of the crop.

Microsatellite polymorphism and diversity
A total of 167 different alleles were recorded for the 10 loci across the 226 sampled avocado trees. The mean number of alleles/locus for all loci was 16.70 ± 1.30 (Table 1). All markers were polymorphic and detected at least 10 alleles each. The highest number of alleles per locus was 23 (AVAG22) followed by 20 (LMAV02 and LMAV29). The lowest number of alleles was 10, which was detected for locus LMAV35. The effective number of alleles ranged from 3.84 (AVAG05) to 9.59 (AVAG22) with an average of 6.81 ± 0.66. The Shannon's information index (I) ranged from 1.69 (LMAV35) to 2.59 (AVAG22). The minimum and maximum observed heterozygosity was 0.46 (LMAV14) and 0.82 (LMAV31), with an average of 0.65 ± 0.04. The average polymorphism information content was 0.82 ± 0.02 and it spanned from 0.70 (AVAG05) to 0.89 (AVAG22). With the exception of LMAV29 and LMAV31, the loci showed a significant deviation from the Hardy-Weinberg equilibrium.

Genetic diversity among the eight avocado geographic populations
Analysis of genetic diversity of the 226 avocado trees at the intra-population level revealed that for the average observed number of alleles (Na), the Njombe urban population recorded lowest value i.e., 4.20 ± 0.36, whereas both Mbeya rural and Njombe rural populations recorded peak value, i.e., 10.70 ± 2.26 and 10.70 ± 0.70, respectively ( Table 2). The mean value of effective number of alleles (Ne) was lowest (2.96 ± 0.29) in the Njombe urban and highest (5.93 ± 0.57) in the Mbozi population. For the Shannon's information index (I), the Njombe rural population had the highest average value, 1.96 ± 0.09, whereas the Njombe urban had the lowest value, 1.19 ± 0.09. The Wanging'ombe and Njombe urban populations had the lowest average values for the observed (Ho) and the expected (He) heterozygosity, i.e., 0.51 ± 0.06 and 0.71 ± 0.04, respectively. The highest value for Ho was reported in Mbozi, i.e., 0.71 ± 0.10, whereas for He, the highest value was reported in Mbozi and Njombe rural, i.e., 0.83 ± 0.03 and 0.83 ± 0.02, respectively. The lowest and highest average gene diversity was detected in Njombe urban (0.47 ± 0.09) and Mbeya rural (0.65 ± 0.11).

Molecular variance and population divergence
Analysis of molecular variance was employed to detect genetic divergence within and among the eight avocado populations. The analysis partitioned 6.08% of the variation among the populations, 17.04% among individuals within populations, and 76.87% within all individuals ( Table 3). The detected variations were significant at P < 0.0001. The total population differentiation due to genetic structure, i.e., the fixation index (F ST ), was 0.061 (P < 0.0001; Table 3). When the populations were further grouped into regions, 1.98 and 4.71% variation was noticed among groups (regions) and among populations (districts) within groups, respectively. However, the variation among individuals within populations (districts) and within individuals was similar to the values in the analysis performed on the eight populations. AMOVA conducted by grouping the genotypes into four groups according to their altitude of growth revealed that 2.52% of the total variation differentiated the groups (P < 0.0001).
The genetic divergence between the eight avocado populations was established by calculating pairwise F ST comparisons ( Table 4). 21 of the 28 pairs of populations showed a significant differentiation (P ≤ 0.05), with their F ST values ranging from 0.0111 (Mbeya city and Mbeya rural) to 0.1475 (Rungwe and Mbozi). The second highest F ST value (0.1369, P < 0.05) was recorded for Njombe urban and Mbozi.

Principal components analysis and hierarchical cluster analysis
Principal components analysis (PCA) was used to study the genetic relationships among the 226 avocado trees (Fig. 1). While the first two axes of the PCA accounted for 8.18% of all variation, most of the trees grouped irrespective of geographic origin. The sampled trees from Njombe urban tend to group to the right and so do those from Busokelo and Rungwe. The trees from the remaining populations are quite scattered in the plot. Grouping of samples from different districts or regions in the PCA plot points to genetic admixture among the sampled trees.
The genetic distance matrix of the 226 avocado tree samples was used to study the genetic relationships among the eight populations through hierarchical  clustering. The UPGMA-based dendrogram produced three major groups, each containing samples from different districts and regions ( Fig. 2), pointing at genetic admixture between samples from different districts.
Population structure and genetic relationship of the studied avocado samples Estimation of K-values, based on the methods by Puechmaille [29], revealed that the most probable K-value for our genetic data set was four (MedMeaK, MaxMeaK, MedMedK, MaxMedK = 4; Fig. 3a). This proposes that the 226 avocado tree samples can be clustered into 4 subpopulations or clusters (Fig. 3b). The genetic structure suggests a high similarity between the Busokelo and Njombe urban avocado populations as well as between the Wangingo'mbe and Njombe rural populations.
Members of Mbozi population showed highly similar genetic constitution with some members of populations from other districts, such as Mbeya city and Mbeya rural across the 10 SSR loci.

Discussion
In the present study, a total of 167 alleles were detected using 10 SSR loci across 226 sampled avocado trees with the number of alleles ranging from 10 to 23 per locus (   German and Viruel [25] and Abraham and Takrama [32] reported lower numbers, 11.4 and 11.5, respectively. Similarly, Liu et al. [33] reported the lowest number of alleles per locus, 3.10, across 10 SSR loci for 56 avocado trees investigated in Hainan Province, China. The differences between our results and the previously reported results could be due to variation in levels of polymorphism of the markers used, sample size, the diversity of the germplasm investigated and the platforms employed for resolution of amplified products [34]. The quality of genomic DNA used in PCR amplification, optimization of PCR protocols and differences in allele scoring accuracy could also be accounted for these differences. The 16.7 alleles per locus detected in the present study is comparable to that reported from other cross-pollinated species like maize (21.7 alleles [35]) and bur oak (14.3 alleles [36]). The average observed heterozygosity for the 10 SSR loci obtained in the present study was 0.65 (Table 1) which is similar to 0.64 and 0.61 obtained by Schnell et al. [30] and Guzmán et al. [31], respectively, indicating similar levels of genetic diversity for their analysed samples. Lower observed heterozygosity have been reported by Abraham and Takrama [32]: 0.48, Boza et al. [37]: 0.56, and Liu et al. [33]: 0.39, thereby pointing to a lower genetic diversity in the germplasm used or differences in the polymorphism levels of the SSR loci used by them vis-à-vis ours.
Expected and observed heterozygosity, Shannon's information index and average gene diversity are indicators of the extent of genetic diversity in populations. The analysis of diversity at the intra-population level revealed that while the observed heterozygosity, the Shannon's information index and average gene diversity were highest for the Mbozi, Njombe rural and Mbeya rural populations, respectively, the excepted heterozygosity was highest for both Mbozi and Njombe rural ( Table 2). This suggests that the Mbozi, Njombe rural and Mbeya rural populations are more diversified than the other populations and thus may offer elite materials for breeding programmes [38]. The three populations may also be able to cope with changes in environmental conditions in a better way than the other populations [19]. All the four diversity measures, except observed heterozygosity, were lowest in the Njombe urban population pointing to lower genetic diversity. This result might be attributed to the massive replacement of local seed propagated avocado with the commercial cultivars leading to a decreasing variation within the genepool. The lowest observed heterozygosity detected in Wanging'ombe, relative to the Njombe urban population, could possibly be attributed by the presence of null alleles and Most of the loci in the present study showed significance deviation from the Hardy-Weinberg equilibrium ( Table 1). Geographical structure and inbreeding within subpopulations may lead to Hardy-Weinberg equilibrium deviations. Wright's fixation indices (F IT , F ST , and F IS ) can be used to assess these within-and amongpopulation components of genetic variation. F IT measures the excess (F IT > 0; heterozygosity deficit) or deficit (F IT < 0: heterozygosity excess) of homozygotes at the global level [39]. The global heterozygosity deficit (F IT ), when AMOVA was calculated without considering regions, was 0.231 (P < 0.0001; Table 3). This implies that the observed homozygotes exceeded the expected value by about 23%. In other words, there was a reduction by 23% of observed heterozygotes relative to the expected ones. The result may suggest that avocado has a significant level of self-pollination although it is generally regarded as an out-crossing species. The SSR loci showing significant deviation from HWE could be in linkage disequilibrium with genic loci under selection in the form of heterozygote disadvantages. The fixation index, F IS , is an inbreeding coefficient that measures the excess (F IS > 0; heterozygosity deficit) or deficit (F IS < 0: heterozygosity excess) of homozygotes within a subpopulation [39]. The average inbreeding coefficient of individuals within subpopulations (districts) (F IS ) was 0.181 (P < 0.0001), which indicates that within subpopulations, the observed homozygotes exceeded the expected value by about 18%. In other words, there was a reduction by 18% of observed heterozygotes relative to the expected ones within subpopulations. At the global level, this implies that about 78% (i.e., FIS*100%/FIT) of the global heterozygosity deficit was due to the within population deficit. F ST is the fixation index that measures differentiation between subpopulations and range from 0 to 1. A value of 0 indicates that the populations under consideration are interbreeding freely (complete panmixis), while a value close to 0 indicates an unstructured population  [40]. A value of 1 infers that all genetic variation is explained by the population structure, and that the populations under consideration do not share any genetic diversity [40]. In the present study, the global degree of genetic differentiation (F ST ) for the eight avocado geographical populations was 0.061 (P < 0.0001; Table 3). This indicates that there is a significant district-based subdivision of Tanzanian avocados. This is possibly the results of mutations or genetic drift and indirect selection pressure that normally lead to lose of certain alleles or change in allele frequencies. As shown through analysis of molecular variance (AMOVA; Table 3) about 94% of the genetic variation was shared by the eight populations. The F ST value detected in our study was lower than the 0.19, 0.22 and 0.25 previously reported by Boza et al. [37], Guzmán et al. [31] and Gross-German and Viruel [25], respectively. This could be due to the fact that our study was based on samples collected from only local avocados (excluding commercial cultivars) and sampling involved only three nearby geographic regions in Tanzania. Contrary to our study, Boza et al. [37] studied avocado samples from United States and Mexico representing Persea americana (218 samples), P. nubigena (2 samples) and P. kruguii (1 sample). On the other hand, Guzmán et al. [31] analysed only P. americana collected in one country comprising local selections, root stocks and commercial cultivars. Similarly, the 315 samples analysed by Gross-German and Viruel [25] included also 5 samples from P. longipes, P. nubigens and P. schiedeana. However, the average F ST in the present study is Fig. 3 Population structure of the analysed trees. a: Identification of the optimum K-value using four different approaches developed by Puechmaille [29]; b: Population structure of 226 avocado trees suggesting varying levels of genetic admixture from four origins (colours) in the eight districts in the southern highlands of Tanzania comparable to the 0.05 reported by Cañas-Gutiérrez et al. [15] for 197 avocado samples from Colombia.
Analysis of molecular variance showed that 6.08% of the total genetic variation was partitioned among the 8 districts (Table 3). When the samples were grouped according to elevation ranges, the genetic variation among groups was 2.52%. These results concur with that of Teshome et al. [41], who observed lower differentiation among altitude-based groups of Ethiopian field pea (Pisum sativam L.) populations (2%) compared to that of region-based groups (8%).
In the present study, principal components analysis (PCA) and hierarchical cluster analysis were used to study the genetic relationships among the sampled trees. Neither the PCA nor the dendrogram separated these trees according to their districts or regions. This was in line with AMOVA findings ( Table 3) which showed that about 94% of the total genetic variation was shared by the populations. The PCA and dendrogram findings are also supported by the population pairwise F ST that revealed absence of differentiation between pairs of populations, such as Rungwe versus Njombe urban and Buokelo versus Njombe urban. Similar results were obtained when these avocado trees were characterised with morphological markers [23]. Genetic admixture among the avocado populations is attributed to sharing of seeds between farmers from different districts and selling of avocado produce from one district to another where the seeds could then be planted. It may also be due to introduction of highly similar germplasm to more than one districts/regions.
The model-based STRUCTURE was used to study the population structure of the 226 avocado plants. The results showed that the sampled plants can be regrouped into four clusters based on their genetic characteristics detected at the 10 studied loci. High similarity in population structure was noticed between the Busokelo and Njombe urban avocados as well as the Wangingo'mbe and Njombe rural avocados. This result is supported by lack of differentiation among these pairs as shown by the population pairwise F ST values (Table 4). Each geographic population (district) have alleles originated in at least three STRUCTURE based populations (clusters). Various analyses such as AMOVA, global F ST , PCA and UPGMA revealed that avocados grown in different districts of Tanzania show high genetic similarity with low but significant genetic differentiation between them.

Conclusion
High diversity was detected in the analysed avocado germplasm based on standard and molecular diversity indices. These findings implies that this germplasm is a potentially valuable source of variable alleles that might be harnessed for genetic improvement of this crop in Tanzania. The principal components analysis and hierarchical cluster analysis showed a mixing of avocado trees from different districts, pointing to strong gene flow among the eight populations. This is in line with the results of the model-based population structure analysis that revealed that the alleles of each district based populations originated from at least three of the four genetic populations.

Collecting samples and DNA extraction
Samples were collected in three regions; i.e., Mbeya, Njombe and Songwe, which are located in the southern highlands of Tanzania. From these regions a total of 41 villages or streets were visited across eight districts that are renowned for harbouring many seed propagated avocado trees. Locations of the study sites are given in Fig. 4. The number of trees sampled in each district and region is presented in Table 5 whereas some data on climate of the studied districts are presented on Table 6.
We visited the study sites from February to August in 2017 and young, leaf samples were collected from a total of 226 seed-propagated avocado trees. We sampled four to six leaves from each tree and then packed them in porous tea bags (two to three leaves per bag). The bags were then put into plastic bags followed by addition of silica gel to dry the leaves. When needed, we replaced the worn-out silica gel and continued doing so until the leaf samples were completely dry. Complete dryness was determined as when added silica gel retained its original colour. We extracted DNA from the dried young avocado leaf samples using the Thermo Scientific Genomic DNA Purification Kit following the manufacturer's instructions with minor modifications. DNA integrity was checked in a 1.2% agarose gel electrophoresis and its quality and quantity was assessed with a NanoDrop Spectrophotometer.

Microsatellite analysis
Forty microsatellite markers were selected among those developed by Sharon et al. [24] and Gross-German and Viruel [25] based on their reported levels of polymorphism. The primers were tested on eight avocado samples from eight populations, each representing a different district. Thereafter, 16 microsatellite markers that were highly polymorphic within the eight samples were chosen and used for analysis of all samples. However, only 10 of the 16 markers showed consistent amplification and the data analysis is therefore based on these 10 markers. Background information about the microsatellites, such as names and repeat motifs, is presented in Table 2. Amplification of target microsatellite loci was carried out in a total reaction volume of 25 μl containing 2.5 μl of 10X PCR buffer, 1.5 μl of 25 mM MgCl 2 , 0.3 μl of 25 mM dNTPs, 0.75 μl of 10 μM of each forward and reverse primer, 0.2 μl of 5 U/μl Taq polymerase and 25 ng genomic DNA. PCR reactions were run in a S1000™ thermal cycler (BIO RAD, Hercules, CA, USA) using a program of initial denaturation at 94°C for 1 min, 35 cycles of 1 min denaturation at 94°C, 30 s annealing depending on the specific primers' annealing temperature and 1 min extension at 72°C, followed by a 10 min final extension at 72°C. The size of the PCR products were determined using an Applied Biosystems 3500 Genetic Analyzer (Thermo Fisher Scientific, Waltham, MA, USA).

Data analysis
Standard diversity indices for markers; i.e., observed and effective number of alleles, observed Nei's [27] and expected heterozygosity and Shannon's information index [28] were computed using Popgene32 software version 1.32 [48]. Polymorphism information content for each SSR locus was assessed using Cervus version 3.0.7 [49]. Hardy-Weinberg equilibrium tests (molecular diversity index) were carried out in Popgene32 software version 1.32 [48]. To assess diversity at intra-population level, we computed standard diversity indices for each geographical population (district) using Arlequin 3.5.2.2 [50]. The indices calculated were observed number of allele and observed and expected heterozygosity. The effective number of alleles and Shannon's information  index were computed using GenAlEx version 6.5 [51].
We also estimated average gene diversity across all loci for each population in Arlequin 3.5.2.2. Analysis of molecular variance (AMOVA) was carried out in Arlequin 3.5.2.2 under 1000 permutations, 100,000 steps in Markov chain and 10,000 Dememorisation Steps. Hierarchical global AMOVA was conducted by grouping individual samples according to districts, regions and elevation ranges. Arlequin 3.5.2.2 was used to compute fixation indices (F ST , F IT , F IS, F CT and F SC ) and pairwise comparisons between the eight geographical populations.
We employed principal components analysis (PCA) to display the genetic relationships among the eight avocado geographical populations (districts). Allele composition for each tree was computed in the adegenet R package [52] and then used in PCA in XLSTAT version 2019.4.2 [53]. Hierarchical cluster analysis was also deployed to assemble avocado samples with similar genetic characteristics across the 10 loci in the same group. In order to achieve this, we computed Nei's genetic distance in GenAlEx and imported it in MEGAX [54] where the dendrogram in the newick format was produced using the unweighted pair group method with arithmetic mean (UPGMA) [55]. The dendrogram was visualized and customized in the Interactive tree of life (iTOL) v4 [56]. To identify genetic populations for the 226 avocado trees, we performed a Bayesian cluster analysis on the allele dataset using STRUCTURE 2.3.4 [57][58][59]. The admixture model was adopted with 10,000 burn-in period and 100,000 Monte Carlo Markov chain iterations. The range of K, the probable number of subpopulations or clusters, tested was 1 to 10. For each K value, we performed 20 independent runs. The structure results were analysed and visualised in STRUCTURE SELECTOR [60] where the optimal number of genetic clusters was computed based on distinct approaches; the median of medians (MedMedK), median means (Med-MeaK), maximum of medians (MaxMedK) and maximum of means (MaxMeaK) [29].