Danielle Botha, et al.
4.1. Reference library development
Despite several studies in which DNA barcoding was applied in the analyses of herbivore diets [3, 73, 74], a focus on herbivore diets in African savannas remains limited [2, 26]. Until present, none focused on South African savannas, and there is not currently any robust DNA reference sequence library available for the semi-arid eastern South African Savanna. During this study, we have curated two DNA reference libraries consisting of an rbcL and a trnL dataset respectively, with special attention to forb species neglected in previous studies and models concerning savanna ecology and definitions. This is the first step towards facilitating metabarcoding studies aimed at South African savanna ecosystems, specifically for the identification of plant materials obtained in herbivorous faeces.
The ideal construction of a reference library would be to sample, identify, and sequence the relevant barcodes of all the species on the species list. However, this would be a costly and laborious research campaign. Mining sequences from global databases is a much more cost-effective option. The method illustrated here can be reproduced for any metabarcoding study aimed at representing the composition of an ecosystem without manually sequencing all specimens. The use of geographically restricted reference databases as proposed in this study may sacrifice the discovery of rare, novel, or invasive species that are not recorded in species lists describing the floristic diversity of the chosen study area in which the metabarcoding study will be conducted. Furthermore, errors made in the early stages of the process or incorrect identifications submitted onto BOLD or GenBank, which condition subsequent choices, may have a devastating impact on the final result. The use of GenBank as a taxonomic resource has been questioned since there exists an absence of preserved voucher specimens, non-justified species identifications, and low-quality data [71, 75]. The same is true for BOLD, although this database is better curated due to higher quality submission standards, it may still contain incorrectly identified taxa and low-quality sequence data because it serves as a workbench for barcode research [71, 72, 76]. However, it has also been shown that the proportion of mislabelled sequences for barcodes on GenBank and BOLD is low and that taxonomic errors are small [75–77]. Being aware of these limitations enables the user to create certain download criteria which result in a database of sequences with known localities, known voucher specimens as well as the inclusion of barcodes with the least ambiguous bases. These databases, as well as any database mined from global reference databases, are not impervious to imperfection, but this method of using a species list and applying download quality criteria will certainly lift the standard above a taxonomic assignation against whole databases such as GenBank and BOLD. Due to the rigorous testing and validation efforts, the barcode databases developed during this study represent a step towards improving reliable identifications for eDNA samples.
The phylogenetic analysis served as a re-identification strategy of sequences included in the rbcL and trnL reference databases. The exclusion of singletons, as recommended by the identification rates of the various distance-based measures (Table 1) was not considered in this part of the study, as this would have led to a great loss of taxonomic coverage and identifications as seen by a drop of 5.41% and 7.22% in the total taxonomy included in the rbcL and trnL databases, respectively. Excluding singletons from the reference datasets would lead to a loss of the families of flowering plants, namely: Aristolochiaceae, Dioscoreaceae, Gentianaceae, Kirkiaceae, Passifloraceae, Piperaceae, Plantaginaceae, Urticaceae in the rbcL dataset as well as the loss of Hypoxidaceae, Menispermaceae, Nyctaginaceae, Passifloraceae, Piperaceae, Rutaceae, Scrophulariaceae and Stilbaceae in the trnL dataset. We, therefore, sacrificed some identification success for the inclusion of singletons, and therefore an accurate representation, in terms of species present, in a South African savanna. The NJ trees revealed that sequences formed cohesive clusters of orders for the rbcL dataset. However, an obvious low taxonomic resolution for some orders in the trnL dataset was again obvious, such as for the orders Poales, Commelinales and Fabales. Similar results were also shown by Gill et al.  for the families Poaceae, Fabaceae, and Malvaceae. Low taxonomic resolutions were expected since many genera which are known to occur in South African savannas are under-sampled in global reference databases or are yet to be barcoded with either rbcL or trnL, leading to a lack of representatives for certain genera or the inclusion of singletons in the reference database. Species represented in the South African savannas are very diverse as can be seen from the sum of the branch lengths of the NJ trees, namely 108.23 for the 922 trnL-barcode taxa and 382.34 for the 1239 rbcL-barcode taxa.
The lowest level of taxonomic resolution achieved in metabarcoding studies is typically genus-level [73, 78], and the assignation of taxonomy at lower taxonomic levels is often infeasible and inaccurate. Therefore, to ensure that these databases of barcode sequences are robust enough to be used as an identification tool the barcode gap was analysed together with the barcode phylogeny for each dataset. Disparities between inter-and intraspecific distances between and among species are defined as a barcoding gap that, if present at a locus, enables the reliable differentiation of species [42, 67]. A barcode gap was evident in both approaches used and for both datasets. A barcode gap for 76% of the species was obtained for the rbcL barcode sequences and 68% of the species for the trnL sequences. This implies that reliable species differentiation would not be possible for 24% of the species in the rbcL dataset and 32% in the trnL dataset. However, the lack of a barcode gap for a considerable proportion of reference sequences has also been reported in other DNA-barcode studies [28, 45, 79]. Gill et al.  reported a barcode gap for only 73% of the species investigated for the rbcL primer and 79% for the trnL-F primer in their barcode library of semi-arid East African savanna plants. Mishra et al.  have shown that even within one genus (Terminalia Linn.) the three different barcodes of matK, ITS and rbcL contained a barcode gap for <70% of the species. As stated by Gill et al.  savannas are comprised of a diverse range of species that are prone to the absence of a barcoding gap.
The distance-based measures of k-nn, BCM, and BOLD identification criteria infer identification success rates by considering the K2P distance matrix of each dataset to simulate taxonomic identifications of one sequence against the rest of the databases to match taxonomies within the identification threshold . Usually, the distance-based matrices are applied at a species level, but for this metabarcoding study, it was decided to predict the identification success of the datasets based on genus-level identifications (Table 1). The best success rates across all three methods were seen with the exclusion of singletons, in this case, genera represented by only one individual. The analysis of the identification of success rates, with the exclusion of singletons, is also a prerequisite for the alignment that precedes the barcode gap analysis. Singletons are typically a problem in barcoding studies since the identification simulations will treat the singleton as a query, and it will not have a match available in the reference dataset and will result in either “incorrect” or “no identification” .
These analyses revealed that the rbcL and trnL datasets have a predicted identification success rate of 90.78% and 79.05% with the k-NN method, respectively. This would imply that a barcode gap analysis is not always an accurate predictor of the species discrimination success of sequences in a reference database, as concluded in certain studies [69, 78, 79]. This is also reflected by the findings in this study as illustrated by the inconsistencies between identification success rate simulations and the evaluation of the barcode gap. However, in this case, the barcode gap analysis can be accepted as a more efficient tool to predict database accuracy since it displays the identification success rate achieved by species-level identifications, whereas the distance-based analysis implemented genus-level taxonomic assignments. In addition, the discrimination success rates demonstrated by both datasets are above (rbcL = 76%) or very close (trnL = 68%) to the discrimination success rate of 72% as proposed by the CBOL Plant Working Group . This implies that an adequate discrimination success rate would be possible up to the lowest taxonomic level of genus when taking the predicted identification success rates into account. The collection, vouchering and sequencing of more species, especially forb species from South African savannas will aid in more comprehensive public reference databases that can then be used as a sequence mining source to better the resolution of smaller, study-area-specific databases by the addition of new species and the minimization of singletons of other species. Until a standardized plant barcoding region can be agreed upon, reference databases used for metabarcoding studies will not be universally applicable and the choice of barcoding loci is largely left to the researcher based on the targeted taxonomies as well as the nature of the environmental sample. In the meantime, robust DNA barcode reference libraries are essential to moving ecological studies forward.