INTRODUCTION
Temperate forests have been demonstrated to be reservoirs of an outsize fungal endophyte diversity living in standing trees (Unterseher, 2011). It is known that this mycobiota plays a key role in the fitness and functioning of the trees through complex dynamics (Baldrian, 2016), and whose roles fell along a continuum from mutualism, commensalism, and parasitism that can elapse even through the same fungal organism lifetime (Saikkonen et al., 1998, Stone et al., 2004). Plant-associated mycobiota also contribute to large-scale patterns of plant diversity in forest ecosystems (Wang et al., 2019). However, there is a huge gap in the understanding of fungal endophyte diversity, the drivers that modulate such communities, and the nature of the interactions they establish with plants (Suryanarayanan, 2020).
In the last decades, novel high-throughput sequencing technologies (HTS) become frequent and accessible, leading to progress in plant mycobiomes investigations, especially through metabarcoding approaches. They are cost and time-efficient and have allowed increasing sensitivity and rate at which biomes can be assessed (Terhonen et al., 2019). Nonetheless, they brought great challenges in data processing, in terms of sequence quality filtering and curation, assemblage, clustering, and taxonomic assignment of such a huge volume of output sequences. This led to the development of the many different bioinformatic tools assessing each of the steps of the workflow, whose performances are still under assessment. HTS metabarcoding approaches also required from researchers certain bioinformatics and programming skills. In this sense, numerous developments came to light, aiming to provide accessible and integrated tools for the whole data processing (Gweon et al., 2015; Rognes et al., 2016; Palmer et al. 2018; Jalili et al., 2020). There have been various efforts to compare the performance of different individual tools aimed at different steps of the workflow (Schloss & Westcott, 2011; Edgar Flyvbjerg, 2015), but there are limited efforts to assess integrated pipelines useful to microbiologists and mycologists that are not programmers or developers (Mysara et al., 2017). Many of these tools have been developed for bacterial 16S amplicon analysis and have been subsequently adapted to fungal data (Anslan et al., 2018). Among the available tools, PIPITS (Gweon et al., 2015) and the Amplicon toolkit (AMPtk) (Palmer et al., 2018) have been created, specifically, to process fungal ITS-sequencing data and have demonstrated to perform better for fungal ITS amplicon analysis among other available pipelines (Anslan et al., 2018, Nilsson et al., 2019). Both PIPITS and AMPtk are command-line cost-free toolkits, designed with easy-to-use straightforward pipelines that allow performing all data analysis from raw sequences to final operational taxonomic unit (OTU) tables and taxonomy. They take advantage of extant methods from other toolkits and, also, develop new tools for certain steps of the workflow.
Recently, we first described the endophytic mycobiota of the Andean Patagonian Forest through culture prospection, in a study that assessed wood endophyte communities of Nothofagus trees (Molina et al., 2020). Even though we described a rich and heterogeneous diversity, we concluded that our results underestimated such diversity, and that fungal endophyte diversity harboured by Nothofagus trees was unable to be fully assessed through culture prospection; to further describe fungal endophyte diversity and elucidate beta diversity patterns, a culture-independent approach was proposed (Molina et al., 2020, 2022).
In this study, we aim to compare the performances of three pipelines from two automated toolkits (AMPtk and PIPITS) in the assessment of wood fungal endophytes assemblages of Nothofagus trees from the Patagonian Forests; this is accomplished by comparing the ITS metabarcoding dataset with another sequence dataset derived from culture prospection isolates, from the same sites and trees.
MATERIALS AND METHODS
Study area and sampling procedure
The study was conducted in Los Alerces National Park in Argentinian Patagonia, from May 2016 to April 2018. The sampling collection was described in Molina et al. (2020). Briefly, at each site, roots and stems were sampled seasonally, from ten trees of similar diameter at breast height. The sampling was performed in seven sites: three stands of Nothofagus pumilio (Poepp. & Endl.) Krasser and four stands of Nothofagus dombeyi (Poepp. & Endl.) Krasser. Sapwood cores of 5 mm diameter and 15 mm length were extracted by using an increment borer sterilized with 70% ethanol (v/v) and flaming between samples. A total of 280 trees were sampled, taking different samples for the culture-dependent and culture-independent approaches, but from the same trees.
Culture-dependent database construction Sampling collection, fungal isolation, and molecular identification methods were reported in Molina et al. (2020). Briefly, the sapwood tissue of the core samples were cut into 5 mm pieces, surface sterilized, put into Ascomycota and Basidiomycota selective media and incubated at 20-24 °C for up to 4 months. Pure axenic cultures were used for DNA extraction, and ITS sequencing.
Culture-independent database construction The sampling processing, library preparation and amplicon sequencing were described in Molina & Pildain (2022). Briefly, sapwood samples were recovered using a sterilized increment borer, about 50 mg of wood was ground to powder, for each wood sample, according to Dumolin et al. (1995). Total DNA was extracted using DNeasy Power Plant Pro Kit (QIAGEN, Hilden, Germany).
Internal Transcribed Spacer 1 (ITS1) library was prepared using the TrueSeq dual indexing strategy. ITS1 amplification was performed by using the primers pair TS-ITS1-F and TS-ITS2-R (White et al., 1990; Gardes & Bruns, 1993) and MyTaqTM Mix (Bioline, USA, Inc., Memphis) in a total volume of 25 μL per reaction, with the following cycling conditions: 94 °C for 5 minutes, 32 cycles of 94 °C for 45 seconds, 50 °C for 45 seconds, and 72 °C for 1 minute, and a final extension at 72 °C for 7 minutes. PCR products were purified using ExoSap-IT (USB Corporation, Cleveland, OH). The purified PCR products were indexed by using sample-specific barcodes combinations of the TruSeq primers pairs i5-TS-DI-5xx and i7-TS-DI-7xx with the following cycling conditions: 95 °C for 3 minutes, 8 cycles of 95 °C for 30 seconds, 55 °C for 30 seconds, and 72 °C for 30 seconds, and a final extension at 72 °C for 5 minutes. PCR products were purified as already mentioned, and quantified by using a NanoDrop spectrophotometer (ThermoFisher, Waltham, MA). Negative controls (from both the DNA extraction and PCR runs) and non-biological synthetic mock communities (SynMock; Palmer et al., 2018) were simultaneously processed, and sequenced. Synthetic mock communities are non-biological constructs specifically designed to mimic the composition and complexity of real-world fungal communities, serving as essential references for validating and benchmarking experimental procedures in mycological research. All samples were randomly separated in two groups and each one received pooled purified ITS amplicons in equimolar ratios (multiplexed). Libraries were sequenced at the Purdue Genomics Core Facilities (Purdue University, West Lafayette, IN) with a MiSeq version 2 Reagent kit of 500 cycles in the Illumina MiSeq platform (2 x 250 bp).
Bioinformatic analysis
Data processing was carried out through the different toolkits: PIPITS (v2.7.12) and the Amplicon toolkit (AMPtk) (v1.2.4). Also, AMPtk was performed by using two different clustering methods. The detailed workflow of the three different pipelines (hereafter PIPITS, AMPtk-UPARSE, and AMPtk-DADA2) is illustrated in Fig. 1. Pre-clustering steps were conducted under default conditions.
Both toolkits differ in the order and the tools used for quality filtering, read-pair assemblage, and sequence trimming to get ITS amplicon. AMPtk uses USEARCH tools (v9.2.64; Edgar, 2010) while PIPITS takes advantage of the open-source alternative: VSEARCH (v2.7.0; Rognes et al., 2016).
The Figure 1 shows that the first step in the PIPITS toolkit is the read-pairs assembling by using VSEARCH; then, it filters the assemblage reads by quality using FASTX-toolkit (v0.0.13; Gordon & Hannon, 2010). Sequences are dereplicated by using VSEARCH, this eliminates the redundant sequences from the large dataset, streamlining downstream data processing and enhancing overall computational efficiency. Next, the pipeline calls the ITS extractor algorithm (ITSx, v 1.1b1; Bengtsson-Palme et al., 2013) to identify the ITS1 region and to extract it from the reads, deleting any conserved region or primer sequences. After data processing, the pipeline re-inflates the replicated sequences in order to keep the reads abundances information. Conversely, the AMPtk toolkit first trims the short reads, next trims the primer sequences from the reads; then, merges the paired-end reads by using USEARCH and performs the quality filtering (Edgar & Flyvbjerg, 2015).
For OTU clustering, the threshold was set at 97% identity. PIPITS uses VSEARCH to cluster OTUs, and for chimera detection and deletion by confronting the UNITE UCHIME reference data set (http://unite.ut.ee/repository.php). In the AMPtk toolkit, the algorithms used for the taxonomic units definition were: OTU clustering with UPARSE (v9.2.6.4; Edgar, 2013) and the DADA2 pipeline (v1.6.0; Callahan et al., 2016) which also performs chimeras’ detection and removal. The DADA2 pipeline does not cluster OTUs but defines amplicon sequence variants (ASVs). Conversely to OTU, ASV represents unique sequences without clustering (Callahan et al., 2017). The AMPtk toolkit provides an additional functionality to address cross-contamination errors by leveraging the sequences of the SynMock community. It accomplishes this by identifying the SynMock sequences, estimating their frequencies, and calculating the tag-switching index. Also, allows running the LULU algorithm (v0.1.0; Frøslev et al., 2017), which is a post-clustering curation pipeline that combines co-occurrence patterns and sequence identity analysis to detect and delete (or merge) erroneous OTUs from the set.
Finally, to assign taxonomic classifications to the defined OTUs, PIPITS employs the RDP Classifier (v2.10.2; Wang et al., 2007) which is a machine learning approach. This approach utilizes computational algorithms to automatically analyze and classify the obtained sequences based on patterns and characteristics present in the data. The RDP Classifier compares the obtained sequences against the carefully curated reference dataset of fungal ITS regions UNITE (https://sourceforge.net/projects/rdp-classifier/files/RDP_Classifier_ TrainingData). Conversely, a “hybrid” approach was used to assign the taxonomy with the AMPtk toolkit. This approach combines classification from a global alignment, with classification from the UTAX (RC Edgar, http://drive5.com/usearch/ manual9.2/cmd_utax.html) and SINTAX (Edgar, 2016) approaches. This hybrid method chooses the best taxonomy from the three approaches, by prioritizing the global alignment result, if the threshold is higher than 97%, or selecting the higher confidence score from the other approaches. If there is a conflict between the taxonomies, the algorithm chooses the last common ancestor taxonomy (Palmer et al., 2018), meaning the last taxonomic rank in which there would be no conflict.
Manual curation of the three pipeline outputs was performed by following Brown et al. (2015) recommendations, thus three additional databases were evaluated (hereafter, curated data). Non-fungal and kingdom undefined OTUs/ASVs, as well as OTUs/ASVs represented by less than 10 reads, were removed from the set.
Database comparisons
Sequences obtained from fungal strains isolated from cultures, were used as reference to assess the performance of the HTS methodologies on finding and describing accurately the fungal taxa present in the sapwood of Nothofagus species under study. To achieve this, the ITS sequences obtained from cultures by Molina et al. (2020) were compared, by using the BLAST algorithm (Altschul et al., 1997), to the datasets obtained from the PIPITS, the AMPtk-UPARSE, and the AMPtk-DADA2 pipelines. The comparisons were performed in Geneious Prime (v2020.1.1; Biomatters Ltd, https://www.geneious.com/).
The databases used were the output FASTA files from each pipeline, so that all the OTUs with a similarity above 97% were reported. The culture-based dataset that was blasted to them, consisted of 72 sequences of the full ITS regions of rDNA (ITS1, the intervening 5.8 RNA gene, and ITS2) obtained through Sanger sequencing from Basidiomycota, Ascomycota and Mucoromycota phyla isolated from the same study sites following the same sampling methods. These taxa are known to be present in the studied system as a result of the previous culture prospection study (Molina et al., 2020). The sensitivity of each pipeline to detect the taxa registered was evaluated, as well as the precision in reaching the taxonomy assignment. Five parameters were defined to assess pipeline performances : a) the percentage of cultured taxa that matched to at least one OTU/ASV defined bioinformatically from HTS (as an estimator of sensitivity), b) the percentage of cultured taxa that were detected as the same OTU/ ASV by pipeline algorithms (i.e., different cultured taxa that were merged by the culture-independent approach), c) the percentage of cultured taxa that were assigned a wrong taxonomy by a pipeline, d) average number of OTUs/ASVs that matched to a cultured taxon and e) maximum number of OTUs/ ASVs that matched with a single cultured taxon in the HTS pipeline (both parameters assessing the redundancy in OTUs/ASVs clustering and post-clustering curation).
Statistical analyses
Differences in alpha diversity between the three pipelines were assessed by using the Friedman test and the Bonferroni test. Statistical analyses and graphics were performed in R ( R Core Team, 2022) with packages biomformat (v1.8.0; McMurdie & Paulson, 2016), phyloseq (v1.24.2; McMurdie & Holmes, 2013), ggplot2 (v3.3.3; Wickham, 2016), vegan (v2.5.7; Oksanen et al., 2020).
RESULTS
Bioinformatic pipelines comparisons
The sequencing experiment yielded a mean depth of 133 855 reads per sample (paired-end raw reads) and a total depth of 38 million raw reads.
With identical computing power, the AMPtk toolkit was much more time-efficient than PIPITS. The AMPtk-UPARSE pipeline took 79 minutes of total run time, and AMPtk-DADA2 took 205 minutes. The difference was originated by the clustering algorithm of either pipeline. The PIPITS pipeline used 11 418 minutes mainly used to complete the “pipits-funits” step: reads dereplicate, ITS extraction, and back to reads replicate. Also, the clustering method from PIPITS (119 minutes) was also more time-consuming than the AMPtk-UPARSE method (38 minutes).
The OTU/ASV richness strongly differed between pipelines; PIPITS generated almost twice as many OTUs (14 647 OTUs) as AMPtk pipelines [UPARSE (8 031 OTUs), and DADA2 (7 524 ASVs)] and the differentiation occurred mainly at the clustering step (Fig. 1). Also, differences in OTU/ASV richness were observed between toolkits at the sample level (Fig. 2, above). Furthermore, both AMPtk pipelines evidenced significant variations in their richness per sample (Wilcoxon test, p<0.001) evidencing that the clustering algorithm performed affects richness results. However, the rarefaction curves approximated asymptotes for the three datasets when manually curated data was considered (Fig. 2, below).
Deleting OTUs/ASVs with low reads abundance and bad taxonomic resolution improved the representation of the fungal community for the three pipelines tested. Manual curation reduced the differences in OTUs/ASVs richness between pipelines, although those were still significant (Fig. 1; Fig. 2, above). This is because 34% of PIPITS final OTUs lacked taxonomic assignment at the Kingdom level (against 20% from the AMPtk pipelines), which might indicate a redundant OTU clustering and a lower performance of post-clustering curation methods in this pipeline. After these taxa were removed from the set, the taxonomic resolution did not differ significantly between pipelines, although PIPITS showed a slightly higher proportion of OTUs assigned to Class or lower taxonomic levels (48% against 37% and 36% in AMPtk-UPARSE and AMPtk-DADA2, respectively).
Comparison with culture prospection dataset The sequence matches between cultured and uncultured datasets, and their estimated identity percentage, are listed in Table 1. There were certain cultured taxa that the pipelines did not detect in the HTS experiment. In that sense, PIPITS was the most sensitive pipeline, with 19% of the cultured taxa undetected, followed by AMPtk-DADA2 (24%) and USEARCH (28%) (Table 2). Cultured taxa were molecularly identified at genus or species ranks in Molina et al. (2020), whereas the same uncultured OTUs/ASVs were identified at higher taxonomic ranks: 40% and 47% of the matched OTUs/ASVs were assigned at Family level or higher in AMPtk and PIPITS pipelines, respectively. Despite low taxonomic resolution, the AMPtk pipelines did not mistake the taxonomic identifications (Table 2) whereas PIPITS exhibited 1.39% of misassignment.
Clustering redundancy was low for AMPtk pipelines, only 5.6% of the cultured taxa matched with multiple OTUs/ASVs (maximum 3) resulting in a mean value of 1.10 OTUs and 1.12 ASVs per taxon for UPARSE and DADA2, respectively (Table 2). In contrast, PIPITS evidenced high redundancy in OTU clustering: 33.33% of the cultured taxa matched with multiple OTUs, the maximum number of matched OTUs for the same cultured taxon was 35, giving an average number of hits of 1.85 OTUs per taxon.
DISCUSSION
In recent years, HTS metabarcoding approaches have revolutionized fungal ecology, increasing our ability to assess biodiversity in a wide range of habitats (Alberdi et al., 2018). As an emerging technology, it implied new methodological and theoretical challenges in terms of data processing and integration of knowledge production with existing backgrounds. Despite the number of studies that compare the performance of the different tools developed for data processing, there is still no consensus on the most appropriate bioinformatics approach (Anslan et al., 2018; Pauvert et al., 2019). These studies use comparison criteria with mock communities to evaluate the results of the pipelines. This is the first study that assesses the performance of different automated bioinformatic toolkits in the characterization of natural fungal communities, and that uses diversity data -obtained from culture methods- from the same sites and samples, as comparison criteria. This is a key step to further characterize endophyte communities and their beta diversity patterns through metabarcoding methodologies.
Bioinformatic pipelines comparisons
When evaluating the performance of a bioinformatic pipeline for analyzing community data, important factors to consider include the runtime, sensitivity, and precision. These parameters provide valuable insights into the efficiency and accuracy of the pipeline, serving as important tools for making informed decisions on which pipeline to apply based on the study goals and capabilities. The runtime of a pipeline determines the computational efficiency and speed of processing the large volumes of sequencing data. The sensitivity of the pipeline is directly impacted by the reads filtering, clustering, and chimera detection steps. These steps influence the ability to accurately capture the true fungal taxa present in the samples because the more reads are removed from the dataset during the filtering steps, the greater the risk of inadvertently eliminating existing taxa in the biological community. On the other hand, precision, which measures the pipeline’s accuracy in identifying true fungal taxa without introducing false positives, relies on stringent filtering and error correction methods, such as chimera removal, clustering, and taxonomy assignment. The choice of clustering algorithm affects precision, with conservative algorithms creating distinct clusters, reducing the risk of merging sequences from different taxa. Conversely, sensitive algorithms may capture more taxa but have a higher chance of including false positives. In this context, it becomes evident the trade-off between sensitivity and precision (Weiss et al., 2016). Increasing sensitivity by relaxing filtering criteria may lead to a higher chance of false positives or including artifactual taxa, whereas prioritizing precision through more stringent filtering may result in the loss of rare or low-abundance taxa (Baldrian et al., 2021).
In the context of this study, it is worth noting that the filtering steps result in differences in reading recruitment with the fastx-toolkit applied in the PIPITS pipeline being less strict compared to the error trimming pipeline used in AMPtk. However, it is the clustering method that explains most of the differences between pipeline performances in this study. The VSEARCH and USEARCH algorithms did not perform similarly. VSEARCH clustering resulted in almost double OTU richness than UPARSE, even though it removed singletons before clustering. On the one hand, the PIPITS pipeline overestimates taxa, as evidenced in the high OTU redundancy (number of OTUs per taxa), that is, it showed less precision. Moreover, the highest richness obtained from this approach is partially reflecting actual taxa in the fungal community that the other pipelines could not recover, as is evidenced by the higher coverage over the cultured dataset that was reported for PIPITS (the ratio of cultured sequences that matches). These results agree with previous studies that have pointed out that the VSEARCH clustering method is a more sensitive approach (Rognes et al., 2016). However, others have found similar results in richness and sensitivity between VSEARCH and USEARCH when applied in other pipelines/toolkits (Anslan et al., 2018; Pauvert et al., 2019). Here, we found that applied in AMPtk toolkit, VSEARCH clustering method results in lower sensitivity. In fact, the AMPtk-UPARSE pipeline was the one that recovered the lower ratio of cultured taxa (that is, showed less sensitivity) and the one that evidenced the best OTU redundancy parameters (that is, more precision).
Regarding the latter, it might be an effect of the LULU algorithm that filtered 800 OTUs from the AMPtk-UPARSE pipeline. This tool was developed to detect taxa splitted in more than one OTU/ASV during the clustering method and merge them; plus, it is not available in the PIPITS toolkit.
The AMPtk-DADA2 pipeline offered the best balance concerning both precision and sensitivity. It yields low ASV redundancy aligning with sequences from culture and more than 76% coverage for the cultured dataset and, also, achieving the lowest richness among the three pipelines. These results are consistent with other studies that found that the DADA2 pipeline achieves the most approximate characterization of mock community alfa diversity (Pauvert et al., 2019). DADA2 is a clustering-free method developed to enable result comparisons between studies and improve taxonomic resolution (Callahan et al., 2017). It evaluates reads at the sample level and combines the identification of ASVs with chimera detection and removal. This approach assumes that highly similar ASVs within the same sample represent errors when they occur in very low abundances. It allows the detection of single-nucleotide polymorphisms that may indicate different fungal species while reducing OTU redundancy (Callahan et al., 2016).
The taxonomy assignment lacks mycological accuracy for the system under study in both PIPITS and AMPtk toolkits: up to 35% of total manually curated OTUs could not go further than the Kingdom Fungi determination. This is a common issue in metabarcoding approach studies from diverse environments (Kirker et al., 2017; Purahong et al., 2019), especially when plant endophyte mycobiota is being assessed (James et al., 2020). In general, sequence-based identification depends on informative sequence databases (Costello et al., 2013). In particular, bioinformatic methods for taxonomy assignment, and especially the learning machine approaches, are sensitive to the incompleteness of the reference databases because the algorithms perform better when there are multiple representatives for each group (Gdanetz et al., 2017). There is a current lack of knowledge about fungal diversity in certain environments and about entire fungal lineages that keep the public databases incomplete (Halwachs et al., 2017). Apart from the general limitations regarding the available databases that both toolkits faced, PIPITS and AMPtk pipelines performed differently concerning taxonomy assignment. On the one hand, PIPITS resulted in a larger proportion of unassigned OTUs at the Kingdom rank, which might be due to the more relaxed algorithms for reads filtering and post-clustering curation. On the other hand, this pipeline resulted in slightly better taxonomic resolution, meaning a larger proportion of assignments at Class rank or less. However, it also evidenced a proportion of misassignment when validating with the cultured dataset. AMPtk toolkit has achieved lower resolution but with no such errors. Unlike PIPITS, which implements the RDP Classifier method, AMPtk uses an approach in which taxonomy is assigned through the consensus of three different methods. Evidently, the hybrid is a more conservative approach, which loses resolution by assigning taxonomy with a “last common ancestor” criterion but allowing reducing errors.
Cultured taxa from our reference dataset are known to be present in the studied ecosystem, however, there is a fraction of their sequences that the HTS experiment did not recover. Culture-dependent and culture-independent experiments were not carried out from the same samples (although they were taken from the same individuals in a year of culture prospection), therefore some taxa could be absent in one of the approaches, especially if those taxa are rare. Nevertheless, some of the sequences absent in the pipeline’s outputs were from taxa reported in high frequency in the culture-prospection study (Molina et al., 2020). Furthermore, certain of these genera were not informed at all in the culture -independent approaches, such as Ophiostoma or Arambarria. Here, we might be witnessing the bias and limitations of the HTS metabarcoding approach and of the ITS amplicon itself. During total DNA extraction from an environmental sample, the cell wall properties of the different fungal taxa or types (Vesty et al., 2017) and the variable number of nuclei per cell across taxa (Roper et al., 2011) will affect the DNA recruitment. Besides, it is well documented that markers differ in their capacity to recover OTUs/ASVs across fungal lineages (Tedersoo et al., 2015).
The Internal Transcribed Spacer region has been used as a universal fungal marker because of the optimal multi-copy characteristics and its variation rate across lineages (Schoch et al., 2012). However, there are some drawbacks for HTS metabarcoding studies. For instance, the ITS region is highly variable in length (Schoch et al., 2014) and GC content (Wang et al., 2015); longer and higher GC content barcodes are reported as difficult templates to amplify in NGS because of the unequal competition for primers (Aird et al., 2011). Plus, longer barcodes are less likely to be recovered in short-read-based approaches, because the quality falls in tails and low-quality sequences are problematic to pair (Baldrian et al., 2021).
It is noticeable, however, that compared with this study the few studies that have compared the alpha diversity achieved by HTS approaches versus morphological studies have got a very small overlapping of taxa between approaches (Porter et al., 2008, Heine et al., 2021). Furthermore, the findings of this study align with previous assessments of the sensitivity of HTS workflows that employ mock community approaches (Pauvert et al., 2019), demonstrating consistent results in terms of the percentage of recovered sequences. All this suggests that sampling and sequencing efforts in this study were satisfactory.
Sequencing depth achieved here was higher than that of similar wood fungal endophytes studies (Küngas et al., 2020; Migliorini et al., 2021). Sequence depth is the more important variable in HTS experimental design that aims to assess beta diversity (Smith & Peay, 2014). However, high sequencing depth increases the potential for cross-contamination and errors during sequencing (Baldrian et al., 2021). In this study, we combine a high sequence depth with an approach to correct cross-contamination errors by using a synthetic mock community.
In summary, the AMPtk toolkit showed to be more precise in terms of false positives and taxonomy assignment than PIPITS. Both AMPtk pipelines had similar performances but the pipeline that uses the DADA2 clustering algorithm showed lower redundancy and higher sensitivity. The AMPtk-DADA2 would be chosen to perform community patterns analyses, however, PIPITS showed itself as a more sensitive pipeline and would be considered in studies aiming for species detection.
DATA AVAILABILITY
Raw sequence reads are deposited in the Short Read Archive of the National Center for Biotechnology Information (BioProject ID: PRJNA785007).
Sanger DNA sequences from culture are available on GenBank (accessions MT076081-MT07685).