SOAPBarcode: Revealing arthropod biodiversity through assembly of Illumina shotgun sequences of PCR amplicons

Shanlin Liu, Yiyuan Li, Jianliang Lu, Xu Su, Min Tang, Rui Zhang, Lili Zhou, Chengran Zhou, Qing Yang, Yinqiu Ji, Douglas W. Yu, Xin Zhou*

*Corresponding author for this work

Research output: Contribution to journalArticlepeer-review

47 Citations (Scopus)


Summary: Metabarcoding of mixed arthropod samples for biodiversity assessment has mostly been carried out on the 454 GS FLX sequencer (Roche, Branford, Connecticut, USA), due to its ability to produce long reads (≥400 bp) that are believed to allow higher taxonomic resolution. The Illumina sequencing platforms, with their much higher throughputs, could potentially reduce sequencing costs and improve sequence quality, but the associated shorter read length (typically <150 bp) has deterred their usage in next-generation-sequencing (NGS)-based analyses of eukaryotic biodiversity, which often utilize standard barcode markers (e.g. COI, rbcL, matK, ITS) that are hundreds of nucleotides long. We present a new Illumina-based pipeline to recover full-length COI barcodes from mixed arthropod samples. Our new assembly program, SOAPBarcode, a variant of the genome assembly program SOAPdenovo, uses paired-end reads of the standard COI barcode region as anchors to extract the correct pathways (sequences) out of otherwise chaotic 'de Bruijn graphs', which are caused by the presence of large numbers of COI homologs of high sequence similarity. Two bulk insect samples of known species composition have been analysed in a recently published 454 metabarcoding study (Yu et al. 2012) and are re-analysed by our analysis pipeline. Compared to the results of Roche 454 (c. 400-bp reads), our pipeline recovered full-length COI barcodes (658 bp) and 17-31% more species-level operational taxonomic units (OTUs) from bulk insect samples, with fewer untraceable (novel) OTUs. On the other hand, our PCR-based pipeline also revealed higher rates of contamination across samples, due to the Illumina's increased sequencing depth. On balance, the assembled full-length barcodes and increased OTU recovery rates resulted in more resolved taxonomic assignments and more accurate beta diversity estimation. The HiSeq 2000 and the SOAPBarcode pipeline together can achieve more accurate biodiversity assessment at a much reduced sequencing cost in metabarcoding analyses. However, greater precaution is needed to prevent cross-sample contamination during field preparation and laboratory operation because of greater ability to detect non-target DNA amplicons present in low-copy numbers.

Original languageEnglish
Pages (from-to)1142-1150
Number of pages9
JournalMethods in Ecology and Evolution
Issue number12
Publication statusPublished - Dec 2013
Externally publishedYes


  • High-throughput sequencing
  • Metabarcoding
  • Next-generation-sequencing
  • Operational taxonomic units
  • Species richness
  • Standard barcode
  • phylogenetic diversity


Dive into the research topics of 'SOAPBarcode: Revealing arthropod biodiversity through assembly of Illumina shotgun sequences of PCR amplicons'. Together they form a unique fingerprint.

Cite this