terminating contamination: large-scale search identifies ... · 1/26/2020  · terminator in refseq...

8
Terminating contamination: large-scale search identifies more than 2,000,000 contaminated entries in GenBank Martin Steinegger 1 and Steven L Salzberg 1 1 Center for Computational Biology, Johns Hopkins University, Baltimore * Metagenomic sequencing allows researchers to investigate organisms sampled from their native environments by sequencing their DNA directly, and then quantifying the abundance and taxonomic composition of the organisms thus captured. How- ever, these types of analyses are sensitive to contamination in public databases caused by incorrectly labeled reference sequences. Here we describe Conterminator, an ef- ficient method to detect and remove incorrectly labelled sequences by an exhaustive all-against-all sequence comparison. Our analysis reports contamination in 114,035 se- quences and 2767 species in the NCBI Reference Sequence Database (RefSeq), 2,161,746 sequences and 6795 species in the GenBank database, and 14,132 protein sequences in the NR non-redundant protein database. Conterminator uncovers contamination in sequences spanning the whole range from draft genomes to “complete” model organ- ism genomes. Our method, which scales linearly with input size, was able to process 3.3 terabytes of genomic sequence data in 12 days on a single 32-core compute node. We believe that Conterminator can become an important tool to ensure the quality of reference databases with particular importance for downstream metagenomic analyses. Source code (GPLv3): https://github.com/martin-steinegger/conterminator INTRODUCTION The number of genomes in public and private reposi- tories has been skyrocketing for at least the past decade, primarily due to the rapidly dropping costs of sequencing. The public genome database GenBank, which is regularly synchronized with the EMBL and DDBJ databases, has been doubling in size roughly every 18 months [1]. These genomics databases provide a vital worldwide resource that has been driving new findings in biotechnology and medicine for nearly three decades. Draft genomes consisting of hundreds to thousands of unordered DNA sequence fragments represent a large fraction of the over 500,000 genomes stored in GenBank [2]. Some of these fragments contain foreign DNA due to contamination from reagents, laboratory materials, sam- ple processing artifacts, or cross-contamination from mul- tiplexed sequencing runs (Fig. 1a). These contaminating sequences may cause a variety of problems, including in- correct labels on sequences in metagenomic studies [3], faulty conclusions about horizontal gene transfer [4, 5], or poor annotation quality of genomes [6]. To combat the contamination issue, NCBI (the home of GenBank) applies two filtering protocols for detection of contaminated fragments. First, VecScreen [7] is used to detect synthetic sequences (vectors, adapters, link- ers, primers, etc.), and second, BLAST alignments [8] against common contaminants identifies a broader array of contaminating sequences. Despite these filters, con- tamination still occurs, and its detection remains chal- lenging [9, 10]. * [email protected] Because humans are always present in sequencing labs, Homo sapiens continues to be a major source of contam- ination for genome projects. Contaminating pieces of human DNA occasionally remain in published genomes [11] despite automated searches. A recent study, for ex- ample, showed that thousands of human DNA fragments can be found in draft bacterial genomes, and that many of these have been erroneously translated and annotated as proteins [10]. However, many other species also cause contamination [12–16]. Systematic approaches to detect contamination are limited by computational costs of com- paring every submitted genome against all other known genomes. For example, a BLAST all-against-all com- parison of the RefSeq database [17], which has a size of 1.5 Tb, would take 30,000 CPU years. Faster align- ment methods such as Minimap2 [18] or Bowtie2 [19] will take less time, but will still suffer from the quadratic complexity of this comparison. We present Conterminator (Fig. 1b) a fast method for detecting contamination in nucleotide and protein databases by computing local alignments across taxo- nomic kingdoms. It utilizes the linear-time all-against-all comparison algorithm from Linclust [20] followed by ex- haustive alignments using MMseqs2 [21]. This enables us to process huge nucleotide and protein sequence sets on a single server. We applied this method to quantify the cur- rent state of contamination in the nucleotide databases Genbank [1] and RefSeq [17], and in the comprehensive NR protein database [1]. . CC-BY-NC-ND 4.0 International license available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint this version posted January 26, 2020. ; https://doi.org/10.1101/2020.01.26.920173 doi: bioRxiv preprint

Upload: others

Post on 23-Sep-2020

10 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Terminating contamination: large-scale search identifies ... · 1/26/2020  · terminator in RefSeq (Fig. 2a,b) and GenBank (Fig. 2c,d). Processing the 1:5 and 3:3 terabytes in RefSeq

Terminating contamination: large-scale search identifies

more than 2,000,000 contaminated entries in GenBank

Martin Steinegger1 and Steven L Salzberg1

1Center for Computational Biology, Johns Hopkins University, Baltimore∗

Metagenomic sequencing allows researchers to investigate organisms sampled fromtheir native environments by sequencing their DNA directly, and then quantifyingthe abundance and taxonomic composition of the organisms thus captured. How-ever, these types of analyses are sensitive to contamination in public databases causedby incorrectly labeled reference sequences. Here we describe Conterminator, an ef-ficient method to detect and remove incorrectly labelled sequences by an exhaustiveall-against-all sequence comparison. Our analysis reports contamination in 114,035 se-quences and 2767 species in the NCBI Reference Sequence Database (RefSeq), 2,161,746sequences and 6795 species in the GenBank database, and 14,132 protein sequences inthe NR non-redundant protein database. Conterminator uncovers contamination insequences spanning the whole range from draft genomes to “complete” model organ-ism genomes. Our method, which scales linearly with input size, was able to process3.3 terabytes of genomic sequence data in 12 days on a single 32-core compute node.We believe that Conterminator can become an important tool to ensure the quality ofreference databases with particular importance for downstream metagenomic analyses.Source code (GPLv3): https://github.com/martin-steinegger/conterminator

INTRODUCTION

The number of genomes in public and private reposi-tories has been skyrocketing for at least the past decade,primarily due to the rapidly dropping costs of sequencing.The public genome database GenBank, which is regularlysynchronized with the EMBL and DDBJ databases, hasbeen doubling in size roughly every 18 months [1]. Thesegenomics databases provide a vital worldwide resourcethat has been driving new findings in biotechnology andmedicine for nearly three decades.

Draft genomes consisting of hundreds to thousandsof unordered DNA sequence fragments represent a largefraction of the over 500,000 genomes stored in GenBank[2]. Some of these fragments contain foreign DNA due tocontamination from reagents, laboratory materials, sam-ple processing artifacts, or cross-contamination from mul-tiplexed sequencing runs (Fig. 1a). These contaminatingsequences may cause a variety of problems, including in-correct labels on sequences in metagenomic studies [3],faulty conclusions about horizontal gene transfer [4, 5],or poor annotation quality of genomes [6].

To combat the contamination issue, NCBI (the homeof GenBank) applies two filtering protocols for detectionof contaminated fragments. First, VecScreen [7] is usedto detect synthetic sequences (vectors, adapters, link-ers, primers, etc.), and second, BLAST alignments [8]against common contaminants identifies a broader arrayof contaminating sequences. Despite these filters, con-tamination still occurs, and its detection remains chal-lenging [9, 10].

[email protected]

Because humans are always present in sequencing labs,Homo sapiens continues to be a major source of contam-ination for genome projects. Contaminating pieces ofhuman DNA occasionally remain in published genomes[11] despite automated searches. A recent study, for ex-ample, showed that thousands of human DNA fragmentscan be found in draft bacterial genomes, and that manyof these have been erroneously translated and annotatedas proteins [10]. However, many other species also causecontamination [12–16]. Systematic approaches to detectcontamination are limited by computational costs of com-paring every submitted genome against all other knowngenomes. For example, a BLAST all-against-all com-parison of the RefSeq database [17], which has a size of1.5 Tb, would take ≈ 30,000 CPU years. Faster align-ment methods such as Minimap2 [18] or Bowtie2 [19]will take less time, but will still suffer from the quadraticcomplexity of this comparison.

We present Conterminator (Fig. 1b) a fast methodfor detecting contamination in nucleotide and proteindatabases by computing local alignments across taxo-nomic kingdoms. It utilizes the linear-time all-against-allcomparison algorithm from Linclust [20] followed by ex-haustive alignments using MMseqs2 [21]. This enables usto process huge nucleotide and protein sequence sets on asingle server. We applied this method to quantify the cur-rent state of contamination in the nucleotide databasesGenbank [1] and RefSeq [17], and in the comprehensiveNR protein database [1].

.CC-BY-NC-ND 4.0 International licenseavailable under a(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made

The copyright holder for this preprintthis version posted January 26, 2020. ; https://doi.org/10.1101/2020.01.26.920173doi: bioRxiv preprint

Page 2: Terminating contamination: large-scale search identifies ... · 1/26/2020  · terminator in RefSeq (Fig. 2a,b) and GenBank (Fig. 2c,d). Processing the 1:5 and 3:3 terabytes in RefSeq

2

Species DNA extraction Sequencing Assembly

Contigs

Scaffolds

Submission High quality

GenBank RefSeq

Reads

Metazoa

Bacteria & Archaea

a

b

(2) Sort by k-mer andpick first as center

(3) Cross kingdom alignment

(4) Compute contig length and predict contamination

Flanking region

Scaffolded intothe genome

Single contig

Contamination type*

*

*

*

*

*

(1) Split sequences and extract k-mers

FIG. 1. How contamination occurs and how Conterminator detects it. a, DNA extraction from an organism (red)is imperfect and often introduces contamination by other species (violet). DNA sequencing then generates short reads thatare assembled into longer contigs. Contaminated DNA is typically assembled into separate, small contigs, but sometimes iserroneously included in the same contigs as DNA from source organism. Contigs may also be linked by scaffolding, which canproduce scaffolds containing a mixture of different species. Final assemblies are submitted to GenBank, and higher-qualityassemblies are entered in RefSeq.b, Conterminator detects contamination in proteins and nucleotide sequences across kingdoms; e.g., bacterial contaminants inplant genomes. The following describes the nucleotide contamination detection workflow. (1) We take taxonomically labeledinput sequences and cut them into non-overlapping segments of length 1000 and extract a subset of k-mers. (2) We group thek-mers by sorting them and compute ungapped alignments between the first and all succeeding sequences per group. (3) Weextract each region of the first sequence that has an alignment to other kingdoms that is longer than 100 amino acids (residues)with a sequence identity greater than 90 %. We perform an exhaustive alignment of the input sequence segments against themulti-kingdom regions. We offset the alignment’s start and end position to the respective coordinates in the input sequence.(4) We reconstruct contig lengths within scaffolds by searching for the scaffold breakpoints (indicated by N characters in theDNA sequence) on the left and right side from the alignment start and end position. We predict that contamination is presentif an alignment hits a contig that is shorter than 20 kb that aligns to a different kingdom with an alignment length longer than20 kb.

RESULTS

Figure 2 summarizes the contamination found by Con-terminator in RefSeq (Fig. 2a,b) and GenBank (Fig.2c,d). Processing the 1.5 and 3.3 terabytes in RefSeqand GenBank took 5 and 12 days on a single 32-coremachine with 2 terabytes of main memory. Contermi-nator reported 114,035 and 2,161,746 contaminated se-quences affecting 2767 species and 6795 in RefSeq andGenBank respectively. Identifiers of the contaminatedsequences are available in Supplemental Files 1&2. InGenBank, over 95 % of contamination occurred in eu-karyotic genomes. Eukaryotic genomes tend to be muchmore fragmented due to their larger genome sizes andhigher repetitive content (as compared to prokaryotes),and many of the smaller contigs in eukaryotic genomeassemblies suffer from contamination.

In RefSeq only 52 % of the contamination occurredin eukaryotic genomes. One likely reason for this isthe more stringent filters used to determine which Gen-

Bank genomes are included in RefSeq; these filters rejectgenomes with very low contig sizes or genomes that wereflagged as contaminated. The number of species identi-fied as contaminants (i.e., the species causing contamina-tion) in RefSeq was 2881 and in GenBank the number was13,981. The leading contaminant species are Homo sapi-ens, Saccharomyces cerevisiae, Stenotrophomonas mal-tophilia and Serratia marcescens (see SupplementaryFig. 1).

Contamination in high-quality genomes. We ex-pected that well-studied model organisms would have thehighest-quality genomes, and that these genomes wouldhave very little, if any, contamination. We also expectedvery little contamination in finished microbial genomes.Therefore, we created a control set of high-qualitygenomes consisting of 928 genomes from FDA-ARGOS,a curated set of complete microbial genomes [22], plusgenomes for model organisms Saccharomyces cerevisiae,Danio rerio, Mus musculus, Drosophila melanogaster,

.CC-BY-NC-ND 4.0 International licenseavailable under a(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made

The copyright holder for this preprintthis version posted January 26, 2020. ; https://doi.org/10.1101/2020.01.26.920173doi: bioRxiv preprint

Page 3: Terminating contamination: large-scale search identifies ... · 1/26/2020  · terminator in RefSeq (Fig. 2a,b) and GenBank (Fig. 2c,d). Processing the 1:5 and 3:3 terabytes in RefSeq

3

FIG. 2. Results ofcontamination withinthe RefSeq. a Distri-bution of contaminatedspecies in RefSeq acrossfive kingdoms: Bacte-ria&Archaea (violet),Fungi (yellow), Metazoa(red), Viridiplantae(green) and other Eu-karyotes (turquoise). bSankey plot of the top13 contaminated speciesin RefSeq. We show thetaxonomic ranks domain,kingdom, phylum andspecies. Numbers showabove each taxonomicnode indicate the totalnumber of contaminatedsequences. The tree usesthe same color code forkingdoms as in a. c,d Same as a,b but forGenBank.

a 100

90

80

70

60

50

40

30

20

10

0

Pe

rce

nt

of

co

nta

min

ate

d s

pe

cie

s in

Re

fSe

q Kingdoms

Bacteria&Archaea

Fungi

Metazoa

Viridiplantae

Other Eukaryotes

b

c 100

90

80

70

60

50

40

30

20

10

0Kingdoms

Pe

rce

nt

of

co

nta

min

ate

d s

pe

cie

s in

Ge

nB

an

k

d

Arabidopsis thaliana, Caenorhabditis elegans and Homosapiens. We searched for (presumably) false positive pre-dictions by scanning our RefSeq results for contaminantsin these high-quality genomes. Initially, our method didnot report any contamination for any of these genomes,in part because by default it only reports contaminationwhen the target sequence is shorter than 20 kb (see Meth-ods). We then considered alignments to sequences longerthan 20 kb in this high-quality genome set. In this ad-ditional scan, we found alignments between bacterial se-quences and two eukaryotes: (1) Acidithiobacillus thioox-idans in Homo sapiens and (2) E. coli in Caenorhabditiselegans, shown in Figure 3.

A. thiooxidans in human genomic sequence. Thehuman reference genome (currently GRCh38) consists ofchromosomal scaffolds, unplaced scaffolds, and “alter-nate” scaffolds. The last group is included in the ref-erence genome to represent sequences that are divergentfrom the primary chromosome sequence. In NT 187580,an alternate scaffold on chromosome 10 in GRCh38.p13,we detected a sequence matching Acidithiobacillus thioox-idans that spans positions 169,917–188,315 of the humanscaffold (Fig. 3a), which has a length of 188,315 basepairs. 15 different A. thiooxidans genomes align to the

5,912K

C. elegans X Chromosome NC_003284.9

E. coli CP023834.1 E. coli AP019388.1 E. coli LT838196.1

5,910K5,908K 5,909K 5,911K5,907K 5,913Kb

170K150K125KGRC38 Alternative scaffold NT_187580GRC38 Chromosome 10 NC_000010.10

NZ_JMEB01000072NZ_AZMO01000002NZ_LWSD01000023NZ_LWSA01000031NZ_LWSA01000031

NZ_LGYM01000035

a

Acidithiobacillusthiooxidans

C. elegans X Chromosome UNSB01000006.1

FIG. 3. Contamination in the reference genomes ofHomo Sapiens and Caenorhabditis elegans. a Align-ment of Homo sapiens alternative scaffold NT 187580 of chro-mosome 10 against RefSeq. Chromosome 10 (NC 000011.10)aligns with 100 % sequence identity from position 1 to169,918. The remaining 18,397 residues of NT 187580 alignonly to Acidithiobacillus thiooxidans at 98 % sequence iden-tity. Shown are only 6 out of 15 alignments to Acidithiobacil-lus thiooxidans. b The X chromosome of Caenorhabditis ele-gans NC 003284.9 aligns on the left and right flanking positionaround 5,907,856 until 5,912,458. E. coli genomes aligns from5,907,856 to 5,912,087, a total of 4231 residues. Shown areonly 3 out of 8199 alignments to E. coli.

.CC-BY-NC-ND 4.0 International licenseavailable under a(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made

The copyright holder for this preprintthis version posted January 26, 2020. ; https://doi.org/10.1101/2020.01.26.920173doi: bioRxiv preprint

Page 4: Terminating contamination: large-scale search identifies ... · 1/26/2020  · terminator in RefSeq (Fig. 2a,b) and GenBank (Fig. 2c,d). Processing the 1:5 and 3:3 terabytes in RefSeq

4

contaminated portion of the scaffold. The primary se-quence of human chromosome 10 aligns perfectly frompositions 1 to 169,918 on the alternate scaffold, but thatalignment stops at the region that aligns to A. thioox-idans. Thus, the last ∼18 kb of this human alternativescaffold appears to be bacterial.

E. coli in the C. elegans reference genome. Ourmethod also detected a bacterial contaminant in the C.elegans reference genome, in chromosome X (GenBankaccession NC 003284.9). A segment spanning positions5,907,856–5,912,087 of the C. elegans sequence alignsperfectly to multiple strains of E. coli (Fig. 3b). Tocheck whether this might be a false positive reported byour method, we downloaded the raw Illumina reads usedfor the assembly of this C. elegans strain (SRR003808 andSRR003809) and aligned them against the chromosomeX assembly (NC 003284.9) using Bowtie2 [19]. Only sixreads (30 bp each) aligned in this region. In contrast, theaverage coverage over the rest of the chromosome was∼99.8. This indicates that the E. coli sequence was in-deed a contaminant. To corroborate the contaminationfurther, we looked at a recent assembly of C. elegansthat used a combination of long and short reads [23]. Wealigned their assembly of chromosome X (GenBank ac-cession UNSB01000006.1¸ ) against the current reference,and found that in this newer assembly, the E. coli regionis not present. This strongly suggests that the C. ele-gans reference genome contains a ∼4 kb insertion of E.coli contamination.

Meleagris gallopavo genome cleanup The most con-taminated genome in RefSeq, on our initial scan, wasthe turkey genome, Meleagris gallopavo [24]. The con-taminants included 6698 small, unplaced scaffolds with atotal size of 2,655,271 bases. More than half of the con-taminations were caused by Achromobacter xylosoxidansand Serratia marcescens. We contacted the original au-thors of that assembly to communicate our findings, andthey subsequently removed all contaminated fragments,plus an additional 39,413 contigs that were shorter than300 bp. The new version of the assembly, Turkey 5.1,has no contaminants and is available in GenBank as ac-cession GCA 000146605.4.

Proteins in contaminated RefSeq contigs We de-tected that 19.4 % of the contaminated RefSeq contigscontain protein annotations and encode a total of 47,943proteins. A previous study [10] reported 3437 spuri-ous bacterial proteins that originate from human repeatsthat have contaminated bacterial genome assemblies. Wealigned these sequences against our set using MMseqs2,enforcing a 80 % alignment coverage of the shorter se-quence (--comp-bias-corr 0 --mask 0 --cov-mode 5-c 0.8), and discovered that our set contains 62 % of thepreviously-reported proteins.

We clustered the proteins using MMseqs2 at a 95 %sequence identity, enforcing a bi-directional coverage of95 % (cluster --min-seq-id 0.95 -c 0.95). This re-sulted in 3339 clusters that covered 12,494 sequences.

The remaining sequences were singleton clusters. Thelargest cluster consists of 185 bacterial proteins, all ofwhich are located on contigs shorter than 1 kb, and theproteins are widely spread among multiple phyla in thebacterial kingdom. Despite the long evolutionary dis-tance, 166 of the sequences are 100 % identical to eachother and the remaining are at least 95 % identical, sug-gesting that all of them represent contaminants (see Fig.4). All contigs align to multiple positions in the domesticsheep Ovis aries genome; the fragments align to chromo-some 15 (NC 040266.1) with a sequence identity greaterthan 94 % with nearly complete coverage.

10 20 30 40 50 60CoriobacterialesMycobacteroidesEscherichiaStreptococcusLactobacillusEnterococcusLegionellaOenococcusParabacteroidesStaphylococcusBordetellaNeisseriaBacteroidesListeriaAlloscardoviaMycobacteriumVeillonellaceaeLactococcusCollinsellaKlebsiellaClostridioidesHelicobacterParascardoviaElusimicrobiumActinomycesHaemophilusArcobacterSelenomonasEnterobacterSalmonellaRhodococcus

MGLISPGGQNLLDFLELRQVLSTYDGALRDPLWWPQERPVPMRVPRGPFGIPLPLMPGPKTLCELRAGMGLISPGGQNLLDFLELRQVLSTYDGALRDPLWWPQERPVPMRVPRGPFGIPLPLMPGPKTLCELRAGMGLISPGGQNLLDFLELRQVLSTYDGALRDPLWWPQERPVPMRVPRGPFGIPLPLMPGPKTLCELRAGMGLISPGGQNLLDFLELRQVLSTYDGALRDPLWWPQERPVPMRVPRGPFGIPLPLMPGPKTLCELRAGMGLISPGGQNLLDFLELRQVLSTYDGALRDPLWWPQERPVPMRVPRGPFGIPLPLMPGPKTLCELRAGMGLISPGGQNLLDFLELRQVLSTYDGALRDPLWWPQERPVPMRVPRGPFGIPLPLMPGPKTLCELRAGMGLISPGGQNLLDFLELRQVLSTYDGALRDPLWWPQERPVPMRVPRGPFGIPLPLMPGPKTLCELRAGMGLISPGGQNLLDFLELRQVLSTYDGALRDPLWWPQERPVPMRVPRGPFGIPLPLMPGPKTLCELRAGMGLISPGGQNLLDFLELRQVLSTYDGALRDPLWWPQERPVPMRVPRGPFGIPLPLMPGPKTLCELRAGMGLISPGGQNLLDFLELRQVLSTYDGALRDPLWWPQERPVPMRVPRGPFGIPLPLMPGPKTLCELRAGMGLISPGGQNLLDFLELRQVLSTYDGALRDPLWWPQERPVPMRVPRGPFGIPLPLMPGPKTLCELRAGMGLISPGGQNLLDFLELRQVLSTYDGALRDPLWWPQERPVPMRVPRGPFGIPLPLMPGPKTLCELRAGMGLISPGGQNLLDFLELRQVLSTYDGALRDPLWWPQERPVPMRVPRGPFGIPLPLMPGPKTLCELRAGMGLISPGGQNLLDFLELRQVLSTYDGALRDPLWWPQERPVPMRVPRGPFGIPLPLMPGPKTLCELRAGMGLISPGGQNLLDFLELRQVLSTYDGALRDPLWWPQERPVPMRVPRGPFGIPLPLMPGPKTLCELRAGMGLISPGGQNLLDFLELRQVLSTYDGALRDPLWWPQERPVPMRVPRGPFGIPLPLMPGPKTLCELRAGMGLISPGGQNLLDFLELRQVLSTYDGALRDPLWWPQERPVPMRVPRGPFGIPLPLMPGPKTLCELRAGMGLISPGGQNLLDFLELRQVLSTYDGALRDPLWWPQERPVPMRVPRGPFGIPLPLMPGPKTLCELRAGMGLISPGGQNLLDFLELRQVLSTYDGALRDPLWWPQERPVPMRVPRGPFGIPLPLMPGPKTLCELRAGMGLISPGGQNLLDFLELRQVLSTYDGALRDPLWWPQERPVPMRVPRGPFGIPLPLMPGPKTLCELRAGMGLISPGGQNLLDFLELRQVLSTYDGALRDPLWWPQERPVPMRVPRGPFGIPLPLMPGPKTLCELRAGMGLISPGGQNLLDFLELRQVLSTYDGALRDPLWWPQERPVPMRVPRGPFGIPLPLMPGPKTLCELRAGMGLISPGGQNLLDFLELRQVLSTYDGALRDPLWWPQERPVPMRVPRGPFGIPLPLMPGPKTLCELRAGMGLISPGGQNLLDFLELRQVLSTYDGALRDPLWWPQERPVPMRVPRGPFGIPLPLMPGPKTLCELRAGMGLISPGGQNLLDFLELRQVLSTYDGALRDPLWWPQERPVPMRVPRGPFGIPLPLMPGPKTLCELRAGMGLISPGGQNLLDFLELRQVLSTYDGALRDPLWWPQERPVPMRVPRGPFGIPLPLMPGPKTLCELRAGMGLISPGGQNLLDFLELRQVLSTYDGALRDPLWWPQERPVPMRVPRGPFGIPLPLMPGPKTLCELRAGMGLISPGGQNLLDFLELRQVLSTYDGALRDPLWWPQERPVPMRVPRGPFGIPLPLMPGPKTLCELRAGMGLISPGGQNVLDFLELRQVLSTYDGALRDPLWWPQERPVPMRVPRGPFGIPLPLMPGPKTLCELRAGMGLISPGGQNLLDFLELRQVLSTYDGALRDPLWWPQERPVPLRVPRGPFGIPLPLMPGPKTLCELRAGMGLISPGGQNLLDFLELRQVLSTYDGALRDPLWWPQERPVPIRVPRGPFEIPLPLMPGPKTLCELRAG

FIG. 4. Multiple sequence alignment of 31 spuri-ous bacterial proteins encoded on short contaminatedcontigs Shown here are 31 out of 185 spurious proteins frombacterial genomes. A majority of the sequences are 100 %identical. The only differing residues are highlighted in white.This highly conserved “protein” is conserved on across differ-ent bacterial phyla, suggesting it is likely a contaminant thathas been erroneously translated as part of automated annota-tion procedures. The respective short contigs (< 1 kb) encod-ing these spurious proteins align with high sequence identityand coverage to the Ovis aries genome.

Contamination in the protein database NR Con-terminator can be used to analyze protein sequences. Itclusters proteins [20] at 95 % sequence identity, while re-quiring at least 95 % sequence overlap. It reports clusterscontaining multiple kingdoms, using the same kingdomdefinition as for the nucleotide comparison. We predictthat the kingdom with fewer members in the cluster iscontaminated; e.g., if a cluster contains 100 proteins, and99 represent animals while 1 represents bacteria, then thebacterial protein likely originates from a contaminatedgenome.

We analyzed the NCBI NR protein sequence [1]database using this procedure. We predicted 14,132 pro-teins to represent contaminants, out of which 7359 arealso present in the Uniprot database [25]. The major-ity of these proteins (70.46 %) are eukaryotic, and theremaining 29.34 % are bacterial (see Supplementary Fig.

.CC-BY-NC-ND 4.0 International licenseavailable under a(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made

The copyright holder for this preprintthis version posted January 26, 2020. ; https://doi.org/10.1101/2020.01.26.920173doi: bioRxiv preprint

Page 5: Terminating contamination: large-scale search identifies ... · 1/26/2020  · terminator in RefSeq (Fig. 2a,b) and GenBank (Fig. 2c,d). Processing the 1:5 and 3:3 terabytes in RefSeq

5

2). Over 6114 contaminant proteins originate from thephylum Arthropoda, and out of this 2401 are from thespecies Trichonephila clavipes, the golden silk orb weaverspider, which contributes overall the most contamination.We next take a closer look at this organism.

Contaminated proteins in Trichonephila clavipes.The T. clavipes proteins identified as possible contam-inants are distributed across 504 contigs with a totallength of 24,152,567 residues. 163 of the contigs arelonger than 20 kb. The longest contaminated contigfrom the assembly [26] is MWRG01000001, which spans1,655,743 bases and encodes 490 proteins. We found 219of these proteins in contaminated protein clusters, a ma-jority of them matching the bacterium Gemmobacter sp.YJ-T1-11. We aligned the contig against the assembly ofGemmobacter sp. YJ-T1-11, and found that it is nearly90 % covered at a 97.76 % identity (see SupplementaryFig. 3). Thus, this clearly appears to be a bacterial con-tig mistakenly included in the assembly of the T. clavipesspider.

DISCUSSION

We present two complementary approaches to detectcontamination, first using nucleotide information to de-tect short contaminated fragments, and second usingprotein-based analysis to reveal long contaminated con-tigs. We used a conservative approach that only consid-ered a sequence to be a contaminant if it had a near-identical match to a species in an entirely different king-dom from the source; e.g., a bacterial sequence foundin an animal genome or vice versa. Our method can effi-ciently detect contamination in large reference databases,and we found that a substantial fraction of the genome se-quences in both GenBank and RefSeq (0.54 % and 0.34 %of entries respectively) appear to be contaminated. Con-tamination occurs mostly as short contigs, flanking re-gions on longer contigs, or regions of larger scaffoldsflanked by Ns, but we also observed a few longer sequenceswith contamination.

Contamination can be transferred into other databasesthat are built from GenBank, such as the proteindatabases NR and Uniprot. Methods that rely on tax-onomical classification, particularly metagenomics anal-yses, are strongly affected by cross-kingdom contamina-tion because they often rely on subsets, e.g. microbialsequences extracted from a larger database. This makesit more difficult for such methods to detect contamina-tion of the type reported here.

With the rapid and ongoing increase in the numberof novel genomes sequenced every year, the number andvariety of contaminating sequences continues to increaseas well, presenting challenges for alignment-based meth-ods to detect contamination. Conterminator’s efficiencymeans that it can be used routinely to detect new con-tamination, even on the largest databases.

METHODS

Conterminator detects cross-kingdom alignments andpredicts contamination. It builds upon existing modulesof MMseqs2 [21], which it extends for use in contamina-tion detection.

Detection of cross-kingdom alignments. Contermi-nator identifies regions in genome sequences that align togenomes from other kingdoms with a minimum length ofat least 100 nucleotides and a sequence identity thresh-old of at least 90 %. With very few exceptions, DNAsequences from different kingdoms should not be alignedat all, and sequences that match at this level of identityare strong candidates for contaminants. (Exceptions tothis rule include recent horizontal gene transfer events,but these are very rare.)

Because modern sequence databases are very large, wecannot use a naive all-against-all alignment, which wouldentail a quadratic number of comparisons. Therefore, weused a similar strategy to Linclust [20] to reduce the com-putational cost to a linear number of comparisons. Wereworked the algorithm to support nucleotide sequences,since Linclust was originally built to cluster protein se-quences.

Conterminator first cuts all sequences into fragmentsof length 1000 and records their start positions. Foreach fragment, we extract m canonical k-mers (defaultm = 100) of length 24 with the lowest hash value andwrite them into an array. (We use the hash function de-fined in [20]; see Supplementary Figure 5.) We store thek-mer in 8 bytes, with the most significant bit indicat-ing whether the k-mer is reversed, sequence identifiers (4bytes), its length (2 bytes), and its position j in the ge-nomic sequence (2 bytes). We sort the array by k-mer,length and sequence identifiers. For each k-mer group weassign all sequences to the longest sequence with the low-est sequence identifier c by overwriting their k-mer withthe identifier of c and their position with the diagonali− j respecting the strand directionality. We sort the ar-ray again with the previous criteria so that all sequenceswith same assignment are in a consecutive block. Wewrite each block’s central sequence identifier, assignedsequence, strand and diagonal to hard disk while onlykeeping the diagonal with the most k-mer matches persequence.

We perform a one-dimensional dynamic program-ming ungapped alignment (using the MMseqs2 command‘rescorediagonal --rescore-mode 2’) on each diago-nal. We assign matches a score of 2 and mismatchesa score of −3 bits, and compute an E-value using ALP[27]. We compute the sequence identity by dividing thenumber of identical positions by the number of alignedpositions. We filter out all hits that are shorter than 100bases, or that have a sequence identity below 90 %, orthat have an E-value above 10−3. We compute the align-ment start positions by adding the start position of thefragment to the alignment coordinates (MMseqs2 com-mand ‘offsetalignment’).

.CC-BY-NC-ND 4.0 International licenseavailable under a(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made

The copyright holder for this preprintthis version posted January 26, 2020. ; https://doi.org/10.1101/2020.01.26.920173doi: bioRxiv preprint

Page 6: Terminating contamination: large-scale search identifies ... · 1/26/2020  · terminator in RefSeq (Fig. 2a,b) and GenBank (Fig. 2c,d). Processing the 1:5 and 3:3 terabytes in RefSeq

6

Based on the alignment, we extract the sequence in-tervals from c that are overlapped by different king-doms. We define five “kingdoms” based on the NCBITaxonomy [28]: (1) Bacteria & Archaea (taxonomy IDs2 and 2157), (2) Fungi (4751), (3) Metazoa (33208), (4)Viridiplantae (33090) and (5) all other eukaryotes. Weignored sequences from Viruses (10239), unclassified se-quences (12908), other sequences (28384), artificial se-quences (81077), environmental samples from bacteria,archaea and eukaryotes (61964, 48479, 48510). Note thatwe merged Bacteria and Archaea for simplicity. We ex-cluded viruses because some of them can integrate theirgenomes into other organisms, making it hard to distin-guish contamination from genuine artifacts of viral inte-gration.

Gather all alignments by exhaustive alignment.Our detection method might miss alignments because itdoes not extract all k-mers, and because it uses 24-mersrather than shorter k-mers. After the previous steps, weperform an exhaustive alignment of the sequence frag-ments against the extracted potential contamination se-quences (and their respective reverse complements) us-ing MMseqs2. The search is performed by using thetwo modules prefilter and rescorediagonal. Theprefilter program masks out low complexity regionsand short tandem repeats in the potential contaminantsusing tantan [29] and detects all consecutive double 15-mers diagonal hits. We rescore the detected diagonalsagain with the rescorediagonal module, enforcing aminimal alignment length of 100 and a minimal sequenceidentity of

√0.9. The square root of the sequence iden-

tity ensures that no pair of sequences is greater than 90 %different from each other.

Predict contig length by find scaffolding bound-aries. Genome assembly programs create scaffolds byordering and orienting contigs using a variety of types oflinking information, such as paired-end reads. A scaffoldthus consists of a sequence of contigs, usually separated(in many GenBank entries) by Ns to indicate the scaf-folding boundaries. Some of the contaminants that weidentified appear as short contigs in the midst of a longerscaffold, and we can identify these by finding the flank-ing Ns. It is important for our contamination detectionto know the real length of each contig in a scaffold. Anave approach to determining contig length would be tosearch for the closest N upstream and downstream fromeach alignment start and end. However, this is inefficientbecause many sequences contain million of bases withoutany Ns. We therefore indexed all N’s for each sequence.We store the position of the first N per block in an arrayassociated with the sequence. The N positions are sortedin ascending order, which enables us to perform a binarysearch to detect the closest N efficiently.

Predict the source of contamination. A large major-ity of contamination occurs as small contigs (see Figure2 in Breitweiser et al. [10]). Conterminator uses thisproperty to help it identify contamination based on thelength distribution of sequences from each kingdom. By

default, it only calls a sequence a contaminant if the se-quence is shorter than 20 kb, and if it aligns to a sequencein another kingdom that is longer than 20 kb.

Predicting contamination in protein databases.Conterminator can also detect protein sequence contami-nation using cross-kingdom analysis. It clusters proteinsusing Linclust [20] with a bidirectional length overlapof 95 % and a sequence identity of 95 % (--min-seq-id0.95 -c 0.95 --cov-mode 0 -a). It reports every clus-ter with cross-kingdom members. For each contaminatedcluster, it counts how often each kingdom occurs and re-ports the least abundant kingdom as the one that is con-taminated. It also reports kingdoms with equal abun-dance, however in those cases it cannot predict the con-taminated entry. Using only abundance without concernfor length may lead to incorrect directionality calls.

Data visualization. We created the Sankey plots usingthe krakenuniq-report tool from KrakenUniq [30] tocreate a Kraken-style report from our predicted contam-inations. The visualization was done using Pavian [31]extracted as SVG and colored by Inkscape. The multi-ple alignment was created by MMseqs2 result2msa andvisualized using Jalview [32]. Software and databaseversions

Resource Version

Conterminator fa1c80MMseqs2 [21] 333546KrakenUniq [30] 5c0019Pavian [31] 81d784RefSeq [17] July 2019GenBank [1] Dec 2018Jalview [32] 2.11.0

TABLE I. List of software used in this paper. The ver-sions for MMseqs2, KrakenUniq and Pavian are the first 6characters of the git commit. For databases we list the dateat which the data was downloaded.

Data availability The list of contamination for Gen-Bank (genbank.gz) and RefSeq (refseq.gz) are avail-able at:ftp://ftp.ccb.jhu.edu/pub/data/conterminator/.

COMPETING INTERESTS

The authors declare that they have no competing in-terests.

AUTHOR’S CONTRIBUTIONS

MS & SS designed research, MS developed code andperformed analyses, MS & SS wrote the manuscript.

.CC-BY-NC-ND 4.0 International licenseavailable under a(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made

The copyright holder for this preprintthis version posted January 26, 2020. ; https://doi.org/10.1101/2020.01.26.920173doi: bioRxiv preprint

Page 7: Terminating contamination: large-scale search identifies ... · 1/26/2020  · terminator in RefSeq (Fig. 2a,b) and GenBank (Fig. 2c,d). Processing the 1:5 and 3:3 terabytes in RefSeq

7

ACKNOWLEDGEMENTS

We thank Milot Mirdita, Florian Breitwieser and Jen-nifer Lu for fruitful discussions. We thank the NCBI for

their answering our questions timely and Aleksey Ziminfor cleaning up the Turkey genome. This work was sup-ported in part by NIH grants R35-GM130151 and R01-HG006677, and by NSF grant IOS-1744309 to SLS.

[1] Sayers, E.W., Cavanaugh, M., Clark, K., Ostell, J.,Pruitt, K.D., Karsch-Mizrachi, I.: GenBank. NucleicAcids Res. 47(D1), 94–99 (2019)

[2] Breitwieser, F.P., Lu, J., Salzberg, S.L.: A review ofmethods and databases for metagenomic classificationand assembly. Brief. Bioinform. 20(4), 1125–1136 (2019)

[3] Kirstahler, P., Bjerrum, S.S., Friis-Møller, A., la Cour,M., Aarestrup, F.M., Westh, H., Pamp, S.J.: Genomics-Based identification of microorganisms in human ocularbody fluid. Sci. Rep. 8(1), 4126 (2018)

[4] Arakawa, K.: No evidence for extensive horizontal genetransfer from the draft genome of a tardigrade. Proc.Natl. Acad. Sci. U. S. A. 113(22), 3057 (2016)

[5] Salzberg, S.L.: Horizontal gene transfer is not a hallmarkof the human genome. Genome Biol. 18(1), 85 (2017)

[6] Poptsova, M.S., Gogarten, J.P.: Using comparativegenome analysis to identify problems in annotated micro-bial genomes. Microbiology 156(Pt 7), 1909–1917 (2010)

[7] Schaffer, A.A., Nawrocki, E.P., Choi, Y., Kitts,P.A., Karsch-Mizrachi, I., McVeigh, R.: Vec-Screen plus taxonomy: imposing a tax(onomy) increaseon vector contamination screening. Bioinformatics 34(5),755–759 (2018)

[8] Camacho, C., Coulouris, G., Avagyan, V., Ma, N., Pa-padopoulos, J., Bealer, K., Madden, T.L.: BLAST+: ar-chitecture and applications. BMC Bioinformatics 10, 421(2009)

[9] De Simone, G., Pasquadibisceglie, A., Proietto, R., Polti-celli, F., Aime, S., JM Op den Camp, H., Ascenzi, P.:Contaminations in (meta) genome data: An open issuefor the scientific community. IUBMB Life (2019)

[10] Breitwieser, F.P., Pertea, M., Zimin, A.V., Salzberg,S.L.: Human contamination in bacterial genomes has cre-ated thousands of spurious proteins. Genome Res. 29(6),954–960 (2019)

[11] Longo, M.S., O’Neill, M.J., O’Neill, R.J.: Abundanthuman DNA contamination identified in non-primategenome databases. PLoS One 6(2), 16410 (2011)

[12] Merchant, S., Wood, D.E., Salzberg, S.L.: Unex-pected cross-species contamination in genome sequencingprojects. PeerJ 2, 675 (2014)

[13] Laurence, M., Hatzis, C., Brash, D.E.: Common contam-inants in next-generation sequencing that hinder discov-ery of low-abundance microbes. PLoS One 9(5), 97876(2014)

[14] Orosz, F.: Two recently sequenced vertebrate genomesare contaminated with apicomplexan species of the sarco-cystidae family. Int. J. Parasitol. 45(13), 871–878 (2015)

[15] Mukherjee, S., Huntemann, M., Ivanova, N., Kyrpides,N.C., Pati, A.: Large-scale contamination of microbialisolate genomes by illumina PhiX control. Stand. Ge-nomic Sci. 10, 18 (2015)

[16] Reiter, T., Titus Brown, C.: Microbial contamination inthe genome of the domesticated olive (2018)

[17] O’Leary, N.A., Wright, M.W., Brister, J.R., Ciufo, S.,

Haddad, D., McVeigh, R., Rajput, B., Robbertse, B.,Smith-White, B., Ako-Adjei, D., Astashyn, A., Badret-din, A., Bao, Y., Blinkova, O., Brover, V., Chetvernin,V., Choi, J., Cox, E., Ermolaeva, O., Farrell, C.M., Gold-farb, T., Gupta, T., Haft, D., Hatcher, E., Hlavina, W.,Joardar, V.S., Kodali, V.K., Li, W., Maglott, D., Mas-terson, P., McGarvey, K.M., Murphy, M.R., O’Neill, K.,Pujar, S., Rangwala, S.H., Rausch, D., Riddick, L.D.,Schoch, C., Shkeda, A., Storz, S.S., Sun, H., Thibaud-Nissen, F., Tolstoy, I., Tully, R.E., Vatsan, A.R., Wallin,C., Webb, D., Wu, W., Landrum, M.J., Kimchi, A.,Tatusova, T., DiCuccio, M., Kitts, P., Murphy, T.D.,Pruitt, K.D.: Reference sequence (RefSeq) database atNCBI: current status, taxonomic expansion, and func-tional annotation. Nucleic Acids Res. 44(D1), 733–45(2016)

[18] Li, H.: Minimap2: pairwise alignment for nucleotide se-quences. Bioinformatics 34(18), 3094–3100 (2018)

[19] Langmead, B., Salzberg, S.L.: Fast gapped-read align-ment with bowtie 2. Nat. Methods 9(4), 357–359 (2012)

[20] Steinegger, M., Soding, J.: Clustering huge protein se-quence sets in linear time. Nat. Commun. 9(1), 2542(2018)

[21] Steinegger, M., Soding, J.: MMseqs2 enables sensitiveprotein sequence searching for the analysis of massivedata sets. Nat. Biotechnol. 35(11), 1026–1028 (2017)

[22] Sichtig, H., Minogue, T., Yan, Y., Stefan, C., Hall,A., Tallon, L., Sadzewicz, L., Nadendla, S., Klimke,W., Hatcher, E., Shumway, M., Aldea, D.L., Allen, J.,Koehler, J., Slezak, T., Lovell, S., Schoepp, R., Scherf,U.: FDA-ARGOS is a database with public quality-controlled reference genomes for diagnostic use and reg-ulatory science. Nat. Commun. 10(1), 3313 (2019)

[23] Yoshimura, J., Ichikawa, K., Shoura, M.J., Artiles,K.L., Gabdank, I., Wahba, L., Smith, C.L., Edgley,M.L., Rougvie, A.E., Fire, A.Z., Morishita, S., Schwarz,E.M.: Recompleting the caenorhabditis elegans genome.Genome Res. 29(6), 1009–1022 (2019)

[24] Dalloul, R.A., Long, J.A., Zimin, A.V., Aslam, L., Beal,K., Blomberg, L.A., Bouffard, P., Burt, D.W., Crasta,O., Crooijmans, R.P.M.A., Cooper, K., Coulombe, R.A.,De, S., Delany, M.E., Dodgson, J.B., Dong, J.J., Evans,C., Frederickson, K.M., Flicek, P., Florea, L., Folkerts,O., Groenen, M.A.M., Harkins, T.T., Herrero, J., Hoff-mann, S., Megens, H.-J., Jiang, A., de Jong, P., Kaiser,P., Kim, H., Kim, K.-W., Kim, S., Langenberger, D., Lee,M.-K., Lee, T., Mane, S., Marcais, G., Marz, M., McEl-roy, A.P., Modise, T., Nefedov, M., Notredame, C., Pa-ton, I.R., Payne, W.S., Pertea, G., Prickett, D., Puiu, D.,Qioa, D., Raineri, E., Ruffier, M., Salzberg, S.L., Schatz,M.C., Scheuring, C., Schmidt, C.J., Schroeder, S., Searle,S.M.J., Smith, E.J., Smith, J., Sonstegard, T.S., Stadler,P.F., Tafer, H., Tu, Z.J., Van Tassell, C.P., Vilella, A.J.,Williams, K.P., Yorke, J.A., Zhang, L., Zhang, H.-B.,Zhang, X., Zhang, Y., Reed, K.M.: Multi-platform next-

.CC-BY-NC-ND 4.0 International licenseavailable under a(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made

The copyright holder for this preprintthis version posted January 26, 2020. ; https://doi.org/10.1101/2020.01.26.920173doi: bioRxiv preprint

Page 8: Terminating contamination: large-scale search identifies ... · 1/26/2020  · terminator in RefSeq (Fig. 2a,b) and GenBank (Fig. 2c,d). Processing the 1:5 and 3:3 terabytes in RefSeq

8

generation sequencing of the domestic turkey (meleagrisgallopavo): genome assembly and analysis. PLoS Biol.8(9) (2010)

[25] UniProt Consortium: UniProt: a worldwide hub of pro-tein knowledge. Nucleic Acids Res. 47(D1), 506–515(2019)

[26] Babb, P.L., Lahens, N.F., Correa-Garhwal, S.M., Nichol-son, D.N., Kim, E.J., Hogenesch, J.B., Kuntner, M., Hig-gins, L., Hayashi, C.Y., Agnarsson, I., Voight, B.F.: Thenephila clavipes genome highlights the diversity of spi-der silk genes and their complex expression. Nat. Genet.49(6), 895–903 (2017)

[27] Sheetlin, S., Park, Y., Frith, M.C., Spouge, J.L.: ALP& FALP: C++ libraries for pairwise local alignment e-values. Bioinformatics 32(2), 304–305 (2016)

[28] Federhen, S.: The NCBI taxonomy database. Nucleic

Acids Res. 40(Database issue), 136–43 (2012)[29] Frith, M.C.: A new repeat-masking method enables spe-

cific detection of homologous sequences. Nucleic AcidsRes. 39(4), 23 (2011)

[30] Breitwieser, F.P., Baker, D.N., Salzberg, S.L.: Krake-nUniq: confident and fast metagenomics classification us-ing unique k-mer counts. Genome Biol. 19(1), 198 (2018)

[31] Breitwieser, F.P., Salzberg, S.L.: Pavian: Interactiveanalysis of metagenomics data for microbiome studiesand pathogen identification. Bioinformatics (2019)

[32] Waterhouse, A.M., Procter, J.B., Martin, D.M.A.,Clamp, M., Barton, G.J.: Jalview version 2–a multi-ple sequence alignment editor and analysis workbench.Bioinformatics 25(9), 1189–1191 (2009)

.CC-BY-NC-ND 4.0 International licenseavailable under a(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made

The copyright holder for this preprintthis version posted January 26, 2020. ; https://doi.org/10.1101/2020.01.26.920173doi: bioRxiv preprint