paprica course

21
Welcome! Universidade de São Paulo Athway PRediction by phylogenetIC plAcement (papric short course Jeff Bowman, [email protected] 30 March 2016

Upload: jeff-bowman

Post on 13-Feb-2017

188 views

Category:

Science


1 download

TRANSCRIPT

Page 1: Paprica course

Welcome!

Universidade de São PauloPAthway PRediction by phylogenetIC plAcement (paprica)

short courseJeff Bowman, [email protected]

30 March 2016

Page 2: Paprica course

Introduction and Logistics

Schedule (tentative)0900 – 0915: Introductions and logistics0915 – 1015 Task 1: Troubleshoot installations, Task 2: Tutorial 11015 – 1030: Break1030 – 1100: Discussion: The paprica workflow1100 – 1130: Discussion: Tutorial 1 results1130 – 1200: Troubleshooting installation for custom build of paprica database1200 – 1300: Lunch1300 – 1330: Tutorial 2: Building the paprica database1330 – 1400: Discussion: The paprica database workflow1400 – 1430: Demonstration: Metagenomic analysis with paprica (break during module)1430 – 1630: Your analysis with paprica. If you don’t have a set of libraries that you’d like to work with we will help you find some.

Objectives1. Install paprica and dependencies, and learn how to use it to analyze a set of 16S rRNA

gene sequences2. Install the dependencies for build the paprica database, and learn how to build a

custom database

Page 3: Paprica course

What it paprica, and what can I do with it?

paprica is a pipeline to estimate the metabolic pathways, enzymes (EC numbers), and genome parameters associated with 16S rRNA gene sequences.

• Designed for NGS data• Also applicable to small libraries or even single 16S rRNA gene sequences (e.g. isolates)

Bowman and Ducklow, 2015 Bowman, 2015

Introduction and Logistics

Page 4: Paprica course

Bowman, 2015

Function Pathwayb Sanger studies Hatam et al. (2014) Bowman et al. (2012)

CO2 fixation CO2 fixation into oxaloacetate (anapleurotic)

Pseudoalteromonas haloplanktis TAC125

Polaribacter MED152, Acidimicrobiales YM16-304

Psychrobacter cryohalolentis K5, Polaribacter MED 152

Antibiotic resistance Triclosan resistancePelagibacter ubique HTCC1062, Polaribacter MED152

Polaribacter MED152, Leadbetterella byssophila DSM17132, Thiomicrospira spp., Gloeocapsa PCC7428, Acidimicrobiales YM16-304, Janthinobacterium spp.

P. cryohalolentis K5, Polaribacter MED152, GSOS

C1 metabolism Formaldehyde oxidation II (glutathione-dependent) Colwellia psychrerythraea 34H

Gloeocapsa PCC7428, Marinobacter BSs20148, Glaciecola nitratireducens FR1064

Octadecabacter antarcticus 307

Choline degradation Choline degradation 1 C. psychrerythraea 34H Acidimicrobiales YM304 P. cryohalolentis K5, O. antarcticus 307

Glycine betaine production Glycine betaine biosynthesis I (Gram-negative bacteria) C. psychrerythraea 34H Acidimicrobiales YM304 P. cryohalolentis K5, O.

antarcticus 307

Halocarbon degradation 2-chlorobenzoate degradation P. cryohalolentis K5 Polaromonas naphthalenivorans CJ2 P. cryohalolentis K5

Mercury conversion Phenylmercury acetate degradation

Marinobacter BSs20148, P. haloplanktis TAC125, Octadecabacter arcticus 238

Belliella baltica DSM15883, Bordetella petrii O. antarcticus 307

Nitrogen fixation Nitrogen fixation Coraliomargarita akajimensis DSM45221

C. akajimensis DSM45221, Methylomonas methanica MC09, Aeromonas spp.

C. akajimensis DSM45221

Sulfite oxidation Sulfite oxidation II/III Pelagibacter ubique HTCC1062 Cellvibrio japonicus UEDA107 GSOS

Sulfate reduction Sulfate reduction IV/VHalomonas elongata DSM2581, Psychrobacter arcticum 273

Vibrio vulnificus YJ016 GSOS

Denitrification Nitrate reduction I/VII C. psychrerythraea 34H C. japonicus UEDA107 -

Introduction and Logistics

Page 5: Paprica course

Bowman et al, in revision

Introduction and Logistics

Page 6: Paprica course

Troubleshoot installation and conduct basic analysis

Tutorial 1 – Initial analysis with paprica• Finishing downloading and installing all remaining dependencies, let me know if you need

assistance• Archaeopteryx

• R and RStudio

• Remove existing paprica directory, then download latest version of paprica:

• Start working through the tutorial located here: http://www.polarmicrobes.org/?p=1473 • Start at “Testing the Installation”

sudo apt-get install default-jrewget https://googledrive.com/host/0BxMokdxOh-JRM1d2azFoRnF3bGM/download/forester_1038.jarmv forester_1038.jar archaeopteryx.jarchmod a+x archaeopteryx.jar

## create bash script archaeopteryx containing these lines (no indentation):## #!/bin/bash## java -cp archaeopteryx.jar org.forester.archaeopteryx.Archaeopteryx

## make this script executablechmod a+x archaeopteryx

rm -r papricagit clone https://github.com/bowmanjeffs/paprica.git

Page 7: Paprica course

16S sequence library, the bigger

the better!

Obtain all completed genomes

(Genbank)

Predict metabolic pathways (ptools)

Construct 16S rRNA gene tree

(Infernal, RAxML)

Place reads on reference tree

(Infernal, pplacer)

Extract pathways for each placement

Generate confidence score

for sample

Find pathways shared across

all members of all clades

Calculate confidence for

each node

Evaluate genomic

plasticity for terminal nodes

Evaluate relative core genome size

Analysis

Database Construction

Confidence Scoring

Three components to metabolic inference:

1. Database construction2. Analysis

3. Confidence scoring

Caveats:Metabolic inference is only as good

as…• Our genome annotations• The diversity of completed

genomes• Our knowledge of metabolic

pathways

And is further limited by…• Genomic plasticity

The paprica workflow

Page 8: Paprica course

The paprica workflow

• Data preparation• Read QC – basic steps

• Overlap if PE• Trim for quality• Remove chloroplasts, mitochondria, anything else that looks weird

• Methods• Mothur (preferred)• Qiime• paprica/utilities/read_qc.py

• Test run on single sample• Setup run for multiple samples

• where samples.txt contains a list of the sample files without their extension• Let’s take a look at paprica-run.sh…

while read f;do ./paprica-run.sh $f bacteria;done < samples.txt

Page 9: Paprica course

Tutorial 1 resultsFiles initially provided or created by papricasummer.fastasummer.sub.fastasummer.sub.clean.fasta

Files produced for or during infernal/pplacer summer.sub.combined_16S.bacteria.tax.clean.align.phyloxmlsummer.sub.combined_16S.bacteria.tax.clean.align.csv summer.sub.combined_16S.bacteria.tax.clean.align.stosummer.sub.combined_16S.bacteria.tax.clean.align.fastasummer.sub.combined_16S.bacteria.tax.clean.align.jplace

paprica output filessummer.bacteria.ec.csvsummer.bacteria.sum_ec.csvsummer.bacteria.pathways.csvsummer.bacteria.sum_pathways.csvsummer.bacteria.edge_data.csvsummer.bacteria.sample_data.txt

Page 10: Paprica course

Tutorial 1 resultsFiles initially provided or created by papricasummer.fastasummer.sub.fastasummer.sub.clean.fasta

Files produced for or during infernal/pplacer summer.sub.combined_16S.bacteria.tax.clean.align.phyloxmlsummer.sub.combined_16S.bacteria.tax.clean.align.csv summer.sub.combined_16S.bacteria.tax.clean.align.stosummer.sub.combined_16S.bacteria.tax.clean.align.fastasummer.sub.combined_16S.bacteria.tax.clean.align.jplace

paprica output filessummer.bacteria.ec.csvsummer.bacteria.sum_ec.csvsummer.bacteria.pathways.csvsummer.bacteria.sum_pathways.csvsummer.bacteria.edge_data.csvsummer.bacteria.sample_data.txt

Page 11: Paprica course

Tutorial 1 resultsFiles initially provided or created by papricasummer.fastasummer.sub.fastasummer.sub.clean.fasta

Files produced for or during infernal/pplacer summer.sub.combined_16S.bacteria.tax.clean.align.phyloxmlsummer.sub.combined_16S.bacteria.tax.clean.align.csv summer.sub.combined_16S.bacteria.tax.clean.align.stosummer.sub.combined_16S.bacteria.tax.clean.align.fastasummer.sub.combined_16S.bacteria.tax.clean.align.jplace

paprica output filessummer.bacteria.ec.csvsummer.bacteria.sum_ec.csvsummer.bacteria.pathways.csvsummer.bacteria.sum_pathways.csvsummer.bacteria.edge_data.csvsummer.bacteria.sample_data.txt

origin,name,multiplicity,edge_num,like_weight_ratio,post_prob,likelihood,marginal_like,distal_length,pendant_length,classification,map_ratio,map_overlapsummer.sub.combined_16S.bacteria.tax.clean.align,SRR584344.1832,1,2568,0.497633,0.769127,-42222.2,-42226,0.457927,0.317102,NA,NA,NAsummer.sub.combined_16S.bacteria.tax.clean.align,SRR584344.4354,1,2253,0.840252,0.915613,-41188,-41192.1,7.3661e-06,0.263113,NA,NA,NAsummer.sub.combined_16S.bacteria.tax.clean.align,SRR584344.3662,1,2422,0.614939,0.615935,-42880.8,-42884.1,6.32695e-06,0.17298,NA,NA,NAsummer.sub.combined_16S.bacteria.tax.clean.align,SRR584344.2443,1,242,0.557322,0.787045,-43458.2,-43459.3,9.2618e-06,0.0380588,NA,NA,NA

Page 12: Paprica course

Tutorial 1 resultsFiles initially provided or created by papricasummer.fastasummer.sub.fastasummer.sub.clean.fasta

Files produced for or during infernal/pplacer summer.sub.combined_16S.bacteria.tax.clean.align.phyloxmlsummer.sub.combined_16S.bacteria.tax.clean.align.csv summer.sub.combined_16S.bacteria.tax.clean.align.stosummer.sub.combined_16S.bacteria.tax.clean.align.fastasummer.sub.combined_16S.bacteria.tax.clean.align.jplace

paprica output filessummer.bacteria.ec.csvsummer.bacteria.sum_ec.csvsummer.bacteria.pathways.csvsummer.bacteria.sum_pathways.csvsummer.bacteria.edge_data.csvsummer.bacteria.sample_data.txt

,15,37,51,142,242,243,552,649,678,739,796,802,805,1030,1050,1075,1106,1107,2139…1.-.-.-,0.0,0.0,0.0,35.25,90.0,14.0,0.0,0.0…1.1.-.-,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0…1.1.1.-,0.0,0.0,0.0,35.25,0.0,0.0,0.0,0…1.1.1.1,0.0,0.0,0.0,23.5,135.0,21.0,0.333333333333…

Edge number for each CCG and CEG

EC n

umbe

r

Page 13: Paprica course

Tutorial 1 resultsFiles initially provided or created by papricasummer.fastasummer.sub.fastasummer.sub.clean.fasta

Files produced for or during infernal/pplacer summer.sub.combined_16S.bacteria.tax.clean.align.phyloxmlsummer.sub.combined_16S.bacteria.tax.clean.align.csv summer.sub.combined_16S.bacteria.tax.clean.align.stosummer.sub.combined_16S.bacteria.tax.clean.align.fastasummer.sub.combined_16S.bacteria.tax.clean.align.jplace

paprica output filessummer.bacteria.ec.csvsummer.bacteria.sum_ec.csvsummer.bacteria.pathways.csvsummer.bacteria.sum_pathways.csvsummer.bacteria.edge_data.csvsummer.bacteria.sample_data.txt

1.-.-.-,175.1590909091.1.-.-,0.3333333333331.1.1.-,44.09848484851.1.1.1,192.4757575761.1.1.10,0.01.1.1.100,1168.893337991.1.1.102,0.333333333333

Sum (normalized) across all CCG and CEG

EC n

umbe

r

Page 14: Paprica course

Tutorial 1 resultsFiles initially provided or created by papricasummer.fastasummer.sub.fastasummer.sub.clean.fasta

Files produced for or during infernal/pplacer summer.sub.combined_16S.bacteria.tax.clean.align.phyloxmlsummer.sub.combined_16S.bacteria.tax.clean.align.csv summer.sub.combined_16S.bacteria.tax.clean.align.stosummer.sub.combined_16S.bacteria.tax.clean.align.fastasummer.sub.combined_16S.bacteria.tax.clean.align.jplace

paprica output filessummer.bacteria.ec.csvsummer.bacteria.sum_ec.csvsummer.bacteria.pathways.csvsummer.bacteria.sum_pathways.csvsummer.bacteria.edge_data.csvsummer.bacteria.sample_data.txt

,15,37,51,142,242,243,552,649,678,739,796,802,805,1030,1050,1075,1106,1107,2139…"(1,3)-beta-D-xylan degradation",0.0,0.0,0.0,0.0,0.0,0.0,0.0… (KDO)2-lipid A biosynthesis I,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.6…(R)-acetoin biosynthesis I,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0…

Edge number for each CCG and CEG

Path

way

Page 15: Paprica course

Tutorial 1 resultsFiles initially provided or created by papricasummer.fastasummer.sub.fastasummer.sub.clean.fasta

Files produced for or during infernal/pplacer summer.sub.combined_16S.bacteria.tax.clean.align.phyloxmlsummer.sub.combined_16S.bacteria.tax.clean.align.csv summer.sub.combined_16S.bacteria.tax.clean.align.stosummer.sub.combined_16S.bacteria.tax.clean.align.fastasummer.sub.combined_16S.bacteria.tax.clean.align.jplace

paprica output filessummer.bacteria.ec.csvsummer.bacteria.sum_ec.csvsummer.bacteria.pathways.csvsummer.bacteria.sum_pathways.csvsummer.bacteria.edge_data.csvsummer.bacteria.sample_data.txt

(transposed and put in table)edge_num 242 243

taxonGCF_000012345.1_Candidatus Pelagibacter ubique

HTCC1062_strain=HTCC1062nedge 53 5n16S 1 1nedge_corrected 53 5nge 1 1ncds 1333 1355.5genome_size 1308759 1325981GC 29.68308145 29.15748phi 0.478821295 0.480875clade_size 1 2branch_length 0.0189682 0.246143npaths_terminal 119.5npaths_actual 116 144confidence 0.478821295 0.625556post_prob 0.789555434 0.814622nec_actual 369 461nec_terminal 315.5

Page 16: Paprica course

Tutorial 1 resultsFiles initially provided or created by papricasummer.fastasummer.sub.fastasummer.sub.clean.fasta

Files produced for or during infernal/pplacer summer.sub.combined_16S.bacteria.tax.clean.align.phyloxmlsummer.sub.combined_16S.bacteria.tax.clean.align.csv summer.sub.combined_16S.bacteria.tax.clean.align.stosummer.sub.combined_16S.bacteria.tax.clean.align.fastasummer.sub.combined_16S.bacteria.tax.clean.align.jplace

paprica output filessummer.bacteria.ec.csvsummer.bacteria.sum_ec.csvsummer.bacteria.pathways.csvsummer.bacteria.sum_pathways.csvsummer.bacteria.edge_data.csvsummer.bacteria.sample_data.txt

name summer.bacteriasample_confidence 0.49199211424npathways 572ppathways 1007nreads 1000database_created_at 2016-03-03T00:59:34.792240

Page 17: Paprica course

Tutorial 2

• Download the remaining dependencies• RAxML

• add to PATH• What if CPU can’t support AVX2? Cheat.

• pathway-tools• follow GUI instructions

• taxtastic• make sure that system Python is Anaconda (or alternate distro), then:

• Follow the tutorial here: http://www.polarmicrobes.org/?p=1543 • Only complete the “Test paprica-build.sh” section!

git clone https://github.com/stamatak/standard-RAxML.gitcd standard-RAxMLmake -f Makefile.AVX2.PTHREADS.gccrm *.o

pip install taxtastic

Page 18: Paprica course

Discussion: The paprica database workflow

ref_genome_databaseptools-local

user bacteria archaea

bacteria archaea

refseqcomb…refpkg refseqcomb…refpkg

terminal_paths.csvterminal_ec.csvinternal_probs.csvinternal_ec_probs.csvinternal_ec_n.csvinternal_data.csvgenome_data_final.csvgenome_data.csvcombined_16S.bacteria.tax.database_info.txt

terminal_paths.csvterminal_ec.csvinternal_probs.csvinternal_ec_probs.csvinternal_ec_n.csvinternal_data.csvgenome_data_final.csvgenome_data.csvcombined_16S.archaea.tax.database_info.txt

GCF…*

*.fasta*.hits*.sto*.5mer_bints.txt.gz*.genomic.fna*.genomic.gbff*.protein.faa

GCF…*

*.fasta*.hits*.sto*.5mer_bints.txt.gz*.genomic.fna*.genomic.gbff*.protein.faa

GCF…* GCF…*

draft.combined_16S.fasta draft.combined_16S.fasta

*.fasta*.hits*.sto*.genomic.fna*protein.gbk

*.fasta*.hits*.sto*.genomic.fna*protein.gbk

paprica-mg.dmndpaprica-mg.prot.csv.gz

combined_16S.[domain].tax.clean.align.fastacombined_16S. [domain].tax.clean.align.stoCONTENTS.jsonphylo_modeleSi5_T.jsonRAxML_fastTreeSH_Support.conf.root.ref.treRAxML_info.ref.tre

* *

*

Page 19: Paprica course

Discussion: The paprica database workflow

paprica-make_ref.py• Downloads all completed genomes from Genbank• Counts 16S genes in each genome and pulls representative• Calculates other genome parameters• Constructs 16S alignment and distance matrix• Constructs genome distance matrix (compositional vector based)• Calculates phi from 16S distance matrix and genome distance matrix• Find 16S genes in user genomes (if present)• Add user 16S genes to previous alignment

paprica-place_it.py• Constructs reference tree and reference package from 16S alignment

paprica-build_core_genomes.py• Predicts metabolic pathways for each genome• Tallies up EC numbers for each genome• For each internal node on reference tree determines mean parameters, and

fraction of occurrence of EC numbers and metabolic pathways• Exports all of this information as csv files

Page 20: Paprica course

Demonstration: paprica-mg.py

• If you’re on a server you can follow the tutorial at http://www.polarmicrobes.org/?p=1596

• test.annotation.csv: The number of hits in the metagenome, by EC number. This is probably the most useful file to you. The columns are:• index: The accession of a representative protein from the database• genome: Genome the representative protein comes from• domain: Domain of this genome• EC_number: The EC number• product: A sensible name for the gene product• start: Start position of the gene in the genome• end: End position of the gene in the genome• n_occurences: The number of occurrences of this EC number in the database• nr_hits: The number of reads that matched this EC number. Each read is allowed only one hit.

• test.paprica-mg.nr.daa: The DIAMOND format results file. Only one hit per read is reported.• test.paprica-mg.nr.txt: A text file of the DIAMOND results. Only one hit per read is reported.• test_mg.pathologic (for -pathways T only): A directory containing .gbk files for each genome in the paprica database

that received a hit, with each EC number that got a hit for that genome.• test.pathways.txt: A simple list of all the pathways that were predicted for the metagenome.

paprica-mg_run.py -i ERR318619_1.qc.fasta.gz -o demo -ref_dir ref_genome_database -pathways F