curso de genómica - uat (vhir) 2012 - análisis de datos de ngs

84
NGS Data analysis http://ueb.vhir.org/NGS2012 Introduction to NGS (Now Generation Sequencing) Data Analysis Statistics and Bioinformatics Research Group Statistics department, Universitat de Barelona Statistics and Bioinformatics Unit Vall d’Hebron Institut de Recerca Alex Sánchez

Upload: ueb

Post on 08-May-2015

1.711 views

Category:

Education


0 download

TRANSCRIPT

Page 1: Curso de Genómica - UAT (VHIR) 2012 - Análisis de datos de NGS

NGS Data analysis http://ueb.vhir.org/NGS2012

Introduction to NGS(Now Generation Sequencing)

Data Analysis

Statistics and Bioinformatics Research GroupStatistics department, Universitat de Barelona

Statistics and Bioinformatics UnitVall d’Hebron Institut de Recerca

Alex Sánchez

Page 2: Curso de Genómica - UAT (VHIR) 2012 - Análisis de datos de NGS

Outline

• Introduction• Bioinformatics Challenges• NGS data analysis: Some examples and workflows

• Metagenomics, De novo sequencing, Variant detection, RNA-seq

• Software• Galaxy, Genome viewers

• Data formats and quality control

NGS Data analysis http://ueb.vhir.org/NGS2012

Page 3: Curso de Genómica - UAT (VHIR) 2012 - Análisis de datos de NGS

Introduction

NGS Data analysis http://ueb.vhir.org/NGS2012

Page 4: Curso de Genómica - UAT (VHIR) 2012 - Análisis de datos de NGS

Why is NGS revolutionary?

• NGS has brought high speed not only to genome sequencing and personal medicine,

• it has also changed the way we do genome research

Got a question on genome organization?

SEQUENCE IT !!!

Ana Conesa, bioinformatics researcher at Principe Felipe Research Center

NGS Data analysis http://ueb.vhir.org/NGS2012

Page 5: Curso de Genómica - UAT (VHIR) 2012 - Análisis de datos de NGS

NGS means high sequencing capacity

GS FLX 454(ROCHE)

HiSeq 2000(ILLUMINA)

5500xl SOLiD (ABI)

Ion TORRENT

GS Junior

NGS Data analysis http://ueb.vhir.org/NGS2012

Page 6: Curso de Genómica - UAT (VHIR) 2012 - Análisis de datos de NGS

454 GS Junior35MB

NGS Platforms Performance

NGS Data analysis http://ueb.vhir.org/NGS2012

Page 7: Curso de Genómica - UAT (VHIR) 2012 - Análisis de datos de NGS

454 Sequencing

NGS Data analysis http://ueb.vhir.org/NGS2012

Page 8: Curso de Genómica - UAT (VHIR) 2012 - Análisis de datos de NGS

ABI SOLID Sequencing

NGS Data analysis http://ueb.vhir.org/NGS2012

Page 9: Curso de Genómica - UAT (VHIR) 2012 - Análisis de datos de NGS

Solexa sequencing

NGS Data analysis http://ueb.vhir.org/NGS2012

Page 10: Curso de Genómica - UAT (VHIR) 2012 - Análisis de datos de NGS

Applications of Next-Generation Sequencing

NGS Data analysis http://ueb.vhir.org/NGS2012

Page 11: Curso de Genómica - UAT (VHIR) 2012 - Análisis de datos de NGS

Comparison of 2nd NGS

NGS Data analysis http://ueb.vhir.org/NGS2012

Page 12: Curso de Genómica - UAT (VHIR) 2012 - Análisis de datos de NGS

Some numbers

Platform 454/FLX Solexa (Illumina)AB SOLIDRead length ~350-400bp 36, 75, or 106 bp 50bpSingle read Yes Yes YesPaired-end Reads Yes Yes YesLong-insert (several Kbp) mate-paired reads Yes Yes NoNumber of reads por instrument run 5.00K >100 M 400MMax Data output 0.5Gbp 20.5 Gbp 20GbpRun time to 1Gb 6 Days > 1 Day >1 DayEase of use (workflow) Difficult Least difficult DifficultBase Calling Flow Space Nucleotide space Color sapce

DNA ApplicationsWhole genome sequencing and resequencing Yes Yes Yes

de novo sequencing Yes Yes YesTargeted resequencing Yes Yes Yes

Discovery of genetic variants ( SNPs, InDels, CNV, ...) Yes Yes YesChromatin Immunopecipitation (ChIP) Yes Yes YesMethylation Analysis Yes Yes YesMetagenomics Yes No No

RNA Applications Yes Yes YesWhole Transcriptome Yes Yes YesSmall RNA Yes Yes Yes

Expression Tags Yes Yes Yes

NGS Data analysis http://ueb.vhir.org/NGS2012

Page 13: Curso de Genómica - UAT (VHIR) 2012 - Análisis de datos de NGS

Bioinformatics challenges of NGS

NGS Data analysis http://ueb.vhir.org/NGS2012

Page 14: Curso de Genómica - UAT (VHIR) 2012 - Análisis de datos de NGS

I have my sequences/images. Now what?

NGS Data analysis http://ueb.vhir.org/NGS2012

Page 15: Curso de Genómica - UAT (VHIR) 2012 - Análisis de datos de NGS

NGS pushes (bio)informatics needs up

• Need for computer power• VERY large text files (~10 million lines long)

– Can’t do ‘business as usual’ with familiar tools such as Perl/Python.– Impossible memory usage and execution time • Impossible to browse for problems

• Need sequence Quality filtering• Need for large amount of CPU power

• Informatics groups must manage compute clusters• Challenges in parallelizing existing software or redesign of algorithms to work in a

parallel environment

• Need for Bioinformatics power!!!• The challenges turns from data generation into data analysis!• How should bioinformatics be structured

• Bigger centralized bioinformatics services? (or research groups providing service?)• Distributed model: bioinformaticians must be part of the temas. Interoperability?

NGS Data analysis http://ueb.vhir.org/NGS2012

Page 16: Curso de Genómica - UAT (VHIR) 2012 - Análisis de datos de NGS

Data management issues

• Raw data are large. How long should be kept?• Processed data are manageable for most people

– 20 million reads (50bp) ~1Gb

• More of an issue for a facility: HiSeq recommends 32 CPU cores, each with 4GB RAM

• Certain studies much more data intensive than other– Whole genome sequencing

• A 30X coverage genome pair (tumor/normal) ~500 GB• 50 genome pairs ~ 25 TB

NGS Data analysis http://ueb.vhir.org/NGS2012

Page 17: Curso de Genómica - UAT (VHIR) 2012 - Análisis de datos de NGS

So what?

• In NGS we have to process really big amounts of data, which is not trivial in computing terms.

• Big NGS projects require supercomputing infrastructures

• Or put another way: it's not the case that anyone can do everything.– Small facilities must carefully choose their projects to be scaled

with their computing capabilities.

NGS Data analysis http://ueb.vhir.org/NGS2012

Page 18: Curso de Genómica - UAT (VHIR) 2012 - Análisis de datos de NGS

Computational infrastructure for NGS

• There is great variety but a good point to start with:

– Computing cluster• Multiple nodes (servers) with multiple cores• High performance storage (TB, PB level)• Fast networks (10Gb ethernet, infiniband)

– Enough space and conditions for the equipment ("servers room")

– Skilled people (sysadmin, developers)• CNAG, in Barcelona: 36 people, more than 50% of them

informaticians

NGS Data analysis http://ueb.vhir.org/NGS2012

Page 19: Curso de Genómica - UAT (VHIR) 2012 - Análisis de datos de NGS

Alternatives (1): Cloud Computing

• Pros– Flexibility.– You pay what you use.– Don´t need to maintain a data center.

• Cons– Transfer big datasets over internet is

slow.– You pay for consumed bandwidth.

That is a problem with big datasets.– Lower performance, specially in disk

read/write.– Privacy/security concerns.– More expensive for big and long

term projects.

NGS Data analysis http://ueb.vhir.org/NGS2012

Page 20: Curso de Genómica - UAT (VHIR) 2012 - Análisis de datos de NGS

Alternatives (2): Grid Computing

• Pros– Cheaper.– More resources available.

• Cons– Heterogeneous

environment.– Slow connectivity (specially

in Spain).– Much time required to find

good resources in the grid.

NGS Data analysis http://ueb.vhir.org/NGS2012

Page 21: Curso de Genómica - UAT (VHIR) 2012 - Análisis de datos de NGS

In summary?

•“NGS” arrived 2007/8

•No-one predicted NGS in 2001 (ten years ago)

•Therefore we cannot predict what we will come up against

•TGS represents specific challenges

–Large Data Storage

–Technology-aware software

–Enables new assays and new science

•We would have said the same about NGS….

•These are not new problems, but will require new solutions

•There is a lag between technology and software….

NGS Data analysis http://ueb.vhir.org/NGS2012

Page 22: Curso de Genómica - UAT (VHIR) 2012 - Análisis de datos de NGS

Bioinformatics and bioinformaticians

• The term bioinformatician means many things • Some may require a wide range of skills • Others require a depth of specific skills • The best thing we can teach is the ability to learn and

adapt • The spirit of adventure • There is a definite skills shortage • There always has been

NGS Data analysis http://ueb.vhir.org/NGS2012

Page 23: Curso de Genómica - UAT (VHIR) 2012 - Análisis de datos de NGS

Increasing importance of data analysis needs

NGS Data analysis http://ueb.vhir.org/NGS2012

Page 24: Curso de Genómica - UAT (VHIR) 2012 - Análisis de datos de NGS

NGS data analysis

NGS Data analysis http://ueb.vhir.org/NGS2012

Page 25: Curso de Genómica - UAT (VHIR) 2012 - Análisis de datos de NGS

NGS data analysis stages

NGS Data analysis http://ueb.vhir.org/NGS2012

Page 26: Curso de Genómica - UAT (VHIR) 2012 - Análisis de datos de NGS

Quality control and preprocessing of NGS data

NGS Data analysis http://ueb.vhir.org/NGS2012

Page 27: Curso de Genómica - UAT (VHIR) 2012 - Análisis de datos de NGS

Data types

NGS Data analysis http://ueb.vhir.org/NGS2012

Page 28: Curso de Genómica - UAT (VHIR) 2012 - Análisis de datos de NGS

Why QC and preprocessing

• Sequencer output:– Reads + quality

• Natural questions– Is the quality of my sequenced

data OK?– If something is wrong can I fix it?

• Problem: HUGE files... How do they look?

• Files are flat files and big... tens of Gbs (even hard to browse them)

NGS Data analysis http://ueb.vhir.org/NGS2012

Page 29: Curso de Genómica - UAT (VHIR) 2012 - Análisis de datos de NGS

Preprocessing sequences improves results

NGS Data analysis http://ueb.vhir.org/NGS2012

Page 30: Curso de Genómica - UAT (VHIR) 2012 - Análisis de datos de NGS

How is quality measured?

• Sequencing systems use to assign quality scores to each peak• Phred scores provide log(10)-transformed error probability values:

If p is probability that the base call is wrong the Phred score isQ = .10·log10p

– score = 20 corresponds to a 1% error rate– score = 30 corresponds to a 0.1% error rate– score = 40 corresponds to a 0.01% error rate

• The base calling (A, T, G or C) is performed based on Phred scores.

• Ambiguous positions with Phred scores <= 20 are labeled with N.

NGS Data analysis http://ueb.vhir.org/NGS2012

Page 31: Curso de Genómica - UAT (VHIR) 2012 - Análisis de datos de NGS

Data formats

• FastA format (everybody knows about it)– Header line starts with “>” followed by a sequence ID– Sequence (string of nt).

• FastQ format (http://maq.sourceforge.net/fastq.shtml)– First is the sequence (like Fasta but starting with “@”)– Then “+” and sequence ID (optional) and in the following line are

QVs encoded as single byte ASCII codes• Different quality encode variants

• Nearly all downstream analysis take FastQ as input sequence

NGS Data analysis http://ueb.vhir.org/NGS2012

Page 32: Curso de Genómica - UAT (VHIR) 2012 - Análisis de datos de NGS

The fastq format

• A FASTQ file normally uses four lines per sequence. – Line 1 begins with a '@' character and is followed by a sequence

identifier and an optional description (like a FASTA title line). – Line 2 is the raw sequence letters. – Line 3 begins with a '+' character and isoptionally followed by the same

sequence identifier (and any description) again. – Line 4 encodes the quality values for the sequence in Line 2, and must

contain the same number of symbols as letters in the sequence.• Different encodings are in use• Sanger format can encode a Phred quality score from 0 to 93 using ASCII 33 to 126

@Seq description

GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT

+

!''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>CCCCCCC65

NGS Data analysis http://ueb.vhir.org/NGS2012

Page 33: Curso de Genómica - UAT (VHIR) 2012 - Análisis de datos de NGS

Some tools to deal with QC

• Use FastQC to see your starting state.

• Use Fastx-toolkit to optimize different datasets and then visualize the result with FastQC to prove your success!

• Hints: – Trimming, clipping and filtering may improve quality– But beware of removing too many sequences…

Go to the tutorial and try the exercises...

NGS Data analysis http://ueb.vhir.org/NGS2012

Page 34: Curso de Genómica - UAT (VHIR) 2012 - Análisis de datos de NGS

Applications

• [1] Metagenomics• [2] De novo sequencing• [3] Amplicon analysis• [4] Variant discovery• [5] Transcriptome analysis• …and more …

NGS Data analysis http://ueb.vhir.org/NGS2012

Page 35: Curso de Genómica - UAT (VHIR) 2012 - Análisis de datos de NGS

[1] Metagenomics &other community-based “omics”

Zoetendal E G et al. Gut 2008;57:1605-1615

NGS Data analysis http://ueb.vhir.org/NGS2012

Page 36: Curso de Genómica - UAT (VHIR) 2012 - Análisis de datos de NGS

[1] A metagenomics workflow

Gene prediction

Binning

AAGACGTGGACA

CATGCGTGCATG

AGTCGTCAGTCATGGG

GTCCGTCACAACTGA

Short reads (40-150 bps)

AAGACGTGGACAGATCTGCTCAGGCTAGCATGAAC

Contigs

GATAGGTGGACCGATATGCATTAGACTTGCAGGGC

1 3000 6000

ORFs

Proteins, families, functions

1 3000 6000

Functional profiles

1 2000

Sequences into species

Assembly

Homology searching

Functional classificationOntologies

Page 37: Curso de Genómica - UAT (VHIR) 2012 - Análisis de datos de NGS

[1] Metagenomic Approaches

SMALL-SCALE: 16S rRNA gene profilingThe basic approach is to identify microbes in a complex community by exploiting universal and conserved targets, such as rRNA genesPetrosini.

LARGE-SCALE: Whole Genome Shotgun (WGS)Whole-genome approaches enable to identify and annotate microbial genes and its functions in the community.

Environmental Shotgun Sequencing (ESS).A primer on metagenomics.

PLoS Comput Biol. 2010 Feb 26;6(2):e1000667.

Challenges and limitations: Chimeric sequences caused by PCR amplification and sequencing errors.

Challenges and limitations: relatively large amounts of starting material requiredpotential contamination of metagenomic samples with host

genetic materialhigh numbers of genes of unknown function.

NGS Data analysis http://ueb.vhir.org/NGS2012

Page 38: Curso de Genómica - UAT (VHIR) 2012 - Análisis de datos de NGS

[1] Comparative Metagenomics

Other software based on phylogeneticdata are UniFrac.

MEGAN can also be used to compare the OTU composition of two or more frequency-normalized samples.MG-RAST provides acomparative functional and sequence-based analysis for uploaded samples

.

Comparing two or more metagenomes is necessary to understand how genomic differences affect, and are affected by the abiotic environment.

Page 39: Curso de Genómica - UAT (VHIR) 2012 - Análisis de datos de NGS

[1] Some Metagenomics projects

"whole-genome shotgun sequencing" 78 million base pairs of unique DNA sequence were analyzed

"whole-genome shotgun sequencing" was applied to microbial populationsA total of 1.045 billion base pairs of nonredundant sequence were analyzed

To date, 242 metagenomic projects are on going and 103 are completed (www.genomesonline.org).

NGS Data analysis http://ueb.vhir.org/NGS2012

Page 40: Curso de Genómica - UAT (VHIR) 2012 - Análisis de datos de NGS

[2] De novo sequencing

NGS Data analysis http://ueb.vhir.org/NGS2012

Page 41: Curso de Genómica - UAT (VHIR) 2012 - Análisis de datos de NGS

[3] Amplicon analysis

Each amplicon (PCR product) is sequenced individually, allowing for the identification of rare variants and the assignment of haplotype information over the full sequence length

Some applications:● Detection of low-frequency (<1%) variants in complex mixtures

→ rare somatic mutations, viral quasispecies... Ultra-deep amplicon sequencing

● Identification of rare alleles associated with hereditary diseases, heterozygote SNP calling... Ultra-broad amplicon sequencing

● Metabolic profiling of environmental habitats, bacterial taxonomy and phlylogeny 16S rRNA amplicon sequencing

NGS Data analysis http://ueb.vhir.org/NGS2012

Page 42: Curso de Genómica - UAT (VHIR) 2012 - Análisis de datos de NGS

[3] Example of raw data generation with GS-FLX

...

NGS Data analysis http://ueb.vhir.org/NGS2012

Page 43: Curso de Genómica - UAT (VHIR) 2012 - Análisis de datos de NGS

[3] Data Workflow

...

Dat

a P

roce

ssin

g

Page 44: Curso de Genómica - UAT (VHIR) 2012 - Análisis de datos de NGS

[3] Final output examples

...

Bar plots output example (with circular legend for the AA)

NT substitution (error) matrices

AA frequency tables

NGS Data analysis http://ueb.vhir.org/NGS2012

Page 45: Curso de Genómica - UAT (VHIR) 2012 - Análisis de datos de NGS

[4] Variant discovery

Your aligner decides the type/amount of variants you can identify

Naive SNP callingReads counting

Statistic support SNP callingMaximum likelihood, Bayesian

Quality score recalibrationRecalibrate quality score from whole alignment

Local realignment around indelsRealign reads

Known variants (limited species)dbSNP

NGS Data analysis http://ueb.vhir.org/NGS2012

Page 46: Curso de Genómica - UAT (VHIR) 2012 - Análisis de datos de NGS

[4] Example: Exome Variant Analysis

NGS Data analysis http://ueb.vhir.org/NGS2012

Page 47: Curso de Genómica - UAT (VHIR) 2012 - Análisis de datos de NGS

[4] Genotype calling tools

NGS Data analysis http://ueb.vhir.org/NGS2012

Page 48: Curso de Genómica - UAT (VHIR) 2012 - Análisis de datos de NGS

[4] GATK pipeline

NGS Data analysis http://ueb.vhir.org/NGS2012

Page 49: Curso de Genómica - UAT (VHIR) 2012 - Análisis de datos de NGS

[4]

NGS Data analysis http://ueb.vhir.org/NGS2012

Page 50: Curso de Genómica - UAT (VHIR) 2012 - Análisis de datos de NGS

[4] Many ongoing sequencing projects

NGS Data analysis http://ueb.vhir.org/NGS2012

Page 51: Curso de Genómica - UAT (VHIR) 2012 - Análisis de datos de NGS

[5] Transcriptome Analysis using NGS

RNA-Seq, or "Whole Transcriptome Shotgun Sequencing" ("WTSS") refers to use of HTS technologies to sequence cDNA in order to get information about a sample's RNA content.

Reads produced by sequencing

Aligned to a reference genome to build transcriptome mappings.

NGS Data analysis http://ueb.vhir.org/NGS2012

Page 52: Curso de Genómica - UAT (VHIR) 2012 - Análisis de datos de NGS

[5] Applications (1) Whole transcriptome analysis

AAAAmRNAFragmentation

RT

cDNA library

sequencing

Detects expression of known and novel mRNAs

Identification of alternative splicing events Detects expressed SNPs or mutations Identifies allele specific expression patterns

NGS Data analysis http://ueb.vhir.org/NGS2012

Page 53: Curso de Genómica - UAT (VHIR) 2012 - Análisis de datos de NGS

[5] Applications (2) Differential expression

1.Reads are mapped to the reference genome or transcriptome

2.Mapped reads are assembled into expression summaries (tables of counts, showing how may reads are in coding region, exon, gene or junction);

3.The data are normalized;

4.Statistical testing of differential expression (DE) is performed, producing a list of genes with P-values and fold changes.

NGS Data analysis http://ueb.vhir.org/NGS2012

Page 54: Curso de Genómica - UAT (VHIR) 2012 - Análisis de datos de NGS

[5] RNA Seq data analysis - Mapping

•Main Issues:–Number of allowed mismatches–Number of multihits–Mates expected distance–Considering exon junctions

End up with a list of # of reads per transcript

These will be our (discrete) response variable

NGS Data analysis http://ueb.vhir.org/NGS2012

Page 55: Curso de Genómica - UAT (VHIR) 2012 - Análisis de datos de NGS

• Two main sources of bias– Influence of length: Counts are proportional to the transcript

length times the mRNA expression level.– Influence of sequencing depth: The higher sequencing depth, the

higher counts.

• How to deal with this– Normalize (correct) gene counts to minimize biases.– Use statistical models that take into account

length and sequencing depth

[5] RNA Seq data analysis -Normalization

NGS Data analysis http://ueb.vhir.org/NGS2012

Page 56: Curso de Genómica - UAT (VHIR) 2012 - Análisis de datos de NGS

[5] RNA Seq - Differential expression methods

• Fisher's exact test or similar approaches.

• Use Generalized Linear Models and model counts using – Poisson distribution.– Negative binomial distribution.

• Transform count data to use existing approaches for microarray data.

• …

NGS Data analysis http://ueb.vhir.org/NGS2012

Page 57: Curso de Genómica - UAT (VHIR) 2012 - Análisis de datos de NGS

[5] Advantages of RNA-seq Unlike hybridization approaches does not require existing genomic

sequence Expected to replace microarrays for transcriptomic studies

Very low background noise Reads can be unabmiguously mapped

Resolution up to 1 bp High-throughput quantitative measurement of transcript abundance

Better than Sanger sequencing of cDNA or EST libraries Cost decreasing all the time

Lower than traditional sequencing Can reveal sequence variations (SNPs) Automated pipelines available

NGS Data analysis http://ueb.vhir.org/NGS2012

Page 58: Curso de Genómica - UAT (VHIR) 2012 - Análisis de datos de NGS

Software for NGS preprocessing and analysis

NGS Data analysis http://ueb.vhir.org/NGS2012

Page 59: Curso de Genómica - UAT (VHIR) 2012 - Análisis de datos de NGS

Which software for NGS (data) analysis?

• Answer is not straightforward.• Many possible classifications

– Biological domains• SNP discovery, Genomics, ChIP-Seq, De-novo assembly, …

– Bioinformatics methods• Mapping, Assembly, Alignment, Seq-QC,…

– Technology• Illumina, 454, ABI SOLID, Helicos, …

– Operating system• Linux, Mac OS X, Windows, …

– License type• GPLv3, GPL, Commercial, Free for academic use,…

– Language• C++, Perl, Java, C, Phyton

– Interface• Web Based, Integrated solutions, command line tools, pipelines,…

http://seqanswers.com/wiki/Software/list

NGS Data analysis http://ueb.vhir.org/NGS2012

Page 60: Curso de Genómica - UAT (VHIR) 2012 - Análisis de datos de NGS

NGS Data analysis http://ueb.ir.vhebron.net/NGS

Which software for NGS (data) analysis?

• Answer is not straightforward.• Many possible classifications

– Biological domains• SNP discovery, Genomics, ChIP-Seq, De-novo assembly, …

– Bioinformatics methods• Mapping, Assembly, Alignment, Seq-QC,…

– Technology• Illumina, 454, ABI SOLID, Helicos, …

– Operating system• Linux, Mac OS X, Windows, …

– License type• GPLv3, GPL, Commercial, Free for academic use,…

– Language• C++, Perl, Java, C, Phyton

– Interface• Web Based, Integrated solutions, command line tools, pipelines,…

http://seqanswers.com/wiki/Software/list

Page 61: Curso de Genómica - UAT (VHIR) 2012 - Análisis de datos de NGS

Some popular tools and places

NGS Data analysis http://ueb.vhir.org/NGS2012

Page 62: Curso de Genómica - UAT (VHIR) 2012 - Análisis de datos de NGS

Galaxy Site

62

http://galaxy.psu.edu/

Page 63: Curso de Genómica - UAT (VHIR) 2012 - Análisis de datos de NGS

63

Obtain data from many data sources including the UCSC Table Browser,BioMart, WormBase,

or your own data.

Prepare data for further analysis by rearrangingor cutting data columns, filtering data and many

other actions.

Analyze data by findingoverlapping regions,

determining statistics, phylogenetic analysis

and much more

Page 64: Curso de Genómica - UAT (VHIR) 2012 - Análisis de datos de NGS

64

contains links to the downloading,

pre-procession and analysis tools

displaysmenus and data inputs

Shows the history of analysis steps, data and resultviewing

RegisterUser

NGS Data analysis http://ueb.vhir.org/NGS2012

Page 65: Curso de Genómica - UAT (VHIR) 2012 - Análisis de datos de NGS

65

Click Get Data

Page 66: Curso de Genómica - UAT (VHIR) 2012 - Análisis de datos de NGS

66

Get Data from Database

NGS Data analysis http://ueb.vhir.org/NGS2012

Page 67: Curso de Genómica - UAT (VHIR) 2012 - Análisis de datos de NGS

67

Upload File File Format

Upload or paste file

Page 68: Curso de Genómica - UAT (VHIR) 2012 - Análisis de datos de NGS

68 NGS Data analysis http://ueb.vhir.org/NGS2012

Page 69: Curso de Genómica - UAT (VHIR) 2012 - Análisis de datos de NGS

FASTQ file manipulation: format conversation,summary statistics,

trimming reads,filtering reads

by quality score…

Page 70: Curso de Genómica - UAT (VHIR) 2012 - Análisis de datos de NGS

Input: sanger FASTQOutput: SAM format

Page 71: Curso de Genómica - UAT (VHIR) 2012 - Análisis de datos de NGS

Downstream analysis:SAM -> BAM

NGS Data analysis http://ueb.vhir.org/NGS2012

Page 72: Curso de Genómica - UAT (VHIR) 2012 - Análisis de datos de NGS

Copyright OpenHelix. No use or reproduction without express written consent72

List saved histories andshared histories.

Work on a current history, create new, share workflow

NGS Data analysis http://ueb.vhir.org/NGS2012

Page 73: Curso de Genómica - UAT (VHIR) 2012 - Análisis de datos de NGS

Creates a workflow, allowsuser to repeat analysisusing different datasets.

NGS Data analysis http://ueb.vhir.org/NGS2012

Page 74: Curso de Genómica - UAT (VHIR) 2012 - Análisis de datos de NGS

DATA VISUALIZATION

NGS Data analysis http://ueb.vhir.org/NGS2012

Page 75: Curso de Genómica - UAT (VHIR) 2012 - Análisis de datos de NGS

Why is visualization important?

make large amounts of data more interpretableglean patterns from the datasanity check / visual debuggingmore…

NGS Data analysis http://ueb.vhir.org/NGS2012

Page 76: Curso de Genómica - UAT (VHIR) 2012 - Análisis de datos de NGS

History of Genome Visualization

1800s 1900s 2000s

time

NGS Data analysis http://ueb.vhir.org/NGS2012

Page 77: Curso de Genómica - UAT (VHIR) 2012 - Análisis de datos de NGS

What is a “Genome Browser”

linear representation of a genomeposition-based annotations, each called a track

continuous annotations: e.g. conservationinterval annotations: e.g. gene, read alignmentpoint annotations: e.g. SNPs

user specifies a subsection of genome to look at

NGS Data analysis http://ueb.vhir.org/NGS2012

Page 78: Curso de Genómica - UAT (VHIR) 2012 - Análisis de datos de NGS

Server-side model(e.g. UCSC, Ensembl, Gbrowse)

• central data store• renders images• sends to client

server

client• requests images• displays images

NGS Data analysis http://ueb.vhir.org/NGS2012

Page 79: Curso de Genómica - UAT (VHIR) 2012 - Análisis de datos de NGS

Client-side model(e.g. Savant, IGV)

• stores dataserve

r

client• local HTS store• renders images• displays images

HTS machine

Page 80: Curso de Genómica - UAT (VHIR) 2012 - Análisis de datos de NGS

Rough comparison of Genome Browsers

UCSC Ensembl GBrowse Savant IGV

Model Server Server Server Client Client

Interactive

HTS support

Database of tracks

Plugins

No support Some support Good support

NGS Data analysis http://ueb.vhir.org/NGS2012

Page 81: Curso de Genómica - UAT (VHIR) 2012 - Análisis de datos de NGS

Limitations of most genome browsersdo not support multiple genomes simultaneouslydo not capture 3-dimensional conformationdo not capture spatial or temporal informationdo not integrate well with analyticscannot be customized

The SAVANT GENOME BROWSERhas been createdto overcome these limitations

NGS Data analysis http://ueb.vhir.org/NGS2012

Page 82: Curso de Genómica - UAT (VHIR) 2012 - Análisis de datos de NGS

Integrative Genomics Viewer (IGV)

he Integrative Genomics Viewer (IGV) is a high-performance visualization tool for interactive exploration of large, integrated datasets. It supports a wide variety of data types including sequence alignments, microarrays, and genomic annotations.

Page 83: Curso de Genómica - UAT (VHIR) 2012 - Análisis de datos de NGS

Acknowledgements Grupo de investigación en Estadística y Bioinformática del

departamento de Estadística de la Universidad de Barcelona.

All the members at the Unitat d’Estadística i Bioinformàtica del VHIR (Vall d’Hebron Institut de Recerca)

Unitat de Serveis Científico Tècnics (UCTS) del VHIR (Vall d’Hebron Institut de Recerca)

People whose materials have been borrowed or who have contributed with their work Manel Comabella, Rosa Prieto, Paqui Gallego, Javier

Santoyo, Ana Conesa, Thomas Girke and Silvia Cardona.…

NGS Data analysis http://ueb.vhir.org/NGS2012

Page 84: Curso de Genómica - UAT (VHIR) 2012 - Análisis de datos de NGS

Gracias por la atención y la paciencia

NGS Data analysis http://ueb.vhir.org/NGS2012