genome browser the plot deepak purushotham hamid reza hassanzadeh haozheng tian juliette zerick...

100
Genome Browser The Plot Deepak Purushotham Hamid Reza Hassanzadeh Haozheng Tian Juliette Zerick Lavanya Rishishwar Piyush Ranjan Lu Wang

Upload: simon-cannon

Post on 17-Dec-2015

213 views

Category:

Documents


0 download

TRANSCRIPT

Genome Browser

The PlotDeepak PurushothamHamid Reza HassanzadehHaozheng TianJuliette Zerick

Lavanya RishishwarPiyush Ranjan

Lu Wang

The Outline

• The Need & The Requirement• The Options• The Chosen One• The New Age

THE NEEDWhy one should develop a Genome Browser

Why A Genome Browser?I want to

analyze this organism

Why A Genome Browser?I want to

analyze this organism

Gene

FunctionsProtein

Domains

Metabolic Pathways

Comparative AnalysisSyn

teny

THE REQUIREMENTWhat is expected out of a Genome Browser

A Genome Browser?I want

something manageable

A Genome Browser!

The Genome Browser

“Genome browsers facilitate genomic analysis by presenting alignment, experimental and annotation data in the context of genomic DNA sequences.”

Melissa S Cline & James W Kent, 2009

Genome browsers aggregate data

Taken From Andy Conley’s slides without permission

THE OPTIONSA Short Survey of the available Genome Browsers Modules

A Brief Time Travel

• FlyBase, SGD, MGD, and WormBase• Setting up an MOD is expensive and time-consuming.• The four MODs agreed in the fall of 2000 to pool their

resources and to make reusable components available to the community free of charge under an open source license.

• The goal of this NIH-funded project, christened GMOD, is “…to generate a model organism database construction set that would allow a new model organism to be assembled by mixing and matching various components.”

GMOD

Who uses GMOD?

GMOD Components

Visualization - GBrowse

Visualization

JBrowse

GBrowse Synteny

CMAP

DATA MANAGEMENT

Chado

Tripal(http://www.cacaogenomedb.org/)

TableEdit

BioMart

InterMine

ANNOTATION

MAKER

DIYA

Galaxy

Ergatis

Apollo

REALLY EXCITING OPTION!

JBrowse

• Smooth, fast navigation (think Google Maps for genomes )

JBrowse

• Smooth, fast navigation (think Google Maps for genomes )

• Supports BED, GFF, Bio::DB::*, Chado, WIG, BAM, UCSC (intron/exon structure, name lookups, quantitative plots)

• Relies on pre-indexing to minimize security exposure and runtime bandwidth/CPU load on the server (future versions more likely to do some server work at runtime)

• Has an API for customized track/glyph extensions • Is stably funded by NHGRI, with many interesting

innovations implemented & pending integration

Smoother UI

Most Genome browsers

How is JBrowse different?

Types of Tracks

Pros

• Fast and smooth!• User Friendly• Works nicely on an iPad/iPhone too

Cons

• No user-uploaded data support • Slow for big numbers of reference seqs (e.g.

5,000 annotated contigs) • Few glyph options, feature tracks are limited

by the facts of <div>

What to pick?

?

Tried and tested

Fancy concept

THE CHOSEN ONEGbrowse and its Features

GBrowse

• Most popular web based genome browser• Visualize genome features along a reference

sequence• Open Source• Highly customizable• Excellent usability• Rich set of “glyphs”

– Genome features– Quantitative Data– Sequence Alignments

GBrowseHeader

Main Browser Window

Track Menu

Under The Hood

• Client-Server Architecture

• GBrowse Architecture• Installation Issues• Input Data• Configuration File• Customization

Client Server Architecture

1. The user types in the URL: browser2012.biology.gatech.edu

Client Server Architecture

2. Browser interprets and sends the request to HTTP Server

Client Server Architecture

3. Web Server receives the request and “serves” the client i.e., starts Gbrowse

Client Server Architecture

4. In case of success, relevant hypertexts and multimedia is generated by accessing the database

Client Server Architecture

5. The output traverses the same path back

Client Server Architecture

5. The output traverses the same path back

Client Server Architecture

6. The whole process repeats again when the user interacts with the browser

How you see what you seeJuxtaposed Images

How are so many images generated?

How you see what you see+ Hyper Text files

How you see what you seeMultimedia files + Hyper Text

©2002 by Cold Spring Harbor Laboratory Press

Stein L D et al. Genome Res. 2002;12:1599-1610

GBrowse Architecture

The Bio::DB::SeqFeature database Schema

Attribute

Attribute List

Feature

Name

Type List

Location List

Parent2Child

1

1

1

n

n

1

1

n1

n

n

n

Data file (.gff3)

Reference Sequence (Chr/Clone/Contig)

SourceEg: Prodigal/Glimmer

Type(sequence ontology (SO) terms)

StartEnd Score

Eg: E-value

Strand

Phase(0/1/2)

AttributesFormat: tag=value

Attributes (Data file)Different tags have predefined meanings:• ID: Gives the feature a unique identifier. Useful when grouping features

together (such as all the exons in a transcript).• Name: Display name for the feature. This is the name to be displayed to

the user.• Alias: A secondary name for the feature. It is suggested that this tag be

used whenever a secondary identifier for the feature is needed, such as locus names and accession numbers.

• Note: A descriptive note to be attached to the feature. This will be displayed as the feature's description.

Alias and Note fields can have multiple values separated by commas. For example : Alias=M19211,gna-12,GAMMA-GLOBULIN

• Other good stuff can go into the attributes field.

Gbrowse Configuration File

• Global Website Settings• Additional HTML Pages• JavaScript• Jquery• Global Database

Settings• Data Source Definitions

Customizations

Configuration file (.conf)

Making a new Track

### TRACK CONFIGURATION ###[ExampleFeatures]feature = remarkglyph = genericstranded = 1bgcolor = orangeheight = 10key = Example Features

Adding Multiple TracksData:

Configuration:

Result UI:

Searchable

Links

Popup balloons with links

Searching for Features

Gene symbolsGene IDsSequence IDsGenetic markersRelative nucleotide coordinatesAbsolute nucleotide coordinatesetc...

click

Viewing Multiple Tracks

Low Magnification

Viewing Multiple Tracks

High Magnification

In short…

• Main features (Determination of protein coding and non-coding,…)

• Quantitative data (E-value, Identity percentage)

• Other evidences (Interpro, CoGs, etc.)• GC content and other useful measurements• Protein and DNA sequences

THE NEW AGEValue-Added Additions

RICHER ANNOTATIONWhat’s New

INCREASED ANNOTATION INFORicher Annotation

M19107 M19501 M21127 M21621 M21639 M217090

500

1000

1500

2000

2500

3000

Total Genes

Pangenome Hits

UniProt

INTEGRATED QUALITY SCORERicher Annotation

Origin of Database Matches

Color code was used for matches originated from different databases

Quality Value Integration

It distinguishes between different databases…

However, for matches from the same database…

Quality Scores

Origin of D

atabase Matches

Color code will also be used for matches with different quality…

Different E-values shown with different

shades of colors

What’s New

MORE LINK-OUTS

COGs

KEGG ID

PATHWAYSWhat’s New

KEGG ID

KEGG Genes

KEGG Compound

KEGG Pathway

ORGANISM SPECIFIC PAGESSynthesis!

Organism Summary Page

• At this point of the course, we have gathered a lot of information for the strains we are dealing with

• Not all of this information could be represented inside the genome browser

• We propose a separate section in the browser containing strain-wise summarized information

Organism Summary Page

• Conceptually, the page could contain:– Biological information– Assembly information:

Genome Size, Number of contigs, N50, Sequencing platform– Gene Prediction information:

Number of protein coding and non-protein coding genes, links to 16s rRNA gene

– Annotation information:Percent annotation, function distribution pie

– Comparative information:Unique protein clusters, etc.

Organism Summary Page

OPERONSAdding more values

Operons

• Operon“…is a functioning unit of genomic DNA containing a cluster of genes under the control of a single regulatory signal or promoter”

• ~70% of the genes have been assigned a unique OperonID

• OperonID will provide an additional browsing mechanism for biologist connecting co-transcribed and co-regulated genes.

Operons

Incorporating Operon Information

BRIG PATTERNMore with Comparison

BRIG Patterns

• Concept:To either generate BRIG images at run

time or load static images when the user requests for BRIG Pattern between two species

BRIG Patterns

That’s All Folks!

• Questions?• Comments?• Concerns?

• If you have any suggestions, we would love to hear from you! (There is a page on Wiki for it!)