galaxy: integrative, reproducible analysis of genomics data

31
Galaxy: Integrative, Reproducible Analysis of Genomics Data Genomic and Proteomic Approaches to Heart, Lung, Blood and Sleep Disorders Jackson Laboratories Ross Hardison September 10, 2008 Galaxy is developed and maintained by Anton Nekrutenko (PSU) and James Taylor (Emory U)

Upload: azana

Post on 25-Feb-2016

48 views

Category:

Documents


3 download

DESCRIPTION

Galaxy: Integrative, Reproducible Analysis of Genomics Data. Genomic and Proteomic Approaches to Heart, Lung, Blood and Sleep Disorders Jackson Laboratories Ross Hardison September 10, 2008 Galaxy is developed and maintained by Anton Nekrutenko (PSU) and James Taylor (Emory U). - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Galaxy: Integrative, Reproducible Analysis of Genomics Data

Galaxy: Integrative, Reproducible Analysis of Genomics Data

Genomic and Proteomic Approaches to Heart, Lung, Blood and Sleep Disorders

Jackson LaboratoriesRoss Hardison

September 10, 2008

Galaxy is developed and maintained by Anton Nekrutenko (PSU) and James Taylor (Emory U)

Page 2: Galaxy: Integrative, Reproducible Analysis of Genomics Data

Types of data in genomics

• Sequences • Comparisons of DNA and protein sequences• Expression data• Chromosomes and chromatin data• Experimental manipulation• Variation and phenotypes• Protein structure and function• Stored in databases and browsers (e.g. UCSC

Genome Browser)• Many analysis tools (Galaxy)

Page 3: Galaxy: Integrative, Reproducible Analysis of Genomics Data

Some major web resources in genomics

• UCSC Genome Browser and Table Browser– http://genome.ucsc.edu/

• Ensembl and EnsMart/BioMart– http://www.ensembl.org/

• TIGR Comprehensive Microbial Resource – http://cmr.tigr.org/

• NCBI for Blast server, PubMed, Gene Expression Omnibus, dbSNP, etc. – http://www.ncbi.nlm.nih.gov/

• dCode for alignments and other – http://dcode.org

• HapMap for haplotype and variation– http://hapmap.org

• Galaxy for data retrieval and analysis– http://galaxy.psu.edu

Page 4: Galaxy: Integrative, Reproducible Analysis of Genomics Data

Sequences

• DNA sequences – Whole genomes and chromosomes– Genes

• Transcripts– Protein-coding and noncoding transcripts– Full-length or partial (expressed sequence tags or

ESTs)• Protein sequences

– Known– Predicted

• Repeats• Variants

Page 5: Galaxy: Integrative, Reproducible Analysis of Genomics Data

Sequences from CFTR: Browser view

Page 6: Galaxy: Integrative, Reproducible Analysis of Genomics Data

Regulation-related features around T2D risk variants

Reg Pot

Page 7: Galaxy: Integrative, Reproducible Analysis of Genomics Data

Browsers vs Data Retrieval

• Browsers are designed to show selected information on one locus or region at a time.– UCSC Genome Browser– Ensembl

• Run on top of databases that record vast amounts of information.

• Sometimes need to retrieve one type of information for many genomics intervals or genome-wide.

• Access this by querying on the tables in the databases or “data marts”– UCSC Table Browser– EnsMart or BioMart

Page 8: Galaxy: Integrative, Reproducible Analysis of Genomics Data

Retrieve all the protein-coding exons in humans

Page 9: Galaxy: Integrative, Reproducible Analysis of Genomics Data

Challenges in genomic data analysis

• We have great browsers and data warehouses– But most lack facilities for performing

sophisticated analysis

• Many useful computational tools have been developed in bioinformatics– But they are not well integrated, they have

different user interfaces, different data formats, etc.

Page 10: Galaxy: Integrative, Reproducible Analysis of Genomics Data

Some common solutions

• Glue it all together with Excel– Until you realize Excel cannot handle that

much data and the match isn’t coming out right anyway…

• Glue it all together with Perl– But that leads to duplication of effort,

duplication of bugs, ….

Page 11: Galaxy: Integrative, Reproducible Analysis of Genomics Data

A better solution

• Build a framework that:– Defines a common format for describing the

interfaces of different computational tools and databases

– Provides the infrastructure to adapt those interfaces into standard form

– Defines common data types and standards for integrating the results

Page 12: Galaxy: Integrative, Reproducible Analysis of Genomics Data
Page 13: Galaxy: Integrative, Reproducible Analysis of Genomics Data

Two faces of Galaxy

• A web site where you can easily perform complex analysis integrating various data sources and computational tools

• A framework to easily build similar sites that integrate your choice of tools and data sources

Page 14: Galaxy: Integrative, Reproducible Analysis of Genomics Data

Galaxy: Data retrieval and analysis

• Flexible data retrieval– From multiple external sources– Upload from user’s computer– Upload as URL from any site

• Hundreds of computational tools– Data editing, filter, sort– File format conversion– Extract sequences and alignments– Operations: merge, intersection, complement, cluster …– Get conservation and other scores for intervals– Statistics– Graphs and displays– EMBOSS tools for sequence analysis– HyPhy tools for molecular evolutionary analysis

• Workflows: run multiple steps reproducibly

Page 15: Galaxy: Integrative, Reproducible Analysis of Genomics Data

Welcome to Galaxy

Welcome screen, changes periodically

News

When tools are invoked, displays information on the tool and allows user to chose parameters

Page 16: Galaxy: Integrative, Reproducible Analysis of Genomics Data

Tool choice

Titles are toggles; more options are displayed when you click on them

Page 17: Galaxy: Integrative, Reproducible Analysis of Genomics Data

History

Titles are toggles; more information is displayed when you click on them

Click on the “eye” to see all the data on another page

Click on the “pencil” to edit the attributes

Click on the “x” to delete

“Refresh” to get results if they have not appeared or to get status of query

Use “options” next to “History” to save, rename, move to or share histories. Must be logged in to do this.

Page 18: Galaxy: Integrative, Reproducible Analysis of Genomics Data

Proxy based tools (e.g. UCSC Table Browser)

User makes request to Galaxy

Galaxy delegates request to external site

Page 19: Galaxy: Integrative, Reproducible Analysis of Genomics Data

Proxy based tools

External site generates response - If data, Galaxy determines data type, processes it and adds it to the history - Otherwise, response is returned to user

Page 20: Galaxy: Integrative, Reproducible Analysis of Genomics Data

Command line tools

Pick one of the programs from the left “Tools” bar

Page 21: Galaxy: Integrative, Reproducible Analysis of Genomics Data

User chooses parameters for tool

Page 22: Galaxy: Integrative, Reproducible Analysis of Genomics Data

Command is run

Page 23: Galaxy: Integrative, Reproducible Analysis of Genomics Data

Background jobs in Galaxy

Page 24: Galaxy: Integrative, Reproducible Analysis of Genomics Data

Web page with datasets on transcriptional regulation

Page 25: Galaxy: Integrative, Reproducible Analysis of Genomics Data

Data uploads to Galaxy: use the URL

Page 26: Galaxy: Integrative, Reproducible Analysis of Genomics Data

How many DHS overlap with high RP intervals?

Page 27: Galaxy: Integrative, Reproducible Analysis of Genomics Data

Overlaps of DHS with high RP

segments (25%) and

highly constrained segments

(43%)

24,330/95,709 = 0.25441,000/95,709 = 0.428

Page 28: Galaxy: Integrative, Reproducible Analysis of Genomics Data

Get constraint scores for intervals

Page 29: Galaxy: Integrative, Reproducible Analysis of Genomics Data

Histogram of phastCons scores

Page 30: Galaxy: Integrative, Reproducible Analysis of Genomics Data

Mean vs Maximum phastCons

Distribution of phastCons scores in DHS that are also occupied by CTCF

mean max

n=7000

Page 31: Galaxy: Integrative, Reproducible Analysis of Genomics Data

Many thanks …

James Taylor, Anton Nekrutenko,

Funding from NIDDK, NHGRI, Huck Institutes of Life Sciences at PSU

Yong Cheng, Demesew Abebe, Christine Dorman, …, Ying Zhang, David King, Swathi Ashok Kumar