phang lab talk
DESCRIPTION
PHANG LAB TALK. Tzu L Phang Ph.D. Assistant Professor Department of Medicine Division of Pulmonary Sciences & Critical Care Medicine. What I do:. Perform high-throughput data analysis for the scientific community; microarray and Next Generation Sequencing datasets - PowerPoint PPT PresentationTRANSCRIPT
PHANG LAB TALK
Tzu L Phang Ph.D.Assistant Professor
Department of MedicineDivision of Pulmonary Sciences & Critical Care Medicine
What I do:• Perform high-throughput data analysis for the scientific community;
microarray and Next Generation Sequencing datasets
• Provide analysis solution for experts and novice users alike
• Develop multi-media approaches to disseminate translational science education
• Studying the role of long non-coding RNA; second talk
• Establishing the Bioinformatics Consultation and Analysis Core to help researchers and scientists design, analyze and interpret their experiments.
Today’s Talk Layout
• The center of my universe:– R and Bioconductor
• Collaboration with Biologists• 5x5; simple way to teach and contribute• Next Generation Sequencing (NGS)
Today’s Talk Layout
• The center of my universe:– R and Bioconductor
• Collaboration with Biologists• 5x5; simple way to teach and contribute• Next Generation Sequencing (NGS)
R
r-project.org
R is hot
http://blog.revolutionanalytics.com/r-is-hot/
R in the media
Bioconductor• www.bioconductor.org• Statistical tools in R for high-throughput data analysis• 6 month update cycle. Release 2.10 with 554 software
package (45 new)• Analysis workflow
– Oligonucleotide Arrays– Sequence Analysis– Variants– Accessing Annotation Data– High-throughput Assays
The Websitewww.bioconductor.org
Categories
Categoriescont …
• Typical Analysis Routine
R is easy
Result output
Other Resourceshttp://www.rseek.org/ http://www.statmethods.net/
http://crantastic.org/ http://stackoverflow.com/
Today’s Talk Layout
• The center of my universe:– R and Bioconductor
• Collaboration with Biologists• 5x5; simple way to teach and contribute• Next Generation Sequencing (NGS
Collaboration
• >1000 microarray chips / year• Affymetrix & Illumina platforms• Next Generation Sequencing 25 free Pilot
Projects.• Serve the rocky mountain region scientific
community
Collaboration - tips
• Don’t be a data analyst – be a co-investigator• Suggest analysis approaches that are not
obvious• Focus on the result, not method• Always looks for grant writing opportunity• Understand the technical & biological system
as thoroughly as possible – you will be surprise what biologists missed informatically
Exmaple 1: Classification of Pituitary Tumors
• Pituitary tumors are the most common type of brain tumor in 20% at autopsy and 1/10,000 persons clinically. Based upon 2010 figures of a veteran population of 22.7 million, this translates into >225,000 veterans with pituitary tumors.
• Currently no medical therapies exist for these tumors and surgical resection is the treatment of choice. Recurrence rates approach 40%.
• Understanding of the pathways to tumorigenesis and markers of aggressiveness and risk of recurrence would alter the intensity and cost of clinical care and may provide novel candidates and pathways to explore for new treatment options for these patients
Principle Component Analysis
Potential markers
Outputs
Example 2: Explore the artistic side!
Example 3: Unconventional Usage
Introduction• Crohn’s Disease (CD) is an Inflammatory Bowel Disease
(IBD) that affecting up to one million Americans (15 to 30 ages).
• Discordance between monozygotic twins affected by CD provide evidence for epigenetic role in etiology of disease.
• We combined 2 microarray technologies to study these roles– CHARM array (Comprehensive High-throughout Array for
Relative Methylation)– Gene Expression (Affymetix Gene 1.0 ST)
Research Informatics Integrated Core (RIIC)
Michael G. Kahn MD, PhDCCTSI Co-Director & RIIC Core Director
RIIC Organizational ModelMichael Kahn
ThomasYaeger
Web site
Portal application
s
Virtual server farm
Research LIS
implementation
Desktop support
Jessica Bondy
(Cancer Center Informatics
Core Director)REDCap, REDCap Survey
Data Manageme
nt Best Practices
MichaelKahnThird
Thursday @ Three
Thirty Three
InformaticsSeminar
Series
Secondary database
and analysis service
TzuPhang
5x5s
Video Tutorials
Bioinformatics Tools Tutorials
SteveRoss
Community Engagemen
tInformatics
Liaison
http://cctsi.ucdenver.edu/RIIC/Pages/ConsultationDataAnalysis.aspx
5X5http://cctsi.ucdenver.edu/5x5
Demonstration
http://gcrc.ucdenver.edu/Videos/Informatics/5x5/SocialNetworking5x5.wmv
Tools
Podcast
TIES – Translational Informatics Education Support (TIES)
• Bridging the gap in translational research through education
• Training biologist informatics• Enhance collaboration through education and
knowledge exchange• Bring awareness in latest technical advances• Disseminate knowledge through innovation
Next Generation Sequencing
The future is here ….
High Throughput Parallel Sequencing
• http://www.youtube.com/watch?v=77r5p8IBwJk
Paradigm Shift
• Standard “Sanger” sequencing– 96 sample/day– Read length ~650 bp– Total = 450,000 bases of sequence data
• 454 – the game changer!– ~400,000 different templates (reads)/day– Read length ~ 250 (at that time)– Total = 100,000,000 bases of sequence data
The second generationRoche (454) http://454.com/
– First on the market– Emulsion PCR and pyrosequencing
Illumina (Solexa) http://www.illumina.com/– Second on the market– Bridge PCR and polymerase based SBS
Abi (Solid) http://solid.appliedbiosystems.com/– Third on the market– Emulsion PCR and ligase based sequencing
Single molecule sequencingHelicos Biosciences
http://helicosbio.comtrue Single Molecule Sequencing technology
Pacific Biosciences http://www.pacificbiosciences.comSingle Molecule Real Time sequencing
Portable Sequencer
• Ion Torrent
OthersPolonator http://www.polonator.org
Emulsion PCR and ligase based sequencingUsed in the Personal Genome ProjectOpen platform, open sourceCheap/affordable
Complete Genomics http://www.completegenomics.comSpecializing in human genome sequencing
Type of read data
• Base Space or Color Space• Paired end or single end• Stranded or Unstranded
Short Reads
• Short reads from NGS are challenging (Solexa ~36 bp, now HiSeq 100 bp single pass)– Very hard to assemble whole genome– Especially on repeat regions
• Requires many fold coverage• New and faster algorithm for many traditional
bioinformatics operations• Reads are getting longer – another moving
target. (2x250)
Applications
• An explosion of scientific innovation!!• New usages not directly foreseen by the
original developers of the technology• Some envision the beginning of next
revolution – such as PCR – NGS machine in every lab!!
• Cheap high-volume sequencing – revisiting data collection and management system
RNA Sequencing• “Digital Gene Expression” or “RNA-Seq”• Truly accurate gene expression measurements– Can replace gene expression microarrays • 25% more sensitive• Does not rely on hybridization (no %GC bias, no cross-
hybridization between related genes)
• Discover novel genes (and other kinds of RNA
molecules) – one experiment found that 34% of human transcripts were
not from known genes• Sultan et al, Science. 2008 Aug 15;321(5891):956-60.
Why RNAseq better then microarray?
• Not predefine gene annotation — make discovering novel transcripts possible
• Low, if any, background• Large dynamic range of expression levels, no
upper limit for quantification• Reveal sequence variation, such as SNP, in the
transcript region• In Helico — single molecule sequencing — no
PCR step, remove amplification bias
More information from RNA
• Can capture true alternative splicing information– Sequence of splice-junctions• One study found 4,096 previously unknown splice
junctions in 3,106 human genes– Different transcription start and end points for
RNA molecules• Allelic variation (SNPs) • Small RNAs
Bottleneck: Data Analysis
Informatics is the Bottleneck
• Scientists are currently able to generate sequence data much faster/more easily than they are able to analyze it
• Customized analysis / Bioinformatics consulting is needed for every project
Bioinformatics Challenges• Need for large amount of CPU power– Informatics groups must manage compute clusters– Challenges in parallelizing existing software or redesign of
algorithms to work in a parallel environment– Another level of software complexity and challenges to
interoperability• VERY large text files (million lines long)– Can’t do ‘business as usual’ with familiar tools such as
Microsoft Excel.– Impossible memory usage and execution time
• Sequence Quality filtering
Auer P. Statistical design and analysis of RNA sequencing data. Genetics. 2010.
Data formats
• Images• “raw” basecalls with quality scores• Sequence reads aligned to reference genomes• Assemblies (contigs)• Variants (SNPs, indels, copy number variants)
Hexadecimal mode
Decimal mode
FASTQ
Raw
SAM format
Example
Pileup format
QNAME
FLAG
RNAME
POS
MAPQ
CIGAR
MRNM
MPOS
ISIZE
SEQ
QUAL
CIGAR
• M : match/mismatch• I : Insertion compared with reference• D : Deletion compared with reference• N : Skipped bases on reference• S : soft clipping (unaligned)• H : hard clipping• P : padding
File Size
• s_1_ILS4_sequence.txt [5.2 GB]• s_1_ILS4_sequence.fastq [3.3 GB]• s_1_ILS4_sequence.sam [4.5 GB]• s_1_ILS4_sequence.bam [995 MB]• s_1_ILS4_sequence.sorted.bam [696 MB]
The Bible
Utility Tools
• SamTools• Picard• Useq• Etc …
Bioconductor Solution
A demonstration
Secondary Tools
• Laboratory Management• Data mining and visualization• Project management for genome assembly• Pathway mapping (functional analysis of
groups of genes)• Motif finding (for Chip-Seq)
Integration
• Integrate information from different technologies on a single genome map– Genetic variation– Gene expression (mRNA levels)– Alternative splicing– Transcription factor binding– Methylation/histone status– Small RNA levels (gene regulatory molecules)– Non-coding RNA levels!
Speed/Efficiency
• New emphasis on efficient data structures and algorithms
• Use of “old style” tools such as grep/sed/awk• Machine language programming• Currently a huge burst of programming creativity in
an “anything goes” environment• A desperate scramble for tools that work• Huge duplication of effort in programming, but also
in evaluating new software
Amazon Web Serviceshttp://aws.amazon.com/education/
Future Directions
• Sequencing will continue to get much faster and cheaper, by 4-10x per year for several more years.
• Affordable complete human genome sequencing will be available as a clinical diagnostic tool within 2-3 years.
• Data storage and analysis bottleneck• Data security/privacy issues
Move to 1:52
Field Trip