introduction to r and bioconductor...introduction to r and bioconductor martin morgan...

20
Introduction to R and Bioconductor Martin Morgan ([email protected]) Fred Hutchinson Cancer Research Center Seattle, WA, USA June 23, 2014

Upload: others

Post on 31-May-2020

15 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Introduction to R and Bioconductor...Introduction to R and Bioconductor Martin Morgan (mtmorgan@fhcrc.org) Fred Hutchinson Cancer Research Center Seattle, WA, USA June 23, 2014 R R

Introduction to R and Bioconductor

Martin Morgan ([email protected])Fred Hutchinson Cancer Research Center

Seattle, WA, USA

June 23, 2014

Page 2: Introduction to R and Bioconductor...Introduction to R and Bioconductor Martin Morgan (mtmorgan@fhcrc.org) Fred Hutchinson Cancer Research Center Seattle, WA, USA June 23, 2014 R R

R

R is a language and environment for statistical computing andgraphics

Page 3: Introduction to R and Bioconductor...Introduction to R and Bioconductor Martin Morgan (mtmorgan@fhcrc.org) Fred Hutchinson Cancer Research Center Seattle, WA, USA June 23, 2014 R R

R

R is a language and environment for statistical computing andgraphics

I Full-featured programming language

I Interactive and interpretted – convenient and forgiving of usererrors

I Coherent, extensively documented

Page 4: Introduction to R and Bioconductor...Introduction to R and Bioconductor Martin Morgan (mtmorgan@fhcrc.org) Fred Hutchinson Cancer Research Center Seattle, WA, USA June 23, 2014 R R

R

R is a language and environment for statistical computing andgraphics

I Throughout the language, e.g., factor and NA

I Built-in statistical functionality

I Highly extensible via user-contributed packages

Page 5: Introduction to R and Bioconductor...Introduction to R and Bioconductor Martin Morgan (mtmorgan@fhcrc.org) Fred Hutchinson Cancer Research Center Seattle, WA, USA June 23, 2014 R R

R

R is a language and environment for statistical computing andgraphics

I Explore data

I Communicateresults

Page 6: Introduction to R and Bioconductor...Introduction to R and Bioconductor Martin Morgan (mtmorgan@fhcrc.org) Fred Hutchinson Cancer Research Center Seattle, WA, USA June 23, 2014 R R

R vectors, classes, and functions

I VectorsI logical, integer, numeric, complex, character, raw

(byte)I factor: discrete levelsI Missing-ness, NA

I data.frame, matrix , and other objectsI Functions

I Operating on vectors, e.g., log, lm (fit a linear model)I ‘Higher order’ functions – apply a function to several different

vectors, e.g., lapply(df, log)

I Packages

None of this making sense? R introduction / refresher tutorial thisafternoon

Page 7: Introduction to R and Bioconductor...Introduction to R and Bioconductor Martin Morgan (mtmorgan@fhcrc.org) Fred Hutchinson Cancer Research Center Seattle, WA, USA June 23, 2014 R R

Using R

Documentation

I help()

I vignettes

Work flowsI Scripts. . .

I ReproducibleI Literate

I . . . mature to packagesI Coordinate data, analysis, and documentationI Share with others

Page 8: Introduction to R and Bioconductor...Introduction to R and Bioconductor Martin Morgan (mtmorgan@fhcrc.org) Fred Hutchinson Cancer Research Center Seattle, WA, USA June 23, 2014 R R

Bioconductor project goal

Analysis and comprehension of high-throughput genomic data

Page 9: Introduction to R and Bioconductor...Introduction to R and Bioconductor Martin Morgan (mtmorgan@fhcrc.org) Fred Hutchinson Cancer Research Center Seattle, WA, USA June 23, 2014 R R

Bioconductor project goal

Analysis and comprehension of high-throughput genomic data

Statistical analysis

I Reduce large data to manageable knowledge

I Cope with technological artifacts

I Rigorous exploration

I Designed experiments, e.g., treatment vs. control

I Leading-edge methods for leading-edge questions

Page 10: Introduction to R and Bioconductor...Introduction to R and Bioconductor Martin Morgan (mtmorgan@fhcrc.org) Fred Hutchinson Cancer Research Center Seattle, WA, USA June 23, 2014 R R

Bioconductor project goal

Analysis and comprehension of high-throughput genomic data

I Understandable

I Reproducible

I Effective visualization

I Biological context, e.g., annotation

I Training

Page 11: Introduction to R and Bioconductor...Introduction to R and Bioconductor Martin Morgan (mtmorgan@fhcrc.org) Fred Hutchinson Cancer Research Center Seattle, WA, USA June 23, 2014 R R

Bioconductor project goal

Analysis and comprehension of high-throughput genomic data

I Sequencing: RNA-seq, ChIP-seq, variants, copy number. . .

I Microarrays: expression, SNP, . . .

I Flow cytometry

I Proteomics

I Images

I . . .

Page 12: Introduction to R and Bioconductor...Introduction to R and Bioconductor Martin Morgan (mtmorgan@fhcrc.org) Fred Hutchinson Cancer Research Center Seattle, WA, USA June 23, 2014 R R

What is Bioconductor?

Collection of packages in the R statistical programming language

I Developed by the Bioconductor core and internationalcontributors

I Stable ‘release’ branch, and leading edge ‘devel’ branch

I Open source / open development

Used by. . .

I Individuals

I Academic labs & research groups

I Government agencies

I Pharma and other companies

Page 13: Introduction to R and Bioconductor...Introduction to R and Bioconductor Martin Morgan (mtmorgan@fhcrc.org) Fred Hutchinson Cancer Research Center Seattle, WA, USA June 23, 2014 R R

How to learn & use Bioconductor

1. Install R (&RStudio?)

2. Identify and installpackages

3. Write R scriptsI Input & ‘massage’

dataI Quality assessmentI Statistical analysisI VisualizationI AnnotationI Reports &

summaries

4. Share with colleagues,collaborators, and thecommunity

http://bioconductor.org

Page 14: Introduction to R and Bioconductor...Introduction to R and Bioconductor Martin Morgan (mtmorgan@fhcrc.org) Fred Hutchinson Cancer Research Center Seattle, WA, USA June 23, 2014 R R

I Estabished work flows, e.g., RNA-seq differential expressionwith DESeq2

I Flexible bioinformatic analysis, e.g., . . .

Page 15: Introduction to R and Bioconductor...Introduction to R and Bioconductor Martin Morgan (mtmorgan@fhcrc.org) Fred Hutchinson Cancer Research Center Seattle, WA, USA June 23, 2014 R R

Project strengths

I Extensive

I Respected

I Well-used

I Accessible

I 824 software packages, 867 annotationpackages, 202 experiment data packages

I Sequencing, microarrays, flow cytometry,proteomics, image analysis, . . .

I All packages with vignettes and helppages

I Tutorials, training material, national andinternational conferences

Page 16: Introduction to R and Bioconductor...Introduction to R and Bioconductor Martin Morgan (mtmorgan@fhcrc.org) Fred Hutchinson Cancer Research Center Seattle, WA, USA June 23, 2014 R R

Project strengths

I Extensive

I Respected

I Well-used

I Accessible

“Community repositories that carry out testingare ideal. . . the genetics community isfortunately familiar with the Comprehensive RArchive Network and the principles ofstewardship of modular software embodied inthe Bioconductor suite. . . The journal hassufficient experience with these resources toendorse their use by authors.” – NatureGenetics 46, 1 (2014)

Page 17: Introduction to R and Bioconductor...Introduction to R and Bioconductor Martin Morgan (mtmorgan@fhcrc.org) Fred Hutchinson Cancer Research Center Seattle, WA, USA June 23, 2014 R R

Project strengths

I Extensive

I Respected

I Well-used

I Accessible

PubMedCentral full-text citations

Citations

Bioconductor 9070RNA-seq

edgeR Diff. expression 647DESeq Diff. expression 648

Microarrayaffy Pre-processing 2318limma Diff. expression 4503GOstats GSEA 436

Page 18: Introduction to R and Bioconductor...Introduction to R and Bioconductor Martin Morgan (mtmorgan@fhcrc.org) Fred Hutchinson Cancer Research Center Seattle, WA, USA June 23, 2014 R R

Project strengths

I Extensive

I Respected

I Well-used

I Accessible

I 225,000 unique IP addresses downloaded9.3M packages

I 397,000 site visitors / year (27% increase)viewed 2.8M pages

I ∼ 600 mailing list posts from ∼ 210authors per month

Page 19: Introduction to R and Bioconductor...Introduction to R and Bioconductor Martin Morgan (mtmorgan@fhcrc.org) Fred Hutchinson Cancer Research Center Seattle, WA, USA June 23, 2014 R R

Project strengths

I Extensive

I Respected

I Well-used

I Accessible

http://bioconductor.org

I Package vignettes & help pages

I Work flows

I Mailing list & ‘guest posting’ facility

I Courses and other training

I Annual Conference,Boston July 30 – Aug 1.

Page 20: Introduction to R and Bioconductor...Introduction to R and Bioconductor Martin Morgan (mtmorgan@fhcrc.org) Fred Hutchinson Cancer Research Center Seattle, WA, USA June 23, 2014 R R

Acknowledgements

I Bioconductor core: Vince Carey, Sean Davis, Kasper Hansen,Wolfgang Huber, Robert Gentleman, Rafael Irizzary, MichaelLawrence, Levi Waldron

I Bioconductor team: Sonali Arora (introductory material, copynumber), Marc Carlson (annotation), Nate Hayden (pileup,C++), Valerie Obenchain (variants, ranges), Herve Pages(ranges, strings), Paul Shannon (systems biology), DanTenenbaum (web, build)

I The international Bioconductor community!

I Funding: US NHGRI / NIH U41HG004059; NSF 1247813.

More: http://bioconductor.org