r, paralelización, datos masivos y aplicaciones web ... · r en bioinformática: paralelización y...

77
R en Bioinformática: paralelización y web Context Parallelizing code Web applications Large data sets and parallelization R, C, and compression on the fly Conclusions et al. What we are doing now R, paralelización, datos masivos y aplicaciones web: ejemplos del uso de R en bioinformática Ramón Díaz-Uriarte Dept. Bioquímica Universidad Autónoma de Madrid Madrid, Spain [email protected] http://ligarto.org/rdiaz Facultad de Informática Universidad Complutense de Madrid 9-Mayo-2012 (1 : 62)

Upload: lynhi

Post on 15-Feb-2019

215 views

Category:

Documents


0 download

TRANSCRIPT

R enBioinformática:paralelización y

web

Context

Parallelizingcode

Web applications

Large data setsandparallelization

R, C, andcompression onthe fly

Conclusions etal.

What we aredoing now

R, paralelización, datos masivos yaplicaciones web: ejemplos del uso de R

en bioinformática

Ramón Díaz-Uriarte

Dept. BioquímicaUniversidad Autónoma de Madrid

Madrid, [email protected]

http://ligarto.org/rdiaz

Facultad de InformáticaUniversidad Complutense de Madrid

9-Mayo-2012

(1 : 62)

R enBioinformática:paralelización y

web

Context

Parallelizingcode

Web applications

Large data setsandparallelization

R, C, andcompression onthe fly

Conclusions etal.

What we aredoing now

License and copyright

This work is Copyright, c©, 2012, Ramón Díaz-Uriarte, andis licensed under the Creative Commons

Attribution-NonCommercial-ShareAlike License. To view acopy of this license, visit

http://creativecommons.org/licenses/by-nc-sa/3.0/ or send a letter to Creative Commons,559 Nathan Abbott Way, Stanford, California 94305, USA.

*****************************Please, respect the copyright. This material is provided freely, and if you use

it, I only ask that you use it according to the (very permissive) terms of the

license: attribution, non-comercial use, and a share alike license. If you have

any doubts, ask me. (2 : 62)

R enBioinformática:paralelización y

web

Context

Parallelizingcode

Web applications

Large data setsandparallelization

R, C, andcompression onthe fly

Conclusions etal.

What we aredoing now

Outline

Context

Parallelizing code

Web applications

Large data sets and parallelization

R, C, and compression on the fly

Conclusions et al.

What we are doing now

(3 : 62)

R enBioinformática:paralelización y

web

ContextBiological context

Computational context

Parallelizingcode

Web applications

Large data setsandparallelization

R, C, andcompression onthe fly

Conclusions etal.

What we aredoing now

ContextBiological contextComputational context

Parallelizing code

Web applications

Large data sets and parallelization

R, C, and compression on the fly

Conclusions et al.

What we are doing now

(4 : 62)

R enBioinformática:paralelización y

web

ContextBiological context

Computational context

Parallelizingcode

Web applications

Large data setsandparallelization

R, C, andcompression onthe fly

Conclusions etal.

What we aredoing now

Chromosomes

From the Wikipedia; original sourcehttp://www.genome.gov/Pages/Hyperion//DIR/VIP/

Glossary/Illustration/karyotype.shtml(5 : 62)

R enBioinformática:paralelización y

web

ContextBiological context

Computational context

Parallelizingcode

Web applications

Large data setsandparallelization

R, C, andcompression onthe fly

Conclusions etal.

What we aredoing now

DNA→ protein

(From O. Rueda’s PhD Thesis)(6 : 62)

R enBioinformática:paralelización y

web

ContextBiological context

Computational context

Parallelizingcode

Web applications

Large data setsandparallelization

R, C, andcompression onthe fly

Conclusions etal.

What we aredoing now

DNA, genes and probes (spots)

(From O. Rueda’s PhD Thesis)(7 : 62)

A T A C G T T

T A T G C A A

A T A C C A

T A T G G T

T A T G C A A T A T G G T

probe 1 probe  2

A T A C G T T

T A T G C A A

A T A C C A

T A T G G T

exon 1 exon 2 exon 3intron 1 intron 2

probe 3

NucleotideSequence

Gene

Probe SelectionforMicroarray

R enBioinformática:paralelización y

web

ContextBiological context

Computational context

Parallelizingcode

Web applications

Large data setsandparallelization

R, C, andcompression onthe fly

Conclusions etal.

What we aredoing now

Two-color arrays

(From O. Rueda’s PhD Thesis)(8 : 62)

Hybridization Optical scanning

DNA samples

Tumor sample Control sample

Microarray chip

Spot for a probe

red fluorescent dye

green fluorescent dye

R enBioinformática:paralelización y

web

ContextBiological context

Computational context

Parallelizingcode

Web applications

Large data setsandparallelization

R, C, andcompression onthe fly

Conclusions etal.

What we aredoing now

Data from a microarray experiment

Slide from Gema Moreno Bueno, Department of Biochemistry, UAM(9 : 62)

R enBioinformática:paralelización y

web

ContextBiological context

Computational context

Parallelizingcode

Web applications

Large data setsandparallelization

R, C, andcompression onthe fly

Conclusions etal.

What we aredoing now

More microarray data

Modified from http://www2.warwick.ac.uk/fac/sci/moac/

students/peter_cock/r/heatmap/scaled_color_key.png

(10 : 62)

R enBioinformática:paralelización y

web

ContextBiological context

Computational context

Parallelizingcode

Web applications

Large data setsandparallelization

R, C, andcompression onthe fly

Conclusions etal.

What we aredoing now

DNA→ protein

(From O. Rueda’s PhD Thesis)(11 : 62)

R enBioinformática:paralelización y

web

ContextBiological context

Computational context

Parallelizingcode

Web applications

Large data setsandparallelization

R, C, andcompression onthe fly

Conclusions etal.

What we aredoing now

aCGH

Chromosome

Olshen, 2005

Barrett et al., 2004

Arrays: a dot is a DNA fragment. Each array a sample. Each array all chromosomes. (For analysis, location in chromosome matters)

Hupe & Barillot, 2005

Calling gains and losses: hypothesistesting

Inferring number of copy gains/losses: estimation L

og

2(R

ati

o)

(12 : 62)

R enBioinformática:paralelización y

web

ContextBiological context

Computational context

Parallelizingcode

Web applications

Large data setsandparallelization

R, C, andcompression onthe fly

Conclusions etal.

What we aredoing now

Data, data, data (in Gigabytes)

Expression arrays (mRNA) > 40,000 probesCopy number with aCGH > 400,000 common;

some > 4 x 106

. . . . . .

(13 : 62)

R enBioinformática:paralelización y

web

ContextBiological context

Computational context

Parallelizingcode

Web applications

Large data setsandparallelization

R, C, andcompression onthe fly

Conclusions etal.

What we aredoing now

aCGH: example of data

(From O. Rueda’s PhD Thesis)(14 : 62)

probe gene

R enBioinformática:paralelización y

web

ContextBiological context

Computational context

Parallelizingcode

Web applications

Large data setsandparallelization

R, C, andcompression onthe fly

Conclusions etal.

What we aredoing now

Computational issues et al.

We want to analyze, reanalyze, and combine.Do it in a reasonably short time.“Wet lab researchers” need user friendly access tomethods that are both statistically rigorous andcomputationally efficient.

BioConductor paper: second most accessed paper inGenome Biology ; yearly “Web server issue” ofNucleic Acids Research.

(15 : 62)

R enBioinformática:paralelización y

web

ContextBiological context

Computational context

Parallelizingcode

Web applications

Large data setsandparallelization

R, C, andcompression onthe fly

Conclusions etal.

What we aredoing now

Multicores and computing clusters

Increases in CPU speed slowed down (< 20% peryear since 2002).Increase in the number of “cores”: 2, 4, 8. Next 10years?Inexpensive computing clusters with off-the-shelfcomponents.Must design our programs from the start: parallelprogramming

Image from http://faq.distributed.net/

(16 : 62)

R enBioinformática:paralelización y

web

Context

Parallelizingcode

Web applications

Large data setsandparallelization

R, C, andcompression onthe fly

Conclusions etal.

What we aredoing now

Context

Parallelizing code

Web applications

Large data sets and parallelization

R, C, and compression on the fly

Conclusions et al.

What we are doing now

(17 : 62)

R enBioinformática:paralelización y

web

Context

Parallelizingcode

Web applications

Large data setsandparallelization

R, C, andcompression onthe fly

Conclusions etal.

What we aredoing now

Standalone

(18 : 62)

Statistical Computingin Bioinformatics

Develop statistical methods

Implement existingapproaches

Implement for statisticians and bioinformaticians

Implement for wet lab users

- Parallel Computing

- Fault tolerance

Web apps:- User friendly

- No installation

- Statistical rigour - Best practices

Increased speed (40x - 60x)

R enBioinformática:paralelización y

web

Context

Parallelizingcode

Web applications

Large data setsandparallelization

R, C, andcompression onthe fly

Conclusions etal.

What we aredoing now

R code

Code available for many procedures (but a few yearsago none parallelized!)Many computations embarrassingly parallelizable:

I bootstrapping and cross-validationI arrays (or samples)I arrays by chromosomesI parallel chains in MCMC

Figure production can be parallelized

(19 : 62)

R enBioinformática:paralelización y

web

Context

Parallelizingcode

Web applications

Large data setsandparallelization

R, C, andcompression onthe fly

Conclusions etal.

What we aredoing now

Parallelizing R code

(Implement missing functionality: R/C)MPI: R packages Rmpi, papply, snow, snowfallLoad balancedWrappers over “mid level” functions in package: easeupdatingParallelize:

I Bootstrap samples/Cross-val. runs.I arraysI arrays by chromosomesI (or a combination of both)I Figures

(20 : 62)

R enBioinformática:paralelización y

web

Context

Parallelizingcode

Web applications

Large data setsandparallelization

R, C, andcompression onthe fly

Conclusions etal.

What we aredoing now

Is it worth it?

Are speed improvements really worth the effort?Over what range of problems do see improvements?With what hardware can we see improvements?

(21 : 62)

R enBioinformática:paralelización y

web

Context

Parallelizingcode

Web applications

Large data setsandparallelization

R, C, andcompression onthe fly

Conclusions etal.

What we aredoing now

What do we gain?

(22 : 62)

HMM

Use

r w

all t

ime

(sec

onds

)

10 50 100 15020

20

50

100

300

500

1000

2500

5000

10000

● ● ●●

● ● ●

Sequential code

Parallelized code

60

30

10●

GLAD

10 50 100 150

●●

● ●

CBS

10 50 100 150

● ●●

●●

●●

●BioHMM

10 50 100 150

20

50

100

300

500

1000

2500

5000

10000

●●

20,000 genes

Number of arrays (samples)

Use

r w

all t

ime

(sec

onds

)

R enBioinformática:paralelización y

web

Context

Parallelizingcode

Web applications

Large data setsandparallelization

R, C, andcompression onthe fly

Conclusions etal.

What we aredoing now

What do we gain?

Are speed improvements really worth the effort?Your effort: “R CMD INSTALL ADaCGH2”.

Over what range of problems do see improvements?10 to 103 arrays/samples;104 to 106 spots/genes.

With what hardware can we see improvements?2 cores to 120 cores.

Smaller clusters: more cost effectiveSingle node/multi-core: lesscommunication overhead

(23 : 62)

R enBioinformática:paralelización y

web

Context

Parallelizingcode

Web applications

Large data setsandparallelization

R, C, andcompression onthe fly

Conclusions etal.

What we aredoing now

Where is this running?

varSelRF (CRAN)ADaCGH2 (BioConductor)SignS (launchpad:http://launchpad.net/signs)

(24 : 62)

R enBioinformática:paralelización y

web

Context

Parallelizingcode

Web applicationsWeb apps: how

Large data setsandparallelization

R, C, andcompression onthe fly

Conclusions etal.

What we aredoing now

Context

Parallelizing code

Web applicationsWeb apps: how

Large data sets and parallelization

R, C, and compression on the fly

Conclusions et al.

What we are doing now

(25 : 62)

R enBioinformática:paralelización y

web

Context

Parallelizingcode

Web applicationsWeb apps: how

Large data setsandparallelization

R, C, andcompression onthe fly

Conclusions etal.

What we aredoing now

Applications for wet lab researchers

Analyze data in a reasonably short time.User friendly access to methods that are statisticallyrigorous.

(26 : 62)

R enBioinformática:paralelización y

web

Context

Parallelizingcode

Web applicationsWeb apps: how

Large data setsandparallelization

R, C, andcompression onthe fly

Conclusions etal.

What we aredoing now

Web-based applications

User-friendly interface.No hardware/software hassles for end users.Parallelization is transparent.Method selection can be partially transferred (to us).Short user wall time: use (hardware/software)resources rarely available to individual biomedicalresearchersJust type in a URL:http://www.some-application

Image modified from http://faq.distributed.net/

(27 : 62)

R enBioinformática:paralelización y

web

Context

Parallelizingcode

Web applicationsWeb apps: how

Large data setsandparallelization

R, C, andcompression onthe fly

Conclusions etal.

What we aredoing now

Sometimes collaborations feel like . . .

(From http://www.bitacoradegalileo.com/2010/11/16/giordano-bruno-en-la-cara-oculta-de-la-luna/)

(28 : 62)

R enBioinformática:paralelización y

web

Context

Parallelizingcode

Web applicationsWeb apps: how

Large data setsandparallelization

R, C, andcompression onthe fly

Conclusions etal.

What we aredoing now

Parallelization in web applications

(29 : 62)

Statistical Computingin Bioinformatics

Develop statistical methods

Implement existingapproaches

Implement for statisticians and bioinformaticians

Implement for wet lab users

- Parallel Computing

- Fault tolerance

Web apps:- User friendly

- No installation

- Statistical rigour - Best practices

Increased speed (40x - 60x)

R enBioinformática:paralelización y

web

Context

Parallelizingcode

Web applicationsWeb apps: how

Large data setsandparallelization

R, C, andcompression onthe fly

Conclusions etal.

What we aredoing now

Main web-based applications

(30 : 62)

Dealing with raw data

Statistical analysis (sensu stricto)

Annotation and Interpretation

Remove artifactsfrom microarrays

- Missing data- Replicate spots

DNMAD preP

Differentiallyexpressed

genes

Select genesfor

classification

Tnasas GeneSrFPomelo_II

Molecular signatures

survival data

SignS

SegmentaCGH

WaviCGH ADaCGH

Interpret results

IDClight PaLS

R enBioinformática:paralelización y

web

Context

Parallelizingcode

Web applicationsWeb apps: how

Large data setsandparallelization

R, C, andcompression onthe fly

Conclusions etal.

What we aredoing now

The applications

(31 : 62)

R enBioinformática:paralelización y

web

Context

Parallelizingcode

Web applicationsWeb apps: how

Large data setsandparallelization

R, C, andcompression onthe fly

Conclusions etal.

What we aredoing now

What do we gain?

(32 : 62)

250

500

1000

2000

4000

1 5 10 20

CBS

1 5 10 20

●●

CGHseg●

●●

GLAD

250

500

1000

2000

4000

HMM15000 genes, 40 arrays

Number of simultaneous users

Use

r w

all t

ime

(sec

onds

)

R enBioinformática:paralelización y

web

Context

Parallelizingcode

Web applicationsWeb apps: how

Large data setsandparallelization

R, C, andcompression onthe fly

Conclusions etal.

What we aredoing now

How it works: some key ideas

Each runI Parallelization (transparent for users)I Fault-tolerance (network problems, machine crashes,

bugs)I Check-pointing

Periodic tasks (keep system running 24h, 365 d)I Automatic monitorizationI Automated testing suite

(33 : 62)

R enBioinformática:paralelización y

web

Context

Parallelizingcode

Web applicationsWeb apps: how

Large data setsandparallelization

R, C, andcompression onthe fly

Conclusions etal.

What we aredoing now

What happens

(34 : 62)

UserHead node (LVS):Send request to

one of the servers.

CGI: data checking,file upload

Execution: Python program

- Setting up LAM/MPI- Starting R

- Fault tolerance- Checking termination of R

- Checking run errors- Formatting output

R program

Autorefreshing HTMLuntil final results

Sequential code Parallelized code

R enBioinformática:paralelización y

web

Context

Parallelizingcode

Web applicationsWeb apps: how

Large data setsandparallelization

R, C, andcompression onthe fly

Conclusions etal.

What we aredoing now

What happens: details

(35 : 62)

User

Head node (LVS)

Server 2

Server 1

Continue R execution till end

Apache

Server 3Server n

CGI

Read dataCreate MPI universe

Launch R, RmpiMonitor R execution

Maintain R process counters

(slave)

(Master)

(slave)(slave)Rmpi started

OK?

Halt MPI universe Produce and return results pages

Is R done?Yes

Return autorefreshing page

NoNo

Yes

Stop execution Halt MPI universe

Return error

Not after K attempts

R enBioinformática:paralelización y

web

Context

Parallelizingcode

Web applicationsWeb apps: how

Large data setsandparallelization

R, C, andcompression onthe fly

Conclusions etal.

What we aredoing now

MPI details

(36 : 62)

Sleep Can we run?(Count other lam daemons)No

Boot (new)LAM/MPI

Yes

Start R: continue from last checkpoint Sleep

Run outof time?

Are we done?R crashed (bugs)?

MPI universe:Servers 1 ... n

NFS sharedtemporary storage

NFS sharedstorage

Segmentation and Figures (over subjects and chrom.).

Rmpi crashed?LAM/MPI/nodes crashed?

No

Halt MPI universe Produce and return results pages

Yes

Yes

No

Verify servers(modify LAM defs)

R enBioinformática:paralelización y

web

Context

Parallelizingcode

Web applicationsWeb apps: how

Large data setsandparallelization

R, C, andcompression onthe fly

Conclusions etal.

What we aredoing now

Where is this running?

http://signs.bioinfo.cnio.es

http://wavi.bioinfo.cnio.es

http://genesrf.bioinfo.cnio.es

(37 : 62)

R enBioinformática:paralelización y

web

Context

Parallelizingcode

Web applications

Large data setsandparallelization

R, C, andcompression onthe fly

Conclusions etal.

What we aredoing now

Context

Parallelizing code

Web applications

Large data sets and parallelization

R, C, and compression on the fly

Conclusions et al.

What we are doing now

(38 : 62)

R enBioinformática:paralelización y

web

Context

Parallelizingcode

Web applications

Large data setsandparallelization

R, C, andcompression onthe fly

Conclusions etal.

What we aredoing now

aCGH

Chromosome

Olshen, 2005

Barrett et al., 2004

Arrays: a dot is a DNA fragment. Each array a sample. Each array all chromosomes. (For analysis, location in chromosome matters)

Hupe & Barillot, 2005

Calling gains and losses: hypothesistesting

Inferring number of copy gains/losses: estimation L

og

2(R

ati

o)

(39 : 62)

R enBioinformática:paralelización y

web

Context

Parallelizingcode

Web applications

Large data setsandparallelization

R, C, andcompression onthe fly

Conclusions etal.

What we aredoing now

Large data sets

Millions of spotsHundreds or thousands of subjects.No need to hold everything in RAM at once.

Package ff: “memory-efficient storage of large dataon disk and fast access functions.”

Combined with:I parallelizationI shared storage

(40 : 62)

R enBioinformática:paralelización y

web

Context

Parallelizingcode

Web applications

Large data setsandparallelization

R, C, andcompression onthe fly

Conclusions etal.

What we aredoing now

ff and parallelization

ff stores the object on disk.Read that object from various R processes.Different R processes can write in different ff objects

(41 : 62)

R enBioinformática:paralelización y

web

Context

Parallelizingcode

Web applications

Large data setsandparallelization

R, C, andcompression onthe fly

Conclusions etal.

What we aredoing now

ff and parallelization (I)

R1 R2 Rn

Common ff object

ff1 ff2 ffn

write

read only

Rmaster

ffall

(42 : 62)

R enBioinformática:paralelización y

web

Context

Parallelizingcode

Web applications

Large data setsandparallelization

R, C, andcompression onthe fly

Conclusions etal.

What we aredoing now

ff and parallelization (I)

R1 R2 Rn

Common ff object

ff1 ff2 ffn

write

read only

Rmaster

ffall

(42 : 62)

R enBioinformática:paralelización y

web

Context

Parallelizingcode

Web applications

Large data setsandparallelization

R, C, andcompression onthe fly

Conclusions etal.

What we aredoing now

ff and parallelization (I)

R1 R2 Rn

Common ff object

ff1 ff2 ffn

write

read only

Rmaster

ffall

(42 : 62)

R enBioinformática:paralelización y

web

Context

Parallelizingcode

Web applications

Large data setsandparallelization

R, C, andcompression onthe fly

Conclusions etal.

What we aredoing now

ff and parallelization (II)

RmasterData

ffin

write

read only

(multicore)

R2 RnR1

ff1 ff2 ffn

R2 RnR1

i1

ini2

R2 RnR1

ff1 ff2 ffn

Fig.1 Fig.2 Fig.n

ffout Results

(43 : 62)

R enBioinformática:paralelización y

web

Context

Parallelizingcode

Web applications

Large data setsandparallelization

R, C, andcompression onthe fly

Conclusions etal.

What we aredoing now

ff and parallelization (II)

RmasterData

ffin

write

read only (multicore)

R2 RnR1

ff1 ff2 ffn

R2 RnR1

i1

ini2

R2 RnR1

ff1 ff2 ffn

Fig.1 Fig.2 Fig.n

ffout Results

(43 : 62)

R enBioinformática:paralelización y

web

Context

Parallelizingcode

Web applications

Large data setsandparallelization

R, C, andcompression onthe fly

Conclusions etal.

What we aredoing now

ff and parallelization (II)

RmasterData

ffin

write

read only

(multicore)

R2 RnR1

ff1 ff2 ffn

R2 RnR1

i1

ini2

R2 RnR1

ff1 ff2 ffn

Fig.1 Fig.2 Fig.n

ffout Results

(43 : 62)

R enBioinformática:paralelización y

web

Context

Parallelizingcode

Web applications

Large data setsandparallelization

R, C, andcompression onthe fly

Conclusions etal.

What we aredoing now

ff and parallelization (II)

RmasterData

ffin

write

read only

(multicore)

R2 RnR1

ff1 ff2 ffn

R2 RnR1

i1

ini2

R2 RnR1

ff1 ff2 ffn

Fig.1 Fig.2 Fig.n

ffout Results

(43 : 62)

R enBioinformática:paralelización y

web

Context

Parallelizingcode

Web applications

Large data setsandparallelization

R, C, andcompression onthe fly

Conclusions etal.

What we aredoing now

ff and parallelization (II)

RmasterData

ffin

write

read only

(multicore)

R2 RnR1

ff1 ff2 ffn

R2 RnR1

i1

ini2

R2 RnR1

ff1 ff2 ffn

Fig.1 Fig.2 Fig.n

ffout Results

(43 : 62)

R enBioinformática:paralelización y

web

Context

Parallelizingcode

Web applications

Large data setsandparallelization

R, C, andcompression onthe fly

Conclusions etal.

What we aredoing now

ff and parallelization (II)

RmasterData

ffin

write

read only

(multicore)

R2 RnR1

ff1 ff2 ffn

R2 RnR1

i1

ini2

R2 RnR1

ff1 ff2 ffn

Fig.1 Fig.2 Fig.n

ffout Results

(43 : 62)

R enBioinformática:paralelización y

web

Context

Parallelizingcode

Web applications

Large data setsandparallelization

R, C, andcompression onthe fly

Conclusions etal.

What we aredoing now

ff and parallelization (II)

RmasterData

ffin

write

read only

(multicore)

R2 RnR1

ff1 ff2 ffn

R2 RnR1

i1

ini2

R2 RnR1

ff1 ff2 ffn

Fig.1 Fig.2 Fig.n

ffout Results

(43 : 62)

R enBioinformática:paralelización y

web

Context

Parallelizingcode

Web applications

Large data setsandparallelization

R, C, andcompression onthe fly

Conclusions etal.

What we aredoing now

ff and parallelization (II)

RmasterData

ffin

write

read only

(multicore)

R2 RnR1

ff1 ff2 ffn

R2 RnR1

i1

ini2

R2 RnR1

ff1 ff2 ffn

Fig.1 Fig.2 Fig.n

ffout Results

(43 : 62)

R enBioinformática:paralelización y

web

Context

Parallelizingcode

Web applications

Large data setsandparallelization

R, C, andcompression onthe fly

Conclusions etal.

What we aredoing now

Where is this running?

ADaCGH2 (BioConductor package)Web-based applicationhttp://wavi.bioinfo.cnio.es.

(44 : 62)

R enBioinformática:paralelización y

web

Context

Parallelizingcode

Web applications

Large data setsandparallelization

R, C, andcompression onthe fly

Conclusions etal.

What we aredoing now

Context

Parallelizing code

Web applications

Large data sets and parallelization

R, C, and compression on the fly

Conclusions et al.

What we are doing now

(45 : 62)

R enBioinformática:paralelización y

web

Context

Parallelizingcode

Web applications

Large data setsandparallelization

R, C, andcompression onthe fly

Conclusions etal.

What we aredoing now

aCGH

Chromosome

Olshen, 2005

Barrett et al., 2004

Arrays: a dot is a DNA fragment. Each array a sample. Each array all chromosomes. (For analysis, location in chromosome matters)

Hupe & Barillot, 2005

Calling gains and losses: hypothesistesting

Inferring number of copy gains/losses: estimation L

og

2(R

ati

o)

(46 : 62)

R enBioinformática:paralelización y

web

Context

Parallelizingcode

Web applications

Large data setsandparallelization

R, C, andcompression onthe fly

Conclusions etal.

What we aredoing now

Store and access (large) pre-computed results

HMM for aCGH data with Reversible Jump: ViterbiCommon regions: “count” on the Viterbi paths.

Fitting HMM/common regions: distinct operations.

C: number-crunching.R: wrapper and figures/tables.C: creates large amounts of data.

In package RJaCGH (CRAN).

(47 : 62)

R enBioinformática:paralelización y

web

Context

Parallelizingcode

Web applications

Large data setsandparallelization

R, C, andcompression onthe fly

Conclusions etal.

What we aredoing now

Fit HMM

R C (HMM)

Store Viterbias gzipped file

return filenames

Find common regions

R C (common regions)pass filenames

ReadViterbi datareturn results

Figures, tables

(48 : 62)

R enBioinformática:paralelización y

web

Context

Parallelizingcode

Web applications

Large data setsandparallelization

R, C, andcompression onthe fly

Conclusions etal.

What we aredoing now

Fit HMM

R C (HMM)

Store Viterbias gzipped filereturn filenames

Find common regions

R C (common regions)pass filenames

ReadViterbi datareturn results

Figures, tables

(48 : 62)

R enBioinformática:paralelización y

web

Context

Parallelizingcode

Web applications

Large data setsandparallelization

R, C, andcompression onthe fly

Conclusions etal.

What we aredoing now

Fit HMM

R C (HMM)

Store Viterbias gzipped filereturn filenames

Find common regions

R C (common regions)pass filenames

ReadViterbi datareturn results

Figures, tables

(48 : 62)

R enBioinformática:paralelización y

web

Context

Parallelizingcode

Web applications

Large data setsandparallelization

R, C, andcompression onthe fly

Conclusions etal.

What we aredoing now

Fit HMM

R C (HMM)

Store Viterbias gzipped filereturn filenames

Find common regions

R C (common regions)pass filenames

ReadViterbi datareturn results

Figures, tables

(48 : 62)

R enBioinformática:paralelización y

web

Context

Parallelizingcode

Web applications

Large data setsandparallelization

R, C, andcompression onthe fly

Conclusions etal.

What we aredoing now

Context

Parallelizing code

Web applications

Large data sets and parallelization

R, C, and compression on the fly

Conclusions et al.

What we are doing now

(49 : 62)

R enBioinformática:paralelización y

web

Context

Parallelizingcode

Web applications

Large data setsandparallelization

R, C, andcompression onthe fly

Conclusions etal.

What we aredoing now

Web-based: A few things we’ve learned

Configuration sucks (if you need to modify > 1 file)Too many languagesAdding test cases to the testing suites: web, RDocumentation: in the code, web pages, LATEX . . .

Too much R code to catch errorsUser interfaces: who designs them?

(50 : 62)

R enBioinformática:paralelización y

web

Context

Parallelizingcode

Web applications

Large data setsandparallelization

R, C, andcompression onthe fly

Conclusions etal.

What we aredoing now

Too many languagesImpedance mismatch problem:“Building Web-based applications requires the mastering of anumber of languages/technologies (e.g. HTML, CSS, CGI, ASP,PHP, XML, etc..). Such languages and technologies werecreated to address different aspects on a by-need evolutionarymanner. The result is a plethora of tools that are fitted togetherin an ad hoc fashion.” El-Ansary, Grolaux, Van Roy, Rafea(2005) “Overcoming the Multiplicity of Languages andTechnologies for Web-Based Development Using aMulti-paradigm Approach”.

R and CHTML and Python: CGI, data entry, displayPython (and others): control and monitor MPIJavascript: AJAX and figures

(51 : 62)

R enBioinformática:paralelización y

web

Context

Parallelizingcode

Web applications

Large data setsandparallelization

R, C, andcompression onthe fly

Conclusions etal.

What we aredoing now

Fault tolerance and communicationManual check for errors (R ain’t Erlang)Too much network traffic

(52 : 62)

Boot (new)LAM/MPI

Start R: continue from last checkpoint

Sleep

Run outof time?

Are we done?R crashed

(coding errors)?

MPI universe:Servers 1 ... n

NFS sharedtemporary storage

NFS sharedstorage

Rmpi crashed?LAM/MPI crashed?

(includes node crashes)

No

Halt MPI universe Produce and return results pages

Yes

Yes

No

Verify servers(modify LAM defs)

R enBioinformática:paralelización y

web

Context

Parallelizingcode

Web applications

Large data setsandparallelization

R, C, andcompression onthe fly

Conclusions etal.

What we aredoing now

Solutions?

Literate programming and org-modeAlternatives to MPI and/or use Erlang. . .Keep things as they are (only a few painful events ayear)

(53 : 62)

R enBioinformática:paralelización y

web

Context

Parallelizingcode

Web applications

Large data setsandparallelization

R, C, andcompression onthe fly

Conclusions etal.

What we aredoing now

Rethinking web-based applications

Users can get into trouble.

Sure, but we can do a good job . . .I Provide state-of-the art statistical and computational

approaches (R).I Pedagogical examples and pipelinesI Minimize the chance of users getting into trouble

Web-based applications are here to stay

(54 : 62)

R enBioinformática:paralelización y

web

Context

Parallelizingcode

Web applications

Large data setsandparallelization

R, C, andcompression onthe fly

Conclusions etal.

What we aredoing now

Rethinking web-based applications

Users can get into trouble.

Sure, but we can do a good job . . .I Provide state-of-the art statistical and computational

approaches (R).I Pedagogical examples and pipelinesI Minimize the chance of users getting into trouble

Web-based applications are here to stay

(54 : 62)

R enBioinformática:paralelización y

web

Context

Parallelizingcode

Web applications

Large data setsandparallelization

R, C, andcompression onthe fly

Conclusions etal.

What we aredoing now

Rethinking web-based applications

Users can get into trouble.

Sure, but we can do a good job . . .I Provide state-of-the art statistical and computational

approaches (R).I Pedagogical examples and pipelinesI Minimize the chance of users getting into trouble

Web-based applications are here to stay

(54 : 62)

R enBioinformática:paralelización y

web

Context

Parallelizingcode

Web applications

Large data setsandparallelization

R, C, andcompression onthe fly

Conclusions etal.

What we aredoing now

. . . so . . .

Forget about them: just write your R/C/whatever codeGo for it

I We can use R + HPCI But other tools and work necessary

(55 : 62)

R enBioinformática:paralelización y

web

Context

Parallelizingcode

Web applications

Large data setsandparallelization

R, C, andcompression onthe fly

Conclusions etal.

What we aredoing now

Regardless of web-based applications . . .

Parallel computing can be used routinelyI (library(parallel) in R ≥ 2.14.0)

Large data sets with ff + parallelization.

(56 : 62)

R enBioinformática:paralelización y

web

Context

Parallelizingcode

Web applications

Large data setsandparallelization

R, C, andcompression onthe fly

Conclusions etal.

What we aredoing now

So far . . .

Most of what I mentioned refers to “traditional clustersetups”

I Several nodes (e.g., > 10).I A few CPUs/cores per nodeI Not too much RAM per node.

We’ve been using it for about 10 years.But things change . . .

(57 : 62)

R enBioinformática:paralelización y

web

Context

Parallelizingcode

Web applications

Large data setsandparallelization

R, C, andcompression onthe fly

Conclusions etal.

What we aredoing now

New hardware

Only a few nodes (2 in our case).Many cores.Lots of RAM available for a single process.More reliable?

Image from

http://blogs.amd.com/work/files/2011/02/Dell61453.jpg(58 : 62)

R enBioinformática:paralelización y

web

Context

Parallelizingcode

Web applications

Large data setsandparallelization

R, C, andcompression onthe fly

Conclusions etal.

What we aredoing now

Changes

Little need for control and monitorization software?Reconfiguration of MPI definition files.Load balancing of web servers.

(59 : 62)

R enBioinformática:paralelización y

web

Context

Parallelizingcode

Web applications

Large data setsandparallelization

R, C, andcompression onthe fly

Conclusions etal.

What we aredoing now

CHANGES

Changes in application software:Rethink how we use MPI.Start using OpenMP in C code.

I Need to be careful when called from R.I Random number generation.

Use mclapply (forking) within R.Rethink usage of ff: we can keep the whole object inRAM.

I Do not use the disk at all.I Eliminate code.

Rethink I/O and storage.Combine MPI/Rmpi with OpenMP and mclapply(forking).

Rethink usage of R (Julia? Python?)

(60 : 62)

R enBioinformática:paralelización y

web

Context

Parallelizingcode

Web applications

Large data setsandparallelization

R, C, andcompression onthe fly

Conclusions etal.

What we aredoing now

CHANGES

Changes in application software:Rethink how we use MPI.Start using OpenMP in C code.

I Need to be careful when called from R.I Random number generation.

Use mclapply (forking) within R.Rethink usage of ff: we can keep the whole object inRAM.

I Do not use the disk at all.I Eliminate code.

Rethink I/O and storage.Combine MPI/Rmpi with OpenMP and mclapply(forking).Rethink usage of R (Julia? Python?)

(60 : 62)

R enBioinformática:paralelización y

web

Context

Parallelizingcode

Web applications

Large data setsandparallelization

R, C, andcompression onthe fly

Conclusions etal.

What we aredoing now

Commercials (grandes ofertas)

I’d be glad to talk to anybody who wants to play with, andhelp configure, our machines and code.

(61 : 62)

R enBioinformática:paralelización y

web

Context

Parallelizingcode

Web applications

Large data setsandparallelization

R, C, andcompression onthe fly

Conclusions etal.

What we aredoing now

Acknowledgements

O. M. Rueda, A. Alibés, A. Cañada, E. R. Morrissey,M. L. Neves, D. Rico.Funding: Fundación de Investigación Médica MutuaMadrileña, Project TIC2003-09331-C02-02 of theSpanish MEC and BIO2009-12458 of the SpanishMICINN. Ramón y Cajal Programme of the SpanishMinistry of Education and Science.CNIO (Spanish National Cancer Research Center).The R users and developers for a vibrant statisticalcomputing community and amazing platform.Victoria López.

(62 : 62)