savvas petrou [email protected] epcc, the university of edinburgh sprint sprint a simple...

12
Savvas Petrou [email protected] EPCC, The University of Edinburgh SPRINT SPRINT A S Simple P Parallel R R INT INTerface

Upload: prosper-hill

Post on 04-Jan-2016

217 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Savvas Petrou spetrou@epcc.ed.ac.uk EPCC, The University of Edinburgh SPRINT SPRINT A Simple Parallel R INTerface

Savvas [email protected]

EPCC, The University of Edinburgh

SPRINTSPRINT

A SSimple PParallel RR INTINTerface

Page 2: Savvas Petrou spetrou@epcc.ed.ac.uk EPCC, The University of Edinburgh SPRINT SPRINT A Simple Parallel R INTerface

March 2010 SPRINTSPRINT 2

Overview

• What is SPRINT

• How is SPRINT different from other parallel R packages

• Biological example: Post-genomic data analysis

• Code comparison

Page 3: Savvas Petrou spetrou@epcc.ed.ac.uk EPCC, The University of Edinburgh SPRINT SPRINT A Simple Parallel R INTerface

March 2010 SPRINTSPRINT 3

SSimple PParallel RR INTINTerface(www.r-sprint.org)

“SPRINT: A new parallel framework for R”, J Hill et al, BMC Bioinformatics, Dec 2008.

SPRINT

Page 4: Savvas Petrou spetrou@epcc.ed.ac.uk EPCC, The University of Edinburgh SPRINT SPRINT A Simple Parallel R INTerface

March 2010 SPRINTSPRINT 4

Issues of existing parallel R packages

• Difficult to program

• Require scientist to also be a parallel

programmer!

• Require substantial changes to existing

scripts

• Can’t be used to solve some problems

• No data dependencies allowed

Page 5: Savvas Petrou spetrou@epcc.ed.ac.uk EPCC, The University of Edinburgh SPRINT SPRINT A Simple Parallel R INTerface

March 2010 SPRINTSPRINT 5

• Data:Data: A matrix of expression measurements with genes

in rows and samples in columns

Biological example

Page 6: Savvas Petrou spetrou@epcc.ed.ac.uk EPCC, The University of Edinburgh SPRINT SPRINT A Simple Parallel R INTerface

March 2010 SPRINTSPRINT 6

Biological example

• ProblemProblem

Using all or many genes will either crash or be very slow

(R memory allocation limits, number of computations)

Input array

dimensions and size

Final array size

in memory

11,000 x 320

26.85 MB26.85 MB

923.15 MB

(0.9 GB0.9 GB)

22,000 x 320

53.7 MB53.7 MB

3,692.62 MB

(3.6 GB3.6 GB)

35,000 x 320

85.44 MB85.44 MB

9,346 MB

(9.12 GB9.12 GB)

45,000 x 320

109.86 MB109.86 MB

15,449.52 MB

(15.08 GB15.08 GB)

Data limitations (correlations)Data limitations (correlations)

Input array dimensions

and permutation count

Estimated total

run time

36,612 x 76

500,000500,000

20,750 seconds

6 hours6 hours

36,612 x 76

1,000,0001,000,000

41,500 seconds

12 hours12 hours

73,224 x 76

500,000500,000

35,000 seconds

10 hours10 hours

73,224 x 76

1,000,0001,000,000

70,000 seconds

20 hours20 hours

Work load limitations (permutations)Work load limitations (permutations)

Page 7: Savvas Petrou spetrou@epcc.ed.ac.uk EPCC, The University of Edinburgh SPRINT SPRINT A Simple Parallel R INTerface

March 2010 SPRINTSPRINT 7

Workarounds and solution

• Workaround:Workaround:

– Remove as many genes as possible before applying algorithm. This can be an arbitrary process and remove relevant data.

– Perform multiple executions and post-process the data. Can become very painful procedure.

• Solution:Solution:Parallelisation of R code can be made accessible to

bioinformaticians/statisticians.

A library with expertexpert coded solutions once, then easy

end-point use by all.

SPRINT

R

Biological Results

HPCHPC

Big PostGenomic Data

Page 8: Savvas Petrou spetrou@epcc.ed.ac.uk EPCC, The University of Edinburgh SPRINT SPRINT A Simple Parallel R INTerface

March 2010 SPRINTSPRINT 8

Benchmarks (256 processes)

Input array

dimensions and size

Final array size

in memory

Total run time (in serial)

(in seconds)

Total run time (in parallel)

(in seconds)

11,000 x 320

26.85 MB26.85 MB

923.15 MB

(0.9 GB0.9 GB)63.1863.18 4.764.76

22,000 x 320

53.7 MB53.7 MB

3,692.62 MB

(3.6 GB3.6 GB)

““Error: cannot allocate vectorError: cannot allocate vector

of size 3.6 Gb”of size 3.6 Gb”13.8713.87

35,000 x 320

85.44 MB85.44 MB

9,346 MB

(9.12 GB9.12 GB)CRASHEDCRASHED 36.6436.64

45,000 x 320

109.86 MB109.86 MB

15,449.52 MB

(15.08 GB15.08 GB)CRASHEDCRASHED 42.1842.18

Data limitations (correlations)Data limitations (correlations)

Input array dimensions

and permutation count

Estimated total

run time (in serial)

Total run time (in parallel)

(in seconds)

36,612 x 76

500,000500,000

20,750 seconds

6 hours6 hours73.1873.18

36,612 x 76

1,000,0001,000,000

41,500 seconds

12 hours12 hours146.64146.64

73,224 x 76

500,000500,000

35,000 seconds

10 hours10 hours148.46148.46

73,224 x 76

1,000,0001,000,000

70,000 seconds

20 hours20 hours294.61294.61

Work load limitations (permutations)Work load limitations (permutations)

Page 9: Savvas Petrou spetrou@epcc.ed.ac.uk EPCC, The University of Edinburgh SPRINT SPRINT A Simple Parallel R INTerface

March 2010 SPRINTSPRINT 9

edata <- read.table("largedata.dat")

pearsonpairwise <- cor(edata)

write.table(pearsonpairwise, "Correlations.txt")

quit(save="no")

library("sprint")

edata <- read.table("largedata.dat")

ff_handle <- pcor(edata)

pterminate()

quit(save="no")

Correlation code comparison

Page 10: Savvas Petrou spetrou@epcc.ed.ac.uk EPCC, The University of Edinburgh SPRINT SPRINT A Simple Parallel R INTerface

March 2010 SPRINTSPRINT 10

data(golub)smallgd <- golub[1:100,] classlabel <- golub.cl

resT <- mt.maxT(smallgd, classlabel, test="t", side="abs")

quit(save="no")

library("sprint")

data(golub)smallgd <- golub[1:100,] classlabel <- golub.cl

resT <- pmaxT(smallgd, classlabel, test="t", side="abs")

pterminate()

quit(save="no")

Permutation testing code comparison

Page 11: Savvas Petrou spetrou@epcc.ed.ac.uk EPCC, The University of Edinburgh SPRINT SPRINT A Simple Parallel R INTerface

March 2010 SPRINTSPRINT 11

• Website: Website: http://www.r-sprint.org/

• Source code can be downloaded from websiteSource code can be downloaded from website

• Soon also in the Soon also in the CRANCRAN repository repository

• Mailing list: Mailing list: [email protected]

• Contact email: Contact email: [email protected]

SPRINT

Page 12: Savvas Petrou spetrou@epcc.ed.ac.uk EPCC, The University of Edinburgh SPRINT SPRINT A Simple Parallel R INTerface

March 2010 SPRINTSPRINT 12

Acknowledgements

DPM Team:DPM Team:

• Peter Ghazal

• Thorsten Forster

• Muriel Mewissen

EPCC Team:EPCC Team:

• Terry Sloan

• Michal Piotrowski

• Savvas Petrou

• Bartek Dobrzelecki

• Jon Hill

• Florian Scharinger

This work is supported by the Wellcome TrustWellcome Trust and the NAG dCSE SupportNAG dCSE Support service.

Numerical Numerical Algorithms GroupAlgorithms Group