computational infrastructure for ngs data analysis

25
Computational infrastructure for NGS data analysis Pablo Escobar [email protected]

Upload: cursongs

Post on 15-Jun-2015

3.413 views

Category:

Technology


2 download

DESCRIPTION

Computational infrastructure for NGS data analysis - Pablo Escobar -Massive sequencing data analysis workshop -Granada 2011

TRANSCRIPT

Page 1: Computational infrastructure for NGS data analysis

Computational infrastructure for NGS data analysis

Pablo Escobar

[email protected]

Page 2: Computational infrastructure for NGS data analysis

Index

Introduction to supercomputing (HPC, high performance computing)

NGS computational infrastructures Sequencing Centers

Medical Genome Project MGP - Sevilla Centro Nacional Análisis genómico CNAG - Barcelona Beijing Genomics Institute BGI - China

Alternatives: Cloud computing Grid computing

Page 3: Computational infrastructure for NGS data analysis

Computational infrastructure for NGS

In NGS we have to process really big amounts of data, which is not trivial in computing terms.

Big NGS projects require supercomputing infrastructures

Page 4: Computational infrastructure for NGS data analysis

Computational infrastructure for NGS

These infrastructures are expensive and not trivial to use, we require:

Acondicionated data center (servers room). This is expensive

Computing cluster: Many computing nodes (servers) High performance storage (hard disks) Fast networks (10Gb ethernet, infiniband...)

Skilled people in computing ( sysadmins and developers). In CNAG currently 30 staff - >50% informatics

Page 5: Computational infrastructure for NGS data analysis

Big infrastructure cluster

Distributed memory cluster

Starting at 20 computing nodes 160 to 240 cores

amd64 (x86_64) is the most

used cpu architecture At least 48GB ram per node

Fast networks

10Gbit Infiniband

Batch queue system (sge, condor, pbs, slurm)

Optional MPI and GPUs environment depending on project requirements

Page 6: Computational infrastructure for NGS data analysis

Big infrastructure storage

Distributed filesystem for high performance storage (starting at 100TB)

Lustre GPFS Ibrix parallel nfs glusterFS

Storage is the most expensive NFS is not a good option for supercomputing

Page 7: Computational infrastructure for NGS data analysis

Big infrastructure storage

Page 8: Computational infrastructure for NGS data analysis

Big infrastructure

Page 9: Computational infrastructure for NGS data analysis

Big infrastructure

Starting at 200.000€ 200.000€ is just the hardware Plus data center (computers room) Plus informatics salary

Not every partner knows about supercomputing. SGI Bull IBM HP

Page 10: Computational infrastructure for NGS data analysis

Middle-size infrastructure

”Small” distributed filesystem ( around 50TB).

”Small” cluster (around 10 nodes, 80 to 120 cores).

At least gigabit ethernet network.

Price range: 50.000 – 100.000 € (just hardware) plus data center and informatics salary

Page 11: Computational infrastructure for NGS data analysis

Small infrastructure

Recommended at least 2 machines 8 or 12 cores each machine. 48Gb ram minimum each machine. BIG local disk. At least 4TB each machine

As much local disks as we can afford

Price range: starting at 8.000€ - 10.000€ (two machines)

Page 12: Computational infrastructure for NGS data analysis

Sequencing centers in Spain

Medical Genome Project Sequencing Instruments

7 GS-FLX (Roche) 4 SolidTM 4 (Applied Biosystems)

Informatics infrastructure 300 core cluster 0,5 petabyte hard disks

Page 13: Computational infrastructure for NGS data analysis

Medical genome project

Storage racks

IBRIX filesystem

front-ends

Page 14: Computational infrastructure for NGS data analysis

MGP raw data generation

a solid sequencer run

7 days running Generates around 4TB

Only the four solid sequencers working full time can generate around 12TB each week.

12TB just of raw data. After running bioinformatics analysis more data is generated

Raw data size grows really fast

New sequencer models New reagents

Page 15: Computational infrastructure for NGS data analysis

MGP raw data generation

Page 16: Computational infrastructure for NGS data analysis

Sequencing centers in Spain

CNAG Sequencing Instruments

8 Illumina Genome Analyzer Iix 6 Illumina HiSeq2000 4 Illumina cBots

Informatics infrastructure 850 core cluster 1.2 petabyte hard disks 10 x 10 Gb/s link with marenostrum (Barcelona Super

Computer 10,240 cores)

Page 17: Computational infrastructure for NGS data analysis

CNAG

Page 18: Computational infrastructure for NGS data analysis

Largest sequencing center in the world

Beijing Genomics Institute (BGI)

Page 19: Computational infrastructure for NGS data analysis

Largest sequencing center in the world

Beijing Genomics Institute (BGI)

Hardware resources

Source: http://www.genomics.cn/en/platform.php?id=249

Page 20: Computational infrastructure for NGS data analysis

Sequencing center resources

Page 21: Computational infrastructure for NGS data analysis

Most used operating system is GNU/LINUX

Source: http://www.top500.org/stats/list/36/osfam

Page 22: Computational infrastructure for NGS data analysis

Alternatives – cloud computing

Pros Flexibility. You pay what you use. Don´t need to maintain a data center.

Cons Transfer big datasets over internet is slow. You pay for consumed bandwidth. That is a problem

with big datasets. Lower performance, specially in disk read/write. Privacy/security concerns. More expensive for big and long term projects.

Page 23: Computational infrastructure for NGS data analysis

Alternatives – grid computing

Page 24: Computational infrastructure for NGS data analysis

Alternatives – grid computing

Pros Cheaper. More resources available.

Cons Too heterogeneous environment. Slow connectivity (specially in Spain). Requires a lot of time to find good resources in the

grid.

Page 25: Computational infrastructure for NGS data analysis

M4NY TH4NKS :)