hpc in the human genome project james cuff [email protected]
Post on 18-Dec-2015
214 views
TRANSCRIPT
• The Sanger Centre is a research centre funded primarily by the Wellcome Trust
• Located in 55 acres of parkland
Also on site are the • European Bioinformatics Institute (EBI)• Human Genome Mapping Project Resource Centre
(HGMP-RC)
The Sanger Centre
• Founded in 1993; >570 staff members now.
• Our purpose is to further the knowledge of the biology of organisms, particularly through large scale sequencing and analysis of their genomes.
• Our lead project is to sequence a third of the human genome as part of the international Human Genome Project.
Sanger Centre research programmes
• Pathogen sequencing programme
• Informatics
– support data collection
– analyse and present results
– develop methodology: algorithms and data resources
• Cancer genome project
• Human genetic programme - study genetic variation (SNPs) and find disease genes
The era of genome sequencing
Size No. of Genes Completion (Mbases) date
H. influenzae 2 1,700 1/1kb Bacterium 1995
Yeast 13 6,000 1/2kb Eukaryotic cell 1996
Nematode 100 18,000 1/6kb Animal 1998
Human 3000 ?40,000 1/60kb Mammal 2000/3
Sequence data production increase of >2000%
The Sequencing Facility
Sanger I.T.
Sanger network:- more than 1600 devices
300 PC’s – various
150+ X-terms/Network Computers (NCD)
250 NT/Mac ABI Collection devices
Various other servers, Linux desktop systems, printers, etc.
Paracel, Compugen and Timelogic systems
>350 Compaq Alpha systems (DS10, DS20, ES40,8400)
+440 Node Sequence Annotation Farm (PC/DS10/DS10L)
>750 Alpha processors in total
Systems architecture hierarchyCompute
Server Farm
RaidStorage
RaidStorage
ASX-0BX ASX-0BX
ASX-0BX
A T MFront-endCompute Servers
Front-endCompute Servers
Desk topworkgroupsystems
Desk topworkgroupsystems
LSF
LSF - Load Sharing Facility by Platform Computing Ltd
F/C F/C
Computer Systems Architecture
Fibre Channel/Memory Channel Tru64 Clusters Implementing tightly coupled clustering with Tru64 V5.x
We get:
Improved disk I/O (fibre channel), scaleability (multi-cpu, multi-terabyte)
Improved manageability
- single system image, whole clusters are managed as single entities)
ES40 Clusters, F.C Storage
• 32 x CableTron 100Mbs switches, 16 x RS232 Terminal servers, 2 x 155Mb ATM fibre uplinks back to v5.0 cluster
• Two network subnets (multicast and backbone)
• 640x100Mb Fast Ethernet ports
• 1,920 UTP cable crimps, 8 cabinets ~ 100kW of power
• 8 Racks each with 40 x Tru64 v5.0 Alpha DS10L. 320x466Mhz Alpha EV6.7, 1U High Total of 320GB mem, spinning 19.2TB internal storage
• ca. Equivalent to 10 x GS320, perf around 355 Gflops
Annotation Farms
• Highly available NFS (Tru64 CAA)• Fast I/O (ATM > switched full duplex ethernet)• Socket data transfer
(via rdist, rcp, and MySQL DBI sockets)• Segmented network architecture via two elans
8 nodeES40
M/C F/Ccluster
ATM172.27
172.25
uplinks172.27
Sanger
172.25
Farm
Network Overview
Compute – systems architecture
Webserver
400 nodeS.A. farm
Pathogen
Sequence dataprocessing
Informatics
Mapping
SNP
Ensembl
Blastserver
Alta Vista
FTP
Large scaleassembly &sequencing
Firewall DMZExternal Services
ATM
Traceserver
Enterprise Clustering
• LSF is still key for job scheduling and batch operations
• LSF offers greater granularity of operation and functionality than Tru64 scheduling
• Schedule individual nodes, cluster-wide and cross-cluster scheduling
• With LSF we still have the capability to use many of the 750+ compute nodes as a single Sanger Compute Engine
MODULAR SUPERCOMPUTING
Projects
• Will involve thousands of CPU’s
- Large numbers of PC farm nodes - High-end, large memory SMP configurations
• All are computationally expensive
• Will require > 100 Terabytes of storage
• We need to continue scaling up and deal with the physical limitations
Immediate Future
The Sanger CentreGenome Campus
ATM
Storage AreaNetwork (SAN)
The EBIGenome Campus
LSFClustering
Institute to Institute Clustering Closer collaborations between Sanger, EBI and other organisations brings the need for site wide shared clusters.
Implement Storage Area Network Install multi-TB to enable disk mirroring, controller/controller snapshots
Longer Term Future
• Wide Area Clusters Needed for large scale collaborations.
• GRID Technology - Global Distributed Computing International Cluster collaborations with other scientific institutes
GLOBAL COMPUTE ENGINES
• Sanger is keen to keep abreast of this emerging technology
Questions ?