forschungszentrum karlsruhe in der helmholtz-gemeinschaft implementation of a reliable and...
TRANSCRIPT
Forschungszentrum Karlsruhein der Helmholtz-Gemeinschaft
Implementation of a reliable and expandable on-line storage for compute clusters
Jos van WezelIWR, Forschungszentrum Karlsruhe
Germany
CHEP04
The GridKa centre
• operates cluster computer for: D0, BaBar, CDF, Compass, LHC experiments (tier 1 for LHC)
• 500 dual CPU nodes, 220 TB disk, 400 TB tape• expect growth to 1.6 PB disk, 4 PB tape in 2008• tape storage via dCache and Tivoli Storage
Manager backend• disk storage via NFS/GPFS
CHEP04
Overview
• Storage components at GridKa• Cluster file system implementation• Integration with Linux• On line storage management• Load balancing
CHEP04
Storage components (1)
• IO servers– dual Xeon 2.4 GHz, 1.5 GB RAM, Broadcom Ethernet– failover host bus adapter driver (Qlogic version 6.01)– RedHat 8, kernel 2.20.18-8 on production cluster– RedHat ES 3 (Scientific Linux) on test cluster
• disks and RAID– disk 136 GB, 10 krpm– 9 * 10 units of 14 disks: 1260 (36 hot spare)– arranged as 153 7+1 RAID-5 volumes of 957 GB– stripe size 256 KB
CHEP04
Storage components (2)
• disk controllers (IBM FastT700)– to disk: 9 * 4 independent 2 Gb FC connections– to servers: 9 * 4 independent 2 Gb FC connections– reset or failure of (access to) one controller is handled
without service interruption
• parallel cluster file system (GPFS)– each node of the storage cluster sees each disk– a partition is striped over 2 or more RAID volumes– file systems are exported via NFS– maximum size of single LUN is 1 TB
CHEP04
Cluster to storage connection
GPFScluster
Workernodes on NFS
Disk collection
Fibre channelswitch
SAN
Ethernet
CHEP04
Linux parts
• SCSI driver – allows for hot adding disks/LUNs– no fixed relation between LUN ID and SCSI
numbering. HBAs support persistent binding
• Fibre Channel driver– failover driver selects functional path– maximum number of LUNs on Qlogic FC HBA is 128
• nfs server and nfs client– server side optimized, client default
• autofs and program maps– version 4.1.3 (autofs4)
CHEP04
Maintenance and management
• Disk storage supports:– global hot-spares– on-line replaceable parts: controllers (incl fw),
batteries, power supplies– background disk scrubbing
• LVM of GPFS allows for:– online replacement of volumes– expansion of file systems– online rebalancing after expansion
CHEP04
Storage load balancing
• At file system level– data transfers are striped over several raid volumes– storage is re-balanced on-line after expansion
• At server level– clients select servers at random– combination of autofs and DNS– introduce selection criteria (server capacity, service
groups)
CHEP04
Server level load balancing
read and write activity of last 24 hrs summed
over all file servers
read activity of production file servers
CHEP04
Presented solution benefits
• scalable size (4 PB) and large (15 TB) file spaces• scalable performance (100 MB/server on single GE)• native OS syscall API, no application code change
needed• on-line replaceable components reduce down time• on-line storage expansion• dynamic load balancing• server load policies allows different server HW• native Linux components on clients
CHEP04
Work to do
• get GPFS/NFS working on RH ES 3.0• integrate dCache into existing storage environment• get DC to CERN and peer tier 1’s rolling• start experimenting with SATA• connect NFS servers with second Ethernet via Ethernet
bonding• introduce load policies
CHEP04
Thank you
and
colleagues from the GIS, GES and DASI departments at Institute for scientific computing (IWR), Karlsruhe