forschungszentrum karlsruhe in der helmholtz-gemeinschaft implementation of a reliable and...

Forschungszentrum Karlsruhein der Helmholtz-Gemeinschaft

Implementation of a reliable and expandable on-line storage for compute clusters

Jos van WezelIWR, Forschungszentrum Karlsruhe

Germany

CHEP04

The GridKa centre

• operates cluster computer for: D0, BaBar, CDF, Compass, LHC experiments (tier 1 for LHC)

• 500 dual CPU nodes, 220 TB disk, 400 TB tape• expect growth to 1.6 PB disk, 4 PB tape in 2008• tape storage via dCache and Tivoli Storage

Manager backend• disk storage via NFS/GPFS

CHEP04

Overview

• Storage components at GridKa• Cluster file system implementation• Integration with Linux• On line storage management• Load balancing

CHEP04

Storage components (1)

• IO servers– dual Xeon 2.4 GHz, 1.5 GB RAM, Broadcom Ethernet– failover host bus adapter driver (Qlogic version 6.01)– RedHat 8, kernel 2.20.18-8 on production cluster– RedHat ES 3 (Scientific Linux) on test cluster

• disks and RAID– disk 136 GB, 10 krpm– 9 * 10 units of 14 disks: 1260 (36 hot spare)– arranged as 153 7+1 RAID-5 volumes of 957 GB– stripe size 256 KB

CHEP04

Storage components (2)

• disk controllers (IBM FastT700)– to disk: 9 * 4 independent 2 Gb FC connections– to servers: 9 * 4 independent 2 Gb FC connections– reset or failure of (access to) one controller is handled

without service interruption

• parallel cluster file system (GPFS)– each node of the storage cluster sees each disk– a partition is striped over 2 or more RAID volumes– file systems are exported via NFS– maximum size of single LUN is 1 TB

CHEP04

Cluster to storage connection

GPFScluster

Workernodes on NFS

Disk collection

Fibre channelswitch

SAN

Ethernet

CHEP04

Linux parts

• SCSI driver – allows for hot adding disks/LUNs– no fixed relation between LUN ID and SCSI

numbering. HBAs support persistent binding

• Fibre Channel driver– failover driver selects functional path– maximum number of LUNs on Qlogic FC HBA is 128

• nfs server and nfs client– server side optimized, client default

• autofs and program maps– version 4.1.3 (autofs4)

CHEP04

Maintenance and management

• Disk storage supports:– global hot-spares– on-line replaceable parts: controllers (incl fw),

batteries, power supplies– background disk scrubbing

• LVM of GPFS allows for:– online replacement of volumes– expansion of file systems– online rebalancing after expansion

CHEP04

Storage load balancing

• At file system level– data transfers are striped over several raid volumes– storage is re-balanced on-line after expansion

• At server level– clients select servers at random– combination of autofs and DNS– introduce selection criteria (server capacity, service

groups)

CHEP04

Server level load balancing

read and write activity of last 24 hrs summed

over all file servers

read activity of production file servers

CHEP04

Presented solution benefits

• scalable size (4 PB) and large (15 TB) file spaces• scalable performance (100 MB/server on single GE)• native OS syscall API, no application code change

needed• on-line replaceable components reduce down time• on-line storage expansion• dynamic load balancing• server load policies allows different server HW• native Linux components on clients

CHEP04

Work to do

• get GPFS/NFS working on RH ES 3.0• integrate dCache into existing storage environment• get DC to CERN and peer tier 1’s rolling• start experimenting with SATA• connect NFS servers with second Ethernet via Ethernet

bonding• introduce load policies

CHEP04

Thank you

and

colleagues from the GIS, GES and DASI departments at Institute for scientific computing (IWR), Karlsruhe

forschungszentrum karlsruhe in der helmholtz-gemeinschaft implementation of a reliable and...

Documents