cheap clustering with ocfs2 - oracle | oss.oracle.com · cheap clustering with ocfs2 mark fasheh...
TRANSCRIPT
What is OCFS2
● General purpose cluster file system– Shared disk model– Symmetric architecture– Almost POSIX compliant
● fcntl(2) locking● Shared writable mmap
● Cluster stack– Small, suitable only for a file system
Why use OCFS2?
● Versus NFS– Fewer points of failure– Data consistency– OCFS2 nodes have direct disk access
● Higher performance● Widely distributed, supported
– In Linux kernel– Novell SLES9, SLES10– Oracle support for RAC customers
Why do we need “cheap” clusters?
● Shared disk hardware can be expensive– Fibre Channel as a rough example
● Switches: $3,000 - $20,000● Cards: $500 - $2,000● Cables, GBIC – Hundreds of dollars● Disk(s): The sky's the limit
● Networks are getting faster and faster– Gigabit PCI card: $6
● Some want to prototype larger systems– Performance not necessarily critical
Hardware
● Cheap commodity hardware is easy to find:– Refurbished from name brands (Dell, HP, IBM,
etc)– Large hardware stores (Fry's Electronics, etc)– Online – Ebay, Amazon, Newegg, etc
● Impressive Performance– Dual core CPUs running at 2GHz and up– Gigabit network– SATA, SATA II
Hardware Examples - Network● Gigabit network card: $6
– Can direct connect rather than buy a switch, buy two!
Hardware Examples - Total
● Total hardware cost per node: $326– 3 node cluster for less than $1,000!– One machine exports disk via network
● Dedicated gigabit network for the storage● At $50 each, simple to buy an extra, dedicated disk● Generally, this node cannot mount the shared disk
● Spend slightly more for nicer hardware– PCI-Express Gigabit: $30– Athlon X2 3800+, MB (SATA II, DDR2): $180
Shared Disk via iSCSI● SCSI over TCP/IP
– Can be routed– Support for authentication, many enterprise
features● iSCSI Enterprise Target (IETD)
– iSCSI “server”– Can run on any disks, regular files– Kernel / User space components
● Open iSCSI Initiator– iSCSI “client”– Kernel / User space components
Trivial ISCSI Target Config.
● Name the target– iqn.YYYY-MM.com.example:disk.name
● Create “Target” stanza in /etc/ietd.conf– Lun definitions describe disks to export– fileio type for normal disks– Special nullio type for testing
Target iqn.2006-08.com.example:lab.exportsLun 0 Path=/dev/sdX,Type=fileioLun 1 Sectors=10000,Type=nullio
Trivial ISCSI Initiator Config.● Recent releases have a DB driven config.
– Use “iscsiadm” program to manipulate– “rm -f /var/db/iscsi/*” to start fresh– 3 steps
● Add discovery address● Log into target● When done, log out of target
$ iscsiadm -m discovery --type sendtargets –portal examplehost[cbb01c] 192.168.1.6:3260,1 iqn.2006-08.com.example:lab.exports
$ iscsiadm -m node --record cbb01c –-login
$ iscsiadm -m node --record cbb01c –-logout
Shared Disk via SLES10
● Easiest option– No downloading – all packages included– Very simple setup using YAST2
● Simple to use, GUI configuration utility● Text mode available
● Supported by Novell/Suse● OCFS2 also integrated with Linux-HA
software● Demo on Wednesday
– Visit Oracle booth for details
Shared Disk via AoE
● ATA over Ethernet– Very simple standard – 6 page spec!– Lightweight client
● Less CPU overhead than iSCSI– Very easy to set up – auto configuration via
Ethernet broadcast– Not routable, no authentication
● Targets and clients must be on the same Ethernet network
● Disks addressed by “shelf” and “slot” #'s
AoE Target Configuration
● “Virtual Blade” (vblade) software available for Linux, FreeBSD– Very small, user space daemon– Buffered I/O against a device or file
● Useful only for prototyping● O_DIRECT patches available
– Stock performance is not very high● Very simple command
– vbladed <shelf> <slot> <ethn> <device>
AoE Client Configuration
● Single kernel module load required– Automatically finds blades– Optional load time option, aoe_iflist
● List of interfaces to listen on● Aoetools package
– Programs to get AoE status, bind interfaces, create devices, etc
OCFS2
● 1.2 tree– Shipped with SLES9/SLES10– RPMS for other distributions available online– Builds against many kernels– Feature freeze, bug fix only
● 1.3 tree– Active development tree– Included in Linux kernel– Bug fixes and features go to -mm first.
OCFS2 Tools
● Standard set of file system utilities– mkfs.ocfs2, mount.ocfs2, fsck.ocfs2, etc– Cluster aware– o2cb to start/stop/configure cluster– Work with both OCFS2 trees
● Ocfs2console GUI configuration utility– Can create entire cluster configuration– Can distribute configuration to all nodes
● RPMS for non SLES distributions available online
OCFS2 Configuration
● Major goal for OCFS2 was simple config.– /etc/ocfs2/cluster.conf
● Single file, identical on all nodes– Only step before mounting is to start cluster
● Can configure to start at boot
$ /etc/init.d/o2cb online <cluster name>Loading module "configfs": OKMounting configfs filesystem at /sys/kernel/config: OKLoading module "ocfs2_nodemanager": OKLoading module "ocfs2_dlm": OKLoading module "ocfs2_dlmfs": OKMounting ocfs2_dlmfs filesystem at /dlm: OKStarting O2CB cluster ocfs2: OK
Sample cluster.confnode: ip_port = 7777 ip_address = 192.168.1.7 number = 0 name = keevan cluster = ocfs2
node: ip_port = 7777 ip_address = 192.168.1.2 number = 1 name = opaka cluster = ocfs2
cluster: node_count = 2 name = ocfs2
OCFS2 Tuning - Heartbeat
● Default heartbeat timeout tuned very low for our purposes– May result in node reboots for lower
performance clusters– Timeout must be same on all nodes– Increase O2CB_HEARTBEAT_THRESHOLD value
in /etc/sysconfig/o2cb● OCFS2 Tools 1.2.3 release will add this to the
configuration script.● SLES10 users can use Linux-HA instead
OCFS2 Tuning – mkfs.ocfs2
● OCFS2 uses cluster and block sizes– Clusters for data, range from 4K-1M
● Use -C <clustersize> option– Blocks for meta data, range from .5K-4K
● Use -b <blocksize> option
● More meta data updates -> larger journal– -Jsize=<journalsize> to pick different size
● mkfs.ocfs2 -T filesystem-type– -Tmail option for meta data heavy workloads
– -Tdatafiles for file systems with very large files
OCFS2 Tuning - Practices● No indexed directories yet
– Keep directory sizes small to medium● Reduce resource contention
– Read only access is not a problem– Try to keep writes local to a node
● Each node has it's own directory● Each node has it's own logfile
● Spread things out by using multiple file systems– Allows you to fine tune mkfs options
depending on file system target usage
References
● http://oss.oracle.com/projects/ocfs2/● http://oss.oracle.com/projects/ocfs2-tools/● http://www.novell.com/linux/storage_foundation/● http://iscsitarget.sf.net/● http://www.open-iscsi.org/● http://aoetools.sf.net/● http://www.coraid.com/● http://www.frys-electronics-ads.com/● http://www.cdw.com/