whitepaper transtec & beegfs...transtec is a registered gold partner of thinkparq and installed...
TRANSCRIPT
transtec & BeeGFSWhitepaper
2
Whitepapertranstec & BeeGFS
OVERVIEW
> transtec AG and BeeGFS
For more than 35 years, transtec has been a vendor and service provider on the German
market, and is now also represented in a further seven European countries. Over the years,
the direct-selling operation in Germany has grown into an international business which does
much more than manufacture, sell and distribute hardware. Today, transtec is operating
highly successful in the fi elds of High Performance Computing, Storage, Virtualization and
Consolidation, and as such has developed into a pan-European solution provider.
BeeGFS is a parallel cluster fi le system developed and maintained by ThinkParQ. They
focused on performance, easy installation and management when developing this parallel
fi le system: things, a lot of our customers ask for. So we decide to create this paper to give
customers and interested persons an overview of a potential confi guration and performance
by using a BeeGFS solution.
transtec is a registered gold partner of ThinkParQ and installed the fi rst Petabyte BeeGFS
system worldwide.
> About
This document presents a summary of the results of a series of benchmarks. The goal was to
measure the data streaming and I/O operations throughput of a possible Turn-Key Solution
from transtec AG.
The Benchmarks was executed on transtec hardware. For the BeeGFS Storage- and Metadata
services, we have used two transtec CALLEO Application Server 4280H. As compute nodes a
CALLEO High-Performance Server 2880 where used, this is a blade system that contains four
insertions. The BeeGFS Management Service was running on a central server that is also
used as cluster manager.
CONTACTS
transtec AG
www.transtec.de | [email protected]
Tel. +49 (0)7121 2678 - 400
ThinkParQ BeeGFS
www.thinkparq.com www.beegfs.com
3
Whitepapertranstec & BeeGFS
SYSTEM DESCRIPTION
> Storage- and Metadata Server System confi guration
The transtec CALLEO Application Server 4280H
Link: shop.transtec.de/SA4280A336R-calleo-application-server-4280h
Each of the two Servers contain the following components:
2x Intel(R) Xeon(R) CPU E5-2667 V3 @ 3.20 GHz
64 GB RAM @ 2133 MHz
2x Seagate 1200 SAS SSD, 200GB (ST200FM0053)
> Confi gured as RAID1 for meta data
34x Seagate Enterprise Capacity SAS HDD, 2TB (ST2000NM0034)
> Confi gured as 3x RAID6 with 11 Disks per Raid
> 1 Disk confi gured as Global Hot-Spare
AVAGO MegaRAID SAS 9361-8i
Mellanox FDR ConnectX-3 MT27500 Infi niband controller
CentOS 7.2
Mellanox OFED stack (from CentOS 7.2 repo)
BeeGFS Storage and Medadata Service version 2015.03-r11
> Compute and Client Nodes confi guration
The transtec CALLEO High-Performance Server 2880
Link: shop.transtec.de/calleo-high-performance-server-2880
Each of the two Servers contain the following components:
2x Intel(R) Xeon(R) CPU E5-2680 v4 @ 2.40GHz
128 GB RAM @ 2400 GHz
Mellanox FDR ConnectX-3 MT27500 Infi niband controller
Stateless CentOS 7.1
Mellanox OFED stack (from CentOS 7.1 repo)
BeeGFS Client Service version 2015.03-r11
> Benchmarks
The benchmark tools used in the experiment are listed below:
IOR-3.0.1: for measuring the sustained throughput of the BeeGFS
storage service.
mdtest-1.9.4: for measuring the performance of the BeeGFS
metadata service
IOR and mdtest has been compiled with openmpi/1.10.0-gcc4.9.2.
4
Whitepapertranstec & BeeGFS
BeeGFS Meta and Storage Server
Ethernet Switch
BeeGFS Management Server
BeeGFS Meta and Storage Server
Infiniband Switch
Compute Nodes
4x
4x
> Topology map
Schema 2: Topology map
5
Whitepapertranstec & BeeGFS
CONFIGURATION AND TUNING
> BeeGFS storage Server
Formatting Options Value
RAID stripe size per disk (KB) 256
RAID level 6
Disks per RAID volume 9+2
Disk array local Linux fi le system XFS
Partition Alignment GPT
XFS Mount Options (/etc/fstab) Value
Last File and Directory Access noatime, nodiratime
Log buffer tuning logbufs=8, logbsize=256k
Streaming performance optimization largeio, inode64, swalloc
Streaming write throughput allocsize=131072k
Write Barriers nobarrier
IO Scheduler Options (/sys/block/<dev>/queue/) Value
Scheduler deadline
nr_requests 4096
read_ahead_kb 32768
Virtual Memory Settings (/proc/sys/vm/) Value
dirty_background_ratio 5
dirty_ratio 10
vfs_cache_pressure 50
min_free_kbytes 262144
CPU Frequency Settings Value
/sys/devices/system/cpu/cpu*/cpufreq/scaling_governor performance
BeeGFS Storage Service (/etc/beegfs/beegfs-storage.conf) Value
connMaxInternodeNum 40
tuneFileReadAheadSize 32m
tuneFileReadAheadTriggerSize 2m
tuneFileReadSize 512K
tuneFileWriteSize 512K
tuneNumWorkers 18
tuneWorkerBufSize 16m
6
Whitepapertranstec & BeeGFS
> BeeGFS meta Server
Format Options Value
RAID Level 1
SSDs per RAID volume 2
Disk array local Linux fi le system ext4
Minimize access times for large directories -Odir_index
Large inodes -I 512
Number of inodes -i 2048
Large journal -J size=400
Extended attributes user_xattr
Partition Alignment GPT
EXT4 Mount Options (/etc/fstab) Value
Last File and Directory Access noatime, nodiratime
Write Barriers nobarrier
IO Scheduler Options (/sys/block/<dev>/queue/) Value
Scheduler deadline
nr_requests 128
read_ahead_kb 128
BeeGFS Meta Service (/etc/beegfs/beegfs-meta.conf) Value
connMaxInternodeNum 32
tuneNumWorkers 0
tuneTargetChooser randomrobin
> BeeGFS client service
BeeGFS Client Service (/etc/beegfs/beegfs-client.conf) Value
connMaxInterrnodeNum 64
tuneRemoteFSync false
Parallel Network Requests Option Value
Chunk size 2M
IOR File-per-process: targets per fi le 3
IOR Single-shared-fi le: targets per fi le 6
> BeeGFS striping settings
Command-Line:
beegfs-ctl --setpattern --chunksize=<Chunk size> --numtargets=<targets per fi le> <path>
7
Whitepapertranstec & BeeGFS
RESULTS
> Benchmark Parameters
IOR (fi le-per-process):
8 benchmarks from 20 to 27 processes, 5 repeats each
Every process writes and read his own fi le
Files of variable size (960GiB / processes)
Total size of data write/read = 960GiB
Transfersize = 2MiB
IOR (single-shared-fi le):
6 benchmarks from 21 to 26 processes, 5 repeats each
All processes write and read the same shared fi le
Total size of data write/read = 640GiB
Transfersize = 2MiB
Mdtest:
8 benchmarks from 20 to 27 processes, 5 repeats each
Total fi les created and statted: 1 million
MPI:
MPI was confi gured to split the running processes between all 4 nodes. Exception is
naturally the benchmarks running with 1 or 2 processes. On those benchmarks only 1
process was running per compute node
8
Whitepapertranstec & BeeGFS
15001 2 4 8 16 32 64 128
2500
3500
4500
5500
6500
meanminmaxMiB/s
Process
> Results IOR (fi le-per-process)
Chart 1: Write throughput (fi le-per-process)
WRITE IOR (fi le-per-process)
15001 2 4 8 16 32 64 128
2500
3500
4500
5500
6500
meanminmaxMiB/s
Process
Chart 2: Read throughput (fi le-per-process)
READ IOR (fi le-per-process)
WRITE READ
Processes Max Min Mean Max Min Mean
1 1887,0 1878,6 1882,3 2204,3 1925,9 2080,6
2 3549,4 3544,6 3546,3 4122,2 3821,6 4010,2
4 5690,8 5651,0 5678,9 5230,4 4943,5 5089,0
8 5660,3 5558,3 5589,3 5155,5 4850,3 4983,6
16 5539,9 5503,9 5520,5 5733,7 5383,4 5565,5
32 5286,9 5242,2 5261,4 5860,6 5823,1 5844,8
64 4757,0 4727,2 4738,1 5975,8 5824,1 5901,1
128 4040,2 4004,1 4026,2 5915,2 5808,2 5860,6
Table 1: Read and Write throughputs (MiB/s) fi le-per-process benchmark
9
Whitepapertranstec & BeeGFS
15002 4 8 16 32 64
2500
3500
4500
meanminmaxMiB/s
Process
> Results IOR (single-shared-fi le)
Chart 3: Write throughput (single-shared-fi le)
WRITE IOR (single-shared-fi le)
15002 4 8 16 32 64
2500
3500
4500
5500
meanminmaxMiB/s
Process
Chart 4: Read throughput (single-shared-fi le)
READ IOR (single-shared-fi le)
WRITE READ
Processes Max Min Mean Max Min Mean
2 2097,1 2042,1 2059,6 3957,2 3652,5 3846,4
4 3835,7 3763,4 3798,9 5021,3 4556,6 4769,3
8 3960,7 3830,7 3889,1 4212,1 3762,3 3890,8
16 4024,3 3890,6 3950,0 4871,2 3882,6 4401,0
32 4039,6 3896,1 3992,1 4937,9 3647,2 4240,8
64 4235,1 3995,3 4154,3 4332,5 3412,3 3957,4
Table 2: Read and Write throughputs (MiB/s) single-shared-fi le benchmark
10
Whitepapertranstec & BeeGFS
01 2 4 8 16 32 64 128
20 000
40 000
60 000
80 000
120 000
100 000
MiB/s
Process
meanminmax
> Results mdtest
Chart 5: File creation throughput
mdtest Files creation
01 2 4 8 16 32 64 128
10 000
20 000
30 000
40 000
MiB/s
Process
meanminmax
Chart 6: File stat throughput
mdtest Files stat
WRITE READ
Processes Max Min Mean Max Min Mean
1 7290 5718 6731 28915 21608 26424
2 14735 13547 13887 59287 49436 52923
4 31131 28701 29944 118461 110927 115570
8 53691 51219 52312 206813 199581 203559
16 86437 79607 84207 340804 329923 336653
32 93855 90888 93027 358610 347062 354744
64 97402 93825 95638 292356 252831 274747
128 86754 78000 82665 187561 183149 186103
Table 3: Files creation and stat per second throughputs mdtest Benchmark
11
Whitepapertranstec & BeeGFS
COMMANDOSThis section shows the commands that were used on the compute nodes to run the streaming and metadata benchmarks.
> IOR fi le-per-process
for (( x=0; x <= 7; x++ ))
do
N_PROCS=$((2**$x))
mpirun -npernode ${N_PROCS} --bind-to none -- ./src/ior -a POSIX -i 5 -g -C -d 10 \
-w -r -e -t 2m -F –b $((960/$N_PROCS))g –o /scratch/bench/ior | tee \
./outputs/ior_bench_fpp_${N_PROCS}.log
done
> IOR single-shared-fi le
for (( x=1; x <= 6; x++ ))
do
N_PROCS=$((2**$x))
mpirun -npernode ${N_PROCS} --bind-to none -- ./src/ior -a POSIX -i 5 -g -C -d 10 \
-w -r -e -t 2m –b $((640/$N_PROCS))g –o /scratch/bench/ior | tee \
./outputs/ior_bench_ssh_${N_PROCS}.log
done
> Mdtest
for (( x=0; x <= 7; x++ ))
do
N_PROCS =$((2**$x))
fi lesperdir=$((1000000/64/$N_PROCS))
mpirun -np ${N_PROCS} ./mdtest -C -T -d /scratch/mdtest -i 5 -I ${fi lesperdir} -z \
2 -b 8 -L -u –F
done