biginsights tuning eric yang jim huang stewart tate simon harris michael ahern john poelman

BigInsights Tuning

Eric YangJim HuangStewart TateSimon HarrisMichael AhernJohn Poelman

2 © 2013 IBM Corporation

Agenda

BigInsights Tuning– Validating and tuning a new cluster

Performance Improvements with Adaptive Map/Reduce Special Topic: “HBase tuning experience with NICE” by Bruce

Brown Tuning Big SQL: Results and lessons learned Q & A

Topics covered later in the week– More on tuning Big SQL (e.g. hints)– GPFS performance


Validating a new cluster

The performance of a new cluster should be validated and tuned– Slow nodes will drag down the performance of the entire cluster

Things to validate:– Network– Storage– BIOS– CPU– Memory


Validating a new cluster (Network)

Network validation– Ensure expected sustained bandwidth between cluster nodes– There are many tools that can be used to validate the network– One tool we frequently use is iperf (http://en.wikipedia.org/wiki/Iperf)

• Test bandwidth between every pair of nodes• Example of how to invoke iperf client:

iperf --client <server hostname> --time 30 --interval 5 --parallel 1 --dualtest

http://en.wikipedia.org/wiki/Iperf


Validating a new cluster (Storage)

Storage validation – JBOD is the preferred storage set up. Avoid things like volume groups or

logical volumes, etc. – Each local device should be set up JBOD or (if the storage controller does not

support JBOD) as a RAID-0 single-device array.– Check device settings

• Are write caches enabled? Write caches are usually enabled except for storage that needs high durability, such as storage for the NameNode

– Based on the type of file system (e.g. ext), check file system settings (block size, etc.)• Example (ext4): /sbin/dumpe4fs /dev/sdX

– Check mount options• Usually want to see things like “atime” and “user extended attributes” are disabled


Validating a new cluster (Storage)

Storage validation (continued)– Use Linux dd command to test read/write throughput of every device or file

system that BigInsights will use– (Optional) Test raw device read/write performance BEFORE creating file

system• Warning: Writing to a raw device will clobber existing data (if any)

– (Mandatory) Test read/write performance after creating and mounting a file system on the device• Test with and without Direct I/O• With Direct I/O

dd if=/dev/zero oflag=direct of=<path>/ddfile.out bs=128k count=1024000 dd if=<path>/ddfile.out iflag=direct of=/dev/null bs=128k count=1024000

• Without Direct I/O dd if=/dev/zero of=<path>/ddfile.out bs=128k count=1024000 dd if=<path>/ddfile.out of=/dev/null bs=128k count=1024000


Validating a new cluster (BIOS/CPU/Memory)

In the BIOS, consider turning off power saving options

Validate CPU and memory performance– “Hdparm –T <device>” Reads data from the device’s caches as fast as

possible


Tuning a new cluster

Cluster and BigInsights performance tuning is a broad topic We will touch on select topics and settings in the slides that follow


Tuning the network

How to check current settings– ifconfig– sysctl -a | grep net

ifconfig settings commonly tuned include– MTU (Maximum Transmission Unit) -- Often set to larger value like 9000.– Txqueuelen (Transmit Queue Length) -- High value, e.g. 2000, recommended

for servers that perform large data transfers.

Other network settings we typical check/tune:– Read/write memory buffers (e.g. net.ipv4.tcp_rmem, net.ipv4.tcp_wmem)

Big SQL users: it is a good idea to tune “keep alive” and “tcp_fin_timeout” TCP/IP settings for improved stability (fewer timeouts and hangs)

There are more network settings than we can cover here


Tuning I/O in the Linux Kernel

Here are examples of things to tune in the Linux kernel

For clusters mostly doing Map/Reduce and a lot of sequential I/O, consider:– Using “deadline” scheduler (instead of “completely fair”)

• Check: cat /sys/block/sdX/queue/scheduler• How to change: echo deadline > /sys/block/sdX/queue/scheduler

– Tune read ahead buffer size• By default the Linux OS will read 128 KB of data in advance so that it is already in

Memory cache before the program needs it. This value can be increased so as to get better Read Performance.

• Check: cat /sys/block/sdX/queue/read_ahead_kb• How to change: echo 3072 > /sys/block/sdX/queue/read_ahead_kb

There are many more kernel settings not covered here


Tuning the file system

Tune the file system based on your file system type Examples of file system type: ext3, ext4, xfs

– Note: GPFS tuning will be covered in a later session For ext4 file systems, consider settings like “dir_index” and “extent”

– mkfs.ext4 -O dir_index,extent /dev/sdX

• dir_index: Use hashed b-trees to speed up lookups in large directories

• extent: Instead of using the indirect block scheme for storing the location of data blocks in an inode, use extents instead. This is a much more efficient encoding which speeds up filesystem access, especially for large files.

Mount options– Turn off atime

• noatime: Access timestamps are not updated when a file is read• nodiratime: Turn off directory time stamps


Tuning Kernel Virtual Machine Memory

redhat_transparent_hugepage/enabled (Redhat 6.2 or later)– In order to bring "huge page" performance benefits to legacy software, RedHat has

implemented custom Linux kernel extension called "redhat_transparent_hugepage".– Recommendation: Turn off huge pages since this functionality still maturing.

• This is a well-known tuning optimization in the Hadoop community• echo never > /sys/kernel/mm/redhat_transparent_hugepage/enabled

vm.swappiness– This control is used to define how aggressive the kernel will swap memory pages. Higher

values will increase aggressiveness, lower values decrease the amount of swap. Default is 60.

– Recommend a lower value in the range 5 to 20 (test in increments of 5)

vm.min_free_kbytes (came up during Big SQL testing)– min_free_kbytes changes the page reclaim thresholds. When this number is increased

the system starts reclaiming memory earlier, when its lowered it starts reclaiming memory later.

– If you see "page allocation" failures in /var/log/messages then chances are this is set too low.

– For Big SQL testing on machines with 128 GB of physical memory, we set this to 2621440 (2.5GB)

– (If using GPFS) GPFS documentation recommends vm.min_free_kbytes should be set to 5 to 6% of the total physical memory

There are many more vm kernel settings you can explore


Tuning HDFS

Give HDFS as many paths as possible, where each path is on a different device

Can be configured at install time– Can also be changed later using parameter “dfs.data.dir” in hdfs-site.xml

In the installer, under “DataNode/TaskTracker”– Installer default path: /hadoop/hdfs/data– You almost always want to change this to one or more paths


Tuning HDFS

Tune the underlying devices and file systems as discussed on previous slides

When running large Big SQL & Hive workloads the following settings have proved useful in improving the stability of a cluster - for both Apache M-R & Platform Symphony M-R:– dfs.datanode.handler.count=40 (default is 10). The number of server threads

for the datanode. Usually about the same as number of CPUs.– dfs.datanode.max.xcievers=65536 (default is 8192). Maximum number of

files that a DataNode will serve concurrently.– dfs.datanode.socket.write.timeout=960000 (default is 480000). Units is

milliseconds. 960000 is 16 minutes. Increase to avoid tasks failing due to write timeout errors.


Tuning Map/Reduce (Intermediate storage)

“Intermediate” data gets shuffled between map and reduce tasks Intermediate data is kept on local storage as configured by Hadoop

parameter “mapred.local.dir” The speed of this storage directly impacts M/R performance For better performance, spread across multiple devices Configured at install time on “File System” panel

– Referred to as “Cache directory”– Can also be updated post-install in mapred-site.xml

In the installer, expand “MapReduce general settings”– Default path: /hadoop/mapred/local– Change this to one or more paths, where each path corresponds to a different

device


Tuning HDFS and Intermediate storage

Logs, Etc

HDFS

(Shared)

dfs.data.dir

DataNode 1 DataNode 2 DataNode N

…

DataNode 3

…

…m

ap

red

.lo

ca

l.d

ir

ma

pre

d.l

oc

al.

dir

ma

pre

d.l

oc

al.

dir

ma

pre

d.l

oc

al.

dir

Intermediate storage (mapred.local.dir) can share the same disks as HDFS (dfs.data.dir


Tuning Map/Reduce

We only have time to cover a few M/R settings in this presentation– Number of slots– JVM heap sizes

Backup slides cover more topics– Number of reduce tasks– Compression– Sorting– Block and split sizes– Making clusters rack aware


Tuning Map/Reduce (Number of slots)

Used to tune the degree of Map/Reduce concurrency on the cluster Defines the maximum number of concurrently occupied slots on a node =

– Too high and the cluster becomes unstable, too low and you are wasting machine resources– Tuned in association with the map and reduce task JVM size (mapred.child.java.opts)– On large machines, particularly those with many CPU cores and/or hyper-threading enabled, the

default values may be too large• Example: Simon tested Big SQL on Power machines have 16 cores and SMT=4, resulting in

64 virtual CPUs. By default, BigInsights was configuration 64 map slots and 32 reduce slots (96 slots in all). This was too many and resulted in high context switching and cluster instability. After tuning, Simon found 24 map slots and 12 reduce slots much more optimal.

Setting Default value

mapred.tasktracker.map.tasks.maximum Formula:<%= Math.min(Math.ceil(numOfCores * 1.0), Math.ceil(maxPartition*0.66*totalMem/1000)) %>

mapred.tasktracker.reduce.tasks.maximum Formula:<%= Math.min(Math.ceil(numOfCores * 0.5), Math.ceil(maxPartition*0.33*totalMem/1000)) %>

Number of map and reduce slots on each TaskTracker


Tuning Map/Reduce (Number of slots - Continued)

Additional notes about these two Hadoop parameters:

1. The formulas for “mapred.tasktracker.map.tasks.maximum” and “mapred.tasktracker.reduce.tasks.maximum” are evaluated on each TaskTracker and thus can vary by node

2. If changed, be sure to restart all TaskTrackers

3. Since these are TaskTracker settings, overriding them on individual Hadoop jobs has no effect


Tuning Map/Reduce (Task JVM heap size)

Tune the amount of memory assigned to each map and reduce task– the –Xmx property defines the maximum Java heap size– Conservative setting would be (60% of real memory) /

(mapred.tasktracker.map.tasks.maximum + mapred.tasktracker.reduce.tasks.maximum– Aggressive setting would be (80% of real memory) /

(mapred.tasktracker.map.tasks.maximum + mapred.tasktracker.reduce.tasks.maximum– Often tuned in association with mapred.tasktracker.map.tasks.maximum &

mapred.tasktracker.reduce.tasks.maximum


mapred.child.java.opts -Xmx1000m (New default value in BI 2.1)BI 2.0 default: -Xmx600m -Xshareclasses

mapred.map.child.java.opts None. If set, overrides “mapred.child.java.opts” for map tasks.

mapred.reduce.child.java.opts None. If set, overrides “mapred.child.java.opts” for reduce tasks.

Individual Hadoop jobs can run with custom JVM arguments


Making tuning changes persist after reboot

Kernel settings:– sysctl -w <parameter>=<value>, or

• Always confirm the setting was actually changed– Add to /etc/sysctl.conf

• vm.swappiness = 5

ifconfig settings– Edit /etc/sysconfig/network-scripts/ifcfg-ethX, change lines like:

• MTU=“9000”– Edit /etc/rc.local, add lines like:

• ifconfig ethX txqueuelen 2000


Tuning Map/Reduce (Number of reduce tasks)

Tune the number of reducer tasks for your Hadoop jobs– The default number of reduce tasks is 1, which is OK for very small tasks– Recommendation is to override “mapred.reduce.tasks” to a value greater than 1 for

most jobs– If number of reduce tasks is too few, then each reduce task will take a long time and

reduce time will dominate overall job time– If too many, then reduce tasks will tie up slots and hold back other jobs


mapred.reduce.tasks 1 (not set in mapred-site.xml)

A Hadoop job by default will only have one reduce task, which is usually not what you want


Tuning Map/Reduce (Compression)

Consider benefits and drawbacks of compression– Tradeoff between storage space and CPU time– Compressing map output can speed up the shuffle phase


mapred.map.output.compression.codec The compression codec to use (a Java class). Must be listed in parameter “io.compression.codecs”.Example: com.ibm.biginsights.compress.CmxCodec

mapred.output.compress Boolean. Default is false. Whether or not to compress final output from a Hadoop job.

mapred.compress.map.output Boolean. Default is false. Whether or not to compress intermediate data shuffled from map to reduce tasks.

mapred.output.compression.type Default is RECORD. How to compress sequence files. Other values are NONE and BLOCK.

io.seqfile.compress.blocksize Default is 1 million (bytes). Block size when compressing a sequence file and mapred.output.compression.type=BLOCK.

Use compression to save space and in some cases improve performance


Tuning Map/Reduce (Sorting)

Tune the amount of memory and concurrency to use for sorting using io.sort.mb and io.sort.factor

– io.sort.mb is the size, in megabytes, of the memory buffer to use while sorting map output

– io.sort.factor is the maximum number of streams to merge at once when sorting files– Should be tuned in conjunction with mapred.child.java.opts=-Xmx


io.sort.mb 256

io.sort.factor 10 (New default value in BI 2.1)BI 2.0 default: 64

Hadoop jobs can run with custom sort parameters


Tuning Map/Reduce (Block Size/Split Size)

Choose an appropriate block size for your files using dfs.block.size– The size of one block of data on HDFS– Block size is a property of each file on HDFS– Usually defines the size of input data (split) passed to a map task for processing– Split sizes can be further tuned using “mapred.min.split.size” and

“mapred.max.split.size”– If split size is too small, then incur overhead of creating/destroying many small tasks– If split size is too large, then concurrency of the cluster is reduced


dfs.block.size 128MB

mapred.min.split.size 0

mapred.max.split.size No limit

Block size is an important characteristic of your data on HDFS


Tuning Map/Reduce

Additional Hadoop parameters (defaults in mapred-site.xml) that are often overridden for individual Hadoop jobs:

Setting Default value Notes

mapred.reduce.slowstart.completed.maps

0.5 Fraction of the number of maps in the job which should be complete before reduces are scheduled for the job.Starting reducers earlier can help some jobs finish faster, but the reducers tie up slots.


Tuning Map/Reduce (General Settings)

Rack awareness


Notes

topology.script.file.name No script Add to core-site.xml. Use to make Hadoop rack-aware.Enable rack-awareness for large clusters with several racks, when all racks have roughly same number of nodes.

Warning: It may hurt performance to enable rack awareness for small clusters with uneven number of nodes in each rack. Racks with fewer nodes tend to be busier.


Tuning Adaptive M/R – Number of slots

The number of slots on each compute node has a huge impact on performance– Cluster is under-utilized if too few slots– Cluster can become over-committed if too many slots– The number of slots needs to be configured based:

• Compute node hardware (CPU, memory, number of disks, etc.)• Workload characteristics (memory or CPU or storage intensive)

Default number of slots:– Adaptive M/R: Equal to number of CPUs

• Single slot pool for map and reduce tasks– Apache M/R: Separate pools for map and reduce slots with default ratio of 2

map slots for every reduce slot


Tuning Adaptive M/R – Number of slots

How to change the number of slots (Adaptive M/R)– Command line

• Example: Configure 16 slots per compute node Edit $BIGINSIGHTS_HOME/hdm/components/HAManager/conf/ResourceGroups.xml and

change two lines:> <ResourceGroup ResourceGroupName="ManagementHosts" availableSlots=“16">> <ResourceGroup ResourceGroupName="ComputeHosts" availableSlots=“16">

Run $BIGINSIGHTS_HOME/bin/syncconf.sh HAManager Check that changes appear in $EGO_CONFDIR/ResourceGroups.xml

• Web UI (not officially externalized in BigInsights) http://<master node>:18080/Platform Bring up Map/Reduce UI Resources –> Resource Groups -> ComputeHosts


Tuning Apache M/R – Number of slots

How to change the number of slots (Apache M/R)– Edit $BIGINSIGHTS_HOME/hdm/hadoop-conf-staging/mapred-site.xml– Change mapred.tasktracker.map.tasks.maximum and

mapred.tasktracker.reduce.tasks.maximum– Run $BIGINSIGHTS_HOME/bin/syncconf.sh hadoop


Adaptive M/R: Tuning Map and Reduce tasks JVM size

Adaptive M/R honors property “mapred.child.java.opts”– Default in mapred-site.xml

• Out-of-box, mapred.child.java.opts = -Xmx1000m– Hadoop jobs often override “mapred.child.java.opts” value


Tuning Adaptive M/R

Since there is no distinction between map slots and reduce slots for Adaptive M/R special consideration should be taken– Never oversubscribe your system memory ( number of slots * memory per slot )– mapred.reduce.slowstart.completed.maps

io.sort.mb (100) / io.sort.factor (10)– Increasing io.sort.mb, when large map output, should decrease map-side IO –

sets buffer size for map-side sorting – recommend 256 default, or higher depending on block size

– Increasing io.sort.factor, should decrease number of spills to disk / decrease sort / shuffle – penalty is extra GC – recommend 64 default

Compression mapred.map.output.compression.codec, mapred.compress.map.output– Enabling compression of map output decreases disk and network activity– On most workloads, when CPU is available, compression will improve overall

performance, in some environments, dramatically– Out of box, com.ibm.biginsights.compress.CmxCodec, provides good

compression ratio without too much CPU overhead


Tuning Adaptive M/R

mapred.reduce.input.buffer.percent (default 0.00)– Sets percentage of memory relative to heap size– With default, all in-memory segments are persisted to disk and then read back

when calling user reduce function– Increasing to 0.96 would allow reducers to use up to 96% of available memory

to keep segments in memory while calling reduce function ~ consider when map output is large such as Terasort

– Conversely TPCH workload does not benefit as smaller map outputs mapred.job.shuffle.merge.percent

– % of memory allocated to storing in-memory map outputs mapred.job.shuffle.input.buffer.percent

– % of memory allocated from max heap for storing map outputs during shuffle mapred.local.dir / dfs.data.dir

– the more the merrier!!– On dedicated cluster, using all disks (X) on a compute host should yield better

results than using X/2 disks for mapred.local.dir and the other X/2 for dfs.data.dir

biginsights tuning eric yang jim huang stewart tate simon harris michael ahern john poelman

Documents

ibm corporationvalidating

ibm corporationtuning

disabled ibm software

dualtest ibm software

ibm proof of technology

new cluster bioscpumemoryin

new clusterthe performance

ibm corporation5run