biginsights tuning eric yang jim huang stewart tate simon harris michael ahern john poelman

33
BigInsights Tuning Eric Yang Jim Huang Stewart Tate Simon Harris Michael Ahern John Poelman

Upload: anna-williams

Post on 02-Jan-2016

215 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: BigInsights Tuning Eric Yang Jim Huang Stewart Tate Simon Harris Michael Ahern John Poelman

BigInsights Tuning

Eric YangJim HuangStewart TateSimon HarrisMichael AhernJohn Poelman

Page 2: BigInsights Tuning Eric Yang Jim Huang Stewart Tate Simon Harris Michael Ahern John Poelman

2 © 2013 IBM Corporation

Agenda

BigInsights Tuning– Validating and tuning a new cluster

Performance Improvements with Adaptive Map/Reduce Special Topic: “HBase tuning experience with NICE” by Bruce

Brown Tuning Big SQL: Results and lessons learned Q & A

Topics covered later in the week– More on tuning Big SQL (e.g. hints)– GPFS performance

Page 3: BigInsights Tuning Eric Yang Jim Huang Stewart Tate Simon Harris Michael Ahern John Poelman

3 © 2013 IBM Corporation

Validating a new cluster

The performance of a new cluster should be validated and tuned– Slow nodes will drag down the performance of the entire cluster

Things to validate:– Network– Storage– BIOS– CPU– Memory

Page 4: BigInsights Tuning Eric Yang Jim Huang Stewart Tate Simon Harris Michael Ahern John Poelman

4 © 2013 IBM Corporation

Validating a new cluster (Network)

Network validation– Ensure expected sustained bandwidth between cluster nodes– There are many tools that can be used to validate the network– One tool we frequently use is iperf (http://en.wikipedia.org/wiki/Iperf)

• Test bandwidth between every pair of nodes• Example of how to invoke iperf client:

iperf --client <server hostname> --time 30 --interval 5 --parallel 1 --dualtest

Page 5: BigInsights Tuning Eric Yang Jim Huang Stewart Tate Simon Harris Michael Ahern John Poelman

5 © 2013 IBM Corporation

Validating a new cluster (Storage)

Storage validation – JBOD is the preferred storage set up. Avoid things like volume groups or

logical volumes, etc. – Each local device should be set up JBOD or (if the storage controller does not

support JBOD) as a RAID-0 single-device array.– Check device settings

• Are write caches enabled? Write caches are usually enabled except for storage that needs high durability, such as storage for the NameNode

– Based on the type of file system (e.g. ext), check file system settings (block size, etc.)• Example (ext4): /sbin/dumpe4fs /dev/sdX

– Check mount options• Usually want to see things like “atime” and “user extended attributes” are disabled

Page 6: BigInsights Tuning Eric Yang Jim Huang Stewart Tate Simon Harris Michael Ahern John Poelman

6 © 2013 IBM Corporation

Validating a new cluster (Storage)

Storage validation (continued)– Use Linux dd command to test read/write throughput of every device or file

system that BigInsights will use– (Optional) Test raw device read/write performance BEFORE creating file

system• Warning: Writing to a raw device will clobber existing data (if any)

– (Mandatory) Test read/write performance after creating and mounting a file system on the device• Test with and without Direct I/O• With Direct I/O

dd if=/dev/zero oflag=direct of=<path>/ddfile.out bs=128k count=1024000 dd if=<path>/ddfile.out iflag=direct of=/dev/null bs=128k count=1024000

• Without Direct I/O dd if=/dev/zero of=<path>/ddfile.out bs=128k count=1024000 dd if=<path>/ddfile.out of=/dev/null bs=128k count=1024000

Page 7: BigInsights Tuning Eric Yang Jim Huang Stewart Tate Simon Harris Michael Ahern John Poelman

7 © 2013 IBM Corporation

Validating a new cluster (BIOS/CPU/Memory)

In the BIOS, consider turning off power saving options

Validate CPU and memory performance– “Hdparm –T <device>” Reads data from the device’s caches as fast as

possible

Page 8: BigInsights Tuning Eric Yang Jim Huang Stewart Tate Simon Harris Michael Ahern John Poelman

8 © 2013 IBM Corporation

Tuning a new cluster

Cluster and BigInsights performance tuning is a broad topic We will touch on select topics and settings in the slides that follow

Page 9: BigInsights Tuning Eric Yang Jim Huang Stewart Tate Simon Harris Michael Ahern John Poelman

9 © 2013 IBM Corporation

Tuning the network

How to check current settings– ifconfig– sysctl -a | grep net

ifconfig settings commonly tuned include– MTU (Maximum Transmission Unit) -- Often set to larger value like 9000.– Txqueuelen (Transmit Queue Length) -- High value, e.g. 2000, recommended

for servers that perform large data transfers.

Other network settings we typical check/tune:– Read/write memory buffers (e.g. net.ipv4.tcp_rmem, net.ipv4.tcp_wmem)

Big SQL users: it is a good idea to tune “keep alive” and “tcp_fin_timeout” TCP/IP settings for improved stability (fewer timeouts and hangs)

There are more network settings than we can cover here

Page 10: BigInsights Tuning Eric Yang Jim Huang Stewart Tate Simon Harris Michael Ahern John Poelman

10 © 2013 IBM Corporation

Tuning I/O in the Linux Kernel

Here are examples of things to tune in the Linux kernel

For clusters mostly doing Map/Reduce and a lot of sequential I/O, consider:– Using “deadline” scheduler (instead of “completely fair”)

• Check: cat /sys/block/sdX/queue/scheduler• How to change: echo deadline > /sys/block/sdX/queue/scheduler

– Tune read ahead buffer size• By default the Linux OS will read 128 KB of data in advance so that it is already in

Memory cache before the program needs it. This value can be increased so as to get better Read Performance.

• Check: cat /sys/block/sdX/queue/read_ahead_kb• How to change: echo 3072 > /sys/block/sdX/queue/read_ahead_kb

There are many more kernel settings not covered here

Page 11: BigInsights Tuning Eric Yang Jim Huang Stewart Tate Simon Harris Michael Ahern John Poelman

11 © 2013 IBM Corporation

Tuning the file system

Tune the file system based on your file system type Examples of file system type: ext3, ext4, xfs

– Note: GPFS tuning will be covered in a later session For ext4 file systems, consider settings like “dir_index” and “extent”

– mkfs.ext4 -O dir_index,extent /dev/sdX

• dir_index: Use hashed b-trees to speed up lookups in large directories

• extent: Instead of using the indirect block scheme for storing the location of data blocks in an inode, use extents instead. This is a much more efficient encoding which speeds up filesystem access, especially for large files.

Mount options– Turn off atime

• noatime: Access timestamps are not updated when a file is read• nodiratime: Turn off directory time stamps

Page 12: BigInsights Tuning Eric Yang Jim Huang Stewart Tate Simon Harris Michael Ahern John Poelman

12 © 2013 IBM Corporation

Tuning Kernel Virtual Machine Memory

redhat_transparent_hugepage/enabled (Redhat 6.2 or later)– In order to bring "huge page" performance benefits to legacy software, RedHat has

implemented custom Linux kernel extension called "redhat_transparent_hugepage".– Recommendation: Turn off huge pages since this functionality still maturing.

• This is a well-known tuning optimization in the Hadoop community• echo never > /sys/kernel/mm/redhat_transparent_hugepage/enabled

vm.swappiness– This control is used to define how aggressive the kernel will swap memory pages. Higher

values will increase aggressiveness, lower values decrease the amount of swap. Default is 60.

– Recommend a lower value in the range 5 to 20 (test in increments of 5)

vm.min_free_kbytes (came up during Big SQL testing)– min_free_kbytes changes the page reclaim thresholds. When this number is increased

the system starts reclaiming memory earlier, when its lowered it starts reclaiming memory later.

– If you see "page allocation" failures in /var/log/messages then chances are this is set too low.

– For Big SQL testing on machines with 128 GB of physical memory, we set this to 2621440 (2.5GB)

– (If using GPFS) GPFS documentation recommends vm.min_free_kbytes should be set to 5 to 6% of the total physical memory

There are many more vm kernel settings you can explore

Page 13: BigInsights Tuning Eric Yang Jim Huang Stewart Tate Simon Harris Michael Ahern John Poelman

13 © 2013 IBM Corporation

Tuning HDFS

Give HDFS as many paths as possible, where each path is on a different device

Can be configured at install time– Can also be changed later using parameter “dfs.data.dir” in hdfs-site.xml

In the installer, under “DataNode/TaskTracker”– Installer default path: /hadoop/hdfs/data– You almost always want to change this to one or more paths

Page 14: BigInsights Tuning Eric Yang Jim Huang Stewart Tate Simon Harris Michael Ahern John Poelman

14 © 2013 IBM Corporation

Tuning HDFS

Tune the underlying devices and file systems as discussed on previous slides

When running large Big SQL & Hive workloads the following settings have proved useful in improving the stability of a cluster - for both Apache M-R & Platform Symphony M-R:– dfs.datanode.handler.count=40 (default is 10). The number of server threads

for the datanode. Usually about the same as number of CPUs.– dfs.datanode.max.xcievers=65536 (default is 8192). Maximum number of

files that a DataNode will serve concurrently.– dfs.datanode.socket.write.timeout=960000 (default is 480000). Units is

milliseconds. 960000 is 16 minutes. Increase to avoid tasks failing due to write timeout errors.

Page 15: BigInsights Tuning Eric Yang Jim Huang Stewart Tate Simon Harris Michael Ahern John Poelman

15 © 2013 IBM Corporation

Tuning Map/Reduce (Intermediate storage)

“Intermediate” data gets shuffled between map and reduce tasks Intermediate data is kept on local storage as configured by Hadoop

parameter “mapred.local.dir” The speed of this storage directly impacts M/R performance For better performance, spread across multiple devices Configured at install time on “File System” panel

– Referred to as “Cache directory”– Can also be updated post-install in mapred-site.xml

In the installer, expand “MapReduce general settings”– Default path: /hadoop/mapred/local– Change this to one or more paths, where each path corresponds to a different

device

Page 16: BigInsights Tuning Eric Yang Jim Huang Stewart Tate Simon Harris Michael Ahern John Poelman

16 © 2013 IBM Corporation

Tuning HDFS and Intermediate storage

Logs, Etc

HDFS

(Shared)

dfs.data.dir

DataNode 1 DataNode 2 DataNode N

DataNode 3

…m

ap

red

.lo

ca

l.d

ir

ma

pre

d.l

oc

al.

dir

ma

pre

d.l

oc

al.

dir

ma

pre

d.l

oc

al.

dir

Intermediate storage (mapred.local.dir) can share the same disks as HDFS (dfs.data.dir

Page 17: BigInsights Tuning Eric Yang Jim Huang Stewart Tate Simon Harris Michael Ahern John Poelman

17 © 2013 IBM Corporation

Tuning Map/Reduce

We only have time to cover a few M/R settings in this presentation– Number of slots– JVM heap sizes

Backup slides cover more topics– Number of reduce tasks– Compression– Sorting– Block and split sizes– Making clusters rack aware

Page 18: BigInsights Tuning Eric Yang Jim Huang Stewart Tate Simon Harris Michael Ahern John Poelman

18 © 2013 IBM Corporation

Tuning Map/Reduce (Number of slots)

Used to tune the degree of Map/Reduce concurrency on the cluster Defines the maximum number of concurrently occupied slots on a node =

– Too high and the cluster becomes unstable, too low and you are wasting machine resources– Tuned in association with the map and reduce task JVM size (mapred.child.java.opts)– On large machines, particularly those with many CPU cores and/or hyper-threading enabled, the

default values may be too large• Example: Simon tested Big SQL on Power machines have 16 cores and SMT=4, resulting in

64 virtual CPUs. By default, BigInsights was configuration 64 map slots and 32 reduce slots (96 slots in all). This was too many and resulted in high context switching and cluster instability. After tuning, Simon found 24 map slots and 12 reduce slots much more optimal.

Setting Default value

mapred.tasktracker.map.tasks.maximum Formula:<%= Math.min(Math.ceil(numOfCores * 1.0), Math.ceil(maxPartition*0.66*totalMem/1000)) %>

mapred.tasktracker.reduce.tasks.maximum Formula:<%= Math.min(Math.ceil(numOfCores * 0.5), Math.ceil(maxPartition*0.33*totalMem/1000)) %>

Number of map and reduce slots on each TaskTracker

Page 19: BigInsights Tuning Eric Yang Jim Huang Stewart Tate Simon Harris Michael Ahern John Poelman

19 © 2013 IBM Corporation

Tuning Map/Reduce (Number of slots - Continued)

Additional notes about these two Hadoop parameters:

1. The formulas for “mapred.tasktracker.map.tasks.maximum” and “mapred.tasktracker.reduce.tasks.maximum” are evaluated on each TaskTracker and thus can vary by node

2. If changed, be sure to restart all TaskTrackers

3. Since these are TaskTracker settings, overriding them on individual Hadoop jobs has no effect

Page 20: BigInsights Tuning Eric Yang Jim Huang Stewart Tate Simon Harris Michael Ahern John Poelman

20 © 2013 IBM Corporation

Tuning Map/Reduce (Task JVM heap size)

Tune the amount of memory assigned to each map and reduce task– the –Xmx property defines the maximum Java heap size– Conservative setting would be (60% of real memory) /

(mapred.tasktracker.map.tasks.maximum + mapred.tasktracker.reduce.tasks.maximum– Aggressive setting would be (80% of real memory) /

(mapred.tasktracker.map.tasks.maximum + mapred.tasktracker.reduce.tasks.maximum– Often tuned in association with mapred.tasktracker.map.tasks.maximum &

mapred.tasktracker.reduce.tasks.maximum

Setting Default value

mapred.child.java.opts -Xmx1000m (New default value in BI 2.1)BI 2.0 default: -Xmx600m -Xshareclasses

mapred.map.child.java.opts None. If set, overrides “mapred.child.java.opts” for map tasks.

mapred.reduce.child.java.opts None. If set, overrides “mapred.child.java.opts” for reduce tasks.

Individual Hadoop jobs can run with custom JVM arguments

Page 21: BigInsights Tuning Eric Yang Jim Huang Stewart Tate Simon Harris Michael Ahern John Poelman

21 © 2013 IBM Corporation

Making tuning changes persist after reboot

Kernel settings:– sysctl -w <parameter>=<value>, or

• Always confirm the setting was actually changed– Add to /etc/sysctl.conf

• vm.swappiness = 5

ifconfig settings– Edit /etc/sysconfig/network-scripts/ifcfg-ethX, change lines like:

• MTU=“9000”– Edit /etc/rc.local, add lines like:

• ifconfig ethX txqueuelen 2000

Page 22: BigInsights Tuning Eric Yang Jim Huang Stewart Tate Simon Harris Michael Ahern John Poelman

22 © 2013 IBM Corporation

Tuning Map/Reduce (Number of reduce tasks)

Tune the number of reducer tasks for your Hadoop jobs– The default number of reduce tasks is 1, which is OK for very small tasks– Recommendation is to override “mapred.reduce.tasks” to a value greater than 1 for

most jobs– If number of reduce tasks is too few, then each reduce task will take a long time and

reduce time will dominate overall job time– If too many, then reduce tasks will tie up slots and hold back other jobs

Setting Default value

mapred.reduce.tasks 1 (not set in mapred-site.xml)

A Hadoop job by default will only have one reduce task, which is usually not what you want

Page 23: BigInsights Tuning Eric Yang Jim Huang Stewart Tate Simon Harris Michael Ahern John Poelman

23 © 2013 IBM Corporation

Tuning Map/Reduce (Compression)

Consider benefits and drawbacks of compression– Tradeoff between storage space and CPU time– Compressing map output can speed up the shuffle phase

Setting Default value

mapred.map.output.compression.codec The compression codec to use (a Java class). Must be listed in parameter “io.compression.codecs”.Example: com.ibm.biginsights.compress.CmxCodec

mapred.output.compress Boolean. Default is false. Whether or not to compress final output from a Hadoop job.

mapred.compress.map.output Boolean. Default is false. Whether or not to compress intermediate data shuffled from map to reduce tasks.

mapred.output.compression.type Default is RECORD. How to compress sequence files. Other values are NONE and BLOCK.

io.seqfile.compress.blocksize Default is 1 million (bytes). Block size when compressing a sequence file and mapred.output.compression.type=BLOCK.

Use compression to save space and in some cases improve performance

Page 24: BigInsights Tuning Eric Yang Jim Huang Stewart Tate Simon Harris Michael Ahern John Poelman

24 © 2013 IBM Corporation

Tuning Map/Reduce (Sorting)

Tune the amount of memory and concurrency to use for sorting using io.sort.mb and io.sort.factor

– io.sort.mb is the size, in megabytes, of the memory buffer to use while sorting map output

– io.sort.factor is the maximum number of streams to merge at once when sorting files– Should be tuned in conjunction with mapred.child.java.opts=-Xmx

Setting Default value

io.sort.mb 256

io.sort.factor 10 (New default value in BI 2.1)BI 2.0 default: 64

Hadoop jobs can run with custom sort parameters

Page 25: BigInsights Tuning Eric Yang Jim Huang Stewart Tate Simon Harris Michael Ahern John Poelman

25 © 2013 IBM Corporation

Tuning Map/Reduce (Block Size/Split Size)

Choose an appropriate block size for your files using dfs.block.size– The size of one block of data on HDFS– Block size is a property of each file on HDFS– Usually defines the size of input data (split) passed to a map task for processing– Split sizes can be further tuned using “mapred.min.split.size” and

“mapred.max.split.size”– If split size is too small, then incur overhead of creating/destroying many small tasks– If split size is too large, then concurrency of the cluster is reduced

Setting Default value

dfs.block.size 128MB

mapred.min.split.size 0

mapred.max.split.size No limit

Block size is an important characteristic of your data on HDFS

Page 26: BigInsights Tuning Eric Yang Jim Huang Stewart Tate Simon Harris Michael Ahern John Poelman

26 © 2013 IBM Corporation

Tuning Map/Reduce

Additional Hadoop parameters (defaults in mapred-site.xml) that are often overridden for individual Hadoop jobs:

Setting Default value Notes

mapred.reduce.slowstart.completed.maps

0.5 Fraction of the number of maps in the job which should be complete before reduces are scheduled for the job.Starting reducers earlier can help some jobs finish faster, but the reducers tie up slots.

Page 27: BigInsights Tuning Eric Yang Jim Huang Stewart Tate Simon Harris Michael Ahern John Poelman

27 © 2013 IBM Corporation

Tuning Map/Reduce (General Settings)

Rack awareness

Setting Default value

Notes

topology.script.file.name No script Add to core-site.xml. Use to make Hadoop rack-aware.Enable rack-awareness for large clusters with several racks, when all racks have roughly same number of nodes.

Warning: It may hurt performance to enable rack awareness for small clusters with uneven number of nodes in each rack. Racks with fewer nodes tend to be busier.

Page 28: BigInsights Tuning Eric Yang Jim Huang Stewart Tate Simon Harris Michael Ahern John Poelman

28 © 2013 IBM Corporation

Tuning Adaptive M/R – Number of slots

The number of slots on each compute node has a huge impact on performance– Cluster is under-utilized if too few slots– Cluster can become over-committed if too many slots– The number of slots needs to be configured based:

• Compute node hardware (CPU, memory, number of disks, etc.)• Workload characteristics (memory or CPU or storage intensive)

Default number of slots:– Adaptive M/R: Equal to number of CPUs

• Single slot pool for map and reduce tasks– Apache M/R: Separate pools for map and reduce slots with default ratio of 2

map slots for every reduce slot

Page 29: BigInsights Tuning Eric Yang Jim Huang Stewart Tate Simon Harris Michael Ahern John Poelman

29 © 2013 IBM Corporation

Tuning Adaptive M/R – Number of slots

How to change the number of slots (Adaptive M/R)– Command line

• Example: Configure 16 slots per compute node Edit $BIGINSIGHTS_HOME/hdm/components/HAManager/conf/ResourceGroups.xml and

change two lines:> <ResourceGroup ResourceGroupName="ManagementHosts" availableSlots=“16">> <ResourceGroup ResourceGroupName="ComputeHosts" availableSlots=“16">

Run $BIGINSIGHTS_HOME/bin/syncconf.sh HAManager Check that changes appear in $EGO_CONFDIR/ResourceGroups.xml

• Web UI (not officially externalized in BigInsights) http://<master node>:18080/Platform Bring up Map/Reduce UI Resources –> Resource Groups -> ComputeHosts

Page 30: BigInsights Tuning Eric Yang Jim Huang Stewart Tate Simon Harris Michael Ahern John Poelman

30 © 2013 IBM Corporation

Tuning Apache M/R – Number of slots

How to change the number of slots (Apache M/R)– Edit $BIGINSIGHTS_HOME/hdm/hadoop-conf-staging/mapred-site.xml– Change mapred.tasktracker.map.tasks.maximum and

mapred.tasktracker.reduce.tasks.maximum– Run $BIGINSIGHTS_HOME/bin/syncconf.sh hadoop

Page 31: BigInsights Tuning Eric Yang Jim Huang Stewart Tate Simon Harris Michael Ahern John Poelman

31 © 2013 IBM Corporation

Adaptive M/R: Tuning Map and Reduce tasks JVM size

Adaptive M/R honors property “mapred.child.java.opts”– Default in mapred-site.xml

• Out-of-box, mapred.child.java.opts = -Xmx1000m– Hadoop jobs often override “mapred.child.java.opts” value

Page 32: BigInsights Tuning Eric Yang Jim Huang Stewart Tate Simon Harris Michael Ahern John Poelman

32 © 2013 IBM Corporation

Tuning Adaptive M/R

Since there is no distinction between map slots and reduce slots for Adaptive M/R special consideration should be taken– Never oversubscribe your system memory ( number of slots * memory per slot )– mapred.reduce.slowstart.completed.maps

io.sort.mb (100) / io.sort.factor (10)– Increasing io.sort.mb, when large map output, should decrease map-side IO –

sets buffer size for map-side sorting – recommend 256 default, or higher depending on block size

– Increasing io.sort.factor, should decrease number of spills to disk / decrease sort / shuffle – penalty is extra GC – recommend 64 default

Compression mapred.map.output.compression.codec, mapred.compress.map.output– Enabling compression of map output decreases disk and network activity– On most workloads, when CPU is available, compression will improve overall

performance, in some environments, dramatically– Out of box, com.ibm.biginsights.compress.CmxCodec, provides good

compression ratio without too much CPU overhead

Page 33: BigInsights Tuning Eric Yang Jim Huang Stewart Tate Simon Harris Michael Ahern John Poelman

33 © 2013 IBM Corporation

Tuning Adaptive M/R

mapred.reduce.input.buffer.percent (default 0.00)– Sets percentage of memory relative to heap size– With default, all in-memory segments are persisted to disk and then read back

when calling user reduce function– Increasing to 0.96 would allow reducers to use up to 96% of available memory

to keep segments in memory while calling reduce function ~ consider when map output is large such as Terasort

– Conversely TPCH workload does not benefit as smaller map outputs mapred.job.shuffle.merge.percent

– % of memory allocated to storing in-memory map outputs mapred.job.shuffle.input.buffer.percent

– % of memory allocated from max heap for storing map outputs during shuffle mapred.local.dir / dfs.data.dir

– the more the merrier!!– On dedicated cluster, using all disks (X) on a compute host should yield better

results than using X/2 disks for mapred.local.dir and the other X/2 for dfs.data.dir