running mongodb in production, part ii - percona › sites › default › files › ...high...

Speaker Name

Running MongoDB in Production, Part IITim VaillancourtSr Technical Operations Architect, Percona

{name: “tim”,lastname: “vaillancourt”,employer: “percona”,techs: [

“mongodb”,“mysql”,“cassandra”,“redis”,“rabbitmq”,“solr”,“mesos”“kafka”,“couch*”,“python”,“golang”

]}

`whoami`

Agenda

● Architecture and High-Availability● Hardware● Tuning MongoDB● Tuning Linux

Architecture and High-Availability

High Availability

● Replication○ Asynchronous

■ Write Concerns can provide psuedo-synchronous replication■ Changelog based, using the “Oplog”

○ Maximum 50 members○ Maximum 7 voting members

■ Use “vote:0” for members $gt 7○ Oplog

■ The “oplog.rs” capped-collection in “local” storing changes to data■ Read by secondary members for replication■ Written to by local node after “apply” of operation

Architecture

● Datacenter Recommendations○ Minimum of 3 x physical servers required for

High-Availability○ Ensure only 1 x member per Replica Set is on a

single physical server!!!● EC2 / Cloud Recommendations

○ Place Replica Set members in 3 Availability Zones, same region

○ Use a hidden-secondary node for Backup and Disaster Recovery in another region

○ Entire Availability Zones have been lost before!

Hardware

Hardware: Mainframe vs Commodity

● Databases: The Past○ Buy some really amazing, expensive hardware○ Buy some crazy expensive license

■ Don’t run a lot of servers due to above○ Scale up:

■ Buy even more amazing hardware for monolithic host■ Hardware came on a truck

○ HA: When it rains, it pours

Hardware: Mainframe vs Commodity

● Databases: A New Era○ Everything fails, nothing is precious○ Elastic infrastructures (“The cloud”, Mesos, etc)○ Scale up: add more cheap, commodity servers○ HA: lots of cheap, commodity servers - still up!

Hardware: Block Devices

● Isolation○ Run Mongod dbPaths on separate volume○ Optionally, run Mongod journal on separate volume

● RAID Level○ RAID 10 == performance/durability sweet spot○ RAID 0 == fast and dangerous

● SSDs○ Benefit MMAPv1 a lot○ Benefit WT and RocksDB a bit less○ Keep about 20-30% free space for internal GC


● EBS / NFS / iSCSI○ Risks / Drawbacks

■ Exponentially more things to break (more on this)■ Block device requests wrapped in TCP is extremely slow■ You probably already paid for some fast local disks■ More difficult (sometimes nearly-impossible) to troubleshoot■ MongoDB doesn’t really benefit from remote storage features/flexibility

● Built-in High-Availability of data via replication● MongoDB replication can bootstrap new members● Strong write concerns can be specified for critical data


● EBS / NFS / iSCSI○ Things to break or troubleshoot…

■ Application needs a block from disk■ System call to kernel for block■ Kernel frames block request in TCP

● No logic to align block sizes■ TCP connection/handshake (if not pooled)■ TCP packet moves across wire, routers, switches

● Ethernet is exponentially slower than SATA/SAS/SCSI■ Storage server parses TCP to block device■ Storage server system calls to kernel for block■ Storage server storage driver calls RAID/storage controller■ Block is returned (finally!)

Hardware: CPUs

● Cores vs Core Speed○ Lots of cores > faster cores (4 CPU minimum recommended)○ Thread-per-connection Model

● CPU Frequency Scaling○ ‘cpufreq’: a daemon for dynamic scaling of the CPU frequency○ Terrible idea for databases or any predictability!○ Disable or set governor to 100% frequency always, i.e mode: ‘performance’○ Disable any BIOS-level performance/efficiency tuneable○ Set ENERGY_PERF_BIAS to ‘performance’ on CentOS/Red Hat

Hardware: Network Infrastructure

● Datacenter Tiers○ Network Edge○ Public Server VLAN

■ Servers with Public NAT and/or port forwards from Network Edge■ Examples: Proxies, Static Content, etc■ Calls backends in Backend VLAN

○ Backend Server VLAN■ Servers with port forwarding from Public Server VLAN (w/Source IP ACLs)■ Optional load balancer for stateless backends■ Examples: Webserver, Application Server/Worker, etc■ Calls data stores in Data VLAN


● Datacenter Tiers○ Data VLAN

■ Servers, filers, etc with port forwarding from Backend Server VLAN (w/Source IP ACLs)

■ Examples: Databases, Queues, Filers, Caches, HDFS, etc


● Network Fabric○ Try to use 10GBe for low latency○ Use Jumbo Frames for efficiency○ Try to keep all MongoDB nodes on the same segment

■ Goal: few or no network hops between nodes■ Check with ‘traceroute’

● Outbound / Public Access○ Databases don’t need to talk to the internet*

■ Store a copy of your Yum, DockerHub, etc repos locally■ Deny any access to Public internet or have no route to it■ Hackers will try to upload a dump of your data out of the network!!

Hardware: Why So Quick?

● MongoDB allows you to scale reads and writes with more nodes○ Single-instance performance is important, but deal-breaking

● You are the most expensive resource!○ Not hardware anymore

Tuning MongoDB

Tuning MongoDB: MMAPv1

● A kernel-level function to map file blocks to memory

● MMAPv1 syncs data to disk once per 60 seconds (default)○ If a server with no journal crashes it can lose 1

min of data!!!● In memory buffering of Journal

○ Synced every 30ms ‘journal’ is on a different disk○ Or every 100ms○ Or 1/3rd of above if change uses j:true WC

Tuning MongoDB: MMAPv1

● Fragmentation○ Can cause serious slowdowns on scans, range

queries, etc○ WiredTiger and RocksDB have little-no

fragmentation due to checkpoints / compaction

Tuning MongoDB: WiredTiger

● WT syncs data to disk in a process called “Checkpointing”:○ Every 60 seconds or >= 2GB data changes

● In-memory buffering of Journal○ Journal buffer size 128kb○ Synced every 50 ms (as of 3.2)○ Or every change with Journaled write concern

Tuning MongoDB: RocksDB

● Deprecated in PSDMB 3.6+● Level-based strategy using immutable

data level files○ Built-in Compression○ Block and Filesystem caches

● RocksDB uses “compaction” to apply changes to data files○ Tiered level compaction○ Follows same logic as MMAPv1 for journal

buffering

Tuning MongoDB: Storage Engine Caches

● WiredTiger○ In heap

■ 50% available system memory■ Uncompressed WT pages

○ Filesystem Cache■ 50% available system memory■ Compressed pages

● RocksDB○ Internal testing planned from Percona in the

future○ 30% in-heap cache recommended by

Facebook / Parse Platform

Tuning MongoDB: Durability

● storage.journal.enabled = <true/false>○ Default since 2.0 on 64-bit builds○ Always enable unless data is transient○ Always enable on cluster config servers

● storage.journal.commitIntervalMs = <ms>○ Max time between journal syncs

● storage.syncPeriodSecs = <secs>○ Max time between data file flushes

Tuning MongoDB: Don’t Enable!

● “cpu”○ External monitoring is recommended

● “rest”○ Will be deprecated in 3.6+

● “smallfiles”○ In most situations this is not necessary unless

■ You use MMAPv1, and■ It is a Development / Test environment■ You have 100s-1000s of databases with very little data

inside (unlikely)● Profiling mode ‘2’

○ Unless troubleshooting an issue / intentional

Tuning Linux

● “I can login via SSH...we’re done!”● The database is only as fast as as the

kernel● Expect a default Linux install to be

optimised for a cheap laptop, not your $$$ hardware

Tuning Linux: Love your OS!

● Avoid Linux earlier than 3.10.x - 3.12.x● Large improvements in parallel efficiency in 3.10+ (for Free!)

● More: https://blog.2ndquadrant.com/postgresql-vs-kernel-versions/

Tuning Linux: The Linux Kernel

https://blog.2ndquadrant.com/postgresql-vs-kernel-versions/

Tuning Linux: NUMA

● A memory architecture that takes into account the locality of memory, caches and CPUs for lower latency○ But no databases want to use it :(

● MongoDB codebase is not NUMA “aware”○ Unbalanced memory allocations to a single zone

Tuning Linux: NUMA

● Disable NUMA○ In the Server BIOS ○ Using ‘numactl’ in init scripts BEFORE ‘mongod’

command (recommended for future compatibility):

numactl --interleave=all /usr/bin/mongod <other flags>

Tuning Linux: Transparent HugePages

● Introduced in RHEL/CentOS 6, Linux 2.6.38+● Merges memory pages in background (Khugepaged process)● “AnonHugePages” in /proc/meminfo shows usage● Disable TransparentHugePages!● Add “transparent_hugepage=never” to kernel command-line (GRUB)

○ Reboot the system■ Disabling online does not clear previous TH pages■ Rebooting tests your system will come back up!

Tuning Linux: Time Source

● Replication and Clustering needs consistent clocks○ mongodb_consistent_backup relies on time sync, for example!

● Use a consistent time source/server○ “It’s ok if everyone is equally wrong”

● Non-Virtualized○ Run NTP daemon on all MongoDB and Monitoring hosts○ Enable service so it starts on reboot

Tuning Linux: Time Source

● Virtualised○ Check if your VM platform has an “agent” syncing time○ VMWare and Xen are known to have their own time sync○ If no time sync provided install NTP daemon

Tuning Linux: I/O Scheduler

● Algorithm kernel uses to commit reads and writes to disk

● CFQ○ “Completely Fair Queue”○ Perhaps too clever/inefficient for database

workloads○ Probably good for a laptop, assume multi-use

● Deadline○ Best general default IMHO○ Predictable I/O request latencies

Tuning Linux: I/O Scheduler

● Noop○ Use with virtualised servers○ Use with real-hardware BBU RAID controllers

Tuning Linux: Filesystems

● Types○ Use XFS or EXT4, not EXT3

■ EXT3 has very poor pre-allocation performance■ Use XFS only on WiredTiger■ EXT4 “data=ordered” mode recommended

○ Btrfs not tested, yet!● Options

○ Set ‘noatime’ on MongoDB data volumes in ‘/etc/fstab’:

○ Remount the filesystem after an options change, or reboot

Tuning Linux: Block Device Readahead

● Tuning that causes data ahead of a block on disk to be read and then cached

● Assumption:○ There is a sequential read pattern○ Something will benefit from the extra cached blocks

● Risk○ Too high waste cache space○ Increases eviction work

■ MongoDB tends to have very random disk patterns● A good start for MongoDB volumes is a ’32’ (16kb) read-ahead

○ Let MongoDB worry about optimising the pattern

Tuning Linux: Block Device Readahead

● Change ReadAhead○ Add file to ‘/etc/udev/rules.d’

■ /etc/udev/rules.d/60-mongodb-disk.rules: # set deadline scheduler and 32/16kb read-ahead for /dev/sda ACTION=="add|change", KERNEL=="sda", ATTR{queue/scheduler}="deadline", ATTR{bdi/read_ahead_kb}="16"

○ Reboot (or use CLI tools to apply)

Tuning Linux: Virtual Memory Dirty Pages

● Dirty Pages○ Pages stored in-cache, but needs to be

written to storage● Dirty Ratio

○ Max percent of total memory that can be dirty

○ VM stalls and flushes when this limit is reached

○ Start with ’10’, default (30) too high

Tuning Linux: Virtual Memory Dirty Pages

● Dirty Background Ratio○ Separate threshold forbackground dirty

page flushing○ Flushes without pauses○ Start with ‘3’, default (15) too high

Tuning Linux: Swappiness

● A Linux kernel sysctl setting for preferring RAM or disk for swap○ Linux default: 60○ To avoid disk-based swap: 1 (not zero!)○ To allow some disk-based swap: 10○ ‘0’ can cause more swapping than ‘1’ on

recent kernels■ More on this here:

https://www.percona.com/blog/2014/04/28/oom-relation-vm-swappiness0-new-kernel/



Tuning Linux: Ulimit

● Allows per-Linux-user resource constraints○ Number of User-level Processes○ Number of Open Files○ CPU Seconds○ Scheduling Priority○ And others…


● MongoDB○ Should probably have a dedicated VM, container or server○ Creates a new process

■ For every new connection to the Database■ Plus various background tasks / threads

○ Creates an open file for each active data file on disk■ 64,000 open files and 64,000 max processes is a good start


● Setting ulimits○ /etc/security/limits.d

file○ Systemd Service○ Init script

● Ulimits are set by Percona and MongoDB packages!○ Example on left:

PSMDB RPM (Systemd)

Tuning Linux: Network Stack

● Defaults are not good for > 100mbps Ethernet● Suggested starting point:

● Set Network Tunings:○ Add the above sysctl tunings to /etc/sysctl.conf○ Run “/sbin/sysctl -p” as root to set the tunings○ Run “/sbin/sysctl -a” to verify the changes

Tuning Linux: More on this...

https://www.percona.com/blog/2016/08/12/tuning-linux-for-mongodb/

https://www.percona.com/blog/2016/08/12/tuning-linux-for-mongodb/

Tuning Linux: “Tuned”

● Tuned○ A “framework” for applying tunings to Linux

■ RedHat/CentOS 7 only for now■ Debian added tuned, not sure if compatible yet■ Cannot tune NUMA, file system type or fs mount opts■ Syctls, THP, I/O sched, etc

● My apology to the community for writing “Tuning Linux for MongoDB”:○ https://github.com/Percona-Lab/tuned-percona-mongodb

https://github.com/Percona-Lab/tuned-percona-mongodb

Speaker Name

To be continued...

May 3rd, 2018

Speaker Name

Questions?

running mongodb in production, part ii - percona › sites › default › files › ...high...

Documents