running mongodb in production, part ii - percona › sites › default › files › ...high...
TRANSCRIPT
Speaker Name
Running MongoDB in Production, Part IITim VaillancourtSr Technical Operations Architect, Percona
{name: “tim”,lastname: “vaillancourt”,employer: “percona”,techs: [
“mongodb”,“mysql”,“cassandra”,“redis”,“rabbitmq”,“solr”,“mesos”“kafka”,“couch*”,“python”,“golang”
]}
`whoami`
Agenda
● Architecture and High-Availability● Hardware● Tuning MongoDB● Tuning Linux
Architecture and High-Availability
High Availability
● Replication○ Asynchronous
■ Write Concerns can provide psuedo-synchronous replication■ Changelog based, using the “Oplog”
○ Maximum 50 members○ Maximum 7 voting members
■ Use “vote:0” for members $gt 7○ Oplog
■ The “oplog.rs” capped-collection in “local” storing changes to data■ Read by secondary members for replication■ Written to by local node after “apply” of operation
Architecture
● Datacenter Recommendations○ Minimum of 3 x physical servers required for
High-Availability○ Ensure only 1 x member per Replica Set is on a
single physical server!!!● EC2 / Cloud Recommendations
○ Place Replica Set members in 3 Availability Zones, same region
○ Use a hidden-secondary node for Backup and Disaster Recovery in another region
○ Entire Availability Zones have been lost before!
Hardware
Hardware: Mainframe vs Commodity
● Databases: The Past○ Buy some really amazing, expensive hardware○ Buy some crazy expensive license
■ Don’t run a lot of servers due to above○ Scale up:
■ Buy even more amazing hardware for monolithic host■ Hardware came on a truck
○ HA: When it rains, it pours
Hardware: Mainframe vs Commodity
● Databases: A New Era○ Everything fails, nothing is precious○ Elastic infrastructures (“The cloud”, Mesos, etc)○ Scale up: add more cheap, commodity servers○ HA: lots of cheap, commodity servers - still up!
Hardware: Block Devices
● Isolation○ Run Mongod dbPaths on separate volume○ Optionally, run Mongod journal on separate volume
● RAID Level○ RAID 10 == performance/durability sweet spot○ RAID 0 == fast and dangerous
● SSDs○ Benefit MMAPv1 a lot○ Benefit WT and RocksDB a bit less○ Keep about 20-30% free space for internal GC
Hardware: Block Devices
● EBS / NFS / iSCSI○ Risks / Drawbacks
■ Exponentially more things to break (more on this)■ Block device requests wrapped in TCP is extremely slow■ You probably already paid for some fast local disks■ More difficult (sometimes nearly-impossible) to troubleshoot■ MongoDB doesn’t really benefit from remote storage features/flexibility
● Built-in High-Availability of data via replication● MongoDB replication can bootstrap new members● Strong write concerns can be specified for critical data
Hardware: Block Devices
● EBS / NFS / iSCSI○ Things to break or troubleshoot…
■ Application needs a block from disk■ System call to kernel for block■ Kernel frames block request in TCP
● No logic to align block sizes■ TCP connection/handshake (if not pooled)■ TCP packet moves across wire, routers, switches
● Ethernet is exponentially slower than SATA/SAS/SCSI■ Storage server parses TCP to block device■ Storage server system calls to kernel for block■ Storage server storage driver calls RAID/storage controller■ Block is returned (finally!)
Hardware: CPUs
● Cores vs Core Speed○ Lots of cores > faster cores (4 CPU minimum recommended)○ Thread-per-connection Model
● CPU Frequency Scaling○ ‘cpufreq’: a daemon for dynamic scaling of the CPU frequency○ Terrible idea for databases or any predictability!○ Disable or set governor to 100% frequency always, i.e mode: ‘performance’○ Disable any BIOS-level performance/efficiency tuneable○ Set ENERGY_PERF_BIAS to ‘performance’ on CentOS/Red Hat
Hardware: Network Infrastructure
● Datacenter Tiers○ Network Edge○ Public Server VLAN
■ Servers with Public NAT and/or port forwards from Network Edge■ Examples: Proxies, Static Content, etc■ Calls backends in Backend VLAN
○ Backend Server VLAN■ Servers with port forwarding from Public Server VLAN (w/Source IP ACLs)■ Optional load balancer for stateless backends■ Examples: Webserver, Application Server/Worker, etc■ Calls data stores in Data VLAN
Hardware: Network Infrastructure
● Datacenter Tiers○ Data VLAN
■ Servers, filers, etc with port forwarding from Backend Server VLAN (w/Source IP ACLs)
■ Examples: Databases, Queues, Filers, Caches, HDFS, etc
Hardware: Network Infrastructure
● Network Fabric○ Try to use 10GBe for low latency○ Use Jumbo Frames for efficiency○ Try to keep all MongoDB nodes on the same segment
■ Goal: few or no network hops between nodes■ Check with ‘traceroute’
● Outbound / Public Access○ Databases don’t need to talk to the internet*
■ Store a copy of your Yum, DockerHub, etc repos locally■ Deny any access to Public internet or have no route to it■ Hackers will try to upload a dump of your data out of the network!!
Hardware: Why So Quick?
● MongoDB allows you to scale reads and writes with more nodes○ Single-instance performance is important, but deal-breaking
● You are the most expensive resource!○ Not hardware anymore
Tuning MongoDB
Tuning MongoDB: MMAPv1
● A kernel-level function to map file blocks to memory
● MMAPv1 syncs data to disk once per 60 seconds (default)○ If a server with no journal crashes it can lose 1
min of data!!!● In memory buffering of Journal
○ Synced every 30ms ‘journal’ is on a different disk○ Or every 100ms○ Or 1/3rd of above if change uses j:true WC
Tuning MongoDB: MMAPv1
● Fragmentation○ Can cause serious slowdowns on scans, range
queries, etc○ WiredTiger and RocksDB have little-no
fragmentation due to checkpoints / compaction
Tuning MongoDB: WiredTiger
● WT syncs data to disk in a process called “Checkpointing”:○ Every 60 seconds or >= 2GB data changes
● In-memory buffering of Journal○ Journal buffer size 128kb○ Synced every 50 ms (as of 3.2)○ Or every change with Journaled write concern
Tuning MongoDB: RocksDB
● Deprecated in PSDMB 3.6+● Level-based strategy using immutable
data level files○ Built-in Compression○ Block and Filesystem caches
● RocksDB uses “compaction” to apply changes to data files○ Tiered level compaction○ Follows same logic as MMAPv1 for journal
buffering
Tuning MongoDB: Storage Engine Caches
● WiredTiger○ In heap
■ 50% available system memory■ Uncompressed WT pages
○ Filesystem Cache■ 50% available system memory■ Compressed pages
● RocksDB○ Internal testing planned from Percona in the
future○ 30% in-heap cache recommended by
Facebook / Parse Platform
Tuning MongoDB: Durability
● storage.journal.enabled = <true/false>○ Default since 2.0 on 64-bit builds○ Always enable unless data is transient○ Always enable on cluster config servers
● storage.journal.commitIntervalMs = <ms>○ Max time between journal syncs
● storage.syncPeriodSecs = <secs>○ Max time between data file flushes
Tuning MongoDB: Don’t Enable!
● “cpu”○ External monitoring is recommended
● “rest”○ Will be deprecated in 3.6+
● “smallfiles”○ In most situations this is not necessary unless
■ You use MMAPv1, and■ It is a Development / Test environment■ You have 100s-1000s of databases with very little data
inside (unlikely)● Profiling mode ‘2’
○ Unless troubleshooting an issue / intentional
Tuning Linux
● “I can login via SSH...we’re done!”● The database is only as fast as as the
kernel● Expect a default Linux install to be
optimised for a cheap laptop, not your $$$ hardware
Tuning Linux: Love your OS!
● Avoid Linux earlier than 3.10.x - 3.12.x● Large improvements in parallel efficiency in 3.10+ (for Free!)
● More: https://blog.2ndquadrant.com/postgresql-vs-kernel-versions/
Tuning Linux: The Linux Kernel
Tuning Linux: NUMA
● A memory architecture that takes into account the locality of memory, caches and CPUs for lower latency○ But no databases want to use it :(
● MongoDB codebase is not NUMA “aware”○ Unbalanced memory allocations to a single zone
Tuning Linux: NUMA
● Disable NUMA○ In the Server BIOS ○ Using ‘numactl’ in init scripts BEFORE ‘mongod’
command (recommended for future compatibility):
numactl --interleave=all /usr/bin/mongod <other flags>
Tuning Linux: Transparent HugePages
● Introduced in RHEL/CentOS 6, Linux 2.6.38+● Merges memory pages in background (Khugepaged process)● “AnonHugePages” in /proc/meminfo shows usage● Disable TransparentHugePages!● Add “transparent_hugepage=never” to kernel command-line (GRUB)
○ Reboot the system■ Disabling online does not clear previous TH pages■ Rebooting tests your system will come back up!
Tuning Linux: Time Source
● Replication and Clustering needs consistent clocks○ mongodb_consistent_backup relies on time sync, for example!
● Use a consistent time source/server○ “It’s ok if everyone is equally wrong”
● Non-Virtualized○ Run NTP daemon on all MongoDB and Monitoring hosts○ Enable service so it starts on reboot
Tuning Linux: Time Source
● Virtualised○ Check if your VM platform has an “agent” syncing time○ VMWare and Xen are known to have their own time sync○ If no time sync provided install NTP daemon
Tuning Linux: I/O Scheduler
● Algorithm kernel uses to commit reads and writes to disk
● CFQ○ “Completely Fair Queue”○ Perhaps too clever/inefficient for database
workloads○ Probably good for a laptop, assume multi-use
● Deadline○ Best general default IMHO○ Predictable I/O request latencies
Tuning Linux: I/O Scheduler
● Noop○ Use with virtualised servers○ Use with real-hardware BBU RAID controllers
Tuning Linux: Filesystems
● Types○ Use XFS or EXT4, not EXT3
■ EXT3 has very poor pre-allocation performance■ Use XFS only on WiredTiger■ EXT4 “data=ordered” mode recommended
○ Btrfs not tested, yet!● Options
○ Set ‘noatime’ on MongoDB data volumes in ‘/etc/fstab’:
○ Remount the filesystem after an options change, or reboot
Tuning Linux: Block Device Readahead
● Tuning that causes data ahead of a block on disk to be read and then cached
● Assumption:○ There is a sequential read pattern○ Something will benefit from the extra cached blocks
● Risk○ Too high waste cache space○ Increases eviction work
■ MongoDB tends to have very random disk patterns● A good start for MongoDB volumes is a ’32’ (16kb) read-ahead
○ Let MongoDB worry about optimising the pattern
Tuning Linux: Block Device Readahead
● Change ReadAhead○ Add file to ‘/etc/udev/rules.d’
■ /etc/udev/rules.d/60-mongodb-disk.rules: # set deadline scheduler and 32/16kb read-ahead for /dev/sda ACTION=="add|change", KERNEL=="sda", ATTR{queue/scheduler}="deadline", ATTR{bdi/read_ahead_kb}="16"
○ Reboot (or use CLI tools to apply)
Tuning Linux: Virtual Memory Dirty Pages
● Dirty Pages○ Pages stored in-cache, but needs to be
written to storage● Dirty Ratio
○ Max percent of total memory that can be dirty
○ VM stalls and flushes when this limit is reached
○ Start with ’10’, default (30) too high
Tuning Linux: Virtual Memory Dirty Pages
● Dirty Background Ratio○ Separate threshold forbackground dirty
page flushing○ Flushes without pauses○ Start with ‘3’, default (15) too high
Tuning Linux: Swappiness
● A Linux kernel sysctl setting for preferring RAM or disk for swap○ Linux default: 60○ To avoid disk-based swap: 1 (not zero!)○ To allow some disk-based swap: 10○ ‘0’ can cause more swapping than ‘1’ on
recent kernels■ More on this here:
https://www.percona.com/blog/2014/04/28/oom-relation-vm-swappiness0-new-kernel/
Tuning Linux: Ulimit
● Allows per-Linux-user resource constraints○ Number of User-level Processes○ Number of Open Files○ CPU Seconds○ Scheduling Priority○ And others…
Tuning Linux: Ulimit
● MongoDB○ Should probably have a dedicated VM, container or server○ Creates a new process
■ For every new connection to the Database■ Plus various background tasks / threads
○ Creates an open file for each active data file on disk■ 64,000 open files and 64,000 max processes is a good start
Tuning Linux: Ulimit
● Setting ulimits○ /etc/security/limits.d
file○ Systemd Service○ Init script
● Ulimits are set by Percona and MongoDB packages!○ Example on left:
PSMDB RPM (Systemd)
Tuning Linux: Network Stack
● Defaults are not good for > 100mbps Ethernet● Suggested starting point:
● Set Network Tunings:○ Add the above sysctl tunings to /etc/sysctl.conf○ Run “/sbin/sysctl -p” as root to set the tunings○ Run “/sbin/sysctl -a” to verify the changes
Tuning Linux: More on this...
https://www.percona.com/blog/2016/08/12/tuning-linux-for-mongodb/
Tuning Linux: “Tuned”
● Tuned○ A “framework” for applying tunings to Linux
■ RedHat/CentOS 7 only for now■ Debian added tuned, not sure if compatible yet■ Cannot tune NUMA, file system type or fs mount opts■ Syctls, THP, I/O sched, etc
● My apology to the community for writing “Tuning Linux for MongoDB”:○ https://github.com/Percona-Lab/tuned-percona-mongodb
Speaker Name
To be continued...
May 3rd, 2018
Speaker Name
Questions?
50