umeng operations infrastructure & practice
TRANSCRIPT
Umeng Operations Infrastructure & Practice
Wang Yuxi, Umeng
About Me
● Before 2014, the only ops at Umeng
● Now, core member of ops team
● Technical generalist, responsible for the overall reliability and performance
of Umeng
● ArchLinux user
@Jasey_Wang | http://JaseyWang.Me
Agenda
● About Umeng
● IDC
● Network
● Server
● Product
● On Giant’s Shoulders
● OS
● User Management
● Critical Infrastructure
● Package Management
● Code Deployment
● Configuration Management
● Monitoring
● Tuning
● Documation
● Outage & Diagnose
● Security
● With Dev
● What We Are Doing Now
About Umeng
● Founded on April 2010
● Incubated by Innovation
Works
● $10 Million raised from
Matrix China
● Acquired by Alibaba
● Largest Mobile app
analytical platform in
China
● 400K+ Apps
● ~1B mobile device
Network
● Bandwidtho 4Gbps+
o BGP cost
● Internal Networko 10G interconnection
o Third Network arch
Upgrade on Q2, 2014
Nexus 752
Bonding
o OOB issue
Server
● Before 2014o Dell(11G, 12G)
● Nowo Dell, HP, Huawei, Inspur
● 10G NIC, enterprise SSD
● Power supply, hot-plug, redundant
● Hard drive, hot-plug
Product
● Real time analytics(thunder)o 150k req/s
o ~ 5B log/d
o 100+ shards
● Batch processing system(iceberg)o ~ 300 2U node, 2T/3T, 7200 SAS
o ~ 3T/d daily incremental data
o 4P/5P usage
● Push, Social
On Giant’s Shoulders
● OSSo Nginx(Tengine)
o Finagle, Thrift
o Redis
o Kafka
o Storm
o MongoDB
o Hadoop & ecosystem
● Enterpriseo Google apps
o Github enterprise
o Redhat
o NewRelic
o CDN
OS
● Before 2013o Ubuntu 10.04/12.04
● Nowo RedHat 6.2, 2.6.32-279(80%)
o professional technical support
● BIOS, RAIDo automatic tools
o done before delivery
http://goo.gl/TyDEVR
OS(cont.)
● OS templateo ks & preseed(great pain)
o partition(ext3/ext4, mount options)
o unnecessary service(irqbalance, cpuspeed, netfilter, etc.)
o sshd, monitoring agent
o handy tools(nmap, tcpdump, htop, iftop, screen, etc.)
o lang(Java/Scala, Python, Ruby)
● Custom init setup via Cobbler
● Added automatically by Zabbix
User Management
● OpenVPN(multi path)o Incredibly stable for 3 years, ZERO outage
o TCP vs UDP
● Public keyo OK for startup, quick & dirty
● IPA(identity, policy, audit(snoopy))o preferred
● Headache for us, history reasono engineers enjoy the “free style”
o so, the sooner the better
Critical Infrastructure
● DNSo use IP, not hostname in your code
o retry, timeout
● NTP
● Netfiltero disabled by default
o conntrack
o NAT server
Package Management
● Internal repoo sync periodically
o GFWed issue :-(
● Really need compile?
● package managero yum/apt
o rpm/dpkg
o how we use them
● One package principleo rpm
o tgz
Code deployment
● Capistranoo Written in Ruby
o Deploy any language
o Easy to use
● Configuration managemento dev use
o ops
Configuration Management
● 2011o tens of servers
o free to use, mainly shell
● 2012 ~ 2013o just ME
o Puppet is ok, learn some
Ruby
o tens of modules written by
me
● Nowo prerequisite
team skill tree
learning curve
o Puppet
obsolete in new IDC
complex syntax, slow
o Saltstack
easy to pick up
flexible & plain
ansible as backup
o Python/ruby scripts, product
level
Monitoring
● Metrics, Metrics, Metrics!!!
● “All monitoring software evolves towards becoming an
implementation of Nagios”
http://goo.gl/PvBYky
Monitoring(cont.)
● From top to bottomo customer perspective
o business level(dau, etc.), critical sensitive
o application level(qps, latency, return code, exception)
o system level(load, nic, cpu, memory)
fork
swap in/out
nic speed/drops/errors
tcp queue, retransmit
o hardware level
Monitoring(cont.)
● Idealo near-real time
o flexible, 5s, 60s, 300s,
1800s
o comparable by date/time
o active/passive or just feed
● Dashboard(core metric)
● Beforeo Nagios/Munin(out of box)
● Nowo Zabbix/Graphite
o networkbench,
alibench(user end)
o New relic
● logo rsyslog
o ELK
o scripts
Tuning
● From app level to system level
● App level, not covered here
● System level, take away for common use
● Don’t forget hardware(BIOS, RAID)
● Baseline comes first
● One modification one time
● Never over-optimizedo “it works”, then “it runs happily”
o business driven
Tuning(cont.)
● Don’t modify kernel
parameter unless 100%
sure
o timestamp issue
o ecn issue
● Tcp related
● Ring buffer, interrupts,
open files, etc.
● DB, watch out
Documentation
● Routineo regular deploy & setup, weekly report
o online standard, 100+ slides for engineer
o ops share every Thu
● Post-Mortemo blameless
o timeline & deadline
● Github Wiki & Google Docs
Outage & Diagnose
● This year(2014)o SLA 99% ~ 99.9%
o issues every week, mostly invisible to customers
● When site is down
o from bottom to top, vice-versa
o good bug can reproduce
o tools are key power
system http://goo.gl/wrNLi7
app
o inform support & bd
o technical background share(http://blog.umeng.com/?cat=4)
● Network is a unreliable, and it can breakdown
Security
● IP issue, long long
historyo public & private ip
o port restricted, listen()
o oob
● test IDC
● UDP amplification
● Bash, SSL vulnerability
● DDoS
● whitehat(WooYun, etc.) http://goo.gl/Q1SkXV
With Dev
● Tradeoffo less dev’s work usually means more reliable system
o there will always be conflicts between ops & dev
unless one of them gives in
aggressive or mild, choose one
● Understand business logico code talks
o data talks
http://goo.gl/Qwh6Ze
What We Are Doing Now
● New IDCs, New beginning, Great challenge
o active - backup
o active - active
● Transfer data from BJ to SH
● Env setup, stress test, benchmark
● Finally, switchover
http://goo.gl/TMDnnS
What We Are Doing Now(cont.)
● Private Cloud
o capex & opex
o resource(hardware,
software)
o workforce