umeng operations infrastructure & practice

Umeng Operations Infrastructure & Practice

Wang Yuxi, Umeng

[email protected]

mailto:[email protected]

About Me

● Before 2014, the only ops at Umeng

● Now, core member of ops team

● Technical generalist, responsible for the overall reliability and performance

of Umeng

● ArchLinux user

@Jasey_Wang | http://JaseyWang.Me

http://JaseyWang.Me

Agenda

● About Umeng

● IDC

● Network

● Server

● Product

● On Giant’s Shoulders

● OS

● User Management

● Critical Infrastructure

● Package Management

● Code Deployment

● Configuration Management

● Monitoring

● Tuning

● Documation

● Outage & Diagnose

● Security

● With Dev

● What We Are Doing Now

About Umeng

● Founded on April 2010

● Incubated by Innovation

Works

● $10 Million raised from

Matrix China

● Acquired by Alibaba

● Largest Mobile app

analytical platform in

China

● 400K+ Apps

● ~1B mobile device

IDC

● IDCo 3 + 1

● Racko ~90

● Servero 800+

● Network deviceo 100+

Network

● Bandwidtho 4Gbps+

o BGP cost

● Internal Networko 10G interconnection

o Third Network arch

Upgrade on Q2, 2014

Nexus 752

Bonding

o OOB issue

Network(cont.)

Server

● Before 2014o Dell(11G, 12G)

● Nowo Dell, HP, Huawei, Inspur

● 10G NIC, enterprise SSD

● Power supply, hot-plug, redundant

● Hard drive, hot-plug

Product

● Real time analytics(thunder)o 150k req/s

o ~ 5B log/d

o 100+ shards

● Batch processing system(iceberg)o ~ 300 2U node, 2T/3T, 7200 SAS

o ~ 3T/d daily incremental data

o 4P/5P usage

● Push, Social

Product(cont.)

On Giant’s Shoulders

● OSSo Nginx(Tengine)

o Finagle, Thrift

o Redis

o Kafka

o Storm

o MongoDB

o Hadoop & ecosystem

● Enterpriseo Google apps

o Github enterprise

o Redhat

o NewRelic

o CDN

OS

● Before 2013o Ubuntu 10.04/12.04

● Nowo RedHat 6.2, 2.6.32-279(80%)

o professional technical support

● BIOS, RAIDo automatic tools

o done before delivery

http://goo.gl/TyDEVR

OS(cont.)

● OS templateo ks & preseed(great pain)

o partition(ext3/ext4, mount options)

o unnecessary service(irqbalance, cpuspeed, netfilter, etc.)

o sshd, monitoring agent

o handy tools(nmap, tcpdump, htop, iftop, screen, etc.)

o lang(Java/Scala, Python, Ruby)

● Custom init setup via Cobbler

● Added automatically by Zabbix

User Management

● OpenVPN(multi path)o Incredibly stable for 3 years, ZERO outage

o TCP vs UDP

● Public keyo OK for startup, quick & dirty

● IPA(identity, policy, audit(snoopy))o preferred

● Headache for us, history reasono engineers enjoy the “free style”

o so, the sooner the better

Critical Infrastructure

● DNSo use IP, not hostname in your code

o retry, timeout

● NTP

● Netfiltero disabled by default

o conntrack

o NAT server

Package Management

● Internal repoo sync periodically

o GFWed issue :-(

● Really need compile?

● package managero yum/apt

o rpm/dpkg

o how we use them

● One package principleo rpm

o tgz

Code deployment

● Capistranoo Written in Ruby

o Deploy any language

o Easy to use

● Configuration managemento dev use

o ops

Configuration Management

● 2011o tens of servers

o free to use, mainly shell

● 2012 ~ 2013o just ME

o Puppet is ok, learn some

Ruby

o tens of modules written by

me

● Nowo prerequisite

team skill tree

learning curve

o Puppet

obsolete in new IDC

complex syntax, slow

o Saltstack

easy to pick up

flexible & plain

ansible as backup

o Python/ruby scripts, product

level

Monitoring

● Metrics, Metrics, Metrics!!!

● “All monitoring software evolves towards becoming an

implementation of Nagios”

http://goo.gl/PvBYky

Monitoring(cont.)

● From top to bottomo customer perspective

o business level(dau, etc.), critical sensitive

o application level(qps, latency, return code, exception)

o system level(load, nic, cpu, memory)

fork

swap in/out

nic speed/drops/errors

tcp queue, retransmit

o hardware level

Monitoring(cont.)

● Idealo near-real time

o flexible, 5s, 60s, 300s,

1800s

o comparable by date/time

o active/passive or just feed

● Dashboard(core metric)

● Beforeo Nagios/Munin(out of box)

● Nowo Zabbix/Graphite

o networkbench,

alibench(user end)

o New relic

● logo rsyslog

o ELK

o scripts

Tuning

● From app level to system level

● App level, not covered here

● System level, take away for common use

● Don’t forget hardware(BIOS, RAID)

● Baseline comes first

● One modification one time

● Never over-optimizedo “it works”, then “it runs happily”

o business driven

Tuning(cont.)

● Don’t modify kernel

parameter unless 100%

sure

o timestamp issue

o ecn issue

● Tcp related

● Ring buffer, interrupts,

open files, etc.

● DB, watch out

Documentation

● Routineo regular deploy & setup, weekly report

o online standard, 100+ slides for engineer

o ops share every Thu

● Post-Mortemo blameless

o timeline & deadline

● Github Wiki & Google Docs

Outage & Diagnose

● This year(2014)o SLA 99% ~ 99.9%

o issues every week, mostly invisible to customers

● When site is down

o from bottom to top, vice-versa

o good bug can reproduce

o tools are key power

system http://goo.gl/wrNLi7

app

o inform support & bd

o technical background share(http://blog.umeng.com/?cat=4)

● Network is a unreliable, and it can breakdown

http://blog.umeng.com/?cat=4

Security

● IP issue, long long

historyo public & private ip

o port restricted, listen()

o oob

● test IDC

● UDP amplification

● Bash, SSL vulnerability

● DDoS

● whitehat(WooYun, etc.) http://goo.gl/Q1SkXV

With Dev

● Tradeoffo less dev’s work usually means more reliable system

o there will always be conflicts between ops & dev

unless one of them gives in

aggressive or mild, choose one

● Understand business logico code talks

o data talks

http://goo.gl/Qwh6Ze

What We Are Doing Now

● New IDCs, New beginning, Great challenge

o active - backup

o active - active

● Transfer data from BJ to SH

● Env setup, stress test, benchmark

● Finally, switchover

http://goo.gl/TMDnnS

What We Are Doing Now(cont.)

● Private Cloud

o capex & opex

o resource(hardware,

software)

o workforce

End

Q & A

umeng operations infrastructure & practice

Internet

network eng

network problemsrun

active active

wang http

b mobile device

nice app

stablethen kafka mirror

team member