web 2.0 performance and reliability: how to run large web apps

70
Artur Bergman [email protected] Wikia Inc – We are hiring – Community/Bizdev in Germany – Engineers in Poland http://www.wikia.com/wiki/hiring O’Reilly Radar – http://radar.oreilly.com/artur/

Upload: adunne

Post on 17-Jan-2015

9.676 views

Category:

Business


0 download

DESCRIPTION

Speaker: Artur Bergman

TRANSCRIPT

Page 1: Web 2.0 Performance and Reliability: How to Run Large Web Apps

Artur [email protected]

• Wikia Inc– We are hiring– Community/Bizdev in Germany– Engineers in Poland– http://www.wikia.com/wiki/hiring

• O’Reilly Radar– http://radar.oreilly.com/artur/

Page 2: Web 2.0 Performance and Reliability: How to Run Large Web Apps

The value of operations

• Google• Orkut• Friendster• Myspace

Page 3: Web 2.0 Performance and Reliability: How to Run Large Web Apps

Benefits

• Users trust your brand• They rely on you• They spend more time on your site• Bad operations wastes R&D money

• Fixed amount of time + faster site = more page views

Page 4: Web 2.0 Performance and Reliability: How to Run Large Web Apps

Stepchild of Engineering

• Product development• Engineering• Operations

– Sysadmins?• Why?

Page 5: Web 2.0 Performance and Reliability: How to Run Large Web Apps

Operations Engineering

• It is engineering• Google terminology -

– Site Reliability Engineer• Sure there are sysadmins too, people

mananing NOCs and datacenters• Provide career growth

Page 6: Web 2.0 Performance and Reliability: How to Run Large Web Apps

Good Engineers

• Detail Oriented• Aspire to be operational engineers• Stubborn• Can steer their inner ADD

– Interrupt driven• Not the same as good developers

Page 7: Web 2.0 Performance and Reliability: How to Run Large Web Apps

Danger signs

• Thinks operation is a path to development engineering– Fire them

• Want people dedicated to the task• A good operations engineer should

spend some time in development• A good development engineer MUST

spend some time in operations

Page 8: Web 2.0 Performance and Reliability: How to Run Large Web Apps
Page 9: Web 2.0 Performance and Reliability: How to Run Large Web Apps

Debugging

• 9 Rules of debugging• http://www.debuggingrules.com/Poster_

download.html– Yes the font is horrible

Page 10: Web 2.0 Performance and Reliability: How to Run Large Web Apps

Rule 1: Understand the system

• Complexity Kills• No excuse• If you write it, you must know it• If you run it, you must know it• If you buy it, you must know it

Page 11: Web 2.0 Performance and Reliability: How to Run Large Web Apps

Rule 3:Quit thinking and look

• "It is a capital mistake to theorize before one has data. Insensibly one begins to twist facts to suit theories, instead of theories to suit facts.”

Page 12: Web 2.0 Performance and Reliability: How to Run Large Web Apps

Rule 3:Quit thinking and look

• What do you look at?• The importance of monitoring• Monitoring• Monitoring• Monitoring

Page 13: Web 2.0 Performance and Reliability: How to Run Large Web Apps

My my, confusing term

• Monitoring• Alerting• Trending

Page 14: Web 2.0 Performance and Reliability: How to Run Large Web Apps

Monitoring

• Collects data• Puts into databases• Makes it available for you• Active collection• Passive interaction

Page 15: Web 2.0 Performance and Reliability: How to Run Large Web Apps

Alerting

• Acts on monitoring data• Severe alerts

– Active– Needs action

• Passive alerts– Things that need to be done but not right now

• DO NOT OVER ALERT• DO NOT CRY WOLF

Page 16: Web 2.0 Performance and Reliability: How to Run Large Web Apps

Wikia alerting strategy

• When the site is slow• Or down• We send emails and do phone calls• Europe and US West coast• Looking to hire in East Asia• No night time

Page 17: Web 2.0 Performance and Reliability: How to Run Large Web Apps

Trending

• Long term • Capacity planning

Page 18: Web 2.0 Performance and Reliability: How to Run Large Web Apps

Monitor Tools

• Nagios• Cacti• MRTG• Hyperic• Cricket• Ganglia

Page 19: Web 2.0 Performance and Reliability: How to Run Large Web Apps

External Monitoring

• Use one, tells you what your clients see every x minutes

• Keynote• Gomez• Websitepulse (cheap - easy - I like

them; no annoying salesforce)

Page 20: Web 2.0 Performance and Reliability: How to Run Large Web Apps

Nagios

• Alerting• Hassle• C CGI??• Doesn’t

scale

Page 21: Web 2.0 Performance and Reliability: How to Run Large Web Apps

Hyperic

• Most exciting open source tool• Agent base - self configured• Baseline alerting

Page 22: Web 2.0 Performance and Reliability: How to Run Large Web Apps

Cricket MRTG Cacti

• Impossible to configure• You need to write tools to do it• Especially Cacti

– Somewhat more pleasant than clawing out your eyes

Page 23: Web 2.0 Performance and Reliability: How to Run Large Web Apps

Ganglia

• We love ganglia• Automatically graphs everything you

want - just works• Large scale clusters• Multicast• Zero config• RRD

Page 24: Web 2.0 Performance and Reliability: How to Run Large Web Apps

http://ganglia.wikimedia.org/

• 270 hosts• 880 CPU• 2 clusters• 1.2 TB of Memory

Page 25: Web 2.0 Performance and Reliability: How to Run Large Web Apps

http://ganglia.wikimedia.org

Page 26: Web 2.0 Performance and Reliability: How to Run Large Web Apps

Custom Ganglia Gmetrics

• Write your own

gmetric --name='Oldest query' --type=int32 --units='sec' --dmax=65 --value=`echo 'show processlist' | mysql -uroot -ppass | grep -v Sleep | grep -v 'system user' | head -2 | tail -1 | cut -f 6`

Page 27: Web 2.0 Performance and Reliability: How to Run Large Web Apps

Custom Ganglia Gmetrics

• Or Learn Unix

gmetric --name='Oldest query' --type=int32 --units='sec' --dmax=65 --value=`echo 'show processlist' | mysql -uroot -ppass | grep -v Sleep | grep -v 'system user' | head -2 | tail -1 | cut -f 6`

Page 28: Web 2.0 Performance and Reliability: How to Run Large Web Apps

Custom Ganglia Gmetrics

• Write your own

gmetric --name='Oldest query' --type=int32 --units='sec' --dmax=65 --value=`echo 'show processlist' | mysql -uroot -ppass | grep -v Sleep | grep -v 'system user' | head -2 | tail -1 | cut -f 6`

Page 29: Web 2.0 Performance and Reliability: How to Run Large Web Apps

Custom Ganglia Gmetrics

• Write your own

gmetric --name='Oldest query' --type=int32 --units='sec' --dmax=65 --value=`echo 'show processlist' | mysql -uroot -ppass | grep -v Sleep | grep -v 'system user' | head -2 | tail -1 | cut -f 6`

Page 30: Web 2.0 Performance and Reliability: How to Run Large Web Apps

Custom Ganglia Gmetrics

• Write your own

gmetric --name='Oldest query' --type=int32--units='sec' --dmax=65 --value=`echo 'show processlist' | mysql -uroot -ppass | grep -v Sleep | grep -v 'system user' | head -2 | tail -1 | cut -f 6`

Page 31: Web 2.0 Performance and Reliability: How to Run Large Web Apps

Something is wrong

• Don’t worry, data warehouse

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

Page 32: Web 2.0 Performance and Reliability: How to Run Large Web Apps

tcpdump / waveshark

• If you suspect the network• Don’t just suspect• LOOK AT IT• Tcpdump / waveshark will tell you

– If your packets are lost, delayed or corrupted

– Your windowing is wrong

Page 33: Web 2.0 Performance and Reliability: How to Run Large Web Apps

Rule 4: Divde and Conquer

• Look at the problems in turn• Split between people• Go in the order you suspect is the most

likely

Page 34: Web 2.0 Performance and Reliability: How to Run Large Web Apps

Rule 5:Change one thing at a time

• I cannot stress this enough• IF YOU DO NOT THEN YOU HAVE

FAILED TO IDENTIFY THE PROBLEM

Page 35: Web 2.0 Performance and Reliability: How to Run Large Web Apps

Rule 6:Keep an audit trail

• You might be making things worse• Good for the root cause analysis• Have your shell log all commands

– Good practice anyway• Version control

Page 36: Web 2.0 Performance and Reliability: How to Run Large Web Apps

Rule 9:If you didn’t fix it, it ain’t fixed

• You must do something to fix a problem• Or it will bite you again• And again• And again• They don’t just appear and disappear• Except BGP route convergence :)

Page 37: Web 2.0 Performance and Reliability: How to Run Large Web Apps

Process

• You need a little• Don’t worry

Page 38: Web 2.0 Performance and Reliability: How to Run Large Web Apps

Don’t forget

Page 39: Web 2.0 Performance and Reliability: How to Run Large Web Apps

Complexity kills

• Design against it• Reuse components• Define standards• Have a few images that all machines

look like - reimage machines every now and then for the heck of it.– EC2 forces you to do this

Page 40: Web 2.0 Performance and Reliability: How to Run Large Web Apps

MTBFMeduim Time Between Failure• Actually mostly irrelevant• Dealing with failure is more important• Target the right uptime

– Complexity scales exponatially with required uptime

• Don’t kid yourself, you don’t need 5 nines

Page 41: Web 2.0 Performance and Reliability: How to Run Large Web Apps

MTTRMedium Time To Recovery

• Important• Noone cares if you fail once a minute

– If you recover in 50 ms• If you are down 1 minute a week, you

are still going to hit 4 nines (99.99%)• Failures happen, plan how to deal with

them

Page 42: Web 2.0 Performance and Reliability: How to Run Large Web Apps

Problem found

• If it is critical, start a phone conversation• Use IRC to communicate technical data• One person liasons with non technical

staff• One person specifically in command• Sleep scheduling ( audit log important )

Page 43: Web 2.0 Performance and Reliability: How to Run Large Web Apps

Post crisis

• Root cause analysis – Just find out what went wrong– And how to avoid it– Or fix it faster next time if you can’t

• Keep track of your uptime

Page 44: Web 2.0 Performance and Reliability: How to Run Large Web Apps

Automation

• All machines are created equal• Seriously• If you manually make changes• You are wrong

– Unless you know what you are doing

Page 45: Web 2.0 Performance and Reliability: How to Run Large Web Apps

Best practices

• Version control• Gold images• Centralised authentication• Time Sync ( NTP )• Central logging• ( All of this applies for virtual machines

too!)

Page 46: Web 2.0 Performance and Reliability: How to Run Large Web Apps

cfengine

• Standard automation tool• Written in C• Not much support• Very good• Very annoying

Page 47: Web 2.0 Performance and Reliability: How to Run Large Web Apps

control:

site = ( mysite ) domain = ( mysite.country )

sysadm = ( mark ) netmask = ( 255.255.255.0 ) actionsequence = ( mountall mountinfo

addmounts mountall links ) mountpattern = /$(site)/$(host))

homepattern = ( u? )

Page 48: Web 2.0 Performance and Reliability: How to Run Large Web Apps

Puppet

• New hip kid on the block• Written in ruby• Better support?• Much nicer syntax• Easier to extend

Page 49: Web 2.0 Performance and Reliability: How to Run Large Web Apps

define yumrepo (enabled = true)

{ configfile {"/etc/yum.repos.d/$name.repo”: mode => 644,

source => "/yum/repos/$name.repo", ensure => $enabled ? { true => file, default => absent } }}

Page 50: Web 2.0 Performance and Reliability: How to Run Large Web Apps

cobbler

• Automatic PXE Installer– Uses kickstart files

• Redhat Enterprise• Centos• Fedora• Some support for debian

Page 51: Web 2.0 Performance and Reliability: How to Run Large Web Apps

cobbler

cobbler system add --name=xen8 --mac=00:19:B9:EE:6D:0A --ip=10.10.30.208 --profile=Centos-5-x86_64 --kopts='ksdevice=00:19:B9:EE:6D:0A

console=ttyS1,57600 console=tty0'

Page 52: Web 2.0 Performance and Reliability: How to Run Large Web Apps

cobbler

cobbler system add --name=xen8 --mac=00:19:B9:EE:6D:0A --ip=10.10.30.208 --profile=Centos-5-x86_64 --kopts='ksdevice=00:19:B9:EE:6D:0A

console=ttyS1,57600 console=tty0’

Page 53: Web 2.0 Performance and Reliability: How to Run Large Web Apps

koan

• Client install tool– Xen– Or OS re-image

koan --server=10.10.30.205 --virt --profile=virt_fc6 --virt-name=otrs

Page 54: Web 2.0 Performance and Reliability: How to Run Large Web Apps

Your datacenter

• Keep it tidy– Label things, keep cables as short as possible– Have a switch in each rack

• If you are small without dedicated DC staff you need– Remote control power switches– Remote console!

Page 55: Web 2.0 Performance and Reliability: How to Run Large Web Apps

Virtualization

• Please use it• Managing becomes much easier• Power consumption• Need a new test box

– The requestor can have it in minutes

Page 56: Web 2.0 Performance and Reliability: How to Run Large Web Apps

Power consumption

• Maybe not as important in Europe• 8 core machines are more efficient than

1 core• But memcache uses 1 core and all RAM• Get more RAM and virtualise

Page 57: Web 2.0 Performance and Reliability: How to Run Large Web Apps

Our network admin boxes

• 1 Xen CPU for Vyatta• 1 Xen CPU for LVS• 1 Xen CPU for Squid - Carp• 1 Xen CPU for Squid• 1 Xen CPU for Monitoring• 1 Xen CPU for network tasks

• We can have more of these and a loss of one affects us less

Page 58: Web 2.0 Performance and Reliability: How to Run Large Web Apps

Vyatta

• Opensource router– Really like it– No need to use Cisco

Page 59: Web 2.0 Performance and Reliability: How to Run Large Web Apps

LVS

• Linux Virtual Server• Low level load balancer• HA• Fast• Doesn’t inspire people to put things in

the only place that is hard to scale

Page 60: Web 2.0 Performance and Reliability: How to Run Large Web Apps

Squid Carp

• Squids configured to hash the urls and send them to specific backend

• Very little configuration done• Logging of UDP - no disk IO

Page 61: Web 2.0 Performance and Reliability: How to Run Large Web Apps

Squid

• As a reverse web accelerator• 90 % of our hits served from RAM in less than

1 ms• Same as wikipedia• We only use RAM cache ( unlike wikipedia)• Cached per user• If not cacheable - cache for a second to

redue backend effect

Page 62: Web 2.0 Performance and Reliability: How to Run Large Web Apps

App servers

• 1 xen cpu for memcache ( 5 GB Ram)• 1 xen cpu for squid ( 5GB Ram )• 6 xen cpus for apache (6 GB Ram )

• More power efficient, less affected by loss

• Applications can’t affect each other

Page 63: Web 2.0 Performance and Reliability: How to Run Large Web Apps

Databases

• Keep developers on short leash• Report bad queries• Fear object relational mappers

Page 64: Web 2.0 Performance and Reliability: How to Run Large Web Apps

Outsourcing

• As much as possible• The younger you are as a company the

less risk– When you have no users, you have no

value• VCs don’t like having their money go

into Capex

Page 65: Web 2.0 Performance and Reliability: How to Run Large Web Apps

What I want from Vendors

• They do what they tell me• They do what I tell them

• No annoying up sells, no premium services– I know more about what you are selling

than you

Page 66: Web 2.0 Performance and Reliability: How to Run Large Web Apps

Services we use

• Amazon EC2 and S3• Panther-Express

Page 67: Web 2.0 Performance and Reliability: How to Run Large Web Apps

Panther Express

• Fantastic Content Distribution Network• Cheap, simple price list

– Take note akamai• Cut delivery time to Europe by 70%• We let our images be cached 1 second

to redue load

Page 68: Web 2.0 Performance and Reliability: How to Run Large Web Apps

EC2 and S3

• We save all our binlogs to S3• We save database dumps to S3• We have monitors running from EC2• We plan to build a datawarehouse

cluster on EC2

Page 69: Web 2.0 Performance and Reliability: How to Run Large Web Apps

EC2 Requires Automation

• Machine is blank when you bring it up• Download database dump from S3 and

replicate up - automatically• Use puppet• Amazon saves you hardware

headaches– But complexity is still a problem

Page 70: Web 2.0 Performance and Reliability: How to Run Large Web Apps

Thank you