google sre: chasing uptime

15
Google SRE: Chasing Uptime What do Google clusters look like? How do we manage them?

Upload: doantruc

Post on 02-Jan-2017

229 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Google SRE: Chasing Uptime

Google SRE: Chasing Uptime

What do Google clusters look like?How do we manage them?

Page 2: Google SRE: Chasing Uptime

Site Reliability Engineering

• Manage Google’s serving infrastructure• Plan and execute new capacity

deployment (e.g. new datacenters)• Performance tuning• Handle serious system problems and

outages

[email protected]

Page 3: Google SRE: Chasing Uptime

Commodity Hardware

• IDE Drives, Midrange CPUs, non-redundant power supplies

• Cheap and readily available parts• Outstanding bang for the buck

[email protected]

Page 4: Google SRE: Chasing Uptime

Google, circa 1996

Page 5: Google SRE: Chasing Uptime

Much improved, Google rack, but…

Page 6: Google SRE: Chasing Uptime

…tangled wires, cardboard, cork, bent motherboards!

Page 7: Google SRE: Chasing Uptime

Commodity Hardware

• IDE Drives, Midrange CPUs, non-redundant power supplies

• Cheap and readily available parts• Outstanding bang for the buck• Unreliable, temperamental, flaky

[email protected]

Page 8: Google SRE: Chasing Uptime

Site Reliability Engineering

• Automate common failure cases: Disk, memory, CPU errors, misconfiguration

• …and the dreaded “unexplained server down”

• Deploy and maintain monitoring and automation infrastructure

[email protected]

Page 9: Google SRE: Chasing Uptime

Strength in numbers: shard

• Dataset is huge, but divided up among many machines, giving us:

• Subset of data fits on one machine• Splitting processing gives lower latency• More CPUs gives higher throughput

[email protected]

Page 10: Google SRE: Chasing Uptime

Site Reliability Engineering

• How is the decision to divide up data among machines made?

• Correct for uneven workload and changes in query mix

• Design, deploy, and maintain automation

[email protected]

Page 11: Google SRE: Chasing Uptime

Strength in numbers: clone

• Each server has a number of clones that serve the same subset of data

• Many queries to be processed in parallel, giving scalability

• If any one clone goes down, others pick up, giving reliability

[email protected]

Page 12: Google SRE: Chasing Uptime

Sally’s carwash

Joe Speedcleaner2005 World speed car washing champion

$400K/year(coach, masseuse, 6 year contract, bad attitude)

Sally CleancarCEO

Page 13: Google SRE: Chasing Uptime

Sally’s carwash

Sam CleancarNepotistic Beneficiary

$60K/year

Kelly KleansidesWashes car doors

$25K/year

John WashdoorCan wash doors fast

$35K/year

Cathy WhitewallSolid cleaning

$25K/year

Bob ShinyrubberSpray and wipe

$20K/year

Joan BlacktireKnows tires$25K/year

Fred Fore O'NineMostly streak-free

$30K/year

Windy WendexEntirely streak-free

$35K/year

Mike ClearglassCleans windows

$25K/year

Susy SpongeClassy Dessicant

$40K/year

Dave DesertAbsorbance Aide

$25K/year

Sandy DrysteelChamois champ

$50K/year

Sally CleancarCEO

Employee cost: $395K/year

Page 14: Google SRE: Chasing Uptime

Site Reliability Engineering

• Diagnose and fix performance problems on live serving systems

• Plan and execute deployment of new capacity

• Work with new projects deployments to ensure they meet production criteria

• …and much more!

[email protected]

Page 15: Google SRE: Chasing Uptime

Questions?

A few starters:• How can I learn more?• Is Google hiring?

• Contacts:Angus Lees [email protected] Pollmann [email protected] Lo [email protected] Pindiproli [email protected]