flashback: qcon san francisco 2012
DESCRIPTION
TRANSCRIPT
Sergejus Barinovas
Why San Francisco?
Learn how others are doing at scale
Learn what problems others have
Learn does their solutions apply to us
Learn does their problems apply to us
Silicon Valley based companies:
- Netflix
- Quora
- tons of others...
Why San Francisco?
NoSQL: Past, Present, Future
Eric Brewer – author of CAP theorem
CP vs. AP, but only on time-out (failure)
Real-time Web with
Real-time web
node.js – de-facto for real-time web
open connection for user and leave open for him
web sockets are great, but use fallbacks
- mobile devices doesn't support web sockets
- long polling, infinite frame, etc.
more companies moving to SPDY protocol
on mobile
Quora on mobile
first iPhone app
- mobile app is like old app shipped on CD
- hybrid application
- native code for controls and navigation
- HTML for viewing Q&A from the site
- separate mobile optimized HTML layout of the
web page
Quora on mobile
second Android app
- created clone of iPhone app - failed!
- UI natural on iPhone is alien on Android
- bought Android devices and learned their
philosophy
- used new Google Android UI design guidelines
- created new app with native for Android look & feel
- users in India pay per MB, so had to optimize traffic
- optimizations applied for iPhone app and web page
Quora on mobile
mobile first experience
- mobile has very unique requirements
- if you're good on mobile, you're good anywhere
- don't use mobile app on tablets, create separate
or use web
Continuous delivery
Continuous delivery
Jesse Robbins, author of Chef
infrastructure as code
- full stack automation
- datacenter API (for provisioning VMs, etc.)
- infrastructure is a product and app is a customer
Continuous delivery
application as services
- service orientation
- software resiliency
- deep instrumentation
dev / ops as teams
- service owners
- shared metrics / monitoring
- continuous integration / deployment
Release engineering at
Release engineering at Facebook
Chuck Rossi – release engineering manager
deployment process
- teams are not deploying to production by them selves
- for communication during deployment IRC is used
- if team member is not connected to IRC, release is
skipped
- BitTorrent for deployments
- powerful app monitoring and profiling
(instrumentation)
Release engineering at Facebook
deployment process
- ability to release on subset of servers
- very powerful feature flag mechanism by IP, gender,
age, …
- karma points for developers with down-vote button
facebook.com
- continuously deployed internally
- employees always access latest facebook.com
- easy to report bug from the internal facebook.com
Scaling
Scaling Pintereset
everything in Amazon cloud
before
- had every possible ‘hot’ technology including
MySQL,
Cassandra, Mongo, Redis, Memcached, Membase,
Elastic
Search – FAIL
- keep it simple, major re-architecting in late 2011
Scaling Pintereset
January 2012
- Amazon EC2 + S3 + Akamai, ELB
- 90 Web Engines + 50 API Engines
- 66 sharded MySQL DBs + 66 slave replicas
- 59 Redis
- 51 Memcache
- 1 Redis task queue + 25 task processors
- sharded Solr
- 6 engineers
Scaling Pintereset
now
- Amazon EC2 + S3 + Akamai, Level3, EdgeCast, ELB
- 180 Web Engines + 240 API Engines
- 80 sharded MySQL DBs + 80 slave replicas
- 110 Redis
- 200 Memcache
- 4 Redis task queues + 80 task processors
- sharded Solr
- 40 engineers
Scaling Pintereset
schemeless DB design
- no foreign keys
- no joins
- denormalized data (id + JSON data)
- users, user_has_boards, boards, board_has_pins, pins
- read slaves
- heavy use of cache for speed & better consistency
thinking of moving to their own DC
Architectural patterns for high availability at
Architectural patterns for HA
Adrian Cockcroft – director of architecture at Netflix
architecture
- everything in Amazon cloud in 3 availability zones
- chaos Gorilla, latency Gorilla
- service-based architecture, stateless micro-services
- high attention for service resilience
- handle dependent service unavailability or
increased latency
started open-sourcing to improve quality of the code
Architectural patterns for HA
Cassandra usage
- 2 dedicated Cassandra teams
- over 50 Casssandra clusters, over 500 nodes, over 30 TB
of
data, biggest cluster has 72 nodes
- most write operations, for reads Memcache layer is used
- moved to SSD in Amazon instead of spinning disks and
cache
- for ETL: read Cassandara backup files using Hadoop
- can scale zero-to-500 instances in 8 minutes
timelines at scale
Timelines at scale
Raffi Krikorian – director of Twiter's platform services
core architecture
- pull (timeline & search) and push (mobile, streams) use-
cases
- 300K QPS for timeline
- on write use fan-out process to copy data for each use-case
- timeline cache in Redis
- when you tweet and you have 200 followers there will be
200
inserts to each follower timeline
Timelines at scale
core architecture
- Hadoop for batch compute and recommendation
- code heavily instrumented (load times, latencies,
etc.)
- uses Cassandra, but moving off from it due to read
times
More info
Slides - http://qconsf.com/sf2012
Videos - http://www.infoq.com/