subversion and git high availability - wandiscolive.wandisco.com/robinson_svn_and_git_high... ·...

Subversion and Git High Availability

Doug Robinson, Senior Product Manager, WANdisco

High Availability Overview

High Availability

•  What is meant by “High Availability”? •  How is “High Availability” measured? •  What does it take to make a system “Highly

Available”? •  Ways to go about it…

High Availability Defined

•  Wikipedia –  “High availability is a system design approach and

associated service implementation that ensures a prearranged level of operational performance will be met during a contractual measurement period.”

•  Say what? –  Defined only in the context of a service contract?! –  Reasonable since costs have a way of getting out of

hand.

Interpreting “High Availability”

•  What do the user’s hear? The user’s management? –  Always Available, Five-Nines (99.999%), etc.

•  What does accounting hear? –  Money and lots of it

•  What do you need to do? –  Balance –  Document –  Communicate

What About Maintenance?

•  Some measures of High Availability do not include the time to perform maintenance in the service contract –  Every project today needs to do maintenance, right? –  So why include it in the measurement?

•  Simple answer: –  Because the users will!

•  Multiple Reasons: Hardware, OS, Application, … –  Doesn’t matter to the users – it’s still an “outage”

•  Scaling centralization ever larger pits the business schedules of multiple teams against the time to perform maintenance –  With a sufficiently large number of teams spread out worldwide

there is zero time to do maintenance without impacting users –  Don’t forget automation

•  Subversion is a centralized application •  Git “upstream” is centralized

High Availability: How?

Back to SVN and Git shortly…

The Ecosystem

•  Software is layered over software… only eventually over hardware

•  You can build a robust system over faulty subsystems –  But the robust system will not perform well –  And the robust system will be blamed for the bad performance

•  Start from the ground up –  Test layer by layer

•  Critical requirements for subsystems: –  Make sure your logging can provide the evidence of bad

underlying behavior –  Make sure your monitoring automatically reports the bad

behavior

Multiple Ways of Getting There

•  Hardening •  Clustering •  Distributing

Hardening


Hardening

•  “Old School”: Standard “DataCenter” Approach •  Test components and select for the longest

“mean time between failures” (MTBF) •  Duplicate critical components:

–  Power distribution and UPS –  Networking –  SAN

Hardening Benefits

•  Eliminates many single points of failure –  Much Lower overall “system” MTBF

Hardening Issues

•  Lots of money and diminishing returns –  Failures still occur –  Systems still go down

•  Software bugs and race conditions •  Hardware still fails – it’s just statistics: MBTF!

–  Outages last as long as it takes: •  To determine and correct the failing component(s) •  To fix up the system and component meta-data and data

–  There are a lot of components and a lot of data •  Failures are going to happen

Clustering


Clustering

•  “Modern School”: Added when hardening still had too large a “Time-to-Fix” (TTF)

•  Take multiple machines (VMs?!) and, through the wonders of H/W and S/W, make them look like one “machine” –  Shared SAN –  Multi-IP NIC

•  Concepts: –  Cluster Package – way to wrap an application

•  Package IP Address – separate from the Host IP Address –  Hot Standby – a machine to sit idle waiting to run when the Primary fails –  Failover – when the Cluster Package IP moves to another cluster server

•  For instance, to the Hot Standby •  Each cluster package runs in one location

–  Or in multiple locations – depends on the package definition

Clustering Benefits

•  Subversion/Git service moves from broken server to running server without manual intervention –  Only transactions “in progress” at failed node lost –  Near immediate restoration of Subversion service –  User outage measured in seconds

•  Enables scheduled maintenance with only a minor disruption –  Patch the “hot standby” –  Failover, service is now up on patched server with 30 second outage –  Patch the “failed from” server, make it the new “hot standby”

•  This all works because the users contact the service via the “package IP address” –  NOT the server IP address –  The package IP address is always on the host that is running/hosting the

package

Clustering: High Level “Who?”

•  Established generic clustering solutions: –  Linux-HA OpenSource –  Cluster Server Veritas –  High Availability Red Hat –  ServiceGuard HP –  Windows Clustering MicroSoft

Clustering: “How to?”

•  Solutions supply significant infrastructure to enable: –  Easy creation of a “package” to house an instance of a

software service (application) –  Automatic

•  Detection of server availability •  Import/export of volume groups •  Mount/unmount of file systems •  Management of package IP address •  Monitoring of package availability •  Failover from one server to another

•  Normally in the LAN; some can support MAN/WAN –  MAN or WAN require GSLB

Clustering: Issues

•  Still have outages due to: –  SAN infrastructure

•  Harden this as much as possible •  Some SAN vendors offer data replication at the SAN level as

an add-on (at additional cost) •  Keep OS updated with latest fixes to avoid kernel issues

preventing failover –  WAN failures (partitions)

•  Use redundant routing •  Get it fixed fast

–  Still essentially centralized •  Latency is problematic for remote users

Distributing

Distributing

•  “New School” or “Application Specific” •  Enable an application in multiple locations

–  Must have sufficient coordination to maintain data consistency •  Application designed from the ground up with this requirement OR •  Sophisticated middle-ware to do the same

–  Normally there are tradeoff’s between the implementations

Distributing: Impact of Latency

•  TCP is “bandwidth limited” by latency –  How long does it take to transfer:

•  1GiB from USA to India (0.3 secs latency)? •  30GiB from USA to India (0.3 secs latency)?

•  Yes, you can change the kernel TCP tunables –  but then you end up with retransmit costs due to lost

packets •  Every RPC is slowed end-to-end by latency

–  Parallelism helps - up to bandwidth limits –  Smaller numbers help – always

Distributing Benefits for SCMs

•  Finally, back to Subversion and Git J •  Normal repo usage

–  80% to 90% of operations are read-only •  Some repo’s up to 99% reads!

•  Global Distribution –  Having repositories “latency-close” to your users means

much faster read-only operations •  Local Distribution

–  Enables the use of load balancers for single-site nearly seamless failover

–  And much higher read-only capacity

Distributing and Git

Git is already distributed…

•  Why would I need to distribute Git? –  Cloning is proportional to repo size and latency –  Pushing is proportional to push size and latency

•  How much? Hold that thought…

Distributing: Git via “--mirror”

•  Distributed set of read-only clone mirrors: –  Create on the mirror host by:

•  git clone –mirror https://<centralSvr>/gitrepos/repo.git –  Keep updated by post-update hook:

•  Avoid polling •  Single account with R/W permissions •  Either:

–  git push –mirror https://<remoteSvr>/gitrepos/repo.git •  Or:

–  ssh account@<centralSvr> bin/do_update –  And have “do_update” use:

»  git pull --mirror

Distributing: Git mirror tradeoff

•  How “stale” depends on frequency of pushes vs. frequency of mirroring –  Post-hook minimizes the amount of time

•  Serialization updating multiple mirrors increases time •  Not guaranteed: use pre-hook to set timeout on post-op

–  Adds to user confusion •  Adds complexity to end-user repo setup

–  Different remotes for fetch and push •  No centralized dashboard

Distributing Git

•  Effects:

Latency:

Average latency times… 32 ms (coast to coast USA)

64 ms (trans-Atlantic)

128 ms (trans-Pacific) 256 ms (US to India)

Seco

nd

s

Distributing: WANdisco Git

•  Fully supported commercial product •  Full Paxos based solution

–  Not a proxy based solution –  Consensus on transaction ordering via Paxos –  Each repo replica an exact copy

•  Like –mirror except that each replica can be pushed to –  Automatic peer to peer synchronization at the “update” level –  All replicas RW

•  If local server is down, users can pull/push from/to any replica •  Can front with a load balancer to distribute load (best in a Cluster setup)

•  Combined with WANdisco Access Control Plus –  Enables both HTTPS and SSH access paths –  Enables fine grained Authorization control –  Team-based administration common to both Git and Subversion

•  Central Administration and Dashboard

Git MultiSite

Distributing and SVN

Distributing: SVN via “svnsync”

•  Subversion has an application specific mechanism for maintaining read-only (RO) repo copies: “svnsync” –  Use hooks on repo write operations to copy the data from the read-write

(RW) repo •  Need to forward commits and changes to non-versioned revision properties •  Use PRE-hooks to register “future work” with a daemon

–  Critical for pre-revprop-change hook •  Avoid the simple POST-hooks-only implementation trap

–  Possible they’ll never fire causing embarrassing delays in synchronization –  Or “lost” rev-prop changes

•  Coordinate pushes so that revisions show up before rev-props… –  Use Apache to forward write operations through from the RO repo to the RW

repo: •  All but “GET”, “PROPFIND”, “OPTIONS”, “REPORT” should to go to the RW-repo •  Use “RewriteCond”, “RewriteRule” and “ProxyPassReverse” to make it happen

–  Make sure that UUID’s are identical for all {RW, RO1, RO2, …} in a set. –  Use pre-hooks at RO repos to prevent modification except from “svnsync”

account

Distributing: svnsync tradeoffs

•  Missing entries in “db/locks” at RO repos for “svn lock”’d objects –  Trying to copy them can cause svnsync failures –  Without them the RO repos cannot be used to replace RW repo if RW lost

•  At least not without loss of all of the locks •  Manual “svnsync lock” resolution via “--steal-lock” option

–  Must verify before using that no other svnsync is running –  Manual operation causes interruption (delay) of service

•  Serial push from RW repo –  Current OpenSource implementations iterate serially from RW repo (“master”) –  Delays propagation in WAN

•  Use of post-event hooks –  Current OpenSource implementations can fail due to lost post-event hooks –  Example: a non-versioned revision property change never shows up at RO repos

•  Does not handle “authz” file change propagation •  No centralized administration or dashboard

Distributing: WANdisco SVN

•  Fully supported commercial product •  Full Paxos based solution

–  Not a proxy based solution –  Consensus on transaction ordering via Paxos –  Each repo replica an exact copy

•  Including db/locks entries, time stamps, etc. (locks are handled properly) –  Automatic peer to peer synchronization at FSFSWD level

•  Distribution of commit/rev-prop changes from any peer to any peer •  Speeds up WAN synchronization

–  All replicas RW •  If local server is down, users can “svn switch –relocate” to another •  Can front with a load balancer to distribute load (best in a Cluster setup)

•  Handles the “authz” file propagation –  Supports both Apache and svnserve concurrently

•  Central Administration and Dashboard

SVN MultiSite Plus

Distributing Advice (free)

Distributing: Investments

•  Obvious: –  Hardware, Power, Cooling, Storage at each site,

Administrative Staffing –  Make versus buy?

•  Non-obvious: –  Time to deliver the data to the distributed sites

•  WAN bandwidth •  WAN latency •  WAN packet loss

–  Coordinating Data •  Time to coordinate updates at all sites

Harden, Cluster, Distribute

•  I know – advice is cheap •  Think of it as jump-starting your investigation

–  Nothing more

Advice

Real World: Use all Three!

•  The following slides are some recommendations on when do use –  Hardening –  Clustering –  Distributing

•  Design based on your user population –  Locations –  Cost structures at those locations –  Staff density at those locations –  Include automated demand loading (automation)

•  Each situation is different - may require a different mix –  Take an iterative approach

Real World: Hardening

•  Choose system components based on cost/reliability tradeoff –  Get your vendor to provide their failure rates –  Find a new vendor if they won’t help

•  Always use Chipkill ECC memory in servers –  Replace DIMMs when uncorrectable errors occur

•  Aside: Why does parity memory still exist??? •  Use SAN where the ROI is positive

–  Use a direct-attached RAID when SAN is overkill •  Never use motherboard-based RAID – just too iffy, poorly supported, etc.

–  At the bottom end use software RAID •  Expensive in CPU •  Competes with SVN MD5’s computations

–  At the very bottom use single drives – expect failures •  Buy server drives with 24x7 duty cycles (not 8x5)

Real World: Clustering

•  You don’t have to have clusters everywhere –  Good thing – they’re relatively expensive

•  Setup a cluster in critical areas to distribute the load –  Centralized global presence –  High staff density –  High burden of automated Continuous Integration

•  Beware automated commits – see below •  Setup a single machine or VM in lower-use areas

–  Monitor use and host performance –  Daily reports on use/performance showing historical trends –  Convert to clusters when loading and ROI justify

Real World: Distributing SCMs

•  Repos where binaries get transmitted are critical –  Tiny “source only” repos only grow

•  Distribute to development sites where latency from central office is large –  Consider when latency is 50 milliseconds or larger –  Absolutely when over 100 milliseconds

•  If you distribute some repos, distribute them all –  Live copies are better than backups –  Keep it simple for the developers

•  Each site’s staff only needs to remember their server

SVN/Git is Free!

•  Fight the mindset that SCM should be cheap… –  It houses your core intellectual property!

•  Mindset: investing in SCM provides positive ROI –  Remind them what the last outage/recovery cost in terms of:

•  Missed schedules •  Missed market windows •  Lost data

–  Some help with the justification: •  http://www.wandisco.com/roi-calculator

•  Gather data –  Gather data

•  Gather data

•  Then generate reports!

Why is this costing money?

Deployment Challenges

Abusive Use Cases: Read-Only

•  Examples: –  Runaway user scripting

•  git clone/fetch/pull •  svn checkout

–  User scripting with out-of-date credentials

•  You will need solid monitoring and reporting to find these –  Reporting will need to summarize number and size of

requests by account –  Flag heavy AuthN failures

Abusive Use Cases: Read-Write

•  Continuous Integration (CI) with updates/modifications –  Can behave like 100 normal users working 24 hours per day –  Engage with the CI designers right at the beginning –  Reads are inexpensive, writes are expensive

•  Try to minimize writes •  Massive check-ins

–  CD’s, DVD’s, Video clips, Chip Designs, etc. –  Distributing those types of artifacts can take hours in the WAN –  Roll-out non-SCM artifact repositories / managers with or before your SCM

•  Removing artifacts from Subversion is extremely painful (offline for dump/filter/load) •  Removing artifacts from Git upstream is extremely painful (filter-branch, non-ff)

–  “Artifact Repository” examples: •  Archiva •  Artifactory •  Nexus

Thank You [email protected]

subversion and git high availability - wandiscolive.wandisco.com/robinson_svn_and_git_high... ·...

Documents