subversion and git high availability - wandiscolive.wandisco.com/robinson_svn_and_git_high... ·...
TRANSCRIPT
High Availability
• What is meant by “High Availability”? • How is “High Availability” measured? • What does it take to make a system “Highly
Available”? • Ways to go about it…
High Availability Defined
• Wikipedia – “High availability is a system design approach and
associated service implementation that ensures a prearranged level of operational performance will be met during a contractual measurement period.”
• Say what? – Defined only in the context of a service contract?! – Reasonable since costs have a way of getting out of
hand.
Interpreting “High Availability”
• What do the user’s hear? The user’s management? – Always Available, Five-Nines (99.999%), etc.
• What does accounting hear? – Money and lots of it
• What do you need to do? – Balance – Document – Communicate
What About Maintenance?
• Some measures of High Availability do not include the time to perform maintenance in the service contract – Every project today needs to do maintenance, right? – So why include it in the measurement?
• Simple answer: – Because the users will!
• Multiple Reasons: Hardware, OS, Application, … – Doesn’t matter to the users – it’s still an “outage”
• Scaling centralization ever larger pits the business schedules of multiple teams against the time to perform maintenance – With a sufficiently large number of teams spread out worldwide
there is zero time to do maintenance without impacting users – Don’t forget automation
• Subversion is a centralized application • Git “upstream” is centralized
The Ecosystem
• Software is layered over software… only eventually over hardware
• You can build a robust system over faulty subsystems – But the robust system will not perform well – And the robust system will be blamed for the bad performance
• Start from the ground up – Test layer by layer
• Critical requirements for subsystems: – Make sure your logging can provide the evidence of bad
underlying behavior – Make sure your monitoring automatically reports the bad
behavior
Hardening
• “Old School”: Standard “DataCenter” Approach • Test components and select for the longest
“mean time between failures” (MTBF) • Duplicate critical components:
– Power distribution and UPS – Networking – SAN
Hardening Issues
• Lots of money and diminishing returns – Failures still occur – Systems still go down
• Software bugs and race conditions • Hardware still fails – it’s just statistics: MBTF!
– Outages last as long as it takes: • To determine and correct the failing component(s) • To fix up the system and component meta-data and data
– There are a lot of components and a lot of data • Failures are going to happen
Clustering
• “Modern School”: Added when hardening still had too large a “Time-to-Fix” (TTF)
• Take multiple machines (VMs?!) and, through the wonders of H/W and S/W, make them look like one “machine” – Shared SAN – Multi-IP NIC
• Concepts: – Cluster Package – way to wrap an application
• Package IP Address – separate from the Host IP Address – Hot Standby – a machine to sit idle waiting to run when the Primary fails – Failover – when the Cluster Package IP moves to another cluster server
• For instance, to the Hot Standby • Each cluster package runs in one location
– Or in multiple locations – depends on the package definition
Clustering Benefits
• Subversion/Git service moves from broken server to running server without manual intervention – Only transactions “in progress” at failed node lost – Near immediate restoration of Subversion service – User outage measured in seconds
• Enables scheduled maintenance with only a minor disruption – Patch the “hot standby” – Failover, service is now up on patched server with 30 second outage – Patch the “failed from” server, make it the new “hot standby”
• This all works because the users contact the service via the “package IP address” – NOT the server IP address – The package IP address is always on the host that is running/hosting the
package
Clustering: High Level “Who?”
• Established generic clustering solutions: – Linux-HA OpenSource – Cluster Server Veritas – High Availability Red Hat – ServiceGuard HP – Windows Clustering MicroSoft
Clustering: “How to?”
• Solutions supply significant infrastructure to enable: – Easy creation of a “package” to house an instance of a
software service (application) – Automatic
• Detection of server availability • Import/export of volume groups • Mount/unmount of file systems • Management of package IP address • Monitoring of package availability • Failover from one server to another
• Normally in the LAN; some can support MAN/WAN – MAN or WAN require GSLB
Clustering: Issues
• Still have outages due to: – SAN infrastructure
• Harden this as much as possible • Some SAN vendors offer data replication at the SAN level as
an add-on (at additional cost) • Keep OS updated with latest fixes to avoid kernel issues
preventing failover – WAN failures (partitions)
• Use redundant routing • Get it fixed fast
– Still essentially centralized • Latency is problematic for remote users
Distributing
• “New School” or “Application Specific” • Enable an application in multiple locations
– Must have sufficient coordination to maintain data consistency • Application designed from the ground up with this requirement OR • Sophisticated middle-ware to do the same
– Normally there are tradeoff’s between the implementations
Distributing: Impact of Latency
• TCP is “bandwidth limited” by latency – How long does it take to transfer:
• 1GiB from USA to India (0.3 secs latency)? • 30GiB from USA to India (0.3 secs latency)?
• Yes, you can change the kernel TCP tunables – but then you end up with retransmit costs due to lost
packets • Every RPC is slowed end-to-end by latency
– Parallelism helps - up to bandwidth limits – Smaller numbers help – always
Distributing Benefits for SCMs
• Finally, back to Subversion and Git J • Normal repo usage
– 80% to 90% of operations are read-only • Some repo’s up to 99% reads!
• Global Distribution – Having repositories “latency-close” to your users means
much faster read-only operations • Local Distribution
– Enables the use of load balancers for single-site nearly seamless failover
– And much higher read-only capacity
Git is already distributed…
• Why would I need to distribute Git? – Cloning is proportional to repo size and latency – Pushing is proportional to push size and latency
• How much? Hold that thought…
Distributing: Git via “--mirror”
• Distributed set of read-only clone mirrors: – Create on the mirror host by:
• git clone –mirror https://<centralSvr>/gitrepos/repo.git – Keep updated by post-update hook:
• Avoid polling • Single account with R/W permissions • Either:
– git push –mirror https://<remoteSvr>/gitrepos/repo.git • Or:
– ssh account@<centralSvr> bin/do_update – And have “do_update” use:
» git pull --mirror
Distributing: Git mirror tradeoff
• How “stale” depends on frequency of pushes vs. frequency of mirroring – Post-hook minimizes the amount of time
• Serialization updating multiple mirrors increases time • Not guaranteed: use pre-hook to set timeout on post-op
– Adds to user confusion • Adds complexity to end-user repo setup
– Different remotes for fetch and push • No centralized dashboard
Distributing Git
• Effects:
Latency:
Average latency times… 32 ms (coast to coast USA)
64 ms (trans-Atlantic)
128 ms (trans-Pacific) 256 ms (US to India)
Seco
nd
s
Distributing: WANdisco Git
• Fully supported commercial product • Full Paxos based solution
– Not a proxy based solution – Consensus on transaction ordering via Paxos – Each repo replica an exact copy
• Like –mirror except that each replica can be pushed to – Automatic peer to peer synchronization at the “update” level – All replicas RW
• If local server is down, users can pull/push from/to any replica • Can front with a load balancer to distribute load (best in a Cluster setup)
• Combined with WANdisco Access Control Plus – Enables both HTTPS and SSH access paths – Enables fine grained Authorization control – Team-based administration common to both Git and Subversion
• Central Administration and Dashboard
Git MultiSite
Distributing: SVN via “svnsync”
• Subversion has an application specific mechanism for maintaining read-only (RO) repo copies: “svnsync” – Use hooks on repo write operations to copy the data from the read-write
(RW) repo • Need to forward commits and changes to non-versioned revision properties • Use PRE-hooks to register “future work” with a daemon
– Critical for pre-revprop-change hook • Avoid the simple POST-hooks-only implementation trap
– Possible they’ll never fire causing embarrassing delays in synchronization – Or “lost” rev-prop changes
• Coordinate pushes so that revisions show up before rev-props… – Use Apache to forward write operations through from the RO repo to the RW
repo: • All but “GET”, “PROPFIND”, “OPTIONS”, “REPORT” should to go to the RW-repo • Use “RewriteCond”, “RewriteRule” and “ProxyPassReverse” to make it happen
– Make sure that UUID’s are identical for all {RW, RO1, RO2, …} in a set. – Use pre-hooks at RO repos to prevent modification except from “svnsync”
account
Distributing: svnsync tradeoffs
• Missing entries in “db/locks” at RO repos for “svn lock”’d objects – Trying to copy them can cause svnsync failures – Without them the RO repos cannot be used to replace RW repo if RW lost
• At least not without loss of all of the locks • Manual “svnsync lock” resolution via “--steal-lock” option
– Must verify before using that no other svnsync is running – Manual operation causes interruption (delay) of service
• Serial push from RW repo – Current OpenSource implementations iterate serially from RW repo (“master”) – Delays propagation in WAN
• Use of post-event hooks – Current OpenSource implementations can fail due to lost post-event hooks – Example: a non-versioned revision property change never shows up at RO repos
• Does not handle “authz” file change propagation • No centralized administration or dashboard
Distributing: WANdisco SVN
• Fully supported commercial product • Full Paxos based solution
– Not a proxy based solution – Consensus on transaction ordering via Paxos – Each repo replica an exact copy
• Including db/locks entries, time stamps, etc. (locks are handled properly) – Automatic peer to peer synchronization at FSFSWD level
• Distribution of commit/rev-prop changes from any peer to any peer • Speeds up WAN synchronization
– All replicas RW • If local server is down, users can “svn switch –relocate” to another • Can front with a load balancer to distribute load (best in a Cluster setup)
• Handles the “authz” file propagation – Supports both Apache and svnserve concurrently
• Central Administration and Dashboard
SVN MultiSite Plus
Distributing: Investments
• Obvious: – Hardware, Power, Cooling, Storage at each site,
Administrative Staffing – Make versus buy?
• Non-obvious: – Time to deliver the data to the distributed sites
• WAN bandwidth • WAN latency • WAN packet loss
– Coordinating Data • Time to coordinate updates at all sites
Harden, Cluster, Distribute
• I know – advice is cheap • Think of it as jump-starting your investigation
– Nothing more
Advice
Real World: Use all Three!
• The following slides are some recommendations on when do use – Hardening – Clustering – Distributing
• Design based on your user population – Locations – Cost structures at those locations – Staff density at those locations – Include automated demand loading (automation)
• Each situation is different - may require a different mix – Take an iterative approach
Real World: Hardening
• Choose system components based on cost/reliability tradeoff – Get your vendor to provide their failure rates – Find a new vendor if they won’t help
• Always use Chipkill ECC memory in servers – Replace DIMMs when uncorrectable errors occur
• Aside: Why does parity memory still exist??? • Use SAN where the ROI is positive
– Use a direct-attached RAID when SAN is overkill • Never use motherboard-based RAID – just too iffy, poorly supported, etc.
– At the bottom end use software RAID • Expensive in CPU • Competes with SVN MD5’s computations
– At the very bottom use single drives – expect failures • Buy server drives with 24x7 duty cycles (not 8x5)
Real World: Clustering
• You don’t have to have clusters everywhere – Good thing – they’re relatively expensive
• Setup a cluster in critical areas to distribute the load – Centralized global presence – High staff density – High burden of automated Continuous Integration
• Beware automated commits – see below • Setup a single machine or VM in lower-use areas
– Monitor use and host performance – Daily reports on use/performance showing historical trends – Convert to clusters when loading and ROI justify
Real World: Distributing SCMs
• Repos where binaries get transmitted are critical – Tiny “source only” repos only grow
• Distribute to development sites where latency from central office is large – Consider when latency is 50 milliseconds or larger – Absolutely when over 100 milliseconds
• If you distribute some repos, distribute them all – Live copies are better than backups – Keep it simple for the developers
• Each site’s staff only needs to remember their server
SVN/Git is Free!
• Fight the mindset that SCM should be cheap… – It houses your core intellectual property!
• Mindset: investing in SCM provides positive ROI – Remind them what the last outage/recovery cost in terms of:
• Missed schedules • Missed market windows • Lost data
– Some help with the justification: • http://www.wandisco.com/roi-calculator
• Gather data – Gather data
• Gather data
• Then generate reports!
Why is this costing money?
Abusive Use Cases: Read-Only
• Examples: – Runaway user scripting
• git clone/fetch/pull • svn checkout
– User scripting with out-of-date credentials
• You will need solid monitoring and reporting to find these – Reporting will need to summarize number and size of
requests by account – Flag heavy AuthN failures
Abusive Use Cases: Read-Write
• Continuous Integration (CI) with updates/modifications – Can behave like 100 normal users working 24 hours per day – Engage with the CI designers right at the beginning – Reads are inexpensive, writes are expensive
• Try to minimize writes • Massive check-ins
– CD’s, DVD’s, Video clips, Chip Designs, etc. – Distributing those types of artifacts can take hours in the WAN – Roll-out non-SCM artifact repositories / managers with or before your SCM
• Removing artifacts from Subversion is extremely painful (offline for dump/filter/load) • Removing artifacts from Git upstream is extremely painful (filter-branch, non-ff)
– “Artifact Repository” examples: • Archiva • Artifactory • Nexus
Thank You [email protected]