five signs you have outgrown cassandra (and what …€¦ · five signs you have outgrown cassandra...

26
Five Signs You Have Outgrown Cassandra (and What to Do About It) WHITE PAPER

Upload: vuongnhan

Post on 21-Sep-2018

215 views

Category:

Documents


0 download

TRANSCRIPT

FiveSignsYouHaveOutgrownCassandra(andWhattoDoAboutIt)

WHITEPAPER

2FiveSignsYouHaveOutgrownCassandra(andWhattoDoAboutIt)

ExecutiveSummaryCassandraisawell-knownNoSQLdatabase,maintainedundertheApacheFoundationandcommercializedbyanumberofcompanies.Whileit’seasyfororganizationssuchasyourstostartwithCassandra,youfindyourself(orsoonwillbe)facingincreasinglylargecostsandcomplexityofday-to-dayoperationsasyourapplicationloadgrows.

ThisimpactsnotonlyyourLineofBusiness(LOB)budget,butalsoyouroperationalstability,andfurther,yourcustomerexperience.YourCassandrainfrastructurehampersyourorganization'sabilitytobeagile,tocompete,andtobringnewproductsandservicestomarket.Aerospike,theleadingenterprise-gradeNoSQLdatabase,cansaveyou5xormoreinTotalCostofOwnership(TCO)whileprovidingproven,unparalleleduptimeandavailability.Aerospikeisusedinproductionandtrustedbyindustry-leadingorganizationsfortheirmission-criticalapplications.

FiveSignsWhatarethefivesignsthatyourcompanymayhaveoutgrownCassandra?

3FiveSignsYouHaveOutgrownCassandra(andWhattoDoAboutIt)

AboutAerospike

Foundedin2009,Aerospikehasdiligentlyfocusedonbuildingamission-critical,highlyavailable,distributed,andrecord-orientedkey-valueNoSQLdatabase.AerospikepowerstheAdTechindustry;itscustomersincludeAdForm,Applovin,AppNexus,BlueKai,InMobi,RubiconProject,TradeDesk,andmanyothers.Aerospikealsodrivesinnovationinanumberofothersectors,includingTelecommunications(withNokia,HPEnterprise,Airtel,NTT,andViettel),FinancialServices,Gaming(withKing,DraftKings,andCurse),andeCommerce(withWilliams-SonomaandKayak).

DesignedandbuilttoexploitthecharacteristicsofFlash/SSDandpoisedtotakeadvantageofstorageclassmemory,Aerospikeprovidesunprecedentedvaluetoitscustomers.Ourtechnologyisdrivingfundamentalchangesinhowpeoplethinkabout,store,andaccesstheirdata;it'sthekeyingredientforbuildingrich,engagingapplicationsandservices.Aerospikeisdrivingdigitaltransformationacrossmanyindustriesbyenablingourcustomerstobuildrelevantsystemsofengagement;thisincludesbetterrecommendationenginesinretailandmarketing,fraudpreventioninpaymentprocessingandcybercrimedetection,andbillingandserviceenablementintelecommunications.Aerospike’scombinationofextraordinaryuptime,highavailability,andconsistentperformanceallbuteliminatesservicedisruptionsforyourcustomers.

FiveSignsYouHaveOutgrownCassandraTherearemanybusinessandtechnicaldemandsdrivingyourorganization:deliveringnewapplicationsfaster,reducingcosts,providingareliableandengagingexperience,maintainingyourNetPromoterScore,drivingdigitaltransformations,andmore.HowdothesemaptothesignsthatyouhaveoutgrownyourCassandracluster?

Sign#1:YourCassandraClustersAreGrowingatanUnexpectedRate&You’reWorriedaboutTCOIt'sadirtylittlesecret:theNoSQLcommunityandvendorshaveencouragedyoutobuildbig-reallybig-databaseclusters.Itbecameamatterofhonortoberunningthousandsofnodes(andinApple’scase,75,000nodes1).Butwhataretheconsequencesofsuchexpansiveclusters?Inalargecluster,theactualprobabilityandincidenceofhavinganodefailonyougoesfromtheorytoadaily-ifnothourly-occurrence.It’sdozensofharddrivesorSSDsperserver,overhundredsorthousandsofservers.Vendorsdirectlybenefitfromyourbigclusters.Evernoticehowtheypricebythenode?There’snoincentiveforthemtoreducetheirnumber.

1http://cassandra.apache.org/

4FiveSignsYouHaveOutgrownCassandra(andWhattoDoAboutIt)

Toillustrateourpoint,let’susethedatapresentedondrivefailuresbyGoogleatFast16.Thisresearchshowedafailurerateof1-2%forSSDsand2-20%forHDDs.Whatdoesthismeaninpracticalterms?Let’sstartwithoneserverwithonedriveandworkourwayupto75,000serverseachwith4drives,asyouwouldseeinverylargedeployments.ThefailureratesinincreasinglylargeserverdeploymentsareillustratedinTable1below:

Table1.Observablefailureratesinserverdeployments

Thelargerthecluster,themorecomponentsyouhave;hence,hardwarefailuregoesfromamerepossibilitytoapracticethatoccursdaily,ifnothourly.Serversprawlcreatesmorehardwarefailuresthatyouroperationsteamsneedtodealwith.Bycontrast,smallerclustersmeanfewercomponents,whichreducesthenumberofactualfailureswithwhichyoumustdeal.

Cassandradoesagreatjobofhorizontalscaling:yousimplyaddmorenodes.Themoreimportantquestionis,areyouabletofullyutilizeeachnodebeforeyouneedtobuy,provisionandmanageanother…andanother...andanother?Youknowtheanswer:youhavefoundthatCassandracannotfullyutilizeadatabasenode.Thus,whenyouhitaresourcelimit-eitherCPU,storageIOPs,orDRAMfortheJVMheap-youronlyalternativeistoscaleout.Yeteachnodeyouaddcreatesmorecomplexity.Eachnodeyouaddalsoresultsinsignificantlygreatercost,becauseCassandravendorsliketochargebythenode.Further,reliabilitysuffers:thelawoflargenumbersmeansthatyouwillseeactualfailures,andseethemmorefrequentlythanyoucanimagine.

5FiveSignsYouHaveOutgrownCassandra(andWhattoDoAboutIt)

AsCassandrarendersyouunabletoutilizetheperformanceofyourserversinitsentirety,youarethusforcedtoperformsomeunnaturalactsbestdescribedinthefollowingwaybytheCassandracommunityanditscommitters:

“Insteadofscalingthecomputesideoverthemetal,wedosillythingslikerunmultipleinstancesperbox…”2

Indeed,runningmultipleCassandrainstancesperboxissocommonthatDataStax,oneoftheCassandravendors,createdaMulti-InstancefeatureaspartoftheirEnterpriseversiontoautomatethisdeploymenttopology.However,runningmultipleinstancesperserverjustaddsfurtheroperationalcomplexityandcompoundingfailuremodeswhenaservergoesdown.

WhyisCassandrasoinefficientwithcomputeresources?TheuseofJava-withmultipleJVMproviders,andwithanumberofgarbagecollection(GC)strategies(e.g.,HotSpot’sCMS,G1,etc.)-createmanyvariablesthatdevelopersandopspeoplecantrytooptimizeandtune34.Togetthemostoutofanode,youneedtocarefullyreadthelogs,adjustJVMparameters,debug5,lookatthreaddumps6,etc.Naturally,youneedtodosoforeachdifferentworkloadandclusterconfiguration,especiallyifthehardwareisdifferent.Andwhenyouupgradethehardwareonyourexistingcluster?Yes,youneedtoretunealloveragain.Whatifyouaddanewworkloadtoexistingdata?You’veguessedit-youneedtoretune.

Thistuningisdifficult,andchangesarepronetoerror7.Mostopsteamsfinditeasiertoexpandtheclusterusingthesameconfigsandhardwareprofile.It’sapractical-thoughcostly-approachfortheLineofBusinessowner,ITbudgetowner,orwhoeverhastopaythebill.

Thinkbackonhowyouinitiallysizedyourcluster.Howdidyouaccomplishthistask?Youfollowedthebestpracticesfromnumerouscommunityblogs,theApachewiki,ordocumentationfromoneormoreofthevendors.Hopefully,youtookintoaccounttheadditionalstoragespaceneededdependingonyourchoiceofcompactionstrategy(STCSvs.LCS).Youmayhavetakenintoaccountspaceforsnapshots.Wereyouthensurprisedwhentheapplicationwasdeployedandusedahugeamountofadditionalstoragespace?Thisiswhereyouneededtoknowwithprecisionwhichfeaturestheapplicationteamusedwhentheapplicationwasconstructed,asnotedbyonecommunityuser:

“...weattemptedtouseaCQLMaptostoreanalyticsdata,wesaw30Xdatasizeoverheadvs.usingasimplerstorageformatandCassandra’soldstorageformat,nowcalledCOMPACTSTORAGE.Ah,that’swherethenamecomesfrom:COMPACT,asinsmall,lightweight.Putanotherway,CassandraandCQL’snewdefaultstorageformatisNOTCOMPACT,thatis,largeandheavyweight.”8

2https://issues.apache.org/jira/browse/CASSANDRA-7486

3https://issues.apache.org/jira/browse/CASSANDRA-8150

4https://issues.apache.org/jira/browse/CASSANDRA-7486

5https://alexzeng.wordpress.com/2013/05/25/debug-cassandrar-jvm-thread-100-cpu-usage-issue/

6https://support.datastax.com/hc/en-us/articles/204226009-Taking-Thread-dumps-to-Troubleshoot-High-CPU-Utilization

7https://tobert.github.io/pages/als-cassandra-21-tuning-guide.html

8http://blog.parsely.com/post/1928/cass/

6FiveSignsYouHaveOutgrownCassandra(andWhattoDoAboutIt)

Aswewilldescribelater,howyoumodeldata-andwhichfeaturesyouuseinCassandra-dramaticallyaffectsyourutilization,reliabilityandresponsetimes.

Finally,yourcompactionandbackupstrategycanalsohaveahugeimpactonyourCAPEX.Becauseyouarereliantoncompactiontoreducestoragerequirements,youmayendupbackingupoldergenerationsofthedataagainandagainuntilthecompactionscancatchup.Thismayrequiresignificantadditionalstoragecapacity,aswasnotedbyRohitShekharofDatos.ioinhisteam’sexperiments:

“Caseinpoint:[...]secondarystoragewasashighas12timestheprimarystorageforlevelcompaction.”9

Sign#2:PeakLoadsAreCausingServiceDisruptionsIngestinghugeamountsofdata-eitherperiodicallyoraspartoftheregularusageoftheapplication-canbecriticalinmanyapplications.However,asthedataiswritten,theapplicationwillneedtoreadandmodifythesamedata;suchmixedworkloadsconstitutethenormalpatternforapplicationslikeactivitystreams,profilestores,tradestores,etc.Write-onceworkloads,wherethedataisnevermodified,likelogstreams,arenotthenorm.

Ifmixedreadsandwritesaresuchacommonusecase,whyisthispatternsuchaproblemforCassandra?Quitesimply,thisisduetoanarchitecturalchoicemadebythedesignersofCassandra:namely,itslog-structuredfilesystemandtheeventualconsistencyofdata.

KyleKingsbury’s(a.k.a.@aphyr’s)postaboutCassandra10statesthatwithoutvectorclocks,Cassandrahastorelyonaverypreciseusagemodelfromitsusers.Withoutadheringtothesemodels,Cassandrawillloseacknowledgedwrites,meaningthattherearefewguaranteestoreadthecorrectinformation.Asadeveloper,youcantrytocodearoundtheproblemwithvariousconsistencymodels11,soyoucanatleastgetaquorumacrossthecopiesheldacrossthenodesoftheclustersforreadsandwrites.Thisaddsunpredictablelatencytoanyoperation,astheoperationcanonlybeasfastastheslowestnode.Themostcommonsolutionistocachemoreofthedatatoavoiddiskreads;thisleadstolargerclustersandmoreDRAM,violatingmanyofthetuningguidesregardingthesizeoftheJVMheap.

That’snottheonlychallengeformixedworkloads,aswasnotedbythevenerableOraclecorporation:

“Cassandrausesconsistenthashingoverapeer-to-peerarchitecturewhereeverynodeinthesystemcanhandleanyread-writerequest,soarbitrarynodesbecomecoordinatorsofrequestswhentheydonotactuallyholdthedatainvolvedintherequestoperation.Thatmeansbothanextranetworkhop(minimum)foreachcallanditmeansthefailureofasinglenodecanhave

9https://datos.io/backup-challenges-cassandra-compaction/

10https://aphyr.com/posts/294-jepsen-cassandra

11https://docs.datastax.com/en/cassandra/2.0/cassandra/dml/dml_config_consistency_c.html

7FiveSignsYouHaveOutgrownCassandra(andWhattoDoAboutIt)

systemwideperformanceimpactsasotherarbitrarynodeschangetheirbehaviorinresponsetothefailednode.”12

OneCassandravendoralsoacknowledgedthisbehaviorintheirdocumentation:

“Clientreadorwriterequestscanbesenttoanynodeinthecluster.Whenaclientconnectstoanodewitharequest,thatnodeservesasthecoordinatorforthatparticularclientoperation.Thecoordinatoractsasaproxybetweentheclientapplicationandthenodesthatownthedatabeingrequested.Thecoordinatordetermineswhichnodesintheringshouldgettherequestbasedonhowtheclusterisconfigured.”13

Thus,eveninhealthyclusters,thereareinevitablenetworkhopstoservicethesimplestofrequests.Thesecompoundingfactorsleadtoawidevarianceinreadlatencies.Choosingasensiblepartitionkeyisonlyapartialsolution:itsimplylimitsthenumberofnodesthatmustbecheckedratherthaneliminatingtheneedinthefirstplace.

Duringanysituationwherenodesbecomeunavailable,furthermemorypressure14isappliedtothecoordinatornode,sinceitneedstokeeptrackofanyhintedhandoffforwritesthatwillneedtobere-appliedlater.Thismemorypressurecanleadtoinstabilitythroughoutthecluster,aswewillseeinthenextsign.

Compactionsaddanotherlayerofcomplexity.Inanylog-structuredmergefilesystem,youneedtoperiodicallypruneandcompressthetrees,removingolderandredundantversionofthedataandcleaningtombstones(deletedrecords).Cassandrahasbothmajorandminorcompactions,whichthecommunityspendsalotoftimefiguringouthowtotuneforthegivenworkloadandhardware15.Thisisasignificantproblem:duringthetimecompactionsrun,theyadverselyaffectthereadandwritelatencyandthroughputofoperations,andimpactyourSLAs(again).Youknowthiswhenyourlogfilesstarttogetsprinkledwiththefollowingtypesoferrormessages:

Howcanyoubetterdealwithpeakload,then?YouwillwanttoexpandyourCassandraclusteraspartofyourregularcapacityplanning,ortodealwithseasonaleventslikeholidaysales.Butdon’twaituntilthelastmomenttoexpandyourcluster,oryouriskbeingtoolate.Someguides,suchastheThreatStackBlog,state:

“Budgetdaystobringanodeintothecluster.Ifyou’veverticallyscaled[withfewerlargenodes],thenitwilltakeoveraweek.”16

12

http://www.oracle.com/technetwork/database/nosqldb/overview/ondb-cassandra-hbase-2014-2344569.pdf

13http://docs.datastax.com/en/cassandra/3.0/cassandra/architecture/archIntro.html

14http://www.datastax.com/dev/blog/modern-hinted-handoff

15https://medium.com/@foundev/how-i-tune-cassandra-compaction-7c16fb0b1d99#.78lo047w7

16http://blog.threatstack.com/scaling-cassandra-lessons-learned

8FiveSignsYouHaveOutgrownCassandra(andWhattoDoAboutIt)

AndasyouexpandyourCassandracluster,expectthatthiswillhaveoperationalimpactandthatyourapplicationwillhavemissedSLAs.Asonecommunityuserremarked:

“Thebottomline,isthatyourqueriesdohaveahigherchanceoffailingbeforethenewnodeisfully-streamed.”17

Sign#3:You’veLearnedtoLiveWithCascadingFailuresFormission-criticalsystems,availabilityisthemostcrucialaspectofadatalayer.Afterall,isn’tavailabilitywhyyoupickedAPfromtheCAPtheoremandselectedApacheCassandrainthefirstplace?You’vechosenasystemthathasdistributionandreplicationofdata,sowhenanodebecomesunavailable-momentarilyorpermanently-thedataliveselsewhere.Right?

Wrong,actually.Datadistributionsoundswellandgood(andisthecorrectsolution),butifasinglenodeoutagecausesacascadingfailureacrossyourcluster,everynodebecomesthesinglepointoffailureforthecluster.Andcascadingfailuresarecommon1819,especiallywhenCPUpressurecausesnodestostopself-reporting20andinflictingclusterrebalances,causingfurtherCPUandI/Opressureonthesurvivingnodes.AsoneCassandrausernoted,“Cassandraseemstohavetwomodes:fineandcatastrophic”.21

Thefailureofonenodehasoftenbeenobservedtocausecascadingfailures2223acrossthewholecluster.ResearchpapershaveshowntheseproblemstobesystemicwithCassandra24.

What,then,arethetypicalsourcesofcascadingfailures?Theyinclude:

• Memorypressurecausedbyhintedhandoffduringfailover• Compactionstrashingtherowcache• I/Oandmemorypressurefrommemtableflushesduringhighload• CompactionscausingI/O,andthus,CPUpressure• Compactionsnotoccurringfastenough,causingmemorypressure• Memoryusecausingfrequentgarbagecollection,andthus,CPUpressure• Alargenumberoftables,causingmemorypressure

Letustackleeachoftheseinturn.

MemoryPressurecausedbyhintedhandoffduringfailover-AsnotedbyoneCassandravendorinseveralpagesoftheirdocumentation,thecauseofthisisclear.IfyourelyonCassandra’sabilitytostorewritesonacoordinatornodetoreplaylaterwhenthedesignatednodereturnstothecluster,this

17

http://stackoverflow.com/questions/37283424/best-way-to-add-multiple-nodes-to-existing-cassandra-cluster

18http://danluu.com/postmortem-lessons/

19https://www.usenix.org/conference/osdi14/technical-sessions/presentation/yuan

20https://moz.com/devblog/cassandra-in-production-things-we-learned/

21http://www.slideshare.net/planetcassandra/pd-melting-cass/12?src=clipshare

22http://mail-archives.apache.org/mod_mbox/cassandra-user/201106.mbox/%[email protected]%3E

23http://www.stackdriver.com/post-mortem-october-23-stackdriver-outage/

24http://ucare.cs.uchicago.edu/pdf/socc14-cbs.pdf

9FiveSignsYouHaveOutgrownCassandra(andWhattoDoAboutIt)

functionalityconsumeslargequantitiesofmemory;thus,anothernode’soutagecausessignificantmemorypressureonalltheothernodes:

"Ifthishappensonmanynodesatoncethiscouldbecome[sic]substantialmemorypressureonthecoordinator.Sothecoordinatortrackshowmanyhintsitiscurrentlywriting,andifthisnumbergetstoohighitwilltemporarilyrefusewrites(with)whosereplicasincludethemisbehavingnodes."25

Thisisnotjustatheoreticalproblem;it'saveryrealonethat’safunctionofyourdatausageanddatadesignwithhintedhandoff,aswasnotedbyoneCassandrauser:

“Serializingthebigrowscauseshighmemorypressure…”26

UsersofCassandraoftenrecommenddisablinghintedhandoffs-andthusreducingavailability-toavoidcascadingfailures:

“Don’tusehintedhandoffs(ANYorLOCAL_ANYquorum).Infact,justdisablethemintheconfiguration.It’stooeasytolosedataduringaprolongedoutageorloadspike,andifanodewentdownbecauseoftheloadspikeyou’rejustgoingtopasstheproblemaroundthering,eventuallytakingmultipleorallnodesdown”27

Compactionstrashingtherowcache-AsnotedbyanotherCassandravendor,compactions,whichincreaseduringasinglenodeoutageasgreaterloadisappliedonsurvivingnodes,alsoincreasepressureontheI/O,memory,andCPUofthosenodes:

“Cassandracompactionthrashesthe[O/S]pagecache,becauseitreadsandwriteseverything,andaftercompactionthemostfrequentlyuseddataislikelytonolongerbeinthecache.”28

ACassandraclusterisoftensizedusingassumptionsabouttheeffectivenessoftherowcache;anineffectiverowcacheleadstoagreaternumberofconnectionsandtransactionsinflight.Thiscausesdifficulty(attheveryleast,performanceissues)forsurvivingnodes.However,theeffectsofthecachearenegatedwhentheblocksbeingcachedarepagedoutbyanotherdatabasefeature.

I/Oandmemorypressurefrommemtableflushesduringhighload-Flushingmemtablesiscriticalbecausewritesareblockeduntiltheflushsucceeds29.Butthereisacascadingeffectifflushingisnottunedcorrectly:

“...propertuningofthesethresholdsisimportantinmakingthemostofavailablesystemmemory,withoutbringingthenodedownforlackofmemory.”30

Indeed,duringanodeoutage,yourcarefullyselectedtuningbecomesinvalid.25

http://www.datastax.com/dev/blog/modern-hinted-handoff

26http://java.cz/dwn/1003/72451_CassandraCZJUG_horky.pdf

27http://blog.threatstack.com/scaling-cassandra-lessons-learned

28http://www.scylladb.com/technology/memory/

29https://docs.datastax.com/en/cassandra/2.1/cassandra/dml/dml_write_path_c.html

30https://wiki.apache.org/cassandra/MemtableThresholds

10FiveSignsYouHaveOutgrownCassandra(andWhattoDoAboutIt)

CompactionscausingI/O,andthus,CPUpressure-ChoosingLeveledvs.Size-TieredcompactionswilldramaticallychangeI/OandCPUpressure.Thisisafunctionofthereadsandwritesatthismomentintime:apredominantlyread-heavyapplicationwillgettheadverseeffectsofadataloadjob,causingpressureonbothI/OandCPU.

Aswasnotedbyonevendor:

“SinceSSTablesareimmutable,thisprocessputsalotofpressureondiskioasSSTablesarereadfromdisk,combinedandwrittenbacktodisk.”31

Compactionsnotoccurringfastenough,causingmemorypressure-TheimmutablenatureofCassandra'slogstructuremeansthattheprocessofcompactionsisnotonlyinevitable;dependingonyourdataaccesspatterns,itmaydownrightimprisonyou.Indeed,when“thecompactionisnotabletocomplete”32,thiscausesunavailability.AswasalsoexpressedonTarget’stechblog:

“ThenodeswouldOOMfrequentlywhencompactingaspecificcolumnfamily…WhatIdiscoveredisthatCassandrawasreadingalotoftombstoneseachtime,andthiswasputtinglotsofextradataontheheap.Thiswouldjustsnowballwhentheclusterwasunderload,andblowtheheap.”33

Memoryusecausingfrequentgarbagecollection,andthus,CPUpressure-WithaJava-basedcodebase,garbagecollectionisinevitableanduncontrollable.Tuningispossible,butoftendescribedasa“darkart”.Unfortunately,thesideeffectsofgarbagecollectionarereal,asusersreport:

“...garbagecollectionwashappening20+timesasecond,evenwhenCassandrawasundertinyload.”34

“InbothcasestheC*nodesendupdoinggarbagecollectionfor~90secspersweep”35

Thisimpactsthelatencyofresponsesandthroughputofthesystem,asvitalsystemresourcesareusedtomanagememory.

Alargenumberoftables,causingmemorypressure-Thewayinwhichyouconstructedthedataschemacanalsoimpactthememorypressure,andchangestoapplicationdesignandusecanradicallychangehardwarerequirements.AsnotedbyRyanSvihla,aSolutionArchitectatDataStax:

“Thereisingeneralaclustermaxeffectively[sic]limitontablecounts.Anythingover300startstocreatesignificantheappressure.”36

31

http://www.planetcassandra.org/blog/impact-of-shared-storage-on-cassandra/

32http://stackoverflow.com/questions/29273276/cassandra-node-heap-pressure-during-compaction-after-bulk-load

33http://target.github.io/infrastructure/tuning-cassandra

34http://target.github.io/infrastructure/tuning-cassandra

35http://stackoverflow.com/questions/29273276/cassandra-node-heap-pressure-during-compaction-after-bulk-load

36https://medium.com/@foundev/domain-modeling-around-deletes-1cc9b6da0d24#.goi7cxibs

11FiveSignsYouHaveOutgrownCassandra(andWhattoDoAboutIt)

Sign#4:YourOperationsTeamIsGrowingDisproportionately&TheCostofSupportIsConcerning

Thenumberofaspectsthatanoperationalteammustconsideratclusterprovisioningtimeislarge.Thisleadstoatime-consumingprocessforprovisioningeachcluster.Worse,theoperationsteammustcontinuallymonitorCassandraclustersforchangesinapplicationpatterns,andre-tunetheclustersonafrequentbasis.Failuretoretuneleadsnotonlytopoorperformance,butalso(eventually)toCPUpressure,whichlimitsgarbagecollectioncapabilities;thiscausesmemorypressure,andinturn,outages.

ThecommunityandcommittersarewellawarethatCassandracannotutilizecomputeresourceseffectively,ascanbeseenonCassandra’sownissuetrackingsystem:

“Insteadofscalingthecomputesideoverthemetal,wedosillythingslikerunmultipleinstancesperbox.It’snotreallysillyifitgetsresults,butitisanexampleofwherewedosomethingtactically,getsousedtoitasanecessarycomplexity,andthenjustkeeptakingforgrantedthatthisishowwedoit.”37

Inordertoincreaseutilization,clustersareforcedtogetwider,andmultipleCassandranodesarecommonlyrequiredonthesamecomputenode.Thispatterngreatlyincreasesoperationalcomplexity,necessitatingnotjustmoretimefromyouroperationsstafftoplananddeploy,butalsomorecomplexandcompoundingfailuremodestodiagnoseandfix.

ThesetupoftheJVMandothertuningparametersforthespecificworkloadandhardwaremeansthereisno“outofthebox”settingthatwillconsistentlywork.Tuningisinevitable,asthecommittersofCassandrathemselveshavenoted:

“...there'sabunchofdifferentworkloadsandabunchofdifferenthardwarethatC*runson,andtheideaofhavingadefaultthat'soptimalforeveryoneisunrealistic.ItmayverywellbethatG1isabetter"goodenough"defaultformostdistributions,largeheaporno,andthat'stheconversationonIRC…”38

Thefactthattuningisanecessitycanhaveasignificantimpactonday-to-dayoperations,asDanielParker,anEngineeratTarget,observed:

“Wewerehavingtorestartnodesfrequentlytocleartheheap.Thiswasnotanoptiontocontinuedoing,especiallypre-peakwhenweexpectnearly10xthetrafficvolume.”39

37

https://issues.apache.org/jira/browse/CASSANDRA-7486

38https://issues.apache.org/jira/browse/CASSANDRA-10403

39http://target.github.io/infrastructure/tuning-cassandra

12FiveSignsYouHaveOutgrownCassandra(andWhattoDoAboutIt)

Sign#5:HiringDedicatedCassandraExpertsHasBecomeUnavoidableandDifficult

Technicalresourcesarealwayshardtofind,employandretain.Findingstaffwithdeepexpertise-evenApachecodecommitters-isastepbeyondmostorganizations’capabilities;yet,thisisoftenrecommendedtoeffectivelyoperateCassandraclusters.IsittrulyabadgeofhonortohaveateamofcommitterslikeNetflixorApple40,orevenemployingandmotivatingateamof“Cassandrawhisperers”41?Doesyourbusinesscoststructureallowseveralhundredsofthousandsofdollarsperyearinhead-counttomaintainfreesoftware?Canyoujustifybuyingacompany42justforitsCassandraexpertise?AsTedWallace,VPofDataDeliveryatBlueKai,noted:

“WeultimatelyfoundoutthattodoCassandrayouneedpeoplewhoarefocusedonkeepingCassandraaliveandrunning.Wedidn’twanttoinvestincreatingateamof‘Cassandrawhisperers’.Wedidn’twanttobeexpertsatmanagingCassandra”43

YouroperationsstaffmustalsounderstandthedifferencesbetweenthemultipleversionsofCassandra,andbeabletotuneeffectivelyfordifferentversions.WillyourstaffchoosethegenericApacheCassandradistribution,oravendoredversioncreatedbyyourCassandradistributor?IfyourteamselectsaCassandradistributor,willtheyprovidecurrentreleases?Willtheykeepupwiththecommunityforfeatures,bugfixes,vulnerabilitiesandcurrency,orwillyouroperationsstaffneedtopatchavendordistributionwithfixesfromtheopensourceversion?EvenacorecommitterlikeDataStaxwilldeferthemigrationfromApacheCassandra2.1to3.0intheirenterprisereleases;isyourorganizationwillingtowait?AsnarratedbyDataStaxthemselvesin2016:

“Today,ittypicallytakesDataStaxfourtosixmonthstocertifyanew,majorversionofopensourceCassandraandensureitisreadyforenterprisedeployments.Thistimemayshortenasthetick-tockprocessdrivesdowndefectrates.”44

40

http://www.planetcassandra.org/mvps/

41http://www.aerospike.com/blog/bluekai-nosql-speed-scale-simplicity/

42http://appleinsider.com/articles/15/03/25/apple-acquires-big-data-analytics-firm-acunu

43http://www.aerospike.com/blog/bluekai-nosql-speed-scale-simplicity/

44http://web.archive.org/web/20160322221453/http://www.datastax.com/2016/01/comparing-open-source-apache-cassandra-and-datastax-enterprise-release-

models

13FiveSignsYouHaveOutgrownCassandra(andWhattoDoAboutIt)

TherealityisthatitoftentakesaminimumofeightmonthstovendorinApachecode.HereisanexampleofthereleasetimelinefortheDataStaxEnterpriseEdition:

Figure1.VendoringinCassandrareleasesfromApache

Youwillbeleftwithachoicebetweenthefollowingoptions:waituntilapatchisvendoredin,uptakeanewerApachereleaseyourself(andabandonyoursupportsubscription),orfixthecodebaseyouhave.

Committingchangesbetweendistributionsisjustonemorecomplextaskforyouroperationsstafftoundertake.Yourstaff’sharderchallengeoccurredmuchearlierintheprocess-whentheapplicationdevelopersdesignedtheschema,primaryandpartitionkeys,anddecidedonwhichfeaturestouse.AsJuanValencia,PrincipalEngineeratShareThis,offered:

“We'vemadealotofmistakesindatamodelingoverthecourseofdevelopment.Settingupourdatamodelscorrectlywastricky.”45

Distributedcountersseemlikeareasonablefeatureforadistributeddatabase,especiallywhenperformingreal-timeanalytics,andlikely,theapplicationteamchosethem.ButasnotedbyAndrewMontalenti,CTOofParse.ly:

“WhenI’minagoodmood,IsometimesaskquestionsaboutCountersintheCassandraIRCchannel,andifI’mlucky,long-timedevelopersdon’tlaughmeoutoftheroom.Sometimes,theyjustcallmea“bravesoul”...Allofthisistosay:CassandraCounters—it’satrap!Run!”46

Youroperationsstaffcan’trestbyjustcreatingachecklistoffeaturesused;theymustknowhowthedataisused,andwhatitslifecycleis.Considersomethingsimplelikeaqueue,whereyouneedtomaintainorderandalsoexpungedataassoonasit’sprocessed.Thatsimpledatamodeldesignleadsto

45

http://www.informationweek.com/strategic-cio/why-we-picked-cassandra-for-big-data/a/d-id/1318250

46http://blog.parsely.com/post/1928/cass/

14FiveSignsYouHaveOutgrownCassandra(andWhattoDoAboutIt)

operationalproblemsdowntheroad,asCassandrathenneedstomanagelargenumbersoftombstones(deletedrecords).AsRyanSvihla,aSolutionArchitectatDataStax,remarked:

“Yourealizethatbasedonyourqueueworkflowinsteadof5recordsyou’llendupwithmillionsandmillionsperdayforyourshortlivedqueue,yourquerytimesendupmissingSLAandyourealizethiswon’tworkforyourtinycluster.”47

TheuseofTTL(Time-To-Live)asamechanismtoremovedataafteraspecifiedtimeperiodisanotherpotentialtrap.AsoneCassandrauserobservedintheirownattemptstouseTTLs:

“...[TTLauto-expire]wasabletoeffectadenialofserviceforallloginsthroughcreatingalargeamountofgarbage[tombstone]records.Oncetherecordsforthesefailedloginshadexpired,allqueriestothistablestartedtimingout.”48

Finally,youroperationsstaffmustabsorbtheconsequencesofyourdevelopmentstaffpickingthewrongprimaryorpartitionkey.Apoorchoiceinevitablyendsupwithhotspotnodes,whichcancauseoneormoreofthefollowing:highmemory,CPUpressure,orI/Opressure.Thisleadstothekindofcascadingfailuresdescribedabove.Onestoryfromthecommunitydemonstratesthischainreaction:

“Thefailingnode,infact,wasahotspot!Becauseofanerrorinaprimarykeyofoneofahighloadedtable!Table'sdatawasnotproperlydistributedacrossallclusternodes.Andthelargeportionofdatawasconcentratedonthatnode!Thisledtotwoproblems:1-significantamountofqueries(read/write)wereaddressedtothatnode;2-hugekeys-about~5megsperkey;Thesetwoproblemsledtonodeloadand,duetohugekeys,instability(highpressureonGC).”49

47

https://medium.com/@foundev/domain-modeling-around-deletes-1cc9b6da0d24#.goi7cxibs

48https://www.tildedave.com/2014/03/01/application-failure-scenarios-with-cassandra.html

49https://www.reddit.com/r/cassandra/comments/3uzlnp/cassandra_high_gc_pressure/

15FiveSignsYouHaveOutgrownCassandra(andWhattoDoAboutIt)

Aerospike:TheEnterprise-GradeNoSQL

Asnotedearlier,Aerospikewasdesignedandbuiltfromthegrounduptotakeadvantageofmoderncomputingarchitectures.Aerospikeundertookanalternativeapproach,buildingfromacleansheetatechnologystackandfundamentalIPthathasenableditscustomerstoobtainandenjoyanunprecedentedlowTCO(TotalCostofOwnership),unparalleleduptime,highavailability,andreduceddatabaseinfrastructurecomplexity.Let'sexplorehowAerospikehasachievedthis.

ReducedTCOAerospikeiswritteninCbyadevelopmentteamwithdeepexpertiseinnetworking,storageanddatabases.Byremovingthelayersoffilesystem,block,andpagecaches,andinsteadbuildingaproprietarylog-structuredfilesystemdesignedforthewayflashdeviceswork,Aerospikecandeliverunprecedentedresourceutilization.ThebottomlineisthatAerospikeclustersaresizedtohave(onaverage)atleast5timesfewerserversthantheequivalentCassandracluster.Forinstance,AdFormwasabletodecreaseitsnumberofnodesfrom32withCassandrato3withAerospike,andachieveafourfoldexpansionofdata50.AerospikeenabledthecompanytosustaintheidenticalthroughputaswiththeiroldCassandracluster,butwithlower-andconsistent-latency,andunmatchedavailability.AsJakobBak,AdForm’sCTO,notes:

“WithAerospike,wehavebeenabletodrasticallycutdownonthenumberofCassandraservers,whichprovidedagreatcostreduction.”51

Lowercost.Higheravailability.Predictableperformance.YougettopickallthreewithAerospike.TherecentlypublishedYCSBbenchmarkcomparingAerospikeandCassandrashowsingreatdetailhowtoprovetoyourselfthisreductionincosts.UsingthestandardYCSBbenchmark,weobservedthefollowinggains:

Table2.SummaryofYCSBBenchmark,Aerospikevs.Cassandra

50

https://vimeo.com/101290545andhttp://www.aerospike.com/adform-divorces-cassandra-scales-4x-with-2x-reduced-servers/

51http://www.aerospike.com/industry/adtech/adform-divorces-cassandra-scales-performance-by-4x-with-2x-fewer-servers/

16FiveSignsYouHaveOutgrownCassandra(andWhattoDoAboutIt)

Table2showsAerospike’sabilitytofullyutilizethehardware.Butit'snotjustspeedsandfeedsthatproveAerospike’ssuperiority.Accordingly,let'sillustrateasimpleTCOcalculation-basedonactualAerospikecustomerdata-withthesavingsourcustomerswereabletoachieve(Table3):

Table3.CostComparison,Aerospikevs.Cassandra

Table3representsacompositeofmultipleCassandrareplacementsderivedfromactualcustomerimplementations.Thetabledepictsa3-yearTCOcomparison-forexactlythesameproblemset-usingaCassandrasolutionvs.anAerospikesolution.Togeneratethistable,wefirstestimatedthesizeoftheCassandraclusterrequired;wethenestimatedthesizeoftherequiredAerospikeclusterunderthesameassumptions.TheexistingCassandraclustersusedHDDs,whiletheAerospikeclusterwassizedtouseSSDs.Despitethecostdifferencebetweenbothdrivetypes(SSDscostmore),thecumulative3-yearTCOsavingsobtainedbyusinganAerospikesolutionareveryclearandreal. EvenifyourCassandracostsaresunkcostsintheshortterm(forexample,becauseyouhavepre-purchasedayear'sworthofinstancehourswithacloudprovider),switchingtoAerospikerightnowstartstheprocessofsavingmoney.Usingthedataabove,thiswouldstillresultinasavingsofnearly$6MbyYear3.

AsAerospikeisanativeCimplementation,therearenoinefficienciesfromaJavaruntime.Theprimarykeyindexisstoredasaparentlessred-blacktree,enablingultra-fastkeylookupsinDRAM;thedataisthenretrievedfromtheproprietarylog-structuredfilesystem.Thisfilesystemallowsparallelizationacrossallthedevicesonthechassis;atypicalAerospikenodewillhave8-12SSDs-sometimesasmany

17FiveSignsYouHaveOutgrownCassandra(andWhattoDoAboutIt)

as16.Byreadingandwritinginparalleltoalldevices,andremovingtheblockandpagecaches,AerospikecanfullyutilizealltheIOPsanddiskslotsavailablebeforerunningoutofCPU.

Aerospike’simplementationisnotfastersimplybecauseitusesC.Itusesperformance-tunedlibraries,suchasre-implementationsofmsgpack,andspecifictestedversionsofJEMalloc.Thecodeisahighlyoptimized,reference-countedmultithreadedimplementation,requiringcertaindeveloperskillsinwhichAerospikespecializes.

PredictablePerformanceSinceAerospikehasmaster-basedreplication,operationsareforwardedtotheprimarynodefortherecord.Thisisequivalenttohavingaspecificcoordinatornodeforeachportionofthedata.ThePrimaryKeyisgeneratedbyacryptographicalgorithm,RIPEMD-160,usedbytheBitcoinalgorithm,whichhashadzerodetectedhashcollisionsinanyuse.Thefirsttwelvebitsofthisgeneratedhashidentifythepartitionwheretherecordresides.

Aerospikesupportsallthepopularlanguages(C/C++,C#,Java,Python,Go,Node.js,PHP,Ruby,Perl,andErlang)withavendor-supportedlanguageclient.Thesehigh-performancenativeclientsprovidetranslationofnativedatatypestoandfromAerospike,aswellasinteroperabilitybetweenlanguages,greatlyimprovingdeveloperproductivity.

UnliketheproxymodelusedinCassandra,whererequestsarereroutedbytheCoordinatornode,Aerospikeclientsmaintainadynamicpartitionmap.Thisidentifiesthemasternodeforeachpartition,whichenablestheclienttoroutethereadorwriterequestdirectlytothecorrectnodewithoutanyadditionalnetworkhops.UnlikeCassandra'sCoordinator,thisremovesanunnecessarynetworkhop.Becausethedataiswrittensynchronouslytoallcopiesofthedata,thereisnoneedtodoanyformofquorumreadacrosstheclustertogetaconsistentversionofthedata.AsAdForm’sCTO,JakobBak,opines:

“Evenmoreimportantisthesuperfastkey-valuestoreandextraordinarypredictabilitywegetwithAerospike,providingtheresponsivenessourclientsrequiretocompeteinthecrowdedInternetandmobilemarkets.”52

52

http://www.aerospike.com/industry/adtech/adform-divorces-cassandra-scales-performance-by-4x-with-2x-fewer-servers/

18FiveSignsYouHaveOutgrownCassandra(andWhattoDoAboutIt)

ThisapproachenablesAerospiketoexcelatmixedread/writeworkloads,withouttheunnecessarycomplicationsandimpacttolatencyofeventuallyconsistentsystems.Tothispoint,ValeryVybornov,HeadofR&DatIMHOVi,notes:

“Itmetourdemandsforrandomaccessresponsetime(mixedreads/writes)underourtypicalloadofseveralhundredwrites/somethousandreadspersecond.”53

YetAerospikeisaboutmorethanmaintainingpredictableperformanceforashortburstoftime,orshiningin5-minutebenchmarks.Aerospikeenablespredictableperformanceoverdaysandmonths-thetypeofperformancethatallowsyourbusinessandapplicationstogrowseamlesslyasyoubroadenyourpresenceinlocalandinternationalmarkets.AsKoBaryimes,Kayak’sSVPofTechnology,observes:

“Aswecontinuetorapidlyexpandintointernationalmarkets,weneededasolutionthatwasreliableandcouldscaletoserveoffersacrossournetwork.Aerospikeenabledustoachievemulti-keygetsinlessthan3milliseconds,deploywitheaseandscalewithverylowjitter.”54

PredictablescalingwaskeytothesuccessofAerospikeatmarketinganalyticsfirmIx+1I,astheirCTO,PatrickDeAngelis,expresses:

“We’veseenAerospikescaletoafewbillionkeyvalueswithnocompromisetoperformance,andwe’veevenseenresponsetimesunder1ms,whichisphenomenal.”55

53

http://www.aerospike.com/blog/scaling-to-meet-russias-rapid-internet-ad-trajectory/

54http://www.aerospike.com/press-releases/kayak-selects-aerospike-delivers-personalized-offers-at-scale/

55http://s3-us-west-1.amazonaws.com/aerospike-fd/wp-content/uploads/2015/02/x-1casestudy_2012.pdf

19FiveSignsYouHaveOutgrownCassandra(andWhattoDoAboutIt)

Aswesawabove,therecentCassandrabenchmarkresultsshowedtheoverallTCOsavingsthatarepossiblewithAerospike.Butthisisjustpartofthestory.AcomparisonofthevarianceofAerospikevs.Cassandra-thatis,therangeofresponsetimesandthroughput,ratherthanasimpleaverage-tellsaveryinterestingtale:

Figure2.Measuredvarianceinread/writethroughputandlatency

AsFigure2illustrates,duringthetwelvehoursthatthebenchmarkran,thevarianceinthroughputandlatencyforbothreadandwriteoperationsvariedhugelyforCassandra(markedinblue).Incontrast,this

20FiveSignsYouHaveOutgrownCassandra(andWhattoDoAboutIt)

varianceremainedinaverynarrowbandforAerospike(markedinred).Whatthistranslatestoisconsistentresponseandthroughputcharacteristicsforyourapplications:Aerospikeisverypredictable,makingittrivialforyoutomeetyourapplicationSLAsnotjustnow,butalsointhefuture.

ProvenReliabilityAerospikeusesashared-nothingarchitecture,whereallnodesarepeers,withoutanynodehavingaspecificrole.Itisbuiltwithauniquemaster-basedclusteralgorithm.Asinglenodewillbetheownerortheprimarynodeforthatpartition.Ifareplicationfactorisdefined,thentherewillbeNnumberofothernodesthathaveacopyofthatrecordforreliability.Ifyouloseanode,thenyouhaveanothercopy.Andunlikeothersystems,Aerospikewritessynchronouslyacrossallcopiesofthedata.

Whenanodefailsorisremovedfromthecluster,anynodethathasasecondarycopyofthepartitioncanbeinstantlypromotedtobethemasterofthatpartition,withoutthetypicaldelaysimposedbyaconsensusalgorithm.AerospikeusesthePaxosalgorithmtoensureconsensusacrossthecluster,butsinceeachnodeisequalinitsroleandequalwiththecurrentstateofthedata,anyofthesecondarynodescanbepromoted.ThishasallowedanorganizationlikeAppNexustominimizedowntime,asexpressedbyitsCTO,GeirMagnusson:

“2.5millionimpressionsasecondatpeak,althoughwecangomuchhigher,andweseenorthof90Billionimpressionsperdayandthisisa24×7businesswith100%uptimewithAerospike.”56

ValeryVybornov,HeadofR&DatIMHOVi’s,hasaverysimilarstory:

“WealsoconsideredtheAerospikedatabase’smaturityandstability—mostnotablythatithasbeenrunninginproductionnon-stopforalmostfouryears.”57

Beyond“fair-weather”performanceandavailability,youalsoneedasystemthatcandealwithhuman(e.g.,“phatfingering”)ornaturaldisasters(e.g.,theweather).AerospikehastheabilitytoensureavailabilityacrossregionsusingCrossDatacenterReplication(XDR).DuringSuperstormSandythathittheEastCoastoftheUnitedStatesin2012,thisfeatureallowedadMarketplacetomaintainavailabilitywithoutahitch,astheircompany’sCTO,MikeYudin,narrates:

“Wedoa100%uptime.Welostoneofourdatacentersintheflood[HurricaneSandy],andit’snotjustthedatacenteritselfthatlostpower,it’stheentirenetworkinfrastructureofthetri-statemajorarea,allthebackbones...Howdidwedothis?Wedothisbyhavingredundant,notonlyredundantequipmentwithinthedatacenter,butalsothegloballyload-balancedinfrastructureacrossmultiplelocations.Ifonegetsflooded,thentrafficjustgetsshiftedintothedatacenterthatsurvives.Thetrickhereofcourseistomakesurethatyourlocationhasallthesamedataandallthesameintelligenceasthesystemthatgotdestroyed.”58

56

http://www.aerospike.com/why-appnexus-uses-aerospike/

57http://www.aerospike.com/blog/scaling-to-meet-russias-rapid-internet-ad-trajectory/

58http://www.aerospike.com/blog/super-storm-sandy-and-100-uptime/

21FiveSignsYouHaveOutgrownCassandra(andWhattoDoAboutIt)

PeoplewareHiringandretainingstaffisalwaysacriticaltaskinanyorganization.Youneedtoensurethattheservicesandapplicationsyoubuildanddeploymeetthetimetomarket(TTM)needsoftheorganization.Yetpeoplecostsshouldneverbeexcludedfromthiscalculation.Thelong-termcareandmaintenanceofyourinfrastructureisarealcostwhichyouneedtodrivedown-ifonlytospendyourcompany'senergyonuniquebusinessfunctions,notdatabasemaintenance.Fromanoperationalperspective,Aerospikeradicallysimplifiesrunningandmaintainingadistributeddatabase,asexpressedbyBlueKai’sVPofDelivery,TedWallace:

“Itjustworksandthenyoumoveon.We’vedonethatrepeatedlyoverthecourseofthepast2.5yearsandit’salwaysjustworked.That’sbeenawesome.”59

ContrastthiswiththetypicaladvicefromtheCassandracommunity-forinstance,SamBisbee,CTOofThreatStack:

“Don’tunderstaffCassandra.Thisishardasastartup,butrecognizegoinginthatitcouldrequire1to2FTEsasyourampup,maybemoredependingonhowquicklyyouscaleup.”60

OperationalSimplicityThetruevalueofcomplextechnologyisitssimplicityofusage,especiallyinchallengingproductionenvironments.Youdonotwanttoconsumetimeandenergyastheclusterexpands,orwhenyourefreshthehardwareormigratetoanewDataCenter.Youwanttouseyourdatabaseinfrastructurelikeautility,addingcapacityasandwhenyouneedto,withouttheneedtoperformextensiveplanningformaintenancewindows.AsTapad’sCEO,DagLiodden,states:

“Aerospikemakesupgradingsimple.That’sthebeautyofthisproduct.There’snoplanningrequired.Youcantakeserversdown,andstillhavethesystemrunning.”61

Aerospikeachievesthiswiththeoperationalsimplicityofself-formingandself-healingclusters,whicharebothrack-awareanddatacenter-aware.Nodowntimeisrequiredtoaddorremovenodesfromacluster;itautomaticallyre-distributespartitionsofdatatoensurethenewcomputeresourcesareefficientlyused.Rackawarenessalsoensuresthatdataiscorrectlyseparatedtoavoidcompoundingfailures.Usingaproprietaryalgorithm,nodescanberestarted-forexample,toperformasoftwareupgrade-butthememorycontentsarepreserved,allowingtheprocesstorestartwithouttheneedtowarmthememorybuffersandcachesinjustseconds,asthedatawasneverevictedfromDRAM.

59

http://www.aerospike.com/blog/bluekai-nosql-speed-scale-simplicity/

60http://blog.threatstack.com/scaling-cassandra-lessons-learned

61http://s3-us-west-1.amazonaws.com/aerospike-fd/wp-content/uploads/2015/02/Tapad_CaseStudy_101012.pdf

22FiveSignsYouHaveOutgrownCassandra(andWhattoDoAboutIt)

AsTedWallacefromBlueKaiopines:

“InthecaseofAerospike,whenweneedmorecapacityinourcluster,wegetamachineready,addittotheclusterandstepaway.Itjustworks.Thatisveryempoweringandveryrewardingwhenyoudon’thavetohavesomebodyspenddayscreatingadocumentedprocess,togetapprovals,youdon’thavetosendnotificationsouttoyourcustomersbecausethereisdowntimebecauseyouhavetodosomemassivedatabasemaintenance.Itjustworksandthenyoumoveon.”62

AmeyPatil,BigDataEngineeratCrowdfireadds:

“Duetoitsmaster-masterreplicationmodel,wedonothavetoworryaboutrebalancing,failoverorrecovery!ThishasdefinitelypleasedourDevOpsteam.;-)”63

Complextechnologydoesnothavetobecomplextouseandoperate.Period.

62

http://www.aerospike.com/blog/bluekai-nosql-speed-scale-simplicity/

63https://crowdfire.engineering/why-we-chose-aerospike-over-other-databases-1dfa2d66a292#.27jfii8t1

23FiveSignsYouHaveOutgrownCassandra(andWhattoDoAboutIt)

SummaryInTable4below,wecontrastthefivesignsyou’veoutgrownCassandrawiththeircorrespondingsolutionusingAerospike:

Table4.CharacteristicsofAerospikevs.Cassandraacrosskeyattributes

24FiveSignsYouHaveOutgrownCassandra(andWhattoDoAboutIt)

Aerospikebelievesinthreecoreprinciples:

1. Beingbuiltformoderncomputingarchitectures,andreadyforthenextgenerationofhardware2. Master-basedclustering,whichmeanssimplescalingandfailover3. SimpleDeveloperExperience(DX)

Beingbuiltformoderncomputingarchitectures,andreadyforthenextgenerationofhardware-Bydesigningearlyandunderstandingthedeepandsignificanttechnologytrends,AerospikehaspositioneditselfastheonlyNoSQLdatabasesystemdesignedandequippedtofullyutilizeFlash/SSDandstorageclassmemorysystems.Bybeingabletofullyparallelizereadsandwritestostorage,Aerospikecandrive12,14,even16SSD/Flashdevicesperchassis;itrunsoutofIOPsbeforeyourunoutofCPUcycles.ThisisahugecostsavingsovertryingtouseDRAMandcache-basedsolutions64.Thesecharacteristicsmakecloud-baseddeploymentsviableforhightransactionalloadstypicallyonlyreservedforon-premisedeployments.High-memoryinstancesandin-memorysolutionsarenotacost-effectivewaytogo;youneedtheabilitytoblendthespeedofDRAMaccesswiththecosteffectivenessofFlash/SSD(asAerospikedoes).Increasingly,cloudprovidersaremovingtoFlash/SSD-basedsystemstodrivebetterutilizationandreducepowerandcoolingcostsversustraditionalHDDsystems.Theyarealeadingindicatorofwherethefutureofcomputearchitectureswilllooklikeforeverybodyinthenext18-24months.Youneedasolutiondesignedandoptimizedformoderncomputingarchitectures.

Master-basedclustering,whichmeanssimplescalingandfailover-Availabilityisakeyingredientformostapplications.It'saformofbrandinsurance,becauseanytimeyourinfrastructure(ofwhichyourdatabaseisapart)isunavailable,thisimpactsyourbrandperception.Whetherthismanifestsasadeclinewhenswipingacreditcardbecausethesystemcannotprocessa“rainyday”load,orastheinabilitytoprovidesuitablerecommendations(andtheresultinglostopportunitycost)-itcomesdowntobrandandperception.It'snolongersimplygettinganemptyorsystemmaintenancepagewhenthewebsiteisunavailable-it'sallaboutensuringyouraudienceremainsengaged,regardlessofwhetheryourinfrastructureishavingarainydayornot.Master-basedclusteringandPaxosconsensusalgorithmsmakeupthecoretechnicalreasonswhyAerospikeprovidesnear-instantaneousfailover.Aerospikecanthereforesustainmixedread/writeworkloadswitheaseandwithpredictablethroughputandlatency,evenonrainydays.Thisnotonlysatisfiesyourkeyavailabilityrequirements;italsosignificantlyreducesyouroverallTCOwhencomparedtoeventuallyconsistentsystemslikeCassandra.Asyouhaveseen,Aerospikehasunmatchedavailability.Andavailabilityisacoreattributeofyourcustomers’perceptionofyourservicesandofferings.Thisisnolabexperiment-thesearereal-worldexamplesofapplicationshandlingthemostdemandingworkloads,24x7.

64

http://www.aerospike.com/blog/bluekai-flashssd-speed-at-scale/

25FiveSignsYouHaveOutgrownCassandra(andWhattoDoAboutIt)

SimpleDeveloperExperience(DX)-Developershavebecometheearlyadoptersoftechnology,oftenchoosingatechnologybeforetheoperations(or“DevOps”)teamareawareofanewproject.APIshavetobenatural,simpleandcurrent.Youdon’twantyourvendortobealaggard-whichiswhyAerospikepublisheditsDXManifestoin2015.Wesupportthelanguagesandframeworksyouneedtobuildefficient&flexibleapplications,cuttingthetimetomarketthatyourbusinessisdemanding,andensuringthatyouremaincompetitive.

Aerospikeisthenext-generation,enterprise-gradeNoSQLsolution.AerospikehasafundamentallyuniquearchitecturethathashelpedcustomerslikeBlueKai,Applovin,ShareThis,AdForm,InMobi,PubMatic,NexTagandCursecost-effectivelyconvertsignificantapplicationsfromCassandratoAerospike.ConvertingtoAerospikehasallowedtheseorganizationstoachievemorepredictableperformance,improveuptimeandavailability,andsignificantlydecreaseTCO.

OurrecentYCSBbenchmarkgoesintogreatdetailontheperformancegainsofanAerospikesolution.Ifyou’reconcernedaboutyourCassandraimplementation,contactAerospikeat(+1)408-462-AERO(408-462-2376),orfilltheformathttp://www.aerospike.com/contact-us.

Aerospikeisthehigh-performanceNoSQLdatabasethatdeliversSpeedatScale.Aerospikeispurpose-builtforthereal-timetransactionalworkloadsthatsupportmission-criticalapplications.TheseworkloadshavethemandatetodeliverinformedandimmediatedecisionsforverticalslikeFinancialServices,AdTech,andeCommerce.Theuniquecombinationofspeed,scale,andreliabilitycandeliverupto10xperformanceor1/10ththecostcomparedtomostotherdatabases.

2525E.CharlestonRoad,Suite201MountainView,CA94043

Tel:408.462.AERO(2376)www.aerospike.com

©2017Aerospike,Inc.Allrightsreserved.AerospikeandtheAerospikelogoaretrademarksorregisteredtrademarksofAerospike.Allothernamesandtrademarksareforidentificationpurposesandarethepropertyoftheirrespectiveowners.(WP04-101317)

AppendixProductComparisonPleaserefertothechartbelowforasummaryofkeyproductdifferencesbetweenAerospikeandCassandra:65

65PlatformssupportthroughApachehttp://www.planetcassandra.org/cassandra/