thinking clearly (paper)
TRANSCRIPT
-
8/3/2019 Thinking Clearly (Paper)
1/15
2010MethodRCorporation.Allrightsreserved.1
ThinkingClearlyAboutPerformance
CaryMillsapMethodRCorporation,Southlake,Texas,USA
Revised2010/05/11
Creatinghighperformanceasanattributeofcomplexsoftwareisextremelydifficult
business for developers, technology administrators, architects, system analysts, and
project managers. However, by understanding some fundamental principles,
performance problem solving and prevention can be made far simpler and more
reliable. This paper describes those principles, linking them together in a coherent
journey covering the goals, the terms, the tools,and the decisionsthat youneed to
maximize yourapplications chance of having a long,productive,high-performance
life.ExamplesinthispapertouchuponOracleexperiences,butthescopeofthepaper
isnotrestrictedtoOracleproducts.
TABLEOFCONTENTS
ThinkingClearlyAboutPerformance..................................11 AnAxiomaticApproach ....................................................12 WhatisPerformance?............................................... ......... 23 ResponseTimevs.Throughput.....................................24 PercentileSpecifications .................................................. 35 ProblemDiagnosis ..................................................... ......... 36 TheSequenceDiagram.................. .................................... 4
7 TheProfile ................................................... ........................... 58 AmdahlsLaw ............................................. ........................... 59 Skew ............................................. ............................................. 610 MinimizingRisk ................................................. .................. 711 Efficiency. ..................................................... ........................... 712 Load........... ..................................................... ........................... 813 QueueingDelay .................................................. .................. 814 TheKnee ............................................. .................................... 915 RelevanceoftheKnee .............................................. .......1016 CapacityPlanning..............................................................10 17 RandomArrivals................................................................10 18 CoherencyDelay................................................................11 19 PerformanceTesting ................................................ .......1120Measuring.............................................................................12
21 PerformanceisaFeature...............................................12 22 Acknowledgments ..................................................... .......1323 AbouttheAuthor...............................................................13 24 Epilog:OpenDebateaboutKnees..............................13
1 ANAXIOMATICAPPROACHWhenIjoinedOracleCorporationin1989,
performancewhateveryonecalledOracletuning
wasdifficult.Onlyafewpeopleclaimedtheycoulddo
itverywell,andthosepeoplecommandednice,high
consultingrates.Whencircumstancesthrustmeinto
theOracletuningarena,Iwasquiteunprepared.
Recently,Ivebeenintroducedintotheworldof
MySQLtuning,andthesituationseemsverysimilar
towhatIsawinOracleovertwentyyearsago.
ItremindsmealotofhowdifficultIwouldhavetold
youthatbeginningalgebrawas,ifyouhadaskedme
whenIwasabout13yearsold.Atthatage,Ihadto
appealheavilytomymathematicalinstinctstosolve
equationslike3x+4=13.Theproblemwiththatis
thatmanyofusdidnthavemathematicalinstincts.I
canrememberlookingataproblemlike3x+4=13;
findxandbasicallystumblingupontheanswer x=3
usingtrialanderror.
Thetrial-and-errormethodoffeelingmywaythrough
algebraproblemsworkedalbeitslowlyand
uncomfortablyforeasyequations,butitdidntscale
astheproblemsgottougher,like3x+4=14.Now
what?MyproblemwasthatIwasntthinkingclearly
yetaboutalgebra.Myintroductionatagefifteento
JamesR.Harkeyputmeontheroadtosolvingthat
problem.
Mr.Harkeytaughtuswhathecalledanaxiomatic
approachtosolvingalgebraicequations.Heshowedus
-
8/3/2019 Thinking Clearly (Paper)
2/15
2010MethodRCorporation.Allrightsreserved. 2
asetofstepsthatworkedeverytime(andhegaveus
plentyofhomeworktopracticeon).Inadditionto
workingeverytime,byexecutingthosesteps,we
necessarilydocumentedourthinkingasweworked.
Notonlywerewethinkingclearly,usingareliableand
repeatablesequenceofsteps,wewereprovingto
anyonewhoreadourworkthatwewerethinkingclearly.
OurworkforMr.Harkeylookedlikethis:
3.1x+4 =13 problemstatement
3.1x+44=134 subtractionpropertyofequality
3.1x =9 additiveinverseproperty,
simplification3.1x 3.1 =9 3.1 divisionpropertyofequalityx 2.903 multiplicativeinverseproperty,
simplification
ThiswasMr.Harkeysaxiomaticapproachtoalgebra,
geometry,trigonometry,andcalculus:onesmall,
logical,provable,andauditablestepatatime.Itsthe
firsttimeIeverreallygotmathematics.
Naturally,Ididntrealizeitatthetime,butofcourse
provingwasaskillthatwouldbevitalformysuccess
intheworldafterschool.Inlife,Ivefoundthat,of
course,knowingthingsmatters.Butprovingthose
thingstootherpeoplemattersmore.Withoutgood
provingskills,itsdifficulttobeagoodconsultant,a
goodleader,orevenagoodemployee.
Mygoalsincethemid-1990shasbeentocreatea
similarlyrigorousapproachtoOracleperformance
optimization.Lately,Iamexpandingthescopeofthat
goalbeyondOracle,to:Createanaxiomaticapproach
tocomputersoftwareperformanceoptimization.Ive
foundthatnotmanypeoplereallylikeitwhenItalk
likethat,soletssayitlikethis:
Mygoalistohelpyou thinkclearlyabouthowto
optimizetheperformanceofyourcomputer
software.
2 WHATISPERFORMANCE?Ifyougoogleforthewordperformance,yougetovera
halfabillionhitsonconceptsrangingfrombicycle
racingtothedreadedemployeereviewprocessthat
manycompaniesthesedaysarelearningtoavoid.WhenIgoogledforperformance,mostofthetophits
relatetothesubjectofthispaper:thetimeittakesfor
computersoftwaretoperformwhatevertaskyouask
ittodo.
Andthatsagreatplacetobegin:the task.Ataskisa
business-orientedunitofwork.Taskscannest: print
invoicesisatask;printoneinvoiceasub-taskisalso
atask.Whenacomputerusertalksaboutperformance
heusuallymeansthetimeittakesforthesystemto
executesometask.Responsetimeistheexecution
durationofatask,measuredintimepertask,like
secondsperclick.Forexample,myGooglesearchfor
thewordperformancehadaresponsetimeof
0.24seconds.TheGooglewebpagerenderedthat
measurementrightinmybrowser.Thatisevidence,tome,thatGooglevaluesmyperceptionofGoogle
performance.
Somepeopleareinterestedinanotherperformance
measure:Throughputisthecountoftaskexecutions
thatcompletewithinaspecifiedtimeinterval,like
clickspersecond.Ingeneral,peoplewhoare
responsiblefortheperformanceofgroupsofpeople
worrymoreaboutthroughputthanpeoplewhowork
inasolocontributorrole.Forexample,anindividual
accountantisusuallymoreconcernedaboutwhether
theresponsetimeofadailyreportwillrequirehimto
staylateafterworktoday.Themanagerofagroupof
accountsisadditionallyconcernedaboutwhetherthesystemiscapableofprocessingallthedatathatallof
heraccountantswillbeprocessingtoday.
3 RESPONSETIMEVS.THROUGHPUTThroughputandresponsetimehaveagenerally
reciprocaltypeofrelationship,butnotexactly.The
realrelationshipissubtlycomplex.
Example:Imaginethatyouhavemeasuredyour
throughputat1,000taskspersecondforsome
benchmark.What,then,isyourusersaverage
responsetime?
Itstemptingtosaythatyouraverageresponsetimewas1/1,000=.001secondspertask.Butitsnot
necessarilyso.
Imaginethatyoursystemprocessingthisthroughput
had1,000parallel,independent,homogeneous
servicechannelsinsideit(thatis,itsasystemwith
1,000independent,equallycompetentservice
providersinsideit,eachawaitingyourrequestforservice).Inthiscase,itispossiblethateachrequest
consumedexactly1second.
Now,wecanknowthataverageresponsetimewas
somewherebetween0secondspertaskand
1secondpertask.Butyoucannotderiveresponse
timeexclusively1fromathroughputmeasurement.Youhavetomeasureitseparately.
1Icarefullyincludethewordexclusivelyinthisstatement,
becausetherearemathematicalmodelsthatcancompute
responsetimeforagiventhroughput,butthemodels
requiremoreinputthanjustthroughput.
-
8/3/2019 Thinking Clearly (Paper)
3/15
2010MethodRCorporation.Allrightsreserved.3
Thesubtletyworksintheotherdirection,too.Youcan
certainlyfliptheexampleIjustgavearoundandprove
it.However,ascarierexamplewillbemorefun.
Example:Yourclientrequiresanewtaskthatyoure
programmingtodeliverathroughputof100taskspersecondonasingle-CPUcomputer.Imaginethat
thenewtaskyouvewrittenexecutesinjust.001secondsontheclientssystem.Willyour
programyieldthethroughputthattheclient
requires?
Itstemptingtosaythatifyoucanrunthetaskonce
injustathousandthofasecond,thensurelyyoullbe
abletorunthattaskatleastahundredtimesinthe
spanofafullsecond.Andyoureright,ifthetaskrequestsarenicelyserialized,forexample,sothat
yourprogramcanprocessall100oftheclients
requiredtaskexecutionsinsidealoop,oneafterthe
other.
Butwhatifthe100taskspersecondcomeatyour
systematrandom,from100differentusersloggedintoyourclientssingle-CPUcomputer?Thenthe
gruesomerealitiesofCPUschedulersandserialized
resources(likeOraclelatchesandlocksandwritable
accesstobuffersinmemory)mayrestrictyour
throughputtoquantitiesmuchlessthantherequired
100taskspersecond.
Itmightwork.Itmightnot.Youcannotderivethroughputexclusivelyfromaresponsetime
measurement.Youhavetomeasureitseparately.
Responsetimeandthroughputarenotnecessarily
reciprocals.Toknowthemboth,youneedtomeasure
themboth.
So,whichismoreimportant:responsetime,orthroughput?Foragivensituation,youmightanswer
legitimatelyineitherdirection.Inmany
circumstances,theansweristhatbotharevital
measurementsrequiringmanagement.Forexample,a
systemownermayhaveabusinessrequirementthat
responsetimemustbe1.0secondsorlessforagiven
taskin99%ormoreofexecutions andthesystem
mustsupportasustainedthroughputof
1,000executionsofthetaskwithina10-minute
interval.
4 PERCENTILESPECIFICATIONSInthepriorsection,Iusedthephrasein99%ormore
ofexecutionstoqualifyaresponsetimeexpectation.
Manypeoplearemoreaccustomedtostatementslike,
averageresponsetimemustbersecondsorless.The
percentilewayofstatingrequirementsmapsbetter,
though,tothehumanexperience.
Example:Imaginethatyourresponsetimetolerance
is1secondforsometaskthatyouexecuteonyour
computereveryday.Furtherimaginethatthelistsof
numbersshowninExhibit1representthemeasuredresponsetimesoftenexecutionsofthattask.The
averageresponsetimeforeachlistis1.000seconds.
Whichonedoyouthinkyoudlikebetter?
ListA ListB
1 .924 .7962 .928 .798
3 .954 .802
4 .957 .823
5 .961 .919
6 .965 .977
7 .972 1.076
8 .979 1.216
9 .987 1.273
10 1.373 1.320
Exhibit1.Theaverageresponsetimeforeachof
thesetwolistsis1.000seconds.
Youcanseethatalthoughthetwolistshavethesame
averageresponsetime,thelistsarequitedifferentincharacter.InListA,90%ofresponsetimeswere
1secondorless.InListB,only60%ofresponse
timeswere1secondorless.Statedintheopposite
way,ListBrepresentsasetofuserexperiencesofwhich40%weredissatisfactory,butListA(having
thesameaverageresponsetimeasListB)represents
onlya10%dissatisfactionrate.
InListA,the90thpercentileresponsetimeis.987seconds.InListB,the90thpercentileresponse
timeis1.273seconds.Thesestatementsabout
percentilesaremoreinformativethanmerelysaying
thateachlistrepresentsanaverageresponsetimeof1.000seconds.
AsGEsays,Ourcustomersfeelthevariance,notthe
mean.2Expressingresponsetimegoalsaspercentiles
makeformuchmorecompellingrequirement
specificationsthatmatchwithenduserexpectations:
TheTrackShipmenttaskmustcompleteinless
than.5secondsinatleast99.9%ofexecutions.
5 PROBLEMDIAGNOSISInnearlyeveryperformanceproblemIvebeeninvited
torepair,theproblemstatementhasbeenastatement
aboutresponsetime.Itusedtotakelessthana
secondtodoX;nowitsometimestakes20+.Ofcourse,aspecificproblemstatementlikethatisoften
buriedbehindveneersofotherproblemstatements,
2GeneralElectricCompany:WhatIsSixSigma?The
RoadmaptoCustomerImpactat
http://www.ge.com/sixsigma/SixSigma.pdf
-
8/3/2019 Thinking Clearly (Paper)
4/15
2010MethodRCorporation.Allrightsreserved. 4
like,Ourwhole[adjectivesdeleted]systemissoslow
wecantuseit.3
Butjustbecausesomethinghashappenedalotforme
doesntmeanthatitswhatwillhappennextforyou.
Themostimportantthingforyoutodofirstisstate
theproblemclearly,sothatyoucanthenthinkaboutit
clearly.
Agoodwaytobeginistoask,whatisthegoalstate
thatyouwanttoachieve?Findsomespecificsthatyou
canmeasuretoexpressthis.Forexample,Response
timeofXismorethan20secondsinmanycases.Well
behappywhenresponsetimeis1secondorlessinat
least95%ofexecutions.
Thatsoundsgoodintheory,butwhatifyouruser
doesnthaveaspecificquantitativegoallike1second
orlessinatleast95%ofexecutions?Therearetwo
quantitiesrightthere(1and95);whatifyouruser
doesntknoweitheroneofthem?Worseyet,whatif
youruserdoeshavespecificideasabouthisexpectations,butthoseexpectationsareimpossibleto
meet?Howwouldyouknowwhatpossibleor
impossibleevenis?
Letsworkourwayuptothosequestions.
6 THESEQUENCEDIAGRAMAsequencediagram isatypeofgraphspecifiedinthe
UnifiedModelingLanguage(UML),usedtoshowthe
interactionsbetweenobjectsinthesequentialorder
thatthoseinteractionsoccur.Thesequencediagramis
anexceptionallyusefultoolforvisualizingresponse
time.Exhibit2showsastandardUMLsequence
diagramforasimpleapplicationsystemcomposedofa
browser,anapplicationserver,andadatabase.
3CaryMillsap,2009.Mywholesystemisslow.Nowwhat?
athttp://carymillsap.blogspot.com/2009/12/my-whole-
system-is-slow-now-what.html
Exhibit2.ThisUMLsequencediagramshowsthe
interactionsamongabrowser,anapplication
server,andadatabase.
Imaginenowdrawingthesequencediagramtoscale,
sothatthedistancebetweeneachrequestarrowcominganditscorrespondingresponsearrowgoing
outwereproportionaltothedurationspentservicing
therequest.IveshownsuchadiagraminExhibit3.
Exhibit3.AUMLsequencediagramdrawnto
scale,showingtheresponsetimeconsumedat
eachtierinthesystem.
WithExhibit3,youhaveagoodgraphical
representationofhowthecomponentsrepresentedin
yourdiagramarespendingyouruserstime.Youcanfeeltherelativecontributiontoresponsetimeby
lookingatthepicture.
Sequencediagramsarejustrightforhelpingpeople
conceptualizehowtheirresponseisconsumedona
givensystem,asonetierhandscontrolofthetaskto
thenext.Sequencediagramsalsoworkwelltoshow
howsimultaneousthreadsofprocessingworkin
parallel.Sequencediagramsaregoodtoolsfor
-
8/3/2019 Thinking Clearly (Paper)
5/15
2010MethodRCorporation.Allrightsreserved.5
analyzingperformanceoutsideoftheinformation
technologybusiness,too. 4
Thesequencediagramisagoodconceptualtoolfor
talkingaboutperformance,buttothinkclearlyabout
performance,weneedsomethingelse,too.Heresthe
problem.Imaginethatthetaskyouresupposedtofix
hasaresponsetimeof2,468seconds(thats41minutes8seconds).Inthatroughly41minutes,
runningthattaskcausesyourapplicationserverto
execute322,968databasecalls.Exhibit4showswhat
yoursequencediagramforthattaskwouldlooklike.
Exhibit4.ThisUMLsequencediagramshows
322,968databasecallsexecutedbythe
applicationserver.
Therearesomanyrequestandresponsearrows
betweentheapplicationanddatabasetiersthatyoucantseeanyofthedetail.Printingthesequence
diagramonaverylongscrollisntausefulsolution,
becauseitwouldtakeusweeksofvisualinspection
beforewedbeabletoderiveusefulinformationfrom
thedetailswedsee.
Thesequencediagramisagoodtoolfor
conceptualizingflowofcontrolandthecorresponding
flowoftime.Buttothinkclearlyaboutresponsetime,
weneedsomethingelse.
7 THEPROFILEThesequencediagramdoesntscalewell.Todealwithtasksthathavehugecallcounts,weneedaconvenient
aggregationofthesequencediagramsothatwe
understandthemostimportantpatternsinhowour
4CaryMillsap,2009.PerformanceoptimizationwithGlobal
Entry.Ornot?at
http://carymillsap.blogspot.com/2009/11/performance-
optimization-with-global.html
timehasbeenspent.Exhibit5showsanexampleofa
tablecalledaprofile,whichdoesthetrick.Aprofileis
atabulardecompositionofresponsetime,typically
listedindescendingorderofcomponentresponse
timecontribution.
Functioncall R(sec) Calls
1 DB:fetch() 1,748.229 322,9682 App:await_db_netIO() 338.470 322,968
3 DB:execute() 152.654 39,142
4 DB:prepare() 97.855 39,142
5 Other 58.147 89,422
6 App:render_graph() 48.274 7
7 App:tabularize() 23.481 4
8 App:read() 0.890 2
Total 2,468.000
Exhibit5.Thisprofileshowsthedecompositionof
a2,468.000-secondresponsetime.
Example:TheprofileinExhibit5isrudimentary,but
itshowsyouexactlywhereyourslowtaskhasspent
yourusers2,468seconds.Withthedatashownhere,forexample,youcanderivethepercentageof
responsetimecontributionforeachofthefunction
callsidentifiedintheprofile.Youcanalsoderivethe
averageresponsetimeforeachtypeoffunctioncallduringyourtask.
Aprofileshowsyouwhereyourcodehasspentyour
timeandsometimesevenmoreimportantlywhere
ithasnot.Thereistremendousvalueinnothavingto
guessaboutthesethings.
FromthedatashowninExhibit5,you knowthat
70.8%ofyourusersresponsetimeisconsumedby
DB:fetch()calls.Furthermore,ifyoucandrilldownintotheindividualcallswhosedurationswere
aggregatedtocreatethisprofile,youcanknowhow
manyofthoseApp:await_db_netIO()calls
correspondedtoDB:fetch()calls,andyoucanknow
howmuchresponsetimeeachofthoseconsumed.
Withaprofile,youcanbegintoformulatetheanswer
tothequestion,Howlong shouldthistaskrun?
Which,bynow,youknowisanimportantquestion
inthefirststep(section5)ofanygoodproblem
diagnosis.
8 AMDAHLSLAWProfilinghelpsyouthinkclearlyaboutperformance.
EvenifGeneAmdahlhadntgivenusAmdahlsLawbackin1967,youdhaveprobablycomeupwithit
yourselfafterthefirstfewprofilesyoulookedat.
AmdahlsLawstates:
Performanceimprovementisproportionalto
howmuchaprogramusesthethingyou
improved.
-
8/3/2019 Thinking Clearly (Paper)
6/15
2010MethodRCorporation.Allrightsreserved. 6
Soifthethingyouretryingtoimproveonly
contributes5%toyourtaskstotalresponsetime,then
themaximumimpactyoullbeabletomakeis5%of
yourtotalresponsetime.Thismeansthatthecloserto
thetopofaprofilethatyouwork(assumingthatthe
profileissortedindescendingresponsetimeorder),
thebiggerthebenefitpotentialforyouroverallresponsetime.
Thisdoesntmeanthatyoualwaysworkaprofilein
top-downorder,though,becauseyoualsoneedto
considerthecostoftheremediesyoullbeexecuting,
too.5
Example:ConsidertheprofileinExhibit6.ItsthesameprofileasinExhibit5,excepthereyoucansee
howmuchtimeyouthinkyoucansaveby
implementingthebestremedyforeachrowinthe
profile,andyoucanseehowmuchyouthinkeach
remedywillcosttoimplement.
Potentialimprovement%andcostofinvestment R(sec) R(%)
1 34.5%superexpensive 1,748.229 70.8%
2 12.3%dirtcheap 338.470 13.7%
3 Impossibletoimprove 152.654 6.2%
4 4.0%dirtcheap 97.855 4.0%
5 0.1%superexpensive 58.147 2.4%
6 1.6%dirtcheap 48.274 2.0%
7 Impossibletoimprove 23.481 1.0%
8 0.0%dirtcheap 0.890 0.0%
Total 2,468.000
Exhibit6.Thisprofileshowsthepotentialfor
improvementandthecorrespondingcost
(difficulty)ofimprovementforeachlineitem
fromExhibit5.
Whatremedyactionwouldyouimplementfirst?
AmdahlsLawsaysthatimplementingtherepairon
line1hasthegreatestpotentialbenefitofsavingabout851seconds(34.5%of2,468seconds).Butifit
istrulysuperexpensive,thentheremedyonline2
mayyieldbetternetbenefitandthatsthe
constrainttowhichwereallyneedtooptimize
eventhoughthepotentialforresponsetimesavings
isonlyabout305seconds.
Atremendousvalueoftheprofileisthatyoucanlearn
exactlyhowmuchimprovementyoushouldexpectfor
aproposedinvestment.Itopensthedoortomaking
muchbetterdecisionsaboutwhatremediesto
implementfirst.Yourpredictionsgiveyouayardstick
formeasuringyourownperformanceasananalyst.
Andfinally,itgivesyouachancetoshowcaseyour
5CaryMillsap,2009.Ontheimportanceofdiagnosing
beforeresolvingat
http://carymillsap.blogspot.com/2009/09/on-importance-of-
diagnosing-before.html
clevernessandintimacywithyourtechnologyasyou
findmoreefficientremediesforreducingresponse
timemorethanexpected,atlower-than-expected
costs.
Whatremedyactionyouimplementfirstreallyboils
downtohowmuchyoutrustyourcostestimates.Does
dirtcheapreallytakeintoaccounttherisksthattheproposedimprovementmayinflictuponthesystem?
Forexample,itmayseemdirtcheaptochangethat
parameterordropthatindex,butdoesthatchange
potentiallydisruptthegoodperformancebehaviorof
somethingouttherethatyourenoteventhinking
aboutrightnow?Reliablecostestimationisanother
areainwhichyourtechnologicalskillspayoff.
Anotherfactorworthconsideringisthepolitical
capitalthatyoucanearnbycreatingsmallvictories.
Maybecheap,low-riskimprovementswontamountto
muchoverallresponsetimeimprovement,buttheres
valueinestablishingatrackrecordofsmallimprovementsthatexactlyfulfillyourpredictions
abouthowmuchresponsetimeyoullsavefortheslow
task.Atrackrecordofpredictionandfulfillment
ultimatelyespeciallyintheareaofsoftware
performance,wheremythandsuperstitionhave
reignedatmanylocationsfordecadesgivesyouthe
credibilityyouneedtoinfluenceyourcolleagues(your
peers,yourmanagers,yourcustomers,)toletyou
performincreasinglyexpensiveremediesthatmay
producebiggerpayoffsforthebusiness.
Awordofcaution,however:Dontgetcarelessasyou
rackupyoursuccessesandproposeeverbigger,
costlier,riskierremedies.Credibilityisfragile.Ittakesalotofworktobuilditupbutonlyonecareless
mistaketobringitdown.
9 SKEWWhenyouworkwithprofiles,yourepeatedly
encountersub-problemslikethisone:
Example:TheprofileinExhibit5revealedthat
322,968DB:fetch()callshadconsumed
1,748.229secondsofresponsetime.Howmuchunwantedresponsetimewouldweeliminateifwe
couldeliminatehalfofthosecalls?
Theanswerisalmostnever,Halfoftheresponse
time.Considerthisfarsimplerexampleforamoment:
Example:Fourcallstoasubroutineconsumedfour
seconds.Howmuchunwantedresponsetimewould
weeliminateifwecouldeliminatehalfofthosecalls?
Theanswerdependsupontheresponsetimesofthe
individualcallsthatwecouldeliminate.Youmight
haveassumedthateachofthecalldurationswasthe
-
8/3/2019 Thinking Clearly (Paper)
7/15
2010MethodRCorporation.Allrightsreserved.7
average4/4=1second.Butnowhereintheproblem
statementdidItellyouthatthecalldurationswere
uniform.
Example:Imaginethefollowingtwopossibilities,
whereeachlistrepresentstheresponsetimesofthefoursubroutinecalls:
A={1,1,1,1}B={3.7,.1,.1,.1}
InlistA,theresponsetimesareuniform,sonomatterwhichhalf(two)ofthecallsweeliminate,
wellreducetotalresponsetimeto2seconds.
However,inlistB,itmakesatremendousdifference
whichtwocallsweeliminate.Ifweeliminatethefirst
twocalls,thenthetotalresponsetimewilldropto
.2seconds(a95%reduction).However,ifweeliminatethefinaltwocalls,thenthetotalresponse
timewilldropto3.8seconds(onlya5%reduction).
Skewisanon-uniformityinalistofvalues.The
possibilityofskewiswhatprohibitsyoufrom
providingapreciseanswertothequestionthatIaskedyouatthebeginningofthissection.Letslookagain:
Example:TheprofileinExhibit5revealedthat
322,968DB:fetch()callshadconsumed
1,748.229secondsofresponsetime.Howmuch
unwantedresponsetimewouldweeliminateifwe
couldeliminatehalfofthosecalls?
Withoutknowinganythingaboutskew,theonlyanswerwecanprovideis,Somewherebetween0
and1,748.229seconds.Thatisthemostprecise
correctansweryoucanreturn.
Imagine,however,thatyouhadtheadditional
informationavailableinExhibit7.Thenyoucouldformulatemuchmoreprecisebest-caseandworst-
caseestimates.Specifically,ifyouhadinformationlike
this,youdbesmarttotrytofigureouthow
specificallytoeliminatethe47,444callswithresponse
timesinthe.01-to.1-secondrange.
Range{mine
-
8/3/2019 Thinking Clearly (Paper)
8/15
2010MethodRCorporation.Allrightsreserved. 8
Example:Amiddletierprogrammakes100database
fetchcalls(andthus100networkI/Ocalls)tofetch994rows.Itcouldhavefetched994rowsin10fetch
calls(andthus90fewernetworkI/Ocalls).
Example:ASQLstatement7touchesthedatabase
buffercache7,428,322timestoreturna698-row
resultset.Anextrafilterpredicatecouldhave
returnedthe7rowsthattheenduserreallywantedtosee,withonly52touchesuponthedatabasebuffer
cache.
Certainly,ifasystemhassomeglobalproblemthat
createsinefficiencyforbroadgroupsoftasksacross
thesystem(e.g.,ill-conceivedindex,badlyset
parameter,poorlyconfiguredhardware),thenyou
shouldfixit.Butdonttuneasystemtoaccommodate
programsthatareinefficient.8Thereisalotmore
leverageincuringtheprograminefficiencies
themselves.Eveniftheprogramsarecommercial,off-
the-shelfapplications,itwillbenefityoubetterinthe
longruntoworkwithyoursoftwarevendortomakeyourprogramsefficient,thanitwilltotrytooptimize
yoursystemtobeasefficientasitcanwithinherently
inefficientworkload.
Improvementsthatmakeyourprogrammoreefficient
canproducetremendousbenefitsforeveryoneonthe
system.Itseasytoseehowtop-linereductionof
wastehelpstheresponsetimeofthetaskbeing
repaired.Whatmanypeopledontunderstandaswell
isthatmakingoneprogrammoreefficientcreatesa
side-effectofperformanceimprovementforother
programsonthesystemthathavenoapparent
relationtotheprogrambeingrepaired.Ithappens
becauseoftheinfluenceof loaduponthesystem.
12 LOADLoadiscompetitionforaresourceinducedby
concurrenttaskexecutions. Loadisthereasonthatthe
performancetestingdonebysoftwaredevelopers
doesntcatchalltheperformanceproblemsthatshow
uplaterinproduction.
Onemeasureofloadisutilization.Utilizationis
resourceusagedividedbyresourcecapacityfora
specifiedtimeinterval.Asutilizationforaresource
goesup,sodoestheresponsetimeauserwill
experiencewhenrequestingservicefromthat
resource.Anyonewhohasriddeninanautomobilein
abigcityduringrushhourhasexperiencedthis
7MychoiceofarticleadjectivehereisadeadgiveawaythatIwasintroducedtoSQLwithintheOraclecommunity.
8Admittedly,sometimesyouneedatourniquettokeepfrom
bleedingtodeath.Butdontuseastopgapmeasureasa
permanentsolution.Addresstheinefficiency.
phenomenon.Whenthetrafficisheavilycongested,
youhavetowaitlongeratthetollbooth.
Withcomputersoftware,thesoftwareyouusedoesn't
actuallygoslowerlikeyourcardoeswhenyoure
going30mphinheavytrafficinsteadof60mphonthe
openroad.Computersoftwarealwaysgoesthesame
speed,nomatterwhat(aconstantnumberofinstructionsperclockcycle),butcertainlyresponse
timedegradesasresourcesonyoursystemgetbusier.
Therearetworeasonsthatsystemsgetslowerasload
increases:queueingdelay,andcoherencydelay.Ill
addresseachaswecontinue.
13 QUEUEINGDELAYThemathematicalrelationshipbetweenloadand
responsetimeiswellknown.Onemathematicalmodel
calledM/M/mrelatesresponsetimetoloadin
systemsthatmeetoneparticularlyusefulsetof
specificrequirements.9Oneoftheassumptionsof
M/M/misthatthesystemyouaremodelinghas
theoreticallyperfectscalability.Havingtheoretically
perfectscalabilityisakintohavingaphysicalsystem
withnofriction,anassumptionthatsomany
problemsinintroductoryPhysicscoursesinvoke.
Regardlessofsomeoverreachingassumptionslikethe
oneaboutperfectscalability,M/M/mhasalottoteach
usaboutperformance.Exhibit8showsthe
relationshipbetweenresponsetimeandloadusing
M/M/m.
Exhibit8.Thiscurverelatesresponsetimeasa
functionofutilizationforanM/M/msystemwith
m=8servicechannels.
9CaryMillsapandJeffHolt,2003.OptimizingOracle
Performance.OReilly.SebastopolCA.
0.0 0.2 0.4 0.6 0.8 1.00
1
2
3
4
5
Utilization
Responsetime
R
MM
8 system
-
8/3/2019 Thinking Clearly (Paper)
9/15
2010MethodRCorporation.Allrightsreserved.9
InExhibit8,youcanseemathematicallywhatyoufeel
whenyouuseasystemunderdifferentload
conditions.Atlowload,yourresponsetimeis
essentiallythesameasyourresponsetimewasatno
load.Asloadrampsup,yousenseaslight,gradual
degradationinresponsetime.Thatgradual
degradationdoesntreallydomuchharm,butasloadcontinuestorampup,responsetimebeginstodegrade
inamannerthatsneitherslightnorgradual.Rather,
thedegradationbecomesquiteunpleasantand,infact,
hyperbolic.
Responsetime,intheperfectscalabilityM/M/m
model,consistsoftwocomponents: servicetimeand
queueingdelay.Thatis,R=S+Q.Servicetime(S)is
thedurationthatataskspendsconsumingagiven
resource,measuredintimepertaskexecution,asin
secondsperclick.Queueingdelay(Q) isthetimethata
taskspendsenqueuedatagivenresource,awaitingits
opportunitytoconsumethatresource.Queueingdelay
isalsomeasuredintimepertaskexecution(e.g.,secondsperclick).
So,whenyouorderlunchatTacoTico,yourresponse
time(R)forgettingyourorderisthequeueingdelay
time(Q)thatyouspendqueuedinfrontofthecounter
waitingforsomeonetotakeyourorder,plusthe
servicetime(S)youspendwaitingforyourorderto
hityourhandsonceyoubegintalkingtotheorder
clerk.Queueingdelayisthedifferencebetweenyour
responsetimeforagiventaskandtheresponsetime
forthatsametaskonanotherwiseunloadedsystem
(dontforgetourperfectscalabilityassumption).
14 THEKNEEWhenitcomestoperformance,youwanttwothings
fromasystem:
1. Youwantthebestresponsetimeyoucanget:youdontwanttohavetowaittoolongfortaskstoget
done.
2. Andyouwantthebestthroughputyoucanget:youwanttobeabletocramasmuchloadasyou
possiblycanontothesystemsothatasmany
peopleaspossiblecanruntheirtasksatthesame
time.
Unfortunately,thesetwogoalsarecontradictory.
Optimizingtothefirstgoalrequiresyoutominimize
theloadonyoursystem;optimizingtotheotherone
requiresyoutomaximizeit.Youcantdoboth
simultaneously.Somewhereinbetweenatsomeload
level(thatis,atsomeutilizationvalue)istheoptimal
loadforthesystem.
Theutilizationvalueatwhichthisoptimalbalance
occursiscalledtheknee.10Thekneeistheutilization
valueforaresourceatwhichthroughputismaximized
withminimalnegativeimpacttoresponsetimes.
Mathematically,thekneeistheutilizationvalueat
whichresponsetimedividedbyutilizationisatits
minimum.Onenicepropertyofthekneeisthatitoccursattheutilizationvaluewherealinethroughthe
originistangenttotheresponsetimecurve.Ona
carefullyproducedM/M/mgraph,youcanlocatethe
kneequitenicelywithjustastraightedge,asshownin
Exhibit9.
Exhibit9.Thekneeoccursattheutilizationat
whichalinethroughtheoriginistangenttothe
responsetimecurve.
AnothernicepropertyoftheM/M/mkneeisthatyouonlyneedtoknowthevalueofoneparameterto
computeit.Thatparameteristhenumberofparallel,
homogeneous,independentservicechannels.Aservice
channelisaresourcethatsharesasinglequeuewith
otheridenticalsuchresources,likeaboothinatoll
plazaoraCPUinanSMPcomputer.
TheitalicizedlowercaseminthenameM/M/misthe
numberofservicechannelsinthesystembeing
modeled.11TheM/M/mkneevalueforanarbitrary
10Iamengagedinanongoingdebateaboutwhetheritis
appropriatetousethetermkneeinthiscontext.Forthetimebeing,Ishallcontinuetouseit.Seesection24fordetails.
11Bythispoint,youmaybewonderingwhattheothertwo
MsstandforintheM/M/mqueueingmodelname.They
relatetoassumptionsabouttherandomnessofthetimingof
yourincomingrequestsandtherandomnessofyourservicetimes.See
http://en.wikipedia.org/wiki/Kendall%27s_notation formore
information,orOptimizingOraclePerformanceforeven
more.
0.0 0.2 0.4 0.6 0.8 1.00
2
4
6
8
10
Utilization
Responsetim
eR
MM4, 0.665006
MM16, 0.810695
-
8/3/2019 Thinking Clearly (Paper)
10/15
2010MethodRCorporation.Allrightsreserved. 10
systemisdifficulttocalculate,butIvedoneitforyou.
Thekneevaluesforsomecommonservicechannel
countsareshowninExhibit10.
Servicechannelcount
Kneeutilization
1 50%
2 57%4 66%
8 74%
16 81%
32 86%64 89%
128 92%
Exhibit10.M/M/mkneevaluesforcommon
valuesofm.
Whyisthekneevaluesoimportant?Forsystemswith
randomlytimedservicerequests,allowingsustained
resourceloadsinexcessofthekneevalueresultsin
responsetimesandthroughputsthatwillfluctuateseverelywithmicroscopicchangesinload.Hence:
Onsystemswithrandomrequestarrivals,itisvitaltomanageloadsothatitwillnotexceed
thekneevalue.
15 RELEVANCEOFTHEKNEESo,howimportantcanthiskneeconceptbe,really?
Afterall,asIvetoldyou,theM/M/mmodelassumes
thisridiculouslyutopianideathatthesystemyoure
thinkingaboutscalesperfectly.Iknowwhatyoure
thinking:Itdoesnt.
ButwhatM/M/mgivesusistheknowledgethateven
ifyoursystemdidscaleperfectly,youwould stillbe
strickenwithmassiveperformanceproblemsonceyouraverageloadexceededthekneevaluesIvegiven
youinExhibit10.Yoursystemisntasgoodasthe
theoreticalsystemsthatM/M/mmodels.Therefore,
theutilizationvaluesatwhichyoursystemsknees
occurwillbemoreconstrainingthanthevaluesIve
givenyouinExhibit10.(Isaid valuesandkneesin
pluralform,becauseyoucanmodelyourCPUswith
onemodel,yourdiskswithanother,yourI/O
controllerswithanother,andsoon.)
Torecap:
Eachoftheresourcesinyoursystemhasaknee. Thatkneeforeachofyourresourcesislessthan
orequaltothekneevalueyoucanlookupin
Exhibit10.Themoreimperfectlyyoursystem
scales,thesmaller(worse)yourkneevaluewill
be.
Onasystemwithrandomrequestarrivals,ifyouallowyoursustainedutilizationforanyresource
onyoursystemtoexceedyourkneevalueforthat
resource,thenyoullhaveperformanceproblems.
Therefore,itisvitalthatyoumanageyourloadsothatyourresourceutilizationswillnotexceedyourknee
values.
16 CAPACITYPLANNINGUnderstandingthekneecancollapsealotof
complexityoutofyourcapacityplanningprocess.It
workslikethis:
1. Yourgoalcapacityforagivenresourceistheamountatwhichyoucancomfortablyrunyour
tasksatpeaktimeswithoutdrivingutilizations
beyondyourknees.
2. Ifyoukeepyourutilizationslessthanyourknees,yoursystembehavesroughlylinearly:nobighyperbolicsurprises.
3. However,ifyourelettingyoursystemrunanyofitsresourcesbeyondtheirkneeutilizations,then
youhaveperformanceproblems(whetheryoure
awareofitornot).
4. Ifyouhaveperformanceproblems,thenyoudontneedtobespendingyourtimewithmathematical
models;youneedtobespendingyourtimefixing
thoseproblemsbyeitherreschedulingload,
eliminatingload,orincreasingcapacity.
Thatshowcapacityplanningfitsintotheperformance
managementprocess.
17 RANDOMARRIVALSYoumighthavenoticedthatseveraltimesnow,Ihave
mentionedthetermrandomarrivals.Whyisthat
important?
Somesystemshavesomethingthatyouprobablydont
haverightnow:completelydeterministicjob
scheduling.Somesystemsitsrarethesedaysare
configuredtoallowservicerequeststoenterthe
systeminabsoluteroboticfashion,say,atapaceof
onetaskpersecond.Andbyonetaskpersecond,I
dontmeanatanaveragerateofonetaskpersecond(forexample,2tasksinonesecondand0tasksinthe
next),Imeanonetaskpersecond,likearobotmight
feedcarpartsintoabinonanassemblyline.
Ifarrivalsintoyoursystembehavecompletely
deterministicallymeaningthatyouknow exactly
whenthenextservicerequestiscomingthenyou
canrunresourceutilizationsbeyondtheirknee
utilizationswithoutnecessarilycreatinga
-
8/3/2019 Thinking Clearly (Paper)
11/15
2010MethodRCorporation.Allrightsreserved.11
performanceproblem.Onasystemwithdeterministic
arrivals,yourgoalistorunresourceutilizationsupto
100%withoutcrammingsomuchworkloadintothe
systemthatrequestsbegintoqueue.
Thereasonthekneevalueissoimportantonasystem
withrandomarrivalsisthatrandomarrivalstendto
clusteruponyouandcausetemporaryspikesinutilization.Thosespikesneedenoughsparecapacity
toconsumesothatusersdonthavetoendure
noticeablequeueingdelays(whichcausenoticeable
fluctuationsinresponsetimes)everytimeoneof
thosespikesoccurs.
Temporaryspikesinutilizationbeyondyourknee
valueforagivenresourceareokaslongastheydont
exceedjustafewsecondsinduration.Howmany
secondsistoomany?Ibelieve(buthavenotyettried
toprove)thatyoushouldatleastensurethatyour
spikedurationsdonotexceed8seconds.12Theanswer
iscertainlythat,ifyoureunabletomeetyourpercentile-basedresponsetimepromisesoryour
throughputpromisestoyourusers,thenyourspikes
aretoolong.
18 COHERENCYDELAYYoursystemdoesnthavetheoreticallyperfect
scalability.EvenifIveneverstudiedyoursystemspecifically,itsaprettygoodbetthatnomatterwhat
computerapplicationsystemyourethinkingofright
now,itdoesnotmeettheM/M/mtheoretically
perfectscalabilityassumption.Coherencydelayisthe
factorthatyoucanusetomodeltheimperfection. 13
Coherencydelayisthedurationthatataskspends
communicatingandcoordinatingaccesstoashared
resource.Likeresponsetime,servicetime,and
queueingdelay,coherencydelayismeasuredintime
pertaskexecution,asinsecondsperclick.
Iwontdescribehereamathematicalmodelfor
predictingcoherencydelay.Butthegoodnewsisthat
ifyouprofileyoursoftwaretaskexecutions,youllsee
itwhenitoccurs.InOracle,timedeventslikethe
followingareexamplesofcoherencydelay:
enqueue
bufferbusywaits
12Youllrecognizethisnumberifyouveheardofthe8-
secondrule,whichyoucanlearnaboutat
http://en.wikipedia.org/wiki/Network_performance#8-
second_rule.
13NeilGunther,1993.UniversalLawofComputational
Scalability,at
http://en.wikipedia.org/wiki/Neil_J._Gunther#Universal_Law_
of_Computational_Scalability.
latchfree
Youcantmodelcoherencydelayslikethesewith
M/M/m.ThatsbecauseM/M/massumesthatallmof
yourservicechannelsareparallel,homogeneous,and
independent.Thatmeansthemodelassumesthatafter
youwaitpolitelyinaFIFOqueueforlongenoughthat
alltherequeststhatenqueuedaheadofyouhaveexitedthequeueforservice,itllbeyourturntobe
serviced.However,coherencydelaysdontworklike
that.
Example:ImagineanHTMLdataentryforminwhich
onebuttonlabeledUpdateexecutesaSQLupdate
statement,andanotherbuttonlabeledSave
executesaSQLcommitstatement.Anapplicationbuiltlikethiswouldalmostguaranteeabysmal
performance.Thatsbecausethedesignmakesit
possiblequitelikely,actuallyforausertoclick
Update,lookathiscalendar,realizeuh-ohheslate
forlunch,andthengotolunchfortwohoursbefore
clickingSavelaterthatafternoon.Theimpacttoothertasksonthissystemthatwantedtoupdatethesamerowwouldbedevastating.Each
taskwouldnecessarilywaitforalockontherow(or,
onsomesystems,worse:alockontherowspage)
untilthelockinguserdecidedtogoaheadandclick
Save.Oruntiladatabaseadministratorkilledthe
userssession,whichofcoursewouldhaveunsavorysideeffectstothepersonwhohadthoughthehad
updatedarow.
Inthiscase,theamountoftimeataskwouldwaiton
thelocktobereleasedhasnothingtodowithhow
busythesystemis.Itwouldbedependentupon
randomfactorsthatexistoutsideofthesystemsvariousresourceutilizations.Thatswhyyoucant
modelthiskindofthinginM/M/m,anditswhyyou
canneverassumethataperformancetestexecutedin
aunittestingtypeofenvironmentissufficientfora
makingago/no-godecisionaboutinsertionofnew
codeintoaproductionsystem.
19 PERFORMANCETESTINGAllthistalkofqueueingdelaysandcoherencydelays
leadstoaverydifficultquestion.Howcanyoupossibly
testanewapplicationenoughtobeconfidentthat
yourenotgoingtowreckyourproduction
implementationwithperformanceproblems?
Youcanmodel.Andyoucantest. 14However,nothing
youdowillbeperfect.Itisextremelydifficulttocreate
modelsandtestsinwhichyoullforeseeallyour
14TheComputerMeasurementGroupisanetworkof
professionalswhostudytheseproblemsvery,veryseriously.
YoucanlearnaboutCMGathttp://www.cmg.org.
-
8/3/2019 Thinking Clearly (Paper)
12/15
2010MethodRCorporation.Allrightsreserved. 12
productionproblemsinadvanceofactually
encounteringthoseproblemsinproduction.
Somepeopleallowtheapparentfutilityofthis
observationtojustifynottestingatall.Dontget
trappedinthatmentality.Thefollowingpointsare
certain:
Youllcatchalotmoreproblemsifyoutrytocatchthempriortoproductionthanifyoudonteven
try.
Youllnevercatchallyourproblemsinpre-productiontesting.Thatswhyyouneedareliable
andefficientmethodforsolvingtheproblemsthat
leakthroughyourpre-productiontesting
processes.
Somewhereinthemiddlebetweennotestingand
completeproductionemulationistherightamount
oftesting.Therightamountoftestingforaircraft
manufacturersisprobablymorethantherightamount
oftestingforcompaniesthatsellbaseballcaps.But
dontskipperformancetestingaltogether.Atthevery
least,yourperformancetestplanwillmakeyoua
morecompetentdiagnostician(andclearerthinker)
whenitcomestimetofixtheperformanceproblems
thatwillinevitablyoccurduringproductionoperation.
20 MEASURINGPeoplefeelthroughputandresponsetime.
Throughputisusuallyeasytomeasure.Measuring
responsetimeisusuallymuchmoredifficult.
(Remember,throughputandresponsetimearenot
reciprocals.)Itmaynotbedifficulttotimeanend-useractionwithastopwatch,butitmightbeverydifficult
togetwhatyoureallyneed,whichistheabilitytodrill
downintothedetailsofwhyagivenresponsetimeis
aslargeasitis.
Unfortunately,peopletendtomeasurewhatseasyto
measure,whichisnotnecessarilywhattheyshouldbe
measuring.Itsabug.Whenitsnoteasytomeasure
whatweneedtomeasure,wetendtoturnour
attentiontomeasurementsthat areeasytoget.
Measuresthatarentwhatyouneed,butthatareeasy
enoughtoobtainandseemrelatedtowhatyouneed
arecalledsurrogatemeasures.Examplesofsurrogatemeasuresincludesubroutinecallcountsandsamples
ofsubroutinecallexecutiondurations.
ImashamedthatIdonthavegreatercommandover
mynativelanguagethantosayitthisway,buthereis
acatchy,modernwaytoexpresswhatIthinkabout
surrogatemeasures:
Surrogatemeasuressuck.
Here,unfortunately,suckdoesntmeannever
work.Itwouldactuallybe betterifsurrogate
measuresneverworked.Thennobodywoulduse
them.Theproblemisthatsurrogatemeasureswork
sometimes.Thisinspirespeoplesconfidencethatthe
measurestheyreusingshouldworkallthetime,and
thentheydont.Surrogatemeasureshavetwobigproblems.Theycantellyouyoursystemsokwhenits
not.Thatswhatstatisticianscall typeIerror,thefalse
positive.Andtheycantellyouthatsomethingisa
problemwhenitsnot.Thatswhatstatisticianscall
typeIIerror,thefalsenegative.Iveseeneachtypeof
errorwasteyearsofpeoplestime.
Whenitcomestimetoassessthespecificsofareal
system,yoursuccessisatthemercyofhowgoodthe
measurementsarethatyoursystemallowsyouto
obtain.IvebeenfortunatetoworkintheOracle
marketsegment,wherethesoftwarevendoratthe
centerofouruniverseparticipatesactivelyinmaking
itpossibletomeasuresystemstherightway.Gettingapplicationsoftwaredeveloperstousethetoolsthat
Oracleoffersisanotherstory,butatleastthe
capabilitiesarethereintheproduct.
21 PERFORMANCEISAFEATUREPerformanceisasoftwareapplicationfeature,justlike
recognizingthatitsconvenientforastringoftheform
Case1234toautomaticallyhyperlinkoverto
case1234inyourbugtrackingsystem.15Performance,
likeanyotherfeature,doesntjusthappen;ithasto
bedesignedandbuilt.Todoperformancewell,you
havetothinkaboutit,studyit,writeextracodeforit,testit,andsupportit.
However,likemanyotherfeatures,youcantknow
exactlyhowperformanceisgoingtoworkoutwhile
itsstillearlyintheprojectwhenyourewriting
studying,designing,andcreatingtheapplication.For
manyapplications(arguably,forthevastmajority),
performanceiscompletelyunknownuntilthe
productionphaseofthesoftwaredevelopmentlife
cycle.Whatthisleavesyouwithisthis:
Sinceyoucantknowhowyourapplicationis
goingtoperforminproduction,youneedto
writeyourapplicationsothatitseasytofixperformanceinproduction.
AsDavidGarvinhastaughtus,itsmucheasierto
managesomethingthatseasytomeasure.16Writing
15FogBugz,whichissoftwarethatIenjoyusing,doesthis.
16DavidGarvin,1993.BuildingaLearningOrganizationin
HarvardBusinessReview,Jul.1993.
-
8/3/2019 Thinking Clearly (Paper)
13/15
2010MethodRCorporation.Allrightsreserved.13
anapplicationthatseasytofixinproductionbegins
withanapplicationthatseasytomeasurein
production.
Mosttimes,whenImentiontheconceptofproduction
performancemeasurement,peopledriftintoastateof
worryaboutthemeasurementintrusioneffectof
performanceinstrumentation.Theyimmediatelyenteramodeofdatacollectioncompromise,leavingonly
surrogatemeasuresonthetable.Wontsoftwarewith
extracodepathtomeasuretimingsbeslowerthanthe
samesoftwarewithoutthatextracodepath?
IlikeananswerthatIheardTomKytegiveoncein
responsetothisquestion. 17Heestimatedthatthe
measurementintrusioneffectofOraclesextensive
performanceinstrumentationisnegative10%or
less.18Hewentontoexplaintoanow-vexed
questionerthattheproductisatleast10%fasternow
becauseoftheknowledgethatOracleCorporationhas
gainedfromitsperformanceinstrumentationcode,morethanmakingupforanyoverheadthe
instrumentationmighthavecaused.
Ithinkthatvendorstendtospendtoomuchtime
worryingabouthowtomaketheirmeasurementcode
pathefficientwithoutfiguringouthowfirsttomakeit
effective.ItlandssquarelyupontheideathatKnuth
wroteaboutin1974whenhesaidthatpremature
optimizationistherootofallevil. 19Thesoftware
designerwhointegratesperformancemeasurement
intohisproductismuchmorelikelytocreateafast
applicationandmoreimportantlyanapplication
thatwillbecomefasterovertime.
22 ACKNOWLEDGMENTSThankyouBaronSchwartzfortheemailconversation
inwhichyouthoughtIwashelpingyou,butinactual
fact,youwerehelpingmecometogripswiththeneed
forintroducingcoherencydelaymoreprominently
intomythinking.ThankyouJeffHolt,RonCrisco,Ken
Ferlita,andHaroldPalacioforthedailyworkthat
keepsthecompanygoingandforthelunchtime
conversationsthatkeepmyimaginationgoing.Thank
youTomKyteforyourcontinuedinspirationand
support.ThankyouMarkFarnhamforyourhelpful
suggestions.AndthankyouNeilGuntherforyour
17TomKyte,2009.Acoupleoflinksandanadvertathttp://tkyte.blogspot.com/2009/02/couple-of-links-and-
advert.html.
18Whereorlessmeansorbetter,asin20%,30%,etc.
19DonaldKnuth,1974.StructuredProgrammingwithGoTo
StatementsinACMJournalComputingSurveys,Vol.6,No.4,
Dec.1974,p268.
patienceandgenerosityinourongoingdiscussions
aboutknees.
23 ABOUTTHEAUTHORCaryMillsapiswellknownintheglobalOracle
communityasaspeaker,educator,consultant,and
writer.HeisthefounderandpresidentofMethodR
Corporation(http://method-r.com),asmallcompany
devotedtogenuinelysatisfyingsoftwareperformance.
MethodRoffersconsultingservices,education
courses,andsoftwaretoolsincludingtheMethodR
Profiler,MRTools,theMethodRSLAManager,andthe
MethodRInstrumentationLibraryforOraclethat
helpyouoptimizeyoursoftwareperformance.
Caryistheauthor(withJeffHolt)of OptimizingOracle
Performance(OReilly),forwhichheandJeffwere
namedOracleMagazines2004AuthorsoftheYear.He
isaco-authorofOracleInsights:TalesoftheOakTable
(Apress).HeistheformerVicePresidentofOracleCorporationsSystemPerformanceGroup,andaco-
founderofhisformercompanycalledHotsos.Caryis
alsoanOracleACEDirectorandafoundingpartnerof
theOakTableNetwork,aninformalassociationof
Oraclescientiststhatarewellknownthroughoutthe
Oraclecommunity.Caryblogsat
http://carymillsap.blogspot.com,andhetweetsat
http://twitter.com/CaryMillsap.
24 EPILOG:OPENDEBATEABOUTKNEESInsections14through16,Iwroteaboutkneesin
performancecurves,theirrelevance,andtheirapplication.However,thereisopendebategoingback
atleast20yearsaboutwhetheritsevenworthwhile
totrytodefinetheconceptof knee,likeIvedonein
thispaper.
Thereissignificanthistoricalbasistotheideathat
suchathingthatIvedescribedasakneeinfactisnt
reallymeaningful.In1988,StephenSamsonargued
that,atleastforM/M/1queueingsystems,thereisno
kneeintheperformancecurve.Hewrote,The
choiceofaguidelinenumberisnoteasy,buttherule-
of-thumbmakersgorighton.Inmostcasesthereis
notaknee,nomatterhowmuchwewishtofind
one.20
Thewholeproblemremindsme,asIwrotein1999,21
oftheparableofthefrogandtheboilingwater.The
20StephenSamson,1988.MVSperformancelegendsinCMG1988ConferenceProceedings.ComputerMeasurement
Group,148159.
21CaryMillsap,1999.Performancemanagement:mythsand
factsavailableathttp://method-r.com.
-
8/3/2019 Thinking Clearly (Paper)
14/15
2010MethodRCorporation.Allrightsreserved. 14
storysaysthatifyoudropafrogintoapanofboiling
water,hewillescape.But,ifyouputafrogintoapan
ofcoolwaterandslowlyheatit,thenthefrogwillsit
patientlyinplaceuntilheisboiledtodeath.
Withutilization,justaswithboilingwater,thereis
clearlyadeathzone,arangeofvaluesinwhichyou
cantaffordtorunasystemwithrandomarrivals.Butwhereistheborderofthedeathzone?Ifyouare
tryingtoimplementaproceduralapproachto
managingutilization,youneedtoknow.
Recently,myfriendNeilGunther 22hasdebatedwith
meprivatelythat,first,thetermkneeiscompletely
thewrongwordtousehere,becausekneeisthe
wrongtermtouseintheabsenceofafunctional
discontinuity.Second,heassertsthattheboundary
valueof.5foranM/M/1systemiswastefullylow,that
yououghttobeabletorunsuchasystemsuccessfully
atamuchhigherutilizationvaluethan.5.And,finally,
hearguesthatanysuchspecialutilizationvalueshouldbedefinedexpresslyastheutilizationvalue
beyondwhichyouraverageresponsetimeexceeds
yourtoleranceforaverageresponsetime(Exhibit11).
Thus,Guntherarguesthatanyusefulnot-to-exceed
utilizationvalueisderivableonlyfrominquiriesabout
humanpreferences,notfrommathematics.
Exhibit11.Gunthersmaximumallowable
utilizationvalueTisdefinedastheutilization
producingtheaverageresponsetimeT.
TheproblemIseewiththisargumentisillustratedinExhibit12.Imaginethatyourtoleranceforaverage
responsetimeisT,whichcreatesamaximum
toleratedutilizationvalueof T.Noticethatevenatiny
22Seehttp://en.wikipedia.org/wiki/Neil_J._Guntherfor
moreinformationaboutNeil.See
http://www.cmg.org/measureit/issues/mit62/m_62_15
.htmlfor more information about his argument.
fluctuationinaverageutilizationnear Twillresultin
ahugefluctuationinaverageresponsetime.
Exhibit12.NearTvalue,smallfluctuationsin
averageutilizationresultinlargeresponsetime
fluctuations.
Ibelieve,asIwroteinsection4,thatyourcustomers
feelthevariance,notthemean.Perhapstheysaythey
willacceptaverageresponsetimesuptoT,butIdont
believethathumanswillbetolerantofperformance
onasystemwhena1%changeinaverageutilization
overa1-minuteperiodresultsin,say,aten-times
increaseinaverageresponsetimeoverthatperiod.
Idounderstandtheperspectivethatthekneevalues
Ivelistedinsection14arebelowtheutilizationvalues
thatmanypeoplefeelsafeinexceeding,especiallyfor
lowerordersystemslikeM/M/1.However,Ibelieve
thatitisimportanttoavoidrunningresourcesataverageutilizationvalueswheresmallfluctuationsin
utilizationyieldtoo-largefluctuationsinresponse
time.
Havingsaidthat,Idontyethaveagooddefinitionfor
whatatoo-largefluctuationis.Perhaps,like
responsetimetolerances,differentpeoplehave
differenttolerancesforfluctuation.Butperhapsthere
isafluctuationtolerancefactorthatapplieswith
reasonableuniversalityacrossallhumanusers.The
Apdexstandard,forexample,assumesthatthe
responsetimeFatwhichusersbecomefrustratedis
universallyfourtimestheresponsetimeTatwhichtheirattitudeshiftsfrombeingsatisfiedtomerely
tolerating.23
Theknee,regardlessofhowyoudefineitorwhatwe
endupcallingit,isanimportantparametertothe
capacityplanningprocedurethatIdescribedin
23Seehttp://www.apdex.orgformoreinformationabout
Apdex.
0 0.5 T 0.9000
5
10
15
20
Utilization
Responsetim
eR
MM1 system, T 10.
0 0.744997 T 0.9870
5
10
15
20
Utilization
Responsetime
R
MM8 system, T 10.
-
8/3/2019 Thinking Clearly (Paper)
15/15
2010MethodRCorporation.Allrightsreserved.15
section16,andIbelieveitisanimportantparameter
tothedailyprocessofcomputersystemworkload
management.
Iwillkeepstudying.