thinking clearly (paper)

Upload: negrodeelite

Post on 06-Apr-2018

219 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/3/2019 Thinking Clearly (Paper)

    1/15

    2010MethodRCorporation.Allrightsreserved.1

    ThinkingClearlyAboutPerformance

    CaryMillsapMethodRCorporation,Southlake,Texas,USA

    [email protected]

    Revised2010/05/11

    Creatinghighperformanceasanattributeofcomplexsoftwareisextremelydifficult

    business for developers, technology administrators, architects, system analysts, and

    project managers. However, by understanding some fundamental principles,

    performance problem solving and prevention can be made far simpler and more

    reliable. This paper describes those principles, linking them together in a coherent

    journey covering the goals, the terms, the tools,and the decisionsthat youneed to

    maximize yourapplications chance of having a long,productive,high-performance

    life.ExamplesinthispapertouchuponOracleexperiences,butthescopeofthepaper

    isnotrestrictedtoOracleproducts.

    TABLEOFCONTENTS

    ThinkingClearlyAboutPerformance..................................11 AnAxiomaticApproach ....................................................12 WhatisPerformance?............................................... ......... 23 ResponseTimevs.Throughput.....................................24 PercentileSpecifications .................................................. 35 ProblemDiagnosis ..................................................... ......... 36 TheSequenceDiagram.................. .................................... 4

    7 TheProfile ................................................... ........................... 58 AmdahlsLaw ............................................. ........................... 59 Skew ............................................. ............................................. 610 MinimizingRisk ................................................. .................. 711 Efficiency. ..................................................... ........................... 712 Load........... ..................................................... ........................... 813 QueueingDelay .................................................. .................. 814 TheKnee ............................................. .................................... 915 RelevanceoftheKnee .............................................. .......1016 CapacityPlanning..............................................................10 17 RandomArrivals................................................................10 18 CoherencyDelay................................................................11 19 PerformanceTesting ................................................ .......1120Measuring.............................................................................12

    21 PerformanceisaFeature...............................................12 22 Acknowledgments ..................................................... .......1323 AbouttheAuthor...............................................................13 24 Epilog:OpenDebateaboutKnees..............................13

    1 ANAXIOMATICAPPROACHWhenIjoinedOracleCorporationin1989,

    performancewhateveryonecalledOracletuning

    wasdifficult.Onlyafewpeopleclaimedtheycoulddo

    itverywell,andthosepeoplecommandednice,high

    consultingrates.Whencircumstancesthrustmeinto

    theOracletuningarena,Iwasquiteunprepared.

    Recently,Ivebeenintroducedintotheworldof

    MySQLtuning,andthesituationseemsverysimilar

    towhatIsawinOracleovertwentyyearsago.

    ItremindsmealotofhowdifficultIwouldhavetold

    youthatbeginningalgebrawas,ifyouhadaskedme

    whenIwasabout13yearsold.Atthatage,Ihadto

    appealheavilytomymathematicalinstinctstosolve

    equationslike3x+4=13.Theproblemwiththatis

    thatmanyofusdidnthavemathematicalinstincts.I

    canrememberlookingataproblemlike3x+4=13;

    findxandbasicallystumblingupontheanswer x=3

    usingtrialanderror.

    Thetrial-and-errormethodoffeelingmywaythrough

    algebraproblemsworkedalbeitslowlyand

    uncomfortablyforeasyequations,butitdidntscale

    astheproblemsgottougher,like3x+4=14.Now

    what?MyproblemwasthatIwasntthinkingclearly

    yetaboutalgebra.Myintroductionatagefifteento

    JamesR.Harkeyputmeontheroadtosolvingthat

    problem.

    Mr.Harkeytaughtuswhathecalledanaxiomatic

    approachtosolvingalgebraicequations.Heshowedus

  • 8/3/2019 Thinking Clearly (Paper)

    2/15

    2010MethodRCorporation.Allrightsreserved. 2

    asetofstepsthatworkedeverytime(andhegaveus

    plentyofhomeworktopracticeon).Inadditionto

    workingeverytime,byexecutingthosesteps,we

    necessarilydocumentedourthinkingasweworked.

    Notonlywerewethinkingclearly,usingareliableand

    repeatablesequenceofsteps,wewereprovingto

    anyonewhoreadourworkthatwewerethinkingclearly.

    OurworkforMr.Harkeylookedlikethis:

    3.1x+4 =13 problemstatement

    3.1x+44=134 subtractionpropertyofequality

    3.1x =9 additiveinverseproperty,

    simplification3.1x 3.1 =9 3.1 divisionpropertyofequalityx 2.903 multiplicativeinverseproperty,

    simplification

    ThiswasMr.Harkeysaxiomaticapproachtoalgebra,

    geometry,trigonometry,andcalculus:onesmall,

    logical,provable,andauditablestepatatime.Itsthe

    firsttimeIeverreallygotmathematics.

    Naturally,Ididntrealizeitatthetime,butofcourse

    provingwasaskillthatwouldbevitalformysuccess

    intheworldafterschool.Inlife,Ivefoundthat,of

    course,knowingthingsmatters.Butprovingthose

    thingstootherpeoplemattersmore.Withoutgood

    provingskills,itsdifficulttobeagoodconsultant,a

    goodleader,orevenagoodemployee.

    Mygoalsincethemid-1990shasbeentocreatea

    similarlyrigorousapproachtoOracleperformance

    optimization.Lately,Iamexpandingthescopeofthat

    goalbeyondOracle,to:Createanaxiomaticapproach

    tocomputersoftwareperformanceoptimization.Ive

    foundthatnotmanypeoplereallylikeitwhenItalk

    likethat,soletssayitlikethis:

    Mygoalistohelpyou thinkclearlyabouthowto

    optimizetheperformanceofyourcomputer

    software.

    2 WHATISPERFORMANCE?Ifyougoogleforthewordperformance,yougetovera

    halfabillionhitsonconceptsrangingfrombicycle

    racingtothedreadedemployeereviewprocessthat

    manycompaniesthesedaysarelearningtoavoid.WhenIgoogledforperformance,mostofthetophits

    relatetothesubjectofthispaper:thetimeittakesfor

    computersoftwaretoperformwhatevertaskyouask

    ittodo.

    Andthatsagreatplacetobegin:the task.Ataskisa

    business-orientedunitofwork.Taskscannest: print

    invoicesisatask;printoneinvoiceasub-taskisalso

    atask.Whenacomputerusertalksaboutperformance

    heusuallymeansthetimeittakesforthesystemto

    executesometask.Responsetimeistheexecution

    durationofatask,measuredintimepertask,like

    secondsperclick.Forexample,myGooglesearchfor

    thewordperformancehadaresponsetimeof

    0.24seconds.TheGooglewebpagerenderedthat

    measurementrightinmybrowser.Thatisevidence,tome,thatGooglevaluesmyperceptionofGoogle

    performance.

    Somepeopleareinterestedinanotherperformance

    measure:Throughputisthecountoftaskexecutions

    thatcompletewithinaspecifiedtimeinterval,like

    clickspersecond.Ingeneral,peoplewhoare

    responsiblefortheperformanceofgroupsofpeople

    worrymoreaboutthroughputthanpeoplewhowork

    inasolocontributorrole.Forexample,anindividual

    accountantisusuallymoreconcernedaboutwhether

    theresponsetimeofadailyreportwillrequirehimto

    staylateafterworktoday.Themanagerofagroupof

    accountsisadditionallyconcernedaboutwhetherthesystemiscapableofprocessingallthedatathatallof

    heraccountantswillbeprocessingtoday.

    3 RESPONSETIMEVS.THROUGHPUTThroughputandresponsetimehaveagenerally

    reciprocaltypeofrelationship,butnotexactly.The

    realrelationshipissubtlycomplex.

    Example:Imaginethatyouhavemeasuredyour

    throughputat1,000taskspersecondforsome

    benchmark.What,then,isyourusersaverage

    responsetime?

    Itstemptingtosaythatyouraverageresponsetimewas1/1,000=.001secondspertask.Butitsnot

    necessarilyso.

    Imaginethatyoursystemprocessingthisthroughput

    had1,000parallel,independent,homogeneous

    servicechannelsinsideit(thatis,itsasystemwith

    1,000independent,equallycompetentservice

    providersinsideit,eachawaitingyourrequestforservice).Inthiscase,itispossiblethateachrequest

    consumedexactly1second.

    Now,wecanknowthataverageresponsetimewas

    somewherebetween0secondspertaskand

    1secondpertask.Butyoucannotderiveresponse

    timeexclusively1fromathroughputmeasurement.Youhavetomeasureitseparately.

    1Icarefullyincludethewordexclusivelyinthisstatement,

    becausetherearemathematicalmodelsthatcancompute

    responsetimeforagiventhroughput,butthemodels

    requiremoreinputthanjustthroughput.

  • 8/3/2019 Thinking Clearly (Paper)

    3/15

    2010MethodRCorporation.Allrightsreserved.3

    Thesubtletyworksintheotherdirection,too.Youcan

    certainlyfliptheexampleIjustgavearoundandprove

    it.However,ascarierexamplewillbemorefun.

    Example:Yourclientrequiresanewtaskthatyoure

    programmingtodeliverathroughputof100taskspersecondonasingle-CPUcomputer.Imaginethat

    thenewtaskyouvewrittenexecutesinjust.001secondsontheclientssystem.Willyour

    programyieldthethroughputthattheclient

    requires?

    Itstemptingtosaythatifyoucanrunthetaskonce

    injustathousandthofasecond,thensurelyyoullbe

    abletorunthattaskatleastahundredtimesinthe

    spanofafullsecond.Andyoureright,ifthetaskrequestsarenicelyserialized,forexample,sothat

    yourprogramcanprocessall100oftheclients

    requiredtaskexecutionsinsidealoop,oneafterthe

    other.

    Butwhatifthe100taskspersecondcomeatyour

    systematrandom,from100differentusersloggedintoyourclientssingle-CPUcomputer?Thenthe

    gruesomerealitiesofCPUschedulersandserialized

    resources(likeOraclelatchesandlocksandwritable

    accesstobuffersinmemory)mayrestrictyour

    throughputtoquantitiesmuchlessthantherequired

    100taskspersecond.

    Itmightwork.Itmightnot.Youcannotderivethroughputexclusivelyfromaresponsetime

    measurement.Youhavetomeasureitseparately.

    Responsetimeandthroughputarenotnecessarily

    reciprocals.Toknowthemboth,youneedtomeasure

    themboth.

    So,whichismoreimportant:responsetime,orthroughput?Foragivensituation,youmightanswer

    legitimatelyineitherdirection.Inmany

    circumstances,theansweristhatbotharevital

    measurementsrequiringmanagement.Forexample,a

    systemownermayhaveabusinessrequirementthat

    responsetimemustbe1.0secondsorlessforagiven

    taskin99%ormoreofexecutions andthesystem

    mustsupportasustainedthroughputof

    1,000executionsofthetaskwithina10-minute

    interval.

    4 PERCENTILESPECIFICATIONSInthepriorsection,Iusedthephrasein99%ormore

    ofexecutionstoqualifyaresponsetimeexpectation.

    Manypeoplearemoreaccustomedtostatementslike,

    averageresponsetimemustbersecondsorless.The

    percentilewayofstatingrequirementsmapsbetter,

    though,tothehumanexperience.

    Example:Imaginethatyourresponsetimetolerance

    is1secondforsometaskthatyouexecuteonyour

    computereveryday.Furtherimaginethatthelistsof

    numbersshowninExhibit1representthemeasuredresponsetimesoftenexecutionsofthattask.The

    averageresponsetimeforeachlistis1.000seconds.

    Whichonedoyouthinkyoudlikebetter?

    ListA ListB

    1 .924 .7962 .928 .798

    3 .954 .802

    4 .957 .823

    5 .961 .919

    6 .965 .977

    7 .972 1.076

    8 .979 1.216

    9 .987 1.273

    10 1.373 1.320

    Exhibit1.Theaverageresponsetimeforeachof

    thesetwolistsis1.000seconds.

    Youcanseethatalthoughthetwolistshavethesame

    averageresponsetime,thelistsarequitedifferentincharacter.InListA,90%ofresponsetimeswere

    1secondorless.InListB,only60%ofresponse

    timeswere1secondorless.Statedintheopposite

    way,ListBrepresentsasetofuserexperiencesofwhich40%weredissatisfactory,butListA(having

    thesameaverageresponsetimeasListB)represents

    onlya10%dissatisfactionrate.

    InListA,the90thpercentileresponsetimeis.987seconds.InListB,the90thpercentileresponse

    timeis1.273seconds.Thesestatementsabout

    percentilesaremoreinformativethanmerelysaying

    thateachlistrepresentsanaverageresponsetimeof1.000seconds.

    AsGEsays,Ourcustomersfeelthevariance,notthe

    mean.2Expressingresponsetimegoalsaspercentiles

    makeformuchmorecompellingrequirement

    specificationsthatmatchwithenduserexpectations:

    TheTrackShipmenttaskmustcompleteinless

    than.5secondsinatleast99.9%ofexecutions.

    5 PROBLEMDIAGNOSISInnearlyeveryperformanceproblemIvebeeninvited

    torepair,theproblemstatementhasbeenastatement

    aboutresponsetime.Itusedtotakelessthana

    secondtodoX;nowitsometimestakes20+.Ofcourse,aspecificproblemstatementlikethatisoften

    buriedbehindveneersofotherproblemstatements,

    2GeneralElectricCompany:WhatIsSixSigma?The

    RoadmaptoCustomerImpactat

    http://www.ge.com/sixsigma/SixSigma.pdf

  • 8/3/2019 Thinking Clearly (Paper)

    4/15

    2010MethodRCorporation.Allrightsreserved. 4

    like,Ourwhole[adjectivesdeleted]systemissoslow

    wecantuseit.3

    Butjustbecausesomethinghashappenedalotforme

    doesntmeanthatitswhatwillhappennextforyou.

    Themostimportantthingforyoutodofirstisstate

    theproblemclearly,sothatyoucanthenthinkaboutit

    clearly.

    Agoodwaytobeginistoask,whatisthegoalstate

    thatyouwanttoachieve?Findsomespecificsthatyou

    canmeasuretoexpressthis.Forexample,Response

    timeofXismorethan20secondsinmanycases.Well

    behappywhenresponsetimeis1secondorlessinat

    least95%ofexecutions.

    Thatsoundsgoodintheory,butwhatifyouruser

    doesnthaveaspecificquantitativegoallike1second

    orlessinatleast95%ofexecutions?Therearetwo

    quantitiesrightthere(1and95);whatifyouruser

    doesntknoweitheroneofthem?Worseyet,whatif

    youruserdoeshavespecificideasabouthisexpectations,butthoseexpectationsareimpossibleto

    meet?Howwouldyouknowwhatpossibleor

    impossibleevenis?

    Letsworkourwayuptothosequestions.

    6 THESEQUENCEDIAGRAMAsequencediagram isatypeofgraphspecifiedinthe

    UnifiedModelingLanguage(UML),usedtoshowthe

    interactionsbetweenobjectsinthesequentialorder

    thatthoseinteractionsoccur.Thesequencediagramis

    anexceptionallyusefultoolforvisualizingresponse

    time.Exhibit2showsastandardUMLsequence

    diagramforasimpleapplicationsystemcomposedofa

    browser,anapplicationserver,andadatabase.

    3CaryMillsap,2009.Mywholesystemisslow.Nowwhat?

    athttp://carymillsap.blogspot.com/2009/12/my-whole-

    system-is-slow-now-what.html

    Exhibit2.ThisUMLsequencediagramshowsthe

    interactionsamongabrowser,anapplication

    server,andadatabase.

    Imaginenowdrawingthesequencediagramtoscale,

    sothatthedistancebetweeneachrequestarrowcominganditscorrespondingresponsearrowgoing

    outwereproportionaltothedurationspentservicing

    therequest.IveshownsuchadiagraminExhibit3.

    Exhibit3.AUMLsequencediagramdrawnto

    scale,showingtheresponsetimeconsumedat

    eachtierinthesystem.

    WithExhibit3,youhaveagoodgraphical

    representationofhowthecomponentsrepresentedin

    yourdiagramarespendingyouruserstime.Youcanfeeltherelativecontributiontoresponsetimeby

    lookingatthepicture.

    Sequencediagramsarejustrightforhelpingpeople

    conceptualizehowtheirresponseisconsumedona

    givensystem,asonetierhandscontrolofthetaskto

    thenext.Sequencediagramsalsoworkwelltoshow

    howsimultaneousthreadsofprocessingworkin

    parallel.Sequencediagramsaregoodtoolsfor

  • 8/3/2019 Thinking Clearly (Paper)

    5/15

    2010MethodRCorporation.Allrightsreserved.5

    analyzingperformanceoutsideoftheinformation

    technologybusiness,too. 4

    Thesequencediagramisagoodconceptualtoolfor

    talkingaboutperformance,buttothinkclearlyabout

    performance,weneedsomethingelse,too.Heresthe

    problem.Imaginethatthetaskyouresupposedtofix

    hasaresponsetimeof2,468seconds(thats41minutes8seconds).Inthatroughly41minutes,

    runningthattaskcausesyourapplicationserverto

    execute322,968databasecalls.Exhibit4showswhat

    yoursequencediagramforthattaskwouldlooklike.

    Exhibit4.ThisUMLsequencediagramshows

    322,968databasecallsexecutedbythe

    applicationserver.

    Therearesomanyrequestandresponsearrows

    betweentheapplicationanddatabasetiersthatyoucantseeanyofthedetail.Printingthesequence

    diagramonaverylongscrollisntausefulsolution,

    becauseitwouldtakeusweeksofvisualinspection

    beforewedbeabletoderiveusefulinformationfrom

    thedetailswedsee.

    Thesequencediagramisagoodtoolfor

    conceptualizingflowofcontrolandthecorresponding

    flowoftime.Buttothinkclearlyaboutresponsetime,

    weneedsomethingelse.

    7 THEPROFILEThesequencediagramdoesntscalewell.Todealwithtasksthathavehugecallcounts,weneedaconvenient

    aggregationofthesequencediagramsothatwe

    understandthemostimportantpatternsinhowour

    4CaryMillsap,2009.PerformanceoptimizationwithGlobal

    Entry.Ornot?at

    http://carymillsap.blogspot.com/2009/11/performance-

    optimization-with-global.html

    timehasbeenspent.Exhibit5showsanexampleofa

    tablecalledaprofile,whichdoesthetrick.Aprofileis

    atabulardecompositionofresponsetime,typically

    listedindescendingorderofcomponentresponse

    timecontribution.

    Functioncall R(sec) Calls

    1 DB:fetch() 1,748.229 322,9682 App:await_db_netIO() 338.470 322,968

    3 DB:execute() 152.654 39,142

    4 DB:prepare() 97.855 39,142

    5 Other 58.147 89,422

    6 App:render_graph() 48.274 7

    7 App:tabularize() 23.481 4

    8 App:read() 0.890 2

    Total 2,468.000

    Exhibit5.Thisprofileshowsthedecompositionof

    a2,468.000-secondresponsetime.

    Example:TheprofileinExhibit5isrudimentary,but

    itshowsyouexactlywhereyourslowtaskhasspent

    yourusers2,468seconds.Withthedatashownhere,forexample,youcanderivethepercentageof

    responsetimecontributionforeachofthefunction

    callsidentifiedintheprofile.Youcanalsoderivethe

    averageresponsetimeforeachtypeoffunctioncallduringyourtask.

    Aprofileshowsyouwhereyourcodehasspentyour

    timeandsometimesevenmoreimportantlywhere

    ithasnot.Thereistremendousvalueinnothavingto

    guessaboutthesethings.

    FromthedatashowninExhibit5,you knowthat

    70.8%ofyourusersresponsetimeisconsumedby

    DB:fetch()calls.Furthermore,ifyoucandrilldownintotheindividualcallswhosedurationswere

    aggregatedtocreatethisprofile,youcanknowhow

    manyofthoseApp:await_db_netIO()calls

    correspondedtoDB:fetch()calls,andyoucanknow

    howmuchresponsetimeeachofthoseconsumed.

    Withaprofile,youcanbegintoformulatetheanswer

    tothequestion,Howlong shouldthistaskrun?

    Which,bynow,youknowisanimportantquestion

    inthefirststep(section5)ofanygoodproblem

    diagnosis.

    8 AMDAHLSLAWProfilinghelpsyouthinkclearlyaboutperformance.

    EvenifGeneAmdahlhadntgivenusAmdahlsLawbackin1967,youdhaveprobablycomeupwithit

    yourselfafterthefirstfewprofilesyoulookedat.

    AmdahlsLawstates:

    Performanceimprovementisproportionalto

    howmuchaprogramusesthethingyou

    improved.

  • 8/3/2019 Thinking Clearly (Paper)

    6/15

    2010MethodRCorporation.Allrightsreserved. 6

    Soifthethingyouretryingtoimproveonly

    contributes5%toyourtaskstotalresponsetime,then

    themaximumimpactyoullbeabletomakeis5%of

    yourtotalresponsetime.Thismeansthatthecloserto

    thetopofaprofilethatyouwork(assumingthatthe

    profileissortedindescendingresponsetimeorder),

    thebiggerthebenefitpotentialforyouroverallresponsetime.

    Thisdoesntmeanthatyoualwaysworkaprofilein

    top-downorder,though,becauseyoualsoneedto

    considerthecostoftheremediesyoullbeexecuting,

    too.5

    Example:ConsidertheprofileinExhibit6.ItsthesameprofileasinExhibit5,excepthereyoucansee

    howmuchtimeyouthinkyoucansaveby

    implementingthebestremedyforeachrowinthe

    profile,andyoucanseehowmuchyouthinkeach

    remedywillcosttoimplement.

    Potentialimprovement%andcostofinvestment R(sec) R(%)

    1 34.5%superexpensive 1,748.229 70.8%

    2 12.3%dirtcheap 338.470 13.7%

    3 Impossibletoimprove 152.654 6.2%

    4 4.0%dirtcheap 97.855 4.0%

    5 0.1%superexpensive 58.147 2.4%

    6 1.6%dirtcheap 48.274 2.0%

    7 Impossibletoimprove 23.481 1.0%

    8 0.0%dirtcheap 0.890 0.0%

    Total 2,468.000

    Exhibit6.Thisprofileshowsthepotentialfor

    improvementandthecorrespondingcost

    (difficulty)ofimprovementforeachlineitem

    fromExhibit5.

    Whatremedyactionwouldyouimplementfirst?

    AmdahlsLawsaysthatimplementingtherepairon

    line1hasthegreatestpotentialbenefitofsavingabout851seconds(34.5%of2,468seconds).Butifit

    istrulysuperexpensive,thentheremedyonline2

    mayyieldbetternetbenefitandthatsthe

    constrainttowhichwereallyneedtooptimize

    eventhoughthepotentialforresponsetimesavings

    isonlyabout305seconds.

    Atremendousvalueoftheprofileisthatyoucanlearn

    exactlyhowmuchimprovementyoushouldexpectfor

    aproposedinvestment.Itopensthedoortomaking

    muchbetterdecisionsaboutwhatremediesto

    implementfirst.Yourpredictionsgiveyouayardstick

    formeasuringyourownperformanceasananalyst.

    Andfinally,itgivesyouachancetoshowcaseyour

    5CaryMillsap,2009.Ontheimportanceofdiagnosing

    beforeresolvingat

    http://carymillsap.blogspot.com/2009/09/on-importance-of-

    diagnosing-before.html

    clevernessandintimacywithyourtechnologyasyou

    findmoreefficientremediesforreducingresponse

    timemorethanexpected,atlower-than-expected

    costs.

    Whatremedyactionyouimplementfirstreallyboils

    downtohowmuchyoutrustyourcostestimates.Does

    dirtcheapreallytakeintoaccounttherisksthattheproposedimprovementmayinflictuponthesystem?

    Forexample,itmayseemdirtcheaptochangethat

    parameterordropthatindex,butdoesthatchange

    potentiallydisruptthegoodperformancebehaviorof

    somethingouttherethatyourenoteventhinking

    aboutrightnow?Reliablecostestimationisanother

    areainwhichyourtechnologicalskillspayoff.

    Anotherfactorworthconsideringisthepolitical

    capitalthatyoucanearnbycreatingsmallvictories.

    Maybecheap,low-riskimprovementswontamountto

    muchoverallresponsetimeimprovement,buttheres

    valueinestablishingatrackrecordofsmallimprovementsthatexactlyfulfillyourpredictions

    abouthowmuchresponsetimeyoullsavefortheslow

    task.Atrackrecordofpredictionandfulfillment

    ultimatelyespeciallyintheareaofsoftware

    performance,wheremythandsuperstitionhave

    reignedatmanylocationsfordecadesgivesyouthe

    credibilityyouneedtoinfluenceyourcolleagues(your

    peers,yourmanagers,yourcustomers,)toletyou

    performincreasinglyexpensiveremediesthatmay

    producebiggerpayoffsforthebusiness.

    Awordofcaution,however:Dontgetcarelessasyou

    rackupyoursuccessesandproposeeverbigger,

    costlier,riskierremedies.Credibilityisfragile.Ittakesalotofworktobuilditupbutonlyonecareless

    mistaketobringitdown.

    9 SKEWWhenyouworkwithprofiles,yourepeatedly

    encountersub-problemslikethisone:

    Example:TheprofileinExhibit5revealedthat

    322,968DB:fetch()callshadconsumed

    1,748.229secondsofresponsetime.Howmuchunwantedresponsetimewouldweeliminateifwe

    couldeliminatehalfofthosecalls?

    Theanswerisalmostnever,Halfoftheresponse

    time.Considerthisfarsimplerexampleforamoment:

    Example:Fourcallstoasubroutineconsumedfour

    seconds.Howmuchunwantedresponsetimewould

    weeliminateifwecouldeliminatehalfofthosecalls?

    Theanswerdependsupontheresponsetimesofthe

    individualcallsthatwecouldeliminate.Youmight

    haveassumedthateachofthecalldurationswasthe

  • 8/3/2019 Thinking Clearly (Paper)

    7/15

    2010MethodRCorporation.Allrightsreserved.7

    average4/4=1second.Butnowhereintheproblem

    statementdidItellyouthatthecalldurationswere

    uniform.

    Example:Imaginethefollowingtwopossibilities,

    whereeachlistrepresentstheresponsetimesofthefoursubroutinecalls:

    A={1,1,1,1}B={3.7,.1,.1,.1}

    InlistA,theresponsetimesareuniform,sonomatterwhichhalf(two)ofthecallsweeliminate,

    wellreducetotalresponsetimeto2seconds.

    However,inlistB,itmakesatremendousdifference

    whichtwocallsweeliminate.Ifweeliminatethefirst

    twocalls,thenthetotalresponsetimewilldropto

    .2seconds(a95%reduction).However,ifweeliminatethefinaltwocalls,thenthetotalresponse

    timewilldropto3.8seconds(onlya5%reduction).

    Skewisanon-uniformityinalistofvalues.The

    possibilityofskewiswhatprohibitsyoufrom

    providingapreciseanswertothequestionthatIaskedyouatthebeginningofthissection.Letslookagain:

    Example:TheprofileinExhibit5revealedthat

    322,968DB:fetch()callshadconsumed

    1,748.229secondsofresponsetime.Howmuch

    unwantedresponsetimewouldweeliminateifwe

    couldeliminatehalfofthosecalls?

    Withoutknowinganythingaboutskew,theonlyanswerwecanprovideis,Somewherebetween0

    and1,748.229seconds.Thatisthemostprecise

    correctansweryoucanreturn.

    Imagine,however,thatyouhadtheadditional

    informationavailableinExhibit7.Thenyoucouldformulatemuchmoreprecisebest-caseandworst-

    caseestimates.Specifically,ifyouhadinformationlike

    this,youdbesmarttotrytofigureouthow

    specificallytoeliminatethe47,444callswithresponse

    timesinthe.01-to.1-secondrange.

    Range{mine

  • 8/3/2019 Thinking Clearly (Paper)

    8/15

    2010MethodRCorporation.Allrightsreserved. 8

    Example:Amiddletierprogrammakes100database

    fetchcalls(andthus100networkI/Ocalls)tofetch994rows.Itcouldhavefetched994rowsin10fetch

    calls(andthus90fewernetworkI/Ocalls).

    Example:ASQLstatement7touchesthedatabase

    buffercache7,428,322timestoreturna698-row

    resultset.Anextrafilterpredicatecouldhave

    returnedthe7rowsthattheenduserreallywantedtosee,withonly52touchesuponthedatabasebuffer

    cache.

    Certainly,ifasystemhassomeglobalproblemthat

    createsinefficiencyforbroadgroupsoftasksacross

    thesystem(e.g.,ill-conceivedindex,badlyset

    parameter,poorlyconfiguredhardware),thenyou

    shouldfixit.Butdonttuneasystemtoaccommodate

    programsthatareinefficient.8Thereisalotmore

    leverageincuringtheprograminefficiencies

    themselves.Eveniftheprogramsarecommercial,off-

    the-shelfapplications,itwillbenefityoubetterinthe

    longruntoworkwithyoursoftwarevendortomakeyourprogramsefficient,thanitwilltotrytooptimize

    yoursystemtobeasefficientasitcanwithinherently

    inefficientworkload.

    Improvementsthatmakeyourprogrammoreefficient

    canproducetremendousbenefitsforeveryoneonthe

    system.Itseasytoseehowtop-linereductionof

    wastehelpstheresponsetimeofthetaskbeing

    repaired.Whatmanypeopledontunderstandaswell

    isthatmakingoneprogrammoreefficientcreatesa

    side-effectofperformanceimprovementforother

    programsonthesystemthathavenoapparent

    relationtotheprogrambeingrepaired.Ithappens

    becauseoftheinfluenceof loaduponthesystem.

    12 LOADLoadiscompetitionforaresourceinducedby

    concurrenttaskexecutions. Loadisthereasonthatthe

    performancetestingdonebysoftwaredevelopers

    doesntcatchalltheperformanceproblemsthatshow

    uplaterinproduction.

    Onemeasureofloadisutilization.Utilizationis

    resourceusagedividedbyresourcecapacityfora

    specifiedtimeinterval.Asutilizationforaresource

    goesup,sodoestheresponsetimeauserwill

    experiencewhenrequestingservicefromthat

    resource.Anyonewhohasriddeninanautomobilein

    abigcityduringrushhourhasexperiencedthis

    7MychoiceofarticleadjectivehereisadeadgiveawaythatIwasintroducedtoSQLwithintheOraclecommunity.

    8Admittedly,sometimesyouneedatourniquettokeepfrom

    bleedingtodeath.Butdontuseastopgapmeasureasa

    permanentsolution.Addresstheinefficiency.

    phenomenon.Whenthetrafficisheavilycongested,

    youhavetowaitlongeratthetollbooth.

    Withcomputersoftware,thesoftwareyouusedoesn't

    actuallygoslowerlikeyourcardoeswhenyoure

    going30mphinheavytrafficinsteadof60mphonthe

    openroad.Computersoftwarealwaysgoesthesame

    speed,nomatterwhat(aconstantnumberofinstructionsperclockcycle),butcertainlyresponse

    timedegradesasresourcesonyoursystemgetbusier.

    Therearetworeasonsthatsystemsgetslowerasload

    increases:queueingdelay,andcoherencydelay.Ill

    addresseachaswecontinue.

    13 QUEUEINGDELAYThemathematicalrelationshipbetweenloadand

    responsetimeiswellknown.Onemathematicalmodel

    calledM/M/mrelatesresponsetimetoloadin

    systemsthatmeetoneparticularlyusefulsetof

    specificrequirements.9Oneoftheassumptionsof

    M/M/misthatthesystemyouaremodelinghas

    theoreticallyperfectscalability.Havingtheoretically

    perfectscalabilityisakintohavingaphysicalsystem

    withnofriction,anassumptionthatsomany

    problemsinintroductoryPhysicscoursesinvoke.

    Regardlessofsomeoverreachingassumptionslikethe

    oneaboutperfectscalability,M/M/mhasalottoteach

    usaboutperformance.Exhibit8showsthe

    relationshipbetweenresponsetimeandloadusing

    M/M/m.

    Exhibit8.Thiscurverelatesresponsetimeasa

    functionofutilizationforanM/M/msystemwith

    m=8servicechannels.

    9CaryMillsapandJeffHolt,2003.OptimizingOracle

    Performance.OReilly.SebastopolCA.

    0.0 0.2 0.4 0.6 0.8 1.00

    1

    2

    3

    4

    5

    Utilization

    Responsetime

    R

    MM

    8 system

  • 8/3/2019 Thinking Clearly (Paper)

    9/15

    2010MethodRCorporation.Allrightsreserved.9

    InExhibit8,youcanseemathematicallywhatyoufeel

    whenyouuseasystemunderdifferentload

    conditions.Atlowload,yourresponsetimeis

    essentiallythesameasyourresponsetimewasatno

    load.Asloadrampsup,yousenseaslight,gradual

    degradationinresponsetime.Thatgradual

    degradationdoesntreallydomuchharm,butasloadcontinuestorampup,responsetimebeginstodegrade

    inamannerthatsneitherslightnorgradual.Rather,

    thedegradationbecomesquiteunpleasantand,infact,

    hyperbolic.

    Responsetime,intheperfectscalabilityM/M/m

    model,consistsoftwocomponents: servicetimeand

    queueingdelay.Thatis,R=S+Q.Servicetime(S)is

    thedurationthatataskspendsconsumingagiven

    resource,measuredintimepertaskexecution,asin

    secondsperclick.Queueingdelay(Q) isthetimethata

    taskspendsenqueuedatagivenresource,awaitingits

    opportunitytoconsumethatresource.Queueingdelay

    isalsomeasuredintimepertaskexecution(e.g.,secondsperclick).

    So,whenyouorderlunchatTacoTico,yourresponse

    time(R)forgettingyourorderisthequeueingdelay

    time(Q)thatyouspendqueuedinfrontofthecounter

    waitingforsomeonetotakeyourorder,plusthe

    servicetime(S)youspendwaitingforyourorderto

    hityourhandsonceyoubegintalkingtotheorder

    clerk.Queueingdelayisthedifferencebetweenyour

    responsetimeforagiventaskandtheresponsetime

    forthatsametaskonanotherwiseunloadedsystem

    (dontforgetourperfectscalabilityassumption).

    14 THEKNEEWhenitcomestoperformance,youwanttwothings

    fromasystem:

    1. Youwantthebestresponsetimeyoucanget:youdontwanttohavetowaittoolongfortaskstoget

    done.

    2. Andyouwantthebestthroughputyoucanget:youwanttobeabletocramasmuchloadasyou

    possiblycanontothesystemsothatasmany

    peopleaspossiblecanruntheirtasksatthesame

    time.

    Unfortunately,thesetwogoalsarecontradictory.

    Optimizingtothefirstgoalrequiresyoutominimize

    theloadonyoursystem;optimizingtotheotherone

    requiresyoutomaximizeit.Youcantdoboth

    simultaneously.Somewhereinbetweenatsomeload

    level(thatis,atsomeutilizationvalue)istheoptimal

    loadforthesystem.

    Theutilizationvalueatwhichthisoptimalbalance

    occursiscalledtheknee.10Thekneeistheutilization

    valueforaresourceatwhichthroughputismaximized

    withminimalnegativeimpacttoresponsetimes.

    Mathematically,thekneeistheutilizationvalueat

    whichresponsetimedividedbyutilizationisatits

    minimum.Onenicepropertyofthekneeisthatitoccursattheutilizationvaluewherealinethroughthe

    originistangenttotheresponsetimecurve.Ona

    carefullyproducedM/M/mgraph,youcanlocatethe

    kneequitenicelywithjustastraightedge,asshownin

    Exhibit9.

    Exhibit9.Thekneeoccursattheutilizationat

    whichalinethroughtheoriginistangenttothe

    responsetimecurve.

    AnothernicepropertyoftheM/M/mkneeisthatyouonlyneedtoknowthevalueofoneparameterto

    computeit.Thatparameteristhenumberofparallel,

    homogeneous,independentservicechannels.Aservice

    channelisaresourcethatsharesasinglequeuewith

    otheridenticalsuchresources,likeaboothinatoll

    plazaoraCPUinanSMPcomputer.

    TheitalicizedlowercaseminthenameM/M/misthe

    numberofservicechannelsinthesystembeing

    modeled.11TheM/M/mkneevalueforanarbitrary

    10Iamengagedinanongoingdebateaboutwhetheritis

    appropriatetousethetermkneeinthiscontext.Forthetimebeing,Ishallcontinuetouseit.Seesection24fordetails.

    11Bythispoint,youmaybewonderingwhattheothertwo

    MsstandforintheM/M/mqueueingmodelname.They

    relatetoassumptionsabouttherandomnessofthetimingof

    yourincomingrequestsandtherandomnessofyourservicetimes.See

    http://en.wikipedia.org/wiki/Kendall%27s_notation formore

    information,orOptimizingOraclePerformanceforeven

    more.

    0.0 0.2 0.4 0.6 0.8 1.00

    2

    4

    6

    8

    10

    Utilization

    Responsetim

    eR

    MM4, 0.665006

    MM16, 0.810695

  • 8/3/2019 Thinking Clearly (Paper)

    10/15

    2010MethodRCorporation.Allrightsreserved. 10

    systemisdifficulttocalculate,butIvedoneitforyou.

    Thekneevaluesforsomecommonservicechannel

    countsareshowninExhibit10.

    Servicechannelcount

    Kneeutilization

    1 50%

    2 57%4 66%

    8 74%

    16 81%

    32 86%64 89%

    128 92%

    Exhibit10.M/M/mkneevaluesforcommon

    valuesofm.

    Whyisthekneevaluesoimportant?Forsystemswith

    randomlytimedservicerequests,allowingsustained

    resourceloadsinexcessofthekneevalueresultsin

    responsetimesandthroughputsthatwillfluctuateseverelywithmicroscopicchangesinload.Hence:

    Onsystemswithrandomrequestarrivals,itisvitaltomanageloadsothatitwillnotexceed

    thekneevalue.

    15 RELEVANCEOFTHEKNEESo,howimportantcanthiskneeconceptbe,really?

    Afterall,asIvetoldyou,theM/M/mmodelassumes

    thisridiculouslyutopianideathatthesystemyoure

    thinkingaboutscalesperfectly.Iknowwhatyoure

    thinking:Itdoesnt.

    ButwhatM/M/mgivesusistheknowledgethateven

    ifyoursystemdidscaleperfectly,youwould stillbe

    strickenwithmassiveperformanceproblemsonceyouraverageloadexceededthekneevaluesIvegiven

    youinExhibit10.Yoursystemisntasgoodasthe

    theoreticalsystemsthatM/M/mmodels.Therefore,

    theutilizationvaluesatwhichyoursystemsknees

    occurwillbemoreconstrainingthanthevaluesIve

    givenyouinExhibit10.(Isaid valuesandkneesin

    pluralform,becauseyoucanmodelyourCPUswith

    onemodel,yourdiskswithanother,yourI/O

    controllerswithanother,andsoon.)

    Torecap:

    Eachoftheresourcesinyoursystemhasaknee. Thatkneeforeachofyourresourcesislessthan

    orequaltothekneevalueyoucanlookupin

    Exhibit10.Themoreimperfectlyyoursystem

    scales,thesmaller(worse)yourkneevaluewill

    be.

    Onasystemwithrandomrequestarrivals,ifyouallowyoursustainedutilizationforanyresource

    onyoursystemtoexceedyourkneevalueforthat

    resource,thenyoullhaveperformanceproblems.

    Therefore,itisvitalthatyoumanageyourloadsothatyourresourceutilizationswillnotexceedyourknee

    values.

    16 CAPACITYPLANNINGUnderstandingthekneecancollapsealotof

    complexityoutofyourcapacityplanningprocess.It

    workslikethis:

    1. Yourgoalcapacityforagivenresourceistheamountatwhichyoucancomfortablyrunyour

    tasksatpeaktimeswithoutdrivingutilizations

    beyondyourknees.

    2. Ifyoukeepyourutilizationslessthanyourknees,yoursystembehavesroughlylinearly:nobighyperbolicsurprises.

    3. However,ifyourelettingyoursystemrunanyofitsresourcesbeyondtheirkneeutilizations,then

    youhaveperformanceproblems(whetheryoure

    awareofitornot).

    4. Ifyouhaveperformanceproblems,thenyoudontneedtobespendingyourtimewithmathematical

    models;youneedtobespendingyourtimefixing

    thoseproblemsbyeitherreschedulingload,

    eliminatingload,orincreasingcapacity.

    Thatshowcapacityplanningfitsintotheperformance

    managementprocess.

    17 RANDOMARRIVALSYoumighthavenoticedthatseveraltimesnow,Ihave

    mentionedthetermrandomarrivals.Whyisthat

    important?

    Somesystemshavesomethingthatyouprobablydont

    haverightnow:completelydeterministicjob

    scheduling.Somesystemsitsrarethesedaysare

    configuredtoallowservicerequeststoenterthe

    systeminabsoluteroboticfashion,say,atapaceof

    onetaskpersecond.Andbyonetaskpersecond,I

    dontmeanatanaveragerateofonetaskpersecond(forexample,2tasksinonesecondand0tasksinthe

    next),Imeanonetaskpersecond,likearobotmight

    feedcarpartsintoabinonanassemblyline.

    Ifarrivalsintoyoursystembehavecompletely

    deterministicallymeaningthatyouknow exactly

    whenthenextservicerequestiscomingthenyou

    canrunresourceutilizationsbeyondtheirknee

    utilizationswithoutnecessarilycreatinga

  • 8/3/2019 Thinking Clearly (Paper)

    11/15

    2010MethodRCorporation.Allrightsreserved.11

    performanceproblem.Onasystemwithdeterministic

    arrivals,yourgoalistorunresourceutilizationsupto

    100%withoutcrammingsomuchworkloadintothe

    systemthatrequestsbegintoqueue.

    Thereasonthekneevalueissoimportantonasystem

    withrandomarrivalsisthatrandomarrivalstendto

    clusteruponyouandcausetemporaryspikesinutilization.Thosespikesneedenoughsparecapacity

    toconsumesothatusersdonthavetoendure

    noticeablequeueingdelays(whichcausenoticeable

    fluctuationsinresponsetimes)everytimeoneof

    thosespikesoccurs.

    Temporaryspikesinutilizationbeyondyourknee

    valueforagivenresourceareokaslongastheydont

    exceedjustafewsecondsinduration.Howmany

    secondsistoomany?Ibelieve(buthavenotyettried

    toprove)thatyoushouldatleastensurethatyour

    spikedurationsdonotexceed8seconds.12Theanswer

    iscertainlythat,ifyoureunabletomeetyourpercentile-basedresponsetimepromisesoryour

    throughputpromisestoyourusers,thenyourspikes

    aretoolong.

    18 COHERENCYDELAYYoursystemdoesnthavetheoreticallyperfect

    scalability.EvenifIveneverstudiedyoursystemspecifically,itsaprettygoodbetthatnomatterwhat

    computerapplicationsystemyourethinkingofright

    now,itdoesnotmeettheM/M/mtheoretically

    perfectscalabilityassumption.Coherencydelayisthe

    factorthatyoucanusetomodeltheimperfection. 13

    Coherencydelayisthedurationthatataskspends

    communicatingandcoordinatingaccesstoashared

    resource.Likeresponsetime,servicetime,and

    queueingdelay,coherencydelayismeasuredintime

    pertaskexecution,asinsecondsperclick.

    Iwontdescribehereamathematicalmodelfor

    predictingcoherencydelay.Butthegoodnewsisthat

    ifyouprofileyoursoftwaretaskexecutions,youllsee

    itwhenitoccurs.InOracle,timedeventslikethe

    followingareexamplesofcoherencydelay:

    enqueue

    bufferbusywaits

    12Youllrecognizethisnumberifyouveheardofthe8-

    secondrule,whichyoucanlearnaboutat

    http://en.wikipedia.org/wiki/Network_performance#8-

    second_rule.

    13NeilGunther,1993.UniversalLawofComputational

    Scalability,at

    http://en.wikipedia.org/wiki/Neil_J._Gunther#Universal_Law_

    of_Computational_Scalability.

    latchfree

    Youcantmodelcoherencydelayslikethesewith

    M/M/m.ThatsbecauseM/M/massumesthatallmof

    yourservicechannelsareparallel,homogeneous,and

    independent.Thatmeansthemodelassumesthatafter

    youwaitpolitelyinaFIFOqueueforlongenoughthat

    alltherequeststhatenqueuedaheadofyouhaveexitedthequeueforservice,itllbeyourturntobe

    serviced.However,coherencydelaysdontworklike

    that.

    Example:ImagineanHTMLdataentryforminwhich

    onebuttonlabeledUpdateexecutesaSQLupdate

    statement,andanotherbuttonlabeledSave

    executesaSQLcommitstatement.Anapplicationbuiltlikethiswouldalmostguaranteeabysmal

    performance.Thatsbecausethedesignmakesit

    possiblequitelikely,actuallyforausertoclick

    Update,lookathiscalendar,realizeuh-ohheslate

    forlunch,andthengotolunchfortwohoursbefore

    clickingSavelaterthatafternoon.Theimpacttoothertasksonthissystemthatwantedtoupdatethesamerowwouldbedevastating.Each

    taskwouldnecessarilywaitforalockontherow(or,

    onsomesystems,worse:alockontherowspage)

    untilthelockinguserdecidedtogoaheadandclick

    Save.Oruntiladatabaseadministratorkilledthe

    userssession,whichofcoursewouldhaveunsavorysideeffectstothepersonwhohadthoughthehad

    updatedarow.

    Inthiscase,theamountoftimeataskwouldwaiton

    thelocktobereleasedhasnothingtodowithhow

    busythesystemis.Itwouldbedependentupon

    randomfactorsthatexistoutsideofthesystemsvariousresourceutilizations.Thatswhyyoucant

    modelthiskindofthinginM/M/m,anditswhyyou

    canneverassumethataperformancetestexecutedin

    aunittestingtypeofenvironmentissufficientfora

    makingago/no-godecisionaboutinsertionofnew

    codeintoaproductionsystem.

    19 PERFORMANCETESTINGAllthistalkofqueueingdelaysandcoherencydelays

    leadstoaverydifficultquestion.Howcanyoupossibly

    testanewapplicationenoughtobeconfidentthat

    yourenotgoingtowreckyourproduction

    implementationwithperformanceproblems?

    Youcanmodel.Andyoucantest. 14However,nothing

    youdowillbeperfect.Itisextremelydifficulttocreate

    modelsandtestsinwhichyoullforeseeallyour

    14TheComputerMeasurementGroupisanetworkof

    professionalswhostudytheseproblemsvery,veryseriously.

    YoucanlearnaboutCMGathttp://www.cmg.org.

  • 8/3/2019 Thinking Clearly (Paper)

    12/15

    2010MethodRCorporation.Allrightsreserved. 12

    productionproblemsinadvanceofactually

    encounteringthoseproblemsinproduction.

    Somepeopleallowtheapparentfutilityofthis

    observationtojustifynottestingatall.Dontget

    trappedinthatmentality.Thefollowingpointsare

    certain:

    Youllcatchalotmoreproblemsifyoutrytocatchthempriortoproductionthanifyoudonteven

    try.

    Youllnevercatchallyourproblemsinpre-productiontesting.Thatswhyyouneedareliable

    andefficientmethodforsolvingtheproblemsthat

    leakthroughyourpre-productiontesting

    processes.

    Somewhereinthemiddlebetweennotestingand

    completeproductionemulationistherightamount

    oftesting.Therightamountoftestingforaircraft

    manufacturersisprobablymorethantherightamount

    oftestingforcompaniesthatsellbaseballcaps.But

    dontskipperformancetestingaltogether.Atthevery

    least,yourperformancetestplanwillmakeyoua

    morecompetentdiagnostician(andclearerthinker)

    whenitcomestimetofixtheperformanceproblems

    thatwillinevitablyoccurduringproductionoperation.

    20 MEASURINGPeoplefeelthroughputandresponsetime.

    Throughputisusuallyeasytomeasure.Measuring

    responsetimeisusuallymuchmoredifficult.

    (Remember,throughputandresponsetimearenot

    reciprocals.)Itmaynotbedifficulttotimeanend-useractionwithastopwatch,butitmightbeverydifficult

    togetwhatyoureallyneed,whichistheabilitytodrill

    downintothedetailsofwhyagivenresponsetimeis

    aslargeasitis.

    Unfortunately,peopletendtomeasurewhatseasyto

    measure,whichisnotnecessarilywhattheyshouldbe

    measuring.Itsabug.Whenitsnoteasytomeasure

    whatweneedtomeasure,wetendtoturnour

    attentiontomeasurementsthat areeasytoget.

    Measuresthatarentwhatyouneed,butthatareeasy

    enoughtoobtainandseemrelatedtowhatyouneed

    arecalledsurrogatemeasures.Examplesofsurrogatemeasuresincludesubroutinecallcountsandsamples

    ofsubroutinecallexecutiondurations.

    ImashamedthatIdonthavegreatercommandover

    mynativelanguagethantosayitthisway,buthereis

    acatchy,modernwaytoexpresswhatIthinkabout

    surrogatemeasures:

    Surrogatemeasuressuck.

    Here,unfortunately,suckdoesntmeannever

    work.Itwouldactuallybe betterifsurrogate

    measuresneverworked.Thennobodywoulduse

    them.Theproblemisthatsurrogatemeasureswork

    sometimes.Thisinspirespeoplesconfidencethatthe

    measurestheyreusingshouldworkallthetime,and

    thentheydont.Surrogatemeasureshavetwobigproblems.Theycantellyouyoursystemsokwhenits

    not.Thatswhatstatisticianscall typeIerror,thefalse

    positive.Andtheycantellyouthatsomethingisa

    problemwhenitsnot.Thatswhatstatisticianscall

    typeIIerror,thefalsenegative.Iveseeneachtypeof

    errorwasteyearsofpeoplestime.

    Whenitcomestimetoassessthespecificsofareal

    system,yoursuccessisatthemercyofhowgoodthe

    measurementsarethatyoursystemallowsyouto

    obtain.IvebeenfortunatetoworkintheOracle

    marketsegment,wherethesoftwarevendoratthe

    centerofouruniverseparticipatesactivelyinmaking

    itpossibletomeasuresystemstherightway.Gettingapplicationsoftwaredeveloperstousethetoolsthat

    Oracleoffersisanotherstory,butatleastthe

    capabilitiesarethereintheproduct.

    21 PERFORMANCEISAFEATUREPerformanceisasoftwareapplicationfeature,justlike

    recognizingthatitsconvenientforastringoftheform

    Case1234toautomaticallyhyperlinkoverto

    case1234inyourbugtrackingsystem.15Performance,

    likeanyotherfeature,doesntjusthappen;ithasto

    bedesignedandbuilt.Todoperformancewell,you

    havetothinkaboutit,studyit,writeextracodeforit,testit,andsupportit.

    However,likemanyotherfeatures,youcantknow

    exactlyhowperformanceisgoingtoworkoutwhile

    itsstillearlyintheprojectwhenyourewriting

    studying,designing,andcreatingtheapplication.For

    manyapplications(arguably,forthevastmajority),

    performanceiscompletelyunknownuntilthe

    productionphaseofthesoftwaredevelopmentlife

    cycle.Whatthisleavesyouwithisthis:

    Sinceyoucantknowhowyourapplicationis

    goingtoperforminproduction,youneedto

    writeyourapplicationsothatitseasytofixperformanceinproduction.

    AsDavidGarvinhastaughtus,itsmucheasierto

    managesomethingthatseasytomeasure.16Writing

    15FogBugz,whichissoftwarethatIenjoyusing,doesthis.

    16DavidGarvin,1993.BuildingaLearningOrganizationin

    HarvardBusinessReview,Jul.1993.

  • 8/3/2019 Thinking Clearly (Paper)

    13/15

    2010MethodRCorporation.Allrightsreserved.13

    anapplicationthatseasytofixinproductionbegins

    withanapplicationthatseasytomeasurein

    production.

    Mosttimes,whenImentiontheconceptofproduction

    performancemeasurement,peopledriftintoastateof

    worryaboutthemeasurementintrusioneffectof

    performanceinstrumentation.Theyimmediatelyenteramodeofdatacollectioncompromise,leavingonly

    surrogatemeasuresonthetable.Wontsoftwarewith

    extracodepathtomeasuretimingsbeslowerthanthe

    samesoftwarewithoutthatextracodepath?

    IlikeananswerthatIheardTomKytegiveoncein

    responsetothisquestion. 17Heestimatedthatthe

    measurementintrusioneffectofOraclesextensive

    performanceinstrumentationisnegative10%or

    less.18Hewentontoexplaintoanow-vexed

    questionerthattheproductisatleast10%fasternow

    becauseoftheknowledgethatOracleCorporationhas

    gainedfromitsperformanceinstrumentationcode,morethanmakingupforanyoverheadthe

    instrumentationmighthavecaused.

    Ithinkthatvendorstendtospendtoomuchtime

    worryingabouthowtomaketheirmeasurementcode

    pathefficientwithoutfiguringouthowfirsttomakeit

    effective.ItlandssquarelyupontheideathatKnuth

    wroteaboutin1974whenhesaidthatpremature

    optimizationistherootofallevil. 19Thesoftware

    designerwhointegratesperformancemeasurement

    intohisproductismuchmorelikelytocreateafast

    applicationandmoreimportantlyanapplication

    thatwillbecomefasterovertime.

    22 ACKNOWLEDGMENTSThankyouBaronSchwartzfortheemailconversation

    inwhichyouthoughtIwashelpingyou,butinactual

    fact,youwerehelpingmecometogripswiththeneed

    forintroducingcoherencydelaymoreprominently

    intomythinking.ThankyouJeffHolt,RonCrisco,Ken

    Ferlita,andHaroldPalacioforthedailyworkthat

    keepsthecompanygoingandforthelunchtime

    conversationsthatkeepmyimaginationgoing.Thank

    youTomKyteforyourcontinuedinspirationand

    support.ThankyouMarkFarnhamforyourhelpful

    suggestions.AndthankyouNeilGuntherforyour

    17TomKyte,2009.Acoupleoflinksandanadvertathttp://tkyte.blogspot.com/2009/02/couple-of-links-and-

    advert.html.

    18Whereorlessmeansorbetter,asin20%,30%,etc.

    19DonaldKnuth,1974.StructuredProgrammingwithGoTo

    StatementsinACMJournalComputingSurveys,Vol.6,No.4,

    Dec.1974,p268.

    patienceandgenerosityinourongoingdiscussions

    aboutknees.

    23 ABOUTTHEAUTHORCaryMillsapiswellknownintheglobalOracle

    communityasaspeaker,educator,consultant,and

    writer.HeisthefounderandpresidentofMethodR

    Corporation(http://method-r.com),asmallcompany

    devotedtogenuinelysatisfyingsoftwareperformance.

    MethodRoffersconsultingservices,education

    courses,andsoftwaretoolsincludingtheMethodR

    Profiler,MRTools,theMethodRSLAManager,andthe

    MethodRInstrumentationLibraryforOraclethat

    helpyouoptimizeyoursoftwareperformance.

    Caryistheauthor(withJeffHolt)of OptimizingOracle

    Performance(OReilly),forwhichheandJeffwere

    namedOracleMagazines2004AuthorsoftheYear.He

    isaco-authorofOracleInsights:TalesoftheOakTable

    (Apress).HeistheformerVicePresidentofOracleCorporationsSystemPerformanceGroup,andaco-

    founderofhisformercompanycalledHotsos.Caryis

    alsoanOracleACEDirectorandafoundingpartnerof

    theOakTableNetwork,aninformalassociationof

    Oraclescientiststhatarewellknownthroughoutthe

    Oraclecommunity.Caryblogsat

    http://carymillsap.blogspot.com,andhetweetsat

    http://twitter.com/CaryMillsap.

    24 EPILOG:OPENDEBATEABOUTKNEESInsections14through16,Iwroteaboutkneesin

    performancecurves,theirrelevance,andtheirapplication.However,thereisopendebategoingback

    atleast20yearsaboutwhetheritsevenworthwhile

    totrytodefinetheconceptof knee,likeIvedonein

    thispaper.

    Thereissignificanthistoricalbasistotheideathat

    suchathingthatIvedescribedasakneeinfactisnt

    reallymeaningful.In1988,StephenSamsonargued

    that,atleastforM/M/1queueingsystems,thereisno

    kneeintheperformancecurve.Hewrote,The

    choiceofaguidelinenumberisnoteasy,buttherule-

    of-thumbmakersgorighton.Inmostcasesthereis

    notaknee,nomatterhowmuchwewishtofind

    one.20

    Thewholeproblemremindsme,asIwrotein1999,21

    oftheparableofthefrogandtheboilingwater.The

    20StephenSamson,1988.MVSperformancelegendsinCMG1988ConferenceProceedings.ComputerMeasurement

    Group,148159.

    21CaryMillsap,1999.Performancemanagement:mythsand

    factsavailableathttp://method-r.com.

  • 8/3/2019 Thinking Clearly (Paper)

    14/15

    2010MethodRCorporation.Allrightsreserved. 14

    storysaysthatifyoudropafrogintoapanofboiling

    water,hewillescape.But,ifyouputafrogintoapan

    ofcoolwaterandslowlyheatit,thenthefrogwillsit

    patientlyinplaceuntilheisboiledtodeath.

    Withutilization,justaswithboilingwater,thereis

    clearlyadeathzone,arangeofvaluesinwhichyou

    cantaffordtorunasystemwithrandomarrivals.Butwhereistheborderofthedeathzone?Ifyouare

    tryingtoimplementaproceduralapproachto

    managingutilization,youneedtoknow.

    Recently,myfriendNeilGunther 22hasdebatedwith

    meprivatelythat,first,thetermkneeiscompletely

    thewrongwordtousehere,becausekneeisthe

    wrongtermtouseintheabsenceofafunctional

    discontinuity.Second,heassertsthattheboundary

    valueof.5foranM/M/1systemiswastefullylow,that

    yououghttobeabletorunsuchasystemsuccessfully

    atamuchhigherutilizationvaluethan.5.And,finally,

    hearguesthatanysuchspecialutilizationvalueshouldbedefinedexpresslyastheutilizationvalue

    beyondwhichyouraverageresponsetimeexceeds

    yourtoleranceforaverageresponsetime(Exhibit11).

    Thus,Guntherarguesthatanyusefulnot-to-exceed

    utilizationvalueisderivableonlyfrominquiriesabout

    humanpreferences,notfrommathematics.

    Exhibit11.Gunthersmaximumallowable

    utilizationvalueTisdefinedastheutilization

    producingtheaverageresponsetimeT.

    TheproblemIseewiththisargumentisillustratedinExhibit12.Imaginethatyourtoleranceforaverage

    responsetimeisT,whichcreatesamaximum

    toleratedutilizationvalueof T.Noticethatevenatiny

    22Seehttp://en.wikipedia.org/wiki/Neil_J._Guntherfor

    moreinformationaboutNeil.See

    http://www.cmg.org/measureit/issues/mit62/m_62_15

    .htmlfor more information about his argument.

    fluctuationinaverageutilizationnear Twillresultin

    ahugefluctuationinaverageresponsetime.

    Exhibit12.NearTvalue,smallfluctuationsin

    averageutilizationresultinlargeresponsetime

    fluctuations.

    Ibelieve,asIwroteinsection4,thatyourcustomers

    feelthevariance,notthemean.Perhapstheysaythey

    willacceptaverageresponsetimesuptoT,butIdont

    believethathumanswillbetolerantofperformance

    onasystemwhena1%changeinaverageutilization

    overa1-minuteperiodresultsin,say,aten-times

    increaseinaverageresponsetimeoverthatperiod.

    Idounderstandtheperspectivethatthekneevalues

    Ivelistedinsection14arebelowtheutilizationvalues

    thatmanypeoplefeelsafeinexceeding,especiallyfor

    lowerordersystemslikeM/M/1.However,Ibelieve

    thatitisimportanttoavoidrunningresourcesataverageutilizationvalueswheresmallfluctuationsin

    utilizationyieldtoo-largefluctuationsinresponse

    time.

    Havingsaidthat,Idontyethaveagooddefinitionfor

    whatatoo-largefluctuationis.Perhaps,like

    responsetimetolerances,differentpeoplehave

    differenttolerancesforfluctuation.Butperhapsthere

    isafluctuationtolerancefactorthatapplieswith

    reasonableuniversalityacrossallhumanusers.The

    Apdexstandard,forexample,assumesthatthe

    responsetimeFatwhichusersbecomefrustratedis

    universallyfourtimestheresponsetimeTatwhichtheirattitudeshiftsfrombeingsatisfiedtomerely

    tolerating.23

    Theknee,regardlessofhowyoudefineitorwhatwe

    endupcallingit,isanimportantparametertothe

    capacityplanningprocedurethatIdescribedin

    23Seehttp://www.apdex.orgformoreinformationabout

    Apdex.

    0 0.5 T 0.9000

    5

    10

    15

    20

    Utilization

    Responsetim

    eR

    MM1 system, T 10.

    0 0.744997 T 0.9870

    5

    10

    15

    20

    Utilization

    Responsetime

    R

    MM8 system, T 10.

  • 8/3/2019 Thinking Clearly (Paper)

    15/15

    2010MethodRCorporation.Allrightsreserved.15

    section16,andIbelieveitisanimportantparameter

    tothedailyprocessofcomputersystemworkload

    management.

    Iwillkeepstudying.