mapreduce/bigtable for distributed...
TRANSCRIPT
MapReduce/BigtableforDistributedOp7miza7on
KeithHall,Sco@Gilpin,GideonMannpresentedby:SlavPetrovZurich,Thornton,NewYork
GoogleInc.
Outline
• Large‐scaleGradientOp7miza7on– DistributedGradient– AsynchronousUpdates– Itera7veParameterMixtures
• MapReduceandBigtable
• ExperimentalResults
GradientOp7miza7onSeRng
Ifisdifferen7able,thensolveviagradientupdates:
Considercasewhereiscomposedofasumofdifferen7ablefunc7ons,thenthegradientupdatecanbewri@enas:
ffq
!i+1 = !i + "!
q
!fq(!i)
!i+1 = !i + "!f(!i)
!! = argmin!
f(!)
f
Goal:find
Thiscaseisthefocusofthetalk
MaximumEntropyModelsX Y
S = ((x1, y1), ..., (xm, ym))
p![y|x] =1Z
exp(! · !(x, y))
! : X ! Y " Rn
:inputspace :outputspace
:featuremapping
:trainingdata
:probabilitymodel
Theobjec7vefunc7onisasumma7onoffunc7ons,eachofwhichiscomputedononedatainstance.
!! = argmin!
1m
!
i
log p!(yi|xi)
DistributedGradient
!i+1 = !i + "!
q
!fq(!i)Observethatthegradientupdatecanbebrokendownintothreephases:
!fq(!i)!
q
Canbedoneinparallel
Cheaptocompute
UpdateStep
Dependsonnumberoffeatures,notdatasize.
Ateachitera7on,mustsendcompletetoeachparallelworker.
!
[Chuetal.’07]
Stochas7cGradient
!i+1 = !i + "!
q
!fq(!i)Alterna7velyapproximatethesumbyasubsetoffunc7ons.Ifthesubsetissize1,thenthealgorithmisanonline:
!fq(!i)
!!+1 = !i + "!fq(!i)
Stochas7cgradientapproachesprovablyconverge,andinprac7ceoZenmuchquickerthanexactgradientcalcula7onmethods.
AsynchronousUpdates
Asynchronousupdatesareadistributedextensionofstochas7cgradient.
!i
!fq(!i)
Eachworker:GetcurrentComputeUpdateglobalparametervector
Sinceeachworkerwillnotbecompu7nginlock‐step,somegradientswillbebasedonoldparameters.Nonetheless,thisalsoconverges.
[Langfordetal.’09]
Itera7veParameterMixtures
Separatelyes7mateondifferentsamples,andthencombine.Itera7ve:takeresul7ngasstar7ngpointfornextroundandrerun.
!!
Forcertainclassesofobjec7vefunc7ons(e.g.maxentandperceptron),convergencecanalsobeshown.
[Mannetal.’09,McDonaldetal.‘10]
Li@lecommunica7onbetweenworkers.
TypicalDatacenter(atGoogle)
Commoditymachines,withrela7velyfewcores(e.g.<=4)
Racksofmachinesconnectedbycommodityethernetswitches.
NoGPUs.NoSharedMemory.NoNAS.NomassivemulKcoremachines.
[Holzle,Barroso‘09]
Implica7onsforalgorithms
• Communica7onbetweenworkersiscostly– Nosharedmemorymeansveryhardtomaintainglobalstate
– Evenethernetcommunica7oncanbeasignificantsourceoflatency
• Eachworkerrela7velylowpowered– NomachineshavemassiveamountsofRAMordisk
– Simpleisolatedjobsperformedinparallelwillhavesignificantgains.
Outline
• DistributedGradientOp7miza7on• MapReduceandBigtable– Op7malConfigura7ons
• ExperimentalResults
MapReduce
1.Backupmapworkerscompensateforfailureofmaptask.
2.Sor7ngphaserequiresallmapjobstobecompletedbeforestar7ngreducephase.
MapReduceImplementa7ons
DistributedGradientandItera7veParameterMixturesarestraight‐forwardtoimplementin
MapReduce.
ComputeGradient
SumandUpdate
Map
Reduce
DistributedGradient:Eachitera7onhasmul7plemappersandonereducer.
Itera7veParameterMixtures:Onemapper/reducerpair,eachrunstoconvergence,andthenthere’sanoutsidemixtureandloop.
Asynchronous–needssharedmemory.
Bigtable
Distributedstorage–builtoffofGFSNotadatabase:cellsindexedbyrow/column/7me.(e.g.canrecovernmostrecentvalues)
Mul7plelayersofcaching:
Machineandnetworklevel
Transac7onprocessingisop7onal
AsynchronousUpdates
UpdateParameters,ComputeGradient
SumandUpdateBigtable
Map
Reduce
Bigtable
Sor7ngturnedoffTransac7onprocessingturnedoff:overwritesarepossible
Gradientscollectedintomini‐batches
Outline
• DistributedGradientOp7miza7on• MapReduceandBigtable
• ExperimentalResults
ExperimentalSet‐up
Click‐throughdata– A.370milliontraininginstances• Par77onedinto200pieces• 240workermachines
– B.1.6billioninstances• Par77onedinto900pieces• 600workermachines
– Evalua7on:185millioninstances(temporallyoccurringaZerfirstinstances)
StoppingCriterionTricky
Complicatedbecausetheimplementa7onsareallslightlydifferent
• Gradientchange:– noglobalgradientinasychronousoritera7veparameter
mixtures• Log‐likelihood/objec7vefunc7onvalue:
– Asynchronouslog‐likelihoodfluctua7ngfrequently• Fixednumberofitera7ons
– Meansdifferentthingsforeachimplementa7on
Currentlyworkingonmorecleanperformanceresultes7ma7on–take
AUCresultswithagrainofsalt.
370MillionTrainingInstances
AUC CPU(rel.) Wallclock(rel.)
NetworkUsage(rel.)
Single‐coreSGD .8557 0.10 9.64 0.03
DistributedGradient .8323 1.00 1.00 1.00
AsynchronousSGD .7283 2.17 7.50 82.47
Itera7veParameterMixture:MaxEnt
.8557 0.12 0.13 0.10
Itera7veParameterMixture:Perceptron
.8870 0.16 0.20 0.22
Observa7ons:Evenwithtuning,AsynchronousUpdatesareimprac7calIteratorParametermixturesperformwell
IPMw/240machinesis~70xasfastassinglecorew/10machinesis8xasfastassinglecore.
1.6BillionTrainingInstances
AUC CPU(rel.) Wallclock(rel.)
NetworkUsage(rel.)
DistributedGradient .8117 1.00 1.00 1.00
AsynchronousSGD .8525 3.71 14.33 133.93
Itera7veParameterMixture:MaxEnt
.8904 0.17 0.15 0.15
Itera7veParameterMixture:Perceptron
.8961 0.24 0.20 0.38
Observa7ons:Networkusagemuchworsewithmoremachines,andasynchronousmuchslower.
Itera7veparametermixturess7lleffec7vewith900par77ons.
Conclusions
AsynchronousUpdatesnotsuitedtocurrentdatacenterconfigura7ons
Itera7veparametermixturesgivesignificant7meimprovementsoverdistributedgradient.