1gbdt, lr & deep learning for turn-based strategy game...

Post on 16-Aug-2020

3 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

1GBDT, LR & Deep Learning for Turn-based Strategy Game AI Like Zhang, Hui Pan, Qi Fan, Changqing Ai, Yanqing Jing

Turing Lab, Tencent Inc. {likezhang, qfan, changqingai, frogjing}@tencent.com, hpan2010@my.fit.edu

Abstract-This paper proposes an AI fightingstrategies generation approach implemented inthe turn-based fighting game StoneAge 2 (SA2).Our research aim is to develop such AI forchoosingthelogicalskillsandtargetstotheplayer.Theapproachtrainedthelogisticalregression(LR)model and deep neural networks (DNN) model,individually. And combined both output atinference process. Meanwhile, to transform thefeatures into a higher dimension binary vectorwithout any manual intervention or any priorknowledge, we put all category features intoGradientBoostedDecisionTree(GBDT)beforeLRcomponent.Themainadvantageofthisprocedureis the approach combines the benefits of LRmodels (memorization of feature interactions)and DL (generation the unseen featurecombination through low-dimensional densefeature) for the AI fighting system. In ourexperiment, we evaluated ourmodel with someotherAIstrategies(ReinforcementLearning(RL),GBDT, LR, DNN) to against a robot script. Theresultsshownthattheplayers,participatingintheexperiment, are capable of using reasonablestrategic skills on the different targets. As aconsequence,thewinrate(versuswiththerobotscript) of our system is higher than the others.Finally, we productionized and evaluated thesystemonSA2,acommercialmobileturn-basedgame.

KeywordsMachine Learning, Logistical Learning, Gradient BoostedDecision Tree, Deep Neural Network, ReinforcementLearning,turn-basedgames;

1. IntroductionSinceAlphaGogotsomehistoricsuccessesinthecoupleoflastyears, reinforcement learning and deep neural network hasopenedaneweraforthegame,moreandmoreAItechnologyhasbeenappliedtothegamefield.Asearlyas2013,DeepMinddemonstrates that a convolutional neural network can learnsuccessfulcontrolpoliciesfromrawvideodataincomplexRLenvironments [18][19][20][21][22][23]. Recently, StarCraftbecomes a new challenge for deep reinforcement learning

978-1-7281-1884-0/19/$31.00©2019IEEE

research [2].Comparing to theagedAtarigameswhichwerebuilt upon limited action choices[24], the action space incurrentcommercialcomputergamesarevaseanddiverse,theplayer selects actions among a combinatorial space ofapproximatelyhundredsofpossibilitiesleadingtoarichsetofchallenges.Andthosechallengeshavenotbeenovercomeyet.These previous research efforts suggested developing anapproachforaspecificgoalingameismorefeasiblethantryingtobuildageneralpurposeAImodel.Inthispaper,wedesignasystemthatwhenplayertemporaryleavingoroffline,thesystemcanchoosetheskillsandapplyiton the target, reasonably.Meanwhile,weproductionizedandevaluatedthesystemonamobileturn-basedgameStoneAge2[1].StoneAge2isaturn-basedstrategygameinwhichplayerswillbuildtheirowngroupofmonsters,decidingbuildingstrategiesbydistributing“buildingpoints”,andarrangethecombinationofmonsterstoformupthelineinbattles.fightingpartisoneofthemostimportantpartforentertainingplayers(Fig.1).Inourgame,eachcombatincludestwentythirty-secondsrounds.And,ineachround,theplaymustchooseaskillfromtheskilllistandapplythisskilltoatarget(thetargetcouldbeoursideortheopponent)basedonthedifferentsituations.

Figure1,screenshotofSAcombatboard,rightsideistheskilllist,lefttopistheopponentside,leftbottomisthe

playerside.There are two different scenarios for the battles: PvE(BossBattle)andPvP(PlayervsPlayer).InthePvEscenario,usuallygamedevelopershavetocreatedifferentbehaviorscripts fortheNPCs(theBoss)andmakesurethescript-definedbehaviorswillgivesomechallengesforplayers.WhileinPvPscenario,inmostcasestherewon’tbeanyautomatedbehavior foreithersidesinceplayerthemselvesareexpectedtomakeallthemoves.

Theseapproachesworkwell for traditional computergames,butexhibitdrawbacksformobilegameslikeStoneAge2.Themajorproblemsinclude:- Mobilegameshavemuchshorterdevelopmentcyclethan

traditional computer games. Designers expect a simplesolution to create an AI for the NPC, not complicatedactionscripts.

- Formobilegameplayers,therearemanyreasonstheycangooffline.Forexample,aplayerinabattlecouldbesimplywaiting in the line for bus.When the bus is coming, hecould justturnoff thegameandforgetaboutthebattle.The unpredictable behavior from mobile game playersbring extremely bad gameplay experience for PvPscenarios, anddesigners are eager to finda solutionbyreplacing“offline”playerswithcertainAImodels.

To solve the above problems,we explored differentmachinelearningbasedalgorithmsandsuccessfullydesignedamixedapproachintoStoneAge2.At the beginning of our experiments, we also applied RLenvironmentonourgame,but itsperformance is lower thanourexpectation.Thereasonis:wedefinedthepercentageofthesum(HPs)oftheopponentsastherewardfunction.theRLAItendencyuseattackkindskillratherthanothers.Theshortageisthattheassistantroleorthehealertriedtousethenormalattackskillinsteadofsomeotherassistantskills,suchasspeedupskill,controlskill,etc.Therefore,weproposedamix-modelwhich combines decision tree, logistic regression plus deeplearningandgetthebetterresult.ThecomparisonresultshavebeendisplayedinSection5.Oursystemcanbeviewedasarecommendationsystem,wherethe input query is the current battle status, such as eachplayer’scurrentHP,MP,buff,rolespeed,etc.Andtheoutputisarankedlistofskill-targetpairbythescores.Thescoresaretheprobabilitiesofthewin-rateoftheplayerapplythisskillonthetarget.Givenastatus,thesystemistofindthealllegalpairsofskills and targets, and then rank these pairs. Therefore, thetraditionalchallengeinrecommendersystemsisalsopresentin ours, which is to achieve both memorization andgeneralization. In our case, generalization can be defined asexploresandtransitivitytheskillswhichhaveneverorrarelyusedinthepast.Ontheotherside,memorizationisbasedonlearningthefrequentco-occurrenceofskillsandtargetsinthehistorical data. In previous work by H. Cheng et al [3], acombinationofwidelinearmodelandDNNwasproposedthatfocusesonachievingthisgoal.Themodel’spredictionis:

𝑃(𝑌 = 1|𝑥) = 𝜎(𝑊,-./0 [𝑥, ∅(𝑥)] +𝑊.//6

0 + 𝑏)(1)

whereYisthebinaryclasslabel,𝜎(·)isthesigmoidfunction,Wwide and Wdeep are the vector of all wide and deep modelweights, respectively. ∅(x) are the cross-producttransformationsoftheoriginalfeaturex.And,bisthebiasterm.This model requires more feature engineering effort or thepriorknowledgetodefinethecross-productfeature.In this paper,we present the Tree, LR&DNN framework toachievebothmemorizationandgeneralizationinonemodel.Inthetreecomponent,weusedGBDTtotransformthecategoricalfeatures into ahigherdimensional, sparse space.Meanwhile,inspiredbycontinuousbagofwordslanguagemodels[4],we

learnhighdimensionalembeddingsforeachplayerIDandskillID in a fixed vocabulary and feed these embeddings into aforwardneuralnetwork.ThemodelarchitectureisillustratedinFigure2.Thepaperisorganizedasfollows:abriefsystemoverviewispresentedinSection2.TheGBDT,LRandDNNcomponentwillbeexplainedinSection3. InSection4,wedescribethethreestages of this skill-target recommendation pipeline: datageneration,modeltraining,andmodelserving.InSection5,weshowtheexperimentresultsbetweendifferentconfigurations.We also evaluated different turn-based combat strategies.Finally,Section5presentsourconclusionsandlessonslearned.

Figure2,Tree,LR&DNNmodelarchitecture

2. SystemOverviewThe overall structure of our skill-target recommendationsystemis illustrated inFigure3.Aquery,which includeeachplayer's information and contextual features, is generated atthebeginningofeachround.Therearetotal20playersinthebattle,andeachplayerhas6-8activeskills,2passiveskills.Inourexperiment,weonlyconsideredtheactiveskills.

Figure3,skill-targetrecommendationsystem

architecturedemonstratingthe“funnel”wherecandidateskillsareretrievedandrankedbeforepresentingonlya

fewtotheuser.Once the candidate generation system received the query, itretrievesasmallsubsetofskill-targetpairsfromthetotal120-160candidatepairsbasedonsomerules,suchas:theMPofthe

candidate skill should less or equal than thepredict player’scurrentMP.Someskillscannotbeusedforopponentsorviceversa.Afterreducingthecandidatepairs,therankingsystemsplitthefeaturestotheranksallskill-targetpairsbytheirscores.ThescoresareusuallyP(y|x),theprobabilityofaplayeractionlabely given the features x, including player features (e.g., attack,defense,agility),contextualfeatures(e.g.,lastattacktarget,lastusedskill).WeimplementedtherankingsystembyusingGBDT,LR&DLframework.

3. GBDT,LR&DeepLearningInthissection,wesplitourGBDT,LR&DeepLearning(GLD)modeltothreecomponents,andillustratedeachcomponentinthreesubsections.

3.1 TheGBDTComponentIn the Tree Component, we used gradient boosted decisiontrees, as illustrated in Figure 4 to transform features into ahigherdimensional,sparsespace.Thentrainedalinearmodel(LRcomponent)onthesefeatures.Given GBDT training data 𝑋 = {𝑥-}-<=> and their labels 𝑌 ={𝑦-}-<=> with𝑦- ∈ {0, 1} , the goal of this step is to choose aclassificationfunctionF(x)tominimizetheaggregationofsomespecifiedlossfunctionL(yi,F(xi)):

𝐹C = 𝑎𝑟𝑔𝑚𝑖𝑛JK𝐿(𝑦-, 𝐹(𝑥-))M

-<=

Gradient boosting considers the function estimation F in anadditiveform:

𝐹(𝑥) = K 𝑓O(𝑥),0

O<=

where T is the number of iteration. At each iteration, GBDTbuildsaregressiontreethatfitstheresidualsfromtheprevioustrees.The{fm(x)}aredesignedinanincrementalfashion;atthem-thstage,thenewlyaddedfunction,fmischosentooptimizetheaggregatedlosswhilekeeping{𝑓P}P<=OQ=fixed6.

OncetheGBDTtrainingprocessfinished,wefitanensembleofthesegradientboosttreesonthetrainingset.Theneachleafofeachtreeintheensembleisassignedafixedarbitraryfeatureindex in a new feature space. These leaf indices are thenencoded in a one-hot fashion. Each sample goes through thedecisionsofeachtreeoftheensembleandendsupinoneleafper tree.Thesample isencodedbysetting featurevalues fortheseleavesto1andtheotherfeaturevaluesto0.Theresultingtransformer has then learned a supervised, sparse, high-dimensional categorical embeddingof thedata. For instance,therearethreesubtreesintheFigure3,wherethefirstsubtreehas4 leaves,thesecond3leavesandthethird2leaves. Ifanexample ends up in leaf 3 in the first subtree, leaf 2 in thesecondsubtreeandleaf3inthethirdsubtree.Theoutputwillbethebinaryvector[0,0,1,0,0,1,0,1,0],whichwillbetheinputtoLRcomponent.

Figure4,gradientboosteddecisiontreefortransformthecategoricalfeaturesintoahigherdimensional,sparse

space.

3.2 TheLogisticRegression(LR)ComponentWe expect the LR component (Figure 5) can learning thefrequent co-occurrence of skills and targets in the historicaldata. The LR component learned a supervised, sparse, high-dimensionalcategoricalembeddingof the featuresgeneratedfromthetreecomponent.And,thelogisticregressionmodeloftheform:

𝑦 =1

1 + 𝑒QS

where,𝑊 =𝑤U +𝑤=𝑥= +𝑤V𝑥V +⋯+𝑤M𝑥M + 𝑏yistheprediction,x=[x1,x2,...,xd]isavectoroffeaturesfromthetreemodel,w=[w1,w2, ...,wd]arethemodelparametersandbisthebias.

Figure5,Thestructureoflogisticregressionmodel

3.3 TheDeepLearning(DL)ComponentWe utilized DL component in our system to explores andtransitivitytheskillswhichhaveneverorrarelyusedinthepast.We fed hundreds of features in it with some preprocessingfeatureengineering.Thefeatureroughlysplitevenlybetweencategorical, continuous and IDs. Despite deep componentalleviatetheburdenofengineeringfeaturesbyhand,thenatureofourrawdatadoesnoteasilylenditselftobeinputdirectly

intoneuralnetworks.Westillexpendconsiderableengineeringresources transformingplayer and skills intouseful features.Therefore,webuildavectorembeddingtoeverycategorytypeandnormalizedeachcontinuousfeature.EmbeddingCategoricalFeaturesInprinciple,aneuralnetworkcanapproximateanycontinuousfunction [14, 15] and piece wise continuous function [16].However, it is not suitable to approximate arbitrary non-continuousfunctionsasitassumescertainlevelofcontinuityinitsgeneralform[17].Thecontinuousnatureofneuralnetworkslimits their applicability to categorical variables. Therefore,naively applying neural networks on structured data withinteger representation for category variables does not workwell.Acommonwaytocircumventthisproblemistouseone-hotencoding,butithastwoshortcomings:Firstwhenwehavemanyhighcardinalityfeaturesone-hotencodingoftenresultsin an unrealistic amount of computational resourcerequirement. Second, it treats different values of categoricalvariables completely independent of each other and oftenignorestheinformativerelationsbetweenthem.Therefore,welearn high dimensional embeddings for each skill in a fixedvocabulary and feed these embeddings into a feedforwardneuralnetwork.Aplayer’susedskillshistoryisrepresentedbyavariable-lengthsequenceofsparseskillIDswhichismappedto a dense vector representation via the embeddings. Thenetwork requires fixed-sized dense inputs and simplyaveraging the embeddings performed best among severalstrategies (sum, component-wise max, etc.). Features areconcatenatedintoafirstlayer,followedbyseverallayersoffullyconnectedRectifiedLinearUnits(ReLU)[6].

Figure6,deepneuralnetworkhelpstoexploresand

transitivitytheskillswhichhaveneverorrarelyusedinthepast.

NormalizingContinuousFeatureSince the range of values of rawdata varieswidely, in somemachinelearningalgorithms,objectivefunctionswillnotworkproperlywithoutnormalization.Forexample, themajorityofclassifiers calculate the distance between two points by theEuclideandistance.Ifoneofthefeatureshasabroadrangeofvalues,thedistancewillbegovernedbythisparticularfeature.Therefore, the range of all features should be normalized sothateachfeaturecontributesapproximatelyproportionatelytothe final distance. Meanwhile, gradient descent convergesmuchfasterwithfeaturescalingthanwithoutit.Moreover,the

neural networks are notoriously sensitive to the scaling anddistributionoftheirinputs[9].Weusedcumulativedistributiontotransformacontinuousfeaturexwithdistributionfto𝑥Xbyscalingthevaluessuchthatthefeatureisequallydistributedin[0,1).Theformulais:

𝑥X = Y 𝑑𝑓[

Q\

Theselow-dimensionaldenseembeddingvectorsarethenfedintothethreehiddenlayersofaneuralnetworkintheforwardpass. Specifically, each hidden layer performs the followingcomputation:

𝑎(]^=) = 𝑓(𝑊(])𝑎(]) +𝑏(]))where l is the layer number and f is the activation function,oftenrectifiedlinearunits(ReLUs).a(l),b(l),andW(l)aretheactivations,bias,andmodelweightsatl-thlayer.Weusedthesigmoidfunctionasthefinalactivationfunction.

3.4 TrainingofTree&DeepModelWe trained the LR andDNNmodelwith two differentways:ensembletrainingandjointtraining.JointtrainingoptimizesallparameterssimultaneouslybytakingboththeLRandDNNpartaswellastheweightsoftheirsumintoaccountattrainingtime.TheLRcomponentandDNNcomponentarecombinedusingaweightedsumoftheiroutputlogoddsastheprediction,whichisthenfedtoonecommonlogisticlossfunction.Jointtrainingisdonebyback-propagatingthegradientsfromtheoutputtoboththewideanddeeppartofthemodelsimultaneouslyusingmini-batchstochasticoptimization[3].Ensembletrainingis:individualmodelsaretrainedseparatelywithoutknowingeachother.ThepredictionscorefrombothLRcomponentandDLcomponentarecombinedatinferencetimenotattrainingtime.ThestructuresofbothjointtrainingandensembletrainingareshowninFig.7.InFig.7,Themodel’spredictionforbothensembletrainingandjointtrainingis:Ensembletraining:𝑃(𝑌 = 1|𝑥) = 𝜎(𝑊_`

0 𝐺𝐵𝐷𝑇(𝑥) +𝑏U)

+(1 − 𝛼)(𝑊g>>0 𝑎(]h) +𝑏=)

Jointtraining:𝑃(𝑌 = 1|𝑥) = 𝜎i𝑊_`

0 𝐺𝐵𝐷𝑇(𝑥) +𝑊g>>0 𝑎(]h) +𝑏Uj

whereYisthebinaryclasslabel,σ(·)isthesigmoidfunction,GBDT(x)isthetuplefeaturesgeneratedfromGBDTcomponent.WLR is the vector of all LRmodelweights, andWDNN are theweightsappliedontheDNN’s finalactivationsa(lf).𝛼isset to0.5.Intheexperiments,weimplementedbothensembleandjointtraining,andtheresultswereshowninSection5.

(a)

(b)

Figure7,(a)LRandDNNmodel’sweightsaretrainingsimultaneouslyforjointtraining.(b)LRandDNNmodel’s

weightsaretrainedseparately.

4. SystemImplementationThe turn-based battle recommendation pipeline consists offourstages:datageneration,modeltraining,validationmodule,andmodelservingasshowinFigure8.

Figure8:turn-basedbattlerecommendationpipeline

overview

4.1 DataandLabelGenerationInthetrainingdatagenerationmodule,eachplayer’sparameter,current and previous round information, and the predictplayer’sskilllistwereusedtogeneratethetrainingexamples.Thelabelisthebattle’sfinalresult:1iftheplayerwin,and0otherwise. In order to reduce the noises in the positiveexamples,we only take the battleswhich the battle finishedwiththeplayerwinin10roundsasthepositiveexample.

4.2 ModelTrainingThemodel structurewe used in the experiment is shown inFigure7.Weputallcategoricalfeaturesastheinputtothetreemodels.Thetreecomponentimplementsnon-linearandtuple

featuretransformationsandcanbeunderstoodasasupervisedfeature encoding that converts a real-valued vector into acompactbinary-valuedvector.TheLRcomponent consistsofthefeatureswhichgeneratedfromthetreecomponent.Allthecontinuous,IDs,andcategoricalfeaturewerenormalizedandembedding, thenput intoDLpart of themodelwith 3ReLUlayers: 256->128->64. Since we applied both ensemble andjoint training, the output of LR and DL component arecombinedatinferencetime(ensembletraining)orattrainingtime(jointtraining)TheGLDmodelsaretrainedonover200,000battles.Thebatchsizeandepochwassetto5000and20,respectively.

4.3 RobotScriptValidationValidationmoduleincludestheofflineABtestsystemandrobotvalidationsystem.Toeffectivelyvalidate theperformancesofdifferent models, run a preliminary offline evaluation onhistorical data to iterate faster on new ideas is well-knownpractice [10]. The secondmodule is robot validation system,which generates the random robot roles with the highparameters than the AI’s. The strategy of robot validationsystemappliestherandomskillontherandomtarget.Weusethewin-ratebetweenAIstrategyandrobotscript tovalidatetheperformanceofdifferentmodels.

4.4 ModelServingOncethemodelistrainedandtested,weloaditintotheonlinemodelservers.Foreachbattleround,theserversreceivealistof skill-target candidate pairs from the candidate generationsystemandplayerfeaturestoscoreeachpair.Afterreducingthecandidatepairs,therankingsystemranksallskill-targetpairsbytheirscores.Then,theskillandtargetwiththehighestscorewassettothecommandsystem.

5.ExperimentResultsTo evaluate the effectiveness of Tree & Deep learning, wecomparedoursystemwithsomeotherAIstrategiestoagainstarobotscript.Therobotscriptpickstherandomskillwiththerandomtarget,buttherobotroleshavethehigherparameters(moreHP,more attack, etc.).Meanwhile,we also experimentour system with different configurations. Finally, our battleactionpredictsystemwasproductionizedonSA2.Theservingperformancehasbeengiveninsubsection3.

5.1PositiveupsamplingClassimbalancehasbeenstudiedbymanyresearchersandhasbeenshowntohavesignificantimpactontheperformanceofthe learnedmodel [6].Becausemost classificationmodels infact don't yield a binary decision, but rather a continuousdecision value. Using the decision values, we can rank testsamples, from 'almost certainly positive' to 'almost certainlynegative'. Based on the decision value,we can always assignsomecutoffthatconfigurestheclassifierinsuchawaythatacertain fractionofdata is labeledaspositive.Determininganappropriatethresholdcanbedoneviathemodel'sROCorPRcurves.Wecanapplythedecisionthresholdregardlessofthebalanceusedinthetrainingset.Assumingthemodelisbetterthan random, we can intuitively see that increasing thethresholdforpositiveclassification(whichleadstolesspositivepredictions) increases the model's precision and vice versa.Therefore, in this part,we investigate the use of positive up

sampling to test the influence of the class imbalance. Weempiricallyexperimentwithdifferentpositiveupsamplingratetotestthepredictionaccuracyofthelearnedmodel.Wevarytheratein{0.1,0.2,0.3,…,0.8,0.9}.TheexperimentresultisshowninFigure9.Fromtheresult,wecanseethatthepositiveupsamplingratehassignificanteffectontheperformanceofthetrainedmodel.The best performance is achievedwith positive up samplingratesetto0.6.

Figure9,Fractionofthepositivesamplesemployedfor

learning

5.2ExperimentswithGBDTComponentIntheGBDTcomponent,wefixedthenumberofestimatorsto100,subsampleto1,learningrateto0.1andchangedthemaxdepthof the individual regression estimators.Themaximumdepthlimitsthenumberofnodesinthetree.TheresultshowninTable1 that increasing themaxdepths, themodelhas thebetterperformanceonthetrainingexamples,butlowerat6,7max depths on the test examples. The reason for thisphenomenonisoverfitting:thehigherdepthwillallowmodeltolearnrelationsveryspecifictoaparticularsample.

Table1:ExperimentsofGBDTcomponentonmaxdepth

5.2ExperimentswithDNNComponentHiddenLayersTable 2 shows the results we used the same structure withdifferent hidden layer configurations in DL module. Theseresultsshowthatwiththeincreasingofbothwidthanddepthof hidden layers improves results. The trade-off, however, isserverCPUtimeneeded for inference.Theconfigurationofa512hiddenReLUunits followedbya256hiddenReLUunitsfollowedbya128hiddenReLUunitsgaveusthebestresultswhileenablingustostaywithinourservingCPUbudget.For the 512->256->128, we also tried the differentnormalization methods: min-max, standard, cumulativedistribution, and none normalization. The cumulative

distribution increased the win-rate by 4%, with min-maxincreased3.2%andstandardincreased2.7%.

Table2:ExperimentsofDNNcomponentondifferent

hiddenlayers

5.3ExperimentswithdifferentbattlestrategiesWeusedarobotvalidationsystemastheopponenttoevaluateour GBDT, LR & Deep Learning model. The robot validationsystem generates 10 random robot roles with the higherparameters than theAI player roles, but applies the randomskill with the random target. The value shown for eachconfiguration (“win-rate”)was obtained by considering bothpositive(winbyAI)andnegative(winbyrobots).AsshowninTable1,GBDT,LR&DeepLearningmodelhadbestwin-ratecomparedwiththerandomskill&randomtargetstrategy,LR-only model, GBDT-only model, GBDT+LR model, DeepLearning-onlymodelandReinforcementLearning(RL)model.ForRLmodel,wedefinedthepercentageofthesum(HPs)oftheopponentsastherewardfunction.

Table3,differentmodelsagainsttotherobotscript

Eachexperimenthasbeenrun100,000times.ThemainreasonisthatsomeAIssuchasGBDT,LRcanonlylearningthefrequentco-occurrenceofskillsandtargetsinthehistoricaldata.Ontheotherside,theRLAItendencyuseattackkindskillratherthanothers.Theshortageisthattheassistantroleorthehealertriedtousenormalattackskillinsteadofsomeassistantskills,suchasspeedupskill,controlskill,etc.Theresultalsoshownthattheensembletraininghasthebetterperformancethanthejointtraining.

5.4ServingPerformanceServingwith high throughput and low latency is challengingwith the high level of traffic faced by our commercial turn-basedmobilegame.Atpeaktraffic,ourrecommenderserversscoreover100,000skill-pairpersecond.Withsinglethreading,scoringallcandidatesinasinglebatchtakes3msonaCPU-onlyserver.

6. ConclusionWehavedescribedourbattleactionpredictsystemonaturn-based mobile game SA 2. We treat our system as arecommendation ranking system in particular benefit fromspecializedfeaturesdescribingpastplayerbehaviorwithusedskills.We coalesced gradient boosted decision tree model, logisticregression anddeepneural network-combine the benefits ofgeneralization(explorestheskillswhichhaveneverorrarelyused in the past) and memorization (based on learning thefrequent co-occurrence of skills and targets in the historicaldata)toscorethe listofpossibleskill-actionpairs.Moreover,thetreemodelcaneffectivelytransformtheinputfeaturestothetuplefeatures.UnlikesomeotherAIstrategies,oursystemdoesnotneedanypriorknowledge.WepresentedtheTree&Wide&Deeplearningframeworktocombinethestrengthsofthreetypesofmodel.Intheexperimentspart,wetesttheresulting intermsoftheperformances against the robot script showed that the GLDmodelwassignificantlybetterthanothers.

7. AcknowledgementsWe would like to thank many at Shanghai Luyou NetworkTechnology,especiallyDianWu,JunQifortherobotscript,andearlyfeedbackontheGLDmodel.Wewouldalsoliketothankthe Tencent K5 Cooperation Department, especially Qi Li,ZuoqiuShen,Qiwangforcommentsonthemanuscript.

8. References[1]TheStoneAge2.http://sq.qq.com/[2]OrivolVinyals,TimoEwalds,Sergeybarunov,PetkoGeorgiev,AlexanderSashaVezhnevets,MichelleYeo,etal.StarCraftII:ANew Challenge for Reinforcement Learning, DeepMind Lab.arXiv:1708.04782,2017.[3]Heng-TzeCheng,LeventKoc,JeremiahHarmsen,TalShaked,TusharChandra,HrishiAradhye,GlenAnderson,GregCorrado,etal.Wide&DeepLearningforRecommenderSystems,DLRS2016 Proceedings of the 1st Workshop on Deep Learning forRecommenderSystems.[4] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado,Jeffrey Dean, Distributed Representations of Words andPhrasesandtheirCompositionality,arXiv:1310.4546,2013.[5] Paul Covington, Jay Adams, Emre Sargin, Deep NeuralNetworks for YouTube Recommendations,Proceedings of the10thACMConferenceonRecommenderSystems,ACM,NewYork,NY,USA,2016.[6]XinranHe,JunfengPan,OuJin,TianbingXu,BoLiu∗,TaoXu∗,Yanxin Shi∗, Antoine Atallah∗, Ralf Herbrich∗, et al,PracticalLessonsfromPredictingClicksonAdsatFacebook.[7]Makoto Ishihara,TaichiMiyazaki,PujanaPaliyawan,ChunYin Chu, Tomohiro Harada, Ruck Thawonmas, InvestigatingKinect-basedFightingGameAIsThatEncourageTheirplayerstoUseVariousSkills,ConsumerElectronics(GCCE),2015IEEE4thGlobalConferenceon.[8] Si Si, Huan Zhang, S. Sathiya Keerthi, Dhruv Mahajan,Inderjit S. Dhillon, Cho-Jui Hsieh, Gradient Boosted Decision

TreesforHighDimensionalSparseOutput,Proceedingsofthe34th International Conference on Machine Learning, PMLR70:3182-3190,2017.[9] S. Ioffe andC. Szegedy.Batchnormalization:Acceleratingdeep network training by reducing internal covariate shift.CoRR,abs/1502.03167,2015.[10]AlexandreGilotte,ClementCalauzenes,ThomasNedelec,AlexandreAbraham,SimonDolleCriteoResearch,OfflineA/BtestingforRecommenderSystems,arXiv:1801.07030v1,2018.[11] J. Duchi, E. Hazan, and Y. Singer. Adaptive subgradientmethods for online learning and stochastic optimization.Journal of Machine Learning Research, 12:2121–2159, July2011.[12]M.Abadi,A.Agarwal,P.Barham,E.Brevdo,Z.Chen,C.Citro,G. S. Corrado, A. Davis, J. Dean, M. Devin, S. Ghemawat, I.Goodfellow,A.Harp,G.Irving,M.Isard,Y.Jia,R.Jozefowicz,L.Kaiser,M.Kudlur,J.Levenberg,D.Mane,R.Monga,S.Moore,D.Murray,C.Olah,M.Schuster,J.Shlens,B.Steiner,I.Sutskever,K.Talwar, P. Tucker, V. Vanhoucke, V. Vasudevan, F. Viegas, O.Vinyals,P.Warden,M.Wattenberg,M.Wicke,Y.Yu,andX.Zheng.TensorFlow: Large-scalemachine learning on heterogeneoussystems,2015.Softwareavailablefromtensorflow.org.[13] X. Amatriain. Building industrial-scale real-worldrecommender systems. In Proceedings of the Sixth ACMConferenceonRecommenderSystems,RecSys.[14]GeorgeCybenko,“Approximationbysuperpositionsofasigmoidalfunction,”2,303–314.[15]MichaelNielsen,“Neuralnetworksanddeeplearning,”(DeterminationPress,2015)Chap.4.[16] Bernardo Llanas, Sagrario Lantar´on, and Francisco JS´ainz,“Constructiveapproximationofdiscontinuousfunctionsbyneural networks,”Neural ProcessingLetters 27, 209–226,2008.[17] Cheng Guo, Felix Berkhahn, Entity Embeddings ofCategoricalVariables,arXiv:1604.06737,2016[18]YanDuan, Xi Chen,ReinHouthooft, John Schulman, andPieterAbbeel.Benchmarkingdeepreinforcementlearningforcontinuous control. In International Conference on MachineLearn-ing,pages1329–1338,2016.[19] Volodymyr Mnih Koray Kavukcuoglu David Silver AlexGraves Ioannis Antonoglou Daan Wierstra MartinRiedmiller,Playing Atari with Deep Reinforcement Learning,DeepMindLab.arXiv:1312.5602,2013[20] M. Ishihara, T. Miyazaki, C.Y. Chu, T. Harada and R.Thawonmas,“ApplyingandimprovingMonte-CarloTreeSearchin a fighting gameAI,”ACM Int. Conf. Advances in ComputerEntertainmentTechnology,2016.[21]S.Yoshida,M.Ishihara,T.Miyazaki,Y.Nakagawa,T.HaradaandR.Thawonmas,“ApplicationofMonte-CarloTreeSearchina fighting gameAI,” IEEEGlobal Conf. ConsumerElectronics,2016.[22]J.E.Laird,“Usingacomputergametodevelopadvancedai,”Com- puter, vol. 34, no. 7, pp. 70–75, Jul. 2001. [Online].Available:http://dx.doi.org/10.1109/2.933506

[23] M. Buro and T. M. Furtak, “Rts games and real-time airesearch,”inProc.BehaviorRepresentationModel.Simul.Conf.,2004,pp.51–58.[24] Leemon Baird. Residual algorithms: Reinforcementlearning with function approximation. In Proceedings of the12th International Conference on Machine Learning (ICML1995),pages30–37.MorganKaufmann,1995.[25]MarcGBellemare,YavarNaddaf,JoelVeness,andMichaelBowling. The Arcade Learning Environment: An evaluationplatformforgeneralagents.J.Artif.Intell.Res.(JAIR),47:253–279,2013.

top related