research article - hindawi publishing...

17
Research Article Optimal Decision-Making Approach for Cyber Security Defense Using Game Theory and Intelligent Learning Yuchen Zhang and Jing Liu Zhengzhou Information Science and Technology Institute, Zhengzhou 450001, China Correspondence should be addressed to Yuchen Zhang; [email protected] Received 28 June 2019; Revised 22 September 2019; Accepted 25 October 2019; Published 23 December 2019 Guest Editor: Akbar S. Namin Copyright © 2019 Yuchen Zhang and Jing Liu. is is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Existing approaches of cyber attack-defense analysis based on stochastic game adopts the assumption of complete rationality, but in the actual cyber attack-defense, it is difficult for both sides of attacker and defender to meet the high requirement of complete rationality. For this aim, the influence of bounded rationality on attack-defense stochastic game is analyzed. We construct a stochastic game model. Aiming at the problem of state explosion when the number of network nodes increases, we design the attack-defense graph to compress the state space and extract network states and defense strategies. On this basis, the intelligent learning algorithm WoLF-PHC is introduced to carry out strategy learning and improvement. en, the defense decision-making algorithm with online learning ability is designed, which helps to select the optimal defense strategy with the maximum payoff from the candidate strategy set. e obtained strategy is superior to previous evolutionary equilibrium strategy because it does not rely on prior data. By introducing eligibility trace to improve WoLF-PHC, the learning speed is further improved and the defense timeliness is significantly promoted. 1. Introduction With the continuous strengthening of social informatiza- tion, cyber attacks are becoming more frequent, causing tremendous losses to defenders [1]. Because of the com- plexity of the network itself and the limitation of the de- fender’s ability, the network cannot achieve absolute security. It is urgent to have a technology which can analyze the attack-defense behavior and effectively compromise the network risk and security investment so that the defender can make reasonable decisions with limited resources. Game theory and cyber attack-defense have a high degree of op- position, non-cooperative relationship, and strategic de- pendence [2]. e research and application of game theory in cyber security are rising day by day [3]. e analysis of attack-defense confrontation based on stochastic game has become a hotspot. Stochastic game is a combination of game theory and Markov decision making. It not only extends the single state of traditional game to multistate but also characterizes the randomness of cyber attack-defense. At present, the cyber security analysis based on stochastic game has achieved some results, but there are still some short- comings and challenges [4–7]. e existing stochastic game of attack-defense is based on the assumption of complete rationality, through Nash equilibrium for attack prediction and defense guidance. Complete rationality includes many aspects of perfection requirements, such as rational con- sciousness (pursuit of maximum benefits), analytical rea- soning ability, identification and judgment ability, memory ability, and accurate behavior ability, among which any aspect of imperfection belongs to limited rationality [8]. e high requirement of complete rationality is too harsh for both sides of attack-defense, which makes it difficult for Nash equilibrium under the assumption of complete ra- tionality to appear in practice, and reduces the accuracy and guiding value of existing research results. To solve the above problems, this paper studies the defense decision-making approach based on stochastic game under the restriction of bounded rationality. Section 2 in- troduces the research status of defense decision making Hindawi Security and Communication Networks Volume 2019, Article ID 3038586, 16 pages https://doi.org/10.1155/2019/3038586

Upload: others

Post on 15-Jul-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Research Article - Hindawi Publishing Corporationdownloads.hindawi.com/journals/scn/2019/3038586.pdf · Afrand and Das [15] established a repeated game model between the intrusion

Research ArticleOptimal Decision-Making Approach for Cyber Security DefenseUsing Game Theory and Intelligent Learning

Yuchen Zhang and Jing Liu

Zhengzhou Information Science and Technology Institute Zhengzhou 450001 China

Correspondence should be addressed to Yuchen Zhang 2744190810qqcom

Received 28 June 2019 Revised 22 September 2019 Accepted 25 October 2019 Published 23 December 2019

Guest Editor Akbar S Namin

Copyright copy 2019 Yuchen Zhang and Jing Liuis is an open access article distributed under the Creative Commons AttributionLicense which permits unrestricted use distribution and reproduction in any medium provided the original work isproperly cited

Existing approaches of cyber attack-defense analysis based on stochastic game adopts the assumption of complete rationality butin the actual cyber attack-defense it is dicult for both sides of attacker and defender to meet the high requirement of completerationality For this aim the inuence of bounded rationality on attack-defense stochastic game is analyzed We constructa stochastic game model Aiming at the problem of state explosion when the number of network nodes increases we design theattack-defense graph to compress the state space and extract network states and defense strategies On this basis the intelligentlearning algorithmWoLF-PHC is introduced to carry out strategy learning and improvementen the defense decision-makingalgorithm with online learning ability is designed which helps to select the optimal defense strategy with the maximum payofrom the candidate strategy set e obtained strategy is superior to previous evolutionary equilibrium strategy because it does notrely on prior data By introducing eligibility trace to improve WoLF-PHC the learning speed is further improved and the defensetimeliness is signicantly promoted

1 Introduction

With the continuous strengthening of social informatiza-tion cyber attacks are becoming more frequent causingtremendous losses to defenders [1] Because of the com-plexity of the network itself and the limitation of the de-fenderrsquos ability the network cannot achieve absolutesecurity It is urgent to have a technology which can analyzethe attack-defense behavior and eectively compromise thenetwork risk and security investment so that the defendercan make reasonable decisions with limited resources Gametheory and cyber attack-defense have a high degree of op-position non-cooperative relationship and strategic de-pendence [2] e research and application of game theoryin cyber security are rising day by day [3] e analysis ofattack-defense confrontation based on stochastic game hasbecome a hotspot Stochastic game is a combination of gametheory and Markov decision making It not only extends thesingle state of traditional game to multistate but alsocharacterizes the randomness of cyber attack-defense At

present the cyber security analysis based on stochastic gamehas achieved some results but there are still some short-comings and challenges [4ndash7] e existing stochastic gameof attack-defense is based on the assumption of completerationality through Nash equilibrium for attack predictionand defense guidance Complete rationality includes manyaspects of perfection requirements such as rational con-sciousness (pursuit of maximum benets) analytical rea-soning ability identication and judgment ability memoryability and accurate behavior ability among which anyaspect of imperfection belongs to limited rationality [8] ehigh requirement of complete rationality is too harsh forboth sides of attack-defense which makes it dicult forNash equilibrium under the assumption of complete ra-tionality to appear in practice and reduces the accuracy andguiding value of existing research results

To solve the above problems this paper studies thedefense decision-making approach based on stochastic gameunder the restriction of bounded rationality Section 2 in-troduces the research status of defense decision making

HindawiSecurity and Communication NetworksVolume 2019 Article ID 3038586 16 pageshttpsdoiorg10115520193038586

based on stochastic game Section 3 analyses the difficultiesof studying cyber attack-defense stochastic game underbounded rationality and the idea of solving the problem inthis paper Moreover Section 3 constructs attack-defensestochastic game model under bounded rationality con-straints and proposes a host-centered attack-defense graphmodel to extract network state and attack-defense action ingame model Bowling et al [9] first proposed WoLF-PHCfor multiagent learning Section 4 further improves WoLF-PHC algorithm based on eligibility trace for promoting thelearning speed of defenders as well as reducing the de-pendence of the algorithm on data Using the improvedintelligent learning algorithm WoLF-PHC (Wolf MountainClimbing Strategy) to analyze the stochastic game model inthe previous section we design the defense decision-makingalgorithm Section 5 verifies the effectiveness of the proposedapproach through experiments Section 6 summarizes thefull text and discusses future research

ere are three main contributions of this paper

(1) e extraction of network state and attack-defenseactions is one of the keys to the construction ofstochastic game model e network state of theexisting stochastic game model contains the securityelements of all nodes in the network and there isa ldquostate explosionrdquo problem In order to solve thisproblem a host-centered attack-defense graphmodel is proposed and an attack-defense graphgeneration algorithm is designed which effectivelycompresses the game state space

(2) Limited rationality means that both sides of attack-defense need to find the optimal strategy throughtrial and error and learning It is a key point todetermine the learning mechanism of players In thispaper reinforcement learning is introduced intostochastic game which expands stochastic gamefrom complete rationality to limited rationalityDefenders use WoLF-PHC to learn the game inadversarial attack-defense so as to make the bestchoice for current attackers Most of the existingbounded rationality games use biological evolu-tionary mechanism to learn and take the group as theresearch object Compared with the existing boun-ded rationality games the approach proposed in thispaper reduces the exchange of information amonggame players and is more suitable for guiding in-dividual defense decision making

(3) WoLF-PHC algorithm is improved based on eligi-bility trace [10] which speeds up the learning speedof defenders reduces the dependence of the algo-rithm on data and proves the effectiveness of theapproach through experiments

2 Related Works

Some progress has been made in cyber security researchbased on game theory at home and abroad but most of thecurrent studies are based on the assumption of completerationality [11] Under complete rationality according to the

decision-making times of both sides in the game process itcan be divided into single-stage game and multistage gamee research of single-stage cyber attack-defense gamestarted earlier Liu et al [12] used static game theory toanalyze the effectiveness of worm virus attack-defensestrategy Li et al [13] established a non-cooperative gamemodel between attackers and sensor trust nodes and gave theoptimal attack strategy based on Nash equilibrium In cyberattack-defense although part of the simple attack-defenseconfrontation belongs to single-stage game in most sce-narios the process of attack-defense often lasts for manystages so multistage cyber attack-defense game becomesa trend Zhang et al [14] regarded the defender as the sourceof signal transmission and the attacker as the receiver andconstructed a multistage attack-defense process using dif-ferential game Afrand and Das [15] established a repeatedgame model between the intrusion detection system andwireless sensor nodes and analyzed the forwarding strategyof node packets Although the above results can be used toanalyze multistage attack-defense confrontation the statetransition between stages is not only affected by attack-defense action but also by the interference of system op-erating environment and other external factors which hasrandomness e above results ignore this randomness andweaken its guiding value

Stochastic game is a combination of game theory andMarkov theory It is a multistage game model It can ac-curately analyze the impact of randomness on attack-defenseprocess by using the Markov process to describe the statetransition Wei et al [16] abstracted the cyber attack-defenseas a stochastic game problem and gave a more scientific andaccurate quantitative approach of attack-defense benefitsapplicable to the stochastic game model of attack-defenseWang et al [17] used stochastic game theory to study thenetwork confrontation problem Convex analysis theory wasused to prove the existence of equilibrium and the equi-librium solution was transformed into a nonlinear pro-gramming problem Based on incomplete informationstochastic game Liu et al [3] proposed the decision-makingapproach formoving targets defense All the aforementionedschemes are based on the assumption of complete ratio-nality which is too strict for both sides of attack-defense Inmost cases both sides of attack-defense are only limitedrationality level which leads to the deviation of the aboveresearch results in the analysis of attack-defense gameerefore it has important research value and practicalsignificance to explore the bounded rationality of cyberattack-defense game law

Limited rationality means that both sides of attack-de-fense will not find the optimal strategy at the beginningeywill learn the game of attack-defense in the game of attack-defense e appropriate learning mechanism is the key towin in the game At present the research of limited rationalattack-defense game is mainly centered on evolutionarygame [18] Hayel and Zhu [19] established an evolutionaryPoisson game model between malicious software and an-tivirus programs and used the replication dynamic equationto analyze the antivirus program strategy Huang and Zhang[20] improved the traditional replication dynamic equation

2 Security and Communication Networks

by introducing incentive coefficient and improved the cal-culation approach of replication dynamic rate Based on thisan evolutionary game model was constructed for defenseEvolutionary game takes the group as the research objectadopts the biological evolution mechanism and completesthe learning by imitating the advantage strategy of othermembers In evolutionary game there is too much in-formation exchange among players and it mainly studies theadjustment process trend and stability of attack-defensegroup strategy which is not conducive to guiding the real-time strategy selection of individual members

Reinforcement learning is a classic online intelligentlearning approach Its players learn independently throughenvironmental feedback Compared with evolutionary bi-ology reinforcement learning is more suitable for guidingindividual decision making is paper introduces re-inforcement-learning mechanism into stochastic gameexpands stochastic game from complete rationality to finiterationality and uses bounded rationality stochastic game toanalyze cyber attack-defense On the one hand comparedwith the existing attack-defense stochastic game this ap-proach uses bounded rationality hypothesis which is morerealistic On the other hand compared with evolutionarygame this approach uses reinforcement learning mecha-nism which is more suitable for guiding real-time defensedecision making

3 Modeling of Attack-Defense ConfrontationUsing Stochastic Game Theory

31 Description and Analysis of Cyber Attack-DefenseConfrontation Cyber attack-defense confrontation isa complex problem but from the level of strategy selection itcan be described as a stochastic game problem as depicted inFigure 1 We take the DDoS attack of using Sadmind vul-nerability of Soloris platform as an example e attack isimplemented through multiple steps including IP sweepSadmind ping Sadmind exploit Installing DDoS softwareand Conducting DDoS attack Each attack step can lead tochange of security state of network

Taking the first step as an example the initial networkstate is denoted as S0 (H1 none) It means that the attackerAlice does not have any privileges of host H1 en attackerAlice implemented an IP sweep attack onH1 through its openport 445 and gained the User privilege of H1 is networkstate is denoted as S1 (H1 User) Afterwards if the defenderBob selected and implemented a defense strategy from thecandidate strategy set Reinstall Listener program Installpatches Close unused port then the network state istransferred back to S0 otherwise the network may continueto evolve to another more dangerous state S3

e continuous time axis is divided into time slices andeach time slice contains only one network state e networkstatemay be the same in different time slices Each time slice isa game of attack-defense Both sides detect the current net-work state then select the attack-defense actions according tothe strategy and get immediate returns Attack-defensestrategies are related to network state e network systemtransfers from one state to another under the candidate action

of the attacking and defending sides e transition betweennetwork states is not only affected by attack-defense actionsbut also by factors such as system operating environment andexternal environment which is randome goal of this paperis to enable defenders to obtain higher long-term benefits inattack-defense stochastic game

Both sides of attack-defense can predict the existence ofNash equilibrium so Nash equilibrium is the best strategyfor both sides From the description of complete rationalityin the introduction we can see that the requirement ofcomplete rationality for both sides of attack-defense is toostrict and both sides of attacker and defender will beconstrained by limited rationality in practice Limited ra-tionality means that at least one of the attacking anddefending sides will not adopt Nash equilibrium strategy atthe beginning of the game which means that it is difficult forboth sides to find the optimal strategy in the early stage of thegame and they need to constantly adjust and improve thestrategy for their opponents It means that the game equi-librium is not the result of one choice but that both sides ofthe attack-defense sides are constantly learning to achieve inthe course of the attack-defense confrontation and becauseof the influence of learning mechanism may deviate againeven if it reaches equilibrium

From the above analysis we can see that learningmechanism is the key to win the game of limited rationalityFor defense decision making the learning mechanism ofattack-defense stochastic game under bounded rationalityneeds to satisfy the following two requirements (1) Con-vergence of learning algorithm attacker strategy underbounded rationality has dynamic change characteristics andbecause of the interdependence of attack-defense strategythe defender must learn the corresponding optimal strategywhen facing different attack strategies to ensure that he isinvincible (2) e learning process does not need too muchattacker information both sides of the cyber attack-defensehave opposition of objectives and non-cooperation andboth sides will deliberately hide their key information If toomuch opponent information is needed in the learningprocess the practicability of the learning algorithm will bereduced

WoLF-PHC algorithm is a typical strategy gradientintelligent learning approach which enables defenders tolearn through network feedback without too much in-formation exchange with attackers e introduction ofWoLF mechanism ensures the convergence of WoLF-PHCalgorithm [9] After the attacker learns to adopt Nashequilibrium strategy WoLF mechanism enables the de-fender to converge to the corresponding Nash equilibriumstrategy while the attacker has not yet learned Nash equi-librium strategy and WoLF mechanism enables the de-fender to converge to the corresponding optimal defensestrategy In conclusion WoLF-PHC algorithm can meet thedemand of attack-defense stochastic game under boundedrationality

32 StochasticGameModel forAttack-Defense emappingrelationship between cyber attack-defense and stochastic

Security and Communication Networks 3

game model is depicted in Figure 2 Stochastic game consistsof attack-defense game in each state and transition modelbetween states e two key elements of ldquoinformationrdquo andldquogame orderrdquo are assumed Constrained by bounded ra-tionality the attackerrsquos historical actions and the attackerrsquospayo function are set as the attackerrsquos private information

Herein we use the above example in Figure 1 to explainFigure 2 the security state corresponds to S0 (H1 none) andS1 (H1 User) in this case e candidate strategy set againstDDoS attack is Reinstall Listener program Install patchesClose unused port Network state is the common knowledgeof both sides Because of the non-cooperation between theattack and defense sides the two sides can only observe eachotherrsquos actions through the detection network which willdelay the execution time for at least one time slice so theattack-defense sides are acting at the same time in each timeslice e ldquosimultaneousrdquo here is a concept of informationrather than a concept of time that is the choice of attack-defense sides may not be based on the concept of time At thesame time because the attack-defense sides do not know theother sidersquos choice when choosing action they are consid-ered to be simultaneous action

Construct the network state transition model Useprobability to express the randomness of network statetransition Because the current network state is mainly re-lated to the previous network state the rst-order Markov isused to represent the state transition relationship in whichthe network state is the attack-defense action Because bothsides of attacker and defender are constrained by boundedrationality in order to increase the generality of the modelthe transfer probability is set as the unknown information ofboth sides of attack-defense

On the basis of the above a gamemodel is constructed tosolve the defense decision-making problem

Denition 1 e attack-defense stochastic gamemodel (AD-SGM) is a six-tuple AD minus SGM (N SD R Q π) in which

(1) N (attacker defender) are the two players whoparticipate in the game representing cyber attackersand defenders respectively

(2) S (s1 s2 sn) is a set of stochastic game stateswhich is composed of network states (see Section 33for the specic meaning and generation approach)

Optimalstrategy of

St1

Optimalstrategy of

Sti

Optimalstrategy of

Sprimet1

CurrentstateSprimet2

Optimalstrategy of

SPrimet2

Game Game

Game

t1 t2 ti Time

Defender

CurrentstateSt1

CurrentstateSPrimet2

Currentstate

StiAttacker

DefenderAttacker

DefenderAttacker DefenderAttacker

Evolution

Evolution

∙ ∙ ∙

∙ ∙ ∙

∙ ∙ ∙

∙ ∙ ∙

∙ ∙ ∙

∙ ∙ ∙

∙ ∙ ∙

Game

Figure 1 e game process and strategy selection of attack-defense confrontation

Gamestate

Securitystate

Candidatestrategy

Attack-defenseequilibrium

Cost and benefitanalysis

Attacker and defender

Game plan

Game equilibrium

Payoff function

Players

Cyber attack-defense Stochastic game model

Figure 2 e mapping relationship between cyber attack-defenseand stochastic game model

4 Security and Communication Networks

(3) D (D1 D2 Dn) is the action set of the de-fender in which Dk d1 d2 dm1113864 1113865 is the actionset of the defender in the game state Sk

(4) Rd(si d sj)is the immediate return from state si to sjafter the defender performs action d

(5) Qd(si d) is the state-action payoff function of thedefender indicating the expected payoff of the de-fender after taking action d in the state si

(6) πd(sk) is the defense strategy of the defender in thestate sk

Defense strategy and defense action are two differentconcepts Defense strategy is the rule of defense action notthe action itself For example πd(sk) (πd(sk d1)

πd(sk dm)) is the strategy of the defender in the networkstate sk where πd(sk dm) is the probability of selectingaction dm 1113936disinDk

πd(sk d) 1

33 Network State and Attack-Defense Action ExtractionApproach Based on Attack-Defense Graph Network stateand attack-defense action are important components ofstochastic game model Extraction of network state andattack-defense action is a key point in constructing at-tack-defense stochastic game model [21] In the currentattack-defense stochastic game when describing thenetwork state each network state contains the securityelements of all nodes in the current network e numberof network states is the power set of security elementswhich will produce a state explosion [22] ereforea host-centered attack-defense graph model is proposedEach state node only describes the host state It can ef-fectively reduce the size of state nodes [23] Using thisattack-defense graph to extract network state and attack-defense action is more conducive to cyber attack-defenseconfrontation analysis

Definition 2 attack-defense graph is a binary groupG (S E) in which S s1 s2 sn1113864 1113865 is a set of node se-curity states and si langhost privilegerang host is the uniqueidentity of the node and privilege none user root in-dicates that it does not have any privileges has ordinary userprivileges and has administrator privileges For directededge E (Ea Ed) it indicates that the occurrence of attackor defense action causes the transfer of node state andek (sr vd sd) k a d where sr is the source node and sd

is the destination nodee generation process of attack-defense map is shown

in Figure 3 Firstly target network scanning is used to ac-quire cyber security elements then attack instantiation iscombined with attack template and defense instantiation iscombined with defense template Finally attack-defensegraph is generated e state set of attack-defense stochasticgame model is extracted by attack-defense graph nodes and

the defense action set is extracted by the edge of attack-defense graph

331 Elements of Cyber Security e elements of cybersecurity NSE are composed of network connection Cvulnerability information V service information F andaccess rights P Matrix Csube host times host times port describes theconnection relationship between nodes the row of matrixrepresents the source node shost the list of matrix rep-resents the destination node dhost and the port accessrelationship is represented by matrix elements When portis empty it indicates that there is no connection re-lationship between shost nodes and dhost nodesV langhost service cveidrang indicates the vulnerability ofservices on nodesrsquo host including security vulnerabilitiesand improper configuration or misconfiguration of systemsoftware and application software F langhost servicerang in-dicates that a service is opened on a node hostP langhost privilegerang indicates that an attacker has accessrights privilege on a node host

332 Attack Template Attack template AM is the de-scription of vulnerability utilization whereAM langtid prec postcrang Among them tid is the identifi-cation of attack mode prec langP V C Frang describes the setof prerequisites for an attacker to use a vulnerability in-cluding the initial access rights privilege of the attacker onthe source node shost the vulnerability information cveidof the target node the network connection relationship Cand the running service of the node F Only when the set ofconditions is satisfied the attacker can succeed Use thisvulnerability postc langP C sdrang describes the conse-quences of an attackerrsquos successful use of a vulnerabilityincluding the increase of attackerrsquos access to the targetnode the change of network connection relationship andservice destruction

333 Defense Template Defense templates DM are theresponse measures taken by defenders after predicting oridentifying attacks where DM langtid dsetrang dset

langd1 postd1rang langdm postdmrang1113864 1113865 is the defense strategy setfor specific attacks postdi langF V PCrang describes the im-pact of defense strategy on cyber security elements in-cluding the impact on node service informationvulnerability information attacker privilege informationnode connection relationship and so on

In the process of attack-defense graph generation ifthere is a connection between two nodes and all the pre-requisites for attack occurring are satisfied the edges fromsource node to destination node are added If the attackchanges the security elements such as connectivity the cybersecurity elements should be updated in time If the defensestrategy is implemented the connection between nodes orthe existing rights of attackers should be changed As shownin Algorithm 1 the first step is to use cyber security elements

Security and Communication Networks 5

to generate all possible state nodes and initialize the edgesSteps 2ndash8 are to instantiate attacks and generate all attackedges Steps 9ndash15 are to instantiate defenses and generate alldefense edges Steps 16ndash20 are used to remove all isolatednodes And step 21 is to output attack-defense maps

Assuming the number of nodes in the target network is nand the number of vulnerabilities of each node is m themaximum number of nodes in the attack-defense graph is3n In the attack instantiation stage the computationalcomplexity of analyzing the connection relationship be-tween each two nodes is o(n2 minus n) e computationalcomplexity of matching the vulnerability of the nodes withthe connection relationship is o(m(n2 minus n)) In the defenseinstantiation stage we remove the isolated nodes and thecomputational complexity of traversing the edges of all thenodes is o(9n2 minus 3n) In summary the order of computa-tional complexity of the algorithm is o(n2) e node ofattack-defense graph G can extract network state and theedge of attack-defense graph G can extract attack-defenseaction

4 Stochastic Game Analysis and StrategySelection Based on WoLF-PHCIntelligent Learning

In the previous section cyber attack-defense is described asa bounded rational stochastic game problem and an attack-defense stochastic game model AD-SGM is constructed Inthis section reinforcement learning mechanism is in-troduced into finite rational stochastic game and WoLF-PHC algorithm is used to select defense strategies based onAD-SGM

41 Principle of WoLF-PHC

411 Q-Learning Algorithm Q-learning [24] is the basis ofWoLF-PHC algorithm and a typical model-free re-inforcement learning algorithm Its learning mechanism isshown in Figure 4 Agent in Q-learning obtains knowledgeof return and environment state transfer through interaction

Elements of networksecurity

Network connectionrelations

Node vulnerabilityinformation

Node serviceinformation

Node accessprivilege

Attack template

Attackinstantiation

Defenseinstantiation

Defense template

Atta

ck-d

efen

se g

raph Network

state

Attackaction

Defensiveaction

Figure 3 Attack-defense graph generation

Input Elements of Cyber security NSE Attack Template AM Defense Template DMOutput Attack graph G (S E)

(1) S⟵NSE E⟵emptylowast Generate all nodes lowast(2) for each S dolowast Attack instantiation to generate attack edges lowast(3) update NSE in slowast Updating Cyber security Elements lowast(4) if Cshost shost and CdhostV geAMprecV and CdhostFgeAMprecF and CdhostPprivilegegeAMprecPprivilege(5) srhost⟵Cshost(6) sdhost⟵Cdhost(7) sdprivilege⟵AMpostcPprivilege(8) Ea⟵Ea cup ea(srAMtid sd)1113864 1113865

(9) end if(10) end for(11) for each S dolowast Defense instantiation to generate defense edges lowast(12) if Easd s and DMtid Eatid(13) Ed⟵Ed cup ed(EasdDMdsetd Easr)1113864 1113865

(14) end if(15) end for(16) for each S dolowast Remove isolated nodes S lowast(17) if ea(s tid sd) empty and ed(sr d s) empty(18) S⟵ S minus s

(19) end if(20) end for(21) Return G

ALGORITHM 1 Attack-defense graph generation algorithm

6 Security and Communication Networks

with environment Knowledge is expressed by payoff Qd andlearned by updating Qd Qd is

Qd(s d) Qd(s d) + α Rd s d sprime( 1113857 + cmaxdprime

Qd sprime dprime( 1113857 minus Qd(s d)1113890 1113891

(1)

where α is payoff learning rate and c is the discount factore strategy of Q-learning is πd(s) argmaxdQd(s d)

412 PHC Algorithm e Policy Hill-Climbing algorithm[25] is a simple and practical gradient descent learning al-gorithm suitable for hybrid strategies which is an im-provement of Q-learning e state-action gain function Qd

of PHC is the same as Q-learning but the policy updateapproach of Q-learning is no longer followed but the hybridstrategy πd(sk) is updated by executing the hill-climbingalgorithm as shown in equations (2)ndash(4) In the formula thestrategy learning rate is

πd sk di( 1113857 πd sk di( 1113857 + Δskdi (2)

where

Δskdi

minus δskdi di ne argmax

dQd sk d( 1113857

1113944djnedi

δskdj others

⎧⎪⎪⎪⎨

⎪⎪⎪⎩

(3)

δskdi min πd sk di( 1113857

δDk minus 1

11138681113868111386811138681113868111386811138681113868

1113888 1113889 (4)

413 WoLF-PHC Algorithm WoLF-PHC algorithm is animprovement of PHC algorithm By introducing WoLFmechanism the defender has two different strategy learningrates low strategy learning rate δw when winning and highstrategy learning rate δl when losing as shown in formula (5)e two learning rates enable defenders to adapt quickly toattackersrsquo strategies when they perform worse than expectedand to learn cautiously when they perform better than expectede most important thing is the introduction of WoLFmechanism which guarantees the convergence of the algorithm[9] WoLF-PHC algorithm uses average strategy as the criterionof success and failure as shown in formulae (6) and (7)

δ

δw 1113944disinDk

πd skd( 1113857Qd skd( 1113857gt 1113944disinDk

πd skd( 1113857Qd skd( 1113857

δl others

⎧⎪⎪⎪⎨

⎪⎪⎪⎩

(5)

πd(sd) πd(sd) +1

C(s)πd(sd) minus πd(sd)( 1113857

(6)

C(s) C(s) + 1 (7)

42 Defense Decision-Making Algorithm Based on ImprovedWoLF-PHC e decision-making process of our approachis shown in Figure 5 which consists of five steps It receivestwo types of input data attack evidence and abnormal

Input AD minus SGM α δ λ and c

Output Defense action d(1) initialize AD minus SGM C(s) 0 e(s d) 0lowast Network state and attack-defense actions are extracted by Algorithm 1 lowast(2) slowast get(E) lowast Getting the current network state from Network E lowast(3) repeat(4) dlowast πd(slowast) lowast Select defense action lowast(5) Output dlowast lowast Feedback defense actions to defenders lowast(6) sprime get(E) lowast Get the status after the action dlowast is executed lowast(7) ρlowast Rd(slowast dlowast sprime) + cmaxdprimeQd(sprime dprime) minus Qd(slowast dlowast)

(8) ρg Rd(slowast dlowast sprime) + cmaxdprimeQd(sprime dprime) minus maxdQd(slowast d)

(9) for each state-action pair (s d) except (slowast dlowast) do(10) e(s d) cλe(s d)

(11) Qd(s d) Qd(s d) + αρge(s d)

(12) end forlowast Update noncurrent eligibility trace (slowast dlowast) and values Qd lowast(13) Qd(slowast dlowast) Qd(slowast dlowast) + αρlowastlowast Update Qd of (slowast dlowast) lowast(14) e(slowast dlowast) cλe(slowast dlowast) + 1 lowast Update track of (slowast dlowast)lowast(15) C(slowast) C(slowast) + 1(16) Updating average strategy based on formula (6)(17) Selecting the learning rate of strategies based on formula (5)(18) δslowastd min(πd(slowast d) δ(D(slowast) minus 1))foralld isin D(slowast)

(19) Δslowastd minus δslowastd dne argmaxdi

Qd(slowast di)

1113936djnedlowastδslowastdjOthers1113896

(20) πd(slowast d) πd(slowast d) + Δslowastdforalld isin D(slowast) lowast Update defense strategy lowast(21) slowast sprime(22) end repeat

ALGORITHM 2 Defense decision-making algorithm

Security and Communication Networks 7

evidence All these pieces of evidence come from real-timeintrusion detection systems After decision making theoptimal security strategy is determined against detectedintrusions

In order to improve the learning speed of WoLF-PHCalgorithm and reduce the dependence of the algorithm onthe amount of data the eligibility trace is introduced toimprove WoLF-PHC e eligibility trace can track specificstate-action trajectories of recent visits and then assigncurrent returns to the state-action of recent visits WoLF-PHC algorithm is an extension of Q-learning algorithm Atpresent there are many algorithms combining Q-learningwith eligibility trace is paper improves WoLF-PHC byusing the typical algorithm [10] e qualification trace ofeach state-action is defined as e(s a) Suppose slowast is thecurrent network state slowast and the eligibility trace is updatedin the way shown in formula (8) Among them the traceattenuation factor is λ

e(s d) cλe(s d) sne slowast

cλe(s d) + 1 s slowast1113896 (8)

WoLF-PHC algorithm is an extension of Q-learningalgorithm which belongs to off-policy algorithm It usesgreedy policy when evaluating defense actions for eachnetwork state and occasionally introduces nongreedy policywhen choosing to perform defense actions in order to learnIn order to maintain the off-policy characteristics of WoLF-PHC algorithm the state-action values are updated byformulae (9)ndash(12) in which the defense actions dlowast are

selected for execution slowast because only the recently visitedstatus-action pairs will have significantly more eligibilitytrace than 0 while most other status-action pairs will havealmost none eligibility trace In order to reduce the memoryand running time consumption caused by eligibility traceonly the latest status-action pair eligibility trace can be savedand updated in practical application

Qd slowast dlowast

( 1113857 Qd slowast dlowast

( 1113857 + αρlowast (9)

Qd(s d) Qd(s d) + αρge(s d) (10)

ρlowast Rd slowast d sprime( 1113857 + cmax

dprimeQd sprime dprime( 1113857 minus Qd s

lowast dlowast

( 1113857

(11)

ρg Rd slowast dlowast sprime( 1113857 + cmax

dprimeQd sprime dprime( 1113857 minus max

dQd slowast d( 1113857

(12)

In order to achieve better results the defense decision-making approach based on WoLF-PHC needs to set fourparameters α δ λ and c reasonably (1) e range of thepayoff learning rate is 0lt αlt 1e bigger the representativeα is the more important the cumulative reward is e fasterthe learning speed is the smaller α is and the better thestability of the algorithm is (2)e range of strategy learningrate 0lt δ lt 1 is obtained According to the experiment wecan get a better result when adopting δlδw 4 (3) eattenuation factor λ of eligibility trace is in the range of

Describe andanalyze cyberattack-defenseconfrontation

Model stochastic game

for attack-defense

Extract currentsecurity state andcandidate attack-

defense action

Evaluatepayoff

hellip

helliphellip

How to defend

Decision making

Figure 5 e process of our decision-making approach

Rt+1reward Rt

St+1state St

Ambient

Agent

Figure 4 Q-learning learning mechanism

8 Security and Communication Networks

0lt λlt 1 which is responsible for the credit allocation ofstatus-action It can be regarded as a time scale e greaterthe credit allocated λ to historical status-action the greaterthe credit allocated to historical status-action (4) e rangeof the discount factor 0lt clt 1 represents the defenderrsquospreference for immediate return and future return When c

approaches 0 it means that future returns are irrelevant andimmediate returns are more important When c approaches1 it means immediate returns are irrelevant and futurereturns are more important

Agent in WoLF-PHC is the defender in the stochasticgame model of attack-defense AD minus SGM the game state inagentrsquos state corresponds to AD minus SGM the defense actionin agentrsquos behavior corresponds to AD minus SGM the imme-diate return in agentrsquos immediate return corresponds toAD minus SGM and the defense strategy in agentrsquos strategycorresponds to AD minus SGM

On the basis of the above a specific defense decision-making approach as shown in Algorithm 2 is given e firststep of the algorithm is to initialize the stochastic gamemodel AD minus SGM of attack-defense and the related pa-rameters e network state and attack-defense actions areextracted by Algorithm 1 e second step is to detect thecurrent network state by the defender Steps 3ndash22 are tomake defense decisions and learn online Steps 4-5 are toselect defense actions according to the current strategy steps6ndash14 are to update the benefits by using eligibility traces andsteps 15ndash21 are the new payoffs using mountain climbingalgorithm to update defense strategy

e spatial complexity of the Algorithm 2 mainlyconcentrates on the storage of pairs Rd(s d sprime) such ase(s d) πd(s d) πd(s d) and Qd(s d) e number of statesis |S| |D| and |A| are the numbers of measures taken by thedefender and attacker in each state respectively ecomputational complexity of the proposed algorithm isO(4 middot |S| middot |D| + |S|2 middot |D|) Compared with the recentmethod using evolutionary game model with complexityO(|S|2 middot (|A| + |D|)3) [14] we greatly reduce the computa-tional complexity and increase the practicability of the al-gorithm since the proposed algorithm does not need to solvethe game equilibrium

5 Experimental Analysis

51 Experiment Setup In order to verify the effectiveness ofthis approach a typical enterprise network as shown inFigure 6 is built for experiment Attacks and defenses occuron the intranet with attackers coming from the extranet Asa defender network administrator is responsible for thesecurity of intranet Due to the setting of Firewall 1 andFirewall 2 legal users of the external network can only accessthe web server which can access the database server FTPserver and e-mail server

e simulation experiment was carried out on a PC withIntel Core i7-6300HQ 340GHz 32GB RAM memoryand Windows 10 64 bit operating system e Python 365emulator was installed and the vulnerability information inthe experimental network was scanned by Nessus toolkit asshown in Table 1 e network topology information was

collected by ArcGis toolkit We used the Python language towrite the project code During the experiment we set upabout 25000 times of attack-defense strategy studies eexperimental results were analyzed and displayed usingMatlab2018a as described in Section 53

Referring to MIT Lincoln Lab attack-defense behaviordatabase attack-defense templates are constructed Attack-defense maps are divided into attack maps and defense mapsby using attacker host A web server W database server DFTP server F and e-mail server E In order to facilitatedisplay and description attack-defense maps are dividedinto attack maps and defense maps as shown in Figures 7and 8 respectively e meaning of defense action in thedefense diagram is shown in Table 2

52 Construction of the Experiment Scenario AD-SGM(1) N (attacker defender) are players participating in

the game representing cyber attackers and defendersrespectively

Vulnerabilityscanning system

VDS

Internet

DMZ

FTP server

Firewall 1Trusted zone

Firewall 2Database

serverE-mailserver

Web server

IDS

Attacker

Extranet

Intranet

Figure 6 Experimental network topology

Table 1 Network vulnerability information

Attackidentifier Host CVE Target

privilegetid1 Web server CVE-2015-1635 usertid2 Web server CVE-2017-7269 roottid3 Web server CVE-2014-8517 roottid4 FTP server CVE-2014-3556 roottid5 E-mail server CVE-2014-4877 roottid6 Database server CVE-2013-4730 usertid7 Database server CVE-2016-6662 root

S0(A root)

S1(W user)

S2(W root)S5(D user)

S6(D root)

tid1

tid2

tid3

tid4

tid5

tid6

tid7

tid8

tid9

S3(F root)

S4(E root)

Figure 7 Attack graph

Security and Communication Networks 9

(2) e state set of stochastic game is S (s0 s1

s2 s3 s4 s5 s6) which consists of network state and isextracted from the nodes shown in Figures 7 and 8

(3) e action set of the defender is D (D0 D1

D2 D3 D4 D5 D6) D0 NULL D1 d11113864 d2D2

d3 d41113864 1113865D3 d1 d5 d61113864 1113865D4 d1 d5 d61113864 1113865D5 d11113864

d2 d7D6 d3 d41113864 1113865 and the edges are extractedfrom Figure 8

(4) Quantitative results of immediate returns Rd(si

d sj) of defenders [16 26] are

Rd s0NULL s0( 1113857 Rd s0NULL s1( 1113857 Rd s0NULL s2( 1113857( 1113857 (0 minus 40 minus 59)

Rd s1 d1 s0( 1113857 Rd s1 d1 s1( 1113857 Rd s1 d1 s2( 1113857 Rd s1 d2 s0( 1113857 Rd s1 d2 s1( 1113857 Rd s1 d2 s2( 1113857( 1113857

(40 0 minus 29 5 minus 15 minus 32)

( Rd s2 d3 s0( 1113857 Rd s2 d3 s1( 1113857 Rd s2 d3 s2( 1113857 Rd s2 d3 s3( 1113857 Rd s2 d3 s4( 1113857 Rd s2 d3 s5( 1113857

Rd s2 d4 s0( 1113857 Rd s2d4 s1( 1113857 Rd s2 d4 s2( 1113857 Rd s2 d4 s3( 1113857 Rd s2 d4 s4( 1113857 Rd s2 d4 s5( 11138571113857

(24 9 minus 15 minus 55 minus 49 minus 65 19 5 minus 21 minus 61 minus 72 minus 68)

( Rd s3 d1 s2( 1113857 Rd s3 d1 s3( 1113857 Rd s3 d1 s6( 1113857 Rd s3 d5 s2( 1113857 Rd s3 d5 s3( 1113857 Rd s3 d5 s6( 1113857

Rd s3 d6 s2( 1113857 Rd s3 d6 s3( 1113857 Rd s3 d6 s6( 11138571113857

(21 minus 16 minus 72 15 minus 23 minus 81 minus 21 minus 36 minus 81)

( Rd s4 d1 s2( 1113857 Rd s4 d1 s4( 1113857 Rd s4 d1 s6( 1113857 Rd s4 d5 s2( 1113857 Rd s4 d5 s4( 1113857 Rd s4 d5 s6( 1113857

Rd s4 d6 s2( 1113857 Rd s4 d6 s4( 1113857 Rd s4 d6 s6( 11138571113857

(26 0 minus 62 11 minus 23 minus 75 9 minus 25 minus 87)

( Rd s5 d1 s2( 1113857 Rd s5 d1 s5( 1113857 Rd s5 d1 s6( 1113857 Rd s5 d2 s2( 1113857 Rd s5 d2 s5( 1113857 Rd s5 d2 s6( 1113857

Rd s5 d7 s2( 1113857 Rd s5 d7 s5( 1113857 Rd s5 d7 s6( 11138571113857

(29 0 minus 63 11 minus 21 minus 76 2 minus 27 minus 88)

( Rd s6 d3 s3( 1113857 Rd s6 d3 s3( 1113857 Rd s6 d3 s5( 1113857 Rd s6 d3 s6( 1113857 Rd s6 d4 s3( 1113857

Rd s6 d4 s4( 1113857 Rd s6 d4 s5( 1113857 Rd s6 d4 s6( 11138571113857

(minus 23 minus 21 minus 19 minus 42 minus 28 minus 31 minus 24 minus 49)

(13)

Table 2 Defense action description

Atomic defense action d1 d2 d3 d4 d5 d6 d7

Renew root data Limit SYNICMPpackets Install Oracle patches Reinstall listener program Uninstall delete Trojan Limit access toMDSYS Restart database server Delete suspicious account Add physical resource Repair database Limit packets from ports

S0(A root)

S1(W user)

S2(W root)S5(D user)

S6(D root)

d3

d3

d4

d1d2d7

S3(F root)

S4(E root)d1d5d6

d1d5d6d4

d3

d1d2

Figure 8 Defense graph

10 Security and Communication Networks

(5) In order to detect the learning performance of theAlgorithm 2 more fully the defenderrsquos state-actionpayoffQd(si d) is initialized with a unified 0 withoutintroducing additional prior knowledge

(6) Defenderrsquos defense strategy adopts average strategyto initialize that is πd(sk d1) πd(sk d2) middot middot middot

πd(sk dm) 1113936disinD(sk)πd(sk d) 1forallsk isin S where noadditional prior knowledge is introduced

53 Testing and Analysis e experiment in this section hasthree purposes e first is to test the influence of differentparameter settings on the proposed Algorithm 2 so as to findout the experimental parameters suitable for this scenarioe second is to compare this approach with the existingtypical approaches to verify the advancement of this ap-proach e third is to test the effectiveness of WoLF-PHCalgorithm improvement based on eligibility trace

From Figures 7 and 8 we can see that the state of attack-defense strategy selection is the most complex and repre-sentative erefore the performance of the algorithm isanalyzed by the experimental state selection and the othernetwork state analysis approaches are the same

531 Parameter Test and Analysis Different parameters willaffect the speed and effect of learning At present there is norelevant theory to determine the specific parameters InSection 4 the relevant parameters are preliminarily ana-lyzed On this basis the different parameter settings arefurther tested to find the parameter settings suitable for thisattack-defense scenario Six different parameter settingswere tested Specific parameter settings are shown in Table 3In the experiment the attackerrsquos initial strategy is randomstrategy and the learning mechanism is the same as theapproach in this paper

e probability of the defenderrsquos choice of defense ac-tions and sums in state is shown in Figure 9 e learningspeed and convergence of the algorithm under differentparameter settings can be observed from Figure 9 whichshows that the learning speed of settings 1 3 and 6 is fasterand the best strategy can be obtained after learning less than1500 times under the three settings but convergence of 3and 6 is poor Although the best strategy can be learned bysettings 3 and 6 there will be oscillation afterwards and thestability of setting 1 is not suitable

Defense payoff can represent the degree of optimizationof the strategy In order to ensure that the payoff value doesnot reflect only one defense result the average of 1000defense gains is taken and the change of the average payoffper 1000 defense gains is shown in Figure 10 As can be seenfrom Figure 10 the benefits of Set 3 are significantly lowerthan those of other settings but the advantages and dis-advantages of other settings are difficult to distinguish Inorder to display more intuitively the average value of 25000defense gains calculated under different settings in Figure 10is shown in Figure 11 From Figure 11 we can see that theaverage value of settings 1 and 5 is higher For furthercomparison the standard deviation of settings 1 and 5 is

calculated one step on the basis of the average value to reflectthe discreteness of the gains As shown in Figure 12 thestandard deviations of setting 1 and setting 6 are smallMoreover the result of setting 1 is smaller than setting 6

In conclusion setting 1 of the six sets of parameters is themost suitable for this scenario Since setting 1 has achievedan ideal effect and can meet the experimental requirementsit is no longer necessary to further optimize the parameters

532 Comparisons In this section stochastic game [16] andevolutionary game [20] are selected to conduct comparativeexperiments with this approach According to the differenceof attackerrsquos learning ability this section designs two groupsof comparative experiments In the first group the attackerrsquoslearning ability is weak and will not make adjustments to theattack-defense results In the second group the attackerrsquoslearning ability is strong and adopts the same learningmechanism as the approach in this paper In both groupsthe initial strategies of attackers were random strategies

In the first group of experiments the defense strategy ofthis approach is as shown in Figure 9(a) e defense strategiescalculated by the approach [16] are πd(s2 d3) 07πd(s2 d4) 03 e defense strategies of [20] are evolu-tionarily stable and balanced And the defense strategies of [20]are πd(s2 d3) 08 πd(s2 d4) 02 Its average earnings per1000 times change as shown in Figure 13

From the results of the strategies and benefits of the threeapproaches we can see that the approach in this paper canlearn from the attackerrsquos strategies and adjust to the optimalstrategy so the approach in this paper can obtain the highestbenefits Wei et al [16] adopted a fixed strategy whenconfronting any attacker When the attacker is constrainedby bounded rationality and does not adopt Nash equilibriumstrategy the benefit of this approach is low Although thelearning factors of both attackers and defenders are takeninto account in [20] the parameters required in the modelare difficult to quantify accurately which results in thedeviation between the final results and the actual results sothe benefit of the approach is still lower than that of theapproach in this paper

In the second group of experiments the results of[16 20] are still as πd(s2 d3) 07 πd(s2 d4) 03 andπd(s2 d3) 08 πd(s2 d4) 02 e decision making ofthis approach is shown in Figure 14 After about 1800 timesof learning the approach achieves stability and converges tothe same defense strategy as that of [16] As can be seen fromFigure 15 the payoff of [20] is lower than that of other twoapproaches e average payoff of this approach in the first2000 defenses is higher than that of [26] and then it is almostthe same as that of [26] Combining Figures 14 and 15 wecan see that the learning attacker cannot get Nash equi-librium strategy at the initial stage and the approach in thispaper is better than that in [26] When the attacker learns toget Nash equilibrium strategy the approach in this paper canconverge to Nash equilibrium strategy At this time theperformance of this approach is the same as that in [26]

In conclusion when facing the attackers with weaklearning ability the approach in this paper is superior to that

Security and Communication Networks 11

Table 3 Dierent parameter settings

Set α δl δw λ c

1 001 0004 0001 001 0012 01 0004 0001 001 0013 001 0004 0001 001 014 001 0004 0001 01 0015 001 004 001 001 0016 001 0008 0001 001 001

0 2000 4000 6000s2 defense number

s2 d

efen

se re

turn

10

08

06

04

02

00

d3d4

(a)

s2 d

efen

se re

turn

10

08

06

04

02

000 2000 4000 6000

s2 defense number

d3d4

(b)

s2 d

efen

se re

turn

10

08

06

04

02

000 2000 4000 6000

s2 defense number

d3d4

(c)

s2 d

efen

se re

turn

10

08

06

04

02

000 2000 4000 6000

s2 defense number

d3d4

(d)

s2 d

efen

se re

turn

10

08

06

04

02

000 2000 4000 6000

s2 defense number

d3d4

(e)

s2 d

efen

se re

turn

10

08

06

04

02

000 2000 4000 6000

s2 defense number

d3d4

(f )

Figure 9 Defense decision making under dierent parameter settings (a) Defense decision under setting 1 (b) Defense decision undersetting 2 (c) Defense decision under setting 3 (d) Defense decision under setting 4 (e) Defense decision under setting 5 (f ) Defensedecision under setting 6

5 10 15 20 25s2 defense number (times1000)

s2 d

efen

se re

turn

ndash10

ndash15

ndash20

ndash25

ndash30

Set 1Set 2Set 3

Set 4Set 5Set 6

Figure 10 Defense payo under dierent parameter settings

12 Security and Communication Networks

in [16 20] When facing an attacker with strong learningability if the attacker has not obtained Nash equilibriumthrough learning this approach is still better than [16 20] Ifthe attacker obtains Nash equilibrium through learning thispaper can also obtain the same Nash equilibrium strategy as[26] and obtain its phase e same eect is superior to thatin [20]

533 Test Comparison with and without Eligibility Traceis section tests the actual eect of the eligibility traces onthe Algorithm 2 e eect of eligibility traces on strategyselection is shown in Figure 16 from which we can see thatthe learning speed of the algorithm is faster when thequalied trace is available After 1000 times of learning the

algorithm can converge to the optimal strategy When thequalied trace is not available the algorithm needs about2500 times of learning to converge

Average earnings per 1000 times change as shown inFigure 17 from which we can see that the benets of thealgorithm are almost the same when there is or not anyqualied trace after convergence From Figure 17 we can seethat 3000 defenses before convergence have higher returnsfrom qualied trails than those from unqualied trails Inorder to further verify this the average of the rst 3000defense gains under qualied trails and unqualied trails iscounted 10 times each respectively e results are shown inFigure 18 which further proves that in the preconvergencedefense phase qualied traces are better than unqualiedtraces

e addition of eligibility trace accelerates the learningspeed but also brings additional memory and computingoverhead In the experiment only 10 state-action pairs thatwere recently accessed were saved which eectively reduced

Set 1 Set 2 Set 3 Set 4 Set 5 Set 6

ndash80

ndash85

ndash90

Figure 11 Average return under dierent parameter settings

Set 1 Set 2 Set 3 Set 4 Set 5 Set 6

80

75

Figure 12 Standard deviation of defense payo under dierentparameter settings

ndash10

ndash20

ndash30

ndash40

s2 d

efen

se re

turn

5 10 15 20 25

Our methodReference [12]Reference [16]

s2 defense number (times1000)

Figure 13 Comparison of defense payo change

0 2000 4000 6000s2 defense number

s2 d

efen

se st

rate

gy

08

06

04

02

00

d3d4

Figure 14 Defense decision making of our approach

Our methodReference [12]Reference [16]

5 10 15 20 25s2 defense number (times1000)

s2 d

efen

se re

turn

ndash10

ndash20

ndash30

ndash40

Figure 15 Comparison of defense payo change

Security and Communication Networks 13

0 2000 4000 6000s2 defense number

s2 d

efen

se st

rate

gy

100

075

050

025

000

Eligibility trace d3Eligibility trace d4

No eligibility trace d3No eligibility trace d4

Figure 16 Comparison of defense decision making

Eligibility traceNo eligibility trace

0 5 10 15 20 25s2 defense number (times1000)

s2 d

efen

se re

turn

ndash10

ndash20

ndash15

ndash30

ndash25

Figure 17 Comparison of defense payo change

Eligibility traceNo eligibility trace

1 2 3 4 5 6 7 8 9 10s2 defense number (times3000)

s2 av

erag

e def

ense

retu

rn

ndash12

ndash14

ndash16

ndash18

ndash20

Figure 18 Average payo comparison of the rst 3000 defenses

14 Security and Communication Networks

the increase of memory consumption In order to test thecomputational cost of qualified trail the time of 100000defense decisions made by the algorithm was counted for 20times with and without qualified trail e average of 20times was 951s for qualified trail and 374 s for unqualifiedtrail Although the introduction of eligibility traces willincrease the decision-making time by nearly 25 times thetime required for 100000 decisions after the introduction ofeligibility traces is still only 951 s which can still meet thereal-time requirements

In summary the introduction of eligibility trace at theexpense of a small amount of memory and computingoverhead can effectively increase the learning speed of thealgorithm and improve the defense gains

54 Comprehensive Comparisons is approach is com-pared with some typical research results as shown in Table 4[3 12 14 16] is based on the assumption of complete ra-tionality e equilibrium strategy obtained by it is difficultto appear in practice and has a low guiding effect on actualdefense decision making [20] and this paper have morepracticability on the premise of bounded rationality hy-pothesis but [20] is based on the theory of biologicalevolution and mainly studies population evolution e coreof game analysis is not the optimal strategy choice of playersbut the strategy adjustment process trend and stability ofgroup members composed of bounded rational players andthe stability here refers to group members is approach isnot suitable for guiding individual real-time decisionmaking because the proportion of specific strategies isunchanged and not the strategy of a player On the contrarythe defender of the proposed approach adopts reinforcementlearning mechanism which is based on systematic feedbackto learn in the confrontation with the attacker and is moresuitable for the study of individual strategies

6 Conclusions and Future Works

In this paper cyber attack-defense confrontation is ab-stracted as a stochastic game problem under the restrictionof limited rationality A host-centered attack-defense graphmodel is proposed to extract network state and attack-de-fense action and an algorithm to generate attack-defensegraph is designed to effectively compress the game statespace e WoLF-PHC-based defense decision approach isproposed to overcome the problem which enables defendersunder bounded rationality to make optimal choices whenfacing different attackers e lattice improves the WoLF-

PHC algorithm speeds up the defenderrsquos learning speed andreduces the algorithmrsquos dependence on data is approachnot only satisfies the constraints of bounded rationality butalso does not require the defender to know too much in-formation about the attacker It is a more practical defensedecision-making approach

e future work is to further optimize the winning andlosing criteria of WoLF-PHC algorithm for specific attack-defense scenarios in order to speed up defense learning andincrease defense gains

Data Availability

e data that support the findings of this study are notpublicly available due to restrictions as the data containsensitive information about a real-world enterprise networkAccess to the dataset is restricted by the original ownerPeople who want to access the data should send a request tothe corresponding author who will apply for permission ofsharing the data from the original owner

Conflicts of Interest

e authors declare that they have no conflicts of interestregarding the publication of this paper

Acknowledgments

is work was supported by the National High TechnologyResearch and Development Program of China (863 Pro-gram) (2014AA7116082 and 2015AA7116040) and NationalNatural Science Foundation of China (61902427) In par-ticular we thank Junnan Yang and Hongqi Zhang of ourresearch group for their previous work on stochastic gameWe sincerely thank them for their help in publishing thispaper

References

[1] Z Tian W Shi Y Wang et al ldquoReal time lateral movementdetection based on evidence reasoning network for edgecomputing environmentrdquo IEEE Transactions on IndustrialInformatics vol 15 no 7 pp 4285ndash4294 2019

[2] Y Wang J Yu W Qu H Shen and X Cheng ldquoEvolutionarygame model and analysis approaches for network groupbehaviorrdquo Chinese Journal of Computers vol 38 no 2pp 282ndash300 2015

[3] J Liu H Zhang and Y Liu ldquoResearch on optimal selection ofmoving target defense policy based on dynamic game withincomplete informationrdquo Acta Electronica Sinica vol 46no 1 pp 82ndash29 2018

Table 4 Comprehensive comparison of existing approaches

Ref Game type Model assumption Learning mechanism Game process Applicable object Practicability[3] Stochastic game Rationality mdash Multistage Personal Bad[12] Static game Rationality mdash Single-stage Personal Bad[14] Dynamic game Rationality mdash Multistage Personal Bad[16] Stochastic game Rationality mdash Multistage Personal Bad[20] Evolutionary game Bounded rationality Biological evolution mdash Group GoodOur approach Stochastic game Bounded rationality Reinforcement learning Multistage Personal Good

Security and Communication Networks 15

[4] Z Tian X Gao S Su J Qiu X Du and M GuizanildquoEvaluating reputation management schemes of internet ofvehicles based on evolutionary game theoryrdquo IEEE Trans-actions on Vehicular Technology vol 68 no 6 pp 5971ndash59802019

[5] Q Zhu and S Rass ldquoGame theory meets network securitya tutorialrdquo in Proceedings of the ACM SIGSAC Conference onComputer and Communications Security pp 2163ndash2165Toronto Canada October 2018

[6] X Li C Zhou Y C Tian and Y Qin ldquoA dynamic decision-making approach for intrusion response in industrial controlsystemsrdquo IEEE Transactions on Industrial Informatics vol 15no 5 pp 2544ndash2554 2019

[7] H Hu Y Liu H Zhang and R Pan ldquoOptimal networkdefense strategy selection based on incomplete informationevolutionary gamerdquo IEEE Access vol 6 pp 29806ndash298212018

[8] J Li G Kendall and R John ldquoComputing Nash equilibriaand evolutionarily stable states of evolutionary gamesrdquo IEEETransactions on Evolutionary Computation vol 20 no 3pp 460ndash469 2016

[9] M Bowling and M Veloso Multiagent Learning Usinga Variable Learning Rate Elsevier Science Publishers LtdAmsterdam Netherlands 2002

[10] J Harb and D Precup ldquoInvestigating recurrence and eligi-bility traces in deep Q-networksrdquo httpsarxivorgabs170405495 2017

[11] C T Do N H Tran and C Hong ldquoGame theory for cybersecurity and privacyrdquo ACM Computing Surveys vol 50 no 2p 30 2017

[12] Y-L Liu D-G Feng L-HWu and Y-F Lian ldquoPerformanceevaluation of worm attack and defense strategies based onstatic Bayesian gamerdquo Chinese Journal of Software vol 23no 3 pp 712ndash723 2012

[13] Y Li D E Quevedo S Dey and L Shi ldquoA game-theoreticapproach to fake-acknowledgment attack on cyber-physicalsystemsrdquo IEEE Transactions on Signal and Information Pro-cessing over Networks vol 3 no 1 pp 1ndash11 2017

[14] H Zhang L Jiang S Huang J Wang and Y Zhang ldquoAttack-defense differential game model for network defense strategyselectionrdquo IEEE Access vol 7 pp 50618ndash50629 2019

[15] A Afrand and S K Das ldquoPreventing DoS attacks in wirelesssensor networks a repeated game theory approachrdquo InternetJournal of Cyber Security vol 5 no 2 pp 145ndash153 2007

[16] J Wei B Fang and Z Tian ldquoResearch on defense strategiesselection based on attack-defense stochastic game modelrdquoChinese Journal of Computer Research and Developmentvol 47 no 10 pp 1714ndash1723 2010

[17] C Wang X Cheng and Y Zhu ldquoA Markov game model ofcomputer network operationrdquo Chinese Systems Engineering-gteory and Practice vol 34 no 9 pp 2402ndash2410 2014

[18] J Hofbauer and K Sigmund ldquoEvolutionary game dynamicsrdquoBulletin of the American Mathematical Society vol 40 no 4pp 479ndash519 2011

[19] Y Hayel and Q Zhu ldquoEpidemic protection over heteroge-neous networks using evolutionary Poisson gamesrdquo IEEETransactions on Information Forensics and Security vol 12no 8 pp 1786ndash1800 2017

[20] J Huang and H Zhang ldquoImproving replicator dynamicevolutionary game model for selecting optimal defensestrategiesrdquo Chinese Journal on Communications vol 38 no 1pp 170ndash182 2018

[21] H Hu Y Liu H Zhang and Y Zhang ldquoSecurity metricmethods for network multistep attacks using AMC and big

data correlation analysisrdquo Security and CommunicationNetworks vol 2018 Article ID 5787102 14 pages 2018

[22] H Hu Y Liu Y Yang H Zhang and Y Zhang ldquoNew in-sights into approaches to evaluating intention and path fornetwork multistep attacksrdquo Mathematical Problems in En-gineering vol 2018 Article ID 4278632 13 pages 2018

[23] H Hu H Zhang Y Liu and Y Wang ldquoQuantitative methodfor network security situation based on attack predictionrdquoSecurity and Communication Networks vol 2017 Article ID3407642 19 pages 2017

[24] M Zimmer and S Doncieux ldquoBootstrapping Q-learning forrobotics from neuro-evolution resultsrdquo IEEE Transactions onCognitive and Developmental Systems vol 10 no 1pp 102ndash119 2018

[25] R S Sutton and A G Barto ldquoReinforcement learning anintroduction Bradford bookrdquo Machine Learning vol 16no 1 pp 285-286 2005

[26] H Zhang J Wang D Yu J Han and T Li ldquoActive defensestrategy selection based on static Bayesian gamerdquo in Pro-ceedings of IET International Conference on CyberspaceTechnology pp 1ndash7 London UK October 2016

16 Security and Communication Networks

International Journal of

AerospaceEngineeringHindawiwwwhindawicom Volume 2018

RoboticsJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Active and Passive Electronic Components

VLSI Design

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Shock and Vibration

Hindawiwwwhindawicom Volume 2018

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawiwwwhindawicom

Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Control Scienceand Engineering

Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom

Journal ofEngineeringVolume 2018

SensorsJournal of

Hindawiwwwhindawicom Volume 2018

International Journal of

RotatingMachinery

Hindawiwwwhindawicom Volume 2018

Modelling ampSimulationin EngineeringHindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Navigation and Observation

International Journal of

Hindawi

wwwhindawicom Volume 2018

Advances in

Multimedia

Submit your manuscripts atwwwhindawicom

Page 2: Research Article - Hindawi Publishing Corporationdownloads.hindawi.com/journals/scn/2019/3038586.pdf · Afrand and Das [15] established a repeated game model between the intrusion

based on stochastic game Section 3 analyses the difficultiesof studying cyber attack-defense stochastic game underbounded rationality and the idea of solving the problem inthis paper Moreover Section 3 constructs attack-defensestochastic game model under bounded rationality con-straints and proposes a host-centered attack-defense graphmodel to extract network state and attack-defense action ingame model Bowling et al [9] first proposed WoLF-PHCfor multiagent learning Section 4 further improves WoLF-PHC algorithm based on eligibility trace for promoting thelearning speed of defenders as well as reducing the de-pendence of the algorithm on data Using the improvedintelligent learning algorithm WoLF-PHC (Wolf MountainClimbing Strategy) to analyze the stochastic game model inthe previous section we design the defense decision-makingalgorithm Section 5 verifies the effectiveness of the proposedapproach through experiments Section 6 summarizes thefull text and discusses future research

ere are three main contributions of this paper

(1) e extraction of network state and attack-defenseactions is one of the keys to the construction ofstochastic game model e network state of theexisting stochastic game model contains the securityelements of all nodes in the network and there isa ldquostate explosionrdquo problem In order to solve thisproblem a host-centered attack-defense graphmodel is proposed and an attack-defense graphgeneration algorithm is designed which effectivelycompresses the game state space

(2) Limited rationality means that both sides of attack-defense need to find the optimal strategy throughtrial and error and learning It is a key point todetermine the learning mechanism of players In thispaper reinforcement learning is introduced intostochastic game which expands stochastic gamefrom complete rationality to limited rationalityDefenders use WoLF-PHC to learn the game inadversarial attack-defense so as to make the bestchoice for current attackers Most of the existingbounded rationality games use biological evolu-tionary mechanism to learn and take the group as theresearch object Compared with the existing boun-ded rationality games the approach proposed in thispaper reduces the exchange of information amonggame players and is more suitable for guiding in-dividual defense decision making

(3) WoLF-PHC algorithm is improved based on eligi-bility trace [10] which speeds up the learning speedof defenders reduces the dependence of the algo-rithm on data and proves the effectiveness of theapproach through experiments

2 Related Works

Some progress has been made in cyber security researchbased on game theory at home and abroad but most of thecurrent studies are based on the assumption of completerationality [11] Under complete rationality according to the

decision-making times of both sides in the game process itcan be divided into single-stage game and multistage gamee research of single-stage cyber attack-defense gamestarted earlier Liu et al [12] used static game theory toanalyze the effectiveness of worm virus attack-defensestrategy Li et al [13] established a non-cooperative gamemodel between attackers and sensor trust nodes and gave theoptimal attack strategy based on Nash equilibrium In cyberattack-defense although part of the simple attack-defenseconfrontation belongs to single-stage game in most sce-narios the process of attack-defense often lasts for manystages so multistage cyber attack-defense game becomesa trend Zhang et al [14] regarded the defender as the sourceof signal transmission and the attacker as the receiver andconstructed a multistage attack-defense process using dif-ferential game Afrand and Das [15] established a repeatedgame model between the intrusion detection system andwireless sensor nodes and analyzed the forwarding strategyof node packets Although the above results can be used toanalyze multistage attack-defense confrontation the statetransition between stages is not only affected by attack-defense action but also by the interference of system op-erating environment and other external factors which hasrandomness e above results ignore this randomness andweaken its guiding value

Stochastic game is a combination of game theory andMarkov theory It is a multistage game model It can ac-curately analyze the impact of randomness on attack-defenseprocess by using the Markov process to describe the statetransition Wei et al [16] abstracted the cyber attack-defenseas a stochastic game problem and gave a more scientific andaccurate quantitative approach of attack-defense benefitsapplicable to the stochastic game model of attack-defenseWang et al [17] used stochastic game theory to study thenetwork confrontation problem Convex analysis theory wasused to prove the existence of equilibrium and the equi-librium solution was transformed into a nonlinear pro-gramming problem Based on incomplete informationstochastic game Liu et al [3] proposed the decision-makingapproach formoving targets defense All the aforementionedschemes are based on the assumption of complete ratio-nality which is too strict for both sides of attack-defense Inmost cases both sides of attack-defense are only limitedrationality level which leads to the deviation of the aboveresearch results in the analysis of attack-defense gameerefore it has important research value and practicalsignificance to explore the bounded rationality of cyberattack-defense game law

Limited rationality means that both sides of attack-de-fense will not find the optimal strategy at the beginningeywill learn the game of attack-defense in the game of attack-defense e appropriate learning mechanism is the key towin in the game At present the research of limited rationalattack-defense game is mainly centered on evolutionarygame [18] Hayel and Zhu [19] established an evolutionaryPoisson game model between malicious software and an-tivirus programs and used the replication dynamic equationto analyze the antivirus program strategy Huang and Zhang[20] improved the traditional replication dynamic equation

2 Security and Communication Networks

by introducing incentive coefficient and improved the cal-culation approach of replication dynamic rate Based on thisan evolutionary game model was constructed for defenseEvolutionary game takes the group as the research objectadopts the biological evolution mechanism and completesthe learning by imitating the advantage strategy of othermembers In evolutionary game there is too much in-formation exchange among players and it mainly studies theadjustment process trend and stability of attack-defensegroup strategy which is not conducive to guiding the real-time strategy selection of individual members

Reinforcement learning is a classic online intelligentlearning approach Its players learn independently throughenvironmental feedback Compared with evolutionary bi-ology reinforcement learning is more suitable for guidingindividual decision making is paper introduces re-inforcement-learning mechanism into stochastic gameexpands stochastic game from complete rationality to finiterationality and uses bounded rationality stochastic game toanalyze cyber attack-defense On the one hand comparedwith the existing attack-defense stochastic game this ap-proach uses bounded rationality hypothesis which is morerealistic On the other hand compared with evolutionarygame this approach uses reinforcement learning mecha-nism which is more suitable for guiding real-time defensedecision making

3 Modeling of Attack-Defense ConfrontationUsing Stochastic Game Theory

31 Description and Analysis of Cyber Attack-DefenseConfrontation Cyber attack-defense confrontation isa complex problem but from the level of strategy selection itcan be described as a stochastic game problem as depicted inFigure 1 We take the DDoS attack of using Sadmind vul-nerability of Soloris platform as an example e attack isimplemented through multiple steps including IP sweepSadmind ping Sadmind exploit Installing DDoS softwareand Conducting DDoS attack Each attack step can lead tochange of security state of network

Taking the first step as an example the initial networkstate is denoted as S0 (H1 none) It means that the attackerAlice does not have any privileges of host H1 en attackerAlice implemented an IP sweep attack onH1 through its openport 445 and gained the User privilege of H1 is networkstate is denoted as S1 (H1 User) Afterwards if the defenderBob selected and implemented a defense strategy from thecandidate strategy set Reinstall Listener program Installpatches Close unused port then the network state istransferred back to S0 otherwise the network may continueto evolve to another more dangerous state S3

e continuous time axis is divided into time slices andeach time slice contains only one network state e networkstatemay be the same in different time slices Each time slice isa game of attack-defense Both sides detect the current net-work state then select the attack-defense actions according tothe strategy and get immediate returns Attack-defensestrategies are related to network state e network systemtransfers from one state to another under the candidate action

of the attacking and defending sides e transition betweennetwork states is not only affected by attack-defense actionsbut also by factors such as system operating environment andexternal environment which is randome goal of this paperis to enable defenders to obtain higher long-term benefits inattack-defense stochastic game

Both sides of attack-defense can predict the existence ofNash equilibrium so Nash equilibrium is the best strategyfor both sides From the description of complete rationalityin the introduction we can see that the requirement ofcomplete rationality for both sides of attack-defense is toostrict and both sides of attacker and defender will beconstrained by limited rationality in practice Limited ra-tionality means that at least one of the attacking anddefending sides will not adopt Nash equilibrium strategy atthe beginning of the game which means that it is difficult forboth sides to find the optimal strategy in the early stage of thegame and they need to constantly adjust and improve thestrategy for their opponents It means that the game equi-librium is not the result of one choice but that both sides ofthe attack-defense sides are constantly learning to achieve inthe course of the attack-defense confrontation and becauseof the influence of learning mechanism may deviate againeven if it reaches equilibrium

From the above analysis we can see that learningmechanism is the key to win the game of limited rationalityFor defense decision making the learning mechanism ofattack-defense stochastic game under bounded rationalityneeds to satisfy the following two requirements (1) Con-vergence of learning algorithm attacker strategy underbounded rationality has dynamic change characteristics andbecause of the interdependence of attack-defense strategythe defender must learn the corresponding optimal strategywhen facing different attack strategies to ensure that he isinvincible (2) e learning process does not need too muchattacker information both sides of the cyber attack-defensehave opposition of objectives and non-cooperation andboth sides will deliberately hide their key information If toomuch opponent information is needed in the learningprocess the practicability of the learning algorithm will bereduced

WoLF-PHC algorithm is a typical strategy gradientintelligent learning approach which enables defenders tolearn through network feedback without too much in-formation exchange with attackers e introduction ofWoLF mechanism ensures the convergence of WoLF-PHCalgorithm [9] After the attacker learns to adopt Nashequilibrium strategy WoLF mechanism enables the de-fender to converge to the corresponding Nash equilibriumstrategy while the attacker has not yet learned Nash equi-librium strategy and WoLF mechanism enables the de-fender to converge to the corresponding optimal defensestrategy In conclusion WoLF-PHC algorithm can meet thedemand of attack-defense stochastic game under boundedrationality

32 StochasticGameModel forAttack-Defense emappingrelationship between cyber attack-defense and stochastic

Security and Communication Networks 3

game model is depicted in Figure 2 Stochastic game consistsof attack-defense game in each state and transition modelbetween states e two key elements of ldquoinformationrdquo andldquogame orderrdquo are assumed Constrained by bounded ra-tionality the attackerrsquos historical actions and the attackerrsquospayo function are set as the attackerrsquos private information

Herein we use the above example in Figure 1 to explainFigure 2 the security state corresponds to S0 (H1 none) andS1 (H1 User) in this case e candidate strategy set againstDDoS attack is Reinstall Listener program Install patchesClose unused port Network state is the common knowledgeof both sides Because of the non-cooperation between theattack and defense sides the two sides can only observe eachotherrsquos actions through the detection network which willdelay the execution time for at least one time slice so theattack-defense sides are acting at the same time in each timeslice e ldquosimultaneousrdquo here is a concept of informationrather than a concept of time that is the choice of attack-defense sides may not be based on the concept of time At thesame time because the attack-defense sides do not know theother sidersquos choice when choosing action they are consid-ered to be simultaneous action

Construct the network state transition model Useprobability to express the randomness of network statetransition Because the current network state is mainly re-lated to the previous network state the rst-order Markov isused to represent the state transition relationship in whichthe network state is the attack-defense action Because bothsides of attacker and defender are constrained by boundedrationality in order to increase the generality of the modelthe transfer probability is set as the unknown information ofboth sides of attack-defense

On the basis of the above a gamemodel is constructed tosolve the defense decision-making problem

Denition 1 e attack-defense stochastic gamemodel (AD-SGM) is a six-tuple AD minus SGM (N SD R Q π) in which

(1) N (attacker defender) are the two players whoparticipate in the game representing cyber attackersand defenders respectively

(2) S (s1 s2 sn) is a set of stochastic game stateswhich is composed of network states (see Section 33for the specic meaning and generation approach)

Optimalstrategy of

St1

Optimalstrategy of

Sti

Optimalstrategy of

Sprimet1

CurrentstateSprimet2

Optimalstrategy of

SPrimet2

Game Game

Game

t1 t2 ti Time

Defender

CurrentstateSt1

CurrentstateSPrimet2

Currentstate

StiAttacker

DefenderAttacker

DefenderAttacker DefenderAttacker

Evolution

Evolution

∙ ∙ ∙

∙ ∙ ∙

∙ ∙ ∙

∙ ∙ ∙

∙ ∙ ∙

∙ ∙ ∙

∙ ∙ ∙

Game

Figure 1 e game process and strategy selection of attack-defense confrontation

Gamestate

Securitystate

Candidatestrategy

Attack-defenseequilibrium

Cost and benefitanalysis

Attacker and defender

Game plan

Game equilibrium

Payoff function

Players

Cyber attack-defense Stochastic game model

Figure 2 e mapping relationship between cyber attack-defenseand stochastic game model

4 Security and Communication Networks

(3) D (D1 D2 Dn) is the action set of the de-fender in which Dk d1 d2 dm1113864 1113865 is the actionset of the defender in the game state Sk

(4) Rd(si d sj)is the immediate return from state si to sjafter the defender performs action d

(5) Qd(si d) is the state-action payoff function of thedefender indicating the expected payoff of the de-fender after taking action d in the state si

(6) πd(sk) is the defense strategy of the defender in thestate sk

Defense strategy and defense action are two differentconcepts Defense strategy is the rule of defense action notthe action itself For example πd(sk) (πd(sk d1)

πd(sk dm)) is the strategy of the defender in the networkstate sk where πd(sk dm) is the probability of selectingaction dm 1113936disinDk

πd(sk d) 1

33 Network State and Attack-Defense Action ExtractionApproach Based on Attack-Defense Graph Network stateand attack-defense action are important components ofstochastic game model Extraction of network state andattack-defense action is a key point in constructing at-tack-defense stochastic game model [21] In the currentattack-defense stochastic game when describing thenetwork state each network state contains the securityelements of all nodes in the current network e numberof network states is the power set of security elementswhich will produce a state explosion [22] ereforea host-centered attack-defense graph model is proposedEach state node only describes the host state It can ef-fectively reduce the size of state nodes [23] Using thisattack-defense graph to extract network state and attack-defense action is more conducive to cyber attack-defenseconfrontation analysis

Definition 2 attack-defense graph is a binary groupG (S E) in which S s1 s2 sn1113864 1113865 is a set of node se-curity states and si langhost privilegerang host is the uniqueidentity of the node and privilege none user root in-dicates that it does not have any privileges has ordinary userprivileges and has administrator privileges For directededge E (Ea Ed) it indicates that the occurrence of attackor defense action causes the transfer of node state andek (sr vd sd) k a d where sr is the source node and sd

is the destination nodee generation process of attack-defense map is shown

in Figure 3 Firstly target network scanning is used to ac-quire cyber security elements then attack instantiation iscombined with attack template and defense instantiation iscombined with defense template Finally attack-defensegraph is generated e state set of attack-defense stochasticgame model is extracted by attack-defense graph nodes and

the defense action set is extracted by the edge of attack-defense graph

331 Elements of Cyber Security e elements of cybersecurity NSE are composed of network connection Cvulnerability information V service information F andaccess rights P Matrix Csube host times host times port describes theconnection relationship between nodes the row of matrixrepresents the source node shost the list of matrix rep-resents the destination node dhost and the port accessrelationship is represented by matrix elements When portis empty it indicates that there is no connection re-lationship between shost nodes and dhost nodesV langhost service cveidrang indicates the vulnerability ofservices on nodesrsquo host including security vulnerabilitiesand improper configuration or misconfiguration of systemsoftware and application software F langhost servicerang in-dicates that a service is opened on a node hostP langhost privilegerang indicates that an attacker has accessrights privilege on a node host

332 Attack Template Attack template AM is the de-scription of vulnerability utilization whereAM langtid prec postcrang Among them tid is the identifi-cation of attack mode prec langP V C Frang describes the setof prerequisites for an attacker to use a vulnerability in-cluding the initial access rights privilege of the attacker onthe source node shost the vulnerability information cveidof the target node the network connection relationship Cand the running service of the node F Only when the set ofconditions is satisfied the attacker can succeed Use thisvulnerability postc langP C sdrang describes the conse-quences of an attackerrsquos successful use of a vulnerabilityincluding the increase of attackerrsquos access to the targetnode the change of network connection relationship andservice destruction

333 Defense Template Defense templates DM are theresponse measures taken by defenders after predicting oridentifying attacks where DM langtid dsetrang dset

langd1 postd1rang langdm postdmrang1113864 1113865 is the defense strategy setfor specific attacks postdi langF V PCrang describes the im-pact of defense strategy on cyber security elements in-cluding the impact on node service informationvulnerability information attacker privilege informationnode connection relationship and so on

In the process of attack-defense graph generation ifthere is a connection between two nodes and all the pre-requisites for attack occurring are satisfied the edges fromsource node to destination node are added If the attackchanges the security elements such as connectivity the cybersecurity elements should be updated in time If the defensestrategy is implemented the connection between nodes orthe existing rights of attackers should be changed As shownin Algorithm 1 the first step is to use cyber security elements

Security and Communication Networks 5

to generate all possible state nodes and initialize the edgesSteps 2ndash8 are to instantiate attacks and generate all attackedges Steps 9ndash15 are to instantiate defenses and generate alldefense edges Steps 16ndash20 are used to remove all isolatednodes And step 21 is to output attack-defense maps

Assuming the number of nodes in the target network is nand the number of vulnerabilities of each node is m themaximum number of nodes in the attack-defense graph is3n In the attack instantiation stage the computationalcomplexity of analyzing the connection relationship be-tween each two nodes is o(n2 minus n) e computationalcomplexity of matching the vulnerability of the nodes withthe connection relationship is o(m(n2 minus n)) In the defenseinstantiation stage we remove the isolated nodes and thecomputational complexity of traversing the edges of all thenodes is o(9n2 minus 3n) In summary the order of computa-tional complexity of the algorithm is o(n2) e node ofattack-defense graph G can extract network state and theedge of attack-defense graph G can extract attack-defenseaction

4 Stochastic Game Analysis and StrategySelection Based on WoLF-PHCIntelligent Learning

In the previous section cyber attack-defense is described asa bounded rational stochastic game problem and an attack-defense stochastic game model AD-SGM is constructed Inthis section reinforcement learning mechanism is in-troduced into finite rational stochastic game and WoLF-PHC algorithm is used to select defense strategies based onAD-SGM

41 Principle of WoLF-PHC

411 Q-Learning Algorithm Q-learning [24] is the basis ofWoLF-PHC algorithm and a typical model-free re-inforcement learning algorithm Its learning mechanism isshown in Figure 4 Agent in Q-learning obtains knowledgeof return and environment state transfer through interaction

Elements of networksecurity

Network connectionrelations

Node vulnerabilityinformation

Node serviceinformation

Node accessprivilege

Attack template

Attackinstantiation

Defenseinstantiation

Defense template

Atta

ck-d

efen

se g

raph Network

state

Attackaction

Defensiveaction

Figure 3 Attack-defense graph generation

Input Elements of Cyber security NSE Attack Template AM Defense Template DMOutput Attack graph G (S E)

(1) S⟵NSE E⟵emptylowast Generate all nodes lowast(2) for each S dolowast Attack instantiation to generate attack edges lowast(3) update NSE in slowast Updating Cyber security Elements lowast(4) if Cshost shost and CdhostV geAMprecV and CdhostFgeAMprecF and CdhostPprivilegegeAMprecPprivilege(5) srhost⟵Cshost(6) sdhost⟵Cdhost(7) sdprivilege⟵AMpostcPprivilege(8) Ea⟵Ea cup ea(srAMtid sd)1113864 1113865

(9) end if(10) end for(11) for each S dolowast Defense instantiation to generate defense edges lowast(12) if Easd s and DMtid Eatid(13) Ed⟵Ed cup ed(EasdDMdsetd Easr)1113864 1113865

(14) end if(15) end for(16) for each S dolowast Remove isolated nodes S lowast(17) if ea(s tid sd) empty and ed(sr d s) empty(18) S⟵ S minus s

(19) end if(20) end for(21) Return G

ALGORITHM 1 Attack-defense graph generation algorithm

6 Security and Communication Networks

with environment Knowledge is expressed by payoff Qd andlearned by updating Qd Qd is

Qd(s d) Qd(s d) + α Rd s d sprime( 1113857 + cmaxdprime

Qd sprime dprime( 1113857 minus Qd(s d)1113890 1113891

(1)

where α is payoff learning rate and c is the discount factore strategy of Q-learning is πd(s) argmaxdQd(s d)

412 PHC Algorithm e Policy Hill-Climbing algorithm[25] is a simple and practical gradient descent learning al-gorithm suitable for hybrid strategies which is an im-provement of Q-learning e state-action gain function Qd

of PHC is the same as Q-learning but the policy updateapproach of Q-learning is no longer followed but the hybridstrategy πd(sk) is updated by executing the hill-climbingalgorithm as shown in equations (2)ndash(4) In the formula thestrategy learning rate is

πd sk di( 1113857 πd sk di( 1113857 + Δskdi (2)

where

Δskdi

minus δskdi di ne argmax

dQd sk d( 1113857

1113944djnedi

δskdj others

⎧⎪⎪⎪⎨

⎪⎪⎪⎩

(3)

δskdi min πd sk di( 1113857

δDk minus 1

11138681113868111386811138681113868111386811138681113868

1113888 1113889 (4)

413 WoLF-PHC Algorithm WoLF-PHC algorithm is animprovement of PHC algorithm By introducing WoLFmechanism the defender has two different strategy learningrates low strategy learning rate δw when winning and highstrategy learning rate δl when losing as shown in formula (5)e two learning rates enable defenders to adapt quickly toattackersrsquo strategies when they perform worse than expectedand to learn cautiously when they perform better than expectede most important thing is the introduction of WoLFmechanism which guarantees the convergence of the algorithm[9] WoLF-PHC algorithm uses average strategy as the criterionof success and failure as shown in formulae (6) and (7)

δ

δw 1113944disinDk

πd skd( 1113857Qd skd( 1113857gt 1113944disinDk

πd skd( 1113857Qd skd( 1113857

δl others

⎧⎪⎪⎪⎨

⎪⎪⎪⎩

(5)

πd(sd) πd(sd) +1

C(s)πd(sd) minus πd(sd)( 1113857

(6)

C(s) C(s) + 1 (7)

42 Defense Decision-Making Algorithm Based on ImprovedWoLF-PHC e decision-making process of our approachis shown in Figure 5 which consists of five steps It receivestwo types of input data attack evidence and abnormal

Input AD minus SGM α δ λ and c

Output Defense action d(1) initialize AD minus SGM C(s) 0 e(s d) 0lowast Network state and attack-defense actions are extracted by Algorithm 1 lowast(2) slowast get(E) lowast Getting the current network state from Network E lowast(3) repeat(4) dlowast πd(slowast) lowast Select defense action lowast(5) Output dlowast lowast Feedback defense actions to defenders lowast(6) sprime get(E) lowast Get the status after the action dlowast is executed lowast(7) ρlowast Rd(slowast dlowast sprime) + cmaxdprimeQd(sprime dprime) minus Qd(slowast dlowast)

(8) ρg Rd(slowast dlowast sprime) + cmaxdprimeQd(sprime dprime) minus maxdQd(slowast d)

(9) for each state-action pair (s d) except (slowast dlowast) do(10) e(s d) cλe(s d)

(11) Qd(s d) Qd(s d) + αρge(s d)

(12) end forlowast Update noncurrent eligibility trace (slowast dlowast) and values Qd lowast(13) Qd(slowast dlowast) Qd(slowast dlowast) + αρlowastlowast Update Qd of (slowast dlowast) lowast(14) e(slowast dlowast) cλe(slowast dlowast) + 1 lowast Update track of (slowast dlowast)lowast(15) C(slowast) C(slowast) + 1(16) Updating average strategy based on formula (6)(17) Selecting the learning rate of strategies based on formula (5)(18) δslowastd min(πd(slowast d) δ(D(slowast) minus 1))foralld isin D(slowast)

(19) Δslowastd minus δslowastd dne argmaxdi

Qd(slowast di)

1113936djnedlowastδslowastdjOthers1113896

(20) πd(slowast d) πd(slowast d) + Δslowastdforalld isin D(slowast) lowast Update defense strategy lowast(21) slowast sprime(22) end repeat

ALGORITHM 2 Defense decision-making algorithm

Security and Communication Networks 7

evidence All these pieces of evidence come from real-timeintrusion detection systems After decision making theoptimal security strategy is determined against detectedintrusions

In order to improve the learning speed of WoLF-PHCalgorithm and reduce the dependence of the algorithm onthe amount of data the eligibility trace is introduced toimprove WoLF-PHC e eligibility trace can track specificstate-action trajectories of recent visits and then assigncurrent returns to the state-action of recent visits WoLF-PHC algorithm is an extension of Q-learning algorithm Atpresent there are many algorithms combining Q-learningwith eligibility trace is paper improves WoLF-PHC byusing the typical algorithm [10] e qualification trace ofeach state-action is defined as e(s a) Suppose slowast is thecurrent network state slowast and the eligibility trace is updatedin the way shown in formula (8) Among them the traceattenuation factor is λ

e(s d) cλe(s d) sne slowast

cλe(s d) + 1 s slowast1113896 (8)

WoLF-PHC algorithm is an extension of Q-learningalgorithm which belongs to off-policy algorithm It usesgreedy policy when evaluating defense actions for eachnetwork state and occasionally introduces nongreedy policywhen choosing to perform defense actions in order to learnIn order to maintain the off-policy characteristics of WoLF-PHC algorithm the state-action values are updated byformulae (9)ndash(12) in which the defense actions dlowast are

selected for execution slowast because only the recently visitedstatus-action pairs will have significantly more eligibilitytrace than 0 while most other status-action pairs will havealmost none eligibility trace In order to reduce the memoryand running time consumption caused by eligibility traceonly the latest status-action pair eligibility trace can be savedand updated in practical application

Qd slowast dlowast

( 1113857 Qd slowast dlowast

( 1113857 + αρlowast (9)

Qd(s d) Qd(s d) + αρge(s d) (10)

ρlowast Rd slowast d sprime( 1113857 + cmax

dprimeQd sprime dprime( 1113857 minus Qd s

lowast dlowast

( 1113857

(11)

ρg Rd slowast dlowast sprime( 1113857 + cmax

dprimeQd sprime dprime( 1113857 minus max

dQd slowast d( 1113857

(12)

In order to achieve better results the defense decision-making approach based on WoLF-PHC needs to set fourparameters α δ λ and c reasonably (1) e range of thepayoff learning rate is 0lt αlt 1e bigger the representativeα is the more important the cumulative reward is e fasterthe learning speed is the smaller α is and the better thestability of the algorithm is (2)e range of strategy learningrate 0lt δ lt 1 is obtained According to the experiment wecan get a better result when adopting δlδw 4 (3) eattenuation factor λ of eligibility trace is in the range of

Describe andanalyze cyberattack-defenseconfrontation

Model stochastic game

for attack-defense

Extract currentsecurity state andcandidate attack-

defense action

Evaluatepayoff

hellip

helliphellip

How to defend

Decision making

Figure 5 e process of our decision-making approach

Rt+1reward Rt

St+1state St

Ambient

Agent

Figure 4 Q-learning learning mechanism

8 Security and Communication Networks

0lt λlt 1 which is responsible for the credit allocation ofstatus-action It can be regarded as a time scale e greaterthe credit allocated λ to historical status-action the greaterthe credit allocated to historical status-action (4) e rangeof the discount factor 0lt clt 1 represents the defenderrsquospreference for immediate return and future return When c

approaches 0 it means that future returns are irrelevant andimmediate returns are more important When c approaches1 it means immediate returns are irrelevant and futurereturns are more important

Agent in WoLF-PHC is the defender in the stochasticgame model of attack-defense AD minus SGM the game state inagentrsquos state corresponds to AD minus SGM the defense actionin agentrsquos behavior corresponds to AD minus SGM the imme-diate return in agentrsquos immediate return corresponds toAD minus SGM and the defense strategy in agentrsquos strategycorresponds to AD minus SGM

On the basis of the above a specific defense decision-making approach as shown in Algorithm 2 is given e firststep of the algorithm is to initialize the stochastic gamemodel AD minus SGM of attack-defense and the related pa-rameters e network state and attack-defense actions areextracted by Algorithm 1 e second step is to detect thecurrent network state by the defender Steps 3ndash22 are tomake defense decisions and learn online Steps 4-5 are toselect defense actions according to the current strategy steps6ndash14 are to update the benefits by using eligibility traces andsteps 15ndash21 are the new payoffs using mountain climbingalgorithm to update defense strategy

e spatial complexity of the Algorithm 2 mainlyconcentrates on the storage of pairs Rd(s d sprime) such ase(s d) πd(s d) πd(s d) and Qd(s d) e number of statesis |S| |D| and |A| are the numbers of measures taken by thedefender and attacker in each state respectively ecomputational complexity of the proposed algorithm isO(4 middot |S| middot |D| + |S|2 middot |D|) Compared with the recentmethod using evolutionary game model with complexityO(|S|2 middot (|A| + |D|)3) [14] we greatly reduce the computa-tional complexity and increase the practicability of the al-gorithm since the proposed algorithm does not need to solvethe game equilibrium

5 Experimental Analysis

51 Experiment Setup In order to verify the effectiveness ofthis approach a typical enterprise network as shown inFigure 6 is built for experiment Attacks and defenses occuron the intranet with attackers coming from the extranet Asa defender network administrator is responsible for thesecurity of intranet Due to the setting of Firewall 1 andFirewall 2 legal users of the external network can only accessthe web server which can access the database server FTPserver and e-mail server

e simulation experiment was carried out on a PC withIntel Core i7-6300HQ 340GHz 32GB RAM memoryand Windows 10 64 bit operating system e Python 365emulator was installed and the vulnerability information inthe experimental network was scanned by Nessus toolkit asshown in Table 1 e network topology information was

collected by ArcGis toolkit We used the Python language towrite the project code During the experiment we set upabout 25000 times of attack-defense strategy studies eexperimental results were analyzed and displayed usingMatlab2018a as described in Section 53

Referring to MIT Lincoln Lab attack-defense behaviordatabase attack-defense templates are constructed Attack-defense maps are divided into attack maps and defense mapsby using attacker host A web server W database server DFTP server F and e-mail server E In order to facilitatedisplay and description attack-defense maps are dividedinto attack maps and defense maps as shown in Figures 7and 8 respectively e meaning of defense action in thedefense diagram is shown in Table 2

52 Construction of the Experiment Scenario AD-SGM(1) N (attacker defender) are players participating in

the game representing cyber attackers and defendersrespectively

Vulnerabilityscanning system

VDS

Internet

DMZ

FTP server

Firewall 1Trusted zone

Firewall 2Database

serverE-mailserver

Web server

IDS

Attacker

Extranet

Intranet

Figure 6 Experimental network topology

Table 1 Network vulnerability information

Attackidentifier Host CVE Target

privilegetid1 Web server CVE-2015-1635 usertid2 Web server CVE-2017-7269 roottid3 Web server CVE-2014-8517 roottid4 FTP server CVE-2014-3556 roottid5 E-mail server CVE-2014-4877 roottid6 Database server CVE-2013-4730 usertid7 Database server CVE-2016-6662 root

S0(A root)

S1(W user)

S2(W root)S5(D user)

S6(D root)

tid1

tid2

tid3

tid4

tid5

tid6

tid7

tid8

tid9

S3(F root)

S4(E root)

Figure 7 Attack graph

Security and Communication Networks 9

(2) e state set of stochastic game is S (s0 s1

s2 s3 s4 s5 s6) which consists of network state and isextracted from the nodes shown in Figures 7 and 8

(3) e action set of the defender is D (D0 D1

D2 D3 D4 D5 D6) D0 NULL D1 d11113864 d2D2

d3 d41113864 1113865D3 d1 d5 d61113864 1113865D4 d1 d5 d61113864 1113865D5 d11113864

d2 d7D6 d3 d41113864 1113865 and the edges are extractedfrom Figure 8

(4) Quantitative results of immediate returns Rd(si

d sj) of defenders [16 26] are

Rd s0NULL s0( 1113857 Rd s0NULL s1( 1113857 Rd s0NULL s2( 1113857( 1113857 (0 minus 40 minus 59)

Rd s1 d1 s0( 1113857 Rd s1 d1 s1( 1113857 Rd s1 d1 s2( 1113857 Rd s1 d2 s0( 1113857 Rd s1 d2 s1( 1113857 Rd s1 d2 s2( 1113857( 1113857

(40 0 minus 29 5 minus 15 minus 32)

( Rd s2 d3 s0( 1113857 Rd s2 d3 s1( 1113857 Rd s2 d3 s2( 1113857 Rd s2 d3 s3( 1113857 Rd s2 d3 s4( 1113857 Rd s2 d3 s5( 1113857

Rd s2 d4 s0( 1113857 Rd s2d4 s1( 1113857 Rd s2 d4 s2( 1113857 Rd s2 d4 s3( 1113857 Rd s2 d4 s4( 1113857 Rd s2 d4 s5( 11138571113857

(24 9 minus 15 minus 55 minus 49 minus 65 19 5 minus 21 minus 61 minus 72 minus 68)

( Rd s3 d1 s2( 1113857 Rd s3 d1 s3( 1113857 Rd s3 d1 s6( 1113857 Rd s3 d5 s2( 1113857 Rd s3 d5 s3( 1113857 Rd s3 d5 s6( 1113857

Rd s3 d6 s2( 1113857 Rd s3 d6 s3( 1113857 Rd s3 d6 s6( 11138571113857

(21 minus 16 minus 72 15 minus 23 minus 81 minus 21 minus 36 minus 81)

( Rd s4 d1 s2( 1113857 Rd s4 d1 s4( 1113857 Rd s4 d1 s6( 1113857 Rd s4 d5 s2( 1113857 Rd s4 d5 s4( 1113857 Rd s4 d5 s6( 1113857

Rd s4 d6 s2( 1113857 Rd s4 d6 s4( 1113857 Rd s4 d6 s6( 11138571113857

(26 0 minus 62 11 minus 23 minus 75 9 minus 25 minus 87)

( Rd s5 d1 s2( 1113857 Rd s5 d1 s5( 1113857 Rd s5 d1 s6( 1113857 Rd s5 d2 s2( 1113857 Rd s5 d2 s5( 1113857 Rd s5 d2 s6( 1113857

Rd s5 d7 s2( 1113857 Rd s5 d7 s5( 1113857 Rd s5 d7 s6( 11138571113857

(29 0 minus 63 11 minus 21 minus 76 2 minus 27 minus 88)

( Rd s6 d3 s3( 1113857 Rd s6 d3 s3( 1113857 Rd s6 d3 s5( 1113857 Rd s6 d3 s6( 1113857 Rd s6 d4 s3( 1113857

Rd s6 d4 s4( 1113857 Rd s6 d4 s5( 1113857 Rd s6 d4 s6( 11138571113857

(minus 23 minus 21 minus 19 minus 42 minus 28 minus 31 minus 24 minus 49)

(13)

Table 2 Defense action description

Atomic defense action d1 d2 d3 d4 d5 d6 d7

Renew root data Limit SYNICMPpackets Install Oracle patches Reinstall listener program Uninstall delete Trojan Limit access toMDSYS Restart database server Delete suspicious account Add physical resource Repair database Limit packets from ports

S0(A root)

S1(W user)

S2(W root)S5(D user)

S6(D root)

d3

d3

d4

d1d2d7

S3(F root)

S4(E root)d1d5d6

d1d5d6d4

d3

d1d2

Figure 8 Defense graph

10 Security and Communication Networks

(5) In order to detect the learning performance of theAlgorithm 2 more fully the defenderrsquos state-actionpayoffQd(si d) is initialized with a unified 0 withoutintroducing additional prior knowledge

(6) Defenderrsquos defense strategy adopts average strategyto initialize that is πd(sk d1) πd(sk d2) middot middot middot

πd(sk dm) 1113936disinD(sk)πd(sk d) 1forallsk isin S where noadditional prior knowledge is introduced

53 Testing and Analysis e experiment in this section hasthree purposes e first is to test the influence of differentparameter settings on the proposed Algorithm 2 so as to findout the experimental parameters suitable for this scenarioe second is to compare this approach with the existingtypical approaches to verify the advancement of this ap-proach e third is to test the effectiveness of WoLF-PHCalgorithm improvement based on eligibility trace

From Figures 7 and 8 we can see that the state of attack-defense strategy selection is the most complex and repre-sentative erefore the performance of the algorithm isanalyzed by the experimental state selection and the othernetwork state analysis approaches are the same

531 Parameter Test and Analysis Different parameters willaffect the speed and effect of learning At present there is norelevant theory to determine the specific parameters InSection 4 the relevant parameters are preliminarily ana-lyzed On this basis the different parameter settings arefurther tested to find the parameter settings suitable for thisattack-defense scenario Six different parameter settingswere tested Specific parameter settings are shown in Table 3In the experiment the attackerrsquos initial strategy is randomstrategy and the learning mechanism is the same as theapproach in this paper

e probability of the defenderrsquos choice of defense ac-tions and sums in state is shown in Figure 9 e learningspeed and convergence of the algorithm under differentparameter settings can be observed from Figure 9 whichshows that the learning speed of settings 1 3 and 6 is fasterand the best strategy can be obtained after learning less than1500 times under the three settings but convergence of 3and 6 is poor Although the best strategy can be learned bysettings 3 and 6 there will be oscillation afterwards and thestability of setting 1 is not suitable

Defense payoff can represent the degree of optimizationof the strategy In order to ensure that the payoff value doesnot reflect only one defense result the average of 1000defense gains is taken and the change of the average payoffper 1000 defense gains is shown in Figure 10 As can be seenfrom Figure 10 the benefits of Set 3 are significantly lowerthan those of other settings but the advantages and dis-advantages of other settings are difficult to distinguish Inorder to display more intuitively the average value of 25000defense gains calculated under different settings in Figure 10is shown in Figure 11 From Figure 11 we can see that theaverage value of settings 1 and 5 is higher For furthercomparison the standard deviation of settings 1 and 5 is

calculated one step on the basis of the average value to reflectthe discreteness of the gains As shown in Figure 12 thestandard deviations of setting 1 and setting 6 are smallMoreover the result of setting 1 is smaller than setting 6

In conclusion setting 1 of the six sets of parameters is themost suitable for this scenario Since setting 1 has achievedan ideal effect and can meet the experimental requirementsit is no longer necessary to further optimize the parameters

532 Comparisons In this section stochastic game [16] andevolutionary game [20] are selected to conduct comparativeexperiments with this approach According to the differenceof attackerrsquos learning ability this section designs two groupsof comparative experiments In the first group the attackerrsquoslearning ability is weak and will not make adjustments to theattack-defense results In the second group the attackerrsquoslearning ability is strong and adopts the same learningmechanism as the approach in this paper In both groupsthe initial strategies of attackers were random strategies

In the first group of experiments the defense strategy ofthis approach is as shown in Figure 9(a) e defense strategiescalculated by the approach [16] are πd(s2 d3) 07πd(s2 d4) 03 e defense strategies of [20] are evolu-tionarily stable and balanced And the defense strategies of [20]are πd(s2 d3) 08 πd(s2 d4) 02 Its average earnings per1000 times change as shown in Figure 13

From the results of the strategies and benefits of the threeapproaches we can see that the approach in this paper canlearn from the attackerrsquos strategies and adjust to the optimalstrategy so the approach in this paper can obtain the highestbenefits Wei et al [16] adopted a fixed strategy whenconfronting any attacker When the attacker is constrainedby bounded rationality and does not adopt Nash equilibriumstrategy the benefit of this approach is low Although thelearning factors of both attackers and defenders are takeninto account in [20] the parameters required in the modelare difficult to quantify accurately which results in thedeviation between the final results and the actual results sothe benefit of the approach is still lower than that of theapproach in this paper

In the second group of experiments the results of[16 20] are still as πd(s2 d3) 07 πd(s2 d4) 03 andπd(s2 d3) 08 πd(s2 d4) 02 e decision making ofthis approach is shown in Figure 14 After about 1800 timesof learning the approach achieves stability and converges tothe same defense strategy as that of [16] As can be seen fromFigure 15 the payoff of [20] is lower than that of other twoapproaches e average payoff of this approach in the first2000 defenses is higher than that of [26] and then it is almostthe same as that of [26] Combining Figures 14 and 15 wecan see that the learning attacker cannot get Nash equi-librium strategy at the initial stage and the approach in thispaper is better than that in [26] When the attacker learns toget Nash equilibrium strategy the approach in this paper canconverge to Nash equilibrium strategy At this time theperformance of this approach is the same as that in [26]

In conclusion when facing the attackers with weaklearning ability the approach in this paper is superior to that

Security and Communication Networks 11

Table 3 Dierent parameter settings

Set α δl δw λ c

1 001 0004 0001 001 0012 01 0004 0001 001 0013 001 0004 0001 001 014 001 0004 0001 01 0015 001 004 001 001 0016 001 0008 0001 001 001

0 2000 4000 6000s2 defense number

s2 d

efen

se re

turn

10

08

06

04

02

00

d3d4

(a)

s2 d

efen

se re

turn

10

08

06

04

02

000 2000 4000 6000

s2 defense number

d3d4

(b)

s2 d

efen

se re

turn

10

08

06

04

02

000 2000 4000 6000

s2 defense number

d3d4

(c)

s2 d

efen

se re

turn

10

08

06

04

02

000 2000 4000 6000

s2 defense number

d3d4

(d)

s2 d

efen

se re

turn

10

08

06

04

02

000 2000 4000 6000

s2 defense number

d3d4

(e)

s2 d

efen

se re

turn

10

08

06

04

02

000 2000 4000 6000

s2 defense number

d3d4

(f )

Figure 9 Defense decision making under dierent parameter settings (a) Defense decision under setting 1 (b) Defense decision undersetting 2 (c) Defense decision under setting 3 (d) Defense decision under setting 4 (e) Defense decision under setting 5 (f ) Defensedecision under setting 6

5 10 15 20 25s2 defense number (times1000)

s2 d

efen

se re

turn

ndash10

ndash15

ndash20

ndash25

ndash30

Set 1Set 2Set 3

Set 4Set 5Set 6

Figure 10 Defense payo under dierent parameter settings

12 Security and Communication Networks

in [16 20] When facing an attacker with strong learningability if the attacker has not obtained Nash equilibriumthrough learning this approach is still better than [16 20] Ifthe attacker obtains Nash equilibrium through learning thispaper can also obtain the same Nash equilibrium strategy as[26] and obtain its phase e same eect is superior to thatin [20]

533 Test Comparison with and without Eligibility Traceis section tests the actual eect of the eligibility traces onthe Algorithm 2 e eect of eligibility traces on strategyselection is shown in Figure 16 from which we can see thatthe learning speed of the algorithm is faster when thequalied trace is available After 1000 times of learning the

algorithm can converge to the optimal strategy When thequalied trace is not available the algorithm needs about2500 times of learning to converge

Average earnings per 1000 times change as shown inFigure 17 from which we can see that the benets of thealgorithm are almost the same when there is or not anyqualied trace after convergence From Figure 17 we can seethat 3000 defenses before convergence have higher returnsfrom qualied trails than those from unqualied trails Inorder to further verify this the average of the rst 3000defense gains under qualied trails and unqualied trails iscounted 10 times each respectively e results are shown inFigure 18 which further proves that in the preconvergencedefense phase qualied traces are better than unqualiedtraces

e addition of eligibility trace accelerates the learningspeed but also brings additional memory and computingoverhead In the experiment only 10 state-action pairs thatwere recently accessed were saved which eectively reduced

Set 1 Set 2 Set 3 Set 4 Set 5 Set 6

ndash80

ndash85

ndash90

Figure 11 Average return under dierent parameter settings

Set 1 Set 2 Set 3 Set 4 Set 5 Set 6

80

75

Figure 12 Standard deviation of defense payo under dierentparameter settings

ndash10

ndash20

ndash30

ndash40

s2 d

efen

se re

turn

5 10 15 20 25

Our methodReference [12]Reference [16]

s2 defense number (times1000)

Figure 13 Comparison of defense payo change

0 2000 4000 6000s2 defense number

s2 d

efen

se st

rate

gy

08

06

04

02

00

d3d4

Figure 14 Defense decision making of our approach

Our methodReference [12]Reference [16]

5 10 15 20 25s2 defense number (times1000)

s2 d

efen

se re

turn

ndash10

ndash20

ndash30

ndash40

Figure 15 Comparison of defense payo change

Security and Communication Networks 13

0 2000 4000 6000s2 defense number

s2 d

efen

se st

rate

gy

100

075

050

025

000

Eligibility trace d3Eligibility trace d4

No eligibility trace d3No eligibility trace d4

Figure 16 Comparison of defense decision making

Eligibility traceNo eligibility trace

0 5 10 15 20 25s2 defense number (times1000)

s2 d

efen

se re

turn

ndash10

ndash20

ndash15

ndash30

ndash25

Figure 17 Comparison of defense payo change

Eligibility traceNo eligibility trace

1 2 3 4 5 6 7 8 9 10s2 defense number (times3000)

s2 av

erag

e def

ense

retu

rn

ndash12

ndash14

ndash16

ndash18

ndash20

Figure 18 Average payo comparison of the rst 3000 defenses

14 Security and Communication Networks

the increase of memory consumption In order to test thecomputational cost of qualified trail the time of 100000defense decisions made by the algorithm was counted for 20times with and without qualified trail e average of 20times was 951s for qualified trail and 374 s for unqualifiedtrail Although the introduction of eligibility traces willincrease the decision-making time by nearly 25 times thetime required for 100000 decisions after the introduction ofeligibility traces is still only 951 s which can still meet thereal-time requirements

In summary the introduction of eligibility trace at theexpense of a small amount of memory and computingoverhead can effectively increase the learning speed of thealgorithm and improve the defense gains

54 Comprehensive Comparisons is approach is com-pared with some typical research results as shown in Table 4[3 12 14 16] is based on the assumption of complete ra-tionality e equilibrium strategy obtained by it is difficultto appear in practice and has a low guiding effect on actualdefense decision making [20] and this paper have morepracticability on the premise of bounded rationality hy-pothesis but [20] is based on the theory of biologicalevolution and mainly studies population evolution e coreof game analysis is not the optimal strategy choice of playersbut the strategy adjustment process trend and stability ofgroup members composed of bounded rational players andthe stability here refers to group members is approach isnot suitable for guiding individual real-time decisionmaking because the proportion of specific strategies isunchanged and not the strategy of a player On the contrarythe defender of the proposed approach adopts reinforcementlearning mechanism which is based on systematic feedbackto learn in the confrontation with the attacker and is moresuitable for the study of individual strategies

6 Conclusions and Future Works

In this paper cyber attack-defense confrontation is ab-stracted as a stochastic game problem under the restrictionof limited rationality A host-centered attack-defense graphmodel is proposed to extract network state and attack-de-fense action and an algorithm to generate attack-defensegraph is designed to effectively compress the game statespace e WoLF-PHC-based defense decision approach isproposed to overcome the problem which enables defendersunder bounded rationality to make optimal choices whenfacing different attackers e lattice improves the WoLF-

PHC algorithm speeds up the defenderrsquos learning speed andreduces the algorithmrsquos dependence on data is approachnot only satisfies the constraints of bounded rationality butalso does not require the defender to know too much in-formation about the attacker It is a more practical defensedecision-making approach

e future work is to further optimize the winning andlosing criteria of WoLF-PHC algorithm for specific attack-defense scenarios in order to speed up defense learning andincrease defense gains

Data Availability

e data that support the findings of this study are notpublicly available due to restrictions as the data containsensitive information about a real-world enterprise networkAccess to the dataset is restricted by the original ownerPeople who want to access the data should send a request tothe corresponding author who will apply for permission ofsharing the data from the original owner

Conflicts of Interest

e authors declare that they have no conflicts of interestregarding the publication of this paper

Acknowledgments

is work was supported by the National High TechnologyResearch and Development Program of China (863 Pro-gram) (2014AA7116082 and 2015AA7116040) and NationalNatural Science Foundation of China (61902427) In par-ticular we thank Junnan Yang and Hongqi Zhang of ourresearch group for their previous work on stochastic gameWe sincerely thank them for their help in publishing thispaper

References

[1] Z Tian W Shi Y Wang et al ldquoReal time lateral movementdetection based on evidence reasoning network for edgecomputing environmentrdquo IEEE Transactions on IndustrialInformatics vol 15 no 7 pp 4285ndash4294 2019

[2] Y Wang J Yu W Qu H Shen and X Cheng ldquoEvolutionarygame model and analysis approaches for network groupbehaviorrdquo Chinese Journal of Computers vol 38 no 2pp 282ndash300 2015

[3] J Liu H Zhang and Y Liu ldquoResearch on optimal selection ofmoving target defense policy based on dynamic game withincomplete informationrdquo Acta Electronica Sinica vol 46no 1 pp 82ndash29 2018

Table 4 Comprehensive comparison of existing approaches

Ref Game type Model assumption Learning mechanism Game process Applicable object Practicability[3] Stochastic game Rationality mdash Multistage Personal Bad[12] Static game Rationality mdash Single-stage Personal Bad[14] Dynamic game Rationality mdash Multistage Personal Bad[16] Stochastic game Rationality mdash Multistage Personal Bad[20] Evolutionary game Bounded rationality Biological evolution mdash Group GoodOur approach Stochastic game Bounded rationality Reinforcement learning Multistage Personal Good

Security and Communication Networks 15

[4] Z Tian X Gao S Su J Qiu X Du and M GuizanildquoEvaluating reputation management schemes of internet ofvehicles based on evolutionary game theoryrdquo IEEE Trans-actions on Vehicular Technology vol 68 no 6 pp 5971ndash59802019

[5] Q Zhu and S Rass ldquoGame theory meets network securitya tutorialrdquo in Proceedings of the ACM SIGSAC Conference onComputer and Communications Security pp 2163ndash2165Toronto Canada October 2018

[6] X Li C Zhou Y C Tian and Y Qin ldquoA dynamic decision-making approach for intrusion response in industrial controlsystemsrdquo IEEE Transactions on Industrial Informatics vol 15no 5 pp 2544ndash2554 2019

[7] H Hu Y Liu H Zhang and R Pan ldquoOptimal networkdefense strategy selection based on incomplete informationevolutionary gamerdquo IEEE Access vol 6 pp 29806ndash298212018

[8] J Li G Kendall and R John ldquoComputing Nash equilibriaand evolutionarily stable states of evolutionary gamesrdquo IEEETransactions on Evolutionary Computation vol 20 no 3pp 460ndash469 2016

[9] M Bowling and M Veloso Multiagent Learning Usinga Variable Learning Rate Elsevier Science Publishers LtdAmsterdam Netherlands 2002

[10] J Harb and D Precup ldquoInvestigating recurrence and eligi-bility traces in deep Q-networksrdquo httpsarxivorgabs170405495 2017

[11] C T Do N H Tran and C Hong ldquoGame theory for cybersecurity and privacyrdquo ACM Computing Surveys vol 50 no 2p 30 2017

[12] Y-L Liu D-G Feng L-HWu and Y-F Lian ldquoPerformanceevaluation of worm attack and defense strategies based onstatic Bayesian gamerdquo Chinese Journal of Software vol 23no 3 pp 712ndash723 2012

[13] Y Li D E Quevedo S Dey and L Shi ldquoA game-theoreticapproach to fake-acknowledgment attack on cyber-physicalsystemsrdquo IEEE Transactions on Signal and Information Pro-cessing over Networks vol 3 no 1 pp 1ndash11 2017

[14] H Zhang L Jiang S Huang J Wang and Y Zhang ldquoAttack-defense differential game model for network defense strategyselectionrdquo IEEE Access vol 7 pp 50618ndash50629 2019

[15] A Afrand and S K Das ldquoPreventing DoS attacks in wirelesssensor networks a repeated game theory approachrdquo InternetJournal of Cyber Security vol 5 no 2 pp 145ndash153 2007

[16] J Wei B Fang and Z Tian ldquoResearch on defense strategiesselection based on attack-defense stochastic game modelrdquoChinese Journal of Computer Research and Developmentvol 47 no 10 pp 1714ndash1723 2010

[17] C Wang X Cheng and Y Zhu ldquoA Markov game model ofcomputer network operationrdquo Chinese Systems Engineering-gteory and Practice vol 34 no 9 pp 2402ndash2410 2014

[18] J Hofbauer and K Sigmund ldquoEvolutionary game dynamicsrdquoBulletin of the American Mathematical Society vol 40 no 4pp 479ndash519 2011

[19] Y Hayel and Q Zhu ldquoEpidemic protection over heteroge-neous networks using evolutionary Poisson gamesrdquo IEEETransactions on Information Forensics and Security vol 12no 8 pp 1786ndash1800 2017

[20] J Huang and H Zhang ldquoImproving replicator dynamicevolutionary game model for selecting optimal defensestrategiesrdquo Chinese Journal on Communications vol 38 no 1pp 170ndash182 2018

[21] H Hu Y Liu H Zhang and Y Zhang ldquoSecurity metricmethods for network multistep attacks using AMC and big

data correlation analysisrdquo Security and CommunicationNetworks vol 2018 Article ID 5787102 14 pages 2018

[22] H Hu Y Liu Y Yang H Zhang and Y Zhang ldquoNew in-sights into approaches to evaluating intention and path fornetwork multistep attacksrdquo Mathematical Problems in En-gineering vol 2018 Article ID 4278632 13 pages 2018

[23] H Hu H Zhang Y Liu and Y Wang ldquoQuantitative methodfor network security situation based on attack predictionrdquoSecurity and Communication Networks vol 2017 Article ID3407642 19 pages 2017

[24] M Zimmer and S Doncieux ldquoBootstrapping Q-learning forrobotics from neuro-evolution resultsrdquo IEEE Transactions onCognitive and Developmental Systems vol 10 no 1pp 102ndash119 2018

[25] R S Sutton and A G Barto ldquoReinforcement learning anintroduction Bradford bookrdquo Machine Learning vol 16no 1 pp 285-286 2005

[26] H Zhang J Wang D Yu J Han and T Li ldquoActive defensestrategy selection based on static Bayesian gamerdquo in Pro-ceedings of IET International Conference on CyberspaceTechnology pp 1ndash7 London UK October 2016

16 Security and Communication Networks

International Journal of

AerospaceEngineeringHindawiwwwhindawicom Volume 2018

RoboticsJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Active and Passive Electronic Components

VLSI Design

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Shock and Vibration

Hindawiwwwhindawicom Volume 2018

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawiwwwhindawicom

Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Control Scienceand Engineering

Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom

Journal ofEngineeringVolume 2018

SensorsJournal of

Hindawiwwwhindawicom Volume 2018

International Journal of

RotatingMachinery

Hindawiwwwhindawicom Volume 2018

Modelling ampSimulationin EngineeringHindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Navigation and Observation

International Journal of

Hindawi

wwwhindawicom Volume 2018

Advances in

Multimedia

Submit your manuscripts atwwwhindawicom

Page 3: Research Article - Hindawi Publishing Corporationdownloads.hindawi.com/journals/scn/2019/3038586.pdf · Afrand and Das [15] established a repeated game model between the intrusion

by introducing incentive coefficient and improved the cal-culation approach of replication dynamic rate Based on thisan evolutionary game model was constructed for defenseEvolutionary game takes the group as the research objectadopts the biological evolution mechanism and completesthe learning by imitating the advantage strategy of othermembers In evolutionary game there is too much in-formation exchange among players and it mainly studies theadjustment process trend and stability of attack-defensegroup strategy which is not conducive to guiding the real-time strategy selection of individual members

Reinforcement learning is a classic online intelligentlearning approach Its players learn independently throughenvironmental feedback Compared with evolutionary bi-ology reinforcement learning is more suitable for guidingindividual decision making is paper introduces re-inforcement-learning mechanism into stochastic gameexpands stochastic game from complete rationality to finiterationality and uses bounded rationality stochastic game toanalyze cyber attack-defense On the one hand comparedwith the existing attack-defense stochastic game this ap-proach uses bounded rationality hypothesis which is morerealistic On the other hand compared with evolutionarygame this approach uses reinforcement learning mecha-nism which is more suitable for guiding real-time defensedecision making

3 Modeling of Attack-Defense ConfrontationUsing Stochastic Game Theory

31 Description and Analysis of Cyber Attack-DefenseConfrontation Cyber attack-defense confrontation isa complex problem but from the level of strategy selection itcan be described as a stochastic game problem as depicted inFigure 1 We take the DDoS attack of using Sadmind vul-nerability of Soloris platform as an example e attack isimplemented through multiple steps including IP sweepSadmind ping Sadmind exploit Installing DDoS softwareand Conducting DDoS attack Each attack step can lead tochange of security state of network

Taking the first step as an example the initial networkstate is denoted as S0 (H1 none) It means that the attackerAlice does not have any privileges of host H1 en attackerAlice implemented an IP sweep attack onH1 through its openport 445 and gained the User privilege of H1 is networkstate is denoted as S1 (H1 User) Afterwards if the defenderBob selected and implemented a defense strategy from thecandidate strategy set Reinstall Listener program Installpatches Close unused port then the network state istransferred back to S0 otherwise the network may continueto evolve to another more dangerous state S3

e continuous time axis is divided into time slices andeach time slice contains only one network state e networkstatemay be the same in different time slices Each time slice isa game of attack-defense Both sides detect the current net-work state then select the attack-defense actions according tothe strategy and get immediate returns Attack-defensestrategies are related to network state e network systemtransfers from one state to another under the candidate action

of the attacking and defending sides e transition betweennetwork states is not only affected by attack-defense actionsbut also by factors such as system operating environment andexternal environment which is randome goal of this paperis to enable defenders to obtain higher long-term benefits inattack-defense stochastic game

Both sides of attack-defense can predict the existence ofNash equilibrium so Nash equilibrium is the best strategyfor both sides From the description of complete rationalityin the introduction we can see that the requirement ofcomplete rationality for both sides of attack-defense is toostrict and both sides of attacker and defender will beconstrained by limited rationality in practice Limited ra-tionality means that at least one of the attacking anddefending sides will not adopt Nash equilibrium strategy atthe beginning of the game which means that it is difficult forboth sides to find the optimal strategy in the early stage of thegame and they need to constantly adjust and improve thestrategy for their opponents It means that the game equi-librium is not the result of one choice but that both sides ofthe attack-defense sides are constantly learning to achieve inthe course of the attack-defense confrontation and becauseof the influence of learning mechanism may deviate againeven if it reaches equilibrium

From the above analysis we can see that learningmechanism is the key to win the game of limited rationalityFor defense decision making the learning mechanism ofattack-defense stochastic game under bounded rationalityneeds to satisfy the following two requirements (1) Con-vergence of learning algorithm attacker strategy underbounded rationality has dynamic change characteristics andbecause of the interdependence of attack-defense strategythe defender must learn the corresponding optimal strategywhen facing different attack strategies to ensure that he isinvincible (2) e learning process does not need too muchattacker information both sides of the cyber attack-defensehave opposition of objectives and non-cooperation andboth sides will deliberately hide their key information If toomuch opponent information is needed in the learningprocess the practicability of the learning algorithm will bereduced

WoLF-PHC algorithm is a typical strategy gradientintelligent learning approach which enables defenders tolearn through network feedback without too much in-formation exchange with attackers e introduction ofWoLF mechanism ensures the convergence of WoLF-PHCalgorithm [9] After the attacker learns to adopt Nashequilibrium strategy WoLF mechanism enables the de-fender to converge to the corresponding Nash equilibriumstrategy while the attacker has not yet learned Nash equi-librium strategy and WoLF mechanism enables the de-fender to converge to the corresponding optimal defensestrategy In conclusion WoLF-PHC algorithm can meet thedemand of attack-defense stochastic game under boundedrationality

32 StochasticGameModel forAttack-Defense emappingrelationship between cyber attack-defense and stochastic

Security and Communication Networks 3

game model is depicted in Figure 2 Stochastic game consistsof attack-defense game in each state and transition modelbetween states e two key elements of ldquoinformationrdquo andldquogame orderrdquo are assumed Constrained by bounded ra-tionality the attackerrsquos historical actions and the attackerrsquospayo function are set as the attackerrsquos private information

Herein we use the above example in Figure 1 to explainFigure 2 the security state corresponds to S0 (H1 none) andS1 (H1 User) in this case e candidate strategy set againstDDoS attack is Reinstall Listener program Install patchesClose unused port Network state is the common knowledgeof both sides Because of the non-cooperation between theattack and defense sides the two sides can only observe eachotherrsquos actions through the detection network which willdelay the execution time for at least one time slice so theattack-defense sides are acting at the same time in each timeslice e ldquosimultaneousrdquo here is a concept of informationrather than a concept of time that is the choice of attack-defense sides may not be based on the concept of time At thesame time because the attack-defense sides do not know theother sidersquos choice when choosing action they are consid-ered to be simultaneous action

Construct the network state transition model Useprobability to express the randomness of network statetransition Because the current network state is mainly re-lated to the previous network state the rst-order Markov isused to represent the state transition relationship in whichthe network state is the attack-defense action Because bothsides of attacker and defender are constrained by boundedrationality in order to increase the generality of the modelthe transfer probability is set as the unknown information ofboth sides of attack-defense

On the basis of the above a gamemodel is constructed tosolve the defense decision-making problem

Denition 1 e attack-defense stochastic gamemodel (AD-SGM) is a six-tuple AD minus SGM (N SD R Q π) in which

(1) N (attacker defender) are the two players whoparticipate in the game representing cyber attackersand defenders respectively

(2) S (s1 s2 sn) is a set of stochastic game stateswhich is composed of network states (see Section 33for the specic meaning and generation approach)

Optimalstrategy of

St1

Optimalstrategy of

Sti

Optimalstrategy of

Sprimet1

CurrentstateSprimet2

Optimalstrategy of

SPrimet2

Game Game

Game

t1 t2 ti Time

Defender

CurrentstateSt1

CurrentstateSPrimet2

Currentstate

StiAttacker

DefenderAttacker

DefenderAttacker DefenderAttacker

Evolution

Evolution

∙ ∙ ∙

∙ ∙ ∙

∙ ∙ ∙

∙ ∙ ∙

∙ ∙ ∙

∙ ∙ ∙

∙ ∙ ∙

Game

Figure 1 e game process and strategy selection of attack-defense confrontation

Gamestate

Securitystate

Candidatestrategy

Attack-defenseequilibrium

Cost and benefitanalysis

Attacker and defender

Game plan

Game equilibrium

Payoff function

Players

Cyber attack-defense Stochastic game model

Figure 2 e mapping relationship between cyber attack-defenseand stochastic game model

4 Security and Communication Networks

(3) D (D1 D2 Dn) is the action set of the de-fender in which Dk d1 d2 dm1113864 1113865 is the actionset of the defender in the game state Sk

(4) Rd(si d sj)is the immediate return from state si to sjafter the defender performs action d

(5) Qd(si d) is the state-action payoff function of thedefender indicating the expected payoff of the de-fender after taking action d in the state si

(6) πd(sk) is the defense strategy of the defender in thestate sk

Defense strategy and defense action are two differentconcepts Defense strategy is the rule of defense action notthe action itself For example πd(sk) (πd(sk d1)

πd(sk dm)) is the strategy of the defender in the networkstate sk where πd(sk dm) is the probability of selectingaction dm 1113936disinDk

πd(sk d) 1

33 Network State and Attack-Defense Action ExtractionApproach Based on Attack-Defense Graph Network stateand attack-defense action are important components ofstochastic game model Extraction of network state andattack-defense action is a key point in constructing at-tack-defense stochastic game model [21] In the currentattack-defense stochastic game when describing thenetwork state each network state contains the securityelements of all nodes in the current network e numberof network states is the power set of security elementswhich will produce a state explosion [22] ereforea host-centered attack-defense graph model is proposedEach state node only describes the host state It can ef-fectively reduce the size of state nodes [23] Using thisattack-defense graph to extract network state and attack-defense action is more conducive to cyber attack-defenseconfrontation analysis

Definition 2 attack-defense graph is a binary groupG (S E) in which S s1 s2 sn1113864 1113865 is a set of node se-curity states and si langhost privilegerang host is the uniqueidentity of the node and privilege none user root in-dicates that it does not have any privileges has ordinary userprivileges and has administrator privileges For directededge E (Ea Ed) it indicates that the occurrence of attackor defense action causes the transfer of node state andek (sr vd sd) k a d where sr is the source node and sd

is the destination nodee generation process of attack-defense map is shown

in Figure 3 Firstly target network scanning is used to ac-quire cyber security elements then attack instantiation iscombined with attack template and defense instantiation iscombined with defense template Finally attack-defensegraph is generated e state set of attack-defense stochasticgame model is extracted by attack-defense graph nodes and

the defense action set is extracted by the edge of attack-defense graph

331 Elements of Cyber Security e elements of cybersecurity NSE are composed of network connection Cvulnerability information V service information F andaccess rights P Matrix Csube host times host times port describes theconnection relationship between nodes the row of matrixrepresents the source node shost the list of matrix rep-resents the destination node dhost and the port accessrelationship is represented by matrix elements When portis empty it indicates that there is no connection re-lationship between shost nodes and dhost nodesV langhost service cveidrang indicates the vulnerability ofservices on nodesrsquo host including security vulnerabilitiesand improper configuration or misconfiguration of systemsoftware and application software F langhost servicerang in-dicates that a service is opened on a node hostP langhost privilegerang indicates that an attacker has accessrights privilege on a node host

332 Attack Template Attack template AM is the de-scription of vulnerability utilization whereAM langtid prec postcrang Among them tid is the identifi-cation of attack mode prec langP V C Frang describes the setof prerequisites for an attacker to use a vulnerability in-cluding the initial access rights privilege of the attacker onthe source node shost the vulnerability information cveidof the target node the network connection relationship Cand the running service of the node F Only when the set ofconditions is satisfied the attacker can succeed Use thisvulnerability postc langP C sdrang describes the conse-quences of an attackerrsquos successful use of a vulnerabilityincluding the increase of attackerrsquos access to the targetnode the change of network connection relationship andservice destruction

333 Defense Template Defense templates DM are theresponse measures taken by defenders after predicting oridentifying attacks where DM langtid dsetrang dset

langd1 postd1rang langdm postdmrang1113864 1113865 is the defense strategy setfor specific attacks postdi langF V PCrang describes the im-pact of defense strategy on cyber security elements in-cluding the impact on node service informationvulnerability information attacker privilege informationnode connection relationship and so on

In the process of attack-defense graph generation ifthere is a connection between two nodes and all the pre-requisites for attack occurring are satisfied the edges fromsource node to destination node are added If the attackchanges the security elements such as connectivity the cybersecurity elements should be updated in time If the defensestrategy is implemented the connection between nodes orthe existing rights of attackers should be changed As shownin Algorithm 1 the first step is to use cyber security elements

Security and Communication Networks 5

to generate all possible state nodes and initialize the edgesSteps 2ndash8 are to instantiate attacks and generate all attackedges Steps 9ndash15 are to instantiate defenses and generate alldefense edges Steps 16ndash20 are used to remove all isolatednodes And step 21 is to output attack-defense maps

Assuming the number of nodes in the target network is nand the number of vulnerabilities of each node is m themaximum number of nodes in the attack-defense graph is3n In the attack instantiation stage the computationalcomplexity of analyzing the connection relationship be-tween each two nodes is o(n2 minus n) e computationalcomplexity of matching the vulnerability of the nodes withthe connection relationship is o(m(n2 minus n)) In the defenseinstantiation stage we remove the isolated nodes and thecomputational complexity of traversing the edges of all thenodes is o(9n2 minus 3n) In summary the order of computa-tional complexity of the algorithm is o(n2) e node ofattack-defense graph G can extract network state and theedge of attack-defense graph G can extract attack-defenseaction

4 Stochastic Game Analysis and StrategySelection Based on WoLF-PHCIntelligent Learning

In the previous section cyber attack-defense is described asa bounded rational stochastic game problem and an attack-defense stochastic game model AD-SGM is constructed Inthis section reinforcement learning mechanism is in-troduced into finite rational stochastic game and WoLF-PHC algorithm is used to select defense strategies based onAD-SGM

41 Principle of WoLF-PHC

411 Q-Learning Algorithm Q-learning [24] is the basis ofWoLF-PHC algorithm and a typical model-free re-inforcement learning algorithm Its learning mechanism isshown in Figure 4 Agent in Q-learning obtains knowledgeof return and environment state transfer through interaction

Elements of networksecurity

Network connectionrelations

Node vulnerabilityinformation

Node serviceinformation

Node accessprivilege

Attack template

Attackinstantiation

Defenseinstantiation

Defense template

Atta

ck-d

efen

se g

raph Network

state

Attackaction

Defensiveaction

Figure 3 Attack-defense graph generation

Input Elements of Cyber security NSE Attack Template AM Defense Template DMOutput Attack graph G (S E)

(1) S⟵NSE E⟵emptylowast Generate all nodes lowast(2) for each S dolowast Attack instantiation to generate attack edges lowast(3) update NSE in slowast Updating Cyber security Elements lowast(4) if Cshost shost and CdhostV geAMprecV and CdhostFgeAMprecF and CdhostPprivilegegeAMprecPprivilege(5) srhost⟵Cshost(6) sdhost⟵Cdhost(7) sdprivilege⟵AMpostcPprivilege(8) Ea⟵Ea cup ea(srAMtid sd)1113864 1113865

(9) end if(10) end for(11) for each S dolowast Defense instantiation to generate defense edges lowast(12) if Easd s and DMtid Eatid(13) Ed⟵Ed cup ed(EasdDMdsetd Easr)1113864 1113865

(14) end if(15) end for(16) for each S dolowast Remove isolated nodes S lowast(17) if ea(s tid sd) empty and ed(sr d s) empty(18) S⟵ S minus s

(19) end if(20) end for(21) Return G

ALGORITHM 1 Attack-defense graph generation algorithm

6 Security and Communication Networks

with environment Knowledge is expressed by payoff Qd andlearned by updating Qd Qd is

Qd(s d) Qd(s d) + α Rd s d sprime( 1113857 + cmaxdprime

Qd sprime dprime( 1113857 minus Qd(s d)1113890 1113891

(1)

where α is payoff learning rate and c is the discount factore strategy of Q-learning is πd(s) argmaxdQd(s d)

412 PHC Algorithm e Policy Hill-Climbing algorithm[25] is a simple and practical gradient descent learning al-gorithm suitable for hybrid strategies which is an im-provement of Q-learning e state-action gain function Qd

of PHC is the same as Q-learning but the policy updateapproach of Q-learning is no longer followed but the hybridstrategy πd(sk) is updated by executing the hill-climbingalgorithm as shown in equations (2)ndash(4) In the formula thestrategy learning rate is

πd sk di( 1113857 πd sk di( 1113857 + Δskdi (2)

where

Δskdi

minus δskdi di ne argmax

dQd sk d( 1113857

1113944djnedi

δskdj others

⎧⎪⎪⎪⎨

⎪⎪⎪⎩

(3)

δskdi min πd sk di( 1113857

δDk minus 1

11138681113868111386811138681113868111386811138681113868

1113888 1113889 (4)

413 WoLF-PHC Algorithm WoLF-PHC algorithm is animprovement of PHC algorithm By introducing WoLFmechanism the defender has two different strategy learningrates low strategy learning rate δw when winning and highstrategy learning rate δl when losing as shown in formula (5)e two learning rates enable defenders to adapt quickly toattackersrsquo strategies when they perform worse than expectedand to learn cautiously when they perform better than expectede most important thing is the introduction of WoLFmechanism which guarantees the convergence of the algorithm[9] WoLF-PHC algorithm uses average strategy as the criterionof success and failure as shown in formulae (6) and (7)

δ

δw 1113944disinDk

πd skd( 1113857Qd skd( 1113857gt 1113944disinDk

πd skd( 1113857Qd skd( 1113857

δl others

⎧⎪⎪⎪⎨

⎪⎪⎪⎩

(5)

πd(sd) πd(sd) +1

C(s)πd(sd) minus πd(sd)( 1113857

(6)

C(s) C(s) + 1 (7)

42 Defense Decision-Making Algorithm Based on ImprovedWoLF-PHC e decision-making process of our approachis shown in Figure 5 which consists of five steps It receivestwo types of input data attack evidence and abnormal

Input AD minus SGM α δ λ and c

Output Defense action d(1) initialize AD minus SGM C(s) 0 e(s d) 0lowast Network state and attack-defense actions are extracted by Algorithm 1 lowast(2) slowast get(E) lowast Getting the current network state from Network E lowast(3) repeat(4) dlowast πd(slowast) lowast Select defense action lowast(5) Output dlowast lowast Feedback defense actions to defenders lowast(6) sprime get(E) lowast Get the status after the action dlowast is executed lowast(7) ρlowast Rd(slowast dlowast sprime) + cmaxdprimeQd(sprime dprime) minus Qd(slowast dlowast)

(8) ρg Rd(slowast dlowast sprime) + cmaxdprimeQd(sprime dprime) minus maxdQd(slowast d)

(9) for each state-action pair (s d) except (slowast dlowast) do(10) e(s d) cλe(s d)

(11) Qd(s d) Qd(s d) + αρge(s d)

(12) end forlowast Update noncurrent eligibility trace (slowast dlowast) and values Qd lowast(13) Qd(slowast dlowast) Qd(slowast dlowast) + αρlowastlowast Update Qd of (slowast dlowast) lowast(14) e(slowast dlowast) cλe(slowast dlowast) + 1 lowast Update track of (slowast dlowast)lowast(15) C(slowast) C(slowast) + 1(16) Updating average strategy based on formula (6)(17) Selecting the learning rate of strategies based on formula (5)(18) δslowastd min(πd(slowast d) δ(D(slowast) minus 1))foralld isin D(slowast)

(19) Δslowastd minus δslowastd dne argmaxdi

Qd(slowast di)

1113936djnedlowastδslowastdjOthers1113896

(20) πd(slowast d) πd(slowast d) + Δslowastdforalld isin D(slowast) lowast Update defense strategy lowast(21) slowast sprime(22) end repeat

ALGORITHM 2 Defense decision-making algorithm

Security and Communication Networks 7

evidence All these pieces of evidence come from real-timeintrusion detection systems After decision making theoptimal security strategy is determined against detectedintrusions

In order to improve the learning speed of WoLF-PHCalgorithm and reduce the dependence of the algorithm onthe amount of data the eligibility trace is introduced toimprove WoLF-PHC e eligibility trace can track specificstate-action trajectories of recent visits and then assigncurrent returns to the state-action of recent visits WoLF-PHC algorithm is an extension of Q-learning algorithm Atpresent there are many algorithms combining Q-learningwith eligibility trace is paper improves WoLF-PHC byusing the typical algorithm [10] e qualification trace ofeach state-action is defined as e(s a) Suppose slowast is thecurrent network state slowast and the eligibility trace is updatedin the way shown in formula (8) Among them the traceattenuation factor is λ

e(s d) cλe(s d) sne slowast

cλe(s d) + 1 s slowast1113896 (8)

WoLF-PHC algorithm is an extension of Q-learningalgorithm which belongs to off-policy algorithm It usesgreedy policy when evaluating defense actions for eachnetwork state and occasionally introduces nongreedy policywhen choosing to perform defense actions in order to learnIn order to maintain the off-policy characteristics of WoLF-PHC algorithm the state-action values are updated byformulae (9)ndash(12) in which the defense actions dlowast are

selected for execution slowast because only the recently visitedstatus-action pairs will have significantly more eligibilitytrace than 0 while most other status-action pairs will havealmost none eligibility trace In order to reduce the memoryand running time consumption caused by eligibility traceonly the latest status-action pair eligibility trace can be savedand updated in practical application

Qd slowast dlowast

( 1113857 Qd slowast dlowast

( 1113857 + αρlowast (9)

Qd(s d) Qd(s d) + αρge(s d) (10)

ρlowast Rd slowast d sprime( 1113857 + cmax

dprimeQd sprime dprime( 1113857 minus Qd s

lowast dlowast

( 1113857

(11)

ρg Rd slowast dlowast sprime( 1113857 + cmax

dprimeQd sprime dprime( 1113857 minus max

dQd slowast d( 1113857

(12)

In order to achieve better results the defense decision-making approach based on WoLF-PHC needs to set fourparameters α δ λ and c reasonably (1) e range of thepayoff learning rate is 0lt αlt 1e bigger the representativeα is the more important the cumulative reward is e fasterthe learning speed is the smaller α is and the better thestability of the algorithm is (2)e range of strategy learningrate 0lt δ lt 1 is obtained According to the experiment wecan get a better result when adopting δlδw 4 (3) eattenuation factor λ of eligibility trace is in the range of

Describe andanalyze cyberattack-defenseconfrontation

Model stochastic game

for attack-defense

Extract currentsecurity state andcandidate attack-

defense action

Evaluatepayoff

hellip

helliphellip

How to defend

Decision making

Figure 5 e process of our decision-making approach

Rt+1reward Rt

St+1state St

Ambient

Agent

Figure 4 Q-learning learning mechanism

8 Security and Communication Networks

0lt λlt 1 which is responsible for the credit allocation ofstatus-action It can be regarded as a time scale e greaterthe credit allocated λ to historical status-action the greaterthe credit allocated to historical status-action (4) e rangeof the discount factor 0lt clt 1 represents the defenderrsquospreference for immediate return and future return When c

approaches 0 it means that future returns are irrelevant andimmediate returns are more important When c approaches1 it means immediate returns are irrelevant and futurereturns are more important

Agent in WoLF-PHC is the defender in the stochasticgame model of attack-defense AD minus SGM the game state inagentrsquos state corresponds to AD minus SGM the defense actionin agentrsquos behavior corresponds to AD minus SGM the imme-diate return in agentrsquos immediate return corresponds toAD minus SGM and the defense strategy in agentrsquos strategycorresponds to AD minus SGM

On the basis of the above a specific defense decision-making approach as shown in Algorithm 2 is given e firststep of the algorithm is to initialize the stochastic gamemodel AD minus SGM of attack-defense and the related pa-rameters e network state and attack-defense actions areextracted by Algorithm 1 e second step is to detect thecurrent network state by the defender Steps 3ndash22 are tomake defense decisions and learn online Steps 4-5 are toselect defense actions according to the current strategy steps6ndash14 are to update the benefits by using eligibility traces andsteps 15ndash21 are the new payoffs using mountain climbingalgorithm to update defense strategy

e spatial complexity of the Algorithm 2 mainlyconcentrates on the storage of pairs Rd(s d sprime) such ase(s d) πd(s d) πd(s d) and Qd(s d) e number of statesis |S| |D| and |A| are the numbers of measures taken by thedefender and attacker in each state respectively ecomputational complexity of the proposed algorithm isO(4 middot |S| middot |D| + |S|2 middot |D|) Compared with the recentmethod using evolutionary game model with complexityO(|S|2 middot (|A| + |D|)3) [14] we greatly reduce the computa-tional complexity and increase the practicability of the al-gorithm since the proposed algorithm does not need to solvethe game equilibrium

5 Experimental Analysis

51 Experiment Setup In order to verify the effectiveness ofthis approach a typical enterprise network as shown inFigure 6 is built for experiment Attacks and defenses occuron the intranet with attackers coming from the extranet Asa defender network administrator is responsible for thesecurity of intranet Due to the setting of Firewall 1 andFirewall 2 legal users of the external network can only accessthe web server which can access the database server FTPserver and e-mail server

e simulation experiment was carried out on a PC withIntel Core i7-6300HQ 340GHz 32GB RAM memoryand Windows 10 64 bit operating system e Python 365emulator was installed and the vulnerability information inthe experimental network was scanned by Nessus toolkit asshown in Table 1 e network topology information was

collected by ArcGis toolkit We used the Python language towrite the project code During the experiment we set upabout 25000 times of attack-defense strategy studies eexperimental results were analyzed and displayed usingMatlab2018a as described in Section 53

Referring to MIT Lincoln Lab attack-defense behaviordatabase attack-defense templates are constructed Attack-defense maps are divided into attack maps and defense mapsby using attacker host A web server W database server DFTP server F and e-mail server E In order to facilitatedisplay and description attack-defense maps are dividedinto attack maps and defense maps as shown in Figures 7and 8 respectively e meaning of defense action in thedefense diagram is shown in Table 2

52 Construction of the Experiment Scenario AD-SGM(1) N (attacker defender) are players participating in

the game representing cyber attackers and defendersrespectively

Vulnerabilityscanning system

VDS

Internet

DMZ

FTP server

Firewall 1Trusted zone

Firewall 2Database

serverE-mailserver

Web server

IDS

Attacker

Extranet

Intranet

Figure 6 Experimental network topology

Table 1 Network vulnerability information

Attackidentifier Host CVE Target

privilegetid1 Web server CVE-2015-1635 usertid2 Web server CVE-2017-7269 roottid3 Web server CVE-2014-8517 roottid4 FTP server CVE-2014-3556 roottid5 E-mail server CVE-2014-4877 roottid6 Database server CVE-2013-4730 usertid7 Database server CVE-2016-6662 root

S0(A root)

S1(W user)

S2(W root)S5(D user)

S6(D root)

tid1

tid2

tid3

tid4

tid5

tid6

tid7

tid8

tid9

S3(F root)

S4(E root)

Figure 7 Attack graph

Security and Communication Networks 9

(2) e state set of stochastic game is S (s0 s1

s2 s3 s4 s5 s6) which consists of network state and isextracted from the nodes shown in Figures 7 and 8

(3) e action set of the defender is D (D0 D1

D2 D3 D4 D5 D6) D0 NULL D1 d11113864 d2D2

d3 d41113864 1113865D3 d1 d5 d61113864 1113865D4 d1 d5 d61113864 1113865D5 d11113864

d2 d7D6 d3 d41113864 1113865 and the edges are extractedfrom Figure 8

(4) Quantitative results of immediate returns Rd(si

d sj) of defenders [16 26] are

Rd s0NULL s0( 1113857 Rd s0NULL s1( 1113857 Rd s0NULL s2( 1113857( 1113857 (0 minus 40 minus 59)

Rd s1 d1 s0( 1113857 Rd s1 d1 s1( 1113857 Rd s1 d1 s2( 1113857 Rd s1 d2 s0( 1113857 Rd s1 d2 s1( 1113857 Rd s1 d2 s2( 1113857( 1113857

(40 0 minus 29 5 minus 15 minus 32)

( Rd s2 d3 s0( 1113857 Rd s2 d3 s1( 1113857 Rd s2 d3 s2( 1113857 Rd s2 d3 s3( 1113857 Rd s2 d3 s4( 1113857 Rd s2 d3 s5( 1113857

Rd s2 d4 s0( 1113857 Rd s2d4 s1( 1113857 Rd s2 d4 s2( 1113857 Rd s2 d4 s3( 1113857 Rd s2 d4 s4( 1113857 Rd s2 d4 s5( 11138571113857

(24 9 minus 15 minus 55 minus 49 minus 65 19 5 minus 21 minus 61 minus 72 minus 68)

( Rd s3 d1 s2( 1113857 Rd s3 d1 s3( 1113857 Rd s3 d1 s6( 1113857 Rd s3 d5 s2( 1113857 Rd s3 d5 s3( 1113857 Rd s3 d5 s6( 1113857

Rd s3 d6 s2( 1113857 Rd s3 d6 s3( 1113857 Rd s3 d6 s6( 11138571113857

(21 minus 16 minus 72 15 minus 23 minus 81 minus 21 minus 36 minus 81)

( Rd s4 d1 s2( 1113857 Rd s4 d1 s4( 1113857 Rd s4 d1 s6( 1113857 Rd s4 d5 s2( 1113857 Rd s4 d5 s4( 1113857 Rd s4 d5 s6( 1113857

Rd s4 d6 s2( 1113857 Rd s4 d6 s4( 1113857 Rd s4 d6 s6( 11138571113857

(26 0 minus 62 11 minus 23 minus 75 9 minus 25 minus 87)

( Rd s5 d1 s2( 1113857 Rd s5 d1 s5( 1113857 Rd s5 d1 s6( 1113857 Rd s5 d2 s2( 1113857 Rd s5 d2 s5( 1113857 Rd s5 d2 s6( 1113857

Rd s5 d7 s2( 1113857 Rd s5 d7 s5( 1113857 Rd s5 d7 s6( 11138571113857

(29 0 minus 63 11 minus 21 minus 76 2 minus 27 minus 88)

( Rd s6 d3 s3( 1113857 Rd s6 d3 s3( 1113857 Rd s6 d3 s5( 1113857 Rd s6 d3 s6( 1113857 Rd s6 d4 s3( 1113857

Rd s6 d4 s4( 1113857 Rd s6 d4 s5( 1113857 Rd s6 d4 s6( 11138571113857

(minus 23 minus 21 minus 19 minus 42 minus 28 minus 31 minus 24 minus 49)

(13)

Table 2 Defense action description

Atomic defense action d1 d2 d3 d4 d5 d6 d7

Renew root data Limit SYNICMPpackets Install Oracle patches Reinstall listener program Uninstall delete Trojan Limit access toMDSYS Restart database server Delete suspicious account Add physical resource Repair database Limit packets from ports

S0(A root)

S1(W user)

S2(W root)S5(D user)

S6(D root)

d3

d3

d4

d1d2d7

S3(F root)

S4(E root)d1d5d6

d1d5d6d4

d3

d1d2

Figure 8 Defense graph

10 Security and Communication Networks

(5) In order to detect the learning performance of theAlgorithm 2 more fully the defenderrsquos state-actionpayoffQd(si d) is initialized with a unified 0 withoutintroducing additional prior knowledge

(6) Defenderrsquos defense strategy adopts average strategyto initialize that is πd(sk d1) πd(sk d2) middot middot middot

πd(sk dm) 1113936disinD(sk)πd(sk d) 1forallsk isin S where noadditional prior knowledge is introduced

53 Testing and Analysis e experiment in this section hasthree purposes e first is to test the influence of differentparameter settings on the proposed Algorithm 2 so as to findout the experimental parameters suitable for this scenarioe second is to compare this approach with the existingtypical approaches to verify the advancement of this ap-proach e third is to test the effectiveness of WoLF-PHCalgorithm improvement based on eligibility trace

From Figures 7 and 8 we can see that the state of attack-defense strategy selection is the most complex and repre-sentative erefore the performance of the algorithm isanalyzed by the experimental state selection and the othernetwork state analysis approaches are the same

531 Parameter Test and Analysis Different parameters willaffect the speed and effect of learning At present there is norelevant theory to determine the specific parameters InSection 4 the relevant parameters are preliminarily ana-lyzed On this basis the different parameter settings arefurther tested to find the parameter settings suitable for thisattack-defense scenario Six different parameter settingswere tested Specific parameter settings are shown in Table 3In the experiment the attackerrsquos initial strategy is randomstrategy and the learning mechanism is the same as theapproach in this paper

e probability of the defenderrsquos choice of defense ac-tions and sums in state is shown in Figure 9 e learningspeed and convergence of the algorithm under differentparameter settings can be observed from Figure 9 whichshows that the learning speed of settings 1 3 and 6 is fasterand the best strategy can be obtained after learning less than1500 times under the three settings but convergence of 3and 6 is poor Although the best strategy can be learned bysettings 3 and 6 there will be oscillation afterwards and thestability of setting 1 is not suitable

Defense payoff can represent the degree of optimizationof the strategy In order to ensure that the payoff value doesnot reflect only one defense result the average of 1000defense gains is taken and the change of the average payoffper 1000 defense gains is shown in Figure 10 As can be seenfrom Figure 10 the benefits of Set 3 are significantly lowerthan those of other settings but the advantages and dis-advantages of other settings are difficult to distinguish Inorder to display more intuitively the average value of 25000defense gains calculated under different settings in Figure 10is shown in Figure 11 From Figure 11 we can see that theaverage value of settings 1 and 5 is higher For furthercomparison the standard deviation of settings 1 and 5 is

calculated one step on the basis of the average value to reflectthe discreteness of the gains As shown in Figure 12 thestandard deviations of setting 1 and setting 6 are smallMoreover the result of setting 1 is smaller than setting 6

In conclusion setting 1 of the six sets of parameters is themost suitable for this scenario Since setting 1 has achievedan ideal effect and can meet the experimental requirementsit is no longer necessary to further optimize the parameters

532 Comparisons In this section stochastic game [16] andevolutionary game [20] are selected to conduct comparativeexperiments with this approach According to the differenceof attackerrsquos learning ability this section designs two groupsof comparative experiments In the first group the attackerrsquoslearning ability is weak and will not make adjustments to theattack-defense results In the second group the attackerrsquoslearning ability is strong and adopts the same learningmechanism as the approach in this paper In both groupsthe initial strategies of attackers were random strategies

In the first group of experiments the defense strategy ofthis approach is as shown in Figure 9(a) e defense strategiescalculated by the approach [16] are πd(s2 d3) 07πd(s2 d4) 03 e defense strategies of [20] are evolu-tionarily stable and balanced And the defense strategies of [20]are πd(s2 d3) 08 πd(s2 d4) 02 Its average earnings per1000 times change as shown in Figure 13

From the results of the strategies and benefits of the threeapproaches we can see that the approach in this paper canlearn from the attackerrsquos strategies and adjust to the optimalstrategy so the approach in this paper can obtain the highestbenefits Wei et al [16] adopted a fixed strategy whenconfronting any attacker When the attacker is constrainedby bounded rationality and does not adopt Nash equilibriumstrategy the benefit of this approach is low Although thelearning factors of both attackers and defenders are takeninto account in [20] the parameters required in the modelare difficult to quantify accurately which results in thedeviation between the final results and the actual results sothe benefit of the approach is still lower than that of theapproach in this paper

In the second group of experiments the results of[16 20] are still as πd(s2 d3) 07 πd(s2 d4) 03 andπd(s2 d3) 08 πd(s2 d4) 02 e decision making ofthis approach is shown in Figure 14 After about 1800 timesof learning the approach achieves stability and converges tothe same defense strategy as that of [16] As can be seen fromFigure 15 the payoff of [20] is lower than that of other twoapproaches e average payoff of this approach in the first2000 defenses is higher than that of [26] and then it is almostthe same as that of [26] Combining Figures 14 and 15 wecan see that the learning attacker cannot get Nash equi-librium strategy at the initial stage and the approach in thispaper is better than that in [26] When the attacker learns toget Nash equilibrium strategy the approach in this paper canconverge to Nash equilibrium strategy At this time theperformance of this approach is the same as that in [26]

In conclusion when facing the attackers with weaklearning ability the approach in this paper is superior to that

Security and Communication Networks 11

Table 3 Dierent parameter settings

Set α δl δw λ c

1 001 0004 0001 001 0012 01 0004 0001 001 0013 001 0004 0001 001 014 001 0004 0001 01 0015 001 004 001 001 0016 001 0008 0001 001 001

0 2000 4000 6000s2 defense number

s2 d

efen

se re

turn

10

08

06

04

02

00

d3d4

(a)

s2 d

efen

se re

turn

10

08

06

04

02

000 2000 4000 6000

s2 defense number

d3d4

(b)

s2 d

efen

se re

turn

10

08

06

04

02

000 2000 4000 6000

s2 defense number

d3d4

(c)

s2 d

efen

se re

turn

10

08

06

04

02

000 2000 4000 6000

s2 defense number

d3d4

(d)

s2 d

efen

se re

turn

10

08

06

04

02

000 2000 4000 6000

s2 defense number

d3d4

(e)

s2 d

efen

se re

turn

10

08

06

04

02

000 2000 4000 6000

s2 defense number

d3d4

(f )

Figure 9 Defense decision making under dierent parameter settings (a) Defense decision under setting 1 (b) Defense decision undersetting 2 (c) Defense decision under setting 3 (d) Defense decision under setting 4 (e) Defense decision under setting 5 (f ) Defensedecision under setting 6

5 10 15 20 25s2 defense number (times1000)

s2 d

efen

se re

turn

ndash10

ndash15

ndash20

ndash25

ndash30

Set 1Set 2Set 3

Set 4Set 5Set 6

Figure 10 Defense payo under dierent parameter settings

12 Security and Communication Networks

in [16 20] When facing an attacker with strong learningability if the attacker has not obtained Nash equilibriumthrough learning this approach is still better than [16 20] Ifthe attacker obtains Nash equilibrium through learning thispaper can also obtain the same Nash equilibrium strategy as[26] and obtain its phase e same eect is superior to thatin [20]

533 Test Comparison with and without Eligibility Traceis section tests the actual eect of the eligibility traces onthe Algorithm 2 e eect of eligibility traces on strategyselection is shown in Figure 16 from which we can see thatthe learning speed of the algorithm is faster when thequalied trace is available After 1000 times of learning the

algorithm can converge to the optimal strategy When thequalied trace is not available the algorithm needs about2500 times of learning to converge

Average earnings per 1000 times change as shown inFigure 17 from which we can see that the benets of thealgorithm are almost the same when there is or not anyqualied trace after convergence From Figure 17 we can seethat 3000 defenses before convergence have higher returnsfrom qualied trails than those from unqualied trails Inorder to further verify this the average of the rst 3000defense gains under qualied trails and unqualied trails iscounted 10 times each respectively e results are shown inFigure 18 which further proves that in the preconvergencedefense phase qualied traces are better than unqualiedtraces

e addition of eligibility trace accelerates the learningspeed but also brings additional memory and computingoverhead In the experiment only 10 state-action pairs thatwere recently accessed were saved which eectively reduced

Set 1 Set 2 Set 3 Set 4 Set 5 Set 6

ndash80

ndash85

ndash90

Figure 11 Average return under dierent parameter settings

Set 1 Set 2 Set 3 Set 4 Set 5 Set 6

80

75

Figure 12 Standard deviation of defense payo under dierentparameter settings

ndash10

ndash20

ndash30

ndash40

s2 d

efen

se re

turn

5 10 15 20 25

Our methodReference [12]Reference [16]

s2 defense number (times1000)

Figure 13 Comparison of defense payo change

0 2000 4000 6000s2 defense number

s2 d

efen

se st

rate

gy

08

06

04

02

00

d3d4

Figure 14 Defense decision making of our approach

Our methodReference [12]Reference [16]

5 10 15 20 25s2 defense number (times1000)

s2 d

efen

se re

turn

ndash10

ndash20

ndash30

ndash40

Figure 15 Comparison of defense payo change

Security and Communication Networks 13

0 2000 4000 6000s2 defense number

s2 d

efen

se st

rate

gy

100

075

050

025

000

Eligibility trace d3Eligibility trace d4

No eligibility trace d3No eligibility trace d4

Figure 16 Comparison of defense decision making

Eligibility traceNo eligibility trace

0 5 10 15 20 25s2 defense number (times1000)

s2 d

efen

se re

turn

ndash10

ndash20

ndash15

ndash30

ndash25

Figure 17 Comparison of defense payo change

Eligibility traceNo eligibility trace

1 2 3 4 5 6 7 8 9 10s2 defense number (times3000)

s2 av

erag

e def

ense

retu

rn

ndash12

ndash14

ndash16

ndash18

ndash20

Figure 18 Average payo comparison of the rst 3000 defenses

14 Security and Communication Networks

the increase of memory consumption In order to test thecomputational cost of qualified trail the time of 100000defense decisions made by the algorithm was counted for 20times with and without qualified trail e average of 20times was 951s for qualified trail and 374 s for unqualifiedtrail Although the introduction of eligibility traces willincrease the decision-making time by nearly 25 times thetime required for 100000 decisions after the introduction ofeligibility traces is still only 951 s which can still meet thereal-time requirements

In summary the introduction of eligibility trace at theexpense of a small amount of memory and computingoverhead can effectively increase the learning speed of thealgorithm and improve the defense gains

54 Comprehensive Comparisons is approach is com-pared with some typical research results as shown in Table 4[3 12 14 16] is based on the assumption of complete ra-tionality e equilibrium strategy obtained by it is difficultto appear in practice and has a low guiding effect on actualdefense decision making [20] and this paper have morepracticability on the premise of bounded rationality hy-pothesis but [20] is based on the theory of biologicalevolution and mainly studies population evolution e coreof game analysis is not the optimal strategy choice of playersbut the strategy adjustment process trend and stability ofgroup members composed of bounded rational players andthe stability here refers to group members is approach isnot suitable for guiding individual real-time decisionmaking because the proportion of specific strategies isunchanged and not the strategy of a player On the contrarythe defender of the proposed approach adopts reinforcementlearning mechanism which is based on systematic feedbackto learn in the confrontation with the attacker and is moresuitable for the study of individual strategies

6 Conclusions and Future Works

In this paper cyber attack-defense confrontation is ab-stracted as a stochastic game problem under the restrictionof limited rationality A host-centered attack-defense graphmodel is proposed to extract network state and attack-de-fense action and an algorithm to generate attack-defensegraph is designed to effectively compress the game statespace e WoLF-PHC-based defense decision approach isproposed to overcome the problem which enables defendersunder bounded rationality to make optimal choices whenfacing different attackers e lattice improves the WoLF-

PHC algorithm speeds up the defenderrsquos learning speed andreduces the algorithmrsquos dependence on data is approachnot only satisfies the constraints of bounded rationality butalso does not require the defender to know too much in-formation about the attacker It is a more practical defensedecision-making approach

e future work is to further optimize the winning andlosing criteria of WoLF-PHC algorithm for specific attack-defense scenarios in order to speed up defense learning andincrease defense gains

Data Availability

e data that support the findings of this study are notpublicly available due to restrictions as the data containsensitive information about a real-world enterprise networkAccess to the dataset is restricted by the original ownerPeople who want to access the data should send a request tothe corresponding author who will apply for permission ofsharing the data from the original owner

Conflicts of Interest

e authors declare that they have no conflicts of interestregarding the publication of this paper

Acknowledgments

is work was supported by the National High TechnologyResearch and Development Program of China (863 Pro-gram) (2014AA7116082 and 2015AA7116040) and NationalNatural Science Foundation of China (61902427) In par-ticular we thank Junnan Yang and Hongqi Zhang of ourresearch group for their previous work on stochastic gameWe sincerely thank them for their help in publishing thispaper

References

[1] Z Tian W Shi Y Wang et al ldquoReal time lateral movementdetection based on evidence reasoning network for edgecomputing environmentrdquo IEEE Transactions on IndustrialInformatics vol 15 no 7 pp 4285ndash4294 2019

[2] Y Wang J Yu W Qu H Shen and X Cheng ldquoEvolutionarygame model and analysis approaches for network groupbehaviorrdquo Chinese Journal of Computers vol 38 no 2pp 282ndash300 2015

[3] J Liu H Zhang and Y Liu ldquoResearch on optimal selection ofmoving target defense policy based on dynamic game withincomplete informationrdquo Acta Electronica Sinica vol 46no 1 pp 82ndash29 2018

Table 4 Comprehensive comparison of existing approaches

Ref Game type Model assumption Learning mechanism Game process Applicable object Practicability[3] Stochastic game Rationality mdash Multistage Personal Bad[12] Static game Rationality mdash Single-stage Personal Bad[14] Dynamic game Rationality mdash Multistage Personal Bad[16] Stochastic game Rationality mdash Multistage Personal Bad[20] Evolutionary game Bounded rationality Biological evolution mdash Group GoodOur approach Stochastic game Bounded rationality Reinforcement learning Multistage Personal Good

Security and Communication Networks 15

[4] Z Tian X Gao S Su J Qiu X Du and M GuizanildquoEvaluating reputation management schemes of internet ofvehicles based on evolutionary game theoryrdquo IEEE Trans-actions on Vehicular Technology vol 68 no 6 pp 5971ndash59802019

[5] Q Zhu and S Rass ldquoGame theory meets network securitya tutorialrdquo in Proceedings of the ACM SIGSAC Conference onComputer and Communications Security pp 2163ndash2165Toronto Canada October 2018

[6] X Li C Zhou Y C Tian and Y Qin ldquoA dynamic decision-making approach for intrusion response in industrial controlsystemsrdquo IEEE Transactions on Industrial Informatics vol 15no 5 pp 2544ndash2554 2019

[7] H Hu Y Liu H Zhang and R Pan ldquoOptimal networkdefense strategy selection based on incomplete informationevolutionary gamerdquo IEEE Access vol 6 pp 29806ndash298212018

[8] J Li G Kendall and R John ldquoComputing Nash equilibriaand evolutionarily stable states of evolutionary gamesrdquo IEEETransactions on Evolutionary Computation vol 20 no 3pp 460ndash469 2016

[9] M Bowling and M Veloso Multiagent Learning Usinga Variable Learning Rate Elsevier Science Publishers LtdAmsterdam Netherlands 2002

[10] J Harb and D Precup ldquoInvestigating recurrence and eligi-bility traces in deep Q-networksrdquo httpsarxivorgabs170405495 2017

[11] C T Do N H Tran and C Hong ldquoGame theory for cybersecurity and privacyrdquo ACM Computing Surveys vol 50 no 2p 30 2017

[12] Y-L Liu D-G Feng L-HWu and Y-F Lian ldquoPerformanceevaluation of worm attack and defense strategies based onstatic Bayesian gamerdquo Chinese Journal of Software vol 23no 3 pp 712ndash723 2012

[13] Y Li D E Quevedo S Dey and L Shi ldquoA game-theoreticapproach to fake-acknowledgment attack on cyber-physicalsystemsrdquo IEEE Transactions on Signal and Information Pro-cessing over Networks vol 3 no 1 pp 1ndash11 2017

[14] H Zhang L Jiang S Huang J Wang and Y Zhang ldquoAttack-defense differential game model for network defense strategyselectionrdquo IEEE Access vol 7 pp 50618ndash50629 2019

[15] A Afrand and S K Das ldquoPreventing DoS attacks in wirelesssensor networks a repeated game theory approachrdquo InternetJournal of Cyber Security vol 5 no 2 pp 145ndash153 2007

[16] J Wei B Fang and Z Tian ldquoResearch on defense strategiesselection based on attack-defense stochastic game modelrdquoChinese Journal of Computer Research and Developmentvol 47 no 10 pp 1714ndash1723 2010

[17] C Wang X Cheng and Y Zhu ldquoA Markov game model ofcomputer network operationrdquo Chinese Systems Engineering-gteory and Practice vol 34 no 9 pp 2402ndash2410 2014

[18] J Hofbauer and K Sigmund ldquoEvolutionary game dynamicsrdquoBulletin of the American Mathematical Society vol 40 no 4pp 479ndash519 2011

[19] Y Hayel and Q Zhu ldquoEpidemic protection over heteroge-neous networks using evolutionary Poisson gamesrdquo IEEETransactions on Information Forensics and Security vol 12no 8 pp 1786ndash1800 2017

[20] J Huang and H Zhang ldquoImproving replicator dynamicevolutionary game model for selecting optimal defensestrategiesrdquo Chinese Journal on Communications vol 38 no 1pp 170ndash182 2018

[21] H Hu Y Liu H Zhang and Y Zhang ldquoSecurity metricmethods for network multistep attacks using AMC and big

data correlation analysisrdquo Security and CommunicationNetworks vol 2018 Article ID 5787102 14 pages 2018

[22] H Hu Y Liu Y Yang H Zhang and Y Zhang ldquoNew in-sights into approaches to evaluating intention and path fornetwork multistep attacksrdquo Mathematical Problems in En-gineering vol 2018 Article ID 4278632 13 pages 2018

[23] H Hu H Zhang Y Liu and Y Wang ldquoQuantitative methodfor network security situation based on attack predictionrdquoSecurity and Communication Networks vol 2017 Article ID3407642 19 pages 2017

[24] M Zimmer and S Doncieux ldquoBootstrapping Q-learning forrobotics from neuro-evolution resultsrdquo IEEE Transactions onCognitive and Developmental Systems vol 10 no 1pp 102ndash119 2018

[25] R S Sutton and A G Barto ldquoReinforcement learning anintroduction Bradford bookrdquo Machine Learning vol 16no 1 pp 285-286 2005

[26] H Zhang J Wang D Yu J Han and T Li ldquoActive defensestrategy selection based on static Bayesian gamerdquo in Pro-ceedings of IET International Conference on CyberspaceTechnology pp 1ndash7 London UK October 2016

16 Security and Communication Networks

International Journal of

AerospaceEngineeringHindawiwwwhindawicom Volume 2018

RoboticsJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Active and Passive Electronic Components

VLSI Design

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Shock and Vibration

Hindawiwwwhindawicom Volume 2018

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawiwwwhindawicom

Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Control Scienceand Engineering

Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom

Journal ofEngineeringVolume 2018

SensorsJournal of

Hindawiwwwhindawicom Volume 2018

International Journal of

RotatingMachinery

Hindawiwwwhindawicom Volume 2018

Modelling ampSimulationin EngineeringHindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Navigation and Observation

International Journal of

Hindawi

wwwhindawicom Volume 2018

Advances in

Multimedia

Submit your manuscripts atwwwhindawicom

Page 4: Research Article - Hindawi Publishing Corporationdownloads.hindawi.com/journals/scn/2019/3038586.pdf · Afrand and Das [15] established a repeated game model between the intrusion

game model is depicted in Figure 2 Stochastic game consistsof attack-defense game in each state and transition modelbetween states e two key elements of ldquoinformationrdquo andldquogame orderrdquo are assumed Constrained by bounded ra-tionality the attackerrsquos historical actions and the attackerrsquospayo function are set as the attackerrsquos private information

Herein we use the above example in Figure 1 to explainFigure 2 the security state corresponds to S0 (H1 none) andS1 (H1 User) in this case e candidate strategy set againstDDoS attack is Reinstall Listener program Install patchesClose unused port Network state is the common knowledgeof both sides Because of the non-cooperation between theattack and defense sides the two sides can only observe eachotherrsquos actions through the detection network which willdelay the execution time for at least one time slice so theattack-defense sides are acting at the same time in each timeslice e ldquosimultaneousrdquo here is a concept of informationrather than a concept of time that is the choice of attack-defense sides may not be based on the concept of time At thesame time because the attack-defense sides do not know theother sidersquos choice when choosing action they are consid-ered to be simultaneous action

Construct the network state transition model Useprobability to express the randomness of network statetransition Because the current network state is mainly re-lated to the previous network state the rst-order Markov isused to represent the state transition relationship in whichthe network state is the attack-defense action Because bothsides of attacker and defender are constrained by boundedrationality in order to increase the generality of the modelthe transfer probability is set as the unknown information ofboth sides of attack-defense

On the basis of the above a gamemodel is constructed tosolve the defense decision-making problem

Denition 1 e attack-defense stochastic gamemodel (AD-SGM) is a six-tuple AD minus SGM (N SD R Q π) in which

(1) N (attacker defender) are the two players whoparticipate in the game representing cyber attackersand defenders respectively

(2) S (s1 s2 sn) is a set of stochastic game stateswhich is composed of network states (see Section 33for the specic meaning and generation approach)

Optimalstrategy of

St1

Optimalstrategy of

Sti

Optimalstrategy of

Sprimet1

CurrentstateSprimet2

Optimalstrategy of

SPrimet2

Game Game

Game

t1 t2 ti Time

Defender

CurrentstateSt1

CurrentstateSPrimet2

Currentstate

StiAttacker

DefenderAttacker

DefenderAttacker DefenderAttacker

Evolution

Evolution

∙ ∙ ∙

∙ ∙ ∙

∙ ∙ ∙

∙ ∙ ∙

∙ ∙ ∙

∙ ∙ ∙

∙ ∙ ∙

Game

Figure 1 e game process and strategy selection of attack-defense confrontation

Gamestate

Securitystate

Candidatestrategy

Attack-defenseequilibrium

Cost and benefitanalysis

Attacker and defender

Game plan

Game equilibrium

Payoff function

Players

Cyber attack-defense Stochastic game model

Figure 2 e mapping relationship between cyber attack-defenseand stochastic game model

4 Security and Communication Networks

(3) D (D1 D2 Dn) is the action set of the de-fender in which Dk d1 d2 dm1113864 1113865 is the actionset of the defender in the game state Sk

(4) Rd(si d sj)is the immediate return from state si to sjafter the defender performs action d

(5) Qd(si d) is the state-action payoff function of thedefender indicating the expected payoff of the de-fender after taking action d in the state si

(6) πd(sk) is the defense strategy of the defender in thestate sk

Defense strategy and defense action are two differentconcepts Defense strategy is the rule of defense action notthe action itself For example πd(sk) (πd(sk d1)

πd(sk dm)) is the strategy of the defender in the networkstate sk where πd(sk dm) is the probability of selectingaction dm 1113936disinDk

πd(sk d) 1

33 Network State and Attack-Defense Action ExtractionApproach Based on Attack-Defense Graph Network stateand attack-defense action are important components ofstochastic game model Extraction of network state andattack-defense action is a key point in constructing at-tack-defense stochastic game model [21] In the currentattack-defense stochastic game when describing thenetwork state each network state contains the securityelements of all nodes in the current network e numberof network states is the power set of security elementswhich will produce a state explosion [22] ereforea host-centered attack-defense graph model is proposedEach state node only describes the host state It can ef-fectively reduce the size of state nodes [23] Using thisattack-defense graph to extract network state and attack-defense action is more conducive to cyber attack-defenseconfrontation analysis

Definition 2 attack-defense graph is a binary groupG (S E) in which S s1 s2 sn1113864 1113865 is a set of node se-curity states and si langhost privilegerang host is the uniqueidentity of the node and privilege none user root in-dicates that it does not have any privileges has ordinary userprivileges and has administrator privileges For directededge E (Ea Ed) it indicates that the occurrence of attackor defense action causes the transfer of node state andek (sr vd sd) k a d where sr is the source node and sd

is the destination nodee generation process of attack-defense map is shown

in Figure 3 Firstly target network scanning is used to ac-quire cyber security elements then attack instantiation iscombined with attack template and defense instantiation iscombined with defense template Finally attack-defensegraph is generated e state set of attack-defense stochasticgame model is extracted by attack-defense graph nodes and

the defense action set is extracted by the edge of attack-defense graph

331 Elements of Cyber Security e elements of cybersecurity NSE are composed of network connection Cvulnerability information V service information F andaccess rights P Matrix Csube host times host times port describes theconnection relationship between nodes the row of matrixrepresents the source node shost the list of matrix rep-resents the destination node dhost and the port accessrelationship is represented by matrix elements When portis empty it indicates that there is no connection re-lationship between shost nodes and dhost nodesV langhost service cveidrang indicates the vulnerability ofservices on nodesrsquo host including security vulnerabilitiesand improper configuration or misconfiguration of systemsoftware and application software F langhost servicerang in-dicates that a service is opened on a node hostP langhost privilegerang indicates that an attacker has accessrights privilege on a node host

332 Attack Template Attack template AM is the de-scription of vulnerability utilization whereAM langtid prec postcrang Among them tid is the identifi-cation of attack mode prec langP V C Frang describes the setof prerequisites for an attacker to use a vulnerability in-cluding the initial access rights privilege of the attacker onthe source node shost the vulnerability information cveidof the target node the network connection relationship Cand the running service of the node F Only when the set ofconditions is satisfied the attacker can succeed Use thisvulnerability postc langP C sdrang describes the conse-quences of an attackerrsquos successful use of a vulnerabilityincluding the increase of attackerrsquos access to the targetnode the change of network connection relationship andservice destruction

333 Defense Template Defense templates DM are theresponse measures taken by defenders after predicting oridentifying attacks where DM langtid dsetrang dset

langd1 postd1rang langdm postdmrang1113864 1113865 is the defense strategy setfor specific attacks postdi langF V PCrang describes the im-pact of defense strategy on cyber security elements in-cluding the impact on node service informationvulnerability information attacker privilege informationnode connection relationship and so on

In the process of attack-defense graph generation ifthere is a connection between two nodes and all the pre-requisites for attack occurring are satisfied the edges fromsource node to destination node are added If the attackchanges the security elements such as connectivity the cybersecurity elements should be updated in time If the defensestrategy is implemented the connection between nodes orthe existing rights of attackers should be changed As shownin Algorithm 1 the first step is to use cyber security elements

Security and Communication Networks 5

to generate all possible state nodes and initialize the edgesSteps 2ndash8 are to instantiate attacks and generate all attackedges Steps 9ndash15 are to instantiate defenses and generate alldefense edges Steps 16ndash20 are used to remove all isolatednodes And step 21 is to output attack-defense maps

Assuming the number of nodes in the target network is nand the number of vulnerabilities of each node is m themaximum number of nodes in the attack-defense graph is3n In the attack instantiation stage the computationalcomplexity of analyzing the connection relationship be-tween each two nodes is o(n2 minus n) e computationalcomplexity of matching the vulnerability of the nodes withthe connection relationship is o(m(n2 minus n)) In the defenseinstantiation stage we remove the isolated nodes and thecomputational complexity of traversing the edges of all thenodes is o(9n2 minus 3n) In summary the order of computa-tional complexity of the algorithm is o(n2) e node ofattack-defense graph G can extract network state and theedge of attack-defense graph G can extract attack-defenseaction

4 Stochastic Game Analysis and StrategySelection Based on WoLF-PHCIntelligent Learning

In the previous section cyber attack-defense is described asa bounded rational stochastic game problem and an attack-defense stochastic game model AD-SGM is constructed Inthis section reinforcement learning mechanism is in-troduced into finite rational stochastic game and WoLF-PHC algorithm is used to select defense strategies based onAD-SGM

41 Principle of WoLF-PHC

411 Q-Learning Algorithm Q-learning [24] is the basis ofWoLF-PHC algorithm and a typical model-free re-inforcement learning algorithm Its learning mechanism isshown in Figure 4 Agent in Q-learning obtains knowledgeof return and environment state transfer through interaction

Elements of networksecurity

Network connectionrelations

Node vulnerabilityinformation

Node serviceinformation

Node accessprivilege

Attack template

Attackinstantiation

Defenseinstantiation

Defense template

Atta

ck-d

efen

se g

raph Network

state

Attackaction

Defensiveaction

Figure 3 Attack-defense graph generation

Input Elements of Cyber security NSE Attack Template AM Defense Template DMOutput Attack graph G (S E)

(1) S⟵NSE E⟵emptylowast Generate all nodes lowast(2) for each S dolowast Attack instantiation to generate attack edges lowast(3) update NSE in slowast Updating Cyber security Elements lowast(4) if Cshost shost and CdhostV geAMprecV and CdhostFgeAMprecF and CdhostPprivilegegeAMprecPprivilege(5) srhost⟵Cshost(6) sdhost⟵Cdhost(7) sdprivilege⟵AMpostcPprivilege(8) Ea⟵Ea cup ea(srAMtid sd)1113864 1113865

(9) end if(10) end for(11) for each S dolowast Defense instantiation to generate defense edges lowast(12) if Easd s and DMtid Eatid(13) Ed⟵Ed cup ed(EasdDMdsetd Easr)1113864 1113865

(14) end if(15) end for(16) for each S dolowast Remove isolated nodes S lowast(17) if ea(s tid sd) empty and ed(sr d s) empty(18) S⟵ S minus s

(19) end if(20) end for(21) Return G

ALGORITHM 1 Attack-defense graph generation algorithm

6 Security and Communication Networks

with environment Knowledge is expressed by payoff Qd andlearned by updating Qd Qd is

Qd(s d) Qd(s d) + α Rd s d sprime( 1113857 + cmaxdprime

Qd sprime dprime( 1113857 minus Qd(s d)1113890 1113891

(1)

where α is payoff learning rate and c is the discount factore strategy of Q-learning is πd(s) argmaxdQd(s d)

412 PHC Algorithm e Policy Hill-Climbing algorithm[25] is a simple and practical gradient descent learning al-gorithm suitable for hybrid strategies which is an im-provement of Q-learning e state-action gain function Qd

of PHC is the same as Q-learning but the policy updateapproach of Q-learning is no longer followed but the hybridstrategy πd(sk) is updated by executing the hill-climbingalgorithm as shown in equations (2)ndash(4) In the formula thestrategy learning rate is

πd sk di( 1113857 πd sk di( 1113857 + Δskdi (2)

where

Δskdi

minus δskdi di ne argmax

dQd sk d( 1113857

1113944djnedi

δskdj others

⎧⎪⎪⎪⎨

⎪⎪⎪⎩

(3)

δskdi min πd sk di( 1113857

δDk minus 1

11138681113868111386811138681113868111386811138681113868

1113888 1113889 (4)

413 WoLF-PHC Algorithm WoLF-PHC algorithm is animprovement of PHC algorithm By introducing WoLFmechanism the defender has two different strategy learningrates low strategy learning rate δw when winning and highstrategy learning rate δl when losing as shown in formula (5)e two learning rates enable defenders to adapt quickly toattackersrsquo strategies when they perform worse than expectedand to learn cautiously when they perform better than expectede most important thing is the introduction of WoLFmechanism which guarantees the convergence of the algorithm[9] WoLF-PHC algorithm uses average strategy as the criterionof success and failure as shown in formulae (6) and (7)

δ

δw 1113944disinDk

πd skd( 1113857Qd skd( 1113857gt 1113944disinDk

πd skd( 1113857Qd skd( 1113857

δl others

⎧⎪⎪⎪⎨

⎪⎪⎪⎩

(5)

πd(sd) πd(sd) +1

C(s)πd(sd) minus πd(sd)( 1113857

(6)

C(s) C(s) + 1 (7)

42 Defense Decision-Making Algorithm Based on ImprovedWoLF-PHC e decision-making process of our approachis shown in Figure 5 which consists of five steps It receivestwo types of input data attack evidence and abnormal

Input AD minus SGM α δ λ and c

Output Defense action d(1) initialize AD minus SGM C(s) 0 e(s d) 0lowast Network state and attack-defense actions are extracted by Algorithm 1 lowast(2) slowast get(E) lowast Getting the current network state from Network E lowast(3) repeat(4) dlowast πd(slowast) lowast Select defense action lowast(5) Output dlowast lowast Feedback defense actions to defenders lowast(6) sprime get(E) lowast Get the status after the action dlowast is executed lowast(7) ρlowast Rd(slowast dlowast sprime) + cmaxdprimeQd(sprime dprime) minus Qd(slowast dlowast)

(8) ρg Rd(slowast dlowast sprime) + cmaxdprimeQd(sprime dprime) minus maxdQd(slowast d)

(9) for each state-action pair (s d) except (slowast dlowast) do(10) e(s d) cλe(s d)

(11) Qd(s d) Qd(s d) + αρge(s d)

(12) end forlowast Update noncurrent eligibility trace (slowast dlowast) and values Qd lowast(13) Qd(slowast dlowast) Qd(slowast dlowast) + αρlowastlowast Update Qd of (slowast dlowast) lowast(14) e(slowast dlowast) cλe(slowast dlowast) + 1 lowast Update track of (slowast dlowast)lowast(15) C(slowast) C(slowast) + 1(16) Updating average strategy based on formula (6)(17) Selecting the learning rate of strategies based on formula (5)(18) δslowastd min(πd(slowast d) δ(D(slowast) minus 1))foralld isin D(slowast)

(19) Δslowastd minus δslowastd dne argmaxdi

Qd(slowast di)

1113936djnedlowastδslowastdjOthers1113896

(20) πd(slowast d) πd(slowast d) + Δslowastdforalld isin D(slowast) lowast Update defense strategy lowast(21) slowast sprime(22) end repeat

ALGORITHM 2 Defense decision-making algorithm

Security and Communication Networks 7

evidence All these pieces of evidence come from real-timeintrusion detection systems After decision making theoptimal security strategy is determined against detectedintrusions

In order to improve the learning speed of WoLF-PHCalgorithm and reduce the dependence of the algorithm onthe amount of data the eligibility trace is introduced toimprove WoLF-PHC e eligibility trace can track specificstate-action trajectories of recent visits and then assigncurrent returns to the state-action of recent visits WoLF-PHC algorithm is an extension of Q-learning algorithm Atpresent there are many algorithms combining Q-learningwith eligibility trace is paper improves WoLF-PHC byusing the typical algorithm [10] e qualification trace ofeach state-action is defined as e(s a) Suppose slowast is thecurrent network state slowast and the eligibility trace is updatedin the way shown in formula (8) Among them the traceattenuation factor is λ

e(s d) cλe(s d) sne slowast

cλe(s d) + 1 s slowast1113896 (8)

WoLF-PHC algorithm is an extension of Q-learningalgorithm which belongs to off-policy algorithm It usesgreedy policy when evaluating defense actions for eachnetwork state and occasionally introduces nongreedy policywhen choosing to perform defense actions in order to learnIn order to maintain the off-policy characteristics of WoLF-PHC algorithm the state-action values are updated byformulae (9)ndash(12) in which the defense actions dlowast are

selected for execution slowast because only the recently visitedstatus-action pairs will have significantly more eligibilitytrace than 0 while most other status-action pairs will havealmost none eligibility trace In order to reduce the memoryand running time consumption caused by eligibility traceonly the latest status-action pair eligibility trace can be savedand updated in practical application

Qd slowast dlowast

( 1113857 Qd slowast dlowast

( 1113857 + αρlowast (9)

Qd(s d) Qd(s d) + αρge(s d) (10)

ρlowast Rd slowast d sprime( 1113857 + cmax

dprimeQd sprime dprime( 1113857 minus Qd s

lowast dlowast

( 1113857

(11)

ρg Rd slowast dlowast sprime( 1113857 + cmax

dprimeQd sprime dprime( 1113857 minus max

dQd slowast d( 1113857

(12)

In order to achieve better results the defense decision-making approach based on WoLF-PHC needs to set fourparameters α δ λ and c reasonably (1) e range of thepayoff learning rate is 0lt αlt 1e bigger the representativeα is the more important the cumulative reward is e fasterthe learning speed is the smaller α is and the better thestability of the algorithm is (2)e range of strategy learningrate 0lt δ lt 1 is obtained According to the experiment wecan get a better result when adopting δlδw 4 (3) eattenuation factor λ of eligibility trace is in the range of

Describe andanalyze cyberattack-defenseconfrontation

Model stochastic game

for attack-defense

Extract currentsecurity state andcandidate attack-

defense action

Evaluatepayoff

hellip

helliphellip

How to defend

Decision making

Figure 5 e process of our decision-making approach

Rt+1reward Rt

St+1state St

Ambient

Agent

Figure 4 Q-learning learning mechanism

8 Security and Communication Networks

0lt λlt 1 which is responsible for the credit allocation ofstatus-action It can be regarded as a time scale e greaterthe credit allocated λ to historical status-action the greaterthe credit allocated to historical status-action (4) e rangeof the discount factor 0lt clt 1 represents the defenderrsquospreference for immediate return and future return When c

approaches 0 it means that future returns are irrelevant andimmediate returns are more important When c approaches1 it means immediate returns are irrelevant and futurereturns are more important

Agent in WoLF-PHC is the defender in the stochasticgame model of attack-defense AD minus SGM the game state inagentrsquos state corresponds to AD minus SGM the defense actionin agentrsquos behavior corresponds to AD minus SGM the imme-diate return in agentrsquos immediate return corresponds toAD minus SGM and the defense strategy in agentrsquos strategycorresponds to AD minus SGM

On the basis of the above a specific defense decision-making approach as shown in Algorithm 2 is given e firststep of the algorithm is to initialize the stochastic gamemodel AD minus SGM of attack-defense and the related pa-rameters e network state and attack-defense actions areextracted by Algorithm 1 e second step is to detect thecurrent network state by the defender Steps 3ndash22 are tomake defense decisions and learn online Steps 4-5 are toselect defense actions according to the current strategy steps6ndash14 are to update the benefits by using eligibility traces andsteps 15ndash21 are the new payoffs using mountain climbingalgorithm to update defense strategy

e spatial complexity of the Algorithm 2 mainlyconcentrates on the storage of pairs Rd(s d sprime) such ase(s d) πd(s d) πd(s d) and Qd(s d) e number of statesis |S| |D| and |A| are the numbers of measures taken by thedefender and attacker in each state respectively ecomputational complexity of the proposed algorithm isO(4 middot |S| middot |D| + |S|2 middot |D|) Compared with the recentmethod using evolutionary game model with complexityO(|S|2 middot (|A| + |D|)3) [14] we greatly reduce the computa-tional complexity and increase the practicability of the al-gorithm since the proposed algorithm does not need to solvethe game equilibrium

5 Experimental Analysis

51 Experiment Setup In order to verify the effectiveness ofthis approach a typical enterprise network as shown inFigure 6 is built for experiment Attacks and defenses occuron the intranet with attackers coming from the extranet Asa defender network administrator is responsible for thesecurity of intranet Due to the setting of Firewall 1 andFirewall 2 legal users of the external network can only accessthe web server which can access the database server FTPserver and e-mail server

e simulation experiment was carried out on a PC withIntel Core i7-6300HQ 340GHz 32GB RAM memoryand Windows 10 64 bit operating system e Python 365emulator was installed and the vulnerability information inthe experimental network was scanned by Nessus toolkit asshown in Table 1 e network topology information was

collected by ArcGis toolkit We used the Python language towrite the project code During the experiment we set upabout 25000 times of attack-defense strategy studies eexperimental results were analyzed and displayed usingMatlab2018a as described in Section 53

Referring to MIT Lincoln Lab attack-defense behaviordatabase attack-defense templates are constructed Attack-defense maps are divided into attack maps and defense mapsby using attacker host A web server W database server DFTP server F and e-mail server E In order to facilitatedisplay and description attack-defense maps are dividedinto attack maps and defense maps as shown in Figures 7and 8 respectively e meaning of defense action in thedefense diagram is shown in Table 2

52 Construction of the Experiment Scenario AD-SGM(1) N (attacker defender) are players participating in

the game representing cyber attackers and defendersrespectively

Vulnerabilityscanning system

VDS

Internet

DMZ

FTP server

Firewall 1Trusted zone

Firewall 2Database

serverE-mailserver

Web server

IDS

Attacker

Extranet

Intranet

Figure 6 Experimental network topology

Table 1 Network vulnerability information

Attackidentifier Host CVE Target

privilegetid1 Web server CVE-2015-1635 usertid2 Web server CVE-2017-7269 roottid3 Web server CVE-2014-8517 roottid4 FTP server CVE-2014-3556 roottid5 E-mail server CVE-2014-4877 roottid6 Database server CVE-2013-4730 usertid7 Database server CVE-2016-6662 root

S0(A root)

S1(W user)

S2(W root)S5(D user)

S6(D root)

tid1

tid2

tid3

tid4

tid5

tid6

tid7

tid8

tid9

S3(F root)

S4(E root)

Figure 7 Attack graph

Security and Communication Networks 9

(2) e state set of stochastic game is S (s0 s1

s2 s3 s4 s5 s6) which consists of network state and isextracted from the nodes shown in Figures 7 and 8

(3) e action set of the defender is D (D0 D1

D2 D3 D4 D5 D6) D0 NULL D1 d11113864 d2D2

d3 d41113864 1113865D3 d1 d5 d61113864 1113865D4 d1 d5 d61113864 1113865D5 d11113864

d2 d7D6 d3 d41113864 1113865 and the edges are extractedfrom Figure 8

(4) Quantitative results of immediate returns Rd(si

d sj) of defenders [16 26] are

Rd s0NULL s0( 1113857 Rd s0NULL s1( 1113857 Rd s0NULL s2( 1113857( 1113857 (0 minus 40 minus 59)

Rd s1 d1 s0( 1113857 Rd s1 d1 s1( 1113857 Rd s1 d1 s2( 1113857 Rd s1 d2 s0( 1113857 Rd s1 d2 s1( 1113857 Rd s1 d2 s2( 1113857( 1113857

(40 0 minus 29 5 minus 15 minus 32)

( Rd s2 d3 s0( 1113857 Rd s2 d3 s1( 1113857 Rd s2 d3 s2( 1113857 Rd s2 d3 s3( 1113857 Rd s2 d3 s4( 1113857 Rd s2 d3 s5( 1113857

Rd s2 d4 s0( 1113857 Rd s2d4 s1( 1113857 Rd s2 d4 s2( 1113857 Rd s2 d4 s3( 1113857 Rd s2 d4 s4( 1113857 Rd s2 d4 s5( 11138571113857

(24 9 minus 15 minus 55 minus 49 minus 65 19 5 minus 21 minus 61 minus 72 minus 68)

( Rd s3 d1 s2( 1113857 Rd s3 d1 s3( 1113857 Rd s3 d1 s6( 1113857 Rd s3 d5 s2( 1113857 Rd s3 d5 s3( 1113857 Rd s3 d5 s6( 1113857

Rd s3 d6 s2( 1113857 Rd s3 d6 s3( 1113857 Rd s3 d6 s6( 11138571113857

(21 minus 16 minus 72 15 minus 23 minus 81 minus 21 minus 36 minus 81)

( Rd s4 d1 s2( 1113857 Rd s4 d1 s4( 1113857 Rd s4 d1 s6( 1113857 Rd s4 d5 s2( 1113857 Rd s4 d5 s4( 1113857 Rd s4 d5 s6( 1113857

Rd s4 d6 s2( 1113857 Rd s4 d6 s4( 1113857 Rd s4 d6 s6( 11138571113857

(26 0 minus 62 11 minus 23 minus 75 9 minus 25 minus 87)

( Rd s5 d1 s2( 1113857 Rd s5 d1 s5( 1113857 Rd s5 d1 s6( 1113857 Rd s5 d2 s2( 1113857 Rd s5 d2 s5( 1113857 Rd s5 d2 s6( 1113857

Rd s5 d7 s2( 1113857 Rd s5 d7 s5( 1113857 Rd s5 d7 s6( 11138571113857

(29 0 minus 63 11 minus 21 minus 76 2 minus 27 minus 88)

( Rd s6 d3 s3( 1113857 Rd s6 d3 s3( 1113857 Rd s6 d3 s5( 1113857 Rd s6 d3 s6( 1113857 Rd s6 d4 s3( 1113857

Rd s6 d4 s4( 1113857 Rd s6 d4 s5( 1113857 Rd s6 d4 s6( 11138571113857

(minus 23 minus 21 minus 19 minus 42 minus 28 minus 31 minus 24 minus 49)

(13)

Table 2 Defense action description

Atomic defense action d1 d2 d3 d4 d5 d6 d7

Renew root data Limit SYNICMPpackets Install Oracle patches Reinstall listener program Uninstall delete Trojan Limit access toMDSYS Restart database server Delete suspicious account Add physical resource Repair database Limit packets from ports

S0(A root)

S1(W user)

S2(W root)S5(D user)

S6(D root)

d3

d3

d4

d1d2d7

S3(F root)

S4(E root)d1d5d6

d1d5d6d4

d3

d1d2

Figure 8 Defense graph

10 Security and Communication Networks

(5) In order to detect the learning performance of theAlgorithm 2 more fully the defenderrsquos state-actionpayoffQd(si d) is initialized with a unified 0 withoutintroducing additional prior knowledge

(6) Defenderrsquos defense strategy adopts average strategyto initialize that is πd(sk d1) πd(sk d2) middot middot middot

πd(sk dm) 1113936disinD(sk)πd(sk d) 1forallsk isin S where noadditional prior knowledge is introduced

53 Testing and Analysis e experiment in this section hasthree purposes e first is to test the influence of differentparameter settings on the proposed Algorithm 2 so as to findout the experimental parameters suitable for this scenarioe second is to compare this approach with the existingtypical approaches to verify the advancement of this ap-proach e third is to test the effectiveness of WoLF-PHCalgorithm improvement based on eligibility trace

From Figures 7 and 8 we can see that the state of attack-defense strategy selection is the most complex and repre-sentative erefore the performance of the algorithm isanalyzed by the experimental state selection and the othernetwork state analysis approaches are the same

531 Parameter Test and Analysis Different parameters willaffect the speed and effect of learning At present there is norelevant theory to determine the specific parameters InSection 4 the relevant parameters are preliminarily ana-lyzed On this basis the different parameter settings arefurther tested to find the parameter settings suitable for thisattack-defense scenario Six different parameter settingswere tested Specific parameter settings are shown in Table 3In the experiment the attackerrsquos initial strategy is randomstrategy and the learning mechanism is the same as theapproach in this paper

e probability of the defenderrsquos choice of defense ac-tions and sums in state is shown in Figure 9 e learningspeed and convergence of the algorithm under differentparameter settings can be observed from Figure 9 whichshows that the learning speed of settings 1 3 and 6 is fasterand the best strategy can be obtained after learning less than1500 times under the three settings but convergence of 3and 6 is poor Although the best strategy can be learned bysettings 3 and 6 there will be oscillation afterwards and thestability of setting 1 is not suitable

Defense payoff can represent the degree of optimizationof the strategy In order to ensure that the payoff value doesnot reflect only one defense result the average of 1000defense gains is taken and the change of the average payoffper 1000 defense gains is shown in Figure 10 As can be seenfrom Figure 10 the benefits of Set 3 are significantly lowerthan those of other settings but the advantages and dis-advantages of other settings are difficult to distinguish Inorder to display more intuitively the average value of 25000defense gains calculated under different settings in Figure 10is shown in Figure 11 From Figure 11 we can see that theaverage value of settings 1 and 5 is higher For furthercomparison the standard deviation of settings 1 and 5 is

calculated one step on the basis of the average value to reflectthe discreteness of the gains As shown in Figure 12 thestandard deviations of setting 1 and setting 6 are smallMoreover the result of setting 1 is smaller than setting 6

In conclusion setting 1 of the six sets of parameters is themost suitable for this scenario Since setting 1 has achievedan ideal effect and can meet the experimental requirementsit is no longer necessary to further optimize the parameters

532 Comparisons In this section stochastic game [16] andevolutionary game [20] are selected to conduct comparativeexperiments with this approach According to the differenceof attackerrsquos learning ability this section designs two groupsof comparative experiments In the first group the attackerrsquoslearning ability is weak and will not make adjustments to theattack-defense results In the second group the attackerrsquoslearning ability is strong and adopts the same learningmechanism as the approach in this paper In both groupsthe initial strategies of attackers were random strategies

In the first group of experiments the defense strategy ofthis approach is as shown in Figure 9(a) e defense strategiescalculated by the approach [16] are πd(s2 d3) 07πd(s2 d4) 03 e defense strategies of [20] are evolu-tionarily stable and balanced And the defense strategies of [20]are πd(s2 d3) 08 πd(s2 d4) 02 Its average earnings per1000 times change as shown in Figure 13

From the results of the strategies and benefits of the threeapproaches we can see that the approach in this paper canlearn from the attackerrsquos strategies and adjust to the optimalstrategy so the approach in this paper can obtain the highestbenefits Wei et al [16] adopted a fixed strategy whenconfronting any attacker When the attacker is constrainedby bounded rationality and does not adopt Nash equilibriumstrategy the benefit of this approach is low Although thelearning factors of both attackers and defenders are takeninto account in [20] the parameters required in the modelare difficult to quantify accurately which results in thedeviation between the final results and the actual results sothe benefit of the approach is still lower than that of theapproach in this paper

In the second group of experiments the results of[16 20] are still as πd(s2 d3) 07 πd(s2 d4) 03 andπd(s2 d3) 08 πd(s2 d4) 02 e decision making ofthis approach is shown in Figure 14 After about 1800 timesof learning the approach achieves stability and converges tothe same defense strategy as that of [16] As can be seen fromFigure 15 the payoff of [20] is lower than that of other twoapproaches e average payoff of this approach in the first2000 defenses is higher than that of [26] and then it is almostthe same as that of [26] Combining Figures 14 and 15 wecan see that the learning attacker cannot get Nash equi-librium strategy at the initial stage and the approach in thispaper is better than that in [26] When the attacker learns toget Nash equilibrium strategy the approach in this paper canconverge to Nash equilibrium strategy At this time theperformance of this approach is the same as that in [26]

In conclusion when facing the attackers with weaklearning ability the approach in this paper is superior to that

Security and Communication Networks 11

Table 3 Dierent parameter settings

Set α δl δw λ c

1 001 0004 0001 001 0012 01 0004 0001 001 0013 001 0004 0001 001 014 001 0004 0001 01 0015 001 004 001 001 0016 001 0008 0001 001 001

0 2000 4000 6000s2 defense number

s2 d

efen

se re

turn

10

08

06

04

02

00

d3d4

(a)

s2 d

efen

se re

turn

10

08

06

04

02

000 2000 4000 6000

s2 defense number

d3d4

(b)

s2 d

efen

se re

turn

10

08

06

04

02

000 2000 4000 6000

s2 defense number

d3d4

(c)

s2 d

efen

se re

turn

10

08

06

04

02

000 2000 4000 6000

s2 defense number

d3d4

(d)

s2 d

efen

se re

turn

10

08

06

04

02

000 2000 4000 6000

s2 defense number

d3d4

(e)

s2 d

efen

se re

turn

10

08

06

04

02

000 2000 4000 6000

s2 defense number

d3d4

(f )

Figure 9 Defense decision making under dierent parameter settings (a) Defense decision under setting 1 (b) Defense decision undersetting 2 (c) Defense decision under setting 3 (d) Defense decision under setting 4 (e) Defense decision under setting 5 (f ) Defensedecision under setting 6

5 10 15 20 25s2 defense number (times1000)

s2 d

efen

se re

turn

ndash10

ndash15

ndash20

ndash25

ndash30

Set 1Set 2Set 3

Set 4Set 5Set 6

Figure 10 Defense payo under dierent parameter settings

12 Security and Communication Networks

in [16 20] When facing an attacker with strong learningability if the attacker has not obtained Nash equilibriumthrough learning this approach is still better than [16 20] Ifthe attacker obtains Nash equilibrium through learning thispaper can also obtain the same Nash equilibrium strategy as[26] and obtain its phase e same eect is superior to thatin [20]

533 Test Comparison with and without Eligibility Traceis section tests the actual eect of the eligibility traces onthe Algorithm 2 e eect of eligibility traces on strategyselection is shown in Figure 16 from which we can see thatthe learning speed of the algorithm is faster when thequalied trace is available After 1000 times of learning the

algorithm can converge to the optimal strategy When thequalied trace is not available the algorithm needs about2500 times of learning to converge

Average earnings per 1000 times change as shown inFigure 17 from which we can see that the benets of thealgorithm are almost the same when there is or not anyqualied trace after convergence From Figure 17 we can seethat 3000 defenses before convergence have higher returnsfrom qualied trails than those from unqualied trails Inorder to further verify this the average of the rst 3000defense gains under qualied trails and unqualied trails iscounted 10 times each respectively e results are shown inFigure 18 which further proves that in the preconvergencedefense phase qualied traces are better than unqualiedtraces

e addition of eligibility trace accelerates the learningspeed but also brings additional memory and computingoverhead In the experiment only 10 state-action pairs thatwere recently accessed were saved which eectively reduced

Set 1 Set 2 Set 3 Set 4 Set 5 Set 6

ndash80

ndash85

ndash90

Figure 11 Average return under dierent parameter settings

Set 1 Set 2 Set 3 Set 4 Set 5 Set 6

80

75

Figure 12 Standard deviation of defense payo under dierentparameter settings

ndash10

ndash20

ndash30

ndash40

s2 d

efen

se re

turn

5 10 15 20 25

Our methodReference [12]Reference [16]

s2 defense number (times1000)

Figure 13 Comparison of defense payo change

0 2000 4000 6000s2 defense number

s2 d

efen

se st

rate

gy

08

06

04

02

00

d3d4

Figure 14 Defense decision making of our approach

Our methodReference [12]Reference [16]

5 10 15 20 25s2 defense number (times1000)

s2 d

efen

se re

turn

ndash10

ndash20

ndash30

ndash40

Figure 15 Comparison of defense payo change

Security and Communication Networks 13

0 2000 4000 6000s2 defense number

s2 d

efen

se st

rate

gy

100

075

050

025

000

Eligibility trace d3Eligibility trace d4

No eligibility trace d3No eligibility trace d4

Figure 16 Comparison of defense decision making

Eligibility traceNo eligibility trace

0 5 10 15 20 25s2 defense number (times1000)

s2 d

efen

se re

turn

ndash10

ndash20

ndash15

ndash30

ndash25

Figure 17 Comparison of defense payo change

Eligibility traceNo eligibility trace

1 2 3 4 5 6 7 8 9 10s2 defense number (times3000)

s2 av

erag

e def

ense

retu

rn

ndash12

ndash14

ndash16

ndash18

ndash20

Figure 18 Average payo comparison of the rst 3000 defenses

14 Security and Communication Networks

the increase of memory consumption In order to test thecomputational cost of qualified trail the time of 100000defense decisions made by the algorithm was counted for 20times with and without qualified trail e average of 20times was 951s for qualified trail and 374 s for unqualifiedtrail Although the introduction of eligibility traces willincrease the decision-making time by nearly 25 times thetime required for 100000 decisions after the introduction ofeligibility traces is still only 951 s which can still meet thereal-time requirements

In summary the introduction of eligibility trace at theexpense of a small amount of memory and computingoverhead can effectively increase the learning speed of thealgorithm and improve the defense gains

54 Comprehensive Comparisons is approach is com-pared with some typical research results as shown in Table 4[3 12 14 16] is based on the assumption of complete ra-tionality e equilibrium strategy obtained by it is difficultto appear in practice and has a low guiding effect on actualdefense decision making [20] and this paper have morepracticability on the premise of bounded rationality hy-pothesis but [20] is based on the theory of biologicalevolution and mainly studies population evolution e coreof game analysis is not the optimal strategy choice of playersbut the strategy adjustment process trend and stability ofgroup members composed of bounded rational players andthe stability here refers to group members is approach isnot suitable for guiding individual real-time decisionmaking because the proportion of specific strategies isunchanged and not the strategy of a player On the contrarythe defender of the proposed approach adopts reinforcementlearning mechanism which is based on systematic feedbackto learn in the confrontation with the attacker and is moresuitable for the study of individual strategies

6 Conclusions and Future Works

In this paper cyber attack-defense confrontation is ab-stracted as a stochastic game problem under the restrictionof limited rationality A host-centered attack-defense graphmodel is proposed to extract network state and attack-de-fense action and an algorithm to generate attack-defensegraph is designed to effectively compress the game statespace e WoLF-PHC-based defense decision approach isproposed to overcome the problem which enables defendersunder bounded rationality to make optimal choices whenfacing different attackers e lattice improves the WoLF-

PHC algorithm speeds up the defenderrsquos learning speed andreduces the algorithmrsquos dependence on data is approachnot only satisfies the constraints of bounded rationality butalso does not require the defender to know too much in-formation about the attacker It is a more practical defensedecision-making approach

e future work is to further optimize the winning andlosing criteria of WoLF-PHC algorithm for specific attack-defense scenarios in order to speed up defense learning andincrease defense gains

Data Availability

e data that support the findings of this study are notpublicly available due to restrictions as the data containsensitive information about a real-world enterprise networkAccess to the dataset is restricted by the original ownerPeople who want to access the data should send a request tothe corresponding author who will apply for permission ofsharing the data from the original owner

Conflicts of Interest

e authors declare that they have no conflicts of interestregarding the publication of this paper

Acknowledgments

is work was supported by the National High TechnologyResearch and Development Program of China (863 Pro-gram) (2014AA7116082 and 2015AA7116040) and NationalNatural Science Foundation of China (61902427) In par-ticular we thank Junnan Yang and Hongqi Zhang of ourresearch group for their previous work on stochastic gameWe sincerely thank them for their help in publishing thispaper

References

[1] Z Tian W Shi Y Wang et al ldquoReal time lateral movementdetection based on evidence reasoning network for edgecomputing environmentrdquo IEEE Transactions on IndustrialInformatics vol 15 no 7 pp 4285ndash4294 2019

[2] Y Wang J Yu W Qu H Shen and X Cheng ldquoEvolutionarygame model and analysis approaches for network groupbehaviorrdquo Chinese Journal of Computers vol 38 no 2pp 282ndash300 2015

[3] J Liu H Zhang and Y Liu ldquoResearch on optimal selection ofmoving target defense policy based on dynamic game withincomplete informationrdquo Acta Electronica Sinica vol 46no 1 pp 82ndash29 2018

Table 4 Comprehensive comparison of existing approaches

Ref Game type Model assumption Learning mechanism Game process Applicable object Practicability[3] Stochastic game Rationality mdash Multistage Personal Bad[12] Static game Rationality mdash Single-stage Personal Bad[14] Dynamic game Rationality mdash Multistage Personal Bad[16] Stochastic game Rationality mdash Multistage Personal Bad[20] Evolutionary game Bounded rationality Biological evolution mdash Group GoodOur approach Stochastic game Bounded rationality Reinforcement learning Multistage Personal Good

Security and Communication Networks 15

[4] Z Tian X Gao S Su J Qiu X Du and M GuizanildquoEvaluating reputation management schemes of internet ofvehicles based on evolutionary game theoryrdquo IEEE Trans-actions on Vehicular Technology vol 68 no 6 pp 5971ndash59802019

[5] Q Zhu and S Rass ldquoGame theory meets network securitya tutorialrdquo in Proceedings of the ACM SIGSAC Conference onComputer and Communications Security pp 2163ndash2165Toronto Canada October 2018

[6] X Li C Zhou Y C Tian and Y Qin ldquoA dynamic decision-making approach for intrusion response in industrial controlsystemsrdquo IEEE Transactions on Industrial Informatics vol 15no 5 pp 2544ndash2554 2019

[7] H Hu Y Liu H Zhang and R Pan ldquoOptimal networkdefense strategy selection based on incomplete informationevolutionary gamerdquo IEEE Access vol 6 pp 29806ndash298212018

[8] J Li G Kendall and R John ldquoComputing Nash equilibriaand evolutionarily stable states of evolutionary gamesrdquo IEEETransactions on Evolutionary Computation vol 20 no 3pp 460ndash469 2016

[9] M Bowling and M Veloso Multiagent Learning Usinga Variable Learning Rate Elsevier Science Publishers LtdAmsterdam Netherlands 2002

[10] J Harb and D Precup ldquoInvestigating recurrence and eligi-bility traces in deep Q-networksrdquo httpsarxivorgabs170405495 2017

[11] C T Do N H Tran and C Hong ldquoGame theory for cybersecurity and privacyrdquo ACM Computing Surveys vol 50 no 2p 30 2017

[12] Y-L Liu D-G Feng L-HWu and Y-F Lian ldquoPerformanceevaluation of worm attack and defense strategies based onstatic Bayesian gamerdquo Chinese Journal of Software vol 23no 3 pp 712ndash723 2012

[13] Y Li D E Quevedo S Dey and L Shi ldquoA game-theoreticapproach to fake-acknowledgment attack on cyber-physicalsystemsrdquo IEEE Transactions on Signal and Information Pro-cessing over Networks vol 3 no 1 pp 1ndash11 2017

[14] H Zhang L Jiang S Huang J Wang and Y Zhang ldquoAttack-defense differential game model for network defense strategyselectionrdquo IEEE Access vol 7 pp 50618ndash50629 2019

[15] A Afrand and S K Das ldquoPreventing DoS attacks in wirelesssensor networks a repeated game theory approachrdquo InternetJournal of Cyber Security vol 5 no 2 pp 145ndash153 2007

[16] J Wei B Fang and Z Tian ldquoResearch on defense strategiesselection based on attack-defense stochastic game modelrdquoChinese Journal of Computer Research and Developmentvol 47 no 10 pp 1714ndash1723 2010

[17] C Wang X Cheng and Y Zhu ldquoA Markov game model ofcomputer network operationrdquo Chinese Systems Engineering-gteory and Practice vol 34 no 9 pp 2402ndash2410 2014

[18] J Hofbauer and K Sigmund ldquoEvolutionary game dynamicsrdquoBulletin of the American Mathematical Society vol 40 no 4pp 479ndash519 2011

[19] Y Hayel and Q Zhu ldquoEpidemic protection over heteroge-neous networks using evolutionary Poisson gamesrdquo IEEETransactions on Information Forensics and Security vol 12no 8 pp 1786ndash1800 2017

[20] J Huang and H Zhang ldquoImproving replicator dynamicevolutionary game model for selecting optimal defensestrategiesrdquo Chinese Journal on Communications vol 38 no 1pp 170ndash182 2018

[21] H Hu Y Liu H Zhang and Y Zhang ldquoSecurity metricmethods for network multistep attacks using AMC and big

data correlation analysisrdquo Security and CommunicationNetworks vol 2018 Article ID 5787102 14 pages 2018

[22] H Hu Y Liu Y Yang H Zhang and Y Zhang ldquoNew in-sights into approaches to evaluating intention and path fornetwork multistep attacksrdquo Mathematical Problems in En-gineering vol 2018 Article ID 4278632 13 pages 2018

[23] H Hu H Zhang Y Liu and Y Wang ldquoQuantitative methodfor network security situation based on attack predictionrdquoSecurity and Communication Networks vol 2017 Article ID3407642 19 pages 2017

[24] M Zimmer and S Doncieux ldquoBootstrapping Q-learning forrobotics from neuro-evolution resultsrdquo IEEE Transactions onCognitive and Developmental Systems vol 10 no 1pp 102ndash119 2018

[25] R S Sutton and A G Barto ldquoReinforcement learning anintroduction Bradford bookrdquo Machine Learning vol 16no 1 pp 285-286 2005

[26] H Zhang J Wang D Yu J Han and T Li ldquoActive defensestrategy selection based on static Bayesian gamerdquo in Pro-ceedings of IET International Conference on CyberspaceTechnology pp 1ndash7 London UK October 2016

16 Security and Communication Networks

International Journal of

AerospaceEngineeringHindawiwwwhindawicom Volume 2018

RoboticsJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Active and Passive Electronic Components

VLSI Design

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Shock and Vibration

Hindawiwwwhindawicom Volume 2018

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawiwwwhindawicom

Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Control Scienceand Engineering

Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom

Journal ofEngineeringVolume 2018

SensorsJournal of

Hindawiwwwhindawicom Volume 2018

International Journal of

RotatingMachinery

Hindawiwwwhindawicom Volume 2018

Modelling ampSimulationin EngineeringHindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Navigation and Observation

International Journal of

Hindawi

wwwhindawicom Volume 2018

Advances in

Multimedia

Submit your manuscripts atwwwhindawicom

Page 5: Research Article - Hindawi Publishing Corporationdownloads.hindawi.com/journals/scn/2019/3038586.pdf · Afrand and Das [15] established a repeated game model between the intrusion

(3) D (D1 D2 Dn) is the action set of the de-fender in which Dk d1 d2 dm1113864 1113865 is the actionset of the defender in the game state Sk

(4) Rd(si d sj)is the immediate return from state si to sjafter the defender performs action d

(5) Qd(si d) is the state-action payoff function of thedefender indicating the expected payoff of the de-fender after taking action d in the state si

(6) πd(sk) is the defense strategy of the defender in thestate sk

Defense strategy and defense action are two differentconcepts Defense strategy is the rule of defense action notthe action itself For example πd(sk) (πd(sk d1)

πd(sk dm)) is the strategy of the defender in the networkstate sk where πd(sk dm) is the probability of selectingaction dm 1113936disinDk

πd(sk d) 1

33 Network State and Attack-Defense Action ExtractionApproach Based on Attack-Defense Graph Network stateand attack-defense action are important components ofstochastic game model Extraction of network state andattack-defense action is a key point in constructing at-tack-defense stochastic game model [21] In the currentattack-defense stochastic game when describing thenetwork state each network state contains the securityelements of all nodes in the current network e numberof network states is the power set of security elementswhich will produce a state explosion [22] ereforea host-centered attack-defense graph model is proposedEach state node only describes the host state It can ef-fectively reduce the size of state nodes [23] Using thisattack-defense graph to extract network state and attack-defense action is more conducive to cyber attack-defenseconfrontation analysis

Definition 2 attack-defense graph is a binary groupG (S E) in which S s1 s2 sn1113864 1113865 is a set of node se-curity states and si langhost privilegerang host is the uniqueidentity of the node and privilege none user root in-dicates that it does not have any privileges has ordinary userprivileges and has administrator privileges For directededge E (Ea Ed) it indicates that the occurrence of attackor defense action causes the transfer of node state andek (sr vd sd) k a d where sr is the source node and sd

is the destination nodee generation process of attack-defense map is shown

in Figure 3 Firstly target network scanning is used to ac-quire cyber security elements then attack instantiation iscombined with attack template and defense instantiation iscombined with defense template Finally attack-defensegraph is generated e state set of attack-defense stochasticgame model is extracted by attack-defense graph nodes and

the defense action set is extracted by the edge of attack-defense graph

331 Elements of Cyber Security e elements of cybersecurity NSE are composed of network connection Cvulnerability information V service information F andaccess rights P Matrix Csube host times host times port describes theconnection relationship between nodes the row of matrixrepresents the source node shost the list of matrix rep-resents the destination node dhost and the port accessrelationship is represented by matrix elements When portis empty it indicates that there is no connection re-lationship between shost nodes and dhost nodesV langhost service cveidrang indicates the vulnerability ofservices on nodesrsquo host including security vulnerabilitiesand improper configuration or misconfiguration of systemsoftware and application software F langhost servicerang in-dicates that a service is opened on a node hostP langhost privilegerang indicates that an attacker has accessrights privilege on a node host

332 Attack Template Attack template AM is the de-scription of vulnerability utilization whereAM langtid prec postcrang Among them tid is the identifi-cation of attack mode prec langP V C Frang describes the setof prerequisites for an attacker to use a vulnerability in-cluding the initial access rights privilege of the attacker onthe source node shost the vulnerability information cveidof the target node the network connection relationship Cand the running service of the node F Only when the set ofconditions is satisfied the attacker can succeed Use thisvulnerability postc langP C sdrang describes the conse-quences of an attackerrsquos successful use of a vulnerabilityincluding the increase of attackerrsquos access to the targetnode the change of network connection relationship andservice destruction

333 Defense Template Defense templates DM are theresponse measures taken by defenders after predicting oridentifying attacks where DM langtid dsetrang dset

langd1 postd1rang langdm postdmrang1113864 1113865 is the defense strategy setfor specific attacks postdi langF V PCrang describes the im-pact of defense strategy on cyber security elements in-cluding the impact on node service informationvulnerability information attacker privilege informationnode connection relationship and so on

In the process of attack-defense graph generation ifthere is a connection between two nodes and all the pre-requisites for attack occurring are satisfied the edges fromsource node to destination node are added If the attackchanges the security elements such as connectivity the cybersecurity elements should be updated in time If the defensestrategy is implemented the connection between nodes orthe existing rights of attackers should be changed As shownin Algorithm 1 the first step is to use cyber security elements

Security and Communication Networks 5

to generate all possible state nodes and initialize the edgesSteps 2ndash8 are to instantiate attacks and generate all attackedges Steps 9ndash15 are to instantiate defenses and generate alldefense edges Steps 16ndash20 are used to remove all isolatednodes And step 21 is to output attack-defense maps

Assuming the number of nodes in the target network is nand the number of vulnerabilities of each node is m themaximum number of nodes in the attack-defense graph is3n In the attack instantiation stage the computationalcomplexity of analyzing the connection relationship be-tween each two nodes is o(n2 minus n) e computationalcomplexity of matching the vulnerability of the nodes withthe connection relationship is o(m(n2 minus n)) In the defenseinstantiation stage we remove the isolated nodes and thecomputational complexity of traversing the edges of all thenodes is o(9n2 minus 3n) In summary the order of computa-tional complexity of the algorithm is o(n2) e node ofattack-defense graph G can extract network state and theedge of attack-defense graph G can extract attack-defenseaction

4 Stochastic Game Analysis and StrategySelection Based on WoLF-PHCIntelligent Learning

In the previous section cyber attack-defense is described asa bounded rational stochastic game problem and an attack-defense stochastic game model AD-SGM is constructed Inthis section reinforcement learning mechanism is in-troduced into finite rational stochastic game and WoLF-PHC algorithm is used to select defense strategies based onAD-SGM

41 Principle of WoLF-PHC

411 Q-Learning Algorithm Q-learning [24] is the basis ofWoLF-PHC algorithm and a typical model-free re-inforcement learning algorithm Its learning mechanism isshown in Figure 4 Agent in Q-learning obtains knowledgeof return and environment state transfer through interaction

Elements of networksecurity

Network connectionrelations

Node vulnerabilityinformation

Node serviceinformation

Node accessprivilege

Attack template

Attackinstantiation

Defenseinstantiation

Defense template

Atta

ck-d

efen

se g

raph Network

state

Attackaction

Defensiveaction

Figure 3 Attack-defense graph generation

Input Elements of Cyber security NSE Attack Template AM Defense Template DMOutput Attack graph G (S E)

(1) S⟵NSE E⟵emptylowast Generate all nodes lowast(2) for each S dolowast Attack instantiation to generate attack edges lowast(3) update NSE in slowast Updating Cyber security Elements lowast(4) if Cshost shost and CdhostV geAMprecV and CdhostFgeAMprecF and CdhostPprivilegegeAMprecPprivilege(5) srhost⟵Cshost(6) sdhost⟵Cdhost(7) sdprivilege⟵AMpostcPprivilege(8) Ea⟵Ea cup ea(srAMtid sd)1113864 1113865

(9) end if(10) end for(11) for each S dolowast Defense instantiation to generate defense edges lowast(12) if Easd s and DMtid Eatid(13) Ed⟵Ed cup ed(EasdDMdsetd Easr)1113864 1113865

(14) end if(15) end for(16) for each S dolowast Remove isolated nodes S lowast(17) if ea(s tid sd) empty and ed(sr d s) empty(18) S⟵ S minus s

(19) end if(20) end for(21) Return G

ALGORITHM 1 Attack-defense graph generation algorithm

6 Security and Communication Networks

with environment Knowledge is expressed by payoff Qd andlearned by updating Qd Qd is

Qd(s d) Qd(s d) + α Rd s d sprime( 1113857 + cmaxdprime

Qd sprime dprime( 1113857 minus Qd(s d)1113890 1113891

(1)

where α is payoff learning rate and c is the discount factore strategy of Q-learning is πd(s) argmaxdQd(s d)

412 PHC Algorithm e Policy Hill-Climbing algorithm[25] is a simple and practical gradient descent learning al-gorithm suitable for hybrid strategies which is an im-provement of Q-learning e state-action gain function Qd

of PHC is the same as Q-learning but the policy updateapproach of Q-learning is no longer followed but the hybridstrategy πd(sk) is updated by executing the hill-climbingalgorithm as shown in equations (2)ndash(4) In the formula thestrategy learning rate is

πd sk di( 1113857 πd sk di( 1113857 + Δskdi (2)

where

Δskdi

minus δskdi di ne argmax

dQd sk d( 1113857

1113944djnedi

δskdj others

⎧⎪⎪⎪⎨

⎪⎪⎪⎩

(3)

δskdi min πd sk di( 1113857

δDk minus 1

11138681113868111386811138681113868111386811138681113868

1113888 1113889 (4)

413 WoLF-PHC Algorithm WoLF-PHC algorithm is animprovement of PHC algorithm By introducing WoLFmechanism the defender has two different strategy learningrates low strategy learning rate δw when winning and highstrategy learning rate δl when losing as shown in formula (5)e two learning rates enable defenders to adapt quickly toattackersrsquo strategies when they perform worse than expectedand to learn cautiously when they perform better than expectede most important thing is the introduction of WoLFmechanism which guarantees the convergence of the algorithm[9] WoLF-PHC algorithm uses average strategy as the criterionof success and failure as shown in formulae (6) and (7)

δ

δw 1113944disinDk

πd skd( 1113857Qd skd( 1113857gt 1113944disinDk

πd skd( 1113857Qd skd( 1113857

δl others

⎧⎪⎪⎪⎨

⎪⎪⎪⎩

(5)

πd(sd) πd(sd) +1

C(s)πd(sd) minus πd(sd)( 1113857

(6)

C(s) C(s) + 1 (7)

42 Defense Decision-Making Algorithm Based on ImprovedWoLF-PHC e decision-making process of our approachis shown in Figure 5 which consists of five steps It receivestwo types of input data attack evidence and abnormal

Input AD minus SGM α δ λ and c

Output Defense action d(1) initialize AD minus SGM C(s) 0 e(s d) 0lowast Network state and attack-defense actions are extracted by Algorithm 1 lowast(2) slowast get(E) lowast Getting the current network state from Network E lowast(3) repeat(4) dlowast πd(slowast) lowast Select defense action lowast(5) Output dlowast lowast Feedback defense actions to defenders lowast(6) sprime get(E) lowast Get the status after the action dlowast is executed lowast(7) ρlowast Rd(slowast dlowast sprime) + cmaxdprimeQd(sprime dprime) minus Qd(slowast dlowast)

(8) ρg Rd(slowast dlowast sprime) + cmaxdprimeQd(sprime dprime) minus maxdQd(slowast d)

(9) for each state-action pair (s d) except (slowast dlowast) do(10) e(s d) cλe(s d)

(11) Qd(s d) Qd(s d) + αρge(s d)

(12) end forlowast Update noncurrent eligibility trace (slowast dlowast) and values Qd lowast(13) Qd(slowast dlowast) Qd(slowast dlowast) + αρlowastlowast Update Qd of (slowast dlowast) lowast(14) e(slowast dlowast) cλe(slowast dlowast) + 1 lowast Update track of (slowast dlowast)lowast(15) C(slowast) C(slowast) + 1(16) Updating average strategy based on formula (6)(17) Selecting the learning rate of strategies based on formula (5)(18) δslowastd min(πd(slowast d) δ(D(slowast) minus 1))foralld isin D(slowast)

(19) Δslowastd minus δslowastd dne argmaxdi

Qd(slowast di)

1113936djnedlowastδslowastdjOthers1113896

(20) πd(slowast d) πd(slowast d) + Δslowastdforalld isin D(slowast) lowast Update defense strategy lowast(21) slowast sprime(22) end repeat

ALGORITHM 2 Defense decision-making algorithm

Security and Communication Networks 7

evidence All these pieces of evidence come from real-timeintrusion detection systems After decision making theoptimal security strategy is determined against detectedintrusions

In order to improve the learning speed of WoLF-PHCalgorithm and reduce the dependence of the algorithm onthe amount of data the eligibility trace is introduced toimprove WoLF-PHC e eligibility trace can track specificstate-action trajectories of recent visits and then assigncurrent returns to the state-action of recent visits WoLF-PHC algorithm is an extension of Q-learning algorithm Atpresent there are many algorithms combining Q-learningwith eligibility trace is paper improves WoLF-PHC byusing the typical algorithm [10] e qualification trace ofeach state-action is defined as e(s a) Suppose slowast is thecurrent network state slowast and the eligibility trace is updatedin the way shown in formula (8) Among them the traceattenuation factor is λ

e(s d) cλe(s d) sne slowast

cλe(s d) + 1 s slowast1113896 (8)

WoLF-PHC algorithm is an extension of Q-learningalgorithm which belongs to off-policy algorithm It usesgreedy policy when evaluating defense actions for eachnetwork state and occasionally introduces nongreedy policywhen choosing to perform defense actions in order to learnIn order to maintain the off-policy characteristics of WoLF-PHC algorithm the state-action values are updated byformulae (9)ndash(12) in which the defense actions dlowast are

selected for execution slowast because only the recently visitedstatus-action pairs will have significantly more eligibilitytrace than 0 while most other status-action pairs will havealmost none eligibility trace In order to reduce the memoryand running time consumption caused by eligibility traceonly the latest status-action pair eligibility trace can be savedand updated in practical application

Qd slowast dlowast

( 1113857 Qd slowast dlowast

( 1113857 + αρlowast (9)

Qd(s d) Qd(s d) + αρge(s d) (10)

ρlowast Rd slowast d sprime( 1113857 + cmax

dprimeQd sprime dprime( 1113857 minus Qd s

lowast dlowast

( 1113857

(11)

ρg Rd slowast dlowast sprime( 1113857 + cmax

dprimeQd sprime dprime( 1113857 minus max

dQd slowast d( 1113857

(12)

In order to achieve better results the defense decision-making approach based on WoLF-PHC needs to set fourparameters α δ λ and c reasonably (1) e range of thepayoff learning rate is 0lt αlt 1e bigger the representativeα is the more important the cumulative reward is e fasterthe learning speed is the smaller α is and the better thestability of the algorithm is (2)e range of strategy learningrate 0lt δ lt 1 is obtained According to the experiment wecan get a better result when adopting δlδw 4 (3) eattenuation factor λ of eligibility trace is in the range of

Describe andanalyze cyberattack-defenseconfrontation

Model stochastic game

for attack-defense

Extract currentsecurity state andcandidate attack-

defense action

Evaluatepayoff

hellip

helliphellip

How to defend

Decision making

Figure 5 e process of our decision-making approach

Rt+1reward Rt

St+1state St

Ambient

Agent

Figure 4 Q-learning learning mechanism

8 Security and Communication Networks

0lt λlt 1 which is responsible for the credit allocation ofstatus-action It can be regarded as a time scale e greaterthe credit allocated λ to historical status-action the greaterthe credit allocated to historical status-action (4) e rangeof the discount factor 0lt clt 1 represents the defenderrsquospreference for immediate return and future return When c

approaches 0 it means that future returns are irrelevant andimmediate returns are more important When c approaches1 it means immediate returns are irrelevant and futurereturns are more important

Agent in WoLF-PHC is the defender in the stochasticgame model of attack-defense AD minus SGM the game state inagentrsquos state corresponds to AD minus SGM the defense actionin agentrsquos behavior corresponds to AD minus SGM the imme-diate return in agentrsquos immediate return corresponds toAD minus SGM and the defense strategy in agentrsquos strategycorresponds to AD minus SGM

On the basis of the above a specific defense decision-making approach as shown in Algorithm 2 is given e firststep of the algorithm is to initialize the stochastic gamemodel AD minus SGM of attack-defense and the related pa-rameters e network state and attack-defense actions areextracted by Algorithm 1 e second step is to detect thecurrent network state by the defender Steps 3ndash22 are tomake defense decisions and learn online Steps 4-5 are toselect defense actions according to the current strategy steps6ndash14 are to update the benefits by using eligibility traces andsteps 15ndash21 are the new payoffs using mountain climbingalgorithm to update defense strategy

e spatial complexity of the Algorithm 2 mainlyconcentrates on the storage of pairs Rd(s d sprime) such ase(s d) πd(s d) πd(s d) and Qd(s d) e number of statesis |S| |D| and |A| are the numbers of measures taken by thedefender and attacker in each state respectively ecomputational complexity of the proposed algorithm isO(4 middot |S| middot |D| + |S|2 middot |D|) Compared with the recentmethod using evolutionary game model with complexityO(|S|2 middot (|A| + |D|)3) [14] we greatly reduce the computa-tional complexity and increase the practicability of the al-gorithm since the proposed algorithm does not need to solvethe game equilibrium

5 Experimental Analysis

51 Experiment Setup In order to verify the effectiveness ofthis approach a typical enterprise network as shown inFigure 6 is built for experiment Attacks and defenses occuron the intranet with attackers coming from the extranet Asa defender network administrator is responsible for thesecurity of intranet Due to the setting of Firewall 1 andFirewall 2 legal users of the external network can only accessthe web server which can access the database server FTPserver and e-mail server

e simulation experiment was carried out on a PC withIntel Core i7-6300HQ 340GHz 32GB RAM memoryand Windows 10 64 bit operating system e Python 365emulator was installed and the vulnerability information inthe experimental network was scanned by Nessus toolkit asshown in Table 1 e network topology information was

collected by ArcGis toolkit We used the Python language towrite the project code During the experiment we set upabout 25000 times of attack-defense strategy studies eexperimental results were analyzed and displayed usingMatlab2018a as described in Section 53

Referring to MIT Lincoln Lab attack-defense behaviordatabase attack-defense templates are constructed Attack-defense maps are divided into attack maps and defense mapsby using attacker host A web server W database server DFTP server F and e-mail server E In order to facilitatedisplay and description attack-defense maps are dividedinto attack maps and defense maps as shown in Figures 7and 8 respectively e meaning of defense action in thedefense diagram is shown in Table 2

52 Construction of the Experiment Scenario AD-SGM(1) N (attacker defender) are players participating in

the game representing cyber attackers and defendersrespectively

Vulnerabilityscanning system

VDS

Internet

DMZ

FTP server

Firewall 1Trusted zone

Firewall 2Database

serverE-mailserver

Web server

IDS

Attacker

Extranet

Intranet

Figure 6 Experimental network topology

Table 1 Network vulnerability information

Attackidentifier Host CVE Target

privilegetid1 Web server CVE-2015-1635 usertid2 Web server CVE-2017-7269 roottid3 Web server CVE-2014-8517 roottid4 FTP server CVE-2014-3556 roottid5 E-mail server CVE-2014-4877 roottid6 Database server CVE-2013-4730 usertid7 Database server CVE-2016-6662 root

S0(A root)

S1(W user)

S2(W root)S5(D user)

S6(D root)

tid1

tid2

tid3

tid4

tid5

tid6

tid7

tid8

tid9

S3(F root)

S4(E root)

Figure 7 Attack graph

Security and Communication Networks 9

(2) e state set of stochastic game is S (s0 s1

s2 s3 s4 s5 s6) which consists of network state and isextracted from the nodes shown in Figures 7 and 8

(3) e action set of the defender is D (D0 D1

D2 D3 D4 D5 D6) D0 NULL D1 d11113864 d2D2

d3 d41113864 1113865D3 d1 d5 d61113864 1113865D4 d1 d5 d61113864 1113865D5 d11113864

d2 d7D6 d3 d41113864 1113865 and the edges are extractedfrom Figure 8

(4) Quantitative results of immediate returns Rd(si

d sj) of defenders [16 26] are

Rd s0NULL s0( 1113857 Rd s0NULL s1( 1113857 Rd s0NULL s2( 1113857( 1113857 (0 minus 40 minus 59)

Rd s1 d1 s0( 1113857 Rd s1 d1 s1( 1113857 Rd s1 d1 s2( 1113857 Rd s1 d2 s0( 1113857 Rd s1 d2 s1( 1113857 Rd s1 d2 s2( 1113857( 1113857

(40 0 minus 29 5 minus 15 minus 32)

( Rd s2 d3 s0( 1113857 Rd s2 d3 s1( 1113857 Rd s2 d3 s2( 1113857 Rd s2 d3 s3( 1113857 Rd s2 d3 s4( 1113857 Rd s2 d3 s5( 1113857

Rd s2 d4 s0( 1113857 Rd s2d4 s1( 1113857 Rd s2 d4 s2( 1113857 Rd s2 d4 s3( 1113857 Rd s2 d4 s4( 1113857 Rd s2 d4 s5( 11138571113857

(24 9 minus 15 minus 55 minus 49 minus 65 19 5 minus 21 minus 61 minus 72 minus 68)

( Rd s3 d1 s2( 1113857 Rd s3 d1 s3( 1113857 Rd s3 d1 s6( 1113857 Rd s3 d5 s2( 1113857 Rd s3 d5 s3( 1113857 Rd s3 d5 s6( 1113857

Rd s3 d6 s2( 1113857 Rd s3 d6 s3( 1113857 Rd s3 d6 s6( 11138571113857

(21 minus 16 minus 72 15 minus 23 minus 81 minus 21 minus 36 minus 81)

( Rd s4 d1 s2( 1113857 Rd s4 d1 s4( 1113857 Rd s4 d1 s6( 1113857 Rd s4 d5 s2( 1113857 Rd s4 d5 s4( 1113857 Rd s4 d5 s6( 1113857

Rd s4 d6 s2( 1113857 Rd s4 d6 s4( 1113857 Rd s4 d6 s6( 11138571113857

(26 0 minus 62 11 minus 23 minus 75 9 minus 25 minus 87)

( Rd s5 d1 s2( 1113857 Rd s5 d1 s5( 1113857 Rd s5 d1 s6( 1113857 Rd s5 d2 s2( 1113857 Rd s5 d2 s5( 1113857 Rd s5 d2 s6( 1113857

Rd s5 d7 s2( 1113857 Rd s5 d7 s5( 1113857 Rd s5 d7 s6( 11138571113857

(29 0 minus 63 11 minus 21 minus 76 2 minus 27 minus 88)

( Rd s6 d3 s3( 1113857 Rd s6 d3 s3( 1113857 Rd s6 d3 s5( 1113857 Rd s6 d3 s6( 1113857 Rd s6 d4 s3( 1113857

Rd s6 d4 s4( 1113857 Rd s6 d4 s5( 1113857 Rd s6 d4 s6( 11138571113857

(minus 23 minus 21 minus 19 minus 42 minus 28 minus 31 minus 24 minus 49)

(13)

Table 2 Defense action description

Atomic defense action d1 d2 d3 d4 d5 d6 d7

Renew root data Limit SYNICMPpackets Install Oracle patches Reinstall listener program Uninstall delete Trojan Limit access toMDSYS Restart database server Delete suspicious account Add physical resource Repair database Limit packets from ports

S0(A root)

S1(W user)

S2(W root)S5(D user)

S6(D root)

d3

d3

d4

d1d2d7

S3(F root)

S4(E root)d1d5d6

d1d5d6d4

d3

d1d2

Figure 8 Defense graph

10 Security and Communication Networks

(5) In order to detect the learning performance of theAlgorithm 2 more fully the defenderrsquos state-actionpayoffQd(si d) is initialized with a unified 0 withoutintroducing additional prior knowledge

(6) Defenderrsquos defense strategy adopts average strategyto initialize that is πd(sk d1) πd(sk d2) middot middot middot

πd(sk dm) 1113936disinD(sk)πd(sk d) 1forallsk isin S where noadditional prior knowledge is introduced

53 Testing and Analysis e experiment in this section hasthree purposes e first is to test the influence of differentparameter settings on the proposed Algorithm 2 so as to findout the experimental parameters suitable for this scenarioe second is to compare this approach with the existingtypical approaches to verify the advancement of this ap-proach e third is to test the effectiveness of WoLF-PHCalgorithm improvement based on eligibility trace

From Figures 7 and 8 we can see that the state of attack-defense strategy selection is the most complex and repre-sentative erefore the performance of the algorithm isanalyzed by the experimental state selection and the othernetwork state analysis approaches are the same

531 Parameter Test and Analysis Different parameters willaffect the speed and effect of learning At present there is norelevant theory to determine the specific parameters InSection 4 the relevant parameters are preliminarily ana-lyzed On this basis the different parameter settings arefurther tested to find the parameter settings suitable for thisattack-defense scenario Six different parameter settingswere tested Specific parameter settings are shown in Table 3In the experiment the attackerrsquos initial strategy is randomstrategy and the learning mechanism is the same as theapproach in this paper

e probability of the defenderrsquos choice of defense ac-tions and sums in state is shown in Figure 9 e learningspeed and convergence of the algorithm under differentparameter settings can be observed from Figure 9 whichshows that the learning speed of settings 1 3 and 6 is fasterand the best strategy can be obtained after learning less than1500 times under the three settings but convergence of 3and 6 is poor Although the best strategy can be learned bysettings 3 and 6 there will be oscillation afterwards and thestability of setting 1 is not suitable

Defense payoff can represent the degree of optimizationof the strategy In order to ensure that the payoff value doesnot reflect only one defense result the average of 1000defense gains is taken and the change of the average payoffper 1000 defense gains is shown in Figure 10 As can be seenfrom Figure 10 the benefits of Set 3 are significantly lowerthan those of other settings but the advantages and dis-advantages of other settings are difficult to distinguish Inorder to display more intuitively the average value of 25000defense gains calculated under different settings in Figure 10is shown in Figure 11 From Figure 11 we can see that theaverage value of settings 1 and 5 is higher For furthercomparison the standard deviation of settings 1 and 5 is

calculated one step on the basis of the average value to reflectthe discreteness of the gains As shown in Figure 12 thestandard deviations of setting 1 and setting 6 are smallMoreover the result of setting 1 is smaller than setting 6

In conclusion setting 1 of the six sets of parameters is themost suitable for this scenario Since setting 1 has achievedan ideal effect and can meet the experimental requirementsit is no longer necessary to further optimize the parameters

532 Comparisons In this section stochastic game [16] andevolutionary game [20] are selected to conduct comparativeexperiments with this approach According to the differenceof attackerrsquos learning ability this section designs two groupsof comparative experiments In the first group the attackerrsquoslearning ability is weak and will not make adjustments to theattack-defense results In the second group the attackerrsquoslearning ability is strong and adopts the same learningmechanism as the approach in this paper In both groupsthe initial strategies of attackers were random strategies

In the first group of experiments the defense strategy ofthis approach is as shown in Figure 9(a) e defense strategiescalculated by the approach [16] are πd(s2 d3) 07πd(s2 d4) 03 e defense strategies of [20] are evolu-tionarily stable and balanced And the defense strategies of [20]are πd(s2 d3) 08 πd(s2 d4) 02 Its average earnings per1000 times change as shown in Figure 13

From the results of the strategies and benefits of the threeapproaches we can see that the approach in this paper canlearn from the attackerrsquos strategies and adjust to the optimalstrategy so the approach in this paper can obtain the highestbenefits Wei et al [16] adopted a fixed strategy whenconfronting any attacker When the attacker is constrainedby bounded rationality and does not adopt Nash equilibriumstrategy the benefit of this approach is low Although thelearning factors of both attackers and defenders are takeninto account in [20] the parameters required in the modelare difficult to quantify accurately which results in thedeviation between the final results and the actual results sothe benefit of the approach is still lower than that of theapproach in this paper

In the second group of experiments the results of[16 20] are still as πd(s2 d3) 07 πd(s2 d4) 03 andπd(s2 d3) 08 πd(s2 d4) 02 e decision making ofthis approach is shown in Figure 14 After about 1800 timesof learning the approach achieves stability and converges tothe same defense strategy as that of [16] As can be seen fromFigure 15 the payoff of [20] is lower than that of other twoapproaches e average payoff of this approach in the first2000 defenses is higher than that of [26] and then it is almostthe same as that of [26] Combining Figures 14 and 15 wecan see that the learning attacker cannot get Nash equi-librium strategy at the initial stage and the approach in thispaper is better than that in [26] When the attacker learns toget Nash equilibrium strategy the approach in this paper canconverge to Nash equilibrium strategy At this time theperformance of this approach is the same as that in [26]

In conclusion when facing the attackers with weaklearning ability the approach in this paper is superior to that

Security and Communication Networks 11

Table 3 Dierent parameter settings

Set α δl δw λ c

1 001 0004 0001 001 0012 01 0004 0001 001 0013 001 0004 0001 001 014 001 0004 0001 01 0015 001 004 001 001 0016 001 0008 0001 001 001

0 2000 4000 6000s2 defense number

s2 d

efen

se re

turn

10

08

06

04

02

00

d3d4

(a)

s2 d

efen

se re

turn

10

08

06

04

02

000 2000 4000 6000

s2 defense number

d3d4

(b)

s2 d

efen

se re

turn

10

08

06

04

02

000 2000 4000 6000

s2 defense number

d3d4

(c)

s2 d

efen

se re

turn

10

08

06

04

02

000 2000 4000 6000

s2 defense number

d3d4

(d)

s2 d

efen

se re

turn

10

08

06

04

02

000 2000 4000 6000

s2 defense number

d3d4

(e)

s2 d

efen

se re

turn

10

08

06

04

02

000 2000 4000 6000

s2 defense number

d3d4

(f )

Figure 9 Defense decision making under dierent parameter settings (a) Defense decision under setting 1 (b) Defense decision undersetting 2 (c) Defense decision under setting 3 (d) Defense decision under setting 4 (e) Defense decision under setting 5 (f ) Defensedecision under setting 6

5 10 15 20 25s2 defense number (times1000)

s2 d

efen

se re

turn

ndash10

ndash15

ndash20

ndash25

ndash30

Set 1Set 2Set 3

Set 4Set 5Set 6

Figure 10 Defense payo under dierent parameter settings

12 Security and Communication Networks

in [16 20] When facing an attacker with strong learningability if the attacker has not obtained Nash equilibriumthrough learning this approach is still better than [16 20] Ifthe attacker obtains Nash equilibrium through learning thispaper can also obtain the same Nash equilibrium strategy as[26] and obtain its phase e same eect is superior to thatin [20]

533 Test Comparison with and without Eligibility Traceis section tests the actual eect of the eligibility traces onthe Algorithm 2 e eect of eligibility traces on strategyselection is shown in Figure 16 from which we can see thatthe learning speed of the algorithm is faster when thequalied trace is available After 1000 times of learning the

algorithm can converge to the optimal strategy When thequalied trace is not available the algorithm needs about2500 times of learning to converge

Average earnings per 1000 times change as shown inFigure 17 from which we can see that the benets of thealgorithm are almost the same when there is or not anyqualied trace after convergence From Figure 17 we can seethat 3000 defenses before convergence have higher returnsfrom qualied trails than those from unqualied trails Inorder to further verify this the average of the rst 3000defense gains under qualied trails and unqualied trails iscounted 10 times each respectively e results are shown inFigure 18 which further proves that in the preconvergencedefense phase qualied traces are better than unqualiedtraces

e addition of eligibility trace accelerates the learningspeed but also brings additional memory and computingoverhead In the experiment only 10 state-action pairs thatwere recently accessed were saved which eectively reduced

Set 1 Set 2 Set 3 Set 4 Set 5 Set 6

ndash80

ndash85

ndash90

Figure 11 Average return under dierent parameter settings

Set 1 Set 2 Set 3 Set 4 Set 5 Set 6

80

75

Figure 12 Standard deviation of defense payo under dierentparameter settings

ndash10

ndash20

ndash30

ndash40

s2 d

efen

se re

turn

5 10 15 20 25

Our methodReference [12]Reference [16]

s2 defense number (times1000)

Figure 13 Comparison of defense payo change

0 2000 4000 6000s2 defense number

s2 d

efen

se st

rate

gy

08

06

04

02

00

d3d4

Figure 14 Defense decision making of our approach

Our methodReference [12]Reference [16]

5 10 15 20 25s2 defense number (times1000)

s2 d

efen

se re

turn

ndash10

ndash20

ndash30

ndash40

Figure 15 Comparison of defense payo change

Security and Communication Networks 13

0 2000 4000 6000s2 defense number

s2 d

efen

se st

rate

gy

100

075

050

025

000

Eligibility trace d3Eligibility trace d4

No eligibility trace d3No eligibility trace d4

Figure 16 Comparison of defense decision making

Eligibility traceNo eligibility trace

0 5 10 15 20 25s2 defense number (times1000)

s2 d

efen

se re

turn

ndash10

ndash20

ndash15

ndash30

ndash25

Figure 17 Comparison of defense payo change

Eligibility traceNo eligibility trace

1 2 3 4 5 6 7 8 9 10s2 defense number (times3000)

s2 av

erag

e def

ense

retu

rn

ndash12

ndash14

ndash16

ndash18

ndash20

Figure 18 Average payo comparison of the rst 3000 defenses

14 Security and Communication Networks

the increase of memory consumption In order to test thecomputational cost of qualified trail the time of 100000defense decisions made by the algorithm was counted for 20times with and without qualified trail e average of 20times was 951s for qualified trail and 374 s for unqualifiedtrail Although the introduction of eligibility traces willincrease the decision-making time by nearly 25 times thetime required for 100000 decisions after the introduction ofeligibility traces is still only 951 s which can still meet thereal-time requirements

In summary the introduction of eligibility trace at theexpense of a small amount of memory and computingoverhead can effectively increase the learning speed of thealgorithm and improve the defense gains

54 Comprehensive Comparisons is approach is com-pared with some typical research results as shown in Table 4[3 12 14 16] is based on the assumption of complete ra-tionality e equilibrium strategy obtained by it is difficultto appear in practice and has a low guiding effect on actualdefense decision making [20] and this paper have morepracticability on the premise of bounded rationality hy-pothesis but [20] is based on the theory of biologicalevolution and mainly studies population evolution e coreof game analysis is not the optimal strategy choice of playersbut the strategy adjustment process trend and stability ofgroup members composed of bounded rational players andthe stability here refers to group members is approach isnot suitable for guiding individual real-time decisionmaking because the proportion of specific strategies isunchanged and not the strategy of a player On the contrarythe defender of the proposed approach adopts reinforcementlearning mechanism which is based on systematic feedbackto learn in the confrontation with the attacker and is moresuitable for the study of individual strategies

6 Conclusions and Future Works

In this paper cyber attack-defense confrontation is ab-stracted as a stochastic game problem under the restrictionof limited rationality A host-centered attack-defense graphmodel is proposed to extract network state and attack-de-fense action and an algorithm to generate attack-defensegraph is designed to effectively compress the game statespace e WoLF-PHC-based defense decision approach isproposed to overcome the problem which enables defendersunder bounded rationality to make optimal choices whenfacing different attackers e lattice improves the WoLF-

PHC algorithm speeds up the defenderrsquos learning speed andreduces the algorithmrsquos dependence on data is approachnot only satisfies the constraints of bounded rationality butalso does not require the defender to know too much in-formation about the attacker It is a more practical defensedecision-making approach

e future work is to further optimize the winning andlosing criteria of WoLF-PHC algorithm for specific attack-defense scenarios in order to speed up defense learning andincrease defense gains

Data Availability

e data that support the findings of this study are notpublicly available due to restrictions as the data containsensitive information about a real-world enterprise networkAccess to the dataset is restricted by the original ownerPeople who want to access the data should send a request tothe corresponding author who will apply for permission ofsharing the data from the original owner

Conflicts of Interest

e authors declare that they have no conflicts of interestregarding the publication of this paper

Acknowledgments

is work was supported by the National High TechnologyResearch and Development Program of China (863 Pro-gram) (2014AA7116082 and 2015AA7116040) and NationalNatural Science Foundation of China (61902427) In par-ticular we thank Junnan Yang and Hongqi Zhang of ourresearch group for their previous work on stochastic gameWe sincerely thank them for their help in publishing thispaper

References

[1] Z Tian W Shi Y Wang et al ldquoReal time lateral movementdetection based on evidence reasoning network for edgecomputing environmentrdquo IEEE Transactions on IndustrialInformatics vol 15 no 7 pp 4285ndash4294 2019

[2] Y Wang J Yu W Qu H Shen and X Cheng ldquoEvolutionarygame model and analysis approaches for network groupbehaviorrdquo Chinese Journal of Computers vol 38 no 2pp 282ndash300 2015

[3] J Liu H Zhang and Y Liu ldquoResearch on optimal selection ofmoving target defense policy based on dynamic game withincomplete informationrdquo Acta Electronica Sinica vol 46no 1 pp 82ndash29 2018

Table 4 Comprehensive comparison of existing approaches

Ref Game type Model assumption Learning mechanism Game process Applicable object Practicability[3] Stochastic game Rationality mdash Multistage Personal Bad[12] Static game Rationality mdash Single-stage Personal Bad[14] Dynamic game Rationality mdash Multistage Personal Bad[16] Stochastic game Rationality mdash Multistage Personal Bad[20] Evolutionary game Bounded rationality Biological evolution mdash Group GoodOur approach Stochastic game Bounded rationality Reinforcement learning Multistage Personal Good

Security and Communication Networks 15

[4] Z Tian X Gao S Su J Qiu X Du and M GuizanildquoEvaluating reputation management schemes of internet ofvehicles based on evolutionary game theoryrdquo IEEE Trans-actions on Vehicular Technology vol 68 no 6 pp 5971ndash59802019

[5] Q Zhu and S Rass ldquoGame theory meets network securitya tutorialrdquo in Proceedings of the ACM SIGSAC Conference onComputer and Communications Security pp 2163ndash2165Toronto Canada October 2018

[6] X Li C Zhou Y C Tian and Y Qin ldquoA dynamic decision-making approach for intrusion response in industrial controlsystemsrdquo IEEE Transactions on Industrial Informatics vol 15no 5 pp 2544ndash2554 2019

[7] H Hu Y Liu H Zhang and R Pan ldquoOptimal networkdefense strategy selection based on incomplete informationevolutionary gamerdquo IEEE Access vol 6 pp 29806ndash298212018

[8] J Li G Kendall and R John ldquoComputing Nash equilibriaand evolutionarily stable states of evolutionary gamesrdquo IEEETransactions on Evolutionary Computation vol 20 no 3pp 460ndash469 2016

[9] M Bowling and M Veloso Multiagent Learning Usinga Variable Learning Rate Elsevier Science Publishers LtdAmsterdam Netherlands 2002

[10] J Harb and D Precup ldquoInvestigating recurrence and eligi-bility traces in deep Q-networksrdquo httpsarxivorgabs170405495 2017

[11] C T Do N H Tran and C Hong ldquoGame theory for cybersecurity and privacyrdquo ACM Computing Surveys vol 50 no 2p 30 2017

[12] Y-L Liu D-G Feng L-HWu and Y-F Lian ldquoPerformanceevaluation of worm attack and defense strategies based onstatic Bayesian gamerdquo Chinese Journal of Software vol 23no 3 pp 712ndash723 2012

[13] Y Li D E Quevedo S Dey and L Shi ldquoA game-theoreticapproach to fake-acknowledgment attack on cyber-physicalsystemsrdquo IEEE Transactions on Signal and Information Pro-cessing over Networks vol 3 no 1 pp 1ndash11 2017

[14] H Zhang L Jiang S Huang J Wang and Y Zhang ldquoAttack-defense differential game model for network defense strategyselectionrdquo IEEE Access vol 7 pp 50618ndash50629 2019

[15] A Afrand and S K Das ldquoPreventing DoS attacks in wirelesssensor networks a repeated game theory approachrdquo InternetJournal of Cyber Security vol 5 no 2 pp 145ndash153 2007

[16] J Wei B Fang and Z Tian ldquoResearch on defense strategiesselection based on attack-defense stochastic game modelrdquoChinese Journal of Computer Research and Developmentvol 47 no 10 pp 1714ndash1723 2010

[17] C Wang X Cheng and Y Zhu ldquoA Markov game model ofcomputer network operationrdquo Chinese Systems Engineering-gteory and Practice vol 34 no 9 pp 2402ndash2410 2014

[18] J Hofbauer and K Sigmund ldquoEvolutionary game dynamicsrdquoBulletin of the American Mathematical Society vol 40 no 4pp 479ndash519 2011

[19] Y Hayel and Q Zhu ldquoEpidemic protection over heteroge-neous networks using evolutionary Poisson gamesrdquo IEEETransactions on Information Forensics and Security vol 12no 8 pp 1786ndash1800 2017

[20] J Huang and H Zhang ldquoImproving replicator dynamicevolutionary game model for selecting optimal defensestrategiesrdquo Chinese Journal on Communications vol 38 no 1pp 170ndash182 2018

[21] H Hu Y Liu H Zhang and Y Zhang ldquoSecurity metricmethods for network multistep attacks using AMC and big

data correlation analysisrdquo Security and CommunicationNetworks vol 2018 Article ID 5787102 14 pages 2018

[22] H Hu Y Liu Y Yang H Zhang and Y Zhang ldquoNew in-sights into approaches to evaluating intention and path fornetwork multistep attacksrdquo Mathematical Problems in En-gineering vol 2018 Article ID 4278632 13 pages 2018

[23] H Hu H Zhang Y Liu and Y Wang ldquoQuantitative methodfor network security situation based on attack predictionrdquoSecurity and Communication Networks vol 2017 Article ID3407642 19 pages 2017

[24] M Zimmer and S Doncieux ldquoBootstrapping Q-learning forrobotics from neuro-evolution resultsrdquo IEEE Transactions onCognitive and Developmental Systems vol 10 no 1pp 102ndash119 2018

[25] R S Sutton and A G Barto ldquoReinforcement learning anintroduction Bradford bookrdquo Machine Learning vol 16no 1 pp 285-286 2005

[26] H Zhang J Wang D Yu J Han and T Li ldquoActive defensestrategy selection based on static Bayesian gamerdquo in Pro-ceedings of IET International Conference on CyberspaceTechnology pp 1ndash7 London UK October 2016

16 Security and Communication Networks

International Journal of

AerospaceEngineeringHindawiwwwhindawicom Volume 2018

RoboticsJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Active and Passive Electronic Components

VLSI Design

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Shock and Vibration

Hindawiwwwhindawicom Volume 2018

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawiwwwhindawicom

Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Control Scienceand Engineering

Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom

Journal ofEngineeringVolume 2018

SensorsJournal of

Hindawiwwwhindawicom Volume 2018

International Journal of

RotatingMachinery

Hindawiwwwhindawicom Volume 2018

Modelling ampSimulationin EngineeringHindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Navigation and Observation

International Journal of

Hindawi

wwwhindawicom Volume 2018

Advances in

Multimedia

Submit your manuscripts atwwwhindawicom

Page 6: Research Article - Hindawi Publishing Corporationdownloads.hindawi.com/journals/scn/2019/3038586.pdf · Afrand and Das [15] established a repeated game model between the intrusion

to generate all possible state nodes and initialize the edgesSteps 2ndash8 are to instantiate attacks and generate all attackedges Steps 9ndash15 are to instantiate defenses and generate alldefense edges Steps 16ndash20 are used to remove all isolatednodes And step 21 is to output attack-defense maps

Assuming the number of nodes in the target network is nand the number of vulnerabilities of each node is m themaximum number of nodes in the attack-defense graph is3n In the attack instantiation stage the computationalcomplexity of analyzing the connection relationship be-tween each two nodes is o(n2 minus n) e computationalcomplexity of matching the vulnerability of the nodes withthe connection relationship is o(m(n2 minus n)) In the defenseinstantiation stage we remove the isolated nodes and thecomputational complexity of traversing the edges of all thenodes is o(9n2 minus 3n) In summary the order of computa-tional complexity of the algorithm is o(n2) e node ofattack-defense graph G can extract network state and theedge of attack-defense graph G can extract attack-defenseaction

4 Stochastic Game Analysis and StrategySelection Based on WoLF-PHCIntelligent Learning

In the previous section cyber attack-defense is described asa bounded rational stochastic game problem and an attack-defense stochastic game model AD-SGM is constructed Inthis section reinforcement learning mechanism is in-troduced into finite rational stochastic game and WoLF-PHC algorithm is used to select defense strategies based onAD-SGM

41 Principle of WoLF-PHC

411 Q-Learning Algorithm Q-learning [24] is the basis ofWoLF-PHC algorithm and a typical model-free re-inforcement learning algorithm Its learning mechanism isshown in Figure 4 Agent in Q-learning obtains knowledgeof return and environment state transfer through interaction

Elements of networksecurity

Network connectionrelations

Node vulnerabilityinformation

Node serviceinformation

Node accessprivilege

Attack template

Attackinstantiation

Defenseinstantiation

Defense template

Atta

ck-d

efen

se g

raph Network

state

Attackaction

Defensiveaction

Figure 3 Attack-defense graph generation

Input Elements of Cyber security NSE Attack Template AM Defense Template DMOutput Attack graph G (S E)

(1) S⟵NSE E⟵emptylowast Generate all nodes lowast(2) for each S dolowast Attack instantiation to generate attack edges lowast(3) update NSE in slowast Updating Cyber security Elements lowast(4) if Cshost shost and CdhostV geAMprecV and CdhostFgeAMprecF and CdhostPprivilegegeAMprecPprivilege(5) srhost⟵Cshost(6) sdhost⟵Cdhost(7) sdprivilege⟵AMpostcPprivilege(8) Ea⟵Ea cup ea(srAMtid sd)1113864 1113865

(9) end if(10) end for(11) for each S dolowast Defense instantiation to generate defense edges lowast(12) if Easd s and DMtid Eatid(13) Ed⟵Ed cup ed(EasdDMdsetd Easr)1113864 1113865

(14) end if(15) end for(16) for each S dolowast Remove isolated nodes S lowast(17) if ea(s tid sd) empty and ed(sr d s) empty(18) S⟵ S minus s

(19) end if(20) end for(21) Return G

ALGORITHM 1 Attack-defense graph generation algorithm

6 Security and Communication Networks

with environment Knowledge is expressed by payoff Qd andlearned by updating Qd Qd is

Qd(s d) Qd(s d) + α Rd s d sprime( 1113857 + cmaxdprime

Qd sprime dprime( 1113857 minus Qd(s d)1113890 1113891

(1)

where α is payoff learning rate and c is the discount factore strategy of Q-learning is πd(s) argmaxdQd(s d)

412 PHC Algorithm e Policy Hill-Climbing algorithm[25] is a simple and practical gradient descent learning al-gorithm suitable for hybrid strategies which is an im-provement of Q-learning e state-action gain function Qd

of PHC is the same as Q-learning but the policy updateapproach of Q-learning is no longer followed but the hybridstrategy πd(sk) is updated by executing the hill-climbingalgorithm as shown in equations (2)ndash(4) In the formula thestrategy learning rate is

πd sk di( 1113857 πd sk di( 1113857 + Δskdi (2)

where

Δskdi

minus δskdi di ne argmax

dQd sk d( 1113857

1113944djnedi

δskdj others

⎧⎪⎪⎪⎨

⎪⎪⎪⎩

(3)

δskdi min πd sk di( 1113857

δDk minus 1

11138681113868111386811138681113868111386811138681113868

1113888 1113889 (4)

413 WoLF-PHC Algorithm WoLF-PHC algorithm is animprovement of PHC algorithm By introducing WoLFmechanism the defender has two different strategy learningrates low strategy learning rate δw when winning and highstrategy learning rate δl when losing as shown in formula (5)e two learning rates enable defenders to adapt quickly toattackersrsquo strategies when they perform worse than expectedand to learn cautiously when they perform better than expectede most important thing is the introduction of WoLFmechanism which guarantees the convergence of the algorithm[9] WoLF-PHC algorithm uses average strategy as the criterionof success and failure as shown in formulae (6) and (7)

δ

δw 1113944disinDk

πd skd( 1113857Qd skd( 1113857gt 1113944disinDk

πd skd( 1113857Qd skd( 1113857

δl others

⎧⎪⎪⎪⎨

⎪⎪⎪⎩

(5)

πd(sd) πd(sd) +1

C(s)πd(sd) minus πd(sd)( 1113857

(6)

C(s) C(s) + 1 (7)

42 Defense Decision-Making Algorithm Based on ImprovedWoLF-PHC e decision-making process of our approachis shown in Figure 5 which consists of five steps It receivestwo types of input data attack evidence and abnormal

Input AD minus SGM α δ λ and c

Output Defense action d(1) initialize AD minus SGM C(s) 0 e(s d) 0lowast Network state and attack-defense actions are extracted by Algorithm 1 lowast(2) slowast get(E) lowast Getting the current network state from Network E lowast(3) repeat(4) dlowast πd(slowast) lowast Select defense action lowast(5) Output dlowast lowast Feedback defense actions to defenders lowast(6) sprime get(E) lowast Get the status after the action dlowast is executed lowast(7) ρlowast Rd(slowast dlowast sprime) + cmaxdprimeQd(sprime dprime) minus Qd(slowast dlowast)

(8) ρg Rd(slowast dlowast sprime) + cmaxdprimeQd(sprime dprime) minus maxdQd(slowast d)

(9) for each state-action pair (s d) except (slowast dlowast) do(10) e(s d) cλe(s d)

(11) Qd(s d) Qd(s d) + αρge(s d)

(12) end forlowast Update noncurrent eligibility trace (slowast dlowast) and values Qd lowast(13) Qd(slowast dlowast) Qd(slowast dlowast) + αρlowastlowast Update Qd of (slowast dlowast) lowast(14) e(slowast dlowast) cλe(slowast dlowast) + 1 lowast Update track of (slowast dlowast)lowast(15) C(slowast) C(slowast) + 1(16) Updating average strategy based on formula (6)(17) Selecting the learning rate of strategies based on formula (5)(18) δslowastd min(πd(slowast d) δ(D(slowast) minus 1))foralld isin D(slowast)

(19) Δslowastd minus δslowastd dne argmaxdi

Qd(slowast di)

1113936djnedlowastδslowastdjOthers1113896

(20) πd(slowast d) πd(slowast d) + Δslowastdforalld isin D(slowast) lowast Update defense strategy lowast(21) slowast sprime(22) end repeat

ALGORITHM 2 Defense decision-making algorithm

Security and Communication Networks 7

evidence All these pieces of evidence come from real-timeintrusion detection systems After decision making theoptimal security strategy is determined against detectedintrusions

In order to improve the learning speed of WoLF-PHCalgorithm and reduce the dependence of the algorithm onthe amount of data the eligibility trace is introduced toimprove WoLF-PHC e eligibility trace can track specificstate-action trajectories of recent visits and then assigncurrent returns to the state-action of recent visits WoLF-PHC algorithm is an extension of Q-learning algorithm Atpresent there are many algorithms combining Q-learningwith eligibility trace is paper improves WoLF-PHC byusing the typical algorithm [10] e qualification trace ofeach state-action is defined as e(s a) Suppose slowast is thecurrent network state slowast and the eligibility trace is updatedin the way shown in formula (8) Among them the traceattenuation factor is λ

e(s d) cλe(s d) sne slowast

cλe(s d) + 1 s slowast1113896 (8)

WoLF-PHC algorithm is an extension of Q-learningalgorithm which belongs to off-policy algorithm It usesgreedy policy when evaluating defense actions for eachnetwork state and occasionally introduces nongreedy policywhen choosing to perform defense actions in order to learnIn order to maintain the off-policy characteristics of WoLF-PHC algorithm the state-action values are updated byformulae (9)ndash(12) in which the defense actions dlowast are

selected for execution slowast because only the recently visitedstatus-action pairs will have significantly more eligibilitytrace than 0 while most other status-action pairs will havealmost none eligibility trace In order to reduce the memoryand running time consumption caused by eligibility traceonly the latest status-action pair eligibility trace can be savedand updated in practical application

Qd slowast dlowast

( 1113857 Qd slowast dlowast

( 1113857 + αρlowast (9)

Qd(s d) Qd(s d) + αρge(s d) (10)

ρlowast Rd slowast d sprime( 1113857 + cmax

dprimeQd sprime dprime( 1113857 minus Qd s

lowast dlowast

( 1113857

(11)

ρg Rd slowast dlowast sprime( 1113857 + cmax

dprimeQd sprime dprime( 1113857 minus max

dQd slowast d( 1113857

(12)

In order to achieve better results the defense decision-making approach based on WoLF-PHC needs to set fourparameters α δ λ and c reasonably (1) e range of thepayoff learning rate is 0lt αlt 1e bigger the representativeα is the more important the cumulative reward is e fasterthe learning speed is the smaller α is and the better thestability of the algorithm is (2)e range of strategy learningrate 0lt δ lt 1 is obtained According to the experiment wecan get a better result when adopting δlδw 4 (3) eattenuation factor λ of eligibility trace is in the range of

Describe andanalyze cyberattack-defenseconfrontation

Model stochastic game

for attack-defense

Extract currentsecurity state andcandidate attack-

defense action

Evaluatepayoff

hellip

helliphellip

How to defend

Decision making

Figure 5 e process of our decision-making approach

Rt+1reward Rt

St+1state St

Ambient

Agent

Figure 4 Q-learning learning mechanism

8 Security and Communication Networks

0lt λlt 1 which is responsible for the credit allocation ofstatus-action It can be regarded as a time scale e greaterthe credit allocated λ to historical status-action the greaterthe credit allocated to historical status-action (4) e rangeof the discount factor 0lt clt 1 represents the defenderrsquospreference for immediate return and future return When c

approaches 0 it means that future returns are irrelevant andimmediate returns are more important When c approaches1 it means immediate returns are irrelevant and futurereturns are more important

Agent in WoLF-PHC is the defender in the stochasticgame model of attack-defense AD minus SGM the game state inagentrsquos state corresponds to AD minus SGM the defense actionin agentrsquos behavior corresponds to AD minus SGM the imme-diate return in agentrsquos immediate return corresponds toAD minus SGM and the defense strategy in agentrsquos strategycorresponds to AD minus SGM

On the basis of the above a specific defense decision-making approach as shown in Algorithm 2 is given e firststep of the algorithm is to initialize the stochastic gamemodel AD minus SGM of attack-defense and the related pa-rameters e network state and attack-defense actions areextracted by Algorithm 1 e second step is to detect thecurrent network state by the defender Steps 3ndash22 are tomake defense decisions and learn online Steps 4-5 are toselect defense actions according to the current strategy steps6ndash14 are to update the benefits by using eligibility traces andsteps 15ndash21 are the new payoffs using mountain climbingalgorithm to update defense strategy

e spatial complexity of the Algorithm 2 mainlyconcentrates on the storage of pairs Rd(s d sprime) such ase(s d) πd(s d) πd(s d) and Qd(s d) e number of statesis |S| |D| and |A| are the numbers of measures taken by thedefender and attacker in each state respectively ecomputational complexity of the proposed algorithm isO(4 middot |S| middot |D| + |S|2 middot |D|) Compared with the recentmethod using evolutionary game model with complexityO(|S|2 middot (|A| + |D|)3) [14] we greatly reduce the computa-tional complexity and increase the practicability of the al-gorithm since the proposed algorithm does not need to solvethe game equilibrium

5 Experimental Analysis

51 Experiment Setup In order to verify the effectiveness ofthis approach a typical enterprise network as shown inFigure 6 is built for experiment Attacks and defenses occuron the intranet with attackers coming from the extranet Asa defender network administrator is responsible for thesecurity of intranet Due to the setting of Firewall 1 andFirewall 2 legal users of the external network can only accessthe web server which can access the database server FTPserver and e-mail server

e simulation experiment was carried out on a PC withIntel Core i7-6300HQ 340GHz 32GB RAM memoryand Windows 10 64 bit operating system e Python 365emulator was installed and the vulnerability information inthe experimental network was scanned by Nessus toolkit asshown in Table 1 e network topology information was

collected by ArcGis toolkit We used the Python language towrite the project code During the experiment we set upabout 25000 times of attack-defense strategy studies eexperimental results were analyzed and displayed usingMatlab2018a as described in Section 53

Referring to MIT Lincoln Lab attack-defense behaviordatabase attack-defense templates are constructed Attack-defense maps are divided into attack maps and defense mapsby using attacker host A web server W database server DFTP server F and e-mail server E In order to facilitatedisplay and description attack-defense maps are dividedinto attack maps and defense maps as shown in Figures 7and 8 respectively e meaning of defense action in thedefense diagram is shown in Table 2

52 Construction of the Experiment Scenario AD-SGM(1) N (attacker defender) are players participating in

the game representing cyber attackers and defendersrespectively

Vulnerabilityscanning system

VDS

Internet

DMZ

FTP server

Firewall 1Trusted zone

Firewall 2Database

serverE-mailserver

Web server

IDS

Attacker

Extranet

Intranet

Figure 6 Experimental network topology

Table 1 Network vulnerability information

Attackidentifier Host CVE Target

privilegetid1 Web server CVE-2015-1635 usertid2 Web server CVE-2017-7269 roottid3 Web server CVE-2014-8517 roottid4 FTP server CVE-2014-3556 roottid5 E-mail server CVE-2014-4877 roottid6 Database server CVE-2013-4730 usertid7 Database server CVE-2016-6662 root

S0(A root)

S1(W user)

S2(W root)S5(D user)

S6(D root)

tid1

tid2

tid3

tid4

tid5

tid6

tid7

tid8

tid9

S3(F root)

S4(E root)

Figure 7 Attack graph

Security and Communication Networks 9

(2) e state set of stochastic game is S (s0 s1

s2 s3 s4 s5 s6) which consists of network state and isextracted from the nodes shown in Figures 7 and 8

(3) e action set of the defender is D (D0 D1

D2 D3 D4 D5 D6) D0 NULL D1 d11113864 d2D2

d3 d41113864 1113865D3 d1 d5 d61113864 1113865D4 d1 d5 d61113864 1113865D5 d11113864

d2 d7D6 d3 d41113864 1113865 and the edges are extractedfrom Figure 8

(4) Quantitative results of immediate returns Rd(si

d sj) of defenders [16 26] are

Rd s0NULL s0( 1113857 Rd s0NULL s1( 1113857 Rd s0NULL s2( 1113857( 1113857 (0 minus 40 minus 59)

Rd s1 d1 s0( 1113857 Rd s1 d1 s1( 1113857 Rd s1 d1 s2( 1113857 Rd s1 d2 s0( 1113857 Rd s1 d2 s1( 1113857 Rd s1 d2 s2( 1113857( 1113857

(40 0 minus 29 5 minus 15 minus 32)

( Rd s2 d3 s0( 1113857 Rd s2 d3 s1( 1113857 Rd s2 d3 s2( 1113857 Rd s2 d3 s3( 1113857 Rd s2 d3 s4( 1113857 Rd s2 d3 s5( 1113857

Rd s2 d4 s0( 1113857 Rd s2d4 s1( 1113857 Rd s2 d4 s2( 1113857 Rd s2 d4 s3( 1113857 Rd s2 d4 s4( 1113857 Rd s2 d4 s5( 11138571113857

(24 9 minus 15 minus 55 minus 49 minus 65 19 5 minus 21 minus 61 minus 72 minus 68)

( Rd s3 d1 s2( 1113857 Rd s3 d1 s3( 1113857 Rd s3 d1 s6( 1113857 Rd s3 d5 s2( 1113857 Rd s3 d5 s3( 1113857 Rd s3 d5 s6( 1113857

Rd s3 d6 s2( 1113857 Rd s3 d6 s3( 1113857 Rd s3 d6 s6( 11138571113857

(21 minus 16 minus 72 15 minus 23 minus 81 minus 21 minus 36 minus 81)

( Rd s4 d1 s2( 1113857 Rd s4 d1 s4( 1113857 Rd s4 d1 s6( 1113857 Rd s4 d5 s2( 1113857 Rd s4 d5 s4( 1113857 Rd s4 d5 s6( 1113857

Rd s4 d6 s2( 1113857 Rd s4 d6 s4( 1113857 Rd s4 d6 s6( 11138571113857

(26 0 minus 62 11 minus 23 minus 75 9 minus 25 minus 87)

( Rd s5 d1 s2( 1113857 Rd s5 d1 s5( 1113857 Rd s5 d1 s6( 1113857 Rd s5 d2 s2( 1113857 Rd s5 d2 s5( 1113857 Rd s5 d2 s6( 1113857

Rd s5 d7 s2( 1113857 Rd s5 d7 s5( 1113857 Rd s5 d7 s6( 11138571113857

(29 0 minus 63 11 minus 21 minus 76 2 minus 27 minus 88)

( Rd s6 d3 s3( 1113857 Rd s6 d3 s3( 1113857 Rd s6 d3 s5( 1113857 Rd s6 d3 s6( 1113857 Rd s6 d4 s3( 1113857

Rd s6 d4 s4( 1113857 Rd s6 d4 s5( 1113857 Rd s6 d4 s6( 11138571113857

(minus 23 minus 21 minus 19 minus 42 minus 28 minus 31 minus 24 minus 49)

(13)

Table 2 Defense action description

Atomic defense action d1 d2 d3 d4 d5 d6 d7

Renew root data Limit SYNICMPpackets Install Oracle patches Reinstall listener program Uninstall delete Trojan Limit access toMDSYS Restart database server Delete suspicious account Add physical resource Repair database Limit packets from ports

S0(A root)

S1(W user)

S2(W root)S5(D user)

S6(D root)

d3

d3

d4

d1d2d7

S3(F root)

S4(E root)d1d5d6

d1d5d6d4

d3

d1d2

Figure 8 Defense graph

10 Security and Communication Networks

(5) In order to detect the learning performance of theAlgorithm 2 more fully the defenderrsquos state-actionpayoffQd(si d) is initialized with a unified 0 withoutintroducing additional prior knowledge

(6) Defenderrsquos defense strategy adopts average strategyto initialize that is πd(sk d1) πd(sk d2) middot middot middot

πd(sk dm) 1113936disinD(sk)πd(sk d) 1forallsk isin S where noadditional prior knowledge is introduced

53 Testing and Analysis e experiment in this section hasthree purposes e first is to test the influence of differentparameter settings on the proposed Algorithm 2 so as to findout the experimental parameters suitable for this scenarioe second is to compare this approach with the existingtypical approaches to verify the advancement of this ap-proach e third is to test the effectiveness of WoLF-PHCalgorithm improvement based on eligibility trace

From Figures 7 and 8 we can see that the state of attack-defense strategy selection is the most complex and repre-sentative erefore the performance of the algorithm isanalyzed by the experimental state selection and the othernetwork state analysis approaches are the same

531 Parameter Test and Analysis Different parameters willaffect the speed and effect of learning At present there is norelevant theory to determine the specific parameters InSection 4 the relevant parameters are preliminarily ana-lyzed On this basis the different parameter settings arefurther tested to find the parameter settings suitable for thisattack-defense scenario Six different parameter settingswere tested Specific parameter settings are shown in Table 3In the experiment the attackerrsquos initial strategy is randomstrategy and the learning mechanism is the same as theapproach in this paper

e probability of the defenderrsquos choice of defense ac-tions and sums in state is shown in Figure 9 e learningspeed and convergence of the algorithm under differentparameter settings can be observed from Figure 9 whichshows that the learning speed of settings 1 3 and 6 is fasterand the best strategy can be obtained after learning less than1500 times under the three settings but convergence of 3and 6 is poor Although the best strategy can be learned bysettings 3 and 6 there will be oscillation afterwards and thestability of setting 1 is not suitable

Defense payoff can represent the degree of optimizationof the strategy In order to ensure that the payoff value doesnot reflect only one defense result the average of 1000defense gains is taken and the change of the average payoffper 1000 defense gains is shown in Figure 10 As can be seenfrom Figure 10 the benefits of Set 3 are significantly lowerthan those of other settings but the advantages and dis-advantages of other settings are difficult to distinguish Inorder to display more intuitively the average value of 25000defense gains calculated under different settings in Figure 10is shown in Figure 11 From Figure 11 we can see that theaverage value of settings 1 and 5 is higher For furthercomparison the standard deviation of settings 1 and 5 is

calculated one step on the basis of the average value to reflectthe discreteness of the gains As shown in Figure 12 thestandard deviations of setting 1 and setting 6 are smallMoreover the result of setting 1 is smaller than setting 6

In conclusion setting 1 of the six sets of parameters is themost suitable for this scenario Since setting 1 has achievedan ideal effect and can meet the experimental requirementsit is no longer necessary to further optimize the parameters

532 Comparisons In this section stochastic game [16] andevolutionary game [20] are selected to conduct comparativeexperiments with this approach According to the differenceof attackerrsquos learning ability this section designs two groupsof comparative experiments In the first group the attackerrsquoslearning ability is weak and will not make adjustments to theattack-defense results In the second group the attackerrsquoslearning ability is strong and adopts the same learningmechanism as the approach in this paper In both groupsthe initial strategies of attackers were random strategies

In the first group of experiments the defense strategy ofthis approach is as shown in Figure 9(a) e defense strategiescalculated by the approach [16] are πd(s2 d3) 07πd(s2 d4) 03 e defense strategies of [20] are evolu-tionarily stable and balanced And the defense strategies of [20]are πd(s2 d3) 08 πd(s2 d4) 02 Its average earnings per1000 times change as shown in Figure 13

From the results of the strategies and benefits of the threeapproaches we can see that the approach in this paper canlearn from the attackerrsquos strategies and adjust to the optimalstrategy so the approach in this paper can obtain the highestbenefits Wei et al [16] adopted a fixed strategy whenconfronting any attacker When the attacker is constrainedby bounded rationality and does not adopt Nash equilibriumstrategy the benefit of this approach is low Although thelearning factors of both attackers and defenders are takeninto account in [20] the parameters required in the modelare difficult to quantify accurately which results in thedeviation between the final results and the actual results sothe benefit of the approach is still lower than that of theapproach in this paper

In the second group of experiments the results of[16 20] are still as πd(s2 d3) 07 πd(s2 d4) 03 andπd(s2 d3) 08 πd(s2 d4) 02 e decision making ofthis approach is shown in Figure 14 After about 1800 timesof learning the approach achieves stability and converges tothe same defense strategy as that of [16] As can be seen fromFigure 15 the payoff of [20] is lower than that of other twoapproaches e average payoff of this approach in the first2000 defenses is higher than that of [26] and then it is almostthe same as that of [26] Combining Figures 14 and 15 wecan see that the learning attacker cannot get Nash equi-librium strategy at the initial stage and the approach in thispaper is better than that in [26] When the attacker learns toget Nash equilibrium strategy the approach in this paper canconverge to Nash equilibrium strategy At this time theperformance of this approach is the same as that in [26]

In conclusion when facing the attackers with weaklearning ability the approach in this paper is superior to that

Security and Communication Networks 11

Table 3 Dierent parameter settings

Set α δl δw λ c

1 001 0004 0001 001 0012 01 0004 0001 001 0013 001 0004 0001 001 014 001 0004 0001 01 0015 001 004 001 001 0016 001 0008 0001 001 001

0 2000 4000 6000s2 defense number

s2 d

efen

se re

turn

10

08

06

04

02

00

d3d4

(a)

s2 d

efen

se re

turn

10

08

06

04

02

000 2000 4000 6000

s2 defense number

d3d4

(b)

s2 d

efen

se re

turn

10

08

06

04

02

000 2000 4000 6000

s2 defense number

d3d4

(c)

s2 d

efen

se re

turn

10

08

06

04

02

000 2000 4000 6000

s2 defense number

d3d4

(d)

s2 d

efen

se re

turn

10

08

06

04

02

000 2000 4000 6000

s2 defense number

d3d4

(e)

s2 d

efen

se re

turn

10

08

06

04

02

000 2000 4000 6000

s2 defense number

d3d4

(f )

Figure 9 Defense decision making under dierent parameter settings (a) Defense decision under setting 1 (b) Defense decision undersetting 2 (c) Defense decision under setting 3 (d) Defense decision under setting 4 (e) Defense decision under setting 5 (f ) Defensedecision under setting 6

5 10 15 20 25s2 defense number (times1000)

s2 d

efen

se re

turn

ndash10

ndash15

ndash20

ndash25

ndash30

Set 1Set 2Set 3

Set 4Set 5Set 6

Figure 10 Defense payo under dierent parameter settings

12 Security and Communication Networks

in [16 20] When facing an attacker with strong learningability if the attacker has not obtained Nash equilibriumthrough learning this approach is still better than [16 20] Ifthe attacker obtains Nash equilibrium through learning thispaper can also obtain the same Nash equilibrium strategy as[26] and obtain its phase e same eect is superior to thatin [20]

533 Test Comparison with and without Eligibility Traceis section tests the actual eect of the eligibility traces onthe Algorithm 2 e eect of eligibility traces on strategyselection is shown in Figure 16 from which we can see thatthe learning speed of the algorithm is faster when thequalied trace is available After 1000 times of learning the

algorithm can converge to the optimal strategy When thequalied trace is not available the algorithm needs about2500 times of learning to converge

Average earnings per 1000 times change as shown inFigure 17 from which we can see that the benets of thealgorithm are almost the same when there is or not anyqualied trace after convergence From Figure 17 we can seethat 3000 defenses before convergence have higher returnsfrom qualied trails than those from unqualied trails Inorder to further verify this the average of the rst 3000defense gains under qualied trails and unqualied trails iscounted 10 times each respectively e results are shown inFigure 18 which further proves that in the preconvergencedefense phase qualied traces are better than unqualiedtraces

e addition of eligibility trace accelerates the learningspeed but also brings additional memory and computingoverhead In the experiment only 10 state-action pairs thatwere recently accessed were saved which eectively reduced

Set 1 Set 2 Set 3 Set 4 Set 5 Set 6

ndash80

ndash85

ndash90

Figure 11 Average return under dierent parameter settings

Set 1 Set 2 Set 3 Set 4 Set 5 Set 6

80

75

Figure 12 Standard deviation of defense payo under dierentparameter settings

ndash10

ndash20

ndash30

ndash40

s2 d

efen

se re

turn

5 10 15 20 25

Our methodReference [12]Reference [16]

s2 defense number (times1000)

Figure 13 Comparison of defense payo change

0 2000 4000 6000s2 defense number

s2 d

efen

se st

rate

gy

08

06

04

02

00

d3d4

Figure 14 Defense decision making of our approach

Our methodReference [12]Reference [16]

5 10 15 20 25s2 defense number (times1000)

s2 d

efen

se re

turn

ndash10

ndash20

ndash30

ndash40

Figure 15 Comparison of defense payo change

Security and Communication Networks 13

0 2000 4000 6000s2 defense number

s2 d

efen

se st

rate

gy

100

075

050

025

000

Eligibility trace d3Eligibility trace d4

No eligibility trace d3No eligibility trace d4

Figure 16 Comparison of defense decision making

Eligibility traceNo eligibility trace

0 5 10 15 20 25s2 defense number (times1000)

s2 d

efen

se re

turn

ndash10

ndash20

ndash15

ndash30

ndash25

Figure 17 Comparison of defense payo change

Eligibility traceNo eligibility trace

1 2 3 4 5 6 7 8 9 10s2 defense number (times3000)

s2 av

erag

e def

ense

retu

rn

ndash12

ndash14

ndash16

ndash18

ndash20

Figure 18 Average payo comparison of the rst 3000 defenses

14 Security and Communication Networks

the increase of memory consumption In order to test thecomputational cost of qualified trail the time of 100000defense decisions made by the algorithm was counted for 20times with and without qualified trail e average of 20times was 951s for qualified trail and 374 s for unqualifiedtrail Although the introduction of eligibility traces willincrease the decision-making time by nearly 25 times thetime required for 100000 decisions after the introduction ofeligibility traces is still only 951 s which can still meet thereal-time requirements

In summary the introduction of eligibility trace at theexpense of a small amount of memory and computingoverhead can effectively increase the learning speed of thealgorithm and improve the defense gains

54 Comprehensive Comparisons is approach is com-pared with some typical research results as shown in Table 4[3 12 14 16] is based on the assumption of complete ra-tionality e equilibrium strategy obtained by it is difficultto appear in practice and has a low guiding effect on actualdefense decision making [20] and this paper have morepracticability on the premise of bounded rationality hy-pothesis but [20] is based on the theory of biologicalevolution and mainly studies population evolution e coreof game analysis is not the optimal strategy choice of playersbut the strategy adjustment process trend and stability ofgroup members composed of bounded rational players andthe stability here refers to group members is approach isnot suitable for guiding individual real-time decisionmaking because the proportion of specific strategies isunchanged and not the strategy of a player On the contrarythe defender of the proposed approach adopts reinforcementlearning mechanism which is based on systematic feedbackto learn in the confrontation with the attacker and is moresuitable for the study of individual strategies

6 Conclusions and Future Works

In this paper cyber attack-defense confrontation is ab-stracted as a stochastic game problem under the restrictionof limited rationality A host-centered attack-defense graphmodel is proposed to extract network state and attack-de-fense action and an algorithm to generate attack-defensegraph is designed to effectively compress the game statespace e WoLF-PHC-based defense decision approach isproposed to overcome the problem which enables defendersunder bounded rationality to make optimal choices whenfacing different attackers e lattice improves the WoLF-

PHC algorithm speeds up the defenderrsquos learning speed andreduces the algorithmrsquos dependence on data is approachnot only satisfies the constraints of bounded rationality butalso does not require the defender to know too much in-formation about the attacker It is a more practical defensedecision-making approach

e future work is to further optimize the winning andlosing criteria of WoLF-PHC algorithm for specific attack-defense scenarios in order to speed up defense learning andincrease defense gains

Data Availability

e data that support the findings of this study are notpublicly available due to restrictions as the data containsensitive information about a real-world enterprise networkAccess to the dataset is restricted by the original ownerPeople who want to access the data should send a request tothe corresponding author who will apply for permission ofsharing the data from the original owner

Conflicts of Interest

e authors declare that they have no conflicts of interestregarding the publication of this paper

Acknowledgments

is work was supported by the National High TechnologyResearch and Development Program of China (863 Pro-gram) (2014AA7116082 and 2015AA7116040) and NationalNatural Science Foundation of China (61902427) In par-ticular we thank Junnan Yang and Hongqi Zhang of ourresearch group for their previous work on stochastic gameWe sincerely thank them for their help in publishing thispaper

References

[1] Z Tian W Shi Y Wang et al ldquoReal time lateral movementdetection based on evidence reasoning network for edgecomputing environmentrdquo IEEE Transactions on IndustrialInformatics vol 15 no 7 pp 4285ndash4294 2019

[2] Y Wang J Yu W Qu H Shen and X Cheng ldquoEvolutionarygame model and analysis approaches for network groupbehaviorrdquo Chinese Journal of Computers vol 38 no 2pp 282ndash300 2015

[3] J Liu H Zhang and Y Liu ldquoResearch on optimal selection ofmoving target defense policy based on dynamic game withincomplete informationrdquo Acta Electronica Sinica vol 46no 1 pp 82ndash29 2018

Table 4 Comprehensive comparison of existing approaches

Ref Game type Model assumption Learning mechanism Game process Applicable object Practicability[3] Stochastic game Rationality mdash Multistage Personal Bad[12] Static game Rationality mdash Single-stage Personal Bad[14] Dynamic game Rationality mdash Multistage Personal Bad[16] Stochastic game Rationality mdash Multistage Personal Bad[20] Evolutionary game Bounded rationality Biological evolution mdash Group GoodOur approach Stochastic game Bounded rationality Reinforcement learning Multistage Personal Good

Security and Communication Networks 15

[4] Z Tian X Gao S Su J Qiu X Du and M GuizanildquoEvaluating reputation management schemes of internet ofvehicles based on evolutionary game theoryrdquo IEEE Trans-actions on Vehicular Technology vol 68 no 6 pp 5971ndash59802019

[5] Q Zhu and S Rass ldquoGame theory meets network securitya tutorialrdquo in Proceedings of the ACM SIGSAC Conference onComputer and Communications Security pp 2163ndash2165Toronto Canada October 2018

[6] X Li C Zhou Y C Tian and Y Qin ldquoA dynamic decision-making approach for intrusion response in industrial controlsystemsrdquo IEEE Transactions on Industrial Informatics vol 15no 5 pp 2544ndash2554 2019

[7] H Hu Y Liu H Zhang and R Pan ldquoOptimal networkdefense strategy selection based on incomplete informationevolutionary gamerdquo IEEE Access vol 6 pp 29806ndash298212018

[8] J Li G Kendall and R John ldquoComputing Nash equilibriaand evolutionarily stable states of evolutionary gamesrdquo IEEETransactions on Evolutionary Computation vol 20 no 3pp 460ndash469 2016

[9] M Bowling and M Veloso Multiagent Learning Usinga Variable Learning Rate Elsevier Science Publishers LtdAmsterdam Netherlands 2002

[10] J Harb and D Precup ldquoInvestigating recurrence and eligi-bility traces in deep Q-networksrdquo httpsarxivorgabs170405495 2017

[11] C T Do N H Tran and C Hong ldquoGame theory for cybersecurity and privacyrdquo ACM Computing Surveys vol 50 no 2p 30 2017

[12] Y-L Liu D-G Feng L-HWu and Y-F Lian ldquoPerformanceevaluation of worm attack and defense strategies based onstatic Bayesian gamerdquo Chinese Journal of Software vol 23no 3 pp 712ndash723 2012

[13] Y Li D E Quevedo S Dey and L Shi ldquoA game-theoreticapproach to fake-acknowledgment attack on cyber-physicalsystemsrdquo IEEE Transactions on Signal and Information Pro-cessing over Networks vol 3 no 1 pp 1ndash11 2017

[14] H Zhang L Jiang S Huang J Wang and Y Zhang ldquoAttack-defense differential game model for network defense strategyselectionrdquo IEEE Access vol 7 pp 50618ndash50629 2019

[15] A Afrand and S K Das ldquoPreventing DoS attacks in wirelesssensor networks a repeated game theory approachrdquo InternetJournal of Cyber Security vol 5 no 2 pp 145ndash153 2007

[16] J Wei B Fang and Z Tian ldquoResearch on defense strategiesselection based on attack-defense stochastic game modelrdquoChinese Journal of Computer Research and Developmentvol 47 no 10 pp 1714ndash1723 2010

[17] C Wang X Cheng and Y Zhu ldquoA Markov game model ofcomputer network operationrdquo Chinese Systems Engineering-gteory and Practice vol 34 no 9 pp 2402ndash2410 2014

[18] J Hofbauer and K Sigmund ldquoEvolutionary game dynamicsrdquoBulletin of the American Mathematical Society vol 40 no 4pp 479ndash519 2011

[19] Y Hayel and Q Zhu ldquoEpidemic protection over heteroge-neous networks using evolutionary Poisson gamesrdquo IEEETransactions on Information Forensics and Security vol 12no 8 pp 1786ndash1800 2017

[20] J Huang and H Zhang ldquoImproving replicator dynamicevolutionary game model for selecting optimal defensestrategiesrdquo Chinese Journal on Communications vol 38 no 1pp 170ndash182 2018

[21] H Hu Y Liu H Zhang and Y Zhang ldquoSecurity metricmethods for network multistep attacks using AMC and big

data correlation analysisrdquo Security and CommunicationNetworks vol 2018 Article ID 5787102 14 pages 2018

[22] H Hu Y Liu Y Yang H Zhang and Y Zhang ldquoNew in-sights into approaches to evaluating intention and path fornetwork multistep attacksrdquo Mathematical Problems in En-gineering vol 2018 Article ID 4278632 13 pages 2018

[23] H Hu H Zhang Y Liu and Y Wang ldquoQuantitative methodfor network security situation based on attack predictionrdquoSecurity and Communication Networks vol 2017 Article ID3407642 19 pages 2017

[24] M Zimmer and S Doncieux ldquoBootstrapping Q-learning forrobotics from neuro-evolution resultsrdquo IEEE Transactions onCognitive and Developmental Systems vol 10 no 1pp 102ndash119 2018

[25] R S Sutton and A G Barto ldquoReinforcement learning anintroduction Bradford bookrdquo Machine Learning vol 16no 1 pp 285-286 2005

[26] H Zhang J Wang D Yu J Han and T Li ldquoActive defensestrategy selection based on static Bayesian gamerdquo in Pro-ceedings of IET International Conference on CyberspaceTechnology pp 1ndash7 London UK October 2016

16 Security and Communication Networks

International Journal of

AerospaceEngineeringHindawiwwwhindawicom Volume 2018

RoboticsJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Active and Passive Electronic Components

VLSI Design

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Shock and Vibration

Hindawiwwwhindawicom Volume 2018

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawiwwwhindawicom

Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Control Scienceand Engineering

Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom

Journal ofEngineeringVolume 2018

SensorsJournal of

Hindawiwwwhindawicom Volume 2018

International Journal of

RotatingMachinery

Hindawiwwwhindawicom Volume 2018

Modelling ampSimulationin EngineeringHindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Navigation and Observation

International Journal of

Hindawi

wwwhindawicom Volume 2018

Advances in

Multimedia

Submit your manuscripts atwwwhindawicom

Page 7: Research Article - Hindawi Publishing Corporationdownloads.hindawi.com/journals/scn/2019/3038586.pdf · Afrand and Das [15] established a repeated game model between the intrusion

with environment Knowledge is expressed by payoff Qd andlearned by updating Qd Qd is

Qd(s d) Qd(s d) + α Rd s d sprime( 1113857 + cmaxdprime

Qd sprime dprime( 1113857 minus Qd(s d)1113890 1113891

(1)

where α is payoff learning rate and c is the discount factore strategy of Q-learning is πd(s) argmaxdQd(s d)

412 PHC Algorithm e Policy Hill-Climbing algorithm[25] is a simple and practical gradient descent learning al-gorithm suitable for hybrid strategies which is an im-provement of Q-learning e state-action gain function Qd

of PHC is the same as Q-learning but the policy updateapproach of Q-learning is no longer followed but the hybridstrategy πd(sk) is updated by executing the hill-climbingalgorithm as shown in equations (2)ndash(4) In the formula thestrategy learning rate is

πd sk di( 1113857 πd sk di( 1113857 + Δskdi (2)

where

Δskdi

minus δskdi di ne argmax

dQd sk d( 1113857

1113944djnedi

δskdj others

⎧⎪⎪⎪⎨

⎪⎪⎪⎩

(3)

δskdi min πd sk di( 1113857

δDk minus 1

11138681113868111386811138681113868111386811138681113868

1113888 1113889 (4)

413 WoLF-PHC Algorithm WoLF-PHC algorithm is animprovement of PHC algorithm By introducing WoLFmechanism the defender has two different strategy learningrates low strategy learning rate δw when winning and highstrategy learning rate δl when losing as shown in formula (5)e two learning rates enable defenders to adapt quickly toattackersrsquo strategies when they perform worse than expectedand to learn cautiously when they perform better than expectede most important thing is the introduction of WoLFmechanism which guarantees the convergence of the algorithm[9] WoLF-PHC algorithm uses average strategy as the criterionof success and failure as shown in formulae (6) and (7)

δ

δw 1113944disinDk

πd skd( 1113857Qd skd( 1113857gt 1113944disinDk

πd skd( 1113857Qd skd( 1113857

δl others

⎧⎪⎪⎪⎨

⎪⎪⎪⎩

(5)

πd(sd) πd(sd) +1

C(s)πd(sd) minus πd(sd)( 1113857

(6)

C(s) C(s) + 1 (7)

42 Defense Decision-Making Algorithm Based on ImprovedWoLF-PHC e decision-making process of our approachis shown in Figure 5 which consists of five steps It receivestwo types of input data attack evidence and abnormal

Input AD minus SGM α δ λ and c

Output Defense action d(1) initialize AD minus SGM C(s) 0 e(s d) 0lowast Network state and attack-defense actions are extracted by Algorithm 1 lowast(2) slowast get(E) lowast Getting the current network state from Network E lowast(3) repeat(4) dlowast πd(slowast) lowast Select defense action lowast(5) Output dlowast lowast Feedback defense actions to defenders lowast(6) sprime get(E) lowast Get the status after the action dlowast is executed lowast(7) ρlowast Rd(slowast dlowast sprime) + cmaxdprimeQd(sprime dprime) minus Qd(slowast dlowast)

(8) ρg Rd(slowast dlowast sprime) + cmaxdprimeQd(sprime dprime) minus maxdQd(slowast d)

(9) for each state-action pair (s d) except (slowast dlowast) do(10) e(s d) cλe(s d)

(11) Qd(s d) Qd(s d) + αρge(s d)

(12) end forlowast Update noncurrent eligibility trace (slowast dlowast) and values Qd lowast(13) Qd(slowast dlowast) Qd(slowast dlowast) + αρlowastlowast Update Qd of (slowast dlowast) lowast(14) e(slowast dlowast) cλe(slowast dlowast) + 1 lowast Update track of (slowast dlowast)lowast(15) C(slowast) C(slowast) + 1(16) Updating average strategy based on formula (6)(17) Selecting the learning rate of strategies based on formula (5)(18) δslowastd min(πd(slowast d) δ(D(slowast) minus 1))foralld isin D(slowast)

(19) Δslowastd minus δslowastd dne argmaxdi

Qd(slowast di)

1113936djnedlowastδslowastdjOthers1113896

(20) πd(slowast d) πd(slowast d) + Δslowastdforalld isin D(slowast) lowast Update defense strategy lowast(21) slowast sprime(22) end repeat

ALGORITHM 2 Defense decision-making algorithm

Security and Communication Networks 7

evidence All these pieces of evidence come from real-timeintrusion detection systems After decision making theoptimal security strategy is determined against detectedintrusions

In order to improve the learning speed of WoLF-PHCalgorithm and reduce the dependence of the algorithm onthe amount of data the eligibility trace is introduced toimprove WoLF-PHC e eligibility trace can track specificstate-action trajectories of recent visits and then assigncurrent returns to the state-action of recent visits WoLF-PHC algorithm is an extension of Q-learning algorithm Atpresent there are many algorithms combining Q-learningwith eligibility trace is paper improves WoLF-PHC byusing the typical algorithm [10] e qualification trace ofeach state-action is defined as e(s a) Suppose slowast is thecurrent network state slowast and the eligibility trace is updatedin the way shown in formula (8) Among them the traceattenuation factor is λ

e(s d) cλe(s d) sne slowast

cλe(s d) + 1 s slowast1113896 (8)

WoLF-PHC algorithm is an extension of Q-learningalgorithm which belongs to off-policy algorithm It usesgreedy policy when evaluating defense actions for eachnetwork state and occasionally introduces nongreedy policywhen choosing to perform defense actions in order to learnIn order to maintain the off-policy characteristics of WoLF-PHC algorithm the state-action values are updated byformulae (9)ndash(12) in which the defense actions dlowast are

selected for execution slowast because only the recently visitedstatus-action pairs will have significantly more eligibilitytrace than 0 while most other status-action pairs will havealmost none eligibility trace In order to reduce the memoryand running time consumption caused by eligibility traceonly the latest status-action pair eligibility trace can be savedand updated in practical application

Qd slowast dlowast

( 1113857 Qd slowast dlowast

( 1113857 + αρlowast (9)

Qd(s d) Qd(s d) + αρge(s d) (10)

ρlowast Rd slowast d sprime( 1113857 + cmax

dprimeQd sprime dprime( 1113857 minus Qd s

lowast dlowast

( 1113857

(11)

ρg Rd slowast dlowast sprime( 1113857 + cmax

dprimeQd sprime dprime( 1113857 minus max

dQd slowast d( 1113857

(12)

In order to achieve better results the defense decision-making approach based on WoLF-PHC needs to set fourparameters α δ λ and c reasonably (1) e range of thepayoff learning rate is 0lt αlt 1e bigger the representativeα is the more important the cumulative reward is e fasterthe learning speed is the smaller α is and the better thestability of the algorithm is (2)e range of strategy learningrate 0lt δ lt 1 is obtained According to the experiment wecan get a better result when adopting δlδw 4 (3) eattenuation factor λ of eligibility trace is in the range of

Describe andanalyze cyberattack-defenseconfrontation

Model stochastic game

for attack-defense

Extract currentsecurity state andcandidate attack-

defense action

Evaluatepayoff

hellip

helliphellip

How to defend

Decision making

Figure 5 e process of our decision-making approach

Rt+1reward Rt

St+1state St

Ambient

Agent

Figure 4 Q-learning learning mechanism

8 Security and Communication Networks

0lt λlt 1 which is responsible for the credit allocation ofstatus-action It can be regarded as a time scale e greaterthe credit allocated λ to historical status-action the greaterthe credit allocated to historical status-action (4) e rangeof the discount factor 0lt clt 1 represents the defenderrsquospreference for immediate return and future return When c

approaches 0 it means that future returns are irrelevant andimmediate returns are more important When c approaches1 it means immediate returns are irrelevant and futurereturns are more important

Agent in WoLF-PHC is the defender in the stochasticgame model of attack-defense AD minus SGM the game state inagentrsquos state corresponds to AD minus SGM the defense actionin agentrsquos behavior corresponds to AD minus SGM the imme-diate return in agentrsquos immediate return corresponds toAD minus SGM and the defense strategy in agentrsquos strategycorresponds to AD minus SGM

On the basis of the above a specific defense decision-making approach as shown in Algorithm 2 is given e firststep of the algorithm is to initialize the stochastic gamemodel AD minus SGM of attack-defense and the related pa-rameters e network state and attack-defense actions areextracted by Algorithm 1 e second step is to detect thecurrent network state by the defender Steps 3ndash22 are tomake defense decisions and learn online Steps 4-5 are toselect defense actions according to the current strategy steps6ndash14 are to update the benefits by using eligibility traces andsteps 15ndash21 are the new payoffs using mountain climbingalgorithm to update defense strategy

e spatial complexity of the Algorithm 2 mainlyconcentrates on the storage of pairs Rd(s d sprime) such ase(s d) πd(s d) πd(s d) and Qd(s d) e number of statesis |S| |D| and |A| are the numbers of measures taken by thedefender and attacker in each state respectively ecomputational complexity of the proposed algorithm isO(4 middot |S| middot |D| + |S|2 middot |D|) Compared with the recentmethod using evolutionary game model with complexityO(|S|2 middot (|A| + |D|)3) [14] we greatly reduce the computa-tional complexity and increase the practicability of the al-gorithm since the proposed algorithm does not need to solvethe game equilibrium

5 Experimental Analysis

51 Experiment Setup In order to verify the effectiveness ofthis approach a typical enterprise network as shown inFigure 6 is built for experiment Attacks and defenses occuron the intranet with attackers coming from the extranet Asa defender network administrator is responsible for thesecurity of intranet Due to the setting of Firewall 1 andFirewall 2 legal users of the external network can only accessthe web server which can access the database server FTPserver and e-mail server

e simulation experiment was carried out on a PC withIntel Core i7-6300HQ 340GHz 32GB RAM memoryand Windows 10 64 bit operating system e Python 365emulator was installed and the vulnerability information inthe experimental network was scanned by Nessus toolkit asshown in Table 1 e network topology information was

collected by ArcGis toolkit We used the Python language towrite the project code During the experiment we set upabout 25000 times of attack-defense strategy studies eexperimental results were analyzed and displayed usingMatlab2018a as described in Section 53

Referring to MIT Lincoln Lab attack-defense behaviordatabase attack-defense templates are constructed Attack-defense maps are divided into attack maps and defense mapsby using attacker host A web server W database server DFTP server F and e-mail server E In order to facilitatedisplay and description attack-defense maps are dividedinto attack maps and defense maps as shown in Figures 7and 8 respectively e meaning of defense action in thedefense diagram is shown in Table 2

52 Construction of the Experiment Scenario AD-SGM(1) N (attacker defender) are players participating in

the game representing cyber attackers and defendersrespectively

Vulnerabilityscanning system

VDS

Internet

DMZ

FTP server

Firewall 1Trusted zone

Firewall 2Database

serverE-mailserver

Web server

IDS

Attacker

Extranet

Intranet

Figure 6 Experimental network topology

Table 1 Network vulnerability information

Attackidentifier Host CVE Target

privilegetid1 Web server CVE-2015-1635 usertid2 Web server CVE-2017-7269 roottid3 Web server CVE-2014-8517 roottid4 FTP server CVE-2014-3556 roottid5 E-mail server CVE-2014-4877 roottid6 Database server CVE-2013-4730 usertid7 Database server CVE-2016-6662 root

S0(A root)

S1(W user)

S2(W root)S5(D user)

S6(D root)

tid1

tid2

tid3

tid4

tid5

tid6

tid7

tid8

tid9

S3(F root)

S4(E root)

Figure 7 Attack graph

Security and Communication Networks 9

(2) e state set of stochastic game is S (s0 s1

s2 s3 s4 s5 s6) which consists of network state and isextracted from the nodes shown in Figures 7 and 8

(3) e action set of the defender is D (D0 D1

D2 D3 D4 D5 D6) D0 NULL D1 d11113864 d2D2

d3 d41113864 1113865D3 d1 d5 d61113864 1113865D4 d1 d5 d61113864 1113865D5 d11113864

d2 d7D6 d3 d41113864 1113865 and the edges are extractedfrom Figure 8

(4) Quantitative results of immediate returns Rd(si

d sj) of defenders [16 26] are

Rd s0NULL s0( 1113857 Rd s0NULL s1( 1113857 Rd s0NULL s2( 1113857( 1113857 (0 minus 40 minus 59)

Rd s1 d1 s0( 1113857 Rd s1 d1 s1( 1113857 Rd s1 d1 s2( 1113857 Rd s1 d2 s0( 1113857 Rd s1 d2 s1( 1113857 Rd s1 d2 s2( 1113857( 1113857

(40 0 minus 29 5 minus 15 minus 32)

( Rd s2 d3 s0( 1113857 Rd s2 d3 s1( 1113857 Rd s2 d3 s2( 1113857 Rd s2 d3 s3( 1113857 Rd s2 d3 s4( 1113857 Rd s2 d3 s5( 1113857

Rd s2 d4 s0( 1113857 Rd s2d4 s1( 1113857 Rd s2 d4 s2( 1113857 Rd s2 d4 s3( 1113857 Rd s2 d4 s4( 1113857 Rd s2 d4 s5( 11138571113857

(24 9 minus 15 minus 55 minus 49 minus 65 19 5 minus 21 minus 61 minus 72 minus 68)

( Rd s3 d1 s2( 1113857 Rd s3 d1 s3( 1113857 Rd s3 d1 s6( 1113857 Rd s3 d5 s2( 1113857 Rd s3 d5 s3( 1113857 Rd s3 d5 s6( 1113857

Rd s3 d6 s2( 1113857 Rd s3 d6 s3( 1113857 Rd s3 d6 s6( 11138571113857

(21 minus 16 minus 72 15 minus 23 minus 81 minus 21 minus 36 minus 81)

( Rd s4 d1 s2( 1113857 Rd s4 d1 s4( 1113857 Rd s4 d1 s6( 1113857 Rd s4 d5 s2( 1113857 Rd s4 d5 s4( 1113857 Rd s4 d5 s6( 1113857

Rd s4 d6 s2( 1113857 Rd s4 d6 s4( 1113857 Rd s4 d6 s6( 11138571113857

(26 0 minus 62 11 minus 23 minus 75 9 minus 25 minus 87)

( Rd s5 d1 s2( 1113857 Rd s5 d1 s5( 1113857 Rd s5 d1 s6( 1113857 Rd s5 d2 s2( 1113857 Rd s5 d2 s5( 1113857 Rd s5 d2 s6( 1113857

Rd s5 d7 s2( 1113857 Rd s5 d7 s5( 1113857 Rd s5 d7 s6( 11138571113857

(29 0 minus 63 11 minus 21 minus 76 2 minus 27 minus 88)

( Rd s6 d3 s3( 1113857 Rd s6 d3 s3( 1113857 Rd s6 d3 s5( 1113857 Rd s6 d3 s6( 1113857 Rd s6 d4 s3( 1113857

Rd s6 d4 s4( 1113857 Rd s6 d4 s5( 1113857 Rd s6 d4 s6( 11138571113857

(minus 23 minus 21 minus 19 minus 42 minus 28 minus 31 minus 24 minus 49)

(13)

Table 2 Defense action description

Atomic defense action d1 d2 d3 d4 d5 d6 d7

Renew root data Limit SYNICMPpackets Install Oracle patches Reinstall listener program Uninstall delete Trojan Limit access toMDSYS Restart database server Delete suspicious account Add physical resource Repair database Limit packets from ports

S0(A root)

S1(W user)

S2(W root)S5(D user)

S6(D root)

d3

d3

d4

d1d2d7

S3(F root)

S4(E root)d1d5d6

d1d5d6d4

d3

d1d2

Figure 8 Defense graph

10 Security and Communication Networks

(5) In order to detect the learning performance of theAlgorithm 2 more fully the defenderrsquos state-actionpayoffQd(si d) is initialized with a unified 0 withoutintroducing additional prior knowledge

(6) Defenderrsquos defense strategy adopts average strategyto initialize that is πd(sk d1) πd(sk d2) middot middot middot

πd(sk dm) 1113936disinD(sk)πd(sk d) 1forallsk isin S where noadditional prior knowledge is introduced

53 Testing and Analysis e experiment in this section hasthree purposes e first is to test the influence of differentparameter settings on the proposed Algorithm 2 so as to findout the experimental parameters suitable for this scenarioe second is to compare this approach with the existingtypical approaches to verify the advancement of this ap-proach e third is to test the effectiveness of WoLF-PHCalgorithm improvement based on eligibility trace

From Figures 7 and 8 we can see that the state of attack-defense strategy selection is the most complex and repre-sentative erefore the performance of the algorithm isanalyzed by the experimental state selection and the othernetwork state analysis approaches are the same

531 Parameter Test and Analysis Different parameters willaffect the speed and effect of learning At present there is norelevant theory to determine the specific parameters InSection 4 the relevant parameters are preliminarily ana-lyzed On this basis the different parameter settings arefurther tested to find the parameter settings suitable for thisattack-defense scenario Six different parameter settingswere tested Specific parameter settings are shown in Table 3In the experiment the attackerrsquos initial strategy is randomstrategy and the learning mechanism is the same as theapproach in this paper

e probability of the defenderrsquos choice of defense ac-tions and sums in state is shown in Figure 9 e learningspeed and convergence of the algorithm under differentparameter settings can be observed from Figure 9 whichshows that the learning speed of settings 1 3 and 6 is fasterand the best strategy can be obtained after learning less than1500 times under the three settings but convergence of 3and 6 is poor Although the best strategy can be learned bysettings 3 and 6 there will be oscillation afterwards and thestability of setting 1 is not suitable

Defense payoff can represent the degree of optimizationof the strategy In order to ensure that the payoff value doesnot reflect only one defense result the average of 1000defense gains is taken and the change of the average payoffper 1000 defense gains is shown in Figure 10 As can be seenfrom Figure 10 the benefits of Set 3 are significantly lowerthan those of other settings but the advantages and dis-advantages of other settings are difficult to distinguish Inorder to display more intuitively the average value of 25000defense gains calculated under different settings in Figure 10is shown in Figure 11 From Figure 11 we can see that theaverage value of settings 1 and 5 is higher For furthercomparison the standard deviation of settings 1 and 5 is

calculated one step on the basis of the average value to reflectthe discreteness of the gains As shown in Figure 12 thestandard deviations of setting 1 and setting 6 are smallMoreover the result of setting 1 is smaller than setting 6

In conclusion setting 1 of the six sets of parameters is themost suitable for this scenario Since setting 1 has achievedan ideal effect and can meet the experimental requirementsit is no longer necessary to further optimize the parameters

532 Comparisons In this section stochastic game [16] andevolutionary game [20] are selected to conduct comparativeexperiments with this approach According to the differenceof attackerrsquos learning ability this section designs two groupsof comparative experiments In the first group the attackerrsquoslearning ability is weak and will not make adjustments to theattack-defense results In the second group the attackerrsquoslearning ability is strong and adopts the same learningmechanism as the approach in this paper In both groupsthe initial strategies of attackers were random strategies

In the first group of experiments the defense strategy ofthis approach is as shown in Figure 9(a) e defense strategiescalculated by the approach [16] are πd(s2 d3) 07πd(s2 d4) 03 e defense strategies of [20] are evolu-tionarily stable and balanced And the defense strategies of [20]are πd(s2 d3) 08 πd(s2 d4) 02 Its average earnings per1000 times change as shown in Figure 13

From the results of the strategies and benefits of the threeapproaches we can see that the approach in this paper canlearn from the attackerrsquos strategies and adjust to the optimalstrategy so the approach in this paper can obtain the highestbenefits Wei et al [16] adopted a fixed strategy whenconfronting any attacker When the attacker is constrainedby bounded rationality and does not adopt Nash equilibriumstrategy the benefit of this approach is low Although thelearning factors of both attackers and defenders are takeninto account in [20] the parameters required in the modelare difficult to quantify accurately which results in thedeviation between the final results and the actual results sothe benefit of the approach is still lower than that of theapproach in this paper

In the second group of experiments the results of[16 20] are still as πd(s2 d3) 07 πd(s2 d4) 03 andπd(s2 d3) 08 πd(s2 d4) 02 e decision making ofthis approach is shown in Figure 14 After about 1800 timesof learning the approach achieves stability and converges tothe same defense strategy as that of [16] As can be seen fromFigure 15 the payoff of [20] is lower than that of other twoapproaches e average payoff of this approach in the first2000 defenses is higher than that of [26] and then it is almostthe same as that of [26] Combining Figures 14 and 15 wecan see that the learning attacker cannot get Nash equi-librium strategy at the initial stage and the approach in thispaper is better than that in [26] When the attacker learns toget Nash equilibrium strategy the approach in this paper canconverge to Nash equilibrium strategy At this time theperformance of this approach is the same as that in [26]

In conclusion when facing the attackers with weaklearning ability the approach in this paper is superior to that

Security and Communication Networks 11

Table 3 Dierent parameter settings

Set α δl δw λ c

1 001 0004 0001 001 0012 01 0004 0001 001 0013 001 0004 0001 001 014 001 0004 0001 01 0015 001 004 001 001 0016 001 0008 0001 001 001

0 2000 4000 6000s2 defense number

s2 d

efen

se re

turn

10

08

06

04

02

00

d3d4

(a)

s2 d

efen

se re

turn

10

08

06

04

02

000 2000 4000 6000

s2 defense number

d3d4

(b)

s2 d

efen

se re

turn

10

08

06

04

02

000 2000 4000 6000

s2 defense number

d3d4

(c)

s2 d

efen

se re

turn

10

08

06

04

02

000 2000 4000 6000

s2 defense number

d3d4

(d)

s2 d

efen

se re

turn

10

08

06

04

02

000 2000 4000 6000

s2 defense number

d3d4

(e)

s2 d

efen

se re

turn

10

08

06

04

02

000 2000 4000 6000

s2 defense number

d3d4

(f )

Figure 9 Defense decision making under dierent parameter settings (a) Defense decision under setting 1 (b) Defense decision undersetting 2 (c) Defense decision under setting 3 (d) Defense decision under setting 4 (e) Defense decision under setting 5 (f ) Defensedecision under setting 6

5 10 15 20 25s2 defense number (times1000)

s2 d

efen

se re

turn

ndash10

ndash15

ndash20

ndash25

ndash30

Set 1Set 2Set 3

Set 4Set 5Set 6

Figure 10 Defense payo under dierent parameter settings

12 Security and Communication Networks

in [16 20] When facing an attacker with strong learningability if the attacker has not obtained Nash equilibriumthrough learning this approach is still better than [16 20] Ifthe attacker obtains Nash equilibrium through learning thispaper can also obtain the same Nash equilibrium strategy as[26] and obtain its phase e same eect is superior to thatin [20]

533 Test Comparison with and without Eligibility Traceis section tests the actual eect of the eligibility traces onthe Algorithm 2 e eect of eligibility traces on strategyselection is shown in Figure 16 from which we can see thatthe learning speed of the algorithm is faster when thequalied trace is available After 1000 times of learning the

algorithm can converge to the optimal strategy When thequalied trace is not available the algorithm needs about2500 times of learning to converge

Average earnings per 1000 times change as shown inFigure 17 from which we can see that the benets of thealgorithm are almost the same when there is or not anyqualied trace after convergence From Figure 17 we can seethat 3000 defenses before convergence have higher returnsfrom qualied trails than those from unqualied trails Inorder to further verify this the average of the rst 3000defense gains under qualied trails and unqualied trails iscounted 10 times each respectively e results are shown inFigure 18 which further proves that in the preconvergencedefense phase qualied traces are better than unqualiedtraces

e addition of eligibility trace accelerates the learningspeed but also brings additional memory and computingoverhead In the experiment only 10 state-action pairs thatwere recently accessed were saved which eectively reduced

Set 1 Set 2 Set 3 Set 4 Set 5 Set 6

ndash80

ndash85

ndash90

Figure 11 Average return under dierent parameter settings

Set 1 Set 2 Set 3 Set 4 Set 5 Set 6

80

75

Figure 12 Standard deviation of defense payo under dierentparameter settings

ndash10

ndash20

ndash30

ndash40

s2 d

efen

se re

turn

5 10 15 20 25

Our methodReference [12]Reference [16]

s2 defense number (times1000)

Figure 13 Comparison of defense payo change

0 2000 4000 6000s2 defense number

s2 d

efen

se st

rate

gy

08

06

04

02

00

d3d4

Figure 14 Defense decision making of our approach

Our methodReference [12]Reference [16]

5 10 15 20 25s2 defense number (times1000)

s2 d

efen

se re

turn

ndash10

ndash20

ndash30

ndash40

Figure 15 Comparison of defense payo change

Security and Communication Networks 13

0 2000 4000 6000s2 defense number

s2 d

efen

se st

rate

gy

100

075

050

025

000

Eligibility trace d3Eligibility trace d4

No eligibility trace d3No eligibility trace d4

Figure 16 Comparison of defense decision making

Eligibility traceNo eligibility trace

0 5 10 15 20 25s2 defense number (times1000)

s2 d

efen

se re

turn

ndash10

ndash20

ndash15

ndash30

ndash25

Figure 17 Comparison of defense payo change

Eligibility traceNo eligibility trace

1 2 3 4 5 6 7 8 9 10s2 defense number (times3000)

s2 av

erag

e def

ense

retu

rn

ndash12

ndash14

ndash16

ndash18

ndash20

Figure 18 Average payo comparison of the rst 3000 defenses

14 Security and Communication Networks

the increase of memory consumption In order to test thecomputational cost of qualified trail the time of 100000defense decisions made by the algorithm was counted for 20times with and without qualified trail e average of 20times was 951s for qualified trail and 374 s for unqualifiedtrail Although the introduction of eligibility traces willincrease the decision-making time by nearly 25 times thetime required for 100000 decisions after the introduction ofeligibility traces is still only 951 s which can still meet thereal-time requirements

In summary the introduction of eligibility trace at theexpense of a small amount of memory and computingoverhead can effectively increase the learning speed of thealgorithm and improve the defense gains

54 Comprehensive Comparisons is approach is com-pared with some typical research results as shown in Table 4[3 12 14 16] is based on the assumption of complete ra-tionality e equilibrium strategy obtained by it is difficultto appear in practice and has a low guiding effect on actualdefense decision making [20] and this paper have morepracticability on the premise of bounded rationality hy-pothesis but [20] is based on the theory of biologicalevolution and mainly studies population evolution e coreof game analysis is not the optimal strategy choice of playersbut the strategy adjustment process trend and stability ofgroup members composed of bounded rational players andthe stability here refers to group members is approach isnot suitable for guiding individual real-time decisionmaking because the proportion of specific strategies isunchanged and not the strategy of a player On the contrarythe defender of the proposed approach adopts reinforcementlearning mechanism which is based on systematic feedbackto learn in the confrontation with the attacker and is moresuitable for the study of individual strategies

6 Conclusions and Future Works

In this paper cyber attack-defense confrontation is ab-stracted as a stochastic game problem under the restrictionof limited rationality A host-centered attack-defense graphmodel is proposed to extract network state and attack-de-fense action and an algorithm to generate attack-defensegraph is designed to effectively compress the game statespace e WoLF-PHC-based defense decision approach isproposed to overcome the problem which enables defendersunder bounded rationality to make optimal choices whenfacing different attackers e lattice improves the WoLF-

PHC algorithm speeds up the defenderrsquos learning speed andreduces the algorithmrsquos dependence on data is approachnot only satisfies the constraints of bounded rationality butalso does not require the defender to know too much in-formation about the attacker It is a more practical defensedecision-making approach

e future work is to further optimize the winning andlosing criteria of WoLF-PHC algorithm for specific attack-defense scenarios in order to speed up defense learning andincrease defense gains

Data Availability

e data that support the findings of this study are notpublicly available due to restrictions as the data containsensitive information about a real-world enterprise networkAccess to the dataset is restricted by the original ownerPeople who want to access the data should send a request tothe corresponding author who will apply for permission ofsharing the data from the original owner

Conflicts of Interest

e authors declare that they have no conflicts of interestregarding the publication of this paper

Acknowledgments

is work was supported by the National High TechnologyResearch and Development Program of China (863 Pro-gram) (2014AA7116082 and 2015AA7116040) and NationalNatural Science Foundation of China (61902427) In par-ticular we thank Junnan Yang and Hongqi Zhang of ourresearch group for their previous work on stochastic gameWe sincerely thank them for their help in publishing thispaper

References

[1] Z Tian W Shi Y Wang et al ldquoReal time lateral movementdetection based on evidence reasoning network for edgecomputing environmentrdquo IEEE Transactions on IndustrialInformatics vol 15 no 7 pp 4285ndash4294 2019

[2] Y Wang J Yu W Qu H Shen and X Cheng ldquoEvolutionarygame model and analysis approaches for network groupbehaviorrdquo Chinese Journal of Computers vol 38 no 2pp 282ndash300 2015

[3] J Liu H Zhang and Y Liu ldquoResearch on optimal selection ofmoving target defense policy based on dynamic game withincomplete informationrdquo Acta Electronica Sinica vol 46no 1 pp 82ndash29 2018

Table 4 Comprehensive comparison of existing approaches

Ref Game type Model assumption Learning mechanism Game process Applicable object Practicability[3] Stochastic game Rationality mdash Multistage Personal Bad[12] Static game Rationality mdash Single-stage Personal Bad[14] Dynamic game Rationality mdash Multistage Personal Bad[16] Stochastic game Rationality mdash Multistage Personal Bad[20] Evolutionary game Bounded rationality Biological evolution mdash Group GoodOur approach Stochastic game Bounded rationality Reinforcement learning Multistage Personal Good

Security and Communication Networks 15

[4] Z Tian X Gao S Su J Qiu X Du and M GuizanildquoEvaluating reputation management schemes of internet ofvehicles based on evolutionary game theoryrdquo IEEE Trans-actions on Vehicular Technology vol 68 no 6 pp 5971ndash59802019

[5] Q Zhu and S Rass ldquoGame theory meets network securitya tutorialrdquo in Proceedings of the ACM SIGSAC Conference onComputer and Communications Security pp 2163ndash2165Toronto Canada October 2018

[6] X Li C Zhou Y C Tian and Y Qin ldquoA dynamic decision-making approach for intrusion response in industrial controlsystemsrdquo IEEE Transactions on Industrial Informatics vol 15no 5 pp 2544ndash2554 2019

[7] H Hu Y Liu H Zhang and R Pan ldquoOptimal networkdefense strategy selection based on incomplete informationevolutionary gamerdquo IEEE Access vol 6 pp 29806ndash298212018

[8] J Li G Kendall and R John ldquoComputing Nash equilibriaand evolutionarily stable states of evolutionary gamesrdquo IEEETransactions on Evolutionary Computation vol 20 no 3pp 460ndash469 2016

[9] M Bowling and M Veloso Multiagent Learning Usinga Variable Learning Rate Elsevier Science Publishers LtdAmsterdam Netherlands 2002

[10] J Harb and D Precup ldquoInvestigating recurrence and eligi-bility traces in deep Q-networksrdquo httpsarxivorgabs170405495 2017

[11] C T Do N H Tran and C Hong ldquoGame theory for cybersecurity and privacyrdquo ACM Computing Surveys vol 50 no 2p 30 2017

[12] Y-L Liu D-G Feng L-HWu and Y-F Lian ldquoPerformanceevaluation of worm attack and defense strategies based onstatic Bayesian gamerdquo Chinese Journal of Software vol 23no 3 pp 712ndash723 2012

[13] Y Li D E Quevedo S Dey and L Shi ldquoA game-theoreticapproach to fake-acknowledgment attack on cyber-physicalsystemsrdquo IEEE Transactions on Signal and Information Pro-cessing over Networks vol 3 no 1 pp 1ndash11 2017

[14] H Zhang L Jiang S Huang J Wang and Y Zhang ldquoAttack-defense differential game model for network defense strategyselectionrdquo IEEE Access vol 7 pp 50618ndash50629 2019

[15] A Afrand and S K Das ldquoPreventing DoS attacks in wirelesssensor networks a repeated game theory approachrdquo InternetJournal of Cyber Security vol 5 no 2 pp 145ndash153 2007

[16] J Wei B Fang and Z Tian ldquoResearch on defense strategiesselection based on attack-defense stochastic game modelrdquoChinese Journal of Computer Research and Developmentvol 47 no 10 pp 1714ndash1723 2010

[17] C Wang X Cheng and Y Zhu ldquoA Markov game model ofcomputer network operationrdquo Chinese Systems Engineering-gteory and Practice vol 34 no 9 pp 2402ndash2410 2014

[18] J Hofbauer and K Sigmund ldquoEvolutionary game dynamicsrdquoBulletin of the American Mathematical Society vol 40 no 4pp 479ndash519 2011

[19] Y Hayel and Q Zhu ldquoEpidemic protection over heteroge-neous networks using evolutionary Poisson gamesrdquo IEEETransactions on Information Forensics and Security vol 12no 8 pp 1786ndash1800 2017

[20] J Huang and H Zhang ldquoImproving replicator dynamicevolutionary game model for selecting optimal defensestrategiesrdquo Chinese Journal on Communications vol 38 no 1pp 170ndash182 2018

[21] H Hu Y Liu H Zhang and Y Zhang ldquoSecurity metricmethods for network multistep attacks using AMC and big

data correlation analysisrdquo Security and CommunicationNetworks vol 2018 Article ID 5787102 14 pages 2018

[22] H Hu Y Liu Y Yang H Zhang and Y Zhang ldquoNew in-sights into approaches to evaluating intention and path fornetwork multistep attacksrdquo Mathematical Problems in En-gineering vol 2018 Article ID 4278632 13 pages 2018

[23] H Hu H Zhang Y Liu and Y Wang ldquoQuantitative methodfor network security situation based on attack predictionrdquoSecurity and Communication Networks vol 2017 Article ID3407642 19 pages 2017

[24] M Zimmer and S Doncieux ldquoBootstrapping Q-learning forrobotics from neuro-evolution resultsrdquo IEEE Transactions onCognitive and Developmental Systems vol 10 no 1pp 102ndash119 2018

[25] R S Sutton and A G Barto ldquoReinforcement learning anintroduction Bradford bookrdquo Machine Learning vol 16no 1 pp 285-286 2005

[26] H Zhang J Wang D Yu J Han and T Li ldquoActive defensestrategy selection based on static Bayesian gamerdquo in Pro-ceedings of IET International Conference on CyberspaceTechnology pp 1ndash7 London UK October 2016

16 Security and Communication Networks

International Journal of

AerospaceEngineeringHindawiwwwhindawicom Volume 2018

RoboticsJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Active and Passive Electronic Components

VLSI Design

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Shock and Vibration

Hindawiwwwhindawicom Volume 2018

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawiwwwhindawicom

Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Control Scienceand Engineering

Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom

Journal ofEngineeringVolume 2018

SensorsJournal of

Hindawiwwwhindawicom Volume 2018

International Journal of

RotatingMachinery

Hindawiwwwhindawicom Volume 2018

Modelling ampSimulationin EngineeringHindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Navigation and Observation

International Journal of

Hindawi

wwwhindawicom Volume 2018

Advances in

Multimedia

Submit your manuscripts atwwwhindawicom

Page 8: Research Article - Hindawi Publishing Corporationdownloads.hindawi.com/journals/scn/2019/3038586.pdf · Afrand and Das [15] established a repeated game model between the intrusion

evidence All these pieces of evidence come from real-timeintrusion detection systems After decision making theoptimal security strategy is determined against detectedintrusions

In order to improve the learning speed of WoLF-PHCalgorithm and reduce the dependence of the algorithm onthe amount of data the eligibility trace is introduced toimprove WoLF-PHC e eligibility trace can track specificstate-action trajectories of recent visits and then assigncurrent returns to the state-action of recent visits WoLF-PHC algorithm is an extension of Q-learning algorithm Atpresent there are many algorithms combining Q-learningwith eligibility trace is paper improves WoLF-PHC byusing the typical algorithm [10] e qualification trace ofeach state-action is defined as e(s a) Suppose slowast is thecurrent network state slowast and the eligibility trace is updatedin the way shown in formula (8) Among them the traceattenuation factor is λ

e(s d) cλe(s d) sne slowast

cλe(s d) + 1 s slowast1113896 (8)

WoLF-PHC algorithm is an extension of Q-learningalgorithm which belongs to off-policy algorithm It usesgreedy policy when evaluating defense actions for eachnetwork state and occasionally introduces nongreedy policywhen choosing to perform defense actions in order to learnIn order to maintain the off-policy characteristics of WoLF-PHC algorithm the state-action values are updated byformulae (9)ndash(12) in which the defense actions dlowast are

selected for execution slowast because only the recently visitedstatus-action pairs will have significantly more eligibilitytrace than 0 while most other status-action pairs will havealmost none eligibility trace In order to reduce the memoryand running time consumption caused by eligibility traceonly the latest status-action pair eligibility trace can be savedand updated in practical application

Qd slowast dlowast

( 1113857 Qd slowast dlowast

( 1113857 + αρlowast (9)

Qd(s d) Qd(s d) + αρge(s d) (10)

ρlowast Rd slowast d sprime( 1113857 + cmax

dprimeQd sprime dprime( 1113857 minus Qd s

lowast dlowast

( 1113857

(11)

ρg Rd slowast dlowast sprime( 1113857 + cmax

dprimeQd sprime dprime( 1113857 minus max

dQd slowast d( 1113857

(12)

In order to achieve better results the defense decision-making approach based on WoLF-PHC needs to set fourparameters α δ λ and c reasonably (1) e range of thepayoff learning rate is 0lt αlt 1e bigger the representativeα is the more important the cumulative reward is e fasterthe learning speed is the smaller α is and the better thestability of the algorithm is (2)e range of strategy learningrate 0lt δ lt 1 is obtained According to the experiment wecan get a better result when adopting δlδw 4 (3) eattenuation factor λ of eligibility trace is in the range of

Describe andanalyze cyberattack-defenseconfrontation

Model stochastic game

for attack-defense

Extract currentsecurity state andcandidate attack-

defense action

Evaluatepayoff

hellip

helliphellip

How to defend

Decision making

Figure 5 e process of our decision-making approach

Rt+1reward Rt

St+1state St

Ambient

Agent

Figure 4 Q-learning learning mechanism

8 Security and Communication Networks

0lt λlt 1 which is responsible for the credit allocation ofstatus-action It can be regarded as a time scale e greaterthe credit allocated λ to historical status-action the greaterthe credit allocated to historical status-action (4) e rangeof the discount factor 0lt clt 1 represents the defenderrsquospreference for immediate return and future return When c

approaches 0 it means that future returns are irrelevant andimmediate returns are more important When c approaches1 it means immediate returns are irrelevant and futurereturns are more important

Agent in WoLF-PHC is the defender in the stochasticgame model of attack-defense AD minus SGM the game state inagentrsquos state corresponds to AD minus SGM the defense actionin agentrsquos behavior corresponds to AD minus SGM the imme-diate return in agentrsquos immediate return corresponds toAD minus SGM and the defense strategy in agentrsquos strategycorresponds to AD minus SGM

On the basis of the above a specific defense decision-making approach as shown in Algorithm 2 is given e firststep of the algorithm is to initialize the stochastic gamemodel AD minus SGM of attack-defense and the related pa-rameters e network state and attack-defense actions areextracted by Algorithm 1 e second step is to detect thecurrent network state by the defender Steps 3ndash22 are tomake defense decisions and learn online Steps 4-5 are toselect defense actions according to the current strategy steps6ndash14 are to update the benefits by using eligibility traces andsteps 15ndash21 are the new payoffs using mountain climbingalgorithm to update defense strategy

e spatial complexity of the Algorithm 2 mainlyconcentrates on the storage of pairs Rd(s d sprime) such ase(s d) πd(s d) πd(s d) and Qd(s d) e number of statesis |S| |D| and |A| are the numbers of measures taken by thedefender and attacker in each state respectively ecomputational complexity of the proposed algorithm isO(4 middot |S| middot |D| + |S|2 middot |D|) Compared with the recentmethod using evolutionary game model with complexityO(|S|2 middot (|A| + |D|)3) [14] we greatly reduce the computa-tional complexity and increase the practicability of the al-gorithm since the proposed algorithm does not need to solvethe game equilibrium

5 Experimental Analysis

51 Experiment Setup In order to verify the effectiveness ofthis approach a typical enterprise network as shown inFigure 6 is built for experiment Attacks and defenses occuron the intranet with attackers coming from the extranet Asa defender network administrator is responsible for thesecurity of intranet Due to the setting of Firewall 1 andFirewall 2 legal users of the external network can only accessthe web server which can access the database server FTPserver and e-mail server

e simulation experiment was carried out on a PC withIntel Core i7-6300HQ 340GHz 32GB RAM memoryand Windows 10 64 bit operating system e Python 365emulator was installed and the vulnerability information inthe experimental network was scanned by Nessus toolkit asshown in Table 1 e network topology information was

collected by ArcGis toolkit We used the Python language towrite the project code During the experiment we set upabout 25000 times of attack-defense strategy studies eexperimental results were analyzed and displayed usingMatlab2018a as described in Section 53

Referring to MIT Lincoln Lab attack-defense behaviordatabase attack-defense templates are constructed Attack-defense maps are divided into attack maps and defense mapsby using attacker host A web server W database server DFTP server F and e-mail server E In order to facilitatedisplay and description attack-defense maps are dividedinto attack maps and defense maps as shown in Figures 7and 8 respectively e meaning of defense action in thedefense diagram is shown in Table 2

52 Construction of the Experiment Scenario AD-SGM(1) N (attacker defender) are players participating in

the game representing cyber attackers and defendersrespectively

Vulnerabilityscanning system

VDS

Internet

DMZ

FTP server

Firewall 1Trusted zone

Firewall 2Database

serverE-mailserver

Web server

IDS

Attacker

Extranet

Intranet

Figure 6 Experimental network topology

Table 1 Network vulnerability information

Attackidentifier Host CVE Target

privilegetid1 Web server CVE-2015-1635 usertid2 Web server CVE-2017-7269 roottid3 Web server CVE-2014-8517 roottid4 FTP server CVE-2014-3556 roottid5 E-mail server CVE-2014-4877 roottid6 Database server CVE-2013-4730 usertid7 Database server CVE-2016-6662 root

S0(A root)

S1(W user)

S2(W root)S5(D user)

S6(D root)

tid1

tid2

tid3

tid4

tid5

tid6

tid7

tid8

tid9

S3(F root)

S4(E root)

Figure 7 Attack graph

Security and Communication Networks 9

(2) e state set of stochastic game is S (s0 s1

s2 s3 s4 s5 s6) which consists of network state and isextracted from the nodes shown in Figures 7 and 8

(3) e action set of the defender is D (D0 D1

D2 D3 D4 D5 D6) D0 NULL D1 d11113864 d2D2

d3 d41113864 1113865D3 d1 d5 d61113864 1113865D4 d1 d5 d61113864 1113865D5 d11113864

d2 d7D6 d3 d41113864 1113865 and the edges are extractedfrom Figure 8

(4) Quantitative results of immediate returns Rd(si

d sj) of defenders [16 26] are

Rd s0NULL s0( 1113857 Rd s0NULL s1( 1113857 Rd s0NULL s2( 1113857( 1113857 (0 minus 40 minus 59)

Rd s1 d1 s0( 1113857 Rd s1 d1 s1( 1113857 Rd s1 d1 s2( 1113857 Rd s1 d2 s0( 1113857 Rd s1 d2 s1( 1113857 Rd s1 d2 s2( 1113857( 1113857

(40 0 minus 29 5 minus 15 minus 32)

( Rd s2 d3 s0( 1113857 Rd s2 d3 s1( 1113857 Rd s2 d3 s2( 1113857 Rd s2 d3 s3( 1113857 Rd s2 d3 s4( 1113857 Rd s2 d3 s5( 1113857

Rd s2 d4 s0( 1113857 Rd s2d4 s1( 1113857 Rd s2 d4 s2( 1113857 Rd s2 d4 s3( 1113857 Rd s2 d4 s4( 1113857 Rd s2 d4 s5( 11138571113857

(24 9 minus 15 minus 55 minus 49 minus 65 19 5 minus 21 minus 61 minus 72 minus 68)

( Rd s3 d1 s2( 1113857 Rd s3 d1 s3( 1113857 Rd s3 d1 s6( 1113857 Rd s3 d5 s2( 1113857 Rd s3 d5 s3( 1113857 Rd s3 d5 s6( 1113857

Rd s3 d6 s2( 1113857 Rd s3 d6 s3( 1113857 Rd s3 d6 s6( 11138571113857

(21 minus 16 minus 72 15 minus 23 minus 81 minus 21 minus 36 minus 81)

( Rd s4 d1 s2( 1113857 Rd s4 d1 s4( 1113857 Rd s4 d1 s6( 1113857 Rd s4 d5 s2( 1113857 Rd s4 d5 s4( 1113857 Rd s4 d5 s6( 1113857

Rd s4 d6 s2( 1113857 Rd s4 d6 s4( 1113857 Rd s4 d6 s6( 11138571113857

(26 0 minus 62 11 minus 23 minus 75 9 minus 25 minus 87)

( Rd s5 d1 s2( 1113857 Rd s5 d1 s5( 1113857 Rd s5 d1 s6( 1113857 Rd s5 d2 s2( 1113857 Rd s5 d2 s5( 1113857 Rd s5 d2 s6( 1113857

Rd s5 d7 s2( 1113857 Rd s5 d7 s5( 1113857 Rd s5 d7 s6( 11138571113857

(29 0 minus 63 11 minus 21 minus 76 2 minus 27 minus 88)

( Rd s6 d3 s3( 1113857 Rd s6 d3 s3( 1113857 Rd s6 d3 s5( 1113857 Rd s6 d3 s6( 1113857 Rd s6 d4 s3( 1113857

Rd s6 d4 s4( 1113857 Rd s6 d4 s5( 1113857 Rd s6 d4 s6( 11138571113857

(minus 23 minus 21 minus 19 minus 42 minus 28 minus 31 minus 24 minus 49)

(13)

Table 2 Defense action description

Atomic defense action d1 d2 d3 d4 d5 d6 d7

Renew root data Limit SYNICMPpackets Install Oracle patches Reinstall listener program Uninstall delete Trojan Limit access toMDSYS Restart database server Delete suspicious account Add physical resource Repair database Limit packets from ports

S0(A root)

S1(W user)

S2(W root)S5(D user)

S6(D root)

d3

d3

d4

d1d2d7

S3(F root)

S4(E root)d1d5d6

d1d5d6d4

d3

d1d2

Figure 8 Defense graph

10 Security and Communication Networks

(5) In order to detect the learning performance of theAlgorithm 2 more fully the defenderrsquos state-actionpayoffQd(si d) is initialized with a unified 0 withoutintroducing additional prior knowledge

(6) Defenderrsquos defense strategy adopts average strategyto initialize that is πd(sk d1) πd(sk d2) middot middot middot

πd(sk dm) 1113936disinD(sk)πd(sk d) 1forallsk isin S where noadditional prior knowledge is introduced

53 Testing and Analysis e experiment in this section hasthree purposes e first is to test the influence of differentparameter settings on the proposed Algorithm 2 so as to findout the experimental parameters suitable for this scenarioe second is to compare this approach with the existingtypical approaches to verify the advancement of this ap-proach e third is to test the effectiveness of WoLF-PHCalgorithm improvement based on eligibility trace

From Figures 7 and 8 we can see that the state of attack-defense strategy selection is the most complex and repre-sentative erefore the performance of the algorithm isanalyzed by the experimental state selection and the othernetwork state analysis approaches are the same

531 Parameter Test and Analysis Different parameters willaffect the speed and effect of learning At present there is norelevant theory to determine the specific parameters InSection 4 the relevant parameters are preliminarily ana-lyzed On this basis the different parameter settings arefurther tested to find the parameter settings suitable for thisattack-defense scenario Six different parameter settingswere tested Specific parameter settings are shown in Table 3In the experiment the attackerrsquos initial strategy is randomstrategy and the learning mechanism is the same as theapproach in this paper

e probability of the defenderrsquos choice of defense ac-tions and sums in state is shown in Figure 9 e learningspeed and convergence of the algorithm under differentparameter settings can be observed from Figure 9 whichshows that the learning speed of settings 1 3 and 6 is fasterand the best strategy can be obtained after learning less than1500 times under the three settings but convergence of 3and 6 is poor Although the best strategy can be learned bysettings 3 and 6 there will be oscillation afterwards and thestability of setting 1 is not suitable

Defense payoff can represent the degree of optimizationof the strategy In order to ensure that the payoff value doesnot reflect only one defense result the average of 1000defense gains is taken and the change of the average payoffper 1000 defense gains is shown in Figure 10 As can be seenfrom Figure 10 the benefits of Set 3 are significantly lowerthan those of other settings but the advantages and dis-advantages of other settings are difficult to distinguish Inorder to display more intuitively the average value of 25000defense gains calculated under different settings in Figure 10is shown in Figure 11 From Figure 11 we can see that theaverage value of settings 1 and 5 is higher For furthercomparison the standard deviation of settings 1 and 5 is

calculated one step on the basis of the average value to reflectthe discreteness of the gains As shown in Figure 12 thestandard deviations of setting 1 and setting 6 are smallMoreover the result of setting 1 is smaller than setting 6

In conclusion setting 1 of the six sets of parameters is themost suitable for this scenario Since setting 1 has achievedan ideal effect and can meet the experimental requirementsit is no longer necessary to further optimize the parameters

532 Comparisons In this section stochastic game [16] andevolutionary game [20] are selected to conduct comparativeexperiments with this approach According to the differenceof attackerrsquos learning ability this section designs two groupsof comparative experiments In the first group the attackerrsquoslearning ability is weak and will not make adjustments to theattack-defense results In the second group the attackerrsquoslearning ability is strong and adopts the same learningmechanism as the approach in this paper In both groupsthe initial strategies of attackers were random strategies

In the first group of experiments the defense strategy ofthis approach is as shown in Figure 9(a) e defense strategiescalculated by the approach [16] are πd(s2 d3) 07πd(s2 d4) 03 e defense strategies of [20] are evolu-tionarily stable and balanced And the defense strategies of [20]are πd(s2 d3) 08 πd(s2 d4) 02 Its average earnings per1000 times change as shown in Figure 13

From the results of the strategies and benefits of the threeapproaches we can see that the approach in this paper canlearn from the attackerrsquos strategies and adjust to the optimalstrategy so the approach in this paper can obtain the highestbenefits Wei et al [16] adopted a fixed strategy whenconfronting any attacker When the attacker is constrainedby bounded rationality and does not adopt Nash equilibriumstrategy the benefit of this approach is low Although thelearning factors of both attackers and defenders are takeninto account in [20] the parameters required in the modelare difficult to quantify accurately which results in thedeviation between the final results and the actual results sothe benefit of the approach is still lower than that of theapproach in this paper

In the second group of experiments the results of[16 20] are still as πd(s2 d3) 07 πd(s2 d4) 03 andπd(s2 d3) 08 πd(s2 d4) 02 e decision making ofthis approach is shown in Figure 14 After about 1800 timesof learning the approach achieves stability and converges tothe same defense strategy as that of [16] As can be seen fromFigure 15 the payoff of [20] is lower than that of other twoapproaches e average payoff of this approach in the first2000 defenses is higher than that of [26] and then it is almostthe same as that of [26] Combining Figures 14 and 15 wecan see that the learning attacker cannot get Nash equi-librium strategy at the initial stage and the approach in thispaper is better than that in [26] When the attacker learns toget Nash equilibrium strategy the approach in this paper canconverge to Nash equilibrium strategy At this time theperformance of this approach is the same as that in [26]

In conclusion when facing the attackers with weaklearning ability the approach in this paper is superior to that

Security and Communication Networks 11

Table 3 Dierent parameter settings

Set α δl δw λ c

1 001 0004 0001 001 0012 01 0004 0001 001 0013 001 0004 0001 001 014 001 0004 0001 01 0015 001 004 001 001 0016 001 0008 0001 001 001

0 2000 4000 6000s2 defense number

s2 d

efen

se re

turn

10

08

06

04

02

00

d3d4

(a)

s2 d

efen

se re

turn

10

08

06

04

02

000 2000 4000 6000

s2 defense number

d3d4

(b)

s2 d

efen

se re

turn

10

08

06

04

02

000 2000 4000 6000

s2 defense number

d3d4

(c)

s2 d

efen

se re

turn

10

08

06

04

02

000 2000 4000 6000

s2 defense number

d3d4

(d)

s2 d

efen

se re

turn

10

08

06

04

02

000 2000 4000 6000

s2 defense number

d3d4

(e)

s2 d

efen

se re

turn

10

08

06

04

02

000 2000 4000 6000

s2 defense number

d3d4

(f )

Figure 9 Defense decision making under dierent parameter settings (a) Defense decision under setting 1 (b) Defense decision undersetting 2 (c) Defense decision under setting 3 (d) Defense decision under setting 4 (e) Defense decision under setting 5 (f ) Defensedecision under setting 6

5 10 15 20 25s2 defense number (times1000)

s2 d

efen

se re

turn

ndash10

ndash15

ndash20

ndash25

ndash30

Set 1Set 2Set 3

Set 4Set 5Set 6

Figure 10 Defense payo under dierent parameter settings

12 Security and Communication Networks

in [16 20] When facing an attacker with strong learningability if the attacker has not obtained Nash equilibriumthrough learning this approach is still better than [16 20] Ifthe attacker obtains Nash equilibrium through learning thispaper can also obtain the same Nash equilibrium strategy as[26] and obtain its phase e same eect is superior to thatin [20]

533 Test Comparison with and without Eligibility Traceis section tests the actual eect of the eligibility traces onthe Algorithm 2 e eect of eligibility traces on strategyselection is shown in Figure 16 from which we can see thatthe learning speed of the algorithm is faster when thequalied trace is available After 1000 times of learning the

algorithm can converge to the optimal strategy When thequalied trace is not available the algorithm needs about2500 times of learning to converge

Average earnings per 1000 times change as shown inFigure 17 from which we can see that the benets of thealgorithm are almost the same when there is or not anyqualied trace after convergence From Figure 17 we can seethat 3000 defenses before convergence have higher returnsfrom qualied trails than those from unqualied trails Inorder to further verify this the average of the rst 3000defense gains under qualied trails and unqualied trails iscounted 10 times each respectively e results are shown inFigure 18 which further proves that in the preconvergencedefense phase qualied traces are better than unqualiedtraces

e addition of eligibility trace accelerates the learningspeed but also brings additional memory and computingoverhead In the experiment only 10 state-action pairs thatwere recently accessed were saved which eectively reduced

Set 1 Set 2 Set 3 Set 4 Set 5 Set 6

ndash80

ndash85

ndash90

Figure 11 Average return under dierent parameter settings

Set 1 Set 2 Set 3 Set 4 Set 5 Set 6

80

75

Figure 12 Standard deviation of defense payo under dierentparameter settings

ndash10

ndash20

ndash30

ndash40

s2 d

efen

se re

turn

5 10 15 20 25

Our methodReference [12]Reference [16]

s2 defense number (times1000)

Figure 13 Comparison of defense payo change

0 2000 4000 6000s2 defense number

s2 d

efen

se st

rate

gy

08

06

04

02

00

d3d4

Figure 14 Defense decision making of our approach

Our methodReference [12]Reference [16]

5 10 15 20 25s2 defense number (times1000)

s2 d

efen

se re

turn

ndash10

ndash20

ndash30

ndash40

Figure 15 Comparison of defense payo change

Security and Communication Networks 13

0 2000 4000 6000s2 defense number

s2 d

efen

se st

rate

gy

100

075

050

025

000

Eligibility trace d3Eligibility trace d4

No eligibility trace d3No eligibility trace d4

Figure 16 Comparison of defense decision making

Eligibility traceNo eligibility trace

0 5 10 15 20 25s2 defense number (times1000)

s2 d

efen

se re

turn

ndash10

ndash20

ndash15

ndash30

ndash25

Figure 17 Comparison of defense payo change

Eligibility traceNo eligibility trace

1 2 3 4 5 6 7 8 9 10s2 defense number (times3000)

s2 av

erag

e def

ense

retu

rn

ndash12

ndash14

ndash16

ndash18

ndash20

Figure 18 Average payo comparison of the rst 3000 defenses

14 Security and Communication Networks

the increase of memory consumption In order to test thecomputational cost of qualified trail the time of 100000defense decisions made by the algorithm was counted for 20times with and without qualified trail e average of 20times was 951s for qualified trail and 374 s for unqualifiedtrail Although the introduction of eligibility traces willincrease the decision-making time by nearly 25 times thetime required for 100000 decisions after the introduction ofeligibility traces is still only 951 s which can still meet thereal-time requirements

In summary the introduction of eligibility trace at theexpense of a small amount of memory and computingoverhead can effectively increase the learning speed of thealgorithm and improve the defense gains

54 Comprehensive Comparisons is approach is com-pared with some typical research results as shown in Table 4[3 12 14 16] is based on the assumption of complete ra-tionality e equilibrium strategy obtained by it is difficultto appear in practice and has a low guiding effect on actualdefense decision making [20] and this paper have morepracticability on the premise of bounded rationality hy-pothesis but [20] is based on the theory of biologicalevolution and mainly studies population evolution e coreof game analysis is not the optimal strategy choice of playersbut the strategy adjustment process trend and stability ofgroup members composed of bounded rational players andthe stability here refers to group members is approach isnot suitable for guiding individual real-time decisionmaking because the proportion of specific strategies isunchanged and not the strategy of a player On the contrarythe defender of the proposed approach adopts reinforcementlearning mechanism which is based on systematic feedbackto learn in the confrontation with the attacker and is moresuitable for the study of individual strategies

6 Conclusions and Future Works

In this paper cyber attack-defense confrontation is ab-stracted as a stochastic game problem under the restrictionof limited rationality A host-centered attack-defense graphmodel is proposed to extract network state and attack-de-fense action and an algorithm to generate attack-defensegraph is designed to effectively compress the game statespace e WoLF-PHC-based defense decision approach isproposed to overcome the problem which enables defendersunder bounded rationality to make optimal choices whenfacing different attackers e lattice improves the WoLF-

PHC algorithm speeds up the defenderrsquos learning speed andreduces the algorithmrsquos dependence on data is approachnot only satisfies the constraints of bounded rationality butalso does not require the defender to know too much in-formation about the attacker It is a more practical defensedecision-making approach

e future work is to further optimize the winning andlosing criteria of WoLF-PHC algorithm for specific attack-defense scenarios in order to speed up defense learning andincrease defense gains

Data Availability

e data that support the findings of this study are notpublicly available due to restrictions as the data containsensitive information about a real-world enterprise networkAccess to the dataset is restricted by the original ownerPeople who want to access the data should send a request tothe corresponding author who will apply for permission ofsharing the data from the original owner

Conflicts of Interest

e authors declare that they have no conflicts of interestregarding the publication of this paper

Acknowledgments

is work was supported by the National High TechnologyResearch and Development Program of China (863 Pro-gram) (2014AA7116082 and 2015AA7116040) and NationalNatural Science Foundation of China (61902427) In par-ticular we thank Junnan Yang and Hongqi Zhang of ourresearch group for their previous work on stochastic gameWe sincerely thank them for their help in publishing thispaper

References

[1] Z Tian W Shi Y Wang et al ldquoReal time lateral movementdetection based on evidence reasoning network for edgecomputing environmentrdquo IEEE Transactions on IndustrialInformatics vol 15 no 7 pp 4285ndash4294 2019

[2] Y Wang J Yu W Qu H Shen and X Cheng ldquoEvolutionarygame model and analysis approaches for network groupbehaviorrdquo Chinese Journal of Computers vol 38 no 2pp 282ndash300 2015

[3] J Liu H Zhang and Y Liu ldquoResearch on optimal selection ofmoving target defense policy based on dynamic game withincomplete informationrdquo Acta Electronica Sinica vol 46no 1 pp 82ndash29 2018

Table 4 Comprehensive comparison of existing approaches

Ref Game type Model assumption Learning mechanism Game process Applicable object Practicability[3] Stochastic game Rationality mdash Multistage Personal Bad[12] Static game Rationality mdash Single-stage Personal Bad[14] Dynamic game Rationality mdash Multistage Personal Bad[16] Stochastic game Rationality mdash Multistage Personal Bad[20] Evolutionary game Bounded rationality Biological evolution mdash Group GoodOur approach Stochastic game Bounded rationality Reinforcement learning Multistage Personal Good

Security and Communication Networks 15

[4] Z Tian X Gao S Su J Qiu X Du and M GuizanildquoEvaluating reputation management schemes of internet ofvehicles based on evolutionary game theoryrdquo IEEE Trans-actions on Vehicular Technology vol 68 no 6 pp 5971ndash59802019

[5] Q Zhu and S Rass ldquoGame theory meets network securitya tutorialrdquo in Proceedings of the ACM SIGSAC Conference onComputer and Communications Security pp 2163ndash2165Toronto Canada October 2018

[6] X Li C Zhou Y C Tian and Y Qin ldquoA dynamic decision-making approach for intrusion response in industrial controlsystemsrdquo IEEE Transactions on Industrial Informatics vol 15no 5 pp 2544ndash2554 2019

[7] H Hu Y Liu H Zhang and R Pan ldquoOptimal networkdefense strategy selection based on incomplete informationevolutionary gamerdquo IEEE Access vol 6 pp 29806ndash298212018

[8] J Li G Kendall and R John ldquoComputing Nash equilibriaand evolutionarily stable states of evolutionary gamesrdquo IEEETransactions on Evolutionary Computation vol 20 no 3pp 460ndash469 2016

[9] M Bowling and M Veloso Multiagent Learning Usinga Variable Learning Rate Elsevier Science Publishers LtdAmsterdam Netherlands 2002

[10] J Harb and D Precup ldquoInvestigating recurrence and eligi-bility traces in deep Q-networksrdquo httpsarxivorgabs170405495 2017

[11] C T Do N H Tran and C Hong ldquoGame theory for cybersecurity and privacyrdquo ACM Computing Surveys vol 50 no 2p 30 2017

[12] Y-L Liu D-G Feng L-HWu and Y-F Lian ldquoPerformanceevaluation of worm attack and defense strategies based onstatic Bayesian gamerdquo Chinese Journal of Software vol 23no 3 pp 712ndash723 2012

[13] Y Li D E Quevedo S Dey and L Shi ldquoA game-theoreticapproach to fake-acknowledgment attack on cyber-physicalsystemsrdquo IEEE Transactions on Signal and Information Pro-cessing over Networks vol 3 no 1 pp 1ndash11 2017

[14] H Zhang L Jiang S Huang J Wang and Y Zhang ldquoAttack-defense differential game model for network defense strategyselectionrdquo IEEE Access vol 7 pp 50618ndash50629 2019

[15] A Afrand and S K Das ldquoPreventing DoS attacks in wirelesssensor networks a repeated game theory approachrdquo InternetJournal of Cyber Security vol 5 no 2 pp 145ndash153 2007

[16] J Wei B Fang and Z Tian ldquoResearch on defense strategiesselection based on attack-defense stochastic game modelrdquoChinese Journal of Computer Research and Developmentvol 47 no 10 pp 1714ndash1723 2010

[17] C Wang X Cheng and Y Zhu ldquoA Markov game model ofcomputer network operationrdquo Chinese Systems Engineering-gteory and Practice vol 34 no 9 pp 2402ndash2410 2014

[18] J Hofbauer and K Sigmund ldquoEvolutionary game dynamicsrdquoBulletin of the American Mathematical Society vol 40 no 4pp 479ndash519 2011

[19] Y Hayel and Q Zhu ldquoEpidemic protection over heteroge-neous networks using evolutionary Poisson gamesrdquo IEEETransactions on Information Forensics and Security vol 12no 8 pp 1786ndash1800 2017

[20] J Huang and H Zhang ldquoImproving replicator dynamicevolutionary game model for selecting optimal defensestrategiesrdquo Chinese Journal on Communications vol 38 no 1pp 170ndash182 2018

[21] H Hu Y Liu H Zhang and Y Zhang ldquoSecurity metricmethods for network multistep attacks using AMC and big

data correlation analysisrdquo Security and CommunicationNetworks vol 2018 Article ID 5787102 14 pages 2018

[22] H Hu Y Liu Y Yang H Zhang and Y Zhang ldquoNew in-sights into approaches to evaluating intention and path fornetwork multistep attacksrdquo Mathematical Problems in En-gineering vol 2018 Article ID 4278632 13 pages 2018

[23] H Hu H Zhang Y Liu and Y Wang ldquoQuantitative methodfor network security situation based on attack predictionrdquoSecurity and Communication Networks vol 2017 Article ID3407642 19 pages 2017

[24] M Zimmer and S Doncieux ldquoBootstrapping Q-learning forrobotics from neuro-evolution resultsrdquo IEEE Transactions onCognitive and Developmental Systems vol 10 no 1pp 102ndash119 2018

[25] R S Sutton and A G Barto ldquoReinforcement learning anintroduction Bradford bookrdquo Machine Learning vol 16no 1 pp 285-286 2005

[26] H Zhang J Wang D Yu J Han and T Li ldquoActive defensestrategy selection based on static Bayesian gamerdquo in Pro-ceedings of IET International Conference on CyberspaceTechnology pp 1ndash7 London UK October 2016

16 Security and Communication Networks

International Journal of

AerospaceEngineeringHindawiwwwhindawicom Volume 2018

RoboticsJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Active and Passive Electronic Components

VLSI Design

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Shock and Vibration

Hindawiwwwhindawicom Volume 2018

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawiwwwhindawicom

Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Control Scienceand Engineering

Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom

Journal ofEngineeringVolume 2018

SensorsJournal of

Hindawiwwwhindawicom Volume 2018

International Journal of

RotatingMachinery

Hindawiwwwhindawicom Volume 2018

Modelling ampSimulationin EngineeringHindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Navigation and Observation

International Journal of

Hindawi

wwwhindawicom Volume 2018

Advances in

Multimedia

Submit your manuscripts atwwwhindawicom

Page 9: Research Article - Hindawi Publishing Corporationdownloads.hindawi.com/journals/scn/2019/3038586.pdf · Afrand and Das [15] established a repeated game model between the intrusion

0lt λlt 1 which is responsible for the credit allocation ofstatus-action It can be regarded as a time scale e greaterthe credit allocated λ to historical status-action the greaterthe credit allocated to historical status-action (4) e rangeof the discount factor 0lt clt 1 represents the defenderrsquospreference for immediate return and future return When c

approaches 0 it means that future returns are irrelevant andimmediate returns are more important When c approaches1 it means immediate returns are irrelevant and futurereturns are more important

Agent in WoLF-PHC is the defender in the stochasticgame model of attack-defense AD minus SGM the game state inagentrsquos state corresponds to AD minus SGM the defense actionin agentrsquos behavior corresponds to AD minus SGM the imme-diate return in agentrsquos immediate return corresponds toAD minus SGM and the defense strategy in agentrsquos strategycorresponds to AD minus SGM

On the basis of the above a specific defense decision-making approach as shown in Algorithm 2 is given e firststep of the algorithm is to initialize the stochastic gamemodel AD minus SGM of attack-defense and the related pa-rameters e network state and attack-defense actions areextracted by Algorithm 1 e second step is to detect thecurrent network state by the defender Steps 3ndash22 are tomake defense decisions and learn online Steps 4-5 are toselect defense actions according to the current strategy steps6ndash14 are to update the benefits by using eligibility traces andsteps 15ndash21 are the new payoffs using mountain climbingalgorithm to update defense strategy

e spatial complexity of the Algorithm 2 mainlyconcentrates on the storage of pairs Rd(s d sprime) such ase(s d) πd(s d) πd(s d) and Qd(s d) e number of statesis |S| |D| and |A| are the numbers of measures taken by thedefender and attacker in each state respectively ecomputational complexity of the proposed algorithm isO(4 middot |S| middot |D| + |S|2 middot |D|) Compared with the recentmethod using evolutionary game model with complexityO(|S|2 middot (|A| + |D|)3) [14] we greatly reduce the computa-tional complexity and increase the practicability of the al-gorithm since the proposed algorithm does not need to solvethe game equilibrium

5 Experimental Analysis

51 Experiment Setup In order to verify the effectiveness ofthis approach a typical enterprise network as shown inFigure 6 is built for experiment Attacks and defenses occuron the intranet with attackers coming from the extranet Asa defender network administrator is responsible for thesecurity of intranet Due to the setting of Firewall 1 andFirewall 2 legal users of the external network can only accessthe web server which can access the database server FTPserver and e-mail server

e simulation experiment was carried out on a PC withIntel Core i7-6300HQ 340GHz 32GB RAM memoryand Windows 10 64 bit operating system e Python 365emulator was installed and the vulnerability information inthe experimental network was scanned by Nessus toolkit asshown in Table 1 e network topology information was

collected by ArcGis toolkit We used the Python language towrite the project code During the experiment we set upabout 25000 times of attack-defense strategy studies eexperimental results were analyzed and displayed usingMatlab2018a as described in Section 53

Referring to MIT Lincoln Lab attack-defense behaviordatabase attack-defense templates are constructed Attack-defense maps are divided into attack maps and defense mapsby using attacker host A web server W database server DFTP server F and e-mail server E In order to facilitatedisplay and description attack-defense maps are dividedinto attack maps and defense maps as shown in Figures 7and 8 respectively e meaning of defense action in thedefense diagram is shown in Table 2

52 Construction of the Experiment Scenario AD-SGM(1) N (attacker defender) are players participating in

the game representing cyber attackers and defendersrespectively

Vulnerabilityscanning system

VDS

Internet

DMZ

FTP server

Firewall 1Trusted zone

Firewall 2Database

serverE-mailserver

Web server

IDS

Attacker

Extranet

Intranet

Figure 6 Experimental network topology

Table 1 Network vulnerability information

Attackidentifier Host CVE Target

privilegetid1 Web server CVE-2015-1635 usertid2 Web server CVE-2017-7269 roottid3 Web server CVE-2014-8517 roottid4 FTP server CVE-2014-3556 roottid5 E-mail server CVE-2014-4877 roottid6 Database server CVE-2013-4730 usertid7 Database server CVE-2016-6662 root

S0(A root)

S1(W user)

S2(W root)S5(D user)

S6(D root)

tid1

tid2

tid3

tid4

tid5

tid6

tid7

tid8

tid9

S3(F root)

S4(E root)

Figure 7 Attack graph

Security and Communication Networks 9

(2) e state set of stochastic game is S (s0 s1

s2 s3 s4 s5 s6) which consists of network state and isextracted from the nodes shown in Figures 7 and 8

(3) e action set of the defender is D (D0 D1

D2 D3 D4 D5 D6) D0 NULL D1 d11113864 d2D2

d3 d41113864 1113865D3 d1 d5 d61113864 1113865D4 d1 d5 d61113864 1113865D5 d11113864

d2 d7D6 d3 d41113864 1113865 and the edges are extractedfrom Figure 8

(4) Quantitative results of immediate returns Rd(si

d sj) of defenders [16 26] are

Rd s0NULL s0( 1113857 Rd s0NULL s1( 1113857 Rd s0NULL s2( 1113857( 1113857 (0 minus 40 minus 59)

Rd s1 d1 s0( 1113857 Rd s1 d1 s1( 1113857 Rd s1 d1 s2( 1113857 Rd s1 d2 s0( 1113857 Rd s1 d2 s1( 1113857 Rd s1 d2 s2( 1113857( 1113857

(40 0 minus 29 5 minus 15 minus 32)

( Rd s2 d3 s0( 1113857 Rd s2 d3 s1( 1113857 Rd s2 d3 s2( 1113857 Rd s2 d3 s3( 1113857 Rd s2 d3 s4( 1113857 Rd s2 d3 s5( 1113857

Rd s2 d4 s0( 1113857 Rd s2d4 s1( 1113857 Rd s2 d4 s2( 1113857 Rd s2 d4 s3( 1113857 Rd s2 d4 s4( 1113857 Rd s2 d4 s5( 11138571113857

(24 9 minus 15 minus 55 minus 49 minus 65 19 5 minus 21 minus 61 minus 72 minus 68)

( Rd s3 d1 s2( 1113857 Rd s3 d1 s3( 1113857 Rd s3 d1 s6( 1113857 Rd s3 d5 s2( 1113857 Rd s3 d5 s3( 1113857 Rd s3 d5 s6( 1113857

Rd s3 d6 s2( 1113857 Rd s3 d6 s3( 1113857 Rd s3 d6 s6( 11138571113857

(21 minus 16 minus 72 15 minus 23 minus 81 minus 21 minus 36 minus 81)

( Rd s4 d1 s2( 1113857 Rd s4 d1 s4( 1113857 Rd s4 d1 s6( 1113857 Rd s4 d5 s2( 1113857 Rd s4 d5 s4( 1113857 Rd s4 d5 s6( 1113857

Rd s4 d6 s2( 1113857 Rd s4 d6 s4( 1113857 Rd s4 d6 s6( 11138571113857

(26 0 minus 62 11 minus 23 minus 75 9 minus 25 minus 87)

( Rd s5 d1 s2( 1113857 Rd s5 d1 s5( 1113857 Rd s5 d1 s6( 1113857 Rd s5 d2 s2( 1113857 Rd s5 d2 s5( 1113857 Rd s5 d2 s6( 1113857

Rd s5 d7 s2( 1113857 Rd s5 d7 s5( 1113857 Rd s5 d7 s6( 11138571113857

(29 0 minus 63 11 minus 21 minus 76 2 minus 27 minus 88)

( Rd s6 d3 s3( 1113857 Rd s6 d3 s3( 1113857 Rd s6 d3 s5( 1113857 Rd s6 d3 s6( 1113857 Rd s6 d4 s3( 1113857

Rd s6 d4 s4( 1113857 Rd s6 d4 s5( 1113857 Rd s6 d4 s6( 11138571113857

(minus 23 minus 21 minus 19 minus 42 minus 28 minus 31 minus 24 minus 49)

(13)

Table 2 Defense action description

Atomic defense action d1 d2 d3 d4 d5 d6 d7

Renew root data Limit SYNICMPpackets Install Oracle patches Reinstall listener program Uninstall delete Trojan Limit access toMDSYS Restart database server Delete suspicious account Add physical resource Repair database Limit packets from ports

S0(A root)

S1(W user)

S2(W root)S5(D user)

S6(D root)

d3

d3

d4

d1d2d7

S3(F root)

S4(E root)d1d5d6

d1d5d6d4

d3

d1d2

Figure 8 Defense graph

10 Security and Communication Networks

(5) In order to detect the learning performance of theAlgorithm 2 more fully the defenderrsquos state-actionpayoffQd(si d) is initialized with a unified 0 withoutintroducing additional prior knowledge

(6) Defenderrsquos defense strategy adopts average strategyto initialize that is πd(sk d1) πd(sk d2) middot middot middot

πd(sk dm) 1113936disinD(sk)πd(sk d) 1forallsk isin S where noadditional prior knowledge is introduced

53 Testing and Analysis e experiment in this section hasthree purposes e first is to test the influence of differentparameter settings on the proposed Algorithm 2 so as to findout the experimental parameters suitable for this scenarioe second is to compare this approach with the existingtypical approaches to verify the advancement of this ap-proach e third is to test the effectiveness of WoLF-PHCalgorithm improvement based on eligibility trace

From Figures 7 and 8 we can see that the state of attack-defense strategy selection is the most complex and repre-sentative erefore the performance of the algorithm isanalyzed by the experimental state selection and the othernetwork state analysis approaches are the same

531 Parameter Test and Analysis Different parameters willaffect the speed and effect of learning At present there is norelevant theory to determine the specific parameters InSection 4 the relevant parameters are preliminarily ana-lyzed On this basis the different parameter settings arefurther tested to find the parameter settings suitable for thisattack-defense scenario Six different parameter settingswere tested Specific parameter settings are shown in Table 3In the experiment the attackerrsquos initial strategy is randomstrategy and the learning mechanism is the same as theapproach in this paper

e probability of the defenderrsquos choice of defense ac-tions and sums in state is shown in Figure 9 e learningspeed and convergence of the algorithm under differentparameter settings can be observed from Figure 9 whichshows that the learning speed of settings 1 3 and 6 is fasterand the best strategy can be obtained after learning less than1500 times under the three settings but convergence of 3and 6 is poor Although the best strategy can be learned bysettings 3 and 6 there will be oscillation afterwards and thestability of setting 1 is not suitable

Defense payoff can represent the degree of optimizationof the strategy In order to ensure that the payoff value doesnot reflect only one defense result the average of 1000defense gains is taken and the change of the average payoffper 1000 defense gains is shown in Figure 10 As can be seenfrom Figure 10 the benefits of Set 3 are significantly lowerthan those of other settings but the advantages and dis-advantages of other settings are difficult to distinguish Inorder to display more intuitively the average value of 25000defense gains calculated under different settings in Figure 10is shown in Figure 11 From Figure 11 we can see that theaverage value of settings 1 and 5 is higher For furthercomparison the standard deviation of settings 1 and 5 is

calculated one step on the basis of the average value to reflectthe discreteness of the gains As shown in Figure 12 thestandard deviations of setting 1 and setting 6 are smallMoreover the result of setting 1 is smaller than setting 6

In conclusion setting 1 of the six sets of parameters is themost suitable for this scenario Since setting 1 has achievedan ideal effect and can meet the experimental requirementsit is no longer necessary to further optimize the parameters

532 Comparisons In this section stochastic game [16] andevolutionary game [20] are selected to conduct comparativeexperiments with this approach According to the differenceof attackerrsquos learning ability this section designs two groupsof comparative experiments In the first group the attackerrsquoslearning ability is weak and will not make adjustments to theattack-defense results In the second group the attackerrsquoslearning ability is strong and adopts the same learningmechanism as the approach in this paper In both groupsthe initial strategies of attackers were random strategies

In the first group of experiments the defense strategy ofthis approach is as shown in Figure 9(a) e defense strategiescalculated by the approach [16] are πd(s2 d3) 07πd(s2 d4) 03 e defense strategies of [20] are evolu-tionarily stable and balanced And the defense strategies of [20]are πd(s2 d3) 08 πd(s2 d4) 02 Its average earnings per1000 times change as shown in Figure 13

From the results of the strategies and benefits of the threeapproaches we can see that the approach in this paper canlearn from the attackerrsquos strategies and adjust to the optimalstrategy so the approach in this paper can obtain the highestbenefits Wei et al [16] adopted a fixed strategy whenconfronting any attacker When the attacker is constrainedby bounded rationality and does not adopt Nash equilibriumstrategy the benefit of this approach is low Although thelearning factors of both attackers and defenders are takeninto account in [20] the parameters required in the modelare difficult to quantify accurately which results in thedeviation between the final results and the actual results sothe benefit of the approach is still lower than that of theapproach in this paper

In the second group of experiments the results of[16 20] are still as πd(s2 d3) 07 πd(s2 d4) 03 andπd(s2 d3) 08 πd(s2 d4) 02 e decision making ofthis approach is shown in Figure 14 After about 1800 timesof learning the approach achieves stability and converges tothe same defense strategy as that of [16] As can be seen fromFigure 15 the payoff of [20] is lower than that of other twoapproaches e average payoff of this approach in the first2000 defenses is higher than that of [26] and then it is almostthe same as that of [26] Combining Figures 14 and 15 wecan see that the learning attacker cannot get Nash equi-librium strategy at the initial stage and the approach in thispaper is better than that in [26] When the attacker learns toget Nash equilibrium strategy the approach in this paper canconverge to Nash equilibrium strategy At this time theperformance of this approach is the same as that in [26]

In conclusion when facing the attackers with weaklearning ability the approach in this paper is superior to that

Security and Communication Networks 11

Table 3 Dierent parameter settings

Set α δl δw λ c

1 001 0004 0001 001 0012 01 0004 0001 001 0013 001 0004 0001 001 014 001 0004 0001 01 0015 001 004 001 001 0016 001 0008 0001 001 001

0 2000 4000 6000s2 defense number

s2 d

efen

se re

turn

10

08

06

04

02

00

d3d4

(a)

s2 d

efen

se re

turn

10

08

06

04

02

000 2000 4000 6000

s2 defense number

d3d4

(b)

s2 d

efen

se re

turn

10

08

06

04

02

000 2000 4000 6000

s2 defense number

d3d4

(c)

s2 d

efen

se re

turn

10

08

06

04

02

000 2000 4000 6000

s2 defense number

d3d4

(d)

s2 d

efen

se re

turn

10

08

06

04

02

000 2000 4000 6000

s2 defense number

d3d4

(e)

s2 d

efen

se re

turn

10

08

06

04

02

000 2000 4000 6000

s2 defense number

d3d4

(f )

Figure 9 Defense decision making under dierent parameter settings (a) Defense decision under setting 1 (b) Defense decision undersetting 2 (c) Defense decision under setting 3 (d) Defense decision under setting 4 (e) Defense decision under setting 5 (f ) Defensedecision under setting 6

5 10 15 20 25s2 defense number (times1000)

s2 d

efen

se re

turn

ndash10

ndash15

ndash20

ndash25

ndash30

Set 1Set 2Set 3

Set 4Set 5Set 6

Figure 10 Defense payo under dierent parameter settings

12 Security and Communication Networks

in [16 20] When facing an attacker with strong learningability if the attacker has not obtained Nash equilibriumthrough learning this approach is still better than [16 20] Ifthe attacker obtains Nash equilibrium through learning thispaper can also obtain the same Nash equilibrium strategy as[26] and obtain its phase e same eect is superior to thatin [20]

533 Test Comparison with and without Eligibility Traceis section tests the actual eect of the eligibility traces onthe Algorithm 2 e eect of eligibility traces on strategyselection is shown in Figure 16 from which we can see thatthe learning speed of the algorithm is faster when thequalied trace is available After 1000 times of learning the

algorithm can converge to the optimal strategy When thequalied trace is not available the algorithm needs about2500 times of learning to converge

Average earnings per 1000 times change as shown inFigure 17 from which we can see that the benets of thealgorithm are almost the same when there is or not anyqualied trace after convergence From Figure 17 we can seethat 3000 defenses before convergence have higher returnsfrom qualied trails than those from unqualied trails Inorder to further verify this the average of the rst 3000defense gains under qualied trails and unqualied trails iscounted 10 times each respectively e results are shown inFigure 18 which further proves that in the preconvergencedefense phase qualied traces are better than unqualiedtraces

e addition of eligibility trace accelerates the learningspeed but also brings additional memory and computingoverhead In the experiment only 10 state-action pairs thatwere recently accessed were saved which eectively reduced

Set 1 Set 2 Set 3 Set 4 Set 5 Set 6

ndash80

ndash85

ndash90

Figure 11 Average return under dierent parameter settings

Set 1 Set 2 Set 3 Set 4 Set 5 Set 6

80

75

Figure 12 Standard deviation of defense payo under dierentparameter settings

ndash10

ndash20

ndash30

ndash40

s2 d

efen

se re

turn

5 10 15 20 25

Our methodReference [12]Reference [16]

s2 defense number (times1000)

Figure 13 Comparison of defense payo change

0 2000 4000 6000s2 defense number

s2 d

efen

se st

rate

gy

08

06

04

02

00

d3d4

Figure 14 Defense decision making of our approach

Our methodReference [12]Reference [16]

5 10 15 20 25s2 defense number (times1000)

s2 d

efen

se re

turn

ndash10

ndash20

ndash30

ndash40

Figure 15 Comparison of defense payo change

Security and Communication Networks 13

0 2000 4000 6000s2 defense number

s2 d

efen

se st

rate

gy

100

075

050

025

000

Eligibility trace d3Eligibility trace d4

No eligibility trace d3No eligibility trace d4

Figure 16 Comparison of defense decision making

Eligibility traceNo eligibility trace

0 5 10 15 20 25s2 defense number (times1000)

s2 d

efen

se re

turn

ndash10

ndash20

ndash15

ndash30

ndash25

Figure 17 Comparison of defense payo change

Eligibility traceNo eligibility trace

1 2 3 4 5 6 7 8 9 10s2 defense number (times3000)

s2 av

erag

e def

ense

retu

rn

ndash12

ndash14

ndash16

ndash18

ndash20

Figure 18 Average payo comparison of the rst 3000 defenses

14 Security and Communication Networks

the increase of memory consumption In order to test thecomputational cost of qualified trail the time of 100000defense decisions made by the algorithm was counted for 20times with and without qualified trail e average of 20times was 951s for qualified trail and 374 s for unqualifiedtrail Although the introduction of eligibility traces willincrease the decision-making time by nearly 25 times thetime required for 100000 decisions after the introduction ofeligibility traces is still only 951 s which can still meet thereal-time requirements

In summary the introduction of eligibility trace at theexpense of a small amount of memory and computingoverhead can effectively increase the learning speed of thealgorithm and improve the defense gains

54 Comprehensive Comparisons is approach is com-pared with some typical research results as shown in Table 4[3 12 14 16] is based on the assumption of complete ra-tionality e equilibrium strategy obtained by it is difficultto appear in practice and has a low guiding effect on actualdefense decision making [20] and this paper have morepracticability on the premise of bounded rationality hy-pothesis but [20] is based on the theory of biologicalevolution and mainly studies population evolution e coreof game analysis is not the optimal strategy choice of playersbut the strategy adjustment process trend and stability ofgroup members composed of bounded rational players andthe stability here refers to group members is approach isnot suitable for guiding individual real-time decisionmaking because the proportion of specific strategies isunchanged and not the strategy of a player On the contrarythe defender of the proposed approach adopts reinforcementlearning mechanism which is based on systematic feedbackto learn in the confrontation with the attacker and is moresuitable for the study of individual strategies

6 Conclusions and Future Works

In this paper cyber attack-defense confrontation is ab-stracted as a stochastic game problem under the restrictionof limited rationality A host-centered attack-defense graphmodel is proposed to extract network state and attack-de-fense action and an algorithm to generate attack-defensegraph is designed to effectively compress the game statespace e WoLF-PHC-based defense decision approach isproposed to overcome the problem which enables defendersunder bounded rationality to make optimal choices whenfacing different attackers e lattice improves the WoLF-

PHC algorithm speeds up the defenderrsquos learning speed andreduces the algorithmrsquos dependence on data is approachnot only satisfies the constraints of bounded rationality butalso does not require the defender to know too much in-formation about the attacker It is a more practical defensedecision-making approach

e future work is to further optimize the winning andlosing criteria of WoLF-PHC algorithm for specific attack-defense scenarios in order to speed up defense learning andincrease defense gains

Data Availability

e data that support the findings of this study are notpublicly available due to restrictions as the data containsensitive information about a real-world enterprise networkAccess to the dataset is restricted by the original ownerPeople who want to access the data should send a request tothe corresponding author who will apply for permission ofsharing the data from the original owner

Conflicts of Interest

e authors declare that they have no conflicts of interestregarding the publication of this paper

Acknowledgments

is work was supported by the National High TechnologyResearch and Development Program of China (863 Pro-gram) (2014AA7116082 and 2015AA7116040) and NationalNatural Science Foundation of China (61902427) In par-ticular we thank Junnan Yang and Hongqi Zhang of ourresearch group for their previous work on stochastic gameWe sincerely thank them for their help in publishing thispaper

References

[1] Z Tian W Shi Y Wang et al ldquoReal time lateral movementdetection based on evidence reasoning network for edgecomputing environmentrdquo IEEE Transactions on IndustrialInformatics vol 15 no 7 pp 4285ndash4294 2019

[2] Y Wang J Yu W Qu H Shen and X Cheng ldquoEvolutionarygame model and analysis approaches for network groupbehaviorrdquo Chinese Journal of Computers vol 38 no 2pp 282ndash300 2015

[3] J Liu H Zhang and Y Liu ldquoResearch on optimal selection ofmoving target defense policy based on dynamic game withincomplete informationrdquo Acta Electronica Sinica vol 46no 1 pp 82ndash29 2018

Table 4 Comprehensive comparison of existing approaches

Ref Game type Model assumption Learning mechanism Game process Applicable object Practicability[3] Stochastic game Rationality mdash Multistage Personal Bad[12] Static game Rationality mdash Single-stage Personal Bad[14] Dynamic game Rationality mdash Multistage Personal Bad[16] Stochastic game Rationality mdash Multistage Personal Bad[20] Evolutionary game Bounded rationality Biological evolution mdash Group GoodOur approach Stochastic game Bounded rationality Reinforcement learning Multistage Personal Good

Security and Communication Networks 15

[4] Z Tian X Gao S Su J Qiu X Du and M GuizanildquoEvaluating reputation management schemes of internet ofvehicles based on evolutionary game theoryrdquo IEEE Trans-actions on Vehicular Technology vol 68 no 6 pp 5971ndash59802019

[5] Q Zhu and S Rass ldquoGame theory meets network securitya tutorialrdquo in Proceedings of the ACM SIGSAC Conference onComputer and Communications Security pp 2163ndash2165Toronto Canada October 2018

[6] X Li C Zhou Y C Tian and Y Qin ldquoA dynamic decision-making approach for intrusion response in industrial controlsystemsrdquo IEEE Transactions on Industrial Informatics vol 15no 5 pp 2544ndash2554 2019

[7] H Hu Y Liu H Zhang and R Pan ldquoOptimal networkdefense strategy selection based on incomplete informationevolutionary gamerdquo IEEE Access vol 6 pp 29806ndash298212018

[8] J Li G Kendall and R John ldquoComputing Nash equilibriaand evolutionarily stable states of evolutionary gamesrdquo IEEETransactions on Evolutionary Computation vol 20 no 3pp 460ndash469 2016

[9] M Bowling and M Veloso Multiagent Learning Usinga Variable Learning Rate Elsevier Science Publishers LtdAmsterdam Netherlands 2002

[10] J Harb and D Precup ldquoInvestigating recurrence and eligi-bility traces in deep Q-networksrdquo httpsarxivorgabs170405495 2017

[11] C T Do N H Tran and C Hong ldquoGame theory for cybersecurity and privacyrdquo ACM Computing Surveys vol 50 no 2p 30 2017

[12] Y-L Liu D-G Feng L-HWu and Y-F Lian ldquoPerformanceevaluation of worm attack and defense strategies based onstatic Bayesian gamerdquo Chinese Journal of Software vol 23no 3 pp 712ndash723 2012

[13] Y Li D E Quevedo S Dey and L Shi ldquoA game-theoreticapproach to fake-acknowledgment attack on cyber-physicalsystemsrdquo IEEE Transactions on Signal and Information Pro-cessing over Networks vol 3 no 1 pp 1ndash11 2017

[14] H Zhang L Jiang S Huang J Wang and Y Zhang ldquoAttack-defense differential game model for network defense strategyselectionrdquo IEEE Access vol 7 pp 50618ndash50629 2019

[15] A Afrand and S K Das ldquoPreventing DoS attacks in wirelesssensor networks a repeated game theory approachrdquo InternetJournal of Cyber Security vol 5 no 2 pp 145ndash153 2007

[16] J Wei B Fang and Z Tian ldquoResearch on defense strategiesselection based on attack-defense stochastic game modelrdquoChinese Journal of Computer Research and Developmentvol 47 no 10 pp 1714ndash1723 2010

[17] C Wang X Cheng and Y Zhu ldquoA Markov game model ofcomputer network operationrdquo Chinese Systems Engineering-gteory and Practice vol 34 no 9 pp 2402ndash2410 2014

[18] J Hofbauer and K Sigmund ldquoEvolutionary game dynamicsrdquoBulletin of the American Mathematical Society vol 40 no 4pp 479ndash519 2011

[19] Y Hayel and Q Zhu ldquoEpidemic protection over heteroge-neous networks using evolutionary Poisson gamesrdquo IEEETransactions on Information Forensics and Security vol 12no 8 pp 1786ndash1800 2017

[20] J Huang and H Zhang ldquoImproving replicator dynamicevolutionary game model for selecting optimal defensestrategiesrdquo Chinese Journal on Communications vol 38 no 1pp 170ndash182 2018

[21] H Hu Y Liu H Zhang and Y Zhang ldquoSecurity metricmethods for network multistep attacks using AMC and big

data correlation analysisrdquo Security and CommunicationNetworks vol 2018 Article ID 5787102 14 pages 2018

[22] H Hu Y Liu Y Yang H Zhang and Y Zhang ldquoNew in-sights into approaches to evaluating intention and path fornetwork multistep attacksrdquo Mathematical Problems in En-gineering vol 2018 Article ID 4278632 13 pages 2018

[23] H Hu H Zhang Y Liu and Y Wang ldquoQuantitative methodfor network security situation based on attack predictionrdquoSecurity and Communication Networks vol 2017 Article ID3407642 19 pages 2017

[24] M Zimmer and S Doncieux ldquoBootstrapping Q-learning forrobotics from neuro-evolution resultsrdquo IEEE Transactions onCognitive and Developmental Systems vol 10 no 1pp 102ndash119 2018

[25] R S Sutton and A G Barto ldquoReinforcement learning anintroduction Bradford bookrdquo Machine Learning vol 16no 1 pp 285-286 2005

[26] H Zhang J Wang D Yu J Han and T Li ldquoActive defensestrategy selection based on static Bayesian gamerdquo in Pro-ceedings of IET International Conference on CyberspaceTechnology pp 1ndash7 London UK October 2016

16 Security and Communication Networks

International Journal of

AerospaceEngineeringHindawiwwwhindawicom Volume 2018

RoboticsJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Active and Passive Electronic Components

VLSI Design

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Shock and Vibration

Hindawiwwwhindawicom Volume 2018

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawiwwwhindawicom

Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Control Scienceand Engineering

Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom

Journal ofEngineeringVolume 2018

SensorsJournal of

Hindawiwwwhindawicom Volume 2018

International Journal of

RotatingMachinery

Hindawiwwwhindawicom Volume 2018

Modelling ampSimulationin EngineeringHindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Navigation and Observation

International Journal of

Hindawi

wwwhindawicom Volume 2018

Advances in

Multimedia

Submit your manuscripts atwwwhindawicom

Page 10: Research Article - Hindawi Publishing Corporationdownloads.hindawi.com/journals/scn/2019/3038586.pdf · Afrand and Das [15] established a repeated game model between the intrusion

(2) e state set of stochastic game is S (s0 s1

s2 s3 s4 s5 s6) which consists of network state and isextracted from the nodes shown in Figures 7 and 8

(3) e action set of the defender is D (D0 D1

D2 D3 D4 D5 D6) D0 NULL D1 d11113864 d2D2

d3 d41113864 1113865D3 d1 d5 d61113864 1113865D4 d1 d5 d61113864 1113865D5 d11113864

d2 d7D6 d3 d41113864 1113865 and the edges are extractedfrom Figure 8

(4) Quantitative results of immediate returns Rd(si

d sj) of defenders [16 26] are

Rd s0NULL s0( 1113857 Rd s0NULL s1( 1113857 Rd s0NULL s2( 1113857( 1113857 (0 minus 40 minus 59)

Rd s1 d1 s0( 1113857 Rd s1 d1 s1( 1113857 Rd s1 d1 s2( 1113857 Rd s1 d2 s0( 1113857 Rd s1 d2 s1( 1113857 Rd s1 d2 s2( 1113857( 1113857

(40 0 minus 29 5 minus 15 minus 32)

( Rd s2 d3 s0( 1113857 Rd s2 d3 s1( 1113857 Rd s2 d3 s2( 1113857 Rd s2 d3 s3( 1113857 Rd s2 d3 s4( 1113857 Rd s2 d3 s5( 1113857

Rd s2 d4 s0( 1113857 Rd s2d4 s1( 1113857 Rd s2 d4 s2( 1113857 Rd s2 d4 s3( 1113857 Rd s2 d4 s4( 1113857 Rd s2 d4 s5( 11138571113857

(24 9 minus 15 minus 55 minus 49 minus 65 19 5 minus 21 minus 61 minus 72 minus 68)

( Rd s3 d1 s2( 1113857 Rd s3 d1 s3( 1113857 Rd s3 d1 s6( 1113857 Rd s3 d5 s2( 1113857 Rd s3 d5 s3( 1113857 Rd s3 d5 s6( 1113857

Rd s3 d6 s2( 1113857 Rd s3 d6 s3( 1113857 Rd s3 d6 s6( 11138571113857

(21 minus 16 minus 72 15 minus 23 minus 81 minus 21 minus 36 minus 81)

( Rd s4 d1 s2( 1113857 Rd s4 d1 s4( 1113857 Rd s4 d1 s6( 1113857 Rd s4 d5 s2( 1113857 Rd s4 d5 s4( 1113857 Rd s4 d5 s6( 1113857

Rd s4 d6 s2( 1113857 Rd s4 d6 s4( 1113857 Rd s4 d6 s6( 11138571113857

(26 0 minus 62 11 minus 23 minus 75 9 minus 25 minus 87)

( Rd s5 d1 s2( 1113857 Rd s5 d1 s5( 1113857 Rd s5 d1 s6( 1113857 Rd s5 d2 s2( 1113857 Rd s5 d2 s5( 1113857 Rd s5 d2 s6( 1113857

Rd s5 d7 s2( 1113857 Rd s5 d7 s5( 1113857 Rd s5 d7 s6( 11138571113857

(29 0 minus 63 11 minus 21 minus 76 2 minus 27 minus 88)

( Rd s6 d3 s3( 1113857 Rd s6 d3 s3( 1113857 Rd s6 d3 s5( 1113857 Rd s6 d3 s6( 1113857 Rd s6 d4 s3( 1113857

Rd s6 d4 s4( 1113857 Rd s6 d4 s5( 1113857 Rd s6 d4 s6( 11138571113857

(minus 23 minus 21 minus 19 minus 42 minus 28 minus 31 minus 24 minus 49)

(13)

Table 2 Defense action description

Atomic defense action d1 d2 d3 d4 d5 d6 d7

Renew root data Limit SYNICMPpackets Install Oracle patches Reinstall listener program Uninstall delete Trojan Limit access toMDSYS Restart database server Delete suspicious account Add physical resource Repair database Limit packets from ports

S0(A root)

S1(W user)

S2(W root)S5(D user)

S6(D root)

d3

d3

d4

d1d2d7

S3(F root)

S4(E root)d1d5d6

d1d5d6d4

d3

d1d2

Figure 8 Defense graph

10 Security and Communication Networks

(5) In order to detect the learning performance of theAlgorithm 2 more fully the defenderrsquos state-actionpayoffQd(si d) is initialized with a unified 0 withoutintroducing additional prior knowledge

(6) Defenderrsquos defense strategy adopts average strategyto initialize that is πd(sk d1) πd(sk d2) middot middot middot

πd(sk dm) 1113936disinD(sk)πd(sk d) 1forallsk isin S where noadditional prior knowledge is introduced

53 Testing and Analysis e experiment in this section hasthree purposes e first is to test the influence of differentparameter settings on the proposed Algorithm 2 so as to findout the experimental parameters suitable for this scenarioe second is to compare this approach with the existingtypical approaches to verify the advancement of this ap-proach e third is to test the effectiveness of WoLF-PHCalgorithm improvement based on eligibility trace

From Figures 7 and 8 we can see that the state of attack-defense strategy selection is the most complex and repre-sentative erefore the performance of the algorithm isanalyzed by the experimental state selection and the othernetwork state analysis approaches are the same

531 Parameter Test and Analysis Different parameters willaffect the speed and effect of learning At present there is norelevant theory to determine the specific parameters InSection 4 the relevant parameters are preliminarily ana-lyzed On this basis the different parameter settings arefurther tested to find the parameter settings suitable for thisattack-defense scenario Six different parameter settingswere tested Specific parameter settings are shown in Table 3In the experiment the attackerrsquos initial strategy is randomstrategy and the learning mechanism is the same as theapproach in this paper

e probability of the defenderrsquos choice of defense ac-tions and sums in state is shown in Figure 9 e learningspeed and convergence of the algorithm under differentparameter settings can be observed from Figure 9 whichshows that the learning speed of settings 1 3 and 6 is fasterand the best strategy can be obtained after learning less than1500 times under the three settings but convergence of 3and 6 is poor Although the best strategy can be learned bysettings 3 and 6 there will be oscillation afterwards and thestability of setting 1 is not suitable

Defense payoff can represent the degree of optimizationof the strategy In order to ensure that the payoff value doesnot reflect only one defense result the average of 1000defense gains is taken and the change of the average payoffper 1000 defense gains is shown in Figure 10 As can be seenfrom Figure 10 the benefits of Set 3 are significantly lowerthan those of other settings but the advantages and dis-advantages of other settings are difficult to distinguish Inorder to display more intuitively the average value of 25000defense gains calculated under different settings in Figure 10is shown in Figure 11 From Figure 11 we can see that theaverage value of settings 1 and 5 is higher For furthercomparison the standard deviation of settings 1 and 5 is

calculated one step on the basis of the average value to reflectthe discreteness of the gains As shown in Figure 12 thestandard deviations of setting 1 and setting 6 are smallMoreover the result of setting 1 is smaller than setting 6

In conclusion setting 1 of the six sets of parameters is themost suitable for this scenario Since setting 1 has achievedan ideal effect and can meet the experimental requirementsit is no longer necessary to further optimize the parameters

532 Comparisons In this section stochastic game [16] andevolutionary game [20] are selected to conduct comparativeexperiments with this approach According to the differenceof attackerrsquos learning ability this section designs two groupsof comparative experiments In the first group the attackerrsquoslearning ability is weak and will not make adjustments to theattack-defense results In the second group the attackerrsquoslearning ability is strong and adopts the same learningmechanism as the approach in this paper In both groupsthe initial strategies of attackers were random strategies

In the first group of experiments the defense strategy ofthis approach is as shown in Figure 9(a) e defense strategiescalculated by the approach [16] are πd(s2 d3) 07πd(s2 d4) 03 e defense strategies of [20] are evolu-tionarily stable and balanced And the defense strategies of [20]are πd(s2 d3) 08 πd(s2 d4) 02 Its average earnings per1000 times change as shown in Figure 13

From the results of the strategies and benefits of the threeapproaches we can see that the approach in this paper canlearn from the attackerrsquos strategies and adjust to the optimalstrategy so the approach in this paper can obtain the highestbenefits Wei et al [16] adopted a fixed strategy whenconfronting any attacker When the attacker is constrainedby bounded rationality and does not adopt Nash equilibriumstrategy the benefit of this approach is low Although thelearning factors of both attackers and defenders are takeninto account in [20] the parameters required in the modelare difficult to quantify accurately which results in thedeviation between the final results and the actual results sothe benefit of the approach is still lower than that of theapproach in this paper

In the second group of experiments the results of[16 20] are still as πd(s2 d3) 07 πd(s2 d4) 03 andπd(s2 d3) 08 πd(s2 d4) 02 e decision making ofthis approach is shown in Figure 14 After about 1800 timesof learning the approach achieves stability and converges tothe same defense strategy as that of [16] As can be seen fromFigure 15 the payoff of [20] is lower than that of other twoapproaches e average payoff of this approach in the first2000 defenses is higher than that of [26] and then it is almostthe same as that of [26] Combining Figures 14 and 15 wecan see that the learning attacker cannot get Nash equi-librium strategy at the initial stage and the approach in thispaper is better than that in [26] When the attacker learns toget Nash equilibrium strategy the approach in this paper canconverge to Nash equilibrium strategy At this time theperformance of this approach is the same as that in [26]

In conclusion when facing the attackers with weaklearning ability the approach in this paper is superior to that

Security and Communication Networks 11

Table 3 Dierent parameter settings

Set α δl δw λ c

1 001 0004 0001 001 0012 01 0004 0001 001 0013 001 0004 0001 001 014 001 0004 0001 01 0015 001 004 001 001 0016 001 0008 0001 001 001

0 2000 4000 6000s2 defense number

s2 d

efen

se re

turn

10

08

06

04

02

00

d3d4

(a)

s2 d

efen

se re

turn

10

08

06

04

02

000 2000 4000 6000

s2 defense number

d3d4

(b)

s2 d

efen

se re

turn

10

08

06

04

02

000 2000 4000 6000

s2 defense number

d3d4

(c)

s2 d

efen

se re

turn

10

08

06

04

02

000 2000 4000 6000

s2 defense number

d3d4

(d)

s2 d

efen

se re

turn

10

08

06

04

02

000 2000 4000 6000

s2 defense number

d3d4

(e)

s2 d

efen

se re

turn

10

08

06

04

02

000 2000 4000 6000

s2 defense number

d3d4

(f )

Figure 9 Defense decision making under dierent parameter settings (a) Defense decision under setting 1 (b) Defense decision undersetting 2 (c) Defense decision under setting 3 (d) Defense decision under setting 4 (e) Defense decision under setting 5 (f ) Defensedecision under setting 6

5 10 15 20 25s2 defense number (times1000)

s2 d

efen

se re

turn

ndash10

ndash15

ndash20

ndash25

ndash30

Set 1Set 2Set 3

Set 4Set 5Set 6

Figure 10 Defense payo under dierent parameter settings

12 Security and Communication Networks

in [16 20] When facing an attacker with strong learningability if the attacker has not obtained Nash equilibriumthrough learning this approach is still better than [16 20] Ifthe attacker obtains Nash equilibrium through learning thispaper can also obtain the same Nash equilibrium strategy as[26] and obtain its phase e same eect is superior to thatin [20]

533 Test Comparison with and without Eligibility Traceis section tests the actual eect of the eligibility traces onthe Algorithm 2 e eect of eligibility traces on strategyselection is shown in Figure 16 from which we can see thatthe learning speed of the algorithm is faster when thequalied trace is available After 1000 times of learning the

algorithm can converge to the optimal strategy When thequalied trace is not available the algorithm needs about2500 times of learning to converge

Average earnings per 1000 times change as shown inFigure 17 from which we can see that the benets of thealgorithm are almost the same when there is or not anyqualied trace after convergence From Figure 17 we can seethat 3000 defenses before convergence have higher returnsfrom qualied trails than those from unqualied trails Inorder to further verify this the average of the rst 3000defense gains under qualied trails and unqualied trails iscounted 10 times each respectively e results are shown inFigure 18 which further proves that in the preconvergencedefense phase qualied traces are better than unqualiedtraces

e addition of eligibility trace accelerates the learningspeed but also brings additional memory and computingoverhead In the experiment only 10 state-action pairs thatwere recently accessed were saved which eectively reduced

Set 1 Set 2 Set 3 Set 4 Set 5 Set 6

ndash80

ndash85

ndash90

Figure 11 Average return under dierent parameter settings

Set 1 Set 2 Set 3 Set 4 Set 5 Set 6

80

75

Figure 12 Standard deviation of defense payo under dierentparameter settings

ndash10

ndash20

ndash30

ndash40

s2 d

efen

se re

turn

5 10 15 20 25

Our methodReference [12]Reference [16]

s2 defense number (times1000)

Figure 13 Comparison of defense payo change

0 2000 4000 6000s2 defense number

s2 d

efen

se st

rate

gy

08

06

04

02

00

d3d4

Figure 14 Defense decision making of our approach

Our methodReference [12]Reference [16]

5 10 15 20 25s2 defense number (times1000)

s2 d

efen

se re

turn

ndash10

ndash20

ndash30

ndash40

Figure 15 Comparison of defense payo change

Security and Communication Networks 13

0 2000 4000 6000s2 defense number

s2 d

efen

se st

rate

gy

100

075

050

025

000

Eligibility trace d3Eligibility trace d4

No eligibility trace d3No eligibility trace d4

Figure 16 Comparison of defense decision making

Eligibility traceNo eligibility trace

0 5 10 15 20 25s2 defense number (times1000)

s2 d

efen

se re

turn

ndash10

ndash20

ndash15

ndash30

ndash25

Figure 17 Comparison of defense payo change

Eligibility traceNo eligibility trace

1 2 3 4 5 6 7 8 9 10s2 defense number (times3000)

s2 av

erag

e def

ense

retu

rn

ndash12

ndash14

ndash16

ndash18

ndash20

Figure 18 Average payo comparison of the rst 3000 defenses

14 Security and Communication Networks

the increase of memory consumption In order to test thecomputational cost of qualified trail the time of 100000defense decisions made by the algorithm was counted for 20times with and without qualified trail e average of 20times was 951s for qualified trail and 374 s for unqualifiedtrail Although the introduction of eligibility traces willincrease the decision-making time by nearly 25 times thetime required for 100000 decisions after the introduction ofeligibility traces is still only 951 s which can still meet thereal-time requirements

In summary the introduction of eligibility trace at theexpense of a small amount of memory and computingoverhead can effectively increase the learning speed of thealgorithm and improve the defense gains

54 Comprehensive Comparisons is approach is com-pared with some typical research results as shown in Table 4[3 12 14 16] is based on the assumption of complete ra-tionality e equilibrium strategy obtained by it is difficultto appear in practice and has a low guiding effect on actualdefense decision making [20] and this paper have morepracticability on the premise of bounded rationality hy-pothesis but [20] is based on the theory of biologicalevolution and mainly studies population evolution e coreof game analysis is not the optimal strategy choice of playersbut the strategy adjustment process trend and stability ofgroup members composed of bounded rational players andthe stability here refers to group members is approach isnot suitable for guiding individual real-time decisionmaking because the proportion of specific strategies isunchanged and not the strategy of a player On the contrarythe defender of the proposed approach adopts reinforcementlearning mechanism which is based on systematic feedbackto learn in the confrontation with the attacker and is moresuitable for the study of individual strategies

6 Conclusions and Future Works

In this paper cyber attack-defense confrontation is ab-stracted as a stochastic game problem under the restrictionof limited rationality A host-centered attack-defense graphmodel is proposed to extract network state and attack-de-fense action and an algorithm to generate attack-defensegraph is designed to effectively compress the game statespace e WoLF-PHC-based defense decision approach isproposed to overcome the problem which enables defendersunder bounded rationality to make optimal choices whenfacing different attackers e lattice improves the WoLF-

PHC algorithm speeds up the defenderrsquos learning speed andreduces the algorithmrsquos dependence on data is approachnot only satisfies the constraints of bounded rationality butalso does not require the defender to know too much in-formation about the attacker It is a more practical defensedecision-making approach

e future work is to further optimize the winning andlosing criteria of WoLF-PHC algorithm for specific attack-defense scenarios in order to speed up defense learning andincrease defense gains

Data Availability

e data that support the findings of this study are notpublicly available due to restrictions as the data containsensitive information about a real-world enterprise networkAccess to the dataset is restricted by the original ownerPeople who want to access the data should send a request tothe corresponding author who will apply for permission ofsharing the data from the original owner

Conflicts of Interest

e authors declare that they have no conflicts of interestregarding the publication of this paper

Acknowledgments

is work was supported by the National High TechnologyResearch and Development Program of China (863 Pro-gram) (2014AA7116082 and 2015AA7116040) and NationalNatural Science Foundation of China (61902427) In par-ticular we thank Junnan Yang and Hongqi Zhang of ourresearch group for their previous work on stochastic gameWe sincerely thank them for their help in publishing thispaper

References

[1] Z Tian W Shi Y Wang et al ldquoReal time lateral movementdetection based on evidence reasoning network for edgecomputing environmentrdquo IEEE Transactions on IndustrialInformatics vol 15 no 7 pp 4285ndash4294 2019

[2] Y Wang J Yu W Qu H Shen and X Cheng ldquoEvolutionarygame model and analysis approaches for network groupbehaviorrdquo Chinese Journal of Computers vol 38 no 2pp 282ndash300 2015

[3] J Liu H Zhang and Y Liu ldquoResearch on optimal selection ofmoving target defense policy based on dynamic game withincomplete informationrdquo Acta Electronica Sinica vol 46no 1 pp 82ndash29 2018

Table 4 Comprehensive comparison of existing approaches

Ref Game type Model assumption Learning mechanism Game process Applicable object Practicability[3] Stochastic game Rationality mdash Multistage Personal Bad[12] Static game Rationality mdash Single-stage Personal Bad[14] Dynamic game Rationality mdash Multistage Personal Bad[16] Stochastic game Rationality mdash Multistage Personal Bad[20] Evolutionary game Bounded rationality Biological evolution mdash Group GoodOur approach Stochastic game Bounded rationality Reinforcement learning Multistage Personal Good

Security and Communication Networks 15

[4] Z Tian X Gao S Su J Qiu X Du and M GuizanildquoEvaluating reputation management schemes of internet ofvehicles based on evolutionary game theoryrdquo IEEE Trans-actions on Vehicular Technology vol 68 no 6 pp 5971ndash59802019

[5] Q Zhu and S Rass ldquoGame theory meets network securitya tutorialrdquo in Proceedings of the ACM SIGSAC Conference onComputer and Communications Security pp 2163ndash2165Toronto Canada October 2018

[6] X Li C Zhou Y C Tian and Y Qin ldquoA dynamic decision-making approach for intrusion response in industrial controlsystemsrdquo IEEE Transactions on Industrial Informatics vol 15no 5 pp 2544ndash2554 2019

[7] H Hu Y Liu H Zhang and R Pan ldquoOptimal networkdefense strategy selection based on incomplete informationevolutionary gamerdquo IEEE Access vol 6 pp 29806ndash298212018

[8] J Li G Kendall and R John ldquoComputing Nash equilibriaand evolutionarily stable states of evolutionary gamesrdquo IEEETransactions on Evolutionary Computation vol 20 no 3pp 460ndash469 2016

[9] M Bowling and M Veloso Multiagent Learning Usinga Variable Learning Rate Elsevier Science Publishers LtdAmsterdam Netherlands 2002

[10] J Harb and D Precup ldquoInvestigating recurrence and eligi-bility traces in deep Q-networksrdquo httpsarxivorgabs170405495 2017

[11] C T Do N H Tran and C Hong ldquoGame theory for cybersecurity and privacyrdquo ACM Computing Surveys vol 50 no 2p 30 2017

[12] Y-L Liu D-G Feng L-HWu and Y-F Lian ldquoPerformanceevaluation of worm attack and defense strategies based onstatic Bayesian gamerdquo Chinese Journal of Software vol 23no 3 pp 712ndash723 2012

[13] Y Li D E Quevedo S Dey and L Shi ldquoA game-theoreticapproach to fake-acknowledgment attack on cyber-physicalsystemsrdquo IEEE Transactions on Signal and Information Pro-cessing over Networks vol 3 no 1 pp 1ndash11 2017

[14] H Zhang L Jiang S Huang J Wang and Y Zhang ldquoAttack-defense differential game model for network defense strategyselectionrdquo IEEE Access vol 7 pp 50618ndash50629 2019

[15] A Afrand and S K Das ldquoPreventing DoS attacks in wirelesssensor networks a repeated game theory approachrdquo InternetJournal of Cyber Security vol 5 no 2 pp 145ndash153 2007

[16] J Wei B Fang and Z Tian ldquoResearch on defense strategiesselection based on attack-defense stochastic game modelrdquoChinese Journal of Computer Research and Developmentvol 47 no 10 pp 1714ndash1723 2010

[17] C Wang X Cheng and Y Zhu ldquoA Markov game model ofcomputer network operationrdquo Chinese Systems Engineering-gteory and Practice vol 34 no 9 pp 2402ndash2410 2014

[18] J Hofbauer and K Sigmund ldquoEvolutionary game dynamicsrdquoBulletin of the American Mathematical Society vol 40 no 4pp 479ndash519 2011

[19] Y Hayel and Q Zhu ldquoEpidemic protection over heteroge-neous networks using evolutionary Poisson gamesrdquo IEEETransactions on Information Forensics and Security vol 12no 8 pp 1786ndash1800 2017

[20] J Huang and H Zhang ldquoImproving replicator dynamicevolutionary game model for selecting optimal defensestrategiesrdquo Chinese Journal on Communications vol 38 no 1pp 170ndash182 2018

[21] H Hu Y Liu H Zhang and Y Zhang ldquoSecurity metricmethods for network multistep attacks using AMC and big

data correlation analysisrdquo Security and CommunicationNetworks vol 2018 Article ID 5787102 14 pages 2018

[22] H Hu Y Liu Y Yang H Zhang and Y Zhang ldquoNew in-sights into approaches to evaluating intention and path fornetwork multistep attacksrdquo Mathematical Problems in En-gineering vol 2018 Article ID 4278632 13 pages 2018

[23] H Hu H Zhang Y Liu and Y Wang ldquoQuantitative methodfor network security situation based on attack predictionrdquoSecurity and Communication Networks vol 2017 Article ID3407642 19 pages 2017

[24] M Zimmer and S Doncieux ldquoBootstrapping Q-learning forrobotics from neuro-evolution resultsrdquo IEEE Transactions onCognitive and Developmental Systems vol 10 no 1pp 102ndash119 2018

[25] R S Sutton and A G Barto ldquoReinforcement learning anintroduction Bradford bookrdquo Machine Learning vol 16no 1 pp 285-286 2005

[26] H Zhang J Wang D Yu J Han and T Li ldquoActive defensestrategy selection based on static Bayesian gamerdquo in Pro-ceedings of IET International Conference on CyberspaceTechnology pp 1ndash7 London UK October 2016

16 Security and Communication Networks

International Journal of

AerospaceEngineeringHindawiwwwhindawicom Volume 2018

RoboticsJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Active and Passive Electronic Components

VLSI Design

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Shock and Vibration

Hindawiwwwhindawicom Volume 2018

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawiwwwhindawicom

Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Control Scienceand Engineering

Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom

Journal ofEngineeringVolume 2018

SensorsJournal of

Hindawiwwwhindawicom Volume 2018

International Journal of

RotatingMachinery

Hindawiwwwhindawicom Volume 2018

Modelling ampSimulationin EngineeringHindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Navigation and Observation

International Journal of

Hindawi

wwwhindawicom Volume 2018

Advances in

Multimedia

Submit your manuscripts atwwwhindawicom

Page 11: Research Article - Hindawi Publishing Corporationdownloads.hindawi.com/journals/scn/2019/3038586.pdf · Afrand and Das [15] established a repeated game model between the intrusion

(5) In order to detect the learning performance of theAlgorithm 2 more fully the defenderrsquos state-actionpayoffQd(si d) is initialized with a unified 0 withoutintroducing additional prior knowledge

(6) Defenderrsquos defense strategy adopts average strategyto initialize that is πd(sk d1) πd(sk d2) middot middot middot

πd(sk dm) 1113936disinD(sk)πd(sk d) 1forallsk isin S where noadditional prior knowledge is introduced

53 Testing and Analysis e experiment in this section hasthree purposes e first is to test the influence of differentparameter settings on the proposed Algorithm 2 so as to findout the experimental parameters suitable for this scenarioe second is to compare this approach with the existingtypical approaches to verify the advancement of this ap-proach e third is to test the effectiveness of WoLF-PHCalgorithm improvement based on eligibility trace

From Figures 7 and 8 we can see that the state of attack-defense strategy selection is the most complex and repre-sentative erefore the performance of the algorithm isanalyzed by the experimental state selection and the othernetwork state analysis approaches are the same

531 Parameter Test and Analysis Different parameters willaffect the speed and effect of learning At present there is norelevant theory to determine the specific parameters InSection 4 the relevant parameters are preliminarily ana-lyzed On this basis the different parameter settings arefurther tested to find the parameter settings suitable for thisattack-defense scenario Six different parameter settingswere tested Specific parameter settings are shown in Table 3In the experiment the attackerrsquos initial strategy is randomstrategy and the learning mechanism is the same as theapproach in this paper

e probability of the defenderrsquos choice of defense ac-tions and sums in state is shown in Figure 9 e learningspeed and convergence of the algorithm under differentparameter settings can be observed from Figure 9 whichshows that the learning speed of settings 1 3 and 6 is fasterand the best strategy can be obtained after learning less than1500 times under the three settings but convergence of 3and 6 is poor Although the best strategy can be learned bysettings 3 and 6 there will be oscillation afterwards and thestability of setting 1 is not suitable

Defense payoff can represent the degree of optimizationof the strategy In order to ensure that the payoff value doesnot reflect only one defense result the average of 1000defense gains is taken and the change of the average payoffper 1000 defense gains is shown in Figure 10 As can be seenfrom Figure 10 the benefits of Set 3 are significantly lowerthan those of other settings but the advantages and dis-advantages of other settings are difficult to distinguish Inorder to display more intuitively the average value of 25000defense gains calculated under different settings in Figure 10is shown in Figure 11 From Figure 11 we can see that theaverage value of settings 1 and 5 is higher For furthercomparison the standard deviation of settings 1 and 5 is

calculated one step on the basis of the average value to reflectthe discreteness of the gains As shown in Figure 12 thestandard deviations of setting 1 and setting 6 are smallMoreover the result of setting 1 is smaller than setting 6

In conclusion setting 1 of the six sets of parameters is themost suitable for this scenario Since setting 1 has achievedan ideal effect and can meet the experimental requirementsit is no longer necessary to further optimize the parameters

532 Comparisons In this section stochastic game [16] andevolutionary game [20] are selected to conduct comparativeexperiments with this approach According to the differenceof attackerrsquos learning ability this section designs two groupsof comparative experiments In the first group the attackerrsquoslearning ability is weak and will not make adjustments to theattack-defense results In the second group the attackerrsquoslearning ability is strong and adopts the same learningmechanism as the approach in this paper In both groupsthe initial strategies of attackers were random strategies

In the first group of experiments the defense strategy ofthis approach is as shown in Figure 9(a) e defense strategiescalculated by the approach [16] are πd(s2 d3) 07πd(s2 d4) 03 e defense strategies of [20] are evolu-tionarily stable and balanced And the defense strategies of [20]are πd(s2 d3) 08 πd(s2 d4) 02 Its average earnings per1000 times change as shown in Figure 13

From the results of the strategies and benefits of the threeapproaches we can see that the approach in this paper canlearn from the attackerrsquos strategies and adjust to the optimalstrategy so the approach in this paper can obtain the highestbenefits Wei et al [16] adopted a fixed strategy whenconfronting any attacker When the attacker is constrainedby bounded rationality and does not adopt Nash equilibriumstrategy the benefit of this approach is low Although thelearning factors of both attackers and defenders are takeninto account in [20] the parameters required in the modelare difficult to quantify accurately which results in thedeviation between the final results and the actual results sothe benefit of the approach is still lower than that of theapproach in this paper

In the second group of experiments the results of[16 20] are still as πd(s2 d3) 07 πd(s2 d4) 03 andπd(s2 d3) 08 πd(s2 d4) 02 e decision making ofthis approach is shown in Figure 14 After about 1800 timesof learning the approach achieves stability and converges tothe same defense strategy as that of [16] As can be seen fromFigure 15 the payoff of [20] is lower than that of other twoapproaches e average payoff of this approach in the first2000 defenses is higher than that of [26] and then it is almostthe same as that of [26] Combining Figures 14 and 15 wecan see that the learning attacker cannot get Nash equi-librium strategy at the initial stage and the approach in thispaper is better than that in [26] When the attacker learns toget Nash equilibrium strategy the approach in this paper canconverge to Nash equilibrium strategy At this time theperformance of this approach is the same as that in [26]

In conclusion when facing the attackers with weaklearning ability the approach in this paper is superior to that

Security and Communication Networks 11

Table 3 Dierent parameter settings

Set α δl δw λ c

1 001 0004 0001 001 0012 01 0004 0001 001 0013 001 0004 0001 001 014 001 0004 0001 01 0015 001 004 001 001 0016 001 0008 0001 001 001

0 2000 4000 6000s2 defense number

s2 d

efen

se re

turn

10

08

06

04

02

00

d3d4

(a)

s2 d

efen

se re

turn

10

08

06

04

02

000 2000 4000 6000

s2 defense number

d3d4

(b)

s2 d

efen

se re

turn

10

08

06

04

02

000 2000 4000 6000

s2 defense number

d3d4

(c)

s2 d

efen

se re

turn

10

08

06

04

02

000 2000 4000 6000

s2 defense number

d3d4

(d)

s2 d

efen

se re

turn

10

08

06

04

02

000 2000 4000 6000

s2 defense number

d3d4

(e)

s2 d

efen

se re

turn

10

08

06

04

02

000 2000 4000 6000

s2 defense number

d3d4

(f )

Figure 9 Defense decision making under dierent parameter settings (a) Defense decision under setting 1 (b) Defense decision undersetting 2 (c) Defense decision under setting 3 (d) Defense decision under setting 4 (e) Defense decision under setting 5 (f ) Defensedecision under setting 6

5 10 15 20 25s2 defense number (times1000)

s2 d

efen

se re

turn

ndash10

ndash15

ndash20

ndash25

ndash30

Set 1Set 2Set 3

Set 4Set 5Set 6

Figure 10 Defense payo under dierent parameter settings

12 Security and Communication Networks

in [16 20] When facing an attacker with strong learningability if the attacker has not obtained Nash equilibriumthrough learning this approach is still better than [16 20] Ifthe attacker obtains Nash equilibrium through learning thispaper can also obtain the same Nash equilibrium strategy as[26] and obtain its phase e same eect is superior to thatin [20]

533 Test Comparison with and without Eligibility Traceis section tests the actual eect of the eligibility traces onthe Algorithm 2 e eect of eligibility traces on strategyselection is shown in Figure 16 from which we can see thatthe learning speed of the algorithm is faster when thequalied trace is available After 1000 times of learning the

algorithm can converge to the optimal strategy When thequalied trace is not available the algorithm needs about2500 times of learning to converge

Average earnings per 1000 times change as shown inFigure 17 from which we can see that the benets of thealgorithm are almost the same when there is or not anyqualied trace after convergence From Figure 17 we can seethat 3000 defenses before convergence have higher returnsfrom qualied trails than those from unqualied trails Inorder to further verify this the average of the rst 3000defense gains under qualied trails and unqualied trails iscounted 10 times each respectively e results are shown inFigure 18 which further proves that in the preconvergencedefense phase qualied traces are better than unqualiedtraces

e addition of eligibility trace accelerates the learningspeed but also brings additional memory and computingoverhead In the experiment only 10 state-action pairs thatwere recently accessed were saved which eectively reduced

Set 1 Set 2 Set 3 Set 4 Set 5 Set 6

ndash80

ndash85

ndash90

Figure 11 Average return under dierent parameter settings

Set 1 Set 2 Set 3 Set 4 Set 5 Set 6

80

75

Figure 12 Standard deviation of defense payo under dierentparameter settings

ndash10

ndash20

ndash30

ndash40

s2 d

efen

se re

turn

5 10 15 20 25

Our methodReference [12]Reference [16]

s2 defense number (times1000)

Figure 13 Comparison of defense payo change

0 2000 4000 6000s2 defense number

s2 d

efen

se st

rate

gy

08

06

04

02

00

d3d4

Figure 14 Defense decision making of our approach

Our methodReference [12]Reference [16]

5 10 15 20 25s2 defense number (times1000)

s2 d

efen

se re

turn

ndash10

ndash20

ndash30

ndash40

Figure 15 Comparison of defense payo change

Security and Communication Networks 13

0 2000 4000 6000s2 defense number

s2 d

efen

se st

rate

gy

100

075

050

025

000

Eligibility trace d3Eligibility trace d4

No eligibility trace d3No eligibility trace d4

Figure 16 Comparison of defense decision making

Eligibility traceNo eligibility trace

0 5 10 15 20 25s2 defense number (times1000)

s2 d

efen

se re

turn

ndash10

ndash20

ndash15

ndash30

ndash25

Figure 17 Comparison of defense payo change

Eligibility traceNo eligibility trace

1 2 3 4 5 6 7 8 9 10s2 defense number (times3000)

s2 av

erag

e def

ense

retu

rn

ndash12

ndash14

ndash16

ndash18

ndash20

Figure 18 Average payo comparison of the rst 3000 defenses

14 Security and Communication Networks

the increase of memory consumption In order to test thecomputational cost of qualified trail the time of 100000defense decisions made by the algorithm was counted for 20times with and without qualified trail e average of 20times was 951s for qualified trail and 374 s for unqualifiedtrail Although the introduction of eligibility traces willincrease the decision-making time by nearly 25 times thetime required for 100000 decisions after the introduction ofeligibility traces is still only 951 s which can still meet thereal-time requirements

In summary the introduction of eligibility trace at theexpense of a small amount of memory and computingoverhead can effectively increase the learning speed of thealgorithm and improve the defense gains

54 Comprehensive Comparisons is approach is com-pared with some typical research results as shown in Table 4[3 12 14 16] is based on the assumption of complete ra-tionality e equilibrium strategy obtained by it is difficultto appear in practice and has a low guiding effect on actualdefense decision making [20] and this paper have morepracticability on the premise of bounded rationality hy-pothesis but [20] is based on the theory of biologicalevolution and mainly studies population evolution e coreof game analysis is not the optimal strategy choice of playersbut the strategy adjustment process trend and stability ofgroup members composed of bounded rational players andthe stability here refers to group members is approach isnot suitable for guiding individual real-time decisionmaking because the proportion of specific strategies isunchanged and not the strategy of a player On the contrarythe defender of the proposed approach adopts reinforcementlearning mechanism which is based on systematic feedbackto learn in the confrontation with the attacker and is moresuitable for the study of individual strategies

6 Conclusions and Future Works

In this paper cyber attack-defense confrontation is ab-stracted as a stochastic game problem under the restrictionof limited rationality A host-centered attack-defense graphmodel is proposed to extract network state and attack-de-fense action and an algorithm to generate attack-defensegraph is designed to effectively compress the game statespace e WoLF-PHC-based defense decision approach isproposed to overcome the problem which enables defendersunder bounded rationality to make optimal choices whenfacing different attackers e lattice improves the WoLF-

PHC algorithm speeds up the defenderrsquos learning speed andreduces the algorithmrsquos dependence on data is approachnot only satisfies the constraints of bounded rationality butalso does not require the defender to know too much in-formation about the attacker It is a more practical defensedecision-making approach

e future work is to further optimize the winning andlosing criteria of WoLF-PHC algorithm for specific attack-defense scenarios in order to speed up defense learning andincrease defense gains

Data Availability

e data that support the findings of this study are notpublicly available due to restrictions as the data containsensitive information about a real-world enterprise networkAccess to the dataset is restricted by the original ownerPeople who want to access the data should send a request tothe corresponding author who will apply for permission ofsharing the data from the original owner

Conflicts of Interest

e authors declare that they have no conflicts of interestregarding the publication of this paper

Acknowledgments

is work was supported by the National High TechnologyResearch and Development Program of China (863 Pro-gram) (2014AA7116082 and 2015AA7116040) and NationalNatural Science Foundation of China (61902427) In par-ticular we thank Junnan Yang and Hongqi Zhang of ourresearch group for their previous work on stochastic gameWe sincerely thank them for their help in publishing thispaper

References

[1] Z Tian W Shi Y Wang et al ldquoReal time lateral movementdetection based on evidence reasoning network for edgecomputing environmentrdquo IEEE Transactions on IndustrialInformatics vol 15 no 7 pp 4285ndash4294 2019

[2] Y Wang J Yu W Qu H Shen and X Cheng ldquoEvolutionarygame model and analysis approaches for network groupbehaviorrdquo Chinese Journal of Computers vol 38 no 2pp 282ndash300 2015

[3] J Liu H Zhang and Y Liu ldquoResearch on optimal selection ofmoving target defense policy based on dynamic game withincomplete informationrdquo Acta Electronica Sinica vol 46no 1 pp 82ndash29 2018

Table 4 Comprehensive comparison of existing approaches

Ref Game type Model assumption Learning mechanism Game process Applicable object Practicability[3] Stochastic game Rationality mdash Multistage Personal Bad[12] Static game Rationality mdash Single-stage Personal Bad[14] Dynamic game Rationality mdash Multistage Personal Bad[16] Stochastic game Rationality mdash Multistage Personal Bad[20] Evolutionary game Bounded rationality Biological evolution mdash Group GoodOur approach Stochastic game Bounded rationality Reinforcement learning Multistage Personal Good

Security and Communication Networks 15

[4] Z Tian X Gao S Su J Qiu X Du and M GuizanildquoEvaluating reputation management schemes of internet ofvehicles based on evolutionary game theoryrdquo IEEE Trans-actions on Vehicular Technology vol 68 no 6 pp 5971ndash59802019

[5] Q Zhu and S Rass ldquoGame theory meets network securitya tutorialrdquo in Proceedings of the ACM SIGSAC Conference onComputer and Communications Security pp 2163ndash2165Toronto Canada October 2018

[6] X Li C Zhou Y C Tian and Y Qin ldquoA dynamic decision-making approach for intrusion response in industrial controlsystemsrdquo IEEE Transactions on Industrial Informatics vol 15no 5 pp 2544ndash2554 2019

[7] H Hu Y Liu H Zhang and R Pan ldquoOptimal networkdefense strategy selection based on incomplete informationevolutionary gamerdquo IEEE Access vol 6 pp 29806ndash298212018

[8] J Li G Kendall and R John ldquoComputing Nash equilibriaand evolutionarily stable states of evolutionary gamesrdquo IEEETransactions on Evolutionary Computation vol 20 no 3pp 460ndash469 2016

[9] M Bowling and M Veloso Multiagent Learning Usinga Variable Learning Rate Elsevier Science Publishers LtdAmsterdam Netherlands 2002

[10] J Harb and D Precup ldquoInvestigating recurrence and eligi-bility traces in deep Q-networksrdquo httpsarxivorgabs170405495 2017

[11] C T Do N H Tran and C Hong ldquoGame theory for cybersecurity and privacyrdquo ACM Computing Surveys vol 50 no 2p 30 2017

[12] Y-L Liu D-G Feng L-HWu and Y-F Lian ldquoPerformanceevaluation of worm attack and defense strategies based onstatic Bayesian gamerdquo Chinese Journal of Software vol 23no 3 pp 712ndash723 2012

[13] Y Li D E Quevedo S Dey and L Shi ldquoA game-theoreticapproach to fake-acknowledgment attack on cyber-physicalsystemsrdquo IEEE Transactions on Signal and Information Pro-cessing over Networks vol 3 no 1 pp 1ndash11 2017

[14] H Zhang L Jiang S Huang J Wang and Y Zhang ldquoAttack-defense differential game model for network defense strategyselectionrdquo IEEE Access vol 7 pp 50618ndash50629 2019

[15] A Afrand and S K Das ldquoPreventing DoS attacks in wirelesssensor networks a repeated game theory approachrdquo InternetJournal of Cyber Security vol 5 no 2 pp 145ndash153 2007

[16] J Wei B Fang and Z Tian ldquoResearch on defense strategiesselection based on attack-defense stochastic game modelrdquoChinese Journal of Computer Research and Developmentvol 47 no 10 pp 1714ndash1723 2010

[17] C Wang X Cheng and Y Zhu ldquoA Markov game model ofcomputer network operationrdquo Chinese Systems Engineering-gteory and Practice vol 34 no 9 pp 2402ndash2410 2014

[18] J Hofbauer and K Sigmund ldquoEvolutionary game dynamicsrdquoBulletin of the American Mathematical Society vol 40 no 4pp 479ndash519 2011

[19] Y Hayel and Q Zhu ldquoEpidemic protection over heteroge-neous networks using evolutionary Poisson gamesrdquo IEEETransactions on Information Forensics and Security vol 12no 8 pp 1786ndash1800 2017

[20] J Huang and H Zhang ldquoImproving replicator dynamicevolutionary game model for selecting optimal defensestrategiesrdquo Chinese Journal on Communications vol 38 no 1pp 170ndash182 2018

[21] H Hu Y Liu H Zhang and Y Zhang ldquoSecurity metricmethods for network multistep attacks using AMC and big

data correlation analysisrdquo Security and CommunicationNetworks vol 2018 Article ID 5787102 14 pages 2018

[22] H Hu Y Liu Y Yang H Zhang and Y Zhang ldquoNew in-sights into approaches to evaluating intention and path fornetwork multistep attacksrdquo Mathematical Problems in En-gineering vol 2018 Article ID 4278632 13 pages 2018

[23] H Hu H Zhang Y Liu and Y Wang ldquoQuantitative methodfor network security situation based on attack predictionrdquoSecurity and Communication Networks vol 2017 Article ID3407642 19 pages 2017

[24] M Zimmer and S Doncieux ldquoBootstrapping Q-learning forrobotics from neuro-evolution resultsrdquo IEEE Transactions onCognitive and Developmental Systems vol 10 no 1pp 102ndash119 2018

[25] R S Sutton and A G Barto ldquoReinforcement learning anintroduction Bradford bookrdquo Machine Learning vol 16no 1 pp 285-286 2005

[26] H Zhang J Wang D Yu J Han and T Li ldquoActive defensestrategy selection based on static Bayesian gamerdquo in Pro-ceedings of IET International Conference on CyberspaceTechnology pp 1ndash7 London UK October 2016

16 Security and Communication Networks

International Journal of

AerospaceEngineeringHindawiwwwhindawicom Volume 2018

RoboticsJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Active and Passive Electronic Components

VLSI Design

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Shock and Vibration

Hindawiwwwhindawicom Volume 2018

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawiwwwhindawicom

Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Control Scienceand Engineering

Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom

Journal ofEngineeringVolume 2018

SensorsJournal of

Hindawiwwwhindawicom Volume 2018

International Journal of

RotatingMachinery

Hindawiwwwhindawicom Volume 2018

Modelling ampSimulationin EngineeringHindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Navigation and Observation

International Journal of

Hindawi

wwwhindawicom Volume 2018

Advances in

Multimedia

Submit your manuscripts atwwwhindawicom

Page 12: Research Article - Hindawi Publishing Corporationdownloads.hindawi.com/journals/scn/2019/3038586.pdf · Afrand and Das [15] established a repeated game model between the intrusion

Table 3 Dierent parameter settings

Set α δl δw λ c

1 001 0004 0001 001 0012 01 0004 0001 001 0013 001 0004 0001 001 014 001 0004 0001 01 0015 001 004 001 001 0016 001 0008 0001 001 001

0 2000 4000 6000s2 defense number

s2 d

efen

se re

turn

10

08

06

04

02

00

d3d4

(a)

s2 d

efen

se re

turn

10

08

06

04

02

000 2000 4000 6000

s2 defense number

d3d4

(b)

s2 d

efen

se re

turn

10

08

06

04

02

000 2000 4000 6000

s2 defense number

d3d4

(c)

s2 d

efen

se re

turn

10

08

06

04

02

000 2000 4000 6000

s2 defense number

d3d4

(d)

s2 d

efen

se re

turn

10

08

06

04

02

000 2000 4000 6000

s2 defense number

d3d4

(e)

s2 d

efen

se re

turn

10

08

06

04

02

000 2000 4000 6000

s2 defense number

d3d4

(f )

Figure 9 Defense decision making under dierent parameter settings (a) Defense decision under setting 1 (b) Defense decision undersetting 2 (c) Defense decision under setting 3 (d) Defense decision under setting 4 (e) Defense decision under setting 5 (f ) Defensedecision under setting 6

5 10 15 20 25s2 defense number (times1000)

s2 d

efen

se re

turn

ndash10

ndash15

ndash20

ndash25

ndash30

Set 1Set 2Set 3

Set 4Set 5Set 6

Figure 10 Defense payo under dierent parameter settings

12 Security and Communication Networks

in [16 20] When facing an attacker with strong learningability if the attacker has not obtained Nash equilibriumthrough learning this approach is still better than [16 20] Ifthe attacker obtains Nash equilibrium through learning thispaper can also obtain the same Nash equilibrium strategy as[26] and obtain its phase e same eect is superior to thatin [20]

533 Test Comparison with and without Eligibility Traceis section tests the actual eect of the eligibility traces onthe Algorithm 2 e eect of eligibility traces on strategyselection is shown in Figure 16 from which we can see thatthe learning speed of the algorithm is faster when thequalied trace is available After 1000 times of learning the

algorithm can converge to the optimal strategy When thequalied trace is not available the algorithm needs about2500 times of learning to converge

Average earnings per 1000 times change as shown inFigure 17 from which we can see that the benets of thealgorithm are almost the same when there is or not anyqualied trace after convergence From Figure 17 we can seethat 3000 defenses before convergence have higher returnsfrom qualied trails than those from unqualied trails Inorder to further verify this the average of the rst 3000defense gains under qualied trails and unqualied trails iscounted 10 times each respectively e results are shown inFigure 18 which further proves that in the preconvergencedefense phase qualied traces are better than unqualiedtraces

e addition of eligibility trace accelerates the learningspeed but also brings additional memory and computingoverhead In the experiment only 10 state-action pairs thatwere recently accessed were saved which eectively reduced

Set 1 Set 2 Set 3 Set 4 Set 5 Set 6

ndash80

ndash85

ndash90

Figure 11 Average return under dierent parameter settings

Set 1 Set 2 Set 3 Set 4 Set 5 Set 6

80

75

Figure 12 Standard deviation of defense payo under dierentparameter settings

ndash10

ndash20

ndash30

ndash40

s2 d

efen

se re

turn

5 10 15 20 25

Our methodReference [12]Reference [16]

s2 defense number (times1000)

Figure 13 Comparison of defense payo change

0 2000 4000 6000s2 defense number

s2 d

efen

se st

rate

gy

08

06

04

02

00

d3d4

Figure 14 Defense decision making of our approach

Our methodReference [12]Reference [16]

5 10 15 20 25s2 defense number (times1000)

s2 d

efen

se re

turn

ndash10

ndash20

ndash30

ndash40

Figure 15 Comparison of defense payo change

Security and Communication Networks 13

0 2000 4000 6000s2 defense number

s2 d

efen

se st

rate

gy

100

075

050

025

000

Eligibility trace d3Eligibility trace d4

No eligibility trace d3No eligibility trace d4

Figure 16 Comparison of defense decision making

Eligibility traceNo eligibility trace

0 5 10 15 20 25s2 defense number (times1000)

s2 d

efen

se re

turn

ndash10

ndash20

ndash15

ndash30

ndash25

Figure 17 Comparison of defense payo change

Eligibility traceNo eligibility trace

1 2 3 4 5 6 7 8 9 10s2 defense number (times3000)

s2 av

erag

e def

ense

retu

rn

ndash12

ndash14

ndash16

ndash18

ndash20

Figure 18 Average payo comparison of the rst 3000 defenses

14 Security and Communication Networks

the increase of memory consumption In order to test thecomputational cost of qualified trail the time of 100000defense decisions made by the algorithm was counted for 20times with and without qualified trail e average of 20times was 951s for qualified trail and 374 s for unqualifiedtrail Although the introduction of eligibility traces willincrease the decision-making time by nearly 25 times thetime required for 100000 decisions after the introduction ofeligibility traces is still only 951 s which can still meet thereal-time requirements

In summary the introduction of eligibility trace at theexpense of a small amount of memory and computingoverhead can effectively increase the learning speed of thealgorithm and improve the defense gains

54 Comprehensive Comparisons is approach is com-pared with some typical research results as shown in Table 4[3 12 14 16] is based on the assumption of complete ra-tionality e equilibrium strategy obtained by it is difficultto appear in practice and has a low guiding effect on actualdefense decision making [20] and this paper have morepracticability on the premise of bounded rationality hy-pothesis but [20] is based on the theory of biologicalevolution and mainly studies population evolution e coreof game analysis is not the optimal strategy choice of playersbut the strategy adjustment process trend and stability ofgroup members composed of bounded rational players andthe stability here refers to group members is approach isnot suitable for guiding individual real-time decisionmaking because the proportion of specific strategies isunchanged and not the strategy of a player On the contrarythe defender of the proposed approach adopts reinforcementlearning mechanism which is based on systematic feedbackto learn in the confrontation with the attacker and is moresuitable for the study of individual strategies

6 Conclusions and Future Works

In this paper cyber attack-defense confrontation is ab-stracted as a stochastic game problem under the restrictionof limited rationality A host-centered attack-defense graphmodel is proposed to extract network state and attack-de-fense action and an algorithm to generate attack-defensegraph is designed to effectively compress the game statespace e WoLF-PHC-based defense decision approach isproposed to overcome the problem which enables defendersunder bounded rationality to make optimal choices whenfacing different attackers e lattice improves the WoLF-

PHC algorithm speeds up the defenderrsquos learning speed andreduces the algorithmrsquos dependence on data is approachnot only satisfies the constraints of bounded rationality butalso does not require the defender to know too much in-formation about the attacker It is a more practical defensedecision-making approach

e future work is to further optimize the winning andlosing criteria of WoLF-PHC algorithm for specific attack-defense scenarios in order to speed up defense learning andincrease defense gains

Data Availability

e data that support the findings of this study are notpublicly available due to restrictions as the data containsensitive information about a real-world enterprise networkAccess to the dataset is restricted by the original ownerPeople who want to access the data should send a request tothe corresponding author who will apply for permission ofsharing the data from the original owner

Conflicts of Interest

e authors declare that they have no conflicts of interestregarding the publication of this paper

Acknowledgments

is work was supported by the National High TechnologyResearch and Development Program of China (863 Pro-gram) (2014AA7116082 and 2015AA7116040) and NationalNatural Science Foundation of China (61902427) In par-ticular we thank Junnan Yang and Hongqi Zhang of ourresearch group for their previous work on stochastic gameWe sincerely thank them for their help in publishing thispaper

References

[1] Z Tian W Shi Y Wang et al ldquoReal time lateral movementdetection based on evidence reasoning network for edgecomputing environmentrdquo IEEE Transactions on IndustrialInformatics vol 15 no 7 pp 4285ndash4294 2019

[2] Y Wang J Yu W Qu H Shen and X Cheng ldquoEvolutionarygame model and analysis approaches for network groupbehaviorrdquo Chinese Journal of Computers vol 38 no 2pp 282ndash300 2015

[3] J Liu H Zhang and Y Liu ldquoResearch on optimal selection ofmoving target defense policy based on dynamic game withincomplete informationrdquo Acta Electronica Sinica vol 46no 1 pp 82ndash29 2018

Table 4 Comprehensive comparison of existing approaches

Ref Game type Model assumption Learning mechanism Game process Applicable object Practicability[3] Stochastic game Rationality mdash Multistage Personal Bad[12] Static game Rationality mdash Single-stage Personal Bad[14] Dynamic game Rationality mdash Multistage Personal Bad[16] Stochastic game Rationality mdash Multistage Personal Bad[20] Evolutionary game Bounded rationality Biological evolution mdash Group GoodOur approach Stochastic game Bounded rationality Reinforcement learning Multistage Personal Good

Security and Communication Networks 15

[4] Z Tian X Gao S Su J Qiu X Du and M GuizanildquoEvaluating reputation management schemes of internet ofvehicles based on evolutionary game theoryrdquo IEEE Trans-actions on Vehicular Technology vol 68 no 6 pp 5971ndash59802019

[5] Q Zhu and S Rass ldquoGame theory meets network securitya tutorialrdquo in Proceedings of the ACM SIGSAC Conference onComputer and Communications Security pp 2163ndash2165Toronto Canada October 2018

[6] X Li C Zhou Y C Tian and Y Qin ldquoA dynamic decision-making approach for intrusion response in industrial controlsystemsrdquo IEEE Transactions on Industrial Informatics vol 15no 5 pp 2544ndash2554 2019

[7] H Hu Y Liu H Zhang and R Pan ldquoOptimal networkdefense strategy selection based on incomplete informationevolutionary gamerdquo IEEE Access vol 6 pp 29806ndash298212018

[8] J Li G Kendall and R John ldquoComputing Nash equilibriaand evolutionarily stable states of evolutionary gamesrdquo IEEETransactions on Evolutionary Computation vol 20 no 3pp 460ndash469 2016

[9] M Bowling and M Veloso Multiagent Learning Usinga Variable Learning Rate Elsevier Science Publishers LtdAmsterdam Netherlands 2002

[10] J Harb and D Precup ldquoInvestigating recurrence and eligi-bility traces in deep Q-networksrdquo httpsarxivorgabs170405495 2017

[11] C T Do N H Tran and C Hong ldquoGame theory for cybersecurity and privacyrdquo ACM Computing Surveys vol 50 no 2p 30 2017

[12] Y-L Liu D-G Feng L-HWu and Y-F Lian ldquoPerformanceevaluation of worm attack and defense strategies based onstatic Bayesian gamerdquo Chinese Journal of Software vol 23no 3 pp 712ndash723 2012

[13] Y Li D E Quevedo S Dey and L Shi ldquoA game-theoreticapproach to fake-acknowledgment attack on cyber-physicalsystemsrdquo IEEE Transactions on Signal and Information Pro-cessing over Networks vol 3 no 1 pp 1ndash11 2017

[14] H Zhang L Jiang S Huang J Wang and Y Zhang ldquoAttack-defense differential game model for network defense strategyselectionrdquo IEEE Access vol 7 pp 50618ndash50629 2019

[15] A Afrand and S K Das ldquoPreventing DoS attacks in wirelesssensor networks a repeated game theory approachrdquo InternetJournal of Cyber Security vol 5 no 2 pp 145ndash153 2007

[16] J Wei B Fang and Z Tian ldquoResearch on defense strategiesselection based on attack-defense stochastic game modelrdquoChinese Journal of Computer Research and Developmentvol 47 no 10 pp 1714ndash1723 2010

[17] C Wang X Cheng and Y Zhu ldquoA Markov game model ofcomputer network operationrdquo Chinese Systems Engineering-gteory and Practice vol 34 no 9 pp 2402ndash2410 2014

[18] J Hofbauer and K Sigmund ldquoEvolutionary game dynamicsrdquoBulletin of the American Mathematical Society vol 40 no 4pp 479ndash519 2011

[19] Y Hayel and Q Zhu ldquoEpidemic protection over heteroge-neous networks using evolutionary Poisson gamesrdquo IEEETransactions on Information Forensics and Security vol 12no 8 pp 1786ndash1800 2017

[20] J Huang and H Zhang ldquoImproving replicator dynamicevolutionary game model for selecting optimal defensestrategiesrdquo Chinese Journal on Communications vol 38 no 1pp 170ndash182 2018

[21] H Hu Y Liu H Zhang and Y Zhang ldquoSecurity metricmethods for network multistep attacks using AMC and big

data correlation analysisrdquo Security and CommunicationNetworks vol 2018 Article ID 5787102 14 pages 2018

[22] H Hu Y Liu Y Yang H Zhang and Y Zhang ldquoNew in-sights into approaches to evaluating intention and path fornetwork multistep attacksrdquo Mathematical Problems in En-gineering vol 2018 Article ID 4278632 13 pages 2018

[23] H Hu H Zhang Y Liu and Y Wang ldquoQuantitative methodfor network security situation based on attack predictionrdquoSecurity and Communication Networks vol 2017 Article ID3407642 19 pages 2017

[24] M Zimmer and S Doncieux ldquoBootstrapping Q-learning forrobotics from neuro-evolution resultsrdquo IEEE Transactions onCognitive and Developmental Systems vol 10 no 1pp 102ndash119 2018

[25] R S Sutton and A G Barto ldquoReinforcement learning anintroduction Bradford bookrdquo Machine Learning vol 16no 1 pp 285-286 2005

[26] H Zhang J Wang D Yu J Han and T Li ldquoActive defensestrategy selection based on static Bayesian gamerdquo in Pro-ceedings of IET International Conference on CyberspaceTechnology pp 1ndash7 London UK October 2016

16 Security and Communication Networks

International Journal of

AerospaceEngineeringHindawiwwwhindawicom Volume 2018

RoboticsJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Active and Passive Electronic Components

VLSI Design

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Shock and Vibration

Hindawiwwwhindawicom Volume 2018

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawiwwwhindawicom

Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Control Scienceand Engineering

Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom

Journal ofEngineeringVolume 2018

SensorsJournal of

Hindawiwwwhindawicom Volume 2018

International Journal of

RotatingMachinery

Hindawiwwwhindawicom Volume 2018

Modelling ampSimulationin EngineeringHindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Navigation and Observation

International Journal of

Hindawi

wwwhindawicom Volume 2018

Advances in

Multimedia

Submit your manuscripts atwwwhindawicom

Page 13: Research Article - Hindawi Publishing Corporationdownloads.hindawi.com/journals/scn/2019/3038586.pdf · Afrand and Das [15] established a repeated game model between the intrusion

in [16 20] When facing an attacker with strong learningability if the attacker has not obtained Nash equilibriumthrough learning this approach is still better than [16 20] Ifthe attacker obtains Nash equilibrium through learning thispaper can also obtain the same Nash equilibrium strategy as[26] and obtain its phase e same eect is superior to thatin [20]

533 Test Comparison with and without Eligibility Traceis section tests the actual eect of the eligibility traces onthe Algorithm 2 e eect of eligibility traces on strategyselection is shown in Figure 16 from which we can see thatthe learning speed of the algorithm is faster when thequalied trace is available After 1000 times of learning the

algorithm can converge to the optimal strategy When thequalied trace is not available the algorithm needs about2500 times of learning to converge

Average earnings per 1000 times change as shown inFigure 17 from which we can see that the benets of thealgorithm are almost the same when there is or not anyqualied trace after convergence From Figure 17 we can seethat 3000 defenses before convergence have higher returnsfrom qualied trails than those from unqualied trails Inorder to further verify this the average of the rst 3000defense gains under qualied trails and unqualied trails iscounted 10 times each respectively e results are shown inFigure 18 which further proves that in the preconvergencedefense phase qualied traces are better than unqualiedtraces

e addition of eligibility trace accelerates the learningspeed but also brings additional memory and computingoverhead In the experiment only 10 state-action pairs thatwere recently accessed were saved which eectively reduced

Set 1 Set 2 Set 3 Set 4 Set 5 Set 6

ndash80

ndash85

ndash90

Figure 11 Average return under dierent parameter settings

Set 1 Set 2 Set 3 Set 4 Set 5 Set 6

80

75

Figure 12 Standard deviation of defense payo under dierentparameter settings

ndash10

ndash20

ndash30

ndash40

s2 d

efen

se re

turn

5 10 15 20 25

Our methodReference [12]Reference [16]

s2 defense number (times1000)

Figure 13 Comparison of defense payo change

0 2000 4000 6000s2 defense number

s2 d

efen

se st

rate

gy

08

06

04

02

00

d3d4

Figure 14 Defense decision making of our approach

Our methodReference [12]Reference [16]

5 10 15 20 25s2 defense number (times1000)

s2 d

efen

se re

turn

ndash10

ndash20

ndash30

ndash40

Figure 15 Comparison of defense payo change

Security and Communication Networks 13

0 2000 4000 6000s2 defense number

s2 d

efen

se st

rate

gy

100

075

050

025

000

Eligibility trace d3Eligibility trace d4

No eligibility trace d3No eligibility trace d4

Figure 16 Comparison of defense decision making

Eligibility traceNo eligibility trace

0 5 10 15 20 25s2 defense number (times1000)

s2 d

efen

se re

turn

ndash10

ndash20

ndash15

ndash30

ndash25

Figure 17 Comparison of defense payo change

Eligibility traceNo eligibility trace

1 2 3 4 5 6 7 8 9 10s2 defense number (times3000)

s2 av

erag

e def

ense

retu

rn

ndash12

ndash14

ndash16

ndash18

ndash20

Figure 18 Average payo comparison of the rst 3000 defenses

14 Security and Communication Networks

the increase of memory consumption In order to test thecomputational cost of qualified trail the time of 100000defense decisions made by the algorithm was counted for 20times with and without qualified trail e average of 20times was 951s for qualified trail and 374 s for unqualifiedtrail Although the introduction of eligibility traces willincrease the decision-making time by nearly 25 times thetime required for 100000 decisions after the introduction ofeligibility traces is still only 951 s which can still meet thereal-time requirements

In summary the introduction of eligibility trace at theexpense of a small amount of memory and computingoverhead can effectively increase the learning speed of thealgorithm and improve the defense gains

54 Comprehensive Comparisons is approach is com-pared with some typical research results as shown in Table 4[3 12 14 16] is based on the assumption of complete ra-tionality e equilibrium strategy obtained by it is difficultto appear in practice and has a low guiding effect on actualdefense decision making [20] and this paper have morepracticability on the premise of bounded rationality hy-pothesis but [20] is based on the theory of biologicalevolution and mainly studies population evolution e coreof game analysis is not the optimal strategy choice of playersbut the strategy adjustment process trend and stability ofgroup members composed of bounded rational players andthe stability here refers to group members is approach isnot suitable for guiding individual real-time decisionmaking because the proportion of specific strategies isunchanged and not the strategy of a player On the contrarythe defender of the proposed approach adopts reinforcementlearning mechanism which is based on systematic feedbackto learn in the confrontation with the attacker and is moresuitable for the study of individual strategies

6 Conclusions and Future Works

In this paper cyber attack-defense confrontation is ab-stracted as a stochastic game problem under the restrictionof limited rationality A host-centered attack-defense graphmodel is proposed to extract network state and attack-de-fense action and an algorithm to generate attack-defensegraph is designed to effectively compress the game statespace e WoLF-PHC-based defense decision approach isproposed to overcome the problem which enables defendersunder bounded rationality to make optimal choices whenfacing different attackers e lattice improves the WoLF-

PHC algorithm speeds up the defenderrsquos learning speed andreduces the algorithmrsquos dependence on data is approachnot only satisfies the constraints of bounded rationality butalso does not require the defender to know too much in-formation about the attacker It is a more practical defensedecision-making approach

e future work is to further optimize the winning andlosing criteria of WoLF-PHC algorithm for specific attack-defense scenarios in order to speed up defense learning andincrease defense gains

Data Availability

e data that support the findings of this study are notpublicly available due to restrictions as the data containsensitive information about a real-world enterprise networkAccess to the dataset is restricted by the original ownerPeople who want to access the data should send a request tothe corresponding author who will apply for permission ofsharing the data from the original owner

Conflicts of Interest

e authors declare that they have no conflicts of interestregarding the publication of this paper

Acknowledgments

is work was supported by the National High TechnologyResearch and Development Program of China (863 Pro-gram) (2014AA7116082 and 2015AA7116040) and NationalNatural Science Foundation of China (61902427) In par-ticular we thank Junnan Yang and Hongqi Zhang of ourresearch group for their previous work on stochastic gameWe sincerely thank them for their help in publishing thispaper

References

[1] Z Tian W Shi Y Wang et al ldquoReal time lateral movementdetection based on evidence reasoning network for edgecomputing environmentrdquo IEEE Transactions on IndustrialInformatics vol 15 no 7 pp 4285ndash4294 2019

[2] Y Wang J Yu W Qu H Shen and X Cheng ldquoEvolutionarygame model and analysis approaches for network groupbehaviorrdquo Chinese Journal of Computers vol 38 no 2pp 282ndash300 2015

[3] J Liu H Zhang and Y Liu ldquoResearch on optimal selection ofmoving target defense policy based on dynamic game withincomplete informationrdquo Acta Electronica Sinica vol 46no 1 pp 82ndash29 2018

Table 4 Comprehensive comparison of existing approaches

Ref Game type Model assumption Learning mechanism Game process Applicable object Practicability[3] Stochastic game Rationality mdash Multistage Personal Bad[12] Static game Rationality mdash Single-stage Personal Bad[14] Dynamic game Rationality mdash Multistage Personal Bad[16] Stochastic game Rationality mdash Multistage Personal Bad[20] Evolutionary game Bounded rationality Biological evolution mdash Group GoodOur approach Stochastic game Bounded rationality Reinforcement learning Multistage Personal Good

Security and Communication Networks 15

[4] Z Tian X Gao S Su J Qiu X Du and M GuizanildquoEvaluating reputation management schemes of internet ofvehicles based on evolutionary game theoryrdquo IEEE Trans-actions on Vehicular Technology vol 68 no 6 pp 5971ndash59802019

[5] Q Zhu and S Rass ldquoGame theory meets network securitya tutorialrdquo in Proceedings of the ACM SIGSAC Conference onComputer and Communications Security pp 2163ndash2165Toronto Canada October 2018

[6] X Li C Zhou Y C Tian and Y Qin ldquoA dynamic decision-making approach for intrusion response in industrial controlsystemsrdquo IEEE Transactions on Industrial Informatics vol 15no 5 pp 2544ndash2554 2019

[7] H Hu Y Liu H Zhang and R Pan ldquoOptimal networkdefense strategy selection based on incomplete informationevolutionary gamerdquo IEEE Access vol 6 pp 29806ndash298212018

[8] J Li G Kendall and R John ldquoComputing Nash equilibriaand evolutionarily stable states of evolutionary gamesrdquo IEEETransactions on Evolutionary Computation vol 20 no 3pp 460ndash469 2016

[9] M Bowling and M Veloso Multiagent Learning Usinga Variable Learning Rate Elsevier Science Publishers LtdAmsterdam Netherlands 2002

[10] J Harb and D Precup ldquoInvestigating recurrence and eligi-bility traces in deep Q-networksrdquo httpsarxivorgabs170405495 2017

[11] C T Do N H Tran and C Hong ldquoGame theory for cybersecurity and privacyrdquo ACM Computing Surveys vol 50 no 2p 30 2017

[12] Y-L Liu D-G Feng L-HWu and Y-F Lian ldquoPerformanceevaluation of worm attack and defense strategies based onstatic Bayesian gamerdquo Chinese Journal of Software vol 23no 3 pp 712ndash723 2012

[13] Y Li D E Quevedo S Dey and L Shi ldquoA game-theoreticapproach to fake-acknowledgment attack on cyber-physicalsystemsrdquo IEEE Transactions on Signal and Information Pro-cessing over Networks vol 3 no 1 pp 1ndash11 2017

[14] H Zhang L Jiang S Huang J Wang and Y Zhang ldquoAttack-defense differential game model for network defense strategyselectionrdquo IEEE Access vol 7 pp 50618ndash50629 2019

[15] A Afrand and S K Das ldquoPreventing DoS attacks in wirelesssensor networks a repeated game theory approachrdquo InternetJournal of Cyber Security vol 5 no 2 pp 145ndash153 2007

[16] J Wei B Fang and Z Tian ldquoResearch on defense strategiesselection based on attack-defense stochastic game modelrdquoChinese Journal of Computer Research and Developmentvol 47 no 10 pp 1714ndash1723 2010

[17] C Wang X Cheng and Y Zhu ldquoA Markov game model ofcomputer network operationrdquo Chinese Systems Engineering-gteory and Practice vol 34 no 9 pp 2402ndash2410 2014

[18] J Hofbauer and K Sigmund ldquoEvolutionary game dynamicsrdquoBulletin of the American Mathematical Society vol 40 no 4pp 479ndash519 2011

[19] Y Hayel and Q Zhu ldquoEpidemic protection over heteroge-neous networks using evolutionary Poisson gamesrdquo IEEETransactions on Information Forensics and Security vol 12no 8 pp 1786ndash1800 2017

[20] J Huang and H Zhang ldquoImproving replicator dynamicevolutionary game model for selecting optimal defensestrategiesrdquo Chinese Journal on Communications vol 38 no 1pp 170ndash182 2018

[21] H Hu Y Liu H Zhang and Y Zhang ldquoSecurity metricmethods for network multistep attacks using AMC and big

data correlation analysisrdquo Security and CommunicationNetworks vol 2018 Article ID 5787102 14 pages 2018

[22] H Hu Y Liu Y Yang H Zhang and Y Zhang ldquoNew in-sights into approaches to evaluating intention and path fornetwork multistep attacksrdquo Mathematical Problems in En-gineering vol 2018 Article ID 4278632 13 pages 2018

[23] H Hu H Zhang Y Liu and Y Wang ldquoQuantitative methodfor network security situation based on attack predictionrdquoSecurity and Communication Networks vol 2017 Article ID3407642 19 pages 2017

[24] M Zimmer and S Doncieux ldquoBootstrapping Q-learning forrobotics from neuro-evolution resultsrdquo IEEE Transactions onCognitive and Developmental Systems vol 10 no 1pp 102ndash119 2018

[25] R S Sutton and A G Barto ldquoReinforcement learning anintroduction Bradford bookrdquo Machine Learning vol 16no 1 pp 285-286 2005

[26] H Zhang J Wang D Yu J Han and T Li ldquoActive defensestrategy selection based on static Bayesian gamerdquo in Pro-ceedings of IET International Conference on CyberspaceTechnology pp 1ndash7 London UK October 2016

16 Security and Communication Networks

International Journal of

AerospaceEngineeringHindawiwwwhindawicom Volume 2018

RoboticsJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Active and Passive Electronic Components

VLSI Design

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Shock and Vibration

Hindawiwwwhindawicom Volume 2018

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawiwwwhindawicom

Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Control Scienceand Engineering

Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom

Journal ofEngineeringVolume 2018

SensorsJournal of

Hindawiwwwhindawicom Volume 2018

International Journal of

RotatingMachinery

Hindawiwwwhindawicom Volume 2018

Modelling ampSimulationin EngineeringHindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Navigation and Observation

International Journal of

Hindawi

wwwhindawicom Volume 2018

Advances in

Multimedia

Submit your manuscripts atwwwhindawicom

Page 14: Research Article - Hindawi Publishing Corporationdownloads.hindawi.com/journals/scn/2019/3038586.pdf · Afrand and Das [15] established a repeated game model between the intrusion

0 2000 4000 6000s2 defense number

s2 d

efen

se st

rate

gy

100

075

050

025

000

Eligibility trace d3Eligibility trace d4

No eligibility trace d3No eligibility trace d4

Figure 16 Comparison of defense decision making

Eligibility traceNo eligibility trace

0 5 10 15 20 25s2 defense number (times1000)

s2 d

efen

se re

turn

ndash10

ndash20

ndash15

ndash30

ndash25

Figure 17 Comparison of defense payo change

Eligibility traceNo eligibility trace

1 2 3 4 5 6 7 8 9 10s2 defense number (times3000)

s2 av

erag

e def

ense

retu

rn

ndash12

ndash14

ndash16

ndash18

ndash20

Figure 18 Average payo comparison of the rst 3000 defenses

14 Security and Communication Networks

the increase of memory consumption In order to test thecomputational cost of qualified trail the time of 100000defense decisions made by the algorithm was counted for 20times with and without qualified trail e average of 20times was 951s for qualified trail and 374 s for unqualifiedtrail Although the introduction of eligibility traces willincrease the decision-making time by nearly 25 times thetime required for 100000 decisions after the introduction ofeligibility traces is still only 951 s which can still meet thereal-time requirements

In summary the introduction of eligibility trace at theexpense of a small amount of memory and computingoverhead can effectively increase the learning speed of thealgorithm and improve the defense gains

54 Comprehensive Comparisons is approach is com-pared with some typical research results as shown in Table 4[3 12 14 16] is based on the assumption of complete ra-tionality e equilibrium strategy obtained by it is difficultto appear in practice and has a low guiding effect on actualdefense decision making [20] and this paper have morepracticability on the premise of bounded rationality hy-pothesis but [20] is based on the theory of biologicalevolution and mainly studies population evolution e coreof game analysis is not the optimal strategy choice of playersbut the strategy adjustment process trend and stability ofgroup members composed of bounded rational players andthe stability here refers to group members is approach isnot suitable for guiding individual real-time decisionmaking because the proportion of specific strategies isunchanged and not the strategy of a player On the contrarythe defender of the proposed approach adopts reinforcementlearning mechanism which is based on systematic feedbackto learn in the confrontation with the attacker and is moresuitable for the study of individual strategies

6 Conclusions and Future Works

In this paper cyber attack-defense confrontation is ab-stracted as a stochastic game problem under the restrictionof limited rationality A host-centered attack-defense graphmodel is proposed to extract network state and attack-de-fense action and an algorithm to generate attack-defensegraph is designed to effectively compress the game statespace e WoLF-PHC-based defense decision approach isproposed to overcome the problem which enables defendersunder bounded rationality to make optimal choices whenfacing different attackers e lattice improves the WoLF-

PHC algorithm speeds up the defenderrsquos learning speed andreduces the algorithmrsquos dependence on data is approachnot only satisfies the constraints of bounded rationality butalso does not require the defender to know too much in-formation about the attacker It is a more practical defensedecision-making approach

e future work is to further optimize the winning andlosing criteria of WoLF-PHC algorithm for specific attack-defense scenarios in order to speed up defense learning andincrease defense gains

Data Availability

e data that support the findings of this study are notpublicly available due to restrictions as the data containsensitive information about a real-world enterprise networkAccess to the dataset is restricted by the original ownerPeople who want to access the data should send a request tothe corresponding author who will apply for permission ofsharing the data from the original owner

Conflicts of Interest

e authors declare that they have no conflicts of interestregarding the publication of this paper

Acknowledgments

is work was supported by the National High TechnologyResearch and Development Program of China (863 Pro-gram) (2014AA7116082 and 2015AA7116040) and NationalNatural Science Foundation of China (61902427) In par-ticular we thank Junnan Yang and Hongqi Zhang of ourresearch group for their previous work on stochastic gameWe sincerely thank them for their help in publishing thispaper

References

[1] Z Tian W Shi Y Wang et al ldquoReal time lateral movementdetection based on evidence reasoning network for edgecomputing environmentrdquo IEEE Transactions on IndustrialInformatics vol 15 no 7 pp 4285ndash4294 2019

[2] Y Wang J Yu W Qu H Shen and X Cheng ldquoEvolutionarygame model and analysis approaches for network groupbehaviorrdquo Chinese Journal of Computers vol 38 no 2pp 282ndash300 2015

[3] J Liu H Zhang and Y Liu ldquoResearch on optimal selection ofmoving target defense policy based on dynamic game withincomplete informationrdquo Acta Electronica Sinica vol 46no 1 pp 82ndash29 2018

Table 4 Comprehensive comparison of existing approaches

Ref Game type Model assumption Learning mechanism Game process Applicable object Practicability[3] Stochastic game Rationality mdash Multistage Personal Bad[12] Static game Rationality mdash Single-stage Personal Bad[14] Dynamic game Rationality mdash Multistage Personal Bad[16] Stochastic game Rationality mdash Multistage Personal Bad[20] Evolutionary game Bounded rationality Biological evolution mdash Group GoodOur approach Stochastic game Bounded rationality Reinforcement learning Multistage Personal Good

Security and Communication Networks 15

[4] Z Tian X Gao S Su J Qiu X Du and M GuizanildquoEvaluating reputation management schemes of internet ofvehicles based on evolutionary game theoryrdquo IEEE Trans-actions on Vehicular Technology vol 68 no 6 pp 5971ndash59802019

[5] Q Zhu and S Rass ldquoGame theory meets network securitya tutorialrdquo in Proceedings of the ACM SIGSAC Conference onComputer and Communications Security pp 2163ndash2165Toronto Canada October 2018

[6] X Li C Zhou Y C Tian and Y Qin ldquoA dynamic decision-making approach for intrusion response in industrial controlsystemsrdquo IEEE Transactions on Industrial Informatics vol 15no 5 pp 2544ndash2554 2019

[7] H Hu Y Liu H Zhang and R Pan ldquoOptimal networkdefense strategy selection based on incomplete informationevolutionary gamerdquo IEEE Access vol 6 pp 29806ndash298212018

[8] J Li G Kendall and R John ldquoComputing Nash equilibriaand evolutionarily stable states of evolutionary gamesrdquo IEEETransactions on Evolutionary Computation vol 20 no 3pp 460ndash469 2016

[9] M Bowling and M Veloso Multiagent Learning Usinga Variable Learning Rate Elsevier Science Publishers LtdAmsterdam Netherlands 2002

[10] J Harb and D Precup ldquoInvestigating recurrence and eligi-bility traces in deep Q-networksrdquo httpsarxivorgabs170405495 2017

[11] C T Do N H Tran and C Hong ldquoGame theory for cybersecurity and privacyrdquo ACM Computing Surveys vol 50 no 2p 30 2017

[12] Y-L Liu D-G Feng L-HWu and Y-F Lian ldquoPerformanceevaluation of worm attack and defense strategies based onstatic Bayesian gamerdquo Chinese Journal of Software vol 23no 3 pp 712ndash723 2012

[13] Y Li D E Quevedo S Dey and L Shi ldquoA game-theoreticapproach to fake-acknowledgment attack on cyber-physicalsystemsrdquo IEEE Transactions on Signal and Information Pro-cessing over Networks vol 3 no 1 pp 1ndash11 2017

[14] H Zhang L Jiang S Huang J Wang and Y Zhang ldquoAttack-defense differential game model for network defense strategyselectionrdquo IEEE Access vol 7 pp 50618ndash50629 2019

[15] A Afrand and S K Das ldquoPreventing DoS attacks in wirelesssensor networks a repeated game theory approachrdquo InternetJournal of Cyber Security vol 5 no 2 pp 145ndash153 2007

[16] J Wei B Fang and Z Tian ldquoResearch on defense strategiesselection based on attack-defense stochastic game modelrdquoChinese Journal of Computer Research and Developmentvol 47 no 10 pp 1714ndash1723 2010

[17] C Wang X Cheng and Y Zhu ldquoA Markov game model ofcomputer network operationrdquo Chinese Systems Engineering-gteory and Practice vol 34 no 9 pp 2402ndash2410 2014

[18] J Hofbauer and K Sigmund ldquoEvolutionary game dynamicsrdquoBulletin of the American Mathematical Society vol 40 no 4pp 479ndash519 2011

[19] Y Hayel and Q Zhu ldquoEpidemic protection over heteroge-neous networks using evolutionary Poisson gamesrdquo IEEETransactions on Information Forensics and Security vol 12no 8 pp 1786ndash1800 2017

[20] J Huang and H Zhang ldquoImproving replicator dynamicevolutionary game model for selecting optimal defensestrategiesrdquo Chinese Journal on Communications vol 38 no 1pp 170ndash182 2018

[21] H Hu Y Liu H Zhang and Y Zhang ldquoSecurity metricmethods for network multistep attacks using AMC and big

data correlation analysisrdquo Security and CommunicationNetworks vol 2018 Article ID 5787102 14 pages 2018

[22] H Hu Y Liu Y Yang H Zhang and Y Zhang ldquoNew in-sights into approaches to evaluating intention and path fornetwork multistep attacksrdquo Mathematical Problems in En-gineering vol 2018 Article ID 4278632 13 pages 2018

[23] H Hu H Zhang Y Liu and Y Wang ldquoQuantitative methodfor network security situation based on attack predictionrdquoSecurity and Communication Networks vol 2017 Article ID3407642 19 pages 2017

[24] M Zimmer and S Doncieux ldquoBootstrapping Q-learning forrobotics from neuro-evolution resultsrdquo IEEE Transactions onCognitive and Developmental Systems vol 10 no 1pp 102ndash119 2018

[25] R S Sutton and A G Barto ldquoReinforcement learning anintroduction Bradford bookrdquo Machine Learning vol 16no 1 pp 285-286 2005

[26] H Zhang J Wang D Yu J Han and T Li ldquoActive defensestrategy selection based on static Bayesian gamerdquo in Pro-ceedings of IET International Conference on CyberspaceTechnology pp 1ndash7 London UK October 2016

16 Security and Communication Networks

International Journal of

AerospaceEngineeringHindawiwwwhindawicom Volume 2018

RoboticsJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Active and Passive Electronic Components

VLSI Design

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Shock and Vibration

Hindawiwwwhindawicom Volume 2018

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawiwwwhindawicom

Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Control Scienceand Engineering

Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom

Journal ofEngineeringVolume 2018

SensorsJournal of

Hindawiwwwhindawicom Volume 2018

International Journal of

RotatingMachinery

Hindawiwwwhindawicom Volume 2018

Modelling ampSimulationin EngineeringHindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Navigation and Observation

International Journal of

Hindawi

wwwhindawicom Volume 2018

Advances in

Multimedia

Submit your manuscripts atwwwhindawicom

Page 15: Research Article - Hindawi Publishing Corporationdownloads.hindawi.com/journals/scn/2019/3038586.pdf · Afrand and Das [15] established a repeated game model between the intrusion

the increase of memory consumption In order to test thecomputational cost of qualified trail the time of 100000defense decisions made by the algorithm was counted for 20times with and without qualified trail e average of 20times was 951s for qualified trail and 374 s for unqualifiedtrail Although the introduction of eligibility traces willincrease the decision-making time by nearly 25 times thetime required for 100000 decisions after the introduction ofeligibility traces is still only 951 s which can still meet thereal-time requirements

In summary the introduction of eligibility trace at theexpense of a small amount of memory and computingoverhead can effectively increase the learning speed of thealgorithm and improve the defense gains

54 Comprehensive Comparisons is approach is com-pared with some typical research results as shown in Table 4[3 12 14 16] is based on the assumption of complete ra-tionality e equilibrium strategy obtained by it is difficultto appear in practice and has a low guiding effect on actualdefense decision making [20] and this paper have morepracticability on the premise of bounded rationality hy-pothesis but [20] is based on the theory of biologicalevolution and mainly studies population evolution e coreof game analysis is not the optimal strategy choice of playersbut the strategy adjustment process trend and stability ofgroup members composed of bounded rational players andthe stability here refers to group members is approach isnot suitable for guiding individual real-time decisionmaking because the proportion of specific strategies isunchanged and not the strategy of a player On the contrarythe defender of the proposed approach adopts reinforcementlearning mechanism which is based on systematic feedbackto learn in the confrontation with the attacker and is moresuitable for the study of individual strategies

6 Conclusions and Future Works

In this paper cyber attack-defense confrontation is ab-stracted as a stochastic game problem under the restrictionof limited rationality A host-centered attack-defense graphmodel is proposed to extract network state and attack-de-fense action and an algorithm to generate attack-defensegraph is designed to effectively compress the game statespace e WoLF-PHC-based defense decision approach isproposed to overcome the problem which enables defendersunder bounded rationality to make optimal choices whenfacing different attackers e lattice improves the WoLF-

PHC algorithm speeds up the defenderrsquos learning speed andreduces the algorithmrsquos dependence on data is approachnot only satisfies the constraints of bounded rationality butalso does not require the defender to know too much in-formation about the attacker It is a more practical defensedecision-making approach

e future work is to further optimize the winning andlosing criteria of WoLF-PHC algorithm for specific attack-defense scenarios in order to speed up defense learning andincrease defense gains

Data Availability

e data that support the findings of this study are notpublicly available due to restrictions as the data containsensitive information about a real-world enterprise networkAccess to the dataset is restricted by the original ownerPeople who want to access the data should send a request tothe corresponding author who will apply for permission ofsharing the data from the original owner

Conflicts of Interest

e authors declare that they have no conflicts of interestregarding the publication of this paper

Acknowledgments

is work was supported by the National High TechnologyResearch and Development Program of China (863 Pro-gram) (2014AA7116082 and 2015AA7116040) and NationalNatural Science Foundation of China (61902427) In par-ticular we thank Junnan Yang and Hongqi Zhang of ourresearch group for their previous work on stochastic gameWe sincerely thank them for their help in publishing thispaper

References

[1] Z Tian W Shi Y Wang et al ldquoReal time lateral movementdetection based on evidence reasoning network for edgecomputing environmentrdquo IEEE Transactions on IndustrialInformatics vol 15 no 7 pp 4285ndash4294 2019

[2] Y Wang J Yu W Qu H Shen and X Cheng ldquoEvolutionarygame model and analysis approaches for network groupbehaviorrdquo Chinese Journal of Computers vol 38 no 2pp 282ndash300 2015

[3] J Liu H Zhang and Y Liu ldquoResearch on optimal selection ofmoving target defense policy based on dynamic game withincomplete informationrdquo Acta Electronica Sinica vol 46no 1 pp 82ndash29 2018

Table 4 Comprehensive comparison of existing approaches

Ref Game type Model assumption Learning mechanism Game process Applicable object Practicability[3] Stochastic game Rationality mdash Multistage Personal Bad[12] Static game Rationality mdash Single-stage Personal Bad[14] Dynamic game Rationality mdash Multistage Personal Bad[16] Stochastic game Rationality mdash Multistage Personal Bad[20] Evolutionary game Bounded rationality Biological evolution mdash Group GoodOur approach Stochastic game Bounded rationality Reinforcement learning Multistage Personal Good

Security and Communication Networks 15

[4] Z Tian X Gao S Su J Qiu X Du and M GuizanildquoEvaluating reputation management schemes of internet ofvehicles based on evolutionary game theoryrdquo IEEE Trans-actions on Vehicular Technology vol 68 no 6 pp 5971ndash59802019

[5] Q Zhu and S Rass ldquoGame theory meets network securitya tutorialrdquo in Proceedings of the ACM SIGSAC Conference onComputer and Communications Security pp 2163ndash2165Toronto Canada October 2018

[6] X Li C Zhou Y C Tian and Y Qin ldquoA dynamic decision-making approach for intrusion response in industrial controlsystemsrdquo IEEE Transactions on Industrial Informatics vol 15no 5 pp 2544ndash2554 2019

[7] H Hu Y Liu H Zhang and R Pan ldquoOptimal networkdefense strategy selection based on incomplete informationevolutionary gamerdquo IEEE Access vol 6 pp 29806ndash298212018

[8] J Li G Kendall and R John ldquoComputing Nash equilibriaand evolutionarily stable states of evolutionary gamesrdquo IEEETransactions on Evolutionary Computation vol 20 no 3pp 460ndash469 2016

[9] M Bowling and M Veloso Multiagent Learning Usinga Variable Learning Rate Elsevier Science Publishers LtdAmsterdam Netherlands 2002

[10] J Harb and D Precup ldquoInvestigating recurrence and eligi-bility traces in deep Q-networksrdquo httpsarxivorgabs170405495 2017

[11] C T Do N H Tran and C Hong ldquoGame theory for cybersecurity and privacyrdquo ACM Computing Surveys vol 50 no 2p 30 2017

[12] Y-L Liu D-G Feng L-HWu and Y-F Lian ldquoPerformanceevaluation of worm attack and defense strategies based onstatic Bayesian gamerdquo Chinese Journal of Software vol 23no 3 pp 712ndash723 2012

[13] Y Li D E Quevedo S Dey and L Shi ldquoA game-theoreticapproach to fake-acknowledgment attack on cyber-physicalsystemsrdquo IEEE Transactions on Signal and Information Pro-cessing over Networks vol 3 no 1 pp 1ndash11 2017

[14] H Zhang L Jiang S Huang J Wang and Y Zhang ldquoAttack-defense differential game model for network defense strategyselectionrdquo IEEE Access vol 7 pp 50618ndash50629 2019

[15] A Afrand and S K Das ldquoPreventing DoS attacks in wirelesssensor networks a repeated game theory approachrdquo InternetJournal of Cyber Security vol 5 no 2 pp 145ndash153 2007

[16] J Wei B Fang and Z Tian ldquoResearch on defense strategiesselection based on attack-defense stochastic game modelrdquoChinese Journal of Computer Research and Developmentvol 47 no 10 pp 1714ndash1723 2010

[17] C Wang X Cheng and Y Zhu ldquoA Markov game model ofcomputer network operationrdquo Chinese Systems Engineering-gteory and Practice vol 34 no 9 pp 2402ndash2410 2014

[18] J Hofbauer and K Sigmund ldquoEvolutionary game dynamicsrdquoBulletin of the American Mathematical Society vol 40 no 4pp 479ndash519 2011

[19] Y Hayel and Q Zhu ldquoEpidemic protection over heteroge-neous networks using evolutionary Poisson gamesrdquo IEEETransactions on Information Forensics and Security vol 12no 8 pp 1786ndash1800 2017

[20] J Huang and H Zhang ldquoImproving replicator dynamicevolutionary game model for selecting optimal defensestrategiesrdquo Chinese Journal on Communications vol 38 no 1pp 170ndash182 2018

[21] H Hu Y Liu H Zhang and Y Zhang ldquoSecurity metricmethods for network multistep attacks using AMC and big

data correlation analysisrdquo Security and CommunicationNetworks vol 2018 Article ID 5787102 14 pages 2018

[22] H Hu Y Liu Y Yang H Zhang and Y Zhang ldquoNew in-sights into approaches to evaluating intention and path fornetwork multistep attacksrdquo Mathematical Problems in En-gineering vol 2018 Article ID 4278632 13 pages 2018

[23] H Hu H Zhang Y Liu and Y Wang ldquoQuantitative methodfor network security situation based on attack predictionrdquoSecurity and Communication Networks vol 2017 Article ID3407642 19 pages 2017

[24] M Zimmer and S Doncieux ldquoBootstrapping Q-learning forrobotics from neuro-evolution resultsrdquo IEEE Transactions onCognitive and Developmental Systems vol 10 no 1pp 102ndash119 2018

[25] R S Sutton and A G Barto ldquoReinforcement learning anintroduction Bradford bookrdquo Machine Learning vol 16no 1 pp 285-286 2005

[26] H Zhang J Wang D Yu J Han and T Li ldquoActive defensestrategy selection based on static Bayesian gamerdquo in Pro-ceedings of IET International Conference on CyberspaceTechnology pp 1ndash7 London UK October 2016

16 Security and Communication Networks

International Journal of

AerospaceEngineeringHindawiwwwhindawicom Volume 2018

RoboticsJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Active and Passive Electronic Components

VLSI Design

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Shock and Vibration

Hindawiwwwhindawicom Volume 2018

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawiwwwhindawicom

Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Control Scienceand Engineering

Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom

Journal ofEngineeringVolume 2018

SensorsJournal of

Hindawiwwwhindawicom Volume 2018

International Journal of

RotatingMachinery

Hindawiwwwhindawicom Volume 2018

Modelling ampSimulationin EngineeringHindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Navigation and Observation

International Journal of

Hindawi

wwwhindawicom Volume 2018

Advances in

Multimedia

Submit your manuscripts atwwwhindawicom

Page 16: Research Article - Hindawi Publishing Corporationdownloads.hindawi.com/journals/scn/2019/3038586.pdf · Afrand and Das [15] established a repeated game model between the intrusion

[4] Z Tian X Gao S Su J Qiu X Du and M GuizanildquoEvaluating reputation management schemes of internet ofvehicles based on evolutionary game theoryrdquo IEEE Trans-actions on Vehicular Technology vol 68 no 6 pp 5971ndash59802019

[5] Q Zhu and S Rass ldquoGame theory meets network securitya tutorialrdquo in Proceedings of the ACM SIGSAC Conference onComputer and Communications Security pp 2163ndash2165Toronto Canada October 2018

[6] X Li C Zhou Y C Tian and Y Qin ldquoA dynamic decision-making approach for intrusion response in industrial controlsystemsrdquo IEEE Transactions on Industrial Informatics vol 15no 5 pp 2544ndash2554 2019

[7] H Hu Y Liu H Zhang and R Pan ldquoOptimal networkdefense strategy selection based on incomplete informationevolutionary gamerdquo IEEE Access vol 6 pp 29806ndash298212018

[8] J Li G Kendall and R John ldquoComputing Nash equilibriaand evolutionarily stable states of evolutionary gamesrdquo IEEETransactions on Evolutionary Computation vol 20 no 3pp 460ndash469 2016

[9] M Bowling and M Veloso Multiagent Learning Usinga Variable Learning Rate Elsevier Science Publishers LtdAmsterdam Netherlands 2002

[10] J Harb and D Precup ldquoInvestigating recurrence and eligi-bility traces in deep Q-networksrdquo httpsarxivorgabs170405495 2017

[11] C T Do N H Tran and C Hong ldquoGame theory for cybersecurity and privacyrdquo ACM Computing Surveys vol 50 no 2p 30 2017

[12] Y-L Liu D-G Feng L-HWu and Y-F Lian ldquoPerformanceevaluation of worm attack and defense strategies based onstatic Bayesian gamerdquo Chinese Journal of Software vol 23no 3 pp 712ndash723 2012

[13] Y Li D E Quevedo S Dey and L Shi ldquoA game-theoreticapproach to fake-acknowledgment attack on cyber-physicalsystemsrdquo IEEE Transactions on Signal and Information Pro-cessing over Networks vol 3 no 1 pp 1ndash11 2017

[14] H Zhang L Jiang S Huang J Wang and Y Zhang ldquoAttack-defense differential game model for network defense strategyselectionrdquo IEEE Access vol 7 pp 50618ndash50629 2019

[15] A Afrand and S K Das ldquoPreventing DoS attacks in wirelesssensor networks a repeated game theory approachrdquo InternetJournal of Cyber Security vol 5 no 2 pp 145ndash153 2007

[16] J Wei B Fang and Z Tian ldquoResearch on defense strategiesselection based on attack-defense stochastic game modelrdquoChinese Journal of Computer Research and Developmentvol 47 no 10 pp 1714ndash1723 2010

[17] C Wang X Cheng and Y Zhu ldquoA Markov game model ofcomputer network operationrdquo Chinese Systems Engineering-gteory and Practice vol 34 no 9 pp 2402ndash2410 2014

[18] J Hofbauer and K Sigmund ldquoEvolutionary game dynamicsrdquoBulletin of the American Mathematical Society vol 40 no 4pp 479ndash519 2011

[19] Y Hayel and Q Zhu ldquoEpidemic protection over heteroge-neous networks using evolutionary Poisson gamesrdquo IEEETransactions on Information Forensics and Security vol 12no 8 pp 1786ndash1800 2017

[20] J Huang and H Zhang ldquoImproving replicator dynamicevolutionary game model for selecting optimal defensestrategiesrdquo Chinese Journal on Communications vol 38 no 1pp 170ndash182 2018

[21] H Hu Y Liu H Zhang and Y Zhang ldquoSecurity metricmethods for network multistep attacks using AMC and big

data correlation analysisrdquo Security and CommunicationNetworks vol 2018 Article ID 5787102 14 pages 2018

[22] H Hu Y Liu Y Yang H Zhang and Y Zhang ldquoNew in-sights into approaches to evaluating intention and path fornetwork multistep attacksrdquo Mathematical Problems in En-gineering vol 2018 Article ID 4278632 13 pages 2018

[23] H Hu H Zhang Y Liu and Y Wang ldquoQuantitative methodfor network security situation based on attack predictionrdquoSecurity and Communication Networks vol 2017 Article ID3407642 19 pages 2017

[24] M Zimmer and S Doncieux ldquoBootstrapping Q-learning forrobotics from neuro-evolution resultsrdquo IEEE Transactions onCognitive and Developmental Systems vol 10 no 1pp 102ndash119 2018

[25] R S Sutton and A G Barto ldquoReinforcement learning anintroduction Bradford bookrdquo Machine Learning vol 16no 1 pp 285-286 2005

[26] H Zhang J Wang D Yu J Han and T Li ldquoActive defensestrategy selection based on static Bayesian gamerdquo in Pro-ceedings of IET International Conference on CyberspaceTechnology pp 1ndash7 London UK October 2016

16 Security and Communication Networks

International Journal of

AerospaceEngineeringHindawiwwwhindawicom Volume 2018

RoboticsJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Active and Passive Electronic Components

VLSI Design

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Shock and Vibration

Hindawiwwwhindawicom Volume 2018

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawiwwwhindawicom

Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Control Scienceand Engineering

Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom

Journal ofEngineeringVolume 2018

SensorsJournal of

Hindawiwwwhindawicom Volume 2018

International Journal of

RotatingMachinery

Hindawiwwwhindawicom Volume 2018

Modelling ampSimulationin EngineeringHindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Navigation and Observation

International Journal of

Hindawi

wwwhindawicom Volume 2018

Advances in

Multimedia

Submit your manuscripts atwwwhindawicom

Page 17: Research Article - Hindawi Publishing Corporationdownloads.hindawi.com/journals/scn/2019/3038586.pdf · Afrand and Das [15] established a repeated game model between the intrusion

International Journal of

AerospaceEngineeringHindawiwwwhindawicom Volume 2018

RoboticsJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Active and Passive Electronic Components

VLSI Design

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Shock and Vibration

Hindawiwwwhindawicom Volume 2018

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawiwwwhindawicom

Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Control Scienceand Engineering

Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom

Journal ofEngineeringVolume 2018

SensorsJournal of

Hindawiwwwhindawicom Volume 2018

International Journal of

RotatingMachinery

Hindawiwwwhindawicom Volume 2018

Modelling ampSimulationin EngineeringHindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Navigation and Observation

International Journal of

Hindawi

wwwhindawicom Volume 2018

Advances in

Multimedia

Submit your manuscripts atwwwhindawicom