blackjack - can adp bring down the house

8/10/2019 Blackjack - Can ADP Bring Down the House

1/16

Blackjack:

Can ADP Bring Down The House ?

Cristian Figueroa

December 14, 2012

1 Introduction

The purpose of this project is to better understand the house advantage inBlackjack. Blackjack is a card game commonly played at most casinos, forthe purpose of this document we establish the player as a male character andthe dealer as a female character. The idea is the following:

The dealer gives 2 cards per player. The dealer has a facedown cardand a face up card.

Theres a set of actions each player can take : Hit, Stand, Double Downand Surrender.

Hit: The dealer gives an additional card to the players hand, hecan do this repeatedly.

Stand: The player takes no further actions, ending his turn.

Double Down: The player increases his bet, but commits to Hit,and afterwards Stand.

Surrender: The player surrenders the game in exchange for recov-ering half of its bet. This can only be done in the beginning ofthe hand.

After the player is done with its actions, the dealer gets additionalcards (if required) until her hand scores 17 or higher. The score of ahand is the sum of the scores of each particular card. The score of thecards are as follow:

1


2/16

An ace, can be counted as a 1 or as 11, whatever is more beneficial.

Cards with faces ( J, Q, or K) count as 10. Any other cards score according to their numerical value.

If the score of the players cards go beyond 21 then the player auto-matically loses. Otherwise the player closest to 21 wins.

Figure 1: Example of an outcome of a hand. The player in the top lost,scoring 22. The player on the bottom scores 18.

Back in the 60s professor Edward O. Thorp proved that a card counting

scheme could be used to overcome the house advantage in the game. Byhouse advantage we refer to the fact that players using basic strategy arebound to lose in the long run, with strictly average luck. This advantagecomes from the fact that the dealer get to play second, and hence it is possiblefor the player to bust and go over 21, before the dealer even begins playing.

Thorp with the aid of IBM 704 used simulations in order to computeprobabilities and design a cards counting scheme that would improve hisodds. Thorp then decided to try his results in real casinos, earning as muchas $11,000 in a single weekend, of course, he was then expelled by casinosecurity. His book Beat the dealer: a winning strategy for the game oftwenty one [4] quickly became a best-seller, selling more than 700,000 copies.

Thorps methodology relied mostly in simulations, and at the end of hisbook one can appreciate many of the several tables used to compute his strat-egy. Blackjack can be seen as an stochastic dynamic programming problem.The problem is that the number of states can be really high, if we considera single deck with 52 cards, each possible configuration of the deck (ignoring

2


3/16

suits), dealers card, and players score (if we define the score as the sum of

the cards) we have of the order of 6.5 billion states 1

, more than the numberof breaths an average human has on his lifetime. Even storing the optimalpolicy would be a challenge. Consider now that most casinos have 5 or 6decks in a table.

Hence we use approximate dynamic programming (ADP) to solve theproblem. The advantages from this method is that allows a standard way ofdealing with the problem and its easy to adapt to different casino rules. Themain disadvantage coming from the ADP methodology is that it requires sim-ulation in order to compute the optimal action. A complimentary work to beconsidered is the use of Machine Learning algorithms in the resulting policyin order to obtain rules that approximate the resulting policy. The approx-

imate dynamic programming scheme we use is the Smoothed ApproximateDynamic Programming (S-ALP) as described in [5].

We compare three different policies: the wikipedia policy, the simpleS-ALP and the smart S-ALP. The wikipedia policy is the results of aplayer following the strategy chart that can be found in the wikipedia entryon Blackjack. The simple S-ALP is a policy with just a constant function,representing a policy of a player that just simulates the effect of the nextstage. The smart S-ALP is a policy where the basis functions are chosenbased on what would be important for the player.

The report is organized as follows, first in Section 2 we describe the model

we use, describing the controls, state space, probability matrix and dynamicprogramming formulation. Next, we describe the approximate dynamic pro-gramming scheme we use in some detail. The implementation, results andconclusions end the report.

2 Blackjack Model

In order to understand the model, we begin by explaining what is the se-quence of events in a blackjack game. First, the player must decide howmuch to bet. Once the player, has made his bet, he is dealt two cards, the

dealer is dealt a facedown card and a face up card. Here the player cantake four different actions 2: Hit, Stand, Double Down and Surrender. The

1Deck configurations dealers face up card players sum of cards= (59 17)10212In reality there are five different actions, but the game was simplified to not include

split for the purpose of this project.

3


4/16

description of each action is as follows:

Hit: The dealer gives an additional card to the players hand. This canbe followed by a Hit, Stand or Double Down.

Stand: The player takes no further actions, ending his turn.

Double Down: The player increases his bet, but commits to Hit, andafterwards Stand.

Surrender: The player surrenders the game in exchange for recoveringhalf of its bet. This can only be done in the beginning of the hand.

Once the player is finished with his turn, the dealer checks whether theplayer went over 21 (busted) or not. If the player busted, then the dealerreveals her facedown card and collects the bet, otherwise the dealer gets anyadditional cards required until her score reaches 17 or higher. If the dealergoes over 21 or has lower score than the players, then the player is giventwice his bet (1:1 payoffs), else if the dealer has a higher score than theplayer, then the dealer collects the players bet. In case of a tie, the playercollects his bet back. A sequence of events can be observed in Figure 2.

Figure 2: Flow Chart of Blackjack.

4


5/16

The uncertain number of stages between bets present a major challenge

for modeling blackjack. The main reason being that not all actions are al-lowed at all times, betting can only occur prior to knowing the cards to bedealt, and surrender can only be performed as the first action in a hand. Us-ing a discounted scheme presents challenging as discounting profits based inthe length of a game, may result in undesired consequences, such as incentivesfor a player to stand just to prevent discounting profits.

To overcome this challenge we consider the following decisions for theplayer. At each hand the player decides what the first action is, as wellas how much the player will bet in the next hand. After choosing the firstdecision, he plays the hand by following the rules found in the wikipedia entryof blackjack, under basic strategy. These set of rules can be summarized in

Figure 2.The idea is that by separating the first play from the after play,means that in each stage the player has the same actions available: howmuch to bet in the next hand and what action to take as a first action inthe current hand. Additional advantages of this model is that each stage isa hand, and hence discounting has the expected effect. Disadvantages withthis model are that the next state is very variable.

A justification for this approach is the following. The main contributionof ADP to the strategy is that we can be strategic in our betting, in order toovercome the house advantage. This effect as well as taking into accountthe first action in the hand should be the most prevalent in obtaining higher

profits. What happens after the first decision can be considered as a secondorder effect of some sort. In Figure 2 a flow chart of the model can beappreciated.

Now that the decisions to be modeled have been described, a in detaildescription of the states, controls, transitions and payoffs can be established.

2.1 States

A state in the problem consists in the dealers face up card, the players cards,the deck configuration and the current bet.

The dealer face up card is straightforward to describe, we use a numer-ical value, d, between 1-10 to describe it.

The players card can be stored as two values, (p, a), p being the sumof the cards in the player hands, and a the number of aces held by

5


6/16

Figure 3: Strategy followed after the first action. H: Hit, Su: Hit, S: Stand,

D: Double Down.

the player. This significantly reduces the possible number of states asopposed to storing each of the cards received by the player.

Deck configuration, c, We store the number of cards of each valuestill left in the deck. This requires an array of length 10 composedof numerical values. For example, if there are only 13 cards scoring10 points and 5 cards scoring 4 points, the deck configuration is: c =(0, 0, 0, 0, 5, 0, 0, 0, 0, 13).

The current bet, b, can be stored by its numerical value.

Therefore we can describe the state space as the following set:

S={(d, (p, a), c , b) 2 10B|d [1, 10], p [2, 20], a [0, 2], c ([0, 4n]9[0, 16n])

6


7/16

Figure 4: Flow Chart of the model. Purple represents the decision of theplayer, the rest is the transition process to a new state.

Where B is the set of possible bets, and n the number of decks used bythe dealer. The size of the state space is considerable, for a dealer usingthree decks, we have |S| 1013, a number larger than the number of stars

estimated to be in the Milky Way.

2.2 Control

Having described the quantities required in order to completely characterizea game in blackjack, we can begin describing the set of controls. In our caseis the product of two decisions: How much to bet in the next game and whatto do as a first action.

In order to have a finite set of states we restrict the betting decisions toa finite set B. The possible set of first actions to take have been described

before, these are: Double Down, Hit, Stand and Surrender. The control thata player can use are not restricted by the state the player has reached. Hencethe set of controls Ucan be described as:

U={(b0, s)|b0 B, s {Double Down, Hit, Stand, Surrender}}.

7


8/16

2.3 Costs

There are two factors that add to the cost of a control given a state: the nextbet, and the expected profit of the action taken. Hence the cost of a control(b0, s) given a state s can be written as:

g(x, u) =

b0 2b (W in|s) ifs= Stand, Hitb0 +b 4b (W in|s) ifs= DoubleDown

b0 0.5b ifs= Surrender

Here b0 represents the next bet (part of the control), and b represents theprevious bet.

2.4 Transitions

Given that the post game actions, as well as the dealer actions can lead tomany different states, an analytical expression for all the reachable statesas well as their transition probability is hard to obtain. Given that we usesimulation in order to approximate the value of the expected basis functionsas well as the expected costs, it is not necessary to obtain an analyticalexpression.

2.5 Stochastic Path vs. Discounted Profit

Both formulations, average cost or discounted cost, could be used to formu-late the objective function of our dynamic programming problem. We decideto formulate the problem as a discounted maximization problem. The mainreason for this is the measure of choice by the paper that uses the method-ology we follow [5], in order to solve the ADP. Also after consulting with V.Farias about why solve Tetris in [3] as a discounted profit problem ratherthan an stochastic path, he mentioned that the discounted cost setting hasbetter numerical stability.

2.6 Infinite Stages

In order to use the discounted profit setting, we need to have infinite stages.Of course this cannot happen with a finite number of decks unless we decideto shuffle the deck and reset the game. This is exactly what we do, and fora fixed number cards l if the deck has less than or equal to l cards, at the

8


9/16

beginning of the game, then the deck is reshuffled and the new game is drawn

from a freshly reshuffl

ed deck. Its interesting to examine how the profit ofthe player changes based in this number.

3 The Approximate Dynamic Programming

Problem

Given the model described, we can now formulate the dynamic programmingproblem associated with it. In this case the Bellman equation is the following:

J((d, (p, a), c , b)) = minuU

{g((d, (p, a), c , b), u)+ XsS((d,(p,a),c,b),u)

p(d,(p,a),c,b)s(u)J(s)}

WhereS(r) denotes the set of states reachable from state r.Solving this problem exactly, involves findingJ for all the relevant states.

Even storing this value is out of the question as it involves billions of values,storing the transition matrix is also out of the question. Hence we decide touse approximate dynamic programming in the value function, in order to findcoefficients r such that

Pni=1 rii(s) J

(s) for carefully chosen functionsi.

We begin by detailing the method chosen: Smoothed Approximate LinearProgramming. We then follow with the basis function chosen for our problem.

3.1 Smoothed Approximate Linear Program: S-ALP

There are several methods in the approximate dynamic programming liter-ature in order to compute the r coefficients, when approximating the valuefunction. The method we decided to use is the smoothed dynamic program-ming problem, following the paper by V. Farias and C. Moallemi [5]. Thereason for this choice is that, rather than rely in projections and fixed pointsit uses linear programming to compute the r coefficients. As an MIT studentwe have access to IBMs CPLEX, and hence we can compute the rcoefficients

effi

ciently by using this software.The smoothed ALP(Approximate Linear Program) approach is based inthe fact that the solution to the following problem gives a solution to the

9


10/16

Bellman equation:

max Js.t. J T J

Where we follow the operational notation: T J(x) = minuU{g(x, u)+P

sp(x, s)J(s)}.Even though, the inequality looks non linear, this is in fact an linear, as wecan replace the inequality by the following set of inequalities:

J(x) g(x, u) +Xs

p(x, s)J(s) u U

The reason this delivers the solution of the bellman equation is that a solutionto the optimization problem yields the largest of the fixed points ofT.

Moreover, we can replace the cost coefficient by any positive cost, orprobability distribution, and the solution to the previous LP also yields thesolution to the Bellman equation. The ALP approach suggest solving thefollowing problem:

max c0J

s.t. J T J

J span()

Where is the matrix of the basis functions we want to use to approximatethe value function. Equivalently, this problem can be written as the following

LP:max c0r

s.t. r Tr

A good reference for ALP is given by D. P. de Farias and B. Van Roy in[2]. Given the large number of constraints of this LP in contrast with thesmall number of variables, constraint sampling can be used. Hence we needto only use a small portion of the constraints in order to obtain a fairly goodvalue for r. An improvement in this method is to bootstrap the policies inand resolve the ALP as shown in [3].

Improvements can be made to improve solution, in [5], they use the

smoothed linear program. The smoothed linear program solves:

max c0r

s.t. r Tr+

0 0

10


11/16

Here represents the constraint violation distribution, and the constraint

violation budget. Hence its a generalization of the previous approach. Theparameter starts close to zero, resembling the ALP approach, and then theparameter keeps being increased as long as it keeps improving the policy.

A case study with the game Tetris has been done in order to compareseveral of the popular ADP policies. The following table can be obtainedfrom [5].

Additional advantages that come from this methodology is that, in termsof performance, given the large number of constraints against the small num-ber of variables, dual simplex method solves in a matter of seconds.

3.2 Basis Functions i

Based on knowledge of the game, we used two concepts to define our basisfunctions: deck load and cards-to-shuffle. Deck load is defined as the sum ofthe number of cards of high value remaining in the deck (A, K, Q, J, 10)minus the cards of low value remaining in the deck (2, 3, 4, 5, 6). A deckthat has recently been shuffled has a deck load of zero. By cards-to-shufflewe mean the number of cards to be dealt before the deck is re shuffled.

Using these two notions we define our basis functions, 0, ...,5 as:

0: constant.

1, ...,4: One per each combination of possible bet, and whether the

deck load is positive or not. Example, 1(s) = 6, 2(s) =... = 4(s) =0, if the player did a minimum bet in the previous round, and the deckload is 6. 2(s) = 2, 1(s) = ... = 4(s) = 0, if the player did aminimum bet in the previous round, and the deck load is -2.

11


12/16

5: Cards-to-shuffle.

The reason we separated the combination of bets and signs for the deckloads is that we are going to try and interpret the coefficients that multiplythe basis function.

One of the reasons for settling for these basis functions as opposed tofunctions that depend on the dealers card or the players card is that usingsuch values have very high variability, as are basically an independent samplefrom the deck. Therefore requires large amount of simulations in order toobtain reasonable stability of [i|(b, s)]

We compare the results we obtain by choosing this basis functions withthe results we obtain by choosing only the constant basis function (meaning

pure simulation).

4 Implementation

Using Java we generate several classes that allows to simulate a blackjackgame and keep track of the relevant quantities. There are two types ofclasses considered for this purpose, the classes to model the game (Dealer,Deck, Player) and the classes to simulate games, recording states, shufflingthe deck, etc.

In order to generate the S-ALP, we first play games following simple

strategies in order to generate fixed number of states: nStates. Then tocreate the constraints that would go into the S-ALP. We start by choosinga fixed number of samples, nSamples, that we use in order to compute the

[r|s, u] and g(s, u), given a state s and a control u. Then, we randomlychoose 80% of the generated states and create the constraints for those statesthrough simulation.

Then we call a method such that, given a budget for constraint violations, generates the S-ALP and solves it using CPLEX. Afterwards we simulatethe policy, and adjust the parameter in order to find a good value thatimproves the performance of the policy.

The diagram depicted in Figure 4 illustrates the process done in finding

a policy in this methodology.The total process took in the order of hours to solve.The detail of the code used to solve the S-ALP, can be found in:

http://www.mit.edu/~cfiguero/Blackjack/

12


13/16

Figure 5: Steps involved in the process of finding a policy through S-ALP.

A brief description of the files:

Player, Deck, Dealer: These classes are straightforward object classeswith methods for drawing cards, updating bets, and game actions.These objects are used in most of the other classes.

Simulator, ALP: These objects are used to simulate several games ofblackjack occurring between a dealer and a player. The methods forgenerating states and constraints can be found in ALP. Methods for cal-culating the expected outcome of an action can also be found in ALP.Outputs of generating states are saved in ,states.txt, outputs of gener-ating constraints are saved in constraints.txt, a log of the developmentof the game used to generate the constraints is saved in runLog.txt.

Optimizer: This class reads the outputs from constraints.txt, andgiven a parameter creates the S-ALP, and solves it.

ALPSimulator: This class was designed to simulate the policy that isobtained through the ALP.

runner, runSol, NormRunner: Classes designed to use the previousclasses in order to obtain results. The way to obtain a policy is byexecutingrunner. To use this policy and see its performance, executesolRunner. To see the performance of a out of the box policy, execute

NormRunner.

VisualTable: This is a graphical representation of the performance ofthe policy.

13


14/16

5 Results and Analysis

As we find a policy and start simulating it, we find that policies with alarge , tend to make inappropriate decisions when playing. After carefulconsideration, we realized that this mainly occurred due to misestimatingfuture costs. Given the variability of outcomes in the next game, it is hardto appropriately estimate the next state and future costs without exhaustivesampling. Using a value of small enough, provides better results while atthe same time taking into account future costs.

Figure 6: Profit collected after playing 1000 hands, for 100 different samples.

In Figure 5, the variability of the profits can be appreciated. Each of thedots represent the profits after playing a thousand hands. We can see thevariability of the profits, this is well known fact in the gambling community,independent of the level of skill (how good the policy is) positive profits arenot guaranteed in the short run [1]. The effects performance of a policy canonly be appreciated in the long term, and in this case, this should be morethan 1000 hands. The inability to quantify the performance of the policyreliably prevents us to take full advantage of the S-ALP, as its not easy todecide whether to increase or not.

The average profit of the ADP policy with a small is 11.8 in 100 repe-titions of 1000 games. The average profit of the policy found in the internetis -11.335 in 100 repetitions of 1000 games.

14


15/16

6 Conclusions

Blackjack is a game where between stages there is high variability. This inturn requires a scheme that is able to make enough samples to obtain reliablemeasures when trying to anticipate future costs. In our case we took enoughsamples that it wouldnt take more than a few hours to solve, and a coupleof hours to simulate.

Turns out that even with this amount of samples the variability in theresults makes it very hard for the S-ALP policy to estimate the future profits.This limits the benefits that come from the ADP solution. This is not sayingthat an ADP approach could not work, it just requires more simulation thanthe one used in this project.

To counter the effects of a bad estimation of future profits, an increaseddiscount of future profits can be used. By lowering a policy can be obtainedthat beats the best online policy, but does not take too big of an advantageas one would desire. The low can be understood in the following way. Inour model, the most important stages to consider when making a decision arethe current stage, and the one directly after it. The reason is that in decidingthe bet for the next stage, if the policy is too myopic it will always choosethe smallest possible bet, but if the policy looks too much into future profits,it will make an inappropriate decision (like surrender) in the current play,this could potentially be fixed through more simulation.

Using rollout policies or n-step lookahead, seems to be a more appropriatemethod for ADP that adjust better to the problem. The reason for this isthat there exists very good base policies that the methods can improve. Alsothe limited vision of the methodology aligns well with the structure of thegame as mentioned before.

Several books about blackjack state that the advantage of a skilled playerin blackjack is around 2%[1], given the high variability it seems unlikely thatADP can get an advantage much higher than that.

Blackjack turned out to be a very interesting game to study in terms ofmodeling, and analysis of the policy. Much of the role of the different pa-rameters, , number of states sampled and expectations, came into play in

analyzing the performance of the policy obtained through this methodology.It poses an interesting challenge to see if a policy can be found, through care-ful modeling and parameter tuning, obtaining higher than a 10% revenue.

15


16/16

References

[1] American Publishing Corporation (2000). How To Beat The Dealer inAny Casino. American Publishing Corporation (2000), 2000.

[2] B. Van Roy D. P. de Farias. The linear programming approach to ap-proximate dynamic programming. Operations Research, 2003.

[3] Vivek Farias and Benjamin Van Roy. Tetris: A study of randomizedconstraint sampling. Probabilistic and Randomized Methods for DesignUnder Uncertainty.

[4] Edward O. Thorp. Beat the Dealer: A Winning Strategy for the Game

of Twenty One. Vintage, 1966.

[5] V. F. Farias V. V. Desai and C. C. Moallemi. Approximate dynamicprogramming via a smoothed approximate linear program. OperationsResearch, 2009.

16

blackjack - can adp bring down the house

Documents