a guided tour of chapter 1: markov process and markov
TRANSCRIPT
A Guided Tour of Chapter 1:Markov Process and Markov Reward Process
Ashwin Rao
ICME, Stanford University
Ashwin Rao (Stanford) Markov Process Chapter 1 / 31
Intuition on the concepts of Process and State
Process: time-sequenced random outcomes
Random outcome eg: price of a derivative, portfolio value etc.
State: Internal Representation St driving future evolution
We are interested in P[St+1|St , St−1, . . . ,S0]
Let us consider random walks of stock prices Xt = St
P[Xt+1 = Xt + 1] + P[Xt+1 = Xt − 1] = 1
We consider 3 examples of such processes
Ashwin Rao (Stanford) Markov Process Chapter 2 / 31
Markov Property - Stock Price Random Walk Process
Process is pulled towards level L with strength parameter α
P[Xt+1 = Xt + 1] =1
1 + e−α1(L−Xt)
Notice how the probability of next price depends only on current price
“The future is independent of the past given the present”
P[Xt+1|Xt ,Xt−1, . . . ,X0] = P[Xt+1|Xt ] for all t ≥ 0
This makes the mathematics easier and the computation tractable
We call this the Markov Property of States
The state captures all relevant information from history
Once the state is known, the history may be thrown away
The state is a sufficient statistic of the future
Ashwin Rao (Stanford) Markov Process Chapter 3 / 31
Logistic Functions f (x ;α) = 11+e−αx
Ashwin Rao (Stanford) Markov Process Chapter 4 / 31
Another Stock Price Random Walk Process
P[Xt+1 = Xt + 1] =
{0.5(1− α2(Xt − Xt−1)) if t > 0
0.5 if t = 0
Direction of Xt+1 − Xt is biased in the reverse direction of Xt − Xt−1
Extent of the bias is controlled by “pull-strength” parameter α2
St = Xt doesn’t satisfy Markov Property, St = (Xt ,Xt − Xt−1) does
P[(Xt+1,Xt+1−Xt)|(Xt ,Xt−Xt−1), (Xt−1,Xt−1−Xt−2), . . . , (X0,Null)]
= P[(Xt+1,Xt+1 − Xt)|(Xt ,Xt − Xt−1)]
St = (X0,X1, . . . ,Xt) or St = (Xt ,Xt−1) also satisfy Markov Property
But we seek the ”simplest/minimal” representation for Markov State
Ashwin Rao (Stanford) Markov Process Chapter 5 / 31
Yet Another Stock Price Random Walk Process
Here, probability of next move depends on all past moves
Depends on # past up-moves Ut relative to # past down-moves Dt
P[Xt+1 = Xt + 1] =
1
1+(Ut+DtDt−1)α3
if t > 0
0.5 if t = 0
Direction of Xt+1 − Xt biased in the reverse direction of history
α3 is a “pull-strength” parameter
Most “compact” Markov State St = (Ut ,Dt)
P[(Ut+1,Dt+1)|(Ut ,Dt), (Ut−1,Dt−1), . . . , (U0,D0)]
= P[(Ut+1,Dt+1)|(Ut ,Dt)]
Note that Xt is not part of St since Xt = X0 + Ut − Dt
Ashwin Rao (Stanford) Markov Process Chapter 6 / 31
Unit-Sigmoid Curves f (x ;α) = 11+( 1
x−1)α
Ashwin Rao (Stanford) Markov Process Chapter 7 / 31
Single Sampling Traces for the 3 Processes
Ashwin Rao (Stanford) Markov Process Chapter 8 / 31
Terminal Probability Distributions for the 3 Processes
Ashwin Rao (Stanford) Markov Process Chapter 9 / 31
Definition for Discrete Time, Countable States
Definition
A Markov Process consists of:
A countable set of states S (known as the State Space) and a setT ⊆ S (known as the set of Terminal States)
A time-indexed sequence of random states St ∈ S for time stepst = 0, 1, 2, . . . with each state transition satisfying the MarkovProperty: P[St+1|St ,St−1, . . . ,S0] = P[St+1|St ] for all t ≥ 0
Termination: If an outcome for ST (for some time step T ) is a statein the set T , then this sequence outcome terminates at time step T
The more commonly used term for Markov Process is Markov Chain
We refer to P[St+1|St ] as the transition probabilities for time t.
Non-terminal states: N = S − TClassical Finance results based on continuous-time Markov Processes
Ashwin Rao (Stanford) Markov Process Chapter 10 / 31
Some nuances of Markov Processes
Time-Homogeneous Markov Process: P[St+1|St ] independent of t
Time-Homogeneous MP specified with function P : N × S → [0, 1]
P(s, s ′) = P[St+1 = s ′|St = s] for all t = 0, 1, 2, . . .
P is the Transition Probability Function (source s → destination s ′)
Can always convert to time-homogeneous: augment State with time
Default: Discrete-Time, Countable-States, Time-Homogeneous MP
Termination typically modeled with Absorbing States (we don’t!)
Separation between:
Specification of Transition Probability Function PSpecification of Probability Distribution of Start States µ : N → [0, 1]
Together (P and µ), we can produce Sampling Traces
Episodic versus Continuing Sampling Traces
Ashwin Rao (Stanford) Markov Process Chapter 11 / 31
The @abstractclass MarkovProcess
c l a s s MarkovProcess (ABC, G e n e r i c [ S ] ) :
@abst ractmethoddef t r a n s i t i o n ( s e l f , s t a t e : NonTerminal [ S ] ) −> \
D i s t r i b u t i o n [ S t a t e [ S ] ] :pass
def s i m u l a t e (s e l f ,s t a r t s t d i s t r : D i s t r i b u t i o n [ NonTerminal [ S ] ]
) −> I t e r a b l e [ S t a t e [ S ] ] :s t : S t a t e [ S ] = s t a r t s t a t e d i s t r . sample ( )y i e l d s twhi le i s i n s t ance ( s t , NonTerminal ) :
s t = s e l f . t r a n s i t i o n ( s t ) . sample ( )y i e l d s t
Ashwin Rao (Stanford) Markov Process Chapter 12 / 31
Finite Markov Process
Finite State Space S = {s1, s2, . . . , sn}, |N | = m ≤ n
We’d like a sparse representation for PConceptualize P : N × S → [0, 1] as N → (S → [0, 1])
T r a n s i t i o n = Mapping [NonTerminal [ S ] ,F i n i t e D i s t r i b u t i o n [ S t a t e [ S ] ]
]
Ashwin Rao (Stanford) Markov Process Chapter 13 / 31
class FiniteMarkovProcess
c l a s s F i n i t e M a r k o v P r o c e s s ( MarkovProcess [ S ] ) :
n t s t a t e s : Sequence [ NonTerminal [ S ] ]t r map : T r a n s i t i o n [ S ]
def i n i t ( s e l f , t r : Mapping [ S ,F i n i t e D i s t r i b u t i o n [ S ] ] ) :
nt : Set [ S ] = set ( t r . k e y s ( ) )s e l f . t r map = {
NonTerminal ( s ) : C a t e g o r i c a l ({( NonTerminal ( s1 ) i f s1 i n nt e l s e
Termina l ( s1 ) ) : pf o r s1 , p i n v . t a b l e ( ) . i t e m s ( )}
) f o r s , v i n t r . i t e m s ( )}s e l f . n t s t a t e s = l i s t ( s e l f . t r map . k e y s ( ) )
def t r a n s i t i o n ( s e l f , s t a t e : NonTerminal [ S ] ) \−> F i n i t e D i s t r i b u t i o n [ S t a t e [ S ] ] :
return s e l f . t r map [ s t a t e ]
Ashwin Rao (Stanford) Markov Process Chapter 14 / 31
class FiniteMarkovProcess
def t r a n s i t i o n ( s e l f , s t a t e : NonTerminal [ S ] ) \−> F i n i t e D i s t r i b u t i o n [ S t a t e [ S ] ] :
return s e l f . t r map [ s t a t e ]
Ashwin Rao (Stanford) Markov Process Chapter 15 / 31
Weather Finite Markov Process
{” Rain ” : C a t e g o r i c a l ({ ” Rain ” : 0 . 3 , ” Nice ” : 0 . 7} ) ,”Snow” : C a t e g o r i c a l ({ ” Rain ” : 0 . 4 , ”Snow” : 0 . 6} ) ,” Nice ” : C a t e g o r i c a l ({
” Rain ” : 0 . 2 ,”Snow” : 0 . 3 ,” Nice ” : 0 . 5
})}
Ashwin Rao (Stanford) Markov Process Chapter 16 / 31
Weather Finite Markov Process
Ashwin Rao (Stanford) Markov Process Chapter 17 / 31
Order of Activity for Inventory Markov Process
α := On-Hand Inventory, β := On-Order Inventory, C := Store Capacity
Observe State St : (α, β) at 6pm store-closing
Order Quantity := max(C − (α + β), 0)
Receive Inventory at 6am if you had ordered 36 hrs ago
Open the store at 8am
Experience random demand i with poisson probabilities:
PMF f (i) =e−λλi
i !, CMF F (i) =
i∑j=0
f (j)
Inventory Sold is min(α + β, i)
Close the store at 6pm
Observe new state St+1 : (max(α + β − i , 0),max(C − (α + β), 0))
Ashwin Rao (Stanford) Markov Process Chapter 18 / 31
Inventory Markov Process States and Transitions
S := {(α, β) : 0 ≤ α + β ≤ C}
If St := (α, β), St+1 := (α + β − i ,C − (α + β)) for i = 0, 1, . . . , α + β
P((α, β), (α + β − i ,C − (α + β))) = f (i) for 0 ≤ i ≤ α + β − 1
P((α, β), (0,C − (α + β))) =∞∑
j=α+β
f (j) = 1− F (α + β − 1)
Ashwin Rao (Stanford) Markov Process Chapter 19 / 31
Inventory Markov Process
Ashwin Rao (Stanford) Markov Process Chapter 20 / 31
Stationary Distribution of a Markov Process
Definition
The Stationary Distribution of a (Time-Homogeneous) Markov Processwith state space S = N and transition probability functionP : N ×N → [0, 1] is a probability distribution function π : N → [0, 1]such that:
π(s) =∑s′∈N
π(s) · P(s, s ′) for all s ∈ N
For Time-Homogeneous MP with finite states S = {s1, s2, . . . , sn} = N ,
π(sj) =n∑
i=1
π(si ) · P(si , sj) for all j = 1, 2, . . . n
Turning P into a matrix, we get: πT = πT ·PPT · π = π ⇒ π is an eigenvector of PT with eigenvalue of 1
Ashwin Rao (Stanford) Markov Process Chapter 21 / 31
MRP Definition for Discrete Time, Countable States
Definition
A Markov Reward Process (MRP) is a Markov Process, along with atime-indexed sequence of Reward random variables Rt ∈ D (a countablesubset of R) for time steps t = 1, 2, . . ., satisfying the Markov Property(including Rewards): P[(Rt+1,St+1)|St ,St−1, . . . ,S0] = P[(Rt+1, St+1)|St ]for all t ≥ 0.
S0,R1,S1,R2, S2, . . . ,ST−1,RT , ST
By default, we assume time-homogeneous MRP, i.e.,P[(Rt+1,St+1)|St ] is independent of t
Time-Homogeneous MRP specified as: PR : N ×D × S → [0, 1]
PR(s, r , s ′) = P[(Rt+1 = r , St+1 = s ′)|St = s] for all t = 0, 1, 2, . . .
PR known as the Transition Probability FunctionAshwin Rao (Stanford) Markov Process Chapter 22 / 31
@abstractclass MarkovRewardProcess
c l a s s MarkovRewardProcess ( MarkovProcess [ S ] ) :
@abst ractmethoddef t r a n s i t i o n r e w a r d ( s e l f , s t a t e : NonTerminal [ S ] ) \
−> D i s t r i b u t i o n [ Tuple [ S t a t e [ S ] , f l o a t ] ] :pass
def t r a n s i t i o n ( s e l f , s t a t e : NonTerminal [ S ] ) \−> D i s t r i b u t i o n [ S t a t e [ S ] ] :
d i s t r i b u t i o n = s e l f . t r a n s i t i o n r e w a r d ( s t a t e )
def n e x t s t a t e ( d i s t r i b u t i o n=d i s t r i b u t i o n ) :n e x t s , = d i s t r i b u t i o n . sample ( )return n e x t s
return S a m p l e d D i s t r i b u t i o n ( n e x t s t a t e )
Ashwin Rao (Stanford) Markov Process Chapter 23 / 31
MRP Reward Functions
The reward transition function RT : N × S → R is defined as:
RT (s, s ′) = E[Rt+1|St+1 = s ′,St = s]
=∑r∈D
PR(s, r , s ′)
P(s, s ′)· r =
∑r∈D
PR(s, r , s ′)∑r∈D PR(s, r , s ′)
· r
The reward function R : N → R is defined as:
R(s) = E[Rt+1|St = s]
=∑s′∈SP(s, s ′) · RT (s, s ′) =
∑s′∈S
∑r∈DPR(s, r , s ′) · r
Ashwin Rao (Stanford) Markov Process Chapter 24 / 31
Inventory MRP
Embellish Inventory Process with Holding Cost and Stockout Cost
Holding cost of h for each unit that remains overnight
Think of this as “interest on inventory”, also includes upkeep cost
Stockout cost of p for each unit of “missed demand”
For each customer demand you could not satisfy with store inventory
Think of this as lost revenue plus customer disappointment (p � h)
Ashwin Rao (Stanford) Markov Process Chapter 25 / 31
Order of Activity for Inventory MRP
α := On-Hand Inventory, β := On-Order Inventory, C := Store Capacity
Observe State St : (α, β) at 6pm store-closing
Order Quantity := max(C − (α + β), 0)
Record any overnight holding cost (= h · α)
Receive Inventory at 6am if you had ordered 36 hours ago
Open the store at 8am
Experience random demand i with poisson probabilities:
PMF f (i) =e−λλi
i !, CMF F (i) =
i∑j=0
f (j)
Inventory Sold is min(α + β, i)
Record any stockout cost due (= p ·max(i − (α + β), 0))
Close the store at 6pm
Register reward Rt+1 as negative sum of holding and stockout costs
Observe new state St+1 : (max(α + β − i , 0),max(C − (α + β), 0))
Ashwin Rao (Stanford) Markov Process Chapter 26 / 31
Finite Markov Reward Process
Finite State Space S = {s1, s2, . . . , sn}, |N | = m ≤ n
Finite set of (next state, reward) transitions
We’d like a sparse representation for PRConceptualize PR : N ×D × S → [0, 1] as N → (S × D → [0, 1])
StateReward = F i n i t e D i s t r i b u t i o n [ Tuple [ S t a t e [ S ] ,f l o a t ] ]
R e w a r d T r a n s i t i o n = Mapping [ NonTerminal [ S ] ,StateReward [ S ] ]
Ashwin Rao (Stanford) Markov Process Chapter 27 / 31
Return as “Accumulated Discounted Rewards”
Define the Return Gt from state St as:
Gt =∞∑
i=t+1
γ i−t−1 · Ri = Rt+1 + γ · Rt+2 + γ2 · Rt+3 + . . .
γ ∈ [0, 1] is the discount factor. Why discount?
Mathematically convenient to discount rewardsAvoids infinite returns in cyclic Markov ProcessesUncertainty about the future may not be fully representedIf reward is financial, discounting due to interest ratesAnimal/human behavior prefers immediate reward
If all sequences terminate (Episodic Processes), we can set γ = 1
Ashwin Rao (Stanford) Markov Process Chapter 28 / 31
Value Function of MRP
Identify states with high “expected accumulated discounted rewards”
Value Function V : N → R defined as:
V (s) = E[Gt |St = s] for all s ∈ N , for all t = 0, 1, 2, . . .
Bellman Equation for MRP (based on recursion Gt = Rt+1 +γ ·Gt+1):
V (s) = R(s) + γ ·∑s′∈N
P(s, s ′) · V (s ′) for all s ∈ N
In Vector form:V = R + γP · V
⇒ V = (Im − γP)−1 ·R
where Im is m ×m identity matrix
If m is large, we need Dynamic Programming (or Approx. DP or RL)
Ashwin Rao (Stanford) Markov Process Chapter 29 / 31
Visualization of MRP Bellman Equation
Ashwin Rao (Stanford) Markov Process Chapter 30 / 31
Key Takeaways from this Chapter
Markov Property: Enables us to reason effectively & computeefficiently in practical systems involving sequential uncertainty
Bellman Equation: Recursive Expression of the Value Function - thisequation (and its MDP version) is the core idea within all DynamicProgramming and Reinforcement Learning algorithms.
Ashwin Rao (Stanford) Markov Process Chapter 31 / 31