a guided tour of chapter 1: markov process and markov

A Guided Tour of Chapter 1:Markov Process and Markov Reward Process

Ashwin Rao

ICME, Stanford University

Ashwin Rao (Stanford) Markov Process Chapter 1 / 31

http://stanford.edu/~ashlearn/RLForFinanceBook/book.pdf

Intuition on the concepts of Process and State

Process: time-sequenced random outcomes

Random outcome eg: price of a derivative, portfolio value etc.

State: Internal Representation St driving future evolution

We are interested in P[St+1|St , St−1, . . . ,S0]

Let us consider random walks of stock prices Xt = St

P[Xt+1 = Xt + 1] + P[Xt+1 = Xt − 1] = 1

We consider 3 examples of such processes


Markov Property - Stock Price Random Walk Process

Process is pulled towards level L with strength parameter α

P[Xt+1 = Xt + 1] =1

1 + e−α1(L−Xt)

Notice how the probability of next price depends only on current price

“The future is independent of the past given the present”

P[Xt+1|Xt ,Xt−1, . . . ,X0] = P[Xt+1|Xt ] for all t ≥ 0

This makes the mathematics easier and the computation tractable

We call this the Markov Property of States

The state captures all relevant information from history

Once the state is known, the history may be thrown away

The state is a sufficient statistic of the future


Logistic Functions f (x ;α) = 11+e−αx


Another Stock Price Random Walk Process

P[Xt+1 = Xt + 1] =

{0.5(1− α2(Xt − Xt−1)) if t > 0

0.5 if t = 0

Direction of Xt+1 − Xt is biased in the reverse direction of Xt − Xt−1

Extent of the bias is controlled by “pull-strength” parameter α2

St = Xt doesn’t satisfy Markov Property, St = (Xt ,Xt − Xt−1) does

P[(Xt+1,Xt+1−Xt)|(Xt ,Xt−Xt−1), (Xt−1,Xt−1−Xt−2), . . . , (X0,Null)]

= P[(Xt+1,Xt+1 − Xt)|(Xt ,Xt − Xt−1)]

St = (X0,X1, . . . ,Xt) or St = (Xt ,Xt−1) also satisfy Markov Property

But we seek the ”simplest/minimal” representation for Markov State


Yet Another Stock Price Random Walk Process

Here, probability of next move depends on all past moves

Depends on # past up-moves Ut relative to # past down-moves Dt

P[Xt+1 = Xt + 1] =

1

1+(Ut+DtDt−1)α3

if t > 0

0.5 if t = 0

Direction of Xt+1 − Xt biased in the reverse direction of history

α3 is a “pull-strength” parameter

Most “compact” Markov State St = (Ut ,Dt)

P[(Ut+1,Dt+1)|(Ut ,Dt), (Ut−1,Dt−1), . . . , (U0,D0)]

= P[(Ut+1,Dt+1)|(Ut ,Dt)]

Note that Xt is not part of St since Xt = X0 + Ut − Dt


Unit-Sigmoid Curves f (x ;α) = 11+( 1

x−1)α


Single Sampling Traces for the 3 Processes


Terminal Probability Distributions for the 3 Processes


Definition for Discrete Time, Countable States

Definition

A Markov Process consists of:

A countable set of states S (known as the State Space) and a setT ⊆ S (known as the set of Terminal States)

A time-indexed sequence of random states St ∈ S for time stepst = 0, 1, 2, . . . with each state transition satisfying the MarkovProperty: P[St+1|St ,St−1, . . . ,S0] = P[St+1|St ] for all t ≥ 0

Termination: If an outcome for ST (for some time step T ) is a statein the set T , then this sequence outcome terminates at time step T

The more commonly used term for Markov Process is Markov Chain

We refer to P[St+1|St ] as the transition probabilities for time t.

Non-terminal states: N = S − TClassical Finance results based on continuous-time Markov Processes


Some nuances of Markov Processes

Time-Homogeneous Markov Process: P[St+1|St ] independent of t

Time-Homogeneous MP specified with function P : N × S → [0, 1]

P(s, s ′) = P[St+1 = s ′|St = s] for all t = 0, 1, 2, . . .

P is the Transition Probability Function (source s → destination s ′)

Can always convert to time-homogeneous: augment State with time

Default: Discrete-Time, Countable-States, Time-Homogeneous MP

Termination typically modeled with Absorbing States (we don’t!)

Separation between:

Specification of Transition Probability Function PSpecification of Probability Distribution of Start States µ : N → [0, 1]

Together (P and µ), we can produce Sampling Traces

Episodic versus Continuing Sampling Traces


The @abstractclass MarkovProcess

c l a s s MarkovProcess (ABC, G e n e r i c [ S ] ) :

@abst ractmethoddef t r a n s i t i o n ( s e l f , s t a t e : NonTerminal [ S ] ) −> \

D i s t r i b u t i o n [ S t a t e [ S ] ] :pass

def s i m u l a t e (s e l f ,s t a r t s t d i s t r : D i s t r i b u t i o n [ NonTerminal [ S ] ]

) −> I t e r a b l e [ S t a t e [ S ] ] :s t : S t a t e [ S ] = s t a r t s t a t e d i s t r . sample ( )y i e l d s twhi le i s i n s t ance ( s t , NonTerminal ) :

s t = s e l f . t r a n s i t i o n ( s t ) . sample ( )y i e l d s t


Finite Markov Process

Finite State Space S = {s1, s2, . . . , sn}, |N | = m ≤ n

We’d like a sparse representation for PConceptualize P : N × S → [0, 1] as N → (S → [0, 1])

T r a n s i t i o n = Mapping [NonTerminal [ S ] ,F i n i t e D i s t r i b u t i o n [ S t a t e [ S ] ]

]


class FiniteMarkovProcess

c l a s s F i n i t e M a r k o v P r o c e s s ( MarkovProcess [ S ] ) :

n t s t a t e s : Sequence [ NonTerminal [ S ] ]t r map : T r a n s i t i o n [ S ]

def i n i t ( s e l f , t r : Mapping [ S ,F i n i t e D i s t r i b u t i o n [ S ] ] ) :

nt : Set [ S ] = set ( t r . k e y s ( ) )s e l f . t r map = {

NonTerminal ( s ) : C a t e g o r i c a l ({( NonTerminal ( s1 ) i f s1 i n nt e l s e

Termina l ( s1 ) ) : pf o r s1 , p i n v . t a b l e ( ) . i t e m s ( )}

) f o r s , v i n t r . i t e m s ( )}s e l f . n t s t a t e s = l i s t ( s e l f . t r map . k e y s ( ) )

def t r a n s i t i o n ( s e l f , s t a t e : NonTerminal [ S ] ) \−> F i n i t e D i s t r i b u t i o n [ S t a t e [ S ] ] :

return s e l f . t r map [ s t a t e ]


class FiniteMarkovProcess

def t r a n s i t i o n ( s e l f , s t a t e : NonTerminal [ S ] ) \−> F i n i t e D i s t r i b u t i o n [ S t a t e [ S ] ] :

return s e l f . t r map [ s t a t e ]


Weather Finite Markov Process

{” Rain ” : C a t e g o r i c a l ({ ” Rain ” : 0 . 3 , ” Nice ” : 0 . 7} ) ,”Snow” : C a t e g o r i c a l ({ ” Rain ” : 0 . 4 , ”Snow” : 0 . 6} ) ,” Nice ” : C a t e g o r i c a l ({

” Rain ” : 0 . 2 ,”Snow” : 0 . 3 ,” Nice ” : 0 . 5

})}


Weather Finite Markov Process


Order of Activity for Inventory Markov Process

α := On-Hand Inventory, β := On-Order Inventory, C := Store Capacity

Observe State St : (α, β) at 6pm store-closing

Order Quantity := max(C − (α + β), 0)

Receive Inventory at 6am if you had ordered 36 hrs ago

Open the store at 8am

Experience random demand i with poisson probabilities:

PMF f (i) =e−λλi

i !, CMF F (i) =

i∑j=0

f (j)

Inventory Sold is min(α + β, i)

Close the store at 6pm

Observe new state St+1 : (max(α + β − i , 0),max(C − (α + β), 0))


Inventory Markov Process States and Transitions

S := {(α, β) : 0 ≤ α + β ≤ C}

If St := (α, β), St+1 := (α + β − i ,C − (α + β)) for i = 0, 1, . . . , α + β

P((α, β), (α + β − i ,C − (α + β))) = f (i) for 0 ≤ i ≤ α + β − 1

P((α, β), (0,C − (α + β))) =∞∑

j=α+β

f (j) = 1− F (α + β − 1)


Inventory Markov Process


Stationary Distribution of a Markov Process

Definition

The Stationary Distribution of a (Time-Homogeneous) Markov Processwith state space S = N and transition probability functionP : N ×N → [0, 1] is a probability distribution function π : N → [0, 1]such that:

π(s) =∑s′∈N

π(s) · P(s, s ′) for all s ∈ N

For Time-Homogeneous MP with finite states S = {s1, s2, . . . , sn} = N ,

π(sj) =n∑

i=1

π(si ) · P(si , sj) for all j = 1, 2, . . . n

Turning P into a matrix, we get: πT = πT ·PPT · π = π ⇒ π is an eigenvector of PT with eigenvalue of 1


MRP Definition for Discrete Time, Countable States

Definition

A Markov Reward Process (MRP) is a Markov Process, along with atime-indexed sequence of Reward random variables Rt ∈ D (a countablesubset of R) for time steps t = 1, 2, . . ., satisfying the Markov Property(including Rewards): P[(Rt+1,St+1)|St ,St−1, . . . ,S0] = P[(Rt+1, St+1)|St ]for all t ≥ 0.

S0,R1,S1,R2, S2, . . . ,ST−1,RT , ST

By default, we assume time-homogeneous MRP, i.e.,P[(Rt+1,St+1)|St ] is independent of t

Time-Homogeneous MRP specified as: PR : N ×D × S → [0, 1]

PR(s, r , s ′) = P[(Rt+1 = r , St+1 = s ′)|St = s] for all t = 0, 1, 2, . . .

PR known as the Transition Probability FunctionAshwin Rao (Stanford) Markov Process Chapter 22 / 31

@abstractclass MarkovRewardProcess

c l a s s MarkovRewardProcess ( MarkovProcess [ S ] ) :

@abst ractmethoddef t r a n s i t i o n r e w a r d ( s e l f , s t a t e : NonTerminal [ S ] ) \

−> D i s t r i b u t i o n [ Tuple [ S t a t e [ S ] , f l o a t ] ] :pass

def t r a n s i t i o n ( s e l f , s t a t e : NonTerminal [ S ] ) \−> D i s t r i b u t i o n [ S t a t e [ S ] ] :

d i s t r i b u t i o n = s e l f . t r a n s i t i o n r e w a r d ( s t a t e )

def n e x t s t a t e ( d i s t r i b u t i o n=d i s t r i b u t i o n ) :n e x t s , = d i s t r i b u t i o n . sample ( )return n e x t s

return S a m p l e d D i s t r i b u t i o n ( n e x t s t a t e )


MRP Reward Functions

The reward transition function RT : N × S → R is defined as:

RT (s, s ′) = E[Rt+1|St+1 = s ′,St = s]

=∑r∈D

PR(s, r , s ′)

P(s, s ′)· r =

∑r∈D

PR(s, r , s ′)∑r∈D PR(s, r , s ′)

· r

The reward function R : N → R is defined as:

R(s) = E[Rt+1|St = s]

=∑s′∈SP(s, s ′) · RT (s, s ′) =

∑s′∈S

∑r∈DPR(s, r , s ′) · r


Inventory MRP

Embellish Inventory Process with Holding Cost and Stockout Cost

Holding cost of h for each unit that remains overnight

Think of this as “interest on inventory”, also includes upkeep cost

Stockout cost of p for each unit of “missed demand”

For each customer demand you could not satisfy with store inventory

Think of this as lost revenue plus customer disappointment (p � h)


Order of Activity for Inventory MRP

α := On-Hand Inventory, β := On-Order Inventory, C := Store Capacity

Observe State St : (α, β) at 6pm store-closing

Order Quantity := max(C − (α + β), 0)

Record any overnight holding cost (= h · α)

Receive Inventory at 6am if you had ordered 36 hours ago

Open the store at 8am

Experience random demand i with poisson probabilities:

PMF f (i) =e−λλi

i !, CMF F (i) =

i∑j=0

f (j)

Inventory Sold is min(α + β, i)

Record any stockout cost due (= p ·max(i − (α + β), 0))

Close the store at 6pm

Register reward Rt+1 as negative sum of holding and stockout costs

Observe new state St+1 : (max(α + β − i , 0),max(C − (α + β), 0))


Finite Markov Reward Process

Finite State Space S = {s1, s2, . . . , sn}, |N | = m ≤ n

Finite set of (next state, reward) transitions

We’d like a sparse representation for PRConceptualize PR : N ×D × S → [0, 1] as N → (S × D → [0, 1])

StateReward = F i n i t e D i s t r i b u t i o n [ Tuple [ S t a t e [ S ] ,f l o a t ] ]

R e w a r d T r a n s i t i o n = Mapping [ NonTerminal [ S ] ,StateReward [ S ] ]


Return as “Accumulated Discounted Rewards”

Define the Return Gt from state St as:

Gt =∞∑

i=t+1

γ i−t−1 · Ri = Rt+1 + γ · Rt+2 + γ2 · Rt+3 + . . .

γ ∈ [0, 1] is the discount factor. Why discount?

Mathematically convenient to discount rewardsAvoids infinite returns in cyclic Markov ProcessesUncertainty about the future may not be fully representedIf reward is financial, discounting due to interest ratesAnimal/human behavior prefers immediate reward

If all sequences terminate (Episodic Processes), we can set γ = 1


Value Function of MRP

Identify states with high “expected accumulated discounted rewards”

Value Function V : N → R defined as:

V (s) = E[Gt |St = s] for all s ∈ N , for all t = 0, 1, 2, . . .

Bellman Equation for MRP (based on recursion Gt = Rt+1 +γ ·Gt+1):

V (s) = R(s) + γ ·∑s′∈N

P(s, s ′) · V (s ′) for all s ∈ N

In Vector form:V = R + γP · V

⇒ V = (Im − γP)−1 ·R

where Im is m ×m identity matrix

If m is large, we need Dynamic Programming (or Approx. DP or RL)


Visualization of MRP Bellman Equation


Key Takeaways from this Chapter

Markov Property: Enables us to reason effectively & computeefficiently in practical systems involving sequential uncertainty

Bellman Equation: Recursive Expression of the Value Function - thisequation (and its MDP version) is the core idea within all DynamicProgramming and Reinforcement Learning algorithms.


a guided tour of chapter 1: markov process and markov

Documents