a guided tour of chapter 1: markov process and markov

31
A Guided Tour of Chapter 1 : Markov Process and Markov Reward Process Ashwin Rao ICME, Stanford University Ashwin Rao (Stanford) Markov Process Chapter 1 / 31

Upload: others

Post on 09-May-2022

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: A Guided Tour of Chapter 1: Markov Process and Markov

A Guided Tour of Chapter 1:Markov Process and Markov Reward Process

Ashwin Rao

ICME, Stanford University

Ashwin Rao (Stanford) Markov Process Chapter 1 / 31

Page 2: A Guided Tour of Chapter 1: Markov Process and Markov

Intuition on the concepts of Process and State

Process: time-sequenced random outcomes

Random outcome eg: price of a derivative, portfolio value etc.

State: Internal Representation St driving future evolution

We are interested in P[St+1|St , St−1, . . . ,S0]

Let us consider random walks of stock prices Xt = St

P[Xt+1 = Xt + 1] + P[Xt+1 = Xt − 1] = 1

We consider 3 examples of such processes

Ashwin Rao (Stanford) Markov Process Chapter 2 / 31

Page 3: A Guided Tour of Chapter 1: Markov Process and Markov

Markov Property - Stock Price Random Walk Process

Process is pulled towards level L with strength parameter α

P[Xt+1 = Xt + 1] =1

1 + e−α1(L−Xt)

Notice how the probability of next price depends only on current price

“The future is independent of the past given the present”

P[Xt+1|Xt ,Xt−1, . . . ,X0] = P[Xt+1|Xt ] for all t ≥ 0

This makes the mathematics easier and the computation tractable

We call this the Markov Property of States

The state captures all relevant information from history

Once the state is known, the history may be thrown away

The state is a sufficient statistic of the future

Ashwin Rao (Stanford) Markov Process Chapter 3 / 31

Page 4: A Guided Tour of Chapter 1: Markov Process and Markov

Logistic Functions f (x ;α) = 11+e−αx

Ashwin Rao (Stanford) Markov Process Chapter 4 / 31

Page 5: A Guided Tour of Chapter 1: Markov Process and Markov

Another Stock Price Random Walk Process

P[Xt+1 = Xt + 1] =

{0.5(1− α2(Xt − Xt−1)) if t > 0

0.5 if t = 0

Direction of Xt+1 − Xt is biased in the reverse direction of Xt − Xt−1

Extent of the bias is controlled by “pull-strength” parameter α2

St = Xt doesn’t satisfy Markov Property, St = (Xt ,Xt − Xt−1) does

P[(Xt+1,Xt+1−Xt)|(Xt ,Xt−Xt−1), (Xt−1,Xt−1−Xt−2), . . . , (X0,Null)]

= P[(Xt+1,Xt+1 − Xt)|(Xt ,Xt − Xt−1)]

St = (X0,X1, . . . ,Xt) or St = (Xt ,Xt−1) also satisfy Markov Property

But we seek the ”simplest/minimal” representation for Markov State

Ashwin Rao (Stanford) Markov Process Chapter 5 / 31

Page 6: A Guided Tour of Chapter 1: Markov Process and Markov

Yet Another Stock Price Random Walk Process

Here, probability of next move depends on all past moves

Depends on # past up-moves Ut relative to # past down-moves Dt

P[Xt+1 = Xt + 1] =

1

1+(Ut+DtDt−1)α3

if t > 0

0.5 if t = 0

Direction of Xt+1 − Xt biased in the reverse direction of history

α3 is a “pull-strength” parameter

Most “compact” Markov State St = (Ut ,Dt)

P[(Ut+1,Dt+1)|(Ut ,Dt), (Ut−1,Dt−1), . . . , (U0,D0)]

= P[(Ut+1,Dt+1)|(Ut ,Dt)]

Note that Xt is not part of St since Xt = X0 + Ut − Dt

Ashwin Rao (Stanford) Markov Process Chapter 6 / 31

Page 7: A Guided Tour of Chapter 1: Markov Process and Markov

Unit-Sigmoid Curves f (x ;α) = 11+( 1

x−1)α

Ashwin Rao (Stanford) Markov Process Chapter 7 / 31

Page 8: A Guided Tour of Chapter 1: Markov Process and Markov

Single Sampling Traces for the 3 Processes

Ashwin Rao (Stanford) Markov Process Chapter 8 / 31

Page 9: A Guided Tour of Chapter 1: Markov Process and Markov

Terminal Probability Distributions for the 3 Processes

Ashwin Rao (Stanford) Markov Process Chapter 9 / 31

Page 10: A Guided Tour of Chapter 1: Markov Process and Markov

Definition for Discrete Time, Countable States

Definition

A Markov Process consists of:

A countable set of states S (known as the State Space) and a setT ⊆ S (known as the set of Terminal States)

A time-indexed sequence of random states St ∈ S for time stepst = 0, 1, 2, . . . with each state transition satisfying the MarkovProperty: P[St+1|St ,St−1, . . . ,S0] = P[St+1|St ] for all t ≥ 0

Termination: If an outcome for ST (for some time step T ) is a statein the set T , then this sequence outcome terminates at time step T

The more commonly used term for Markov Process is Markov Chain

We refer to P[St+1|St ] as the transition probabilities for time t.

Non-terminal states: N = S − TClassical Finance results based on continuous-time Markov Processes

Ashwin Rao (Stanford) Markov Process Chapter 10 / 31

Page 11: A Guided Tour of Chapter 1: Markov Process and Markov

Some nuances of Markov Processes

Time-Homogeneous Markov Process: P[St+1|St ] independent of t

Time-Homogeneous MP specified with function P : N × S → [0, 1]

P(s, s ′) = P[St+1 = s ′|St = s] for all t = 0, 1, 2, . . .

P is the Transition Probability Function (source s → destination s ′)

Can always convert to time-homogeneous: augment State with time

Default: Discrete-Time, Countable-States, Time-Homogeneous MP

Termination typically modeled with Absorbing States (we don’t!)

Separation between:

Specification of Transition Probability Function PSpecification of Probability Distribution of Start States µ : N → [0, 1]

Together (P and µ), we can produce Sampling Traces

Episodic versus Continuing Sampling Traces

Ashwin Rao (Stanford) Markov Process Chapter 11 / 31

Page 12: A Guided Tour of Chapter 1: Markov Process and Markov

The @abstractclass MarkovProcess

c l a s s MarkovProcess (ABC, G e n e r i c [ S ] ) :

@abst ractmethoddef t r a n s i t i o n ( s e l f , s t a t e : NonTerminal [ S ] ) −> \

D i s t r i b u t i o n [ S t a t e [ S ] ] :pass

def s i m u l a t e (s e l f ,s t a r t s t d i s t r : D i s t r i b u t i o n [ NonTerminal [ S ] ]

) −> I t e r a b l e [ S t a t e [ S ] ] :s t : S t a t e [ S ] = s t a r t s t a t e d i s t r . sample ( )y i e l d s twhi le i s i n s t ance ( s t , NonTerminal ) :

s t = s e l f . t r a n s i t i o n ( s t ) . sample ( )y i e l d s t

Ashwin Rao (Stanford) Markov Process Chapter 12 / 31

Page 13: A Guided Tour of Chapter 1: Markov Process and Markov

Finite Markov Process

Finite State Space S = {s1, s2, . . . , sn}, |N | = m ≤ n

We’d like a sparse representation for PConceptualize P : N × S → [0, 1] as N → (S → [0, 1])

T r a n s i t i o n = Mapping [NonTerminal [ S ] ,F i n i t e D i s t r i b u t i o n [ S t a t e [ S ] ]

]

Ashwin Rao (Stanford) Markov Process Chapter 13 / 31

Page 14: A Guided Tour of Chapter 1: Markov Process and Markov

class FiniteMarkovProcess

c l a s s F i n i t e M a r k o v P r o c e s s ( MarkovProcess [ S ] ) :

n t s t a t e s : Sequence [ NonTerminal [ S ] ]t r map : T r a n s i t i o n [ S ]

def i n i t ( s e l f , t r : Mapping [ S ,F i n i t e D i s t r i b u t i o n [ S ] ] ) :

nt : Set [ S ] = set ( t r . k e y s ( ) )s e l f . t r map = {

NonTerminal ( s ) : C a t e g o r i c a l ({( NonTerminal ( s1 ) i f s1 i n nt e l s e

Termina l ( s1 ) ) : pf o r s1 , p i n v . t a b l e ( ) . i t e m s ( )}

) f o r s , v i n t r . i t e m s ( )}s e l f . n t s t a t e s = l i s t ( s e l f . t r map . k e y s ( ) )

def t r a n s i t i o n ( s e l f , s t a t e : NonTerminal [ S ] ) \−> F i n i t e D i s t r i b u t i o n [ S t a t e [ S ] ] :

return s e l f . t r map [ s t a t e ]

Ashwin Rao (Stanford) Markov Process Chapter 14 / 31

Page 15: A Guided Tour of Chapter 1: Markov Process and Markov

class FiniteMarkovProcess

def t r a n s i t i o n ( s e l f , s t a t e : NonTerminal [ S ] ) \−> F i n i t e D i s t r i b u t i o n [ S t a t e [ S ] ] :

return s e l f . t r map [ s t a t e ]

Ashwin Rao (Stanford) Markov Process Chapter 15 / 31

Page 16: A Guided Tour of Chapter 1: Markov Process and Markov

Weather Finite Markov Process

{” Rain ” : C a t e g o r i c a l ({ ” Rain ” : 0 . 3 , ” Nice ” : 0 . 7} ) ,”Snow” : C a t e g o r i c a l ({ ” Rain ” : 0 . 4 , ”Snow” : 0 . 6} ) ,” Nice ” : C a t e g o r i c a l ({

” Rain ” : 0 . 2 ,”Snow” : 0 . 3 ,” Nice ” : 0 . 5

})}

Ashwin Rao (Stanford) Markov Process Chapter 16 / 31

Page 17: A Guided Tour of Chapter 1: Markov Process and Markov

Weather Finite Markov Process

Ashwin Rao (Stanford) Markov Process Chapter 17 / 31

Page 18: A Guided Tour of Chapter 1: Markov Process and Markov

Order of Activity for Inventory Markov Process

α := On-Hand Inventory, β := On-Order Inventory, C := Store Capacity

Observe State St : (α, β) at 6pm store-closing

Order Quantity := max(C − (α + β), 0)

Receive Inventory at 6am if you had ordered 36 hrs ago

Open the store at 8am

Experience random demand i with poisson probabilities:

PMF f (i) =e−λλi

i !, CMF F (i) =

i∑j=0

f (j)

Inventory Sold is min(α + β, i)

Close the store at 6pm

Observe new state St+1 : (max(α + β − i , 0),max(C − (α + β), 0))

Ashwin Rao (Stanford) Markov Process Chapter 18 / 31

Page 19: A Guided Tour of Chapter 1: Markov Process and Markov

Inventory Markov Process States and Transitions

S := {(α, β) : 0 ≤ α + β ≤ C}

If St := (α, β), St+1 := (α + β − i ,C − (α + β)) for i = 0, 1, . . . , α + β

P((α, β), (α + β − i ,C − (α + β))) = f (i) for 0 ≤ i ≤ α + β − 1

P((α, β), (0,C − (α + β))) =∞∑

j=α+β

f (j) = 1− F (α + β − 1)

Ashwin Rao (Stanford) Markov Process Chapter 19 / 31

Page 20: A Guided Tour of Chapter 1: Markov Process and Markov

Inventory Markov Process

Ashwin Rao (Stanford) Markov Process Chapter 20 / 31

Page 21: A Guided Tour of Chapter 1: Markov Process and Markov

Stationary Distribution of a Markov Process

Definition

The Stationary Distribution of a (Time-Homogeneous) Markov Processwith state space S = N and transition probability functionP : N ×N → [0, 1] is a probability distribution function π : N → [0, 1]such that:

π(s) =∑s′∈N

π(s) · P(s, s ′) for all s ∈ N

For Time-Homogeneous MP with finite states S = {s1, s2, . . . , sn} = N ,

π(sj) =n∑

i=1

π(si ) · P(si , sj) for all j = 1, 2, . . . n

Turning P into a matrix, we get: πT = πT ·PPT · π = π ⇒ π is an eigenvector of PT with eigenvalue of 1

Ashwin Rao (Stanford) Markov Process Chapter 21 / 31

Page 22: A Guided Tour of Chapter 1: Markov Process and Markov

MRP Definition for Discrete Time, Countable States

Definition

A Markov Reward Process (MRP) is a Markov Process, along with atime-indexed sequence of Reward random variables Rt ∈ D (a countablesubset of R) for time steps t = 1, 2, . . ., satisfying the Markov Property(including Rewards): P[(Rt+1,St+1)|St ,St−1, . . . ,S0] = P[(Rt+1, St+1)|St ]for all t ≥ 0.

S0,R1,S1,R2, S2, . . . ,ST−1,RT , ST

By default, we assume time-homogeneous MRP, i.e.,P[(Rt+1,St+1)|St ] is independent of t

Time-Homogeneous MRP specified as: PR : N ×D × S → [0, 1]

PR(s, r , s ′) = P[(Rt+1 = r , St+1 = s ′)|St = s] for all t = 0, 1, 2, . . .

PR known as the Transition Probability FunctionAshwin Rao (Stanford) Markov Process Chapter 22 / 31

Page 23: A Guided Tour of Chapter 1: Markov Process and Markov

@abstractclass MarkovRewardProcess

c l a s s MarkovRewardProcess ( MarkovProcess [ S ] ) :

@abst ractmethoddef t r a n s i t i o n r e w a r d ( s e l f , s t a t e : NonTerminal [ S ] ) \

−> D i s t r i b u t i o n [ Tuple [ S t a t e [ S ] , f l o a t ] ] :pass

def t r a n s i t i o n ( s e l f , s t a t e : NonTerminal [ S ] ) \−> D i s t r i b u t i o n [ S t a t e [ S ] ] :

d i s t r i b u t i o n = s e l f . t r a n s i t i o n r e w a r d ( s t a t e )

def n e x t s t a t e ( d i s t r i b u t i o n=d i s t r i b u t i o n ) :n e x t s , = d i s t r i b u t i o n . sample ( )return n e x t s

return S a m p l e d D i s t r i b u t i o n ( n e x t s t a t e )

Ashwin Rao (Stanford) Markov Process Chapter 23 / 31

Page 24: A Guided Tour of Chapter 1: Markov Process and Markov

MRP Reward Functions

The reward transition function RT : N × S → R is defined as:

RT (s, s ′) = E[Rt+1|St+1 = s ′,St = s]

=∑r∈D

PR(s, r , s ′)

P(s, s ′)· r =

∑r∈D

PR(s, r , s ′)∑r∈D PR(s, r , s ′)

· r

The reward function R : N → R is defined as:

R(s) = E[Rt+1|St = s]

=∑s′∈SP(s, s ′) · RT (s, s ′) =

∑s′∈S

∑r∈DPR(s, r , s ′) · r

Ashwin Rao (Stanford) Markov Process Chapter 24 / 31

Page 25: A Guided Tour of Chapter 1: Markov Process and Markov

Inventory MRP

Embellish Inventory Process with Holding Cost and Stockout Cost

Holding cost of h for each unit that remains overnight

Think of this as “interest on inventory”, also includes upkeep cost

Stockout cost of p for each unit of “missed demand”

For each customer demand you could not satisfy with store inventory

Think of this as lost revenue plus customer disappointment (p � h)

Ashwin Rao (Stanford) Markov Process Chapter 25 / 31

Page 26: A Guided Tour of Chapter 1: Markov Process and Markov

Order of Activity for Inventory MRP

α := On-Hand Inventory, β := On-Order Inventory, C := Store Capacity

Observe State St : (α, β) at 6pm store-closing

Order Quantity := max(C − (α + β), 0)

Record any overnight holding cost (= h · α)

Receive Inventory at 6am if you had ordered 36 hours ago

Open the store at 8am

Experience random demand i with poisson probabilities:

PMF f (i) =e−λλi

i !, CMF F (i) =

i∑j=0

f (j)

Inventory Sold is min(α + β, i)

Record any stockout cost due (= p ·max(i − (α + β), 0))

Close the store at 6pm

Register reward Rt+1 as negative sum of holding and stockout costs

Observe new state St+1 : (max(α + β − i , 0),max(C − (α + β), 0))

Ashwin Rao (Stanford) Markov Process Chapter 26 / 31

Page 27: A Guided Tour of Chapter 1: Markov Process and Markov

Finite Markov Reward Process

Finite State Space S = {s1, s2, . . . , sn}, |N | = m ≤ n

Finite set of (next state, reward) transitions

We’d like a sparse representation for PRConceptualize PR : N ×D × S → [0, 1] as N → (S × D → [0, 1])

StateReward = F i n i t e D i s t r i b u t i o n [ Tuple [ S t a t e [ S ] ,f l o a t ] ]

R e w a r d T r a n s i t i o n = Mapping [ NonTerminal [ S ] ,StateReward [ S ] ]

Ashwin Rao (Stanford) Markov Process Chapter 27 / 31

Page 28: A Guided Tour of Chapter 1: Markov Process and Markov

Return as “Accumulated Discounted Rewards”

Define the Return Gt from state St as:

Gt =∞∑

i=t+1

γ i−t−1 · Ri = Rt+1 + γ · Rt+2 + γ2 · Rt+3 + . . .

γ ∈ [0, 1] is the discount factor. Why discount?

Mathematically convenient to discount rewardsAvoids infinite returns in cyclic Markov ProcessesUncertainty about the future may not be fully representedIf reward is financial, discounting due to interest ratesAnimal/human behavior prefers immediate reward

If all sequences terminate (Episodic Processes), we can set γ = 1

Ashwin Rao (Stanford) Markov Process Chapter 28 / 31

Page 29: A Guided Tour of Chapter 1: Markov Process and Markov

Value Function of MRP

Identify states with high “expected accumulated discounted rewards”

Value Function V : N → R defined as:

V (s) = E[Gt |St = s] for all s ∈ N , for all t = 0, 1, 2, . . .

Bellman Equation for MRP (based on recursion Gt = Rt+1 +γ ·Gt+1):

V (s) = R(s) + γ ·∑s′∈N

P(s, s ′) · V (s ′) for all s ∈ N

In Vector form:V = R + γP · V

⇒ V = (Im − γP)−1 ·R

where Im is m ×m identity matrix

If m is large, we need Dynamic Programming (or Approx. DP or RL)

Ashwin Rao (Stanford) Markov Process Chapter 29 / 31

Page 30: A Guided Tour of Chapter 1: Markov Process and Markov

Visualization of MRP Bellman Equation

Ashwin Rao (Stanford) Markov Process Chapter 30 / 31

Page 31: A Guided Tour of Chapter 1: Markov Process and Markov

Key Takeaways from this Chapter

Markov Property: Enables us to reason effectively & computeefficiently in practical systems involving sequential uncertainty

Bellman Equation: Recursive Expression of the Value Function - thisequation (and its MDP version) is the core idea within all DynamicProgramming and Reinforcement Learning algorithms.

Ashwin Rao (Stanford) Markov Process Chapter 31 / 31