a guided tour of chapter 1: markov process and markov

Post on 09-May-2022

1 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

A Guided Tour of Chapter 1:Markov Process and Markov Reward Process

Ashwin Rao

ICME, Stanford University

Ashwin Rao (Stanford) Markov Process Chapter 1 / 31

Intuition on the concepts of Process and State

Process: time-sequenced random outcomes

Random outcome eg: price of a derivative, portfolio value etc.

State: Internal Representation St driving future evolution

We are interested in P[St+1|St , St−1, . . . ,S0]

Let us consider random walks of stock prices Xt = St

P[Xt+1 = Xt + 1] + P[Xt+1 = Xt − 1] = 1

We consider 3 examples of such processes

Ashwin Rao (Stanford) Markov Process Chapter 2 / 31

Markov Property - Stock Price Random Walk Process

Process is pulled towards level L with strength parameter α

P[Xt+1 = Xt + 1] =1

1 + e−α1(L−Xt)

Notice how the probability of next price depends only on current price

“The future is independent of the past given the present”

P[Xt+1|Xt ,Xt−1, . . . ,X0] = P[Xt+1|Xt ] for all t ≥ 0

This makes the mathematics easier and the computation tractable

We call this the Markov Property of States

The state captures all relevant information from history

Once the state is known, the history may be thrown away

The state is a sufficient statistic of the future

Ashwin Rao (Stanford) Markov Process Chapter 3 / 31

Logistic Functions f (x ;α) = 11+e−αx

Ashwin Rao (Stanford) Markov Process Chapter 4 / 31

Another Stock Price Random Walk Process

P[Xt+1 = Xt + 1] =

{0.5(1− α2(Xt − Xt−1)) if t > 0

0.5 if t = 0

Direction of Xt+1 − Xt is biased in the reverse direction of Xt − Xt−1

Extent of the bias is controlled by “pull-strength” parameter α2

St = Xt doesn’t satisfy Markov Property, St = (Xt ,Xt − Xt−1) does

P[(Xt+1,Xt+1−Xt)|(Xt ,Xt−Xt−1), (Xt−1,Xt−1−Xt−2), . . . , (X0,Null)]

= P[(Xt+1,Xt+1 − Xt)|(Xt ,Xt − Xt−1)]

St = (X0,X1, . . . ,Xt) or St = (Xt ,Xt−1) also satisfy Markov Property

But we seek the ”simplest/minimal” representation for Markov State

Ashwin Rao (Stanford) Markov Process Chapter 5 / 31

Yet Another Stock Price Random Walk Process

Here, probability of next move depends on all past moves

Depends on # past up-moves Ut relative to # past down-moves Dt

P[Xt+1 = Xt + 1] =

1

1+(Ut+DtDt−1)α3

if t > 0

0.5 if t = 0

Direction of Xt+1 − Xt biased in the reverse direction of history

α3 is a “pull-strength” parameter

Most “compact” Markov State St = (Ut ,Dt)

P[(Ut+1,Dt+1)|(Ut ,Dt), (Ut−1,Dt−1), . . . , (U0,D0)]

= P[(Ut+1,Dt+1)|(Ut ,Dt)]

Note that Xt is not part of St since Xt = X0 + Ut − Dt

Ashwin Rao (Stanford) Markov Process Chapter 6 / 31

Unit-Sigmoid Curves f (x ;α) = 11+( 1

x−1)α

Ashwin Rao (Stanford) Markov Process Chapter 7 / 31

Single Sampling Traces for the 3 Processes

Ashwin Rao (Stanford) Markov Process Chapter 8 / 31

Terminal Probability Distributions for the 3 Processes

Ashwin Rao (Stanford) Markov Process Chapter 9 / 31

Definition for Discrete Time, Countable States

Definition

A Markov Process consists of:

A countable set of states S (known as the State Space) and a setT ⊆ S (known as the set of Terminal States)

A time-indexed sequence of random states St ∈ S for time stepst = 0, 1, 2, . . . with each state transition satisfying the MarkovProperty: P[St+1|St ,St−1, . . . ,S0] = P[St+1|St ] for all t ≥ 0

Termination: If an outcome for ST (for some time step T ) is a statein the set T , then this sequence outcome terminates at time step T

The more commonly used term for Markov Process is Markov Chain

We refer to P[St+1|St ] as the transition probabilities for time t.

Non-terminal states: N = S − TClassical Finance results based on continuous-time Markov Processes

Ashwin Rao (Stanford) Markov Process Chapter 10 / 31

Some nuances of Markov Processes

Time-Homogeneous Markov Process: P[St+1|St ] independent of t

Time-Homogeneous MP specified with function P : N × S → [0, 1]

P(s, s ′) = P[St+1 = s ′|St = s] for all t = 0, 1, 2, . . .

P is the Transition Probability Function (source s → destination s ′)

Can always convert to time-homogeneous: augment State with time

Default: Discrete-Time, Countable-States, Time-Homogeneous MP

Termination typically modeled with Absorbing States (we don’t!)

Separation between:

Specification of Transition Probability Function PSpecification of Probability Distribution of Start States µ : N → [0, 1]

Together (P and µ), we can produce Sampling Traces

Episodic versus Continuing Sampling Traces

Ashwin Rao (Stanford) Markov Process Chapter 11 / 31

The @abstractclass MarkovProcess

c l a s s MarkovProcess (ABC, G e n e r i c [ S ] ) :

@abst ractmethoddef t r a n s i t i o n ( s e l f , s t a t e : NonTerminal [ S ] ) −> \

D i s t r i b u t i o n [ S t a t e [ S ] ] :pass

def s i m u l a t e (s e l f ,s t a r t s t d i s t r : D i s t r i b u t i o n [ NonTerminal [ S ] ]

) −> I t e r a b l e [ S t a t e [ S ] ] :s t : S t a t e [ S ] = s t a r t s t a t e d i s t r . sample ( )y i e l d s twhi le i s i n s t ance ( s t , NonTerminal ) :

s t = s e l f . t r a n s i t i o n ( s t ) . sample ( )y i e l d s t

Ashwin Rao (Stanford) Markov Process Chapter 12 / 31

Finite Markov Process

Finite State Space S = {s1, s2, . . . , sn}, |N | = m ≤ n

We’d like a sparse representation for PConceptualize P : N × S → [0, 1] as N → (S → [0, 1])

T r a n s i t i o n = Mapping [NonTerminal [ S ] ,F i n i t e D i s t r i b u t i o n [ S t a t e [ S ] ]

]

Ashwin Rao (Stanford) Markov Process Chapter 13 / 31

class FiniteMarkovProcess

c l a s s F i n i t e M a r k o v P r o c e s s ( MarkovProcess [ S ] ) :

n t s t a t e s : Sequence [ NonTerminal [ S ] ]t r map : T r a n s i t i o n [ S ]

def i n i t ( s e l f , t r : Mapping [ S ,F i n i t e D i s t r i b u t i o n [ S ] ] ) :

nt : Set [ S ] = set ( t r . k e y s ( ) )s e l f . t r map = {

NonTerminal ( s ) : C a t e g o r i c a l ({( NonTerminal ( s1 ) i f s1 i n nt e l s e

Termina l ( s1 ) ) : pf o r s1 , p i n v . t a b l e ( ) . i t e m s ( )}

) f o r s , v i n t r . i t e m s ( )}s e l f . n t s t a t e s = l i s t ( s e l f . t r map . k e y s ( ) )

def t r a n s i t i o n ( s e l f , s t a t e : NonTerminal [ S ] ) \−> F i n i t e D i s t r i b u t i o n [ S t a t e [ S ] ] :

return s e l f . t r map [ s t a t e ]

Ashwin Rao (Stanford) Markov Process Chapter 14 / 31

class FiniteMarkovProcess

def t r a n s i t i o n ( s e l f , s t a t e : NonTerminal [ S ] ) \−> F i n i t e D i s t r i b u t i o n [ S t a t e [ S ] ] :

return s e l f . t r map [ s t a t e ]

Ashwin Rao (Stanford) Markov Process Chapter 15 / 31

Weather Finite Markov Process

{” Rain ” : C a t e g o r i c a l ({ ” Rain ” : 0 . 3 , ” Nice ” : 0 . 7} ) ,”Snow” : C a t e g o r i c a l ({ ” Rain ” : 0 . 4 , ”Snow” : 0 . 6} ) ,” Nice ” : C a t e g o r i c a l ({

” Rain ” : 0 . 2 ,”Snow” : 0 . 3 ,” Nice ” : 0 . 5

})}

Ashwin Rao (Stanford) Markov Process Chapter 16 / 31

Weather Finite Markov Process

Ashwin Rao (Stanford) Markov Process Chapter 17 / 31

Order of Activity for Inventory Markov Process

α := On-Hand Inventory, β := On-Order Inventory, C := Store Capacity

Observe State St : (α, β) at 6pm store-closing

Order Quantity := max(C − (α + β), 0)

Receive Inventory at 6am if you had ordered 36 hrs ago

Open the store at 8am

Experience random demand i with poisson probabilities:

PMF f (i) =e−λλi

i !, CMF F (i) =

i∑j=0

f (j)

Inventory Sold is min(α + β, i)

Close the store at 6pm

Observe new state St+1 : (max(α + β − i , 0),max(C − (α + β), 0))

Ashwin Rao (Stanford) Markov Process Chapter 18 / 31

Inventory Markov Process States and Transitions

S := {(α, β) : 0 ≤ α + β ≤ C}

If St := (α, β), St+1 := (α + β − i ,C − (α + β)) for i = 0, 1, . . . , α + β

P((α, β), (α + β − i ,C − (α + β))) = f (i) for 0 ≤ i ≤ α + β − 1

P((α, β), (0,C − (α + β))) =∞∑

j=α+β

f (j) = 1− F (α + β − 1)

Ashwin Rao (Stanford) Markov Process Chapter 19 / 31

Inventory Markov Process

Ashwin Rao (Stanford) Markov Process Chapter 20 / 31

Stationary Distribution of a Markov Process

Definition

The Stationary Distribution of a (Time-Homogeneous) Markov Processwith state space S = N and transition probability functionP : N ×N → [0, 1] is a probability distribution function π : N → [0, 1]such that:

π(s) =∑s′∈N

π(s) · P(s, s ′) for all s ∈ N

For Time-Homogeneous MP with finite states S = {s1, s2, . . . , sn} = N ,

π(sj) =n∑

i=1

π(si ) · P(si , sj) for all j = 1, 2, . . . n

Turning P into a matrix, we get: πT = πT ·PPT · π = π ⇒ π is an eigenvector of PT with eigenvalue of 1

Ashwin Rao (Stanford) Markov Process Chapter 21 / 31

MRP Definition for Discrete Time, Countable States

Definition

A Markov Reward Process (MRP) is a Markov Process, along with atime-indexed sequence of Reward random variables Rt ∈ D (a countablesubset of R) for time steps t = 1, 2, . . ., satisfying the Markov Property(including Rewards): P[(Rt+1,St+1)|St ,St−1, . . . ,S0] = P[(Rt+1, St+1)|St ]for all t ≥ 0.

S0,R1,S1,R2, S2, . . . ,ST−1,RT , ST

By default, we assume time-homogeneous MRP, i.e.,P[(Rt+1,St+1)|St ] is independent of t

Time-Homogeneous MRP specified as: PR : N ×D × S → [0, 1]

PR(s, r , s ′) = P[(Rt+1 = r , St+1 = s ′)|St = s] for all t = 0, 1, 2, . . .

PR known as the Transition Probability FunctionAshwin Rao (Stanford) Markov Process Chapter 22 / 31

@abstractclass MarkovRewardProcess

c l a s s MarkovRewardProcess ( MarkovProcess [ S ] ) :

@abst ractmethoddef t r a n s i t i o n r e w a r d ( s e l f , s t a t e : NonTerminal [ S ] ) \

−> D i s t r i b u t i o n [ Tuple [ S t a t e [ S ] , f l o a t ] ] :pass

def t r a n s i t i o n ( s e l f , s t a t e : NonTerminal [ S ] ) \−> D i s t r i b u t i o n [ S t a t e [ S ] ] :

d i s t r i b u t i o n = s e l f . t r a n s i t i o n r e w a r d ( s t a t e )

def n e x t s t a t e ( d i s t r i b u t i o n=d i s t r i b u t i o n ) :n e x t s , = d i s t r i b u t i o n . sample ( )return n e x t s

return S a m p l e d D i s t r i b u t i o n ( n e x t s t a t e )

Ashwin Rao (Stanford) Markov Process Chapter 23 / 31

MRP Reward Functions

The reward transition function RT : N × S → R is defined as:

RT (s, s ′) = E[Rt+1|St+1 = s ′,St = s]

=∑r∈D

PR(s, r , s ′)

P(s, s ′)· r =

∑r∈D

PR(s, r , s ′)∑r∈D PR(s, r , s ′)

· r

The reward function R : N → R is defined as:

R(s) = E[Rt+1|St = s]

=∑s′∈SP(s, s ′) · RT (s, s ′) =

∑s′∈S

∑r∈DPR(s, r , s ′) · r

Ashwin Rao (Stanford) Markov Process Chapter 24 / 31

Inventory MRP

Embellish Inventory Process with Holding Cost and Stockout Cost

Holding cost of h for each unit that remains overnight

Think of this as “interest on inventory”, also includes upkeep cost

Stockout cost of p for each unit of “missed demand”

For each customer demand you could not satisfy with store inventory

Think of this as lost revenue plus customer disappointment (p � h)

Ashwin Rao (Stanford) Markov Process Chapter 25 / 31

Order of Activity for Inventory MRP

α := On-Hand Inventory, β := On-Order Inventory, C := Store Capacity

Observe State St : (α, β) at 6pm store-closing

Order Quantity := max(C − (α + β), 0)

Record any overnight holding cost (= h · α)

Receive Inventory at 6am if you had ordered 36 hours ago

Open the store at 8am

Experience random demand i with poisson probabilities:

PMF f (i) =e−λλi

i !, CMF F (i) =

i∑j=0

f (j)

Inventory Sold is min(α + β, i)

Record any stockout cost due (= p ·max(i − (α + β), 0))

Close the store at 6pm

Register reward Rt+1 as negative sum of holding and stockout costs

Observe new state St+1 : (max(α + β − i , 0),max(C − (α + β), 0))

Ashwin Rao (Stanford) Markov Process Chapter 26 / 31

Finite Markov Reward Process

Finite State Space S = {s1, s2, . . . , sn}, |N | = m ≤ n

Finite set of (next state, reward) transitions

We’d like a sparse representation for PRConceptualize PR : N ×D × S → [0, 1] as N → (S × D → [0, 1])

StateReward = F i n i t e D i s t r i b u t i o n [ Tuple [ S t a t e [ S ] ,f l o a t ] ]

R e w a r d T r a n s i t i o n = Mapping [ NonTerminal [ S ] ,StateReward [ S ] ]

Ashwin Rao (Stanford) Markov Process Chapter 27 / 31

Return as “Accumulated Discounted Rewards”

Define the Return Gt from state St as:

Gt =∞∑

i=t+1

γ i−t−1 · Ri = Rt+1 + γ · Rt+2 + γ2 · Rt+3 + . . .

γ ∈ [0, 1] is the discount factor. Why discount?

Mathematically convenient to discount rewardsAvoids infinite returns in cyclic Markov ProcessesUncertainty about the future may not be fully representedIf reward is financial, discounting due to interest ratesAnimal/human behavior prefers immediate reward

If all sequences terminate (Episodic Processes), we can set γ = 1

Ashwin Rao (Stanford) Markov Process Chapter 28 / 31

Value Function of MRP

Identify states with high “expected accumulated discounted rewards”

Value Function V : N → R defined as:

V (s) = E[Gt |St = s] for all s ∈ N , for all t = 0, 1, 2, . . .

Bellman Equation for MRP (based on recursion Gt = Rt+1 +γ ·Gt+1):

V (s) = R(s) + γ ·∑s′∈N

P(s, s ′) · V (s ′) for all s ∈ N

In Vector form:V = R + γP · V

⇒ V = (Im − γP)−1 ·R

where Im is m ×m identity matrix

If m is large, we need Dynamic Programming (or Approx. DP or RL)

Ashwin Rao (Stanford) Markov Process Chapter 29 / 31

Visualization of MRP Bellman Equation

Ashwin Rao (Stanford) Markov Process Chapter 30 / 31

Key Takeaways from this Chapter

Markov Property: Enables us to reason effectively & computeefficiently in practical systems involving sequential uncertainty

Bellman Equation: Recursive Expression of the Value Function - thisequation (and its MDP version) is the core idea within all DynamicProgramming and Reinforcement Learning algorithms.

Ashwin Rao (Stanford) Markov Process Chapter 31 / 31

top related