dpa51 dynamic programming applications lecture 5

DPA5 1

Dynamic Programming Applications

Lecture 5

DPA5 2

Preview

Last time:

Structural properties .

Today:

Optimal stopping & the OLA rule

(Secretary problem, Asset selling)

Next time:

Infinite horizon.

DPA5 3

The RM problemJt(x,i)= max{Jt-1(x), Ri+ Jt-1(x-1)}= (Ri- OCt-1(x))+ +Jt-1(x)

Optimal policy: accept cls. i iff Ri OCt-1(x) = Jt-1(x) - Jt-1(x-1)

Results:

1. Jt(x) increasing in x - by induction

2. OCt(x) decreasing in x - single crossing

3. OCt(x) increasing in t - by induction + 2:

Jt (x) = pi (Ri- OCt-1(x))+ + Jt-1(x)

Jt(x-1)= pi (Ri- OCt-1(x-1))+ + Jt-1(x-1)

OCt(x)- OCt-1(x)= pi [(Ri- OCt-1(x))+- (Ri- OCt-1(x-1))+] 0

DPA5 4

The RM problem - results

• The optimal policy is characterized by threshold levels bi

t as follows:

Accept class i at time t iff 0 x < bit

where bit = min{x | OCt-1(x) > Ri}

• Moreover, b1t … bm

t , where R1 … Rm

DPA5 5

Optimal Stopping

At each stage a control is available that stops

the evolution of the system.

At stage k there are 2 options:

1. Stop process (get a certain reward)

2. Continue process, perhaps at a certain cost, and select one of the next available choices.

If there is only one other choice besides stopping,

policy is characterized by the stopping states-set.

DPA5 6

Secretary Problems

• Cayley 1875• Interview N candidates for a job• Must accept/reject at end of interview• Objectives:

– Maximize expected ‘score’– Maximize P(get the best)

(you risk to hire nobody!)

DPA5 7

Archetype problem

Make irrevocable choice from a fixed

number of opportunities whose values

are revealed sequentially.• Asset selling• Purchasing with a deadline• Exercising stock options (in your next HW)

DPA5 8

Max P(get best)

• Wt=history of relative ranks of candidates seen by time t (inclusive)

• xt = 1, if tth candidate is best seen so far

0, otherwise

• Relevant: t and xt

• Fact: xt=1 and Wt-1 statistically independent:

DPA5 9

Objective

Jt = P(under optimal policy we select best candidate given that we’ve rejected t-1 so far )

Jt (0)=P(under optimal policy we select best candidate given that we’ve seen t so far and the last one was NOT the best so far)

Jt (1)= …

P(best of N| best of first t) = ?

DPA5 10

DP equation

JN+1 = 0

Jt = (t-1)/t Jt (0) + 1/t Jt (1)

Jt (0) = Jt+1 (must continue)

Jt (1) = max ( t/N , Jt+1) (accept or continue)

Fact 1: Jt -1 Jt

Fact 2: Jt t and t/N t => single crossing

Define: t* = min {t | Jt+1 t/N}

DPA5 11

RecursionJt = Jt* , if t < t*

(t-1)/t Jt + 1/N, if t t*

Jt/(t-1) = Jt+1/t + 1/(N(t-1))

Therefore: Jt+1 = t/N 1/s (after telescoping)

By definition, t* is the smallest s.t. Jt*+1 t* /N , so

t* = min{t | 1/s 1} = ?

N-1

s=t

N-1

s=t

DPA5 12

Policy

• For large N: 1/s loge(N/ t0)

• Therefore t0 N/e

• Policy: Interview N/e candidates and reject them, then select best you see so far.

• P(success) = J(t0) t0 /N 1/e .3679

• Empirical validation?

N-1

s=t0

DPA5 13

The Last Shall be First

“..The last person interviewed for a job gets it 55.8% of the time according to Runzheimer Canada, Inc. Early applicants are hired only 17.6% of the time; the management consulting firm suggests that job-seekers who find they are among the first to be grilled‘tactfully ask to be rescheduled for a later date’. Mondays are also poor days to be interviewed and anyday just before quitting time is also bad.”

(The Globe and Mail, Sept. 12, 1990, pg. A22)

DPA5 14

Asset selling

• Like maximizing interview score, but with discounting/investment

• Offers: w0,w1,…,wN-1 i.i.d with fixed known distribution (if not known: inference, learning)

• Stage k choices:1. Accept, and invest $wk at rate r2. Reject, and wait until stage k+1

• Objective: maximize revenue at end of period N

DPA5 15

Formulation

State:

• xkT: asset has not been sold, current offer is xk

• xk=T: asset has been sold

Decision:

• uk= u sell; uk= u’ don’t sell

Plant equation:

xk+1= T, if xk=T, or if xkT and uk= u (sell)

wk, otherwise

DPA5 16

Costs

gN(xN) = xN , if xN T

0 , else

gk(xk) = (1+r)N-k xk , if xk T and uk=u

0 , else

JN(xN) = xN , if xN T

0 , else

Jk(xk) = max((1+r)N-k xk , Ew{Jk+1(wk)}), if xk T

0 , else

DPA5 17

Policy

• Accept offer xk if xk > ak

• Reject offer xk if xk < ak

• Indifferent if xk = ak

Optimal policy is determined by sequence ak:

• ak = Ew{Jk+1(wk)} / (1+r)N-k

DPA5 18

Structural properties

Fact: ak ak+1 for all k

Intuition:

if an offer is good enough to be acceptable at time k, it should be so at time k+1.

DPA5 19

General stopping & OLA

• Stopping mandatory at or before stage N• Stationary: state, control, disturbances, and their space

sets, and cost/stage are constant over time

• Xtra action: go to termination state @ cost t(xk)

DP-algorithm:

JN(xN) = t(xN )

Jk(xk) = min(t(xk), Ew{g(xk,uk,wk)+Jk+1(f( xk,uk,wk)})

DPA5 20

Stopping set

It is optimal to stop at time k for states x in the set:

Tk={x| t(x) minu E{g(x,u,w) + Jk+1(f(x,u,w)) }

Fact: JN-1(x) JN(x), so Jk-1(x) Jk(x) for all k, x.

Cor.: T0 … Tk Tk+1 … TN-1

Question: how to guarantee equality?

DPA5 21

Absorbance

Condition: TN-1 is absorbing if x TN-1 and termination not selected, then next state is in TN-1.

That is f(x,u,w) TN-1 for all x TN-1 , u U(x), w.

Intuition: if you reach a state that’s optimal to stop at, but you don’t stop, then you move to a state that’s also optimal to stop at.

Theorem: If TN-1 is absorbing then Tk=TN-1 for all k.

OLA policy: iff TN-1 (1-step stopping set) absorbing.

dpa51 dynamic programming applications lecture 5

Documents