number of symbol comparisons in quicksort and …poulalho/alea09/slides/vallee.pdf · comparisons...

NUMBER OF SYMBOL COMPARISONS

IN QUICKSORT AND QUICKSELECT

Brigitte Vallee

(CNRS and Universite de Caen, France)

Joint work with Julien Clement, Jim Fill and Philippe Flajolet

Plan of the talk.

– Presentation of the study

– Statement of the results

– The general model of source

– The main steps of the method

– Sketch of the proof.

– What is a tamed source ?

Plan of the talk.

– Sketch of the proof

The classical framework for sorting.

The main sorting algorithms or searching algorithms

e.g., QuickSort, BST-Search, InsertionSort,...

deal with n (distinct) keys U1, U2, . . . , Un of the same ordered set Ω.

They perform comparisons and exchanges between keys.

The unit cost is the key–comparison.

The behaviour of the algorithm (wrt to key–comparisons)

only depends on the relative order between the keys.

It is sufficient to restrict to the case when Ω = [1..n].The input set is then Sn, with uniform probability.

Then, the analysis of all these algorithms is very well known,

with respect to the number of key–comparisons performed

in the worst-case, or in the average case.

Here, realistic analysis of the two algorithms QuickSort and QuickSelect

QuickSort (n, A): sorts the array A

Choose a pivot;

(k,A−, A+) := Partition(A);QuickSort (k − 1, A−);QuickSort (n− k,A+).

QuickSelect (n, m,A): returns the value of the element of rank k in A.

Choose a pivot;

(k,A−, A+) := Partition(A);If m = k then QuickSelect := pivot

else if m < k then QuickSelect (k − 1,m,A−)else QuickSelect (n− k, m− k,A+);

Here, realistic analysis of the two algorithms QuickSort and QuickSelect

Choose a pivot;

QuickSelect (n, m,A): returns the value of the element of rank k in A.

Choose a pivot;

(k,A−, A+) := Partition(A);If m = k then QuickSelect := pivot

else if m < k then QuickSelect (k − 1,m,A−)else QuickSelect (n− k, m− k,A+);

Mean number Kn of key comparisons

Classical results for various algorithms of the class

QuickMin(n) := QuickSelect(1, n) finds the minimum

QuickMax(n) := QuickSelect (n, n) finds the maximum.

QuickQuantα(n) := QuickSelect (bαnc, n) finds the α–quantile

QuickMed(n) := QuickSelect (bn/2c, n) finds the median

QuickRand(n) := QuickSelect (m,n) for a rank m ∈ [1..n]R

QuickSort(n) Kn ∼ 2n log n

QuickMin(n) Kn ∼ 2n

QuickMax(n) Kn ∼ 2n

QuickRand(n) Kn ∼ 3n

QuickQuantα(n) Kn ∼ 2 [1 + Hα] n

QuickMed(n) Kn ∼ 2(1 + log 2)n

Hα = the entropy function = α| log α|+ (1− α)| log(1− α)|

Mean number Kn of key comparisons

Classical results for various algorithms of the class

QuickMin(n) := QuickSelect(1, n) finds the minimum

QuickMax(n) := QuickSelect (n, n) finds the maximum.

QuickQuantα(n) := QuickSelect (bαnc, n) finds the α–quantile

QuickMed(n) := QuickSelect (bn/2c, n) finds the median

QuickRand(n) := QuickSelect (m,n) for a rank m ∈ [1..n]R

QuickSort(n) Kn ∼ 2n log n

QuickMin(n) Kn ∼ 2n

QuickMax(n) Kn ∼ 2n

QuickRand(n) Kn ∼ 3n

QuickQuantα(n) Kn ∼ 2 [1 + Hα] n

QuickMed(n) Kn ∼ 2(1 + log 2)n

Hα = the entropy function = α| log α|+ (1− α)| log(1− α)|

A more realistic framework for sorting.

Keys are viewed as words. The domain Ω of keys is a subset of Σ∞,

Σ∞ = the infinite words on some ordered alphabet Σ.The words are compared [wrt the lexicographic order].

The realistic unit cost is now the symbol–comparison.

The realistic cost of the comparison between two words A and B,

A = a1 a2 a3 . . . ai . . . and B = b1 b2 b3 . . . bi . . .

equals k + 1, where k is the length of their largest common prefix

k := maxi; ∀j ≤ i, aj = bj= the coincidence

A more realistic framework for sorting.

Keys are viewed as words. The domain Ω of keys is a subset of Σ∞,

Σ∞ = the infinite words on some ordered alphabet Σ.The words are compared [wrt the lexicographic order].

The realistic unit cost is now the symbol–comparison.

The realistic cost of the comparison between two words A and B,

A = a1 a2 a3 . . . ai . . . and B = b1 b2 b3 . . . bi . . .

equals k + 1, where k is the length of their largest common prefix

k := maxi; ∀j ≤ i, aj = bj= the coincidence

We are interested in this new cost for each algorithm:

the number of symbol–comparisons...and its mean value Sn( for n words)

How is Sn compared to Kn? That is the question....

An initial question asked by Sedgewick in 2000...

... In order to also compare with text algorithms based on tries.

An example.Sixteen words drawn from the memoryless source p(a) = 1/3, p(b) = 2/3.

We keep the prefixes of length 12.

A = abbbbbaaabab B = abbbbbbaabaa C = baabbbabbbba

D = bbbababbbaab E = bbabbaababbb F = abbbbbbbbabb

G = bbaabbabbaba H = ababbbabbbab I = bbbaabbbbbbb

J = abaabbbbaabb K = bbbabbbbbbaa L = aaaabbabaaba

M = bbbaaabbbbbb N = abbbbbbabbaa O = abbabababbbb

P = bbabbbaaaabb

A = abbbbbaaabab B = abbbbbbaabaa C = baabbbabbbba D =bbbababbbaab E = bbabbaababbb

F = abbbbbbbbabb G = bbaabbabbaba H = ababbbabbbab I = bbbaabbbbbbb J = abaabbbbaabb

K = bbbabbbbbbaa L = aaaabbabaaba M = bbbaaabbbbbb N = abbbbbbabbaa O = abbabababbbb P = bbabbbaaaabb

Plan of the talk.

What is already knownabout the mean number of symbol-comparisons?

For QuickSort(n), a first answer given by Janson and Fill (’06), only for

the classical binary source.

For QuickSelect(n), this was studied by Fill and Nakama (’07), only in

the case of the binary source, only for QuickMin, QuickMax, QuickRand

– A quite involved form for the constants of the analysis.

– More than twenty pages of computation....

Our work.We answer all these questions,

in the case of a general source and a general algorihm of the class.

– There are precise restrictive hypotheses on the source, and we provide

sufficient conditions under which these hypotheses hold.

– We provide a closed form for the constants of the analysis, for any source

of the previous type.

– We use different methods, with (almost) no computation...

What is already knownabout the mean number of symbol-comparisons?

For QuickSort(n), a first answer given by Janson and Fill (’06), only for

the classical binary source.

For QuickSelect(n), this was studied by Fill and Nakama (’07), only in

the case of the binary source, only for QuickMin, QuickMax, QuickRand

– A quite involved form for the constants of the analysis.

– More than twenty pages of computation....

Our work.We answer all these questions,

in the case of a general source and a general algorihm of the class.

– There are precise restrictive hypotheses on the source, and we provide

sufficient conditions under which these hypotheses hold.

– We provide a closed form for the constants of the analysis, for any source

of the previous type.

– We use different methods, with (almost) no computation...

Case of QuickSort(n)[CFFV 08]

Theorem 1. For any tamed source, the mean number Sn of symbol

comparisons used by QuickSort(n) satisfies

Sn ∼1

hSn log2 n.

and involves the entropy hS of the source S, defined as

hS := limk→∞

∑w∈Σk

pw log pw

where pw is the probability that a word begins with prefix w.

Compared to Kn ∼ 2n log n, there is an extra factor equal to 1/(2hS) log n

Compared to Tn ∼ (1/hS) n log n, there is an extra factor of log n.

Case of QuickSort(n)[CFFV 08]

Theorem 1. For any tamed source, the mean number Sn of symbol

comparisons used by QuickSort(n) satisfies

Sn ∼1

hSn log2 n.

and involves the entropy hS of the source S, defined as

hS := limk→∞

∑w∈Σk

pw log pw

where pw is the probability that a word begins with prefix w.

Compared to Kn ∼ 2n log n, there is an extra factor equal to 1/(2hS) log n

Compared to Tn ∼ (1/hS) n log n, there is an extra factor of log n.

Case of QuickMin(n), QuickMax(n), [CFFV 08]

Theorem 2. For any weakly tamed source, the mean numbers

of symbol comparisons used by QuickMin(n) and QuickMax(n)

T (−)n ∼ c

(−)S n and T (+)

n ∼ c(+)S n,

involve the constants c(ε)S which depend on probabilities pw and p

(ε)w , (ε = ±)

c(ε)S :=

Xw∈Σ?

"1− p

p(ε)w

Here p(−)w , p

(+)w , pw are the probabilities that a word begins with a prefix w′,

with |w′| = |w| and w′ < w or w′ > w or w′ = w.

Case of QuickRand(n) [CFFV 08]

Theorem 3. For any weakly tamed source, the mean number of symbol

comparisons used by QuickRand(n) (randomized wrt rank), satisfies

Tn ∼ cS n, with

w∈Σ?

pw+Xε=±

p(ε)w

Here p(−)w , p

(−)w , pw are the probabilities that a word begins with a prefix w′,

with |w′| = |w| and w′ < w or w′ > w or w′ = w.

Case of QuickQuantα(n) [CFFV 09] Work yet in progress

Theorem 4. For any weakly tamed source, the mean number of symbol

comparisons used by QuickQuantα(n) satisfies q(α)n ∼ ρS(α) n with

ρS(α) = 2X

w∈S(α)

pw + pw log pw − p(α,+)w log p(α,+)

w − p(α,−)w log p(α,−)

w∈R(α)

p(α,−)w

pw− 1

1− pw

p(α,−)w

w∈L(α)

p(α,+)w

pw− 1

1− pw

p(α,+)w

Three parts depending on the position of probabilities p(ε)w wrt α:

w ∈ L(α) iff p(+)w ≥ 1− α, w ∈ R(α) iff p

(−)w ≥ α

w ∈ S(α) iff p(−)w < α < 1− p

(+)w ,

p(α,−)w = 1− α− p

(+)w , p

(α,−)w = α− p

(−)w

The constants of the analysis for the binary source.

hB = log 2, c(+)B = c

(−)B = c

c(ε)B = 4 + 2

∑`≥0

+ 2∑`≥0

2`−1∑k=1

[1− k log

cB =143

+ 2∞∑

2`−1∑k=1

[k + 1 + log(k + 1)− k2 log

Numerically,

c(ε)B = 5.2793782410809583738656270377858153808364114934903....

cB = 8.20731......

To be compared to the constants of the number of key–comparisons

κ = 2 or κ = 3

hB = log 2, c(+)B = c

(−)B = c

c(ε)B = 4 + 2

∑`≥0

+ 2∑`≥0

2`−1∑k=1

[1− k log

cB =143

+ 2∞∑

2`−1∑k=1

[k + 1 + log(k + 1)− k2 log

Numerically,

c(ε)B = 5.2793782410809583738656270377858153808364114934903....

cB = 8.20731......

κ = 2 or κ = 3

hB = log 2, c(+)B = c

(−)B = c

c(ε)B = 4 + 2

∑`≥0

+ 2∑`≥0

2`−1∑k=1

[1− k log

cB =143

+ 2∞∑

2`−1∑k=1

[k + 1 + log(k + 1)− k2 log

Numerically,

c(ε)B = 5.2793782410809583738656270377858153808364114934903....

cB = 8.20731......

κ = 2 or κ = 3

Plan of the talk.

The (general) model of source (I)

A general source S produces infinite words on an ordered alphabet Σ.

For w ∈ Σ?, pw := probability that a word begins with the prefix w.

Define aw :=∑

w′, |w′| = |w|w′ < w, p

w′ 6= 0

pw′ bw :=∑

w′, |w′| = |w|w′ ≤ w, p

w′ 6= 0

Then: ∀u ∈ [0, 1], ∀k ≥ 1, ∃w = Mk(u) ∈ Σk such that u ∈ [aw, bw[.

Example p0 = 1/3, p1 = 2/3 ⇒ M1(1/2) = 1,M1(1/4) = 0p00 = 1/12, p01 = 3/12, p10 = 1/2, p11 = 1/6 ⇒ M2(1/2) = 10,M2(1/4) = 01

If pw → 0 for |w| → ∞, the sequences (aw) and (bw) are adjacent

They define an infinite word M(u) := limk→∞Mk(u).Then, the source is alternatively defined by a mapping M : [0, 1] → Σ∞.

Define aw :=∑

w′, |w′| = |w|w′ < w, p

w′ 6= 0

pw′ bw :=∑

w′, |w′| = |w|w′ ≤ w, p

w′ 6= 0

Example p0 = 1/3, p1 = 2/3 ⇒ M1(1/2) = 1,M1(1/4) = 0

p00 = 1/12, p01 = 3/12, p10 = 1/2, p11 = 1/6 ⇒ M2(1/2) = 10,M2(1/4) = 01

Define aw :=∑

w′, |w′| = |w|w′ < w, p

w′ 6= 0

pw′ bw :=∑

w′, |w′| = |w|w′ ≤ w, p

w′ 6= 0

Define aw :=∑

w′, |w′| = |w|w′ < w, p

w′ 6= 0

pw′ bw :=∑

w′, |w′| = |w|w′ ≤ w, p

w′ 6= 0

The (general) model of source (II)

For a prefix w ∈ Σ?, the interval I = [0, 1] is divided into three intervals.

I(−)w := [0, aw] ∼ u, M(u) < w,I(+)

w := [bw, 1] ∼ u, M(u) > w,Iw := [aw, bw] ∼ u, M(u) begins with w.

The interval Iw is the fundamental interval relative to the prefix w.

The measure of each interval denoted by pw or p(±)w is the probability that

M(u) begins with a prefix w′ with

|w′| = |w| and w′ = w or w′ < w or w′ > w.

Remark: p(−)w = aw, p(+)

w = 1− bw, pw = bw − aw

I(−)w := [0, aw] ∼ u, M(u) < w,I(+)

w = 1− bw, pw = bw − aw

I(−)w := [0, aw] ∼ u, M(u) < w,I(+)

w = 1− bw, pw = bw − aw

Natural instances of sources: Dynamical sources

With a shift map T : I → I and an encoding map τ : I → Σ,

the emitted word is M(x) = (τx, τTx, τT 2x, . . . τT kx, . . .)

xT xT x2 T x3

A dynamical system, with Σ = a, b, c and a word M(x) = (c, b, a, c . . .).

Natural instances of sources: Dynamical sources

With a shift map T : I → I and an encoding map τ : I → Σ,

the emitted word is M(x) = (τx, τTx, τT 2x, . . . τT kx, . . .)

xT xT x2 T x3

A dynamical system, with Σ = a, b, c and a word M(x) = (c, b, a, c . . .).

Memoryless sources or Markov chains.= Dynamical sources with affine branches....

x0.0 0.2 0.4 0.6 0.8 1.0

The dynamical framework leads to more general sources.

The curvature of branches entails correlation between symbols

Example : the Continued Fraction source

The dynamical framework leads to more general sources.

The curvature of branches entails correlation between symbols

Example : the Continued Fraction source

Fundamental intervals and fundamental triangles.

x0.0 0.2 0.4 0.6 0.8 1.0

Plan of the talk.

Three main steps for the analysis

of the mean number Sn of symbol comparisons

(A) The Poisson model PZ does not deal with a fixed number n of keys.

The number N of keys is now a random variable which follows a Poisson

law of parameter Z.

We first obtain nice expressions for SZ ....

(B) It is now possible to returm to the model where the number of keys

is fixed. We obtain a nice exact formula for Sn ....

from which it is not easy to obtain the asymptotics...

(C) Then, the Rice formula provides the asymptotics of Sn ( n →∞), as

soon as the source is “tamed”.

law of parameter Z.

Plan of the talk.

(A) Dealing with the Poisson Model.

In the PZ model, the number N of keys follows the Poisson law

Pr[N = n] = e−Z Zn

the mean number S(Z) of symbol comparisons for QuickX is

S(Z) =∫T

[γ(u, t) + 1]π(u, t) du dt

where T := (u, t), 0 ≤ u ≤ t ≤ 1 is the unit triangle

γ(u, t):= coincidence between M(u) and M(t)π(u, t) du dt := Mean number of key-comparisons between M(u′)

and M(t′) with u′ ∈ [u, u + du] and t′ ∈ [t− dt, t]performed by QuickX

First Step in the Poisson model : The coincidence γ(u, t)An (easy) alternative expression for

S(Z) =∫T

[γ(u, t) + 1]π(u, t) du dt

w∈Σ?

π(u, t) du dt

It involves the fundamental triangles

and separates the roles of the source and the algorithm.

It is then sufficient to

– study the key–probability π(u, t) of the algorithm (the second step).

– take its integral on each fundamental triangle (the third step)

Fundamental intervals and fundamental triangles.

x0.0 0.2 0.4 0.6 0.8 1.0

Study of the key probability π(u, t) of the algorithm (I)

π(u, t) du dt := Mean number of key-comparisons between M(u′)and M(t′) with u′ ∈ [u, u + du] and t′ ∈ [t− dt, t].

Case of QuickSort. M(u) and M(t) are compared

iff the first pivot chosen in M(v), v ∈ [u, t] is M(u) or M(t)

Choose a pivot;

Study of the key probability π(u, t) of the algorithm (II)

Case of QuickSelect

M(u) and M(t) are compared in QuickMin

iff the first pivot chosen in M(v), v ∈ [0, t] is M(u) or M(t)

M(u) and M(t) are compared in QuickMax

iff the first pivot chosen in M(v), v ∈ [u, 1] is M(u) or M(t)

And for QuickQuantα?

The idea is to compare QuicSelect with a dual algorithm, the QuickVal algorithm.

Case of QuickSelect

Here, we introduce another algorithm, the QuickVal algorithm,

the dual algorithm of QuickSelect,

QuickVal (n, a, A). : returns the rank of the element a in B = A ∪ aB := A ∪ aQV (n, a, B);

QV (n, a, B).Choose a pivot in B;

(k,B−, B+) := Partition(B);If a = pivot then QV := k

else if a < pivot then QV := QV (k − 1, a, B−)else QV := k+ QV (n− k, a,B+);

Study of the key probability π(u, t) of the algorithm (III)

QuickValα:= the algorithm where the key of interest is the word M(α)

There are three facts

– The QuickValα algorithm and the QuickQuantα algorithm are dual

– Their probabilistic behaviours are close

– The QuickValα algorithm is easy to deal with since

M(u) and M(t) are compared in QuickValα

iff the first pivot chosen in M(v), v ∈ [x, y] is M(u) or M(t)where the interval [x, y] is the smallest interval that contains u, t and α.

[x(u, t), y(u, t)] :=

[α, t] if u > α (case I) ∼ QuickMin

[u, α] if t < α (case II) ∼ QuickMax

[u, t] if u < α < t (case III) ∼ QuickSort

[x(u, t), y(u, t)] :=

The three domains for the definition of the interval [x, y].

(0, 0)

(1, 1)

(α, α)s(II)

(I)(III)

[x(u, t), y(u, t)] :=

[α, t] if u > α (case I)

[u, α] if t < α (case II)

[u, t] if u < α < t (case III)

Study of the key probability π(u, t) of the algorithm (IV)

For each algorithm, QuickSort, QuickMin, QuickMax, QuickValα,

the key probability π(u, t) of the algorithm defined as

π(u, t) du dt := Mean number of key-comparisons between M(u′)and M(t′) with u′ ∈ [u, u + du] and t′ ∈ [t− dt, t]

equals π(u, t) du dt = Zdu · Zdt · E[

22 + N[x(u,t),y(u,t)]

]Here, N[x,y] is the number of words M(v) with v ∈ [x, y],

It follows a Poisson law of parameter Z(y − x).

Then: π(u, t) = 2 Z2 f1(Z(y − x)) with f1(θ) := θ−2 [e−θ−1+θ]

This ends the second step in the Poisson model

The final expression for the mean number of symbol comparisons

for each algorithm, in the Poisson model PZ

S(Z) = 2Z2∑

w∈Σ?

f1(Z(y − x)) du dt

(0, 0)

(1, 1)

(α, α)

(I)(III)

(B) Return to the model where n is fixed.With the expansion of f1,

S(Z) =∞∑

(−1)k$(−k)Zk

is expressed with a series $(s) of Dirichlet type,

which depends both on the algorithm and the source.

$(s) = 2∑

w∈Σ?

(y − x)−(s+2) du dt

SinceSn

n!= [Zn]

(eZ · S(Z)

), there is an exact formula for Sn

Sn = 2n∑

(−1)k

)$(−k)

(B) Return to the model where n is fixed.With the expansion of f1,

S(Z) =∞∑

(−1)k$(−k)Zk

is expressed with a series $(s) of Dirichlet type,

which depends both on the algorithm and the source.

$(s) = 2∑

w∈Σ?

(y − x)−(s+2) du dt

SinceSn

n!= [Zn]

(eZ · S(Z)

), there is an exact formula for Sn

Sn = 2n∑

(−1)k

)$(−k)

(C) Using Rice formula

As soon as $(s) is “weakly tamed” in <(s) < σ0 with σ0 > −2,

the residue formula transforms the sum into an integral:

Sn =n∑

(−1)k

)$(−k) =

∫ d+i∞

d−i∞$(s)

n!s(s + 1) . . . (s + n)

with −2 < d < min(−1, σ0).

(C) Using Rice formula

As soon as $(s) is “weakly tamed” in <(s) < σ0 with σ0 > −2,

the residue formula transforms the sum into an integral:

Sn =n∑

(−1)k

)$(−k) =

∫ d+i∞

d−i∞$(s)

n!s(s + 1) . . . (s + n)

with −2 < d < min(−1, σ0).

Where are the leftmost singularities for $(s) ?

For QuickSort: $(s) = 2Λ(s)

s(s + 1)

where Λ(s) :=∑

w∈Σ?

p−sw has always a singularity at s = −1.

For all the QuickSelect algorithms,

$(s) is weakly tamed in <(s) < σ0 with σ0 > −1

Plan of the talk.

– What is a tamed source?

What can be expected about Λ(s)?

— For any source, Λ(s) has a singularity at s = −1.

— For a tamed source S, the dominant singularity of Λ(s) is located

at s = −1, this is a simple pole, whose residue equals 1/hS .

— In this case, there is a triple pole at s = −1 for$(s)s + 1

= 2Λ(s)

s(s + 1)2

and$(s)s + 1

∼ 2hS

1(s + 1)3

s → −1

= 2Λ(s)

s(s + 1)2

and$(s)s + 1

∼ 2hS

1(s + 1)3

s → −1

= 2Λ(s)

s(s + 1)2

and$(s)s + 1

∼ 2hS

1(s + 1)3

s → −1

For shifting the integral to the right, past... d = −1,

other properties of Λ(s) are needed on <s ≥ −1, –more subtle–

Different behaviours of Λ(s) for <s ≥ −1 where one can past d = −1...

In colored domains, Λ(s) is meromorphic and of polynomial growth for |s| → ∞.

For dynamical sources, we provide sufficient conditions

(of geometric or arithmetic type), under which these behaviours hold.

For a memoryless source, they depend on the approximability of ratios log pi/ log pj

Conclusions.

If the source S is tamed, then Sn ∼1

hSn log2 n. We are done !

— QuickSort must be compared to the algorithm which uses tries for

sorting n words. We have studied its mean cost Tn in 2001.

For a tamed source, Tn ∼1

hSn log n.

— Our methods apply to all the QuickSelect algorithms, and the hy-

potheses needed for the source are different (less strong).

They are related to properties of Λ(s) for <s < −1.

— It is easy to adapt our results to the intermittent sources, which emits

“long” sequences of the same symbols. In this case,

Sn = Θ(n log3 n), Tn = Θ(n log2 n).

Conclusions.

hSn log n.

Conclusions.

hSn log n.

Conclusions.

hSn log n.

Open problems.

— What about the distribution of the average search cost in a BST?

Is it asymptotically normal?

We know that is the case if one counts the number of key–comparisons.

We already know that, for a tamed source,

the average depth of a trie is asymptotically normal (Cesaratto-V, ’07).

Long term research projects...

— Revisit the complexity results of the main classical algorithms,

and take into account the number of symbol-comparisons...

instead of the number of key-comparisons.

— Provide a sharp “analytic” classification of sources:

Transfer geometric properties of sources into analytical properties of Λ(s)

Open problems.

number of symbol comparisons in quicksort and …poulalho/alea09/slides/vallee.pdf · comparisons...

Documents