estimating the sortedness of a data stream parikshit gopalanu t austin t. s. jayramibm almaden...

47
Estimating the Sortedness of a Data Stream Parikshit Gopalan U T Austin T. S. Jayram IBM Almaden Robert Krauthgamer IBM Almaden Ravi Kumar Yahoo! Research

Upload: brianna-mcfarland

Post on 26-Mar-2015

218 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Estimating the Sortedness of a Data Stream Parikshit GopalanU T Austin T. S. JayramIBM Almaden Robert KrauthgamerIBM Almaden Ravi KumarYahoo! Research

Estimating the Sortedness of a Data Stream

Parikshit Gopalan U T Austin

T. S. Jayram IBM Almaden

Robert Krauthgamer IBM Almaden

Ravi Kumar Yahoo! Research

Page 2: Estimating the Sortedness of a Data Stream Parikshit GopalanU T Austin T. S. JayramIBM Almaden Robert KrauthgamerIBM Almaden Ravi KumarYahoo! Research

Data Stream Model of Computation

X1 X2 X3 … XnInput

Storage

• Computing with Massive data sets.

• Sequential access.

•Small storage space, update time.

[Alon-Matias-Szegedy, …]

Page 3: Estimating the Sortedness of a Data Stream Parikshit GopalanU T Austin T. S. JayramIBM Almaden Robert KrauthgamerIBM Almaden Ravi KumarYahoo! Research

Sorting on Data-Streams

Cannot sort efficiently.

Can we tell if the data needs to be sorted?[Ergun-Kannan-Kumar-Rubinfeld-Vishwanathan,Ajtai-Jayram-Kumar-Sivakumar, Gupta-Zane,Cormode-Muthukrishnan-Sahinalp, LibenNowell-Vee-Zhu,Ailon-Chazelle-Commandur-Liu]

Page 4: Estimating the Sortedness of a Data Stream Parikshit GopalanU T Austin T. S. JayramIBM Almaden Robert KrauthgamerIBM Almaden Ravi KumarYahoo! Research
Page 5: Estimating the Sortedness of a Data Stream Parikshit GopalanU T Austin T. S. JayramIBM Almaden Robert KrauthgamerIBM Almaden Ravi KumarYahoo! Research
Page 6: Estimating the Sortedness of a Data Stream Parikshit GopalanU T Austin T. S. JayramIBM Almaden Robert KrauthgamerIBM Almaden Ravi KumarYahoo! Research

Sorting on Data-Streams

• Cannot sort efficiently on a data-stream.

• Can we tell if the data needs to be sorted? [Ergun-Kannan-Kumar-Rubinfeld-Vishwanathan,Ajtai-Jayram-Kumar-Sivakumar, Gupta-Zane,Cormode-Muthukrishnan-Sahinalp, LibenNowell-Vee-Zhu,Ailon-Chazelle-Commandur-Liu]

• Measuring distance from Sortedness: Kendall Tau distance Spearman Footrule distance Ulam distance

Page 7: Estimating the Sortedness of a Data Stream Parikshit GopalanU T Austin T. S. JayramIBM Almaden Robert KrauthgamerIBM Almaden Ravi KumarYahoo! Research

Candidate metrics

1. Spearman’s footrule [ℓ1 distance] :

3 5 7 9 10 4 1 2 6 8 1 2 3 4 5 6 7 8 9 10

e

Easy to compute.

Page 8: Estimating the Sortedness of a Data Stream Parikshit GopalanU T Austin T. S. JayramIBM Almaden Robert KrauthgamerIBM Almaden Ravi KumarYahoo! Research

2. Kendall Tau distance [No. of Inversions]

3 5 7 9 10 4 1 2 6 8

Inversions: Positions i < j where (i) > (j)

Candidate metrics

Page 9: Estimating the Sortedness of a Data Stream Parikshit GopalanU T Austin T. S. JayramIBM Almaden Robert KrauthgamerIBM Almaden Ravi KumarYahoo! Research

2. Kendall Tau distance [No. of Inversions]

3 5 7 9 10 4 1 2 6 8

Inversions: Positions i < j where (i) > (j)

Candidate metrics

Page 10: Estimating the Sortedness of a Data Stream Parikshit GopalanU T Austin T. S. JayramIBM Almaden Robert KrauthgamerIBM Almaden Ravi KumarYahoo! Research

2. Kendall Tau distance [No. of Inversions]

Candidate metrics

Within a factor-2 of Spearman’s footrule. [Diaconis-Graham]

An O(log n) space, 1-pass (1 + ) algorithm. [Ajtai-Jayram-Kumar-Sivakumar]

Page 11: Estimating the Sortedness of a Data Stream Parikshit GopalanU T Austin T. S. JayramIBM Almaden Robert KrauthgamerIBM Almaden Ravi KumarYahoo! Research

3. Ulam distance [Edit Distance]:

Ed(): Number of deletions needed to sort.

Candidate metrics

Ulam: Fastest way to sort a bridge hand.

Page 12: Estimating the Sortedness of a Data Stream Parikshit GopalanU T Austin T. S. JayramIBM Almaden Robert KrauthgamerIBM Almaden Ravi KumarYahoo! Research

Ed(): Number of deletions needed to sort.

5 7 8 1 10 4 2 3 6 9

Edit Distance and the LIS

Page 13: Estimating the Sortedness of a Data Stream Parikshit GopalanU T Austin T. S. JayramIBM Almaden Robert KrauthgamerIBM Almaden Ravi KumarYahoo! Research

Ed(): Number of deletions needed to sort.

1 2 3 4 5 6 7 8 9 10

Delete

Insert

5 7 8 1 10 4 2 3 6 9

5 7 8 10

Edit Distance and the LIS

Page 14: Estimating the Sortedness of a Data Stream Parikshit GopalanU T Austin T. S. JayramIBM Almaden Robert KrauthgamerIBM Almaden Ravi KumarYahoo! Research

Ed() : Number of deletions needed to sort .

LIS() : Length of the longest increasing sequence.

Ed() + LIS() = n

Edit Distance and the LIS

Studied in statistics, biology, computer science …

Both take a global view of the sequence.

Hard for models like streaming, sketching, property-testing.

51 … 80

151 … 190

81 … 100

Page 15: Estimating the Sortedness of a Data Stream Parikshit GopalanU T Austin T. S. JayramIBM Almaden Robert KrauthgamerIBM Almaden Ravi KumarYahoo! Research

Prior Work

• Exact Computation of Ed() and LIS() : – Patience Sorting [Ross,Mallows]

Page 16: Estimating the Sortedness of a Data Stream Parikshit GopalanU T Austin T. S. JayramIBM Almaden Robert KrauthgamerIBM Almaden Ravi KumarYahoo! Research

Patience Sorting

5 7 8 1 10 4 2 3 6 9

5 7 8

5 7 8 1 10 4 2 3 6 9 0

Page 17: Estimating the Sortedness of a Data Stream Parikshit GopalanU T Austin T. S. JayramIBM Almaden Robert KrauthgamerIBM Almaden Ravi KumarYahoo! Research

Patience Sorting

5 7 8 1 10 4 2 3 6 9

5 7 8

1

10

5 7 8 1 10 4 2 3 6 9 0

Page 18: Estimating the Sortedness of a Data Stream Parikshit GopalanU T Austin T. S. JayramIBM Almaden Robert KrauthgamerIBM Almaden Ravi KumarYahoo! Research

Patience Sorting

5 7 8 1 10 4 2 3 6 9

5

4

8

1

10

7

5 7 8 1 10 4 2 3 6 9 0

Page 19: Estimating the Sortedness of a Data Stream Parikshit GopalanU T Austin T. S. JayramIBM Almaden Robert KrauthgamerIBM Almaden Ravi KumarYahoo! Research

Patience Sorting

5 7 8 1 10 4 2 3 6 9

5

2

8

1

10

7

4

5 7 8 1 10 4 2 3 6 9 0

Number in place i: Earliest end to IS of length i.

Page 20: Estimating the Sortedness of a Data Stream Parikshit GopalanU T Austin T. S. JayramIBM Almaden Robert KrauthgamerIBM Almaden Ravi KumarYahoo! Research

Patience Sorting

5 7 8 1 10 4 2 3 6 9

5

2

31

10

7

4

8

5 7 8 1 10 4 2 3 6 9 0

Number in place i: Earliest end to IS of length i.

Page 21: Estimating the Sortedness of a Data Stream Parikshit GopalanU T Austin T. S. JayramIBM Almaden Robert KrauthgamerIBM Almaden Ravi KumarYahoo! Research

Patience Sorting

5 7 8 1 10 4 2 3 6 9 0

5

2

31 6

7

4

8 10

9

Number in place i: Earliest end to IS of length i.

Page 22: Estimating the Sortedness of a Data Stream Parikshit GopalanU T Austin T. S. JayramIBM Almaden Robert KrauthgamerIBM Almaden Ravi KumarYahoo! Research

Patience Sorting

5 7 8 1 10 4 2 3 6 9 0

5

2

3 6

7

4

8 10

9

0

1

Number in place i: Earliest end to IS of length i.

Page 23: Estimating the Sortedness of a Data Stream Parikshit GopalanU T Austin T. S. JayramIBM Almaden Robert KrauthgamerIBM Almaden Ravi KumarYahoo! Research

Patience Sorting

5 7 8 1 10 4 2 3 6 9 0

5

2

3 6

7

4

8 10

9

Length of LIS

0

1

LIS

Page 24: Estimating the Sortedness of a Data Stream Parikshit GopalanU T Austin T. S. JayramIBM Almaden Robert KrauthgamerIBM Almaden Ravi KumarYahoo! Research

Prior Work

• Exact Computation of Ed() and LIS() : – Patience Sorting [Ross,Mallows]– O(n) space, 1-pass streaming algorithm.– √n) space lower bound. [LibenNowell-Vee-Zhu]

• Approximating Ed() and LIS() : – No sub-linear space algorithms, no lower bounds.

[Ajtai et al, Cormode et al, LibenNowell et al]

• LIS Algorithms parametrized by length of LIS :[LibenNowell-Vee-Zhu, Sun-Woodruff]

• Computing Ed() in other models:– Property Testing [Ergun et al, Ailon et al]– Sketching [Charikar-Krauthgamer]

Page 25: Estimating the Sortedness of a Data Stream Parikshit GopalanU T Austin T. S. JayramIBM Almaden Robert KrauthgamerIBM Almaden Ravi KumarYahoo! Research

Our Results

• Approximating Ed() : – An O(log2 n) space, randomized 4-approximation for Ed().– A O(√n) space, deterministic (1 + ε)-approximation for Ed().

• Approximating the LIS:– A O(√n) space, deterministic (1 + ε)-approximation for LIS().

• Exact Computation of Ed() and LIS(): – An n) space lower bound for randomized algorithms. – Independently proved by [Sun-Woodruff].

• Lower bounds for approximating the LIS:– Conjecture: Deterministic algorithms require √n) space for

(1 + ε)-approximation

Page 26: Estimating the Sortedness of a Data Stream Parikshit GopalanU T Austin T. S. JayramIBM Almaden Robert KrauthgamerIBM Almaden Ravi KumarYahoo! Research

Computing the Edit Distance

Thm: For any ε > 0,there is a one-pass randomized algorithm using O(ε-2log2 n) space and update time, that gives a (4 + ε) approximation to Ed().

1. Combinatorial measure that approximates Ulam distance. Builds on [Ergun et al, Ailon et al].

2. Sampling scheme to compute this measure in one pass.

Page 27: Estimating the Sortedness of a Data Stream Parikshit GopalanU T Austin T. S. JayramIBM Almaden Robert KrauthgamerIBM Almaden Ravi KumarYahoo! Research

A Voting Scheme [Ergun et al.]

Combinatorial measure called Unpopularity.Neighborhoods of (i) : Intervals starting or ending at i.

3 7 8 6 5 91 2

Page 28: Estimating the Sortedness of a Data Stream Parikshit GopalanU T Austin T. S. JayramIBM Almaden Robert KrauthgamerIBM Almaden Ravi KumarYahoo! Research

A Voting Scheme [Ergun et al.]

Combinatorial measure called Unpopularity.Neighborhoods of (i) : Intervals starting or ending at i.

Deciding if (i) is unpopular:For every neighborhood of (i)

Every number in the neighborhood votes on “Is (i) out of order?”

If majority in some neighborhood vote against (i), it is marked unpopular.

Let U() denote no. of unpopular numbers.[Ergun et al]: Ed() ≤ U()[Ailon et al]: U() ≤ 2 Ed()

Page 29: Estimating the Sortedness of a Data Stream Parikshit GopalanU T Austin T. S. JayramIBM Almaden Robert KrauthgamerIBM Almaden Ravi KumarYahoo! Research

A Voting Scheme [Ergun et al.]

Can we estimate U() using a streaming algorithm?

4 5 3 7 1 2

Page 30: Estimating the Sortedness of a Data Stream Parikshit GopalanU T Austin T. S. JayramIBM Almaden Robert KrauthgamerIBM Almaden Ravi KumarYahoo! Research

A Voting Scheme [Ergun et al.]

Can we estimate U() using a streaming algorithm?

4 5 3 7 1 2

Impossible to decide if (i) is unpopular before seeing the entire input.

Page 31: Estimating the Sortedness of a Data Stream Parikshit GopalanU T Austin T. S. JayramIBM Almaden Robert KrauthgamerIBM Almaden Ravi KumarYahoo! Research

A New Voting Scheme

• Neighborhoods of (i) : Intervals ending at i.• If majority in some neighborhood vote against (i),

it is marked unpopular.• Unpopularity based only on past, not the future.

Thm: Let V() denote no. of unpopular numbers. Then

Ed()/2 ≤ V() ≤ 2 Ed()

Page 32: Estimating the Sortedness of a Data Stream Parikshit GopalanU T Austin T. S. JayramIBM Almaden Robert KrauthgamerIBM Almaden Ravi KumarYahoo! Research

A Voting Scheme

• Let Ed() = k. Then V() ≤ 2k.• Fix an optimal Bad set of size k to delete.

How many numbers can be Unpopular ?

Partition Unpopular into Good and Bad.

Good numbers form an increasing sequence.

Good never votes against Good.

Good + Unpopular ≡ Bad neighborhood !

Page 33: Estimating the Sortedness of a Data Stream Parikshit GopalanU T Austin T. S. JayramIBM Almaden Robert KrauthgamerIBM Almaden Ravi KumarYahoo! Research

A Voting Scheme

Good + Unpopular ≡ Bad neighborhood !

If k numbers are Bad,

At most k are Good + Unpopular.

Bad numbers might all be Unpopular.

Hence V() ≤ 2k.

• Let Ed() = k. Then V() ≤ 2k.• Fix an optimal Bad set of size k to delete.

Page 34: Estimating the Sortedness of a Data Stream Parikshit GopalanU T Austin T. S. JayramIBM Almaden Robert KrauthgamerIBM Almaden Ravi KumarYahoo! Research

A Voting Scheme

• Let Ed() = k. Then V() ≤ 2k.

• Bound can be tight.

100 99 98 … 91 1 2 3 … 10 11 12 … 90

100 99 98 … 91 1 2 3 … 10 11 12 … 90

100 99 98 … 91 1 2 3 … 10 11 12 … 90

Page 35: Estimating the Sortedness of a Data Stream Parikshit GopalanU T Austin T. S. JayramIBM Almaden Robert KrauthgamerIBM Almaden Ravi KumarYahoo! Research

A Voting Scheme

• Let V() = k. Then Ed() ≤ 2k.• Fix the set of k Unpopular elements.

Algorithm to produce an increasing sequence:

1. Scan right to left.

2. Delete Unpopular elements + Inversions w.r.t last number in sequence.

At least half of deletions are Unpopular numbers.

What remains is an increasing sequence.

Page 36: Estimating the Sortedness of a Data Stream Parikshit GopalanU T Austin T. S. JayramIBM Almaden Robert KrauthgamerIBM Almaden Ravi KumarYahoo! Research

A Voting Scheme

• Let V() = k. Then Ed() ≤ 2k.• Bound can be tight.

11 … 50 91 92 93 … 100 1 2 3 … 10 51 … 90

11 … 50 91 92 93 … 100 1 2 3 … 10 51 … 90

11 … 50 91 92 93 … 100 1 2 3 … 10 51 … 90

Page 37: Estimating the Sortedness of a Data Stream Parikshit GopalanU T Austin T. S. JayramIBM Almaden Robert KrauthgamerIBM Almaden Ravi KumarYahoo! Research

A New Voting Scheme

• Neighborhoods of (i) : Intervals ending at i.• If majority in some neighborhood vote against (i),

it is marked unpopular.• Unpopularity based only on past, not the future.

Thm: Let V() denote no. of unpopular numbers. Then

Ed()/2 ≤ V() ≤ 2 Ed()

Can we estimate V() efficiently?

Page 38: Estimating the Sortedness of a Data Stream Parikshit GopalanU T Austin T. S. JayramIBM Almaden Robert KrauthgamerIBM Almaden Ravi KumarYahoo! Research

Outline of Sampling Scheme

Taking a vote in one neighborhood:– Take O(log n) samples, take the (approx)

majority.Reservoir Sampling [Vitter].

3 7 8 6 5 91 2

3 7 8 6 5 91 2

Computing V() : Need O(log n) samples from every neighborhood.

Page 39: Estimating the Sortedness of a Data Stream Parikshit GopalanU T Austin T. S. JayramIBM Almaden Robert KrauthgamerIBM Almaden Ravi KumarYahoo! Research

Outline of Sampling Scheme

3 7 8 6 5 91 2

Key observation: Don’t need samples across intervals to be independent!

Roughly O(log2 n) samples suffice.

Computing V() : Need O(log n) samples from every neighborhood.

Page 40: Estimating the Sortedness of a Data Stream Parikshit GopalanU T Austin T. S. JayramIBM Almaden Robert KrauthgamerIBM Almaden Ravi KumarYahoo! Research

Deterministic Algorithm for LIS

Thm: For any ε > 0,there is a one-pass deterministic algorithm using O(n/ε)1/2 space and update time, that gives a (1 - ε) approximation to LIS().Based on multiplayer communication protocol for LIS: 32 …

8010 51 … 19

15 … 50

• Algorithm simulates protocol for √n players.

Page 41: Estimating the Sortedness of a Data Stream Parikshit GopalanU T Austin T. S. JayramIBM Almaden Robert KrauthgamerIBM Almaden Ravi KumarYahoo! Research

Two-Player Protocol for LIS

3245 4582 … 8021

1000 5123 … 1319

Patience Sorting

6 24 … 1000

6…1000

Multiples of εk

n/2

k

1/ε

Page 42: Estimating the Sortedness of a Data Stream Parikshit GopalanU T Austin T. S. JayramIBM Almaden Robert KrauthgamerIBM Almaden Ravi KumarYahoo! Research

Approximating the LIS

Conjecture: For some ε0 > 0, every 1-pass deterministic algorithm that gives a (1 + ε0) approximation to LIS() requires √n) space.

Consider k-player communication protocol for LIS:

32 … 80

10 51 … 19

15 … 50

• As k increases, maximum message size increases.

Proving the conjecture requires analyzing k ≥ √n

Page 43: Estimating the Sortedness of a Data Stream Parikshit GopalanU T Austin T. S. JayramIBM Almaden Robert KrauthgamerIBM Almaden Ravi KumarYahoo! Research

Lower Bounds for approximating the LIS

Conjecture: For some ε0 > 0, every 1-pass deterministic algorithm that gives a (1 + ε0) approximation to LIS() requires √n) space.

Candidate Hard Instances?

1.8 2.9 3.7 4.9

1.6 2.8 3.5 4.6

1.3 2.5 3.3 4.5

1 2 3.2 4.2

Page 44: Estimating the Sortedness of a Data Stream Parikshit GopalanU T Austin T. S. JayramIBM Almaden Robert KrauthgamerIBM Almaden Ravi KumarYahoo! Research

Lower Bounds for approximating the LIS

Conjecture: For some ε0 > 0, every 1-pass deterministic algorithm that gives a (1 + ε0) approximation to LIS() requires √n) space.

Candidate Hard Instances?

1.8 2.9 3.7 4.9

1.6 2.8 3.5 4.6

1.3 2.5 3.3 4.5

1 2 3.2 4.2

1.7 2.8 3.4 4.8

1.6 2.6 3.5 4.6

1.3 2.5 3.6 4.5

1.1 2.1 3.9 4.2

No Yes

Page 45: Estimating the Sortedness of a Data Stream Parikshit GopalanU T Austin T. S. JayramIBM Almaden Robert KrauthgamerIBM Almaden Ravi KumarYahoo! Research

Lower Bounds for approximating the LIS

Conjecture: For some ε0 > 0, every 1-pass deterministic algorithm that gives a (1 + ε0) approximation to LIS() requires √n) space.

Candidate Hard Instances?

1.8

2.9

3.7

4.9

1.6 2.8 3.5 4.6

1.3 2.5 3.3 4.5

1 2 3.2 4.2

1.7 2.8 3.4 4.8

1.6 2.6 3.5 4.6

1.3 2.5 3.6 4.5

1.1 2.1 3.9 4.2

No Yes

Page 46: Estimating the Sortedness of a Data Stream Parikshit GopalanU T Austin T. S. JayramIBM Almaden Robert KrauthgamerIBM Almaden Ravi KumarYahoo! Research

Lower Bounds for approximating the LIS

Conjecture: For some ε0 > 0, every 1-pass deterministic algorithm that gives a (1 + ε0) approximation to LIS() requires √n) space.

Candidate Hard Instances?

1.8

2.9

3.7

4.9

1.6 2.8 3.5 4.6

1.3 2.5 3.3 4.5

1 2 3.2 4.2

1.7

2.8

3.4

4.8

1.6 2.63.5

4.6

1.3 2.53.6

4.5

1.1 2.13.9

4.2

No Yes

Page 47: Estimating the Sortedness of a Data Stream Parikshit GopalanU T Austin T. S. JayramIBM Almaden Robert KrauthgamerIBM Almaden Ravi KumarYahoo! Research

Open Problems

Estimate the Edit distance between two permutations.

Tight bounds for approximation: Show (√n) lower bound for deterministic algorithms. Randomized algorithm for LIS ?