so much data

61
So Much Data So Much Data Bernard Chazelle Bernard Chazelle Princeton University Princeton University So Little Time So Little Time

Upload: mandar

Post on 16-Mar-2016

48 views

Category:

Documents


0 download

DESCRIPTION

So Much Data. So Little Time. Bernard Chazelle Princeton University. So Many Slides. (before lunch). So Little Time. Bernard Chazelle Princeton University. math. algorithms. experimentation. 2006. computation. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: So Much Data

So Much DataSo Much Data

Bernard ChazelleBernard Chazelle Princeton UniversityPrinceton University

So Little TimeSo Little Time

Page 2: So Much Data

So Many SlidesSo Many Slides

Bernard ChazelleBernard Chazelle Princeton UniversityPrinceton University

So Little Time So Little Time

(before lunch)(before lunch)

Page 3: So Much Data

computation

math experimentation

algorithms

Page 4: So Much Data

Computers have two Computers have two problemsproblems

Page 5: So Much Data

1. They don’t have steering 1. They don’t have steering wheelswheels

Page 6: So Much Data
Page 7: So Much Data

2. End of Moore’s Law

party’s over !

Page 8: So Much Data

computation

algorithms experimentation

Page 9: So Much Data

32x 17

22432

= 544

This is not me

Page 10: So Much Data

FFT

RSA

Page 11: So Much Data
Page 12: So Much Data
Page 13: So Much Data

noisy

low entropy

uncertain

unevenly priced

big

Page 14: So Much Data

noisy

low entropy

uncertain

unevenly priced

big

Page 15: So Much Data

Biomedical imaging

Sloan Digital Sky

Survey4 petabytes4 petabytes(~1MG)(~1MG)

10 10 petabytes/yrpetabytes/yr

150 petabytes/yr150 petabytes/yr

Page 16: So Much Data

Collected works of Micha Sharir

My A(9,9)-th paper

Page 17: So Much Data

massive input output

Sublinear Sublinear AlgorithmsAlgorithms

Sample tiny fractionSample tiny fraction

Page 18: So Much Data

Shortest PathsShortest Paths [C-Liu-Magen ’03]

New New YorkYork

DelphiDelphi

Page 19: So Much Data

Ray ShootingRay Shooting

Volume Intersection Point location

Page 20: So Much Data

Approximate MSTApproximate MST [C-Rubinfeld-Trevisan ’01]

Page 21: So Much Data

Reduces to counting connected componentsReduces to counting connected components

Page 22: So Much Data

EE = no. connected components= no. connected components

varvar << (no. connected components)<< (no. connected components)22

whp, is a good estimator of # connected components

Page 23: So Much Data

worst case worst case

input spaceinput space

average case average case (uniform)(uniform)

Page 24: So Much Data

worst case worst case

Page 25: So Much Data

average case = actuarial view average case = actuarial view

Page 26: So Much Data

“ OK, if you elect NOT to have the surgery, the insurance company offers 6 days and 7 nights in Barbados. “

Page 27: So Much Data

arbitrary, unknown random sourcearbitrary, unknown random source

Self-Improving Self-Improving AlgorithmsAlgorithms

Page 28: So Much Data

Yes ! This could be YOU, too !

Page 29: So Much Data

E Tk Optimal expected time for random source

time T1 time T2 time T3 time T4

Page 30: So Much Data

Clustering Clustering [ Ailon-C-Liu-Comandur [ Ailon-C-Liu-Comandur ’05 ]’05 ]

K-median over Hamming K-median over Hamming cubecube

Page 31: So Much Data

minimize sum of distancesminimize sum of distances

Page 32: So Much Data

minimize sum of distancesminimize sum of distances

Page 33: So Much Data

[ Kumar-Sabharwal-Sen ’04 ][ Kumar-Sabharwal-Sen ’04 ]

COST OPT( 1 + )

Page 34: So Much Data

How to achieve linear limiting How to achieve linear limiting time?time?

Input space {0,1}Input space {0,1}dndn

prob < O(dn)/KSSprob < O(dn)/KSS

Identify coreIdentify core

TailTail::

Use KSS Use KSS

Page 35: So Much Data

Store sample of Store sample of precomputed KSSprecomputed KSS

Nearest neighborNearest neighborIncremental algorithmIncremental algorithm

Page 36: So Much Data

Main difficulty: How to spot the tail?Main difficulty: How to spot the tail?

Page 37: So Much Data
Page 38: So Much Data

encode

Page 39: So Much Data

decode

Page 40: So Much Data
Page 41: So Much Data

Data inaccessible before noise

What makes you What makes you think it’s wrong?think it’s wrong?

Page 42: So Much Data

Data inaccessible before noise

must satisfy some propertymust satisfy some property(eg, convex, bipartite)(eg, convex, bipartite)but does not quitebut does not quite

Page 43: So Much Data

f(x) = ?f(x) = ?

x

f(x)

data

f = access function

Page 44: So Much Data

f(x) = ?f(x) = ?

x

f(x)

f = access function

Page 45: So Much Data

f(x) = ?f(x) = ?

x

f(x)

But life being what it is…

Page 46: So Much Data

f(x) = ?f(x) = ?

x

f(x)

Page 47: So Much Data

)(O

Humans

Define distance from any object to data class

Page 48: So Much Data

f(x) = ?f(x) = ?

x

g(x)

x1, x2,…

f(x1), f(x2),…

filter

g is access function for:

Page 49: So Much Data

Online DataOnline DataReconstructiReconstructi

onon

Page 50: So Much Data

Monotone function: [n] Rd

Filter requires polylog (n) lookups

[ Ailon-C-Liu-Comandur ’04 ][ Ailon-C-Liu-Comandur ’04 ]

Page 51: So Much Data

Convex Convex polygonpolygon

Filter requires : lookups

[C-Comandur ’06 ]

Page 52: So Much Data

Convex Convex terrainterrain

lookups

Filter requires :

Page 53: So Much Data

Iterated planar separator Iterated planar separator theoremtheorem

Page 54: So Much Data

Iterated planar separator Iterated planar separator theoremtheorem

Page 55: So Much Data

Iterated Iterated (weak)(weak) planar separator theorem planar separator theoremin sublinear time!in sublinear time!

Page 56: So Much Data

Using epsilon-nets in spaces of unbounded VC Using epsilon-nets in spaces of unbounded VC dimensiondimension

reconstruct

Page 57: So Much Data

bipartite graph

k-connectivity expander

Page 58: So Much Data

denoising low-dim attractor sets

Page 59: So Much Data

Priced Priced

computation & computation & accuracyaccuracy

spectrometry/cloning/gene chipspectrometry/cloning/gene chip PCR/hybridization/chromatographyPCR/hybridization/chromatography gel electrophoresis/blottinggel electrophoresis/blotting

001100001010001111110011001101011100001100000101111o1o1100001100

Linear programmingLinear programming

Page 60: So Much Data

Pricing dataPricing data

Factoring is easy. Here’s why…Factoring is easy. Here’s why…Gaussian mixture sample: Gaussian mixture sample: 0010010100100110101010100100101001001101010101….….

Page 61: So Much Data

Collaborators:Collaborators: Nir Ailon, Seshadri Comandur, Ding LiuAvner Magen, Ronitt Rubinfeld, Luca Trevisan