so much data bernard chazelle princeton university princeton university bernard chazelle princeton...

Post on 02-Apr-2015

240 Views

Category:

Documents

1 Downloads

Preview:

Click to see full reader

TRANSCRIPT

So Much DataSo Much DataSo Much DataSo Much Data

Bernard ChazelleBernard Chazelle

Princeton UniversityPrinceton University

Bernard ChazelleBernard Chazelle

Princeton UniversityPrinceton University

So Little TimeSo Little TimeSo Little TimeSo Little Time

So Many SlidesSo Many SlidesSo Many SlidesSo Many Slides

Bernard ChazelleBernard Chazelle

Princeton UniversityPrinceton University

Bernard ChazelleBernard Chazelle

Princeton UniversityPrinceton University

So Little Time So Little Time

So Little Time So Little Time

(before lunch)(before lunch) (before lunch)(before lunch)

computation

math experimentation

algorithms

Computers have two Computers have two problemsproblems

Computers have two Computers have two problemsproblems

1. They don’t have steering 1. They don’t have steering wheelswheels

1. They don’t have steering 1. They don’t have steering wheelswheels

2. End of Moore’s Law

party’s over !

party’s over !

computation

algorithms experimentation

32x 17

22432

= 544

This is not me

FFT

RSA

noisy

low entropy

uncertain

unevenly priced

big

noisy

low entropy

uncertain

unevenly priced

big

Biomedical imaging

Sloan Digital Sky

Survey4 petabytes4 petabytes

(~1MG)(~1MG)4 petabytes4 petabytes

(~1MG)(~1MG)

10 10 petabytes/yrpetabytes/yr

10 10 petabytes/yrpetabytes/yr

150 petabytes/yr150 petabytes/yr150 petabytes/yr150 petabytes/yr

Collected works of Micha Sharir

My A(9,9)-th paper

massive input

massive input outputoutput

Sublinear Sublinear AlgorithmsAlgorithmsSublinear Sublinear

AlgorithmsAlgorithms

Sample tiny fractionSample tiny fractionSample tiny fractionSample tiny fraction

Shortest PathsShortest PathsShortest PathsShortest Paths [C-Liu-Magen ’03]

New New YorkYork

New New YorkYork

DelphiDelphiDelphiDelphi

Ray ShootingRay ShootingRay ShootingRay Shooting

Volume Intersection Point location

Approximate MSTApproximate MSTApproximate MSTApproximate MST [C-Rubinfeld-Trevisan ’01]

Reduces to counting connected componentsReduces to counting connected componentsReduces to counting connected componentsReduces to counting connected components

EEEE = no. connected components= no. connected components= no. connected components= no. connected components

varvarvarvar << (no. connected components)<< (no. connected components)<< (no. connected components)<< (no. connected components)2222

whp, is a good estimator

of # connected components

worst case worst case worst case worst case

input spaceinput spaceinput spaceinput space

average case average case (uniform)(uniform)average case average case (uniform)(uniform)

worst case worst case worst case worst case

average case = actuarial view average case = actuarial view average case = actuarial view average case = actuarial view

“ OK, if you elect NOT to have the surgery, the insurance company offers 6 days and 7 nights in Barbados. “

arbitrary, unknown random sourcearbitrary, unknown random sourcearbitrary, unknown random sourcearbitrary, unknown random source

Self-Improving Self-Improving AlgorithmsAlgorithms

Self-Improving Self-Improving AlgorithmsAlgorithms

Yes ! This could be YOU, too !

E Tk Optimal expected time for random source

time T1

time T2

time T3

time T4

Clustering Clustering [ Ailon-C-Liu-Comandur [ Ailon-C-Liu-Comandur ’05 ]’05 ]Clustering Clustering [ Ailon-C-Liu-Comandur [ Ailon-C-Liu-Comandur ’05 ]’05 ]

K-median over Hamming K-median over Hamming cubecubeK-median over Hamming K-median over Hamming cubecube

minimize sum of distancesminimize sum of distancesminimize sum of distancesminimize sum of distances

minimize sum of distancesminimize sum of distancesminimize sum of distancesminimize sum of distances

[ Kumar-Sabharwal-Sen ’04 ][ Kumar-Sabharwal-Sen ’04 ][ Kumar-Sabharwal-Sen ’04 ][ Kumar-Sabharwal-Sen ’04 ]

COST OPT( 1 + )

How to achieve linear limiting How to achieve linear limiting time?time?How to achieve linear limiting How to achieve linear limiting time?time?

Input space {0,1}Input space {0,1}Input space {0,1}Input space {0,1}dndndndn

prob < O(dn)/KSSprob < O(dn)/KSSprob < O(dn)/KSSprob < O(dn)/KSS

Identify coreIdentify coreIdentify coreIdentify core

TailTail::TailTail::

Use KSS Use KSS Use KSS Use KSS

Store sample of Store sample of precomputed KSSprecomputed KSSStore sample of Store sample of precomputed KSSprecomputed KSS

Nearest neighborNearest neighborNearest neighborNearest neighborIncremental algorithmIncremental algorithmIncremental algorithmIncremental algorithm

Main difficulty: How to spot the tail?Main difficulty: How to spot the tail?Main difficulty: How to spot the tail?Main difficulty: How to spot the tail?

encode

decode

Data inaccessible before noise

What makes you What makes you think it’s wrong?think it’s wrong?

Data inaccessible before noise

must satisfy some propertymust satisfy some property

(eg, convex, bipartite)(eg, convex, bipartite)

but does not quitebut does not quite

f(x) = ?f(x) = ?

x

f(x)

data

f = access function

f(x) = ?f(x) = ?

x

f(x)

f = access function

f(x) = ?f(x) = ?

x

f(x)

But life being what it is…

f(x) = ?f(x) = ?

x

f(x)

)(O

Humans

Define distance from any object to data class

f(x) = ?f(x) = ?

x

g(x)

x1, x2,…

f(x1), f(x2),…

filter

g is access function for:

Online DataOnline DataReconstructiReconstructi

onon

Online DataOnline DataReconstructiReconstructi

onon

Monotone function: [n] Rd

Filter requires polylog (n) lookups

[ Ailon-C-Liu-Comandur ’04 ][ Ailon-C-Liu-Comandur ’04 ] [ Ailon-C-Liu-Comandur ’04 ][ Ailon-C-Liu-Comandur ’04 ]

Convex Convex polygonpolygon

Filter requires : lookups

[C-Comandur ’06 ]

Convex Convex terrainterrain

lookups

Filter requires :

Iterated planar separator Iterated planar separator theoremtheorem

Iterated planar separator Iterated planar separator theoremtheorem

Iterated Iterated (weak)(weak) planar separator theorem planar separator theorem

in sublinear time!in sublinear time!in sublinear time!in sublinear time!

Using epsilon-nets in spaces of unbounded VC Using epsilon-nets in spaces of unbounded VC dimensiondimension

reconstruct

bipartite graph

k-connectivity

expander

denoising low-dim attractor sets

Priced Priced

computation & computation & accuracyaccuracy

Priced Priced

computation & computation & accuracyaccuracy

spectrometry/cloning/gene chipspectrometry/cloning/gene chip PCR/hybridization/chromatographyPCR/hybridization/chromatography gel electrophoresis/blottinggel electrophoresis/blotting

spectrometry/cloning/gene chipspectrometry/cloning/gene chip PCR/hybridization/chromatographyPCR/hybridization/chromatography gel electrophoresis/blottinggel electrophoresis/blotting

001100001010001111110011001101011100001100000101111o1o1100001100

001100001010001111110011001101011100001100000101111o1o1100001100

Linear programmingLinear programming Linear programmingLinear programming

Pricing dataPricing data

Pricing dataPricing data

Factoring is easy. Here’s why…Factoring is easy. Here’s why…Factoring is easy. Here’s why…Factoring is easy. Here’s why…Gaussian mixture sample: Gaussian mixture sample: 0010010100100110101010100100101001001101010101….….

Collaborators:Collaborators: Nir Ailon, Seshadri Comandur, Ding Liu

Avner Magen, Ronitt Rubinfeld, Luca Trevisan

Collaborators:Collaborators: Nir Ailon, Seshadri Comandur, Ding Liu

Avner Magen, Ronitt Rubinfeld, Luca Trevisan

top related