so much data
DESCRIPTION
So Much Data. So Little Time. Bernard Chazelle Princeton University. So Many Slides. (before lunch). So Little Time. Bernard Chazelle Princeton University. math. algorithms. experimentation. 2006. computation. - PowerPoint PPT PresentationTRANSCRIPT
So Much DataSo Much Data
Bernard ChazelleBernard Chazelle Princeton UniversityPrinceton University
So Little TimeSo Little Time
So Many SlidesSo Many Slides
Bernard ChazelleBernard Chazelle Princeton UniversityPrinceton University
So Little Time So Little Time
(before lunch)(before lunch)
computation
math experimentation
algorithms
Computers have two Computers have two problemsproblems
1. They don’t have steering 1. They don’t have steering wheelswheels
2. End of Moore’s Law
party’s over !
computation
algorithms experimentation
32x 17
22432
= 544
This is not me
FFT
RSA
noisy
low entropy
uncertain
unevenly priced
big
noisy
low entropy
uncertain
unevenly priced
big
Biomedical imaging
Sloan Digital Sky
Survey4 petabytes4 petabytes(~1MG)(~1MG)
10 10 petabytes/yrpetabytes/yr
150 petabytes/yr150 petabytes/yr
Collected works of Micha Sharir
My A(9,9)-th paper
massive input output
Sublinear Sublinear AlgorithmsAlgorithms
Sample tiny fractionSample tiny fraction
Shortest PathsShortest Paths [C-Liu-Magen ’03]
New New YorkYork
DelphiDelphi
Ray ShootingRay Shooting
Volume Intersection Point location
Approximate MSTApproximate MST [C-Rubinfeld-Trevisan ’01]
Reduces to counting connected componentsReduces to counting connected components
EE = no. connected components= no. connected components
varvar << (no. connected components)<< (no. connected components)22
whp, is a good estimator of # connected components
worst case worst case
input spaceinput space
average case average case (uniform)(uniform)
worst case worst case
average case = actuarial view average case = actuarial view
“ OK, if you elect NOT to have the surgery, the insurance company offers 6 days and 7 nights in Barbados. “
arbitrary, unknown random sourcearbitrary, unknown random source
Self-Improving Self-Improving AlgorithmsAlgorithms
Yes ! This could be YOU, too !
E Tk Optimal expected time for random source
time T1 time T2 time T3 time T4
Clustering Clustering [ Ailon-C-Liu-Comandur [ Ailon-C-Liu-Comandur ’05 ]’05 ]
K-median over Hamming K-median over Hamming cubecube
minimize sum of distancesminimize sum of distances
minimize sum of distancesminimize sum of distances
[ Kumar-Sabharwal-Sen ’04 ][ Kumar-Sabharwal-Sen ’04 ]
COST OPT( 1 + )
How to achieve linear limiting How to achieve linear limiting time?time?
Input space {0,1}Input space {0,1}dndn
prob < O(dn)/KSSprob < O(dn)/KSS
Identify coreIdentify core
TailTail::
Use KSS Use KSS
Store sample of Store sample of precomputed KSSprecomputed KSS
Nearest neighborNearest neighborIncremental algorithmIncremental algorithm
Main difficulty: How to spot the tail?Main difficulty: How to spot the tail?
encode
decode
Data inaccessible before noise
What makes you What makes you think it’s wrong?think it’s wrong?
Data inaccessible before noise
must satisfy some propertymust satisfy some property(eg, convex, bipartite)(eg, convex, bipartite)but does not quitebut does not quite
f(x) = ?f(x) = ?
x
f(x)
data
f = access function
f(x) = ?f(x) = ?
x
f(x)
f = access function
f(x) = ?f(x) = ?
x
f(x)
But life being what it is…
f(x) = ?f(x) = ?
x
f(x)
)(O
Humans
Define distance from any object to data class
f(x) = ?f(x) = ?
x
g(x)
x1, x2,…
f(x1), f(x2),…
filter
g is access function for:
Online DataOnline DataReconstructiReconstructi
onon
Monotone function: [n] Rd
Filter requires polylog (n) lookups
[ Ailon-C-Liu-Comandur ’04 ][ Ailon-C-Liu-Comandur ’04 ]
Convex Convex polygonpolygon
Filter requires : lookups
[C-Comandur ’06 ]
Convex Convex terrainterrain
lookups
Filter requires :
Iterated planar separator Iterated planar separator theoremtheorem
Iterated planar separator Iterated planar separator theoremtheorem
Iterated Iterated (weak)(weak) planar separator theorem planar separator theoremin sublinear time!in sublinear time!
Using epsilon-nets in spaces of unbounded VC Using epsilon-nets in spaces of unbounded VC dimensiondimension
reconstruct
bipartite graph
k-connectivity expander
denoising low-dim attractor sets
Priced Priced
computation & computation & accuracyaccuracy
spectrometry/cloning/gene chipspectrometry/cloning/gene chip PCR/hybridization/chromatographyPCR/hybridization/chromatography gel electrophoresis/blottinggel electrophoresis/blotting
001100001010001111110011001101011100001100000101111o1o1100001100
Linear programmingLinear programming
Pricing dataPricing data
Factoring is easy. Here’s why…Factoring is easy. Here’s why…Gaussian mixture sample: Gaussian mixture sample: 0010010100100110101010100100101001001101010101….….
Collaborators:Collaborators: Nir Ailon, Seshadri Comandur, Ding LiuAvner Magen, Ronitt Rubinfeld, Luca Trevisan