surviving the information avalanche jim gray microsoft research talk @ adobe developers conference...

43
Surviving The Information Avalanche Jim Gray Microsoft Research Talk @ Adobe Developers Conference 26 April 2004 http://research.microsoft.com/~gray/ talks

Upload: dwayne-moore

Post on 29-Jan-2016

224 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Surviving The Information Avalanche Jim Gray Microsoft Research Talk @ Adobe Developers Conference 26 April 2004 gray/talks

SurvivingThe Information Avalanche

Jim GrayMicrosoft Research

Talk @ Adobe Developers Conference26 April 2004

http://research.microsoft.com/~gray/talks

Page 2: Surviving The Information Avalanche Jim Gray Microsoft Research Talk @ Adobe Developers Conference 26 April 2004 gray/talks

OutlineHistorical trends imply that in 20 years:1. we can store everything in cyberspace.

The personal petabyte.2. computers will have natural interfaces

speech recognition/synthesisvision, object recognition beyond OCR

Implications1. The information avalanche will only get

worse. 2. The user interface will change:

less typing, more writing, talking, gesturing,more seeing and hearing

3. Organizing, summarizing, prioritizinginformation is a key technology.

We are here

Yotta

Zetta

Exa

Peta

Tera

Giga

Mega

Kilo

Page 3: Surviving The Information Avalanche Jim Gray Microsoft Research Talk @ Adobe Developers Conference 26 April 2004 gray/talks

Things Have Changed

• IBM 305 RAMAC

• 10 MB disk

• ~1M$ (y2004 $)

1956

Page 4: Surviving The Information Avalanche Jim Gray Microsoft Research Talk @ Adobe Developers Conference 26 April 2004 gray/talks

The Next 50 years will see MORE CHANGE ops/s/$ Had Three Growth Curves 1890-1990

1890-1945Mechanical

Relay

7-year doubling

1945-1985Tube, transistor,..

2.3 year doubling

1985-2004Microprocessor

1.0 year doubling1.E-06

1.E-03

1.E+00

1.E+03

1.E+06

1.E+09

1880 1900 1920 1940 1960 1980 2000

doubles every 7.5 years

doubles every 2.3 years

doubles every 1.0 years

ops per second/$

Combination of Hans Moravac + Larry Roberts + Gordon Bell WordSize*ops/s/sysprice

Page 5: Surviving The Information Avalanche Jim Gray Microsoft Research Talk @ Adobe Developers Conference 26 April 2004 gray/talks

Constant Cost or Constant Function?• 100x improvement per decade

• Same function 100x cheaper

• 100x more function for same price

Mainframe

Mini

Workstation

PDA

SMP Constellation Cluster

SMP Constellation

Graphics/storage

Camera/browser

Constant PriceLower Price – New Category

Page 6: Surviving The Information Avalanche Jim Gray Microsoft Research Talk @ Adobe Developers Conference 26 April 2004 gray/talks

Growth Comes From NEW Apps• The 10M$ computer of 1980 costs 1k$ today• If we were still doing the same things,

IT would be a 0 B$/y industry• NEW things absorb the new capacity

Page 7: Surviving The Information Avalanche Jim Gray Microsoft Research Talk @ Adobe Developers Conference 26 April 2004 gray/talks

The Surprise-Free Futurein 20 years.

• 10,000x more power for same price– Personal supercomputer– Personal petabyte stores

• Same function for 10,000x less cost.– Smart dust --the penny PC? – The 10 peta-op computer (for 1,000$).

Page 8: Surviving The Information Avalanche Jim Gray Microsoft Research Talk @ Adobe Developers Conference 26 April 2004 gray/talks

10,000x would change things

• Human computer interface– Decent computer vision– Decent computer speech recognition– Decent computer speech synthesis

• Vast information stores

• Ability to search and abstract the stores.

Page 9: Surviving The Information Avalanche Jim Gray Microsoft Research Talk @ Adobe Developers Conference 26 April 2004 gray/talks

How Good is HCI Today?• Surprisingly good.

– Demo of making faces http://research.microsoft.com/research/pubs/view.aspx?pubid=290

– Demo of speech synthesis• Daisy, Hal• Synthetic voice

– Speech recognition is improving fast, – Vision getting better– Pen computing finally a reality.– Displays improving fast (compared to last 30 years)

Page 10: Surviving The Information Avalanche Jim Gray Microsoft Research Talk @ Adobe Developers Conference 26 April 2004 gray/talks

OutlineHistorical trends imply that in 20 years:1. we can store everything in cyberspace.

The personal petabyte.2. computers will have natural interfaces

speech recognition/synthesisvision, object recognition beyond OCR

Implications1. The information avalanche will only get

worse. 2. The user interface will change:

less typing, more writing, talking, gesturing,more seeing and hearing

3. Organizing, summarizing, prioritizinginformation is a key technology.

We are here

Yotta

Zetta

Exa

Peta

Tera

Giga

Mega

Kilo

Page 11: Surviving The Information Avalanche Jim Gray Microsoft Research Talk @ Adobe Developers Conference 26 April 2004 gray/talks

How much information is there?

• Almost everything is recorded digitally.

• Most bytes are never seen by humans.

• Data summarization, trend detection anomaly detection are key technologies

See Mike Lesk: How much information is there: http://www.lesk.com/mlesk/ksg97/ksg.html

See Lyman & Varian:

How much informationhttp://www.sims.berkeley.edu/research/projects/how-much-info/

Yotta

Zetta

Exa

Peta

Tera

Giga

Mega

KiloA BookA Book

.Movie

All books(words)

All Books MultiMedia

Everything!

Recorded

A PhotoA Photo

Page 12: Surviving The Information Avalanche Jim Gray Microsoft Research Talk @ Adobe Developers Conference 26 April 2004 gray/talks

And >90% in Cyberspace Because:

Low rentmin $/byte

Shrinks timenow or later

Shrinks spacehere or there

Automate processingknowbots

Point-to-Point OR Broadcast

Imm

edia

te O

R T

ime

Del

ayed

LocateProcessAnalyzeSummarize

Page 13: Surviving The Information Avalanche Jim Gray Microsoft Research Talk @ Adobe Developers Conference 26 April 2004 gray/talks

MyLifeBits The guinea pig• Gordon Bell is digitizing his life• Has now scanned virtually all:

– Books written (and read when possible)– Personal documents (correspondence, memos, email, bills, legal,0…) – Photos– Posters, paintings, photo of things (artifacts, …medals, plaques)– Home movies and videos– CD collection– And, of course, all PC files

• Recording: phone, radio, TV, web pages… conversations• Paperless throughout 2002. 12” scanned, 12’ discarded.• Only 30GB Excluding videos• Video is 2+ TB and growing fast

Page 14: Surviving The Information Avalanche Jim Gray Microsoft Research Talk @ Adobe Developers Conference 26 April 2004 gray/talks

Capture and encoding

Page 15: Surviving The Information Avalanche Jim Gray Microsoft Research Talk @ Adobe Developers Conference 26 April 2004 gray/talks

I mean everything

Page 16: Surviving The Information Avalanche Jim Gray Microsoft Research Talk @ Adobe Developers Conference 26 April 2004 gray/talks

25Kday life ~ Personal Petabyte

0.001

0.01

0.1

1.

10.

100.

1000.

TB

Msgs webpages

Tifs Books jpegs 1KBpssound

music Videos

Lifetime Storage 1PB

Will anyone look at web pages in 2020? Probably new modalities & media will dominate then.

Page 17: Surviving The Information Avalanche Jim Gray Microsoft Research Talk @ Adobe Developers Conference 26 April 2004 gray/talks

Challenges

• Capture: Get the bits in

• Organize: Index them

• Manage: No worries about loss or space

• Curate/ Annotate: atutomate where possible

• Privacy: Keep safe from theft.

• Summarize: Give thumbnail summaries

• Interface: how ask/anticipate questions

• Present: show it in understandable ways.

Page 18: Surviving The Information Avalanche Jim Gray Microsoft Research Talk @ Adobe Developers Conference 26 April 2004 gray/talks

MemexAs We May Think, Vannevar Bush, 1945

“A memex is a device in which an individual stores all his books, records, and communications, and which is mechanized so that it may be consulted with exceeding speed and flexibility”

“yet if the user inserted 5000 pages of material a day it would take him hundreds of years to fill the repository, so that he can be profligate and enter material freely”

Page 19: Surviving The Information Avalanche Jim Gray Microsoft Research Talk @ Adobe Developers Conference 26 April 2004 gray/talks

Too much storage?Try to fill a terabyte in a year

Item Items/TB Items/day

300 KB JPEG 3 M 9,800

1 MB Doc 1 M 2,900

1 hour 256 kb/s MP3 audio

9 K 26

1 hour 1.5 Mbp/s MPEG video

290 0.8

Petabyte volume has to be some form of video.

Page 20: Surviving The Information Avalanche Jim Gray Microsoft Research Talk @ Adobe Developers Conference 26 April 2004 gray/talks

How Will We Find Anything?• Need Queries, Indexing, Pivoting,

Scalability, Backup, Replication,Online update, Set-oriented access

• If you don’t use a DBMS, you will implement one!

• Simple logical structure: – Blob and link is all that is inherent– Additional properties (facets == extra tables)

and methods on those tables (encapsulation) • More than a file system • Unifies data and meta-data

SQL ++SQL ++DBMSDBMS

Page 21: Surviving The Information Avalanche Jim Gray Microsoft Research Talk @ Adobe Developers Conference 26 April 2004 gray/talks

Photos

Page 22: Surviving The Information Avalanche Jim Gray Microsoft Research Talk @ Adobe Developers Conference 26 April 2004 gray/talks

Searching: the most useful app?

• Challenge: What questions for useful results?

• Many ways to present answers

Page 23: Surviving The Information Avalanche Jim Gray Microsoft Research Talk @ Adobe Developers Conference 26 April 2004 gray/talks
Page 24: Surviving The Information Avalanche Jim Gray Microsoft Research Talk @ Adobe Developers Conference 26 April 2004 gray/talks

Detail view

Page 25: Surviving The Information Avalanche Jim Gray Microsoft Research Talk @ Adobe Developers Conference 26 April 2004 gray/talks

Resource explorerAncestor (collections), annotations, descendant

& preview panes turned on

Page 26: Surviving The Information Avalanche Jim Gray Microsoft Research Talk @ Adobe Developers Conference 26 April 2004 gray/talks

Synchronized timelines with histogram guide

Page 27: Surviving The Information Avalanche Jim Gray Microsoft Research Talk @ Adobe Developers Conference 26 April 2004 gray/talks

Value of media depends on annotations

• “Its just bits until it is annotated”

Page 28: Surviving The Information Avalanche Jim Gray Microsoft Research Talk @ Adobe Developers Conference 26 April 2004 gray/talks

System annotations provide base level of value

• Date 7/7/2000

Page 29: Surviving The Information Avalanche Jim Gray Microsoft Research Talk @ Adobe Developers Conference 26 April 2004 gray/talks

Tracking usage – even better

• Date 7/7/2000. Opened 30 times, emailed to 10 people (its valued by the user!)

Page 30: Surviving The Information Avalanche Jim Gray Microsoft Research Talk @ Adobe Developers Conference 26 April 2004 gray/talks

Get the user to say a little something is a big jump

• Date 7/7/2000. Opened 30 times, emailed to 10 people. “BARC dim sum intern farewell Lunch”

Page 31: Surviving The Information Avalanche Jim Gray Microsoft Research Talk @ Adobe Developers Conference 26 April 2004 gray/talks

Getting the user to tell a story is the ultimate in media value

• A story is a “layout” in time and space• Most valuable content (by selection, and by being well annotated)• Stories must include links to any media they use (for future navigation/search –

“transclusion”).• Cf: MovieMaker; Creative Memories PhotoAlbums

Dapeng was an Dapeng was an intern at BARC intern at BARC for the summer for the summer of 2000of 2000

We took him to We took him to lunch at our lunch at our favorite Dim Sum favorite Dim Sum place to say place to say farewellfarewell

At table L-R: Dapeng, Gordon, Tom, Jim, Don, At table L-R: Dapeng, Gordon, Tom, Jim, Don, Vicky, Patrick, JimVicky, Patrick, Jim

Page 32: Surviving The Information Avalanche Jim Gray Microsoft Research Talk @ Adobe Developers Conference 26 April 2004 gray/talks

Value of media depends on annotations

• Auto-annotate whenever possible e.g. GPS cameras

• Make manual annotation as easy as possible. XP photo capture, voice, photos with voice, etc

• Support gang annotation• Make stories easy

“Its just bits until it is annotated”

Dapeng was Dapeng was an intern at an intern at BARC for the BARC for the summer of summer of 20002000

We took We took him to him to lunch at our lunch at our favorite favorite Dim Sum Dim Sum place to say place to say farewellfarewell

At table L-R: Dapeng, Gordon, Tom, At table L-R: Dapeng, Gordon, Tom, Jim, Don, Vicky, Patrick, JimJim, Don, Vicky, Patrick, Jim

Page 33: Surviving The Information Avalanche Jim Gray Microsoft Research Talk @ Adobe Developers Conference 26 April 2004 gray/talks

80% of data is personal / individual. But, what about the other 20%?

• Business– Wall Mart online: 1PB and growing….– Paradox: most “transaction” systems < 1 PB.– Have to go to image/data monitoring for big data

• Government– Government is the biggest business.

• Science– LOTS of data.

Page 34: Surviving The Information Avalanche Jim Gray Microsoft Research Talk @ Adobe Developers Conference 26 April 2004 gray/talks

CERN Tier 0

Instruments: CERN – LHCPeta Bytes per Year

Looking for the Higgs Particle

• Sensors: 1000 GB/s (1TB/s ~ 30 EB/y)

• Events 75 GB/s

• Filtered 5 GB/s

• Reduced 0.1 GB/s~ 2 PB/y

• Data pyramid: 100GB : 1TB : 100TB : 1PB : 10PB

Page 35: Surviving The Information Avalanche Jim Gray Microsoft Research Talk @ Adobe Developers Conference 26 April 2004 gray/talks

Information Avalanche• Both

– better observational instruments and – Better simulations are producing a data avalanche

• Examples– Turbulence: 100 TB simulation

then mine the Information – BaBar: Grows 1TB/day

2/3 simulation Information 1/3 observational Information

– CERN: LHC will generate 1GB/s10 PB/y

– VLBA (NRAO) generates 1GB/s today– NCBI: “only ½ TB” but doubling each year, very rich dataset.– Pixar: 100 TB/Movie

Image courtesy of C. Meneveau & A. Szalay @ JHU

Page 36: Surviving The Information Avalanche Jim Gray Microsoft Research Talk @ Adobe Developers Conference 26 April 2004 gray/talks

Q: Where will the Data Come From?A: Sensor Applications

• Earth Observation – 15 PB by 2007

• Medical Images & Information + Health Monitoring– Potential 1 GB/patient/y 1 EB/y

• Video Monitoring– ~1E8 video cameras @ 1E5 MBps

10TB/s 100 EB/y filtered???

• Airplane Engines– 1 GB sensor data/flight, – 100,000 engine hours/day– 30PB/y

• Smart Dust: ?? EB/y

http://robotics.eecs.berkeley.edu/~pister/SmartDust/http://www-bsac.eecs.berkeley.edu/~shollar/macro_motes/macromotes.html

Page 37: Surviving The Information Avalanche Jim Gray Microsoft Research Talk @ Adobe Developers Conference 26 April 2004 gray/talks

The Big PictureExperiments &

Instruments

Simulationsfacts

facts

answers

questions

• Data ingest • Managing a petabyte• Common schema• How to organize it?• How to reorganize it• How to coexist with others

• Query and Vis tools • Support/training• Performance

– Execute queries in a minute – Batch query scheduling

?The Big Problems

Literature

Other Archives facts

facts

Page 38: Surviving The Information Avalanche Jim Gray Microsoft Research Talk @ Adobe Developers Conference 26 April 2004 gray/talks

FTP - GREP • Download (FTP and GREP) are not adequate

– You can GREP 1 MB in a second– You can GREP 1 GB in a minute – You can GREP 1 TB in 2 days– You can GREP 1 PB in 3 years.

• Oh!, and 1PB ~3,000 disks

• At some point we need indices to limit searchparallel data search and analysis

• This is where databases can help

• Next generation technique: Data Exploration– Bring the analysis to the data!

Page 39: Surviving The Information Avalanche Jim Gray Microsoft Research Talk @ Adobe Developers Conference 26 April 2004 gray/talks

The Speed Problem• Many users want to search the whole DB

ad hoc queries, often combinatorial• Want ~ 1 minute response• Brute force (parallel search):

– 1 disk = 50MBps => ~1M disks/PB ~ 300M$/PB

• Indices (limit search, do column store)– 1,000x less equipment: 1M$/PB

• Pre-compute answer– No one knows how do it for all questions.

Page 40: Surviving The Information Avalanche Jim Gray Microsoft Research Talk @ Adobe Developers Conference 26 April 2004 gray/talks

Next-Generation Data Analysis• Looking for

– Needles in haystacks – the Higgs particle– Haystacks: Dark matter, Dark energy

• Needles are easier than haystacks• Global statistics have poor scaling

– Correlation functions are N2, likelihood techniques N3

• As data and computers grow at same rate, we can only keep up with N logN

• A way out? – Relax notion of optimal

(data is fuzzy, answers are approximate)– Don’t assume infinite computational resources or memory

• Combination of statistics & computer science

Page 41: Surviving The Information Avalanche Jim Gray Microsoft Research Talk @ Adobe Developers Conference 26 April 2004 gray/talks

Analysis and Databases• Much statistical analysis deals with

– Creating uniform samples – – data filtering– Assembling relevant subsets– Estimating completeness – censoring bad data– Counting and building histograms– Generating Monte-Carlo subsets– Likelihood calculations– Hypothesis testing

• Traditionally these are performed on files• Most of these tasks are much better done inside a database• Move Mohamed to the mountain, not the mountain to Mohamed.

Page 42: Surviving The Information Avalanche Jim Gray Microsoft Research Talk @ Adobe Developers Conference 26 April 2004 gray/talks

DataGrid Computing

• Store exabytes twice (for redundancy)

• Access them from anywhere• Implies huge archive/data

centers• Supercomputer centers

become super data centers• Examples:

Google, Yahoo!, Hotmail,BaBar, CERN, Fermilab, SDSC, …

Page 43: Surviving The Information Avalanche Jim Gray Microsoft Research Talk @ Adobe Developers Conference 26 April 2004 gray/talks

OutlineHistorical trends imply that in 20 years:1. we can store everything in cyberspace.

The personal petabyte.2. computers will have natural interfaces

speech recognition/synthesisvision, object recognition beyond OCR

Implications1. The information avalanche will only get

worse. 2. The user interface will change:

less typing, more writing, talking, gesturing,more seeing and hearing

3. Organizing, summarizing, prioritizinginformation is a key technology.

We are here

Yotta

Zetta

Exa

Peta

Tera

Giga

Mega

Kilo