dealing with massive data

46
Dealing with MASSIVE Data Feifei Li [email protected] Dept Computer Science, FSU Sep 9, 2008

Upload: urban

Post on 07-Jan-2016

33 views

Category:

Documents


0 download

DESCRIPTION

Dealing with MASSIVE Data. Feifei Li [email protected] Dept Computer Science, FSU Sep 9, 2008. Brief Bio. B.A.S. in computer engineering from Nanyang Technological University in 2002 Ph.D. in computer science from Boston University in 2007 - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Dealing with MASSIVE Data

Dealing with MASSIVE Data

Feifei Li

[email protected]

Dept Computer Science, FSU

Sep 9, 2008

Page 2: Dealing with MASSIVE Data

2

Brief Bio• B.A.S. in computer engineering from

Nanyang Technological University in 2002

• Ph.D. in computer science from Boston University in 2007

• Research Interns/Visitors at AT&T Labs, IBM T. J. Watson Research Center, Microsoft Research.

• Now: Assistant Professor in CS Department at FSU

Page 3: Dealing with MASSIVE Data

3

Research Areas

Algorithms and Data structures

I/O-efficient

algorithmsstreaming

algorithms

computational geometry misc.

Database Applications

spatial databases

indexingquery processing

data security and privacy

Geographic

Information Systems

data streams

Probabilistic Data

Page 4: Dealing with MASSIVE Data

4

Massive Data• Massive datasets are being collected everywhere• Storage management software is billion-$ industry

Examples (2002):

• Phone: AT&T 20TB phone call database, wireless tracking

• Consumer: WalMart 70TB database, buying patterns

• WEB: Web crawl of 200M pages and 2000M links, Google’s huge indexes

• Geography: NASA satellites generate 1.2TB per day

Page 5: Dealing with MASSIVE Data

5

Example: LIDAR Terrain Data

• Massive (irregular) point sets (1-10m resolution)

– Becoming relatively cheap and easy to collect

• Appalachian Mountains between 50GB and 5TB

• Exceeds memory limit and needs to be stored on disk

Page 6: Dealing with MASSIVE Data

6

Example: Network Flow Data• AT&T IP backbone generates 500 GB per day

• Gigascope: A data stream management system

– Compute certain statistics

• Can we do computation without storing the data?

Page 7: Dealing with MASSIVE Data

7

Traditional Random Access Machine Model

• Standard theoretical model of computation:

– Infinite memory (how nice!)

– Uniform access cost

• Simple model crucial for success of computer industry

R

A

M

Page 8: Dealing with MASSIVE Data

How to Deal with MASSIVE Data?

when there is not enough memory

Page 9: Dealing with MASSIVE Data

9

Solution 1: Buy More Memory

• Expensive

• (Probably) not scalable

– Growth rate of data is higher than the growth of memory

Page 10: Dealing with MASSIVE Data

10

Solution 2: Cheat! (by random sampling)

• Provide approximate solution for some problems– average, frequency of an element, etc.

• What if we want the exact result?• Many problems can’t be solved by sampling

– maximum, and all problems mentioned later

Page 11: Dealing with MASSIVE Data

Solution 3: Using the Right Computation Model

• External Memory Model

• Streaming Model

• Probabilistic Model (brief)

Page 12: Dealing with MASSIVE Data

Computation Model for Massive Data (1):External Memory Model

Internal memory is limited but fast

External memory is unlimited but slow

Page 13: Dealing with MASSIVE Data

13

Memory Hierarchy

• Modern machines have complicated memory hierarchy

– Levels get larger and slower further away from CPU

– Block sizes and memory sizes are different!

• There are a few attempts to model the hierarchy but not successful

– They are too complicated!

L

1

L

2

R

A

M

Page 14: Dealing with MASSIVE Data

14

Slow I/O

– Disk systems try to amortize large access time transferring large contiguous blocks of data (8-16Kbytes)

• Important to store/access data to take advantage of blocks (locality)

• Disk access is 106 times slower than main memory access

track

magnetic surface

read/write armread/write head

“The difference in speed between modern CPU and

disk technologies is analogous to the difference

in speed in sharpening a pencil using a sharpener on

one’s desk or by taking an airplane to the other side of

the world and using a sharpener on someone else’s

desk.” (D. Comer)

Page 15: Dealing with MASSIVE Data

15

Puzzle #1: Majority Counting

• A huge file of characters stored on disk• Question: Is there a character that appears > 50% of the time• Solution 1: sort + scan

– A few passes (O(logM/B N)): will come to it later• Solution 2: divide-and-conquer

– Load a chunk in to memory: N/M chunks– Count them, return majority– The overall majority must be the majority in >50% chunks– Iterate until < M– Very few passes (O(logM N)), geometrically decreasing

• Solution 3: O(1) memory, 2 passes (answer to be posted later)

b a e c a d a a d a a e a b a a f a g b

Page 16: Dealing with MASSIVE Data

16

N = # of items in the problem instance

B = # of items per disk block

M = # of items that fit in main memory

I/O: Move block between memory and disk

Performance measure: # of I/Os performed by algorithm

We assume (for convenience) that M >B2

D

P

M

Block I/O

External Memory Model [AV88]

Page 17: Dealing with MASSIVE Data

17

Sorting in External Memory

• Break all N elements into N/M chunks of size M each

• Sort each chunk individually in memory

• Merge them together

• Can merge <M/B sorted lists (queues) at once

M/B blocks in main memory

Page 18: Dealing with MASSIVE Data

18

Sorting in External Memory• Merge sort:

– Create N/M memory sized sorted lists

– Repeatedly merge lists together Θ(M/B) at a time

phases using I/Os each I/Os)( BNO)(log

MN

BMO )log(

BN

BN

BMO

)(MN

)/(BM

MN

))/(( 2BM

MN

1

Page 19: Dealing with MASSIVE Data

19

External Searching: B-Tree

• Each node (except root) has fan-out between B/2 and B

• Size: O(N/B) blocks on disk

• Search: O(logBN) I/Os following a root-to-leaf path

• Insertion and deletion: O(logBN) I/Os

Page 20: Dealing with MASSIVE Data

20

Fundamental Bounds Internal External

• Scanning: N

• Sorting: N log N

• Searching:

More Results

• List ranking N

• Minimal spanning tree N log N

• Offline union-find N

• Interval searching log N + T logBN + T/B

• Rectangle enclosure log N + T log N + T/B

• R-tree search

NBlogBN

BN

BMlog

BN

N2log

BN

BN

BMlog

BBN

BN

BM logloglog

BN

BN

BMlog

TN BT

BN

Page 21: Dealing with MASSIVE Data

21

Does All the Theory Matter?• Programs developed in RAM-model

still runs even there is not enough memory

– Run on large datasets because

OS moves blocks as needed

• OS utilizes paging and prefetching strategies

– But if program makes scattered accesses even good OS cannot take advantage of block access

Thrashing!

data size

runn

ing

tim

e

D

P

M

Page 22: Dealing with MASSIVE Data

22

Toy Experiment: Permuting• Problem:

– Input: N elements out of order: 6, 7, 1, 3, 2, 5, 10, 9, 4, 8

* Each element knows its correct position

– Output: Store them on disk in the right order

• Internal memory solution:

– Just scan the original sequence and move every element in the right place!

– O(N) time, O(N) I/Os

• External memory solution:

– Use sorting

– O(N log N) time, I/Os)log( BN

BN

BMO

Page 23: Dealing with MASSIVE Data

23

A Practical Example on Real Data• Computing persistence on large terrain data

Page 24: Dealing with MASSIVE Data

24

Takeaways• Need to be very careful when your program’s space

usage exceeds physical memory size• If program mostly makes highly localized accesses

– Let the OS handle it automatically• If program makes many non-localized accesses

– Need I/O-efficient techniques• Three common techniques (recall the majority counting

puzzle):– Convert to sort + scan– Divide-and-conquer– Other tricks

Page 25: Dealing with MASSIVE Data

Want to know more about I/O-efficient algorithms?

A course on I/O-efficient algorithms is offered as CIS5930 (Advanced Topics in Data Management)

Page 26: Dealing with MASSIVE Data

26

Computation Model for Massive Data (2):Streaming Model

You got to look at each element only once!

Cannot

Don’t want to store data and do further processing

Can’t wait to

Page 27: Dealing with MASSIVE Data

27

Streaming Algorithms: Applications

DBMS(Oracle, DB2)

Back-end Data Warehouse

Off-line analysis – slow, expensive

DSL/CableNetworks

EnterpriseNetworks

Peer

Network OperationsCenter (NOC)

What are the top (most frequent) 1000 (source, dest) pairs seen over the last month?

SELECT COUNT (R1.source, R2.dest)FROM R1, R2WHERE R1.dest = R2.source

SQL Join Query

How many distinct (source, dest) pairs have been seen?

Set-Expression Query

PSTN

Other applications:

• Sensor networks

• Network security

• Financial applications

• Web logs and clickstreams

Page 28: Dealing with MASSIVE Data

28

Puzzle #2: Find Missing Card

• How to find the missing tile by making one pass over everything?

– Assuming you can’t memorize everything (of course)

• Assign a number to each type of tiles: = 8, = 14, = 22

• Compute the sum of all remaining tiles

– (1+…+9+11+…+19+21+…+29)*4 – sum = missing tile!

Mahjong tile

Page 29: Dealing with MASSIVE Data

29

A Research Problem: Count # Distinct Elements

• Unfortunately, there is a lower bound saying you can’t do this without using Ω(n) memory

• But if we allow some errors, then can approximate it well

b a e c a d a a d a a e a b a a f a g b

# distinct elements = 7

Page 30: Dealing with MASSIVE Data

30

Solution: FM Sketch [FM85, AMS99]

• Take a (pseudo) random hash function h : {1,…,n} {1,…,2d}, where 2d > n

• For each incoming element x, compute h(x)

– e.g., h(5) = 10101100010000

– Count how many trailing zeros

– Remember the maximum number of trailing zeroes in any h(x)

• Let Y be the maximum number of trailing zeroes

– Can show E[2Y] = # distinct elements

* 2 elements, “on average” there is one h(x) with 1 trailing zero

* 4 elements, “on average” there is one h(x) with 2 trailing zeroes

* 8 elements, “on average” there is one h(x) with 3 trailing zeroes

* …

Page 31: Dealing with MASSIVE Data

Counting Paintballs

• Imagine the following scenario:– A bag of n paintballs is

emptied at the top of a long stair-case.

– At each step, each paintball either bursts and marks the step, or bounces to the next step. 50/50 chance either way.

Looking only at the pattern of marked steps, what was n?

Page 32: Dealing with MASSIVE Data

Counting Paintballs (cont)

• What does the distribution of paintball bursts look like?– The number of bursts at

each step follows a binomial distribution.

– The expected number of bursts drops geometrically.

– Few bursts after log2 n steps

1st

2nd

Y th

B(n,1/2)

B(n,1/2 Y)

B(n,1/4)

B(n,1/2 Y)

Page 33: Dealing with MASSIVE Data

33

Solution: FM Sketch [FM85, AMS99]

• So 2Y is an unbiased estimator for # distinct elements

• However, has a large variance

– Use O(1/ε2 ∙ log(1/δ)) copies to guarantee a good estimator that has probability 1–δ to be within relative error ε

• Applications:

– How many distinct IP addresses used a given link to send their traffic from the beginning of the day?

– How many new IP addresses appeared today that didn’t appear before?

Page 34: Dealing with MASSIVE Data

34

Finding Heavy Hitters• Which elements appeared in the stream more than 10% of the time?

• Applications:

– Networking

* Finding IP addresses sending most traffic

– Databases

* Iceberg queries

– Data mining

* Finding “hot” items (item sets) in transaction data

• Solution

– Exact solution is difficult

– If allow approximation of ε

* Use O(1/ε) space and O(1) time per element in stream

Page 35: Dealing with MASSIVE Data

35

Streaming in a Distributed World

• Large-scale querying/monitoring: Inherently distributed!

–Streams physically distributed across remote sitesE.g., stream of UDP packets through subset of edge routers

• Challenge is “holistic” querying/monitoring

– Queries over the union of distributed streams Q(S1 ∪ S2 ∪ …)

– Streaming data is spread throughout the network

Network Operations

Center (NOC)

Query site Query

0 11

1 1

00

1

1 0

0

11

0

11

0

11

0

11

Q(S1 ∪ S2 ∪…)

S6

S5S4

S3

S1

S2

Page 36: Dealing with MASSIVE Data

36

Streaming in a Distributed World

• Need timely, accurate, and efficient query answers

• Additional complexity over centralized data streaming!

• Need space/time- and communication-efficient solutions

– Minimize network overhead

– Maximize network lifetime (e.g., sensor battery life)

– Cannot afford to “centralize” all streaming data

Network Operations

Center (NOC)

Query site Query

0 11

1 1

00

1

1 0

0

11

0

11

0

11

0

11

Q(S1 ∪ S2 ∪…)

S6

S5S4

S3

S1

S2

Page 37: Dealing with MASSIVE Data

Want to know more about streaming algorithms?

A graduate-level course on streaming algorithms willbe approximately offered

in the next next next semester with an error guarantee of 5%!

Or, talk to me tomorrow!

Page 38: Dealing with MASSIVE Data

Top-k Queries

• Extremely useful in information retrieval

– top-k sellers, popular movies, etc.

– google

tuple

score

t1t2t3t4t5

65301008087

top-2 = {t3, t5}

tuple

score

t3t5t4t1t2

10087806530

Threshold Alg

RankSQL

Page 39: Dealing with MASSIVE Data

Top-k Queries on Uncertain Data

tuple

score

t3t5t4t1t2

10087806530

confidence

0.20.80.90.50.6

(sensor reading, reliability)

(page rank, how well match query)

tuple

score

t3t5t4t1t2

10087806530

confidence

0.20.80.90.50.6

top-k answer depends onthe interplay between

score and confidence

Page 40: Dealing with MASSIVE Data

Top-k Definition: U-Topk

The k tuples with the maximum probabilityof being the top-k

tuple

score

t3t5t4t1t2

10087806530

confidence

0.20.80.90.50.6

{t3, t5}: 0.2*0.8 = 0.16

{t3, t4}:

0.2*(1-0.8)*0.9 = 0.036

{t5, t4}:

(1-0.2)*0.8*0.9 = 0.576

...

Potential problem: top-k could be very different from top-(k+1)

Page 41: Dealing with MASSIVE Data

Top-k Definition: U-kRanks

The i-th tuple is the one with the maximumprobability of being at rank i, i=1,...,k

tuple

score

confidence

t3t5t4t1t2

10087806530

0.20.80.90.50.6

Rank 1:

t3: 0.2

t5: (1-0.2)*0.8 = 0.64

t4: (1-0.2)*(1-0.8)*0.9 = 0.144 ...

Rank 2:

t3: 0

t5: 0.2*0.8 = 0.16

t4: 0.9*(0.2*(1-0.8)+(1-0.2)*0.8)

= 0.612Potential problem: duplicated tuples in top-k

Page 42: Dealing with MASSIVE Data

Uncertain Data Models

• An uncertain data model represents a probability distribution of database instances (possible worlds)

• Basic model: mutual independence among all tuples• Complete models: able to represent any distribution of possible worlds

– Atomic independent random Boolean variables– Each tuple corresponds to a Boolean formula, appears iff the

formula evaluates to true– Exponential complexity

Page 43: Dealing with MASSIVE Data

Uncertain Data Model: x-relations

Each x-tuple represents a discrete probability distribution of tuples

x-tuples are mutually independent, and disjoint

U-Top2: {t1,t2}

U-2Ranks: (t1, t3)

single-alternative

multi-alternative

Page 44: Dealing with MASSIVE Data

Want to know more about uncertainty data management?

A graduate-level course on uncertainty data management will be (likely probably) offered

in the next next next next next semester

Or, talk to me tomorrow!

Page 45: Dealing with MASSIVE Data

45

Recap• External memory model

– Main memory is fast but limited

– External memory slow but unlimited

– Aim to optimize I/O performance

• Streaming model

– Main memory is fast but small

– Can’t store, not willing to store, or can’t wait to store data

– Compute the desired answers in one pass

• Probabilistic data model

– Can’t store, query exponential possible instances of possible worlds

– Compute the desired answers in the succinct representation of the probabilistic data (efficiently!! Possibly allow some errors)

Page 46: Dealing with MASSIVE Data

Thanks!

Questions?