the challenges ahead for visualizing and analyzing massive data sets hank childs lawrence berkeley...

24
The Challenges Ahead for Visualizing and Analyzing Massive Data Sets Hank Childs Lawrence Berkeley National Laboratory February 26, 2010 27B element Rayleigh-Taylor Instability (MIRANDA, BG/L) 2 trillion element mesh 2 billion element Thermal hydraulics (Nek5000, BG/P)

Upload: candace-grapes

Post on 15-Dec-2015

217 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: The Challenges Ahead for Visualizing and Analyzing Massive Data Sets Hank Childs Lawrence Berkeley National Laboratory February 26, 2010 27B element Rayleigh-Taylor

The Challenges Ahead for Visualizing and Analyzing

Massive Data SetsHank Childs

Lawrence Berkeley National Laboratory

February 26, 2010

27B elementRayleigh-Taylor Instability(MIRANDA, BG/L)

2 trillion element mesh

2 billion elementThermal hydraulics(Nek5000, BG/P)

Page 2: The Challenges Ahead for Visualizing and Analyzing Massive Data Sets Hank Childs Lawrence Berkeley National Laboratory February 26, 2010 27B element Rayleigh-Taylor

Overview of This Mini-Symposium

• Peterka: we can visualize the results on the supercomputer itself

• Bremer: we can understand and gain insight from these massive data sets

• Childs: visualization and analysis will be a crucial problem on the next generation of supercomputers

• Pugmire: we can make our algorithms work at massive scale

Page 3: The Challenges Ahead for Visualizing and Analyzing Massive Data Sets Hank Childs Lawrence Berkeley National Laboratory February 26, 2010 27B element Rayleigh-Taylor

How does the {peta-, exa-} scale affect visualization?

Large # of time steps

Large ensembles

High-res meshes

Large # of variables

Page 4: The Challenges Ahead for Visualizing and Analyzing Massive Data Sets Hank Childs Lawrence Berkeley National Laboratory February 26, 2010 27B element Rayleigh-Taylor

The soon-to-be “good ole days” … how visualization is done right now*

P0P1

P3

P2

P8P7 P6

P5

P4

P9

Pieces of data(on disk)

Read Process Render

Processor 0

Read Process Render

Processor 1

Read Process Render

Processor 2

Parallelized visualizationdata flow network

P0 P3P2

P5P4 P7P6

P9P8

P1

Parallel Simulation Code

* = Your mileage may vary - Are you running full machine?- How much data do you

output?

Page 5: The Challenges Ahead for Visualizing and Analyzing Massive Data Sets Hank Childs Lawrence Berkeley National Laboratory February 26, 2010 27B element Rayleigh-Taylor

Pure parallelism performance is based on # bytes to process and I/O rates.

Vis is almost always >50% I/O and sometimes 98% I/O

Amount of data to visualize is typically O(total mem)

Relative I/O (ratio of total memory and I/O) is key

FLOPs Memory I/O

Terascale machine

“Petascale machine”

Page 6: The Challenges Ahead for Visualizing and Analyzing Massive Data Sets Hank Childs Lawrence Berkeley National Laboratory February 26, 2010 27B element Rayleigh-Taylor

Anedoctal evidence: relative I/O is getting slower.

Machine name Main memory I/O rate

ASC purple 49.0TB 140GB/s 5.8min

BGL-init 32.0TB 24GB/s 22.2min

BGL-cur 69.0TB 30GB/s 38.3min

Petascale machine

?? ?? >40min

Time to write memory to disk

Page 7: The Challenges Ahead for Visualizing and Analyzing Massive Data Sets Hank Childs Lawrence Berkeley National Laboratory February 26, 2010 27B element Rayleigh-Taylor

Why is relative I/O getting slower?

• “I/O doesn’t pay the bills”—And I/O is becoming a dominant cost in the

overall supercomputer procurement.• Simulation codes aren’t as exposed.

Page 8: The Challenges Ahead for Visualizing and Analyzing Massive Data Sets Hank Childs Lawrence Berkeley National Laboratory February 26, 2010 27B element Rayleigh-Taylor

Recent runs of trillion cell data sets provide further evidence that I/O dominates

8

● Weak scaling study: ~62.5M cells/core

8

#coresProblem Size

TypeMachine

8K0.5TZAIXPurple

16K1TZSun LinuxRanger

16K1TZLinuxJuno

32K2TZCray XT5JaguarPF

64K4TZBG/PDawn

16K, 32K1TZ, 2TZCray XT4Franklin2T cells, 32K procs on Jaguar

2T cells, 32K procs on Franklin

- Approx I/O time: 2-5 minutes- Approx processing time: 10 seconds

Page 9: The Challenges Ahead for Visualizing and Analyzing Massive Data Sets Hank Childs Lawrence Berkeley National Laboratory February 26, 2010 27B element Rayleigh-Taylor

Visualization works because it uses the brain’s highly effective visual processing system.

Trillions of data points

Millions of pixels

But is this still a good idea at the peta-/exascale?

• (Note that visualization is often reducing the data … so we are frequently *not* trying to render all of the data points.)

Page 10: The Challenges Ahead for Visualizing and Analyzing Massive Data Sets Hank Childs Lawrence Berkeley National Laboratory February 26, 2010 27B element Rayleigh-Taylor

Visualization works because it uses the brain’s highly effective visual processing system.

Trillions of data points

One idea: add more pixels!

35M pixel powerwall• Bonus: big displays act as collaboration centers.

Page 11: The Challenges Ahead for Visualizing and Analyzing Massive Data Sets Hank Childs Lawrence Berkeley National Laboratory February 26, 2010 27B element Rayleigh-Taylor

Visualization works because it uses the brain’s highly effective visual processing system.

Trillions of data points

One idea: add more pixels!

35M pixel powerwall

Source: Sawant & Healey, NC State

Visual acuity of the human eye is <30M

pixels!!

Page 12: The Challenges Ahead for Visualizing and Analyzing Massive Data Sets Hank Childs Lawrence Berkeley National Laboratory February 26, 2010 27B element Rayleigh-Taylor

Summary: what are the challenges?

• Scale—We can’t read all of the data at full resolution

any more? What can we do?• Insight

—There is a lot more data than pixels. How are we going to understand it?

Page 13: The Challenges Ahead for Visualizing and Analyzing Massive Data Sets Hank Childs Lawrence Berkeley National Laboratory February 26, 2010 27B element Rayleigh-Taylor

How can we deal with so many cells per pixel?

• What should the color of this pixel be?—“Random” between the 9 colors?—An average value of the 9 colors? (brown)—The color of the minimum value?—The color of the maximum value?

• We need infrastructure to allow users to have confidence in the pictures we deliver.

A single pixel

Data insight often goes far beyond pictures (see Bremer talk)

Page 14: The Challenges Ahead for Visualizing and Analyzing Massive Data Sets Hank Childs Lawrence Berkeley National Laboratory February 26, 2010 27B element Rayleigh-Taylor

Multi-resolution techniques use coarse representations then refine.

P0P1

P3

P2

P8P7 P6

P5

P4

P9

Pieces of data(on disk)

Read Process Render

Processor 0

Read Process Render

Processor 1

Read Process Render

Processor 2

Parallelized visualizationdata flow network

P0 P3P2

P5P4 P7P6

P9P8

P1

Parallel Simulation Code

P2

P4

Page 15: The Challenges Ahead for Visualizing and Analyzing Massive Data Sets Hank Childs Lawrence Berkeley National Laboratory February 26, 2010 27B element Rayleigh-Taylor

Multi-resolution: pros and cons

• Summary:—“Dive” into data

• Enough diving results in original data

• Pros—Avoid I/O & memory requirements—Confidence in pictures; multi-res hierarchy

addresses “many cells to one pixel issue”• Cons

—Is it meaningful to process simplified version of the data?

—How do we generate hierarchical representations? What costs do they incur?

Page 16: The Challenges Ahead for Visualizing and Analyzing Massive Data Sets Hank Childs Lawrence Berkeley National Laboratory February 26, 2010 27B element Rayleigh-Taylor

In situ processing does visualization as part of the simulation.

P0P1

P3

P2

P8P7 P6

P5

P4

P9

Pieces of data(on disk)

Read Process Render

Processor 0

Read Process Render

Processor 1

Read Process Render

Processor 2

P0 P3P2

P5P4 P7P6

P9P8

P1

Parallel Simulation Code

Page 17: The Challenges Ahead for Visualizing and Analyzing Massive Data Sets Hank Childs Lawrence Berkeley National Laboratory February 26, 2010 27B element Rayleigh-Taylor

In situ processing does visualization as part of the simulation.

P0P1

P3

P2

P8P7 P6

P5

P4

P9

GetAccessToData

Process RenderProcessor 0

Parallelized visualization data flow networkParallel Simulation Code

GetAccessToData

Process RenderProcessor 1

GetAccessToData

Process RenderProcessor 2

GetAccessToData

Process RenderProcessor 9

… … … …

Page 18: The Challenges Ahead for Visualizing and Analyzing Massive Data Sets Hank Childs Lawrence Berkeley National Laboratory February 26, 2010 27B element Rayleigh-Taylor

In situ: pros and cons

• Pros:—No I/O!—Lots of compute power available

• Cons:—Very memory constrained—Many operations not possible

• Once the simulation has advanced, you cannot go back and analyze it

—User must know what to look a priori• Expensive resource to hold hostage!

Page 19: The Challenges Ahead for Visualizing and Analyzing Massive Data Sets Hank Childs Lawrence Berkeley National Laboratory February 26, 2010 27B element Rayleigh-Taylor

Now we know the tools … what problem are we trying to solve?

• Three primary use cases:—Exploration—Confirmation—Communication

Examples:Scientific discoveryDebugging

Examples:Data analysisImages / moviesComparison

Examples:Data analysisImages / movies

Page 20: The Challenges Ahead for Visualizing and Analyzing Massive Data Sets Hank Childs Lawrence Berkeley National Laboratory February 26, 2010 27B element Rayleigh-Taylor

Notional decision process

Need all data at full

resolution?

No Multi-resolution(debugging & scientific

discovery)

YesDo you know

what you want do a priori?

Yes

In Situ(data analysis & images / movies)

Exploration

Confirmation

Communication

Pure parallelism(Anything & esp.

comparison)

No

Also roles for more minor techniques that weren’t

discussed such as streaming and data subsetting.

Page 21: The Challenges Ahead for Visualizing and Analyzing Massive Data Sets Hank Childs Lawrence Berkeley National Laboratory February 26, 2010 27B element Rayleigh-Taylor

Prepare for difficult conversations in the future.

• Multi-resolution:—Do you understand what a multi-resolution

hierarchy should look like for your data?—Who do you trust to generate it?—Are you comfortable with your I/O routines

generating these hierarchies while they write?—How much overhead are you willing to tolerate

on your dumps? 33+%?—Willing to accept that your visualizations are

not the “real” data?

Page 22: The Challenges Ahead for Visualizing and Analyzing Massive Data Sets Hank Childs Lawrence Berkeley National Laboratory February 26, 2010 27B element Rayleigh-Taylor

Prepare for difficult conversations in the future.

• In situ:—How much memory are you willing to give up

for visualization?—Will you be angry if the vis algorithms crash?—Do you know what you want to generate a

priori? • Can you re-run simulations if necessary?

Page 23: The Challenges Ahead for Visualizing and Analyzing Massive Data Sets Hank Childs Lawrence Berkeley National Laboratory February 26, 2010 27B element Rayleigh-Taylor

Summary

• Is there a problem with massive data?—Yes, I/O is a major problem—Yes, obtaining insight is a major problem

• Why is there a problem? Who’s fault is it?—As we scale up, some things get cheap, others

things (like I/O) stay expensive• What can we do about it?

—Multi-res / in-situ• Will it hurt?

—Yes.• Can we do it?

—Yes, see next three talks

Page 24: The Challenges Ahead for Visualizing and Analyzing Massive Data Sets Hank Childs Lawrence Berkeley National Laboratory February 26, 2010 27B element Rayleigh-Taylor

• Questions???

• Hank Childs, LBL & UC Davis• Contact info:

[email protected] / [email protected]—http://vis.lbl.gov/~hrchilds