combinatorial scientific computing: a view to the future

Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company,for the United States Department of Energy’s National Nuclear Security Administration

under contract DE-AC04-94AL85000.

Combinatorial Scientific Computing:A View to the Future

Bruce Hendrickson

Senior Manager for Math & Computer Science

Sandia National Laboratories, Albuquerque, NM

University of New Mexico, Computer Science Dept.

Combinatorial Scientific Computing

• The development, application and analysis of combinatorial

algorithms to enable scientific and engineering computations

• Highlighted areas from a survey talk I composed in 2003– Sparse matrices (direct & iterative methods)

– Optimization & derivatives

– Parallel computing

– Mesh generation

– Statistical physics

– Chemistry

– Biology

A Brief History of CSC

• Grew out of series of minisymposia at SIAM meetings– Deeper origins in

• Sparse direct methods community (1950s and onward)

• Statistical physics – graphs and Ising models (1940s & 1950s)

• Chemical classification (1800s, Cayley)

– Recognition of common esthetic, techniques and goals among researchers who were far apart in traditional scientific taxonomy

• Name selected in 2002– After lengthy email discussion among ~ 30 people.

– Now, ~3000 hits for “combinatorial scientific computing” on Google.

Previous Milestones

• This is the 4th major CSC workshop– SIAM ’04 (with Parallel Processing)

• Organizers J. Gilbert, B. Hendrickson, A. Pothen, H. Simon, S. Toledo

– CERFACS ’05

– SIAM ’07 (with Computational Science & Engineering)

– Coming soon:• SIAM ’09 (with Applied Linear Algebra)

• SIAM ’11 (with Optimization) (?)

• Special issue of ETNA in 2004

• Importance recognized by scientific community and

funding agencies

Invited Speakers from Earlier CSC Workshops

• Richard Brualdi (combinatorial matrix theory)• Dan Gusfield (computational biology)• Shang-Hua Teng (smoothed analysis of algorithms)

• Stan Eisenstat (sparse direct methods)• Dan Halperin (geometric algorithms)• Denis Trystram (parallel scheduling)

• Iain Duff (sparse direct methods)• Phil Duxbury (statistical physics)

Outline

• A look back:– A brief history of a brief history

• A look ahead:

– New application opportunities: data-centric computing• Graph models of information retrieval• Emerging science of complex networks

– Architectural revolution: challenges and promise• Challenges of near-future machines• Potential architectures for discrete problems

• Conclusions

Data-Centric Computing

• Many science disciplines generate huge data sets

– Biology, astronomy, high-energy physics, environmental science, social sciences (internet data), etc.

• Important scientific knowledge lurks within this data

• What abstractions and algorithms are needed?

• Claim:

– Combinatorial algorithms have an important role to play

– “Combinatorial problems generated by challenges in data mining and related topics are now central to computational science.”

• [I. Beichl & F. Sullivan, 2008]

Example 1: Information Retrieval

• Consider a document corpus– Each document is a “bag of words”

– Represent as non-negative term/document matrix

– A(i,j) encodes frequency of term i in document j

– A set of terms in a query can be thought of as a vector q

– Large entries in ATq identify good matches for retrieval

t terms

d documents

Latent Semantic Analysis

• LSA uses truncated SVD for dimension reduction– A ≈ Uk k Vk

T

• Retrieval query now becomes– ATq ≈ Vk k Uk

Tq

• Widely used idea to reduce noise and reduce query expense– [Deerwester, et al., 1990]

• Basic idea has many applications– Image recognition, machine translation, pattern recognition, etc.

Graph Based Alternative

• View the term-document matrix as a bipartite graph

– Terms and documents have weighted links if they are related

• Embed the graph in a low dimensional space using (for example)

Laplacian eigenvectors

• Given a query vector, map it to same space and look for nearby

documents

– Fiedler retrieval [H., 2007]

• Algebraically, this involves low eigenvectors of the matrix L=

• Note that LSA involves low eigenvectors of

2

T1

DA

AD

0A

A0 T

Advantages of Graph Representation

•Terms & Documents live in same space– Principled method for adding doc-doc or term-term similarities

• E.g. former from dictionary, latter from citation analysis or hyperlinks

• Unified text and link analysis

•Supports more complex queries– “similar to these documents and these terms”

•Supports extensions to more classes of objects. – E.g., instead of just term-document, could do term-document-author.

2

T1

GA

AGL

3

2

1

DBC

BDA

CADatd

L T

TT

Example II: Network Science

• Graphs are ideal for representing entities and relationships

• Rapidly growing use in social, environmental, and other

sciences

Zachary’s karate club (|V|=34)

The way it was …

Twitter social network (|V|≈20K)

The way it is now …

New Questions

• New algorithms

– Community detection, centrality, graph generation, etc.

– Right set of questions and concepts still emerging.

• New issues

– Noisy, error-filled data. What can we conclude robustly?

– Semantic graphs with edges and vertices of different types.• E.g. people, organizations, events• How should this be exploited algorithmically?

– Multilinear instead of linear algebra?

• New paradigms:

– E.g. graph evolves over time

– Temporal analysis, dynamics, streaming algorithms on graphs, etc

• Enormous opportunities for combinatorial algorithms

Outline

• A look back:– A brief history of a brief history

• A look ahead:

– New application opportunities: data-centric computing• Graph models of information retrieval• Emerging science of complex networks

– Architectural revolution: challenges and promise• Challenges of near-future machines• Potential architectures for discrete problems

• Conclusions

A Renaissance in Architecture Research

• Good news– Moore’s Law marches on

– Real estate on a chip is essentially free• Major paradigm change – huge opportunity for innovation

• Bad news– Power considerations limit the improvement in clock speed

• Eventual consequences are unclear

• Current response, multicore processors– Computation/Communication ratio will get worse

• Makes life harder for applications

Applications Also Getting More Complex

• Leading edge scientific applications increasingly include:– Adaptive, unstructured data structures– Complex, multiphysics simulations– Multiscale computations in space and time– Complex synchronizations (e.g. discrete events)

• Significant parallelization challenges on today’s machines– Finite degree of coarse-grained parallelism– Load balancing and memory hierarchy optimization

• Dramatically harder on millions of cores

• Huge need for new algorithmic ideas – CSC will be critical

Architectural Challenges for Graph Algorithms

• Runtime is dominated by latency– Particularly true for data-centric applications– Random accesses to global address space– Perhaps many at once – fine-grained parallelism

• Essentially no computation to hide access time

• Access pattern is data dependent– Prefetching unlikely to help– Usually only want small part of cache line

• Potentially abysmal locality at all levels of memory hierarchy

Locality Challenges

What we traditionally care about

What industry cares about

Emerging Codes

From: Murphy and Kogge, On The Memory Access Patterns of Supercomputer Applications: Benchmark Selection and Its Implications , IEEE T. on Computers, July 2007

Example: AMD Opteron

L2Cache

L1I-Cache

L1D-Cache

Memory (Latency

Avoidance)


L2Cache

L1I-Cache

L1D-Cache

Memory (Lat.

Avoidance)

Memory Controller

I-FetchScanAlign

Load/StoreUnit

Out-of-Order Exec

Load/StoreMem/

Coherency(Latency

Tolerance)


L2Cache

L1I-Cache

L1D-Cache

Memory (Latency

Avoidance)

Memory Controller

I-FetchScanAlign

Load/StoreUnit Out-of-Order

ExecLoad/Store

Mem/Coherency

(Lat. Toleration)

Bus DDRHT

Memory and I/O

Interfaces


L2Cache

L1I-Cache

L1D-Cache

Memory (Latency

Avoidance)

Memory Controller

I-FetchScanAlign

Load/StoreUnit

Out-of-Order Exec

Load/StoreMem/Coherency(Lat. Tolerance)

FPU Execution

Int ExecutionBus DDRHT

Memory and I/O

Interfaces

COMPUTER


Thanks to Thomas Sterling

Architectural Wish List for Graphs

• Low latency / high bandwidth– For small messages!

• Latency tolerant• Light-weight synchronization mechanisms• Global address space

– No graph partitioning required

– Avoid memory-consuming profusion of ghost-nodes

– No local/global numbering conversions

• One machine with these properties is the Cray MTA-2– And successor XMT

How Does the MTA Work?

• Latency tolerance via massive multi-threading– Context switch in a single tick– Global address space, hashed to reduce hot-spots– No cache or local memory.– Multiple outstanding loads

• Remote memory request doesn’t stall processor– Other streams work while your request gets fulfilled

• Light-weight, word-level synchronization– Minimizes conflicts, enables parallelism

• Flexible dynamic load balancing• Notes:

– 220 MHz clock– Largest machine is 40 processors

Case Study I: MTA-2 vs. BlueGene

• With LLNL, implemented S-T shortest paths in MPI• Ran on IBM/LLNL BlueGene/L, world’s fastest computer• Finalist for 2005 Gordon Bell Prize

– 4B vertex, 20B edge, Erdös-Renyi random graph– Analysis: touches about 200K vertices– Time: 1.5 seconds on 32K processors

• Ran similar problem on MTA-2– 32 million vertices, 128 million edges– Measured: touches about 23K vertices– Time: .7 seconds on one processor, .09 seconds on 10 processors

• Conclusion: 4 MTA-2 processors = 32K BlueGene/L processors– [Berry, H., Kahan, Konecny, 2007]

Case Study II: Single Source Shortest Path

PBGL SSSP

Tim

e (s

)

MTA SSSP

# Processors

• Parallel Boost Graph Library (PBGL)– Lumsdaine, et al., on Opteron cluster

– Some graph algorithms can scale on some inputs

• PBGL - MTA Comparison on SSSP– Erdös-Renyi random graph (|V|=228)

– PBGL SSSP can scale on non-power law graphs

– Order of magnitude speed difference

– 2 orders of magnitude efficiency difference

• Big difference in power consumption

– [Lumsdaine, Gregor, H., Berry, 2007]

Longer Term Architectural Opportunities

• Near future trends

– Multithreading to tolerate latencies– XMT-like capability on commodity machines?

• Potential big impact on latency-dominated applications (e.g. graphs)

• Further out

– Application-specific circuitry• E.g. hashing, feature detection, etc.

– Reconfigurable hardware?• Adapt circuits to the application at run time

• Lots of new combinatorial problems in these alternative computing models

Conclusions

• CSC is in robust health– Growing in breadth, depth, impact and visibility

• Trends in science play to our strengths– Growing complexity of traditional applications requires more CSC

• Unstructured, adaptive meshes; bigger problems; multiphysics; optimization; etc.

– New science domains with combinatorial needs are emerging• Social sciences, ecology, structural biology, etc.

– Many sciences are becoming more data-rich– Complex computers require new discrete algorithms

• We can help applications on multicore nodes, and maybe influence future architectures

• Enormous need for new models and algorithmic improvements

• It’s a great time to be doing CSC!

Thanks

• Cevdet Aykanat, Jon Berry, Rob Bisseling, Erik Boman, Bill Carlson, Ümit Çatalürek, Edmond Chow, Karen Devine, Iain Duff, Danny Dunlavy, Alan Edelman, Jean-Loup Faulon, John Gilbert, Assefaw Gebremedhin, Mike Heath, Paul Hovland, Simon Kahan, Pat Knupp, Tammy Kolda, Gary Kumfert, Fredrik Manne, Mike Merrill, Richard Murphy, Esmond Ng, Ali Pınar, Cindy Phillips, Steve Plimpton, Alex Pothen, Robert Preis, Padma Raghavan, Steve Reinhardt, Suzanne Rountree, Rob Schreiber, Viral Shah, Jonathan Shewchuk, Horst Simon, Dan Spielman, Shang-Hua Teng, Sivan Toledo, Keith Underwood, etc.

combinatorial scientific computing: a view to the future

Documents

scientific community

environmental science

document corpuseach

important scientific

query vector

termdocument matrix

data mining

query expensedeerwester