software and hardware requirements for next-generation data analytics john feo center for adaptive...

20
Software and Hardware Requirements for Next-Generation Data Analytics John Feo Center for Adaptive Supercomputing Software Pacific Northwest National Laboratory October, 2010

Upload: franklin-newman

Post on 28-Dec-2015

216 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Software and Hardware Requirements for Next-Generation Data Analytics John Feo Center for Adaptive Supercomputing Software Pacific Northwest National Laboratory

Software and Hardware Requirements for Next-Generation Data Analytics

John Feo

Center for Adaptive Supercomputing SoftwarePacific Northwest National Laboratory

October, 2010

Page 2: Software and Hardware Requirements for Next-Generation Data Analytics John Feo Center for Adaptive Supercomputing Software Pacific Northwest National Laboratory

Graphs are everywhere in science

Astrophysics Problem: Outlier detection. Challenges: massive datasets, temporal variations.Graph problems: clustering, matching.

BioinformaticsProblem: Identifying drug target proteins.Challenges: Data heterogeneity, quality.Graph problems: centrality, clustering.

Social InformaticsProblem: Discover emergent communities, model spread of information.Challenges: new analytics routines, uncertainty in data.Graph problems: clustering, shortest paths, flows.

Page 3: Software and Hardware Requirements for Next-Generation Data Analytics John Feo Center for Adaptive Supercomputing Software Pacific Northwest National Laboratory

… and in commerce

Sample queries: Allegiance switching: identify entities that switch communities.Community structure: identify the genesis and dissipation of communitiesPhase change: identify significant change in the network structureThought leaders: identify influential individuals that drive events

Graph features:Topology: Interaction graph is low-diameter and has no good separatorsIrregularity: Communities are not uniform in sizeOverlap: individuals are members of one or more communities

1000x growth

in 3 years!

has more than 300 million active users

Page 4: Software and Hardware Requirements for Next-Generation Data Analytics John Feo Center for Adaptive Supercomputing Software Pacific Northwest National Laboratory

Small-world and scale-free

Low diameter (small-world):work explodes

difficult to partition/load-balance

high % of nodes are visited quickly

“Six degrees of separation”

Scale-free (power-law):difficult to partition/load-balance

work concentrates in a few nodes

4

0.25

0.50

1.00

Blockk-way

Number of partitions

Ra

tio

of

ed

ge

s c

ut

RMAT graph with a million vertices

Page 5: Software and Hardware Requirements for Next-Generation Data Analytics John Feo Center for Adaptive Supercomputing Software Pacific Northwest National Laboratory

Grids, Erdős–Rényi, and Scale-Free GraphsUSA Roadmap

Erdős–Rényi

Scale-Free

Communication trace from execution of ½-approx weighted matching

(data distributed using Metis)

5

Page 6: Software and Hardware Requirements for Next-Generation Data Analytics John Feo Center for Adaptive Supercomputing Software Pacific Northwest National Laboratory

Challenges

Problem sizeTon of bytes, not ton of flops

Little data locality

Have only parallelism to tolerate latencies

Low computation to communication ratioSingle word access

Threads limited by loads and stores

Synchronization points are simple elementsNode, edge, record

Work tends to be dynamic and imbalancedLet any processor execute any thread

Page 7: Software and Hardware Requirements for Next-Generation Data Analytics John Feo Center for Adaptive Supercomputing Software Pacific Northwest National Laboratory

System requirements

Global shared memoryNo simple data partitions

Local storage for thread private data

Network support for single word accessesTransfer multiple words when locality exists

Multi-threaded processorsHide latency with parallelism

Single cycle context switching

Multiple outstanding loads and stores per thread

Full-and-empty bitsEfficient synchronization

Wait in memory

Message driven operationsDynamic work queues

Hardware support for thread migration

Cray XMT

Page 8: Software and Hardware Requirements for Next-Generation Data Analytics John Feo Center for Adaptive Supercomputing Software Pacific Northwest National Laboratory

Center for Adaptive Supercomputer Software

Driving development of next-generation multithreaded architectures and methods for

irregular problems

Driving development of next-generation multithreaded architectures and methods for

irregular problems

DATA

Scientific Simulations

Sensor Networks

Internet

Databases

Data Analytics

Knowledge Discovery

Trend Analysis

Science

Policy

Commerce

Sponsored by DODSponsored by DOD

Page 9: Software and Hardware Requirements for Next-Generation Data Analytics John Feo Center for Adaptive Supercomputing Software Pacific Northwest National Laboratory

Partners

Page 10: Software and Hardware Requirements for Next-Generation Data Analytics John Feo Center for Adaptive Supercomputing Software Pacific Northwest National Laboratory

Analytic methods and applications

Community thought leaders

Blog Analysis

Community Activities

FaceBook - 300 M users

Connect-the-dots

Bus

Hayashi

Zaire

Train

Anthrax

MoneyEndo

National Security

People, Places, & Actions

Semantic Web

Anomaly detection

Security

N-x contingency analysis

SmartGrid

Page 11: Software and Hardware Requirements for Next-Generation Data Analytics John Feo Center for Adaptive Supercomputing Software Pacific Northwest National Laboratory

Chapel for hybrid systems

Next generation multithreaded architectures

Communication software for hybrid systems

Performance analysis and toolsCompiler and runtime system

SmartGrid Sensor Networks

Mesh generation

N-x contingency analysisSemantic Databases

Bayesian networks Social networks

Arc

hite

ctur

eR

untim

eS

yste

mLa

ngu

ages

Met

hods

App

lica

tions

Research focus areas

MapReduce

Clustering

Computer SecurityBioInformatics

Page 12: Software and Hardware Requirements for Next-Generation Data Analytics John Feo Center for Adaptive Supercomputing Software Pacific Northwest National Laboratory

PathsShortest path

Betweenness

Min/max flow

StructuresSpanning trees

Connected components

Graph isomorphism

GroupsMatching/Coloring

Partitioning

Equivalence

Methods for data analytics

Influential FactorsDegree distribution

Normal

Scale-free

Planar or non-planar

Static or dynamic

Weighted or unweightedWeight distribution

Typed or untyped edges

Load imbalanceNon-planar

Concurrent insertsand deletions

Difficult to partition

Page 13: Software and Hardware Requirements for Next-Generation Data Analytics John Feo Center for Adaptive Supercomputing Software Pacific Northwest National Laboratory

Systems for large-scale analytics

Cray XMT

Graph resides in

XMT memory

Graph resides in

XMT memory

RDBSruns on cluster

RDBSruns on cluster

Netezza TwinFin

Page 14: Software and Hardware Requirements for Next-Generation Data Analytics John Feo Center for Adaptive Supercomputing Software Pacific Northwest National Laboratory

vap

wspd_va

tbsky 31

sky ir temp

precip-tbrg

percent_opaque

radar7

radar13

radar19

vap

wspd_va

tbsky 31

sky ir temp

precip-tbrg

percent_opaque

radar7

radar13

radar19

vap

wspd_va

tbsky 31

sky ir temp

precip-tbrg

percent_opaque

radar7

radar13

radar19

Replicate per time step

Add dependencies across time steps (not shown)

Dynamic Bayesian Network Model for Atmospheric Sensor Network Validation

Page 15: Software and Hardware Requirements for Next-Generation Data Analytics John Feo Center for Adaptive Supercomputing Software Pacific Northwest National Laboratory

Convert dynamic Bayesian network to junction tree for inferencing

Each node in the junction tree is a clique or super node containing several nodes from original Bayesian network

Junction Tree based “Evidence Propagation” is an efficient method of propagating the effect of any variable’s state to every other variable in the BN

vap

wspd_va

tbsky 31

sky ir

temp

precip-

tbrg

percent_opaque

radar7

radar13

radar19

vap

wspd_va

tbsky 31

sky ir

temp

precip-

tbrg

percent_opaque

radar7

radar13

radar19

vap

wspd_va

tbsky 31

sky ir

temp

precip-

tbrg

percent_opaque

radar7

radar13

radar19

DBN to Junction Tree Conversion

Page 16: Software and Hardware Requirements for Next-Generation Data Analytics John Feo Center for Adaptive Supercomputing Software Pacific Northwest National Laboratory

Evidence Propagation is highly irregular

Compute per node is unbalancedDegree per node is irregularData moves up and down

Loop parallelism intra-nodeTask parallelism inter-node (recursion, futures)Data flow schedulingData synchronization

SMALL SYSTEMS HAVE 100S OF MILLIONS OF NODES

Page 17: Software and Hardware Requirements for Next-Generation Data Analytics John Feo Center for Adaptive Supercomputing Software Pacific Northwest National Laboratory

Atmospheric Sensor Network Validation Framework

Page 18: Software and Hardware Requirements for Next-Generation Data Analytics John Feo Center for Adaptive Supercomputing Software Pacific Northwest National Laboratory

Semantic analysis

Understanding the relationships among data

Data intensive science

National security

Commerce

Data and relationships best expressed as triples and graphs

<John owns Dog>

18

PNNL, SNL, Cray

Patient Blue bumps

Pink rash High fever

John Yes _ Yes

Alice _ Yes _

Mary _ _ Yes

18

Blue bumps

JohnAlice

Mary

has symptom

Pink rash

has symptom

High Feverhas symptom

has symptom

Mayo Clinic’s patient database has 650K columns

Page 19: Software and Hardware Requirements for Next-Generation Data Analytics John Feo Center for Adaptive Supercomputing Software Pacific Northwest National Laboratory

XMT’s potential for semantic analysis

Machine Programming Model Performance (inferences per sec)

Author

X86, 32 nodes, 128 cores

MPI ~ 600 K inf/sec Weaver and Hendler (ISWC 2009)

X86, 64 nodes, 256 cores

Hadoop ~550K – 800K Urbani et al(ESWC 2010)

256 Treadstorm processors

C++ ~2.2M w/ read time ~13M w/o read time

RDFS closure

Inferring new relationships and attributes

Rule based

Original Diagram from Urbani et al. "Scalable Distributed Reasoning using MapReduce" ISWC 2009

JOB 3: Delete Duplicates

JOB 0: Transitive Closure

<John studied under Jim Browne> +

<Jim Browne teaches at UT Austin>

<John attended UT Austin>

865 million triples

Page 20: Software and Hardware Requirements for Next-Generation Data Analytics John Feo Center for Adaptive Supercomputing Software Pacific Northwest National Laboratory

Summary

The new HPC is irregular and sparse

Bad news: we need new architectures

Good news: there are commercial and consumer applications

Shared memory is necessary, but not sufficient

Need processors that can fill the memory system with requests

Need memory systems that support millions of simultaneous requests

Need fine-grain hardware synchronization in memory