towards a digital human kathy yelick eecs department u.c. berkeley

Towards a Digital Human

Kathy Yelick

EECS Department

U.C. Berkeley

The 20+ Year Vision

• Imagine a “digital body double” – 3D image-based medical record– Includes diagnostic, pathologic, and other

information

• Used for:– Diagnosis– Less invasive surgery-by-robot– Experimental treatments

• Digital Human Effort– Lead by the Federation of American Scientists

Digital Human Today: Imaging

• The Visible Human Project– 18,000 digitized sections of the body

• Male: 1mm sections, released in 1994• Female: .33mm sections, released in 1995

– Goals• study of human anatomy• testing medical imaging algorithms

– Current applications: • educational, diagnostic, treatment planning,

virtual reality, artistic, mathematical and industrial

• Used by > 1,400 licensees in 42 countries

Image Source: www.madsci.org

Digital Human Roadmap

1995 2000 2005 2010

1 organ 1 model

scalable implementations

1 organ multiple models

multiple organs

3D model construction

better algorithms

organ system

coupled models

faster computers

improved programmability

digital human

Building 3D Models from Images

Image source: John Sullivan et al, WPI

Image data from • visible human• MRI• Laboratory experiments

Automatic construction• Surface mesh• Volume mesh• John Sullivan et al, WPI

Simulation: From Genome to Function (James B.Bassingthwaighte, UW)

HealthOrganism

Tissue

Molecule

The Physiome Projecthttp://www.physiome.org

Structure to Function:• Experiments, Databases• Problem Formulation• Engineering the Solutions • Quantitative System Modeling• Archiving & Dissemination

Organ Simulation

Cardiac cells/muscles– SDSC, Auckland, UW, Utah,

Cardiac flow

– NYU,…

Lung transport– Vanderbilt

Lung flow– ORNL

Cochlea

– Caltech, UM, UCB

Kidney mesh generation

– Dartmouth

Electrocardiography– Johns Hopkins,…Skeletal mesh

generation

– ElIisman

Just a few of the efforts at understanding and simulating parts of the human body

Immersed Boundaries within the Body

• Fluid flow within the body is one of the major challenges, e.g., – Blood through the heart– Coagulation of platelets in clots– Movement of bacteria

• A key problem is modeling an elastic structure immersed in a fluid– Irregular moving boundaries– Wide range of scales– Vary by structure, connectivity, viscosity, external

forced, internally-generated forces, etc.

Heart Simulation

Developed by Peskin and McQueen at NYU– On a Cray C90: 1 heart-beat in 100 CPU hours– 1283 point fluid mesh; coarser mesh incorrect – Heart represented by muscle fibers

–Applications: • Understanding

structural abnormalities

• Evaluating artificial heart valves

• Eventually, artificial hearts

Source: www.psc.org

Heart Simulation

Source: www.psc.org

Animation of lower portion of the heart

Platelet Simulation

• Simulation of blood clotting– Developed by Aaron Fogelson and Charlie Peskin– Cells are represented by polygons

• Rules added to simulate adhesion

– Artery walls represented as fibers– 2D Simulation

• Main implementation in F77 for vectors• Our implemented in Split-C for distributed memory

Cochlea Simulation

– Simulates fluid-structure interactions due to incoming sound waves

– Potential applications: design of cochlear implants

• Simulation–OpenMP code

• E. Givelberg• J. Bunn• M. Rajan

–256x256x128 fluid grid

–18 hours on HP Superdome

Cochlea Simulation

• Embedded structures are– Elastic membranes– Shell model– Variable elasticity

• Capture shape of cochlea

• Shown here is a plate– Implemented in a

few weeks within Titanium code Source: Titanium simulation on a T3E

Bacteria Simulation

• Simulation of bacteria in a fluid– Requires smaller scale– Small animal swimming

• Lisa Fauchy, Tulane and

• James Sullivan, Tulane

• E coli simulation– Video © James A. Sullivan, Cells Alive!

• Brownian motion extension– Under development by Peter Kramer, RPI– microscopic processes where thermal fluctuation is

important

Insect Flight Simulation

• Wings are– Immersed 2D

structure

• Under development – UW and NYU

• Applications to– Insect robot design

Source: Dickenson, UCB

Other Applications

• The immersed boundary method has also been used, or is being applied to– Flags and parachutes– Flagella – Embryo growth– Valveless pumping– Paper making

Immersed Boundary Method Structure

Fiber activation & force calculation

InterpolateVelocity

Navier-Stokes Solver

SpreadForce

4 steps in each timestep

Fiber Points

Interaction

Fluid Lattice

Challenges to Parallelization

• Irregular fiber lists need to interact with regular fluid lattice. – Trade-off between load balancing of fibers

and minimizing communication• Efficient “scatter-gather” across processors

• Need a scalable elliptic solver– Plan to uses multigrid – Eventually add Adaptive Mesh Refinement

• New algorithms under development by Colella’s group at LBNL and Baden at UCSD

Software Architecture

Application Models

Generic Immersed Boundary Method (Titanium)

Heart(Titanium)

CochleaFlagellateSwimming

Spectral(Titanium)

Multigrid MLC(KeLP)

Extensible Simulation

SolversMultigrid(Titanium)

– Based on Generic Immersed Boundary Method– Can plug in new models by extending fiber points– Uses Java inheritance for simplicity

Immersed Boundaries in Titanium

–Heartbeat simulation • Experiment 800• 1/5th of heartbeat done• Using to validate code

• Recent improvements– FFTW in solver: 10x faster– Study of fluid/fiber

communication• Better load balancing

Source: Titanium simulation on Millennium

• Immersed Boundary Method in Titanium is complete– Download from www.cs.berkeley.edu/~smyau

Reduced Solver Bottleneck

• Performance improvements in past year– About 5x overall; ~10 seconds/ts on 64 T3E nodes– Bottleneck has changed to interaction phases

Fiber14%

Solve23%

Spread30%

Inter-polate27%

Spread27%

Solve44%

Fiber2%

2000 Performance 2002 Performance

Load Balance and Locality

• Tried various assignments of fiber points

Optimized for Locality

Optimized for Load Balance

• Optimizing for load balance generally better– Application (fiber structure) dependent– Somewhat constrained by spectral solver– Multigrid will give more options

Scallop: New Multigrid Poisson Solver

• A latency tolerant elliptic solver library– Based on domain decomposition– Trades off extra computation for fewer messages– Will be used to build Navier-Stokes Solver

• Work by Scott Baden and Greg Balls, UCSD– Based on Balls/Colella algorithm– Implemented in C++/KeLP, with a simple interface

• Solver not yet integrated– Algorithm is complete– Performance tuning is ongoing– Interface between Titanium and C++/KeLP

developed

Tools for High Performance

Challenges to parallel simulation of a digital human are generic

• Parallel machines are too hard to program– Users “left behind” with each new major generation

• Efficiency is too low– Even after a large programming effort – Single digit efficiency numbers are common

• Approach– Titanium: A modern (Java-based) language that

provides performance transparency– Sparsity: Self-tuning scientific kernels

Titanium Overview

Object-oriented language based on Java with:• Scalable parallelism

– Single Program Multiple Data (SPMD) model of parallelism, 1 thread per processor

• Global address space– Processors can read/write memory on other

processor– Pointer dereference can cause communication

• Intermediate point between message passing and shared memory

Performance Challenges

• Two separate performance problems– Serial Performance

• Can Java be used for high performance?• What if we compile to machine code, rather

than using the JVM?

– Communication Performance• Global address space encourages the use of

small messages• Overlapping used to tolerate latencies• How well does this work on current networks?

SciMark Benchmark

• Numerical benchmark for Java, C/C++• Five kernels:

– FFT (complex, 1D)– Successive Over-Relaxation (SOR)– Monte Carlo integration (MC)– Sparse matrix multiply – dense LU factorization

• Results are reported in Mflops• Download and run on your machine from:

– http://math.nist.gov/scimark2– C and Java sources also provided

Roldan Pozo, NIST, http://math.nist.gov/~Rpozo

SciMark: Java vs. C(Sun UltraSPARC 60)

FFT SOR MC Sparse LU

* Sun JDK 1.3 (HotSpot) , javac -0; Sun cc -0; SunOS 5.7

SciMark: Java vs. C(Intel PIII 500MHz, Win98)

FFT SOR MC Sparse LU

* Sun JDK 1.2, javac -0; Microsoft VC++ 5.0, cl -0; Win98

Can we do better without the JVM?

• Pure Java with a JVM (and JIT)– Within 2x of C and sometimes better

• OK for many users, even those using high end machines

– Depends on quality of both compilers

• We can try to do better using a traditional compilation model – E.g., Titanium compiler at Berkeley

• Compiles Java extension to C• Bounds checking of arrays is crucial – compiler

flag to turn this off

Java Compiled by Titanium Compiler

Performance on a Pentium IV (1.5GHz)

100150200250300350400450

Overall FFT SOR MC Sparse LU

java C (gcc -O6) Ti Ti -nobc

Java Compiled by Titanium Compiler

Performance on a Sun Ultra 4

Overall FFT SOR MC Sparse LU

Java C Ti Ti -nobc

Language Support for Performance

• Multidimensional arrays– Contiguous storage– Support for sub-array operations without copying

• Support for small objects– E.g., complex numbers– Called “immutables” in Titanium– Sometimes called “value” classes

• Semi-automatic memory management– Create named “regions” for new and delete– Avoids distributed garbage collection

Performance Challenges

• Two separate performance problems– Serial Performance

• Can Java be used for high performance?• What if we compile to machine code, rather

than using the JVM?

– Communication Performance• Global address space encourages the use of

small messages• Overlapping used to tolerate latencies• How well does this work on current networks?

The Network Parameters

• EEL – End to end latency or time spent sending a short message between two processes.

• Parameters of the LogP Model– L – “Latency”or time spent on the network

• During this time, processor can be doing other work

– O – “Overhead” or processor busy time on the sending or receiving side.

– G – “gap” the rate at which messages can be pushed onto the network.

– P – the number of processors

• Overhead is the most important parameter – cannot be hidden

LogP Parameters: Overhead & Latency• Non-overlapping

overhead• Send and recv

overhead can overlap

EEL = osend + L + orecv

EEL < osend + orecv

Performance of Remote Read/Write

Send Overhead (alone) Send & Rec Overhead Rec Overhead (alone) Added Latency

8-byte messages

LogP Parameters: gap

• The Gap is the delay between sending messages

• Gap may be larger than send overhead, e.g., – NIC may be busy processing the last

message.– Flow control on the network may

prevent the NIC from accepting the next message.

• If the gap is higher than overhead– Need to overlap communication with

computation, not other communication

osendgap

Results: Gap and Overhead

1.2 0.2

8.2 9.5

6.510.3

7.84.6

Gap Send Overhead Receive Overhead

Send Overhead Over Time

• Overhead has not improved significantly– Lack of integration; lack of attention in software

Myrinet2K

Dolphin

Cenju4

MeikoParagon

Dolphin

Myrinet

Compaq

NCube/2

1990 1992 1994 1996 1998 2000 2002Year (approximate)

Summary

Several categories–Application development

• Heart and cochlea-component simulation

–Application-level package• Generic immersed boundary method• Parallel for shared and distributed memory• Enables new larger-scale simulations; finer grid

–Titanium language and compiler• High level, high performance programming

–Communication runtime layer• Lightweight communication and benchmarks

applicationdata

softwaresystems

Collaborators• Susan Graham

• Paul Hilfinger

• Jim Demmel

• Alex Aiken• Phillip Colella• Peter McQuorquodale• Mike Welcome• Sabrina Merchant• Kaushik Datta

• Dan Bonachea• Rich Vuduc• Jason Duell• Paul Hargrove• Wei Chen

• Eun-Jin Im• Szu-Huey Chang• Carrie Fei• Ben Liblit• Robert Lin• Geoff Pike• Jimmy Su• Ellen Tsai• Siu Man Yau• Shaoib Kamil• Benjamin Lee• Rajesh Nishtala• Costin Iancu

http://titanium.cs.berkeley.edu/

towards a digital human kathy yelick eecs department u.c. berkeley

heart slide

human body slide

wpi slide

heart simulation source

berkeley slide

cochlea simulation

digital human today

digital human roadmap

Documents

eecs systems research in the post- pc era david culler u.c....

cs61b l07 specification and documentation (1)garcia / yelick...

source-destination routing optimal strategies eric chi...

global address space applications kathy yelick nersc/lbnl...

defects and ˝damage - eecs instructional support...

statistical modeling of feedback data in an automatic tuning...

titanium: from java to high performance computing katherine...

latency vs. bandwidth which matters more? katherine yelick...

cs61b l13 lists (1)garcia / yelick fall 2003 © ucb...

use of a high level language in high performance...

scientific applications on multi-pim systems wimps 2002...

9/12/2007cs194 lecture1 shared memory hardware: case study...

cardiac blood flow alpha project impact kathy yelick eecs...

cs294, yelick consensus revisited, p1 cs 294-8 consensus...

eecs 170c lecture week 1 spring 2014 eecs 170c prof. m....

1 titanium review: ti parallel benchmarks kaushik datta...

cs61b l01 introduction (1)garcia / yelick fall 2003 © ucb...

impact of the cardiac heart flow alpha project kathy yelick...

slide 1 adaptive compilers and runtime systems kathy yelick...

cs61b l11 testing & static (1)garcia / yelick fall 2003 ©...