salsasalsasalsasalsa proposal review meeting with ctsi translating research into practice project...

26
SALSA SALSA Childhood Obesity Studies with Multicore Robust Data Mining Proposal Review Meeting with CTSI Translating Research Into Practice Project Development Team, July 8, 2009, IUPUI Gil Liu, Judy Qiu, Craig Stewart Contact [email protected] www.infomall.org/salsa Research Technology, UITS Community Grids Laboratory, PTI Children’s Health Service Indiana University

Upload: daisy-jennings

Post on 27-Dec-2015

214 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: SALSASALSASALSASALSA Proposal Review Meeting with CTSI Translating Research Into Practice Project Development Team, July 8, 2009, IUPUI Gil Liu, Judy Qiu,

SALSASALSA

Childhood Obesity Studies with Multicore Robust Data Mining

Proposal Review Meeting with CTSI Translating Research Into Practice Project Development Team, July 8, 2009, IUPUI

Gil Liu, Judy Qiu, Craig StewartContact [email protected] www.infomall.org/salsa

Research Technology, UITS

Community Grids Laboratory, PTI

Children’s Health Service

Indiana University

Page 2: SALSASALSASALSASALSA Proposal Review Meeting with CTSI Translating Research Into Practice Project Development Team, July 8, 2009, IUPUI Gil Liu, Judy Qiu,

SALSA

Obesogenic Environment• Environmental factors that increase caloric intake and

decrease energy expenditure “…so manifold and so basic as to be inseparable from the way we live.”

Margaret Talbot (New America Foundation)

• “The current U.S. environment is characterized by an essentially unlimited supply of convenient, inexpensive, palatable, energy-dense foods coupled with a lifestyle requiring negligible amounts of physical activity for subsistence.”

Hill & Peters 2001

• “Genes load the gun, and environment pulls the trigger.”G Bray 1998

Page 3: SALSASALSASALSASALSA Proposal Review Meeting with CTSI Translating Research Into Practice Project Development Team, July 8, 2009, IUPUI Gil Liu, Judy Qiu,

SALSA

Page 4: SALSASALSASALSASALSA Proposal Review Meeting with CTSI Translating Research Into Practice Project Development Team, July 8, 2009, IUPUI Gil Liu, Judy Qiu,

SALSA

# of Visits Per patient Percent

1 only 44%

2 or more 46% 3 or more 22% 4 or more 11% 5 or more 6%

Distribution of Visits by Year and FrequencyYear # of

visits

2004 43005

2005 45271

2006 45300

2007 54707

Page 5: SALSASALSASALSASALSA Proposal Review Meeting with CTSI Translating Research Into Practice Project Development Team, July 8, 2009, IUPUI Gil Liu, Judy Qiu,

SALSA

Page 6: SALSASALSASALSASALSA Proposal Review Meeting with CTSI Translating Research Into Practice Project Development Team, July 8, 2009, IUPUI Gil Liu, Judy Qiu,

SALSA

Zones of Analysis Centered on Subject’s Residence

Page 7: SALSASALSASALSASALSA Proposal Review Meeting with CTSI Translating Research Into Practice Project Development Team, July 8, 2009, IUPUI Gil Liu, Judy Qiu,

SALSA

units/acre

very low density 0-2

low density 2-5

medium density 5-15

high density > 15

commercial light

commercial office

commercial heavy

industrial light

Industrial heavy

special use

parks

roads

water

interstates

Generalized LandUse Categories

0 1 2Miles

±

vacant / agricultural

Page 8: SALSASALSASALSASALSA Proposal Review Meeting with CTSI Translating Research Into Practice Project Development Team, July 8, 2009, IUPUI Gil Liu, Judy Qiu,

SALSA

The Environment

• GREENNESS

• Normalized Difference Vegetation Index (NDVI)

• Healthy green biomass

Variables of the Built Environment Selected for Study:

gcliu
after the basics, mention previous work showing greater explanatory powercompared to NDVI, TVI, & Tassled Cap
Page 9: SALSASALSASALSASALSA Proposal Review Meeting with CTSI Translating Research Into Practice Project Development Team, July 8, 2009, IUPUI Gil Liu, Judy Qiu,

SALSA

Variables• Dependent

– 2-year change in BMI z-Score (t2-t1)

• Covariates– Age, race/ethnicity, sex – Baseline z-BMI (linear, quadratic, cubic) – Health insurance status– Census tract median family income (log)– Index year

Page 10: SALSASALSASALSASALSA Proposal Review Meeting with CTSI Translating Research Into Practice Project Development Team, July 8, 2009, IUPUI Gil Liu, Judy Qiu,

SALSA

Linear Regression Models of 2-year change in z-BMI

NDVI -0.52 *** -0.69 ***Residential Density -0.01 -0.01 **

*** p<.01

** p>=.01& <=.05 a Standard errors adjusted for neighborhood-level clustering

NDVI and Residential

Density

b Controlled for age, race/ethnicity, baseline zBMI (linear, quadratic cubic terms), sex, health insurance, status, census tract median family income, index year

B B B

NDVI OnlyResidential

Density Only

Page 11: SALSASALSASALSASALSA Proposal Review Meeting with CTSI Translating Research Into Practice Project Development Team, July 8, 2009, IUPUI Gil Liu, Judy Qiu,

SALSA

Potential Pathways and Mechanisms

• Places that promote outside play and physical activity

• “Territorial personalization”

• Improved mental health, self-esteem, reduced stress

Page 12: SALSASALSASALSASALSA Proposal Review Meeting with CTSI Translating Research Into Practice Project Development Team, July 8, 2009, IUPUI Gil Liu, Judy Qiu,

SALSA

Collaboration of SALSA Project

Indiana University ITSALSA Team

Geoffrey Fox Xiaohong QiuScott BeasonSeung-Hee BaeJaliya Ekanayake Jong Youl ChoiYang Ruan

Microsoft ResearchIndustry Technology Collaboration

DryadRoger BargaCCRGeorge ChrysanthakopoulosDSSHenrik Frystyk Nielsen

Application Collaborators

Bioinformatics, CGB Haiku Tang, Mina Rho, Qufeng DongIU Medical School Gilbert LiuIUPUI Polis Center (GIS) Neil DevadasanCheminformatics Rajarshi Guha, David Wild

PTI/UITS RT

Craig Stewart William BernnetScott Mcaulay

Page 13: SALSASALSASALSASALSA Proposal Review Meeting with CTSI Translating Research Into Practice Project Development Team, July 8, 2009, IUPUI Gil Liu, Judy Qiu,

SALSA

Hardware

Application Software

DataDeveloping and applying parallel and distributed Cyberinfrastructure to support large scale data analysis.

• Childhood Obesity Studies (314,932 patient records/188 dimensions)• Indiana census 2000 (65535 GIS records / 54 dimensions)• Biology gene sequence alignments (640 million / 300 to 400 base pair)• Particle physics LHC (1 terabytes data that placed in IU Data Capacitor)

Components of Data Intensive Computing System

Page 15: SALSASALSASALSASALSA Proposal Review Meeting with CTSI Translating Research Into Practice Project Development Team, July 8, 2009, IUPUI Gil Liu, Judy Qiu,

SALSA

Hardware

Application

Data

The exponentially growing volumes of data requires robust high performance tools.

• Parallelization frameworks • MPI for High performance clusters of multicore systems• MapReduce for Cloud/Grid systems (Hadoop , Dryad)

• Data mining algorithms and tools• Deterministic Annealing Clustering (VDAC)• Pairwise Clustering • Multi Dimensional Scaling (Dimension Reduction)• Visualization (Plotviz)

Components of Data Intensive Computing System

Software

Page 16: SALSASALSASALSASALSA Proposal Review Meeting with CTSI Translating Research Into Practice Project Development Team, July 8, 2009, IUPUI Gil Liu, Judy Qiu,

SALSA

Hardware

Software

Data

Data Intensive (Science) Applications

• Heath• Biology• Chemistry• Particle Physics LHC• GIS

Components of Data Intensive Computing System

Application

Page 17: SALSASALSASALSASALSA Proposal Review Meeting with CTSI Translating Research Into Practice Project Development Team, July 8, 2009, IUPUI Gil Liu, Judy Qiu,

SALSA

Deterministic Annealing Clustering of Indiana Census DataDecrease temperature (distance scale) to discover more clusters

Distance ScaleTemperature0.5

Red is coarse resolution with 10 clustersBlue is finer resolution with 30 clusters

Clusters find cities in Indiana

Distance Scale is Temperature

Page 18: SALSASALSASALSASALSA Proposal Review Meeting with CTSI Translating Research Into Practice Project Development Team, July 8, 2009, IUPUI Gil Liu, Judy Qiu,

SALSA

Various Sequence Clustering Results

18

4500 Points : Pairwise Aligned

4500 Points : Clustal MSA Map distances to 4D Sphere before MDS

3000 Points : Clustal MSAKimura2 Distance

Page 19: SALSASALSASALSASALSA Proposal Review Meeting with CTSI Translating Research Into Practice Project Development Team, July 8, 2009, IUPUI Gil Liu, Judy Qiu,

SALSA

Initial Obesity Patient Data Analysis

19

2000 records 6 Clusters

Refinement of 3 of clusters to left into 54000 records 8 Clusters

Page 20: SALSASALSASALSASALSA Proposal Review Meeting with CTSI Translating Research Into Practice Project Development Team, July 8, 2009, IUPUI Gil Liu, Judy Qiu,

SALSA

-0.4

-0.3

-0.2

-0.1

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1x1x

1

2x1x

1

4x1x

1

8x1x

1

16x1

x1

24x1

x1

1x2x

1

1x4x

1

1x8x

1

1x16

x1

1x24

x1

1x1x

2

1x1x

4

1x1x

8

1x1x

16

1x1x

24

Patient2000

Patient4000

Patient10000

PWDA Parallel Pairwise data clustering by Deterministic Annealing run on 24 core computer

Parallel Pattern (Thread X Process X Node)

Threading

Intra-nodeMPI Inter-node

MPI

ParallelOverhead

June 11 2009

Page 21: SALSASALSASALSASALSA Proposal Review Meeting with CTSI Translating Research Into Practice Project Development Team, July 8, 2009, IUPUI Gil Liu, Judy Qiu,

SALSA

June 11 2009

Parallel Overhead

Parallel Pairwise Clustering PWDA Speedup Tests on eight 16-core Systems (6 Clusters, 10,000 Patient Records)

Threading with Short Lived CCR Threads

Parallel Patterns (# Thread /process) x (# MPI process /node) x (# node)

-0.6

-0.5

-0.4

-0.3

-0.2

-0.1

0

0.1

0.2

2-way

1x2x

2

2x1x

22x

2x1

1x4x

21x

8x1

2x2x

22x

4x1

4x1x

24x

2x1

1x8x

2

2x4x

22x

8x1

4x2x

24x

4x1

8x1x

28x

2x1

1x16

x1

1x16

x22x

8x2

4x4x

28x

2x2

16x1

x2

2x8x

3

1x16

x3

2x4x

6

1x8x

81x

16x4

2x8x

4

16x1

x41x

16x8

4x4x

88x

2x8

16x1

x8

4x2x

6

4x2x

8

1x2x

11x

1x2

2x1x

1

1x4x

1

4x1x

1

16x1

x1

1x8x

6

2x4x

8

8x1x

1

4x4x

3

8x2x

316

x1x3

8x1x

88x

2x4

2x8x

8

4-way 8-way

16-way

32-way

48-way

64-way 128-way

Page 22: SALSASALSASALSASALSA Proposal Review Meeting with CTSI Translating Research Into Practice Project Development Team, July 8, 2009, IUPUI Gil Liu, Judy Qiu,

SALSA

Pairwise Sequence Distance Calculation

• Perform all possible pairwise sequence alignment given a set of genomic sequences.

• Alignments performed using Smith-Waterman (local) sequence alignment algorithm.

• Currently we are able to perform ~640 million alignments (300 to 400 base pairs) in ~4 hours using tempest cluster.

• Represents one of the largest datasets we have analyzed.

Pattern Parallelism Total Pairwise Alignments

Actual Time (ms)

Overhead Nodes Process Threads milliseconds/alignment

days/640million alignments

1x1x1 1 499500 7496846 0 1 1 1 15.0087 111.1756

1x8x1 8 499500 925544 -0.012337722 1 8 1 1.852941 13.72549

1x4x2 8 499500 983639 0.049656349 1 4 2 1.969247 14.58702

1x2x4 8 499500 1048946 0.119346456 1 2 4 2.099992 15.5555

1x1x8 8 499500 1332675 0.422118048 1 1 8 2.668018 19.7631

1x16x1 16 499500 499500 0.066048309 1 16 1 1 7.407407

1x8x2 16 499500 515269 0.099702995 1 8 2 1.03157 7.641256

1x4x4 16 499500 556739 0.188209548 1 4 4 1.114593 8.256241

1x2x8 16 499500 772563 0.648827787 1 2 8 1.546673 11.45683

1x1x16 16 499500 1266255 1.702480483 1 1 16 2.535045 18.77811

1x24x1 24 499500 436759 0.398216797 1 24 1 0.874392 6.476981

1x1x24 24 499500 1242180 2.976648313 1 1 24 2.486847 18.42109

32x1x24 768 499500 50155 4.138032714 32 1 24 0.10041 0.743781

32x24x1 768 499500 22359 1.290524842 32 24 1 0.044763 0.331576

1x1x

1

1x1x

4

1x4x

1

1x2x

2

1x8x

1

1x4x

2

1x2x

4

1x1x

8

1x8x

2

1x4x

4

1x2x

8

1x1x

16

1x16

x1

1x24

x1

1x1x

24

32x2

4x1

32x1

x24-0.5

00.5

11.5

22.5

33.5

44.5

Parallel Pattern vs. Overhead

Pattern (nodes x processes X threads)

Ove

rhea

d

Page 23: SALSASALSASALSASALSA Proposal Review Meeting with CTSI Translating Research Into Practice Project Development Team, July 8, 2009, IUPUI Gil Liu, Judy Qiu,

SALSA

• MDS of 635 Census Blocks with 97 Environmental Properties• Shows expected Correlation with Principal Component – color varies from

greenish to reddish as projection of leading eigenvector changes value• Ten color bins used

Page 24: SALSASALSASALSASALSA Proposal Review Meeting with CTSI Translating Research Into Practice Project Development Team, July 8, 2009, IUPUI Gil Liu, Judy Qiu,

SALSA

Canonical Correlation

• Choose vectors a and b such that the random variables U = aT.X and V = bT.Y maximize the correlation = cor(aT.X, bT.Y).

• X Environmental Data• Y Patient Data• Use R to calculate =

0.76

Page 25: SALSASALSASALSASALSA Proposal Review Meeting with CTSI Translating Research Into Practice Project Development Team, July 8, 2009, IUPUI Gil Liu, Judy Qiu,

SALSA

• Projection of First Canonical Coefficient between Environment and Patient Data onto Environmental MDS

• Keep smallest 30% (green-blue) and top 30% (red-orchid) in numerical value

• Remove small values < 5% mean in absolute value

MDS and Canonical Correlation

Page 26: SALSASALSASALSASALSA Proposal Review Meeting with CTSI Translating Research Into Practice Project Development Team, July 8, 2009, IUPUI Gil Liu, Judy Qiu,

SALSA

References• See K. Rose, "Deterministic Annealing for Clustering, Compression, Classification, Regression, and

Related Optimization Problems," Proceedings of the IEEE, vol. 80, pp. 2210-2239, November 1998

• T Hofmann, JM Buhmann Pairwise data clustering by deterministic annealing, IEEE Transactions on Pattern Analysis and Machine Intelligence 19, pp1-13 1997

• Hansjörg Klock and Joachim M. Buhmann Data visualization by multidimensional scaling: a deterministic annealing approach Pattern Recognition Volume 33, Issue 4, April 2000, Pages 651-669

• Granat, R. A., Regularized Deterministic Annealing EM for Hidden Markov Models, Ph.D. Thesis, University of California, Los Angeles, 2004. We use for Earthquake prediction

• Geoffrey Fox, Seung-Hee Bae, Jaliya Ekanayake, Xiaohong Qiu, and Huapeng Yuan, Parallel Data Mining from Multicore to Cloudy Grids, Proceedings of HPC 2008 High Performance Computing and Grids Workshop, Cetraro Italy, July 3 2008

• Project website: www.infomall.org/salsa

26