applied statistics for the office of science understanding variability and bringing rigor to...

12
Applied Statistics for the Office of Science Understanding Variability and Bringing Rigor to Scientific Investigation George Ostrouchov Statistics and Data Sciences Group Computer Science and Mathematics Division Oak Ridge National Laboratory

Upload: mariah-baisley

Post on 31-Mar-2015

225 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Applied Statistics for the Office of Science Understanding Variability and Bringing Rigor to Scientific Investigation George Ostrouchov Statistics and

Applied Statistics for the Office of Science

Understanding Variability and BringingRigor to Scientific Investigation

George Ostrouchov

Statistics and Data Sciences GroupComputer Science and Mathematics Division

Oak Ridge National Laboratory

Page 2: Applied Statistics for the Office of Science Understanding Variability and Bringing Rigor to Scientific Investigation George Ostrouchov Statistics and

Statistics and Data Sciences

OAK RIDGE NATIONAL LABORATORY

U. S. DEPARTMENT OF ENERGY

George Ostrouchov

Filling a Gap in Statistics to Address Office of Science Needs

ASCR Strategic Plan“[AMR] weaknesses include an underinvestment or

lack of investment in several critical areas: . . . Underinvestment in statistics”

“The following gaps in the [AMR] program have been identified: Multiscale mathematics Ultrascale algorithms Discrete mathematics Statistics – investments in this area are required to deal

with extracting knowledge from the oceans of data that large-scale simulations will produce.

Multiphysics”

Through Applied Statistics, ASCR has the opportunity to engage the dominant segment of Applied Mathematics for its goals.

Office of Science Response to the Data Challenge:

The Office of Science will initiate a long-term research program to address the “Curse of Dimensionality.”

Raymond L. Orbach, AAAS, Feb. 19, 2006

U.S. Department of Energy

Office of Science

ORNL Applied Statistics program can address the curse of dimensionality and other Office of Science goals.

Page 3: Applied Statistics for the Office of Science Understanding Variability and Bringing Rigor to Scientific Investigation George Ostrouchov Statistics and

Statistics and Data Sciences

OAK RIDGE NATIONAL LABORATORY

U. S. DEPARTMENT OF ENERGY

George Ostrouchov

Statistics Brings Rigor and Efficiency to Scientific InvestigationStatistics Brings Rigor and Efficiency to Scientific Investigation and Technology

Conrad Habicht, Maurice Solovine, and Albert Einstein, the self-styled Olympia Academy, in about 1903. At Einstein’s suggestion, the first book read was Pearson’s “The Grammar of Science.”

CREDIT: IMAGE ARCHIVE ETH-BIBLIOTHEK, ZÜRICH

Karl Pearson (1857-1936) “The Grammar of Science” (1892) – Relativity First Department of Statistics (1911) UCL Founding editor of Biometrika

EXPERIMENTAL

Page 4: Applied Statistics for the Office of Science Understanding Variability and Bringing Rigor to Scientific Investigation George Ostrouchov Statistics and

Statistics and Data Sciences

OAK RIDGE NATIONAL LABORATORY

U. S. DEPARTMENT OF ENERGY

George Ostrouchov

Common Evolutionary Steps: Experimental Science and Computational Science

Early computational science relies largely on intuitive design and visual validation Computational experiments are expensive Petascale data sets are nearly as opaque as real systems – statistical

analysis must select what to visualize Uncertainty analysis is in its infancy

Statistics is a major partner in bringing computational science to the rigor and efficiency standards of experimental science Methods to see through, examine, and classify variability Uncertainty quantification Statistical design of experiments Fusion of data and computational experiment

Page 5: Applied Statistics for the Office of Science Understanding Variability and Bringing Rigor to Scientific Investigation George Ostrouchov Statistics and

Statistics and Data Sciences

OAK RIDGE NATIONAL LABORATORY

U. S. DEPARTMENT OF ENERGY

George Ostrouchov

Statistics: the Study of Variability

The discipline concerned with the study of variability, with the study of uncertainty, and with the study of decision-making in the face of uncertainty.

Large scale user of mathematical and computational tools with a focused scientific agenda

Inherently interdisciplinary

Source: [NSF2004] Jon Kettenring, Bruce Lindsay, and David Siegmund, editors, 2004. Statistics: Challenges and Opportunities for the Twenty-First Century,

Cuts through the fog of variability and brings efficiency to science.

Page 6: Applied Statistics for the Office of Science Understanding Variability and Bringing Rigor to Scientific Investigation George Ostrouchov Statistics and

Statistics and Data Sciences

OAK RIDGE NATIONAL LABORATORY

U. S. DEPARTMENT OF ENERGY

George Ostrouchov

Mathematics is Biology’s Next Microscope, Only Better

Here are five mathematical challenges that would contribute to the progress of biology.(1) Understand computation. Find more effective ways to gain insight and prove theorems fromnumerical or symbolic computations and agent-based models. We recall Hamming: “The purpose ofcomputing is insight, not numbers” (Hamming 1971, p. 31).(2) Find better ways to model multi-level systems, for example, cells within organs within peoplein human communities in physical, chemical, and biotic ecologies.(3) Understand probability, risk, and uncertainty. Despite three centuries of great progress, we arestill at the very beginning of a true understanding. Can we understand uncertainty and risk betterby integrating frequentist, Bayesian, subjective, fuzzy, and other theories of probability, or is anentirely new approach required?(4) Understand data mining, simultaneous inference, and statistical de-identification (Miller1981). Are practical users of simultaneous statistical inference doomed to numerical simulations ineach case, or can general theory be improved? What are the complementary limits of data miningand statistical de-identification in large linked databases with personal information?(5) Set standards for clarity, performance, publication and permanence of software andcomputational results.

Mathematics, Computer Science, and Statisticsare Biology’s Next Microscope, Only Better

Here are five mathematical challenges that would contribute to the progress of biology.(1) Understand computation. Find more effective ways to gain insight and prove theorems fromnumerical or symbolic computations and agent-based models. We recall Hamming: “The purpose ofcomputing is insight, not numbers” (Hamming 1971, p. 31).(2) Find better ways to model multi-level systems, for example, cells within organs within peoplein human communities in physical, chemical, and biotic ecologies.(3) Understand probability, risk, and uncertainty. Despite three centuries of great progress, we arestill at the very beginning of a true understanding. Can we understand uncertainty and risk betterby integrating frequentist, Bayesian, subjective, fuzzy, and other theories of probability, or is anentirely new approach required?(4) Understand data mining, simultaneous inference, and statistical de-identification (Miller1981). Are practical users of simultaneous statistical inference doomed to numerical simulations ineach case, or can general theory be improved? What are the complementary limits of data miningand statistical de-identification in large linked databases with personal information?(5) Set standards for clarity, performance, publication and permanence of software andcomputational results.

Statistics

Multiscale Math

Statistics

Computer Science

Computer Science and Mathematics

Cohen JE (2004). PLoS Biol 2(12): e439

Chemistry’s Materials’Astrophysics’ TelescopeParticle Physics’ Device,Fellow AAAS, Fellow AmPhilSoc, Member NAS

Page 7: Applied Statistics for the Office of Science Understanding Variability and Bringing Rigor to Scientific Investigation George Ostrouchov Statistics and

Statistics and Data Sciences

OAK RIDGE NATIONAL LABORATORY

U. S. DEPARTMENT OF ENERGY

George Ostrouchov

Particle Physics Embraces Statistics

“… since 1900 … statistics … takes over field after field … [as] … the methodology of choice …

… people in astronomy and physics … are starting to use

statistics a lot more for the simple reason that they have to be efficient now.

… I don't see any area where it's being resisted much.”

Bradley EfronChair, Department of Statistics, Stanford University

and Max H. Stein Professor of Humanities and Sciences

2005 National Medal of Science Recipient

Page 8: Applied Statistics for the Office of Science Understanding Variability and Bringing Rigor to Scientific Investigation George Ostrouchov Statistics and

Statistics and Data Sciences

OAK RIDGE NATIONAL LABORATORY

U. S. DEPARTMENT OF ENERGY

George Ostrouchov

Citations to Statistics Comprise the Dominant Group within Mathematics

Highly Cited Journals in Mathematics

Rank Journal 1991-2001Citations 1. J. American Statistical Assn. 16,457 2. Biometrics 10,8543. J. Math. Analysis 9,8454. Annals of Statistics 9,7025. Proc. Amer. Math Soc. 9,2376. C.R. Acad. Sci. Ser. I Math. 9,1537. Trans. Amer. Math. Soc. 8,5868. Journal of Algebra 8,5319. J. Functional Analysis 7,99910. Biometrika 7,91111. SIAM J. Numer. Anal. 7,38312. Inventiones Mathmaticae 7,38213. J. Royal Stat. Soc. B 6,57514. Mathemat. Programming 6,44415. Linear Algebra Appl. 6,112

SOURCE: ISI Essential Science Indicators, Sci. Citation Index (300 Journals in pure mathematics, applied mathematics, statistics and probability)

Highly Cited Authors in Mathematics for period 1991-2001Rank Name Affiliation Department / Field Papers Citations1. Pierre-Louis Lions University of Paris 9 Mathematics 75 12072. David L. Donoho Stanford University Statistics 27 11823. Adrian F.M. Smith Univ. London Statistics 40 10264. Elizabeth A. Thompson U. Washington Biostatistics 11 9735. Iain M Johnstone Stanford University Statistics 17 9686. Jianqing Fan Chinese U. Hong Kong Statistics 53 9017. Donald B. Rubin Harvard University Statistics 38 8548. Ingrid Daubechies Princeton University Mathematics 20 8079. Adrian E. Raftery U. Washington Statistics/Sociol. 31 80410. Alan E. Gelfand U. Connecticut Statistics 35 74711. Sun-Wei Guo Med. Coll. Wisconsin Biostatistics 6 73712. Scott L. Zeger Johns Hopkins Univ. Biostatistics 23 72313. Peter J. Green University of Bristol Statistics 14 66714. Bradley P. Carlin University of Minnesota Biostatistics 28 66315. J. Stephen Marron U. North Carolina Statistics 43 61816. David G. Clayton MRC, Cambridge Biostatistics 4 59817. Gareth O. Roberts Lancaster Univ. Statistics 41 59818. Albert Cohen University of Paris Mathematics 61 57219. Michael Rockner Univ. Bielefeld, Germany Mathematics 69 57220. Yangbo Ye University of Iowa Mathematics 42 56721. Jinchao Xu Pennsylvania St. U. Mathematics 22 56622. Xiao-Li Meng University of Chicago Statistics 27 56123. Matthew P. Wand Harvard University Biostatistics 31 55824. Wally R. Gilks MRC Biostatistics 16 55125. M. Chris Jones Open University Statistics 52 542

19 of Top 25 most cited mathematics authors

are from Statistics or Biostatistics !

Statistics is Highly Interdisciplinary !Citations per paper:Statistics and Biostatistics – 27Rest of Mathematics - 15

Page 9: Applied Statistics for the Office of Science Understanding Variability and Bringing Rigor to Scientific Investigation George Ostrouchov Statistics and

Statistics and Data Sciences

OAK RIDGE NATIONAL LABORATORY

U. S. DEPARTMENT OF ENERGY

George Ostrouchov

Statistics Disseminates Data Analysis Ideas Accross Science Domains

Of 500 recent citations of Efron’s “Bootstrap” paper, 348 were outside statistics. [NSF2004]

Mitchell’s “Detmax Algorithm” paper 200+ citations (funded by AMR at ORNL) - red are outside statistics.

Page 10: Applied Statistics for the Office of Science Understanding Variability and Bringing Rigor to Scientific Investigation George Ostrouchov Statistics and

Statistics and Data Sciences

OAK RIDGE NATIONAL LABORATORY

U. S. DEPARTMENT OF ENERGY

George Ostrouchov

Statistics Core Research Disseminates and Unifies Data Analysis Ideas

Tames the explosion of data analytic methods by Providing portability between science domains Deriving properties of new data analytic methods Building bridges between data analytic methods

Examples: Latent Semantic Indexing (Dumais+ 1991) and Correspondence

Analysis (Benzecri 1969, 1980,1992, Greenacre 1984) Empirical Orthogonal Functions (Lorenz 1956) and a climate time

series application of Principal Components Analysis (Pearson 1902, Hotelling 1935)

Support Vector Machines (Vapnik 1995) and Logistic Regression (Cox 1970) via hinge loss function (Hastie+ 2001)

FastMap approximation to Principal Components (Faloutsos+ 1995): Bridge to Convex Hull and new methods, RobustMap (Ostrouchov+ 2005) and to right Householder transformations (Ostrouchov+ 2006)

Addressing

Addressing

the Curse of Dimensionality

the Curse of Dimensionality

Page 11: Applied Statistics for the Office of Science Understanding Variability and Bringing Rigor to Scientific Investigation George Ostrouchov Statistics and

Statistics and Data Sciences

OAK RIDGE NATIONAL LABORATORY

U. S. DEPARTMENT OF ENERGY

George Ostrouchov

Statistics Core

Science Applications

“I … emphasize the symbiotic relationship … between the Statisticians and Astrophysicists …. It is now … clear that there are core common problems …” Bob Nichol (CMU Physics)

Miller, CJ; Genovese, C; Nichol, RC; et al.Controlling the false-discovery rate in astrophysical data analysisASTRONOMICAL JOURNAL, 122 (6): 3492-3505 DEC 2001

Miller, CJ; Nichol, RC; Batuski, DJAcoustic oscillations in the early universe and todaySCIENCE, 292 (5525): 2302-2303 JUN 22 2001

Science publication on Big Bang while others still plow through plethora of data

Quantitative Rigor for Science: Transfer From Medicine via Core Statistics to Big Bang

False Discovery Rate: “Interdisciplinary” “Decision-making in the face of uncertainty”

Family-wise error rate of statistical tests:One test: 0.05 probability of a false positiveFifty tests: 0.93 probability of a false positive need simultaneous inference (SI)Thousand tests: SI too conservative, need FDR

Statistics core is the hub that disseminates and unifies data analysis ideas.

Critical mass engagement is needed to reap short term and long term returns.

Source: [NSF2004] Jon Kettenring, Bruce Lindsay, and David Siegmund, editors, 2004. Statistics: Challenges and Opportunities for the Twenty-First Century,

Page 12: Applied Statistics for the Office of Science Understanding Variability and Bringing Rigor to Scientific Investigation George Ostrouchov Statistics and

Statistics and Data Sciences

OAK RIDGE NATIONAL LABORATORY

U. S. DEPARTMENT OF ENERGY

George Ostrouchov

Engage Core Statistics for OASCR Goals

A gap exists between statistics research and simulation science Engage statistics with leadership computing Engage statistics with simulation science data Engage statistics with Office of Science experimental data (neutron

science)

Statistics Core

Science Applications

Computational Chemistry

Climate Simulation

Fusion Simulation

Combustion Simulation

Superscalable Algorithms

Neutron ScienceAstrophysics Simulation

Genome Science

Tuning Leadership Facilities

Ontologies for Energy