new software

28
New Software Bob Nichol ICG, Portsmouth Thanks to all my colleagues in SDSS, AG2, Votech, GRIST & PiCA Special thanks to Chris Miller, Alex Gray, Gordon Richards, Brent Bryan, Chris Genovese, Ryan Scranton, Larry Wasserman, Jeff Schneider

Upload: bedros

Post on 28-Jan-2016

40 views

Category:

Documents


0 download

DESCRIPTION

New Software. Bob Nichol ICG, Portsmouth. Thanks to all my colleagues in SDSS, AG2, Votech, GRIST & PiCA Special thanks to Chris Miller, Alex Gray, Gordon Richards, Brent Bryan, Chris Genovese, Ryan Scranton, Larry Wasserman, Jeff Schneider. Outline. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: New Software

New Software

Bob Nichol

ICG, Portsmouth

Thanks to all my colleagues in SDSS, AG2, Votech, GRIST & PiCA

Special thanks to Chris Miller, Alex Gray, Gordon Richards, Brent Bryan, Chris Genovese, Ryan Scranton, Larry

Wasserman, Jeff Schneider

Page 2: New Software

Phystat 2005 Oxford

Outline1. Cosmological data is exploding in both size and

complexity (see Szalay’s talk)

2. Can present software handle this data quality and quantity?

3. Also present models lack detail to understand the data which drives us towards non-parametric techniques.

4. Important synergies between CS, statistics and domain sciences

Jim reviewed existing statistical software, I will discuss new software we may need

Page 3: New Software

Gravity correlates everythingThe 2-point function ((r)) has a long history in cosmology (Peebles 1980). It is the excess joint

probability (dP12) of a pair of points over that expected from a Poisson process.

dP12 = n2 dV1 dV2 [1 + (r)]

dP123=n3dV1dV2dV3[1+23(r)+13(r)+12(r)+123(r)]

dV1 dV2r

Also, marked correlation functions

Page 4: New Software

Same 2pt, different 3pt

Motivation for the N-point functions: Measure of the topology of the large-scale structure in universe

Page 5: New Software

Multi-resolutional KD-treesMulti-resolutional KD-treesScale to n-dimensions (although for very high

dimensions use new tree structures)Use Cached Representation (store at each node

summary sufficient statistics). Compute counts from these statistics

Prune the tree which is stored in memory! (Moore et al. 2001 astro-ph/0012333)

Many applications; suite of algorithms!See Alex Gray’s talk tomorrow

Page 6: New Software

Top Level

1st Level

2nd Level

5th Level

Page 7: New Software

Just a set of range searchesJust a set of range searches

Prune cells outside range

Also Prune cells inside!Greater saving in time

Page 8: New Software

Phystat 2005 Oxford

N1 dmax

dmin

Usually binned into annuli

rmin< r < rmax

Thus, for each r transverse both trees and prune pairs of nodes

No count

dmin < rmax or dmax < rmin

N1 x N2

rmin > dmin and rmax< dmax

N2

Therefore, only need to calculate pairs cutting the boundaries.

Scales to n-point functions also do all r values at once

Dual Tree AlgorithmDual Tree Algorithm

Page 9: New Software

Can also parallelize on the Grid!

bins

Surveysize

Page 10: New Software

Phystat 2005 Oxford

Code Available• The code is freely available

http://www.autonlab.org/autonweb/showSoftware/127/• Added “Monte carlo sampling” to code for faster answers:

approximate answers• Added to AstroGrid suite of algorithms: “aimed at building a data-grid

for UK astronomy, which will form the UK contribution to a global Virtual Observatory”

• VOtech is an EU-funded R&D project to explore the use of advanced algorithms in the IOVA infrastructure

• Broker developed to handle submission of jobs to National Grid Service (NGS) - the UK's core production computational and data grid service. Also using US Teragrid (5000 correlations in one weekend).

Page 11: New Software

Phystat 2005 Oxford

Page 12: New Software

Phystat 2005 Oxford

EuroVO• The Euro-VO Data Centre Alliance (DCA):

– A collaborative and operational network of European data centres who publish data and metadata to the Euro-VO and who provide a research infrastructure of GRID-enabled processing and storage facilities.

• The Euro-VO Facility Centre (VOFC): – An organization that provides the Euro-VO for centralized

resource registry, standards definition and promotion as well as community support for VO technology take-up and scientific program support using VO technologies and resources.

• The Euro-VO Technology Centre (VOTC): – A distributed organisation that coordinates a set of research and

development projects on VO technology, systems and tools.

Page 13: New Software

Phystat 2005 Oxford

EuroVO: VOTech Project

• 6.6M Euro Design Study under EU FP6, • Aims:

– Complete all technical preparatory work necessary for the construction of the European Virtual Observatory,

– Responsible for development of infrastructure and tools:

• Intelligent resource discovery (ontology and the semantic web), data interoperability, data mining, and visualisation capabilities.

– Provide the ability to offload mass scale computational process onto the Enabling Grids for E-sciencE (EGEE) backbone.

Page 14: New Software

Phystat 2005 Oxford

Existing infrastructure

• VOTech is tasked with building upon existing infrastructure:

• In particular:– IVOA for standards,–AstroGrid for middleware:

• Web Services based,

• Presumably IVOA will continue to look towards other standards bodies:

–World Wide Web Consortium (W3C),–Global Grid Forum (GGF),–…

Page 15: New Software

Phystat 2005 Oxford

IVOA Standards (Recommendations)

VOTable Format Definition Version 1.1:– An XML language,

• Flexible storage and exchange format for tabular data: Emphasis on astronomical tables,

– Allows meta data and data to be stored separately with links to remote data.

• Resource Metadata for the VO Version 1.01:– For describing what data and computational facilities are

available, and once identified how to use them.• Unified Content Descriptor (UCD) (Proposed):

– A formal (and restricted) vocabulary for astronomical data.

• IVOA Identifiers Version 1.10 (Proposed):– Syntax for globally unique resource names.

Page 16: New Software

Phystat 2005 Oxford

AstroGrid Components• MySpace: Distributed file store for workflows,results,• Common Execution Architecture (CEA):

– Codes need wrapping before use,– Take command line apps and present as a Web Service.

• Algorithm Registry:– Meta data from wrapped codes are published in a yellow pages, for

searching.• Portal:

– Web interface for interacting with preceding services,– Workflow: Coordinate data flow/control of components within a larger

system of work,– Submit jobs and observe status, and access files in MySpace.

• Dashboard/Workbench:– Interact with MySpace, Registry, CEA from any language that provides

XML-RPC library. Web Start application.

Page 17: New Software

Phystat 2005 Oxford

AG Portal

AG rollout and access via portal

Page 18: New Software

Phystat 2005 Oxford

Non-parametric Techniques• The complexicity and wealth of the data

demands non-parametric techniques, ie., can one describe phenomena using the least amount of assumptions?

• The challenges here are computational as well as psychological

Page 19: New Software

Phystat 2005 Oxford

CMB Power Spectrum

Before WMAPWMAP data

Are the 2nd and 3rd

peaks detected?

Page 20: New Software

Phystat 2005 Oxford

In parametric models of the CMB power spectrum the answer is likely “yes” as all CMB models have multiple peaks. But that has not really answered our question!

Can we answer the question non-parametrically e.g.,Yi = f(Xi) + ci

Where Yi is the observed data, f(Xi) is an orthogonal function (icos(iXi)), ci is the covariance matrix. The challenge is to “shrink” f(Xi), we use

• Beran (2000) to strink f(Xi) to N terms equal to the number of data points - optimal for all smooth functions and provides valid confidence intervals

• Monotonic shrinkage of i - specifically nested subset

selection (NSS)

See Genovese et al. (2004) astro-ph/0410104

Page 21: New Software

Phystat 2005 Oxford

Results(optimal smoothing through bias-variance trade-off)

ConcordanceOur f(Xi)

Note, WMAP only fit is not same as concordance model

Page 22: New Software

Phystat 2005 Oxford

Testing models• The main advantage of this method is that we can construct a

“confidence ball” (in N dimensions) around f(Xi) and thus perform non-parametric interferences e.g. is the second peak detected?

Not at 95% confidence!

Page 23: New Software

Phystat 2005 Oxford

Information Contentfh(Xi) = f(Xi) + b*h

Beyond here thereis little information

Page 24: New Software

Phystat 2005 Oxford

Using CMBfast we can make parametric models (11 parameters) and test if they are within the “confidence ball”. Varying b we get a range of 0.0169 to 0.0287

Gray are models in the 95% confidence ball

ASA “Outstanding Application of the year” (2005)

Page 25: New Software

Phystat 2005 Oxford

Testing in high D

• Now we can now jointly search all 11 parameters in the parametric models and determine which models fit in the confidence ball (at 95%).

• Traditionally this is done by marginalising over the other parameters to gain confidence intervals on each parameter separately. This is a problem in high-D where the likelihood function could be degenerate, ill-defined and under-identified

• This is computational intense as billions of models need to searched, each of which takes ~minute to run

• Use Kriging methods to predict where the surface of the confidence ball exists and test models there.

Page 26: New Software

Phystat 2005 Oxford

7D parameter space

400000 samples

Cyan : 0.5Purple: 1Blue : 1.5Red : 2Green : >2

Bimodal dist. for several parameters

Page 27: New Software

Phystat 2005 Oxford

Other applications

• Physics of CMB is well-understood and people counter that parametric analyses are better (including Bayesian methods) [however, concerns about CI]

• Other areas of astrophysics have similar data problems, but the physics is less developed– Galaxy and quasar spectra (models are still

rudimentary) – Galaxy clustering (non-linear gravitational effects

are not confidently modeled)– Galaxy properties (e.g. star-formation rate)

Page 28: New Software

Phystat 2005 Oxford

Summary

• Existing statistical software can’t scale-up to the next generation of datasets. Nor does it exploit the “Grid”

• Need to explore non-parametric statistics• Software will be made available via AstroGrid

and IVOA - seemless access to data, computational resources and algorithms