24/25 october 2002 sdmiv workshop – julian gallop1 potential applications in clrc/ral...

16
24/25 October 2002 SDMIV workshop – Julian Gallop 1 Potential applications in CLRC/RAL collaborations Julian Gallop October 2002

Upload: shawn-manning

Post on 29-Dec-2015

216 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: 24/25 October 2002 SDMIV workshop – Julian Gallop1 Potential applications in CLRC/RAL collaborations Julian Gallop October 2002

24/25 October 2002 SDMIV workshop – Julian Gallop 1

Potential applications in CLRC/RAL collaborations

Julian Gallop

October 2002

Page 2: 24/25 October 2002 SDMIV workshop – Julian Gallop1 Potential applications in CLRC/RAL collaborations Julian Gallop October 2002

24/25 October 2002 SDMIV workshop – Julian Gallop 2

commercial / scientific

• Data mining well known in commercial applications

– should the own brand cornflakes be located next to the beer

• Less well known in scientific applications

• Among scientists, it’s common to find

– “not sure that what I need is data mining, but instead ….”

• Perhaps data mining is regarded too narrowly

Page 3: 24/25 October 2002 SDMIV workshop – Julian Gallop1 Potential applications in CLRC/RAL collaborations Julian Gallop October 2002

24/25 October 2002 SDMIV workshop – Julian Gallop 3

Definitions

• an early (1991) definition of Knowledge Discovery in databases (KDD) was given as:– "the non-trivial extraction of implicit, previously

unknown, and potential useful information from data" (Frawley et. al. 1991).

• this was subsequently (1996) revised to:– "the non-trivial process of identifying valid,

potentially useful and ultimately understandable patterns in data" (Fayyad et al 1996).

• data mining is one step in the KDD process - concerned with applying computational techniques to find patterns in data

Page 4: 24/25 October 2002 SDMIV workshop – Julian Gallop1 Potential applications in CLRC/RAL collaborations Julian Gallop October 2002

24/25 October 2002 SDMIV workshop – Julian Gallop 4

CLRC scientific fields and collaborations

• Sciences: space, earth observation, particle physics, microstructures, synchrotron radiation . . .

• Holds (or provides access to) significant data collections

• Partnerships between E-science centre, BITD, computational science and science departments

• E-science projects include:

– Ones that are mainly CLRC (e.g. Data Portal)

– UK e-science collaborations (e.g. Astrogrid, NERC Data Grid, gViz)

– EU collaborations (e.g. DataGrid)

– And also the UK Grid Support Centre

Page 5: 24/25 October 2002 SDMIV workshop – Julian Gallop1 Potential applications in CLRC/RAL collaborations Julian Gallop October 2002

24/25 October 2002 SDMIV workshop – Julian Gallop 5

Sample CLRC e-science project – Data Portal

• Data Portal project – pilot project within CLRC: – To enable a scientist to discover, explore and retrieve disparate

datasets through one interface, independent of the data location.

– CLRC sciences - space science, synchrotron science and neutron science - as well as e-science and IT.

– Part of the work is the development of a scientific metadata model

Page 6: 24/25 October 2002 SDMIV workshop – Julian Gallop1 Potential applications in CLRC/RAL collaborations Julian Gallop October 2002

24/25 October 2002 SDMIV workshop – Julian Gallop 6

Sample e-science projects involving CLRC

• Astrogrid (UK)

– Building a virtual observatory

– Ideas on data mining:• Finding: association rules; deviations from a rule;

similarity; clustering and classification

• Datagrid (EU): aims to enable next generation scientific exploration which requires intensive computation and analysis of shared large-scale databases, millions of Gigabytes, across widely distributed scientific communities.

– Applications are: biomedical, earth observation, particle physics

• NERC Data Grid (UK)

Page 7: 24/25 October 2002 SDMIV workshop – Julian Gallop1 Potential applications in CLRC/RAL collaborations Julian Gallop October 2002

24/25 October 2002 SDMIV workshop – Julian Gallop 7

NERC Data Grid

• Funded by NERC & UK e-science core programme

• Involves:

– CLRC (RAL & DL – including British Atmospheric Data Centre)

– Program for Climate Model Data Intercomparison (PCMDI) (U.S. Lawrence-Livermore National Lab)

• Relevant to:

– energy; water management; food chain; health; weather risk

Page 8: 24/25 October 2002 SDMIV workshop – Julian Gallop1 Potential applications in CLRC/RAL collaborations Julian Gallop October 2002

24/25 October 2002 SDMIV workshop – Julian Gallop 8

NERC Data Grid – relevance to knowledge discovery

• Aims to address problem that

– at present searching metadata to discover and retrieve what you want is a manual process

– Datasets in multiple locations involve multiple logins and retrieval in multiple formats

• indicators of success:

– that it will be possible to find, reformat and visualize disparate datasets from disparate organisations within one organisation

– Ability to test data and comparison ideas without learning foreign formats and establishing personal relationships every time

• Clearly will provide a basis for knowledge discovery if successful

Page 9: 24/25 October 2002 SDMIV workshop – Julian Gallop1 Potential applications in CLRC/RAL collaborations Julian Gallop October 2002

24/25 October 2002 SDMIV workshop – Julian Gallop 9

Earth observation instruments

• For example ENVISAT

• Instrument AATSR

• Low orbit, 14/day

• Returns to same place every 3 days

• Picture shows plume from Mt Etna in 2001 (previous instrument ATSR2)

• NASA AQUA TBs/day

Page 10: 24/25 October 2002 SDMIV workshop – Julian Gallop1 Potential applications in CLRC/RAL collaborations Julian Gallop October 2002

24/25 October 2002 SDMIV workshop – Julian Gallop 10

Earth observation patterns

• For particular location, what patterns emerge on:

– A daily basis

– Or a yearly basis

• Knowing the conventional pattern day by day, can observe out of the ordinary events e.g. an oil slick

Page 11: 24/25 October 2002 SDMIV workshop – Julian Gallop1 Potential applications in CLRC/RAL collaborations Julian Gallop October 2002

24/25 October 2002 SDMIV workshop – Julian Gallop 11

climateprediction.net

• Makes use of spare compute capacity on office and home PC’s to run a climate prediction model

• Different PC’s run different parameters and collectively run a Monte Carlo simulation

• Results will be studied to find out which subsets of the parameter space correspond to observation

• Better understanding of uncertainties

• Public understanding of climate change

• Oxford U, CLRC RAL, Reading U, with Met Office and OU

Page 12: 24/25 October 2002 SDMIV workshop – Julian Gallop1 Potential applications in CLRC/RAL collaborations Julian Gallop October 2002

24/25 October 2002 SDMIV workshop – Julian Gallop 12

Data in climateprediction.net

• base

– Latitude 96

– Longitude 72

– Levels 19

– Timesteps calculated every 30mins / 1hr and output for every day over a period of 50 years

17000 registered in advance of launch

• variables

– Horizontal velocity

– Temperature

– Surface pressure

– Water vapour (atmosphere)

– Salinity (ocean)

• Possible others, such as ocean carbon content and atmospheric ozone and sulphates

Page 13: 24/25 October 2002 SDMIV workshop – Julian Gallop1 Potential applications in CLRC/RAL collaborations Julian Gallop October 2002

24/25 October 2002 SDMIV workshop – Julian Gallop 13

parameters in climateprediction.net

• Physics parameters that may be varied between one run and another:– Representation of cloud variability

– Rate at which water droplets collide and cohere

– # of nucleation particles for coloud droplet formation

– Light scattering in the atmosphere

– Cloud convection

– Surface processes such as rate of transpiration by plants

• Also, runs will be duplicated to detect tampering

Page 14: 24/25 October 2002 SDMIV workshop – Julian Gallop1 Potential applications in CLRC/RAL collaborations Julian Gallop October 2002

24/25 October 2002 SDMIV workshop – Julian Gallop 14

Data distribution in climateprediction.net

• Results dataset will be distributed at several (possibly 20) climate modelling institutions

• A subset of data is returned from a PC to a data server. Remainder is therefore kept on the (home or office) PC and available – if the owner so chooses.

• Program attempting to data mine needs to be isolated from these details, by appropriate portal, metadata and/or catalogue

Page 15: 24/25 October 2002 SDMIV workshop – Julian Gallop1 Potential applications in CLRC/RAL collaborations Julian Gallop October 2002

24/25 October 2002 SDMIV workshop – Julian Gallop 15

Climateprediction.net questions

• Some questions that need to be askable

– What features of the response are robust as we change the physics?

– What kind of changes have similar effects to each other?

– What models that are consistent with current observations give changes in extreme events in the future

• Unclear whether this is data mining in strict sense, but certainly multivariate statistical techniques

Page 16: 24/25 October 2002 SDMIV workshop – Julian Gallop1 Potential applications in CLRC/RAL collaborations Julian Gallop October 2002

24/25 October 2002 SDMIV workshop – Julian Gallop 16

Summing up

• NERC Data Grid project, for example, exposes current difficulties of doing data mining on large scientific datasets

– In commercial situation, data is warehoused under single operational control

– In science, access is needed to different datasets which are under different managements

– Multiple logins, multiple metadata systems

• Current e-science projects are providing a mechanism, which future data mining could use

• Applications include: earth observation; particle physics; astronomy; biology; . . . . .