24/25 october 2002 sdmiv workshop – julian gallop1 potential applications in clrc/ral...
TRANSCRIPT
24/25 October 2002 SDMIV workshop – Julian Gallop 1
Potential applications in CLRC/RAL collaborations
Julian Gallop
October 2002
24/25 October 2002 SDMIV workshop – Julian Gallop 2
commercial / scientific
• Data mining well known in commercial applications
– should the own brand cornflakes be located next to the beer
• Less well known in scientific applications
• Among scientists, it’s common to find
– “not sure that what I need is data mining, but instead ….”
• Perhaps data mining is regarded too narrowly
24/25 October 2002 SDMIV workshop – Julian Gallop 3
Definitions
• an early (1991) definition of Knowledge Discovery in databases (KDD) was given as:– "the non-trivial extraction of implicit, previously
unknown, and potential useful information from data" (Frawley et. al. 1991).
• this was subsequently (1996) revised to:– "the non-trivial process of identifying valid,
potentially useful and ultimately understandable patterns in data" (Fayyad et al 1996).
• data mining is one step in the KDD process - concerned with applying computational techniques to find patterns in data
24/25 October 2002 SDMIV workshop – Julian Gallop 4
CLRC scientific fields and collaborations
• Sciences: space, earth observation, particle physics, microstructures, synchrotron radiation . . .
• Holds (or provides access to) significant data collections
• Partnerships between E-science centre, BITD, computational science and science departments
• E-science projects include:
– Ones that are mainly CLRC (e.g. Data Portal)
– UK e-science collaborations (e.g. Astrogrid, NERC Data Grid, gViz)
– EU collaborations (e.g. DataGrid)
– And also the UK Grid Support Centre
24/25 October 2002 SDMIV workshop – Julian Gallop 5
Sample CLRC e-science project – Data Portal
• Data Portal project – pilot project within CLRC: – To enable a scientist to discover, explore and retrieve disparate
datasets through one interface, independent of the data location.
– CLRC sciences - space science, synchrotron science and neutron science - as well as e-science and IT.
– Part of the work is the development of a scientific metadata model
24/25 October 2002 SDMIV workshop – Julian Gallop 6
Sample e-science projects involving CLRC
• Astrogrid (UK)
– Building a virtual observatory
– Ideas on data mining:• Finding: association rules; deviations from a rule;
similarity; clustering and classification
• Datagrid (EU): aims to enable next generation scientific exploration which requires intensive computation and analysis of shared large-scale databases, millions of Gigabytes, across widely distributed scientific communities.
– Applications are: biomedical, earth observation, particle physics
• NERC Data Grid (UK)
24/25 October 2002 SDMIV workshop – Julian Gallop 7
NERC Data Grid
• Funded by NERC & UK e-science core programme
• Involves:
– CLRC (RAL & DL – including British Atmospheric Data Centre)
– Program for Climate Model Data Intercomparison (PCMDI) (U.S. Lawrence-Livermore National Lab)
• Relevant to:
– energy; water management; food chain; health; weather risk
24/25 October 2002 SDMIV workshop – Julian Gallop 8
NERC Data Grid – relevance to knowledge discovery
• Aims to address problem that
– at present searching metadata to discover and retrieve what you want is a manual process
– Datasets in multiple locations involve multiple logins and retrieval in multiple formats
• indicators of success:
– that it will be possible to find, reformat and visualize disparate datasets from disparate organisations within one organisation
– Ability to test data and comparison ideas without learning foreign formats and establishing personal relationships every time
• Clearly will provide a basis for knowledge discovery if successful
24/25 October 2002 SDMIV workshop – Julian Gallop 9
Earth observation instruments
• For example ENVISAT
• Instrument AATSR
• Low orbit, 14/day
• Returns to same place every 3 days
• Picture shows plume from Mt Etna in 2001 (previous instrument ATSR2)
• NASA AQUA TBs/day
24/25 October 2002 SDMIV workshop – Julian Gallop 10
Earth observation patterns
• For particular location, what patterns emerge on:
– A daily basis
– Or a yearly basis
• Knowing the conventional pattern day by day, can observe out of the ordinary events e.g. an oil slick
24/25 October 2002 SDMIV workshop – Julian Gallop 11
climateprediction.net
• Makes use of spare compute capacity on office and home PC’s to run a climate prediction model
• Different PC’s run different parameters and collectively run a Monte Carlo simulation
• Results will be studied to find out which subsets of the parameter space correspond to observation
• Better understanding of uncertainties
• Public understanding of climate change
• Oxford U, CLRC RAL, Reading U, with Met Office and OU
24/25 October 2002 SDMIV workshop – Julian Gallop 12
Data in climateprediction.net
• base
– Latitude 96
– Longitude 72
– Levels 19
– Timesteps calculated every 30mins / 1hr and output for every day over a period of 50 years
17000 registered in advance of launch
• variables
– Horizontal velocity
– Temperature
– Surface pressure
– Water vapour (atmosphere)
– Salinity (ocean)
• Possible others, such as ocean carbon content and atmospheric ozone and sulphates
24/25 October 2002 SDMIV workshop – Julian Gallop 13
parameters in climateprediction.net
• Physics parameters that may be varied between one run and another:– Representation of cloud variability
– Rate at which water droplets collide and cohere
– # of nucleation particles for coloud droplet formation
– Light scattering in the atmosphere
– Cloud convection
– Surface processes such as rate of transpiration by plants
• Also, runs will be duplicated to detect tampering
24/25 October 2002 SDMIV workshop – Julian Gallop 14
Data distribution in climateprediction.net
• Results dataset will be distributed at several (possibly 20) climate modelling institutions
• A subset of data is returned from a PC to a data server. Remainder is therefore kept on the (home or office) PC and available – if the owner so chooses.
• Program attempting to data mine needs to be isolated from these details, by appropriate portal, metadata and/or catalogue
24/25 October 2002 SDMIV workshop – Julian Gallop 15
Climateprediction.net questions
• Some questions that need to be askable
– What features of the response are robust as we change the physics?
– What kind of changes have similar effects to each other?
– What models that are consistent with current observations give changes in extreme events in the future
• Unclear whether this is data mining in strict sense, but certainly multivariate statistical techniques
24/25 October 2002 SDMIV workshop – Julian Gallop 16
Summing up
• NERC Data Grid project, for example, exposes current difficulties of doing data mining on large scientific datasets
– In commercial situation, data is warehoused under single operational control
– In science, access is needed to different datasets which are under different managements
– Multiple logins, multiple metadata systems
• Current e-science projects are providing a mechanism, which future data mining could use
• Applications include: earth observation; particle physics; astronomy; biology; . . . . .