data assimilation vs data mining
TRANSCRIPT
7/23/2019 Data Assimilation vs Data Mining
http://slidepdf.com/reader/full/data-assimilation-vs-data-mining 1/28
Data Mining vs. Data Assimilation
S. Lakshmivarahan
School of Computer Science
University of Oklahoma
Norman, Oklahoma
S. Lakshmivarahan Data Mining vs. Data Assimilation
7/23/2019 Data Assimilation vs Data Mining
http://slidepdf.com/reader/full/data-assimilation-vs-data-mining 2/28
Data Mining(DM) - early beginnings
Much of what we know in physical sciences had their originsin Astronomy - with observations of celestial objects
Thanks to the Herculean efforts of:
Copernicus (1473-1543)Galileo (1544-1642)Kepler (1571-1630)Newton (1643-1727)
This is only a small sampling from a long list of pioneers
S. Lakshmivarahan Data Mining vs. Data Assimilation
7/23/2019 Data Assimilation vs Data Mining
http://slidepdf.com/reader/full/data-assimilation-vs-data-mining 3/28
Discovery of simple laws from observations
Observations collected over decades were meticulouslyanalyzed by hand to formulate new laws of nature
Examples:
Heliocentric systemThe Four laws of Kepler
Law of gravitation by NewtonThree Newtons laws
Within the context of physical sciences these are some of theearliest examples of data mining
Note: In Chemical, Biological and other Sciences there are
instances such as the above that are re pleat with historicalfacts that can illustrate the use of data mining in each of these disciplines
S. Lakshmivarahan Data Mining vs. Data Assimilation
7/23/2019 Data Assimilation vs Data Mining
http://slidepdf.com/reader/full/data-assimilation-vs-data-mining 4/28
What is Data Mining
DM is the process extracting the structure or patterns that areinherent in the data/observations
These patterns provide clues about the data generatingprocess
Ultimate goal of DM is to understand and quantify the datagenerating process
Since the motion of celestial objects inherently followedcertain laws, early pioneers with their hard work and ingenuity
could discover the laws that laid the foundation of thephysical sciences and engineering as we know today
S. Lakshmivarahan Data Mining vs. Data Assimilation
7/23/2019 Data Assimilation vs Data Mining
http://slidepdf.com/reader/full/data-assimilation-vs-data-mining 5/28
Abundance of data - revival of DM
Volume of data collected doubles in every three years -Thanksto technology
ComputersLarge scale storage device technologyCommunication and sensor technologies
Today interest in DM include:Physical sciences, Biological sciences, Medical SciencesSpace exploration, All branches of Engineering,Environmental Sciences, EcologyEconomics, Social Sciences, Finance, Banking and Commerce,
Sports and recreationGovernments, private companies
More about DM a bit later. Back to early Astronomy
S. Lakshmivarahan Data Mining vs. Data Assimilation
7/23/2019 Data Assimilation vs Data Mining
http://slidepdf.com/reader/full/data-assimilation-vs-data-mining 6/28
Development of Calculus and discovery of dynamic models
Introduction of mathematical models by combining concurrentdevelopments in
Physical laws - Newton’s lawsCalculus by Newton (1643-1727) and Leibnitz (1646-1716)
among others
Naturally lead to the development of dynamic models todescribe the motion of planets around the sun
With the availability of models, the potential for forecast or
prediction became very clear
S. Lakshmivarahan Data Mining vs. Data Assimilation
7/23/2019 Data Assimilation vs Data Mining
http://slidepdf.com/reader/full/data-assimilation-vs-data-mining 7/28
Discovery of Least squares- Beginnings of DataAssimilation (DA)
Gauss (1777-1855) (when he was only 24 years old) using theknown models of his time, undertook the challenging problemof predicting when the celestial object called Ceres willreappear on the telescope
The model had unknown parameters that needed to beestimated
By combining the model with observations in the least squaressense, Gauss, estimated the unknown parameters - created thefirst assimilated model
He then used this assimilated model to accurately predict of the time and location of reappearance of the lost astronomicalobject
S. Lakshmivarahan Data Mining vs. Data Assimilation
7/23/2019 Data Assimilation vs Data Mining
http://slidepdf.com/reader/full/data-assimilation-vs-data-mining 8/28
Gauss laid the foundation for DA
This work leads the development of the method of leastsquares as we know it today
Method of least squares still continues to dominate the theory
and practice of estimation of unknown parametersBy this time Gauss had also invented the notion of statisticalanalysis relating to the distribution observational errorsfollowing the bell shaped curve which we now call as thenormal or Gaussian distribution
S. Lakshmivarahan Data Mining vs. Data Assimilation
7/23/2019 Data Assimilation vs Data Mining
http://slidepdf.com/reader/full/data-assimilation-vs-data-mining 9/28
What is Data Assimilation?
Fusion of model with dataModels are general descriptions of the underlying physicalprocesses in question
Model represents a class - suitably parametrized
Examples:Static regression models have unknown coefficientsDynamic model has unknown initial/boundary conditions +physical parameters such as Reynolds number, coefficient of thermal expansion of water, specific heat of water, etc
Data/observations reveal all the secrets of or the truth aboutthe process that model tries to capture
S. Lakshmivarahan Data Mining vs. Data Assimilation
DA f i f d l i h d
7/23/2019 Data Assimilation vs Data Mining
http://slidepdf.com/reader/full/data-assimilation-vs-data-mining 10/28
DA - fusion of models with data
By combining models and data - estimating the unknownparameters of the models using the data - we can get aspecialized instantiation of the model called the assimilatedmodel
This assimilated model is a good tool for creating forecast orprediction
One of the standard tools for the fusion of model and data isbased on the method of least squares
The discipline of DA primarily deals with development of methods for assimilating models with data
S. Lakshmivarahan Data Mining vs. Data Assimilation
G l f DA d f / di i
7/23/2019 Data Assimilation vs Data Mining
http://slidepdf.com/reader/full/data-assimilation-vs-data-mining 11/28
Goal of DA - generate good forecast/prediction
Predict the path of a hurricane, tornado- using one of severalmodels + data collected using satellites, Radars, specialplanes that fly into the hurricanes twice a day
From the crime scene data, reconstruct the case - CSI, Miami
NTSB estimate the causes of failure using the data from thedebris
Predict the potential tax revenues so that a Government candevelop its budget for the next year
Medical diagnosis - from symptoms to the cure
S. Lakshmivarahan Data Mining vs. Data Assimilation
Di t I bl A l ifi ti
7/23/2019 Data Assimilation vs Data Mining
http://slidepdf.com/reader/full/data-assimilation-vs-data-mining 12/28
Direct vs. Inverse problems - A classification
To further explore the relation between DM and DA -introduce an useful classification
Scientific and Engineering problems can be classified into oneof two typesDirect problems - Examples
Given a polynomial p (x ), evaluate it at x = 1.0Given a differential equation and the initial condition, find the
solutionGiven a matrix A and a vector x, compute the vector b = Ax
Inverse problems - ExamplesGiven a polynomial p(x), solve for the roots of p (x ) = 0Given a differential equation and a particular solution, find the
initial condition that corresponds to the solutionGiven a matrix A and a vector b, find the solution x such thatAx = b
It turns out that DM and DA naturally correspond to twotypes of inverse problems and prediction is a direct problem
S. Lakshmivarahan Data Mining vs. Data Assimilation
Fi t l l f i bl Th C f D t Mi i
7/23/2019 Data Assimilation vs Data Mining
http://slidepdf.com/reader/full/data-assimilation-vs-data-mining 13/28
First level of inverse problems - The Core of Data Mining
At the highest level, Data Mining relates to solving the
important class of inverse problems leading to the discoveryof basic laws/models that are implied by the data
Examples of discovery of laws/models from data include:
Basic laws in early Astronomy -Kepler, Newton,Atom models in early 1900s
Higgs Boson, the so called God particle in 2012Theory of evolution by C. DarwinBuilding models to identify credit card fraudBased on the observed structure of the autocorrelation of atime series, decide on the class and the type of model that
might be capture the observed autocorrelationData Mining has been and still continues to be the basis forthe advancement of knowledge in all of Sciences andEngineering
S. Lakshmivarahan Data Mining vs. Data Assimilation
S d l l f i s bl s Th C f D t
7/23/2019 Data Assimilation vs Data Mining
http://slidepdf.com/reader/full/data-assimilation-vs-data-mining 14/28
Second level of inverse problems - The Core of DataAssimilation
Assume now that the newly discovered mathematical laws areexpressed in the form of a class of models
The problem then becomes one of data assimilation thatrelates to solving a second level of inverse problem that dealswith the estimation of the unknown parameters of the modelusing the same or similar data
Determination of the weights for links connecting the neuronsin an Artificial Neural Network - minimize classification errorEstimate the sea surface temperature using satelliteobservations - based on Planck/Stefan’s law of radiation
Estimate the amount of rain in a cloud system using radarobservation - based on an empirical lawEstimate the structure of the earth - based on the anomaly of the local gravitational field - basis for geophysical exploration
S. Lakshmivarahan Data Mining vs. Data Assimilation
Third level involves Prediction a direct problem
7/23/2019 Data Assimilation vs Data Mining
http://slidepdf.com/reader/full/data-assimilation-vs-data-mining 15/28
Third level involves Prediction - a direct problem
Once an assimilated model is made available, interest thenshifts to the direct problem of generation of short termprediction
Predict lunar/solar eclipsePrediction of total revenue by a state treasuryPrediction of how snow will fall in Boston due to a coastal lowpressure systemPrediction of the amount of green houses in the atmosphere by2025
S. Lakshmivarahan Data Mining vs. Data Assimilation
Is it DM or DA?
7/23/2019 Data Assimilation vs Data Mining
http://slidepdf.com/reader/full/data-assimilation-vs-data-mining 16/28
Is it DM or DA?
First phase: At its core DM relates to the discovery of basicknowledge - Remember Kepler and Newton
This knowledge is often expressed as a law which isencapsulated in a (mathematical) model
Emphasis then shifts to testing the goodness of a model
Second phase: At its core DA deals with the problem of estimating the unknowns by fitting the model to data -Remember Gauss
Third phase: Using the assimilated model generate forecast
products for public consumption
DM and DA are the two parts of a continuum
S. Lakshmivarahan Data Mining vs. Data Assimilation
A classification of models
7/23/2019 Data Assimilation vs Data Mining
http://slidepdf.com/reader/full/data-assimilation-vs-data-mining 17/28
A classification of models
Models: Based on causality (Motion of a Hurricane) vscorrelation (ARMA model in time series)
Models: Explicit (ARMA model) vs implicit (Neural Networks)
Static (Regression) vs. Dynamic (ODE/PDE)
Models: Deterministic (motion of a planet) vs. stochastic(evolution stock prices)
Model: Linear vs. nonlinear
Model Time: Discrete (unemployment) vs. continuous(temperature)
Model Space: Discrete (Markov chain) vs. continuous (rainfall)
S. Lakshmivarahan Data Mining vs. Data Assimilation
Forms of Data
7/23/2019 Data Assimilation vs Data Mining
http://slidepdf.com/reader/full/data-assimilation-vs-data-mining 18/28
Forms of Data
Data arise in various forms:
Time series data - annual rain fall, total monthly sales
Data martix m × n - n objects (columns) and m attributes(rows)
Cross Sectional data - Tabular forms
Practical problems: Missing data, outliers, Data qualitycontrol
Note: In Science and Engineering, data are often of thequantitative type (permiting full blown arithmetic
operations). In Economics, Social Sciences etc., data could bea mixture of both quantitative and qualitative types.Algorithms for mining/assimialtion qualitative data differ fromthose of quantitative data sets
S. Lakshmivarahan Data Mining vs. Data Assimilation
Estimation - Over vs under determined problems
7/23/2019 Data Assimilation vs Data Mining
http://slidepdf.com/reader/full/data-assimilation-vs-data-mining 19/28
Estimation - Over vs. under determined problems
Two scenarios arise depending on the cost of collectingobservations
Over determined (OD) case - abundance of observation muchlarger than the number of unknowns to be estimated - Oncedeployed, satellites, radars will deliver large amounts of datafor quite a long time
Under determined (UD) case - less number of observationscompared with the number of unknowns - Exploration forminerals, natural gas, oil, etc.,
In the OD case there is no solution and in the UD case thereare infinitely many solutions
These cases are the motivation for the definition of solution inthe least squares sense
S. Lakshmivarahan Data Mining vs. Data Assimilation
Framework for DA
7/23/2019 Data Assimilation vs Data Mining
http://slidepdf.com/reader/full/data-assimilation-vs-data-mining 20/28
Framework for DA
Estimation problem is recast as a constrained minimizationproblem
Constraints arise naturally:
Positivity of certain physical parameters - inequality constraint
Model itself acts as a constraint - equality constraint
Strict enforcement of constraints - Strong constraintformulation -Lagrangian multiplier technique
Weak enforcement of constraints - Weak constraint
formulation - Penalty function technique
S. Lakshmivarahan Data Mining vs. Data Assimilation
Well-posed vs ill-posed problems
7/23/2019 Data Assimilation vs Data Mining
http://slidepdf.com/reader/full/data-assimilation-vs-data-mining 21/28
Well posed vs. ill posed problems
In a well posed problem solution exits and is uniqueIn an ill-posed problem solution may not exist or it may haveinfinitely many solutions
Many of the inverse problems are ill-posed
These are solved by using some form of regularizationtechniques - Tikhonov regularization
Using regularization we solve the nearest well-posed version of a given ill-posed problem
Example: Solving (A + αI )X (α) = b instead of AX = b forsome small positive α for which (A + αI ) is positive definite
S. Lakshmivarahan Data Mining vs. Data Assimilation
Methods for estimation
7/23/2019 Data Assimilation vs Data Mining
http://slidepdf.com/reader/full/data-assimilation-vs-data-mining 22/28
Methods for estimation
Parametric vs. non-parametric methods
Least squares - two versions
Unweighed least squares - orthogonal projectionWeighted least squares - oblique projections
Generalized method of moments
Maximum likelihood methods
Bayesian methods where we combine a known prior withconditional distributions
S. Lakshmivarahan Data Mining vs. Data Assimilation
Optimization problems
7/23/2019 Data Assimilation vs Data Mining
http://slidepdf.com/reader/full/data-assimilation-vs-data-mining 23/28
Optimization problems
Unimodal vs. multi modal problems
Continuous vs. discrete optimization problems
Continuous,Unimodal problems solved using:
Gradient method
Conjugate gradient methodQuasi-Newton method
Continuous multi modal and discrete optimization problemssolved using randomized techniques:
Simulated annealing
Genetic algorithms
S. Lakshmivarahan Data Mining vs. Data Assimilation
Methods for DM/DA - I
7/23/2019 Data Assimilation vs Data Mining
http://slidepdf.com/reader/full/data-assimilation-vs-data-mining 24/28
/
Time series analysisSignal processing in EEMedicineEconometrics, Finance
The goal is to build stochastic dynamic models in discretetime by exploiting the underlying correlation, seasonalityproperties of the data set
In Finance model both level and volatility
Autoregressive, integrated, moving average (ARIMA) models
This is one of the well developed areas in empirical modeling
S. Lakshmivarahan Data Mining vs. Data Assimilation
Methods for DM/DA - II
7/23/2019 Data Assimilation vs Data Mining
http://slidepdf.com/reader/full/data-assimilation-vs-data-mining 25/28
/
Multivariate regression analysis (1800s) Statistics
Data reduction using PCA (1940), ICA (1990) - StatisticsClassification using
Clustering (1950s)Neural networks (1950s),Pattern recognition (1950s),
Support Vector Machines (SVM) (1980s)
Association rules
Image processing, voice recognition
Decision trees (1960)
Probabilistic reasoning in networks (1990s) - J. Pearl TuringAward in 2012
Random field - Spatial data analysis
S. Lakshmivarahan Data Mining vs. Data Assimilation
Commonality of approaches in DM, DA, AI,Machine
7/23/2019 Data Assimilation vs Data Mining
http://slidepdf.com/reader/full/data-assimilation-vs-data-mining 26/28
y pp , , ,Learning
Supervised learning
Learning with a teacher - Learning in Neural Networks
Learning with a probabilistic teacher - using impreciseknowledge
Unsupervised learning/Learning without a teacher -Clustering, Adaptive Control
S. Lakshmivarahan Data Mining vs. Data Assimilation
Summary
7/23/2019 Data Assimilation vs Data Mining
http://slidepdf.com/reader/full/data-assimilation-vs-data-mining 27/28
y
At the first level Data Mining seeks to uncover the basic lawsthat are hidden in the data. These laws are presented bymodels of some kind with unknown parameters
At the second level Data Assimilation deals with the task of
fusing data with models to produce an assimilated model - byestimating the unknown parameters
At the third level, using the given assimilated model producevarious forecast products for public consumption
DM, DA and Forecasting are the three parts of a continuumin knowledge discovery
S. Lakshmivarahan Data Mining vs. Data Assimilation
References
7/23/2019 Data Assimilation vs Data Mining
http://slidepdf.com/reader/full/data-assimilation-vs-data-mining 28/28
J. M. Lewis, S. Lakshmivarahan and S. K. Dhall (2006)Dynamic Data Assimilation: a least squares approach, Volume104, Encyclopedia of Mathematics and its Applications,Cambridge University Press, 654 pages
J. D. Hamilton (1994) Time Series Analysis, PrincetonUniversity Press, 799 pages
P. Tang, M. Steinbach and V. Kumar (2006) Introduction toData Mining, Addison Wesley
S. Lakshmivarahan Data Mining vs. Data Assimilation