kriging - introduction

Slide 1

Kriging - IntroductionMethod invented in the 1950s by South African geologist Daniel Krige (1919-) for predicting distribution of minerals.Became very popular for fitting surrogates to expensive computer simulations in the 21st century.It is one of the best surrogates available.It probably became popular late mostly because of the high computer cost of fitting it to data.

Kriging was invented in the 1950s by Daniel Krige, a South African geologist , for the purpose of predicting the distribution of minerals on the basis of samples. It became very popular for fitting surrogates to expensive computer simulations only about 40-50 years later. This is at least partly because it is an expensive surrogate in terms of computation, and it had to wait until computers became fast enough.

My experience is that of all currently popular surrogates, it has the highest chance of being the most accurate for a given problem. However, from my experience that chance is still less than 50%.

Krige, D. G. (1951). "A statistical approach to some basic mine valuation problems on the Witwatersrand". J. of the Chem., Metal. and Mining Soc. of South Africa 52 (6): 119139.1Kriging philosophyWe assume that the data is sampled from an unknown function that obeys simple correlation rules.The value of the function at a point is correlated to the values at neighboring points based on their separation in different directions.The correlation is strong to nearby points and weak with far away points, but strength does not change based on location.Normally Kriging is used with the assumption that there is no noise so that it interpolates exactly the function values.It works out to be a local surrogate, and it uses radial basis functions.

For linear regression (see lecture) we normally assume that we know the functional shape (e.g. a polynomial) and the data is to be used to find the coefficients that will minimize the root mean square errors at the data points. Kriging takes a very different approach. It assumes that we dont know much about the function, except for the form of correlation between the value of the function at nearby points. In particular, the correlation depends only on the distance between points and decays as they are further apart.

Kriging is usually used as an interpolator, so that if fits exactly the data, and we could not use the rms error as a way to find its parameters. That is, we assume that there is no noise in the data. There is a version of kriging with noise (often called kriging with a nugget), but it is rarely used.

Because of the correlation decay, kriging is a local surrogate with shape functions that are similar to those used for radial basis surrogates, except that the decay can be different in different directions.2Reminder: Covariance and CorrelationCovariance of two random variables X and Y

The covariance of a random variable with itself is the square of the standard deviationCovariance matrix for a vector contains the covariances of the componentsCorrelation

The correlation matrix has 1 on the diagonal.

Correlation between function values at nearby points for sin(x)Generate 10 random numbers, translate them by a bit (0.1), and by more (1.0)x=10*rand(1,10)8.147 9.058 1.267 9.134 6.324 0.975 2.785 5.469 9.575 9.649xnear=x+0.1; xfar=x+1; Calculate the sine function at the three sets.ynear=sin(xnear)0.9237 0.2637 0.9799 0.1899 0.1399 0.8798 0.2538 -0.6551 -0.2477 -0.3185 y=sin(x)0.9573 0.3587 0.9551 0.2869 0.0404 0.8279 0.3491 -0.7273 -0.1497 -0.2222yfar=sin(xfar)0.2740 -0.5917 0.7654 -0.6511 0.8626 0.9193 -0.5999 0.1846 -0.9129 -0.9405Compare corelations.r=corrcoef(y,ynear) 0.9894; rfar=corrcoef(y,yfar) 0.4229Decay to about 0.4 over one sixth of the wavelength.

To illustrate the values of correlations that are expected between function values, we generate random numbers between 0 and 10 and evaluate the sine function at these points. We also translate the points by a small amount compared to the wavelength (0.1) and a larger amount (1.0) and calculate the correlation coefficients between the function values in the original and translated sets.

The correlation coefficient is about 0.99 with the nearby points and 0.42 with the set further away. This reflects the change in function values, as illustrated by the pair of points marked in red.

In kriging, finding the rate of correlation decay is part of the fitting process. This example shows us that with a wavy function we can expect the correlation to decay to about 0.4 over one sixth of the wavelength.4Gaussian correlation function

5

Linear trend function is most often a low order polynomialWe will cover ordinary kriging, where linear trend is just a constant to be estimated by data.There is also simple kriging, where constant is assumed to be known.Assumption: Systematic departures Z(x) are correlated. Kriging prediction comes with a normal distribution of the uncertainty in the prediction.Universal Kriging

xyKrigingSampling data pointsSystematic DepartureLinear Trend Model

Linear trend modelSystematic departure

6Notation

7Prediction and shape functions

8Fitting the data

9

Top hat questionComparing linear regression with kriging, which of the following statements are correct?Linear regression assumes that the response is a linear combination of given shape functions, kriging does not.Linear regression minimizes rms of residuals, kriging does not.Linear regression is much cheaper than kriging.Linear regression typically works with fewer parameters than data points, while kriging has more unknown parameters than data points.

Prediction varianceSquare root of variance is called standard errorThe uncertainty at any x is normally distributed.

1112KRIGING FIT AND THE IMPROVEMENT QUESTIONFirst we sample the function and fit a kriging model.We note the present best solution (PBS)At every x there is some chance of improving on the PBS.Then we ask: Assuming an improvement over the PBS, where is it likely be largest?

EGO was developed with kriging, even though it is applicable to any surrogate with an uncertainty model. So in this lecture we will assume that the surrogate is kriging.So given a sample of function values at data points we fit a kriging model. The first step for EGO is to identify the best sample value, which for minimization is the lowest point. That value is called present best solution (PBS). Note that it is not the lowest points of the surrogate prediction, which can be even lower (though in the figure they are the same).

Note that every point (every value of x) the red curve is the center of a normal distribution that extends from plus to minus infinity. So at every point we have some chance of the function being below the PBS. EGO selects the next point to be sampled by asking the following question: Assuming that at point x we will see improvement on the PBS, at which point will the improvement is likely to be largest.1213WHAT IS EXPECTED IMPROVEMENT?

Consider the point x=0.8, and the random variable Y, which is the possible values of the function there. Its mean is the kriging prediction, which is slightly above zero.

1314EXPLORATION AND EXPLOITATIONEGO maximizes E[I(x)] to find the next point to be sampled.The expected improvement balances exploration and exploitation: it can be high either due to high uncertainty or low surrogate prediction.When can we say that the next point is exploration?

Global optimization algorithms are said to balance exploration and exploitation. Exploration is the search in regions that are sparsely sampled, and exploitation is the search in regions that are close to good solutions. Maximizing the expected improvement balances exploration and exploitation because the expected improvement can be high because of large uncertainty in sparsely sampled region, or it can be high because of low kriging predictions.The expected improvement function is graphed in the bottom region, and it is seen that the highest value is near x=0.2. This is clearly exploration, because the kriging prediction there is not low, but the uncertainty is large because it is far from a sample point. On the other hand, the peaks near x=0.6 and x=0.8 are exploitation peaks, because their main attribute is that they are close to the best point.Note that this example has the somewhat unusual property for a sparse sample (only 4 points) that the best sample point is very close to the best prediction of the kriging surrogate. As a consequence EGO will start with exploration. In most cases, the first kriging fit will predict a minimum not so close to a data point, and so that minimum or a point very near it will likely be the next sample point, starting EGO with exploitation rather than exploration.Of course, if the new sample point is close to the prediction, so that the fit does not change much, now we will be close to the situation depicted here, and EGO will follow with an exploration move.14Constraint boundary estimationWhen we optimize subject to constraints, evaluating the constraints is often computationally expensive.Following references in notes, we denote the constraint as When we evaluate the constraint, we do not mind having poor accuracy when the constraint is far from its critical value, but accuracy is important when it is nearly critical.

The material on constraint boundary estimation is taken from Viana, F.A.C, Haftka, R.T., and Watson, L.T.,(2012) Sequential sampling for contour estimation with concurrent function evaluation Structural and Multidisciplinary Optimization ,Vol 45(4), pp. 615-618.

The methodology is based on Bichon BJ, Eldred MS, Swiler LP, Mahadevan S, McFarland J (2008) Efficient global reliability analysis for nonlinear implicit performancefunctions. AIAA J 46(10):24592468

Their paper, in turn, is based on Ranjan P, Bingham D, Michailidis G (2008) Sequential experimentdesign for contour estimation from complex computer codes. Technometrics 50(4):527541Ranjan P, Bingham D, Michailidis G (2011) Errata. Technometrics53(1):109110

15Feasibility functionWe define a feasibility function

G is random due to uncertainty in surrogate that is fitted to g; represents uncertainty in surrogate. Here we will use twice the standard error.We will add points to maximize expected feasibility

where

16Branin-Hoo exampleConstraint function

mf is fraction of points misclassified on a grid of 10,000 points..

Convergence.

kriging - introduction

Documents

data points

nearby points

kriging philosophywe

kriging introductionmethod

version of kriging

correlation matrix

correlation decay

popular surrogates