cur_fit2

Curve FittingFitting Functions to Data

Rex Boggs1997 Raybould Fellow

From the Exploring Data website - http://curriculum.qed.qld.gov.au/kla/eda/© Education Queensland, 1997

11

Curve Fitting

Introduction

A major part of the Maths B course is the study of functions. If you choose to use real problems withreal data when teaching applications of these functions (which I hope you do) then you have tointroduce some concepts of linear and nonlinear regression that are outside of our syllabus. The maintopics are scatterplots, least squares regression, correlation, normal probability plots and residualplots.

Scatterplots are simple to construct, and the least squares regression line and correlation coefficientonly need to be understood at a conceptual level. It is certainly not necessary for a student to be ableto find the equation of the line or the value of r by hand. A normal probability plot is optional thoughit is a nice application of the normal distribution and can be taught as such, while a residual plot is asimple extension to a scatterplot. The time needed to cover these topics is not great and is balancedby the richer maths course you are able to offer your students.

Curve fitting, as nonlinear regression is often called, nicely integrates algebra and statistics and hasthe potential to integrate Maths B with other senior subjects, notably Physics, Chemistry and Biology.

Using Technology

The TI-82 graphical calculator allows students to fit a range of non-linear functions to a set of data.The data and a function can be plotted on the same axes so the user can see how well the function fitsthe data, and the calculator can give a correlation coefficient (or an r2 value) which can assist indeciding if the model is appropriate. The TI-83 calculator will also produce a residual plot to informthe decision about the appropriateness of the model and to assist in looking for underlying patterns.The TI-83 also extends the choice of functions by including the logistic and sinusoidal functions.

Statistics packages such as NCSS Jr and Minitab are both less powerful and more powerful thangraphical calculators. With these packages the user applies a transformation to the data to‘straighten’ it and then finds a linear regression line through the data. The gradient and y-intercept ofthis line are then used (with some algebra) to find the equation of a non-linear function that fits theoriginal data. Graphical calculators follow the same process with polynomial, exponential andlogarithmic functions, but carry out the process automatically.

Statistics packages provide a immense amount of information about the fit of the function, far morethan the graphical calculator. NCSS Jr for example will display a number of different residual plots,each of which tells something about how the function fits the data. Statistics packages also allow theuser to work with data where there is more than one explanatory variable (called multiple regression).Note that some datasets, eg a dataset that exhibits periodic behaviour, can’t be straightened and hencecan’t be analysed with a statistics program.

There are also programs designed specifically for fitting functions to data. One of these isCurveExpert. It contains over twenty-five common classes of functions (plus user-defined functions)and a specialised method of fitting a function to the data that doesn’t require the data be straightenedfirst. CurveExpert provides an r value, the standard error and a residual plot to assist the user inchoosing which function best fits the data, but doesn’t provide the range of other information givenby a statistics program. There is a danger with CurveExpert that a user will go ‘function shopping’and end up choosing a function based on the smallest r value. The screen shot below, where apolynomial of degree 13 is applied to a dataset with 15 values shows the danger in this.


22

An awareness of the properties of the functions studied in Maths is critical if students are to make aninformed choice about which functions are possible candidates for a particular dataset. A knowledgeof the effect of each parameter on the graph of the function is equally valuable.

There is value in students carrying out the entire process of transforming a dataset to ‘straighten’ it(using logarithms, for example) using technology to find a least squares regression line and thendoing the algebra to transform back to get the nonlinear function that fits the original data, for at leastone or two datasets. Past that I would allow students to use whichever technology they feel is mostappropriate. Students should justify the decision about the technology they have used, the functionchosen to model the data, the accuracy of the model and they need to discuss if extrapolating tovalues larger or smaller than those in the dataset is appropriate.


33

Using Statistics in Biology - The Outbreak of the Gypsy Moth

Biological populations can grow exponentially if not restrained by predators or lack of food or space.Here are some data on an outbreak of the gypsy moth, which devastated forests in Massachusetts inthe US. Rather than count the number of moths, the number of acres defoliated by the moths wascounted. This data was supplied by Chuck Schwalbe, U.S. Dept of Agriculture.

Year 1978 1979 1980 1981

Acres 63 042 226 260 907 075 2 826 095

1. Plot the number of acres defoliated y against the year x. Does the pattern of growth appear tobe exponential?

2. Use a graphical calculator to find the exponential model that best fits this data.

3. Use this model to predict the number of acres defoliated in 1982. The actual number for 1982was 1 383 265. Give a reason why the predicted value and the actual value could be sodifferent.

The Outbreak of the Gypsy Moth : Solution

Year 1978 1979 1980 1981

Acres 63 042 226 260 907 075 2 826 095

If we use this data as is, we run into a problem. We are trying to find the values of A0 and k to fit theequation A = A0e

kY,where A is the number of acres and Y is the year. The numbers as given are

outside the range of values that the calculator can handle for exponential regression. Since the actualyears are not important, we will modify the data as follows.

Year 1 2 3 4

Acres (1000s) 63 226 907 2 826

1. Draw a scatterplot of Acres(1000s) versus Year.

2. The data looks exponential. Use your graphical calculator tofit an exponential curve to this data.

An exponential model seems to fit this data very well.


44

3. Draw a residual plot. With only four data values, showingthat a pattern exists will be difficult. There is no obviouspattern to this data.

We conclude that an exponential function is a good model forthis data. The model given by the graphical calculator is:

A = 17.8*3.60Y

Students should alter this to the form A = A0 ekY

, which ismore common for biological models.

4. The predicted value for the following year is 10 723 000. Theactual value was 1 383 000. There was a viral infection in thegypsy moths which reduced their numbers drastically.

Extrapolate with care!

The Size of Alligators

Many wildlife populations are monitored by taking aerial photographs. Information about thenumber of animals and their whereabouts is important to protecting certain species and to ensuringthe safety of surrounding human populations.

In addition, it is sometimes possible to monitor certain characteristics of the animals. The length of analligator can be estimated quite accurately from aerial photographs or from a boat. However, thealligator's weight is much more difficult to determine. In the example below, data on the length (ininches) and weight (in pounds) of alligators captured in central Florida are used to develop a modelfrom which the weight of an alligator can be predicted from its length.

Weight Length Weight Length

130 94 83 8651 74 70 88640 147 61 7228 58 54 7480 86 44 61110 94 106 9033 63 84 8990 86 39 6836 69 42 7638 72 197 114366 128 102 9084 85 57 7880 82

A scatterplot of weight against length reveals that the relationship between these variables is notlinear but curved. A successful model must take into account this non-linear relationship.


55

Some Possible Models

1. Assume that alligators 'scale up' nicely, ie an alligator that is twice as long is also twice aswide, twice as thick, each tooth is twice as long, etc. Then the model would be a cubicpower function, W = k*L

3. To find k, we plot W vs L

3and find the least squares

regression line. The slope of the line is k.2. Assume that the model is a power function of the form W = k*L

b. Plot ln(weight) vs

ln(length) and find the least squares regression line. It’s a nice application of log laws to findvalues for k and b algebraically. As this the method that the TI-82/3 uses, the values for kand b should match those given by these calculators when using the power regressionfunction.

3. Assume that the model is a power function of the form, W = k*Lb

and use CurveExpert tofind values for k and b. CurveExpert uses an iterative method to find values for theparameters that doesn’t require ‘straightening’ the data first, and finds values for k and b thatminimise the standard error.

4. Assume that the model is a general cubic of the form W=a*L3

+ b*L2

+ c*L + d. Use eitherCurveExpert or the TI-83 to find the best fit.

Some Possible Models - Results

First Model - weight vs length3

This model assumes the relationship between weight and length is a cubic one. Here is the output from theRegression section of NCSS Jr., a freeware statistics program ( http://WWW.NCSS.com/ ).

Regression Equation SectionIndependent Regression Standard T-Value Prob Decision PowerVariable Coefficient Error (Ho: B=0) Level (5%) (5%)Intercept -34.81806 6.620628 -5.2590 0.000025 Reject Ho 0.998927l_cubed 1.973203E-04 0 0.0000 1.000000 Accept Ho 0.050000R-Squared 0.973043

Plots Section

0.0

3.5

7.0

10.5

14.0

-80.0 -45.0 -10.0 25.0 60.0

Histogram

Residuals of weight

Co

un

t

-80.0

-45.0

-10.0

25.0

60.0

-2.0 -1.0 0.0 1.0 2.0

Normal Probability Plot of Residuals of weight

Normal Distribution

Re

sid

ua

lso

fw

eig

ht


66

-80.0

-45.0

-10.0

25.0

60.0

0.0 150.0 300.0 450.0 600.0

Residual vs Predicted

Predicted

Re

sid

ua

ls

0.0

20

0.0

40

0.0

60

0.0

80

0.0

0.0 875000.0 750000.0 625000.0 500000.0

l_cubed vs weight

l_cubed

we

igh

t

Analysis

The scatterplot of weight vs l_cubed indicates the transformed data is approximately linear. The regressionoutput indicates k=.0001973 and the y-intercept is -34.8, ie the regression equation isW = .0001973*L

3- 34.8. The r

2value of 0.973 indicates that the fit of the straight line to the linearised data

is quite good. However the residual plot shows some disturbing patterns. For all of the alligators except thelargest three, there is a notable negative gradient on the residuals, while the largest three show the oppositetrend. Additionally the scatterplot of weight vs L-cubed shows that the lengths of the largest alligators areinfluential values as their x-values (the cube of the length) are much larger than the others. This is due inpart to the data being cubed, as any differences in length would be exaggerated by this operation. Finally thenormal probability plot of the residuals shows that the residuals may not be normally distributed which againindicates that this model may not be satisfactory.

There may be a temptation to remove these large values as they appear to be atypical, but that would be a bigmistake! After all it is the largest alligators that are the most important in terms of their impact on nearbyhuman populations.

If we chose to use this model, there may be a case for using a piecewise function - one branch for the typicalalligators and another for larger alligators.

Second Model - ln(weight) vs ln(length)

This model assumes the relationship between weight and length is a power function. To straighten this datawe plot ln(weight) vs ln(length). Here is the output from NCSS Jr.

Regression Equation SectionIndependent Regression Standard T-Value Prob Decision PowerVariable Coefficient Error (Ho: B=0) Level (5%) (5%)Intercept -10.1746 0.7316143 -13.9071 0.000000 Reject Ho 1.000000ln_len 3.285993 0.1653929 19.8678 0.000000 Reject Ho 1.000000R-Squared 0.944940


77

Plots Section

0.0

3.0

6.0

9.0

12.0

-0.4 -0.2 0.1 0.4 0.6

Histogram

Residuals of ln_wht

Co

un

t

-0.4

-0.2

0.1

0.4

0.6

-2.0 -1.0 0.0 1.0 2.0

Normal Probability Plot of Residuals of ln_wht

Normal Distribution

Re

sid

ua

lso

fln

_w

ht

-0.4

-0.2

0.1

0.4

0.6

3.0 3.9 4.8 5.6 6.5

Residual vs Predicted

Predicted

Re

sid

ua

ls

3.0

3.9

4.8

5.6

6.5

4.0 4.3 4.5 4.8 5.0

ln_len vs ln_wht

ln_len

ln_w

ht

AnalysisFrom the scatterplot the transformed data appear to be approximately linear. The original model is W= k*Lb and the transformed model is ln(W) = a*ln(L) + c. From the regression output we haveLn(W) = 3.286*Ln(L) - 10.75. It is a nice application of logarithm laws to obtain the function for Win terms of L.

286.3

286.3

286.3

*0000381.

))26239

()(

)26239()()(

75.10)(*286.3)(

LW

LLnWLn

LnLLnWLn

LLnWLn

It is worth noting that the TI-82 and TI-83 give the same values if power regression is applied to thisdata. The residual plot looks much better with only a weak pattern to the data. Except for oneextreme value the normal plot of the residuals shows there are no problems with this model. Whilethe r

2value is a bit lower at .945, the model still fits the data very well.


88

Third Model - Power Function Using CurveExpert

CurveExpert is a curve-fitting program that uses iterative methods of finding a ‘curve of best fit’ to adataset (the shareware version is availabe from http://www.ebicom.net/~dhyams/cvxpt.htm). As itdoesn’t linearise the data first, its results generally don’t agree with that generated by NCSS Jr orgraphical calculators. It is a matter of judgement as to which model should be used.

Here is the output data from CurveExpert. ‘S’ is the standard error and ‘r’ the correlation coefficient.

Power Fit: y=ax^b

Coefficient Data:a = 3.4045285e-006 S = 13.50b = 3.8140161 r = .9949

Below is the plot of the data with the line of best fit, and the residual plot.

S = 13.50301636

r = 0.99486573

X Axis (units)

YA

xis

(un

its)

49.1 66.9 84.7 102.5 120.3 138.1 155.92.80

119.20

235.60

352.00

468.40

584.80

701.20

Residuals

X Axis (units)

YA

xis

(un

its

)

49.1 75.8 102.5 129.2 155.9-49.56

-24.78

0.00

24.78

49.56

AnalysisThe model that CurveExpert has found is W = .0000034*L3.81. The correlation coefficient r = .995and the standard error S = 13.50. While it may seem that a method that fits the data directly to amodel would have to be preferable to one that linearises the data first, that isn’t necessarily the case.


99

The benefit of using a statistics program like NCSS is the wealth of information available about thevalidity of the fit. All that CurveExpert supplies is a residual plot, valuable to be sure but in somecases maybe not sufficient.

It is important that the student doesn’t go ‘curve shopping’, i.e. choosing the model that gives thesmallest standard error. There should be a reason why a particular model is chosen and a physicalinterpretation of each of the parameters. In this problem it is reasonable to expect the power to beaproximately 3 since weight is correlates strongly to volume. A value of 3.81 appears to high for thissituatation.

In our previous model we were able to reduce the influence of the largest alligators by taking thelogarithm of the lengths. This isn’t possible with CurveExpert, so the three largest lengths are muchlarger that the others and they have undue influence on the model. One way to determine theinfluence of a datavalue is to remove it and recalculate the regression equation. Removing the largestalligator from the dataset gave the following values:

Power Fit: y=ax^b

Coefficient Data:a = 9.491332e-06b = 3.5899794

Removing a single datavalue has markedly altered the values of both parameters. Removing thethree large alligators gave this output:

Power Fit: y=ax^b

Coefficient Data:a = 4.6383263e-05b = 3.236518

The implication is that this model is not very robust and hence should not be used.

Fourth Model - Fitting a General Cubic Equation

Either a graphics calculator or CurveExpert can be used to fit a general cubic equation to the dataset.Here is the output from CurveExpert:

3rd degree Polynomial Fit: y=a+bx+cx^2+dx^3...

Coefficient Data:a = -277.82231 S = 11.36b = 11.473033 R = .9966c = -0.15420092d = 0.00080703423


1010

S = 11.36027521

r = 0.99668494

X Axis (units)

YA

xis

(un

its)

49.1 66.9 84.7 102.5 120.3 138.1 155.92.80

119.20

235.60

352.00

468.40

584.80

701.20

Residuals

49.1 75.8 102.5 129.2 155.9-29.72

-14.86

0.00

14.86

29.72

AnalysisThe regression equation is W = .000897*L3 - .154*L2 + 11.5*L - 278. The standard error = 11.36and the r value = .9966. Based on the standard error and correlation coefficient, one may think thatthis model is satisfactory.

However what is happening here is that this model is including some of the sample error into themodel itself by wiggling its way through the data values. A small standard error does not mean themodel is a valid one! I would reject this model on the basis that there is no physical reason forincluding all of these terms of the cubic function.

DecisionHaving tested a number of models I will choose the power function W = .000038*L3.3. I feel thattwo significant digits is reasonable accuracy given that the sample size is small. This model was lessinfluenced by the large data values, and analysis of this model showed a reasonable residual plot andnormal plot. This model can of course be modified if more alligators are able to be captured andmeasured.


1111

World Oil Production

Mathematical models which describe physical phenomena are often very accurate, reflecting thesimple underlying formula that links the variables and the ability to measure such variables precisely.We usually aren't so fortunate when we model activities that involve nature and biological processes,while those involving people are the most difficult of all to model accurately.

The data in the table is the world oil production measured in millions of barrels. Your task is to finda function to model this data, discussing limitations of your model, and its usefulness as a predictorof future production.

Construct a residual plot and discuss any interesting features in the plot.

Year Mbbl1880 301890 771900 1491905 2151910 3281915 4321920 6891925 10691930 14121935 16551940 21501945 25951950 38031955 56261960 76741962 88821964 103101966 120161968 141041970 166901972 185841974 203891976 201881978 219221980 217221982 194111984 198371986 202461988 21338


1212

Solution - World Oil Production

The first thing we do with data is to graph it, in this case a scatterplot with time as the predictorvariable and oil production as the response variable.

Scatterplots

0.0

62

50

.01

25

00

.01

87

50

.02

50

00

.0

1860.0 1895.0 1930.0 1965.0 2000.0

Year vs MBBl

Year

MB

Bl

The data is clearly nonlinear. Assuming a constant percentage growth gives rise to an exponentialmodel. To test this we can plot Year vs ln(Mbbl) and see how linear the data appears to be.

2.0

4.0

6.0

8.0

10

.0

1860.0 1895.0 1930.0 1965.0 2000.0

Year vs Ln_MBBI

Year

Ln_M

BB

I

The data is approximately linear so the model is worth pursuing. The original scatterplot doesindicate that something peculiar happened in the early 1970s to alter the exponential pattern. It wasof course the war in the Mid-East that disrupted oil production. Looking back at the original data,obviously the model isn’t applicable after 1972 so the next step in the analysis is to delete the last 8rows of data. Here is a new plot with this data removed, and the least squares regression line added.


1313

2.0

4.7

7.3

10.0

1860.0 1900.0 1940.0 1980.0

Year vs Ln_MBBI

Year

Ln_M

BB

I

Before using NCSS to do a regression analysis on this data it is useful to make 1880 the base year (bysubtracting 1880 from each year), otherwise the y-intercept represents an estimate for Ln(Mbbl) inthe year 0, which is quite outside the range of values we are considering.

Both the regression equation and plots output is given below.

Regression Equation SectionIndependent Regression Standard T-Value Prob Decision PowerVariable Coefficient Error (Ho: B=0) Level (5%) (5%)Intercept 3.702236 6.431098E-02 57.5677 0.000000 Reject Ho 1.000000Year_adj 6.649784E-02 1.025216E-03 64.8623 0.000000 Reject Ho 1.000000R-Squared 0.995504

Plots

0.0

2.5

5.0

7.5

10.0

-0.4 -0.2 -0.1 0.1 0.3

Histogram of Residuals of Ln_MBBI

Residuals of Ln_MBBI

Co

un

t

-0.4

-0.2

-0.1

0.1

0.3

-2.0 -1.0 0.0 1.0 2.0

Normal Probability Plot of Residuals of Ln_MBBI

Expected Normals

Re

sid

ua

lso

fL

n_

MB

BI


1414

-0.4

-0.2

-0.1

0.1

0.3

-20.0 10.0 40.0 70.0 100.0

Residuals vs Year_adj

Year_adj

Re

sid

ua

ls

A bit of algebra gives Mbbl (b) in terms of Year Since 1880 (y).

ln(b) = 3.702 + .0650 yb = e(3.702 + .0650y)

b = 40.53 e.0650y

ReportThe scatterplot of Mbbl vs Year Since 1880 shows a non-linear pattern until the early 1970s, afterwhich the underlying pattern is obviously changed. We restrict our domain to 1880 - 1972 byremoving the last 8 pairs of data.

The scatterplot of ln(Mbbl) vs year shows that the relationship is approximately linear and hence thefunction is of the form b = A e

cy. The regression analysis followed by a bit of algebra gives us the

equation b = 40.53 e.0650y.It is worth noting that with real-life data there may be underlying patterns that are the result ofexternal factors and hence can’t be accounted for by the mathematical model. The residual plot andthe normal probability plot together indicate some underlying patterns to this dataset.

The residual plot show that growth was faster than that predicted by the model prior to 1930, slowerin the 1930s and 1940s, and the faster again in the 1950s to the 1970s. The obvious factors linked tothe slowdown are the Great Depression and World War II. A history lesson reflected in a residualplot!

Student Generated Data

In addition to using data gathered by researchers students should generate their own data, either byexperiment or observation and then find a model that gives an acceptable fit to the data. Gatheringgood data is not always an easy task, and students are best made aware of this by going through thisprocess. The first set of ideas comes from an email from Alice Hankla, Galloway School, AtlantaGeorgia, USA. I have edited her email slightly.

**

With exponentials, we use kid-collected data (say radioactivity or Newton's law of cooling) andgraph on semilog paper first, then use a graphing program. For log-log plots, we do Kepler's lawconcerning period vs average distance from sun. One is power of 2 and the other power of 3, hencelog-log to straighten out.

About the cubic and quartic and higher powers of polynomials. One can fit anything to a high powerand it is meaningless. This is the time for the lesson in the meaning of the parameters of an equation


1515

vs best fit. They must think about the model, and about extrapolating values. A good problem is thestopping distance of cars at various speeds (data from Georgia Drivers Manual, which they all have.Fit to quadratic or cubic and the extrapolated distance for 100 mph is outrageous. Point made.

Another outrageous extrapolation is from cumulated number of cases of AIDS diagnosed(downloaded from CDC webpage) . The last year shows a tad of a decrease and decrease is predictedfor the future. (It is being modelled via diff equations.; my class attended a lecture by a mancurrently doing this). Another interesting dataset is the number of TB cases, which was exponentialdecay until 1988 then the graph turns upward (data from CDC by phone and I have and can send viathis listserve if anyone wants it).

We also do the number of hamburgers that McDonald sold annually and the number of locationsopened cumulatively. I got this data from the web about 4 years ago and can send it to youall.

Increase in population world-wide and in US is in the newspaper from time to time - exponential, ofcourse.

From a pizza menu, graph area vs price for various toppings to see the parallel and to see best buy.

Nonlinear - Mostly Physics

Time same size cans for different content - cream soup, thin soup, vege soup - as each rolling downthe same incline. Also can time one can for different distances.

On a gallon milk jug, make marks one inch apart. Fill with water to each mark and time exit from ahole near the bottom until level is at a specified point.

Measure height of bounce of a tennis ball until it stops Measure first height of bounce whendropping at different heights.

Time a steel ball falling same distance in different liquids - motor oil, shampoo, vinegar.Meaningless but fun. Different sizes of same metal balls in same liquid does give coefficient ofviscosity.

Time students running up a set of stairs. Plot time vs weight. separate into gender. Can get work byfiguring height of steps - mgh and horsepower by mgh/time.

Newton's law of cooling - Record water temp at 3-min intervals for 2 hours. Graph time vs change intemperature and time vs temperature.

Torque on a meterstick: Set a meterstick with its 30 cm mark over the fulcrum. Balance it byhanging a known weight on one location, say at 5 cm from fulcrum. Record weight and location.Repeat for different weights and distances from fulcrum.

Measure and graph object distance vs real image distance for a convex lens - hyperbola. For virtualimages one gets the other half of the hyperbola. Unique experiment.


1616

Other Ideas

Here are a few other non-linear modelling applications that I have heard about.

Hang a rope across the classroom before the students arrive, with the ends at different heights. Thestudents are allowed to measure whatever they want, but only to the middle of the room. They mustuse their data to create a mathematical model that predicts how high up the wall the other end of therope is attached. Then test it.

Grow a spinach plant, and measure its height periodically. This gives a nice logistic equation, at leastit does if you water regularly!

Fill a plastic drum with water, place it on a stand, punch a hole in the side (near the bottom) andmeasure how far the water squirts out, as a function of time.

Bounce a wet superball down a footpath (ie sidewalk) and measure the distance between wet spots. Itgives a remarkably good decaying exponential function.

One that I tried - Using a CBL unit with a TI-83, we took a park bench (you can use any long straightthing with a groove that will keep a ball on track) and gave it a slight tilt (eg a brick under one end).We rolled a ball (actually a small globe because that was all we could find) down the incline, usingthe CBL and a motion detector probe to measure distance vs time.

We plotted the data, and applied a quadratic model. After deleting the first few data points (since wereckon my mate John pushed the globe slightly rather than just letting it roll), we had a correlationcoefficient of over 0.999. I was very surprised it was so high.

The WorldWatch Datadisk

Worldwatch tracks trends in key factors that affect our environment, with many of their datasetsstarting around 1950. They publish a very interesting book every year, called Vital Signs, whichdiscusses the changes in the trends over the past year. More importantly for statistics education, theypublish a disk with all of the data given in the tables and graphs in Vital Signs. To find out moreabout the data disk, visit the WorldWatch website at http://www.worldwatch.org/

Further Problems of Linear and Non-Linear Regression

Statistics and Nutrition

A study of nutrition in developing countries collected data from the Egyptian village of Nahya. Hereare the mean weights for 170 infants in Nahya who were weighed each month during their first yearof life.

(Data from Zeinab E. M. Afifi, “Principal components analysis of growth of Nahya infants: Size,velocity and two physique factors,” Human Biology, 57 (1985), pp. 659-669.)

Age (Months) 1 2 3 4 5 6 7 8 9 10 11 12Weight (kg) 4.3 5.1 5.7 6.3 6.8 7.1 7.2 7.2 7.2 7.2 7.5 7.8


1717

1. Plot the mean weight against time. Compute the least squares regression line. Plot this lineon your graph. Is it an acceptable summary of the overall pattern of growth?

2. Plot residuals against age. Describe what this output tells you.

3. Describe a better model for weight against age (Hint: there may be different functionsneeded for different ages).

Erosion

A study of erosion produced the following data on the rate (in litres per second) at which water flowsacross a soil test bed and the weight (in kilograms) of soil washed away. (Data from G.R. Foster,W.R. Ostercamp, and L.J. Lane, “Effects of discharge rate on rill erosion,” paper presented at the1982 Winter Meeting of the American Society of Agricultural Engineers.)

Flow Rate .31 .85 1.26 2.47 3.75

Eroded Soil .82 1.95 2.18 3.01 6.07

Find a mathematical model for this data. Determine its validity. Comment on the strength of itspredictive value.

Mystery Datasetfrom: Bruce King, New Milford, CT ([email protected])

I have enjoyed asking my students, occasionally, to deal with a "mystery" data set. That is, I wantthem to arrive at conclusions, however tentative, based only on the characteristics of the data,without taking into account any contextual information. Afterwards, I tell them what the variablesmeasure; sometimes this can be a bit of fun.

For example, once some years ago I grabbed about two dozen hard-cover books from my shelves, andmeasured such things as weight (Y), area of cover, number of pages, thickness, etc.,--and number ofletters in the author's (or first author's) last name. I would ask students to construct a model for Y interms of the other variables. (This was in a unit on multiple regression. In a more elementary course,I might ask them to identify, however tentatively, which X-variable(s) are strongly related to Y; andwhich are not related to Y.) Only after we had done everything we could would I tell them what theXs and Y were.

Here's another "mystery" data set I've found useful about this time of the year (or a bit later) in a"Moore-oriented" course (i.e., one that relies on BPS or IPS). It involves transformations. whichappears first in BPS on p.110; and in IPS, on p.150.


1818

X Y

.3871 .2409

.7323 .61521.000 1.0001.524 1.8815.203 11.869.555 29.4619.22 84.0130.11 164.839.81 247.7

The idea, of course, is to find a reasonable model for Y as a function of X.

If you haven't seen this before, you might like to try it before looking at some feedback, which I'll putin a separate, second, message.

Answer: Kepler may have recognised this data, as it is the ‘average’ distance of each planet from theSun (using the Earth’s distance as 1) and the length of the planet’s year (with the Earth’s year as 1).


1919

Sample Assignment

Galileo's Gravity and Motion Experiments

Over 400 years ago, Galileo conducted a series of experiments on the paths of projectiles, attemptingto find a mathematical description of falling bodies. Two of his experiments are the basis of thisassignment.

The experiments consisted of rolling a ball down a grooved ramp that was placed at a fixed heightabove the floor and inclined at a fixed angle to the horizontal. In one experiment the ball left the endof the ramp and descended to the floor. In a related experiment a horizontal shelf was placed at theend of the ramp, and the ball would travel along this shelf before descending to the floor. In eachexperiment Galileo altered the release height of the ball (h) and measured the distance (d) the balltravelled before landing. The units of measurement were called 'punti'. A page from Galileo's notesis shown below.

The data from these experiments isgiven in the following two tables.

Table 1 - Ramp Only


2020

Release HeightAbove Table (h)

HorizontalDistance (d)

1000 573800 534600 495450 451300 395200 337100 253

Table 2 - Ramp and ShelfRelease Height

Above Table (h)Horizontal

Distance (d)1000 1500828 1340800 1328650 1172300 800

Source: Drake, S. (1978), Galileo at Work, Chicago: University of Chicago Press.


2121

Take Note of:Ockham's Razor: A maxim that whenever possible choose a simple model over a more complicatedone. It just seems to be the way the world often works.

1. Use the Ramp and Shelf data to find a mathematical model for the horizontal distancetravelled as a function of release height. In particular:

a. Test at least two different mathematical models. Show any scatterplots and statisticalanalyses used in these tests.

b. Decide which mathematical model you feel best represents the data. Justify your decision.

c. Discuss how accurately your model fits the given data.

d. Would your mathematical model give sensible answers if the ball is released at greaterheights? How well does your model work if the release height is 0?

2. According to Jeffreys and Berger in an erratum to an article in American Scientist (1992),

the model for the Ramp Only data is of the formbd

adh

1

2

where a and b are parameters to

be determined.

a. Find the values of a and b, using a non-linear regression software program such asCurveExpert.

Note that this program requires you to enter initial estimates for these parameters. You canfind good initial estimates of the values of a and b by choosing two pairs of data from theRamp Only data table, substituting into the above function and solving the resultingsimultaneous equations.

b. Discuss how accurately your function fits the data.

c. What is the domain of d, given these values for a and b? What is the physical interpretation ofthis domain?

References:

Dickey, D.A. and Arnold, J.T, (1995). Teaching Statistics with Data of Historic Significance:Galileo's Gravity and Motion Experiments, Journal of Statistics Education, v.3, n.1.

Drake, S. (1978). Galileo at Work, Chicago: University of Chicago Press.

Jeffreys, W. H., and Berger, J. O. (1992). Ockham's Razor and Bayesian Analysis, AmericanScientist, 80, 64-72 (Erratum, p. 116).

cur_fit2

Documents