matlab data analysis

156
MATLAB ® Data Analysis R2014b

Upload: others

Post on 11-Sep-2021

16 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: MATLAB Data Analysis

MATLAB®

Data Analysis

R2014b

Page 2: MATLAB Data Analysis

How to Contact MathWorks

Latest news: www.mathworks.com

Sales and services: www.mathworks.com/sales_and_services

User community: www.mathworks.com/matlabcentral

Technical support: www.mathworks.com/support/contact_us

Phone: 508-647-7000

The MathWorks, Inc.3 Apple Hill DriveNatick, MA 01760-2098

MATLAB® Data Analysis© COPYRIGHT 2005–2014 by The MathWorks, Inc.The software described in this document is furnished under a license agreement. The software may be usedor copied only under the terms of the license agreement. No part of this manual may be photocopied orreproduced in any form without prior written consent from The MathWorks, Inc.FEDERAL ACQUISITION: This provision applies to all acquisitions of the Program and Documentationby, for, or through the federal government of the United States. By accepting delivery of the Programor Documentation, the government hereby agrees that this software or documentation qualifies ascommercial computer software or commercial computer software documentation as such terms are usedor defined in FAR 12.212, DFARS Part 227.72, and DFARS 252.227-7014. Accordingly, the terms andconditions of this Agreement and only those rights specified in this Agreement, shall pertain to andgovern the use, modification, reproduction, release, performance, display, and disclosure of the Programand Documentation by the federal government (or other entity acquiring for or through the federalgovernment) and shall supersede any conflicting contractual terms or conditions. If this License failsto meet the government's needs or is inconsistent in any respect with federal procurement law, thegovernment agrees to return the Program and Documentation, unused, to The MathWorks, Inc.

Trademarks

MATLAB and Simulink are registered trademarks of The MathWorks, Inc. Seewww.mathworks.com/trademarks for a list of additional trademarks. Other product or brandnames may be trademarks or registered trademarks of their respective holders.Patents

MathWorks products are protected by one or more U.S. patents. Please seewww.mathworks.com/patents for more information.

Page 3: MATLAB Data Analysis

Revision History

September 2005 Online only New for MATLAB 7.1 (Release 14SP3)March 2006 Online only Revised for MATLAB Version 7.2 (Release

2006a)September 2006 Online only Revised for MATLAB Version 7.3 (Release

2006b)March 2007 Online only Revised for MATLAB Version 7.4 (Release

2007a)September 2007 Online only Revised for MATLAB Version 7.5 (Release

2007b)March 2008 Online only Revised for MATLAB Version 7.6 (Release

2008a)October 2008 Online only Revised for MATLAB Version 7.7 (Release

2008b)March 2009 Online only Revised for MATLAB 7.8 (Release 2009a)September 2009 Online only Revised for MATLAB 7.9 (Release 2009b)March 2010 Online only Revised for MATLAB 7.10 (Release 2010a)September 2010 Online only Revised for MATLAB Version 7.11 (R2010b)April 2011 Online only Revised for MATLAB Version 7.12 (R2011a)September 2011 Online only Revised for MATLAB Version 7.13 (R2011b)March 2012 Online only Revised for MATLAB Version 7.14 (R2012a)September 2012 Online only Revised for MATLAB Version 8.0 (R2012b)March 2013 Online only Revised for MATLAB Version 8.1 (R2013a)September 2013 Online only Revised for MATLAB Version 8.2 (R2013b)March 2014 Online only Revised for MATLAB Version 8.3 (R2014a)October 2014 Online only Revised for MATLAB Version 8.4 (R2014b)

Page 4: MATLAB Data Analysis
Page 5: MATLAB Data Analysis

v

Contents

Data Processing1

Importing and Exporting Data . . . . . . . . . . . . . . . . . . . . . . . . 1-2Importing Data into the Workspace . . . . . . . . . . . . . . . . . . . 1-2Exporting Data from the Workspace . . . . . . . . . . . . . . . . . . . 1-2

Plotting Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-3Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-3Load and Plot Data from Text File . . . . . . . . . . . . . . . . . . . . 1-3

Missing Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-6Representing Missing Data Values . . . . . . . . . . . . . . . . . . . . 1-6Calculating with NaNs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-6Removing NaNs from Data . . . . . . . . . . . . . . . . . . . . . . . . . . 1-7Interpolating Missing Data . . . . . . . . . . . . . . . . . . . . . . . . . . 1-8

Inconsistent Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-9

Filtering Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-11Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-11Filter Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-11Moving Average Filter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-12Discrete Filter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-13

Convolution Filter to Smooth Data . . . . . . . . . . . . . . . . . . . . 1-16

Detrending Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-21Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-21Remove Linear Trends from Data . . . . . . . . . . . . . . . . . . . . 1-21

Descriptive Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-25Functions for Calculating Descriptive Statistics . . . . . . . . . 1-25Example: Using MATLAB Data Statistics . . . . . . . . . . . . . . 1-27

Page 6: MATLAB Data Analysis

vi Contents

Interactive Data Exploration2

What Is Interactive Data Exploration? . . . . . . . . . . . . . . . . . . 2-2Interacting with MATLAB Data Graphs . . . . . . . . . . . . . . . . 2-2

Marking Up Graphs with Data Brushing . . . . . . . . . . . . . . . . 2-4What Is Data Brushing? . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-4How to Brush Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-5Effects of Brushing on Data . . . . . . . . . . . . . . . . . . . . . . . . . 2-8Other Data Brushing Aspects . . . . . . . . . . . . . . . . . . . . . . . 2-10

Making Graphs Responsive with Data Linking . . . . . . . . . . 2-12What Is Data Linking? . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-12Why Use Linked Plots? . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-13How to Link Plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-13How Linked Plots Behave . . . . . . . . . . . . . . . . . . . . . . . . . . 2-14Linking vs. Refreshing Plots . . . . . . . . . . . . . . . . . . . . . . . . 2-16Using Linked Plot Controls . . . . . . . . . . . . . . . . . . . . . . . . . 2-18

Interacting with Graphed Data . . . . . . . . . . . . . . . . . . . . . . . 2-21Data Brushing with the Variables Editor . . . . . . . . . . . . . . 2-21Using Data Tips to Explore Graphs . . . . . . . . . . . . . . . . . . 2-22Example — Visually Exploring Demographic Statistics . . . . 2-23

Regression Analysis3

Linear Correlation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-2Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-2Covariance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-3Correlation Coefficients . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-4

Linear Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-6Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-6Residuals and Goodness of Fit . . . . . . . . . . . . . . . . . . . . . . . 3-7Fitting Data with Curve Fitting Toolbox Functions . . . . . . . 3-11

Page 7: MATLAB Data Analysis

vii

Interactive Fitting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-12The Basic Fitting GUI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-12Preparing for Basic Fitting . . . . . . . . . . . . . . . . . . . . . . . . . 3-12Opening the Basic Fitting GUI . . . . . . . . . . . . . . . . . . . . . . 3-13Example: Using Basic Fitting GUI . . . . . . . . . . . . . . . . . . . 3-14

Programmatic Fitting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-32MATLAB Functions for Polynomial Models . . . . . . . . . . . . . 3-32Linear Model with Nonpolynomial Terms . . . . . . . . . . . . . . 3-37Multiple Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-39Programmatic Fitting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-41

Time Series Analysis4

What Are Time Series? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-2

Time Series Objects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-3Types of Time Series and Their Uses . . . . . . . . . . . . . . . . . . 4-3Time Series Data Sample . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-3Example: Time Series Objects and Methods . . . . . . . . . . . . . 4-5Time Series Constructor . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-27Time Series Collection Constructor . . . . . . . . . . . . . . . . . . . 4-27

Page 8: MATLAB Data Analysis

viii

Page 9: MATLAB Data Analysis

1

Data Processing

• “Importing and Exporting Data” on page 1-2• “Plotting Data” on page 1-3• “Missing Data” on page 1-6• “Inconsistent Data” on page 1-9• “Filtering Data” on page 1-11• “Convolution Filter to Smooth Data” on page 1-16• “Detrending Data” on page 1-21• “Descriptive Statistics” on page 1-25

Page 10: MATLAB Data Analysis

1 Data Processing

1-2

Importing and Exporting Data

In this section...

“Importing Data into the Workspace” on page 1-2“Exporting Data from the Workspace” on page 1-2

Importing Data into the Workspace

The first step in analyzing data is to import it into the MATLAB workspace. See“Methods for Importing Data” for information about importing data from specific fileformats.

Exporting Data from the Workspace

When you analyze your data, you might create new variables or modified importedvariables. You can export variables from the MATLAB workspace to various file formats,both character-based and binary. You can, for example, create HDF and Microsoft®

Excel® files containing your data. For details, see the documentation on “Supported FileFormats for Import and Export”.

Page 11: MATLAB Data Analysis

Plotting Data

1-3

Plotting Data

In this section...

“Introduction” on page 1-3“Load and Plot Data from Text File” on page 1-3

Introduction

After you import data into the MATLAB workspace, it is a good idea to plot the data sothat you can explore its features. An exploratory plot of your data enables you to identifydiscontinuities and potential outliers, as well as the regions of interest.

The MATLAB figure window displays plots. See “Types of MATLAB Plots” for a fulldescription of the figure window. It also discusses the various interactive tools availablefor editing and customizing MATLAB graphics.

Load and Plot Data from Text File

This example uses sample data in count.dat, a space-delimited text file. Thefile consists of three sets of hourly traffic counts, recorded at three different townintersections over a 24-hour period. Each data column in the file represents data for oneintersection.

Load the count.dat Data

Import data into the workspace using the load function.

load count.dat

Loading this data creates a 24-by-3 matrix called count in the MATLAB workspace.

Get the size of the data matrix.

[n,p] = size(count)

n =

Page 12: MATLAB Data Analysis

1 Data Processing

1-4

24

p =

3

n represents the number of rows, and p represents the number of columns.

Plot the count.dat Data

Create a time vector, t, containing integers from 1 to n.

t = 1:n;

Plot the data as a function of time, and annotate the plot.

plot(t,count),

legend('Location 1','Location 2','Location 3',2)

xlabel('Time'), ylabel('Vehicle Count')

title('Traffic Counts at Three Intersections')

Page 13: MATLAB Data Analysis

Plotting Data

1-5

See Alsolegend | load | plot | size | title | xlabel | ylabel

More About• “Types of MATLAB Plots”

Page 14: MATLAB Data Analysis

1 Data Processing

1-6

Missing Data

In this section...

“Representing Missing Data Values” on page 1-6“Calculating with NaNs” on page 1-6“Removing NaNs from Data” on page 1-7“Interpolating Missing Data” on page 1-8

Representing Missing Data Values

Often, you represent missing or unavailable data values in MATLAB code with thespecial value, NaN, which stands for Not-a-Number.

The IEEE® floating-point arithmetic convention defines NaN as the result of an undefinedoperation, such as 0/0.

Calculating with NaNs

When you perform calculations on a IEEE variable that contains NaNs, the NaN valuesare propagated to the final result. This behavior might render the result useless.

For example, consider a matrix containing the 3-by-3 magic square with its centerelement replaced with NaN:

a = magic(3); a(2,2) = NaN

a =

8 1 6

3 NaN 7

4 9 2

Compute the sum for each column in the matrix:

sum(a)

ans =

15 NaN 15

Page 15: MATLAB Data Analysis

Missing Data

1-7

Notice that the sum of the elements in the middle column is a NaN value because thatcolumn contains a NaN.

If you do not want to have NaNs in your final results, remove these values from your data.For more information, see “Removing NaNs from Data” on page 1-7.

Removing NaNs from Data

Use the IEEE function isnan to identify NaNs in the data, and then remove them usingthe techniques in the following table.

Note: Use the function isnan to identify NaNs. By IEEE arithmetic convention, thelogical comparison NaN == NaN always produces 0 (that is, it never evaluates to true).Therefore, you cannot use x(x==NaN) = [] to remove NaNs from your data.

Code Description

i = find(~isnan(x));

x = x(i)

Find the indices of elements in a vector xthat are not NaNs. Keep only the non-NaNelements.

x = x(~isnan(x)); Remove NaNs from a vector x.x(isnan(x)) = []; Remove NaNs from a vector x (alternative

method).X(any(isnan(X),2),:) = []; Remove any rows containing NaNs from a

matrix X.

If you remove NaNs frequently, consider creating a small function that you can call. Forexample:

function X = exciseRows(X)

X(any(isnan(X),2),:) = [];

After you remove all rows containing NaNs, use the following command to compute thecorrelation coefficients of X :

C = corrcoef(excise(X));

For more information about correlation coefficients, see “Linear Correlation” on page3-2.

Page 16: MATLAB Data Analysis

1 Data Processing

1-8

Interpolating Missing Data

Use interpolation to find intermediate points in your data. The simplest function forperforming interpolation is interp1, which is a 1-D interpolation function.

By default, the interpolation method is 'linear', which fits a straight line betweena pair of existing data points to calculate the intermediate value. The complete setof available methods, which you can specify as arguments in the interp1 function,includes the following:

• 'nearest' — Nearest neighbor interpolation• 'linear' — Linear interpolation• 'spline' — Piecewise cubic spline interpolation• 'pchip' or 'cubic' — Shape-preserving piecewise cubic interpolation• 'v5cubic' — Cubic interpolation from MATLAB Version 5. This method does not

extrapolate, and it issues a warning and uses 'spline' if X is not equally spaced.

For more information about interp1, see the MATLAB documentation or type helpinterp1 at the MATLAB prompt.

Page 17: MATLAB Data Analysis

Inconsistent Data

1-9

Inconsistent Data

When you examine a data plot, you might find that some points appear to differdramatically from the rest of the data. In some cases, it is reasonable to consider suchpoints outliers, or data values that appear to be inconsistent with the rest of the data.

The following example illustrates how to remove outliers from three data sets in the 24-by-3 matrix count. In this case, an outlier is defined as a value that is more than threestandard deviations away from the mean.

Caution Be cautious about changing data unless you are confident that you understandthe source of the problem you want to correct. Removing an outlier has a greater effecton the standard deviation than on the mean of the data. Deleting one such point leads toa smaller new standard deviation, which might result in making some remaining pointsappear to be outliers!

% Import the sample data

load count.dat;

% Calculate the mean and the standard deviation

% of each data column in the matrix

mu = mean(count)

sigma = std(count)

The Command Window displays

mu =

32.0000 46.5417 65.5833

sigma =

25.3703 41.4057 68.0281

When an outlier is considered to be more than three standard deviations away from themean, use the following syntax to determine the number of outliers in each column of thecount matrix:

[n,p] = size(count);

% Create a matrix of mean values by

% replicating the mu vector for n rows

MeanMat = repmat(mu,n,1);

% Create a matrix of standard deviation values by

% replicating the sigma vector for n rows

Page 18: MATLAB Data Analysis

1 Data Processing

1-10

SigmaMat = repmat(sigma,n,1);

% Create a matrix of zeros and ones, where ones indicate

% the location of outliers

outliers = abs(count - MeanMat) > 3*SigmaMat;

% Calculate the number of outliers in each column

nout = sum(outliers)

The procedure returns the following number of outliers in each column:

nout =

1 0 0

There is one outlier in the first data column of count and none in the other two columns.

To remove an entire row of data containing the outlier, type

count(any(outliers,2),:) = [];

Here, any(outliers,2) returns a 1 when any of the elements in the outliersvector is a nonzero number. The argument 2 specifies that any works down the seconddimension of the count matrix—its columns.

Page 19: MATLAB Data Analysis

Filtering Data

1-11

Filtering Data

In this section...

“Introduction” on page 1-11“Filter Function” on page 1-11“Moving Average Filter” on page 1-12“Discrete Filter” on page 1-13

Introduction

Various MATLAB IEEE functions help you work with difference equations and filtersto shape the variations in the raw data. These functions operate on both vectors andmatrices. Filter data to smooth out high-frequency fluctuations or remove periodic trendsof a specific frequency.

A vector input represents a single, sampled data signal (or sequence). For a matrix input,each signal corresponds to a column in the matrix and each data sample is a row.

Filter Function

The function

y = filter(b,a,x)

creates filtered data y by processing the data in vector x with the filter described byvectors a and b.

The filter function is a general tapped delay-line filter, described by the differenceequation

a y n b x n b x n b N x n Nb b( ) ( ) ( ) ( ) ( ) ( ) ( ) ( )1 1 2 1 1= + - + + - +…

- - - - - +a y n a N y n Na a( ) ( ) ( ) ( )2 1 1…

Here, n is the index of the current sample, Na is the order of the polynomial described by

vector a, and Nb is the order of the polynomial described by vector b. The output y(n) is

Page 20: MATLAB Data Analysis

1 Data Processing

1-12

a linear combination of current and previous inputs, x(n) x(n – 1)..., and previous outputs,y(n – 1) y(n – 2)... .

Moving Average Filter

This example shows how to smooth the data in count.dat using a moving-average filterto see the average traffic flow over a 4-hour window (covering the current hour and theprevious 3 hours). This is represented by the following difference equation:

Create the corresponding vectors.

a = 1;

b = [1/4 1/4 1/4 1/4];

Import the data from count.dat using the load function.

load count.dat

Loading this data creates a 24-by-3 matrix called count in the MATLAB® workspace.

Extract the first column of count and assign it to the vector x.

x = count(:,1);

Calculate the 4-hour moving average of the data.

y = filter(b,a,x);

Plot the original data and the filtered data.

figure

t = 1:length(x);

plot(t,x,'--',t,y,'-'),grid on

legend('Original Data','Smoothed Data',2)

title('Plot of Original and Smoothed Data')

Page 21: MATLAB Data Analysis

Filtering Data

1-13

The filtered data, represented by the solid line in the plot, is the 4-hour moving averageof the count data. The original data is represented by the dashed line.

Discrete Filter

This example shows how to use the discrete filter to shape data by applying a transferfunction to an input signal.

Depending on your objectives, the transfer function you choose might alter both theamplitude and the phase of the variations in the data at different frequencies to produceeither a smoother or a rougher output.

Taking the z -transform of the difference equation

Page 22: MATLAB Data Analysis

1 Data Processing

1-14

results in the following transfer function:

Here Y(z) is the z-transform of the filtered output y(n). The coefficients, b and a, areunchanged by the z-transform.

In digital signal processing (DSP), it is customary to write transfer functions as rationalexpressions in and to order the numerator and denominator terms in ascendingpowers of .

Consider the transfer function:

The following code defines and applies this transfer function to the data in count.dat.

Load the matrix count into the workspace.

load count.dat

Extract the first column and assign it to x.

x = count(:,1);

Enter the coefficients of the denominator ordered in ascending powers of to represent.

a = [1 0.2];

Enter the coefficients of the numerator to represent .

b = [2 3];

Call the filter function.

Page 23: MATLAB Data Analysis

Filtering Data

1-15

y = filter(b,a,x);

Compare the original data and the shaped data with an overlaid plot of the two curves.

t = 1:length(x);

plot(t,x,'-.',t,y,'-'), grid on

legend('Original Data','Shaped Data',2)

title('Plot of Original and Shaped Data')

The plot shows this filter primarily modifies the amplitude of the original data.

Page 24: MATLAB Data Analysis

1 Data Processing

1-16

Convolution Filter to Smooth Data

This example shows how to use a convolution filter to remove high-frequency componentsfrom a matrix to smooth the matrix for contour plotting.

The conv2 and filter functions can remove high-frequency components from amatrix representing a continuous surface or field to make the underlying data easier tovisualize.

Create a function of two variables and plot the contour lines at a specified, fixed interval.

Z = peaks(100);

clevels = -7:1:10; % contour levels for all plots

figure

contour(Z,clevels)

axis([0,100,0,100])

title('Peaks Surface (underlying data)')

Page 25: MATLAB Data Analysis

Convolution Filter to Smooth Data

1-17

Add uniform random noise with a mean of zero to the surface and plot the resultingcontours. Irregularities in the contours tend to obscure the trend of the data.

ZN = Z + rand(100) - .5;

figure

contour(ZN,clevels)

axis([0,100,0,100])

title('Peaks Surface (noise added)')

Page 26: MATLAB Data Analysis

1 Data Processing

1-18

Specify a 3-by-3 convolution kernal, F, for smoothing the matrix and use the conv2function to attenuate high spatial frequencies in the surface data. Plot the contour lines.

F = [.05 .1 .05; .1 .4 .1; .05 .1 .05];

ZC = conv2(ZN,F,'same');

figure

contour(ZC, clevels)

axis([0,100,0,100])

title('Noisy Surface (smoothed once)')

Page 27: MATLAB Data Analysis

Convolution Filter to Smooth Data

1-19

Smooth the surface one more time using the same operator and plot the contour lines. Alarger or more uniform kernal can achieve a smoother surface in one pass.

ZC2 = conv2(ZC,F,'same');

figure

contour(ZC2, clevels)

axis([0,100,0,100])

title('Noisy Surface (smoothed twice)')

Page 28: MATLAB Data Analysis

1 Data Processing

1-20

Page 29: MATLAB Data Analysis

Detrending Data

1-21

Detrending Data

In this section...

“Introduction” on page 1-21“Remove Linear Trends from Data” on page 1-21

Introduction

The MATLAB function detrend subtracts the mean or a best-fit line (in the least-squares sense) from your data. If your data contains several data columns, detrendtreats each data column separately.

Removing a trend from the data enables you to focus your analysis on the fluctuationsin the data about the trend. A linear trend typically indicates a systematic increase ordecrease in the data. A systematic shift can result from sensor drift, for example. Whiletrends can be meaningful, some types of analyses yield better insight once you removetrends.

Whether it makes sense to remove trend effects in the data often depends on theobjectives of your analysis.

Remove Linear Trends from Data

This example shows how to remove a linear trend from daily closing stock prices toemphasize the price fluctuations about the overall increase. If the data does have atrend, detrending it forces its mean to zero and reduces overall variation. The examplesimulates stock price fluctuations using a distribution taken from the gallery function.

Create a simulated data set and compute its mean. sdata represents the daily pricechanges of a stock.

t = 0:300;

dailyFluct = gallery('normaldata',size(t),2);

sdata = cumsum(dailyFluct) + 20 + t/100;

Find the average of the data.

mean(sdata)

Page 30: MATLAB Data Analysis

1 Data Processing

1-22

ans =

39.4851

Plot and label the data. Notice the systematic increase in the stock prices that the datadisplays.

figure

plot(t,sdata);

legend('Original Data','Location','northwest');

xlabel('Time (days)');

ylabel('Stock Price (dollars)');

Page 31: MATLAB Data Analysis

Detrending Data

1-23

Apply detrend, which performs a linear fit to sdata and then removes the trend from it.Subtracting the output from the input yields the computed trend line.

detrend_sdata = detrend(sdata);

trend = sdata - detrend_sdata;

Find the average of the detrended data.

mean(detrend_sdata)

ans =

1.1425e-14

As expected, the detrended data has a mean very close to 0.

Display the results by adding the trend line, the detrended data, and its mean to thegraph.

hold on

plot(t,trend,':r')

plot(t,detrend_sdata,'m')

plot(t,zeros(size(t)),':k')

legend('Original Data','Trend','Detrended Data',...

'Mean of Detrended Data','Location','northwest')

xlabel('Time (days)');

ylabel('Stock Price (dollars)');

Page 32: MATLAB Data Analysis

1 Data Processing

1-24

See Alsocumsum | detrend | gallery | plot

Page 33: MATLAB Data Analysis

Descriptive Statistics

1-25

Descriptive Statistics

In this section...

“Functions for Calculating Descriptive Statistics” on page 1-25“Example: Using MATLAB Data Statistics” on page 1-27

If you need more advanced statistics features, you might want to use the StatisticsToolbox™ software.

Functions for Calculating Descriptive Statistics

Use the following MATLAB functions to calculate the descriptive statistics for your data.

Note: For matrix data, descriptive statistics for each column are calculatedindependently.

Statistics Function Summary

Function Description

max Maximum valuemean Average or mean valuemedian Median valuemin Smallest valuemode Most frequent valuestd Standard deviationvar Variance, which measures the spread or dispersion of the values

The following examples apply MATLAB functions to calculate descriptive statistics:

• “Example 1 — Calculating Maximum, Mean, and Standard Deviation” on page1-26

• “Example 2 — Subtracting the Mean” on page 1-27

Page 34: MATLAB Data Analysis

1 Data Processing

1-26

Example 1 — Calculating Maximum, Mean, and Standard Deviation

This example shows how to use MATLAB functions to calculate the maximum, mean,and standard deviation values for a 24-by-3 matrix called count. MATLAB computesthese statistics independently for each column in the matrix.

% Load the sample data

load count.dat

% Find the maximum value in each column

mx = max(count)

% Calculate the mean of each column

mu = mean(count)

% Calculate the standard deviation of each column

sigma = std(count)

The results are

mx =

114 145 257

mu =

32.0000 46.5417 65.5833

sigma =

25.3703 41.4057 68.0281

To get the row numbers where the maximum data values occur in each data column,specify a second output parameter indx to return the row index. For example:

[mx,indx] = max(count)

These results are

mx =

114 145 257

indx =

20 20 20

Here, the variable mx is a row vector that contains the maximum value in each of thethree data columns. The variable indx contains the row indices in each column thatcorrespond to the maximum values.

Page 35: MATLAB Data Analysis

Descriptive Statistics

1-27

To find the minimum value in the entire count matrix, reshape this 24-by-3 matrix intoa 72-by-1 column vector by using the syntax count(:). Then, to find the minimum valuein the single column, use the following syntax:

min(count(:))

ans =

7

Example 2 — Subtracting the Mean

Subtract the mean from each column of the matrix by using the following syntax:

% Get the size of the count matrix

[n,p] = size(count)

% Compute the mean of each column

mu = mean(count)

% Create a matrix of mean values by

% replicating the mu vector for n rows

MeanMat = repmat(mu,n,1)

% Subtract the column mean from each element

% in that column

x = count - MeanMat

Note: Subtracting the mean from the data is also called detrending. For moreinformation about removing the mean or the best-fit line from the data, see “DetrendingData” on page 1-21.

Example: Using MATLAB Data Statistics

The Data Statistics dialog box helps you calculate and plot descriptive statistics withthe data. This example shows how to use MATLAB Data Statistics to calculate and plotstatistics for a 24-by-3 matrix, called count. The data represents how many vehiclespassed by traffic counting stations on three streets.

This section contains the following topics:

• “Calculating and Plotting Descriptive Statistics” on page 1-28• “Formatting Data Statistics on Plots” on page 1-30• “Saving Statistics to the MATLAB Workspace” on page 1-32

Page 36: MATLAB Data Analysis

1 Data Processing

1-28

• “Generating Code Files” on page 1-33

Note: MATLAB Data Statistics is available for 2-D plots only.

Calculating and Plotting Descriptive Statistics

1 Load and plot the data:

load count.dat

[n,p] = size(count);

% Define the x-values

t = 1:n;

% Plot the data and annotate the graph

plot(t,count)

legend('Station 1','Station 2','Station 3','Location','northwest')

xlabel('Time')

ylabel('Vehicle Count')

Page 37: MATLAB Data Analysis

Descriptive Statistics

1-29

Note: The legend contains the name of each data set, as specified by the legendfunction: Station 1, Station 2, and Station 3. A data set refers to each columnof data in the array you plotted. If you do not name the data sets, default names areassigned: data1, data2, and so on.

2 In the Figure window, select Tools > Data Statistics .

The Data Statistics dialog box opens and displays descriptive statistics for the X- andY-data of the Station 1 data set.

Note: The Data Statistics GUI calculates the range, which is the difference betweenthe minimum and maximum values in the selected data set. The Data Statistics GUIdoes not display the range on the plot.

3 Select a different data set in the Statistics for list: Station 2.

This displays the statistics for the X and Y data of the Station 2 data set.4 Select the check box for each statistic you want to display on the plot, and then click

Save to workspace.

For example, to plot the mean of Station 2, select the mean check box in the Ycolumn.

This plots a horizontal line to represent the mean of Station 2 and updates thelegend to include this statistic.

Page 38: MATLAB Data Analysis

1 Data Processing

1-30

Formatting Data Statistics on Plots

The Data Statistics GUI uses colors and line styles to distinguish statistics from the dataon the plot. This portion of the example shows how to customize the display of descriptivestatistics on a plot, such as the color, line width, line style, or marker.

Note: Do not edit display properties of statistics until you finish plotting all the statisticswith the data. If you add or remove statistics after editing plot properties, the changes toplot properties are lost.

Page 39: MATLAB Data Analysis

Descriptive Statistics

1-31

To modify the display of data statistics on a plot:

1 In the MATLAB Figure window, click the (Edit Plot) button in the toolbar.

This step enables plot editing.2 Double-click the statistic on the plot for which you want to edit display properties.

For example, double-click the horizontal line representing the mean of Station 2.

This step opens the Property Editor below the MATLAB Figure window, where youcan modify the appearance of the line used to represent this statistic.

Page 40: MATLAB Data Analysis

1 Data Processing

1-32

3 In the Property Editor, specify the Line and Marker styles, sizes, and colors.

Tip Alternatively, right-click the statistic on the plot, and select an option from theshortcut menu.

Saving Statistics to the MATLAB Workspace

This portion of the example shows how to save statistics in the Data Statistics GUI to theMATLAB workspace.

Note: When your plot contains multiple data sets, save statistics for each data setindividually. To display statistics for a different data set, select it from the Statistics forlist in the Data Statistics GUI.

1 In the Data Statistics dialog box, click the Save to workspace button.2 In the Save Statistics to Workspace dialog box, select options to save statistics for

either X data, Y data, or both. Then, enter the corresponding variable names.

In this example, save only the Y data. Enter the variable name as Loc2countstats.

3 Click OK.

This step saves the descriptive statistics to a structure. The new variable is added tothe MATLAB workspace.

To view the new structure variable, type the variable name at the MATLAB prompt:

Loc2countstats

Loc2countstats =

min: 9

Page 41: MATLAB Data Analysis

Descriptive Statistics

1-33

max: 145

mean: 46.5417

median: 36

mode: 9

std: 41.4057

range: 136

Generating Code Files

This portion of the example shows how to generate a file containing MATLAB code thatreproduces the format of the plot and the plotted statistics with new data.

1 In the Figure window, select File > Generate Code.

This step creates a function code file and displays it in the MATLAB Editor. The codecan programmatically reproduce what you did interactively with the Data StatisticsGUI and the Property Editor.

2 Change the name of the function on the first line of the file from createfigure tosomething more specific, like countplot. Save the file to your current folder withthe file name countplot.m.

3 Generate some new, random count data:

randcount = 300*rand(24,3);

4 Reproduce the plot with the new data and the recomputed statistics:

countplot(t,randcount)

Page 42: MATLAB Data Analysis

1 Data Processing

1-34

Page 43: MATLAB Data Analysis

2

Interactive Data Exploration

• “What Is Interactive Data Exploration?” on page 2-2• “Marking Up Graphs with Data Brushing” on page 2-4• “Making Graphs Responsive with Data Linking” on page 2-12• “Interacting with Graphed Data” on page 2-21

Page 44: MATLAB Data Analysis

2 Interactive Data Exploration

2-2

What Is Interactive Data Exploration?

Interacting with MATLAB Data Graphs

The MATLAB data analysis and graphics tools for visual data exploration leverage itsHandle Graphics® capabilities. In addition to the presentation techniques described inthe following section, they include:

• Highlighting and editing observations on graphs with data brushing• Connecting data graphs with variables with data linking• Finding, adding, removing, and changing data values with the “Data Brushing with

the Variables Editor” on page 2-21• Describing observations on graphs with data tips

Used alone or together, these tools help you to perceive trends, noise, and relationshipsin data sets, and understand aspects of the phenomena you model. Ways to use them arepresented in the following sections. To learn more, you can also view a video tutorial thatdescribes these and related features.

Understanding Data Using Graphic Presentations

Finding patterns in numbers is a mathematical and an intuitive undertaking. Whenpeople collect data to analyze, they often want to see how models, variables, andconstants explain hypotheses. Sometimes they see patterns by scanning tables or sets ofstatistics, other times by contemplating graphical representations of models and data.An analyst's powers of pattern recognition can lead to insights into data’s distribution,outliers, curvilinearity, associations between variables, goodness-of-fit to models, andmore. Computers amplify those powers greatly.

Graphically exploring digital data interactively generally requires:

• Data displays for charts, graphs, and maps• A graphical user interface (GUI) capable of directly manipulating the displays• Software that categorizes selected data performs operations on the categories, and

then updates or creates new data displays

This approach to understanding is often called exploratory data analysis (EDA), a termcoined during the infancy of computer graphics in the 1970s and generally attributed tostatistician John Tukey (who also invented the box plot). EDA complements statistical

Page 45: MATLAB Data Analysis

What Is Interactive Data Exploration?

2-3

methods and tools to help analysts check hypotheses and validate models. An EDA GUIusually lets analysts divide observations of variables on data plots into subsets usingmouse gestures, and then analyze further or eliminate selected observations.

Part of EDA is simply looking at data graphics with an informed eye to observe patternsor lack of them. What makes EDA especially powerful, however, are interactive tools thatlet analysts probe, drill down, map, and spin data sets around, and select observationsand trace them through plots, tables, and models.

Well before digital tool sets like the MATLAB environment developed, curiousquantitative types plotted graphs, maps, and other data diagrams to trigger insights intowhat their collections of numbers might mean. If you are curious about what data mightmean and like to reflect on data graphics, MATLAB provides many options:

• Plotting data — scatter, line, area, bar, histogram and other types of graphs• Plotting thematic maps to show spatial relationships of point, lines and area data• Plotting N-D point, vector, contour, surface, and volume shapes• Overlaying other variables on points, lines, and surfaces (e.g. texture-maps)• Rendering portions of a 3-D display with transparency• Animating any of the above

All of these options generate static or dynamic displays that may reveal meaning in data.In many environments, however, users cannot interact with them; they can only changedata or parameters and redisplay the same or different data graphics. MATLAB toolsenable users to directly manipulate data displays to explore correlations and anomaliesin data sets, as the following sections explain.

Page 46: MATLAB Data Analysis

2 Interactive Data Exploration

2-4

Marking Up Graphs with Data Brushing

In this section...

“What Is Data Brushing?” on page 2-4“How to Brush Data” on page 2-5“Effects of Brushing on Data” on page 2-8“Other Data Brushing Aspects” on page 2-10

What Is Data Brushing?

When you brush data, you manually select observations on an interactive data displayin the course of assessing validity, testing hypotheses, or segregating observations forfurther processing. You can brush data on 2-D graphs, 3-D graphs, and surfaces. Most ofthe MATLAB high-level plotting functions allow you to brush on their displays. For a listof restrictions, see “Plot Types You Cannot Brush” in the brush function reference page,which also illustrates the types of graphs you can brush.

Data brushing is a MATLAB figure interactive mode like zooming, panning or plotediting. You can use data brushing mode to select, remove, and replace individual datavalues.

Activate data brushing in any of these ways:

•Click on the figure toolbar.

• Select Tools > Brush.• Right-click a cell in the Variables editor and select Brushing > Brushing on.• Call the brush function.

The figure toolbar data brushing button contains two parts:

•Data brushing button that toggles data brushing on and off.

• Data brushing button arrow ▼ that displays a drop-down menu for choosing thebrushing color.

You also can set the color with the brush function; it accepts ColorSpec names andRGB triplets. For example:

Page 47: MATLAB Data Analysis

Marking Up Graphs with Data Brushing

2-5

brush magenta

brush([.1 .3 .5])

How to Brush Data

To brush observations on graphs and surface plots,

1To enter brushing mode, select the Data Brushing button in the figure toolbar.You also can select a brushing color with the Data Brushing button arrow ▼.

2 Drag a selection rectangle to highlight observations on a graph in the currentbrushing color.Instead of dragging out a rectangle, you can click any observation to select it.Double-clicking selects all the observations in a series.

3 To add other observations to the highlighted set, hold down the Shift key and brushthem.

4 Shift+clicking or Shift+dragging highlighted observations eliminates theirhighlighting and removes them from the selection set; this lets you select any set ofobservations.

The following figures show a scatter plot before and after brushing some outlyingobservations; the left-hand plot displays the Data Brushing tool palette for choosing abrush color.

Page 48: MATLAB Data Analysis

2 Interactive Data Exploration

2-6

Brushed observations remain brushed even in other modes (pan, zoom, edit) until youdeselect them by brushing an empty area or by selecting Clear all brushing from thecontext menu. You can add and remove data tips to a brushed plot without disturbing itsbrushing.

Once you have brushed observations from one or more graphed variables, you canperform several tasks with the brushing set, either from the Tools menu or by right-clicking any brushed observation:

• Remove all brushed observations from the plot.• Remove all unbrushed observations from the plot.• Replace the brushed observations with NaN or constant values.• Copy the brushed data values to the clipboard.• Paste the brushed data values to the command window• Create a variable to hold the brushed data values• Clear brushing marks from the plot (context menu only)

The two following figures show a lineseries plot of a variable, along with constantlines showing its mean and two standard deviations. On the left, the user is brushingobservations that lie beyond two standard deviations from the mean. On the right, theuser has eliminated these extreme values by selecting Brushing > Remove brushedfrom the Tools (or context) menu. The plot immediately redisplays with three fewer x-and y-values. The original workspace variable, however, remains unchanged.

Page 49: MATLAB Data Analysis

Marking Up Graphs with Data Brushing

2-7

Before removing the extreme values, you can save them as a new workspace variablewith Tools > Brushing > Create new variable. Doing this opens a dialog box for youto declare a variable name.

Typing extremevals to name the variable and pressing OK to dismiss the dialogproduces

extremevals =

9.0000 3.5784

12.0000 3.0349

35.0000 -2.9443

The new variable contains one row per observation selected. The first column containsthe x-values and the second column contains the y-values, copied from the lineseries’XData and YData. In graphs where multiple series are brushed, the Create New

Page 50: MATLAB Data Analysis

2 Interactive Data Exploration

2-8

Variable dialog box helps you identify what series the new variable should represent,allowing you to select and name one at a time.

Effects of Brushing on Data

Brushing simply highlights data points in a graph, without affecting data on which theplot is based. If you remove brushed or unbrushed observations or replace them withNaN values, the change applies to the XData, YData, and possibly ZData propertiesof the plot itself, but not to variables in the workspace. You can undo such changes.However, if you replot a brushed graph using the same workspace variables, not only doits brushing marks go away, all removed or replaced values are restored and you cannotundo it. If you want brushing to affect the underlying workspace data, you must link theplot to the variables it displays. See “Making Graphs Responsive with Data Linking” onpage 2-12 for more information.

Brushed 3-D Plots

When an axes displays three-dimensional graphics, brushing defines a region of interest(ROI) as an unbounded rectangular prism. The central axis of the prism is a lineperpendicular to the plane of the screen. Opposite corners of the prism pass throughpoints defined by the CurrentPoint associated with the initial mouse click and thevalue of CurrentPoint during the drag. All vertices lying within the rectangular prismROI highlight as you brush them, even those that are hidden from view.

The next figure contains two views of a brushed ROI on a peaks surface plot. On theleft plot, only the cross-section of the rectangular prism is visible (the brown rectangle)because the central axis of the prism is perpendicular to the viewing plane. When theviewpoint rotates by about 90 degrees clockwise (right-hand plot), you see that theprism extends along the initial axis of view and that the brushed region conforms to thesurface.

Page 51: MATLAB Data Analysis

Marking Up Graphs with Data Brushing

2-9

Brushed Multiple Plots

When the same x-, y- or z-variable appears in several plots, brushing observations inone plot highlights the related observations in the other plots when they are linked. Ifthe brushed variables are open in the Variables editor, the rows containing the brushedobservations are highlighted. For more information, see “Data Brushing with theVariables Editor” on page 2-21.

Organizing Plots for Brushing

Data brushing usually involves creating multiple views of related variables on graphsand in tables. Just as computer users organize their virtual desktops in many differentways, you can use various strategies for viewing sets of plots:

• Multiple overlapping figure windows• Tiled figure windows• Tabbed figure windows• Subplots presenting multiple views

When MATLAB figures are created, by default, they appear as separate windows. Manyusers keep them as such, arranging, overlapping, hiding and showing them as their workrequires. Any figure, however, can dock inside a figure group, which itself can float ordock in the MATLAB desktop. Once docked in a figure group, you can float and overlap

Page 52: MATLAB Data Analysis

2 Interactive Data Exploration

2-10

the individual plots, tile them in various arrangements, or use tabs to show and hidethem.

Note: For more in formation on managing figure windows, see “Document Layout” and“Plotting Basics”.

Another way of organizing plots is to arrange them as subplots within a single figurewindow, as illustrated in the example for “Linking vs. Refreshing Plots” on page 2-16.You create and organize subplots with the subplot function, for which there is no GUIas there is for figure groups. Subplots are useful when you have an idea of how manygraphs you want to work with simultaneously and how you want to arrange them (theydo not need to be all the same size).

Note: You can easily set up MATLAB code files to create subplots; see subplot for moreinformation.

Other Data Brushing Aspects

Not all types of graphs can be brushed, and each type that you can brush is markedup in a particular way. To be brushable, a graphic object must have XDataSource,YDataSource, and where applicable, ZDataSource properties. The one exception isthe patch objects produced by the hist function, which are brushable due to the specialhandling they receive. In order to brush a histogram, you must put the figure containingit into a linked state.

The brush function reference page explains how to apply brushing to different graphtypes, describes how to use different mouse gestures for brushing, and lists graph typesthat you can and cannot brush. See the following sections:

• “Types of Plots You Can Brush”• “Plot Types You Cannot Brush”• “Mouse Gestures for Data Brushing”

Keep in mind that data brushing is a mode that operates on entire figures, like zoom,pan, or other modes. This means that some figures can be in data brushing mode atthe same time other figures are in other modes. When you dock multiple figures into afigure group, there is only one toolbar, which reflects the state or mode of whatever figure

Page 53: MATLAB Data Analysis

Marking Up Graphs with Data Brushing

2-11

docked in the group you happen to select. Thus, even when docked, some graphs may bein data brushing mode while others are not.

If an axes contains a plot type that cannot be brushed, such as an image object, you canselect the figure's Data Brushing tool and trace out a rectangle by dragging it, but nobrush marks appear. When you lay out graphs in subplots within a single figure andenter data brushing mode, all the subplot axes become brushable as long as the graphicobjects they contain are brushable. If the figure is also in a linked state, brushing onesubplot marks any other in the figure that shares a data source with it. Although thisalso happens when separate figures are linked and brushed, you can prevent individualfigures from being brushed by unlinking them from data sources.

Page 54: MATLAB Data Analysis

2 Interactive Data Exploration

2-12

Making Graphs Responsive with Data Linking

In this section...

“What Is Data Linking?” on page 2-12“Why Use Linked Plots?” on page 2-13“How to Link Plots” on page 2-13“How Linked Plots Behave” on page 2-14“Linking vs. Refreshing Plots” on page 2-16“Using Linked Plot Controls” on page 2-18

What Is Data Linking?

Linked plots are graphs in figure windows that visibly respond to changes in the currentworkspace variables they display and vice versa. This differs from the default behavior ofgraphs, which contain copies of variables they represent (their XData/YData/ZData) andmust be explicitly replotted in order to update them when a displayed variable changes.For example, if variable y in the workspace appears in a linked plot and y is modified inthe Command Window, the graphic representation of y in the linked plot updates withinhalf a second to reflect the change.

If you use the Variables editor, you might be familiar with data linking. When variableschange or go out of scope, the Variables editor updates itself. It continuously updatesvariables in the workspace when you add, change, or delete values. The Variables editorworks the same way with linked plots.

You can programmatically update a plot after the elements in one variable change. Forexample, the following code calls refreshdata to update the plot after y changes.

x = 0:.1:8*pi;

y = sin(x);

h = plot(x,y)

set(h,'XDataSource','x');

set(h,'YDataSource','y');

y = sin(x.^3);

refreshdata

For more information on this manual technique, see the refreshdata reference page.Prior to data linking, you need to explicitly update your plots to reflect changes in yourworkspace variables, as illustrated in “Linking vs. Refreshing Plots” on page 2-16.

Page 55: MATLAB Data Analysis

Making Graphs Responsive with Data Linking

2-13

Why Use Linked Plots?

If the same variable appears in plots in multiple figures, you can link any of the plots tothe variable. You can use linked plots in concert with “Marking Up Graphs with DataBrushing” on page 2-4, but also on their own. Linking plots lets you

• Make graphs respond to changes in variables in the base workspace or within afunction

• Make graphs respond when you change variables in the Variables editor andCommand Line

• Modify variables through data brushing that affect different graphicalrepresentations of them at once

• Create graphical “watch windows” for debugging purposes

Watch windows are useful if you program in the MATLAB language. For example, whenrefining a data processing algorithm to step through your code, you can see graphsrespond to changes in variables as a function executes statements.

How to Link Plots

When you create a figure, by default, data linking is off. You can put a figure into alinked state in any of three ways:

•Click the Data Linking tool button on the figure toolbar.

• Select Link from the figure Tools menu.• Call the linkdata MATLAB function, e.g., linkdata on.• To disable data linking, click the Data Linking tool button, deselect Tools > Link, or

type linkdata off.

Once a figure is linked, its appearance changes; an information bar, called the linkedplot information bar, appears beneath the figure toolbar to reflect its new linked state.It identifies all linked variables and gives you an opportunity to unlink or relink any ofthem. The linked plot information bar identifies a figure as being linked and displaysrelationships between graphic objects and the workspace variables they represent. Clickthe circular down arrow icon on its left side to display a legend that identifies the datasource for each graphic object in a graph.

For example, execute this code at the command line:

y = randn(10,3);

Page 56: MATLAB Data Analysis

2 Interactive Data Exploration

2-14

plot(y)

Then, click the Data Linking tool button , and click the circular down arrow icon onthe left side of the linked plot information bar.

Dropping down the linked plot legend is useful when many data sources are linkedto a graph at once. Like legends created with the legend function, it identifies graphcomponents with variable expressions.

How Linked Plots Behave

Once linked to its data source(s), a figure acts as if you called the MATLAB functionrefreshdata every time a workspace variable it displays changes. That is, any series orgroup graphic objects contained in the figure can update its own XData, YData, or ZDataproperties and redraw itself when one of its data sources is modified. If the linked stateis set to 'off' using the linkdata function, by deselecting the Data Linking toolbarbutton, or by deselecting Link on the figure's Tools menu, automatic refreshing stops.

When you turn linking on for a figure, the linking mechanism can usually identify thedata sources for displayed graphs, but sometimes ambiguity exists about what variable

Page 57: MATLAB Data Analysis

Making Graphs Responsive with Data Linking

2-15

or range of a variable has been plotted. At such times, the Linked Plot information barinforms you that graphics have no data sources and gives you a chance to identify them.Click fix it to open a dialog box where you can specify the variables and ranges of any orall plotted variables.

For example, create a matrix of random data and plot a histogram of the fourth column.

x = rand(10);

hist(x(:,4))

Then, click the Data Linking tool button and click fix it in the Linked Plotinformation bar. Use the drop down menu under YDataSource to choose the source ofthe data, x(:,4).

Page 58: MATLAB Data Analysis

2 Interactive Data Exploration

2-16

Note: You can create graphs that have no data sources. For example,plot(randn(100,1)) generates a line graph that has neither an XDataSource (thex-values are implicit) nor a YDataSource (no variable for y-values exists). Therefore,while you can brush such graphs, you cannot link them to data sources, because linkingrequires workspace data. Similarly, if you create a variable, graph it, and then clear thevariable from the workspace you will be unable to link that plot.

When you brush a graph that is not linked to data sources, you brush the graphics only.The brushing affects only the figure you interact with. However, when you brush a linkedplot, you are brushing the underlying variables. In this case, your brush marks alsodisplay on all linked plots that have the same data sources you brushed, as well as anydisplay of that data which you have opened in the Variables editor. The color of the brushmarks in all displays is the brush color you have selected for the figure in which you arebrushing. This color can differ from the brush colors you have chosen to use in othersdisplay, and overrides those colors.

Linking vs. Refreshing Plots

Besides the linked plots feature, other MATLAB mechanisms connect graphic objects todata sources (workspace variables). The main techniques are:

• Directly update the XData/YData/ZData properties of a graph.• Set a graph’s XDataSource/YDataSource/ZDataSource and indirectly update

XData/YData/ZData by calling refreshdata.

For an example of updating object properties to animate graphics, see “Trace MarkerAlong Line” in the MATLAB Graphics documentation. Data linking is not a methodintended for animating data graphs.

Linking plots automates these tasks and keeps graphs continuously in sync with thevariables they depict, making it the easiest technique to use. Data sources must stillexist in the workspace, but you do not need to explicitly declare them for linked plotsunless some ambiguity exists. The following code examples iteratively approximate pi,and illustrate the difference between declaring and refreshing data sources yourself andletting the linkdata function handle it for you.

Updating a Graph with refreshdata Updating a Graph with linkdata

x1= [1 2]; x2= [1 2];

Page 59: MATLAB Data Analysis

Making Graphs Responsive with Data Linking

2-17

Updating a Graph with refreshdata Updating a Graph with linkdatay1 = [4 4];

ntimes = 100;

denom = 1;

k = -1;

subplot(1,2,1)

hp1 = plot(x1,y1);

xlabel('Updated with REFRESHDATA')

ylabel('\pi')

set(gca,'Xlim',[0 ntimes],...

'Ylim',[2.5 4])

set(hp1,'XDataSource', 'x1')

set(hp1,'YDataSource', 'y1')

for t = 3:ntimes

denom = denom + 2;

x1(t) = t;

y1(t) = 4*(y1(t-1)/4 + k/denom);

refreshdata

drawnow

k = -k;

end

line([0 ntimes], [pi pi],'color','c')

y2 = [4 4];

ntimes = 100;

denom = 1;

k = -1;

subplot(1,2,2)

plot(x2,y2);

xlabel('Updated with LINKDATA')

ylabel('\pi')

set(gca,'Xlim',[0 ntimes],...

'Ylim',[2.5 4])

linkdata on

for t = 3:ntimes

denom = denom + 2;

x2(t) = t;

y2(t) = 4*(y2(t-1)/4 + k/denom);

k = -k;

end

line([0 ntimes], [pi pi],'color','c')

Differences are shown in italics. When you execute the code on the left, which usesrefreshdata, it animates the approximation process. The code on the right useslinkdata and does not animate; it runs much faster. (A drawnow command is notneeded, because data linking buffers update and refresh the graph at half-secondintervals.) The graphic results, shown in the next image, are identical. Because bothplots are in axes in the same figure, linking the second graph also links the first graph toits variables.

Page 60: MATLAB Data Analysis

2 Interactive Data Exploration

2-18

Using Linked Plot Controls

To minimize the Linked Plot information bar while remaining in linked mode, click the

hide/show button on its right side; the button flips direction and the bar is hidden.Clicking the button again flips the arrow back and restores the Linked Plot informationbar. Turning off linking cuts all data source connections and removes the Linked Plotinformation bar from the figure. However, the data source properties remain set, and thebar reappears whenever a linked state is restored by selecting Tools > Link, depressingthe Linked Plot button, or calling the linkdata function. Whatever data sources wereestablished previously will then reconnect (assuming those variables still exist in thesame form).

Page 61: MATLAB Data Analysis

Making Graphs Responsive with Data Linking

2-19

The Data Source Button

The down arrow button on the left side of the Linked Plot information bar dropsdown a legend (similar to what the legend function produces but without DisplayNames). The legend identifies workspace variables associated with plot objects for theentire figure (legend works on a per-axes basis), such as these linked lineseries from theprevious example, shown in the next image.

The drop-down legend names variable linked to the graphic objects in the figure.For items to appear there, a graph must have an XDataSource, YDataSource, or aZDataSource property that MATLAB can evaluate without error. The icon for each listentry reflects the Color, Linestyle and Marker of the corresponding graphic object,making clear which graphic objects link to which variables. The drop-down legend isinformational only; you can only dismiss it after reading it by clicking anywhere else onthe figure.

The Edit Button

Clicking the Edit link on the information bar opens the Specify Data Source Propertiesmodal dialog box for you to set the DisplayName, XDataSource, YDataSource, andZDataSource properties of plot objects in the figure to columns or vectors of workspacevariables. Changing a DisplayName updates text on a legend, if present for the variable,and has no other effects. The three columns on the right contain drop-down lists ofworkspace variables. You can also type variable names and ranges, or a MATLABexpression. When you change variables or their ranges on the fly with this dialog box,variables plotted against one another must be compatible types and have the samenumber of observations (as in any bivariate graph).

If you attempt to link a plot and linkdata can identify more than one possibleworkspace variable for one or more plot objects, the Specify Data Source Properties

Page 62: MATLAB Data Analysis

2 Interactive Data Exploration

2-20

dialog box appears for you to resolve the ambiguity. If you choose not to or are unable todo so and cancel the dialog box, data linking is not established for those graphic objects.

When Data Links Fail

Updating a linked plot can fail if the strings in the XDataSource, YDataSource, orZDataSource properties are incompatible with what is in the current workspace.Consequently, the corresponding XData, YData, and ZData cannot be updated. Thishappens most often because variables are cleared or no longer exist when the workspacechanges (e.g., when you are debugging).

However, failing links do not affect the visual appearance of the object in the graph.Instead, a warning icon and message appears on the Linked Plot information bar whenthis occurs for any plotted data in the figure. The failing link warning is general, but you

can identify which variables are affected by clicking the Data Source button. If you

hide the Linked Plot information bar (by clicking its Hide button), the bar reappearswhen a data links fails, alerting you to the issue.

Page 63: MATLAB Data Analysis

Interacting with Graphed Data

2-21

Interacting with Graphed Data

In this section...

“Data Brushing with the Variables Editor” on page 2-21“Using Data Tips to Explore Graphs” on page 2-22“Example — Visually Exploring Demographic Statistics” on page 2-23

Data Brushing with the Variables Editor

To brush data in the Variables editor, link the figure windows associated with variable.Then right-click on a cell in the Variables editor and select Brushing > Brushing onin the context menu. Select one or more cells to brush elements in the variable. Thecorresponding points on your plots highlight simultaneously.

You can brush observations that appear in multiple linked plots at the same time. Youcan do this only when your observations are in a matrix with the plot variables runningalong separate columns. For example, you can create two separate plots of observationsin a matrix called data, which contains system response measurements at 50 different(x, y) points. The first column, data(:,1), contains the x-coordinates, data(:,2)contains y-coordinates, and data(:,3) contains the measured response at each point.The left plot below shows the response versus x. The plot on the right shows the responseversus y. If you brush a point in one plot, the corresponding point in the other plothighlights at the same time. Furthermore, if you have the Variables editor open, thecorresponding data row is highlighted whenever you brush a point.

Page 64: MATLAB Data Analysis

2 Interactive Data Exploration

2-22

For more information about the using the Variables editor, see the openvar referencepage.

Using Data Tips to Explore Graphs

A data tip is a small display associated with an axes that reads out individual dataobservation values from a 2-D or 3-D graph. You create data tips by mouse clicks on

graphs using the Data Cursor tool from the figure toolbar. When you select thistool, you are in data cursor mode—signified by a hollow cross-hair cursor—in which youidentify x-, y-, and z-values of data points you click. Like data points you brush, exportsuch values to the workspace.

For descriptions of data cursor properties and how to use them, see

• “Display Data Values Interactively” and Using Data Cursors with Histograms in theMATLAB Graphics documentation

• The MATLAB function reference page for datacursormode

Page 65: MATLAB Data Analysis

Interacting with Graphed Data

2-23

The default behavior of data tips is to simply display the XData, YData, and ZDatavalues of the selected observations as text in a box. Sometimes this information is nothelpful by itself, and you might want to replace or augment it with other information.You can modify this behavior to display other facts connected to observations. Youcustomize data tip behavior by constructing a data tip text update function (in MATLABcode) to construct text strings for display in data tips and then instructing data cursormode to use your function instead of the default one.

Customize data cursor update functions to display information such as

• Names associated with x-, y-, and z-values• Weights associated with x-, y-, and z-values• Differences in x-, y-, and z-values from the mean or their neighbors• Transformations of values (e.g., normalizations or to different units of measure)• Related variables

You can create data tip text update functions to display such information and changetheir behavior on the fly. You can even make the update function behave differently fordistinct observations in the same graph if your update function or the code calling it candistinguish groups of them. The next section contains an example of coding and using acustomized data cursor update function.

Example — Visually Exploring Demographic Statistics

• “The Data Tip Text Update Function” on page 2-24• “Preparing, Plotting, and Annotating the Data” on page 2-25• “Explore the Graph with the Custom Data Cursor” on page 2-28• “Plot and Link a Histogram of a Related Variable” on page 2-29• “Explore the Linked Graphs with Data Brushing” on page 2-30• “Plot the Observations on a Linked Map” on page 2-31

The extended example that follows begins by using data tips to explore the incidenceof fatal traffic accidents tabulated for U.S. states, with respect to state populations.The example extends this analysis to brush, link, and map the data to discover spatialpatterns in the data. Each section of the example has four or fewer steps. By executingthem all, you gain insight into the data set and become familiar with useful graphicaldata exploration techniques.

Page 66: MATLAB Data Analysis

2 Interactive Data Exploration

2-24

Censuses of population and other national government statistics are valuable sourcesof demographic and socioeconomic data. An important aspect of census data is itsgeography, i.e., the regions to which a given set of statistics applies, and at what levelof granularity. When exploring census data, you frequently need to identify whatgeographic unit any given observation represents.

This example uses data tips to show place names and statistics for individualobservations. You pass place names and the data matrix to a custom text update functionto enable this. The place names are for U.S. states and the District of Columbia. If allthese names were placed as labels on the x-axis, they would be too small or too crowdedto be legible, but they are readable one at a time as data tips.

The example also illustrates how sorting a data matrix by rows can enhanceinterpretation when the original ordering (in this case alphabetical by state) provides nospecial insight into relationships among observations and variables.

The Data Tip Text Update Function

Data tips can present other information beyond x-, y- and z-values. Read through theexample function labeldtips, which takes three more parameters than a defaultcallback, and displays the following information:

• Its y-value• Deviation from an expected y-value• Percent deviation from the expected y-value• The observation's label (state name)

Because it customizes data tips, the function must be a code file that you invoke fromthe Command Window or from a script. This file, labeldtips.m, and the MAT-filesaccidents.mat and usapolygon.mat that the following examples also use, exist onthe MATLAB path. Here is the code for the labeldtips data cursor callback function.

function output_txt = labeldtips(obj,event_obj,...

xydata,labels,xymean)

% Display an observation's Y-data and label for a data tip

% obj Currently not used (empty)

% event_obj Handle to event object

% xydata Entire data matrix

% labels State names identifying matrix row

% xymean Ratio of y to x mean (avg. for all obs.)

% output_txt Datatip text (string or string cell array)

% This datacursor callback calculates a deviation from the

Page 67: MATLAB Data Analysis

Interacting with Graphed Data

2-25

% expected value and displays it, Y, and a label taken

% from the cell array 'labels'; the data matrix is needed

% to determine the index of the x-value for looking up the

% label for that row. X values could be output, but are not.

pos = get(event_obj,'Position');

x = pos(1); y = pos(2);

output_txt = {['Y: ',num2str(y,4)]};

ydev = round((y - x*xymean));

ypct = round((100 * ydev) / (x*xymean));

output_txt{end+1} = ['Yobs-Yexp: ' num2str(ydev) ...

'; Pct. dev: ' num2str(ypct)];

idx = find(xydata == x,1); % Find index to retrieve obs. name

% The find is reliable only if there are no duplicate x values

[row,col] = ind2sub(size(xydata),idx);

output_txt{end+1} = cell2mat(labels(row));

The portion of the example called “Explore the Graph with the Custom Data Cursor” onpage 2-28 sets up data cursor mode and declares this function as a callback using thefollowing code:

hdt = datacursormode;

set(hdt,'UpdateFcn',{@labeldtips,hwydata,statelabel,usmean})

The call to datacursormode puts the current figure in data cursor mode. hdt is thehandle of a data cursor mode object for the figure you want to explore. The function nameand its three formal arguments are a cell array.

Preparing, Plotting, and Annotating the Data

The following steps show how you load statistical data for U.S. states, plot some of it, andenter data cursor mode to explore the data:

Note: To help you interpret graphs created in this example, the hwydata data matrixand its row labels have been presorted by rows to be in ascending order by total statepopulation. The 51-by-1 vector hwyidx contains indices from the presorting (the datawere originally in alphabetic order)

If you ever want to resort the data array and state labels alphabetically, you can sort onthe first column of the hwydata matrix, which contains Census Bureau state IDs thatascend in alphabetical order, as follows:

[hwydata hwyidx] = sortrows(hwydata,1);

Page 68: MATLAB Data Analysis

2 Interactive Data Exploration

2-26

statelabel = statelabel(hwyidx);

If you do resort the data, to make the graph easier to interpret you might plot it usingmarkers rather than lines. To do this, change the call to plot in section 2, below, to thefollowing:

plot(hwydata(:,14),hwydata(:,4),'.')

1 Load U.S. state data statistics from the National Transportation Safety HighwayAdministration and the Bureau of the Census and look at the variables:

load 'accidents.mat'

whos

Name Size Bytes Class

datasources 3x1 2568 cell

hwycols 1x1 8 double

hwydata 51x17 6936 double

hwyheaders 1x17 1874 cell

hwyidx 51x1 408 double

hwyrows 1x1 8 double

statelabel 51x1 3944 cell

ushwydata 1x17 136 double

uslabel 1x1 86 cell

The data set has 51 observations for 17 variables.

• The state-by-state statistics; the double 51-by-17 matrix hwydata• The variable (column) names; the 1-by-17 text cell array hwyheaders• The state names; the 51-by-1 text cell array statelabel• Values for the entire United States for the 17 variables; the 1-by-17 matrix

ushwydata

• The label for the US values; the 1-by-1 cell array uslabel• Metadata describing data sources; the 3-by-1 cell array datasources

2 Plot a line graph of the population by state as x versus the number of traffic fatalitiesper state as y:

hf1 = figure;

plot(hwydata(:,14),hwydata(:,4));

xlabel(hwyheaders(14))

ylabel(hwyheaders(4))

Page 69: MATLAB Data Analysis

Interacting with Graphed Data

2-27

Because the state observations are sorted by population size, the graph is monotonicin x. The larger a population a state has, the more variation in traffic accidentfatalities it tends to show.

3 Compute the per capita rate of traffic fatalities for the entire United States; in thenext part of this example, the data cursor update function uses this average tocompute an expected value for each state you query:

usmean = ushwydata(4)/ushwydata(14)

usmean =

1.5150e-004

The statistic shows that nationally, about 150 per 100,0000 people die in trafficaccidents every year.

Use usmean to compute the smallest and largest expected values by multiplying it bythe smallest and largest state populations, and draw a line connecting them:

line([min(hwydata(:,14)) max(hwydata(:,14))],...

[min(hwydata(:,14))*usmean max(hwydata(:,14)*usmean)],...

'Color','m');

Page 70: MATLAB Data Analysis

2 Interactive Data Exploration

2-28

Note: The magenta line is not a regression line; it is a trend line that plots the numberof traffic deaths that a state of a given size would have if all states obeyed the nationalaverage.

Explore the Graph with the Custom Data Cursor

You can now explore the graphed data with the example custom data cursor updatefunction labeldtips (which must be on the MATLAB path or in the current folder).labeldtips displays state names and y-deviations.

1 Turn on data cursor mode and invoke the custom callback:

hdt = datacursormode;

set(hdt,'DisplayStyle','window');

% Declare a custom datatip update function

% to display state names:

set(hdt,'UpdateFcn',{@labeldtips,hwydata,statelabel,usmean})

Page 71: MATLAB Data Analysis

Interacting with Graphed Data

2-29

The data cursor 'window' display style sends data tip output to a small windowthat you can move anywhere within the figure. This display style is best suited todata tips that contain more text than just x-, y-, and z-values. The labeldtipscallback remains active for that figure until you use set to replace it with anotherfunction (or empty, to restore the default data cursor behavior). Click the right-mostpoint on the blue graph.

The data tip shows that California has the largest population and the largestnumber of traffic fatalities, 4120. However, it had 1012, or 20%, fewer fatalities thanpredicted by the national average. The next data point to the left depicts Texas.Click that data point or press the left arrow to show its data tip. To see results fromother states, move the data tip by dragging the black square or using the left or rightarrow to step it along the graph. If you know a little about U.S. geography, you mightobserve a pattern.

Plot and Link a Histogram of a Related Variable

The ninth column of hwydata, labeled "Fatalities per 100K Licensed Drivers,” is relatedto population. Plot a histogram of this variable to see which states have fewer or morefatalities per driver. To do this, link the plots to their data, and brush either of them.

Page 72: MATLAB Data Analysis

2 Interactive Data Exploration

2-30

1 Open a new figure and plot a histogram of Fatalities per 100K Licensed Drivers in it:

hf2 = figure

hist(hwydata(:,9),5)

xlabel(hwyheaders(9))

2 Link both the line graph and the histogram to their data sources in hwydata:

linkdata(hf1)

linkdata(hf2)

You can also click the Data Linking tool on the two figures. The first figurelinks automatically; the histogram does not because linkdata cannot determinewith certainty the YDataSource for histograms. The Linked Plot information bar ontop of the histogram informs you No Graphics have data sources. Cannot linkplot: fix it.

3 Click fix it to open the Specify Data Source Properties dialog box. Typehwydata(:,9) into the YDataSource edit box and click OK.

The Linked Plot information bar displays the data source you identified.

Explore the Linked Graphs with Data Brushing

Now that you have linked both graphs to a common data set, you can brush portions ofone to see the effect on the other.

1 It isn't necessary, but you might want to dock the plots in a figure group so you cansee them side by side.

Page 73: MATLAB Data Analysis

Interacting with Graphed Data

2-31

2Select the Data Brushing tool on the histogram plot. Brush the three right-most bars in the histogram; they represent higher values that range from 25 to 48fatalities per 100,000 drivers.

Notice which observations light up on the line graph. Not only are these states withsmaller populations, they are also states with above-average numbers of trafficfatalities.

3 Click the line graph to make it the active figure and select its Data Brushing tool.Click all the observations you can that fall below the straight line average. Youneed to hold the Shift key down to make multiple selections, whether by clicking ordragging. You might want to zoom in on the left side of the graph to brush properlythere. What do you see happening on the histogram?

Plot the Observations on a Linked Map

The hwydata matrix contains geographic location information in the form of latitude-longitude coordinates of a centroid for each state. You can make a crude map bygenerating a scatter plot of these coordinates, using longitude as x and latitude as y. Ifyou link the scatter plot, you can brush all the plots at once.

Page 74: MATLAB Data Analysis

2 Interactive Data Exploration

2-32

1 To provide a context for the map, plot an outline map of the conterminous UnitedState. Obtain the latitude and longitude coordinates required from the MAT-fileusapolygon.mat:

hf3 = figure;

load usapolygon

patch(uslon,uslat,[1 .9 .8],'Edgecolor','none');

hold on

2 Map the centroid longitude and latitude as a scatter plot with filled circles. Plot arectangle over part of the map, as follows:

scatter(hwydata(:,2),hwydata(:,3),36,'b','filled');

xlabel('Longitude')

ylabel('Latitude')

rectangle('Position',[-115,25,115-77,36-25],...

'EdgeColor',[.75 .75 .75])

The x- and y-limits change, shrinking the map, because the data matrix containsobservations for Alaska and Hawaii, but the map outline file does not include thesestates.

Page 75: MATLAB Data Analysis

Interacting with Graphed Data

2-33

3 Dock the map underneath the other two figures. Brush the map after turning on theData Linking and Data Brushing tools for its figure. Drag across the gray rectanglewith the Data Brushing tool to highlight just the southeastern and southwesternstates. What you see should look like this.

Data brushing and linking reveals that almost all the states with above-averagetraffic fatality rates are in the southern part of the U.S.

Using graphic data exploration, you have identified some intriguing regularities in thisdata. However, you have not identified any causes for the patterns you found. That willtake more work on with the data, and possibly additional data sets, along with somehypotheses and models.

Page 76: MATLAB Data Analysis

2-34

Page 77: MATLAB Data Analysis

3

Regression Analysis

• “Linear Correlation” on page 3-2• “Linear Regression” on page 3-6• “Interactive Fitting” on page 3-12• “Programmatic Fitting” on page 3-32

Page 78: MATLAB Data Analysis

3 Regression Analysis

3-2

Linear Correlation

In this section...

“Introduction” on page 3-2“Covariance” on page 3-3“Correlation Coefficients” on page 3-4

Introduction

Correlation quantifies the strength of a linear relationship between two variables. Whenthere is no correlation between two variables, then there is no tendency for the values ofthe variables to increase or decrease in tandem. Two variables that are uncorrelated arenot necessarily independent, however, because they might have a nonlinear relationship.

You can use linear correlation to investigate whether a linear relationship exists betweenvariables without having to assume or fit a specific model to your data. Two variablesthat have a small or no linear correlation might have a strong nonlinear relationship.However, calculating linear correlation before fitting a model is a useful way to identifyvariables that have a simple relationship. Another way to explore how variables arerelated is to make scatter plots of your data.

Covariance quantifies the strength of a linear relationship between two variables inunits relative to their variances. Correlations are standardized covariances, giving adimensionless quantity that measures the degree of a linear relationship, separate fromthe scale of either variable.

The following three MATLAB functions compute sample correlation coefficients andcovariance. These sample coefficients are estimates of the true covariance and correlationcoefficients of the population from which the data sample is drawn.

Function Description

corrcoef Correlation coefficient matrixcov Covariance matrixxcorr (a SignalProcessingToolbox™ function)

Cross-correlation sequence of a random process (includesautocorrelation)

Page 79: MATLAB Data Analysis

Linear Correlation

3-3

Covariance

Use the MATLAB cov function to calculate the sample covariance matrix for a datamatrix (where each column represents a separate quantity).

The sample covariance matrix has the following properties:

• cov(X) is symmetric.• diag(cov(X)) is a vector of variances for each data column. The variances represent

a measure of the spread or dispersion of data in the corresponding column. (The varfunction calculates variance.)

• sqrt(diag(cov(X))) is a vector of standard deviations. (The std functioncalculates standard deviation.)

• The off-diagonal elements of the covariance matrix represent the covariances betweenthe individual data columns.

Here, X can be a vector or a matrix. For an m-by-n matrix, the covariance matrix is n-by-n.

For an example of calculating the covariance, load the sample data in count.dat thatcontains a 24-by-3 matrix:

load count.dat

Calculate the covariance matrix for this data:

cov(count)

MATLAB responds with the following result:

ans =

1.0e+003 *

0.6437 0.9802 1.6567

0.9802 1.7144 2.6908

1.6567 2.6908 4.6278

The covariance matrix for this data has the following form:

Page 80: MATLAB Data Analysis

3 Regression Analysis

3-4

s s s

s s s

s s s

s sij ji

2

11

2

12

2

13

2

21

2

22

2

23

2

31

2

32

2

33

2 2

È

Î

ÍÍÍ

˘

˚

˙˙˙

=

Here, s2ij is the sample covariance between column i and column j of the data. Because

the count matrix contains three columns, the covariance matrix is 3-by-3.

Note: In the special case when a vector is the argument of cov, the function returns thevariance.

Correlation Coefficients

The MATLAB function corrcoef produces a matrix of sample correlation coefficientsfor a data matrix (where each column represents a separate quantity). The correlationcoefficients range from -1 to 1, where

• Values close to 1 indicate that there is a positive linear relationship between the datacolumns.

• Values close to -1 indicate that one column of data has a negative linear relationshipto another column of data (anticorrelation).

• Values close to or equal to 0 suggest there is no linear relationship between the datacolumns.

For an m-by-n matrix, the correlation-coefficient matrix is n-by-n. The arrangementof the elements in the correlation coefficient matrix corresponds to the location of theelements in the covariance matrix, as described in “Covariance” on page 3-3.

For an example of calculating correlation coefficients, load the sample data incount.dat that contains a 24-by-3 matrix:

load count.dat

Type the following syntax to calculate the correlation coefficients:

corrcoef(count)

Page 81: MATLAB Data Analysis

Linear Correlation

3-5

This results in the following 3-by-3 matrix of correlation coefficients:

ans =

1.0000 0.9331 0.9599

0.9331 1.0000 0.9553

0.9599 0.9553 1.0000

Because all correlation coefficients are close to 1, there is a strong positive correlationbetween each pair of data columns in the count matrix.

Page 82: MATLAB Data Analysis

3 Regression Analysis

3-6

Linear Regression

In this section...

“Introduction” on page 3-6“Residuals and Goodness of Fit” on page 3-7“Fitting Data with Curve Fitting Toolbox Functions” on page 3-11

Introduction

A data model explicitly describes a relationship between predictor and response variables.Linear regression fits a data model that is linear in the model coefficients. The mostcommon type of linear regression is a least-squares fit, which can fit both lines andpolynomials, among other linear models.

Before you model the relationship between pairs of quantities, it is a good idea to performcorrelation analysis to establish if a linear relationship exists between these quantities.Be aware that variables can have nonlinear relationships, which correlation analysiscannot detect. For more information, see “Linear Correlation” on page 3-2.

The MATLAB Basic Fitting GUI helps you to fit your data, so you can calculate modelcoefficients and plot the model on top of the data. For an example, see “Example: UsingBasic Fitting GUI” on page 3-14. You also can use the MATLAB polyfit andpolyval functions to fit your data to a model that is linear in the coefficients. For anexample, see “Programmatic Fitting” on page 3-41.

If you need to fit data with a nonlinear model, transform the variables to make therelationship linear. Alternatively, try to fit a nonlinear function directly using either theStatistics Toolbox nlinfit function, the Optimization Toolbox™ lsqcurvefit function,or by applying functions in the Curve Fitting Toolbox™.

This topic explains how to:

• Use correlation analysis to determine whether two quantities are related to justifyfitting the data.

• Fit a linear model to the data.• Evaluate the goodness of fit by plotting residuals and looking for patterns.• Calculate measures of goodness of fit R2 and adjusted R2

Page 83: MATLAB Data Analysis

Linear Regression

3-7

Residuals and Goodness of Fit

Residuals are the difference between the observed values of the response (dependent)variable and the values that a model predicts. When you fit a model that is appropriatefor your data, the residuals approximate independent random errors. That is, thedistribution of residuals ought not to exhibit a discernible pattern.

Producing a fit using a linear model requires minimizing the sum of the squares ofthe residuals. This minimization yields what is called a least-squares fit. You can gaininsight into the “goodness” of a fit by visually examining a plot of the residuals. If theresidual plot has a pattern (that is, residual data points do not appear to have a randomscatter), the randomness indicates that the model does not properly fit the data.

Evaluate each fit you make in the context of your data. For example, if your goalof fitting the data is to extract coefficients that have physical meaning, then it isimportant that your model reflect the physics of the data. Understanding what your datarepresents, how it was measured, and how it is modeled is important when evaluatingthe goodness of fit.

One measure of goodness of fit is the coefficient of determination, or R2 (pronounced r-square). This statistic indicates how closely values you obtain from fitting a model matchthe dependent variable the model is intended to predict. Statisticians often define R2

using the residual variance from a fitted model:R2 = 1 – SSresid / SStotal

SSresid is the sum of the squared residuals from the regression. SStotal is the sum of thesquared differences from the mean of the dependent variable (total sum of squares). Bothare positive scalars.

To learn how to compute R2 when you use the Basic Fitting tool, see “Derive R2, theCoefficient of Determination” on page 3-20. To learn more about calculating the R2

statistic and its multivariate generalization, continue reading here.

Example: Computing R2 from Polynomial Fits

You can derive R2 from the coefficients of a polynomial regression to determine how muchvariance in y a linear model explains, as the following example describes:

1 Create two variables, x and y, from the first two columns of the count variable inthe data file count.dat:

Page 84: MATLAB Data Analysis

3 Regression Analysis

3-8

load count.dat

x = count(:,1);

y = count(:,2);

2 Use polyfit to compute a linear regression that predicts y from x:

p = polyfit(x,y,1)

p =

1.5229 -2.1911

p(1) is the slope and p(2) is the intercept of the linear predictor. You can alsoobtain regression coefficients using the Basic Fitting GUI.

3 Call polyval to use p to predict y, calling the result yfit:

yfit = polyval(p,x);

Using polyval saves you from typing the fit equation yourself, which in this caselooks like:

yfit = p(1) * x + p(2);

4 Compute the residual values as a vector of signed numbers:

yresid = y - yfit;

5 Square the residuals and total them to obtain the residual sum of squares:

SSresid = sum(yresid.^2);

6 Compute the total sum of squares of y by multiplying the variance of y by thenumber of observations minus 1:

SStotal = (length(y)-1) * var(y);

7 Compute R2 using the formula given in the introduction of this topic:

rsq = 1 - SSresid/SStotal

rsq =

0.8707

This demonstrates that the linear equation 1.5229 * x -2.1911 predicts 87% ofthe variance in the variable y.

Page 85: MATLAB Data Analysis

Linear Regression

3-9

Computing Adjusted R2 for Polynomial Regressions

You can usually reduce the residuals in a model by fitting a higher degree polynomial.When you add more terms, you increase the coefficient of determination, R2. You get acloser fit to the data, but at the expense of a more complex model, for which R2 cannotaccount. However, a refinement of this statistic, adjusted R2, does include a penaltyfor the number of terms in a model. Adjusted R2, therefore, is more appropriate forcomparing how different models fit to the same data. The adjusted R2 is defined as:R2

adjusted = 1 - (SSresid / SStotal)*((n-1)/(n-d-1))where n is the number of observations in your data, and d is the degree of thepolynomial. (A linear fit has a degree of 1, a quadratic fit 2, a cubic fit 3, and so on.)

The following example repeats the steps of the previous example, “Example: ComputingR2 from Polynomial Fits” on page 3-7, but performs a cubic (degree 3) fit instead of alinear (degree 1) fit. From the cubic fit, you compute both simple and adjusted R2 valuesto evaluate whether the extra terms improve predictive power:

1 Create two variables, x and y, from the first two columns of the count variable inthe data file count.dat:

load count.dat

x = count(:,1);

y = count(:,2);

2 Call polyfit to generate a cubic fit to predict y from x:

p = polyfit(x,y,3)

p =

-0.0003 0.0390 0.2233 6.2779

p(4) is the intercept of the cubic predictor. You can also obtain regressioncoefficients using the Basic Fitting GUI.

3 Call polyval to use the coefficients in p to predict y, naming the result yfit:

yfit = polyval(p,x);

polyval evaluates the explicit equation you could manually enter as:

yfit = p(1) * x.^3 + p(2) * x.^2 + p(3) * x + p(4);

4 Compute the residual values as a vector of signed numbers:

Page 86: MATLAB Data Analysis

3 Regression Analysis

3-10

yresid = y - yfit;

5 Square the residuals and total them to obtain the residual sum of squares:

SSresid = sum(yresid.^2);

6 Compute the total sum of squares of y by multiplying the variance of y by thenumber of observations minus 1:

SStotal = (length(y)-1) * var(y);

7 Compute simple R2 for the cubic fit using the formula given in the introduction ofthis topic:

rsq = 1 - SSresid/SStotal

rsq =

0.9083

8 Finally, compute adjusted R2 to account for degrees of freedom:

rsq_adj = 1 - SSresid/SStotal * (length(y)-1)/(length(y)-length(p))

rsq_adj =

0.8945

The adjusted R2, 0.8945, is smaller than simple R2, .9083. It provides a more reliableestimate of the power of your polynomial model to predict.

In many polynomial regression models, adding terms to the equation increases both R2

and adjusted R2. In the preceding example, using a cubic fit increased both statisticscompared to a linear fit. (You can compute adjusted R2 for the linear fit for yourself todemonstrate that it has a lower value.) However, it is not always true that a linear fit isworse than a higher-order fit: a more complicated fit can have a lower adjusted R2 than asimpler fit, indicating that the increased complexity is not justified. Also, while R2 alwaysvaries between 0 and 1 for the polynomial regression models that the Basic Fitting toolgenerates, adjusted R2 for some models can be negative, indicating that a model that hastoo many terms.

Correlation does not imply causality. Always interpret coefficients of correlationand determination cautiously. The coefficients only quantify how much variance ina dependent variable a fitted model removes. Such measures do not describe howappropriate your model—or the independent variables you select—are for explaining thebehavior of the variable the model predicts.

Page 87: MATLAB Data Analysis

Linear Regression

3-11

Fitting Data with Curve Fitting Toolbox Functions

The Curve Fitting Toolbox software extends core MATLAB functionality by enabling thefollowing data-fitting capabilities:

• Linear and nonlinear parametric fitting, including standard linear least squares,nonlinear least squares, weighted least squares, constrained least squares, and robustfitting procedures

• Nonparametric fitting• Statistics for determining the goodness of fit• Extrapolation, differentiation, and integration• GUI that facilitates data sectioning and smoothing• Saving fit results in various formats, including MATLAB code files, MAT-files, and

workspace variables

For more information, see the Curve Fitting Toolbox documentation.

Page 88: MATLAB Data Analysis

3 Regression Analysis

3-12

Interactive Fitting

In this section...

“The Basic Fitting GUI” on page 3-12“Preparing for Basic Fitting” on page 3-12“Opening the Basic Fitting GUI” on page 3-13“Example: Using Basic Fitting GUI” on page 3-14

The Basic Fitting GUI

The MATLAB Basic Fitting GUI allows you to interactively:

• Model data using a spline interpolant, a shape-preserving interpolant, or a polynomialup to the tenth degree

• Plot one or more fits together with data• Plot the residuals of the fits• Compute model coefficients• Compute the norm of the residuals (a statistic you can use to analyze how well a

model fits your data)• Use the model to interpolate or extrapolate outside of the data• Save coefficients and computed values to the MATLAB workspace for use outside of

the GUI• Generate MATLAB code to recompute fits and reproduce plots with new data

Note: The Basic Fitting GUI is only available for 2-D plots. For more advanced fittingand regression analysis, see the Curve Fitting Toolbox documentation and the StatisticsToolbox documentation.

Preparing for Basic Fitting

The Basic Fitting GUI sorts your data in ascending order before fitting. If your data setis large and the values are not sorted in ascending order, it will take longer for the BasicFitting GUI to preprocess your data before fitting.

Page 89: MATLAB Data Analysis

Interactive Fitting

3-13

You can speed up the Basic Fitting GUI by first sorting your data. To create sortedvectors x_sorted and y_sorted from data vectors x and y, use the MATLAB sortfunction:

[x_sorted, i] = sort(x);

y_sorted = y(i);

Opening the Basic Fitting GUI

To use the Basic Fitting GUI, you must first plot your data in a figure window, using anyMATLAB plotting command that produces (only) x and y data.

To open the Basic Fitting GUI, select Tools > Basic Fitting from the menus at the topof the figure window.

When you fully expand it by twice clicking the arrow button in the lower rightcorner, the window displays three panels. Use these panels to:

• Select a model and plotting options• Examine and export model coefficients and norms of residuals• Examine and export interpolated and extrapolated values.

Page 90: MATLAB Data Analysis

3 Regression Analysis

3-14

To expand or collapse panels one-by-one, click the arrow button in the lower right cornerof the interface.

Example: Using Basic Fitting GUI

This example shows how to use the Basic Fitting GUI to fit, visualize, analyze, save, andgenerate code for polynomial regressions.

• “Load and Plot Census Data” on page 3-15• “Predict the Census Data with a Cubic Polynomial Fit” on page 3-15• “View and Save the Cubic Fit Parameters” on page 3-19• “Derive R2, the Coefficient of Determination” on page 3-20

Page 91: MATLAB Data Analysis

Interactive Fitting

3-15

• “Interpolate and Extrapolate Population Values” on page 3-25• “Generate a Code File to Reproduce the Result” on page 3-28• “Learn How the Basic Fitting Tool Computes Fits” on page 3-29

Load and Plot Census Data

The file, census.mat, contains U.S. population data for the years 1790 through 1990 at10 year intervals.

To load and plot the data, type the following commands at the MATLAB prompt:

load census

plot(cdate,pop,'ro')

The load command adds the following variables to the MATLAB workspace:

• cdate — A column vector containing the years from 1790 to 1990 in increments of 10.It is the predictor variable.

• pop — A column vector with U.S. population for each year in cdate. It is theresponse variable.

The data vectors are sorted in ascending order, by year. The plot shows the population asa function of year.

Now you are ready to fit an equation the data to model population growth over time.

Predict the Census Data with a Cubic Polynomial Fit

1 Open the Basic Fitting dialog box by selecting Tools > Basic Fitting in the Figurewindow.

2 In the Plot fits area of the Basic Fitting dialog box, select the cubic check box to fita cubic polynomial to the data.

MATLAB uses your selection to fit the data, and adds the cubic regression line to thegraph as follows.

Page 92: MATLAB Data Analysis

3 Regression Analysis

3-16

In computing the fit, MATLAB encounters problems and issues the followingwarning:

Polynomial is badly conditioned.

Add points with distinct X values,

select a polynomial with a lower degree,

or select "Center and scale X data."

Page 93: MATLAB Data Analysis

Interactive Fitting

3-17

This warning indicates that the computed coefficients for the model are sensitiveto random errors in the response (the measured population). It also suggests somethings you can do to get a better fit.

3 Continue to use a cubic fit. As you cannot add new observations to the census data,improve the fit by transforming the values you have to z-scores before recomputinga fit. Select the Center and scale X data check box in the GUI to make the BasicFitting tool perform the transformation.

To learn how centering and scaling data works, see “Learn How the Basic FittingTool Computes Fits” on page 3-29.

4 Now view the equations and display residuals. In addition to selecting the Centerand scale X data and cubic check boxes, select the following options:

• Show equations• Plot residuals• Show norm of residuals

Selecting Plot residuals creates a subplot of them as a bar graph. The following figuredisplays the results of the Basic Fitting GUI options you selected.

Page 94: MATLAB Data Analysis

3 Regression Analysis

3-18

The cubic fit is a poor predictor before the year 1790, where it indicates a decreasingpopulation. The model seems to approximate the data reasonably well after 1790.However, a pattern in the residuals shows that the model does not meet the assumptionof normal error, which is a basis for the least-squares fitting. The data 1 line identifiedin the legend are the observed x (cdate) and y (pop) data values. The cubic regressionline presents the fit after centering and scaling data values. Notice that the figure showsthe original data units, even though the tool computes the fit using transformed z-scores.

For comparison, try fitting another polynomial equation to the census data by selecting itin the Plot fits area.

Page 95: MATLAB Data Analysis

Interactive Fitting

3-19

Tip You can change the default plot settings and rename data series with the “PropertyEditor”.

View and Save the Cubic Fit Parameters

In the Basic Fitting dialog box, click the arrow button to display the estimatedcoefficients and the norm of the residuals in the Numerical results panel.

To view a specific fit, select it from the Fit list. This displays the coefficients in the BasicFitting dialog box, but does not plot the fit in the figure window.

Page 96: MATLAB Data Analysis

3 Regression Analysis

3-20

Note: If you also want to display a fit on the plot, you must select the corresponding Plotfits check box.

Save the fit data to the MATLAB workspace by clicking the Save to workspace buttonon the Numerical results panel. The Save Fit to Workspace dialog box opens.

With all check boxes selected, click OK to save the fit parameters as a MATLABstructure:

fit

fit =

type: 'polynomial degree 3'

coeff: [0.9210 25.1834 73.8598 61.7444]

Now, you can use the fit results in MATLAB programming, outside of the Basic FittingGUI.

Derive R2, the Coefficient of Determination

You can get an indication of how well a polynomial regression predicts your observeddata by computing the coefficient of determination, or R-square (written as R2). The R2

statistic, which ranges from 0 to 1, measures how useful the independent variable is inpredicting values of the dependent variable:

• An R2 value near 0 indicates that the fit is not much better than the model y =constant.

• An R2 value near 1 indicates that the independent variable explains most of thevariability in the dependent variable.

To compute R2, first compute a fit, and then obtain residuals from it. A residual is thesigned difference between an observed dependent value and the value your fit predictsfor it.residuals = yobserved - yfittedThe Basic Fitting tool can generate residuals for any fit it calculates. To view a graph ofresiduals, select the Plot residuals check box. You can view residuals as a bar, line orscatter plot.

After you have residual values, you can save them to the workspace, where you cancompute R2. Complete the preceding part of this example to fit a cubic polynomial to thecensus data, and then perform these steps:

Page 97: MATLAB Data Analysis

Interactive Fitting

3-21

Compute Residual Data and R2 for a Cubic Fit

1 Click the arrow button at the lower right to open the Numerical results tab if itis not already visible.

2 From the Fit drop-down menu, select cubic if it does not already show.3 Save the fit coefficients, norm of residuals, and residuals by clicking Save to

Workspace.

The Save Fit to Workspace dialog box opens with three check boxes and three textfields.

4 Select all three check boxes to save the fit coefficients, norm of residuals, andresidual values.

5 Identify the saved variables as belonging to a cubic fit. Change the variable namesby adding a 3 to each default name (for example, fit3, normresid3, and resids3).The dialog box should look like this figure.

6 Click OK. Basic Fitting saves residuals as a column vector of numbers, fitcoefficients as a struct, and the norm of residuals as a scalar.

Notice that the value that Basic Fitting computes for norm of residuals is 12.2380.This number is the square root of the sum of squared residuals of the cubic fit.

7 Optionally, you can verify the norm-of-residuals value that the Basic Fitting toolprovided. Compute the norm-of-residuals yourself from the resids3 array that youjust saved:

mynormresid3 = sum(resids3.^2)^(1/2)

Page 98: MATLAB Data Analysis

3 Regression Analysis

3-22

mynormresid3 =

12.2380

8 Compute the total sum of squares of the dependent variable, pop to compute R2.Total sum of squares is the sum of the squared differences of each value from themean of the variable. For example, use this code:

SSpop = (length(pop)-1) * var(pop)

SSpop =

1.2356e+005

var(pop) computes the variance of the population vector. You multiply it by thenumber of observations after subtracting 1 to account for degrees of freedom. Boththe total sum of squares and the norm of residuals are positive scalars.

9 Now, compute R2, using the square of normresid3 and SSpop:

rsqcubic = 1 - normresid3^2 / SSpop

rsqcubic =

0.9988

10 Finally, compute R2 for a linear fit and compare it with the cubic R2 value that youjust derived. The Basic Fitting GUI also provides you with the linear fit results. Toobtain the linear results, repeat steps 2-6, modifying your actions as follows:

• To calculate least-squares linear regression coefficients and statistics, in the Fitdrop-down on the Numerical results pane, select linear instead of cubic.

• In the Save to Workspace dialog, append 1 to each variable name to identify it asderiving from a linear fit, and click OK. The variables fit1, normresid1, andresids1 now exist in the workspace.

• Use the variable normresid1 (98.778) to compute R2 for the linear fit, as youdid in step 9 for the cubic fit:

rsqlinear = 1 - normresid1^2 / SSpop

rsqlinear =

0.9210

This result indicates that a linear least-squares fit of the population data explains92.1% of its variance. As the cubic fit of this data explains 99.9% of that variance,the latter seems to be a better predictor. However, because a cubic fit predicts usingthree variables (x, x2, and x3), a basic R2 value does not fully reflect how robust the

Page 99: MATLAB Data Analysis

Interactive Fitting

3-23

fit is. A more appropriate measure for evaluating the goodness of multivariate fits isadjusted R2. For information about computing and using adjusted R2, see “Residualsand Goodness of Fit” on page 3-7.

Caution R2 measures how well your polynomial equation predicts the dependent variable,not how appropriate the polynomial model is for your data. When you analyze inherentlyunpredictable data, a small value of R2 indicates that the independent variable does notpredict the dependent variable precisely. However, it does not necessarily mean thatthere is something wrong with the fit.

Compute Residual Data and R2 for a Linear Fit

In this next example, use the Basic Fitting GUI to perform a linear fit, save the resultsto the workspace, and compute R2 for the linear fit. You can then compare linear R2 withthe cubic R2 value that you derive in the example “Compute Residual Data and R2 for aCubic Fit” on page 3-21.

1 Click the arrow button at the lower right to open the Numerical results tab if itis not already visible.

2 Select the linear check box in the Plot fits area.3 From the Fit drop-down menu, select linear if it does not already show. The

Coefficients and norm of residuals area displays statistics for the linear fit.4 Save the fit coefficients, norm of residuals, and residuals by clicking Save to

Workspace.

The Save Fit to Workspace dialog box opens with three check boxes and three textfields.

5 Select all three check boxes to save the fit coefficients, norm of residuals, andresidual values.

6 Identify the saved variables as belonging to a linear fit. Change the variable namesby adding a 1 to each default name (for example, fit1, normresid1, and resids1).

7 Click OK. Basic Fitting saves residuals as a column vector of numbers, fitcoefficients as a struct, and the norm of residuals as a scalar.

Notice that the value that Basic Fitting computes for norm of residuals is 98.778.This number is the square root of the sum of squared residuals of the linear fit.

Page 100: MATLAB Data Analysis

3 Regression Analysis

3-24

8 Optionally, you can verify the norm-of-residuals value that the Basic Fitting toolprovided. Compute the norm-of-residuals yourself from the resids3 array that youjust saved:

mynormresid1 = sum(resids1.^2)^(1/2)

mynormresid3 =

98.7783

9 Compute the total sum of squares of the dependent variable, pop to compute R2.Total sum of squares is the sum of the squared differences of each value from themean of the variable. For example, use this code:

SSpop = (length(pop)-1) * var(pop)

SSpop =

1.2356e+005

var(pop) computes the variance of the population vector. You multiply it by thenumber of observations after subtracting 1 to account for degrees of freedom. Boththe total sum of squares and the norm of residuals are positive scalars.

10 Now, compute R2, using the square of normresid1 and SSpop:

rsqlinear = 1 - normresid1^2 / SSpop

rsqcubic =

0.9210

This result indicates that a linear least-squares fit of the population data explains92.1% of its variance. As the cubic fit of this data explains 99.9% of that variance,the latter seems to be a better predictor. However, a cubic fit has four coefficients(x, x2, x3, and a constant), while a linear fit has two coefficients (x and a constant).A simple R2 statistic does not account for the different degrees of freedom. A moreappropriate measure for evaluating polynomial fits is adjusted R2. For informationabout computing and using adjusted R2, see “Residuals and Goodness of Fit” on page3-7.

Caution R2 measures how well your polynomial equation predicts the dependent variable,not how appropriate the polynomial model is for your data. When you analyze inherentlyunpredictable data, a small value of R2 indicates that the independent variable does not

Page 101: MATLAB Data Analysis

Interactive Fitting

3-25

predict the dependent variable precisely. However, it does not necessarily mean thatthere is something wrong with the fit.

Interpolate and Extrapolate Population Values

Suppose you want to use the cubic model to interpolate the U.S. population in 1965 (adate not provided in the original data).

1 In the Basic Fitting dialog box, click the button to specify a vector of x values atwhich to evaluate the current fit.

2 In the Enter value(s)... field, type the following value:

1965

Note: Use unscaled and uncentered x values. You do not need to center and scalefirst, even though you selected to scale x values to obtain the coefficients in “Predictthe Census Data with a Cubic Polynomial Fit” on page 3-15. The Basic Fittingtool makes the necessary adjustments behind the scenes.

3 Click Evaluate.

The x values and the corresponding values for f(x) computed from the fit anddisplayed in a table, as shown below:

Page 102: MATLAB Data Analysis

3 Regression Analysis

3-26

4 Select the Plot evaluated results check box to display the interpolated value as adiamond marker:

Page 103: MATLAB Data Analysis

Interactive Fitting

3-27

5 Save the interpolated population in 1965 to the MATLAB workspace by clickingSave to workspace.

This opens the following dialog box, where you specify the variable names:

Page 104: MATLAB Data Analysis

3 Regression Analysis

3-28

6 Click OK, but keep the Figure window open if you intend to follow the steps in thenext section, “Generate a Code File to Reproduce the Result” on page 3-28.

Generate a Code File to Reproduce the Result

After completing a Basic Fitting session, you can generate MATLAB code thatrecomputes fits and reproduces plots with new data.

1 In the Figure window, select File > Generate Code.

This creates a function and displays it in the MATLAB Editor. The code shows youhow to programmatically reproduce what you did interactively with the Basic Fittingdialog box.

2 Change the name of the function on the first line from createfigure to somethingmore specific, like censusplot. Save the code file to your current folder with the filename censusplot.m The function begins with:

function censusplot(X1, Y1, valuesToEvaluate1)

3 Generate some new, randomly perturbed census data:

randpop = pop + 10*randn(size(pop));

4 Reproduce the plot with the new data and recompute the fit:

censusplot(cdate,randpop,1965)

You need three input arguments: x,y values (data 1) plotted in the original graph,plus an x-value for a marker.

The following figure displays the plot that the generated code produces. The new plotmatches the appearance of the figure from which you generated code except for the ydata values, the equation for the cubic fit, and the residual values in the bar graph,as expected.

Page 105: MATLAB Data Analysis

Interactive Fitting

3-29

Learn How the Basic Fitting Tool Computes Fits

The Basic Fitting tool calls the polyfit function to compute polynomial fits. It calls thepolyval function to evaluate the fits. polyfit analyzes its inputs to determine if thedata is well conditioned for the requested degree of fit.

When it finds badly conditioned data, polyfit computes a regression as well as it can,but it also returns a warning that the fit could be improved. The Basic Fitting examplesection “Predict the Census Data with a Cubic Polynomial Fit” on page 3-15 displaysthis warning.

One way to improve model reliability is to add data points. However, adding observationsto a data set is not always feasible. An alternative strategy is to transform the predictor

Page 106: MATLAB Data Analysis

3 Regression Analysis

3-30

variable to normalize its center and scale. (In the example, the predictor is the vector ofcensus dates.)

The polyfit function normalizes by computing z-scores:

zx

=- m

s

where x is the predictor data, μ is the mean of x, and σ is the standard deviation of x.The z-scores give the data a mean of 0 and a standard deviation of 1. In the Basic FittingGUI, you transform the predictor data to z-scores by selecting the Center and scale xdata check box.

After centering and scaling, model coefficients are computed for the y data as a functionof z. These are different (and more robust) than the coefficients computed for y as afunction of x. The form of the model and the norm of the residuals do not change. TheBasic Fitting GUI automatically rescales the z-scores so that the fit plots on the samescale as the original x data.

To understand the way in which the centered and scaled data is used as an intermediaryto create the final plot, run the following code in the Command Window:

close

load census

x = cdate;

y = pop;

z = (x-mean(x))/std(x); % Compute z-scores of x data

plot(x,y,'ro') % Plot data as red markers

hold on % Prepare axes to accept new graph on top

zfit = linspace(z(1),z(end),100);

pz = polyfit(z,y,3); % Compute conditioned fit

yfit = polyval(pz,zfit);

xfit = linspace(x(1),x(end),100);

plot(xfit,yfit,'b-') % Plot conditioned fit vs. x data

The centered and scaled cubic polynomial plots as a blue line, as shown here:

Page 107: MATLAB Data Analysis

Interactive Fitting

3-31

In the code, computation of z illustrates how to normalize data. The polyfit functionperforms the transformation itself if you provide three return arguments when calling it:

[p,S,mu] = polyfit(x,y,n)

The returned regression parameters, p, now are based on normalized x. The returnedvector, mu, contains the mean and standard deviation of x. For more information, see thepolyfit reference page.

Page 108: MATLAB Data Analysis

3 Regression Analysis

3-32

Programmatic Fitting

In this section...

“MATLAB Functions for Polynomial Models” on page 3-32“Linear Model with Nonpolynomial Terms” on page 3-37“Multiple Regression” on page 3-39“Programmatic Fitting” on page 3-41

MATLAB Functions for Polynomial Models

Two MATLAB functions can model your data with a polynomial.

Polynomial Fit Functions

Function Description

polyfit polyfit(x,y,n) finds the coefficients of a polynomial p(x)of degree n that fits the y data by minimizing the sum of thesquares of the deviations of the data from the model (least-squares fit).

polyval polyval(p,x) returns the value of a polynomial of degree nthat was determined by polyfit, evaluated at x.

This example shows how to model data with a polynomial.

Measure a quantity y at several values of time t.

t = [0 0.3 0.8 1.1 1.6 2.3];

y = [0.6 0.67 1.01 1.35 1.47 1.25];

plot(t,y,'o')

title('Plot of y Versus t')

Page 109: MATLAB Data Analysis

Programmatic Fitting

3-33

You can try modeling this data using a second-degree polynomial function,

The unknown coefficients, , , and , are computed by minimizing the sum of thesquares of the deviations of the data from the model (least-squares fit).

Use polyfit to find the polynomial coefficients.

p = polyfit(t,y,2)

p =

Page 110: MATLAB Data Analysis

3 Regression Analysis

3-34

-0.2942 1.0231 0.4981

MATLAB calculates the polynomial coefficients in descending powers.

The second-degree polynomial model of the data is given by the equation

Evaluate the polynomial at uniformly spaced times, t2. Then, plot the original data andthe model on the same plot.

t2 = 0:0.1:2.8;

y2 = polyval(p,t2);

figure

plot(t,y,'o',t2,y2)

title('Plot of Data (Points) and Model (Line)')

Page 111: MATLAB Data Analysis

Programmatic Fitting

3-35

Evaluate model at the data time vector

y2 = polyval(p,t);

Calculate the residuals.

res = y - y2;

Plot the residuals.

figure, plot(t,res,'+')

title('Plot of the Residuals')

Page 112: MATLAB Data Analysis

3 Regression Analysis

3-36

Notice that the second-degree fit roughly follows the basic shape of the data, but does notcapture the smooth curve on which the data seems to lie. There appears to be a patternin the residuals, which indicates that a different model might be necessary. A fifth-degreepolynomial (shown next) does a better job of following the fluctuations in the data.

Repeat the exercise, this time using a fifth-degree polynomial from polyfit.

p5 = polyfit(t,y,5)

p5 =

0.7303 -3.5892 5.4281 -2.5175 0.5910 0.6000

Page 113: MATLAB Data Analysis

Programmatic Fitting

3-37

Evaluate the polynomial at t2 and plot the fit on top of the data in a new figure window.

y3 = polyval(p5,t2);

figure

plot(t,y,'o',t2,y3)

title('Fifth-Degree Polynomial Fit')

Note: If you are trying to model a physical situation, it is always important to considerwhether a model of a specific order is meaningful in your situation.

Linear Model with Nonpolynomial Terms

Page 114: MATLAB Data Analysis

3 Regression Analysis

3-38

This example shows how to fit data with a linear model containing nonpolynomial terms.

When a polynomial function does not produce a satisfactory model of your data, you cantry using a linear model with nonpolynomial terms. For example, consider the followingfunction that is linear in the parameters , , and , but nonlinear in the data:

You can compute the unknown coefficients , , and by constructing and solvinga set of simultaneous equations and solving for the parameters. The following syntaxaccomplishes this by forming a design matrix, where each column represents a variableused to predict the response (a term in the model) and each row corresponds to oneobservation of those variables.

Enter t and y as column vectors.

t = [0 0.3 0.8 1.1 1.6 2.3]';

y = [0.6 0.67 1.01 1.35 1.47 1.25]';

Form the design matrix.

X = [ones(size(t)) exp(-t) t.*exp(-t)];

Calculate model coefficients.

a = X\y

a =

1.3983

-0.8860

0.3085

Therefore, the model of the data is given by

Now evaluate the model at regularly spaced points and plot the model with the originaldata.

Page 115: MATLAB Data Analysis

Programmatic Fitting

3-39

T = (0:0.1:2.5)';

Y = [ones(size(T)) exp(-T) T.*exp(-T)]*a;

plot(T,Y,'-',t,y,'o'), grid on

title('Plot of Model and Original Data')

Multiple Regression

This example shows how to use multiple regression to model data that is a function ofmore than one predictior variable.

When y is a function of more than one predictor variable, the matrix equations thatexpress the relationships among the variables must be expanded to accommodate theadditional data. This is called multiple regression.

Page 116: MATLAB Data Analysis

3 Regression Analysis

3-40

Measure a quantity for several values of and . Store these values in vectors x1, x2,and y, respectively.

x1 = [.2 .5 .6 .8 1.0 1.1]';

x2 = [.1 .3 .4 .9 1.1 1.4]';

y = [.17 .26 .28 .23 .27 .24]';

A model of this data is of the form

Multiple regression solves for unknown coefficients , , and by minimizing the sumof the squares of the deviations of the data from the model (least-squares fit).

Construct and solve the set of simultaneous equations by forming a design matrix, X.

X = [ones(size(x1)) x1 x2];

Solve for the parameters by using the backslash operator.

a = X\y

a =

0.1018

0.4844

-0.2847

The least-squares fit model of the data is

To validate the model, find the maximum of the absolute value of the deviation of thedata from the model.

Y = X*a;

MaxErr = max(abs(Y - y))

Page 117: MATLAB Data Analysis

Programmatic Fitting

3-41

MaxErr =

0.0038

This value is much smaller than any of the data values, indicating that this modelaccurately follows the data.

Programmatic Fitting

This example shows how to use MATLAB functions to:

• “Calculate Correlation Coefficients” on page 3-42• “Fit a Polynomial to the Data” on page 3-43• “Plot and Calculate Confidence Bounds” on page 3-45

Load sample census data from census.mat, which contains U.S. population data fromthe years 1790 to 1990.

load census

This adds the following two variables to the MATLAB workspace.

• cdate is a column vector containing the years 1790 to 1990 in increments of 10.• pop is a column vector with the U.S. population numbers corresponding to each year

in cdate.

Plot the data.

plot(cdate,pop,'ro')

title('U.S. Polulation from 1790 to 1990')

Page 118: MATLAB Data Analysis

3 Regression Analysis

3-42

The plot shows a strong pattern, which indicates a high correlation between thevariables.

Calculate Correlation Coefficients

In this portion of the example, you determine the statistical correlation between thevariables cdate and pop to justify modeling the data. For more information aboutcorrelation coefficients, see “Linear Correlation” on page 3-2.

Calculate the correlation-coefficient matrix.

corrcoef(cdate,pop)

Page 119: MATLAB Data Analysis

Programmatic Fitting

3-43

ans =

1.0000 0.9597

0.9597 1.0000

The diagonal matrix elements represent the perfect correlation of each variable withitself and are equal to 1. The off-diagonal elements are very close to 1, indicating thatthere is a strong statistical correlation between the variables cdate and pop.

Fit a Polynomial to the Data

This portion of the example applies the polyfit and polyval MATLAB functions tomodel the data.

Calculate fit parameters.

[p,ErrorEst] = polyfit(cdate,pop,2);

Evaluate the fit.

pop_fit = polyval(p,cdate,ErrorEst);

Plot the data and the fit.

plot(cdate,pop_fit,'-',cdate,pop,'+');

title('U.S. Polulation from 1790 to 1990')

legend('Polynomial Model','Data','Location','NorthWest');

xlabel('Census Year');

ylabel('Population (millions)');

Page 120: MATLAB Data Analysis

3 Regression Analysis

3-44

The plot shows that the quadratic-polynomial fit provides a good approximation to thedata.

Calculate the residuals for this fit.

res = pop - pop_fit;

figure, plot(cdate,res,'+')

title('Residuals for the Quadratic Polynomial Model')

Page 121: MATLAB Data Analysis

Programmatic Fitting

3-45

Notice that the plot of the residuals exhibits a pattern, which indicates that a second-degree polynomial might not be appropriate for modeling this data.

Plot and Calculate Confidence Bounds

Confidence bounds are confidence intervals for a predicted response. The width of theinterval indicates the degree of certainty of the fit.

This portion of the example applies polyfit and polyval to the census sample data toproduce confidence bounds for a second-order polynomial model.

The following code uses an interval of ±2D , which corresponds to a 95% confidenceinterval for large samples.

Page 122: MATLAB Data Analysis

3 Regression Analysis

3-46

Evaluate the fit and the prediction error estimate (delta).

[pop_fit,delta] = polyval(p,cdate,ErrorEst);

Plot the data, the fit, and the confidence bounds.

plot(cdate,pop,'+',...

cdate,pop_fit,'g-',...

cdate,pop_fit+2*delta,'r:',...

cdate,pop_fit-2*delta,'r:');

xlabel('Census Year');

ylabel('Population (millions)');

title('Quadratic Polynomial Fit with Confidence Bounds')

grid on

Page 123: MATLAB Data Analysis

Programmatic Fitting

3-47

The 95% interval indicates that you have a 95% chance that a new observation will fallwithin the bounds.

Page 124: MATLAB Data Analysis

3-48

Page 125: MATLAB Data Analysis

4

Time Series Analysis

• “What Are Time Series?” on page 4-2• “Time Series Objects” on page 4-3

Page 126: MATLAB Data Analysis

4 Time Series Analysis

4-2

What Are Time Series?

Time series are data vectors sampled over time, in order, often at regular intervals. Theyare distinguished from randomly sampled data, which form the basis of many other dataanalyses. Time series represent the time-evolution of a dynamic population or process.The linear ordering of time series gives them a distinctive place in data analysis, with aspecialized set of techniques.

Time series analysis is concerned with:

• Identifying patterns• Modeling patterns• Forecasting values

Several dedicated MATLAB functions perform time series analysis. This sectionintroduces objects and interactive tools for time series analysis.

Page 127: MATLAB Data Analysis

Time Series Objects

4-3

Time Series Objects

In this section...

“Types of Time Series and Their Uses” on page 4-3“Time Series Data Sample” on page 4-3“Example: Time Series Objects and Methods” on page 4-5“Time Series Constructor” on page 4-27“Time Series Collection Constructor” on page 4-27

Types of Time Series and Their Uses

MATLAB time series objects are of two types:

• timeseries — Stores data and time values, as well as the metadata informationthat includes units, events, data quality, and interpolation method

• tscollection — Stores a collection of timeseries objects that share a commontime vector, convenient for performing operations on synchronized time series withdifferent units

This section discusses the following topics:

• Using time series constructors to instantiate time series classes• Modifying object properties using set methods or dot notation• Calling time series functions and methods

To get a quick overview of programming with timeseries and tscollection objects,follow the steps in “Example: Time Series Objects and Methods” on page 4-5.

Time Series Data Sample

To properly understand the description of timeseries object properties and methodsin this documentation, it is important to clarify some terms related to storing data in atimeseries object—the difference between a data value and a data sample.

A data value is a single, scalar value recorded at a specific time. A data sample consistsof one or more values associated with a specific time in the timeseries object. Thenumber of data samples in a time series is the same as the length of the time vector.

Page 128: MATLAB Data Analysis

4 Time Series Analysis

4-4

For example, consider data that consists of three sensor signals: two signals representthe position of an object in meters, and the third represents its velocity in meters/second.

To enter the data matrix, type the following at the MATLAB prompt:

x = [-0.2 -0.3 13;

-0.1 -0.4 15;

NaN 2.8 17;

0.5 0.3 NaN;

-0.3 -0.1 15]

The NaN value represents a missing data value. MATLAB displays the following 5-by-3matrix:

x=

-0.2000 -0.3000 13.0000

-0.1000 -0.4000 15.0000

NaN 2.8000 17.0000

0.5000 0.3000 NaN

-0.3000 -0.1000 15.0000

The first two columns of x contain quantities with the same units and you can create amultivariate timeseries object to store these two time series. For more informationabout creating timeseries objects, see “Time Series Constructor” on page 4-27. Thefollowing command creates a timeseries object ts_pos to store the position values:

ts_pos = timeseries(x(:,1:2), 1:5, 'name', 'Position')

MATLAB responds by displaying the following properties of ts_pos:

timeseries

Common Properties:

Name: 'Position'

Time: [5x1 double]

TimeInfo: [1x1 tsdata.timemetadata]

Data: [5x2 double]

DataInfo: [1x1 tsdata.datametadata]

More properties, Methods

The Length of the time vector, which is 5 in this example, equals the number of datasamples in the timeseries object. Find the size of the data sample in ts_pos by typingthe following at the MATLAB prompt:

Page 129: MATLAB Data Analysis

Time Series Objects

4-5

getdatasamplesize(ts_pos)

ans =

1 2

Similarly, you can create a second timeseries object to store the velocity data:

ts_vel = timeseries(x(:,3), 1:5, 'name', 'Velocity');

Find the size of each data sample in ts_vel by typing the following:

getdatasamplesize(ts_vel)

ans =

1 1

Notice that ts_vel has one data value in each data sample and ts_pos has two datavalues in each data sample.

Note: In general, when the time series data is an M-by-N-by-P-by-... multidimensionalarray with M samples, the size of each data sample is N-by-P-by-... .

If you want to perform operations on the ts_pos and ts_vel timeseries objects whilekeeping them synchronized, group them in a time series collection. For more information,see “Time Series Collection Constructor Syntax” on page 4-28.

Example: Time Series Objects and Methods

• “Creating Time Series Objects” on page 4-6• “Viewing Time Series Objects” on page 4-7• “Modifying Time Series Units and Interpolation Method” on page 4-10• “Defining Events” on page 4-11• “Creating Time Series Collection Objects” on page 4-14• “Resampling a Time Series Collection Object” on page 4-16• “Adding a Data Sample to a Time Series Collection Object” on page 4-19

Page 130: MATLAB Data Analysis

4 Time Series Analysis

4-6

• “Removing and Interpolating Missing Data” on page 4-20• “Removing a Time Series from a Time Series Collection” on page 4-22• “Displaying Time Vector Values as Date Strings” on page 4-22• “Plotting Time Series Collection Members” on page 4-23

Creating Time Series Objects

This portion of the example illustrates how to create several timeseries objects from anarray. For more information about the timeseries object, see “Time Series Constructor”on page 4-27.

Import the sample data from count.dat to the MATLAB workspace.

load count.dat

This adds the 24-by-3 matrix, count, to the workspace. Each column of count representshourly vehicle counts at each of three town intersections.

View the count matrix.

count

count =

11 11 9

7 13 11

14 17 20

11 13 9

43 51 69

38 46 76

61 132 186

75 135 180

38 88 115

28 36 55

12 12 14

18 27 30

18 19 29

17 15 18

19 36 48

32 47 10

42 65 92

Page 131: MATLAB Data Analysis

Time Series Objects

4-7

57 66 151

44 55 90

114 145 257

35 58 68

11 12 15

13 9 15

10 9 7

Create three timeseries objects to store the data collected at each intersection.

count1 = timeseries(count(:,1), 1:24,'name', 'intersection1');

count2 = timeseries(count(:,2), 1:24,'name', 'intersection2');

count3 = timeseries(count(:,3), 1:24,'name', 'intersection3');

Note: In the above construction, timeseries objects have both a variable name (e.g.,count1) and an internal object name (e.g., intersection1). The variable name is usedwith MATLAB functions. The object name is a property of the object, accessed with objectmethods. For more information on timeseries object properties and methods, see “TimeSeries Properties” on page 4-27 and “Time Series Methods” on page 4-27.

By default, a time series has a time vector having units of seconds and a start time of0 sec. The example constructs the count1, count2, and count3 time series objectswith start times of 1 sec, end times of 24 sec, and 1-sec increments. You will change thetime units to hours in “Modifying Time Series Units and Interpolation Method” on page4-10.

Note: If you want to create a timeseries object that groups the three data columns incount, use the following syntax:

count_ts = timeseries(count, 1:24,'name','traffic_counts')

This is useful when all time series have the same units and you want to keep themsynchronized during calculations.

Viewing Time Series Objects

After creating a timeseries object, as described in “Creating Time Series Objects” onpage 4-6, you can view it in the Variables editor.

Page 132: MATLAB Data Analysis

4 Time Series Analysis

4-8

To view a timeseries object like count1 in the Variables editor, use either of thefollowing methods:

• Type open('count1') at the command prompt.• On the Home tab, in the Variable section, click Open Variable and select count1.

Page 133: MATLAB Data Analysis

Time Series Objects

4-9

Page 134: MATLAB Data Analysis

4 Time Series Analysis

4-10

Modifying Time Series Units and Interpolation Method

After creating a timeseries object, as described in “Creating Time Series Objects” onpage 4-6, you can modify its units and interpolation method using dot notation.

View the current properties of count1.

get(count1)

ans =

Events: []

Name: 'intersection1'

UserData: []

Data: [24x1 double]

DataInfo: [1x1 tsdata.datametadata]

Time: [24x1 double]

TimeInfo: [1x1 tsdata.timemetadata]

Quality: []

QualityInfo: [1x1 tsdata.qualmetadata]

IsTimeFirst: 1

TreatNaNasMissing: 1

Length: 24

MATLAB displays the current property values of the count1 timeseries object.

View the current DataInfo properties using dot notation.

count1.DataInfo

tsdata.datametadata

Package: tsdata

Common Properties:

Units: ''

Interpolation: linear (tsdata.interpolation)

Change the data units for count1 to 'cars'.

count1.DataInfo.Units = 'cars';

Set the interpolation method for count1 to zero-order hold.

Page 135: MATLAB Data Analysis

Time Series Objects

4-11

count1.DataInfo.Interpolation = tsdata.interpolation('zoh');

Verify that the DataInfo properties have been modified.

count1.DataInfo

tsdata.datametadata

Package: tsdata

Common Properties:

Units: 'cars'

Interpolation: zoh (tsdata.interpolation)

Modify the time units to be 'hours' for the three time series.

count1.TimeInfo.Units = 'hours';

count2.TimeInfo.Units = 'hours';

count3.TimeInfo.Units = 'hours';

Defining Events

This portion of the example illustrates how to define events for a timeseries objectby using the tsdata.event auxiliary object. Events mark the data at specific times.When you plot the data, event markers are displayed on the plot. Events also provide aconvenient way to synchronize multiple time series.

Add two events to the data that mark the times of the AM commute and PM commute.

Construct and add the first event to all time series. The first event occurs at 8 AM.

e1 = tsdata.event('AMCommute',8);

e1.Units = 'hours'; % Specify the units for time

count1 = addevent(count1,e1); % Add the event to count1

count2 = addevent(count2,e1); % Add the event to count2

count3 = addevent(count3,e1); % Add the event to count3

Construct and add the second event to all time series. The second event occurs at 6 PM.

e2 = tsdata.event('PMCommute',18);

e2.Units = 'hours'; % Specify the units for time

count1 = addevent(count1,e2); % Add the event to count1

count2 = addevent(count2,e2); % Add the event to count2

count3 = addevent(count3,e2); % Add the event to count3

Page 136: MATLAB Data Analysis

4 Time Series Analysis

4-12

Plot the time series, count1.

figure

plot(count1)

When you plot any of the time series, the plot method defined for time series objectsdisplays events as markers. By default markers are red filled circles.

The plot reflects that count1 uses zero-order-hold interpolation.

Plot count2.

plot(count2)

Page 137: MATLAB Data Analysis

Time Series Objects

4-13

If you plot time series count2, it replaces the count1 display. You see its events andthat it uses linear interpolation.

Overlay time series plots by setting hold on.

hold on

plot(count3)

Page 138: MATLAB Data Analysis

4 Time Series Analysis

4-14

When you hold the plot and add new data to it, the title, data units and time units donot display. The plot method cannot determine if the units are the same, so it does notattempt to display x and y axis labels.

Creating Time Series Collection Objects

This portion of the example illustrates how to create a tscollection object. Eachindividual time series in a collection is called a member. For more information about thetscollection object, see “Time Series Collection Constructor” on page 4-27.

Note: Typically, you use the tscollection object to group synchronized time seriesthat have different units. In this simple example, all time series have the same units andthe tscollection object does not provide an advantage over grouping the three time

Page 139: MATLAB Data Analysis

Time Series Objects

4-15

series in a single timeseries object. For an example of how to group several time seriesin one timeseries object, see “Creating Time Series Objects” on page 4-6.

Create a tscollection object nameed count_coll and use the constructor syntax toimmediately add two of the three time series currently in the MATLAB workspace (youwill add the third time series later).

tsc = tscollection({count1 count2},'name', 'count_coll')

Time Series Collection Object: count_coll

Time vector characteristics

Start time 1 hours

End time 24 hours

Member Time Series Objects:

intersection1

intersection2

Note: The time vectors of the timeseries objects you are adding to the tscollectionmust match.

Notice that the Name property of the timeseries objects is used to name the collectionmembers as intersection1 and intersection2.

Add the third timeseries object in the workspace to the tscollection.

tsc = addts(tsc, count3)

Time Series Collection Object: count_coll

Time vector characteristics

Start time 1 hours

End time 24 hours

Page 140: MATLAB Data Analysis

4 Time Series Analysis

4-16

Member Time Series Objects:

intersection1

intersection2

intersection3

All three members in the collection are listed.

Resampling a Time Series Collection Object

This portion of the example illustrates how to resample each member in atscollection using a new time vector. The resampling operation is used to eitherselect existing data at specific time values, or to interpolate data at finer intervals. If thenew time vector contains time values that did not exist in the previous time vector, thenew data values are calculated using the default interpolation method you associatedwith the time series.

Resample the time series to include data values every 2 hours instead of every hour andsave it as a new tscollection object.

tsc1 = resample(tsc,1:2:24)

Time Series Collection Object: count_coll

Time vector characteristics

Start time 1 hours

End time 23 hours

Member Time Series Objects:

intersection1

intersection2

intersection3

In some cases you might need a finer sampling of information than you currently haveand it is reasonable to obtain it by interpolating data values.

Page 141: MATLAB Data Analysis

Time Series Objects

4-17

Interpolate values at each half-hour mark.

tsc1 = resample(tsc,1:0.5:24)

Time Series Collection Object: count_coll

Time vector characteristics

Start time 1 hours

End time 24 hours

Member Time Series Objects:

intersection1

intersection2

intersection3

To add values at each half-hour mark, the default interpolation method of a timeseries is used. For example, the new data points in intersection1 are calculated byusing the zero-order hold interpolation method, which holds the value of the previoussample constant. You set the interpolation method for intersection1 as described in“Modifying Time Series Units and Interpolation Method” on page 4-10.

The new data points in intersection2 and intersection3 are calculated usinglinear interpolation, which is the default method.

Plot the members of tsc1 with markers to see the results of interpolating.

hold off % Allow axes to clear before plotting

plot(tsc1.intersection1,'-xb','Displayname','Intersection 1')

Page 142: MATLAB Data Analysis

4 Time Series Analysis

4-18

You can see that data points have been interpolated at half-hour intervals, and thatIntersection 1 uses zero-order-hold interpolation, while the other two members use linearinterpolation.

Maintain the graph in the figure while you add the other two members to the plot.Because the plot method suppresses the axis labels while hold is on, also add a legendto describe the three series.

hold on

plot(tsc1.intersection2,'-.xm','Displayname','Intersection 2')

plot(tsc1.intersection3,':xr','Displayname','Intersection 3')

legend('show','Location','NorthWest')

Page 143: MATLAB Data Analysis

Time Series Objects

4-19

Adding a Data Sample to a Time Series Collection Object

This portion of the example illustrates how to add a data sample to a tscollection.

Add a data sample to the intersection1 collection member at 3.25 hours (i.e., 15minutes after the hour).

tsc1 = addsampletocollection(tsc1,'time',3.25,...

'intersection1',5);

There are three members in the tsc1 collection, and adding a data sample to onemember adds a data sample to the other two members at 3.25 hours. However, becauseyou did not specify the data values for intersection2 and intersection3 in thenew sample, the missing values are represented by NaNs for these members. To learn

Page 144: MATLAB Data Analysis

4 Time Series Analysis

4-20

how to remove or interpolate missing data values, see “Removing Missing Data” on page4-20 and “Interpolating Missing Data” on page 4-21.

tsc1 Data from 2.0 to 3.5 Hours

Hours Intersection 1 Intersection 2 Intersection 3

2.0 7 13 112.5 7 15 15.53.0 14 17 203.25 5 NaN NaN

3.5 14 15 14.5

To view all intersection1 data (including the new sample at 3.25 hours), type

tsc1.intersection1

Similarly, to view all intersection2 data (including the new sample at 3.25 hourscontaining a NaN value), type

tsc1.intersection2

Removing and Interpolating Missing Data

Time series objects use NaNs to represent missing data. This portion of the exampleillustrates how to either remove missing data or interpolate values for it by using theinterpolation method you specified for that time series. In “Adding a Data Sample to aTime Series Collection Object” on page 4-19, you added a new data sample to thetsc1 collection at 3.25 hours.

As the tsc1 collection has three members, adding a data sample to one member addeda data sample to the other two members at 3.25 hours. However, because you did notspecify the data values for the intersection2 and intersection3 members at 3.25hours, they currently contain missing values, represented by NaNs.

Removing Missing Data

Find and remove the data samples containing NaN values in the tsc1 collection.

tsc1 = delsamplefromcollection(tsc1,'index',...

find(isnan(tsc1.intersection2.Data)));

Page 145: MATLAB Data Analysis

Time Series Objects

4-21

This command searches one tscollection member at a time—in this case,intersection2. When a missing value is located in intersection2, the data at thattime is removed from all members of the tscollection.

Note: Use dot-notation syntax to access the Data property of the intersection2member in the tsc1 collection:

tsc1.intersection2.Data

For a complete list of timeseries properties, see “Time Series Properties” on page4-27.

Interpolating Missing Data

For the sake of this example, reintroduce NaN values in intersection2 andintersection3.

tsc1 = addsampletocollection(tsc1,'time',3.25,...

'intersection1',5);

Interpolate the missing values in tsc1 using the current time vector (tsc1.Time).

tsc1 = resample(tsc1,tsc1.Time);

This replaces the NaN values in intersection2 and intersection3 by using linearinterpolation—the default interpolation method for these time series.

Note: Dot notation tsc1.Time is used to access the Time property of the tsc1 collection.For a complete list of tscollection properties, see “Time Series Collection Properties”on page 4-28.

To view intersection2 data after interpolation, for example, type

tsc1.intersection2

New tsc1 Data from 2.0 to 3.5 Hours

Hours Intersection 1 Intersection 2 Intersection 3

2.0 7 13 11

Page 146: MATLAB Data Analysis

4 Time Series Analysis

4-22

Hours Intersection 1 Intersection 2 Intersection 3

2.5 7 15 15.53.0 14 17 203.25 5 16 17.33.5 14 15 14.5

Removing a Time Series from a Time Series Collection

Remove the intersection3 time series from the tscollection object tsc1.

tsc1 = removets(tsc1,'intersection3')

Time Series Collection Object: count_coll

Time vector characteristics

Start time 1 hours

End time 24 hours

Member Time Series Objects:

intersection1

intersection2

Two time series as members in the collection are now listed.

Displaying Time Vector Values as Date Strings

This portion of the example illustrates how to control the format in which numerical timevector display, using MATLAB date strings. For a complete list of the MATLAB date-string formats supported for timeseries and tscollection objects, see the definitionof time vector definition in the timeseries reference page.

To use date strings, you must set the StartDate field of the TimeInfo property. Allvalues in the time vector are converted to date strings using StartDate as a referencedate.

Suppose the reference date occurs on December 25, 2009.

tsc1.TimeInfo.Units = 'hours';

Page 147: MATLAB Data Analysis

Time Series Objects

4-23

tsc1.TimeInfo.StartDate = 'DEC-25-2009 00:00:00';

Similarly to what you did with the count1, count2, and count3 time series objects, setthe data units to of the tsc1 members to the string 'car count'.

tsc1.intersection1.DataInfo.Units = 'car count';

tsc1.intersection2.DataInfo.Units = 'car count';

Plotting Time Series Collection Members

To plot data in a time series collection, you plot its members one at a time.

First graph tsc1 member intersection1.

hold off

plot(tsc1.intersection1);

Page 148: MATLAB Data Analysis

4 Time Series Analysis

4-24

When you plot a member of a time series collection, its time units display on the x-axisand its data units display on the y-axis. The plot title is displayed as 'Time SeriesPlot:<member name>'.

If you use the same figure to plot a different member of the collection, no annotationsdisplay. The time series plot method does not attempt to update labels and titles whenhold is on because the descriptors for the series can be different.

Plot intersection1 and intersection2 in the same figure. Prevent overwriting theplot, but remove axis labels and title. Add a legend and set the DisplayName property ofthe line series to label each member.

plot(tsc1.intersection1,'-xb','Displayname','Intersection 1')

hold on

plot(tsc1.intersection2,'-.xm','Displayname','Intersection 2')

legend('show','Location','NorthWest')

Page 149: MATLAB Data Analysis

Time Series Objects

4-25

The plot now includes the two time series in the collection: intersection1 andintesection2. Plotting the second graph erased the labels on the first graph.

Finally, change the date strings on the x-axis to hours and plot the two time seriescollection members again with a legend.

Specify time units to be 'hours' for the collection.

tsc1.TimeInfo.Units = 'hours';

Specify the format for displaying time.

tsc1.TimeInfo.Format = 'HH:MM';

Recreate the last plot with new time units.

Page 150: MATLAB Data Analysis

4 Time Series Analysis

4-26

hold off

plot(tsc1.intersection1,'-xb','Displayname','Intersection 1')

% Prevent overwriting plot, but remove axis labels and title.

hold on

plot(tsc1.intersection2,'-.xm','Displayname','Intersection 2')

legend('show','Location','NorthWest')

% Restore the labels with the |xlabel| and |ylabel| commmands and overlay a

% data grid.

xlabel('Time (hours)')

ylabel('car count')

grid on

For more information on plotting options for time series, see timeseries.

Page 151: MATLAB Data Analysis

Time Series Objects

4-27

Time Series Constructor

Before implementing the various MATLAB functions and methods specifically designedto handle time series data, you must create a timeseries object to store the data. Seetimeseries for the timeseries object constructor syntax.

For an example of using the constructor, see “Creating Time Series Objects” on page4-6.

Time Series Properties

See timeseries for a description of all the timeseries object properties. Youcan specify the Data, IsTimeFirst, Name, Quality, and Time properties as inputarguments in the constructor. To assign other properties, use the set function or dotnotation.

Note: To get property information from the command line, type help timeseries/tsprops at the MATLAB prompt.

For an example of editing timeseries object properties, see “Modifying Time SeriesUnits and Interpolation Method” on page 4-10.

Time Series Methods

For a description of all the time series methods, see timeseries.

Time Series Collection Constructor

• “Introduction” on page 4-27• “Time Series Collection Constructor Syntax” on page 4-28• “Time Series Collection Properties” on page 4-28• “Time Series Collection Methods” on page 4-29

Introduction

The MATLAB object, called tscollection, is a MATLAB variable that groups severaltime series with a common time vector. The timeseries objects that you include in thetscollection object are called members of this collection, and possess several methodsfor convenient analysis and manipulation of timeseries.

Page 152: MATLAB Data Analysis

4 Time Series Analysis

4-28

Time Series Collection Constructor Syntax

Before you implement the MATLAB methods specifically designed to operate on acollection of timeseries objects, you must create a tscollection object to store thedata.

The following table summarizes the syntax for using the tscollection constructor. Foran example of using this constructor, see “Creating Time Series Collection Objects” onpage 4-14.

Time Series Collection Syntax Descriptions

Syntax Description

tsc = tscollection(ts) Creates a tscollection object tsc thatincludes one or more timeseries objects.

The ts argument can be one of the following:

• Single timeseries object in the MATLABworkspace

• Cell array of timeseries objects in theMATLAB workspace

The timeseries objects share the same timevector in the tscollection.

tsc = tscollection(Time) Creates an empty tscollection object with thetime vector Time.

When time values are date strings, you mustspecify Time as a cell array of date strings.

tsc = tscollection(Time,

TimeSeries, 'Parameter',

Value, ...)

Optionally enter the following parameter-value pairs after the Time and TimeSeriesarguments:

• Name (see “Time Series Collection Properties”on page 4-28)

Time Series Collection Properties

This table lists the properties of the tscollection object. You can specify the Name,Time, and TimeInfo properties as input arguments in the tscollection constructor.

Page 153: MATLAB Data Analysis

Time Series Objects

4-29

Time Series Collection Property Descriptions

Property Description

Name tscollection object name entered as a string. This namecan differ from the name of the tscollection variable in theMATLAB workspace.

Time A vector of time values.

When TimeInfo.StartDate is empty, the numerical Timevalues are measured relative to 0 in specified units. WhenTimeInfo.StartDate is defined, the time values representdate strings measured relative to StartDate in specified units.

The length of Time must match either the first or the lastdimension of the Data property of each tscollection member.

TimeInfo Uses the following fields to store contextual information aboutTime:

• Units — Time units with the following values:'weeks', 'days', 'hours', 'minutes', 'seconds','milliseconds', 'microseconds', and 'nanoseconds'

• Start — Start time• End — End time (read-only)• Increment — Interval between two subsequent time values.

The increment is NaN when times are not uniformly sampled.• Length — Length of the time vector (read-only)• Format — String defining the date string display format.

See the MATLAB datestr function reference page for moreinformation.

• StartDate — Date string defining the reference date. Seethe MATLAB setabstime (tscollection) function referencepage for more information.

• UserData — Stores any additional user-defined information

Time Series Collection Methods

• “General Time Series Collection Methods” on page 4-30

Page 154: MATLAB Data Analysis

4 Time Series Analysis

4-30

• “Data and Time Manipulation Methods” on page 4-30

General Time Series Collection Methods

Use the following methods to query and set object properties, and plot the data.

Methods for Querying Properties

Method Description

get (tscollection) Query tscollection object property values.isempty (tscollection) Evaluate to true for an empty tscollection

object.length (tscollection) Return the length of the time vector.plot Plot the time series in a collection.set (tscollection) Set tscollection property values.size (tscollection) Return the size of a tscollection object.

Data and Time Manipulation Methods

Use the following methods to add or delete data samples, and manipulate thetscollection object.

Methods for Manipulating Data and Time

Method Description

addts Add a timeseries object to a tscollectionobject.

addsampletocollection Add data samples to a tscollection object.delsamplefromcollection Delete one or more data samples from a

tscollection object.getabstime (tscollection) Extract a date-string time vector from a

tscollection object into a cell array.getsampleusingtime (tscollection) Extract data samples from an existing

tscollectionobject into a new tscollectionobject.

gettimeseriesnames Return a cell array of time series names in atscollection object.

Page 155: MATLAB Data Analysis

Time Series Objects

4-31

Method Description

horzcat (tscollection) Horizontal concatenation of tscollection objects.Combines several timeseries objects with thesame time vector into one time series collection.

removets Remove one or more timeseries objects from atscollection object.

resample (tscollection) Select or interpolate data in a tscollection objectusing a new time vector.

setabstime (tscollection) Set the time values in the time vector of atscollection object as date strings.

settimeseriesnames Change the name of the selected timeseries objectin a tscollection object.

vertcat (tscollection) Vertical concatenation of tscollection objects.Joins several tscollection objects along the timedimension.

Page 156: MATLAB Data Analysis

4-32