modern hydrology and sustainable water development (gupta/modern hydrology and sustainable water...

51
8 Statistical analyses in hydrology Without realizing it we inadvertently use statistics in our day-to-day life. For example, we make an es- timate in our mind of the time required to cover a given distance using a particular mode of transport or how much money we need to carry when leav- ing home each morning. Similarly while investing money; we weigh risk versus gain when choosing a particular investment option – all based on previous realizations. Professional analysts use statistics for collection, analysis, interpretation, and presenta- tion of numerical data with a view to understanding a variety of phenomena. Many hydrologic data rep- resent random phenomena – atmospheric temper- ature, rainfall, wind, etc. – measured using various instruments or derived from other measurements. All measurements can be placed into one of the two overriding categories, namely continuous and discontinuous or discrete. Continuous measure- ments are those where fractions can be included and invariably contain some error in the recorded value, because one can always measure them more accurately by using a device capable of measuring in smaller units – say in millimetres rather than centimetres for length measurement. Flow rate, rainfall intensity, water-surface elevation, etc. are some other examples of continuous measurements in hydrology. Discontinuous or discrete measure- ments can only take certain values, usually integers. Numbers of storm events occurring in a specified time period, and the number of overtopping flood events per year for a levee on a stream, are some examples of discrete measurements in hydrology. Values derived from such data should be rounded off to the nearest whole number. In addition to the above, there are other cate- gories in which data may fall: (i) Qualitative mea- surements describing quality rather than a quan- tity, for example, type of species (male/female) or colour that describe quality, and normally the data that follow are numbers/counts for each of these categories; (ii) Quantitative measurements normally indicating a variable that can be measured quantitatively, for example, length and weight are quantitative variables; and (iii) Derived variables are data not measured directly but calculated from other measurements. The commonly derived vari- ables are proportions (ratios and percentages). In statistics, any set of people or objects with someattribute in common are defined as a pop- ulation. One can have a population of students, monthly rainfall at a certain station, storm events in a given region during a given time interval – anything one may be interested in studying can be defined as a population. The objective of statistical analysis is to characterize a given population, and draw inferences or identify trends inherent in it. For example, one may want to test whether a new drug is effective for a specific group of people. Most of the time, it is not possible to administer every one with a new drug. Instead the drug is administered to a sample group of people from the population to see if it is effective. This subset of the population is called a sample. In this sense all types of hydrological data, limited in number Modern Hydrology and Sustainable Water Development S. K. Gupta © 2011 S. K. Gupta. ISBN: 978-1-405-17124-3

Upload: s-k

Post on 06-Jun-2016

226 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Modern Hydrology and Sustainable Water Development (Gupta/Modern Hydrology and Sustainable Water Development) || Statistical Analyses in Hydrology

P1: SFK/UKS P2: SFK

c08 BLBK276-Gupta August 28, 2010 15:31 Trim: 246mm X 189mm Printer Name: Yet to Come

8 Statistical analyses inhydrology

Without realizing it we inadvertently use statisticsin our day-to-day life. For example, we make an es-timate in our mind of the time required to cover agiven distance using a particular mode of transportor how much money we need to carry when leav-ing home each morning. Similarly while investingmoney; we weigh risk versus gain when choosing aparticular investment option – all based on previousrealizations. Professional analysts use statistics forcollection, analysis, interpretation, and presenta-tion of numerical data with a view to understandinga variety of phenomena. Many hydrologic data rep-resent random phenomena – atmospheric temper-ature, rainfall, wind, etc. – measured using variousinstruments or derived from other measurements.

All measurements can be placed into one of thetwo overriding categories, namely continuous anddiscontinuous or discrete. Continuous measure-ments are those where fractions can be includedand invariably contain some error in the recordedvalue, because one can always measure them moreaccurately by using a device capable of measuringin smaller units – say in millimetres rather thancentimetres for length measurement. Flow rate,rainfall intensity, water-surface elevation, etc. aresome other examples of continuous measurementsin hydrology. Discontinuous or discrete measure-ments can only take certain values, usually integers.Numbers of storm events occurring in a specifiedtime period, and the number of overtopping floodevents per year for a levee on a stream, are someexamples of discrete measurements in hydrology.

Values derived from such data should be roundedoff to the nearest whole number.

In addition to the above, there are other cate-gories in which data may fall: (i) Qualitative mea-surements describing quality rather than a quan-tity, for example, type of species (male/female)or colour that describe quality, and normally thedata that follow are numbers/counts for each ofthese categories; (ii) Quantitative measurementsnormally indicating a variable that can be measuredquantitatively, for example, length and weight arequantitative variables; and (iii) Derived variablesare data not measured directly but calculated fromother measurements. The commonly derived vari-ables are proportions (ratios and percentages).

In statistics, any set of people or objects withsomeattribute in common are defined as a pop-ulation. One can have a population of students,monthly rainfall at a certain station, storm eventsin a given region during a given time interval –anything one may be interested in studying can bedefined as a population. The objective of statisticalanalysis is to characterize a given population, anddraw inferences or identify trends inherent in it.For example, one may want to test whether a newdrug is effective for a specific group of people.Most of the time, it is not possible to administerevery one with a new drug. Instead the drug isadministered to a sample group of people fromthe population to see if it is effective. This subsetof the population is called a sample. In this senseall types of hydrological data, limited in number

Modern Hydrology and Sustainable Water Development S. K. Gupta

© 2011 S. K. Gupta. ISBN: 978-1-405-17124-3

Page 2: Modern Hydrology and Sustainable Water Development (Gupta/Modern Hydrology and Sustainable Water Development) || Statistical Analyses in Hydrology

P1: SFK/UKS P2: SFK

c08 BLBK276-Gupta August 28, 2010 15:31 Trim: 246mm X 189mm Printer Name: Yet to Come

218 MODERN HYDROLOGY AND SUSTAINABLE WATER DEVELOPMENT

of realizations/measurements, are samples of theirrespective parent populations. When something ismeasured or inferred about a population, it is calleda parameter. When the same thing is measuredin a sample, it is called a statistic. For example, ifone could measure the average age of the entireIndian population, it would be a parameter. Whenthe same is measured in a sample of Indians, itwould be called a statistic. Thus, a population isto a parameter as a sample is to a statistic. Thisdistinction between the sample and the populationis important because the aim is to draw inferencesabout populations from different samples. Similarlysymbols to denote populations and samples differtoo. For denoting population, one uses Greekletters for parameters and Roman letters fordenoting a measure (statistic) from a sample.

8.1 Descriptive statistics

Before analysing data, one needs to know some de-tails about the datasets (samples) obtained. First,one needs to know about the frequency pattern(distribution) inherent in the data. Second, one alsoneeds to know how close the sample fits the pop-ulation. Descriptive statistical methods are usedmainly to characterize the sample. These are used,in the first instance, to get a feel for the data andthen to ascertain its suitability for carrying out rel-evant statistical tests on them. One also needs toindicate the errors associated with the results de-rived from the statistical tests that were applied.

8.1.1 Measures of central tendency

8.1.1.1 Mean

The most common description of the central ten-dency is the mean or average value of a numberof measurements and is defined as the sum of allthe individual values divided by the number of mea-surements carried out. Thus:

x ≡ 1

N

N∑

i=1

xi ≡ (x1 + x2 + x3 + · · · + xN )/N (8.1)

N is the number of measurements; i denotes theserial number of an arbitrary measurement; and x

is the parameter measured. The summation sign isusually written as

∑xi, omitting i = 1 and N . As an

example, the mean value of a set of 7 measurementsof a variable having values 28.5, 18.75, 22.9, 25.4,24.55, 23.7, and 23.9, using Eqn 8.1, is 23.96. Since,in practice, the number of measurements is limited,a better estimate of the population mean can beobtained if the sample size is large. However, if wecould increase this number to infinity, we wouldend up with the mean parameter of the population,defined as:

µ ≡ limN→∞

(1

N

∑xi

)(8.2)

If parameter values xi occur fi times, respectively(i.e. occur with frequencies fi), the mean is:

x =∑

fi xi∑fi

=∑

fi xi

N(8.3)

where N = ∑f i is the total frequency (i.e. total

number of cases/number of measurements).While summarizing large volumes of raw data,

it is often useful to classify the data into classes, orcategories, and to determine the number of indi-vidual data belonging to each class, called the classfrequency. A tabular arrangement of data by classestogether with corresponding frequencies is calledthe frequency distribution, or frequency table.Data organized and summarized as a frequency dis-tribution are often called grouped data. Althoughin the grouping process much of the original detailis generally lost, a clear ‘overall’ picture and somevital relationships may emerge. The size or width ofa class interval is the difference between the lowerand upper end members of the class and is referredto as class width, class size, or class length. Ifall class intervals of a frequency distribution haveequal widths, this common width is denoted by x.The class mark or class midpoint is the midpointof the class interval and is obtained by adding thelower and upper class limits and dividing by 2 oradding x/2 to the lower class limit. In subsequentmathematical analysis, all observations belongingto a given class are assumed to coincide with theclass mark. Thus, for example, in Fig. 8.1, the fre-quency in the class interval 19.5–20.5 is consideredto correspond to 20. A graphical representationof frequency distribution with a set of rectangles

Page 3: Modern Hydrology and Sustainable Water Development (Gupta/Modern Hydrology and Sustainable Water Development) || Statistical Analyses in Hydrology

P1: SFK/UKS P2: SFK

c08 BLBK276-Gupta August 28, 2010 15:31 Trim: 246mm X 189mm Printer Name: Yet to Come

STATISTICAL ANALYSES IN HYDROLOGY 219

10 20 304Temperature °C

0

500

1000

No

. of

ob

serv

atio

ns

Surface Temperature

18.6 ± 8.3 °C

Bottom Temperature

6.0 ± 0.9 °C

Fig. 8.1 Frequency histogram of epilimnetic- and hypolimnetic temperature measured in a lake. Dotted line connecting themidpoints of the top of the rectangles of the histogram is the frequency polygon. The values of mean (6.0 and 18.6) andstandard deviation (0.9 and 8.3) show, what is also seen from the histogram, that it is significantly cooler at the bottom withmuch less scatter than for the hypolimnetic temperatures. Based on http://waterontheweb.org/curricula/ws/unit 05/Mod 17/mod17-1.ppt.

having (a) base on the horizontal axis (the x-axis)with centre as the class mark and width equalto class width, and (b) area proportional to classfrequency is called a histogram or frequencyhistogram. Line connecting the midpoints of thetop of the rectangles of histogram is the frequencypolygon (Fig. 8.1).

If in constructing the histogram, x, or the classwidth chosen is too large, nearly all data may liewithin a single column, suggesting a good statisti-cal certainty, but a bad resolution. If, on the otherhand, x is made too small, that is, the resolutionis increased, few data values fall within any givencolumn and the reliability appears to be less (scat-tering histogram).

8.1.1.2 Median

The mean, as defined above, can distort the pic-ture if there are a few extreme yet genuine values.The median parameter can help in this case and isfound by locating the ‘middle’ value. The medianis defined as the value such that half the measure-ments of the dataset are below and the other halfare above it. In the set of seven values given above(18.75, 22.9, 23.7, 23.9, 25.4, 24.55, 28.5), the me-dian is 23.9. If N is an even number, the median isthe mean of the two middle values. For a symmet-

rical distribution, the mean and the median valuesare the same.

8.1.1.3 Mode

Mode is the value that occurs in a dataset most oftenand may not exist in many datasets, including theone above. Unlike mean or median, many datasetsmay not have a unique value of mode. For example,the mode of the sample (1, 3, 6, 6, 6, 6, 7, 7, 12,12, and 17) is 6 (unique). Given the list of data (1,1, 2, 4, 4), the mode is not unique.

8.1.1.4 Geometric mean

The geometric mean, G, of a set of N positive num-bers, xi, is the Nth root of the product of all thenumbers:

G = N√

x1x2x3 · · · xN (8.4)

8.1.1.5 Harmonic mean

The harmonic mean, H, of a set of N numbers,xi, is the reciprocal of the arithmetic mean of thereciprocals of numbers:

H = 1

1N

N∑i=1

1xi

= N∑ 1

xi

(8.5)

Page 4: Modern Hydrology and Sustainable Water Development (Gupta/Modern Hydrology and Sustainable Water Development) || Statistical Analyses in Hydrology

P1: SFK/UKS P2: SFK

c08 BLBK276-Gupta August 28, 2010 15:31 Trim: 246mm X 189mm Printer Name: Yet to Come

220 MODERN HYDROLOGY AND SUSTAINABLE WATER DEVELOPMENT

8.1.1.6 The root mean square (RMS)

The root mean square (RMS) or the quadraticmean of a set of numbers, xi, is defined as:

RM S =

√√√√√N∑

i=1x2

i

N=√∑

x2i

N(8.6)

8.1.1.7 Quartiles, deciles, and percentiles

Extending the idea of median, one can think of val-ues that divide the given dataset into four equalparts. These values, denoted by Q1, Q2, and Q3, arecalled the first, second, and third quartiles, respec-tively; the value Q2 equals the median.

Similarly, the values that divide the data into 10equal parts are called deciles and are denoted byD1, D2, . . . , D9, while the values dividing the datainto 100 equal parts are called percentiles and aredenoted by P1, P2, . . . , P99. The 5th decile andthe 50th percentile correspond to the median. The25th and 75th percentiles correspond to the 1stand the 3rd quartiles, respectively.

8.1.2 Measures of dispersion andvariability

One can characterize data in a better way usingsome other parameters, such as Range andStandard Deviation. Range is given by thehighest and lowest values in a dataset. In theabove dataset, the range is 18.75–28.5. StandardDeviation (SD) is useful to measure the variationof a sample around the mean value. To appreciatethe significance of mean and standard deviation,let us consider measurements of epilimnetic andhypolimnetic temperatures of a lake (Fig. 8.1). Themean values indicate that the lake is significantlycooler at the bottom. But it is also seen that thevariation at the bottom is much less and so is thestandard deviation – the value being 0.9 for theepilimnetic temperatures compared to 8.3 for thehypolimnetic temperatures.

Since by definition, the average deviation of themeasured value from the mean equals zero, thesum of the squares of the deviations of individualvalues from the mean is taken to characterize the

distribution. The resulting value is referred to asvariance of population and designated by σ 2:

σ 2 ≡ limN→∞

[1

N

∑(xi − µ)2

]

= limN→∞

(1

N

∑x2

i

)− µ2 (8.7)

If we now consider a real set of measurements (sam-ple), the standard deviation of the set (sample) is:

s =√(

1

N − 1

∑(xi − x)2

)(8.8)

The fact that (N – 1) instead of N appears in thedenominator is due to loss of one degree of free-dom while estimating x. The need to do this canbe appreciated by considering the extreme case ofonly one measurement. Because a single measure-ment cannot give any idea on the precision of themeasurement, the value of s obtained using (N −1) in the denominator of Eqn 8.8 will not be a realnumber in accordance with the situation.

If parameter values, xi, occur in the dataset withcorresponding frequencies, fi, the standard devia-tion can be written as:

s =√(

1

N − 1

∑fi(xi − x)2

)(8.9)

The mean and SD are often combined to charac-terize a given dataset as x ± s. For a large dataset,s ≈ σ and the distinction between the two is oftenignored.

Coefficient of Variation is yet another parame-ter to show how much variation occurs in a givendataset. It is calculated using:

Coefficient of variation = s

x× 100 (8.10)

The higher its value, the more the number of datapoints one needs to collect to be confident that thesample is representative of the population. It canalso be used to compare variation between differ-ent datasets.

Example 8.1. A set of 8 annual rainfall valuesrecorded from a mountainous catchment are: 15,20, 21, 20, 36, 15, 25, and 15. Estimate variousmeasures of central tendency described above.

Page 5: Modern Hydrology and Sustainable Water Development (Gupta/Modern Hydrology and Sustainable Water Development) || Statistical Analyses in Hydrology

P1: SFK/UKS P2: SFK

c08 BLBK276-Gupta August 28, 2010 15:31 Trim: 246mm X 189mm Printer Name: Yet to Come

STATISTICAL ANALYSES IN HYDROLOGY 221

Solution. :

xi

i xi (arranged) (xi)2 xi − x (xi − x)2 1/xi

1 15 15 225 –5.875 34.51563 0.0666672 20 15 400 –0.875 0.765625 0.053 21 15 441 0.125 0.015625 0.0476194 20 20 400 –0.875 0.765625 0.055 36 20 1296 15.125 228.7656 0.0277786 15 21 225 –5.875 34.51563 0.0666677 25 25 625 4.125 17.01563 0.048 15 36 225 –5.875 34.51563 0.066667 167 3837 0 350.875 0.415397

With reference to calculations shown in the tableabove:

Mean (x) = 167/8 = 20.875; Median = 20;Mode = 15

Geometric mean (G)= 8

√15 × 20 × 21 × 20 × 36 × 15 × 25 × 15

= 19.992Harmonic mean = 8/0.415397 = 19.259RMS = √

(3837/8) = 21.900Range = 36 to 15 = 21Std. Deviation (σ ) = √

(350.875/7) = 7.080; Vari-ance = σ 2 = 50.125

Although one can calculate these univariate sta-tistical parameters by hand, it gets quite tediouswhen we have more than a few values and a num-ber of variables. However, most statistical programsare capable of calculating them easily.

8.1.3 Measures of shape and distribution

8.1.3.1 Moments

If xi are the N values assumed by a variable, x, ina distribution, the quantity obtained by raising allvalues of xi to power r and adding them and thendividing the sum with total number of xi, i.e.:

xr = xr1 + xr

2 + · · · xrN

N=

N∑i=1

xri

N=∑

xri

N(8.11)

is called the rth moment of the distribution. Thefirst moment with r = 1 gives the arithmetic meanx.

The rth moment about the mean x is defined as:

mr =

N∑i=1

(xi − x)r

N=∑

(xi − x)r

N= (xi − x)r

(8.12)

If r = 1, then m1 = 0. If r = 2, then m2 = s2, thevariance.

For grouped data with parameters, xi, occurringwith frequencies, fi, the above moments are givenby:

xr = f1xr1 + f2xr

2 + · · · fK xrK

N=

K∑i=1

fi xri

N=∑

xri

N

(8.13)

mr =

K∑i=1

fi(xi − x)r

N=∑

fi(xi − x)r

N= (xi − x)r

(8.14)

where N = ∑Ki=1 fi = ∑

fi .To generalize, one can define dimensionless mo-

ments about the mean that are independent of anyspecific system of units as:

ar = mr

sr= mr

(√

m2)r= mr√

mr2

(8.15)

where s = √m2 is the standard deviation. Since m1

= 0 and m2 = s2, we have a1 = 0 and a2 = 1.

8.1.3.2 Skewness

Skewness is a measure of the degree of asymme-try, or departure from symmetry, of a distributionaround the mean/central maximum. If smoothedfrequency curve of a distribution has a longer tailto the right of the central maximum than to the left,the distribution is said to be skewed to the right orto have positive skewness. If the reverse is true, itis said to be skewed to the left or to have negativeskewness (Fig. 8.2a).

For skewed distributions the mean tends to lieon the same side of the mode as the longer tail,that is, to the right of the mode for right skewed

Page 6: Modern Hydrology and Sustainable Water Development (Gupta/Modern Hydrology and Sustainable Water Development) || Statistical Analyses in Hydrology

P1: SFK/UKS P2: SFK

c08 BLBK276-Gupta August 28, 2010 15:31 Trim: 246mm X 189mm Printer Name: Yet to Come

222 MODERN HYDROLOGY AND SUSTAINABLE WATER DEVELOPMENT

Fig. 8.2 Types of frequency curves based on (a) Skewness and (b) Kurtosis.

and to the left of the mode for left skewed. Thusa measure of asymmetry is provided by the differ-ence (mean – mode). This can be made dimen-sionless if one divides it by a measure of disper-sion, such as standard deviation, leading to thedefinition:

Skewness = mean − mode

standard deviation= x − mode

s(8.16)

The following empirical relation is found to holdfor unimodal frequency distribution curves that aremoderately skewed:

mean − mode = 3(mean − median) (8.17)

Using Eqn 8.17, the coefficient of skewness can bedefined as:

Skewness = 3(mean − median)

standard deviation

= 3(x − median)

s(8.18)

Eqn 8.16 and Eqn 8.18 are called Pearson’s firstand second coefficients of skewness.

8.1.3.3 Kurtosis

Kurtosis indicates the peakness of a distribution,usually taken relative to the normal distribution(also called Gaussian distribution) – a symmetricalbell-shaped curve. A distribution having relativelyhigh peak, such as in Fig. 8.2b, is called leptokur-

tic. The curve that is less peaked or flat topped iscalled platykurtic. The normal distribution that isneither very peaked nor very flat topped is calledmesokurtic.

One measure of the kurtosis employs the 4th mo-ment about the mean expressed in a dimensionlessform and is given by:

Moment coefficient of kurtosis = k = a4 = m4

s4

= m4

m22

(8.19)

which is often denoted by k or b2. For a normaldistribution, k = b2 = a4 = 3. For this reason, k >

3 defines a leptokurtic distribution; k < 3 definesa platykurtic distribution.

8.2 Probability theory

Due to the vast complexities of nature, hydrologicevents such as rainfall and stream flow that affectour day-to-day life and are closely linked to our cli-mate are highly variable. As a result, probabilitytheory and statistics have to be employed to pre-dict occurrences of these events, or many othernatural phenomena. For example, water resourcesmanagers want to know if the probability of thespring and summer stream flow of major rivers inthe year 2010 will be, say 30% lower than their re-spective long-term average values? Or what is theprobability that there will be a warm December

Page 7: Modern Hydrology and Sustainable Water Development (Gupta/Modern Hydrology and Sustainable Water Development) || Statistical Analyses in Hydrology

P1: SFK/UKS P2: SFK

c08 BLBK276-Gupta August 28, 2010 15:31 Trim: 246mm X 189mm Printer Name: Yet to Come

STATISTICAL ANALYSES IN HYDROLOGY 223

x1 x2 x3 xk-1 xkx0

p (x)x

Xx1 x 2 x3 xk-1 xk0

Px(x)

x

1

F (x)x

0

1

X

0

f (x)x

x

(a)

(b)

(c)

(d)

Fig. 8.3 (a) Probability mass function (PMF), px. (b) Cumulative distribution function (CDF), Px, of a discrete randomvariable. (c) Probability density function (PDF), fx, and (d) Cumulative distribution function (CDF), Fx, of a continuousrandom variable. Based on Tung et al. (2006).

month in 2010? Probability is defined as the num-ber of times a specific event occurs out of a totalnumber of observed events (that defines the samplesize):

P (X) = nx/n; lim n → ∞ (8.20)

whereP(X) = Probability of occurrence of event

X(t);nx = Number of occurrences of X(t) event;n = Sample size.

A hydrologic variable is called a random variablebecause its occurrence is associated with a certaindegree of uncertainty or probability. A probabilitydistribution represents the relative frequency of oc-currence of a population. Often, if a sample is largeenough, one can plot histograms of the data withsmall class intervals. By fitting the histograms witha curve, an approximate probability distribution ofthe data X occurring in each class interval is ob-tained. This curve is often called the probabilitydensity function (PDF). The shape of this curvegives an idea about the nature of PDF. For exam-ple, if it is bell-shaped and symmetric, it is likely tobe a Gaussian (normal) distribution. Probability dis-tributions commonly encountered in hydrology areshown in Fig. 8.5, which also shows interrelation-

ships amongst them. Some of these distributionsare discussed in Sections 8.2.2 and 8.2.3.

The cumulative distribution function (CDF, orsimply distribution function (DF), of a random vari-able X is defined as:

Fx = P (X ≤ x) (8.21)

The CDF of a function Fx(x) gives the non-exceedance probability, which is a non-decreasingfunction of the argument x, that is Fx(a) ≤ Fx(b)for a < b. For a discrete random variable X, theprobability mass function (PMF), is defined as:

px(x) = P (X = x) (8.22)

The PMF of a discrete random variable and its asso-ciated CDF are sketched schematically in Fig. 8.3(a,b). For a continuous random variable, the PDF,fx(x), is defined as:

fx(x) = dFx(x)

dx(8.23)

Similar to the discrete case, the PDF of a continuousrandom variable must satisfy two conditions: (i)fx(x) ≥ 0; and (ii)

∫fx(x) dx = 1. Thus the PDF of

a continuous random variable, fx(x), is the slope ofits corresponding CDF. Graphical representationsof PDF and CDF are shown in Fig. 8.3(c, d). Given

Page 8: Modern Hydrology and Sustainable Water Development (Gupta/Modern Hydrology and Sustainable Water Development) || Statistical Analyses in Hydrology

P1: SFK/UKS P2: SFK

c08 BLBK276-Gupta August 28, 2010 15:31 Trim: 246mm X 189mm Printer Name: Yet to Come

224 MODERN HYDROLOGY AND SUSTAINABLE WATER DEVELOPMENT

the PDF of a random variable X, its CDF can beobtained as:

Fx(x) =∫ x

−∞fx(u)du (8.24)

in which u is a dummy variable of integration. Itshould be noted that the PDF has meaning onlywhen it is integrated between two points. Theprobability of a continuous random variable tak-ing on a particular value is zero, whereas this maynot be the case for discrete random variables.

The rth-order moment of a random variable Xabout a reference point X = x0 is defined, for thecontinuous case, as:

E [(X − x0)r] =∫ ∞

−∞(x − x0)r fx(x)dx

=∫ ∞

−∞(x − x0)rdFx(x) (8.25)

whereas for the discrete case:

E [(X − x0)r] =N∑

k=1

(xk − x0)r px(xk) (8.26)

where E[·] is a statistical expectation operator.The first three moments (r =1, 2, 3) as above areused to describe the central tendency, variability,and asymmetry of the distribution respectively. Itmay be noted that Eqn 8.26 is identically equalto Eqn 8.13 and Eqn. 8.14 with the values of x0

taken as zero and x, respectively. As in the caseof a discrete variable, two types of moments arecommonly used for continuous variables also, thatis, moments about the origin where x0 = 0, and thecentral moments where x0 = µx, with µx = E[X].The rth-order central moment is denoted as µr =E[(X − µx)r], whereas the rth-order moment aboutthe origin is denoted as µ′

r = E(Xr).From Eqn 8.25, it is easy to see that the central

tendency of a continuous random variable X can berepresented by its expectation, which is the first-order moment (or mean) about the origin:

E (X) = µx =∫ ∞

−∞xfx(x)dx =

∫ ∞

−∞xdFx(x)

=∫ ∞

−∞[1 − Fx(x)]dx (8.27)

The following two operational properties of theexpectation are useful:

1. The expectation of the sum of several ran-dom variables (regardless of their dependence)equals the sum of the expectation of the individ-ual random variables, i.e.:

E

(K∑

k=1

akXk

)=

K∑

k=1

akµk (8.28)

in which µk = E(Xk), for k =1, 2, . . ., K.2. The expectation value of multiplication of

several independent random variables equalsthe product of the expectation of the individualrandom variables, i.e.:

E

(K∏

k=1

Xk

)=

K∏

k=1

µk (8.29)

The median, xmd, of a continuous random vari-able divides the distribution into two equal halvesand is also the 50th percentile satisfying the condi-tion:

Fx(xmd) =∫ xmd

−∞fx(x)dx = 0.5 (8.30)

The mode xmo at which the value of a PDF peakscan be obtained by solving the following equation:[

∂ fx(x)

∂x

]= 0 (8.31)

The variance is the second-order central moment(i.e. about the mean) and for the continuous caseis defined as:

Var(x) = µ2 = σ 2x = E [(X − µx)2]

=∫ ∞

−∞(x − µx)2 fx(x)dx (8.32)

Three important properties of the variance are:

1. Var(a) = 0 when a is constant;2. Var(X) = E(X2) − E2(X) = µ′

2 − µx2;

3. The variance of the sum of several independentrandom variables equals the sum of variances ofthe individual random variables, that is:

Var

(K∑

k=1

ak Xk

)=

K∑

k=1

a2kσ 2

k (8.33)

where ak is constant, and σ k is the standarddeviation of random variable Xk, k =1, 2, . . ., K.

Page 9: Modern Hydrology and Sustainable Water Development (Gupta/Modern Hydrology and Sustainable Water Development) || Statistical Analyses in Hydrology

P1: SFK/UKS P2: SFK

c08 BLBK276-Gupta August 28, 2010 15:31 Trim: 246mm X 189mm Printer Name: Yet to Come

STATISTICAL ANALYSES IN HYDROLOGY 225

Two measures of the skewness coefficient thatindicate the asymmetry of a distribution are definedby Eqn 8.16 and Eqn 8.18. Another measure of theasymmetry of the PDF of a random variable is re-lated to the third-order central moment as:

γx = µ3

µ1.52

= E [(X − µx)]

σ 3x

(8.34)

The sign of the skewness coefficient indicates thedegree of symmetry of the probability distributionfunction. If γ x = 0, the distribution is symmetricabout its mean. The distribution has a long tail tothe right when γ x > 0; whereas γ x < 0 indicatesthat the distribution has a long tail to the left (Fig.8.2a).

Kurtosis, κx, is a measure of the peakness ofa distribution. As in Eqn 8.19, it is related to thefourth-order central moment of a random variableand may also be written as:

κx = µ4

µ22

= E [(X − µx)4]

σ 4x

(8.35)

with κx > 0. For a random variable having a nor-mal distribution, its kurtosis equals 3. For all possi-ble distribution functions, the skewness coefficientand kurtosis must satisfy the following inequality:

γ 2x + 1 ≤ κx (8.36)

Example 8.2. The time to failure, T , of a pump ina water distribution system, is a continuous randomvariable having PDF as:

ft(t) = exp(−t/1250)/β for t = 0, β > 0

in which t is the time elapsed (in hours) beforethe pump fails; and β is the parameter of the dis-tribution function. Determine the constant β andthe probability that the operating life of the pumpis longer than 200 h. Determine the first two mo-ments about the origin for the time to failure ofthe pump. Then calculate the first two central mo-ments. Also find values of mean, mode, median,and 10 percentile for the random time to failure, T.

Solution. The shape of the PDF is shown in Fig.8.9a (for α = 1). The PDF must satisfy two con-ditions, that is: (i) ft(t) ≥ 0, for all t; and (ii) thearea under ft(t) must equal unity. Compliance with

condition (i) can be proved easily. Condition (ii) isused to determine the value of the constant β as:∫ ∞

0ft(t)dt = 1 =

∫ ∞

0

e−t/1250

βdt

=[−1250e−t/1250

β

]∞

0

= 1250

β

Therefore, the constant β =1250 h/failure. Thisparticular PDF follows exponential distribution. Todetermine the probability that the operational lifeof the pump would exceed 200 h, one has to cal-culate P(T ≥200):

P (T ≥ 200) =∫ ∞

200

e−t/1250

1250dt = [−e−t/1250]∞200

= e−200/1250 = 0.852

According to Eqn 8.25, moments about the originare:

E (T r) = µ′r =

∫ ∞

0tr

(e−t/β

β

)dt

Performing integration by parts, one obtains

for r = 1, µ′1 = E(T) =µt = β =1250 h

for r = 2, µ′2 = E(T2) =2β2 =3,125,000 h2

The central moments can be determined usingEqn 8.25, with x0 = µt:

for r = 1, µ1 = E(T − µt) = 0for r = 2, µ2 = E[(T − µt)2] = β2 = 1, 562, 500 h2

The mean value of the time to failure, called themean time to failure (MTTF), is the first-order mo-ment about the origin, which is µt =1250 h. Fromthe shape of the PDF for the exponential distribu-tion, as shown in Fig. 8.9a (for α = 1), one canreadily see that the mode, representing the mostlikely time of pump failure, is at the beginning ofpump operation, that is tmo = 0 h.

To determine the median time to failure of thepump, one can first derive an expression for theCDF from the given exponential PDF as:

Ft(t) = P (T ≤ t) =∫ t

0

e−u/1250

1250du = 1 − e−t/1250

for t ≥ 0

in which u is the dummy variable of integration.The median time to failure, tmd, can be obtained

Page 10: Modern Hydrology and Sustainable Water Development (Gupta/Modern Hydrology and Sustainable Water Development) || Statistical Analyses in Hydrology

P1: SFK/UKS P2: SFK

c08 BLBK276-Gupta August 28, 2010 15:31 Trim: 246mm X 189mm Printer Name: Yet to Come

226 MODERN HYDROLOGY AND SUSTAINABLE WATER DEVELOPMENT

ρ = - 1.0

X

y(a)

ρ = 0.8

X

y(b)

ρ = 0.0

X

y(c)

ρ = 0.0

X

y(d)

Fig. 8.4 Some sample cases of correlation between two random variables X and Y : (a) linear negative correlation; (b) stronglinear correlation; (c) no correlation; (d) correlated nonlinearly but uncorrelated linearly. Redrawn from Tung et al. (2006).

from Eqn 8.30:

Ft(tmd) = 1 − e−tmd/1250 = 0.5

which yields tmd =866.43 h.Similarly, the 10 percentile value, t0.1, that gives

the time elapsed over which the pump would failwith a probability of 0.1, can be found in the sameway as the median, except that the value of theCDF here is 0.1, that is:

Ft(t0.1) = 1 − e−t0.1/1250 = 0.1

which yields t0.1 =131.7 h.

8.2.1 Covariance and correlationcoefficient

When a problem involves two dependent randomvariables, the degree of linear dependence betweenthe two is obtained from the correlation coeffi-cient, ρx,y, which is defined as:

C orr(X, Y) = ρx,y = C ov(X, Y)/σxσy (8.37)

where Cov(X, Y ) is the covariance between ran-dom variables X and Y , defined as:

C ov(X, Y) = E [(X − µx)(Y − µy)]

= E (XY) − µxµy (8.38)

The correlation coefficient defined by Eqn 8.37is called the Pearson product-moment correla-tion coefficient or simply correlation coefficient incommon usage. It is easily seen that Cov(X′

1, X′2)

= Corr(X1, X2), with X′1 and X′

2 being the normal-ized random variables. In the realm of statistics, arandom variable can be normalized as:

X′ = (X − µx)/µx (8.39)

Hence a normalized random variable has zero meanand unit variance. Normalization does not affect theskewness coefficient and kurtosis of a random vari-able, because these parameters are dimensionless.Some commonly encountered cases of correlationbetween two random variables are shown schemat-ically in Fig. 8.4.

8.2.2 Some discrete univariate probabilitydistributions

As mentioned above, probability distributions areclassified into two types – discrete and contin-uous, based on the nature of the random vari-able. In this section, two discrete distributions,namely the Binomial distribution and the Poissondistribution that are commonly encountered in hy-drosystems, are described. Several frequently used

Page 11: Modern Hydrology and Sustainable Water Development (Gupta/Modern Hydrology and Sustainable Water Development) || Statistical Analyses in Hydrology

P1: SFK/UKS P2: SFK

c08 BLBK276-Gupta August 28, 2010 15:31 Trim: 246mm X 189mm Printer Name: Yet to Come

STATISTICAL ANALYSES IN HYDROLOGY 227

Fig. 8.5 Inter-relationships amongstdifferent univariate distributions. AfterLeemis (1986). © American StatisticalAssociation.

univariate continuous distributions are describedbelow. The inter-relationships between the variousdistributions discussed in this chapter, as well assome other distributions not discussed here, areshown in Fig. 8.5.

8.2.2.1 Binomial distribution

The binomial distribution is applicable to randomprocesses with only two types of outcomes, forexample, a system is either functioning (i.e. suc-cessful) or non-functioning (i.e. failure). Consideran experiment involving a total of n independenttrials with each trial having two possible outcomes,say success or failure. In each trial, if the probability

of having a successful outcome is p, the probabilityof having k successes in n trials can be computedas:

px(k) = Cn,k pkqn−k (8.40)

where Cn,k is the binomial coefficient; and q = (1− p), the probability of failure in each trial. Com-putationally, it is more convenient to use the fol-lowing recursive formula to evaluate the binomialPMF (Drane et al. 1993):

px(k|n, p) =(

n + 1 − k

k

)(p

q

)px(k − 1|n, p)

(8.41)

for k = 0, 1, 2, . . . , n, with the initial probabil-ity px(k=0|n, p) = qn. A simple recursive scheme

Page 12: Modern Hydrology and Sustainable Water Development (Gupta/Modern Hydrology and Sustainable Water Development) || Statistical Analyses in Hydrology

P1: SFK/UKS P2: SFK

c08 BLBK276-Gupta August 28, 2010 15:31 Trim: 246mm X 189mm Printer Name: Yet to Come

228 MODERN HYDROLOGY AND SUSTAINABLE WATER DEVELOPMENT

0.4

0.2

080 765432 91 10

ν=1

K→

P (

k)x

0.4

0.2

080 765432 91 10

ν=5

P (

k)x

K→

0.4

0.2

080 765432 91 10

ν=9

K→P

(k)

x→

Poisson Distribution

0.4

0.2

080 765432 91 10

n=10 p=0.1

K→

P (

k)x

0.2

0.1

080 765432 91 10

n=10 p=0.5

K→

P (

k)x

Binomial Distribution

(a) (b)

Fig. 8.6 Probability mass functions (PMFs) of (a) Binomial random variables for different values of p; and (b) Poisson randomvariables for different parameter values. The parameter k represents the number of realizations of the event. Based on Tunget al. (2006) with permission from Mcgraw-Hill.

for computing the binomial cumulative probabilitywas given by Tietjen (1994).

A random variable X having a binomial distribu-tion with parameters n and p has the expectationE(X) = np and variance Var(X) = npq. Shape ofthe PMF of a binomial random variable dependson the values of p and q. The skewness coefficientof a binomial random variable is (q − p)/

√(npq).

Hence the PMF is positively skewed if p < q, sym-metric if p = q = 0.5, and negatively skewed if p >

q. Plots of binomial PMFs for different values of pwith a fixed value of n are shown in Fig. 8.6a.

8.2.2.2 Poisson distribution

The PMF of a Poisson distribution is:

px(k|ν) = e−ννk

k!for k = 0, 1, 2, . . . (8.42)

where the parameter, ν > 0, represents the meanof a Poisson random variable. Unlike the binomialrandom variables, Poisson random variables haveno upper bound. A recursive formula (Drane et al.

1993) for calculating the Poisson PMF is:

px(k|ν) =(ν

k

)px(k − 1|ν) for k = 1, 2, . . .

(8.43)

with px(k=0 | ν) = e−ν .For a Poisson random variable, the mean and the

variance are identical to ν. Plots of Poisson PMFscorresponding to different values of ν are shownin Fig. 8.6b. The skewness coefficient of a Poissonrandom variable is 1/

√ν, indicating that the shape

of the distribution tends to become symmetrical asν becomes large.

The Poisson distribution has been applied widelyin modelling the number of occurrences of a ran-dom event within a specified time or space interval.Eqn 8.42 can be modified as:

px(k|λ, t) = e−λt(λt)k

k!for k = 0, 1, 2, . . .

(8.44)

in which the parameter λ can be interpreted as theaverage rate of occurrence of the random event inthe time interval (0, t).

Page 13: Modern Hydrology and Sustainable Water Development (Gupta/Modern Hydrology and Sustainable Water Development) || Statistical Analyses in Hydrology

P1: SFK/UKS P2: SFK

c08 BLBK276-Gupta August 28, 2010 15:31 Trim: 246mm X 189mm Printer Name: Yet to Come

STATISTICAL ANALYSES IN HYDROLOGY 229

Example 8.3. A bridge is designed to transmit aflood with a return period of 50 years. In otherwords, the annual probability of the bridge beingovertopped by the flood is 1-in-50 or 1/50 = 0.02.What is the probability that the bridge would beovertopped during its expected useful life of 100years? Also, use Poisson distribution to computethe same result.

Solution. The random variable X gives the num-ber of times the bridge will be overtopped over a100-year period. Each year is an independent trialin which the bridge can be either overtopped orallow the flood to pass through normally withoutovertopping. Because the outcome of each ‘trial’ isbinary, the binomial distribution without overtop-ping applicable.

The probability of the event of overtopping ofthe bridge in each trial (i.e. each year), is 0.02. Theperiod of 100 years represents 100 trials. Hence,in the binomial distribution model, the parametersare p = 0.02 and n = 100. The probability thatovertopping occurs in a period of 100 years can becalculated, using Eqn 8.40, as

P (overtopping in a 100-year period) = P(overtopping occurs at least once in 100-yearperiod) = P(X ≥ 1 | n = 100, p = 0.02):

=100∑

k=1

px(k) =100∑

k=1

C100,k(0.02)k(0.98)100−k

This requires evaluation of 100 binomial terms,which is quite cumbersome. But if the problemis viewed differently, that is, the non-occurrence ofovertopping, which is = 1 − P (no overtopping ina 100-year period) = 1 − p(X=0 | n=100, p=0.98)= 1 − (0.98)100 = 1 − 0.1326 = 0.8674.

Using the Poisson distribution, one has to deter-mine the average number of overtopping events ina given 100-year period. For a 50-year event, theaverage rate of overtopping is λ = 0.02/year. Theaverage number of overtopping events in the 100-year period can be obtained as ν = (0.02)(100) =2. Therefore, probability of overtopping in a 100-year period, using the Poisson distribution, is ob-tained as with binomial distribution, by estimating

the probability of no overtopping and subtractingit from unity, i.e.:

= 1 − P (no overtopping occursin a 100-year period);

= 1 − p(X = 0|v = 2) = 1 − e−2;= 1 − 0.1353 = 0.8647.

Comparing it with the result from that obtainedusing binomial distribution, it is seen that use of thePoisson distribution yields a slightly smaller risk ofovertopping.

8.2.3 Some continuous univariateprobability distributions

Several continuous PDFs are used frequently inhydrological reliability analysis. These include nor-mal, lognormal, gamma, Weibull, and exponentialdistributions. Other less commonly used includebeta and extremal distributions.

Many of the probability distributions are not asingle distribution but comprise, in fact, a family ofdistributions. This is due to the distribution havingone or more location, scale, and shape parameters.A location parameter shifts the graph to the left orright on the horizontal axis. The effect of a scaleparameter greater than unity is to stretch the PDF.Shape parameters allow a distribution to take on avariety of shapes, depending on its value. These dis-tributions are particularly useful in modelling vari-ous applications, since they are flexible enough tomodel a variety of datasets.

8.2.3.1 The Gaussian or normal distribution

The Gaussian or normal distribution is the mostwell-known probability distribution involving twoparameters, namely, the mean and the variance. Anormal random variable having the mean, µx, andvariance, σ 2

x, is denoted here as X ∼ N(µx, σ x)with the PDF as:

fx

(x|µx, σ

2x

) = 1√2πσx

exp

[−1

2

(x − µx

σx

)2]

for − ∞ < x < ∞ (8.45)

The normal distribution is bell-shaped and sym-metric with respect to the mean, µx (Fig. 8.7).

Page 14: Modern Hydrology and Sustainable Water Development (Gupta/Modern Hydrology and Sustainable Water Development) || Statistical Analyses in Hydrology

P1: SFK/UKS P2: SFK

c08 BLBK276-Gupta August 28, 2010 15:31 Trim: 246mm X 189mm Printer Name: Yet to Come

230 MODERN HYDROLOGY AND SUSTAINABLE WATER DEVELOPMENT

0 1-1 2-2 3-3 4-40

0.1

0.2

0.3

0.4

0.5

99.7%

95%

68%

(x - x)/si

P

Fig. 8.7 Histogram (shaded) of normalized distribution of a variable x classified with a class width x, that is, between xi

and xi + x. The normalization of the variable is achieved by dividing the deviations of actual observations from the meanvalue (x) by the standard deviation (s) of the dataset. The normalized variable, therefore, has mean equal to zero and standarddeviation equal to unity. The normalized frequency or the number of observations of values within the specific class intervaldivided by the total number of observations (i.e. ni/N) is plotted on the y-axis. The smooth curve gives the Gaussiandistribution, which is a hypothetical result corresponding to an infinite number of measurements. It also represents theprobability distribution (P) of data around the mean value. On top of the graph, the integral or summed probabilities areshown, giving the probability of observing values between (x + s) and (x − s) as 68%, between (x + 2s) and (x− 2s) as 95%,and between (x + 3s) and (x− 3s) as 99.7%. Redrawn from Mook (2000) © UNESCO/Springer.

Therefore, the skewness coefficient of a normalrandom variable is zero. Owing to the symmetry ofthe PDF, all odd-ordered central moments are zero.The kurtosis of a normal random variable is κx =3.0. A linear function of several normal random vari-ables is also normal, that is, the linear combinationof K normal random variables W = a1X1 + a2X2 +. . . + aKXK, with Xk ∼ N(µk, σ k), for k =1, 2, . . . Kis also a normal random variable with the mean, µw,and the variance, σ 2

w, respectively, as given by:

µw =K∑

k=1

akµk

σ 2w =

K∑

k=1

a2kσ 2

k + 2K−1∑

k=1

K∑

k′=k+1

akak′C ov(Xk, Xk′)

(8.46)

The normal distribution often provides a practi-cal alternative to approximate the probability of

a non-normal random variable. The accuracy ofsuch an approximation, of course, depends onhow close is the given distribution to the normaldistribution.

An important theorem relating to the sum ofindependent random variables is the central limittheorem, which roughly states that distribution ofthe sum of a number of independent random vari-ables, regardless of their individual distributions,can be approximated by a normal distribution aslong as none of the variables has a dominant effecton the sum. The larger the number of randomvariables involved in the summation, the better theapproximation. Because many natural processescan be thought of as a combination of a largenumber of independent component processes,none dominating the others, the normal distri-bution is, therefore, a reasonable approximationto model these processes. Dowson and Wragg(1973) showed that when only the mean and the

Page 15: Modern Hydrology and Sustainable Water Development (Gupta/Modern Hydrology and Sustainable Water Development) || Statistical Analyses in Hydrology

P1: SFK/UKS P2: SFK

c08 BLBK276-Gupta August 28, 2010 15:31 Trim: 246mm X 189mm Printer Name: Yet to Come

STATISTICAL ANALYSES IN HYDROLOGY 231

variance are specified, the maximum entropy onthe interval (−∞, +∞) follows the normal distri-bution. Thus, when only the first two momentsare specified, use of a normal distribution impliescloser approximation to the true nature of theunderlying process than by any other distribution.

For a normal distribution, the location and scaleparameters correspond to the mean and the stan-dard deviation. Therefore, probability computa-tions for normal random variables are made by firsttransforming the original variable to a standardizednormal variable, Z, which removes the effect ofthese parameters by using Eqn 8.47, as:

Z = (X − µ)/σx (8.47)

in which Z has a zero mean and a variance of unity.Since Z is a linear function of the normal randomvariable X, it is, therefore, normally distributed,that is, Z ∼ N(µz = 0, σ z = 1). The PDF of Z, calledthe standard normal distribution, can be obtainedreadily from Eqn 8.45:

φ(z) = 1√2π

exp

(−z2

2

)f or − ∞ < z < ∞

(8.48)

General expressions for the product-moments ofthe standard normal random variable, using Eqn8.25, are:

E (Z2r) = (2r)!

2r × r!and E (Z2r+1) = 0 (8.49)

Computations of probability for X (N(µx, σx)) canbe made as:

P (X ≤ x) = P (Z ≤ z) = (z) (8.50)

where z = (x − µx)/σ x and (z) is the standardnormal CDF defined as:

(z) =∫ z

−∞φ(z)dz (8.51)

The shape of PDF of the standard normal randomvariable is shown in Fig. 8.7. The integral in Eqn8.51 is not analytically solvable. A table of thestandard normal CDF can be found in many text-books on statistics (Abramowitz and Stegun 1972;Blank 1980; Devore 1999; Haan 1974). For numer-ical computation, several highly accurate approx-imations are available for determining (z). One

such approximation is the polynomial approxima-tion (Abramowitz and Stegun 1972):

(z) = 1 − φ(z)(b1t + b2t2 + b3t3 + b4t4 + b5t5

)

f or z ≥ 0 (8.52)

in which t = 1/(1 + 0.2316419z); b1 = 0.31938153;b2 = −0.356563782; b3 = 1.781477937; b4 =−1.821255978; and b5 =1.33027443. The maxi-mum absolute error of the approximation is 7.5 ×10−8, which is reasonably accurate for most prac-tical applications. Note that Eqn 8.52 is applicableonly to the non-negative values of z. For z < 0, thevalue of a standard normal CDF can be computedas (z) = 1 − (|z|), using the symmetry of thefunction φ(z).

The inverse operation of finding the standardnormal percentile, zp with the specified probabil-ity level, p, can be easily done using the table ofstandard normal CDF and carrying out some inter-polation. However, for practical algebraic compu-tations using a computer, the following rational ap-proximation can be used (Abramowitz and Stegun,1972):

zp = t − c0 + c1t + c2t2

1 + d1t + d2t + d3t3

f or 0.5 < p ≤ 1 (8.53)

in which p = (zp); t = √−2 ln(1 − p); c0 =2.515517, c1 = 0.802853, c2 = 0.010328, d1 =1.432788, d2 = 0.189269; and d3 = 0.001308.The corresponding maximum absolute error intro-duced by this rational approximation is 4.5 × 10−4.Eqn 8.53 is valid for the values of (z) that lie be-tween [0.5, 1]. When p < 0.5, Eqn 8.53 can still beused by letting t = √−2 ln(p) and putting a minussign to the computed percentile value.

Example 8.4. Referring to Example 8.3, deter-mine the probability of more than five overtoppingevents over a 100-year period using a normal ap-proximation.

Solution. For this problem, the random variableX of interest is the number of overtoppingevents in a given 100-year period. The exactdistribution of X is binomial with parameters n= 100 and p = 0.02 or alternatively the Poisson

Page 16: Modern Hydrology and Sustainable Water Development (Gupta/Modern Hydrology and Sustainable Water Development) || Statistical Analyses in Hydrology

P1: SFK/UKS P2: SFK

c08 BLBK276-Gupta August 28, 2010 15:31 Trim: 246mm X 189mm Printer Name: Yet to Come

232 MODERN HYDROLOGY AND SUSTAINABLE WATER DEVELOPMENT

distribution with a parameter ν = 2 can also beused. The exact probability of having more thanfive occurrences of overtopping in 100 years can becomputed as:

P (X > 5) = P (X ≥ 6)

=100∑

x=6

(100

x

)(0.02)x(0.98)100−x

= 1 − P (X ≤ 5)

= 1 −5∑

x=0

(100

x

)(0.02)x(0.98)100−x

= 1 − 0.9845 = 0.0155

There are a total of six terms under the summationsign on the right-hand side. Although the compu-tation of probability is possible manually, the fol-lowing approximation gives a reasonably accuratevalue. Using a normal probability approximation,the mean and variance of X are:

µx = np = (100)(0.2) = 2.0

σ 2x = npq = (100)(0.2)(0.98) = 1.96

The above binomial probability can be approxi-mated as:

P (X ≥ 6) ≈ P (X ≥ 5.5) = 1 − P (X < 5.5)

= 1 − [Z < (5.5 − 2.0)/√

1.96]

= 1 − (2.5) = 1 − 0.9938 = 0.062

DeGroot (1975) showed that when (np)1.5 > 1.07,the error of using the normal distribution to ap-proximate the binomial probability does not ex-ceed 0.05. The error gets progressively reduced asthe value of (np)1.5 becomes larger. For this ex-ample, (np)1.5 = 0.283 < 1.07, and the accuracyof approximation is not satisfactory, as seen aboveby the difference between the computation usingbinomial and normal distributions.

Example 8.5. The magnitude of annual maximumflood in a river has a normal distribution with amean of 600 m3 s−1 and standard deviation of 400m3 s−1. (a) What is the annual probability that theflood magnitude would exceed 1000 m3 s−1? (b)

Determine the magnitude of the flood with a returnperiod of 100 years.

Solution. (a) Let X be the random annual maxi-mum flood magnitude. Since X has a normal dis-tribution with a mean µx = 600 m3 s−1 and stan-dard deviation σ x = 400 m3 s−1, the probabilityof the annual maximum flood magnitude exceed-ing 1000 m3 s−1 is obtained by using the Table ofStandard Normal Probability, (z) = P(Z ≤ z) orEqn 8.52:

P (X > 1000)

= 1 − P [Z ≤ (1000 − 600)/400]

= 1 − (1.00) = 1 − 0.8413 = 0.1587

(b) A flood event with a 100-year return periodrepresents the event; the annual probability of itsmagnitude being exceeded is 0.01. Thus P(X ≥q100) = 0.01, in which q100 is the magnitude ofthe 100-year flood. Therefore, the problem is todetermine q100 from:

P (X ≤ q100) = 1 − P (X ≥ q100)

because

P (X ≤ q100) = P Z ≤ [(q100 − µx)/σx]= P [Z ≤ (q100 − 600)/400]

= [(q100 − 600)/400] = 0.99

From the Table of Standard Normal Probability,(z) = P(Z ≤ z) or from Eqn 8.53, (2.33) =0.99. Therefore:

(q100 − 600)/400 = 2.33

which gives the magnitude of the 100-year floodevent as q100 = 1532 m3 s−1.

8.2.3.2 Lognormal distribution

The lognormal distribution is a commonly usedcontinuous distribution for positively valued ran-dom variables that are closely related to normal ran-dom variables. A random variable X is said to have alognormal distribution if its logarithmic transformY = ln(X) has a normal distribution with mean,µlnx, and variance, σ 2

lnx. From the central limittheorem, if a natural process can be thought of as

Page 17: Modern Hydrology and Sustainable Water Development (Gupta/Modern Hydrology and Sustainable Water Development) || Statistical Analyses in Hydrology

P1: SFK/UKS P2: SFK

c08 BLBK276-Gupta August 28, 2010 15:31 Trim: 246mm X 189mm Printer Name: Yet to Come

STATISTICAL ANALYSES IN HYDROLOGY 233

Ω = 0.3

Ω = 0.6

Ω = 1.3

F(x

)L

N

0

0.4

0.8

1.2

1.6

0 1 2 3 4 5 6

(a)

µx=1.0

X0 1 2 3 4 5 6

µx =1.65

µx=2.25

µx=4.5

0.4

0.6

0.2

0

(b)

Ω x=1.3

X

Fig. 8.8 Shapes of lognormal probability density functions for different values of mean (µx) and normalized standarddeviation (x=σ x/µx): (a) µx =1.0; (b) x =1.30. Based on Tung et al. (2006) with permission from Mcgraw-Hill.

a product of a large number of independent com-ponent processes, none dominating the others, thelognormal distribution is a reasonable approxima-tion for these natural processes. With reference toEqn 8.45, the PDF of a lognormal random variablecan be written as:

fLN

(X|µln x, σ

2ln x

)

= 1√2πσln xx

exp

−1

2

[ln(x) − µln x

σln x

]2

for x > 0 (8.54)

Statistical properties of a lognormal random vari-able in the original scale can be computed fromthose of log-transformed variables as:

µx = exp

(µln x + σ 2

ln x

2

)(8.55)

σ 2x = µ2

x

[exp

(σ 2

ln x

)− 1]

(8.56)

or 2x = σ 2

x /µ2x = exp(σ 2

ln x) − 1 (8.57)

and skewness coefficient:

γx = 3x + 3 (8.58)

Conversely, the statistical moments of ln(X) can becomputed from those of X by using Eqn 8.55 andEqn 8.56:

µln x = ln (µx) − 1

2σ 2

ln x = 1

2ln

[µ2

x

1 + 2x

](8.59)

σ 2ln x = ln(1 + 2

x) (8.60)

One can see from Eqn 8.55, that the shape of alognormal PDF is always positively skewed (Fig.8.8). Eqn 8.55 and Eqn 8.56 can be derived easilyfrom the moment-generating function (Tung andYen 2005).

It is interesting to note from Eqn 8.57 that thevariance of a log-transformed variable (σ 2

lnx) is di-mensionless. Since the sum of normal random vari-ables is normally distributed, the product of log-normal random variables is also lognormally dis-tributed (Fig. 8.5). This useful property of lognor-mal random variables can be stated as: if X1, X2, . . ., XK are independent lognormal random variables,then W = b0

∏Kk=1 Xbk

k has a lognormal distributionwith mean and variance as:

µln w = ln(b0) +K∑

k=1

bkµln xk and σ 2ln w

=K∑

k=1

b2kσ

2ln xk

(8.61)

Page 18: Modern Hydrology and Sustainable Water Development (Gupta/Modern Hydrology and Sustainable Water Development) || Statistical Analyses in Hydrology

P1: SFK/UKS P2: SFK

c08 BLBK276-Gupta August 28, 2010 15:31 Trim: 246mm X 189mm Printer Name: Yet to Come

234 MODERN HYDROLOGY AND SUSTAINABLE WATER DEVELOPMENT

Example 8.6. Redo the Example 8.5 for when theannual maximum flood magnitude in the river flowfollows a lognormal distribution.

Solution. (a) Since Q has a lognormal distribution,ln(Q) is also normally distributed with a mean andvariance that can be computed from Eqn 8.57, Eqn8.60, and Eqn 8.59, respectively, as:

2x = σ 2

x /µ2x or x = 400/600 = 0.667

σ 2ln x = ln(1 + 0.6672) = 0.368

µln x = ln(600) − 0.368/2 = 6.213

The probability of the magnitude of annual max-imum flood exceeding 1000 m3 s−1 is, therefore,given by:

P (X > 1000) = P [ln X > ln(1000)]

= 1 − P (Z ≤ 6.908 − 6.213)/√

0.368

= 1 − (1.145) = 1 − 0.8736 = 0.1261

(b) A 100-year flood q100 represents an event, themagnitude of which corresponds to P(X ≥ q100) =0.01, which can be estimated from:

P (X ≤ q100) = 1 − P (X ≥ q100) = 0.99

P (X ≤ q100) = P (lnX ≤ ln q100)

= P [Z ≤ (ln q100 − µln x)/σln x]

= P [Z ≤ (ln q100 − 6.213)/√

0.368]

= [ln q100 − 6.213/√

0.368] = 0.99

From the Table of Standard Normal Probabil-ity, (z) = P(Z ≤ z) or from Eqn 8.53, onefinds that (2.33) = 0.99. Therefore, [ln q100 −6.213]/

√0.368 = 2.33, which yields ln(q100) =

7.626. The magnitude of the 100-year flood eventis, therefore, q100 = exp(7.626) = 2051 m3 s−1.

8.2.3.3 Gamma distribution and its variation

The gamma distribution is a versatile continuousdistribution associated with a positive-valued ran-dom variable. The two-parameter gamma distribu-tion has a PDF defined as:

fG(X|α, β) = 1

β(α)(x/β)α−1exp(x/β)

f or x > 0 (8.62)

in which β > 0 and α > 0 are the scale and shape pa-rameters, respectively, and (•) denotes a gammafunction defined as:

(α) =∫ ∞

0tα−1e−tdt (8.63)

The mean, variance, and skewness coefficient of agamma random variable having the PDF as given byEqn 8.62, are:

µx = αβ σ 2x = αβ2 γx = 2/

√α (8.64)

When the lower bound of a gamma random vari-able is a positive quantity, the above two-parametergamma PDF can be modified to a three-parametergamma PDF with the additional location parameter(ξ ) as:

fG(X|ξ, α, β)

= 1

β(α)

[x − ξ

β

]α−1

exp[−(x − ξ )/β]

for x > ξ (8.65)

where ξ is the lower bound. The two-parametergamma distribution can be reduced to a simplerform by letting Y = X/β, and the resulting one-parameter gamma PDF (called the standard gammadistribution) is:

fG(X|α) = 1

(α)(y)α−1exp(y) f or y > 0 (8.66)

Tables of the cumulative probability of the stan-dard gamma distribution can be found in Dudewicz(1976). The versatility of some gamma distributionscan be seen from the shapes illustrated in Fig. 8.9a.If α is a positive integer in Eqn 8.66, the distributionis called an Erlang distribution.

When α = 1, the two-parameter gamma distribu-tion reduces to an exponential distribution withthe PDF given by:

fEXP(x|β) = e−x/β/β f or x > 0 (8.67)

An exponential random variable with a PDF, asgiven by Eqn 8.67, has the mean and standard de-viation equal to β (Example 8.2). Therefore, thecoefficient of variation of an exponential randomvariable is equal to unity. The exponential distribu-tion is used commonly for describing the life spanof various electronic and mechanical components.

Page 19: Modern Hydrology and Sustainable Water Development (Gupta/Modern Hydrology and Sustainable Water Development) || Statistical Analyses in Hydrology

P1: SFK/UKS P2: SFK

c08 BLBK276-Gupta August 28, 2010 15:31 Trim: 246mm X 189mm Printer Name: Yet to Come

STATISTICAL ANALYSES IN HYDROLOGY 235

β α=4, =1

β α=1, =4

β α=2, =4

x0 2 4 6 8 10 12 14

0.0

0.1

0.2

0.3

F(x

)G

(a)

0-1 1 2-2-3-4 3 40

0.1

0.2

0.3

0.4

Max

Min

y

F(y

)E

V1

(b)

Fig. 8.9 (a) Various shapes of two-parameter gamma probability density functions. The standard gamma distribution, alsocalled one-parameter gamma distribution, is obtained by letting Y = X/β. When α = 1, the two-parameter gamma distributionreduces to an exponential distribution. (b) Probability density function of extreme-value Type-1 random variables. Therandom variable is transformed using y = (x − ξ)/β. Based on Tung et al. (2006) with permission from Mcgraw-Hill.

Two variations of the gamma distribution usedfrequently in hydrologic frequency analysis are thePearson and log-Pearson Type-3 distributions. Inparticular, the log-Pearson Type-3 distribution isrecommended by the US Water Resources Coun-cil (USWRC 1982) as the standard distribution forflood frequency analysis. A Pearson Type-3 randomvariable has the PDF of the form:

fP3(x|ξ, α, β) = 1

β|(α)

(x − ξ

β

)α−1

×exp[−(x − ξ )/β] (8.68)

with α > 0; x ≥ ξ when β > 0 and with α > 0; x≤ ξ when β < 0. When β > 0, the Pearson Type-3 distribution is identical to the three-parametergamma distribution. However, the Pearson Type-3 distribution has the flexibility to model nega-tively skewed random variables corresponding toβ < 0.

Similar to the relationship between normal andlognormal distributions, the PDF of a log-PearsonType-3 random variable is given by:

fL P3(x|ξ, α, β) = 1

x|β|(α)

[ln (x) − ξ

β

]α−1

×exp−[ln (x) − ξ ]/β (8.69)

with α > 0; x ≥ eξ when β > 0 and withα > 0; and x ≤ eξ when β < 0. Kite (1988),Stedinger et al. (1993) and Rao and Hamed(2000) provide good accounts of Pearson Type-3 and log-Pearson Type-3 distributions. Evaluationof the probability of gamma random variables in-volves computation of the gamma function, whichcan be made by using the following recursiveformula:

(α) = (α − 1)(α − 1) (8.70)

When the argument α is an integer number, then(α) = (α − 1)! = (α − 1) (α − 2) . . .1. However,when α is a real number, the recursive relationleads to (α′) as the smallest term, with 1 < α′

< 2. The value of (α′) can be determined fromthe table of the gamma function or by numericalintegration of Eqn 8.63. Alternatively, the followingapproximation could be applied (Abramowitz andStegun, 1972):

(α′) = (x + 1)

= 1 +5∑

i=1

ai xi f or 0 < x < 1 (8.71)

Page 20: Modern Hydrology and Sustainable Water Development (Gupta/Modern Hydrology and Sustainable Water Development) || Statistical Analyses in Hydrology

P1: SFK/UKS P2: SFK

c08 BLBK276-Gupta August 28, 2010 15:31 Trim: 246mm X 189mm Printer Name: Yet to Come

236 MODERN HYDROLOGY AND SUSTAINABLE WATER DEVELOPMENT

in which a1 = −0.577191652; a2 = 0.988205891;a3 = −0.897056937; a4 = 0.4245549; and a5 =−0.1010678. The maximum absolute error associ-ated with Eqn 8.71 is 5 × 10−5.

8.2.3.4 Extreme-value distributions

In many hydrosystems, the focus is on extremeevents such as floods and droughts. Statistics of ex-tremes is concerned with the statistical characteris-tics of Xmax,n = max(X1, X2, . . . , Xn) and/or Xmin,n

= min(X1, X2, . . . , Xn), in which X1, X2, . . . , Xn

are observed values of random processes. The ex-act distributions of extremes are, in fact, functionsof the parent distribution that generate the randomobservations X1, X2, . . . , Xn and the number (n) ofobservations.

Of practical interest are the asymptotic distribu-tions of extremes that assume the resulting distri-bution as a limiting form of Fmax,n(y) or Fmin,n(y),when the number of observations, n, approachesinfinity. As a result, asymptotic distributions of ex-tremes turn out to be independent of the samplesize, n, and the parent distribution for random ob-servations, that is:

limn→∞Fmax,n(y) = Fmax(y) and

limn→8 Fmin,n(y) = Fmin(y)

These asymptotic distributions of the extremeslargely depend on the behaviour of the tail of theparent distribution in either direction towards boththe extremes. The central portion of the parentdistribution has little significance for defining theasymptotic distribution of extremes.

Three types of asymptotic distribution of ex-tremes are derived, based on different character-

istics of the underlying distribution (Haan 1977) asunder:

1. Type-1. Parent distributions are unbounded inthe direction of both extremes and all statisticalmoments exist. Examples of this type of parentdistribution are normal (for both the largest andthe smallest extremes), lognormal, and gammadistributions (for the largest extreme).

2. Type-2. Parent distributions are unbounded inthe direction of both extremes but all momentsdo not exist. One such distribution is the Cauchydistribution (Table 8.2).

3. Type-3. Parent distributions are bounded in thedirection of the desired extreme. Examples ofthis type of underlying distribution are the betadistribution (for both the largest and the smallestextremes) and the lognormal and gamma distri-butions (for the smallest extreme).

Owing to the fact that Xmin,n = −max(−X1,−X2, . . . , −Xn), the asymptotic distribution func-tions of Xmax,n and Xmin,n satisfy the following re-lation (Leadbetter et al. 1983):

Fmin(y) = 1 − Fmax(−y) (8.72)

Consequently, the asymptotic distribution of Xmin

can be obtained directly from that of Xmax. Threetypes of asymptotic distributions of the extremesare listed in Table 8.1.

8.2.3.4.1 Extreme-value Type-1 distribution

Other names given to this distribution are: Gum-bel distribution, Fisher-Tippett distribution, andthe Double exponential distribution. The CDF andPDF of the extreme-value Type-I (EV1) distribution

Table 8.1 Three types of asymptotic Cumulative Distribution Functions (CDFs) of extremes. Source: Tung et al. (2006).

Type Maxima Range Minima Range

1 exp(−e−y) −∞ < y < ∞ 1 − exp(−ey) −∞ < y < ∞2 exp(−yα ) α < 0, y >0 1− exp[−(−y)α] α < 0, y < 02 exp[−(−y)α] α > 0, y <0 1− exp(−yα ) α > 0, y > 0

Note: y = (x − ξ )/β.

Page 21: Modern Hydrology and Sustainable Water Development (Gupta/Modern Hydrology and Sustainable Water Development) || Statistical Analyses in Hydrology

P1: SFK/UKS P2: SFK

c08 BLBK276-Gupta August 28, 2010 15:31 Trim: 246mm X 189mm Printer Name: Yet to Come

STATISTICAL ANALYSES IN HYDROLOGY 237

have, respectively, the following forms:

FE V 1(x|ξ, β) = exp

−exp

[−(

x − ξ

β

)]

f or maxima

= 1 − exp

−exp

[+(

x − ξ

β

)]

f or minima (8.73)

fE V 1(x|ξ, β)

= 1

βexp

−(

x − ξ

β

)− exp

[−(

x − ξ

β

)]

f or maxima

= 1

βexp

+(

x − ξ

β

)− exp

[+(

x − ξ

β

)]

f or minima (8.74)

for −∞ < x; the location parameter ξ < ∞; andthe scale parameter β ≥ 0. The shapes of the EV1distribution are shown in Fig. 8.9b, in which thetransformed random variable Y = (X −ξ )/β is used.As can be seen, the PDF associated with the largestextreme is a mirror image of the smallest extremewith respect to the vertical line passing throughthe common mode, which happens to be the pa-rameter ξ . The first three product-moments of anEV1 random variable are:

µx = ξ + 0.5772β for the largest extreme

= ξ − 0.5772β for the smallest extreme

σ 2x = 1.645β2 for both types

γx = 1.13955 for the largest extreme

= −1.3955 for the smallest extreme (8.75)

Shen and Bryson (1979) showed that if a randomvariable has an EV1 distribution, the following re-lationship is satisfied when ξ is small:

xT1 ≈[

ln (T1)

ln (T2)

]xT2 (8.76)

where xT is the percentile corresponding to theexceedance probability of 1/T .

Example 8.7. Redo Example 8.5 by assuming thatthe annual maximum flood follows the EV1 distri-bution.

Solution. Based on the values of mean (600 m3 s−1)and standard deviation (400 m3 s−1), the values ofthe distribution parameters ξ and β can be deter-mined as follows. For obtaining the maxima, β iscomputed from Eqn 8.75 as:

β = σx√1.645

= 400

1.2826= 311.872 m3.s−1

ξ = µx − 0.577β = 600 − 0.577(311.872)

= 420.05 m3.s−1

1. The probability of the flood discharge exceeding1000 m3 s−1, according to Eqn 8.73, is:

P (X > 1000)

= 1 − FE V1(1000)

= 1 − exp

[−exp

(−1000 − 420.05

311.872

)]

= 1 − exp[−exp(−1.860)]

= 1 − 0.8558 = 0.1442

2. On the other hand, the magnitude of a 100-yearflood event can be calculated as:

y100 = q100 − ξ

β= −ln [−ln (1 − 0.01)] = 4.60

Hence:

q100 = 420.05 + 4.60 × (311.87) = 1855 m3 s−1

8.2.3.4.2 Extreme-value Type-3 distribution

As seen from Table 8.1, the parent distributions arebounded in the direction of the desired extreme inthe extreme-value Type-3 (EV3) distribution. Butfor many hydrologic variables, the lower bound iszero, and the upper bound is infinity. Therefore,the EV3 distribution for the maxima has limitedapplications. On the other hand, the EV3 distribu-tion of the minima is used widely for modelling the

Page 22: Modern Hydrology and Sustainable Water Development (Gupta/Modern Hydrology and Sustainable Water Development) || Statistical Analyses in Hydrology

P1: SFK/UKS P2: SFK

c08 BLBK276-Gupta August 28, 2010 15:31 Trim: 246mm X 189mm Printer Name: Yet to Come

238 MODERN HYDROLOGY AND SUSTAINABLE WATER DEVELOPMENT

α ξ=1.0; =0.0

β=0.5β=5.0

β=3.0

β=2.0β=1.0

0.0 0.5 1.0 1.5 2.0 2.5 3.0

2.0

1.0

2.02.0

0.5

0.0

1.5

F(x

)w

x

(a)

α β=2, =6 α β=6, =6 α β=2, =2

αβ=0.5,=0.5

αβ= ,=11

0.0 0.4 0.80.2 0.6 1.00.0

1.0

2.0

3.0

4.0

x

f(x

|a,b

)B

(b)

Fig. 8.10 (a) Probability density functions of Weibull random variables for different parameters. (b) Shapes of standard betaprobability density functions for different parameter values. Based on Tung et al. (2006) with permission from Mcgraw-Hill.

smallest extreme such as drought or low-flow con-ditions. The EV3 distribution for the minima is alsoknown as the Weibull distribution, having a PDFdefined as:

fW (x|ξ, α, β)

= α

β

(x − ξ

β

)α−1

exp

[−(

x − ξ

β

)α]

for x ≥ ξ and α, β > 0 (8.77)

Various shapes of the Weibull distribution are gen-erated using the shape parameter α. When ξ = 0and α = 1, the Weibull distribution reduces to theexponential distribution.

Fig. 8.10a shows that versatility of the Weibulldistribution function depends on the parameter val-ues. The CDF of Weibull random variables is:

Fw(x|ξ, α, β) = 1 − exp

[−(

x − ξ

β

)α](8.78)

The mean and variance of a Weibull random vari-able can be derived as:

µx = ξ + β

(1 + 1

β

)and

σ 2x = β2

[

(1 + 2

α

)−

(1 + 1

α

)](8.79)

8.2.3.4.3 Generalized extreme-value distribution

The Generalized Extreme Value (GEV ) distribu-tion combines the Type-I, Type-2, and Type-3 ex-treme value distributions into a single family, toallow a continuous range of possible shapes. TheCDF of a random variable corresponding to themaximum with a GEV distribution is:

FG E V (x|ξ, α, β) = exp

−[

1 − α(x − ξ )

β

]1/α

f or = 0 (8.80)

When α = 0, Eqn 8.80 reduces to Eqn 8.73 forthe Gumbel or EV1 distribution. For α < 0, it cor-responds to the EV2 distribution having a lowerbound x > ξ + β/α. On the other hand, for α >

0, it corresponds to the Weibul or EV3 distributionhaving an upper bound x < ξ + β/α. For |α| < 0.3,the shape of the GEV distribution is similar to theGumbel distribution, except that the right-hand tailis thicker for α < 0 and thinner for α > 0 (Stedingeret al. 1993). The first three moments of the GEVdistribution, respectively, are:

µx = ξ +(

β

α

)[1 − (1 + α)] (8.81)

σ 2x =

α

)2

[(1 + 2α) − 2(1 + α)] (8.82)

Page 23: Modern Hydrology and Sustainable Water Development (Gupta/Modern Hydrology and Sustainable Water Development) || Statistical Analyses in Hydrology

P1: SFK/UKS P2: SFK

c08 BLBK276-Gupta August 28, 2010 15:31 Trim: 246mm X 189mm Printer Name: Yet to Come

STATISTICAL ANALYSES IN HYDROLOGY 239

γx = sign(α) ×(−(1 + 3α) + 3(1 + 2α)(1 + α)−23(1 + α)

[(1 + 2α) − 2(1 + α)]1.5

)

(8.83)

where sign(α) is +1 or −1, depending on thesign of α. From Eqn 8.82 and Eqn 8.83, one notesthat the variance of the GEV distribution existswhen α > −0.5, and the skewness coefficient ex-ists when α > −0.33. Recently the GEV distributionhas been frequently used in modelling the randomhydrologic extremes, such as precipitation andfloods.

8.2.3.4.4 Beta distributions

In hydrosystems, such as reservoir storage andgroundwater tables for unconfined aquifers, ran-dom variables that are bounded on both limits areoften best described by the beta distribution. Thegeneral form of the beta PDF is:

fN B(x|a, b, α, β)

= 1

B(α, β)(b − a)α+β−1(x − a)α−1(b − x)β−1

f or a ≤ x ≤ b (8.84)

in which a and b are the lower and upper boundsof the beta random variable, respectively; the twoshape parameters are α > 0, β > 0, and B(α, β) isa beta function defined as:

B(α, β) =∫ 1

0tα−1(1 − t)β−1dt

= (α)(β)

(α + β)(8.85)

The general beta PDF can be transformed to thestandard beta PDF using a new variable, Y = (X− a)/(b − a), as:

fB(y|α, β) = 1

B(α, β)yα−1(1 − y)β−1

f or 0 < y < 1 (8.86)

The CDF of the standard beta function is given by:

F (y|α, β) = By(α, β)

B(α, β)= Iy(α, β) (8.87)

where By(α,β) is the incomplete beta function,which is a generalization of the beta function thatreplaces the definite integral of the beta function(Eqn 8.85) with an indefinite integral, i.e.:

By(α, β) =∫ y

0tα−1(1 − t)β−1dt (8.88)

and Iy(α,β) is the regularized incomplete betafunction.

The beta distribution is also versatile and canhave many shapes, as shown in Fig. 8.10b. Themean and variance of the standard beta randomvariable, Y , respectively, are:

µy = α

α + βand

σ 2y = αβ

(α + β + 1) (α + β)2(8.89)

When α = β = 1, the beta distribution reduces toa uniform distribution as:

fU (x) = 1

b − af or a ≤ x ≤ b (8.90)

8.2.3.5 Distributions related to normalrandom variables

The Gaussian or normal distribution has played animportant role in the development of statistical the-ories. In the following, two distributions related tothe functions of normal random variables are brieflydescribed.

8.2.3.5.1 χ2 (chi-square) distribution

The sum of the squares of K independent standardnormal random variables results in a χ2 (chi-square)random variable with K degrees of freedom, de-noted as χ2

K . In other words:

K∑

k=1

Z2k ≈ χ2

K (8.91)

in which the Zks are independent standard normalrandom variables. The PDF of a χ2 random variablewith K degrees of freedom is:

fχ2 (x|K ) = 1

2K/2(K/2)x(K/2−1)e−x/2 f or x > 0

(8.92)

Page 24: Modern Hydrology and Sustainable Water Development (Gupta/Modern Hydrology and Sustainable Water Development) || Statistical Analyses in Hydrology

P1: SFK/UKS P2: SFK

c08 BLBK276-Gupta August 28, 2010 15:31 Trim: 246mm X 189mm Printer Name: Yet to Come

240 MODERN HYDROLOGY AND SUSTAINABLE WATER DEVELOPMENT

d.f.=20

d.f.=10

d.f.=5

0 5 10 15 20 25 300.00

0.04

0.08

0.12

0.16

f(x

x

(a) (b)

d.f.=1

d.f.=5

d.f.=infinity

0 1 2 3 4-1-2-3-4t

f(t

)t

0.0

0.2

0.3

0.1

0.4

0.5

Fig. 8.11 (a) Shapes of chi-square (χ2) probability density functions (PDFs). As the number of degrees of freedom (d.f.)increases, so do mean and variance of the distribution. (b) Shapes of t-distributions for different values of d.f. Based on Tunget al. (2006) with permission from Mcgraw-Hill.

Comparing Eqn 8.92 with Eqn 8.62, one notes thatthe χ2 distribution is a special case of the two-parameter gamma distribution with α = K/2 andβ =2. The mean, variance, and skewness coeffi-cient of a χ2

K random variable, respectively, are:

µx = K σ 2x = 2K γx = 2/

√K/2 (8.93)

Thus, as the value of K increases, the χ2 distri-bution approaches a symmetric distribution. Fig.8.11a shows some of the χ2 distributions with var-ious degrees of freedom. If X1, X2, . . . , XK are in-dependent normal random variables with the com-mon mean µx and variance σ 2

x, the χ2 distributionis related to the sample of normal random variablesas follows:

The sum of K squared standardized normal vari-ables Zk = (Xk − X)/σ x k = 1, 2, . . . , K, has a χ2

distribution with (K − 1) degrees of freedom.The quantity (K − 1)s2/σ 2

x has a χ2 distributionwith (K − 1) degrees of freedom, in which s2 is theunbiased sample variance.

8.2.3.5.2 t-distribution

A random variable having a t-distribution resultsfrom the ratio of the standard normal random vari-

able to the square root of the χ2 random variabledivided by its degrees of freedom, that is:

TK = Z√χ2

K /K(8.94)

in which TK is a t-distributed random variable withK degrees of freedom. The PDF of TK can be ex-pressed as:

fT (x|K ) = [(K + 1)/2]√π K(K/2)

(1 + x2

K

)−(K+1)/2

f or − ∞ < x < ∞ (8.95)

A t-distribution is symmetric with respect to themean µx = 0 when K ≥ 1. Its shape is similar tothe standard normal distribution (z), except thatthe tail portions of its PDF are thicker than φ(z).However, as K → ∞, the PDF of a t-distributed ran-dom variable approaches the standard normal dis-tribution. Fig. 8.11b shows some PDFs for t-randomvariables of different degrees of freedom. It may benoted that when K = 1, the t-distribution reduces tothe Cauchy distribution for which all the product-moments do not exist. The mean and variance ofa t-distributed random variable with K degrees offreedom are:

µx = 0 σ 2x = K/(K − 2) f or K ≥ 3 (8.96)

Page 25: Modern Hydrology and Sustainable Water Development (Gupta/Modern Hydrology and Sustainable Water Development) || Statistical Analyses in Hydrology

P1: SFK/UKS P2: SFK

c08 BLBK276-Gupta August 28, 2010 15:31 Trim: 246mm X 189mm Printer Name: Yet to Come

STATISTICAL ANALYSES IN HYDROLOGY 241

When the population variance of normal randomvariables is known, the sample mean X of K normalrandom samples from N(µx, σ 2

x) has a normal dis-tribution with mean µx and variance σ 2

x/K. How-ever, when the population variance is unknownbut is estimated by s2, the quantity

√K (X − µx)/s,

which is the standardized sample mean using thesample variance, has a t-distribution with (K − 1)degrees of freedom.

8.3 Hydrologic frequency analysis

The aim of hydrosystems analysis is not to alto-gether eliminate all hydro-hazards but to reduce thefrequency of their occurrences and thereby mini-mize the resulting damage. Therefore, the proba-bilities of the investigated event must be evaluatedcorrectly. This is inherently complex because inmany cases the ‘input’ is controlled by nature andtherefore has a limited predictability. For example,variations in the amount, timing, and spatial distri-bution of precipitation and its resulting effects suchas runoff are far from being perfectly understood.

Therefore, a statistical or probabilistic approachis often used as this does not require a completeunderstanding of the hydrologic phenomenon in-volved but examines the relationship betweenevent and frequency of occurrence with the aim offinding some statistical pattern between these vari-ables. In effect, the past trends are extrapolatedinto the future, assuming that whatever complexphysical interactions and processes govern naturedo not change with time. Therefore, the historicalrecord can be used as a basis for predicting fu-ture events. In other words, the data are assumedto satisfy statistical stationarity by which the un-derlying properties of the distribution remain in-variant with time. The hydrologic data commonlyanalysed in this manner are rainfall and stream flowrecords. An obvious example that violates the as-sumed statistical stationarity is the progressive ur-banization within a watershed that could result in atendency of increasing the magnitude of peak flowover time. Global change due to a variety of hu-man activities, including greenhouse gas emissions,may also be contributing to violation of statisticalstationarity.

Basic probability concepts and theories useful forfrequency analysis have already been described inSection 8.2. In general, there is no a priori phys-ical reasoning that stipulates the use of a partic-ular distribution in the frequency analysis of geo-physical data. However, since the maximum or theminimum values of hydrological events are usuallyof interest, extreme-value-related distributions havebeen found useful.

There are three basic types of data series ex-tractable from geophysical events:

1. A complete series, which includes all the avail-able data concerning the magnitude of a phe-nomenon. Such a data series is usually very largeand since in some instances the interest is onlyin the extremes of the distribution (e.g. floods,droughts, wind speeds, and wave heights), fol-lowing other data series are often more relevant.

2. An extreme-value series is one that contains thelargest (or smallest) value of the data for indi-vidual equal time intervals. If, for example, thelargest value of the data in the record of eachyear is used, the extreme-value series is calledan annual maximum series. If the smallest valueis used, the series is called an annual minimumseries.

3. A partial-duration series consists of all dataabove or below the base value. For example,one might consider only floods in a river witha magnitude greater than 1000 m3 s−1. Whenthe base value is selected such that the numberof events included in the data series equals thenumber of years of record, the resulting series iscalled an annual exceedance series. This seriescontains the n largest or n smallest values in nyears of record.

Another issue related to selection of data seriesfor frequency analysis is adequacy of the recordlength. The US Water Resources Council (USWRC1967) recommended that at least 10 years of datashould be available before a frequency analysis canbe done. It has, however, been shown that if afrequency analysis is done using 10 years of record,a high degree of uncertainty can be expected in theestimate of high-return-period events.

Page 26: Modern Hydrology and Sustainable Water Development (Gupta/Modern Hydrology and Sustainable Water Development) || Statistical Analyses in Hydrology

P1: SFK/UKS P2: SFK

c08 BLBK276-Gupta August 28, 2010 15:31 Trim: 246mm X 189mm Printer Name: Yet to Come

242 MODERN HYDROLOGY AND SUSTAINABLE WATER DEVELOPMENT

The third issue related to the data series used forfrequency analysis concerns the problem of datahomogeneity. This refers to different accuracieswith which various events of different magnitudeare recorded, primarily due to difficulties in mea-suring high magnitude events. As an example, theprobability distribution of measured floods can begreatly distorted with respect to the parent popu-lation. This further contributes to the uncertaintyin flood frequency analysis.

8.3.1 Return period

The concept of return period (or recurrence inter-val) is commonly used in hydro-systems engineer-ing. This is a substitute for probability and lends itsome physical interpretation. The return period fora given event is defined as the period of time on thelong-term average value at which a given event isequalled or exceeded. Hence, on average, an eventwith a two-year return period will be equalled orexceeded once every two years. The relationshipbetween the probability and return period is givenby:

T = 1

P (X ≥ xT )= 1

1 − P (X < xT )(8.97)

in which xT is the value of the variate correspond-ing to a T -year return period. For example, if theprobability that a flood will be equalled or ex-ceeded in a single year is 0.1, that is P(X ≥ xT)= 0.1, the corresponding return period is 1/P(X≥ xT) = 1/0.1 = 10 years. Note that P(X ≥ xT)is the probability that the event is equalled or ex-ceeded in any one year and is the same for eachyear, regardless of the magnitudes that occurredin prior years. This is because the events are inde-pendent and the long-term probabilities are usedwithout regard to the order in which they mayoccur.

A common error or misconception is to assume,for example, that if the 100-year magnitude eventoccurs this year, it will not recur for the next 100years. In fact, it may occur again the very nextyear and then may not be repeated for severalhundred years. This misconception resulted inconsiderable public outcry in the United Stateswhen the Phoenix area experienced two 50-year

and one 100-year floods in a span of 18 months in1978–1979 and the Milwaukee area experienced100-year floods in two consecutive years, June1997 and June 1998.

The common unit used for the return period isyear, although semi-annual, monthly, or any othertime period may be used. The unit used for thetime series is also the unit assigned to the returnperiod. Thus an annual series has a unit for thereturn-period as year and a monthly series has theunit of return-period as month(s).

8.3.2 Probability estimates for data series:plotting positions (rank-orderprobability)

In order to fit a probability distribution to a dataseries, estimates of probability must be assigned toeach term in the data series. Consider a data se-ries consisting of the entire population of N val-ues for a particular variable. If this series wereranked according to decreasing magnitude, prob-ability of the largest variate being equalled or ex-ceeded is 1/N , where N is the total number ofvariates. Similarly, the exceedance probability ofthe second largest variate is 2/N , and so on. Ingeneral:

P (X ≥ x(m)) = 1

Tm= m

N(8.98)

in which m is the rank of the data in descendingorder; x(m) is the mth largest variate in a data seriesof size N ; and Tm is the return period associatedwith x(m). As the entire population is either notused or is not available, only an estimate of theexceedance probability based on a sample can bemade. Eqn 8.98 giving the rank-order probabilityis called a plotting position formula because itprovides an estimate of probability so that thedata series can be plotted (as magnitude versusprobability). Some modifications are done to avoidtheoretical inconsistency when it is applied to thesample data series. For example, an exceedanceprobability of 1.0 for the smallest variate impliesthat all values must be equal to or larger thanit. A number of plotting-position formulas have

Page 27: Modern Hydrology and Sustainable Water Development (Gupta/Modern Hydrology and Sustainable Water Development) || Statistical Analyses in Hydrology

P1: SFK/UKS P2: SFK

c08 BLBK276-Gupta August 28, 2010 15:31 Trim: 246mm X 189mm Printer Name: Yet to Come

STATISTICAL ANALYSES IN HYDROLOGY 243

been introduced. Perhaps the most popularplotting-position formula is the Weibull formula:

P (X ≥ x(m)) = 1

Tm= 1

n + 1(8.99)

n being the sample size.

8.3.3 Graphical approach

After identifying the data series, ranking the eventsand calculating the plotting position, a graph ofmagnitude of x versus its probability [P(X ≥ x),P(X < x), or T ] can be plotted and a distributionfitted graphically. To facilitate this procedure, oneoften uses specially designed probability graph pa-pers that have the probability scale chosen suchthat the resulting plot is a straight line if the cho-sen distribution fits the data perfectly. Probabilitygraph papers are available for Gaussian (normal),lognormal, Gumbel, and some other distributions.By plotting the data using a particular probabil-ity scale and constructing the best-fit straight linethrough the data, a graphical fit is done to the dis-tribution used in constructing the probability scale.This is a graphical approach to estimate the statis-tical parameters of the distribution.

8.3.4 Analytical approaches

An alternative to the graphical technique is to es-timate the statistical parameters of a distributionfrom the sample data. The distribution model canthen be used to solve for the variate value corre-sponding to any desired return period or probabil-ity as:

xT = F −1x

(1 − 1

T|θ)

(8.100)

in which Fx−1(θ) is the inverse cumulative distribu-

tion function with the model parameter vector θ .Eqn 8.100 is applicable when the inverse distribu-tion functional forms are analytically tractable, suchas the Gumbel, generalized extreme value, etc.

Example 8.8. Consider that the annual maximumfloods follow a lognormal distribution with a meanof 490 m3 s−1 and a standard deviation of 80 m3

s−1. Determine the flood magnitude with a 1-in-100chance of being exceeded in any given year.

Solution. From Eqn 8.57 and the Table of StandardNormal Probability (z) = P(Z ≤ z), the param-eters of a lognormal distribution, for annual maxi-mum flood Q, can be obtained as:

σln Q =√

ln(2

Q + 1) =

√√√√ln

[(σQ

µQ

)2

+ 1

]

=√√√√ln

[(80

490

)2

+ 1

]= 0.1622

µln Q = ln(µQ ) − 1

2σ 2

ln Q = ln(490) − 1

2(0.1622)2

= 6.1812

Since ln(Q) follows a normal distribution witha mean of µln Q = 6.1812 and a standard devia-tion of σ ln Q = 0.1622, the magnitude of the log-transformed 100-year flood can be calculated as:

ln (q100) − µln Q

σln Q= −1

(1 − 1

100

)= −1(0.99)

= 2.34

Hence

ln(q100) = 2.34 × σln Q + µln Q = 6.5607

and the corresponding 100-year flood magnitudeis:

exp (6.5607) = 706.8 m3s−1

For some distributions, such as Pearson Type-3 orlog-Pearson Type-3, the appropriate probability pa-per or CDF inverse form is unavailable. In such acase, an analytical approach using the frequencyfactor KT is applied:

xT = µx + KT × σx (8.101)

in which xT is the variate corresponding to a returnperiod of T , µx, and σ x are the mean and standarddeviation of the random variable, respectively; andKT is the frequency factor, which is a function of thereturn period T or P(X ≥ xT ) and higher moments,if required. The xT versus KT plot of Eqn 8.101 is astraight line with slope σ x and intercept µx.

Page 28: Modern Hydrology and Sustainable Water Development (Gupta/Modern Hydrology and Sustainable Water Development) || Statistical Analyses in Hydrology

P1: SFK/UKS P2: SFK

c08 BLBK276-Gupta August 28, 2010 15:31 Trim: 246mm X 189mm Printer Name: Yet to Come

244 MODERN HYDROLOGY AND SUSTAINABLE WATER DEVELOPMENT

In order for Eqn 8.101 to be applicable, the func-tional relationship between KT and exceedanceprobability or return period must be determinedfor the distribution to be used. In fact, the fre-quency factor KT = (xT − µx)/σ x is identical toa standardized variate corresponding to the ex-ceedance probability of 1/T for a particular dis-tribution model under consideration. For exam-ple, if the normal distribution is considered, KT =zT = −1(1 − T−1). The same applies to the log-normal distribution when the mean and standarddeviation of log-transformed random variables areused. Hence the standard normal probability tableprovides values of the frequency factor for the sam-ple data from normal and log-normal distributions.Once this relationship is known, a nonlinear prob-ability or return-period scale can be constructed toreplace the linear KT scale, and thus a special graphpaper can be constructed for any distribution sothat plot of xT versus P (or T ) is linear.

The procedure for using the frequency-factormethod is outlined as follows:

Compute the sample mean x, standard deviationsx, and skewness coefficient γ x (if needed) forthe sample.

For the desired return period, determine the cor-responding value of KT for the distribution.

Compute the desired percentile value using Eqn8.101 with x replacing µx and sx replacing σ x,that is:

x = x + KT × sx (8.102)

It should be noted that the basic difference be-tween the graphical and analytical approaches liesin estimating the statistical parameters of the dis-tribution being used. By employing the graphicalapproach, the best-fit line is constructed that deter-mines the statistical parameters. In the analyticalapproach, statistical parameters are first computedfrom the sample and the straight-line fit thus ob-tained is used. The straight line obtained in theanalytical approach is in general a poorer fit to theobserved data than that obtained from the graphicalapproach, especially if curve-fitting procedures areadopted. However, the US Water Resources Coun-

cil (USWRC 1967) recommended use of the analyt-ical approach because:

1. Graphical least-squares methods are avoided toreduce incorporation of random characteristicsof the particular dataset (especially in the lightof difficulty in selecting an appropriate plotting-position formula).

2. The generally larger variance of the analyticalapproach is believed to help compensate for theerrors that arise in the case of typically smallsized data sets.

8.3.5 Limitations of hydrologic frequencyanalysis

Several probability distributions have been pro-posed for application to hydrologic data. Someof them were proposed because the underlyingconcept of the distribution coincided with thegoal of hydrologic frequency analysis. For example,the extremal distributions discussed below havefavourable properties for hydrologic frequencyanalysis.

In the 1960s, a working group of hydrologyexperts was formed by the US Water ResourcesCouncil to evaluate the best/preferred approach toflood frequency analysis. The following key resultsemerged from the study conducted by this group(Benson 1968):

1. From physical considerations there is no a pri-ori requirement that dictates use of a specificdistribution in the analysis of hydrologic data.

2. Intuitively, there is no reason to expect that a sin-gle distribution will apply globally to all streams.

3. No single method of testing the computed re-sults against the original data was acceptable tothe working group. The statistical experts alsocould not offer a mathematically rigorous proce-dure.

Subsequent to this study, the USWRC (1967) rec-ommended use of the log-Pearson Type-3 distri-bution for flood frequency analyses in the UnitedStates and this has become the officially recom-mended distribution for all flood frequency stud-ies in that country. There is, however, no physical

Page 29: Modern Hydrology and Sustainable Water Development (Gupta/Modern Hydrology and Sustainable Water Development) || Statistical Analyses in Hydrology

P1: SFK/UKS P2: SFK

c08 BLBK276-Gupta August 28, 2010 15:31 Trim: 246mm X 189mm Printer Name: Yet to Come

STATISTICAL ANALYSES IN HYDROLOGY 245

basis for application of this distribution to hydro-logic data. It, however, has added flexibility overother two-parameter distributions (Gumbel, log-normal) because the skewness coefficient is thethird independent parameter and the use of threeparameters generally results in a better fit to thedata.

Often frequency analysis is applied for the pur-pose of estimating magnitude of rare events, for ex-ample, a 100-year flood, on the basis of a short dataseries. Viessman et al. (1977), however, noted ‘as ageneral rule, frequency analysis should be avoidedin estimating frequencies of expected hydrologicevents greater than twice the record length.’ Thisgeneral rule is followed rarely in practice becauseof the regulatory requirement to estimate the 100-year flood; for example, the USWRC (1967) gaveits approval for frequency analyses using as littleas 10 years of peak flow data. In order to estimatethe 100-year flood on the basis of a short record,the analyst must rely on extrapolation. Klemes(1986) noted that there are many known causes fornon-stationarity, ranging from the dynamics of theEarth’s motion to anthropogenic changes in landuse. In this context, Klemes (1986) reasoned thatthe notion of a 100-year flood has no meaning interms of average return period, and thus the 100-year flood really provides a reference for designconsiderations rather than a true reflection of thefrequency of occurrence of an event.

8.4 Nonparametric density estimationmethods

In flood frequency analysis, one of the more impor-tant issues is the choice of the best probability dis-tribution. The true distribution is always unknownin practice and often an arbitrary choice of or pref-erence for a given distribution increases the esti-mation uncertainty. Some countries have tried tofind standard distributions for flood frequency anal-ysis in order to avoid arbitrariness in selection ofthe type of distribution adopted. Many distributionshave been used to estimate flood flow frequenciesfrom observed annual flood series. The general ex-treme value (GEV) distribution is recommended asa base model in the United Kingdom. After apprais-

ing many distributions, the USWRC issued a seriesof bulletins recommending the Log-Pearson Type-3distribution as a base method for use by all US Fed-eral Agencies. The most commonly used in Canadaare the three parameter log-normal, the General-ized Extreme Value and the log-Pearson 3.

Recently, nonparametric density estimationmethods have gained popularity in many fields ofscience, including hydrology. This model has sev-eral advantages. The shapes of nonparametric den-sity functions are directly determined from the data(Faucher et al. 2001). It does not require any apri-ory assumptions about the distribution of the popu-lation of interest (Adamowski 1989) and the estima-tion of parameters (i.e. mean, variance, and skew-ness) are not also needed. The parametric distribu-tions are limited to certain shapes. However, non-parametric densities can adapt to the often-irregularempirical distribution of random variables found innature (Faucher et al. 2001).

The probability density function f(x) esti-mated by a nonparametric method is given by(Adamowksi 2000):

f (x) = 1

nh

n∑

j=1

K

(x − xj

h

)(8.103)

where x1 to xn are the observations; K( ) is a kernelfunction, itself a probability density function; and his a bandwidth or smoothing factor to be estimatedfrom the data.

The following conditions are imposed on the ker-nel (Adamowksi 2000):∫

K (z)dz = 1 (8.104)

∫zK (z)dz = 0 (8.105)

∫z2K (z)dz = C = 0 (8.106)

where C is the kernel variance.The kernel distribution function is the inte-

gration of the density Eqn 8.103 from −1 to x(Adamowksi 2000):

F (x) = 1

n

n∑

j=1

Kl

(x − xj

h

)(8.107)

Page 30: Modern Hydrology and Sustainable Water Development (Gupta/Modern Hydrology and Sustainable Water Development) || Statistical Analyses in Hydrology

P1: SFK/UKS P2: SFK

c08 BLBK276-Gupta August 28, 2010 15:31 Trim: 246mm X 189mm Printer Name: Yet to Come

246 MODERN HYDROLOGY AND SUSTAINABLE WATER DEVELOPMENT

where:

Kl(u) =u∫

−∞K (w)dw.

The kernel distribution function may serve to esti-mate percentiles corresponding to a given proba-bility of exceedance. The flood quantile xT with areturn period of T years, of the kernel distributionfunction is (Adamowksi 2000):

xT = F −1

(1 − 1

T

)(8.108)

where F−1( ) represents the inverse of F( ). Thevalue of xT can be determined by solving Eqn 8.107numerically.

In nonparametric frequency analysis, the choiceof kernel function is not critical since variouskernels lead to comparable estimates (Adamowksi2000). However, the calculation and the choice ofthe bandwidth, h, in Eqn 8.103 is critical. Whilethe Gaussian kernel is often used, Table 8.1 showssome of the other commonly used kernel functions.

Several bandwidth estimation techniques arebased on minimization of an estimate of the meansquare error (MSE) or of the integrated mean squareerror (IMSE) function. IMSE is obtained by integrat-ing the MSE function over the entire domain of x.

The criterion of optimality is based on minimiz-ing the IMSE given by Guo et al. (1996):

IMSE =∞∫

−∞[ f (x) − f (x)]

2dx (8.109)

where f (x) is an estimate of the known densityfunction f (x). An optimal value of h can be ob-tained by minimizing the IMSE for a given densityf (x) and sample. The value of h has to be derivedempirically from the observed data. The optimalvalue of h, can be determined numerically by dif-ferentiating the objective function (Eqn 8.109) withrespect to h and then equating it to zero, as givenby Guo et al. (1996):

h =

n∑i=2

i−1∑j=1

(xi − xj )

√5n(n − 10/3)

(8.110)

Table 8.2 Some commonly used kernel functions where t= h−1(x − xi).

Kernel K (t)

Epanechnikov K (t) = 34 (1 − t2), |t| < 1

Rectangular K (t) = 12 , |t| < 1

Biweight K (t) = 1516 (1 − t2)2, |t| < 1

Gaussian K (t) = 1√2π

e− 12 t2

Cauchy K (t) = 1π (1+t2)

EV1 K (t) = e−t−e−t

8.5 Error analysis

In the foregoing, it was implicitly assumed that theexperimental observations have no uncertainty andevery observation of a variable (e.g. daily streamflow) under consideration represents its true value.However, in real life most observations have a cer-tain degree of uncertainty due to a variety of fac-tors. As already mentioned in the Section 1.4, theterm accuracy is defined to indicate closeness tothe ‘true’ value, whereas the term precision is de-fined to indicate reproducibility of measurementsirrespective of how close the result is to the ‘true’value. In a way, accuracy and precision are related,as described below.

Precision is a measure of the magnitude of ran-dom errors. If one is able to reduce random er-rors, for instance by employing better equipmentor procedures of data collection/analysis, resultsare more precise and reproducible. Random errorsare the most common type of errors. These arisefrom limitations of the quality of instrumentationemployed. These can only be partly overcome byrefining the instrumentation/analytical method(s)employed, by repeating the measurements (readingtemperature or pH several times), or extending theobservation time (e.g. when using radio-isotopicmeasurements).

Accuracy of a measurement, on the other hand,is directly affected by systematic errors; avoidingor eliminating systematic errors makes the resultmore accurate and reliable. Systematic errors arereproducible discrepancies, often resulting fromfailure/fault in the instrumentation or a consis-tent mathematical inadequacy in the data analysis.

Page 31: Modern Hydrology and Sustainable Water Development (Gupta/Modern Hydrology and Sustainable Water Development) || Statistical Analyses in Hydrology

P1: SFK/UKS P2: SFK

c08 BLBK276-Gupta August 28, 2010 15:31 Trim: 246mm X 189mm Printer Name: Yet to Come

STATISTICAL ANALYSES IN HYDROLOGY 247

10

11

12

13

14

15(a)

10

11

12

13

14

15(b)

Fig. 8.12 An example to illustrate precision and accuracy, showing results of 2 series from 19 measurements. (a) The dataare imprecise but accurate, giving the correct average value of 13 units. The grey area refers to the 1σ confidence level; thatis, 68% of the data lie within this range. (b) The data are precise, but inaccurate probably because of a systematic error in themeasurement, as the average value for this case is less than the correct value of 13 units. Redrawn from Mook (2000) ©UNESCO/IAEA/IHPV.

These are more difficult to detect but may be foundby repeating the analysis with different equipmentor by recalculation using an alternative statisticalprocedure.

For studying and eventually reducing systematicerrors it is important to collect data with small ran-dom errors, that is, with relatively high precision.On the other hand, it is no use increasing the pre-cision if systematic errors are large. The differencebetween precision and accuracy is graphically de-picted in Fig. 8.12.

8.5.1 Propagation of errors

8.5.1.1 Standard deviation

It is often required to determine a quantity, sayA, which is a function of one or more variables,each with its own uncertainty. The uncertainty ineach of these variables contributes to the over-all uncertainty of A. If the uncertainty is dueto statistical or random fluctuations and A = f

(x, y, z, . . .):

σ 2A = σ 2

x

(∂ A

∂x

)2

+ σ 2y

(∂ A

∂y

)2

+ σ 2xz

(∂ A

∂z

)2

+ · · ·(8.111)

If the uncertainties are estimated on the basis ofinstrumental uncertainties, similar equations are tobe used for calculating the uncertainty in the finalresult. For the general relation A = f (x, y, z, . . .)with the estimated instrumental uncertainties x,y, z, . . ., the uncertainty in A is:

A2 = x2

(∂ A

∂x

)2

+ y2

(∂ A

∂y

)2

+z2

(∂ A

∂z

)2

+ · · · (8.112)

Equations for propagation of uncertainty to the de-pendent variable A for some common functions oftwo independent variables x and y, derived by us-ing Eqn 8.112, are given in Table 8.3.

Page 32: Modern Hydrology and Sustainable Water Development (Gupta/Modern Hydrology and Sustainable Water Development) || Statistical Analyses in Hydrology

P1: SFK/UKS P2: SFK

c08 BLBK276-Gupta August 28, 2010 15:31 Trim: 246mm X 189mm Printer Name: Yet to Come

248 MODERN HYDROLOGY AND SUSTAINABLE WATER DEVELOPMENT

Table 8.3 Propagation of uncertainty in some commonlyused functions for the dependent variable A with knownuncertainty in independent variables x and y.

Function A = f (x, y) Equation for A

A = ax + by, and A = ax – by A2 = a2 x2 + b2 y2

A = ± a xy, and A = ± a x/y A2/A2 = x2/x2

+y2/y2

A = a e±bx A/A = ±bxA = a ln(±bx) A = ax/x

8.5.1.2 Weighted mean

While calculating average values using Eqn 8.1, allthe numbers are implicitly assumed to have thesame precision and thus the same weight. If eachnumber is assigned its own uncertainty correspond-ing to standard deviation (s), the weight of eachresult is inversely proportional to the square of thestandard deviation (1/s2), which is referred to asthe weighting factor. Accordingly the mean is cal-culated as:

x =∑ xi

σ 2i

/∑ 1

σ 2i

(8.113)

while the standard deviation of the mean is ob-tained from:

σx = 1√∑(1/σi)2

(8.114)

If all the standard deviation values, si, are equal, theexpression for sx reduces to:

σ 2x = 1/

∑(1/σi)

2 = 1/(N/σ 2i ) = σ 2

i /N

or σx = σi/√

N (8.115)

8.5.2 Least squares fit to a straight line

One is often interested in obtaining the best de-scription of data in terms of some theory, whichinvolves parameters whose values are initially un-known. Often the interest is to see if the relationbetween the dependent and the independent vari-ables in a dataset is linear, that is, data lie on astraight line y = a + bx, and if so what are the val-ues of the gradient (b) and the intercept (a) of thebest fit line? We proceed by assuming that the data

consists of a series of points (xi, yi ± σ i) with no un-certainty in x. The σ i is, in principle, the theoreticalerror that one would have expected. In practice,the observed experimental error for a given datapoint is used. The deviation from any value of y(i.e. yi) from the straight line is given by:

yi = yi − f (x) = yi − a − bx1 (8.116)

Minimizing the sum of these deviations yields zero.Thus, adding the absolute values of yi does not

result into a useful mathematical procedure to es-timate the coefficients a and b, characterizing thestraight line. But minimizing the weighted sum ofthe squares of the deviations [i.e. S =∑

(yi/σ i)2],we obtain:

∂a

∑(yi

σi

)2

= ∂

∂a

∑[(yi − a − bxi)

σi

]2

= 0

(8.117)

and

∂b

∑(yi

σi

)2

= ∂

∂b

∑[(yi − a − bxi)

σi

]2

= 0

(8.118)

Resulting values of a and b, therefore, are:

a = 1

(∑ x2i

σ 2i

∑ y2i

σ 2i

−∑ x2

i

σ 2i

∑ xi yi

σ 2i

)

(8.119)

and

b = 1

(∑ 1

σ 2i

∑ x1 yi

σ 2i

−∑ x2

i

σ 2i

∑ y2i

σ 2i

)

(8.120)

with

=∑ 1

σ 2i

∑ x2i

σ 2i

−(∑ xi

σ 2i

)2

(8.121)

If the standard deviations of y are all equal, thevalues of a and b are:

a = 1

(∑x2

i

∑yi −

∑xi

∑xi yi

)(8.122)

b = 1

(N∑

xi yi −∑

xi

∑yi

)(8.123)

= N∑

x2i −

(∑xi

)2(8.124)

Page 33: Modern Hydrology and Sustainable Water Development (Gupta/Modern Hydrology and Sustainable Water Development) || Statistical Analyses in Hydrology

P1: SFK/UKS P2: SFK

c08 BLBK276-Gupta August 28, 2010 15:31 Trim: 246mm X 189mm Printer Name: Yet to Come

STATISTICAL ANALYSES IN HYDROLOGY 249

The standard deviations of the coefficients a and bare:

σa =√

1

∑ x2i

σ 2i

(8.125)

and

σb =√

1

∑ 1

σ 2i

(8.126)

As above, the least-squares fit to any arbitrary curve,for instance a quadratic or second-degree polyno-mial, harmonic, or an exponential, can also be ob-tained by minimizing the weighted sum of squares(S) of deviation between observed values of yi andthe computed value for the particular xi for thechosen curve and the arbitrary value of constants(e.g. a and b above) to be optimized:

S =∑

i

(yi

σi

)2

=∑

i

(yobs

i − ycalci

σi

)2

(8.127)

Several computer programs are available to obtainthe optimized values of the parameters for fittingany arbitrarily chosen curve.

8.5.3 Chi-square test for testing ofdistributions

Choosing an equation to fit a set of observationsimplies hypothesizing a particular form for thedataset of the parent distribution. One would liketo know if this hypothesis concerning the formof parent distribution is correct. In fact, it is notpossible to give an unambiguous ‘yes’ or ‘no’answer to this question, but one would still like tostate how confidently the hypothesis can either beaccepted or rejected.

Distributions are tested by the χ2 method. Whenthe experimentally observed yobs

i for each exper-imental point is normally distributed with meanyth

i and the variance σ 2i , the S defined in Eqn 8.127

after replacing ycalci with yth

i is distributed as χ2

(defined in Section 8.2.3.5.1). Therefore, in orderto test a hypothesis that the observed distributionis consistent with the expected distribution oneneeds to:

1. Construct S using Eqn 8.127 and minimize itwith respect to the free parameters:

2. Determine the number of degrees of freedom(K) from:

K = n − p (8.128)

where n is the number of data points includedin the summation for S, and p is the number offree parameters that are made to vary to arriveat the value of Smin;

3. Refer to the relevant probability tables for thegiven degree of freedom, K, to see if χ2 is greaterthan or equal to Smin. In such a case, the assumeddistribution is likely to be consistent with thedataset.

As the term σ i2 appears in the denominator of

Eqn 8.127, the magnitude of errors on individualpoints also determines if the hypothesis that thedata are consistent with the expected distributionis reasonable.

More useful than the χ2 distribution itself, is:

Py(c) = PK (χ2 > c) (8.129)

which gives the probability that, for the given de-grees of freedom, the value of χ2 will exceed aspecified value of c. In deciding whether or not toreject a hypothesis, one can make the followingtwo kinds of incorrect decisions.

4. Error of the first kind – In this case the hy-pothesis gets rejected even though it is correct.Thus it is erroneously concluded that the dataare inconsistent with the distribution whereas,in fact, they are consistent. This happens evenin a known fraction of tests by the maximum ac-cepted value, Smin. The number of errors of thefirst kind can be reduced simply by increasingthe limit on Smin, the cut-off limit beyond whichthe hypothesis is rejected. But this is liable to in-crease the number of errors of the second kindand, therefore, some compromise value of thelimit must be chosen.

5. Error of the second kind – In this case one failsto reject the hypothesis even though it is falseand some other hypothesis may be applicable.This means that one fails to detect that the data

Page 34: Modern Hydrology and Sustainable Water Development (Gupta/Modern Hydrology and Sustainable Water Development) || Statistical Analyses in Hydrology

P1: SFK/UKS P2: SFK

c08 BLBK276-Gupta August 28, 2010 15:31 Trim: 246mm X 189mm Printer Name: Yet to Come

250 MODERN HYDROLOGY AND SUSTAINABLE WATER DEVELOPMENT

are inconsistent with the distribution. This hap-pens because the value of Smin accidentally turnsout to be small even though the data are actu-ally inconsistent with the distribution. In gen-eral, it is quite difficult to estimate how frequentthis effect is likely to occur. It, however, de-pends on the magnitude of the cut-off used forSmin and on the size of errors on the individualpoints.

Though arbitrary, as a matter of practice, if theprobability that Smin is exceeded for the given de-gree of freedom by less than 5% or 1%, the hypoth-esis can be accepted at 95% or 99% confidencelevels, respectively.

Example 8.9. Consider a situation where one maybe testing the linear distribution of a set of (say 12)data points and when one fits the expression y =a + bx to the data, a value of 20.0 is obtainedfor Smin using Eqn 8.127. In this case there are10 degrees of freedom (12 points minus 2, cor-responding to the parameters a and b). From thetable of percentage area in the tails of χ2 distri-butions for various degrees of freedom, it may beseen that the probability of getting a value of 20or more is about 3%. The linear distribution can beaccepted at the 95% but not at the 99% confidencelevel.

8.5.4 Student’s t-test

It is a statistical test of the null hypothesis that themean values of two normally distributed popula-tions are equal. Given two datasets, each charac-terized by its own mean, standard deviation, andnumber of data points, one can use the t-test to de-termine whether the two mean values are distinct,provided that the underlying distributions can beassumed to be normal.

It has been shown that a t-distribution has themean equal to 0 and the variance is equal to K/(K − 2), where K is the degrees of freedom (Eqn8.96). The variance is always greater than 1, al-though it is close to unity when there are manydegrees of freedom. For infinite degrees of free-dom, the t-distribution is the same as the standardnormal distribution.

As with χ2 distribution, one can define 95%, 99%,or other confidence intervals by using the table oft-distribution. As mentioned above, the probabilityof one being wrong in rejecting the hypothesis (i.e.making Type-1 error) is given by the area under thetail of the distribution exceeding the significancelevel of 0.05 or 0.01, or any other pre-assigned valueobtained by computing the t-test statistics. If tworandom samples of sizes N1 and N2 are drawn fromnormal populations whose standard deviations s1

and s2 are equal (σ 1 = σ 2) and the two samplemeans are X1 and X2, respectively, the test statisticst is given by:

t = X1 − X2

σ√

1/N1 + 1/N2where =

√N1s2

1 + N2s22

N1 + N2 − 2

(8.130)

The degrees of freedom of the t-distribution areK = N1 + N2 − 2.

8.5.5 The F-test

Just as the sampling distribution of the differencein the mean (X1− X2) of the two samples is im-portant in some applications, one may also needthe sampling distribution of the difference in vari-ance (s2

1 − s22 ). It, however, turns out that this dis-

tribution is rather complicated. One may, there-fore, consider the statistics s2

1/s22 , since either a

large or small ratio would indicate a large differ-ence between the two, while a ratio close to unitywould correspondingly indicate a small difference.The sampling distribution in this case is called theF-distribution.

More precisely, suppose that one has two sam-ples of sizes N1 and N2 drawn from two normal(or approximately normal) populations having vari-ance σ 2

1 and σ 22 . The statistics F can be defined as:

F =

s2

1/σ21

s2

2/σ22

= N1s21/(N1 − 1)σ 2

1

N2s22/(N2 − 1)σ 2

2

where

s2

1 = N1s21

(N1 − 1)and

s2

2 = N2s22

(N2 − 1)

(8.131)

Page 35: Modern Hydrology and Sustainable Water Development (Gupta/Modern Hydrology and Sustainable Water Development) || Statistical Analyses in Hydrology

P1: SFK/UKS P2: SFK

c08 BLBK276-Gupta August 28, 2010 15:31 Trim: 246mm X 189mm Printer Name: Yet to Come

STATISTICAL ANALYSES IN HYDROLOGY 251

X1

X2

Xm

t

t

t

X1

X2

Xm

X (t)1 X (t+ )1 τ

X (t)2 X (t+ )2 τ

X (t)m X (t+ )m τt t+τXt Xt+τ

σ2t σ2

t+τ

cov (X , X )t t+τ

t t+τ

t t+τ

Fig. 8.13 An ensemble of m time series (X1,X2, . . ..., Xm). The values of X at t and (t +τ ) for any series k are designated by xk(t) andxk(t+τ ), respectively. The mean and varianceacross the m series at t and (t+τ ) are givenby x (t), σ 2

t and x (t + τ ), σ 2t+τ , respectively.

The covariance of x(t) and x(t + τ ) acrossthe m series is designated by cov(xt , xt+τ ).Redrawn from Yevjevich (1972). © WaterResources Publications,.

The degrees of freedom of the F distribution areK1 = (N1 − 1) and K2 = (N2 − 1). The PDF of theF distribution is given by:

fF = CF (K1/2)−1

(K1 F + K2)(K1+K2)/2(8.132)

where C is a constant depending on K1 and K2 suchthat the total area under the curve is unity. The per-centile values for which the areas in the right-handtail are 0.05 and 0.01, denoted by F .95 and F .99 re-spectively, can be read from the tables correspond-ing to appropriate degrees of freedom K1 and K2.These represent the 5% and 1% significance levelsrespectively, and can be used to determine whetheror not the variance s2

1 is significantly larger than s22 .

In practice, the sample with the larger variance ischosen as the sample 1.

8.6 Time series analysis

A time series (TS) is an ordered sequence of data,measured/derived typically at discrete time inter-vals. Unlike analyses of random samples of obser-vations in relation to various statistics, the analysisof TS is based on the assumption that successivevalues in the data represent consecutive measure-ments taken at discrete time steps/intervals. If for

any reason some time steps are either skipped orno data is available for a certain time step (or num-ber of time steps), it may sometimes be possible tointerpolate the values of the parameter for the miss-ing time steps. Hydrologic models can also generateTS data.

TS data are usually plotted as a line or bar graph.An ensemble of TS, each of size N in the discretecase, or length T in the continuous case, is a set ofthe TS. An example is shown in Fig. 8.13. This en-semble may be thought of as the magnitude of a pa-rameter measured at the same place and under thesame conditions, but for successive time steps T .

Time series are used to describe several aspectsof the hydrologic cycle. To a hydrologist or a hy-draulic engineer, the TS approach is useful to char-acterize the resources and/or conditions of a wa-ter basin. Hydrologists use TS methodology for dis-playing the amount of rainfall that has taken placewithin a catchment on various time scales – duringthe previous day, past year, or 10 years. As an ex-ample, Fig. 8.14 is a time series of annual rainfallover India since 1872. This information, combinedwith additional TS data (e.g. on El Nino and LaNina years, as given in Fig. 8.14), is used by me-teorologists to find cause and effect relationshipsbetween the monsoon and the El Nino/La Nina,whereas hydrologists are interested in calculating

Page 36: Modern Hydrology and Sustainable Water Development (Gupta/Modern Hydrology and Sustainable Water Development) || Statistical Analyses in Hydrology

P1: SFK/UKS P2: SFK

c08 BLBK276-Gupta August 28, 2010 15:31 Trim: 246mm X 189mm Printer Name: Yet to Come

252 MODERN HYDROLOGY AND SUSTAINABLE WATER DEVELOPMENT

1880 1900 1920 1940 1960 1980 2000Years

20

30

10

0

-10

-20

-30

Rai

nfa

ll d

evia

tio

n f

rom

mea

n (

%) Above average

Below averageEl Nino yearLa Nina year

Fig. 8.14 A time series of annual rainfall over India since 1872, with El Nino and La Nina years marked. Redrawn from Kumar(2007).

the amount of storm water runoff and estimationof the total discharge from a given catchment. Hy-draulic engineers use TS of measured dischargeand water level of a river for designing new watercontrol structures/better management of the exist-ing river development projects. TS data are alsoused for planning any input or withdrawal of wa-ter from the river. For example, TS data are usedto describe the amount of water that is being re-leased from a sewage treatment plant into a riversystem or the amount of water being taken outof a river system for supplying an irrigation canalnetwork.

There are two main objectives of TS analysis: (i)identifying nature of the phenomenon representedby the sequence of observations; (ii) forecasting(predicting future values of the TS variable). Boththese objectives require that the pattern in the ob-served TS data is identified and more or less for-mally described. Once the trend is identified, onecan extrapolate it to predict future events. Time se-ries analysis is also used in many applications, suchas economic- and sales- forecasting, budgetary anal-ysis, stock market analysis, agricultural yield projec-tions, process and quality control, inventory moni-toring, and workload projections in industry, utilitystudies, census analysis, and many more.

As with most other hydrologic data analyses, theTS data are assumed to comprise: (i) a systematic

pattern (usually a set of identifiable frequencycomponents); and (ii) random noise (error)components. Most TS analysis techniques involvesome form of filtering out of noise in order tomake the pattern more conspicuous/prominent.Methods for TS analyses can be divided into twobroad classes: (i) Time-domain methods; and(ii) frequency-domain methods. A time domainanalysis aims to describe the pattern of the seriesover time. A frequency domain analysis, on theother hand, aims to determine the strength/powerof the periodicity inherent in the series within eachgiven frequency band over a range of frequencies.In the following we first consider the time-domainmethods of TS analysis.

8.6.1 General aspects of time seriespatterns

Time series patterns, in general, can be describedin terms of two basic classes of components: trendand periodicity. The former represents a generalsystematic linear, or more often the nonlinear com-ponent, that varies over time and does not repeatat least within the time range encompassed bymost parts of the data. The latter component (i.e.periodicity) may formally have a similar nature, butit repeats itself at periodic time intervals. The termscyclic or seasonal are often used for components

Page 37: Modern Hydrology and Sustainable Water Development (Gupta/Modern Hydrology and Sustainable Water Development) || Statistical Analyses in Hydrology

P1: SFK/UKS P2: SFK

c08 BLBK276-Gupta August 28, 2010 15:31 Trim: 246mm X 189mm Printer Name: Yet to Come

STATISTICAL ANALYSES IN HYDROLOGY 253

Noise only

Trend only

Trend + Noise

Noise only

Trend + Cycles

Trend + Cycles + Noise

Noise only

Cycles only

Cycles + Noise

(a)

Time

Time

Spike

Step

(b)

Fig. 8.15 (a) Combining trend, cycles, and noise in a time series. (b) Some important features to look out for duringpreliminary analysis of a time series. Redrawn from Wild and Seber (2000). © John Wiley & Sons Inc.

when the series is influenced by periodic factorsthat recur on a regular basis. Fig. 8.15a shows thevarious components schematically.

Sometimes the term cyclic is used only whenthe data exhibit rises and falls for purely periodiccomponents that are not of any fixed period orseason. The various TS components may co-existin real-life data. Thus, in terms of its componentparts, a TS comprises trend, cycle, seasonal, andirregular (or random error) components and maybe written as:

xt = Tt + St + Ct + Et (8.133)

for an additive model and as:

xt = Tt × St × Ct × Et (8.134)

for a multiplicative model. The symbols are: Tt =trend component; St = seasonality component; Ct

= cyclic component; and Et = random error com-ponent. If the magnitude of the TS varies with thelevel of the series, one should opt for a multiplica-tive model in preference to an additive model. Themultiplicative model can be processed as an ad-ditive model by taking logarithms of the variousterms in the series. This decomposition may en-able one to study the TS components separately;

by de-trending; or to make seasonal adjustments ifneeded for further analysis.

8.6.1.1 Stationarity and ergodicity

If properties of the time series do not change withtime’ they are called stationary (time-invariant ornon-evolutive). If properties of the sequence do notchange with position on the line, a series is calledhomogenous (line-invariant, non-evolutive).

A sample of discrete series of size N , or a sampleof continuous series of length T , is a realization ofthe underlying stochastic process. It is also calleda sample function. Consider an ensemble of the mseries, I = 1, 2, . . . , m, each being T -time long (orN -size) for a continuous (or discrete) TS, as shownin Fig. 8.13. The mean of x(t), at the position t fromthe origin of each series, determined over the entireensemble where both x(t) and x(t+τ ) designate thevalue of x at t and (t + τ ), respectively, is:

xt = 1

m

m∑

i=1

xi(t) (8.135)

and of x(t + τ ) at the position (t + τ ) is:

xt+ = 1

m

m∑

i=1

xi(t + τ ) (8.136)

Page 38: Modern Hydrology and Sustainable Water Development (Gupta/Modern Hydrology and Sustainable Water Development) || Statistical Analyses in Hydrology

P1: SFK/UKS P2: SFK

c08 BLBK276-Gupta August 28, 2010 15:31 Trim: 246mm X 189mm Printer Name: Yet to Come

254 MODERN HYDROLOGY AND SUSTAINABLE WATER DEVELOPMENT

The covariance of x(t) and x(t + τ ) is:

cov(xt, xt+τ ) = 1

m

m∑

i=1

xi(t)xi(t + τ ) − xt xt+τ

(8.137)

If the mean of x(t) and x(t+τ ), given in the form xt

and xt+τ of Eqn 8.135 and Eqn 8.136, converges tothe same population mean, µ, with the probabilityequal to one, for the case when m→ ∞, regardlessof the position of t between zero and T , the timeseries x(t) is said to be stationary in the mean, orfirst-order stationary.

If the covariance of Eqn 8.137 is independent ofposition, t, but dependent on the lag τ , and is aconstant for a given lag, i.e.:

cov(xt, xt+τ ) = ρτσ2x , for m → ∞ (8.138)

in which σ 2x is the population variance of xt and

ρτ is the population autocorrelation coefficient forlag τ , the TS is said to be stationary in the co-variance. For τ = 0, ρτ= 1, which implies thatthe series is stationary in the variance at the sametime. When the series is stationary both in meanand in the covariance, it is said to be second-orderstationary. If all higher moments (3rd, 4th, 5th,. . .) of x(t), x(t+1), x(t+2), . . . . are independentof t, but dependent on t1, t2, . . ., and they all con-verge with probability equal to 1 to correspondinghigher-order population moments as m→ ∞, theseries is higher-order stationary. This implies thatthe series is also stationary in mean and covariance.Stationarity of this kind is called strong stationarity,or stationarity in the strict sense as opposed to theweak stationarity when the series is only second-order stationary. In the field of hydrology, in actualpractice one is often satisfied with second-orderstationarity of TS.

Statistical parameters of each particular series ofthe ensemble of series, given here as i-series, i = 1,2, . . . , m, are

xi = 1

T

T∫

0

xtdt (8.139)

cov(xt, xt+τ ) = 1

T − τ

T −τ∫

0

(xt − xt)(xt+τ − xt+τ ) dt

(8.140)

in which xt and xt+τ refer to the ith series only, andsimilarly for higher-order moments. If for any seriesxi → µ (population mean with the probabilityunity for T→ ∞), and cov(xt, xt+τ ) = f (τ ) = ρτσ

for any τ , even as m→ ∞ and for higher-ordermoments, the series in the ensemble is said tobe ergodic. In the opposite case, the series isnon-ergodic.

8.6.1.1.1 Tests of stationarity and ergodicity

In practice, both the number of terms in a seriesand the length of a series are always finite. Statisti-cal tests, therefore, must be performed to ascertainwhether the stochastic process is stationary as wellas ergodic. The following scheme of ensemble ofa discrete series is assumed for discussion of thesetests.:

x1,1 x1, j . . . x1,N

xi,1 xi, j . . . xi,N

. . . . . . . . . . . .

xm,1 xm, j . . . xm,N

∣∣∣∣∣∣∣∣∣

α1, j

αi, j

. . .

αm, j

αi,1 αi, j . . . αi,N

Any parameter α is estimated either across theensemble of the series as αi,1, αi,2, . . . , αi,N , oralong the discrete series as α1, j , α2, j , . . . , αm, j , .

Because of sampling fluctuations, neitherαi,1, αi,2, . . . , αi,N , nor α1, j , α2, j , . . . , αm, j , arelikely to be identical but may be shown by a propertest (say χ2 test) as statistically indistinguishablefrom each other. In the first case, the stochastic pro-cess is inferred to be stationary in α, and if that is ac-cepted, in the second case the process is inferred tobe ergodic in α as well. When hydrologic processesare inherently stationary and ergodic, or made sta-tionary and ergodic by eliminating non-stationarityand non-ergodicity where possible, the mathemati-cal analysis and description become much simpler.

8.6.1.1.2 Self-stationarity

Hydrologic stochastic processes are usually de-scribed by a single series. The only test to beperformed would be whether the properties of theseries do not change as one considers the higher-order terms. In other words, the test is to ascertainwhether parts of the series have the same proper-ties as the total population, in the limits of sampling

Page 39: Modern Hydrology and Sustainable Water Development (Gupta/Modern Hydrology and Sustainable Water Development) || Statistical Analyses in Hydrology

P1: SFK/UKS P2: SFK

c08 BLBK276-Gupta August 28, 2010 15:31 Trim: 246mm X 189mm Printer Name: Yet to Come

STATISTICAL ANALYSES IN HYDROLOGY 255

fluctuation. For example, if the series is dividedinto two parts and it is shown that the parametersof both parts are not significantly different fromeach other, both parts are considered as originatingfrom the same process. If this is repeated for stillsmaller parts and the same results are obtained, thestochastic process is said to be self-stationary. Ingeneral, if the sub-series parameters are confinedto within 95% tolerance limits around the corre-sponding value of the parameter for the entireseries, the process is inferred to be self-stationary.

Example 8.10. A given year with 12 monthlyprecipitation values is considered a sub-seriesmaking n series of ensemble for n years of record.If the values of mean, variance, covariance, andparameters related to other moments are obtainedfor each month, it can be easily shown that thehydrologic series of monthly precipitation doesnot follow a stationary stochastic process. Thereason for this is the presence of a periodic annualcomponent in such series.

8.6.1.2 Trend analysis

If a time series has a monotonous trend (eitherconsistently increasing or decreasing), that part ofdata analysis is generally not difficult and often thefirst step in the process of trend identification issmoothing.

8.6.1.2.1 Smoothing

Smoothing involves some form of local averagingof data such that the non-systematic componentsof individual observations cancel each other. Thussmoothing removes random variations from thedata and brings out trends and cyclic components,if present. The most common technique is movingaverage smoothing, which replaces each elementof the series by either the simple or weightedaverage of the surrounding n elements, where nis the width of the smoothing ‘window’ (Box andJenkins 1976):

xt = M [1]t = xt + xt−1 + xt−2 · · · · · · xt−n+1

nor

M [1]t = M [1]

t−1 + (xt − xt−n)/n (8.141)

where M [1]t is the moving average at time t and n

is the width or the number of terms in the mov-ing average. At each successive time interval, themost recent observation is included and the earli-est observation is excluded for computing the aver-age. Simple moving average is intended for data in-volving constants or without having any trend. Forlinear or quadratic trends, double moving averagecan be calculated. To calculate this moving aver-age, M [1]

t , is treated over time simply as individualdata points and a moving average of these averagevalues is obtained. As an example, simple decadalmoving average values of annual rainfall series of In-dia is shown in Fig. 8.14. Median values can be usedinstead of the mean values. The main advantage ofusing median smoothing, as compared to movingaverage smoothing, is that its results are less biasedby outliers (within the smoothing window). Thus,if there are outliers in the data (e.g. due to mea-surement errors), median smoothing typically pro-duces smoother or at least more ‘reliable’ curvesthan moving average based on the same windowwidth. The main disadvantage of median smooth-ing is that in the absence of clear outliers, it mayproduce more ‘jagged’ curves than moving aver-age and it does not allow for weighting. There areseveral other moving average models in use.

8.6.1.2.2 Fitting a function

Many monotonous TS data can be adequately ap-proximated by a linear function; if there is a clearmonotonous nonlinear component, the data firstneed to be transformed to remove the nonlinearity.Usually a logarithmic, exponential, or sometimes apolynomial function can also be used.

8.6.1.3 Analysis of seasonality

Cyclic dependence (also called seasonality, anal-ogous to seasons) is another general componentof the TS pattern. The concept is mathematicallydefined as co-relational dependence of order τ

between each tth element of the series and the (t +τ )th element, as measured by autocorrelation (i.e.a correlation between the two terms); τ is usuallycalled the lag. If the measurement error is small,seasonality can be visually identified in the series

Page 40: Modern Hydrology and Sustainable Water Development (Gupta/Modern Hydrology and Sustainable Water Development) || Statistical Analyses in Hydrology

P1: SFK/UKS P2: SFK

c08 BLBK276-Gupta August 28, 2010 15:31 Trim: 246mm X 189mm Printer Name: Yet to Come

256 MODERN HYDROLOGY AND SUSTAINABLE WATER DEVELOPMENT

as a pattern that repeats after every k elements(Fig. 8.15a).

The correlation between the pair (xt, xt + τ ) isgiven by the autocorrelation function (ACF) ob-tained as:

rτ =

t−τ∑t=1

(xt − x)(xt+τ − x)

n∑t=1

(xt − x)2

(8.142)

It ranges from −1 to +1. The correlogram (auto-correlogram) displays graphically (as bar charts)and numerically, the rτ for consecutive lags in aspecified range of lags, as shown in Fig. 8.16. Thedegree of autocorrelation is of primary interest asone is usually interested only in very strong (andthus highly significant) autocorrelations.

A partial autocorrelation function (PACF) is usedto denote the degree of association between xt

and xt+τ when the cyclic effects of other timelags 1, 2, . . .τ−1 are already removed. In other

words, the partial autocorrelation is similar to au-tocorrelation except that when calculating it the(auto)correlations with all the elements within thelag period are removed.

Cyclic dependence for a particular lag τ canbe removed by differencing the series, that is,converting each (t + τ )th element of the se-ries into its difference from the tth element.There are two major reasons for effecting such atransformation.

First, one can identify the nature of hiddenseasonal dependence in the series. Becauseautocorrelations for consecutive lags are interde-pendent, removing some of the autocorrelationsmodifies other autocorrelation values, that is,it may eliminate them or make some otherseasonality more apparent.

The other reason for removing seasonal depen-dence is to make the series stationary, which isnecessary for ARIMA and other techniques, as dis-cussed subsequently. ( Section 8.6.2)

-1

1

0.6

0.2

-0.2

-0.6

0 5 10 15 20 25Lag

AC

F

(a)

Lag

-1

1

0.6

0.2

-0.2

-0.6

0 5 10 15 20 25

PAC

F

(b)

-1

1

0.6

0.2

-0.2

-0.6

0 5 10 15 20 25Lag

AC

F

(c)

Lag

-1

1

0.6

0.2

-0.2

-0.6

0 5 10 15 20 25

PAC

F

(d)

Fig. 8.16 (a) Autocorrelation Function (ACF) showing significant values for a large number of lags. However, theautocorrelations at lag 2 and above are possibly due to propagation of the autocorrelation at lag 1. This is confirmed by thePartial Autocorrelation Function (PACF) plot in (b) that shows a significant spike only at lag 1, implying that all thehigher-order autocorrelations are effectively explained by the lag 1 autocorrelation, suggesting one order of non-seasonaldifferencing, that is, ARIMA (0, 1, 0) for making the series stationary; (c) ACF obtained for the series of (a) by applying ARIMA(0, 1, 0) model; (d) PACF of (c). Note from (c) that the correlation at lag 1 is significant and positive and from (d) that thePACF shows a sharper ‘cut-off’ than the ACF. In particular, the PACF has only two significant spikes, while the ACF has four.Redrawn from http://www.duke.edu/∼ rnau/411arim3.htm.

Page 41: Modern Hydrology and Sustainable Water Development (Gupta/Modern Hydrology and Sustainable Water Development) || Statistical Analyses in Hydrology

P1: SFK/UKS P2: SFK

c08 BLBK276-Gupta August 28, 2010 15:31 Trim: 246mm X 189mm Printer Name: Yet to Come

STATISTICAL ANALYSES IN HYDROLOGY 257

8.6.1.3.1 Hypothesis test on r1

The first-order autocorrelation coefficient isespecially important because dependence ofphysical systems on past values is likely to bestrongest for the most recent past. The first-orderautocorrelation coefficient, r1, can be testedagainst the null hypothesis that the correspondingpopulation value ρ1 = 0. The critical value of r1

for a given significance level (e.g. 95%) depends onwhether the test is one-tailed or two-tailed. For theone-tailed hypothesis, the alternative hypothesis isusually that the true first-order autocorrelation isgreater than zero:

H : ρ1 > 0 (8.143)

For the two-tailed test, alternative hypothesis is thatthe true first-order autocorrelation is different from0, with no specification as to positive or negative:

H : ρ1 = 0 (8.144)

Which alternative hypothesis to use depends on theproblem. If there is some reason to expect positiveautocorrelation (e.g. with tree rings, from carryoverfood storage in trees), the one-sided test is the best.Otherwise, the two-sided test is the best.

For the one-sided test, the World MeteorologicalOrganization (WMO 1966) recommends that the95% significance level for r1 be computed by usingthe equation:

r1(95%) = −1 + 1.645√

n − 2

n − 1(8.145)

where n is the sample size. More generally, follow-ing Salas et al. (1980), the probability limits on thecorrelogram of an independent series are:

rk(95%) = −1 + 1.645√

n − τ − 1

n − τfor one sided

(8.146)

rk(95%) = −1 ± 1.96√

n − τ − 1

n − τfor two sided

(8.147)

where n is the sample size and τ is the lag.

8.6.1.3.2 Effective sample size

If a time series of length N is autocorrelated, thenumber of independent observations is fewer than

N . Essentially, the series is not random in time do-main, and the information in each observation isnot totally independent of the information in otherobservations. The reduction in the number of inde-pendent observations has implications for hypoth-esis testing.

Some standard statistical tests that depend on theassumption of random samples can still be appliedto a time series, despite the autocorrelation in theseries. The way of circumventing the problem ofautocorrelation is to adjust the sample size for au-tocorrelation. The number of independent samplesafter adjustment is fewer than the number of obser-vations of the series.

Below is an equation for computing so-called ‘ef-fective’ sample size, or sample size adjusted forautocorrelation. More on the adjustment can befound elsewhere (Dawdy and Matalas 1964; WMO1966). The equation was derived based on the as-sumption that the autocorrelation in the series rep-resents first-order autocorrelation (dependence onlag-1 only). In other words, the governing process isfirst-order autoregressive, or Markovian. Compu-tation of the effective sample size requires only thesample size and first-order sample autocorrelationcoefficient. The ‘effective’ sample size is given by:

n′ = n(1 − r1)

(1 + r1)(8.148)

where n is the sample size; n′ is the effective samplesize; and r1 is the first-order autocorrelation coeffi-cient. For example, a series with a sample size of100 years and a first-order autocorrelation of 0.50has an adjusted sample size of:

n′ = 100(1 − 0.5)

(1 + 0.5)= 100

0.5

1.5≈ 33 years (8.149)

8.6.2 ARIMA modelling

The modelling and forecasting procedures dis-cussed above necessitate knowledge of the math-ematical model of the process. However, in real-life situations, as patterns of the data are not clear,and individual observations may involve consider-able errors. One still needs not only to uncover thehidden patterns in the data but also to generateforecasts. The ARIMA methodology developed byBox and Jenkins (1976) enables one to do just this.

Page 42: Modern Hydrology and Sustainable Water Development (Gupta/Modern Hydrology and Sustainable Water Development) || Statistical Analyses in Hydrology

P1: SFK/UKS P2: SFK

c08 BLBK276-Gupta August 28, 2010 15:31 Trim: 246mm X 189mm Printer Name: Yet to Come

258 MODERN HYDROLOGY AND SUSTAINABLE WATER DEVELOPMENT

The following section introduces basic ideas of thismethodology.

8.6.2.1 Two commonly employed processes

8.6.2.1.1 Autoregressive process

Most TS often consist of elements that are seriallydependent in the sense that one can estimate a coef-ficient or a set of coefficients that describe elementsof the series from specific, time-lagged (previous)elements. This can be summarized by the equation:

xt = ξ + φ1 × x(t−1) + φ2 × x(t−2)

+φ3 × x(t−3) + · · · φp × x(t−p) + εt (8.150)

where ξ is a constant, usually equal to the mean;and φ1, φ2, φ3, . . . φp are the autoregressive modelparameters. Expressed in words, each observationconsists of a random error (εt) component (alsocalled random shock) and a linear combinationof prior observations. The value of p is calledthe order of the AR model. AR models can beanalysed by one of various methods, includingstandard linear least squares techniques, and havea straightforward interpretation.

Some formulations transform the series by sub-tracting the mean of the series from each data point.This yields a series with a mean value of zero. Withthis transformation; Eqn 8.150 can be rewritten as:

xt = φ1 × x(t−1) + φ2 × x(t−2)

+φ3 × x(t−3) + · · · φp × x(t−p) + εt (8.151)

An autoregressive process is stable if the parame-ters lie within a certain range; for example, if thereis only one autoregressive parameter; then it mustbe within the interval −1 < φ1< 1. Otherwise, pasteffects would accumulate and the values of suc-cessive xt would move towards infinity, that is, theseries would not be stationary. If there is more thanone autoregressive parameter, similar (general) re-strictions on the parameter values can be specified.

8.6.2.1.2 Moving average process

Independent of the autoregressive process, eachelement in the series can also be affected by the pasterror (or random shock) that cannot be accounted

for by the autoregressive component, i.e.:

xt = µ + εt − θ1 × ε(t−1) − θ2 × ε(t−2)

− θ3 × ε(t−3) − · · · − θq × ε(t−q) (8.152)

where µ is the series mean and θ1, θ2, θ3, . . . θq arethe moving average model parameters. This meansthat each observation is made up of a random errorcomponent (random shock, ε) and a linear com-bination of prior random shocks. Thus, a movingaverage model is conceptually a linear regressionof the current value of the series against the whitenoise or random shocks of one or more prior valuesof the series. The value of q is called the order ofthe MA model.

If the series is transformed by subtracting themean, Eqn 8.152 can be rewritten as:

xt = εt − θ1 × ε(t−1) − θ2 × ε(t−2)

− θ3 × ε(t−3) − · · · − θq × ε(t−q) (8.153)

Without going into detail, it is obvious that thereis a ‘duality’ between the moving average processand the autoregressive process, that is, the mov-ing average equation given above can be rewritten(inverted) into an autoregressive form (of infiniteorder). However, analogous to the stationarity con-dition described for the autoregressive model, thiscan only be done if the moving average parame-ters follow certain conditions, i.e. if the model isinvertible. Otherwise, the series is not stationary.

8.6.2.1.3 Lag operator

In TS analysis, the lag operator or backshift oper-ator operates on an element of TS to produce theprevious element. For example, given a TS:

L xt = x(t−1) for all t > 1 (8.154)

where L is the lag operator. Sometimes the symbolB for backshift is used. Note that the lag operatorcan be raised to arbitrary integer powers so that:

L−1xt = x(t+1) and L kxt = x(t−k) (8.155)

Also polynomials of the lag operator can be used,and this is a common notation for ARMA models.

Page 43: Modern Hydrology and Sustainable Water Development (Gupta/Modern Hydrology and Sustainable Water Development) || Statistical Analyses in Hydrology

P1: SFK/UKS P2: SFK

c08 BLBK276-Gupta August 28, 2010 15:31 Trim: 246mm X 189mm Printer Name: Yet to Come

STATISTICAL ANALYSES IN HYDROLOGY 259

For example:

εt = xt −p∑

i=1

φi × x(t−i)

=(

1 −p∑

i=1

φi Li

)xt (8.156)

specifies an AR(p) model.A polynomial of lag operator is called a lag poly-

nomial so that, for example, the ARMA model canbe mathematically specified as:

φxt = θεt (8.157)

where φ and θ , respectively, represent the lag poly-nomials:

φ = 1 −p∑

i=1

φi Li and

θ = 1 +q∑

i=1

θi Li (8.158)

In TS analysis, the first difference operator is aspecial case of lag polynomial:

xt = xt − x(t−1) = (1 − L )xt (8.159)

Similarly, the second difference operator:

(xt) = xt − x(t−1)

2xt = (1 − L )xt

= (1 − L )(1 − L )xt

= (1 − L 2)xt (8.160)

The above approach can be generalized to give theith difference operator as:

i xt = (1 − L )i xt (8.161)

8.6.2.2 Autoregressive integrated movingaverage (ARIMA) model

The acronym ARIMA stands for ‘Auto-Regressive In-tegrated Moving Average.’ Lags of the differencedseries appearing in the forecasting equation arecalled ‘auto-regressive’ terms, lags of the forecasterrors are called ‘moving average’ terms, and a timeseries, which needs to be differenced to make itstationary, is said to be an ‘integrated’ version of astationary series.

The non-seasonal ARIMA model introduced byBox and Jenkins (1976) includes three types of pa-rameters in the model – the autoregressive parame-ters (p), the number of differencing passes (d), andmoving average parameters (q). In the notation in-troduced by Box and Jenkins, models are summa-rized as ARIMA (p, d, q); thus for example, a modeldescribed as (0, 1, 2) means that it contains 0 (zero)autoregressive parameters and 2 moving averageparameters that are computed for the series after itis differenced once. For instance, given a TS processXt, a first-order autoregressive process is denotedby ARIMA (1, 0, 0) or simply AR (1) and is given by:

xt = φ1 × x(t−1) + εt (8.162)

and the first-order moving average process is de-noted by ARIMA (0, 0, 1) or simply MA (1) and isgiven by:

xt = εt − θ1 × ε(t−1) (8.163)

The ultimate model may be a combination of theseprocesses and of higher orders as well. Thus, anARMA (p, q) process defined by Eqn 8.157 can beexpanded as:(

1 −p∑

i=1

φi Li

)xt =

(1 +

q∑

i=1

θi Li

)εt (8.164)

where L is the lag operator, the φi are the param-eters of the autoregressive part of the model. Theθ i are the parameters of the moving average partand the εt are the error terms. The error terms εt

are generally assumed to be independent, identi-cally distributed variables sampled from a normaldistribution with zero mean.

An ARIMA (p, d, q) process is obtained by inte-grating an ARMA (p, q) process, i.e.:(

1 −p∑

i=1

φi Li

)(1 − L )dxt =

(1 +

q∑

i=1

θi Li

)εt

(8.165)

where d is a positive integer that controls the levelof differencing (or, if d = 0, this model is equivalentto an ARMA model). Conversely, applying term-by-term differencing d times to an ARMA (p, q)process gives an ARIMA (p, d, q) process. Note thatit is only necessary to difference the AR side of the

Page 44: Modern Hydrology and Sustainable Water Development (Gupta/Modern Hydrology and Sustainable Water Development) || Statistical Analyses in Hydrology

P1: SFK/UKS P2: SFK

c08 BLBK276-Gupta August 28, 2010 15:31 Trim: 246mm X 189mm Printer Name: Yet to Come

260 MODERN HYDROLOGY AND SUSTAINABLE WATER DEVELOPMENT

ARMA representation, because the MA componentis always I (0).

It should be noted that not all choices of param-eters produce well-behaved models. In particular,if the model is required to be stationary, then con-ditions on these parameters must be satisfied.

Some well-known special cases arise naturally.For example, an ARIMA (0, 1, 0) model is given by:

xt = x(t−1) + εt (8.166)

which is simply a random walk model. A number ofvariations of the ARIMA model are commonly used.

ARIMA models can be extended to include sea-sonal autoregressive and seasonal moving averageterms. Although this complicates the notation andmathematics involved in the model, the underly-ing concepts for seasonal autoregressive and sea-sonal moving average terms are similar to the non-seasonal autoregressive and moving average terms.

The most general form of the ARIMA model in-cludes difference operators, autoregressive terms,moving average terms, seasonal difference oper-ators, seasonal autoregressive terms, and seasonalmoving average terms. In general, as with any mod-elling approach, only necessary terms should beincluded in the model.

8.6.2.2.1 Identification

As already mentioned, the input series for ARIMAneeds to be stationary, that is, it should have aconstant mean, variance, and autocorrelation overtime. Therefore, usually the series first needs tobe differenced until it is stationary (this also oftenrequires log-transforming the data to stabilizethe variance). The number of times the seriesneeds to be differenced to achieve stationarity isreflected in the d parameter. In order to determinethe necessary level of differencing, one shouldexamine the plot of the data and autocorrelogramfunction (ACF). Significant changes in level (large,upward or downward changes) usually requirefirst-order differencing (lag 1); large changes ofslope usually require second-order differencing.Seasonal patterns require respective seasonaldifferencing. If the estimated autocorrelation co-efficients decline slowly at longer lags, first-orderdifferencing usually suffices. However, one shouldkeep in mind that some TS may require little or no

differencing, and that an over- differenced seriesproduces less stable coefficient estimates.

At this stage (which is usually called Identifica-tion phase) one also needs to decide how manyautoregressive (p) and moving average (q) param-eters are necessary to yield an effective but stillparsimonious model (meaning a model with theleast number of parameters and the largest numberof degrees of freedom) of the process representedby the TS data.

Major tools used in the identification phase areplots of the series, correlograms of ACF, and par-tial autocorrelation function (PACF). The decisionis not straightforward and in some cases may re-quire not only experience but also a good deal ofexperimentation with alternative models (as wellas with the technical parameters of ARIMA). How-ever, majority of empirical time series patterns canbe sufficiently well approximated using one of thefive basic models that can be identified based on theshape of the ACF and the PACF. The following briefsummary is based on practical recommendations ofPankratz (1983). Also, it is to be noted that sincethe number of parameters (to be estimated) of eachkind does not normally exceed two, it is often prac-tical to try alternative models on the same data:

One autoregressive (p) parameter: ACF – expo-nential decay; PACF – spike at lag 1; no correla-tion for other lags.

Two autoregressive (p) parameters: ACF – a sine-wave shape pattern or a set of exponential de-cays; PACF – spikes at lags 1 and 2, and no cor-relation for other lags.

One moving average (q) parameter: ACF – spikeat lag 1, and no correlation for other lags; PACF– damps out exponentially.

Two moving average (q) parameters: ACF –spikes at lags 1 and 2, and no correlation forother lags; PACF – a sinusoidal pattern or a setof exponential decays.

One autoregressive (p) and one moving average(q) parameter: ACF – exponential decay startingat lag 1; PACF – exponential decay starting atlag 1.

8.6.2.2.2 Estimation and forecasting

In the next step, the parameters are estimated (us-ing function minimization procedures), and the

Page 45: Modern Hydrology and Sustainable Water Development (Gupta/Modern Hydrology and Sustainable Water Development) || Statistical Analyses in Hydrology

P1: SFK/UKS P2: SFK

c08 BLBK276-Gupta August 28, 2010 15:31 Trim: 246mm X 189mm Printer Name: Yet to Come

STATISTICAL ANALYSES IN HYDROLOGY 261

sum of the squares of residuals is minimized. Theestimates of the parameters are used in the laststage (i.e. forecasting) for calculation of new val-ues of the series (beyond those included in the in-put dataset) together with the confidence intervalsfor the predicted values. The estimation process isperformed on the transformed (differenced) data.Before the forecasts are generated, the series needsto be integrated (integration is the inverse of differ-encing) so that the values generated by forecastsare compatible with the input data. This automaticintegration feature is represented by including theletter ‘I’ in the name of the methodology (ARIMA= Auto-Regressive Integrated Moving Average).

In addition to the standard autoregressive andmoving average parameters, ARIMA models mayalso include a constant, as described above (Eqn8.150 and Eqn 8.152). The interpretation of the(statistically significant) constant depends on themodel that fits given data well. Specifically: (i)if there are no autoregressive parameters in themodel, then the expected value of the constant isµ, the mean of the series; (ii) if there are autore-gressive parameters in the series, then the constantrepresents the intercept. If the series is differenced,then the constant represents the mean or interceptof the differenced series. For example, if the seriesis differenced once, and there are no autoregres-sive parameters in the model, then the constantrepresents the mean of the differenced series and,therefore, the slope of the straight line representsthe linear trend of the un-differenced series.

8.6.3 Single spectrum (Fourier) analysis

Spectrum analysis is a frequency-domain methodof TS analysis and is concerned with the explo-ration of cyclical patterns of data. The purpose ofthe analysis is to decompose a complex time serieswith cyclical components into a few underlyingsinusoidal (sine and cosine) functions of particu-lar frequencies. In essence, performing spectrumanalysis on TS is like putting white light througha prism in order to identify the wavelengths andthe underlying cyclic components. As a result ofsuccessful analysis, one might uncover just a fewrecurring cycles of different lengths in the timeseries of interest, which at first sight would havelooked more or less like random noise. A frequently

cited example of spectrum analysis is the cyclic na-ture of sunspot activity. It turns out that sunspotactivity has a dominant 11-year cycle. Other exam-ples to demonstrate this technique include analysisof celestial phenomena, weather patterns, fluctu-ations in commodity prices, economic activities,etc. To contrast this technique with ARIMA, thepurpose of spectrum analysis is to identify the sea-sonal fluctuations of different lengths, while in theARIMA model, the length of the seasonal compo-nent is usually known (or guessed) a priori andthen included in some theoretical model of mov-ing averages or autocorrelations.

A sine or cosine function is typically expressedin terms of the number of cycles per unit time (fre-quency), often denoted by the Greek letter ν(nu)(some text books use f ). Thus, if unit of time is 1year, there are n observed cycles (for monthly cy-cles n = 12, i.e. ν = 12 cycles per year). Of course,there are other likely cycles with different frequen-cies. For example, there might be annual cycles (ν= 1), and weekly cycles (ν = 52 cycles per year).

The period T of a sine or cosine function is de-fined as the length of time required for one fullcycle. Thus, it is the reciprocal of the frequency,that is, T = 1/ ν. The period of monthly cycle, ex-pressed in terms of year, is equal to 1/12 = 0.0833.In other words, there is a period in the series oflength of time as 0.0833 years.

One way to decompose the original series into itscomponent sine and cosine functions of differentfrequencies is to formulate the problem as a linearMultiple Regression model, in which the depen-dent variable is the observed time series, and theindependent variables are the sine functions of allpossible (discrete) frequencies. Such a model maybe written as:

xt = a0 +∑[

ak. cos

(πk

Lt

)+ bk. sin

(πk

Lt

)]

for k = 1 to ∞ (8.167)

where the basic approach is to represent any func-tion xt of period 2L as a combination of periodiccomponents, namely sines and cosines.

A term-by-term integration between negative andpositive L values allows one to determine the

Page 46: Modern Hydrology and Sustainable Water Development (Gupta/Modern Hydrology and Sustainable Water Development) || Statistical Analyses in Hydrology

P1: SFK/UKS P2: SFK

c08 BLBK276-Gupta August 28, 2010 15:31 Trim: 246mm X 189mm Printer Name: Yet to Come

262 MODERN HYDROLOGY AND SUSTAINABLE WATER DEVELOPMENT

coefficients Eqn 8.167:

a0 = 1

2L

L∫

−L

xt dt

ak = 1

L

L∫

−L

xt cos

(kt

L

)t dt

bk = 1

L

L∫

−L

xt sin

(kt

L

)t dt (8.168)

Eqn 8.167 is known as the Fourier series of xt,with Eqn 8.168 as its Fourier coefficients. Serieswith bk = 0 for all k are known as Fourier cosineseries, whereas series with ak = 0 for all k areknown as Fourier sine series. One can thus estimateany periodic function xt as a combination of sineand cosine functions, with the coefficients givenby equation Eqn 8.168.

Example 8.11. Assuming that we have monthlymeans, and we average all years to obtain fn,monthly averages corresponding to each month ofthe year. t is then 1 month, and N = 12. Typically,we can represent the annual cycle with at least twoharmonics, to allow for a lack of symmetry betweenwinter and summer:

faCn = a0 + a1 cos

(2π

12/1n

)+ b1 sin

(2π

12/1n

)

+ a2 cos

(2π

12/2n

)+ b2 sin

(2π

12/2n

)

The a0 term represents the annual average (withzero frequency); the a1 and b1 terms representthe periodic component with period of 12 months(fundamental frequency); and the a2 and b2 termsrepresent a periodic component with period of 6months (first harmonic). The coefficients a0, a1,a2, b1, b2 can be obtained as before, e.g.:

b2 = 1

12

12∑

n=1

fn sin

(2π

12/2n

)

where the bar represents the monthly average overseveral years. Once the coefficients are obtained,the annual cycle can be subtracted from the timeseries in order to deal with anomalies.

8.6.3.1 Non-periodic functions: Fourierintegrals and transforms

In a non-periodic function, the approach is to as-sume that the function xt is periodic in the limit ast → ∞. Instead of writing xt as a summation of sinesand cosines, the following integral representationis introduced:

xt =∞∫

0

[A(ω) cos ωt + B(ω) sin ωt] dω (8.169)

with

A(ω) = 1

π

∞∫

−∞y(ν) cos ωνdν

B(ω) = 1

π

∞∫

−∞y(ν) sin ωνdν (8.170)

Equation Eqn 8.169 is equivalent to Eqn 8.167 andEqn 8.170 is equivalent to Eqn 8.168. So far noassumption is made about the periodicity of thefunction xt. It is sometimes useful to represent Eqn8.169 in the complex form using the Euler equa-tion:

exp(i t) = cos t + i sin t (8.171)

Substituting Eqn 8.170 into Eqn 8.169), one canderive the complex Fourier integral as:

xt = 1

∞∫

−∞

∞∫

−∞x (ν) exp (iω(t − ν) dν dω (8.172)

Rearranging the double integral in Eqn 8.172 gives:

xt = 1√2π

∞∫

−∞

⎣ 1√2π

∞∫

−∞x (ν) exp (−iωt) dν

× exp (iωt) dω (8.173)

In Eqn 8.173, the expression in the square bracketsis the Fourier transform of xt, as given by:

X(ω) = 1√2π

∞∫

−∞x (t) exp (−iωt) dt (8.174)

Page 47: Modern Hydrology and Sustainable Water Development (Gupta/Modern Hydrology and Sustainable Water Development) || Statistical Analyses in Hydrology

P1: SFK/UKS P2: SFK

c08 BLBK276-Gupta August 28, 2010 15:31 Trim: 246mm X 189mm Printer Name: Yet to Come

STATISTICAL ANALYSES IN HYDROLOGY 263

and thus Eqn 8.173 becomes the inverse Fouriertransform of x:

xt = 1√2π

∞∫

−∞Y(ω) exp (iωt) dω (8.175)

8.6.3.2 Discrete sampling effects

The usual method of describing a physical processis in the time domain, that is, the value of somefunction, xt, is described as a function of time, thatis as x(t). An alternate method is to describe pro-cesses in the frequency domain, such that a func-tion, X, is described as a function of frequency, ω,i.e. as X(ω). The two functions, x(t) and X(ω), arerelated through the Fourier transform Eqn 8.174and Eqn. 8.175.

These, however, refer to continuous functions.In reality, the experimental measurements are gen-erally discrete, made only at certain time intervals.An important consequence of the discrete natureof measurements is the introduction of a minimumresolvable frequency. If a time series is formed witha sampling interval of 10 seconds, the minimum re-solvable period is 20 seconds, as at least 2 pointsare required to determine a sine wave. This meansthat a time series collected at 0.1 Hz (1/10 s) canonly resolve frequencies less than 0.05 Hz (1/20 s).In this example, 0.05 Hz is known as the Nyquistor cut-off frequency, defined as:

fc = 1

2t(8.176)

Sampling at discrete intervals for a certain periodof time gives N measurements. The frequencies atwhich one can calculate the Fourier transform of fis limited to a set of N frequencies, determined by:

fn = n

Nt, n = −n

2, · · · , N

2(8.177)

Therefore, one needs to replace continuous repre-sentation of the Fourier transform Eqn 8.174 witha discrete representation, in keeping with the dis-crete nature of sampling:

X(k) = tN−1∑

n=0

x(n)exp(−i2πnk/N) (8.178)

The angular frequency ω in Eqn 8.174 is replacedby 2π f , with t in Eqn 8.174 replaced by nt. Adiscretized form of f is also substituted, giving:

fk = k

Nt, k = 0, · · · , N

2(8.179)

8.6.3.3 Calculation of the power spectrum andestimation of the periodogram

To determine the ‘dominant’ frequencies in a timeseries, the power spectral density is defined as:

G(ω) = 2

T|X(ω)|2 (8.180)

where T is the length of the time series, and X(ω) isdefined in Eqn 8.174. This is the continuous repre-sentation of the power spectral density that gives anestimate of the ‘power’ in the signal x(t) at the fre-quency ω. The discrete representation of the powerspectral density is:

G( f ) = 2t

N|X( f )|2 (8.181)

Analysis of the power spectral density G(f ) enablesone to investigate the dominant frequencies in asignal, as the dominant frequencies are likely to beimportant in characterizing a physical process.

A periodogram estimate of the power spectrumis used, which is defined at (N/2 + 1) frequencyas:

P ( f0) = 1

N2|F0|2

P ( fk) = 1

N2

(|Fk|2 + |FN−k|2)

k = 1, 2, · · · ,(

N

2− 1

)

P ( fc) = 1

N2|FN/2|2 (8.182)

The above estimate of the power at each frequency,fk, is representative of some kind of an averagevalue midway between the preceding frequency‘bin’ to a value midway to the next one. This isnot always particularly accurate, due to the phe-nomenon known as ‘leakage’, that is, there is aleakage of energy from one frequency to anotherin the periodogram estimate. One can reduce the

Page 48: Modern Hydrology and Sustainable Water Development (Gupta/Modern Hydrology and Sustainable Water Development) || Statistical Analyses in Hydrology

P1: SFK/UKS P2: SFK

c08 BLBK276-Gupta August 28, 2010 15:31 Trim: 246mm X 189mm Printer Name: Yet to Come

264 MODERN HYDROLOGY AND SUSTAINABLE WATER DEVELOPMENT

Square

WelchHanning

Cosine taper

Am

plit

ud

e

0 50 100 150 200 250

0

0.2

0.4

0.6

0.8

1

Bin number

Window FunctionsFig. 8.17 Windowing functions thatpreserve all the power at the centre of thebin and allow little power at the extremitiesof the bin are used to reduce frequencyleakage during spectral analysis of a timeseries. Some common windows and theirleakage functions are shown. Redrawn fromhttp://www.cwr.uwa.edu.au/∼antenucc/spectral analysis/spectral analysis.html#Coherence and phase.html.

effects of frequency leakage through a techniqueknown as ‘data windowing’ that aims to preservethe entire power at the centre of the bin andvirtually no the power at its extremities, with aslow transition between the two. The introduc-tion of a data window results in transformation ofEqn 8.178 to:

W(k) = tN−1∑

n=0

x(n)w( j) exp (−i2πnk/N) (8.183)

where w(j) is the window function. Several typesof window function are in common use. Some ofthese are:

Square window w j = 1Hanning or cosine squared window

w j = 1

2

[1 − cos

2π j

N + 1

]

Welch window

w j = −1

(j − 0.5N

0.5

)2

The cosine taper window

w j = 1

2

[1 − cos

π j

m

]j = 0, · · · , m− 1

w j = 1 j = m, · · · , N − m− 1

w j = 1

2

[1 − cos

(N − j)

m

]j = 0, · · · , m− 1

where m is usually equal to 0.1 N.

The leakage functions of these windows are shownin Fig. 8.17.

8.6.3.4 Coherence and phase relationshipsbetween signals

It is sometimes of interest to investigate the jointstructure of two series, that is, their mutual depen-dence or coherence of either series with the other.Since one can observe relationships only at thesame frequency in both series, the cross-correlationspectrum may be defined as:

Gij (ω) = 2

T

[Xi(ω)X∗

j (ω)]

(8.184)

The ‘∗’ superscript refers to the complex conjugateof the function, which for a complex function, y,is simply:

conj(y) = real(y) − i imag(y) (8.185)

Unlike the power spectrum, the cross-spectrumis complex as it contains both amplitude andphase information. The real part of the cross-spectrum is known as the coincident spectrum(or co-spectrum), and the complex part of thecross-spectrum is known as the quadrature spec-trum (or quad-spectrum). From the cross-spectrum,one can estimate the amplitude and phaserelationships between the signals. The amplitude

Page 49: Modern Hydrology and Sustainable Water Development (Gupta/Modern Hydrology and Sustainable Water Development) || Statistical Analyses in Hydrology

P1: SFK/UKS P2: SFK

c08 BLBK276-Gupta August 28, 2010 15:31 Trim: 246mm X 189mm Printer Name: Yet to Come

STATISTICAL ANALYSES IN HYDROLOGY 265

relationship is quantified through the coherencysquared, calculated as:

S2i j (ω) = |Gij (ω)|2

Gi(ω)G j (w)(8.186)

A coherency squared value of unity indicatestotal dependence of one signal on another,whereas a coherency squared value of zero refersto non-dependence of one signal on another.Two signals can only be coherent at the samefrequency.

Phase is calculated directly from the co-spectrumand quadrature spectrum as follows:

φi j (ω) = arctan

[imag(Gij )

real(Gij )

](8.187)

The phase shift estimates are measures of the ex-tent to which each frequency component of oneseries leads the other.

8.7 Tutorial

Ex 8.1 What is the chance that exactly three 50-year floods will be equalled or exceeded in a given100-year period? What is the chance that three ormore floods will occur?

Table 8.4

Deviation (yt) from the Average Global Temperature, in C

Year yt Year yt Year yt Year yt

1975 −0.09 1980 0.19 1985 0.09 1990 0.391976 −0.22 1981 0.27 1986 0.17 1991 0.391977 0.11 1982 0.09 1987 0.30 1992 0.171978 0.04 1983 0.30 1988 0.34 1993 0.211979 0.10 1984 0.12 1989 0.25 1994 0.31

Solution The general formula for the exact numberof chance events, k, out of n trials is given by Eqn8.40. Substituting 100 for N , 3 for k, and 0.02 forp, we obtain P = 0.183, or about 1 chance in 5or 6. To obtain the answer to the second question,one could substitute, 3, 4, 5, etc. up to 100 in theformula and add the probabilities of these mutually

exclusive events (if one occurs, the others cannot),or cut short the process by substituting 2, 1, and 0 inthe formula and subtracting the sum of probabilitiesfrom 1.00 (certainty), since it is certain that either(a) 3 or more or (b) 2 or less events must occur.Substituting 2, 1, and 0 in the formula, we obtain0.274, 0.271, and 0.133 in turn for P; 1 minus theirsum is 0.322 or about 1 chance in 3.

Ex 8.2 What is the chance that a 1000-year floodwill be exceeded in the estimated 50-year opera-tional lifetime of a project?

Solution One can interpret this to include thechance that the flood will be exceeded exactly onceor more than once. One should, therefore, solve forthe chance that it will not occur (k = 0) and sub-tract that probability from 1.00, as explained in Ex8.1. Note that for k = 0, the Eqn 8.40 simplifies to:

P = (1 − p)n

Substituting 0.001 for p and 50 for n, we obtain0.952 for P and, therefore, 0.048 for the answer,that is, about 1 chance in 20.

Ex 8.3 In Table 8.4 below, 20 years of data onaverage global temperature deviations with respectto the average are given. Plot the time series andcomment on any trends or other features of thedata.

Ex 8.4 Table 8.5 gives a hypothetical data on refor-estation of a catchment and the runoff of a streamflowing out of it. Draw time series plots for eachof the variables ‘Average runoff’ and ‘Fractionalforested area’. Do these plots show that ‘Fractionalforested area’ has affected the runoff, under theassumption of near constant rainfall?

Page 50: Modern Hydrology and Sustainable Water Development (Gupta/Modern Hydrology and Sustainable Water Development) || Statistical Analyses in Hydrology

P1: SFK/UKS P2: SFK

c08 BLBK276-Gupta August 28, 2010 15:31 Trim: 246mm X 189mm Printer Name: Yet to Come

266 MODERN HYDROLOGY AND SUSTAINABLE WATER DEVELOPMENT

Table 8.5

YearAvg. Runoffm3 s−1

Forestedarea (%) Year

Avg. Runoffm3 s−1

Forestedarea (%) Year

Avg. Runoffm3 s−1

Forestedarea (%)

1973 0.26 30.5 1980 0.26 24.2 1987 0.21 34.11974 0.27 27.8 1981 0.26 23.9 1988 0.21 34.81975 0.28 25.6 1982 0.26 24.8 1989 0.16 41.21976 0.28 25.1 1983 0.25 26.3 1990 0.17 431977 0.28 25.4 1984 0.26 25.8 1991 0.15 47.51978 0.27 24.1 1985 0.23 26.7 1992 0.16 46.51979 0.27 24.1 1986 0.21 30.4 1993 0.15 45.8

Ex 8.5 One of the methods for removing a lineartrend is by differencing, that is by plotting the se-ries with terms zt = yt+1 − yt. Verify that if thisdifferencing is applied to the straight line yt = a +bt, then zt does not contain a trend.

Ex 8.6 Table 8.6 gives quarterly rainfall at a partic-ular meteorological station. Determine if the datashows any seasonality.

Table 8.6

Year Quarter (yt) 4-point smoothing (y′t+.5) 2-point smoothing zt of y ′

t+.5

2002 Jan–Mar 9.9Apr–Jun 9.5 9.1Jul–Sep 8.3 9.1 9.1Sep–Dec 8.7 8.925 9.0125

2003 Jan–Mar 9.9 8.6 8.7625Apr–Jun 8.8 8.4 8.5Jul–Sep 7.0 8.25 8.325Sep–Dec 7.9 7.925 8.0875

2004 Jan–Mar 9.3 7.9 7.9125Apr–Jun 7.5 7.65 7.775Jul–Sep 6.9Sep–Dec 6.9

A 4-point moving average results in the columnlabelled (y′

t+.5) for the first four points. This corre-sponds to a time midway between the second andthird points. A 2-point average brings this smoothedseries back in step with the original series, givingthe column labelled (zt). Plot the smoothed serieszt along with individual points of the series yt. Youshould check some of the values of zt for yourself.Does this exercise result in zt being approximatelylinear? Have the seasonal and random variation

been largely smoothed out? Is it possible concludethat the above method can be used for smoothingout any cycle that repeats itself every c months oryears? We simply take a c-point moving average.The key step is to determine the period c of thecycle.

Ex 8.7 De-seasonalize the values of (yt)from Ex 8.6.

Solution Removal of seasonal effect from a seriesand leaving any trend and the random ups anddowns back in the data may give us a clearer pictureof the situation. The resulting series gives us a de-seasonalized data. To achieve this in a reasonablemanner, we need a suitable model of the processproducing the original time series. Our model takesthe form:

data = trend + cycle + error

Page 51: Modern Hydrology and Sustainable Water Development (Gupta/Modern Hydrology and Sustainable Water Development) || Statistical Analyses in Hydrology

P1: SFK/UKS P2: SFK

c08 BLBK276-Gupta August 28, 2010 15:31 Trim: 246mm X 189mm Printer Name: Yet to Come

STATISTICAL ANALYSES IN HYDROLOGY 267

To remove a quarterly cycle, for example, webegin by averaging all the first quarters, namely[(y1+y5+y9+. . ..)/no. of years], then averaging allthe second quarters, all the third quarters andfinally all the fourth quarters, giving us just fournumbers s1, s2, s3 and s4. We then subtract themean s of these four numbers (which is the sameas y of the original series) to get st − s. The de-seasonalized series is then given by zt = yt −(st − s), where the definition of st is extended be-yond the first year by simply repeating the samefour numbers.

Table 8.7

Year Quarter (yt) Quarter st Adjustment (st − s) Deseasoned (zt)

2002 Jan–Mar 9.9 9.70 1.32 8.58Apr–Jun 9.5 8.60 0.22 9.28Jul–Sep 8.3 7.40 –0.98 9.28Sep–Dec 8.7 7.83 –0.55 9.25

2003 Jan–Mar 9.9 9.70 1.32 8.58Apr–Jun 8.8 8.60 0.22 8.58Jul–Sep 7.0 7.40 –0.98 7.98Sep–Dec 7.9 7.83 –0.55 8.45

2004 Jan–Mar 9.3 9.70 1.32 7.98Apr–Jun 7.5 8.60 0.22 7.28Jul–Sep 6.9 7.40 –0.98 7.88Sep–Dec 6.9 7.83 –0.55 7.45Mean 8.38

Table 8.7 demonstrates the calculations, with ref-erence to yt in Ex 8.6, using:

s1 = s5 = s9 = 1

3(y1 + y5 + y9)

= 1

3(9.9 + 9.9 + 9.3) = 9.7

s = 8.38; s1 − s = 9.7 − 8.38 = 1.32

and z1 = y1 − (s1 − s) = 8.58

Plot the series zt along with the points for yt andnote that without the seasonal cycle there appearsto be an almost linear trend.