sma322: numerical analysis 1sma322: numerical analysis 1 contents 1 introduction 1 2 number system...

SMA322: NUMERICAL ANALYSIS 1

Contents

1 Introduction 1

2 Number System and Errors 2

3 Taylors Series 9

4 Solution to Nonlinear Equations: Root Finding 24

5 Numerical Differentiation 31

6 Numerical Integration 36

7 More on Numerical Differentiation and Integration 41

1 Introduction

Number systems. Approximations by Taylor’s Series. Introduction to errors.Finite differences. Interpolation based on finite differences. Numerical differ-entiation: extrapolation to the limit. Numerical integration, Newton-Cotes,Romberg formulae. Solution of the non-linear equation: f(x) = 0 includingpolynomial equations. Systems of non-linear equations. Ordinary differentialequations: initial value problems; boundary value problems; convergence andstability.

1

2 Number System and Errors

2

Chapter 1

IEEE Arithmetic1.1 Definitions

Bit = 0 or 1Byte = 8 bitsWord = Reals: 4 bytes (single precision)

8 bytes (double precision)= Integers: 1, 2, 4, or 8 byte signed

1, 2, 4, or 8 byte unsigned

1.2 Numbers with a decimal or binary point

⇤ ⇤ ⇤ ⇤ · ⇤ ⇤ ⇤ ⇤Decimal: 103 102 101 100 10�1 10�2 10�3 10�4

Binary: 23 22 21 20 2�1 2�2 2�3 2�4

1.3 Examples of binary numbers

Decimal Binary1 12 103 114 100

0.5 0.11.5 1.1

1.4 Hex numbers

{0, 1, 2, 3, . . . , 9, 10, 11, 12, 13, 14, 15} = {0, 1, 2, 3.......9, a,b,c,d,e,f}

1.5 4-bit unsigned integers as hex numbers

Decimal Binary Hex1 0001 12 0010 23 0011 3...

......

10 1010 a...

......

15 1111 f

1

1.6. IEEE SINGLE PRECISION FORMAT:

1.6 IEEE single precision format:

sz}|{⇤0

ez }| {⇤1⇤2⇤3⇤4⇤5⇤6⇤7⇤8

fz }| {⇤9

· · · · · · · ·⇤31

# = (�1)s ⇥ 2e�127 ⇥ 1.f

where s = signe = biased exponentp=e-127 = exponent1.f = significand (use binary point)

1.7 Special numbers

Smallest exponent: e = 0000 0000, represents denormal numbers (1.f ! 0.f)Largest exponent: e = 1111 1111, represents ±•, if f = 0

e = 1111 1111, represents NaN, if f 6= 0

Number Range: e = 1111 1111 = 28 - 1 = 255 reservede = 0000 0000 = 0 reserved

so, p = e - 127 is1 - 127 p 254-127-126 p 127

Smallest positive normal number= 1.0000 0000 · · · · ·· 0000⇥ 2�126

' 1.2 ⇥ 10�38

bin: 0000 0000 1000 0000 0000 0000 0000 0000hex: 00800000MATLAB: realmin(’single’)

Largest positive number= 1.1111 1111 · · · · ·· 1111⇥ 2127

= (1 + (1 � 2�23)) ⇥ 2127

' 2128 ' 3.4 ⇥ 1038

bin: 0111 1111 0111 1111 1111 1111 1111 1111hex: 7f7fffffMATLAB: realmax(’single’)

Zerobin: 0000 0000 0000 0000 0000 0000 0000 0000hex: 00000000

Subnormal numbersAllow 1.f ! 0.f (in software)Smallest positive number = 0.0000 0000 · · · · · 0001 ⇥ 2�126

= 2�23 ⇥ 2�126 ' 1.4 ⇥ 10�45

2 CHAPTER 1. IEEE ARITHMETIC

1.8. EXAMPLES OF COMPUTER NUMBERS

1.8 Examples of computer numbers

What is 1.0, 2.0 & 1/2 in hex ?

1.0 = (�1)0 ⇥ 2(127�127) ⇥ 1.0Therefore, s = 0, e = 0111 1111, f = 0000 0000 0000 0000 0000 000bin: 0011 1111 1000 0000 0000 0000 0000 0000hex: 3f80 0000

2.0 = (�1)0 ⇥ 2(128�127) ⇥ 1.0Therefore, s = 0, e = 1000 0000, f = 0000 0000 0000 0000 0000 000bin: 0100 00000 1000 0000 0000 0000 0000 0000hex: 4000 0000

1/2 = (�1)0 ⇥ 2(126�127) ⇥ 1.0Therefore, s = 0, e = 0111 1110, f = 0000 0000 0000 0000 0000 000bin: 0011 1111 0000 0000 0000 0000 0000 0000hex: 3f00 0000

1.9 Inexact numbers

Example:13

= (�1)0 ⇥ 14⇥ (1 +

13),

so that p = e � 127 = �2 and e = 125 = 128� 3, or in binary, e = 0111 1101. How isf = 1/3 represented in binary? To compute binary number, multiply successivelyby 2 as follows:

0.333 . . . 0.0.666 . . . 0.01.333 . . . 0.010.666 . . . 0.0101.333 . . . 0.0101

etc.

so that 1/3 exactly in binary is 0.010101 . . . . With only 23 bits to represent f , thenumber is inexact and we have

f = 01010101010101010101011,

where we have rounded to the nearest binary number (here, rounded up). Themachine number 1/3 is then represented as

00111110101010101010101010101011

or in hex

3eaaaaab.

CHAPTER 1. IEEE ARITHMETIC 3

1.10. MACHINE EPSILON

1.9.1 Find smallest positive integer that is not exact in single pre-cision

Let N be the smallest positive integer that is not exact. Now, I claim that

N � 2 = 223 ⇥ 1.11 . . . 1,

andN � 1 = 224 ⇥ 1.00 . . . 0.

The integer N would then require a one-bit in the 2�24 position, which is not avail-able. Therefore, the smallest positive integer that is not exact is 224 + 1 = 16 777 217.In MATLAB, single(224) has the same value as single(224 + 1). Since single(224 + 1)is exactly halfway between the two consecutive machine numbers 224 and 224 + 2,MATLAB rounds to the number with a final zero-bit in f, which is 224.

1.10 Machine epsilon

Machine epsilon (emach) is the distance between 1 and the next largest number. If0 d < emach/2, then 1 + d = 1 in computer math. Also since

x + y = x(1 + y/x),

if 0 y/x < emach/2, then x + y = x in computer math.

Find emach

The number 1 in the IEEE format is written as

1 = 20 ⇥ 1.000 . . . 0,

with 23 0’s following the binary point. The number just larger than 1 has a 1 in the23rd position after the decimal point. Therefore,

emach = 2�23 ⇡ 1.192 ⇥ 10�7.

What is the distance between 1 and the number just smaller than 1? Here, thenumber just smaller than one can be written as

2�1 ⇥ 1.111 . . . 1 = 2�1(1 + (1 � 2�23)) = 1 � 2�24

Therefore, this distance is 2�24 = emach/2.The spacing between numbers is uniform between powers of 2, with logarithmic

spacing of the powers of 2. That is, the spacing of numbers between 1 and 2 is 2�23,between 2 and 4 is 2�22, between 4 and 8 is 2�21, etc. This spacing changes fordenormal numbers, where the spacing is uniform all the way down to zero.

Find the machine number just greater than 5

A rough estimate would be 5(1 + emach) = 5 + 5emach, but this is not exact. Theexact answer can be found by writing

5 = 22(1 +14),

so that the next largest number is

22(1 +14

+ 2�23) = 5 + 2�21 = 5 + 4emach.


1.11. IEEE DOUBLE PRECISION FORMAT

1.11 IEEE double precision format

Most computations take place in double precision, where round-off error is re-duced, and all of the above calculations in single precision can be repeated fordouble precision. The format is

sz}|{⇤0

ez }| {⇤1⇤2⇤3⇤4⇤5⇤6⇤7⇤8⇤9⇤10⇤11

fz }| {⇤12

· · · · · · · ·⇤63

# = (�1)s ⇥ 2e�1023 ⇥ 1.f

where s = signe = biased exponentp=e-1023 = exponent1.f = significand (use binary point)

1.12 Roundoff error example

Consider solving the quadratic equation

x2 + 2bx � 1 = 0,

where b is a parameter. The quadratic formula yields the two solutions

x± = �b ±p

b2 + 1.

Consider the solution with b > 0 and x > 0 (the x+ solution) given by

x = �b +p

b2 + 1. (1.1)

As b ! •,

x = �b +p

b2 + 1

= �b + bp

1 + 1/b2

= b(p

1 + 1/b2 � 1)

⇡ b✓

1 +1

2b2 � 1◆

=12b

.

Now in double precision, realmin ⇡ 2.2⇥ 10�308 and we would like x to be accurateto this value before it goes to 0 via denormal numbers. Therefore, x should becomputed accurately to b ⇡ 1/(2 ⇥ realmin) ⇡ 2 ⇥ 10307. What happens if wecompute (1.1) directly? Then x = 0 when b2 + 1 = b2, or 1 + 1/b2 = 1. That is1/b2 = emach/2, or b =

p2/

pemach ⇡ 108.

CHAPTER 1. IEEE ARITHMETIC 5

1.12. ROUNDOFF ERROR EXAMPLE

For a subroutine written to compute the solution of a quadratic for a generaluser, this is not good enough. The way for a software designer to solve this problemis to compute the solution for x as

x =1

b(1 +p

1 + 1/b2).

In this form, if 1 + 1/b2 = 1, then x = 1/2b which is the correct asymptotic form.


3 Taylors Series

9

3DXOV�2QOLQH�1RWHV�+RPH��&DOFXOXV�,,��6HULHV��6HTXHQFHV��7D\ORU�6HULHV

¬

6HFWLRQ��7D\ORU�6HULHV,Q�WKH�SUHYLRXV�VHFWLRQ�ZH�VWDUWHG�ORRNLQJ�DW�ZULWLQJ�GRZQ�D�SRZHU�VHULHV�UHSUHVHQWDWLRQ�RI�D�IXQFWLRQ��7KH�SUREOHP�ZLWK�WKH�DSSURDFK�LQ�WKDWVHFWLRQ�LV�WKDW�HYHU\WKLQJ�FDPH�GRZQ�WR�QHHGLQJ�WR�EH�DEOH�WR�UHODWH�WKH�IXQFWLRQ�LQ�VRPH�ZD\�WR

DQG�ZKLOH�WKHUH�DUH�PDQ\�IXQFWLRQV�RXW�WKHUH�WKDW�FDQ�EH�UHODWHG�WR�WKLV�IXQFWLRQ�WKHUH�DUH�PDQ\�PRUH�WKDW�VLPSO\�FDQ·W�EH�UHODWHG�WR�WKLV�

6R��ZLWKRXW�WDNLQJ�DQ\WKLQJ�DZD\�IURP�WKH�SURFHVV�ZH�ORRNHG�DW�LQ�WKH�SUHYLRXV�VHFWLRQ��ZKDW�ZH�QHHG�WR�GR�LV�FRPH�XS�ZLWK�D�PRUH�JHQHUDOPHWKRG�IRU�ZULWLQJ�D�SRZHU�VHULHV�UHSUHVHQWDWLRQ�IRU�D�IXQFWLRQ�

6R��IRU�WKH�WLPH�EHLQJ��OHW·V�PDNH�WZR�DVVXPSWLRQV��)LUVW��OHW·V�DVVXPH�WKDW�WKH�IXQFWLRQ� �GRHV�LQ�IDFW�KDYH�D�SRZHU�VHULHV�UHSUHVHQWDWLRQDERXW� �

1H[W��ZH�ZLOO�QHHG�WR�DVVXPH�WKDW�WKH�IXQFWLRQ�� KDV�GHULYDWLYHV�RI�HYHU\�RUGHU�DQG�WKDW�ZH�FDQ�LQ�IDFW�ÀQG�WKHP�DOO�

1RZ�WKDW�ZH·YH�DVVXPHG�WKDW�D�SRZHU�VHULHV�UHSUHVHQWDWLRQ�H[LVWV�ZH�QHHG�WR�GHWHUPLQH�ZKDW�WKH�FRHIÀFLHQWV�� DUH��7KLV�LV�HDVLHU�WKDQ�LWPLJKW�DW�ÀUVW�DSSHDU�WR�EH��/HW·V�ÀUVW�MXVW�HYDOXDWH�HYHU\WKLQJ�DW� ��7KLV�JLYHV�

6R��DOO�WKH�WHUPV�H[FHSW�WKH�ÀUVW�DUH�]HUR�DQG�ZH�QRZ�NQRZ�ZKDW� �LV��8QIRUWXQDWHO\��WKHUH�LVQ·W�DQ\�RWKHU�YDOXH�RI� �WKDW�ZH�FDQ�SOXJ�LQWRWKH�IXQFWLRQ�WKDW�ZLOO�DOORZ�XV�WR�TXLFNO\�ÀQG�DQ\�RI�WKH�RWKHU�FRHIÀFLHQWV��+RZHYHU��LI�ZH�WDNH�WKH�GHULYDWLYH�RI�WKH�IXQFWLRQ��DQG�LWV�SRZHUVHULHV��WKHQ�SOXJ�LQ� �ZH�JHW�

DQG�ZH�QRZ�NQRZ� �

/HW·V�FRQWLQXH�ZLWK�WKLV�LGHD�DQG�ÀQG�WKH�VHFRQG�GHULYDWLYH�

6R��LW�ORRNV�OLNH�

8VLQJ�WKH�WKLUG�GHULYDWLYH�JLYHV�

8VLQJ�WKH�IRXUWK�GHULYDWLYH�JLYHV�

+RSHIXOO\�E\�WKLV�WLPH�\RX·YH�VHHQ�WKH�SDWWHUQ�KHUH��,W�ORRNV�OLNH��LQ�JHQHUDO��ZH·YH�JRW�WKH�IROORZLQJ�IRUPXOD�IRU�WKH�FRHIÀFLHQWV�

7KLV�HYHQ�ZRUNV�IRU� �LI�\RX�UHFDOO�WKDW� �DQG�GHÀQH� �

6R��SURYLGHG�D�SRZHU�VHULHV�UHSUHVHQWDWLRQ�IRU�WKH�IXQFWLRQ� �DERXW� �H[LVWV�WKH�7D\ORU�6HULHV�IRU� �DERXW� �LV�

7D\ORU�6HULHV

,I�ZH�XVH� ��VR�ZH�DUH�WDONLQJ�DERXW�WKH�7D\ORU�6HULHV�DERXW� ��ZH�FDOO�WKH�VHULHV�D�0DFODXULQ�6HULHV�IRU� �RU�

0DFODXULQ�6HULHV

%HIRUH�ZRUNLQJ�DQ\�H[DPSOHV�RI�7D\ORU�6HULHV�ZH�ÀUVW�QHHG�WR�DGGUHVV�WKH�DVVXPSWLRQ�WKDW�D�7D\ORU�6HULHV�ZLOO�LQ�IDFW�H[LVW�IRU�D�JLYHQIXQFWLRQ��/HW·V�VWDUW�RXW�ZLWK�VRPH�QRWDWLRQ�DQG�GHÀQLWLRQV�WKDW�ZH·OO�QHHG�

7R�GHWHUPLQH�D�FRQGLWLRQ�WKDW�PXVW�EH�WUXH�LQ�RUGHU�IRU�D�7D\ORU�VHULHV�WR�H[LVW�IRU�D�IXQFWLRQ�OHW·V�ÀUVW�GHÀQH�WKH�QWK�GHJUHH�7D\ORUSRO\QRPLDO�RI� �DV�

1RWH�WKDW�WKLV�UHDOO\�LV�D�SRO\QRPLDO�RI�GHJUHH�DW�PRVW� ��,I�ZH�ZHUH�WR�ZULWH�RXW�WKH�VXP�ZLWKRXW�WKH�VXPPDWLRQ�QRWDWLRQ�WKLV�ZRXOG�FOHDUO\�EHDQ�QWK�GHJUHH�SRO\QRPLDO��:H·OO�VHH�D�QLFH�DSSOLFDWLRQ�RI�7D\ORU�SRO\QRPLDOV�LQ�WKH�QH[W�VHFWLRQ�

1RWLFH�DV�ZHOO�WKDW�IRU�WKH�IXOO�7D\ORU�6HULHV�

WKH� WK�GHJUHH�7D\ORU�SRO\QRPLDO�LV�MXVW�WKH�SDUWLDO�VXP�IRU�WKH�VHULHV�

1H[W��WKH�UHPDLQGHU�LV�GHÀQHG�WR�EH�

6R��WKH�UHPDLQGHU�LV�UHDOO\�MXVW�WKH�HUURU�EHWZHHQ�WKH�IXQFWLRQ� �DQG�WKH� WK�GHJUHH�7D\ORU�SRO\QRPLDO�IRU�D�JLYHQ� �

:LWK�WKLV�GHÀQLWLRQ�QRWH�WKDW�ZH�FDQ�WKHQ�ZULWH�WKH�IXQFWLRQ�DV�

:H�QRZ�KDYH�WKH�IROORZLQJ�7KHRUHP�

7KHRUHP

6XSSRVH�WKDW� ��7KHQ�LI�

IRU� �WKHQ�

RQ� �

,Q�JHQHUDO��VKRZLQJ�WKDW

LV�D�VRPHZKDW�GLIÀFXOW�SURFHVV�DQG�VR�ZH�ZLOO�EH�DVVXPLQJ�WKDW�WKLV�FDQ�EH�GRQH�IRU�VRPH� �LQ�DOO�RI�WKH�H[DPSOHV�WKDW�ZH·OO�EH�ORRNLQJ�DW�

1RZ�OHW·V�ORRN�DW�VRPH�H[DPSOHV�

([DPSOH��)LQG�WKH�7D\ORU�6HULHV�IRU� �DERXW� �

+LGH�6ROXWLRQ�¥

7KLV�LV�DFWXDOO\�RQH�RI�WKH�HDVLHU�7D\ORU�6HULHV�WKDW�ZH·OO�EH�DVNHG�WR�FRPSXWH��7R�ÀQG�WKH�7D\ORU�6HULHV�IRU�D�IXQFWLRQ�ZH�ZLOO�QHHG�WRGHWHUPLQH�D�JHQHUDO�IRUPXOD�IRU� ��7KLV�LV�RQH�RI�WKH�IHZ�IXQFWLRQV�ZKHUH�WKLV�LV�HDV\�WR�GR�ULJKW�IURP�WKH�VWDUW�

7R�JHW�D�IRUPXOD�IRU� �DOO�ZH�QHHG�WR�GR�LV�UHFRJQL]H�WKDW�

DQG�VR�

7KHUHIRUH��WKH�7D\ORU�VHULHV�IRU� �DERXW� �LV�



7KHUH�DUH�WZR�ZD\V�WR�GR�WKLV�SUREOHP��%RWK�DUH�IDLUO\�VLPSOH��KRZHYHU�RQH�RI�WKHP�UHTXLUHV�VLJQLÀFDQWO\�OHVV�ZRUN��:H·OO�ZRUN�ERWKVROXWLRQV�VLQFH�WKH�ORQJHU�RQH�KDV�VRPH�QLFH�LGHDV�WKDW�ZH·OO�VHH�LQ�RWKHU�H[DPSOHV�

6ROXWLRQ��$V�ZLWK�WKH�ÀUVW�H[DPSOH�ZH·OO�QHHG�WR�JHW�D�IRUPXOD�IRU� ��+RZHYHU��XQOLNH�WKH�ÀUVW�RQH�ZH·YH�JRW�D�OLWWOH�PRUH�ZRUN�WR�GR��/HW·V�ÀUVWWDNH�VRPH�GHULYDWLYHV�DQG�HYDOXDWH�WKHP�DW� �

$IWHU�D�FRXSOH�RI�FRPSXWDWLRQV�ZH�ZHUH�DEOH�WR�JHW�JHQHUDO�IRUPXODV�IRU�ERWK� �DQG� ��:H�RIWHQ�ZRQ·W�EH�DEOH�WR�JHW�D�JHQHUDOIRUPXOD�IRU� �VR�GRQ·W�JHW�WRR�H[FLWHG�DERXW�JHWWLQJ�WKDW�IRUPXOD��$OVR��DV�ZH�ZLOO�VHH�LW�ZRQ·W�DOZD\V�EH�HDV\�WR�JHW�D�JHQHUDO�IRUPXODIRU� �

6R��LQ�WKLV�FDVH�ZH·YH�JRW�JHQHUDO�IRUPXODV�VR�DOO�ZH�QHHG�WR�GR�LV�SOXJ�WKHVH�LQWR�WKH�7D\ORU�6HULHV�IRUPXOD�DQG�EH�GRQH�ZLWK�WKH�SUREOHP�

6ROXWLRQ��7KH�SUHYLRXV�VROXWLRQ�ZDVQ·W�WRR�EDG�DQG�ZH�RIWHQ�KDYH�WR�GR�WKLQJV�LQ�WKDW�PDQQHU��+RZHYHU��LQ�WKLV�FDVH�WKHUH�LV�D�PXFK�VKRUWHU�VROXWLRQPHWKRG��,Q�WKH�SUHYLRXV�VHFWLRQ�ZH�XVHG�VHULHV�WKDW�ZH·YH�DOUHDG\�IRXQG�WR�KHOS�XV�ÀQG�D�QHZ�VHULHV��/HW·V�GR�WKH�VDPH�WKLQJ�ZLWK�WKLV�RQH�:H�DOUHDG\�NQRZ�D�7D\ORU�6HULHV�IRU� �DERXW� �DQG�LQ�WKLV�FDVH�WKH�RQO\�GLIIHUHQFH�LV�ZH·YH�JRW�D�´� µ�LQ�WKH�H[SRQHQW�LQVWHDG�RI�MXVWDQ� �

6R��DOO�ZH�QHHG�WR�GR�LV�UHSODFH�WKH� �LQ�WKH�7D\ORU�6HULHV�WKDW�ZH�IRXQG�LQ�WKH�ÀUVW�H[DPSOH�ZLWK�´� µ�

7KLV�LV�D�PXFK�VKRUWHU�PHWKRG�RI�DUULYLQJ�DW�WKH�VDPH�DQVZHU�VR�GRQ·W�IRUJHW�DERXW�XVLQJ�SUHYLRXVO\�FRPSXWHG�VHULHV�ZKHUH�SRVVLEOH��DQGDOORZHG�RI�FRXUVH��



)RU�WKLV�H[DPSOH��ZH�ZLOO�WDNH�DGYDQWDJH�RI�WKH�IDFW�WKDW�ZH�DOUHDG\�KDYH�D�7D\ORU�6HULHV�IRU� �DERXW� ��,Q�WKLV�H[DPSOH��XQOLNH�WKHSUHYLRXV�H[DPSOH��GRLQJ�WKLV�GLUHFWO\�ZRXOG�EH�VLJQLÀFDQWO\�ORQJHU�DQG�PRUH�GLIÀFXOW�

7R�WKLV�SRLQW�ZH·YH�RQO\�ORRNHG�DW�7D\ORU�6HULHV�DERXW� ��DOVR�NQRZQ�DV�0DFODXULQ�6HULHV��VR�OHW·V�WDNH�D�ORRN�DW�D�7D\ORU�6HULHV�WKDW�LVQ·WDERXW� ��$OVR��ZH·OO�SLFN�RQ�WKH�H[SRQHQWLDO�IXQFWLRQ�RQH�PRUH�WLPH�VLQFH�LW�PDNHV�VRPH�RI�WKH�ZRUN�HDVLHU��7KLV�ZLOO�EH�WKH�ÀQDO�7D\ORU6HULHV�IRU�H[SRQHQWLDOV�LQ�WKLV�VHFWLRQ�



)LQGLQJ�D�JHQHUDO�IRUPXOD�IRU� �LV�IDLUO\�VLPSOH�

7KH�7D\ORU�6HULHV�LV�WKHQ�

2ND\��ZH�QRZ�QHHG�WR�ZRUN�VRPH�H[DPSOHV�WKDW�GRQ·W�LQYROYH�WKH�H[SRQHQWLDO�IXQFWLRQ�VLQFH�WKHVH�ZLOO�WHQG�WR�UHTXLUH�D�OLWWOH�PRUH�ZRUN�



)LUVW��ZH·OO�QHHG�WR�WDNH�VRPH�GHULYDWLYHV�RI�WKH�IXQFWLRQ�DQG�HYDOXDWH�WKHP�DW� �

,Q�WKLV�H[DPSOH��XQOLNH�WKH�SUHYLRXV�RQHV��WKHUH�LV�QRW�DQ�HDV\�IRUPXOD�IRU�HLWKHU�WKH�JHQHUDO�GHULYDWLYH�RU�WKH�HYDOXDWLRQ�RI�WKH�GHULYDWLYH�+RZHYHU��WKHUH�LV�D�FOHDU�SDWWHUQ�WR�WKH�HYDOXDWLRQV��6R��OHW·V�SOXJ�ZKDW�ZH·YH�JRW�LQWR�WKH�7D\ORU�VHULHV�DQG�VHH�ZKDW�ZH�JHW�

6R��ZH�RQO\�SLFN�XS�WHUPV�ZLWK�HYHQ�SRZHUV�RQ�WKH� ·V��7KLV�GRHVQ·W�UHDOO\�KHOS�XV�WR�JHW�D�JHQHUDO�IRUPXOD�IRU�WKH�7D\ORU�6HULHV��+RZHYHU�OHW·V�GURS�WKH�]HURHV�DQG�´UHQXPEHUµ�WKH�WHUPV�DV�IROORZV�WR�VHH�ZKDW�ZH�FDQ�JHW�

%\�UHQXPEHULQJ�WKH�WHUPV�DV�ZH�GLG�ZH�FDQ�DFWXDOO\�FRPH�XS�ZLWK�D�JHQHUDO�IRUPXOD�IRU�WKH�7D\ORU�6HULHV�DQG�KHUH�LW�LV�

7KLV�LGHD�RI�UHQXPEHULQJ�WKH�VHULHV�WHUPV�DV�ZH�GLG�LQ�WKH�SUHYLRXV�H[DPSOH�LVQ·W�XVHG�DOO�WKDW�RIWHQ��EXW�RFFDVLRQDOO\�LV�YHU\�XVHIXO��7KHUH�LVRQH�PRUH�VHULHV�ZKHUH�ZH�QHHG�WR�GR�LW�VR�OHW·V�WDNH�D�ORRN�DW�WKDW�VR�ZH�FDQ�JHW�RQH�PRUH�H[DPSOH�GRZQ�RI�UHQXPEHULQJ�VHULHV�WHUPV�



$V�ZLWK�WKH�ODVW�H[DPSOH�ZH·OO�VWDUW�RII�LQ�WKH�VDPH�PDQQHU�

6R��ZH�JHW�D�VLPLODU�SDWWHUQ�IRU�WKLV�RQH��/HW·V�SOXJ�WKH�QXPEHUV�LQWR�WKH�7D\ORU�6HULHV�

,Q�WKLV�FDVH�ZH�RQO\�JHW�WHUPV�WKDW�KDYH�DQ�RGG�H[SRQHQW�RQ� �DQG�DV�ZLWK�WKH�ODVW�SUREOHP�RQFH�ZH�LJQRUH�WKH�]HUR�WHUPV�WKHUH�LV�D�FOHDUSDWWHUQ�DQG�IRUPXOD��6R�UHQXPEHULQJ�WKH�WHUPV�DV�ZH�GLG�LQ�WKH�SUHYLRXV�H[DPSOH�ZH�JHW�WKH�IROORZLQJ�7D\ORU�6HULHV�

:H�UHDOO\�QHHG�WR�ZRUN�DQRWKHU�H[DPSOH�RU�WZR�LQ�ZKLFK� �LVQ·W�DERXW� �



+HUH�DUH�WKH�ÀUVW�IHZ�GHULYDWLYHV�DQG�WKH�HYDOXDWLRQV�

1RWH�WKDW�ZKLOH�ZH�JRW�D�JHQHUDO�IRUPXOD�KHUH�LW�GRHVQ·W�ZRUN�IRU� ��7KLV�ZLOO�KDSSHQ�RQ�RFFDVLRQ�VR�GRQ·W�ZRUU\�DERXW�LW�ZKHQ�LW�GRHV�

,Q�RUGHU�WR�SOXJ�WKLV�LQWR�WKH�7D\ORU�6HULHV�IRUPXOD�ZH·OO�QHHG�WR�VWULS�RXW�WKH� �WHUP�ÀUVW�

1RWLFH�WKDW�ZH�VLPSOLÀHG�WKH�IDFWRULDOV�LQ�WKLV�FDVH��<RX�VKRXOG�DOZD\V�VLPSOLI\�WKHP�LI�WKHUH�DUH�PRUH�WKDQ�RQH�DQG�LW·V�SRVVLEOH�WR�VLPSOLI\WKHP�

$OVR��GR�QRW�JHW�H[FLWHG�DERXW�WKH�WHUP�VLWWLQJ�LQ�IURQW�RI�WKH�VHULHV��6RPHWLPHV�ZH�QHHG�WR�GR�WKDW�ZKHQ�ZH�FDQ·W�JHW�D�JHQHUDO�IRUPXOD�WKDWZLOO�KROG�IRU�DOO�YDOXHV�RI� �



$JDLQ��KHUH�DUH�WKH�GHULYDWLYHV�DQG�HYDOXDWLRQV�

1RWLFH�WKDW�DOO�WKH�QHJDWLYH�VLJQV�ZLOO�FDQFHO�RXW�LQ�WKH�HYDOXDWLRQ��$OVR��WKLV�IRUPXOD�ZLOO�ZRUN�IRU�DOO� ��XQOLNH�WKH�SUHYLRXV�H[DPSOH�

+HUH�LV�WKH�7D\ORU�6HULHV�IRU�WKLV�IXQFWLRQ�

1RZ��OHW·V�ZRUN�RQH�RI�WKH�HDVLHU�H[DPSOHV�LQ�WKLV�VHFWLRQ��7KH�SUREOHP�IRU�PRVW�VWXGHQWV�LV�WKDW�LW�PD\�QRW�DSSHDU�WR�EH�WKDW�HDV\��RU�PD\EHLW�ZLOO�DSSHDU�WR�EH�WRR�HDV\��DW�ÀUVW�JODQFH�



+HUH�DUH�WKH�GHULYDWLYHV�IRU�WKLV�SUREOHP�

7KLV�7D\ORU�VHULHV�ZLOO�WHUPLQDWH�DIWHU� ��7KLV�ZLOO�DOZD\V�KDSSHQ�ZKHQ�ZH�DUH�ÀQGLQJ�WKH�7D\ORU�6HULHV�RI�D�SRO\QRPLDO��+HUH�LV�WKH7D\ORU�6HULHV�IRU�WKLV�RQH�

:KHQ�ÀQGLQJ�WKH�7D\ORU�6HULHV�RI�D�SRO\QRPLDO�ZH�GRQ·W�GR�DQ\�VLPSOLÀFDWLRQ�RI�WKH�ULJKW�KDQG�VLGH��:H�OHDYH�LW�OLNH�LW�LV��,Q�IDFW��LI�ZH�ZHUHWR�PXOWLSO\�HYHU\WKLQJ�RXW�ZH�MXVW�JHW�EDFN�WR�WKH�RULJLQDO�SRO\QRPLDO�

:KLOH�LW·V�QRW�DSSDUHQW�WKDW�ZULWLQJ�WKH�7D\ORU�6HULHV�IRU�D�SRO\QRPLDO�LV�XVHIXO�WKHUH�DUH�WLPHV�ZKHUH�WKLV�QHHGV�WR�EH�GRQH��7KH�SUREOHP�LVWKDW�WKH\�DUH�EH\RQG�WKH�VFRSH�RI�WKLV�FRXUVH�DQG�VR�DUHQ·W�FRYHUHG�KHUH��)RU�H[DPSOH��WKHUH�LV�RQH�DSSOLFDWLRQ�WR�VHULHV�LQ�WKH�ÀHOG�RI'LIIHUHQWLDO�(TXDWLRQV�ZKHUH�WKLV�QHHGV�WR�EH�GRQH�RQ�RFFDVLRQ�

6R��ZH·YH�VHHQ�TXLWH�D�IHZ�H[DPSOHV�RI�7D\ORU�6HULHV�WR�WKLV�SRLQW�DQG�LQ�DOO�RI�WKHP�ZH�ZHUH�DEOH�WR�ÀQG�JHQHUDO�IRUPXODV�IRU�WKH�VHULHV��7KLVZRQ·W�DOZD\V�EH�WKH�FDVH��7R�VHH�DQ�H[DPSOH�RI�RQH�WKDW�GRHVQ·W�KDYH�D�JHQHUDO�IRUPXOD�FKHFN�RXW�WKH�ODVW�H[DPSOH�LQ�WKH�QH[W�VHFWLRQ�

%HIRUH�OHDYLQJ�WKLV�VHFWLRQ�WKHUH�DUH�WKUHH�LPSRUWDQW�7D\ORU�6HULHV�WKDW�ZH·YH�GHULYHG�LQ�WKLV�VHFWLRQ�WKDW�ZH�VKRXOG�VXPPDUL]H�XS�LQ�RQHSODFH��,Q�P\�FODVV�,�ZLOO�DVVXPH�WKDW�\RX�NQRZ�WKHVH�IRUPXODV�IURP�WKLV�SRLQW�RQ�

��3DXO�'DZNLQV 3DJH�/DVW�0RGLÀHG��

4 Solution to Nonlinear Equations: Root Find-ing

24

Chapter 2

Root FindingSolve f (x) = 0 for x, when an explicit analytical solution is impossible.

2.1 Bisection Method

The bisection method is the easiest to numerically implement and almost alwaysworks. The main disadvantage is that convergence is slow. If the bisection methodresults in a computer program that runs too slow, then other faster methods maybe chosen; otherwise it is a good choice of method.

We want to construct a sequence x0, x1, x2, . . . that converges to the root x = rthat solves f (x) = 0. We choose x0 and x1 such that x0 < r < x1. We say that x0and x1 bracket the root. With f (r) = 0, we want f (x0) and f (x1) to be of oppositesign, so that f (x0) f (x1) < 0. We then assign x2 to be the midpoint of x0 and x1,that is x2 = (x0 + x1)/2, or

x2 = x0 +x1 � x0

2.

The sign of f (x2) can then be determined. The value of x3 is then chosen as eitherthe midpoint of x0 and x2 or as the midpoint of x2 and x1, depending on whetherx0 and x2 bracket the root, or x2 and x1 bracket the root. The root, therefore,stays bracketed at all times. The algorithm proceeds in this fashion and is typicallystopped when the increment to the left side of the bracket (above, given by (x1 �x0)/2) is smaller than some required precision.

2.2 Newton’s Method

This is the fastest method, but requires analytical computation of the derivative off (x). Also, the method may not always converge to the desired root.

We can derive Newton’s Method graphically, or by a Taylor series. We againwant to construct a sequence x0, x1, x2, . . . that converges to the root x = r. Considerthe xn+1 member of this sequence, and Taylor series expand f (xn+1) about the pointxn. We have

f (xn+1) = f (xn) + (xn+1 � xn) f 0(xn) + . . . .

To determine xn+1, we drop the higher-order terms in the Taylor series, and assumef (xn+1) = 0. Solving for xn+1, we have

xn+1 = xn �f (xn)

f 0(xn).

Starting Newton’s Method requires a guess for x0, hopefully close to the root x = r.

2.3 Secant Method

The Secant Method is second best to Newton’s Method, and is used when a fasterconvergence than Bisection is desired, but it is too difficult or impossible to take an

7

2.3. SECANT METHOD

analytical derivative of the function f (x). We write in place of f 0(xn),

f 0(xn) ⇡ f (xn) � f (xn�1)

xn � xn�1.

Starting the Secant Method requires a guess for both x0 and x1.

2.3.1 Estimatep

2 = 1.41421356 using Newton’s Method

Thep

2 is the zero of the function f (x) = x2 � 2. To implement Newton’s Method,we use f 0(x) = 2x. Therefore, Newton’s Method is the iteration

xn+1 = xn �x2

n � 22xn

.

We take as our initial guess x0 = 1. Then

x1 = 1 � �12

=32

= 1.5,

x2 =32�

94 � 2

3=

1712

= 1.416667,

x3 =1712

�172

122 � 2176

=577408

= 1.41426.

2.3.2 Example of fractals using Newton’s Method

Consider the complex roots of the equation f (z) = 0, where

f (z) = z3 � 1.

These roots are the three cubic roots of unity. With

ei2pn = 1, n = 0, 1, 2, . . .

the three unique cubic roots of unity are given by

1, ei2p/3, ei4p/3.

Witheiq = cos q + i sin q,

and cos (2p/3) = �1/2, sin (2p/3) =p

3/2, the three cubic roots of unity are

r1 = 1, r2 = �12

+

p3

2i, r3 = �1

2�

p3

2i.

The interesting idea here is to determine which initial values of z0 in the complexplane converge to which of the three cubic roots of unity.

Newton’s method is

zn+1 = zn �z3

n � 13z2

n.

If the iteration converges to r1, we color z0 red; r2, blue; r3, green. The result willbe shown in lecture.

8 CHAPTER 2. ROOT FINDING

2.4. ORDER OF CONVERGENCE

2.4 Order of convergence

Let r be the root and xn be the nth approximation to the root. Define the error as

en = r � xn.

If for large n we have the approximate relationship

|en+1| = k|en|p,

with k a positive constant, then we say the root-finding numerical method is oforder p. Larger values of p correspond to faster convergence to the root. The orderof convergence of bisection is one: the error is reduced by approximately a factor of2 with each iteration so that

|en+1| =12|en|.

We now find the order of convergence for Newton’s Method and for the SecantMethod.

2.4.1 Newton’s Method

We start with Newton’s Method

xn+1 = xn �f (xn)

f 0(xn).

Subtracting both sides from r, we have

r � xn+1 = r � xn +f (xn)

f 0(xn),

or

en+1 = en +f (xn)

f 0(xn). (2.1)

We use Taylor series to expand the functions f (xn) and f 0(xn) about the root r,using f (r) = 0. We have

f (xn) = f (r) + (xn � r) f 0(r) +12(xn � r)2 f 00(r) + . . . ,

= �en f 0(r) +12

e2n f 00(r) + . . . ;

f 0(xn) = f 0(r) + (xn � r) f 00(r) +12(xn � r)2 f 000(r) + . . . ,

= f 0(r) � en f 00(r) +12

e2n f 000(r) + . . . .

(2.2)

To make further progress, we will make use of the following standard Taylor series:

11 � e

= 1 + e + e2 + . . . , (2.3)

CHAPTER 2. ROOT FINDING 9


which converges for |e| < 1. Substituting (2.2) into (2.1), and using (2.3) yields

en+1 = en +f (xn)

f 0(xn)

= en +�en f 0(r) + 1

2 e2n f 00(r) + . . .

f 0(r) � en f 00(r) + 12 e2

n f 000(r) + . . .

= en +�en + 1

2 e2n

f 00(r)f 0(r) + . . .

1 � enf 00(r)f 0(r) + . . .

= en +

✓�en +

12

e2n

f 00(r)f 0(r)

+ . . .◆✓

1 + enf 00(r)f 0(r)

+ . . .◆

= en +

✓�en + e2

n

✓12

f 00(r)f 0(r)

� f 00(r)f 0(r)

◆+ . . .

◆

= �12

f 00(r)f 0(r)

e2n + . . .

Therefore, we have shown that

|en+1| = k|en|2

as n ! •, with

k =12

��f 00(r)f 0(r)

�� ,

provided f 0(r) 6= 0. Newton’s method is thus of order 2 at simple roots.

2.4.2 Secant Method

Determining the order of the Secant Method proceeds in a similar fashion. We startwith

xn+1 = xn �(xn � xn�1) f (xn)

f (xn) � f (xn�1).

We subtract both sides from r and make use of

xn � xn�1 = (r � xn�1) � (r � xn)

= en�1 � en,

and the Taylor series

f (xn) = �en f 0(r) +12

e2n f 00(r) + . . . ,

f (xn�1) = �en�1 f 0(r) +12

e2n�1 f 00(r) + . . . ,

so that

f (xn) � f (xn�1) = (en�1 � en) f 0(r) +12(e2

n � e2n�1) f 00(r) + . . .

= (en�1 � en)

✓f 0(r) � 1

2(en�1 + en) f 00(r) + . . .

◆.



We therefore have

en+1 = en +�en f 0(r) + 1

2 e2n f 00(r) + . . .

f 0(r) � 12 (en�1 + en) f 00(r) + . . .

= en � en1 � 1

2 enf 00(r)f 0(r) + . . .

1 � 12 (en�1 + en) f 00(r)

f 0(r) + . . .

= en � en

✓1 � 1

2en

f 00(r)f 0(r)

+ . . .◆✓

1 +12

(en�1 + en)f 00(r)f 0(r)

+ . . .◆

= �12

f 00(r)f 0(r)

en�1en + . . . ,

or to leading order

|en+1| =12

��f 00(r)f 0(r)

�� |en�1||en|. (2.4)

The order of convergence is not yet obvious from this equation, and to determinethe scaling law we look for a solution of the form

|en+1| = k|en|p.

From this ansatz, we also have

|en| = k|en�1|p,

and therefore|en+1| = kp+1|en�1|p2

.

Substitution into (2.4) results in

kp+1|en�1|p2=

k2

��f 00(r)f 0(r)

�� |en�1|p+1.

Equating the coefficient and the power of en�1 results in

kp =12

��f 00(r)f 0(r)

�� ,

andp2 = p + 1.

The order of convergence of the Secant Method, given by p, therefore is determinedto be the positive root of the quadratic equation p2 � p � 1 = 0, or

p =1 +

p5

2⇡ 1.618,

which coincidentally is a famous irrational number that is called The Golden Ra-tio, and goes by the symbol F. We see that the Secant Method has an order ofconvergence lying between the Bisection Method and Newton’s Method.

CHAPTER 2. ROOT FINDING 11

5 Numerical Differentiation

31

Chapter 7

Ordinary differential equationsWe now discuss the numerical solution of ordinary differential equations. These

include the initial value problem, the boundary value problem, and the eigenvalueproblem. Before proceeding to the development of numerical methods, we reviewthe analytical solution of some classic equations.

7.1 Examples of analytical solutions

7.1.1 Initial value problemOne classic initial value problem is the RC circuit. With R the resistor and C thecapacitor, the differential equation for the charge q on the capacitor is given by

Rdqdt

+qC

= 0. (7.1)

If we consider the physical problem of a charged capacitor connected in a closedcircuit to a resistor, then the initial condition is q(0) = q0, where q0 is the initialcharge on the capacitor.

The differential equation (7.1) is separable, and separating and integrating fromtime t = 0 to t yields Z q

q0

dqq

= � 1RC

Z t

0dt,

which can be integrated and solved for q = q(t):

q(t) = q0e�t/RC.

The classic second-order initial value problem is the RLC circuit, with differen-tial equation

Ld2qdt2 + R

dqdt

+qC

= 0.

Here, a charged capacitor is connected to a closed circuit, and the initial conditionssatisfy

q(0) = q0,dqdt

(0) = 0.

The solution is obtained for the second-order equation by the ansatz

q(t) = ert,

which results in the following so-called characteristic equation for r:

Lr2 + Rr +1C

= 0.

If the two solutions for r are distinct and real, then the two found exponentialsolutions can be multiplied by constants and added to form a general solution. Theconstants can then be determined by requiring the general solution to satisfy thetwo initial conditions. If the roots of the characteristic equation are complex ordegenerate, a general solution to the differential equation can also be found.

41

7.1. EXAMPLES OF ANALYTICAL SOLUTIONS

7.1.2 Boundary value problems

The dimensionless equation for the temperature y = y(x) along a linear heat-conducting rod of length unity, and with an applied external heat source f (x),is given by the differential equation

� d2ydx2 = f (x), (7.2)

with 0 x 1. Boundary conditions are usually prescribed at the end points ofthe rod, and here we assume that the temperature at both ends are maintained atzero so that

y(0) = 0, y(1) = 0.

The assignment of boundary conditions at two separate points is called a two-point boundary value problem, in contrast to the initial value problem where theboundary conditions are prescribed at only a single point. Two-point boundaryvalue problems typically require a more sophisticated algorithm for a numericalsolution than initial value problems.

Here, the solution of (7.2) can proceed by integration once f (x) is specified. Weassume that

f (x) = x(1 � x),

so that the maximum of the heat source occurs in the center of the rod, and goes tozero at the ends.

The differential equation can then be written as

d2ydx2 = �x(1 � x).

The first integration results in

dydx

=Z

(x2 � x)dx

=x3

3� x2

2+ c1,

where c1 is the first integration constant. Integrating again,

y(x) =Z ✓ x3

3� x2

2+ c1

◆dx

=x4

12� x3

6+ c1x + c2,

where c2 is the second integration constant. The two integration constants are de-termined by the boundary conditions. At x = 0, we have

0 = c2,

and at x = 1, we have

0 =112

� 16

+ c1,

so that c1 = 1/12. Our solution is therefore

y(x) =x4

12� x3

6+

x12

=1

12x(1 � x)(1 + x � x2).

42 CHAPTER 7. ORDINARY DIFFERENTIAL EQUATIONS

7.2. NUMERICAL METHODS: INITIAL VALUE PROBLEM

The temperature of the rod is maximum at x = 1/2 and goes smoothly to zero atthe ends.

7.1.3 Eigenvalue problem

The classic eigenvalue problem obtained by solving the wave equation by separationof variables is given by

d2ydx2 + l2y = 0,

with the two-point boundary conditions y(0) = 0 and y(1) = 0. Notice thaty(x) = 0 satisfies both the differential equation and the boundary conditions. Othernonzero solutions for y = y(x) are possible only for certain discrete values of l.These values of l are called the eigenvalues of the differential equation.

We proceed by first finding the general solution to the differential equation. Itis easy to see that this solution is

y(x) = A cos lx + B sin lx.

Imposing the first boundary condition at x = 0, we obtain

A = 0.

The second boundary condition at x = 1 results in

B sin l = 0.

Since we are searching for a solution where y = y(x) is not identically zero, wemust have

l = p, 2p, 3p, . . . .

The corresponding negative values of l are also solutions, but their inclusion onlychanges the corresponding values of the unknown B constant. A linear superposi-tion of all the solutions results in the general solution

y(x) =•

Ân=1

Bn sin npx.

For each eigenvalue np, we say there is a corresponding eigenfunction sin npx.When the differential equation can not be solved analytically, a numerical methodshould be able to solve for both the eigenvalues and eigenfunctions.

7.2 Numerical methods: initial value problem

We begin with the simple Euler method, then discuss the more sophisticated Runge-Kutta methods, and conclude with the Runge-Kutta-Fehlberg method, as imple-mented in the MATLAB function ode45.m. Our differential equations are for x =x(t), where the time t is the independent variable, and we will make use of thenotation x = dx/dt. This notation is still widely used by physicists and descendsdirectly from the notation originally used by Newton.

CHAPTER 7. ORDINARY DIFFERENTIAL EQUATIONS 43

7.2. NUMERICAL METHODS: INITIAL VALUE PROBLEM

7.2.1 Euler method

The Euler method is the most straightforward method to integrate a differentialequation. Consider the first-order differential equation

x = f (t, x), (7.3)

with the initial condition x(0) = x0. Define tn = nDt and xn = x(tn). A Taylorseries expansion of xn+1 results in

xn+1 = x(tn + Dt)

= x(tn) + Dtx(tn) + O(Dt2)

= x(tn) + Dt f (tn, xn) + O(Dt2).

The Euler Method is therefore written as

xn+1 = x(tn) + Dt f (tn, xn).

We say that the Euler method steps forward in time using a time-step Dt, startingfrom the initial value x0 = x(0). The local error of the Euler Method is O(Dt2).The global error, however, incurred when integrating to a time T, is a factor of 1/Dtlarger and is given by O(Dt). It is therefore customary to call the Euler Method afirst-order method.

7.2.2 Modified Euler method

This method is of a type that is called a predictor-corrector method. It is also thefirst of what are Runge-Kutta methods. As before, we want to solve (7.3). The ideais to average the value of x at the beginning and end of the time step. That is, wewould like to modify the Euler method and write

xn+1 = xn +12

Dt�

f (tn, xn) + f (tn + Dt, xn+1)�.

The obvious problem with this formula is that the unknown value xn+1 appearson the right-hand-side. We can, however, estimate this value, in what is called thepredictor step. For the predictor step, we use the Euler method to find

xpn+1 = xn + Dt f (tn, xn).

The corrector step then becomes

xn+1 = xn +12

Dt�

f (tn, xn) + f (tn + Dt, xpn+1)

�.

The Modified Euler Method can be rewritten in the following form that we willlater identify as a Runge-Kutta method:

k1 = Dt f (tn, xn),k2 = Dt f (tn + Dt, xn + k1),

xn+1 = xn +12(k1 + k2).

(7.4)

44 CHAPTER 7. ORDINARY DIFFERENTIAL EQUATIONS

6 Numerical Integration

36

Chapter 6

IntegrationWe want to construct numerical algorithms that can perform definite integrals

of the form

I =Z b

af (x)dx. (6.1)

Calculating these definite integrals numerically is called numerical integration, nu-merical quadrature, or more simply quadrature.

6.1 Elementary formulas

We first consider integration from 0 to h, with h small, to serve as the building blocksfor integration over larger domains. We here define Ih as the following integral:

Ih =Z h

0f (x)dx. (6.2)

To perform this integral, we consider a Taylor series expansion of f (x) about thevalue x = h/2:

f (x) = f (h/2) + (x � h/2) f 0(h/2) +(x � h/2)2

2f 00(h/2)

+(x � h/2)3

6f 000(h/2) +

(x � h/2)4

24f 0000(h/2) + . . .

6.1.1 Midpoint ruleThe midpoint rule makes use of only the first term in the Taylor series expansion.Here, we will determine the error in this approximation. Integrating,

Ih = h f (h/2) +Z h

0

✓(x � h/2) f 0(h/2) +

(x � h/2)2

2f 00(h/2)

+(x � h/2)3

6f 000(h/2) +

(x � h/2)4

24f 0000(h/2) + . . .

◆dx.

Changing variables by letting y = x� h/2 and dy = dx, and simplifying the integraldepending on whether the integrand is even or odd, we have

Ih = h f (h/2)

+Z h/2

�h/2

✓y f 0(h/2) +

y2

2f 00(h/2) +

y3

6f 000(h/2) +

y4

24f 0000(h/2) + . . .

◆dy

= h f (h/2) +Z h/2

0

✓y2 f 00(h/2) +

y4

12f 0000(h/2) + . . .

◆dy.

The integrals that we need here are

Z h2

0y2dy =

h3

24,

Z h2

0y4dy =

h5

160.

35

6.2. COMPOSITE RULES

Therefore,

Ih = h f (h/2) +h3

24f 00(h/2) +

h5

1920f 0000(h/2) + . . . . (6.3)

6.1.2 Trapezoidal rule

From the Taylor series expansion of f (x) about x = h/2, we have

f (0) = f (h/2) � h2

f 0(h/2) +h2

8f 00(h/2) � h3

48f 000(h/2) +

h4

384f 0000(h/2) + . . . ,

and

f (h) = f (h/2) +h2

f 0(h/2) +h2

8f 00(h/2) +

h3

48f 000(h/2) +

h4

384f 0000(h/2) + . . . .

Adding and multiplying by h/2 we obtain

h2�

f (0) + f (h)�

= h f (h/2) +h3

8f 00(h/2) +

h5

384f 0000(h/2) + . . . .

We now substitute for the first term on the right-hand-side using the midpoint ruleformula:

h2�

f (0) + f (h)�

=

✓Ih �

h3

24f 00(h/2) � h5

1920f 0000(h/2)

◆

+h3

8f 00(h/2) +

h5

384f 0000(h/2) + . . . ,

and solving for Ih, we find

Ih =h2�

f (0) + f (h)�� h3

12f 00(h/2) � h5

480f 0000(h/2) + . . . . (6.4)

6.1.3 Simpson’s ruleTo obtain Simpson’s rule, we combine the midpoint and trapezoidal rule to elimi-nate the error term proportional to h3. Multiplying (6.3) by two and adding to (6.4),we obtain

3Ih = h✓

2 f (h/2) +12

( f (0) + f (h))

◆+ h5

✓2

1920� 1

480

◆f 0000(h/2) + . . . ,

or

Ih =h6�

f (0) + 4 f (h/2) + f (h)�� h5

2880f 0000(h/2) + . . . .

Usually, Simpson’s rule is written by considering the three consecutive points 0, hand 2h. Substituting h ! 2h, we obtain the standard result

I2h =h3�

f (0) + 4 f (h) + f (2h)�� h5

90f 0000(h) + . . . . (6.5)

6.2 Composite rules

We now use our elementary formulas obtained for (6.2) to perform the integralgiven by (6.1).

36 CHAPTER 6. INTEGRATION

6.2. COMPOSITE RULES

6.2.1 Trapezoidal rule

We suppose that the function f (x) is known at the n + 1 points labeled as x0, x1, . . . , xn,with the endpoints given by x0 = a and xn = b. Define

fi = f (xi), hi = xi+1 � xi.

Then the integral of (6.1) may be decomposed as

Z b

af (x)dx =

n�1

Âi=0

Z xi+1

xi

f (x)dx

=n�1

Âi=0

Z hi

0f (xi + s)ds,

where the last equality arises from the change-of-variables s = x � xi. Applying thetrapezoidal rule to the integral, we have

Z b

af (x)dx =

n�1

Âi=0

hi2

( fi + fi+1) . (6.6)

If the points are not evenly spaced, say because the data are experimental values,then the hi may differ for each value of i and (6.6) is to be used directly.

However, if the points are evenly spaced, say because f (x) can be computed, wehave hi = h, independent of i. We can then define

xi = a + ih, i = 0, 1, . . . , n;

and since the end point b satisfies b = a + nh, we have

h =b � a

n.

The composite trapezoidal rule for evenly space points then becomes

Z b

af (x)dx =

h2

n�1

Âi=0

( fi + fi+1)

=h2

( f0 + 2 f1 + · · · + 2 fn�1 + fn) . (6.7)

The first and last terms have a multiple of one; all other terms have a multiple oftwo; and the entire sum is multiplied by h/2.

6.2.2 Simpson’s rule

We here consider the composite Simpson’s rule for evenly space points. We applySimpson’s rule over intervals of 2h, starting from a and ending at b:

Z b

af (x)dx =

h3

( f0 + 4 f1 + f2) +h3

( f2 + 4 f3 + f4) + . . .

+h3

( fn�2 + 4 fn�1 + fn) .

CHAPTER 6. INTEGRATION 37

6.3. LOCAL VERSUS GLOBAL ERROR

Note that n must be even for this scheme to work. Combining terms, we have

Z b

af (x)dx =

h3

( f0 + 4 f1 + 2 f2 + 4 f3 + 2 f4 + · · · + 4 fn�1 + fn) .

The first and last terms have a multiple of one; the even indexed terms have amultiple of 2; the odd indexed terms have a multiple of 4; and the entire sum ismultiplied by h/3.

6.3 Local versus global error

Consider the elementary formula (6.4) for the trapezoidal rule, written in the form

Z h

0f (x)dx =

h2�

f (0) + f (h)�� h3

12f 00(x),

where x is some value satisfying 0 x h, and we have used Taylor’s theoremwith the mean-value form of the remainder. We can also represent the remainderas

� h3

12f 00(x) = O(h3),

where O(h3) signifies that when h is small, halving of the grid spacing h decreasesthe error in the elementary trapezoidal rule by a factor of eight. We call the termsrepresented by O(h3) the Local Error.

More important is the Global Error which is obtained from the composite formula(6.7) for the trapezoidal rule. Putting in the remainder terms, we have

Z b

af (x)dx =

h2

( f0 + 2 f1 + · · · + 2 fn�1 + fn) � h3

12

n�1

Âi=0

f 00(xi),

where xi are values satisfying xi xi xi+1. The remainder can be rewritten as

� h3

12

n�1

Âi=0

f 00(xi) = �nh3

12⌦

f 00(xi)↵,

where⌦

f 00(xi)↵

is the average value of all the f 00(xi)’s. Now,

n =b � a

h,

so that the error term becomes

�nh3

12⌦

f 00(xi)↵

= � (b � a)h2

12⌦

f 00(xi)↵

= O(h2).

Therefore, the global error is O(h2). That is, a halving of the grid spacing onlydecreases the global error by a factor of four.

Similarly, Simpson’s rule has a local error of O(h5) and a global error of O(h4).

38 CHAPTER 6. INTEGRATION

7 More on Numerical Differentiation and Inte-gration

41

CHAPTER 5

Numerical Differentiation and Integration

Numerical differentiation is explained in these notes using thesymmetrical increments formula and showing how this can be gen-eralized for higher order derivatives. A brief glimpse into the insta-bility of the problem is given.

Integration is dealt with in deeper detail (because it is easier andmuch more stable) but also briefly. A simple definition of quadratureformula is given and the easiest ones (the trapezoidal and Simpson’srules) are explained.

1. Numerical Differentiation

Sometimes (for example, when approximating the solution of adifferential equation) one has to approximate the value of the deriv-ative of a function at point. A symbolic approach may be unavailable(either because of the software or because of its computational cost)and a numerical recipe may be required. Formulas for approximat-ing the derivative of a function at a point are available (and also forhigher order derivatives) but one also needs, in general, bounds forthe error incurred. In these notes we shall only show the symmetricrule and explain why the naive approximation is suboptimal.? instability?

1.1. The Symmetric Formula for the Derivative. The first (sim-ple) idea for approximating the derivative of f at x is to use the def-inition of derivative as a limit, that is, the following formula:

(20) f 0(x) ' f (x + h)� f (x)

h,

where h is a small increment of the x variable.However, the very expression in formula (20) shows its weak-

ness: should one take h positive or negative? This is not irrelevant.Assume f (x) = 1/x and try to compute its derivative at x = 2. Weshall take |h| = .01. Obviously f 0(0.5) = 0.25. Using the “natural”approximation one has, for h > 0:

f (x + .01)� f (x)

.01=

12.01 � 1

2.01

= �0.248756+

69

70 5. NUMERICAL DIFFERENTIATION AND INTEGRATION

x0 � h x0 x0 + h

f (x0 � h)

f (x0)

f (x0 + h)

FIGURE 1. Approximate derivative: on the right & left(dotted) and symmetric (dashed). The symmetric oneis much similar to the tangent (solid straight line).

whereas, for h < 0:

f (x� .01)� f (x)

�.01= �

11.99 � 1

2.01

= �0.251256+

which are already different at the second decimal digit. Which ofthe two is to be preferred? There is no abstract principle leading tochoosing one or the other.

Whenever there are two values which approximate a third oneand there is no good reason for preferring one to the other, it is rea-sonable to expect that the mean value should be a better approxima-tion than either of the two. In this case:

12

✓f (x + h)� f (x)

h+

f (x� h)� f (x)

�h

◆= �0.2500062+

which is an approximation to the real value 0.25 to five significantfigures.

This is, actually, the correct way to state the approximation prob-lem to numerical differentiation: to use the symmetric differencearound the point and divide by the double of the width of the inter-val. This is exactly the same as taking the mean value of the “right”and “left” hand derivative.

THEOREM 8. The naive approximation to the derivative has order 1precision, whereas the symmetric formula has order 2.

2. NUMERICAL INTEGRATION—QUADRATURE FORMULAS 71

PROOF. Let f be a three times differentiable function on [x �h, x + h]. Using the Taylor polynomial of degree 1, one has

f (x + h) = f (x) + f 0(x)h +f 00(x)

2h2,

for some x in the interval, so that, solving f 0(x), one gets

f 0(x) =f (x + h)� f (x)

h� f 00(x)

2h,

which is exactly the meaning of “having order 1”. Notice that therightmost term cannot be eliminated.

However, for the symmetric formula one can use the Taylor poly-nomial of degree 2, twice:

f (x + h) = f (x) + f 0(x)h +f 00(x)

2h2 +

f 3)(x)

6h3,

f (x� h) = f (x)� f 0(x)h +f 00(x)

2h2 � f 3)(z)

6h3

for some x 2 [x� h, x + h] and z 2 [x� h, x + h]. Subtracting:

f (x + h)� f (x� h) = 2 f 0(x)h + K(x, z)h3

where K is a sum of the degree 3 coefficients in the previous equa-tion, so that

f 0(x) =f (x + h)� f (x� h)

2h+

K(x, z)

2h2,

which is the meaning of “having order 2”. ⇤

2. Numerical Integration—Quadrature Formulas

Numerical integration, being a problem in which errors accumu-late (a “global” problem) is more stable than differentiation, surpris-ing as it may seem (even though taking derivatives is much simpler,symbolically, than integration).

In general, one wishes to state an abstract formula for perform-ing numerical integration: an algebraic expression such that, givena function f : [a, b] ! R whose integral is to be computed, one cansubstitute f in the expression and get a value —the approximated in-tegral.

DEFINITION 16. A simple quadrature formula for an interval [a, b] isa family of points x1 < x2 < · · · < xn+1 2 [a, b] and coefficients (also


2 3 3.5 4 5

0

1

2f (x)

FIGURE 2. Midpoint rule: the area under f is approxi-mated by the rectangle.

called weights) a1, . . . , an+1. The approximate integral of a function f on[a, b] using that formula is the expression

a1 f (x1) + a2 f (x2) + · · · + an+1 f (xn+1).

That is, a quadrature formula is no more than “a way to approx-imate an integral using intermediate points and weights”. The for-mula is closed if x1 = a and xn+1 = b; if neither a nor b are part of theintermediate points, then it is open.

If the xi are evenly spaced, then the formula is a Newton-Coatesformula. These are the only ones we shall explain in this course.

Obviously, one wants the formulas which best approach the inte-grals of known functions.

DEFINITION 17. A quadrature formula with coefficients a1, . . . , an+1is of order m if for any polynomial of degree m, one has

Z b

aP(x) dx = a1P(x1) + a2P(x2) + · · · + an+1P(xn+1).

That is, if the formula is exact for polynomials of degree m.

The basic quadrature formulas are: the midpoint formula (open,n = 0), the trapezoidal rule (closed, two points, n = 1) and Simp-son’s formula (closed, three points, n = 2).

We first show the simple versions and then generalize them totheir composite versions.


2.1. The Midpoint Rule. A coarse but quite natural way to ap-proximate an integral is to multiply the value of the function at themidpoint by the width of the interval. This is the midpoint formula:

DEFINITION 18. The midpoint quadrature formula corresponds tox1 = (a + b)/2 and a1 = (b� a). That is, the approximation

Z b

af (x) dx ' (b� a) f

✓a + b

2

◆.

given by the area of the rectangle having a horizontal side at f ((b�a)/2).

One checks easily that the midpoint rule is of order 1: it is exactfor linear polynomials but not for quadratic ones.

2.2. The Trapezoidal Rule. The next natural approximation (whichis not necessarily better) is to use two points. As there are two al-ready given (a and b), the naive idea is to use them.

Given a and b, one has the values of f at them. One could in-terpolate f linearly (using a line) and approximate the value of theintegral with the area under the line. Or one could use the meanvalue between two rectangles (one with height f (a) and the otherwith height f (b)). The fact is that both methods give the same value.

DEFINITION 19. The trapezoidal rule for [a, b] corresponds to x1 =a, x2 = b and weights a1 = a2 = (b� a)/2. That is, the approxima-tion Z b

af (x) dx ' b� a

2( f (a) + f (b))

for the integral of f using the trapeze with a side on [a, b], joining(a, f (a)) with (b, f (b)) and parallel to OY.

Even though it uses one point more than the midpoint rule, thetrapezoidal rule is also of order 1.

2.3. Simpson’s Rule. The next natural step involves 3 points in-stead of 2 and using a parabola instead of a straight line. This methodis remarkably precise (it has order 3) and is widely used. It is calledSimpson’s Rule.

DEFINITION 20. Simpson’s rule is the quadrature formula corre-sponding to the nodes x1 = a, x2 = (a + b)/2 and x3 = b, andthe weights, corresponding to the correct interpolation of a degree


2 3 4 5

0

1

2f (x)

FIGURE 3. trapezoidal rule: approximate the area un-der f by that of the trapeze.

2 polynomial. That is1, a1 = b�a6 , a2 = 4(b�a)

6 and a3 = b�a6 . Hence, it

is the approximation of the integral of f byZ b

af (x) dx ' b� a

6

✓f (a) + 4 f

✓a + b

2

◆+ f (b)

◆,

which is a weighted mean of the areas of three intermediate rectan-gles. This rule must be memorized: one sixth of the length times thevalues of the function at the endpoints and midpoint with weights1, 4, 1.

The remarkable property of Simpson’s rule is that it has order 3:even though a parabola is used, the rule integrates correctly poly-nomials of degree up to 3. Notice that one does not need to know theequation of the parabola: the values of f and the weights (1/6, 4/6, 1/6)are enough.

2.4. Composite Formulas. Composite quadrature formulas areno more than “applying the simple ones in subintervals”. For ex-ample, instead of using the trapezoidal rule for approximating anintegral, like

Z b

af (x) dx ' b� a

2( f (a) + f (b))

one subdivides [a, b] into subintervals and performs the approxima-tion on each of these.

1This is an easy computation which the reader should perform by himself.


2 3 3.5 4 5

0

1

2f (x)

FIGURE 4. Simpson’s rule: approximating the area un-der f by the parabola passing through the endpointsand the midpoint (black).

2 2.5 3 3.5 4 4.5 5

0

1

2f (x)

FIGURE 5. Composite trapezoidal rule: just repeat thesimple one on each subinterval.

One might do this mechanically, one subinterval after another.The fact is that, due to their additive nature, it is always simpler toapply the general formula than to apply the simple one step by step,for closed Newton-Coates formulas (not for open ones, for whichthe computations are the same one way or the other).

2.4.1. Composite Trapezoidal Rule. If there are two consecutive in-tervals [a, b] and [b, c] of equal length (that is, b� a = c� b) and the


trapezoidal rule is applied on both in order to approximate the inte-gral of f on [a, c], one gets:

Z c

af (x) dx =

b� a2

( f (b) + f (a)) +c� b

2( f (c) + f (b)) =

h2

( f (a) + 2 f (b) + f (c)) ,

where h = b� a is the width of each subinterval, either [a, b] or [b, c]:the formula just a weighted mean of the values of f at the endpointsand the midpoint. In the general case, when there are more than twosubintervals, one gets an analogous formula:

DEFINITION 21 (Composite trapezoidal rule). Given an interval[a, b] and n + 1 nodes, the composite trapezoidal rule for [a, b] with n + 1nodes is given by the nodes x0 = a, x1 = a + h, . . . , xn = b, the valueh = (b� a)/n and the approximation

Z c

af (x) dx ' h

2( f (a) + 2 f (x2) + 2 f (x3) + · · · + 2 f (xn) + f (b)) .

That is, half the width of the subintervals times the sum of the valuesat the endpoints and the double of the values at the interior points.

2.4.2. Composite Simpson’s Rule. In a similar way, the compositeSimpson’s rule is just the application of Simpson’s rule in a sequenceof subintervals (which, for the Newton-Coates formula, have all thesame width). As before, as the left endpoint of each subinterval isthe right endpoint for the next one, the whole composite formula issomewhat simpler to implement than the mere addition of the sim-ple formula on each subinterval.

DEFINITION 22 (Composite Simpson’s rule). Given an interval[a, b], divided into n subintervals (so, given 2n + 1 evenly distributednodes), Simpson’s composite rule for [a, b] with 2n + 1 nodes is given bythe nodes x0 = a, x1 = a + h, . . . , x2n = b, (where h = (b� a)/(2n))and the approximation

Z c

af (x) dx ' b� a

6n( f (a) + 4 f (x2) + 2 f (x3) + 4 f (x3) + . . .

· · · + 2 f (x2n�1) + 4 f (x2n) + f (b)).

This means that the composite Simpson’s rule is the same as thesimple version taking into account that one multiplies all by h/6,where h is the width of each subinterval and the coefficients are 1for the two endpoints, 4 for the inner midpoints and 2 for the innerendpoints.


2 2.75 3.5 4.25 5

0

20

40

60

80 f (x)

FIGURE 6. Composite Simpson’s rule: conceptually, itis just the repetition of the simple rule on each subin-terval.

CHAPTER 6

Differential Equations

There is little doubt that the main application of numerical meth-ods is for integrating differential equations (which is the technical termfor “solving them numerically”).

1. Introduction

A differential equation is a special kind of equation: one in whichone of the unknowns is a function. We have already studied some:any integral is a differential equation (it is the simplest kind). Forexample,

y0 = x

is an equation in which one seeks a function y(x) whose derivativeis x. It is well-known that the solution is not unique: there is anintegration constant and the general solution is written as

y =x2

2+ C.

This integration constant can be better understood graphically.When computing the primitive of a function, one is trying to finda function whose derivative is known. A function can be thoughtof as a graph on the X, Y plane. The integration constant specifiesat which height the graph is. This does not change the derivative,obviously. On the other hand, if one is given the specific problem

y0 = x, y(3) = 28,

then one is trying to find a function whose derivative is x, with a con-dition at a point: that its value at 3 be 28. Once this value is fixed, thereis only one graph having that shape and passing through (3, 28). The con-dition y(3) = 28 is called an initial condition. One imposes that thegraph of f passes through a point and then there is only one f solv-ing the integration problem, which means there is only one suitableconstant C. As a matter of fact, C can be computed by substitution:

28 = 32 + C ) C = 19.79

80 6. DIFFERENTIAL EQUATIONS

The same idea gives rise to the term initial condition for a differentialequation.

Consider the equation

y0 = y

(whose general solution should be known). This equation means:find a function y(x) whose derivative is equal to the same functiony(x) at every point. One tends to think of the solution as y(x) = ex

but. . . is this the only possible solution? A geometrical approach maybe more useful; the equation means “the function y(x) whose deriv-ative is equal to the height y(x) at each point.” From this point ofview it seems obvious that there must be more than one solution tothe problem: at each point one should be able to draw the corre-sponding tangent, move a little to the right and do the same. Thereis nothing special on the points (x, ex) for them to be the only solu-tion to the problem. Certainly, the general solution to the equationis

y(x) = Cex,

where C is an integration constant. If one also specifies an initial con-dition, say y(x0) = y0, then necessarily

y0 = Cex0

so that C = y0/ex0 is the solution to the initial value problem (noticethat the denominator is not 0).

This way, in order to find a unique solution to a differential equa-tion, one needs to specify at least one initial condition. As a matterof fact, the number of these must be the same as the order of theequation.

Consider

y00 = �y,

whose solutions should also be known: the functions of the formy(x) = a sin(x) + b cos(x), for two constants a, b 2 R. In order forthe solution to be unique, two initial conditions must be specified.They are usually stated as the value of y at some point, and the valueof the derivative at that same place. For example, y(0) = 1 andy0(0) = �1. In this case, the solution is y(x) = � sin(x) + cos(x).

This chapter deals with approximate solutions to differential equa-tions. Specifically, ordinary differential equations (i.e. with functionsy(x) of a single variable x).

2. THE BASICS 81

2. The Basics

The first definition is that of differential equation

DEFINITION 23. An ordinary differential equation is an equality A =B in which the only unknown is a function of one variable whose de-rivative of some order appears explicitly.

The adjective ordinary is the condition on the unknown of beinga function of a single variable (there are no partial derivatives).

EXAMPLE 18. We have shown some examples above. Differentialequations can get many forms:

y0 = sin(x)

xy = y0 � 1

(y0)2 � 2y00 + x2y = 0

y0

y� xy = cos(y)

etc.

In this chapter, the unknown in the equation will always be de-noted with the letter y. The variable on which it depends will usuallybe either x or t.

DEFINITION 24. A differential equation is of order n if n is thehighest order derivative of y appearing in it.

The specific kind of equations we shall study in this chapter arethe solved ones (which does not mean that they are already solved,but that they are written in a specific way):

y0 = f (x, y).

DEFINITION 25. An initial value problem is a differential equationtogether with an initial condition of the form y(x0) = y0, wherex0, y0 2 R.

DEFINITION 26. The general solution to a differential equation Eis a family of functions f (x, c), where c is one (or several) constantssuch that:

• Any solution of E has the form f (x, c) for some c.• Any expression f (x, c) is a solution of E,

except for possibly a finite number of values of c.


If integrating functions of a real variable is already a complexproblem, the exact integration a differential equation is, in general,impossible. That is, the explicit computation of the symbolic solutionto a differential equation is a problem which is usually not tackled.What one seeks is to know an approximate solution and a reasonablygood bound for the error incurred when using that approximationinstead of the “true” solution.

Notice that, in reality, most of the numbers appearing in the equa-tion describing a problem will already be inexact, so trying to get an“exact” solution is already a mistake.

3. Discretization

We shall assume a two-variable function f (x, y) is given, whichis defined on a region x 2 [x0, xn], y 2 [a, b], and which satisfies thefollowing condition (which the reader is encouraged to forget):

DEFINITION 27. A function f (x, y) defined on a set X 2 R2 satis-fies Lipschitz’s condition if there exists K > 0 such that

| f (x1)� f (x2)| K|x1 � x2|for any x1, x2,2 X, where | | denotes the absolute value of a number.

This is a kind of “strong continuity condition” (i.e. it is easier fora function to be continuous than to be Lipschitz). What matters isthat this condition has a very important consequence for differentialequations: it guarantees the uniqueness of the solution. Let X bea set [x0, xn] ⇥ [a, b] (a strip, or a rectangle) and f (x, y) : X ! R afunction on X which satisfies Lipschitz’s condition. Then

THEOREM 9 (Cauchy-Kovalevsky). Under the conditions above, anydifferential equation y0 = f (x, y) with an initial condition y(x0) = y0 fory0 2 (a, b) has a unique solution y = y(x) defined on [x0, x0 + t] for somet 2 R greater than 0.

Lipschitz’s condition is not so strange. As a matter of fact, poly-nomials and all the “analytic functions” (exponential, logarithms,trigonometric functions, etc. . . ) and their inverses (where they aredefined and continuous) satisfy it. An example which does not isf (x) =

px on an interval containing 0, because f at that point has

a “vertical” tangent line. The reader should not worry about thiscondition (only if he sees a derivative becoming infinity or a point ofdiscontinuity, but we shall not discuss them in these notes). We givejust an example:

3. DISCRETIZATION 83

EXAMPLE 19 (Bourbaki “Functions of a Real Variable”, Ch. 4, §1).The differential equation y0 = 2

p|y| with initial condition y(0) = 0

has an infinite number of solutions. For example, any of the follow-ing, for a, b > 0:

(1) y(t) = 0 for any interval (�b, a),(2) y(t) = �(t + b)2 for t �b,(3) y(t) = (t� a)2 for t � a

is a solution of that equation. This is because the function on theright hand side,

p|y|, is not Lipschitz near y = 0. (The reader is

suggested to verify both assertions).

In summary, any “normal” initial value problem has a uniquesolution. What is difficult is finding this.

And what about an approximation?

3.1. The derivative as an arrow. One is usually told that the de-rivative of a function of a real variable is the slope of its graph at thecorresponding point. However, a more useful idea for the presentchapter is to think of it as the Y coordinate of the velocity vector of thegraph.

When plotting a function, one should imagine that one is draw-ing a curve with constant horizontal speed (because the OX�axis ishomogeneous, one goes from left to right at uniform speed). Thisway, the graph of f (x) is actually the plane curve (x, f (x)). Its tan-gent vector at any point is (1, f 0(x)): the derivative f 0(x) of f (x) isthe vertical component of this vector.

From this point of view, a differential equation in solved formy0 = f (x, y) can be interpreted as the statement “find a curve (x, y(x))such that the velocity vector at each point is (1, f (x, y)).” One canthen draw the family of “velocity” vectors on the plane (x, y) givenby (1, f (x, y)) (the function f (x, y) is known, remember). This visu-alization, like in Figure 1, already gives an idea of the shape of thesolution.

Given the arrows —the vectors (1, f (x, y))— on a plane, draw-ing a curve whose tangents are those arrows should not be too hard.Even more, if what one needs is just an approximation, instead ofdrawing a curve, one could draw “little segments going in the di-rection of the arrows.” If these segments have a very small x�coor-dinate, one reckons that an approximation to the solution will beobtained.

This is exactly Euler’s idea.


FIGURE 1. Arrows representing vectors of a solveddifferential equation y0 = f (x, y). Notice how the hor-izontal component is constant while the vertical oneis not. Each arrow corresponds to a vector (1, f (x, y))where f (x, y) ' x cos(y).

Given a plane “filled with arrows” indicating the velocity vectorsat each point, the simplest idea for drawing the corresponding curvethat:

• Passes through a specified point (initial condition y(x0) =y0)

• Whose y�velocity component is f (x, y) at each point. . .consists in discretizing the x�coordinates. Starting at x0, one as-sumes that OX is quantized in intervals of width h (constant, bynow) “small enough.” Now, instead of drawing a smooth curve, oneapproximates it by small steps (of width h) on the x�variable.

As the solution has to verify y(x0) = y0, one needs only• Draw the point (x0, y0) (the initial condition).• Compute f (x0, y0), the vertical value of the velocity vector of

the solution at the initial point.• As the x�coordinate is quantized, the next point of the ap-

proximation will have x�coordinate equal to x0 + h.• As the velocity at (x0, y0) is (1, f (x0, y0)), the simplest ap-

proximation to the displacement of the curve in an interval ofwidth h on the x�coordinate is (h, h f (x0, y0)). Let x1 = x0 + hand y1 = y0 + h f (x0, y0).

• Draw the segment from (x0, y0) to (x1, y1): this is the first“approximate segment” of the solution.

4. SOURCES OF ERROR: TRUNCATION AND ROUNDING 85

• At this point one is in the same situation as at the beginningbut with (x1, y1) instead of (x0, y0). Repeat.

And so on as many times as one desires. The above is Euler’s algo-rithm for numerical integration of differential equations.

4. Sources of error: truncation and rounding

It is obvious that the solution to an equation y0 = f (x, y) willnever (or practically so) be a line composed of straight segments.When using Euler’s method, there is an intrinsic error which canbe analyzed, for example, with Taylor’s formula. Assume f (x, y)admits a sufficient number of partial derivatives. Then y(x) will bedifferentiable also and

y(x0 + h) = y(x0) + hy0(x0) +h2

2y00(q)

for some q 2 [x0, x0 + h]. What Euler’s method does is to remove thelast term, so that the error incurred is exactly that: a term of order 2in h in the first iteration. This error, which arises from truncating theTaylor expansion is called truncation error in this setting.

In general, one assumes that the interval along which the inte-gration is performed has width of magnitude h�1. Usually, startingfrom an interval [a, b] one posits a number of subintervals n (or inter-mediate points, n� 1) and takes x0 = a, xn = b and h = (b� a)/n, sothat going from x0 to xn requires n iterations and the global truncationerror has order h�1O(h2) ' O(h).

However, truncation is not the only source of error. Floating-point operations incur always rounding errors.

4.1. Details on the truncation and rounding errors. Specifically,if using IEEE-754, one considers that the smallest significant quantity(what is called, the epsilon) is e = 2�52 ' 2.2204 ⇥ 10�16. Thus,the rounding error in each operation is less than e. If “operation”means a step in the Euler algorithm (which may or may not be thecase), then each step from xi to xi+1 incurs an error of at most e andhence, the global rounding error is bounded by e/h. This is very trickybecause it implies that the smaller the interval h, the larger the roundingerror incurred, which is counterintutive. So, making h smaller doesnot necessarily improve the accuracy of the method!

Summing up, the addition of the truncation and rounding errorscan be approximated as

E(h) ' e

h+ h,


which grows both when h becomes larger and when it gets smaller.The minimum E(h) is given by h ' pe: which means that there isno point in using intervals of width less than

pe in the Euler method

(actually, ever). Taking smaller intervals may perfectly lead to hugeerrors.

This explanation about truncation and rounding errors is rele-vant for any method, not just Euler’s. One needs to bound both forevery method and know how to choose the best h. There are evenproblems for which a single h is not useful and it has to be modifiedduring the execution of the algorithm. We shall not deal with theseproblems here (the interested reader should look for the term “stiffdifferential equation”).

5. Quadratures and Integration

Assume that in the differential equation y0 = f (x, y), the func-tion f depends only on the variable x. Then, the equation would bewritten

y0 = f (x),

expression which means y is the function whose derivative is f (x). Thatis, the problem is that of computing a primitive. We have alreadydealt with it in Chapter 5 but it is inherently related to what we aredoing for differential equations.

When f (x, y) depends also on y, the relation between the equa-tion and a primitive is harder to perceive but we know (by the Fun-damental Theorem of Calculus) that, if y(x) is the solution to theinitial value problem y0 = f (x, y) with y(x0) = y0, then, integratingboth sides, one gets

y(x) =Z x

x0

f (t, y(t)) dt + y0,

which is not a primitive but looks like it. In this case, there is no way toapproximate the integral using the values of f at intermediate pointsbecause one does not know the value of y(t). But one can take a similarapproach.

6. Euler’s Method: Integrate Using the Left Endpoint

One can state Euler’s method as in Algorithm 10. If one readscarefully, the gist of each step is to approximate each value yi+1 as theprevious one yi plus f evaluated at (xi, yi) times the interval width.

7. MODIFIED EULER: THE MIDPOINT RULE 87

That is: Z xi+1

xi

f (t, y(t)) dt ' (xi+1 � xi) f (xi, yi) = h f (xi, yi).

If f (x, y) were independent of y, then one would be performing thefollowing approximation (for an interval [a, b]):

Z b

af (t) dt = (b� a) f (a),

which, for lack of a better name, could be called the left endpoint rule:the integral is approximated by the area of the rectangle of heightf (a) and width (b� a).

Algorithm 10 Euler’s algorithm assuming exact arithmetic.

Input: A function f (x, y), an initial condition (x0, y0), an interval[a, b] = [x0, xn] and a step h = (xn � x0)/nOutput: A family of values y0, y1, . . . , yn (which approximate thesolution to y0 = f (x, y) on the net x0, . . . , xn)

?STARTi 0while i n do

yi+1 yi + h f (xi, yi)i i + 1

end whilereturn (y0, . . . , yn)

One might try (as an exercise) to solve the problem “using theright endpoint”: this gives rise to what are called the implicit meth-ods which we shall not study (but which usually perform better thanthe explicit ones we are going to explain).

7. Modified Euler: the Midpoint Rule

Instead of using the left endpoint of [xi, xi+1] for integrating andcomputing yi+1, one might use (and this would be better, as thereader should verify) the midpoint rule somehow. As there is noway to know the intermediate values of y(t) further than x0, somekind of guesswork has to be done. The method goes as follows:

• Use a point near Pi = (xi, yi) whose x�coordinate is themidpoint of [xi, xi+1].

• For lack of a better point, the first approximation is done us-ing Euler’s algorithm and one takes a point Qi = [xi+1, yi +h f (xi, yi)].


• Compute the midpoint of the segment PiQi, which is (xi +h/2, yi + h/2 f (xi, yi)). Let k be its y�coordinate.

• Use the value of f at that point in order to compute yi+1: thisgives the formula yi+1 = yi + h f (xi + h/2, k).

If, as above, f (x, y) did not depend on y, one verifies easily thatthe corresponding approximation for the integral is

Z b

af (x) dx ' (b� a) f (

a + b2

),

which is exactly the midpoint rule of numerical integration.This method is called modified Euler’s method and its order is 2, the

same as Euler’s, which implies that the accrued error at xn is O(h).It is described formally in Algorithm 11.

xi xi +h2

xi+1

yi

yi+1z2

z2 + k2yi + k1

FIGURE 2. Modified Euler’s Algorithm. Instead ofusing the vector (h, h f (xi, yi)) (dashed), one sums at(xi, yi) the dotted vector, which is the tangent vector atthe midpoint.

As shown in Figure 2, this method consist in first using Euler’s inorder to get an initial guess, then computing the value of f (x, y) atthe midpoint between (x0, y0) and the guess and using this value off as the approximate slope of the solution at (x0, y0). There is still anerror but, as the information is gathered “a bit to the right”, it is lessthan the one of Euler’s method, analogue to the trapezoidal rule.

The next idea is, instead of using a single vector, compute two ofthem and use a “mean value:” as a matter of fact, the mean betweenthe vector at the origin and the vector at the end of Euler’s method.

8. HEUN’S METHOD: THE TRAPEZOIDAL RULE 89

Algorithm 11 Modified Euler’s algorithm, assuming exact arith-metic.

Input: A function f (x, y), an initial condition (x0, y0), an interval[x0, xn] and a step h = (xn � x0)/nOutput: A family of values y0, y1, . . . , yn (which approximate thesolution of y0 = f (x, y) on the net x0, . . . , xn)


k1 f (xi, yi)z2 yi +

h2 k1

k2 f (xi +h2 , z2)

yi+1 yi + hk2i i + 1

end whilereturn (y0, . . . , yn)

8. Heun’s Method: the Trapezoidal Rule

Instead of using the midpoint of the segment of Euler’s method,one can take the vector corresponding to Euler’s method and alsothe vector at the endpoint of Euler’s method and compute the meanof both vectors and use this mean for approximating. This way, oneis using the information at the point and some information “lateron.” This improves Euler’s method and is called accordingly improvedEuler’s method or Heun’s method. It is described in Algorithm 12.

At each step one has to perform the following operations:• Compute k1 = f (xi, yi).• Compute z2 = yj + hk1. This is the coordinate yi+1 in Euler’s

method.• Compute k2 = f (xi+1, z2). This would be the slope at (xi+1, yi+1)

with Euler’s method.• Compute the mean of k1 and k2: k1+k2

2 and use this value as“slope”. That is, set yi+1 = yi +

h2 (k1 + k2).

Figure 3 shows a graphical representation of Heun’s method.Figures 4 and 5 show the approximate solutions to the equations

y0 = y and y0 = �y + cos(x), respectively, using the methods ex-plained in the text, together with the exact solution.


xi xi+1 xi+1 + h

yi

yi + hk2

yi+1

yi + hk1

FIGURE 3. Heun’s method. At (xi, yi) one uses themean displacement (solid) between the vectors at(xi, yi) (dashed) and at the next point (xi+1, yi + hk1)in Euler’s method (dotted).

Algorithm 12 Heun’s algorithm assuming exact arithmetic.

Input: A function f (x, y), an initial condition (x0, y0), an interval[x0, xn] and a step h = (xn � x0)/nOutput: A family of points y0, y1, . . . , yn( which approximate thesolution to y0 = f (x, y) on the net x0, . . . , xn)


k1 f (xi, yi)z2 yi + hk1k2 f (xi + h, z2)yi+1 yi +

h2 (k1 + k2)

i i + 1end whilereturn (y0, . . . , yn)

8. HEUN’S METHOD: THE TRAPEZOIDAL RULE 91

�2 �1 0 1 2

1

2

3

4

5 EulerModifiedSolution

FIGURE 4. Comparison between Euler’s and ModifiedEuler’s methods and the true solution to the ODE y0 =y for y(�2) = 10�1.

0 1 2 3 4 5 6

�1.5

�1

�0.5

0

0.5

1

EulerModified

HeunSolution

FIGURE 5. Comparison between Euler’s, Modified Eu-ler’s and Heun’s methods and the true solution to theODE y0 = �y + cos(x) for y(0) = �1.

sma322: numerical analysis 1sma322: numerical analysis 1 contents 1 introduction 1 2 number system...

Documents