algebraicstatistics - tuhh · markov bases allow to sample data in a given fibre using markov...

296
Algebraic Statistics

Upload: others

Post on 08-Apr-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Algebraic Statistics

Karl-Heinz Zimmermann

Algebraic Statistics

Hamburg University of Technology

Prof. Dr. Karl-Heinz ZimmermannHamburg University of Technology21071 HamburgGermany

All rights reservedc©2009, 2015 Karl-Heinz Zimmermann, author

urn:nbn:de:gbv:830-88213556

For my Teachers

Thomas Beth†

Adalbert Kerber

Sun-Yuan KungHorst Muller

VI

Preface

Algebraic statistics brings together ideas from algebraic geometry, commutative algebra, and combina-torics to address problems in statistics and its applications. Computer algebra provides powerful toolsfor the study of algorithms and software. However, these tools are rarely prepared to address statisticalchallenges and therefore new algebraic results need often be developed. This way of interplay betweenalgebra and statistics fertilizes both disciplines.

Algebraic statistics is a relatively new branch of mathematics that developed and changed rapidlyover the last ten years. The seminal work in this field was the paper of Diaconis and Sturmfels (1998)introducing the notion of Markov bases for toric statistical models and showing the connection tocommutative algebra. Later on, the connection between algebra and statistics spread to a number ofdifferent areas including parametric inference, phylogenetic invariants, and algebraic tools for maximumlikelihood estimation. These connection were highlighted in the celebrated book Algebraic Statistics forComputational Biology of Pachter and Sturmfels (2005) and subsequent publications.

In this report, statistical models for discrete data are viewed as solutions of systems of polyno-mial equations. This allows to treat statistical models for sequence alignment, hidden Markov models,and phylogenetic tree models. These models are connected in the sense that if they are interpreted inthe tropical algebra, the famous dynamic programming algorithms (Needleman-Wunsch, Viterbi, andFelsenstein) occur in a natural manner. More generally, if the models are interpreted in a higher di-mensional analogue of the tropical algebra, the polytope algebra, parametric versions of these dynamicprogramming algorithms can be established.

Markov bases allow to sample data in a given fibre using Markov chain Monte Carlo algorithms.In this way, Markov bases provide a means to increase the sample size and make statistical tests ininferential statistics more reliable. We will calculate Markov bases using Groebner bases in commutativepolynomial rings.

The manuscript grew out of lectures on algebraic statistics held for Master students of ComputerScience at the Hamburg University of Technology. It appears that the first lecture held in the summerterm 2008 was the first course of this kind in Germany. The current manuscript is the basis of a four-hour introductory course. The use of computer algebra systems is at the heart of the course. Mapleis employed for symbolic computations, Singular for algebraic computations, and R for statisticalcomputations. The second edition at hand is just a streamlined version of the first one.

Hamburg, Nov. 2015 Karl-Heinz Zimmermann

Contents

Part I Algebraic and Combinatorial Methods

1 Commutative Algebra . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.1 Polynomial Rings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.2 Ideals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51.3 Monomial Orders . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71.4 Division Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111.5 Groebner Bases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141.6 Computation of Groebner Bases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161.7 Reduced Groebner Bases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191.8 Toric Ideals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

2 Algebraic Geometry . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 252.1 Affine Varieties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 252.2 Ideal-Variety Correspondence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 292.3 Zariski Topology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 342.4 Irreducible Affine Varieties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 342.5 Elimination Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 352.6 Geometry of Elimination . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 412.7 Implicit Representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

3 Combinatorial Geometry . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 493.1 Tropical Algebra . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 493.2 Shortest Paths Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 503.3 Geometric Zoo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 513.4 Geometry of Polytopes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 573.5 Polytope Algebra . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 643.6 Newton Polytopes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 683.7 Parametric Shortest Path Problem. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

X Contents

Part II Algebraic Statistics

4 Basic Algebraic Statistical Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 754.1 Introductory Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 754.2 General Algebraic Statistical Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 774.3 Linear Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 794.4 Toric Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 824.5 Markov Chain Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 874.6 Maximum Likelihood Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 904.7 Model Invariants . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 934.8 Statistical Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96

5 Sequence Alignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 995.1 Sequence Alignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 995.2 Scoring Schemes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1025.3 Pair Hidden Markov Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1065.4 Sum-Product Decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1085.5 Optimal Alignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1105.6 Needleman-Wunsch Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1115.7 Parametric Sequence Alignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113

6 Hidden Markov Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1216.1 Fully Observed Markov Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1216.2 Hidden Markov Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1256.3 Sum-Product Decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1276.4 Viterbi Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1296.5 Expectation Maximization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1326.6 Finding CpG Islands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135

7 Tree Markov Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1397.1 Data and General Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1397.2 Fully Observed Tree Markov Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1437.3 Hidden Tree Markov Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1487.4 Sum-Product Decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1507.5 Felsenstein Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1517.6 Evolutionary Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1537.7 Group-Based Evolutionary Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160

8 Computational Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1758.1 Markov Bases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1758.2 Markov Chains . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1788.3 Metropolis Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1848.4 Contingency Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1898.5 Hardy-Weinberg Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 198

Contents XI

8.6 Logistic Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205

A Computational Statistics in R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 211A.1 Descriptive Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 211A.2 Random Variables and Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 217A.3 Some Discrete Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 220A.4 Some Continuous Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 228A.5 Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 237A.6 Method of Moments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 240A.7 Maximum-Likelihood Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 241A.8 Ordinary Least Squares . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 244A.9 Parameter Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 246

B Spectral Analysis of Ranked Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 249B.1 Data Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 249B.2 Representation Theory for Partial Rankings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 250

C Representation Theory of the Symmetric Group . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 257C.1 The Symmetric Group . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 257C.2 Diagrams, Tableaux, and Tabloids . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 259C.3 Permutation Modules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 261C.4 Specht Modules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 262C.5 Standard Basis of Specht Modules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 265C.6 Young’s Rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 266C.7 Representations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 268C.8 Characters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 273C.9 Characters of the Symmetric Group. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 275C.10 Dimension of Specht Modules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 278

Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 281

Part I

Algebraic and Combinatorial Methods

1

Commutative Algebra

Commutative algebra is a branch of abstract algebra that studies commutative rings and their ideals.Both algebraic geometry and algebraic number theory are built on commutative algebra. Ideals inpolynomial rings are usually studied by their Groebner bases. The latter can be used to tackle importantproblems like testing the membership in ideals and solving polynomial equations.

1.1 Polynomial Rings

Let K be a field. A monomial in a collection of variables or unknowns X1, . . . , Xn over K is a product

Xα = Xα11 · · ·Xαn

n , α1, . . . , αn ∈ N0. (1.1)

The total degree of a monomial Xα is the sum of the exponents |α| = α1 + . . . + αn. For instance,X2

1X33X4 is a monomial of total degree 6 in the variables X1, X2, X3, X4, since α = (2, 0, 3, 1) and

|α| = 6.We can form linear combinations of monomials with coefficients in K. The resulting objects are

polynomials in X1, . . . , Xn over K. A general polynomial f in X1, . . . , Xn with coefficients in K has theform

f =∑

α

cαXα, cα ∈ K, (1.2)

where the sum is over a finite number of elements α ∈ Nn0 . A nonzero product cαXα involved in

a polynomial is called a term and the scalar cα is called the coefficient of the term. For instance,taking K to be the field Q of rational numbers and using the variables X, Y , Z instead of subscripts,f = X2 + Y Z − 1 is a polynomial containing three terms.

The set of all polynomials in X1, . . . , Xn with coefficients in K is denoted by K[X1, . . . , Xn]. Thepolynomials in K[X1, . . . , Xn] can be added and multiplied as usual,

(∑

α

cαXα

)+

β

dβXβ

=

α

(cα + dα)Xα, (1.3)

4 1 Commutative Algebra

(∑

α

cαXα

β

dβXβ

=

α,β

(cadb)Xα+β . (1.4)

ThusK[X1, . . . , Xn] forms a commutative ring with identity called polynomial ring inX1, . . . , Xn overK.Moreover, the addition of polynomials in K[X1, . . . , Xn] suggests that K[X1, . . . , Xn] forms an infinite-dimensional K-vector space with the monomials as a K-basis.

Each nonzero polynomial f in K[X1, . . . , Xn] has a degree, denoted by deg(f). This is the largesttotal degree of a monomial occurring in f with a nonzero coefficient. For instance, f = 4X3+3Y 5Z−Z4

is a polynomial of degree 6 in Q[X,Y, Z]. The nonzero elements of K are the polynomials of degree 0. Forany nonzero polynomials f and g in K[X1, . . . , Xn], we have deg(fg) = deg(f) + deg(g) by comparingmonomials of largest degree. Thus the polynomial ring K[X1, . . . , Xn] is an integral domain (i.e., it hasno zero divisors) and only nonzero constant polynomials have multiplicative inverses in K[X1, . . . , Xn].Hence, K[X1, . . . , Xn] is not a field.

A polynomial f in K[X1, . . . , Xn] is called homogeneous if all involved monomials have the sametotal degree. For instance, f = 3X4 + 5Y Z3 −X2Z2 is a homogenous polynomial of total degree 4 inQ[X,Y, Z]. It is clear that each polynomial f in K[X1, . . . , Xn] can be written as a sum of homogeneouspolynomials called the homogeneous components of f . For instance, f = 3X4 +5Y Z3 +X2 −Y 2 − 1 inQ[X,Y, Z] is a sum of homogeneous components f (4) = 3X4 + 5Y Z3, f (2) = X2 − Y 2, and f (0) = −1,

Example 1.1 (Singular). Polynomial rings can be generated over different fields. The polynomialring Q[X,Y, Z] is defined as

> ring r1 = 0, (x,y,z), dp;

> poly f = x2y-z2

> f*f-f;

x4y2-2x2yz2+z4-x2y+z2

Polynomials can be written in short (e.g., x2y− z2) or long (e.g., x2y− z2) notation. The definition ofpolynomial rings over other fields follows the same pattern such as the polynomial ring over the finitefield Z5,

> ring r2 = 5, (x,y,z), dp;

the polynomial ring over the finite Galois field GF(8),

> ring r3 = (2^3,a), (x,y,z), dp; // primitive element a

> number n = a2+1; // element of GF(8)

> n*n;

a5

and the polynomial ring over the extension field Q(a, b),

> ring r4 = (0,a,b), (x,y,z), dp;

> number n = 2a+1/b2; // element of Q(a,b)

> n*n;

4a2b4+4ab2+1/(b4)

1.2 Ideals 5

1.2 Ideals

Ideals are the most prominent structures studied in polynomial rings.A nonempty subset I of the polynomial ring K[X1, . . . , Xn] is an ideal if

• for each f, g ∈ I, we have −f and f + g ∈ I, and• for each f ∈ I and g ∈ K[X1, . . . , Xn], we have f · g ∈ I.

The first condition ensures that I is an additive subgroup of K[X1, . . . , Xn] and equals the subgroupcriterion which says that for each f, g ∈ I, we have f − g ∈ I.

Lemma 1.2. Let f1, . . . , fs be polynomials in K[X1, . . . , Xn], Then the set

〈f1, . . . , fs〉 ={

s∑

i=1

hifi | h1, . . . , hs ∈ K[X1, . . . , Xn]

}(1.5)

is an ideal of K[X1, . . . , Xn], the smallest ideal of K[X1, . . . , Xn] containing f1, . . . , fs.

Proof. Let f, g ∈ 〈f1, . . . , fs〉. Write f = h1f1 + . . . + hsfs and g = h′1f1 + . . . + h′sfs, where hi, h′i ∈

K[X1, . . . , Xn], 1 ≤ i ≤ s. Then f − g = (h1 − h′1)f1 + . . .+ (hs − h′s)fs and thus f − g ∈ 〈f1, . . . , fs〉.Moreover, if h ∈ K[X1, . . . , Xn] then f ·h = (h1h)f1+ . . .+(hsh)fs and thus f ·h ∈ 〈f1, . . . , fs〉. In viewof the last assertion, note that each ideal of K[X1, . . . , Xn] that contains f1, . . . , fs must also contain〈f1, . . . , fs〉. ⊓⊔The ideal 〈f1, . . . , fs〉 is called the ideal generated by f1, . . . , fs. The set {f1, . . . , fs} is sometimes calleda basis of the ideal. In particular, the sets 〈∅〉 = {0} and 〈1〉 = K[X1, . . . , Xn] are the trivial ideals.

There are several ways to construct new ideals from given ones.

Proposition 1.3. Let I and J be ideals of K[X1, . . . , Xn]. The sum of I and J is the set

I + J = {f + g | f ∈ I, g ∈ J}. (1.6)

• The sum I + J is an ideal of K[X1, . . . , Xn].• The sum I + J is the smallest ideal containing I ∪ J .• If I = 〈f1, . . . , fr〉 and J = 〈g1, . . . , gs〉, then

I + J = 〈f1, . . . , fr, g1, . . . , gs〉. (1.7)

Proof. Let f, f ′ ∈ I and g, g′ ∈ J . Then (f + g)− (f ′ + g′) = (f − f ′) + (g− g′) in I + J . Moreover, leth ∈ K[X1, . . . , Xn]. Then (f + g) · h = (f · h) + (g · h) ∈ I + J . Hence, I + J is an ideal.

Let L be an ideal of K[X1, . . . , Xn] containing I ∪ J . If f ∈ I and g ∈ J then f + g ∈ L and thus Lcontains I + J .

Let h ∈ 〈f1, . . . , fr, g1, . . . , gs〉. Then h = h1f1 + . . . + hrfr + h′1g1 + . . . + h′sgs, where hi, h′j ∈

K[X1, . . . , Xn], 1 ≤ i ≤ r, 1 ≤ j ≤ s. Thus h is of the form f + g, where f ∈ I and g ∈ J , and henceh ∈ I + J . Conversely, the ideal 〈f1, . . . , fr, g1, . . . , gs〉 contains I ∪ J and thus by the second assertionmust be equal to I + J . ⊓⊔Proposition 1.4. Let I and J be ideals of K[X1, . . . , Xn]. The product of I and J is the ideal

I · J = 〈f · g | f ∈ I, g ∈ J〉. (1.8)

6 1 Commutative Algebra

• The intersection I ∩ J is an ideal in K[X1, . . . , Xn].• The product I · J is contained in the intersection I ∩ J .• If I = 〈f1, . . . , fr〉 and J = 〈g1, . . . , gs〉, then

I · J = 〈fi · gj | 1 ≤ i ≤ r, 1 ≤ j ≤ s〉. (1.9)

Proof. Let f, g ∈ I ∩ J . Then f − g ∈ I and f − g ∈ J and so f − g ∈ I ∩ J . Let f ∈ I ∩ J andh ∈ K[X1, . . . , Xn]. Since I and J are ideals, f · h ∈ I and f · h ∈ J . Thus f · h ∈ I ∩ J .

Let f ∈ I and g ∈ J . Then f · g is contained in both I and J and thus belongs to I ∩ J . That is,I · J ⊆ I ∩ J .

Since fi · gj belongs to I · J it follows that I · J contains 〈fi · gj | 1 ≤ i ≤ r, 1 ≤ j ≤ s〉. Conversely,let h ∈ I · J . Then h can be written in terms of generators f · g, where f ∈ I and g ∈ I. Butthe constituents of these generators f and g can be written with respect to the bases f1, . . . , fr andg1, . . . , gs, respectively. Thus the polynomial h belongs to the ideal 〈fi · gj | 1 ≤ i ≤ r, 1 ≤ j ≤ s〉. ⊓⊔Example 1.5 (Singular). The above ideal operations in Q[X,Y, Z] can be defined as follows,

> ring r = 0, (x,y,z), dp;

> ideal i = xyz, x2-y2;

> ideal j = x2-1, y2-z2;

> i+j

_[1]=xyz

_[2]=x2-y2

_[3]=x2-1

_[4]=y2-z2

> i*j

_[1]=x3yz-xyz

_[2]=xy3z-xyz3

_[3]=x4-x2y2-x2+y2

_[4]=x2y2-y4-x2z2+y2z2

♦Proposition 1.6. Let I be an ideal of K[X1, . . . , Xn]. The set

√I = {f ∈ K[X1, . . . , Xn] | fm ∈ I for some integer m ≥ 1}. (1.10)

is an ideal of K[X1, . . . , Xn] containing I called the radical of I with√√

I =√I.

Proof. We have I ⊆√I, since f ∈ I, i.e., f1 ∈ I, implies f ∈

√I.

Claim that√I is an ideal. Indeed, let f, g ∈

√I. By definition, there are positive integers k and

l such that fk, gl ∈ I. Expanding (f + g)k+l−1 by the binomial theorem shows that each term is amultiple of some fmgm

with m+m′ = k + l − 1. Thus either k ≥ m or l ≥ m′ and thus fk or gl is inI. Thus all terms in (f + g)k+l−1 belong to I and hence f + g lies in

√I.

Let f ∈√I and g ∈ K[X1, . . . , Xn]. By definition, fm ∈ I for some m ≥ 1. Thus (fg)m = fmgm

belongs to I and hence fg lies in√I. It follows that

√I is an ideal.

Claim that√√

I =√I. Indeed, we have already shown that

√I lies in

√√I. Conversely, let

f ∈√√

I. Then fm ∈√I for some positive integer m and thus (fm)l ∈ I for some positive integer l.

Thus f ∈√I and hence the claim follows. ⊓⊔

1.3 Monomial Orders 7

An ideal I is called radical if√I = I. For instance, the above assertion shows that

√I is radical.

Example 1.7 (Singular). The computation of the radical of an ideal requires the loading of a library.

> LIB "primdec.lib"; // load library for radical

> ring r = 0, (x,y,z), dp;

> ideal i = xy, x2, y3-y5;

> radical(I);

_[1]=x

_[2]=y3-y

An ideal I of K[X1, . . . , Xn] is prime if I 6= K[X1, . . . , Xn] and for every pair of elements f, g ∈K[X1, . . . , Xn], fg ∈ I implies f ∈ I or g ∈ I.

An ideal I of K[X1, . . . , Xn] is maximal if I 6= K[X1, . . . , Xn] and I is maximal with respect to setinclusion.

Lemma 1.8. Each maximal ideal m of K[X1, . . . , Xn] is prime.

Proof. Let f, g ∈ K[X1, . . . , Xn] with fg ∈ m. Suppose f 6∈ m. Then m ∪ 〈f〉 = K[X1, . . . , Xn], since m

is maximal. Then m+ af = 1 for some m ∈ m and a ∈ K[X1, . . . , Xn]. Thus mg + afg = g and henceg ∈ m.

Example 1.9. For any field K, every maximal ideal m of K[X1, . . . , Xn] is given as follows: take a finitealgebraic extension field L of K and a point (a1, . . . , an) ∈ Ln, consider the ideal 〈X1−a1, . . . , Xn−an〉of L[X1, . . . , Xn], and put

m = 〈X1 − a1, . . . , Xn − an〉 ∩K[X1, . . . , Xn].

In particular, if K is algebraically closed, every maximal ideal of K[X1, . . . , Xn] has the form

m = 〈X1 − a1, . . . , Xn − an〉

for some a1, . . . , an ∈ K. ♦

Example 1.10. In the polynomial ring K[X1, . . . , Xn], every ideal 〈S〉 generated by a subset S of theset of variables {X1, . . . , Xn} is prime; in particular, if S = ∅, then 〈S〉 = {0}. The only maximal idealamong these prime ideals is 〈X1, . . . , Xn〉. ♦

1.3 Monomial Orders

We study several ways to order the terms of a polynomial. For this, we first consider orders on the setNn0 of n-tuples of natural numbers. The set Nn0 forms a monoid with the component-wise addition

(α1, . . . , αn) + (β1, . . . , βn) = (α1 + β1, . . . , αn + βn)

and the zero vector 0 = (0, . . . , 0) is the identity element..A monomial ordering on Nn0 is a total ordering > on Nn0 satisfying the following properties:

8 1 Commutative Algebra

1. If α, β ∈ Nn0 with α > β and γ ∈ Nn0 , then α+ γ > β + γ.2. If α ∈ Nn0 and α 6= 0, then α > 0.

The first condition shows that the ordering is compatible with the addition in Nn0 and the secondcondition means that 0 is the smallest element of the ordering. Both conditions imply that if α, β ∈ Nn0 ,then α+ β > α.

For the monoid N0, there is only one monomial ordering

0 < 1 < 2 < 3 < . . . ,

but in monoids Nn0 with n ≥ 2 there are infinitely many monomial orderings.

Example 1.11. The following orderings depend on the ordering of the variables X1, . . . , Xn.

• Lexicographical ordering (lp):

α >lp β :⇐⇒ ∃1 ≤ i ≤ n : α1 = β1, . . . , αi−1 = βi−1, αi > βi.

• Degree lexicographical ordering (Dp):

α >Dp β :⇐⇒ |α| > |β| ∨ (|α| = |β| ∧ α >lp β).

• Degree reverse lexicographical ordering (dp):

α >dp β :⇐⇒ |α| > |β| ∨ (|α| = |β| ∧ ∃1 ≤ i ≤ n : αn = βn, . . . , αi+1 = βi+1, αi < βi).

In all three orderings, (1, 0, . . . , 0), . . . , (0, . . . , 0, 1) > 0. For instance, (3, 0, 0) >lp (2, 2, 0) but(2, 2, 0) >Dp (3, 0, 0) and (2, 2, 0) >dp (3, 0, 0). Moreover, (2, 1, 2) >Dp (1, 3, 1) but (1, 3, 1) >dp (2, 1, 2).♦

In the following, we require the natural component-wise ordering on Nn0 given by

(α1, . . . , αn) ≤nat (β1, . . . , βn) :⇐⇒ α1 ≤ β1, . . . , αn ≤ βn.

For instance, (1, 1, 2) ≤nat (2, 1, 2) ≤nat (2, 1, 4).

Theorem 1.12. (Dickson’s Lemma) Let A be a subset of Nn0 . There is a finite subset B of A suchthat for each α ∈ A there is a β ∈ B with β ≤nat α.

The set B is called a Dickson basis of A (Fig. 1.1).

Proof. For n = 1 take the smallest element of A ⊆ N0 as the only element of B.For n ≥ 1, A ⊆ Nn+1

0 , and i ∈ N0 define

Ai = {α′ ∈ Nn0 | (α′, i) ∈ A} ⊆ Nn+10 .

By induction, Ai has a Dickson basis Bi. Furthermore, by induction,⋃i∈N0

Bi has a Dickson basis B′.Since B′ is finite, there is an index j such that B′ ⊆ B1 ∪ . . . ∪Bj .

Claim that a Dickson basis of A is given by

B = {(β′, i) ∈ Nn+10 | 0 ≤ i ≤ j, β′ ∈ Bi}.

Indeed, let (α′, k) ∈ A. Then α′ ∈ Ak. Since Bk is a Dickson basis of Ak, there is an element β′ ∈ Bksuch that β′ ≤nat α

′. If k ≤ j, then (β′, k) ∈ B and (β′, k) ≤nat (α′, k). Otherwise, there are γ′ ∈ B′

and i ≤ j such that γ′ ≤nat β′ and (γ′, i) ∈ Bi. Then (γ′, i) ∈ B and (γ′, i) ≤nat (α

′, k). ⊓⊔

1.3 Monomial Orders 9

r❡ r r r r

r❡ r r r r r r

r r r r r r r

r r r r r r r

r❡ r r r r r r r

r r r r r r r r

r r r r r r r r

r❡ r r r r r r r r

r r r r r r r r r

Fig. 1.1. A subset A of N20 and a Dickson set of A (encircled points).

Corollary 1.13. Each monomial ordering on Nn0 is a well-ordering.

Proof. Let > be a monomial ordering on Nn0 and A be a nonempty subset of Nn0 . By Dickson’s lemma,the set A has a Dickson basis B. Let α ∈ A. Then there is an element β ∈ B with β ≤nat α. Thus thereis an element γ ∈ Nn0 with α = β + γ. Since 0 ≤ γ, it follows that β ≤ β + γ = α. Hence, the smallestelement of the Dickson basis B with respect to the monomial ordering is the smallest element of A.Therefore, the monomial ordering is a well-ordering. ⊓⊔

Corollary 1.14. For any monomial ordering > on Nn0 , each decreasing chain of elements of Nn0

α(1) > α(2) > . . . > α(k) > . . .

becomes stationary (i.e., there is some j0 such that α(j) = α(j0) for all j ≥ j0).

Proof. Put A = {α(i) | i ∈ N}. By Corollary 1.13, A has a smallest element and hence the sequencemust become stationary. ⊓⊔

10 1 Commutative Algebra

A monomial ordering > on Nn0 carries forward to a monomial ordering on the set of monomials ofthe polynomial ring K[X1, . . . , Xn]. For this, define for all α, β ∈ N0,

Xα > Xβ :⇐⇒ α > β.

Since any monomial ordering is total, the terms that are involved in a polynomial of K[X1, . . . , Xn] canbe uniquely written in increasing or decreasing order. A polynomial f in K[X1, . . . , Xn] whose termsare written in decreasing order is in canonical form, i.e.,

f = c0Xα(0)

+ . . .+ cmXα(m)

, ci ∈ K∗,

where α(0) > . . . > α(m). Note that polynomials stored in canonical form can be efficiently tested onequality.

For polynomials in a polynomial ring K[X] with one unknown, there is only one monomial ordering,

1 < X < X2 < X3 < . . . ,

but in polynomial rings with several unknowns there are infinitely many monomial orderings.

Example 1.15 (Singular). Polynomials are stored and printed in canonical form.

> ring r1 = 0, (x,y,z), lp;

> poly f = x3yz+y5; f;

x3yz+y5

> ring r2 = 0, (x,y,z), Dp;

> poly f = imap(r1,f); f;

x3yz+y5

> ring r2 = 0, (x,y,z), dp;

> poly f = imap(r1,f); f;

y5+x3yz

The leading data of a polynomial f in K[X1, . . . , Xn] are defined as follows:

• leading term lt>(f) = c0Xα(0)

,• leading coefficient lc>(f) = c0, and

• leading monomial lm>(f) = Xα(0)

.

A polynomial f is called monic if its leading coefficient is equal to 1.

Example 1.16. Consider the polynomial f = 4XY 2Z + 4Z2 − 5X3 + 7X2Z2 in Q[X,Y, Z], where Xcorresponds to X(1,0,0), Y to X(0,1,0), and Z to X(0,0,1). Thus

f = 4X(1,2,1) + 4X(0,0,2) − 5X(3,0,0) + 7X(2,0,2).

In the lp ordering, (3, 0, 0) ≥ (2, 0, 2) ≥ (1, 2, 1) ≥ (0, 0, 2) and the canonical form is

f = −5X3 + 7X2Z2 + 4XY 2Z + 4Z2.

1.4 Division Algorithm 11

In the Dp ordering, (2, 0, 2) ≥ (1, 2, 1) ≥ (3, 0, 0) ≥ (0, 0, 2) and the canonical form is

f = 7X2Z2 + 4XY 2Z − 5X3 + 4Z2,

In the dp ordering, (1, 2, 1) ≥ (2, 0, 2) ≥ (3, 0, 0) ≥ (0, 0, 2) and the canonical form is

f = 4XY 2Z + 7XY 2Z − 5X3 + 4Z2.

♦Example 1.17 (Singular). The leading data of a polynomial can be obtained as follows.

> ring r = 0, (x,y,z), lp;

> poly f = (xy-z)*(x2-yz);

> f;

x3y-x2z-xy2z+yz2

> leadmonom(f);

x3y

> leadexp(f);

3,1,0

> leadcoef(f);

1

> lead(f);

x3y

> f-lead(f); // tail

-x2z-xy2z+yz2

1.4 Division Algorithm

The ordinary division algorithm for polynomials in one variable carries forward to the multivariate caseby making use of a monomial ordering.

Theorem 1.18. Let > be a monomial ordering on Nn0 . Let f be a nonzero polynomial in K[X1, . . . , Xn]and let F = (f1, . . . , fm) be a sequence of nonzero polynomials in K[X1, . . . , Xn]. There are polynomialsh1, . . . , hm and r in K[X1, . . . , Xn] such that

f = h1f1 + · · ·+ hmfm + r (1.11)

and either r = 0 or none of the terms in r is divisible by lt>(f1), . . . , lt>(fm). Moreover, if hifi 6= 0,then lt>(hifi) ≤ lt>(f), 1 ≤ i ≤ m.

The proof is constructive and mimicks the division algorithm (Alg. 1.1).

Proof. First, put h1 = . . . = hm = 0, r = 0, and s = f . Then we have

f = h1f1 + · · ·+ hmfm + (r + s). (1.12)

This equation serves as an invariant throughout the algorithm that proceeds in iterative steps. If s = 0,the algorithm terminates. Otherwise, there are two cases:

12 1 Commutative Algebra

• Reduction step: If lt>(s) is divisible by some lt>(fi), 1 ≤ i ≤ m, then take the smallest index iwith this property and put

s = s− lt>(s)

lt>fiand hi = hi +

lt>(s)

lt>(fi). (1.13)

• Shifting step: If lt>(s) is not divisible by any of the lt>(fi), 1 ≤ i ≤ m, then put

r = r + lt>(s) and s = s− lt>(s). (1.14)

In both cases, the equation (1.12) still holds. Moreover, if r 6= 0, then the assertion that no term of ris divisible by lt>(fi), 1 ≤ i ≤ m, inductively holds. The leading term of the polynomial s is strictlydecreasing with respect to the monomial ordering after each of the assignments (1.13) and (1.14). Thusthe sequence formed by the leading terms of s in successive steps is strictly decreasing. By Corollary 1.13,the monomial ordering is a well-ordering and hence the sequence becomes stationary. Therefore, thedivision algorithm terminates with s = 0.

In view of the inequalities, the leading term of s decreases in each step and is either added to someproduct hifi (reduction step) or to the remainder r (shifting step). Moreover, in the reduction step,the leading term of s added to the product hifi is the largest term added. Since lt>(s) = lt>(f) atthe start of the computation, the inequalities follows. ⊓⊔The remainder on the division of f by F is often denoted by r = fF .

Algorithm 1.1 Division algorithm.

Require: nonzero polynomials f and f1, . . . , fm in K[X1, . . . , Xn]Ensure: polynomials h1, . . . , hm and r in K[X1, . . . , Xn] as in Thm. 1.18

h1 ← 0, . . . , hm ← 0r ← 0s← fwhile s 6= 0 do

i← 1division occurred ← falsewhile i ≤ m and division occurred = false do

if lt(f [i]) divides lt(s) thens← s− lt(s)/lt(fi) ∗ fihi ← hi + lt(s)/lt(fi)division occurred ← true

else

i← i+ 1end if

end while

if division occurred = false then

r ← r + lt(s)s← s− lt(s)

end if

end while

1.4 Division Algorithm 13

Example 1.19. Consider the polynomials f = X2Y + XY 2 + Y 2, f1 = Y 2 − 1, and f2 = X − Y inQ[X,Y ] using the lp ordering with X > Y . Initially, we have h1 = h2 = 0, r = 0, and s = f . First,lt>(s) = X2Y is divisible lt>(f2) = X and so

s = s− X2Y

X(X − Y ) = 2XY 2 + Y 2

and

h2 = h2 +X2Y

X= XY.

Second, lt>(s) = 2XY 2 is divisible lt>(f1) = Y 2. Thus

s = s− 2XY 2

Y 2(Y 2 − 1) = Y 2 + 2X

and

h1 = h1 +2XY 2

Y 2= 2X.

Third, lt>(s) = 2X is divisible by lt>(f2) = X. So

s = s− 2X

X(X − Y ) = 2Y + Y 2

and

h2 = h2 +2X

X= XY + 2.

Fourth, lt>(s) = Y 2 is divisible by lt>(f1) = Y 2. Thus

s = s− Y 2

Y 2(Y 2 − 1) = 2Y + 1

andh1 = 2X + 1.

Fifth, lt>(s) = 2Y is not divisible by lt>(f1) = Y 2 or lt>(f2) = X. It follows that

r = 2Y and s = 1.

Sixth, lt>(s) = 1 is not divisible by lt>(f1) = Y 2 or lt>(f2) = X. Consequently,

r = 2Y + 1 and s = 0.

Therefore,

f = (2X + 1) · (Y 2 − 1) + (XY + 2) · (X − Y ) + (2Y + 1) and fF = 2Y + 1.

Example 1.20 (Singular). The expression of a polynomial as a linear combination with remainderaccording to the division theorem is provided by the command division, while the command reduce

only yields the remainder upon division.

14 1 Commutative Algebra

> ring r = 0, (x,y), lp;

> ideal i = y2-1, x-y;

> poly f = x2y+xy2+y2;

> reduce(f,std(i)); // reduction by standard basis of i

2y+1

> division(f,i); // division with remainder

[1]:

_[1,1]=2x+1

_[2,1]=xy+2

[2]:

_[1]=2y+1

[3]:

_[1,1]=1

1.5 Groebner Bases

Groebner bases are specific generating sets of polynomial ideals.

Example 1.21. Take the polynomials f = XY 2 −X, f1 = Y 2 − 1, and f2 = XY +1 in Q[X,Y ] usingthe lp ordering with X > Y . First, the division of f into F = (f1, f2) yields

f = X · (Y 2 − 1) + 0 · (XY + 1).

Second, the division of f into F ′ = (f2, f1) gives

f = Y · (XY + 1) + 0 · (Y 2 − 1) + (−X − Y ).

Thus the division depends on the ordering of the polynomials in the sequence F . Moreover, the firstrepresentation shows that the polynomial f lies in the ideal I = 〈f1, f2〉, while this cannot be deducedfrom the second representation. ♦

Let > be a monomial ordering on K[X1, . . . , Xn] and let I be an ideal of K[X1, . . . , Xn]. A Groebnerbasis of I with respect to > is a finite set of polynomials G = {g1, . . . , gs} in I such that for eachnonzero polynomial f ∈ I, lt>(f) is divisible by lt>(gi) for some 1 ≤ i ≤ s. Groebner bases wereinvented by Bruno Buchberger in the 1960s and named after his advisor Walter Groebner (1899-1980).

Example 1.22 (Singular). Consider the ideal I = 〈Y 2− 1, XY +1〉 in Q[X,Y ] using the lp orderingwith X > Y . A Groebner basis of the ideal I can be computed by using the command groebner orstd. The latter command is more general and can be applied to calculate standard bases of polynomialideals.

> ring r = 0, (x,y), lp;

> ideal i = y2-1, xy+1;

> ideal j = std(i);

> j;

j[1]=y2-1

j[2]=x+y

1.5 Groebner Bases 15

The computed Groebner basis of I is {Y 2 + 1, X + Y }. ♦

Theorem 1.23. Each ideal I of K[X1, . . . , Xn] has a Groebner basis with respect to any monomialordering.

Proof. Let > be a monomial ordering on K[X1, . . . , Xn]. Consider the set

A = {α ∈ Nn0 | Xα = lm>(f) for some f ∈ I}

of exponents of all leading monomials of the polynomials in the ideal I. By Dickson’s lemma, the setA has a Dickson basis B = {β1, . . . , βs}, where Xβi = lm>(gi) for some gi ∈ I, 1 ≤ i ≤ s. Let f be anonzero polynomial of I with lm>(f) = Xα. Then α = βi + γ for some 1 ≤ i ≤ s and γ ∈ Nn0 . ThusXα = XβiXγ and hence the leading term of f is divisible by the leading term of gi. It follows that{g1, . . . , gs} is a Groebner basis of I. ⊓⊔

Proposition 1.24 (Ideal Membership Test). Let > be a monomial ordering on K[X1, . . . , Xn] andlet I = 〈g1, . . . , gs〉 be an ideal of K[X1, . . . , Xn]. If G = {g1, . . . , gs} is a Groebner basis of I, then foreach polynomial f in K[X1, . . . , Xn], we have f ∈ I if and only if fG = 0.

Proof. Let f ∈ K[X1, . . . , Xn] whose division into G yields f = h1g1 + . . . + hsgs + fG. Let fG = 0.Then f ∈ I by definition of I.

Conversely, let f ∈ I. Then fG = f − (h1g1 + . . .+ hsgs) belongs to I. Assume that fG 6= 0. Thenlt>(f

G) is divisible by some lt>(gi), since G is a Groebner basis of I. But this contradicts the divisionalgorithm, since none of the terms in the remainder fG is divisible by any of the terms lt>(gi). ⊓⊔

Corollary 1.25. Let > be a monomial ordering on K[X1, . . . , Xn] and let I be an ideal of K[X1, . . . , Xn].If G = {g1, . . . , gs} is a Groebner basis of I with respect to >, then I = 〈g1, . . . , gs〉.

Proof. By definition, g1, . . . , gs ∈ I and thus 〈g1, . . . , gs〉 ⊆ I. Conversely, let f ∈ I. The division of finto G yields f = h1g1 + . . .+ hsgs + fG. Thus by Prop. 1.24, fG = 0 and hence f ∈ 〈g1, . . . , gs〉. ⊓⊔

Proposition 1.26. Let G = {g1, . . . , gs} be a Groebner basis in K[X1, . . . , Xn] with respect to anymonomial ordering >. For each polynomial f in K[X1, . . . , Xn], the remainder fG is uniquely deter-mined and independent of the order of the elements in G.

Proof. Let f be a polynomial in K[X1, . . . , Xn] and let I = 〈g1, . . . , gs〉. First, assume that there aretwo expressions f = h1g1+ . . .+hsgs+r and f = h′1g1+ . . .+h

′sgs+r

′ as given by the division theorem.Then r′ − r = (h1 − h′1)g1 + . . .+ (hs − h′s)gs lies in I. Suppose that r′ − r 6= 0. Since G is a Groebnerbasis of I, the leading term of r′ − r is divisible by the leading term of some gi, 1 ≤ i ≤ s. But thiscontracts the fact that r and r′ are remainders and so none of their terms are divisible by any of thegi, 1 ≤ i ≤ s.

Second, let G′ be a permutation of the Groebner basis G. Then the division algorithm yields f =h′1g1 + . . .+ h′sgs + fG

. But the remainder is uniquely determined and therefore fG = fG′

. ⊓⊔

The remainder on division of a polynomial f by a Groebner basis of an ideal I is a uniquely determinednormal form of f modulo I depending only on the monomial ordering and not how the division isperformed.

Theorem 1.27. (Hilbert Basis Theorem) Each ideal I of K[X1, . . . , Xn] is finitely generated.

16 1 Commutative Algebra

The proof follows directly from Thm. 1.23 and Cor. 1.25.A ring is Noetherian if each ideal of R is finitely generated.

Theorem 1.28. The following conditions of a ring R are equivalent:

1. Each ideal of R is finitely generated (that is, R is Noetherian).2. Each ascending chain of ideals I1 ⊂ I2 ⊂ · · · in R becomes stationary (that is, there is an index j0

such that Ij = Ij0 for all j ≥ j0).3. Each nonempty set of ideals in R contains a maximal element (with respect to inclusion).

Proof. Suppose each ideal of R is finitely generated. Assume that I1 ⊂ I2 ⊂ · · · is an ascending chainof ideals in R. Then I =

⋃j Ij is an ideal in R and by hypothesis has a finite generating set G. If

G ⊂ I1 ∪ . . . ∪ Ij0 , then Ij = Ij0 for all j ≥ j0.Suppose that each ascencing chain of ideals in R become stationary. Assume that S is a nonempty

set of ideals in R. If I1 ∈ S is not maximal in S, then there exists an ideal I2 in S that properlycontains I1. Continuing like this gives an ascending chain of ideals in S that will become stationary.Then Ij = Ij0 for all j ≥ j0 and Ij0 is maximal in S.

Suppose that each nonempty set of ideals in R contains a maximal element. Let I be an ideal ofR, and let S be a set of ideals J ⊆ I of R that are finitely generated. Then S is nonempty and byhypothesis contains a maximal element J0 = 〈f1, . . . , fs〉. Assume that I 6= J0. Then there is an elementf ∈ I \ J0 and so 〈f, f1, . . . , fs〉 will be a finitely generated ideal in I that properly contains J0. Thiscontradicts the maximality of J0. Hence, I is finitely generated. ⊓⊔

By the Hilbert basis theorem and the above result, we obtain the following.

Corollary 1.29. The polynomial ring K[X1, . . . , Xn] is Noetherian.

1.6 Computation of Groebner Bases

The basic algorithm for the computation of a Groebner basis of an ideal in K[X1, . . . , Xn] is due toBuchberger.

Let > be a monomial ordering on K[X1, . . . , Xn] and let f, g ∈ K[X1, . . . , Xn] \ {0} with lm>(f) =Xα and lm>(g) = Xβ , respectively. The least common multiple of α and β w.r.t. the natural orderingon Nn0 is

γ = lcm(α, β) = (max{α1, β1}, . . . ,max{αn, βn})Then the least common multiple of Xα and Xβ w.r.t. the relation of divisibility is

Xγ = lcm(Xα, Xβ).

Define the S-polynomial of f and g as

S(f, g) =Xγ

lt>(f)· f − Xγ

lt>(g)· g. (1.15)

Note that S(f, g) lies in the ideal generated by f and g. Moreover, in S(f, g) the leading terms of fand g cancel and thus S(f, g) exhibits a new leading term.

1.6 Computation of Groebner Bases 17

Example 1.30. Consider the polynomials f = 2Y 2+Z2 and g = 3X2Y +Y Z in Q[X,Y, Z] with respectto the lp ordering with X > Y > Z. Then lm>(f) = Y 2, lm>(g) = X2Y , and lcm(lt>(f), lt>(g)) =X2Y 2. Thus

S(f, g) =X2Y 2

2Y 2· f − X2Y 2

3X2Y· g =

1

2X2Z2 − 1

3Y 2Z.

♦Theorem 1.31. (Buchberger’s S-Criterion) Let > be a monomial ordering on K[X1, . . . , Xn]. Aset G = {g1, . . . , gs} of polynomials in K[X1, . . . , Xn] is a Groebner basis of the ideal I = 〈g1, . . . , gs〉 ifand only if S(gi, gj)

G = 0 for all pairs i 6= j.

Proof. Let G be a Groebner basis of I. Since each S-polynomial S(gi, gj) belongs to I, it follows fromProp. 1.24 that S(gi, gj)

G = 0.Conversely, assume that S(gi, gj)

G = 0 for all pairs i 6= j. Let f ∈ I. Write

f = h1g1 + . . .+ hsgs,

where h1, . . . , hs ∈ K[X1, . . . , Xn]. Let lt>(f) = cXα, lt>(gi) = ciXαi , and lt>(hi) = diX

βi , 1 ≤ i ≤s. Define δ = max>{αi + βi | 1 ≤ i ≤ s}. The above equation shows that the leading term of f is aK-linear combination of the leading terms lt>(higi), higi 6= 0, and therefore α ≤ δ.

If δ = α, we can assume that δ = α1 + β1 = . . . = αr + βr, where r ≤ s and higi 6= 0 for 1 ≤ i ≤ r.Then

cXα = (c1d1 + . . .+ crdr)Xδ.

Thus lt>(g1) = c1Xα1 divides lt>(f) = cXα. If all nonzero polynomials f of I have this property,

then G is a Groebner basis of I.If α < δ, the maximal leading terms on the right-hand side of the representation of f must cancel.

By the above notation, we obtain

c1d1 + . . .+ crdr = 0. (1.16)

Write the polynomial f in the form

f = C + (h1 − lt>(h1))g1 + . . .+ (hr − lt>(hr))gr + hr+1gr+1 . . .+ hsgs,

where C = lt>(h1)g1 + . . .+ lt>(hr)gr. By putting ki = Xβigi/ci, 1 ≤ i ≤ r, we obtain

C = c1d1k1 + . . .+ crdrkr (1.17)

= c1d1(k1 − k2) + (c1d1 + c2d2)(k2 − k3) + (c1d1 + c2d2 + c3d3)(k3 − k4) + . . .

. . .+ (c1d1 + . . .+ cr−1dr−1)(kr−1 − kr) + (c1d1 + . . .+ crdr)kr.

Thus C is a linear combination of ki − kj , 1 ≤ i < j ≤ r. Define Xαi,j as the least common multiple ofXαi and Xαj . Then there exists ξ ∈ Nn0 so that ξ + αi,j = αi + βi = αj + βj , 1 ≤ i < j ≤ r. We have

ki − kj =Xβigici

− Xβjgjcj

= Xξ

(Xαi,jgiciXαi

− Xαi,jgjcjXαj

)

= XξS(gi, gj)

18 1 Commutative Algebra

and lt>(ki − kj) < δ, 1 ≤ i < j ≤ r. It follows from (1.16) and (1.17) that

C = c′1Xξ1S(g1, g2) + . . .+ c′r−1X

ξr−1S(gr−1, gr),

where c′1, . . . , c′r−1 ∈ K and ξ1, . . . , ξr−1 ∈ Nn0 . By hypothesis,

S(gi, gj) = hij1 g1 + . . .+ hijs gs

for some polynomials hij1 , . . . , hijs with lt>(h

ijl ) ≤ lt>(S(gi, gj)), 1 ≤ i < j ≤ s and 1 ≤ l ≤ s. It

follows that the polynomial C can be written as a linear combination of the polynomials g1, . . . , gs.Thus by (1.17), the polynomial f can be expressed as a linear combination of the polynomials g1, . . . , gs,

f = h′1g1 + . . .+ h′sgs,

where max>{lt>(h′igi) | h′igi 6= 0, 1 ≤ i ≤ s} < δ. Since each monomial ordering is a well-ordering, weobtain by continuing in this way an expression

f = h′′1g1 + . . .+ h′′sgs,

where the leading monomial Xδ on the right-hand side equals lt>(f). Then the case α = δ will establishthe result. ⊓⊔

Buchberger’s S-criterion can be used to calculate a Groebner basis of a given ideal (Alg. 1.2).

Algorithm 1.2 Buchberger’s algorithm.

Require: I = 〈f1, . . . , fm〉 ideal of K[X, . . . ,Xn], F = {f1, . . . , fm}Ensure: Groebner basis G of I with F ⊆ G

G← Frepeat

G′ ← Gfor each pair f 6= g in G′ do

S ← S(f, g)G′

if S 6= 0 then

G← G ∪ {S}end if

end for

until G = G′

Theorem 1.32. Buchberger’s algorithm terminates and the output is a Groebner basis.

Proof. First, we prove correction. Claim that at each step G ⊆ I. Indeed, this is true at the start ofthe algorithm. Suppose G ⊆ I holds at the beginning of some pass and put G = {g1, . . . , gs}. Then forall f, g ∈ G, S(f, g) ∈ 〈f, g〉 ⊆ I. Moreover, the division algorithm gives S(f, g) = h1g1 + . . . + hsgs +S(f, g)G. Thus S(f, g)G ∈ I and hence G ⊆ I after each pass. Upon termination, the remainders of theS-polynomials divided by the current set G are 0. In this case, Buchberger’s S-criterion shows that theset G is a Groebner basis.

1.7 Reduced Groebner Bases 19

Second, we show termination. For this, consider the ideal of leading terms of G = {g1, . . . , gs} givenby

〈LT (G)〉 = 〈lt(g1), . . . , lt(gs)〉.In each pass, the setG is replaced by a new setG′. IfG 6= G′, there is at least one remainder r = S(f, g)G

with f, g ∈ G which is added to G′. Since no term of r is divisible by any of the leading terms of thepolynomials in G, we have

〈LT (G)〉 ⊂ 〈LT (G′)〉.This gives an ascending chain of ideals of K[X1, . . . , Xn]. But the polynomial ring K[X1, . . . , Xn] isNoetherian and so the chain becomes stationary; that is, at some pass

〈LT (G)〉 = 〈LT (G′)〉.

Thus G = G′ by the way G′ is constructed from G and hence the algorithm stops. ⊓⊔

Example 1.33. Consider the ideal I = 〈Y 2 +Z2, X2Y + Y Z〉 in Q[X,Y, Z] with respect to the lp or-dering with X > Y > Z. The following session provides a Groebner basis of I according to Buchberger’salgorithm:

> LIB "teachstd.lib"; // library for command spoly

> ring r = 0, (x,y,z), lp;

> ideal i = y2+z2, x2y+yz;

> reduce(spoly(y2+z2, x2y+yz), i);

x2z2+z3

> ideal j = y2+z2, x2y+yz, x2z2+z3;

> reduce(spoly(y2+z2, x2y+yz), j);

0

> reduce(spoly(y2+z2, x2z2+z3), j);

0

> reduce(spoly(x2y+yz, x2z2+z3), j);

0

It follows that {Y 2 + Z2, X2Y + Y Z,X2Z2 + Z3} is a Groebner basis of I. ♦

1.7 Reduced Groebner Bases

Groebner bases are not unique since a Groeber basis remains a Groebner basis if an arbitrary polynomialis added. It will shown that reduced Groebner bases are unique.

Let> be a monomial ordering onK[X1, . . . , Xn]. A Groebner basisG = {g1, . . . , gs} inK[X1, . . . , Xn]is minimal if the polynomials g1, . . . , gs are monic and lt>(gi) is not divisible by lt>(gj) for any pairi 6= j.

Proposition 1.34. Each nonzero ideal I of K[X1, . . . , Xn] has a minimal Groebner basis with respectto any monomial ordering.

20 1 Commutative Algebra

Proof. Let G = {g1, . . . , gs} be a Groebner basis of the ideal I with respect to the monomial ordering>. We may assume that each generator gi is monic by multiplying gi with the inverse of its leadingcoefficient, 1 ≤ i ≤ s.

Suppose G is not minimal. We may assume that lt>(g1) is divisible by lt>(gi) for some 2 ≤ i ≤ s.By reduction, the polynomial

h = g1 −lt>(g1)

lt>(gi)gi (1.18)

belongs to I and, by Prop. 1.24, its division into G yields hG = 0. But the leading term of g1 cancelsin (1.18) and is larger than the leading term of h. Thus the polynomial g1 cannot be used during thedivision of h by the basis G. Hence, the polynomial h is a linear combination of g2, . . . , gs. It followsthat by (1.18), the generator g1 is also a linear combination of g2, . . . , gs. Therefore, G

′ = {g2, . . . , gs}also generates the ideal I. Moreover, G′ is a Groebner basis since if the leading term of a polynomialf ∈ I is divisible by lt>(g1), then it is also divisible by lt>(gi).

Repeating the above argument leads to a minimal Groebner basis in a finite number of steps. ⊓⊔Let> be a monomial ordering onK[X1, . . . , Xn]. A Groebner basisG = {g1, . . . , gs} inK[X1, . . . , Xn]

is reduced if G is a minimal Groebner basis and no term in gi is divisible by lt>(gj) for any pair i 6= j.

Proposition 1.35. Each nonzero ideal I of K[X1, . . . , Xn] has a unique reduced Groebner basis withrespect to any monomial ordering.

Proof. Let {f1, . . . , fr} and {g1, . . . , gs} be reduced Groebner bases of I with respect to the monomialordering >.

Claim that r = s and after reordering lt>(f1) = lt>(g1), . . . , lt>(fs) = lt>(gs). Indeed, bydefinition of Groebner bases, lt>(g1) is divisible by some lt>(fi), 1 ≤ i ≤ r. We may assume thati = 1. Moreover, lt>(f1) is divisible by some lt>(gj), 1 ≤ j ≤ s. Then lt>(gj) divides lt>(g1). Byminimality, we have j = 1. Since f1 and g1 are monic, it follows that lt>(f1) = lt>(g1). The sameargument applies to the other generators. In this way, we obtain the desired result.

Claim that f1 = g1, . . . , fs = gs. Indeed, consider the polynomial f1 − g1. The first assertion showsthat the leading terms in f1 and g1 cancel. From this and the definition of reduced Groebner bases itfollows that no term in f1 − g1 is divisible by lt>(f1) = lt>(g1), lt>(f2) = lt>(g2), . . . lt>(fs) =lt(gs). Thus if f1− g1 is divided into (f1, . . . , fs), it is already the remainder. But f1− g1 ∈ I and so itfollows from Prop. 1.24 that (f1 − g1)

G = 0. Hence, f1 = g1. The same procedure applies to the othergenerators and the claim follows.

Finally, claim that the ideal I has a reduced Groebner basis. Indeed, the ideal I has a minimalGroebner basis {g1, . . . , gs} by Prop. 1.34. First, replace g1 by the remainder of g1 modulo (g2, . . . , gs).By the division algorithm, none of the terms of the new g1 is divisible by lt>(g2), . . . , lt>(gs). Moreover,by minimality, the leading term of the original g1 will be shifted to the new g1. Second, substitute g2by the remainder of g2 modulo (g1, g3, . . . , gs). This procedure is Continued until gs is replaced bythe remainder of gs modulo (g1, . . . , gs−1). Then the leading terms of the original generators g1, . . . , gswill survive and thus the new generators g1, . . . , gs will still form a Groebner basis. Furthermore, byconstruction, none of the terms in gi is divisible by lt>(gj) by any pair i 6= j. Hence, we end up witha reduced Groebner basis as claimed. ⊓⊔Example 1.36 (Singular). The commands groebner and std compute reduced Groebner bases withrespect to (global) monomial orderings.

1.8 Toric Ideals 21

> ring r = 0, (x,y,z), dp;

> ideal i = xyz, xy-yz, xz-y2;

> ideal j = std(i);

> j;

j[1]=y2-xz

j[2]=xy-yz

j[3]=yz2

j[4]=x2z-xz2

j[5]=xz3

1.8 Toric Ideals

Toric ideals represent algebraic relations between monomials and arise naturally in algebraic statistics.Let K be a field and let X1, . . . , Xn and Y1, . . . , Ym be variables over K. Let t1, . . . , tn be monomials

in K[Y1, . . . , Ym]. Consider the K-algebra homomorphism φ : K[X1, . . . , Xn] → K[Y1, . . . , Ym] given by

φ : Xi 7→ ti, 1 ≤ i ≤ n. (1.19)

The kernel of this map, kerφ = {a ∈ K[X1, . . . , Xn] | φ(a) = 0}, is called the toric ideal associatedwith t1, . . . , tn and is denoted by I(t1, . . . , tn).

Proposition 1.37. Let R be a ring and let R[X1, . . . , Xn] be a polynomial ring over R. Let t1, . . . , tn ∈R and let ψ : R[X1, . . . , Xn] → R be an R-homomorphism defined by ψ(Xi) = ti, 1 ≤ i ≤ n.

• For each element f ∈ R[X1, . . . , Xn], there exist elements h1, . . . , hn ∈ R[X1, . . . , Xn] and r ∈ Rsuch that

f =n∑

i=1

hi · (Xi − ti) + r.

• The kernel of the R-homomorphism ψ is the ideal 〈Xi − ti | 1 ≤ i ≤ n〉 in R[X1, . . . , Xn].

Proof.

• Divide the polynomial f ∈ R[X1, . . . , Xn] into Xi − ti, 1 ≤ i ≤ n. The division yields the desiredrepresentation of f .

• We have ψ(Xi − ti) = ψ(Xi)− ψ(ti) = ψ(Xi)− ti = 0, 1 ≤ i ≤ n, and thus the ideal 〈Xi − ti | 1 ≤i ≤ n〉 belongs to the kernel of ψ. Conversely, assume that f ∈ R[X1, . . . , Xn] lies in the kernel ofψ. By the first assertion, we have 0 = f =

∑ni=1 hi · (Xi− ti)+ r. Then 0 = ψ(f) = f(t1, . . . , tn) = r

and thus the polynomial f lies in the ideal 〈Xi − ti | 1 ≤ i ≤ n〉. ⊓⊔

Proposition 1.38. Let t1, . . . , tn be monomials in K[Y1, . . . , Ym] and let J = 〈Xi − ti | 1 ≤ i ≤ n〉 bean ideal of K[X1, . . . , Xn, Y1, . . . , Ym].

• We have I(t1, . . . , tn) = J ∩K[X1, . . . , Xn].• If G is a Groebner basis of J with respect to an elimination ordering with Y1 > . . . > Ym > X1 >

. . . > Xn, then G ∩K[X1, . . . , Xn] is a Groebner basis of I(t1, . . . , tn).

22 1 Commutative Algebra

Proof. Take the K-algebra homomorphism ψ : K[X1, . . . , Xn, Y1, . . . , Ym] → K[Y1, . . . , Ym] given byψ(Yi) = Yi, 1 ≤ i ≤ m, and ψ(Xi) = ti, 1 ≤ i ≤ n. Consider K[X1, . . . , Xn] as a subring of thepolynomial ring K[X1, . . . , Xn, Y1, . . . , Ym] and observe that the K-algebra homomorphism φ givenin (1.19) is the restriction of ψ onto K[X1, . . . , Xn]; that is, φ(f) = ψ(f) for each f ∈ K[X1, . . . , Xn].Thus by Prop. 1.37, we have kerψ = J and hence I(t1, . . . , tn) = kerφ = kerψ|K[X1,...,Xn] = J ∩K[X1, . . . , Xn].

The second assertion follows directly from the Elimination theorem. ⊓⊔A polynomial in K[X1, . . . , xn] is a binomial if it is given by the difference of two monomials. Thus a

binomial is of the formXα−Xβ where α, β ∈ Nn0 . A binomialXα−Xβ is called pure if gcd(Xα, Xβ) = 1.For instance, the binomial X1X

32 −X3X

24 is pure, while X1X

32 −X1X3X

24 is not.

Theorem 1.39. Consider a grading on the polynomial ring K[X1, . . . , Xn, Y1, . . . , Ym] where the degreesof the variables Y1, . . . , Ym are arbitrary and degXi = deg ti, 1 ≤ i ≤ n. Then the toric ideal I =I(t1, . . . , tn) is prime, generated by pure binomials, and homogeneous.

Proof. Let fg ∈ I. Then 0 = φ(fg) = φ(f)φ(g). But K[X1, . . . , Xn] is an integral domain and soφ(f) = 0 or φ(g) = 0. Thus f ∈ I or g ∈ I and hence I is a prime ideal.

The ideal J = 〈Xi−ti | 1 ≤ i ≤ n〉 of K[X1, . . . , Xn, Y1, . . . , Ym] is generated by binomials. Groebnerbasis theory implies that all elements in any reduced Groebner basis of J are also binomials. Thus byProp. 1.38, the ideal I is generated by binomials. Let Xα −Xβ be a binomial in I. Suppose it is notpure. Then Xα −Xβ = Xγ(Xδ −Xǫ) for some γ, δ, ǫ ∈ Nn0 and so 0 = φ(Xα −Xβ) = tγφ(Xδ −Xǫ).Since K[X1, . . . , Xn] is an integral domain, φ(Xδ−Xǫ) = 0. Thus Xδ−Xǫ ∈ I and hence I is generatedby pure binomials.

The ideal J is homogeneous, since it is generated by homogeneous polynomials. Let f be a polynomialin J ∩K[X1, . . . , Xn]. Since J is homogeneous, all homogeneous components of f lie in J and thereforeall homogeneous components belong to K[X1, . . . , Xn]. Thus all homogeneous components are in I andhence I is homogeneous. ⊓⊔Example 1.40 (Singular). The toric ideal I = I(Y 3

1 Y32 , Y

21 , Y

22 ) is the kernel of the Q-algebra homo-

morphism φ : Q[X1, X2, X3] → Q[Y1, Y2] given by

φ(X1) = Y 31 Y

32 , φ(X2) = Y 2

1 , and φ(X3) = Y 22 .

Consider the ideal J = 〈X1 − Y 31 Y

32 , X2 − Y 2

1 , X3 − Y 22 〉 of Q[X1, X2, X3, Y1, Y2]. The following compu-

tation provides a reduced Groebner basis of J ,

> ring r = 0, (y(1..2), x(1..3)), dp;

> ideal j = x(1)-y(1)^3*y(2)^2, x(2)-y(1)^2, x(3)-y(2)^2;

> eliminate( std(j),y(1)*y(2) );

_[1]=x(2)^3*x(3)^3-x(1)^2

The toric ideal I = J ∩Q[X1, X2, X3] has the reduced Groebner basis {X21 −X3

2X33}. ♦

Toric ideals often arise by using integral matrices. To see this, let A = (aij) be an integral m × nmatrix with non-negative entries. The columns of the matrix A give rise to the monomials

ti = Y a1i1 · · ·Y amim , 1 ≤ i ≤ n, (1.20)

in the polynomial ring K[Y1, . . . , Ym]. The toric ideal associated to A is the toric ideal I(t1, . . . , tn) inK[X1, . . . , Xn], which is also denoted by I(A).

1.8 Toric Ideals 23

Proposition 1.41. Let A = (aij) ∈ Zm×n≥0 . The toric ideal I(A) equals the ideal

IA = 〈Xα −Xβ | Aα = Aβ, α, β ∈ Nn0 〉.

Proof. Let α ∈ Nn0 . The K-algebra homomorphism φ : K[X1, . . . , Xn] → K[Y1, . . . , Ym] given by φ(Xi) =ti with ti as in (1.20), 1 ≤ i ≤ n, assigns to the monomial Xα the monomial Y Aα.

Let Xα−Xβ lie in IA. Then φ(Xα−Xβ) = φ(Xα)−φ(Xβ) = Y Aα−Y Aβ = 0 and thus Xα−Xβ lies

in I(A). Conversely, by Thm. 1.39, the toric ideal I(A) is generated by binomials Xα−Xβ , α, β ∈ Nn0 .For each such binomial, 0 = φ(Xα −Xβ) = φ(Xα)− φ(Xβ) = Y Aα − Y Aβ . Thus Aα = Aβ and henceXα −Xβ belongs to IA. ♦

Example 1.42 (Singular). The integral matrix

A =

0 1 0 10 0 1 12 1 1 0

gives the toric ideal I(A) = I(Y 23 , Y1Y3, Y2Y3, Y1Y2). A reduced Groebner basis of this ideal is {X1X4−

X2X3} as can be seen from the following calculation,

> ring r = 0, (x(1..4),y(1..3)), lp;

> ideal i = x(1)-y(3)^2, x(2)-y(1)*y(3), x(3)-y(2)*y(3), x(4)-y(1)*y(2);

> ideal j = std(i);

> eliminate( j, y(1)*y(2)*y(3) );

_[1]=x(1)*x(4)-x(2)*x(3)

24 1 Commutative Algebra

2

Algebraic Geometry

Algebraic geometry is the study of algebraic varieties, which are the zero sets of systems of multi-variate polynomials. Algebraic varieties are geometric objects that can be described algebraically bycommutative algebra. This section provides a dictionary which allows to translate geometric objectsinto algebraic ones.

2.1 Affine Varieties

Let K be a field. The set Kn = {(a1, . . . , an) | a1, . . . , an ∈ K} is the affine n-dimensional space overK. Each polynomial f in K[X1, . . . , Xn] defines a polynomial function f : Kn → K, where the value atthe point a = (a1, . . . , an) ∈ Kn is obtained by substituting Xi = ai, 1 ≤ i ≤ n, and evaluating theresulting expression in K. More precisely, if f =

∑α cαX

α, cα ∈ K, then

f(a1, . . . , an) =∑

α

cαaα, aα = aα1

1 · · · aαnn . (2.1)

This amounts to a ring homomorphism which assigns to each polynomial f ∈ K[X1, . . . , Xn] its poly-nomial function f : Kn → K.

Proposition 2.1. Let K be an infinite field. A polynomial f ∈ K[X1, . . . , Xn] is the zero polynomial ifand only if the corresponding polynomial function f : Kn → K is the zero function.

Proof. The zero polynomial f = 0 gives rise to the zero polynomial function.Conversely, we need to show that if f is the zero polynomial function, i.e., f(a) = 0 for all a ∈ Kn,

then f is the zero polynomial. In case of n = 1, the Fundamental theorem of algebra applies whichsays that each nonzero polynomial f ∈ K[X] of positive degree m has at most m roots in K. Since Kis infinite, the assumption that f(a) = 0 for all a ∈ K is only satisfied by the zero polynomial.

Let n ≥ 1. Take a polynomial f in K[X1, . . . , Xn, Xn+1]. Write f as a polynomial in Xn+1 withcoefficients in K[X1, . . . , Xn]; that is,

f =

N∑

i=0

hi(X1, . . . , Xn)Xin+1,

26 2 Algebraic Geometry

where hi ∈ K[X1, . . . , Xn]. Let (a1, . . . , an) ∈ Kn. Then f(a1, . . . , an, Xn+1) ∈ K[Xn+1]. In view ofthe case n = 1, f(a1, . . . , an, Xn+1) is the zero polynomial. Thus hi(a1, . . . , an) = 0 for 0 ≤ i ≤ N .Since (a1, . . . , an) was chosen arbitrarily, each hi is the zero function. By induction, each hi is the zeropolynomial. Hence, f is the zero polynomial. ⊓⊔

Corollary 2.2. Let K be an infinite field and let f, g ∈ K[X1, . . . , Xn]. Then f = g in K[X1, . . . , Xn]if and only if f, g define the same polynomial functions.

Proof. Suppose f, g ∈ K[X1, . . . , Xn] give rise to the same polynomial function. Then the polynomialf − g vanishes at all points in Kn. By Prop. 2.1, f − g is the zero polynomial and hence f = g. Theconverse is clear. ⊓⊔

The situation is different for finite fields. For instance, all elements of the finite field Fq with q elementsare zeros of the polynomial Xq −X.

The objects studied in affine algebraic geometry are the subsets of the affine space defined by oneor more polynomial equations. For instance, in the Euclidean space R3, consider the cone given by theset of triples (x, y, z) that satisfy the equation X2 + Y 2 = Z (Fig. 2.1).

Fig. 2.1. Cone in Euclidean 3-space.

Note that any polynomial equation f = g can be rewritten as f − g = 0. Thus it will be customaryto write all equations in the form f = 0. More generally, the simultaneous solutions of a system ofpolynomial equations are considered.

Let S be a set of polynomials in K[X1, . . . , Xn]. The set of all simultaneous solutions (a1, . . . , an) ∈Kn of the system of equations

f(a1, . . . , an) = 0, f ∈ S, (2.2)

is the affine variety defined by S and is denoted by V(S). In particular, if S = {f1, . . . , fs} is a finiteset, we also write V(S) = V(f1, . . . , fs). A subset V of Kn is an affine variety if V = V(S) for some setS of polynomials in K[X1, . . . , Xn]. For instance, we have V({1}) = ∅ and V({0}) = Kn, and thus boththe empty set and the affine space Kn are affine varieties (Fig. 2.2).

2.1 Affine Varieties 27

Fig. 2.2. Intersection of cone and plane in Euclidean 3-space.

Example 2.3 (Maple). The cubic plane curve Y 2 = X2(X + 1) in R2 can be generated by using thecommand (Fig. 2.3)

> with(plots):

> plot([sqrt(x^2*(x+1),-sqrt(x^2*(x+1))], x=-1..10);

Fig. 2.3. Cubic plane curve.

Proposition 2.4. If S and S′ are subsets of K[X1, . . . , Xn] such that S ⊆ S′, then V(S′) ⊆ V(S).

28 2 Algebraic Geometry

If there is more than one defining equation, the resulting affine variety can be considered as anintersection of other varieties.

On the other hand, the setW = R\{0, 1, 2, 3} is not an affine variety. Indeed, a polynomial f ∈ R[X]that vanishes at every point inW has infinitely many roots. Since a polynomial in K[X] of degree n ≥ 1at most n zeros in K, the polynomial f must be the zero polynomial. Hence, the smallest affine varietyin R that contains W is the whole real line.

The study of affine varieties depends heavily on the base field. In particular, algebraic geometry overthe field of real numbers has some unpleasant surprises. For instance, we have V(X2 + 1) = ∅ if takenover R. On the other hand, each polynomial in C[X] factors completely by the Fundamental theoremof algebra and we find that V(X2 + 1) = {±i}.

Proposition 2.5. If V is an affine variety in Kn, there is an ideal I of K[X1, . . . , Xn] such thatV = V(I).

Proof. By definition, there is a subset S of K[X1, . . . , Xn] such that V = V(S). Let I be the idealof K[X1, . . . , Xn] generated by the elements of S. So each element f ∈ I can be written as f =h1f1 + . . .+ hsfs, where f1, . . . , fs ∈ S and h1, . . . , hs ∈ K[X1, . . . , Xn]. Thus for each point a ∈ V(S),f1(a) = . . . = fs(a) = 0 and thus f(a) = 0. Hence, V(S) ⊆ V(I). Conversely, S is a subset of I andthus V(S) ⊇ V(I). ⊓⊔

Theorem 2.6 (Weak Nullstellensatz). Let K be an algebraically closed field. If I is a proper idealof K[X1, . . . , Xn], the affine variety V(I) is nonempty.

Proof. Each proper ideal I in K[X] is generated by a single polynomial f ∈ K[X]; this can be shownby using the divison theorem. Thus we have I = 〈f〉. The Fundamental theorem of algebra says that ifK is algebraically closed, each nonconstant polynomial f has a zero in K. It follows that V(I) 6= ∅.

Assume that the result holds for the proper ideals of K[X2, . . . , Xn]. Take an ideal I of K[X1, . . . , Xn]for which V(I) = ∅. By Hilbert’s basis theorem, I is finitely generated and so I = 〈f1, . . . , fs〉 for somef1, . . . , fs ∈ K[X1, . . . , Xn]. Suppose f1 6= 0 is not a constant polynomial; otherwise, I = K[X1, . . . , Xn].Write f1 as a polynomial in K[X2, . . . , Xn][X1]; that is,

f1(X1, . . . , Xn) = cXN1 + terms in which X1 has degree < N,

where 0 6= c ∈ K[X2, . . . , Xn]. Consider the following nonsingular linear change of coordinates,

X1 = Y1,

X2 = Y2 + a2Y1, a2 ∈ K,

...

Xn = Yn + anY1, an ∈ K.

By this setting, we obain

f1(X1, . . . , Xn) = f1(Y1, Y2 + a1Y1, . . . , Yn + anY1)

= c(a2, . . . , an)YN1 + terms in which Y1 has degree < N,

where c(a2, . . . , an) is a nonzero polynomial expression in a2, . . . , an. Since K is algebraically closed, Kis infinite. Thus by Prop. 2.1, a2, . . . , an can be chosen such that c(a2, . . . , an) 6= 0.

2.2 Ideal-Variety Correspondence 29

Under this linear transformation, each polynomial f ∈ K[X1, . . . , Xn] becomes a polynomial f ∈K[Y1, . . . , Yn]. Moreover, the ideal I passes to the ideal I = 〈f1, . . . , fs〉 which also satisfies V(I) = ∅.Furthermore, the polynomial f1 transforms into

f1(Y1, . . . , Yn) = c(a2, . . . , an)YN1 + terms in which Y1 has degree < N,

where c(a2, . . . , cn) 6= 0.Take the projection mapping π1 : Kn → Kn−1 : (a1, a2, . . . , an) 7→ (a2, . . . , an) and put I1 =

I ∩K[Y2, . . . , Yn]. As the leading coefficient of the polynomial f1 is a constant, the Extension theoremimplies that partial solutions in Kn−1 always extend; that is, V(I1) = π1(V(I)). It follows that V(I1) =π1(V(I)) = π1(∅) = ∅. By induction, we have I1 = K[Y2, . . . , Yn]. Thus 1 ∈ I1 ⊆ I and hence I =K[X1, . . . , Xn]. ⊓⊔

Example 2.7 (Singular). The substitution defined in the proof is a ring homomorphism given asfollows:

> ring r = 0, (x,y,z), dp;

> poly f = x2yz+xy+z2;

> ring s = 0, (u,v,w), dp;

> map F = r, u, 2u+v, 3u+w // map F from ring r to ring s

// x -> x, y -> 2u+v, z -> 3u+w

> poly g = F(f); // apply F

6u^4+3u3v+2u3w+u2vw+11u2+uv+6uw+w2

Example 2.8 (Singular). Consider the ideal I = 〈XY − Y, Y 2 −X2, X − Y 3〉 in Q[X,Y ].

> ring r = 0, (x,y), dp;

> ideal i = xy-y, y2-x2, x-y3;

> std(i);

_[1]=x-y

_[2]=y2-y

The reduced Groebner basis is G = {X − Y, Y 2 − Y } and thus the variety V(I) consists of three pointsin C2, namely V(I) = V(G) = {(0, 0), (1, 1), (−1,−1)}. ♦

2.2 Ideal-Variety Correspondence

We set up a dictionary that allows to relate geometric properties to algebraic ones.

Proposition 2.9. If I and J are ideals of K[X1, . . . , Xn], then V(I + J) = V(I) ∩ V(J).

Proof. By Prop. 1.3, I +J is an ideal. Since I and J are subsets of I +J , it follows from Prop. 2.4 thatV(I + J) ⊆ V(I) and V(I + J) ⊆ V(J). Therefore, V(I + J) ⊆ V(I) ∩ V(J).

Conversely, let (a1, . . . , an) ∈ V(I) ∩ V(J) and let h ∈ I + J . Then there are polynomials f ∈ Iand g ∈ J such that h = f + g. By hypothesis, f(a1, . . . , an) = 0 and g(a1, . . . , an) = 0. Thush(a1, . . . , an) = 0 and hence (a1, . . . , an) ∈ V(I + J). ⊓⊔

30 2 Algebraic Geometry

Example 2.10 (Singular). In the Euclidean space R3, consider the surfaces V(Y −X2) and V(Z−X3).Their intersection yields an interesting curve, the twisted cubic V = V(Y −X2, Z −X3) where

V = V(Y −X2, Z −X3) = V(Y −X2) ∩ V(Z −X3)

= {(x, x2, z) | x, z ∈ R} ∩ {(x, y, x3) | x, y ∈ R}= {(x, x2, x3) | x ∈ R}.

The latter representation is a parametrization of V which provides a way to draw the curve (Fig. 2.4).However, not every affine variety can be parametrized in this way. ♦

Fig. 2.4. Twisted cubic.

Proposition 2.11. If I and J are ideals of K[X1, . . . , Xn], then V(I · J) = V(I) ∪ V(J).

Proof. Let (a1, . . . , an) ∈ V(I · J). By definition, the ideal I · J is generated by elements of the formf · g, where f ∈ I and g ∈ J . It follows that (f · g)(a1, . . . , an) = f(a1, . . . , an) · g(a1, . . . , an) = 0. Thus(a1, . . . , an) ∈ V(I) or (a1, . . . , an) ∈ V(J) and hence (a1, . . . , an) ∈ V(I) ∪ V(J).

Conversely, let (a1, . . . , an) ∈ V(I)∪V(J). Assume that (a1, . . . , an) ∈ V(I). Then f(a1, . . . , an) = 0for all polynomials f ∈ I and so f(a1, . . . , an) · g(a1, . . . , an) = 0 for each polynomial g ∈ J . It followsthat (f · g)(a1, . . . , an) = 0. But the ideal I · J is generated by elements of the form f · g, where f ∈ Iand g ∈ J . Therefore, (a1, . . . , an) ∈ V(I · J). ⊓⊔

Example 2.12 (Singular). The ideals I = 〈X,Y 〉 and J = 〈Z〉 give rise to the (x, y)-plane V(X,Y )and the z-axis V(Z), respectively. The product ideal IJ = 〈XZ, Y Z〉 provides the union of the (x, y)-plane and the z-axis.

> ring r = 0, (x,y,z), dp;

> ideal i = x,y;

2.2 Ideal-Variety Correspondence 31

> ideal j = z;

> ideal ij = i*j;

> std(ij);

_[1]=yz

_[2]=xz

Proposition 2.13. If I and J are ideals of K[X1, . . . , Xn], then V(I ∩ J) = V(I) ∪ V(J).

Proof. Let (a1, . . . , an) ∈ V(I)∪V(J). Assume that (a1, . . . , an) ∈ V(I). Then f(a1, . . . , an) = 0 for eachpolynomial f ∈ I. Thus f(a1, . . . , an) = 0 for all polynomials f ∈ I∩J and hence (a1, . . . , an) ∈ V(I∩J).

Conversely, we have I · J ⊆ I ∩ J by Prop. 1.4 and thus V(I ∩ J) ⊆ V(I · J) by Prop. 2.4. ButV(I · J) = V(I) ∪ V(J) by Prop. 2.11 and so V(I ∩ J) ⊆ V(I) ∪ V(J). ⊓⊔

Proposition 2.14. If K is algebraically closed, each finite subset of the affine space Kn is an affinevariety.

Proof. By Ex. 1.9, each maximal ideal of K[X1, . . . , Xn] has the form ma = 〈X1 − a1, . . . , Xn − an〉 forsome point a = (a1, . . . , an) ∈ Kn. Take the ideal I given by the sum of maximal ideals ma, where aruns over the elements of V . Since V(ma) = {a} for each point a ∈ Kn, we have V = V(I). ⊓⊔

We associate to each affine variety V in Kn the collection of polynomials I(V ) that vanish at everypoint of V , i.e.,

I(V ) = {f ∈ K[X1, . . . , Xn] | f(a1, . . . , an) = 0 for all (a1, . . . , an) ∈ V } (2.3)

The set I(V ) called the ideal of V .

Proposition 2.15. If V is an affine variety in Kn, then I(V ) is an ideal of K[X1, . . . , Xn].

Proof. Let f, g ∈ I(V ) and h ∈ K[X1, . . . , Xn]. For each (a1, . . . , an) ∈ V , we have (f−g)(a1, . . . , an) =f(a1, . . . , an)− g(a1, . . . , an) = 0 and (f · h)(a1, . . . , an) = f(a1, . . . , an) · h(a1, . . . , an) = 0. Therefore,f − g and f · h ∈ I(V ). ⊓⊔

For instance, we have I(∅) = K[X1, . . . , Xn]. If K is infinite, then by Prop. 2.1, I(Kn) = {0}. If Kis algebraically closed, for each point (a1, . . . , an) ∈ Kn, I({(a1, . . . , an)}) = 〈X1 − a1, . . . , Xn − an〉 bythe proof of Prop. 2.14. For example, I({i}) = 〈x− i〉 over C, but I({i}) = 〈x2 + 1〉 over R.

Proposition 2.16. If V and V ′ are affine varieties in Kn such that V ⊆ V ′, then I(V ′) ⊆ I(V ).

If V = V(I), is it always true that I(V ) = I? The answer is no, as the following simple exampledemonstrates. Consider the ideal I = 〈X2〉 in R[X,Y ] that consists of all polynomials divisible by X2.The corresponding affine variety V = V(X2) is given by the x-axis. Therefore, the ideal I(V ) = 〈X〉is generated by the polynomial X and hence the ideal I(V(I)) is strictly larger than I. The reason isthat the ideal 〈X2〉 is not radical, but 〈X〉 has this property.

Proposition 2.17. If V is an affine variety in Kn, the ideal I(V ) is radical.

32 2 Algebraic Geometry

Proof. Write I = I(V ). Let f ∈ K[X1, . . . , Xn] such that fm ∈ I for some integer m ≥ 1. Then thepolynomial fm vanishes at each point of V . But 0 = fm(a) = f(a)m for each point a ∈ Kn and so fvanishes at each point of V . Thus f ∈ I and hence the ideal I is radical. ⊓⊔

The properties of the field K also affect the relation between an ideal I in K[X1, . . . , Xn] and thecorresponding ideal I(V(I)). For instance, over the field of real numbers, we have V(X2 + 1) = ∅ andthus I(V(X2+1)) = R[X]. On the other hand, if we take the field of complex numbers, each polynomialin C[X] factors completely by the Fundamental theorem of algebra. We find that V(X2 + 1) = {±i}and thus I(V(X2 + 1)) = 〈X2 + 1〉.

Proposition 2.18. If I is an ideal of K[X1, . . . , Xn], then√I ⊆ I(V(I)).

Proof. Let f ∈√I. Then there exists an integer m ≥ 1 such that fm ∈ I. Thus for each point a ∈ V(I),

we have 0 = fm(a) = f(a)m and thus f(a) = 0. Hence, f ∈ I(V(I)). ⊓⊔

Theorem 2.19 (Strong Nullstellensatz). If K is an algebraically closed field and I is an ideal ofK[X1, . . . , Xn], then

I(V(I)) =√I. (2.4)

The proof is given by the Rabinowitsch trick, a short way to show the Nullstellensatz.

Proof. Let I be an ideal of K[X1, . . . , Xn]. By Hilbert’s basis theorem, there are polynomials f1, . . . , fs ∈K[X1, . . . , Xn] such that I = 〈f1, . . . , fs〉. Take f ∈ I(V(I)). Then f(a1, . . . , an) = 0 for each point(a1, . . . , an) ∈ V(I). Consider the ideal I = 〈f1, . . . , fs, 1− Y · f〉 of K[X1, . . . , Xn, Y ].

Claim that V(I) = ∅. Indeed, let (a1, . . . , an, an+1) ∈ Kn+1. If (a1, . . . , an) ∈ V(I), thenf(a1, . . . , an) = 0. It follows that 1 − Y · f , evaluated at the point (a1, . . . , an, an+1), has the value1 − an+1f(a1, . . . , an) = 1 and so (a1, . . . , an, an+1) 6∈ V(I). If (a1, . . . , an) 6∈ V(I), there is an indexi, 1 ≤ i ≤ s, such that fi(a1, . . . , an) 6= 0. Think of fi as a polynomial in n + 1 variables. Thenfi(a1, . . . , an, an+1) 6= 0 and hence (a1, . . . , an, an+1) 6∈ V(I). This proves the claim.

Since V(I) = ∅, the Weak Nullstellensatz implies that 1 ∈ I. Thus there are polynomials h, h1, . . . , hsin K[X1, . . . , Xn, Y ] such that

1 = h1 · f1 + . . .+ hs · fs + h · (1− Y · f).

Put Y = 1/f . Then we obtain an equation of rational functions

1 =∑

i

hi(X1, . . . , Xn, 1/f) · fi.

Multiply both sides by a power fm, wherem is sufficiently large to clear the denominators. The resultingequation has the form fm =

∑i hifi ∈ I, where hi ∈ K[X1, . . . , Xn], 1 ≤ i ≤ s. Thus f lies in the

radical ideal√I and hence we have shown that I(V(I)) is a subset of

√I. By Prop. 2.18, the result

follows. ⊓⊔

Example 2.20 (Singular). The ideal I = 〈Y −X2, Z −X3〉 in Q[X,Y, Z] defining the twisted cubiccurve is radical as the following computation shows.

2.2 Ideal-Variety Correspondence 33

> ring r = 0, (x,y,z), dp;

> ideal i = y-x2, z-x3;

> LIB "primdec.lib";

> std(i);

_[1]=y2-xz

_[2]=xy-z

_[3]=x2-y

> ideal j = radical(std(i));

> std(j);

_[1]=y2-xz

_[2]=xy-z

_[3]=x2-y

Theorem 2.21 (Ideal-Variety Correspondence).

• Let K be an arbitrary field. The maps

affine varieties in KnI−→ ideals in K[X1, . . . , Xn]

and

ideals in K[X1, . . . , Xn]V−→ affine varieties in Kn

are inclusion-reversing, and V(I(V )) = V for each affine variety V in Kn.• Let K be an algebraically closed field. The maps

affine varieties in KnI−→ radical ideals in K[X1, . . . , Xn]

and

radical ideals in K[X1, . . . , Xn]V−→ affine varieties in Kn

are inclusion-reversing bijections and inverses of each other. In particular, I(V(I)) = I for eachradical ideal I of K[X1, . . . , Xn].

Proof. First, Prop. 2.4 and 2.16 show that the maps are inclusion-reversing.Second, let V be an affine variety in Kn. Then there is an ideal I in K[X1, . . . , Xn] such that

V = V(I). By Hilbert’s basis theorem, there are polynomials f1, . . . , fs in K[X1, . . . , Xn] such thatI = 〈f1, . . . , fs〉. Let (a1, . . . , an) ∈ V . Then f1(a1, . . . , an) = · · · = fs(a1, . . . , an) = 0. But eachelement h ∈ I is of the form h = h1f1 + . . . + hsfs, where hi ∈ K[X1, . . . , Xn], 1 ≤ i ≤ s. Thush(a1, . . . , an) = 0 and hence I is contained in I(V ). Then by Prop. 2.16, V(I(V )) is contained inV = V(I). Conversely, let (a1, . . . , an) ∈ V . Then f(a1, . . . , an) = 0 for each polynomial f ∈ I(V ) andso (a1, . . . , an) ∈ V(I(V )). It follows that V is a subset of V(I(V )). Hence, V(I(V )) = V .

Third, the Strong Nullstellensatz says that I(V(I)) = I for each radical ideal I in K[X1, . . . , Xn].Moreover, the first part exhibits that V(I(V )) = V for each affine variety V in Kn. This shows thatthe mappings are bijective and inverses of each other. ⊓⊔

34 2 Algebraic Geometry

2.3 Zariski Topology

The affine varieties of the affine space Kn form the closed sets of a topology on Kn. Recall that a familyof subsets Ξ of a set X is a topology on X if both the empty set and the whole set X are elements ofΞ, any intersection of finitely many elements of Ξ is an element of Ξ, and any union of elements ofΞ is an element of Ξ. The members of Ξ are the open sets in X and their complements in X are theclosed sets in X.

Proposition 2.22. Let I and J be ideals and let (Ij) be a family of ideals of K[X1, . . . , Xn].

• V(0) = Kn and V(K[X1, . . . , Xn]) = ∅.• V(IJ) = V(I) ∪ V(J).• V(∑j Ij) =

⋂j V(Ij).

Proof. The first assertion is obvious. The second assertion is Prop.2.11. Finally, the ideal∑j Ij consists

of all finite sums of the form∑j fj , where fj ∈ Ij . It follows that V(

∑j Ij) is equal to the intersection

of all affine varieties V(Ij). ⊓⊔

These assertions show that the affine varieties in Kn are the closed sets in Kn: The whole space and theempty set are closed, the finite union of closed sets is closed, and the arbitrary intersection of closedsets is closed. This topology is the Zariski topology on Kn. By Prop. 2.14, each finite subset of Kn isclosed.

Proposition 2.23. If W is a subset of Kn, then the set V(I(W )) is the smallest affine variety thatcontains W .

Proof. By Prop. 2.15, the set I(W ) is an ideal of K[X1, . . . , Xn] and thus by definition V(I(W )) is anaffine variety in Kn.

Claim thatW is a subset of V(I(W )). Indeed, if (a1, . . . , an) ∈W , each polynomial in I(W ) vanishesat the point (a1, . . . , an) and thus the point (a1, . . . , an) belongs to V(I(W )).

Claim that each affine variety V in Kn with W ⊆ V satisfies V(I(W )) ⊆ V . Indeed, if W ⊆ V , thenby Prop. 2.16, I(V ) ⊆ I(W ) and by Prop. 2.4, V(I(W )) ⊆ V(I(V )). But V is an affine variety andtherefore by Thm. 2.21, V(I(V )) = V . ⊓⊔

The Zariski closure of a subset W of Kn is the smallest affine variety containing it. By Prop. 2.23,the Zariski closure of W equals V(I(W )). For instance, we have seen that the Zariski closure of the setW = R\{0, 1, 2, 3} is the real line R. By the ideal-variety correspondence, if V is an affine variety, thenV(I(V )) = V and hence each affine variety equals its Zariski closure.

2.4 Irreducible Affine Varieties

Affine varieties can be decomposed into irreducible components which can then be studied separately.An affine variety V in the affine space Kn is irreducible if in each expression of V as a union of affinevarieties V = V1 ∪ V2, either V = V1 or V = V2.

Proposition 2.24. An affine variety V in Kn is irreducible if and only if the ideal I(V ) is prime inK[X1, . . . , Xn].

2.5 Elimination Theory 35

Proof. Let V be irreducible and let fg ∈ I(V ). Put V1 = V ∩ V(f) and V2 = V ∩ V(g). By Prop. 2.9,the intersection of affine varieties is an affine variety and thus V1 and V2 are also affine varieties. ByProp. 2.4, V(fg) ⊇ V(I(V )). Moreover, since V ⊆ V(I(V )), we have V = V ∩ V(fg). It follows byProp. 2.11, V = V ∩ (V(f) ∪ V(g)) = (V ∩ V(f)) ∪ (V ∩ V(g)) = V1 ∪ V2. But V is irreducible and soV = V1 or V = V2. Without loss of generality, let V = V1. Then the polynomial f vanishes on V andthus f ∈ I(V ). Hence, the ideal I(V ) is prime.

Conversely, suppose V is reducible. Then there are affine varieties V1 and V2 contained in V suchthat V = V1 ∪ V2. By Prop. 2.16, we have I(V ) ⊆ I(V1) and I(V ) ⊆ I(V2). By the ideal-varietycorrespondence, we have V1 = V(I(V1)), V2 = V(I(V2)), and V = V(I(V )). Since V1 6= V and V2 6= V ,it follows that I(V1) 6= I(V ) and I(V2) 6= I(V ). Take polynomials f ∈ I(V1)\I(V ) and g ∈ I(V2)\I(V ).Then fg ∈ I(V1 ∪ V2) = I(V ) and hence I(V ) is not prime. ⊓⊔

Example 2.25. The ideals I1 = 〈X,Y 〉 and I2 = 〈Z〉 are prime in R[X,Y, Z]. Thus the correspondingaffine varieties, the (x, y)-plane V(I1) = {(x, y, 0) | x, y ∈ R} and the z-axis V(I2) = {(0, 0, z) | z ∈ R},are irreducible. ♦

Proposition 2.26. Each affine variety V in Kn can be uniquely written (up to permutation) in theform

V = V1 ∪ . . . ∪ Vm,where V1, . . . , Vm are pairwise distinct irreducible affine varieties.

Proof. Let V be an affine variety that cannot be written as a finite union of irreducible affine varieties.Then V is reducible with V = V1 ∪ V ′

1 such that V1 6= V and V ′1 6= V . Furthermore, at least one of V1

and V ′1 cannot be described as a union of irreducible affine varieties. Assume that V1 is not a union of

irreducible affine varieties. Then again write V1 = V2 ∪ V ′2 , where V2 6= V1 and V ′

2 6= V1. Continuingin this way, we obtain an infinite descending sequence of affine varieties V ⊃ V1 ⊃ V2 ⊃ . . .. By theideal-variety correspondence, the operator I is one-to-one and thus gives rise to an ascending sequenceof ideals I(V ) ⊂ I(V1) ⊂ I(V2) ⊂ . . .. By the ascending chain condition 1.28, this chain must becomestationary contradicting the infiniteness of the sequence of affine varieties.

Assume there are two such expressions V = U1 ∪ . . .∪Ur =W1 ∪ . . .∪Ws. Consider U1 = V ∩U1 =(W1 ∩ U1) ∪ . . . ∪ (Ws ∩ U1). Since U1 is irreducible, there is an index j such that U1 = Wj ∩ U1; thatis, U1 ⊆ Wj . Likewise, there is an index k such that W1 ⊆ Uk. It follows that U1 ⊆ Uk, which impliesby hypothesis that j = k and so U1 =Wj Continuing we see that r = s and that one decomposition isonly a renumbering of the other. ⊓⊔

Example 2.27. The affine variety V = {(x, y, z) | xz = yz = 0, x, y, z ∈ R} in R3 is reducible,since V decomposes into the union of the (x, y)-plane V1 = {(x, y, 0) | x, y ∈ R} and the z-axisV2 = {(0, 0, z) | z ∈ R}. Both are irreducible affine varieties. ♦

2.5 Elimination Theory

We provide a straightforward method for solving systems of polynomial equations based on the elimi-nation orderings.

Let I be an ideal in K[X1, . . . , Xn] and let k ≥ 0 be an integer. The kth elimination ideal of I isgiven as

36 2 Algebraic Geometry

Ik = I ∩K[Xk+1, . . . , Xn]. (2.5)

Note that the 0-th elimination ideal is I0 = I. Clearly, Ik is an ideal of K[X1, . . . , Xn].A monomial ordering > on K[X1, . . . , Xn] has the elimination property for X1, . . . , Xk if f ∈

K[X1, . . . , Xn] and lm>(f) ∈ K[Xk+1, . . . , Xn] implies f ∈ K[Xk+1, . . . , Xn]. That is, monomials whichcontain one of X1, . . . , Xk are always larger than monomials which contain none of X1, . . . , Xk. A mono-mial ordering > on K[X1, . . . , Xn] is an elimination ordering for X1, . . . , Xk if it has the eliminationproperty for X1, . . . , Xk.

For instance, the lp ordering has the elimination property for any sequence X1, . . . , Xk, k ≥ 0.Product orderings provide a large class of elimination orderings. For this, let >1 be a monomial orderingon K[X1, . . . , Xm] and >2 be a monomial ordering on K[Y1, . . . , Yn]. Then the product ordering > onK[X1, . . . , Xm, Y1, . . . , Yn], denoted by (>1, >2), is defined by

XαY β > XγY δ :⇐⇒ Xα >1 Xγ ∨ (Xα = Xγ ∧ Y β >2 Y

δ).

The product ordering > on K[X1, . . . , Xm, Y1, . . . , Yn] is an elimination ordering for X1, . . . , Xm.

Example 2.28 (Singular). Product orderings can be specified by the ring definitions.

> ring r = 0, (w,x,y,z), (dp(2),Dp(2)); // mixed product ordering

> poly f = wx2z+w2x2yz+wx+yz2+y2z;

> f;

w2x2yz+wx2z+wx+y2z+yz2

Theorem 2.29. (Elimination) Let I be an ideal of K[X1, . . . , Xn] and let k ≥ 0 be an integer. If Gis a Groebner basis of I with respect to an elimination ordering > for X1, . . . , Xk, the k-th eliminationideal Ik of I has the Groebner basis

Gk = G ∩K[Xk+1, . . . , Xn].

Proof. Let G = {g1, . . . , gs} be a Groebner basis of I. Assume that the first r ≤ s elements of G lie inK[Xk+1, . . . , Xn].

Claim that Gk = {g1, . . . , gr} is a generating set of Ik. Indeed, by definition, Gk ⊆ Ik and so〈g1, . . . , gr〉 ⊆ Ik. Conversely, let f ∈ Ik. Then divide f into g1, . . . , gs giving the remainder fG = 0. Inview of the given elimination ordering, the leading terms of the (eliminated) polynomials gr+1, . . . , gsmust involve at least one of the variables X1, . . . , Xk and these terms are greater than any term in f .It follows that the division of f into g1, . . . , gs does not involve gr+1, . . . , gs and therefore f is of theform

f = h1g1 + . . .+ hrgr + 0 · gr+1 + . . .+ 0 · gs + 0.

Hence, f ∈ 〈g1, . . . , gr〉 as required.Claim thatGk = {g1, . . . , gr} is a Groebner basis of Ik. Indeed, divide the S-polynomial S(gi, gj) ∈ Ik

into Gk for each pair i 6= j, 1 ≤ i, j ≤ r. The previous paragraph shows that the remainder S(gi, gj)Gk

is zero. Thus by the Buchberger S-criterion, Gk is a Groebner basis of Ik. ⊓⊔

Example 2.30 (Singular).

2.5 Elimination Theory 37

> ring r = 0, (w,x,y,z), lp;

> ideal i = w2,x4,y5,z3,wxyz;

> eliminate(i,z);

_[1]=w2

_[2]=x4

_[3]=y5

> eliminate(i,yz);

_[1]=w2

_[2]=x4

> eliminate(i,xyz);

_[1]=w2

An ideal I of K[X1, . . . , Xn] generated by f1, f2, . . . , fs provides a system of polynomial equations

f1 = 0, f2 = 0, . . . , fs = 0.

Any point (a1, . . . , an) ∈ V(I) is a solution of the system of equations, and any point (ak+1, . . . , an)in V(Ik) is a partial solution of the system of equations. Each solution truncates to a partial solution,but not each partial solution extends to a solution. This is where the following Extension results comesinto play. For this, note that each polynomial f in Ik−1 can be written as a polynomial in Xk, whosecoefficients are polynomials in Xk+1, . . . , Xn,

f = cXNk + terms in which Xk has degree < N, (2.6)

where 0 6= c ∈ K[Xk+1, . . . , Xn] is called the leading coefficient polynomial of f .

Theorem 2.31. (Extension) Let K is an algebraically closed field. A partial solution (ak+1, . . . , an) inV(Ik) extends to a partial solution (ak, ak+1, . . . , an) in V(Ik−1) if the leading coefficient polynomials ofthe elements of the Groebner basis of Ik−1 with respect to an eliminiation ordering with X1 > . . . > Xn

do not all vanish at (ak+1, . . . , an).

Note that the condition of the theorem is particularly fulfilled if the leading coefficient polynomialsare constants in K. The Elimination theorem shows that a Groebner basis G of the given ideal I withrespect to the lp ordering eliminates successively more and more variables. This gives the followingstrategy for finding all solutions of the system of equations: Start with the polynomials in G with thefewest variables, solve them, and then extend these partial solutions to solutions of the whole systemadding one variable at a time.

Example 2.32 (Singular). Consider the system of equations

X2 + Y 2 + Z2 = 2,

X2 + 2Y 2 = 3,

XZ = 1.

Take the ideal I = 〈X2+Y 2+Z2− 2, X2 +2Y 2− 3, XZ − 1〉 in C[X,Y, Z]. First, compute a Groebnerbasis of I with respect to an elimination ordering.

38 2 Algebraic Geometry

> ring r = 0, (x,y,z), lp; // lexicographical ordering

> ideal i = x2+y2+z2-2, x2+2y2-3, xz-1;

> ideal j = std(i); j;

j[1]=2z4-z2+1

j[2]=y2-z2-1

j[3]=x+2y2z-3z

The corresponding Groebner basis is G = {2Z4 − Z2 + 1, Y 2 − Z2 − 1, X + 2Y 2Z − 3Z}. The secondelimination ideal I2 = I∩C[Z] has the Groebner basis G2 = {2Z4−Z2+1}. The generating polynomialis irreducible which can be tested by Maple.

> with(PolynomialTools):

> factor( 2z^4-z^2-1+1 );

The zeros of this polynomial can be numerically found as follows.

> LIB "solve.lib";

> ring r2 = 0, (z), lp;

> ideal i2 = 2z4-z2+1;

> solve (i2,6);

[1]:

(-0.691776+i*0.478073)

[2]:

(0.691776-i*0.478073)

[3]:

(0.691776-i*0.478073)

[4]:

(-0.691776+i*0.478073)

However, these roots are algebraic numbers and can be symbolically established by Maple.

> with(PolynomialTools):

> solve( 2z^4-z^2-1+1, z );

This produces the four solutions

±1

2

√1 + i

√7,

1

2

√1− i

√7.

Note that each algebraic number has a degree which is the degree of its minimal polynomial over Q;for instance, the above algebraic numbers have degree 4. By elimination, the first elimination idealI1 = I ∩C[Y,Z] is generated by the polynomials Y 2 −Z2 − 1 and 2Z4 −Z2 +1. The leading coefficientpolynomial of Y 2 − Z2 − 1 ∈ C[Z][Y ] is the leading term of Y 2 which is a nonzero constant. Thus byextension, each partial solution in V(I2) extends to a solution in V(I1). There are eight such points. Tofind them, substitute a root of the generator 2Z4 − Z2 + 1 for Z and solve the resulting equation forY . For instance, the Maple command

> subs(Z=(1/2)*sqrt(1+I*sqrt(7)), G);

produces

2.5 Elimination Theory 39

3

4− 1

4i√7 +

1

8(1 + i

√7)2, Y 2 − 5

4− 1

4i√7, X − 1

2

√1 + i

√7 +

1

4(1 + i

√7)3/2.

We can check that the first expression is a zero by using Maple

> evalf(%);

which yields as output

[0 + 0 · i,−1.250000000− 0.6614378278 · i+ Y 2,−0.9783183438 + 0.6760967252 · i+X].

The second expression shows that

Y = ±√

5

4+

1

4i√7.

Finally, the leading coefficient polynomial of X +2Z3 −Z ∈ C[Y,Z][X] is the coefficient of the term Xwhich is a nonzero constant. By extension, each partial solution in V(I1) can be extended to a point inV(I). For instance, for the above value of Z we obtain

X =1

2

√1 + i

√7− 1

4(1 + i

√7)3/2.

This gives rise to the following solutions of the system of equations,

(1

2

√1 + i

√7− 1

4(1 + i

√7)3/2,±

√5

4+

1

4i√7,

1

2

√1 + i

√7).

All other solutions can be derived in the same way. ♦

Example 2.33 (Singular). Consider the system of equations

XY = 1, (2.7)

XZ = 1. (2.8)

This gives the ideal I = 〈XY − 1, XZ − 1〉 in C[X,Y, Z]. First, calculate a Groebner basis of I withrespect to an elimination ordering.

> ring r = 0, (x,y,z), lp;

> ideal i = xy-1, xz-1;

> ideal j = std(i); j;

j[1]=y-z

j[2]=xz-1

The associated Groebner basis is G = {Y − Z,XZ − 1}. The first elimination ideal I1 = I ∩ C[Y,Z]has the Groebner basis G1 = {Y −Z}. The zeros of this generator are the pairs (a, a) with a ∈ C; thatis, V(I1) = {(a, a) | a ∈ C}.

The leading coefficient polynomial of XZ − 1 equals Z. By extension, each partial solution (a, a)with a 6= 0 extends to a solution (X,Y, Z) = (1/a, a, a) in V(I) and thus solves the system of equations.Note that the partial solution (0, 0) cannot be extended. ♦

40 2 Algebraic Geometry

The above examples is rather simple because the coordinates of the solutions can all be expressed interms of roots of complex numbers. Unfortunately, general systems of polynomial equations are rarelythis nice. For instance, it is known that there are no general formulae involving only the field operationsin K and extraction of roots, forming so-called radicals, for solving single variable polynomial equationsof degree 5 or higher. This is a famous result due to Evariste Galois (1811-1832). Thus if eliminationleads to a one-variable equations of degree 5 or higher, we may not be able to give radical formulae forthe roots.

Example 2.34 (Maple). Consider the system of equations

X5 + Y 2 + Z2 = 2,

X2 + 2Y 2 = 3,

XZ = 1.

To solve these equations, we first compute a Groebner basis of the ideal I = 〈X5 + Y 2 + Z2 − 2, X2 +2Y 2 − 3, XZ − 1〉 with respect to the lp ordering.

> with(Groebner):

> F2 := [x^5+y^2+z^2-2, x^2+2*y^2-3, x*z-1]:

> G2 := gbasis(F2, plex(x,y,z));

This gives the output

[2Z7 − Z5 − Z3 + 2, 4Y 2 − 2Z5 + Z3 + Z − 6, 2X + 2Z6 − Z4 − Z2].

By elimination, the second elimination ideal I2 = I ∩C[Z] is generated by the polynomial 2Z7 − Z5 −Z3 + 2. This generator is irreducible over Q. In this situation, we need to decide what kind of answeris required.

If we want a purely algebraic description of the solutions, then Maple can represent solutions ofsystems like this by the solve command. Entering

> solve(convert(G2, set), {x,y,z});

gives the output

X =1

2· Root of(2 Z7 − Z5 − Z3 + 2)2 +

1

2· Root of(2 Z7 − Z5 − Z3 + 2)4

−Root of(2 Z7 − Z5 − Z3 + 2)6,

Y =1

2· Root of(−6 + Y ′)5,

Y ′ = Root of(2 Z7 − Z5 − Z3 + 2) + Root of(2 Z7 − Z5 − Z3 + 2)3

−2 · Root of(2 Z7 − Z5 − Z3 + 2)5 + Z7,

Z = Root of(2 Z7 − Z5 − Z3 + 2).

Here Root of(2 Z7− Z5− Z3+2) stands for any of the roots of the polynomial equation 2 Z7− Z5−Z3 + 2 = 0 in the dummy variable Z.

On the other hand, in many practical situations where equations must be solved, knowing a numer-ical approximation to a real or complex solution is often more useful and perfectly acceptable provided

2.6 Geometry of Elimination 41

that the results are sufficiently accurate. The command fsolve finds numerical approximations to allreal or complex roots of a polynomial by a combination of root location and numerical techniques. Forinstance,

> fsolve(z^7-z^5-z^3+2);

computes approximate values for the real roots of the polynomial. The output is

−1.160417997.

Using this approximate value Z = −1.160417997 as partial solution in V(I2), we can substitute thisnumber into the Groebner basis using

> L := subs (z = -1.160417997, G2);

and obtain[−7 · 10−9,−4.514744785 + 4Y 2, 1.723516882 + 2X].

It shows that the value of the first polynomial is not exactly 0. Nevertheless, we can extend thisapproximate partial solution as follows

> y := solve(L[2]);

> x := solve(L[3]);

In this way, we obtain two approximate solutions of the system,

(x, y, z) = (−0.8617584410,±1.062396440,−1.160417997).

Checking one of these by substituting into the Groebner basis using

> subs([x=-0.8617584410, y=1.062396440, z=-1.160417997], G2);

we find that[−7 · 10−9,−1 · 10−9, 0].

Thus we have a reasonably good approximate solution in the sense that the values are very close to 0.The remaining solutions can be derived in the same way. ♦

2.6 Geometry of Elimination

In this section, it will be shown that elimination can be interpreted as the projection of an affinevariety onto a lower-dimensional subspace. For this, take integers k, n with 0 ≤ k ≤ n and consider theprojection mapping

πk : Kn → Kn−k : (a1, . . . , an) 7→ (ak+1, . . . , an).

Lemma 2.35. Let I be an ideal of K[X1, . . . , Xn] and let V = V(I) be the corresponding affine variety.For each 0 ≤ k ≤ n, the k-th elimination ideal Ik of I satisfies

πk(V ) ⊆ V(Ik).

42 2 Algebraic Geometry

Proof. Let f ∈ Ik. Since f ∈ I, it follows that for any point (a1, . . . , an) ∈ V , we have f(a1, . . . , an) = 0.But f only involves the coordinates Xk+1, . . . , Xn and so f(πk(a1, . . . , an)) = f(ak+1, . . . , an) = 0.Hence, f vanishes at all points of πk(V ). ⊓⊔

It follows that the projection of the k-th elimination ideal Ik can be written as

πk(V ) = {(ak+1, . . . , an) ∈ V (Ik) | ∃a1, . . . , ak ∈ K : (a1, . . . , an) ∈ V }.Thus πk(V ) consists exactly of the partial solutions that extend to complete solutions. However, πk(V )is generally not an affine variety.

Example 2.36. Reconsider the system of equations XY = 1 and XZ = 1 in C[X,Y, Z] (Ex. 2.33).The first elimination ideal I1 is generated by the polynomial Y − Z and the associated affine varietyV(I1) = {(a, a) | a ∈ C} is a line in the (y, z)-plane.

On the other hand, the projected set π1(V ) = {(a, a) | a ∈ C, a 6= 0} is not an affine variety. Itmisses the point (0, 0), since there is no solution (a, 0, 0) ∈ V(I) for some a ∈ C. ♦The gap between the projected set πk(V ) and the affine variety V(Ik) can be determined by using theExtension theorem.

Theorem 2.37. Let K be an algebraically closed field, let I be an ideal of K[X1, . . . , Xn], and letV = V(I) be the corresponding affine variety. Let G1 = {g1, . . . , gs} be a Groebner basis of the firstelimination ideal I1 of I with respect to an elimination ordering with X1 > . . . > Xn and let hi denotethe leading coefficient polynomial of gi, 1 ≤ i ≤ s. Then we have

V(I1) = π1(V ) ∪ [V(h1, . . . , hs) ∩ V(I1)].Proof. By Lemma 2.35, the set on the right hand side lies in V(I1). Conversely, let (a2, . . . , an) ∈V(I1). If (a2, . . . , an) 6∈ V(h1, . . . , hs), then by the Extension theorem there exists a1 ∈ K such that(a1, a2, . . . , an) ∈ V(I) and thus π1(a1, a2, . . . , an) = (a2, . . . , an) ∈ π1(V ). Otherwise, (a2, . . . , an) liesin V(h1, . . . , hs) ∩ V(I1). ⊓⊔

The relationship between the projected set πk(V ) and the affine variety V(Ik) can be explained asfollows.

Theorem 2.38. (Closure) Let K be an algebraically closed field, let I = 〈f1, . . . , fs〉 be an ideal inK[X1, . . . , Xn], and let V = V(I) be the corresponding affine variety in Kn. For each 0 ≤ k ≤ n, theaffine variety V(Ik) is the Zariski closure of πk(V ).

Proof. By Prop. 2.23, we have to show that V(Ik) = V(I(πk(V ))).By Lemma 2.35, we have πk(V ) ⊆ V(Ik). But by Prop. 2.23, V(I(πk(V ))) is the smallest affine

variety containing πk(V ) and so V(I(πk(V ))) ⊆ V(Ik).Conversely, let f ∈ I(πk(V )); that is, f(ak+1, . . . , an) = 0 for all (ak+1, . . . , an) ∈ πk(V ). Consider

f as an element of K[X1, . . . , Xn]. Then f(a1, . . . , an) = 0 for all (a1, . . . , an) ∈ V ; that is, f ∈ I(V(I)).By the Strong Nullstellensatz, f ∈

√I. But f lies in K[Xk+1, . . . , Xn] and so f ∈

√Ik. It follows that

I(πk(V )) ⊆ √Ik. Therefore, by Prop. 2.16, V(Ik) = V(√Ik) ⊆ V(I(πk(V ))). ⊓⊔

Example 2.39. Reconsider the system of equations XY = 1 and XZ = 1 in C[X,Y, Z] (Ex. 2.36).The first elimination ideal I1 has the associated affine variety V = {(a, a) | a ∈ C} which is a line inthe (y, z)-plane. On the other hand, the projected set π1(V ) = {(a, a) | a ∈ C, a 6= 0} is not an affinevariety. By the Closure theorem, the affine variety V is the Zariski closure of π1(V ). ♦

2.7 Implicit Representation 43

2.7 Implicit Representation

An affine variety is defined as the set of solutions of a system of polynomial equations. There is anotherway to represent an affine variety, namely, by a system of parametric equations such that its elementscan be explicitly written down. This representation can be used for drawing an affine variety, but notevery affine variety can be described in this way.

Let V be an affine variety in the affine space Kn. An implicit representation of V describes the setV as a set of solutions a system of polynomial equations,

f1 = . . . = fm = 0, (2.9)

where f1, . . . , fm are polynomials in K[X1, . . . , Xn]. On the other hand, a parametric representation ofV describes the set V as the Zariski closure of the set

{(f1(t1, . . . , tm), . . . , fn(t1, . . . , tm)) | t1, . . . , tm ∈ K}, (2.10)

where f1, . . . , fn are polynomials in K[T1, . . . , Tm] or rational functions in K(T1, . . . , Tm). The implicitrepresentation is useful to test whether or not a point lies in the variety, while the parametric repre-sentation is useful for plotting the variety.

Example 2.40. The affine variety V given by the solutions of the equation

X2 − Y = 0

can be equivalently described by the polynomial parametrization

X = T, Y = T 2.

Take polynomials f1, . . . , fn ∈ K[T1, . . . , Tm] and consider the following system of equations inK[T1, . . . , Tm, X1, . . . , Xn] given as

Xi = fi(T1, . . . , Tm), 1 ≤ i ≤ n. (2.11)

The polynomials f1, . . . , fn give rise to the mapping F : Km → Kn defined as

F : (t1, . . . , tm) 7→ (f1(t1, . . . , tm), . . . , fn(t1, . . . , tm)). (2.12)

The set F (Km) is a subset of Kn that is parametrized by the equations (2.11). But F (Km) may notbe an affine variety and thus we search for the smallest affine variety that contains F (Km); that is, theZariski closure of F (Km). For this, we relate implicitization to elimination. To this end, observe thatthe system of equations (2.11) defines the affine variety

V = V(X1 − f1, . . . , Xn − fn) ⊆ Km+n. (2.13)

The points of V can be written in the form

(t1, . . . , tm, f1(t1, . . . , tm), . . . , fn(t1, . . . , tm)), t1, . . . , tm ∈ K. (2.14)

44 2 Algebraic Geometry

Define the embedding ιn : Km → Km+n by

ιn : (t1, . . . , tm) 7→ (t1, . . . , tm, f1(t1, . . . , tm), . . . , fn(t1, . . . , tm)), (2.15)

and the projection πm : Km+n → Kn by

πm : (t1, . . . , tm, x1, . . . , xn) 7→ (x1, . . . , xn). (2.16)

These maps give rise to the commutative diagram

Km+n

πm ##●●●

●●●●

●●

Kmιn

;;✇✇✇✇✇✇✇✇✇

F// Kn

That is, the map F can be written as the composition

F = πm ◦ ιn. (2.17)

By definition, we have

ιn(Km) = V. (2.18)

Thus we obtain

F (Km) = πm(ιn(Km)) = πm(V ). (2.19)

Therefore, the image of the parametrization equals the projection of the affine variety. Thus the Closuretheorem immediately implies the following result.

Theorem 2.41. (Polynomial Implicitization) Let K be an algebraically closed field, let F : Km →Kn be a map determined by the polynomial parametrization (2.11), and let I = 〈X1 − f1, . . . , Xn − fn〉be an ideal in K[T1, . . . , Tm, X1, . . . , Xn]. Then for the m-th elimination ideal Im = I ∩K[X1, . . . , Xn],the affine variety V(Im) is the Zariski closure of F (Km).

The following algorithm solves the polynomial implicitication problem: Given a system of equations

Xi = fi, 1 ≤ i ≤ n,

where f1, . . . , fn are polynomials in K[T1, . . . , Tm]. Consider the ideal I = 〈X1 − f1, . . . , Xn − fn〉 inK[T1, . . . , Tm, X1, . . . , Xn]. Compute a Groebner basis of I with respect to an elimination ordering withT1 > . . . > Tm > X1 > . . . > Xn. Then the elements of the Groebner basis which are not involvingT1, . . . , Tm form a Groebner basis of the m-th elimination ideal Im. By the Implicitization theorem,this basis defines the affine variety in Kn containing the parametrization.

Example 2.42 (Singular). Consider the parametric surface

S = {(uv, uv2, u2) | u, v ∈ C}

2.7 Implicit Representation 45

that is given by the system of equations

X = UV,

Y = UV 2,

Z = U2.

The surface S can be illustrated by the Maple code (Fig. 2.5)

> with(plots):

> plot3d([u*v, u*v^2, u^2], u=-5..5, v=-5..5, grid=[20,20]);

Fig. 2.5. Parametric surface.

Take the ideal I = 〈X − UV, Y − UV 2, Z − U2〉 in C[U, V,X, Y, Z] and compute a Groebner basis of Iwith respect to the lp ordering with U > V > X > Y > Z.

> ring r = 0, (u,v,x,y,z), lp:

> ideal i = x-uv, y-uv2, z-u2;

> std(i);

_[1]=x4-y2z

_[2]=vyz-x3

_[3]=vx-y

_[4]=v2z-x2

_[5]=uy-v2z

_[6]=ux-vz

_[7]=uv-x

_[8]=u2-z

Thus the second elimination ideal I2 has the Groebner basis {X4 − Y 2Z}. By the Implicitizationtheorem, the affine variety V = V(X4 − Y 2Z) is the Zariski closure of the parametric surface S. ♦

46 2 Algebraic Geometry

Second, consider a parametric representation of an affine variety V in Kn as the Zariski closure ofthe set

{(f1(t1, . . . , tm)

g1(t1, . . . , tm), . . . ,

fn(t1, . . . , tm)

g1(t1, . . . , tm)

)| t1, . . . , tm ∈ K

}, (2.20)

where f1, . . . , fn and g1, . . . , gn are polynomials in K[T1, . . . , Tm].

Example 2.43 (Maple). Consider a curve in C2 parametrized by rational functions

X =a(T )

c(T ), Y =

b(T )

c(T ),

where a, b, c ∈ K[T ] are polynomials such that c 6= 0 and gcd(a, b, c) = 1. In particular, consider thecurve

C =

{(2T 2 + 4T + 5

T 2 + 2T + 3,3T 2 + T + 4

T 2 + 2T + 3

)| t ∈ C

}.

This curve can be drawn by using the Maple code (Fig. 2.6)

> with(plots):

> plot( [(2T^2+4T+5)/(T^2+2T+3), (3T^2+T+4)/(T^2+2T+3)], t=-10..10);

Fig. 2.6. Parametric curve.

Parametrizations of this form play an important role in computer-aided geometric design. A questionof particular interest is the implicitization problem, which asks how the equation f(X,Y ) = 0 of theunderlying curve is obtained from the parametrization. ♦

Take polynomials f1, . . . , fn and g1, . . . , gn in K[T1, . . . , Tm] and consider the system of equations

2.7 Implicit Representation 47

Xi =fi(T1, . . . , Tm)

gi(T1, . . . , Tm), 1 ≤ i ≤ n. (2.21)

These polynomials give rise to the mapping F : Km → Kn defined as

F : (t1, . . . , tm) 7→(f1(t1, . . . , tm)

g1(t1, . . . , tm), . . . ,

fn(t1, . . . , tm)

g1(t1, . . . , tm)

). (2.22)

The mapping F may not be defined at all points in Km because of the denominators. Therefore, we putW = V(g1, . . . , gn) and obtain the mapping F : Km \W → Kn. In order to control the denominators,we put g = g1 · · · gn, introduce an additional variable Y , and consider the ideal

I = 〈g1X1 − f1, . . . , gnXn − fn, Y g − 1〉. (2.23)

The equation 1− Y g = 0 means that the denominators g1, . . . , gn never vanish on V(I).Define the embedding ιn : Km \W → Km+n+1 by

ιn : (t1, . . . , tm) 7→(

1

g(t1, . . . , tm), t1, . . . , tm,

f1(t1, . . . , tm)

g1(t1, . . . , tm), . . . ,

fn(t1, . . . , tm)

g1(t1, . . . , tm)

), (2.24)

and the projection πm+1 : Km+n+1 → Kn by

πm+1 : (y, t1, . . . , tm, x1, . . . , xn) 7→ (x1, . . . , xn). (2.25)

These maps give rise to the commutative diagram

Km+n+1

πm+1$$❍

❍❍❍❍

❍❍❍❍

Km \Wιn

88rrrrrrrrrrr

F// Kn

That is, the mapping F can be written as the composition F = πm+1 ◦ ιn. By definition, we haveιn(K

m \W ) = V(I) and thus

F (Km \W ) = πm+1(ιn(Km \W )) = πm+1(V(I)). (2.26)

Therefore, the image of the parametrization equals the projection of the affine variety. The Closuretheorem yields the following result.

Theorem 2.44. (Rational Implicitization) Let K be an algebraically closed field. Let F : Km\W →Kn be the mapping determined by the rational parametrization (2.21) and let I = 〈g1X1−f1, . . . , g1Xn−fn, 1 − Y g〉 be an ideal in K[Y, T1, . . . , Tm, X1, . . . , Xn], where g = g1 · · · gn. Then for the (m + 1)-thelimination ideal Im+1 = I∩K[X1, . . . , Xn], the affine variety V (Im+1) is the Zariski closure F (K

m\W ).

The following algorithm solves the rational implicitication problem: Given a system of equations

Xi =fi(T1, . . . , Tm)

gi(T1, . . . , Tm), 1 ≤ i ≤ n,

48 2 Algebraic Geometry

where f1, . . . , fn and g1, . . . , gn are polynomials in K[T1, . . . , Tm]. Take a new variable Y and considerthe ideal I = 〈g1X1 − f1, . . . , gnXn− fn, 1− Y g〉 of K[Y, T1, . . . , Tm, X1, . . . , Xn]. Compute a Groebnerbasis with respect to the an elimination ordering with Y > T1 > . . . > Tm > X1 > . . . > Xn By theElimination theorem, the elements of the Groebner basis not involving Y, T1, . . . , Tm form a Groebnerbasis of the (m + 1)-th elimination ideal Im+1. By the Implicitization theorem, this Groebner basisdefines the affine variety in Kn containing the parametrization.

Example 2.45. Consider a curve in K2 parametrized by the rational functions

X =2T 2 + 4T + 5

T 2 + 2T + 3, Y =

3T 2 + T + 4

T 2 + 2T + 3.

This parametrization gives the following ideal in C[Z, T,X, Y ]

I = 〈((T 2 + 2T + 3)X − (2T 2 + 4T + 5), (T 2 + 2T + 3)Y − (T 2 + 2T + 3), (T 2 + 2T + 3)2Z − 1〉.

A Groebner basis of I with respect to the lp ordering is given by

XY − Y, Y T 2 + 2Y T + 2Y, 2XT 2 + 4XT + 5X − T 2 − 2T − 3,−4X2 + 4X + Z − 1.

Thus the second elimination ideal is I2 = 〈XY −Y 〉 and underlying curve is given by f(X,Y ) = XY −Y .♦

3

Combinatorial Geometry

In this chapter we examine some interesting recently discovered connections between polynomials andthe geometry of convex polytopes. This will naturally lead to the polytope algebra that can be viewedas a multi-dimensional generalization of the tropical algebra.

3.1 Tropical Algebra

A semiring is an algebraic structure similar to a ring, but without the requirement that each elementmust have an additive inverse. A prominent example of a semiring is the so-called tropical algebra.

A semiring is a non-empty set R together with two binary operations, addition + and multipli-cation ·, such that (R,+) is a commutative monoid with identity element 0, (R, ·) is a monoid withidentity element 1, multiplication distributes over addition, i.e., for all a, b, c ∈ R,

a · (b+ c) = (a · b) + (a · c) and (a+ b) · c = (a · c) + (b · c),

and multiplication with 0 annihilates R, i.e., for all a ∈ R, a · 0 = 0 = 0 · a. A commutative semiring isa semiring whose multiplication is commutative. An idempotent semiring is a semiring whose additionis idempotent, i.e., for all a ∈ R, a+ a = a.

Example 3.1. Each ring is also a semiring. The set of natural numbers N0 forms a commutativesemiring with the ordinary addition and multiplication. Likewise, the non-negative rational numbersand the non-negative real numbers form commutative semirings. ♦Example 3.2. The set R ∪ {∞} together with the operations

x⊕ y = min{x, y} and x⊙ y = x+ y, x, y ∈ R ∪ {∞}

forms a commutative, idempotent semiring with additive identity ∞ and multiplicative identity 0. Notethat additive and multiplicative inverses may not exist. For instance, the equations 3 ⊕ x = 10 and∞⊙ x = 1 have no solutions x ∈ R ∪ {∞}. ♦This semiring is also known min-plus algebra or tropical algebra. The attribute ”tropical” was coined byFrench scholars (1998) in honor of the Brazilian mathematician Imre Simon who studied the tropicalsemiring in the early 1960s.

50 3 Combinatorial Geometry

Proposition 3.3. The mapping φ : R≥0 → R ∪ {∞} : x 7→ − log x is an antitone, bijective mappingsuch that φ(0) = ∞, φ(1) = 0, and

φ(x · y) = φ(x)⊙ φ(y), x, y ∈ R≥0. (3.1)

Thus the mapping φ is a monoid isomorphism from (R≥0, ·) onto (R∪ {∞},⊙). This mapping is calledthe tropicalization of the ordinary semiring (R≥0,+, ·) (Fig. 3.1).

infinity

00

-infinity

x

infinity

Fig. 3.1. The function x 7→ − log x.

3.2 Shortest Paths Problem

We illustrate an important problem in graph theory that makes use of the tropical algebra. For this,let G = (V,E) be a digraph with vertex set V = {1, . . . , n}. Each edge (i, j) in G has an associatedlength dij given by a positive real number. We put dii = 0, 1 ≤ i ≤ n, and dij = +∞ if (i, j), i 6= j,is not an edge in G. We represent the digraph G by the n× n adjacency matrix DG = (dij). Considerthe (n− 1)th power of the matrix DG in the tropical algebra (R ∪ {∞},⊕,⊙),

D⊙n−1G = DG ⊙DG ⊙ · · · ⊙DG (n− 1 times). (3.2)

Proposition 3.4. Let G be a digraph on n vertices with n× n adjacency matrix DG. The entry of thematrix D⊙n−1

G in row i and column j equals the length of the shortest path from vertex i to vertex j.

Proof. Let d(r)ij denote the minimum length of any path from vertex i to vertex j, which uses at most

r edges in G. Clearly, we have d(1)ij = dij . The shortest path from vertex i to vertex j visits each vertex

at most once, because the weights are assumed to be nonnegative. Thus the shortest path uses at most

n− 1 edges and hence the length of a shortest path from i to j equals d(n−1)ij .

3.3 Geometric Zoo 51

Observe that a shortest path from vertex i to vertex j, which uses at most r ≥ 2 edges, consists ofa shortest path from vertex i to some vertex k, which uses at most r − 1 edges, and the edge (k, j).That is, the shortest paths from vertex i to vertex j satisfy the equation

d(r)ij = min{d(r−1)

ik + dkj | 1 ≤ k ≤ n}, 2 ≤ r ≤ n− 1. (3.3)

The tropicalization of this equation yields

d(r)ij =

n⊕

k=1

d(r−1)ik ⊙ dkj , 2 ≤ r ≤ n− 1. (3.4)

The right-hand side is the tropical product of the ith row of D⊙r−1G and the jth column of DG. Thus

the left-hand side is the (i, j)th entry of the matrix D⊙rG . Hence, the assertion follows. ⊓⊔

The iterative evaluation of Eq. (3.3) is known as Floyd-Warshall algorithm for finding the shortestpaths between each pair of vertices in a digraph.

Example 3.5. Consider the following digraph G,

?>=<89:;1 //?>=<89:;2

��?>=<89:;4

OO

?>=<89:;3

__❃❃❃❃❃❃❃❃❃❃❃❃❃❃❃❃❃

The corresponding adjacency matrix is

DG =

0 1 ∞ ∞∞ 0 1 ∞1 ∞ 0 ∞1 ∞ ∞ 0

and the tropical matrix products are

D⊙2G =

0 1 2 ∞2 0 1 ∞1 2 0 ∞1 2 ∞ 0

and D⊙3

G =

0 1 2 ∞2 0 1 ∞1 2 0 ∞1 2 3 0

.

3.3 Geometric Zoo

We consider the Euclidean n-space Rn equipped with the ordinary scalar product

52 3 Combinatorial Geometry

〈u, v〉 = u1v1 + · · ·+ unvn, u, v ∈ Rn. (3.5)

The Euclidean distance between two points u and v in Rn is defined as

‖u− v‖ =√

〈u− v, u− v〉. (3.6)

A set C in Rn is called convex if it contains the line segment connecting any two points in C. Theline segment between two points u and v in Rn is given as

[u, v] = {λu+ (1− λ)v | 0 ≤ λ ≤ 1}. (3.7)

Simple examples of convex sets are the singleton sets {v}, where v ∈ Rn, and the Euclidean space Rn.There is a simple way to construct new convex sets from given ones.

Proposition 3.6. The intersection of an arbitrary collection of convex sets is convex.

Proof. If a line segment belongs to every set in the collection, it also belongs to the intersection. ⊓⊔

If a set is not itself convex, its convex hull is the smallest convex set containing it (Fig. 3.2). The convexhull of a set S in Rn is denoted by conv(S).

❄❄❄❄

❄❄❄❄

✴✴✴✴✴✴✴✴✴✴✴✴✴✴

✴✴✴✴✴✴✴✴✴✴✴✴✴✴⑧⑧⑧⑧⑧⑧⑧⑧⑧

✎✎✎✎✎✎✎✎✎✎✎✎✎✎

✎✎✎✎✎✎✎✎✎✎✎✎✎✎

Fig. 3.2. A set and its convex hull, a convex polygon.

Proposition 3.7. If S is a subset of Rn, then its convex hull is

conv(S) = {λ1s1 + . . .+ λmsm | m ≥ 0, si ∈ S, λi ≥ 0,∑

i

λi = 1}. (3.8)

Moreover,

• for each subset of Rn, conv(conv(S)) = conv(S).• If S1 and S2 are subsets of Rn, then conv(conv(S1) ∪ conv(S2)) = conv(S1 ∪ S2), and if S1 ⊆ S2,

then conv(S1) ⊆ conv(S2).

Proof. Let u, v ∈ S. By definition, for each λ, 0 ≤ λ ≤ 1, the point λu + (1 − λ)v belongs to conv(S)and so conv(S) is a convex set. Moreover, for each s ∈ S, s = 1 · s ∈ conv(S) and thus conv(S) containsthe set S.

3.3 Geometric Zoo 53

Let C be a convex set in Rn containing S. We show that conv(S) lies in C by using induction on thesize (i.e., number of terms) of the linear combinations. Each element of S lies in C. Let s1, . . . , sm+1

be elements of S. Consider the convex combination

s = λ1s1 + . . .+ λmsm + λm+1sm+1,

where λi ≥ 0, 1 ≤ i ≤ m+ 1, and∑i λi = 1. If λm+1 = 1, then s = sm+1 ∈ C, and if λm+1 = 0, then

by induction we have s ∈ C. Otherwise, we have

s = (1− λm+1)

m∑

i=1

λi1− λm+1

si + λm+1sm+1.

andλi

1− λm+1≥ 0, 1 ≤ i ≤ m, and

m∑

i=1

λi1− λm+1

= 1.

Thus, by induction, the convex combination

s′ =m∑

i=1

λi1− λm+1

si

belongs to C. Since C is convex and contains S, it follows that the element s = (1−λm+1)s′+λm+1sm+1

belongs to C, as required. The remaining assertions are left to the reader. ⊓⊔

A linear combinations of the form λ1s1 + . . .+ λmsm, where si ∈ S, λi ≥ 0, and∑i λi = 1, is called a

convex combination.A polytope is the convex hull of a finite set in Rn. If the set is S = {s1, . . . , sm} in Rn, then by

Prop. 3.7, the corresponding polytope can be expressed as

conv(S) = {λ1s1 + . . .+ λmsm | λi ≥ 0,∑

i

λi = 1}. (3.9)

In lower dimensions, polytopes are familiar geometric figures: A polytope in R is a line segment, apolytope in R2 is a line segment or a convex polygon (Fig. 3.2), and a polytope in R3 is a line segment,a convex polygon lying in a plane, or a convex polyhedron. In particular, a lattice polytope is a polytopegiven by the convex hull of a set of integral points.

Example 3.8. The mathematical software polymake was designed to work with polytopes. Each poly-tope in polymake is treated as an object and is given by a file storing the data. The program polymake

allows to construct polytopes from scratch or by applying constructions to existing polytopes.Consider the lattice polytope given by the convex hull of the points (0,8), (0,7), (0,6), (0,5), (1,6),

(1,5), (1,4), (1,3), (2,4), (2,3), and (3,2) (Fig. 5.7). In polymake, this polytope can be specified in a textfile, say dude, containing the following information

POINTS

1 0 8

1 0 7

54 3 Combinatorial Geometry

1 0 6

1 0 5

1 1 6

1 1 5

1 1 4

1 1 3

1 2 4

1 2 3

1 3 2

The points are always represented in homogeneous coordinates, where the first coordinate is used forhomogenization. ♦

In particular, a n-dimensional simplex or n-simplex is the convex hull of n+1 points m1, . . . ,mn+1

in Rn such that the vectors m2−m1, . . . ,mn+1−m1 form a basis of Rn. A n-simplex can be constructedfrom a (n − 1)-simplex in Rn−1 by adding one point in the n-th dimension and connecting the pointwith all points of the (n− 1)-simplex. In this way, one obtains inductively simplices that are singletonpoints, line segments, triangles, tetrahedrons (Fig. 3.3), and so on.

✎✎✎✎✎✎✎✎✎✎✎✎✎✎

✯✯✯✯✯✯✯✯✯✯✯✯✯✯✯✯✯✯✯✯

❄❄❄❄

❄❄❄❄

❄❄❄❄

❄❄❄❄

❴❴❴❴❴❴❴❴❴❴

❖❖❖❖❖❖

❖❖❖❖❖❖

❖❖

⑧⑧⑧⑧⑧⑧⑧⑧⑧

Fig. 3.3. A tetrahedron.

Each polytope has a well-defined dimension. To see this, we need to develop the theory of affinesubspaces. An affine subspace of Rn is a subset A of Rn with the property that ifm ≥ 0 and s1, . . . , sm ∈A then λ1s1 + . . . + λmsm ∈ A with λ1, . . . , λm ∈ R, whenever

∑i λi = 1. Linear combinations of the

form λ1s1 + . . .+ λmsm, where si ∈ A and∑i λi = 1, are called affine combinations.

Given a subset S of Rn and a vector v ∈ Rn, the translate of S by v is the set

v + S = {v + s | s ∈ S}. (3.10)

Proposition 3.9. Each affine subspace of Rn is a translate of a unique linear subspace of Rn.

Proof. Let A be an affine subspace of Rn and v ∈ A. Consider the translate

−v +A = {λ1a1 + . . .+ λmam | m ≥ 0, ai ∈ A,∑

i

λi = 0}.

It is easy to check that the translate −v + A is a linear subspace of Rn. Since A = v + (−v + A),it follows that A is a translate of the linear subspace −v + A. Moreover, if v, w ∈ A, then the above

3.3 Geometric Zoo 55

representation of a translate shows that the linear subspaces −v +A and −w +A are equal. It followsthat the affine subspace A is a translate of a unique linear subspace of Rn. ⊓⊔

The dimension of an affine subspace in Rn is defined as the dimension of the linear subspace of Rn

corresponding to it as in Prop. 3.9.

Proposition 3.10. The translate of a polytope is a polytope.

Proof. Let P be a polytope in Rn and let v ∈ Rn. By definition, there is a subset S = {s1, . . . , sm} of Rn

such that P = conv(S). Claim that v + conv(S) = conv(v + S). Indeed, let w ∈ P . Write w =∑i λisi,

where λi ≥ 0, 1 ≤ i ≤ m, and∑i λi = 1. Then

v +∑

i

λisi =∑

i

λi (v + si) .

The left-hand side is a point in v + conv(S) and the right-hand side is a point in conv(v + S). Thisproves the claim. Thus the translate v + P is the convex hull of the set v + S and hence v + P is apolytope. ⊓⊔

If a set is not itself an affine subspace of Rn, its affine hull is the smallest affine subspace containingit. The affine hull of a set S in Rn is denoted by aff(S).

Proposition 3.11. If S is a subset of Rn, then

aff(S) = {λ1s1 + . . .+ λmsm | m ≥ 0, si ∈ S, λi ∈ R,∑

i

λi = 1}. (3.11)

Moreover,

• for each subset S of Rn, aff(aff(S)) = aff(S).• If S1 and S2 are subsets of Rn such that S1 ⊆ S2, then aff(S1) ⊆ aff(S2).

Proof. By definition, the set aff(S) is an affine subspace of Rn. Moreover, for each s ∈ S, s = 1·s ∈ aff(S)and thus aff(S) contains the set S.

Let A be an affine subspace of Rn containing S. We show that aff(S) lies in A by using induction onthe size (i.e, number of terms) of the linear combinations. Each element of S lies in A. Let s1, . . . , sm+1

be elements of S. Consider the affine combination

s = λ1s1 + . . .+ λmsm + λm+1sm+1,

where∑i λi = 1. If λm+1 = 1, then s = sm+1 ∈ A, and if λm+1 = 0, then by induction we have s ∈ A.

Otherwise, we have

s = (1− λm+1)

m∑

i=1

λi1− λm+1

si + λm+1sm+1.

andm∑

i=1

λi1− λm+1

= 1.

Thus, by induction, the affine combination

56 3 Combinatorial Geometry

s′ =

m∑

i=1

λi1− λm+1

si

belongs to A. Since A is an affine subspace and contains S, it follows that the element s = (1−λm+1)s′+

λm+1sm+1 belongs to A, as required. The remaining assertions are left to the reader. ⊓⊔For instance, the affine hull of a point is a point, the affine hull of a line segment is a line, the affinehull of a convex polygon is a plane, and the affine hull of a convex polyhedron is the Euclidean 3-space.

The dimension of a polytope in Rn is defined as the dimension of its affine hull. This is the small-est affine space containing the polytope. For instance, a point has dimension 0, a line segment hasdimension 1, a convex polygon has dimension 2, and a convex polyhedron has dimension 3.

Proposition 3.12. Each n-simplex in Rn has dimension n.

Proof. Let C be a n-simplex in Rn given by the points m1, . . . ,mn+1. The affine hull of C contains thepoints m2−m1, . . . ,mn+1−m1. These points form a basis of Rn and thus the affine hull has dimensionat least n. But aff(C) is an affine subspace of Rn and so has dimension at most n. ⊓⊔

The unbounded counterparts of polytopes are cones. A cone in Rn is a subset C of Rn with theproperty that if m ≥ 0 and s1, . . . , sm ∈ C then λ1s1 + . . .+ λmsm, whenever λi ≥ 0, 1 ≤ i ≤ m.

If a set is not itself a cone, its positive hull is the smallest cone containing it. The positive hull of aset S in Rn is denoted by pos(S).

Proposition 3.13. If S is a subset of Rn, then the positive hull of S is

pos(S) = {λ1s1 + . . .+ λmsm | m ≥ 0, si ∈ S, λi ≥ 0}. (3.12)

Proof. Let u, v ∈ S. By definition, the set pos(S) is a cone in Rn. Moreover, for each s ∈ S, s = 1 · s ∈pos(S) and thus pos(S) contains the set S.

Finally, let C be a cone in Rn containing S. We show that pos(S) lies in C by using induction onthe size (i.e., number of terms) of the linear combinations. Each element of S lies in C. Let s1, . . . , sm+1

be elements of S. Consider the linear combination

s = λ1s1 + . . .+ λmsm + λm+1sm+1,

where λi ≥ 0, 1 ≤ i ≤ m+ 1. If λm+1 = 0, then by induction we have s ∈ C. Otherwise, we have

s = λm+1

m∑

i=1

λiλm+1

si + λm+1sm+1

andλi

λm+1≥ 0.

Thus, by induction, the linear combination

s′ =

m∑

i=1

λiλm+1

si

belongs to C. Since C is a cone and contains S, it follows that the element s = λm+1s′ + λm+1sm+1

belongs to C, as required. ⊓⊔

3.4 Geometry of Polytopes 57

For instance, quadrants in R2 are cones, octants in R3 are cones, and half-spaces are cones.

Proposition 3.14. The positive hull of a set is convex.

Proof. By definition, convex combinations are nonnegative linear combinations and so the positive hullof a set is convex. ⊓⊔

3.4 Geometry of Polytopes

The geometric structure of polytopes will be studied in more detail.An affine hyperplane is an affine subspace of codimension 1 in the Euclidean space Rn. In Cartesian

coordinates, an affine hyperplane is given by a single linear equation (with not all wi equal to 0)

v1w1 + . . .+ vnwn = α.

More specifically, an affine hyperplane in Rn is defined as

Hw,α = {v ∈ Rn | 〈v, w〉 = α} (3.13)

where w 6= 0 is a vector in Rn and α is a real number. Note that two affine hyperplanes Hw,α and Hw,β

for different values α and β are parallel to each other. Moreover, the sets

H+w,α = {v | 〈v, w〉 ≥ α}

andH−w,α = {v | 〈v, w〉 ≤ α}

are called the half-spaces bounded by the affine hyperplane.For instance, a point is an affine hyperplane in Euclidean 1-space, a line is an affine hyperplane in

Euclidean 2-space (Fig. 3.4), and a plane is an affine hyperplane in Euclidean 3-space.

❆❆❆

❆❆

❆❆

❆❆

❆❆❆❆❆ x

y

Fig. 3.4. Affine hyperplane 2x+ y = 6 in Euclidean 2-space.

58 3 Combinatorial Geometry

Proposition 3.15. An affine hyperplane in Rn is an affine subspace of Rn with dimension n− 1.

Proof. Let H = {v | 〈v, w〉 = α} be an affine hyperplane in Rn. Consider the linear mapping φ : Rn 7→R : v 7→ 〈v, w〉 given by the scalar product with fixed w. The kernel of this mapping is the linearsubspace U = {u | 〈u,w〉 = 0} and the image is R, since by definition w 6= 0. The dimension formulagives dimRn = dimkerφ+dim im φ = dimU +dimR and so dimU = n− 1. But the affine hyperplaneH is a translate of the linear subspace U given by v + U , where v ∈ H. It follows that the hyperplaneH has also dimension n− 1. ⊓⊔

Let P be a polytope in Rn, let w 6= 0 be a vector in Rn, and let α be a real number. Define thenumber

ρP (w) = min{〈v, w〉 | v ∈ P}. (3.14)

The number ρP (w) always exists, since the linear map P → R : v 7→ 〈v, w〉 given by the scalar productwith fixed w is continuous and P is closed and bounded. Thus the minimum on P will be attained.The corresponding affine hyperplane

HP,w = {v | 〈v, w〉 = ρP (w)} (3.15)

is called a supporting hyperplane of P , and we call w the outward pointing normal (Fig. 3.5).

✤✤✤✤✤✤✤✤✤✤✤✤✤

❄❄

❄❄

❄❄

❄❄

❄ P //

��⑧⑧⑧⑧⑧⑧⑧⑧

Fig. 3.5. A polytope P given as a square, two supporting hyperplanes, and associated outward pointing normals.

Proposition 3.16. If HP,w is a supporting hyperplane of a polytope P in Rn, then the intersectionPw = P ∩HP,w is a non-empty polytope and P is contained in the half-space H+

P,w.

Proof. The set Pw is non-empty, since the number ρP (w) always exists.Let u, v ∈ Pw. Then u, v ∈ P and 〈u,w〉 = 〈v, w〉 = ρP (w). Since P is a polytope, for each λ with

0 ≤ λ ≤ 1, λu+ (1− λ)v ∈ P . Moreover, 〈λu+ (1− λ)v, w〉 = λ〈u,w〉+ (1− λ)〈v, w〉 = ρP (w) and soλu+ (1− λ)v ∈ HP,w. Thus λu+ (1− λ)v ∈ Pw and hence Pw is convex.

Assume that P is the convex hull of the points a1, . . . , am in Rn. Suppose that the points a1, . . . , alare on the hyperplane HP,w, while the points al+1, . . . , am are not. Equivalently, there are positive realnumbers βl+1, . . . , βm such that

3.4 Geometry of Polytopes 59

〈ai, w〉 ={ρP (w), 1 ≤ i ≤ l,ρP (w) + βi, l + 1 ≤ i ≤ m.

Let v be a point in P . By definition, the point v is a convex combination of the points a1, . . . , am,

v = λ1a1 + . . .+ λmam,

where λi ≥ 0, 1 ≤ i ≤ m, and∑i λi = 1. It follows that

〈v, w〉 =m∑

i=1

λi〈ai, w〉 =m∑

i=1

λiρp(w) +

m∑

i=l+1

λiβi = ρP (w) +

m∑

i=l+1

λiβi.

Therefore, v ∈ HP,w if and only if λl+1βl+1 + . . . + λmβm = 0. By hypothesis, this is equivalent toλl+1 = . . . = λm = 0. Equivalently, the point v is a convex combination of the points a1, . . . , al. Itfollows that Pw = P ∩HP,w = conv({a1, . . . , al}) is a polytope.

Finally, by definition, Pw belongs to the half-space H+P,w. ⊓⊔

We call the non-empty polytope Pw the face of P determined by w. That is, the face Pw is the set ofall points in P at which the linear map P → R : v → 〈v, w〉 given by the scalar product with fixed wattains its minimum.

Proposition 3.17. Each polytope has only finitely many faces.

Proof. Let P be a polytope in Rn. By definition, there is a finite set A = {a1, . . . , am} such that P =conv(A). The proof of Prop. 3.16 shows that for each supporting hyperplane H of P , the correspondingface P ∩H is the convex hull of a subset of A. But there are only finitely many subsets of a finite setund so the result follows. ⊓⊔

Proposition 3.18. Each face of a polytope P in Rn has dimension less than dimP .

Proof. We may assume that P is a polytope in Rn with dimension n. For each supporting hyperplaneH of P , the affine space P ∩ H is contained in the affine space aff(H) = H. But, by Prop. 3.15, theaffine subspace H has dimension n − 1 and so, by Prop. 3.11, the face P ∩H has dimension at mostn− 1. ⊓⊔

Since each face is a polytope, it can be assigned a dimension. A k-face of a polytope P is a face ofP with dimension k. A 0-face of P is called a vertex of P and a 1-face of P is called an edge of P . IfP has dimension n, then a (n − 2)-face of P is called a ridge of P and a (n − 1)-face of P is called afacet of P . We write fk(P ) for the number of k-faces of a polytope P in Rn, 0 ≤ k ≤ n − 1. If P hasdimension n, the vector f(P ) = (f0(P ), f1(P ), . . . , fn−1(P )) is termed the f-vector of P . For instance,a tetrahedron (Fig. 3.3) has 4 facets, 6 edges, and 4 vertices and so its f-vector is (4, 6, 4).

Lemma 3.19. The vertices of a polytope P are precisely the points in P that cannot be written asconvex combinations of other points in P .

Proof. Let v ∈ P be a vertex given by the supporting hyperplane HP,w in P . Write the point v as aconvex combination v =

∑i λiai of points ai ∈ P , where λi ≥ 0 and

∑i λi = 1. Then, by definition,

60 3 Combinatorial Geometry

ρP (w) = 〈v, w〉 =∑

i

λi〈ai, w〉 ≥∑

i

λiρP (w) = ρP (w).

Thus 〈ai, w〉 = ρP (w) for each index i with λi > 0; that is, ai ∈ P ∩HP,w for each index i with λi > 0.But v is a vertex of P and so v = ai for each index i with λi > 0.

Suppose v ∈ P is not a vertex. Then v lies in a k-face Pw = H ∩ P for some k ≥ 1 and supportinghyperplane H = HP,w. The polytope Pw has a finite set A = {a1, . . . , am} such that P = conv(A).Write the point v as convex combination v = λ1a1+ . . .+λmam. If v = ai for each index i with λi > 0,then v will be a vertex. Thus by hypothesis, the point v can be written as nontrivial linear combinationof points in Pw. ⊓⊔

Proposition 3.20. Each polytope is the convex hull of its vertices.

Proof. Let P be a polytope in Rn. By definition, there is a finite set A = {a1, . . . , am} such thatP = conv(A). The proof of Prop. 3.16 shows that each vertex of P is of the form {a}, where a ∈ A.Suppose Pi = {ai}, 1 ≤ i ≤ l, are the vertices of P . Then, by Prop. 3.7, conv(P1 ∪ . . . ∪ Pl) =conv({a1, . . . , al}) ⊆ conv(P ) = P .

Conversely, we may successively eliminate points from A that can be written as convex combinationsof other points in A. For instance, assume that am =

∑m−1i=1 λiai, where λi ≥ 0, 1 ≤ i ≤ m − 1, and∑

i λi = 1. Then for each point v ∈ P ,

v =m∑

i=1

µiai =m−1∑

i=1

(µi + λiµm)ai,

where µi ≥ 0, 1 ≤ i ≤ m, and∑i µi = 1. But µi+λiµm ≥ 0, 1 ≤ i ≤ m− 1, and

∑m−1i=1 µi+λiµm = 1,

and so the point v can be written as a convex combination of the points a1, . . . , am−1. It follows thatconv({a1, . . . , am}) = conv({a1, . . . , am−1}).

Assume that all points are eliminated from A that can be written as convex combinations of otherpoints in A. Claim that the remaining points in A are vertices of P . Suppose the point am can bewritten as convex combination of points v1, . . . , vk in P . That is, am =

∑kj=1 µjvj , where µj ≥ 0,

1 ≤ j ≤ k, and∑j µj = 1. Write vj =

∑mi=1 λjiai, where λji ≥ 0, 1 ≤ i ≤ m, and

∑i λji = 1. Then

am =m∑

i=1

k∑

j=1

µjλjiai

and so

(1−m∑

j=1

µjλjm)am =m−1∑

i=1

k∑

j=1

µjλjiai.

But by hypothesis, the point am cannot be written as convex combination of the points a1, . . . , am−1

and so∑mj=1 µjλjm = 1. Thus λjm = 1 for each index j for which µj > 0; that is, vj = am for each

index j for which µj > 0. By Lemma 3.19, the points in A are the vertices of P . ⊓⊔

Example 3.21. Consider the set of points A = {(0, 0), (1, 1), (2, 0), (0, 3)} in R2. The correspondinglattice polytope conv(A) is the triangle with the vertices (0, 0), (2, 0), and (0, 3). The point (1, 1) is aconvex combination of the triangle’s vertices

3.4 Geometry of Polytopes 61

(1, 1) =1

6(0, 0) +

1

2(2, 0) +

1

3(0, 3).

⊓⊔A polyhedral set in Rn is the intersection of a finite number of half-spaces in Rn.

Proposition 3.22. A bounded polyhedral set in Rn is a polytope in Rn, and vice versa.

Proof. Let P be a polyhedral set in Rn given by the intersection of m ≥ 1 half-spaces. Then it is easyto check that P = {v ∈ Rn | Av ≥ b, v ≥ 0} for some matrix A ∈ Rm×n and vector b ∈ Rm. The set Pis convex and since P is bounded, it is a polytope in Rn.

Conversely, let P be an n-dimensional polytope in Rn with facets F1, . . . , Fl having outward pointingnormals w1, . . . , wl, respectively. Then it is easy to check that

P = {v ∈ Rn | 〈v, wj〉 ≥ ρP (wj), 1 ≤ j ≤ l}.

Hence, P is a bounded polyhedral set. ⊓⊔Example 3.23. The square P = conv({(0, 0), (0, 1), (1, 0), (1, 1)}) in R2 has four facets that are givenby the inequalities

〈v, w1〉 ≥ 0, 〈v,−w2〉 ≥ −1, 〈v, w3〉 ≥ 0, 〈v,−w4〉 ≥ −1,

where e1 = w1 = w2 and e2 = w3 = w4 are the unit vectors (Fig. 3.5). ♦It follows that each convex polytope can be represented either as the set of convex combinations

of a finite number of points (vertices), or as an intersection of a finite number of half-spaces. Theserepresentations are referred to as V-polytopes and H-polytopes, respectively. Both representations areuseful in their own respect. For instance, the representation as V-polytopes is preferable if one wantsto show that every projection of a polytope is a polytope. On the other hand, the representation as H-polytopes is preferable if one has to prove that every intersection of a polytope with an affine subspaceis a polytope.

Example 3.24. Consider the polytope in Ex. 3.8. The vertices of this polytope are produced by thepolymake command

> polymake dude VERTICES

VERTICES

1 0 8

1 0 5

1 1 3

1 3 2

The system also provides the vertex normals (i.e., the i-th row is the normal vector of a hyperplaneseparating the i-th vertex from the remaining ones),

VERTEX_NORMALS

0 1/3 1/3

0 -3 -1

0 1 -1

0 1 0

62 3 Combinatorial Geometry

Furthermore, the updated file dude yields the representation of the polytope as a polyhedral set,

FACETS

-5 2 1

0 1 0

8 -2 -1

-7 1 2

AFFINE HULL

This output tells us that the polytope is defined by four linear inequalities

−5 + 2x1 + x2 ≥ 0,

x1 ≥ 0,

8− 2x1 − x2 ≥ 0,

−7 + x1 + 2x2 ≥ 0,

while there is no affine hull contribution which would provide additional linear equalities.The command DIM confirms that the polytope is two-dimensional,

> polymake dude DIM

DIM

2

The f-vector of our polytope is

> polymake dude F_VECTOR

F_VECTOR

4 4

Inspecting the updated file dude again shows that each facet of the polytope is given by four lines,

VERTICES_IN_FACETS

{1 2}

{0 1}

{0 3}

{2 3}

A fan in Rn is a family F = {C1, C2, . . . , Cm} of nonempty cones with the following properties:

• Each non-empty face of a cone in F is also a cone in F .• The intersection of any two cones in F is a face of both.

A fan F in Rn is complete if the union⋃F = C1 ∪ . . .∪Cm equals Rn. A fan F in Rn is pointed if {0}

is a cone in F and therefore a face of each cone in F .

Example 3.25. The pointed fan in Fig. 3.6 in R2 has m = 11 cones, of which 5 are full dimensional.♦

3.4 Geometry of Polytopes 63

❄❄❄❄❄❄❄❄

⑧⑧⑧⑧⑧⑧⑧⑧

⑧⑧⑧⑧⑧⑧⑧⑧

Fig. 3.6. A fan in R2.

Let P be a polytope in Rn and let F be a face of P . The normal cone of P at F is defined as

NP (F ) = {w ∈ Rn | F = P ∩HP,w}. (3.16)

That is, NP (F ) consists of all vectors w ∈ Rn with the property that F is the set of all points at whichthe linear map P → R : x 7→ 〈x,w〉 given by the scalar product with fixed w attains the minimum.In particular, if F = {v} is a vertex of P , then its normal cone NP (v) consists of all linear mapsP → R : x 7→ 〈x,w〉 that attain the minimum at the point v.

Example 3.26. Linear programming is a method to minimize or maximize a linear function over aconvex set. The canonical form of a linear program in Euclidean n-space is

min cTxs.t. Ax ≥ band x ≥ 0

where A ∈ Rm×n, b ∈ Rm, and c ∈ Rn are given and x is the vector of variables. The objective functionRn → R : x 7→ 〈c, x〉 has to be minimized over the convex set P = {x ∈ Rn | Ax ≥ b, x ≥ 0}. Supposethe minimum is attained at the face F of P . Then the normal cone NP (F ) consists of all vectors cwhich attain the minimum at F . This amounts to an inverse problem of linear programming. ♦

Proposition 3.27. Let P be a polytope in Rn and let F be a face of P . The normal cone NP (F ) is acone in Rn with dimension dimNP (F ) = n− dimF .

Proof. For each w ∈ Rn, let fw : P → R : x 7→ 〈x,w〉 be the scalar product with fixed w. Let v, w ∈ Rn

such that the linear mappings fv and fw attain the minimum at F , and let λ ≥ 0 and µ ≥ 0. Then thelinear mapping fλv+µw attains the minimum at F and so λv + µw belongs to NP (F ).

Let F be a k-face. Then the face F is determined by n − k linearly independent linear equationsand the cone NP (F ) is determined by k linearly independent linear equations. Hence, the dimensionformula follows. ⊓⊔

Example 3.28. Consider the square P in R2 as given in Fig. 3.5. We may assume that it has a facetF = {(x, 0) | 0 ≤ x ≤ p} for some real number p > 0. The affine subspace U = aff(F ) is thus thereal line given by the x-axis. Take the linear subspace W of R2 orthogonal to U ; that is, U ⊕W = R2

as linear spaces. Then W is the real line given by the y-axis. For each w ∈ W , ρP (w) = 0 and soNP (F ) =W . Therefore, dimW = dimR2 − dimU = 1, as required. ♦

64 3 Combinatorial Geometry

The collection of all non-empty normal cones NP (F ), as F runs over all faces of P , is called thenormal fan of P and is denoted by N (P ).

Proposition 3.29. Let P be a polytope of Rn. The normal fan N (P ) is a complete fan of Rn.

Proof. Let w ∈ Rn. If we put F = P ∩ HP,w, then w ∈ NP (F ). It follows that the normal cones arenon-empty and their union is the Euclidean n-space. ⊓⊔Example 3.30. Consider the triangle P in R2 as given in Fig. 3.7. The normal cone of each vertex v isan cone NP (v) and the normal cone of an edge e is a half line NP (e). The normal fan consists of sevencones, of which are three full-dimensional (Fig. 3.8). ♦

✑✑✑✑

✑✑✑✑✑

❆❆

❆❆

❆❆

v w

u

✑✑✑✑

✑✑✑

✑✑

❆❆

❆❆

❆❆

❏❏❏

NP (v)✟✟✟

NP (w)

❏❏❏

✟✟✟NP (u)

NP (vw)

Fig. 3.7. A triangle and its normal cones.

❏❏

❏❏❏

✟✟✟✟✟✟

NP (vw)

NP (u)

NP (v) NP (w)

Fig. 3.8. The normal fan of the triangle in Fig. 3.7.

3.5 Polytope Algebra

We introduce important constructions of new polytopes from given ones. For this, the vector spacestructure of the Euclidean n-space is used.

3.5 Polytope Algebra 65

Let P and Q be polytopes in Rn. The Minkowski sum of P and Q is given as

P +Q = {p+ q | p ∈ P, q ∈ Q}, (3.17)

where p + q denotes the addition in Rn. Moreover, let λ ≥ 0 be a real number. The polytope λP isdefined by

λP = {λp | p ∈ P}, (3.18)

where λp denotes the λ-multiple of p ∈ P .

Example 3.31. The Minkowski sum of the two polytopes P and Q in Fig. 3.9 can be obtained byplacing a copy of P at every point of Q. This works because P contains the origin. ♦

P

Q

P +Q

Fig. 3.9. Two polytopes and their Minkowski sum.

Proposition 3.32. If P and Q are polytopes in Rn, then the Minkowski sum P +Q is also a polytopein Rn.

Proof. Let p, p′ ∈ P and q, q′ ∈ Q. For each λ, 0 ≤ λ ≤ 1, we have

λ(p+ q) + (1− λ)(p′ + q′) = [λp+ (1− λ)p′] + [λq + (1− λ)q′] ∈ P +Q.

Thus the Minkowski sum P +Q is convex.Let P = conv(A) for some finite subset A = {a1, . . . , am} of Rn, and let p ∈ P and q ∈ Q. Write

p =∑i λiai, where λi ≥ 0, 1 ≤ i ≤ m, and

∑i λi = 1. Then

p+ q =

(∑

i

λiai

)+ q =

i

λi (ai + q) ∈ conv

(⋃

i

(ai +Q)

).

66 3 Combinatorial Geometry

Conversely, by Prop. 3.7 and the fact that P +Q is convex, we obtain

conv

(⋃

i

(ai +Q)

)⊆ conv (P +Q) = P +Q.

It follows that

P +Q = conv

(⋃

i

(ai +Q)

).

Let Q = conv(B) for some finite subset B of Rn. Then by Prop. 3.10, ai+Q = conv(ai+B), 1 ≤ i ≤ m,and by Prop. 3.7,

P +Q = conv

(⋃

i

conv (ai +B)

)= conv

(⋃

i

(ai +B)

).

Thus P +Q has a finite generating set and hence is a polytope. ⊓⊔

Proposition 3.33. Let P and Q be polytopes in Rn, and let w 6= 0 be a vector in Rn. We have

ρP+Q(w) = ρP (w) + ρQ(w) and (P +Q)w = Pw +Qw.

Proof. First, we have

ρP+Q(w) = min{〈v, w〉 | v ∈ P +Q} = min{〈p, w〉+ 〈q, w〉 | p ∈ P, q ∈ Q}= min{〈p, w〉 | p ∈ P}+min{〈q, w〉 | q ∈ Q}= ρP (w) + ρQ(w).

Second, by the first asseration,

(P +Q)w = (P +Q) ∩HP+Q,w = {p+ q | 〈p+ q, w〉 = ρP+Q(w), p ∈ P, q ∈ Q}= {p+ q | 〈p, w〉+ 〈q, w〉 = ρP (w) + ρQ(w), p ∈ P, q ∈ Q}= {p | 〈p, w〉 = ρP (w), p ∈ P}+ {q | 〈q, w〉 = ρQ(w), q ∈ Q}= (P ∩HP,w) + (Q ∩HQ,w) = Pw +Qw.

⊓⊔

The polytope algebra on Rn is a triple (Pn,⊕,⊙) that consists of the set of all polytopes in Rn,denoted by Pn, and two arithmetic operations ⊕ and ⊙, called addition and multiplication, defined as

P ⊕Q = conv(P ∪Q) and P ⊙Q = P +Q. (3.19)

By Prop. 3.32, the multiplication is well-defined.

Proposition 3.34. If P = conv(A) and Q = conv(B) are polytopes in Rn, then

P ⊕Q = conv(A ∪B).

3.5 Polytope Algebra 67

Proof. By Prop. 3.7, we have conv(A ∪ B) ⊆ conv(P ∪Q) = P ⊕Q. Conversely, let v ∈ P ⊕Q. Writethe point v as a convex combination of points in P ∪Q; that is, v =

∑i λipi+

∑j µjqj for some points

pi ∈ P and qj ∈ Q. But by definition, the points pi and qj are convex combinations of the points in thegenerating sets A and B, respectively. It follows that v ∈ conv(A ∪B). ⊓⊔

Example 3.35. Consider the non-collinear line segments P = {(x, 0) | 0 ≤ x ≤ p} and Q = {(0, y) |0 ≤ y ≤ q} in R2. Their sum and product are illustrated in Fig. 3.10. ♦

Q

❋❋

❋❋

❋❋

❋❋

❋❋

Q

❴❴❴❴❴❴❴❴❴

P ⊕Q P ⊙Q

P P

✤✤✤✤✤✤✤

Fig. 3.10. Two line segments P and Q in R2 and their sum P ⊕Q and product P ⊙Q.

Proposition 3.36. The polytope algebra (Pn,⊕,⊙) on Rn is a commutative, idempotent semiring.

Proof. It is easy to see that (Pn,⊕) is a commutative monoid with identity element ∅ and (Pn,⊙) is acommutative monoid with identity element {0}. Moreover, by Prop. 3.7, the addition is idempotent andmultiplication with the empty set annihilates Pn. To see that the distributive law holds, take p ∈ P ,q ∈ Q, and r ∈ R. Then for each λ with 0 ≤ λ ≤ 1, we have

p+ (λq + (1− λ)r) = λ(p+ q) + (1− λ)(p+ r).

The left-hand side is a point in P ⊙ (Q⊕R) and the right-hand side is a point in (P ⊙Q)⊕ (P ⊙R). ⊓⊔

Example 3.37. Consider the polytope algebra P1 on the Euclidean 1-space. The elements of P1 areexactly the line segments [a, b] = {λa+ (1− λ)b | 0 ≤ λ ≤ 1}, where a, b ∈ R. The sum and product oftwo line segments [a, b] and [c, d] are given by

[a, b]⊕ [c, d] = [min{a, c},max{b, d}] and [a, b]⊙ [c, d] = [a+ c, b+ d].

Proposition 3.38. The mapping f : P1 → R ∪ {∞} : [a, b] 7→ a is an epimorphism from the polytopealgebra P1 onto the tropical algebra.

68 3 Combinatorial Geometry

Proof. The mapping is well-defined and we have f(∅) = ∞ and f({0}) = f([0, 0]) = 0. For any two linesegments [a, b] and [c, d] in P1,

f([a, b]⊕ [c, d]) = f([min{a, c},max{b, d}]) = min{a, c} = a⊕ c = f([a, b])⊕ f([c, d])

andf([a, b]⊙ [c, d]) = f([a+ c, b+ d]) = a+ c = a⊙ c = f([a, b])⊙ f([c, d]).

⊓⊔

In this way, the polytope algebra on Rn can be viewed as a natural higher-dimensional generalizationof the tropical algebra.

3.6 Newton Polytopes

We establish an interesting connection between lattice polytopes and polynomials. For this, take apolynomial f in the polynomial ring K[X1, . . . , Xn] and write

f =∑

α∈Nn0

cαXα.

The Newton polytope of f , denoted as NP(f), is the lattice polytope

NP(f) = conv({α ∈ Nn0 | cα 6= 0}).

That is, the Newton polytope is generated by the exponents of the monomials involved in the polyno-mial. It is a measure of shape or sparsity of a polynomial. Note that the actual values of the coefficientsdo not matter in the definition of the Newton polytope.

Example 3.39. Any polynomial in K[X,Y ] of the form

f = aXY + bX2 + cY 3 + d,

where a, b, c, d are nonzero elements of K, has the Newton polytope equal to the triangle

P = conv({(1, 1), (2, 0), (0, 3), (0, 0)}).

By Ex. 3.21, polynomials of the above form with a = 0 have the same Newton polytope. ♦

We can also go the other way, from exponents to polynomials. Suppose we have a finite set ofexponents A = {α1, . . . , αl} in Nn0 . Then let L(A) be the set of all polynomials whose terms all havethe exponents in A,

L(A) = {c1Xα1 + · · ·+ clXαl | c1, . . . , cl ∈ K}.

Note that L(A) is a vector space over K with basis {Xα1 , . . . , Xαl} and dimension l. The followingresult is immediate from the definitions.

Proposition 3.40. Let A be a finite subset of Nn0 . For each polynomial f in L(A), we have NP(f) ⊆conv(A).

3.6 Newton Polytopes 69

The formation of Minkowskis sums is compatible with polynomial multiplication. To see this, letw 6= 0 be a vector in Rn and f =

∑α cαX

α be a polynomial in K[X1, . . . , Xn]. Define the number

πf (w) = min{〈α,w〉 | cα 6= 0}.

This number exists, since each polynomial has only a finite number of terms. The initial form of f withrespect to w is the subsum inw(f) of all terms cαX

α, cα 6= 0, such that 〈α,w〉 is minimal. That is,

inw(f) =∑

{cαXα | cα 6= 0, 〈α,w〉 = πf (w)}.

Example 3.41. Any polynomial in R[X,Y ] of the form

f = aXY + bX2 + cY 3 + d,

where a, b, c, d are nonzero real numbers, has with respect to w = (−2, 2) the initial form

inw(f) = bX2.

Proposition 3.42. Let f and g be polynomials in K[X1, . . . , Xn], and let w 6= 0 be a vector in Rn. Wehave

inw(f · g) = inw(f) · inw(g),πf (w) = ρNP(f)(w),

NP(inw(f)) = NP(f)w.

Proof. Let f =∑α cαX

α and g =∑β dβX

β be polynomials in K[X1, . . . , Xn]. First, we have

πf ·g(w) = min{〈α+ β,w〉 | cα 6= 0, dβ 6= 0}= min{〈α,w〉 | cα 6= 0}+min{〈β,w〉 | dβ 6= 0}= πf (w) + πg(w).

Then we obtain

inw(f · g) =∑

{cαdβXα+β | cα 6= 0, dβ 6= 0, 〈α+ β,w〉 = πf ·g(w)}

=∑

{cαXα | cα 6= 0, 〈α,w〉 = πf (w)} ·∑

{dβXβ | dβ 6= 0, 〈β,w〉 = πg(w)}= inw(f) · inw(g).

Second, we have πf (w) ≥ ρNP(f)(w) by definition. On the other hand, the Newton polytope NP(f)is the convex hull of the set A = {α | cα 6= 0}. By the proof of Prop. 3.16, the vertices of NP(f) belongto this set. Let a1, . . . , am ∈ A be the vertices of NP(f). We may assume that 〈a1, w〉 ≤ 〈ai, w〉 foreach i, 1 ≤ i ≤ m. Let v ∈ NP(f) such that ρNP(f)(w) = 〈v, w〉. By Prop. 3.20, v =

∑i λiai, where

λi ≥ 0, 1 ≤ i ≤ m, and∑i λi = 1. Then 〈v, w〉 =∑i λi〈ai, w〉 ≥

∑i λi〈a1, w〉 = 〈a1, w〉. It follows that

ρNP(f)(w) ≥ πf (w).

70 3 Combinatorial Geometry

Third, by using the last assertion, we obtain

NP(f)w = NP(f) ∩HNP(f),w

= conv{α | cα 6= 0} ∩ {v | v ∈ NP(f), 〈v, w〉 = ρNP(f)(w)}= conv{α | cα 6= 0} ∩ {v | v ∈ NP(f), 〈v, w〉 = πf (w)}= conv{α | cα 6= 0, 〈α,w〉 = πf (w)}= NP(inw(f)).

⊓⊔

Lemma 3.43. If f and g are polynomials in K[X1, . . . , Xn] and w 6= 0 is a vector in Rn, then

NP(inw(f) · inw(g)) = NP(f)w +NP(g)w. (3.20)

Proof. Let f =∑α cαX

α and g =∑β dβX

β be polynomials in K[X1, . . . , Xn]. We have

NP(inw(f) · inw(g)) = conv{α+ β | cα 6= 0, dβ 6= 0, 〈α,w〉 = πf (w), 〈β,w〉 = πg(w)}= conv{α | cα 6= 0, 〈α,w〉 = πf (w)}+ conv{β | dβ 6= 0, 〈β,w〉 = πg(w)}= NP(f)w +NP(g)w.

⊓⊔

Theorem 3.44. Let f and g be polynomials in K[X1, . . . , Xn]. Then

NP(f · g) = NP(f)⊙NP(g).

Proof. Let w 6= 0 be a vector in Rn. We have by Prop. 3.42, Lemma 3.43, and Prop. 3.33,

NP(f · g)w = NP(inw(f · g))= NP(inw(f) · inw(g))= NP(f)w +NP(g)w

= (NP(f) +NP(g))w .

This equality shows that the polytopes NP(f · g) and NP(f) ⊙ NP(g) have the same set of vertices.But by Prop. 3.20, each polytope is the convex hull of its vertices and so the result follows. ⊓⊔

Theorem 3.45. Let f and g be polynomials in K[X1, . . . , Xn]. Then

NP(f + g) ⊆ NP(f)⊕NP(g).

Equality holds, if all coefficients in the polynomials f and g are positive.

Proof. Let f =∑α cαX

α and g =∑β dβX

β be polynomials in K[X1, . . . , Xn]. By Prop. 3.34, we have

NP(f)⊕NP(g) = conv(conv({α | cα 6= 0}) ∪ conv({β | dβ 6= 0}))= conv({α, β | cα 6= 0, dβ 6= 0}).

3.7 Parametric Shortest Path Problem 71

On the other hand, we have f + g =∑γ(cγ + dγ)X

γ and so

NP(f + g) = conv({γ | cγ + dγ 6= 0}).

Let cγ + dγ 6= 0. Then cγ 6= 0 or dγ 6= 0 and so γ ∈ {α, β | cα 6= 0, dβ 6= 0}. This proves the inclusion.Finally, suppose that all coefficients in f and g are positive. Then cγ 6= 0 or dγ 6= 0 implies that

cγ + dγ 6= 0. Thus the other inclusion also holds and hence both polytopes are equal. ⊓⊔

Example 3.46. In R[X,Y ] consider the polynomials

f = Xp + 1 and g = Y q + 1,

where p and q are positive integers. The corresponding Newton polytopes are line segments in R2 givenas

NP(f) = conv({(0, 0), (p, 0)}) = {(x, 0) | 0 ≤ x ≤ p}and

NP(g) = conv({(0, 0), (0, q)}) = {(0, y) | 0 ≤ y ≤ q}.The sum f + g has the Newton polytope

NP(f + g) = NP(Xp + Y q + 2) = conv({(p, 0), (0, q), (0, 0)}),

which is a triangle with vertices (0, 0), (p, 0), and (0, q), and the product f · g has the Newton polytope

NP(f · g) = NP(xpyq + xp + yq + 1) = conv({(p, q), (p, 0), (0, q), (0, 0)}),

which is a rectangle with vertices (0, 0), (p, 0), (0, q), and (p, q) (Ex. 3.35). ♦

3.7 Parametric Shortest Path Problem

The problem of finding shortest paths in a network can be extended by making use of the polytopealgebra. For this, let G = (V,E) be a digraph with vertex set V = {1, . . . , n} and edge set E. Assumethat each edge (i, j) in G has an associated polytope Pij in the Euclidean d-space. We put Pii = {0},1 ≤ i ≤ n, and Pij = ∅ if (i, j), i 6= j, is not an edge in G. We represent the digraph G by the n × nmatrix DG = (Pij) of polytopes in the polytope algebra Pd.

Each vector w ∈ Rn allows to assign scalar values to the edges (i, j) in G by linear programming onthe polytope Pij :

dij = dij(w) = min{〈w, p〉 | p ∈ Pij}, (i, j) ∈ E. (3.21)

In particular, we have dii = 0, 1 ≤ i ≤ n, and dij = ∞ if (i, j), i 6= j, is not an edge in G. Thus eachvector w ∈ Rn gives rise to an n× n adjacency matrix DG = DG,w = (dij) with respect to w. We showthat the lengths of the shortest paths in G given by the n× n adjacency matrix DG with respect to wcan be derived by computation in the polytope algebra Pd.

72 3 Combinatorial Geometry

Proposition 3.47. Let G be a digraph on n vertices with n× n adjacency matrix DG and let w ∈ Rd.The length of the shortest path from vertex i to vertex j in the digraph G is given by

d(n−1)ij = min{〈w, p〉 | p ∈ P

(n−1)ij },

where (P(n−1)ij ) is the (i, j)-th entry in the (n− 1)th power of the matrix DG computed in the polytope

algebra Pd.Proof. The proof of Prop. 3.4 shows that the lengths of the shortest paths satisfy the recursion formula

d(r)ij = min{d(r−1)

ik + dkj | 1 ≤ k ≤ n}, 2 ≤ r ≤ n− 1.

We put P(1)ij = Pij , 1 ≤ i, j ≤ n. Claim that for 1 ≤ r ≤ n− 1,

d(r)ij = min{〈w, p〉 | p ∈ P

(r)ij }.

Indeed, this assertion holds by definition for r = 1. For 2 ≤ r ≤ n− 1, we have

d(r)ij = min{d(r−1)

ik + dkj | 1 ≤ k ≤ n}= min{min{〈w, p〉 | p ∈ P

(r−1)ik }+min{〈w, p〉 | p ∈ P

(1)kj } | 1 ≤ k ≤ n}

= min{min{〈w, p〉 | p ∈ P(r−1)ik ⊙ Pkj} | 1 ≤ k ≤ n}

= min{〈w, p〉 | p ∈n⊕

k=1

P(r−1)ik ⊙ Pkj}

= min{〈w, p〉 | p ∈ P(r)ij }.

The second equality follows from the induction hypothesis, the third from the definition of multiplicationin the polytope algebra, and the fourth from the definition of addition in the polytope algebra and thefact that the minimum is attained at a vertex which is, by Prop. 3.34, a vertex of one of the involvedpolytopes. ⊓⊔The Floyd-Warshall algorithm for finding shortest paths in a weighted digraph can be extended to thisparametric setting. If the parameter d is kept fixed, the algorithm still runs in polynomial time.

Example 3.48. Reconsider the directed graph G in Ex. 3.5. Suppose the adjacency matrix of G isdefined over the polytope algebra Pd as follows

DG =

{0} P ∅ ∅∅ {0} P ∅P ∅ {0} ∅P ∅ ∅ {0}

,

where P is a polytope in Rd. Then we have

D⊙2G =

{0} P P⊙2 ∅P⊙2 {0} P ∅P P⊙2 {0} ∅P P⊙2 ∅ {0}

and D⊙3

G =

{0} P P⊙2 ∅P⊙2 {0} P ∅P P⊙2 {0} ∅P P⊙2 P⊙3 {0}

.

Part II

Algebraic Statistics

4

Basic Algebraic Statistical Models

Statistics is the study of the collection, analysis, interpretation, presentation, and organization of data.Statistics builds models of the process that generated the data. In descriptive statistics, data are sum-marizes and measured by indexes such as mean and standard deviation, while in inferential statistics,conclusions about data are drawn subject to random variation such as confidences intervals and hy-pothesis testing. In this chapter, some basic algebraic statistical models are introduced which will serveas a basis for the subsequent chapters.

4.1 Introductory Example

We consider a statistical model called DiaNA that produces sequences of symbols over the DNA alphabet{A, C, G, T} such as

CTCACGTGATGAGAGCATTCTCAGACCGTGACGCGTGTAGCAGCGGCTC. (4.1)

DiaNA uses three tetrahedral dice to generate DNA sequences. The first two dice are loaded and thethird die is fair (Table 4.1). DiaNA first picks one of her three dice at random, where the first die(GC-rich) is picked with probability θ1, the second die (GC-poor) is picked with probability θ2, and thethird die is picked with probability 1− θ1 − θ2.

Table 4.1. The three tetrahedral dice of DiaNA.

A C G T

first die 0.15 0.33 0.36 0.16second die 0.27 0.24 0.23 0.26third die 0.25 0.25 0.25 0.25

DiaNA uses the following probabilities to generate the four symbols:

pA = −0.10 · θ1 + 0.02 · θ2 + 0.25,

76 4 Basic Algebraic Statistical Models

pC = 0.08 · θ1 − 0.01 · θ2 + 0.25, (4.2)

pG = 0.11 · θ1 − 0.02 · θ2 + 0.25,

pT = −0.09 · θ1 + 0.01 · θ2 + 0.25.

We have

pA + pC + pG + pT = 1,

and the three distributions in the rows of Table 4.1 are obtained by specializing (θ1, θ2) to (1,0), (0,1),and (0,0), respectively.

Consider the likelihood of observing the data (4.1). For this, note that the data contains 10 A’s, 14C’s, 15 G’s, and 10 T’s. Assume that all symbols were independently generated. Then the likelihood ofobserving the data is given by

L = p10A · p14C · p15G · p10T .

The likelihood function L = L(θ1, θ2) is a real-valued function on the triangle

Θ = {(θ1, θ2) | θ1 > 0, θ2 > 0, θ1 + θ2 < 1}.

Equivalently, the likelihood of observing the data can be described by the log-likelihood function

ℓ(θ1, θ2) = logL(θ1, θ2)

= 10 · log pA(θ1, θ2) + 14 · log pC(θ1, θ2) + 15 · log pG(θ1, θ2) + 10 · log pT (θ1, θ2).

The parameters θ1 and θ2 can be estimated by maximizing this likelihood function. For this, we equatethe two partial derivatives of the function to zero:

∂ℓ

∂θ1=

10

pA· ∂pA∂θ1

+14

pC· ∂pA∂θ1

+15

pG· ∂pA∂θ1

+10

pT· ∂pA∂θ1

= 0,

∂ℓ

∂θ2=

10

pA· ∂pA∂θ2

+14

pC· ∂pA∂θ2

+15

pG· ∂pA∂θ2

+10

pT· ∂pA∂θ2

= 0.

We use Maple to solve these equations:

> pA := -0.10*x + 0.02*y + 0.25:

> pC := 0.08*x - 0.01*y + 0.25:

> pG := 0.11*x - 0.02*y + 0.25:

> pT := -0.09*x - 0.01*y + 0.25:

> L := pA^10 * pC^14 * pG^15 * pT^10:

> l := log( L ):

> lx := diff(l, x):

> ly := diff(l, y):

> fsolve( {lx=0, ly=0}, {x,y}, {x=0..1}, {y=0..1} );

4.2 General Algebraic Statistical Model 77

The fsolve command provides the critical point

θ = (θ1, θ2) = (0.5191263945, 0.2172513326).

The corresponding probability distribution is

(pA, pC , pG, pT ) = (0.202432, 0.289358, 0.302759, 0.205451).

This distribution lies very close to the empirical distribution

1

49(10, 14, 15, 10) = (0.204082, 0.285714, 0.306122, 0.204082).

To determine the nature of the critical point θ, we examine the corresponding Hessian matrix

H =

(∂2ℓ∂θ21

∂2ℓ∂θ1∂θ2

∂2ℓ∂θ2∂θ1

∂2ℓ∂θ22

).

At the critical point θ = θ, the Hessian matrix equals(−7.409465471 1.1950565621.195056562 −0.2034803046

).

Since the Hessian matrix is a real-valued symmetric matrix, its eigenvalues are real-valued

−7.602486025, −0.01045975018.

As the eigenvalues are negative, the Hessian matrix is negative definite. Thus the critical point θ isa local maximum of the likelihood function ℓ(θ). These calculations can be carried out by Maple asfollows:

> x := 0.5191263945: y := 0.2172513326:

> with( linalg ):

> H := matrix( [[ diff(diff(l,x),x), diff(diff(l,x),y) ],

[ diff(diff(l,y),x), diff(diff(l,y),y) ]] );

> eigenvalues ( H );

4.2 General Algebraic Statistical Model

The above example exhibits all characteristics of an algebraic statistical model. In general, one considersa state space given by the first m positive integers

[m] := {1, . . . ,m}. (4.3)

A probability distribution on the set [m] is a point in the probability simplex

∆m−1 = {(p1, . . . , pm) ∈ [0, 1]m |∑

i

pi = 1}. (4.4)

78 4 Basic Algebraic Statistical Models

An algebraic statistical model is defined by a polynomial map f : Rd → Rm given by

(θ1, . . . , θd) 7→ (f1(θ), . . . , fm(θ)), (4.5)

where θ1, . . . , θd are the model parameters and f1, . . . , fm are polynomials (or rational functions) inR[X1, . . . , Xn]. Note that the number of parameters d is usually much smaller than the size of the statespace m.

The parameter vector (θ1, . . . , θd) ranges over a suitable nonempty open subset Θ of Rd, the param-eter space of f . We assume that the parameter space Θ satisfies

Θ ⊆ {θ ∈ Rd | fi(θ) > 0, 1 ≤ i ≤ m} (4.6)

Thus we have

f(Θ) ⊆ ∆m−1 ⇐⇒ f1(θ) + . . .+ fm(θm) = 1. (4.7)

The right-hand side is an identity of polynomial functions in which all nonconstant terms cancel andthe constant terms add up to 1. If (4.7) holds, then the model is simply the set f(Θ).

However, not all algebraic statistical models satisfy (4.7). In this case, the vectors in f(Θ) can bescaled to obtain a family of probability distributions on [m],

1∑i fi(θ)

· (f1(θ), . . . , fm(θ)), θ ∈ Θ. (4.8)

The denominator polynomial∑i fi(θ) is known as the partition function of the model.

The sample data are typically given by a sequence of values from the state space,

i1, i2, i3, . . . , iN . (4.9)

The integer N is the sample size. Assume that the values are independent and identically distributed.Then the data can be summarized by the frequency vector

u = (u1, u2, . . . , um), (4.10)

where ui is the number of occurences of i ∈ [m] in the data, 1 ≤ i ≤ m. It follows that

u1 + u2 + . . .+ um = N (4.11)

and the empirical distribution corresponding to the data is given by the scaled vector

1

N(u1, u2, . . . , um), (4.12)

which belongs to the probability simplex ∆m−1. The coordinates ui/N are the observed relative fre-quencies of the outcomes.

Consider an algebraic statistical model f : Rd → Rm for the data. The probability of observing thedata (4.9) is given by

L(θ) = fi1(θ)fi2(θ) · · · fiN (θ) = f1(θ)u1 · · · fim(θ)um . (4.13)

4.3 Linear Models 79

If the frequency vector u is kept fixed, the likelihood function L is a function from the parameter spaceΘ to the positive real numbers.

By reordering the data (4.9), we obtain the same frequency vector u. Thus the probability ofobserving the frequency vector u is given by

(N

u1, . . . , um

)L(θ). (4.14)

The frequency vector is a sufficient statistic for the model f since the likelihood function L(θ) dependson the data only through u (and not through the data itself).

We may run into numerical problems when multiplying many probabilities. For this, we use the logtransformation and represent the likelihood function by the log-likelihood function

ℓ(θ) = logL(θ) = u1 · log f1(θ) + . . .+ um · log fm(θ). (4.15)

The log-likelihood function ℓ(θ) is a function from the parameter space Θ to the negative real numbers.The problem of maximum likelihood estimation is to maximize the likelihood function L(θ) or,

equivalently, the scaled likelihood function (4.14), or, equivalently, the scaled log-likelihood functionℓ(θ), over the parameter space:

max ℓ(θ)s.t. θ ∈ Θ

(4.16)

A solution to this optimization problem is called maximum likelihood estimate of θ with respect to themodel f and the data u. The simplest algebraic statistical models are the linear and toric models, sincethey easily allow to establish maximum likelihood estimates.

4.3 Linear Models

An algebraic statistical model f : Rd → Rm is called a linear model if its coordinate functions fi(θ),1 ≤ i ≤ d, are linear functions. That is, there are real numbers ai1, . . . , aid and bi, 1 ≤ i ≤ d, such that

fi(θ) =

d∑

j=1

aijθj + bj , 1 ≤ i ≤ d. (4.17)

For instance, DiaNA is a linear model f : R2 → R4 given by the coordinate functions

f1(θ) = −0.10 · θ1 + 0.02 · θ2 + 0.25,

f2(θ) = 0.08 · θ1 − 0.01 · θ2 + 0.25,

f3(θ) = 0.11 · θ1 − 0.02 · θ2 + 0.25,

f4(θ) = −0.09 · θ1 + 0.01 · θ2 + 0.25.

Proposition 4.1. For any linear model f : Rd → Rm and sufficient statistic u ∈ Nm0 , the log-likelihoodfunction

80 4 Basic Algebraic Statistical Models

ℓ(θ) =

m∑

i=1

ui log fi(θ)

is concave. If the linear map f is one-to-one and the data ui, 1 ≤ i ≤ m, are positive, then thelog-likelihood function ℓ(θ) is strictly concave.

Proof. Consider the Hessian matrix of the log-likelihood function,

H =

(∂2ℓ

∂θj∂θk

)

j,k

.

The log-likelihood function ℓ(θ) is concave if and only if the Hessian matrix H is negative semi-definitefor each θ ∈ Θ. Since the Hessian matrix is real-valued and symmetric, its eigenvalues are real-valued.It follows that the Hessian matrix H is negative semi-definite if and only if the eigenvectors of H arenon-positive.

Taking partial derivatives of the coordinate functions gives

∂fi∂θj

= aij , 1 ≤ i ≤ m, 1 ≤ j ≤ d.

Thus the partial derivatives of the log-likelihood function are

∂ℓ

∂θj=

m∑

i=1

uiaijfi(θ)

, 1 ≤ j ≤ d.

Taking partial derivatives again yields

∂ℓ

∂θj∂θk= −

m∑

i=1

uiaijaikfi(θ)2

, 1 ≤ j, k ≤ d.

Thus the Hessian matrix is given by the matrix product

H = −AT · diag(

u1f1(θ)2

, . . .um

fm(θ)2

)·A,

where A is the m× d matrix with entries aij . Hence, the eigenvalues of H are non-positive.If the mapping f is one-to-one, the matrix A has full rank d, and if the data ui, 1 ≤ i ≤ d, are

strictly positive, then by (4.18) all eigenvalues of the Hessian matrix are strictly negative. Hence, thelog-likelihood function is strictly concave. ⊓⊔

Maximum likelihood estimates for a linear model are given by the critical points of the log-likelihoodfunction.

Corollary 4.2. If the linear model f : Rd → Rn is one-to-one and the data ui, 1 ≤ i ≤ m, are positive,then each critical point of the log-likehood function ℓ(θ) is a local maximum.

4.3 Linear Models 81

Consider the simple linear regression model given by n real-valued data points (xi, yi) for 1 ≤ i ≤ n.Suppose the relation between the coordinates of these data points is described by the linear expressions

yi = θ1xi + θ0 + ǫi, 1 ≤ i ≤ n,

where θ0, θ1 ∈ R and ǫi is an N(0, σ2) error. The objective is to find the equation of the straight line

y = θ1x+ θ0

which provides the best fit for the data points in the sense of least-squares minimization; i.e.,∑i ǫ

2i is

minimal. The ordinary least-squares method gives

θ1 =

∑ni=1(xi − x)(yi − y)∑n

i=1(xi − x)2=sxys2x

andθ0 = y − θ1x.

Example 4.3 (R). The computation of the statistics of a linear model in R can be accomplished bythe function lm (Fig. 4.1)

# relation between age and specific blood value of 20 persons

> age <- c(46,20,52,30,57,25,28,36,22,43,57,33,22,63,40,48,28,49,52,58)

> bv <- c(3.5,1.9,4.0,2.6,4.5,3.0,2.9,3.8,2.1,3.8,4.1,3.0,2.5,4.6,3.2,4.2,

+ 2.3,4.0,2.3,4.0,4.3,3.9)

> srm <- lm( bv ~ age ) # linear model

> summary( srm )

Call:

lm(formula = bv ~ age)

Residuals:

Min 1Q Median 3Q Max

-0.48979 -0.22844 -0.02445 0.20009 0.63844

Coefficients:

Estimate Std. Error t value P(>|t|)

(Intercept) 1.15174 0.22257 5.175 6.37e-05 ***

age 0.05583 0.00522 10.695 3.15e-09 ***

---

...

Multiple R-squared: 0.864,

...

The parameter estimates are θ0 = 0.56 (intercept) and θ1 = 0.15 (age). The standard errors of the

parameter estimates are se(β0) = 0.223 and se(β1) = 0.005. The R2 value of 0.86 indicates that about86% of the variance of the blood values can be explained by the model. ♦

82 4 Basic Algebraic Statistical Models

Fig. 4.1. Simple linear regression.

4.4 Toric Models

The toric models form another class of simple algebraic statistical models. To define a toric model, takea matrix A = (αij) ∈ Nd×m0 whose column sums are all equal:

d∑

i=1

ai1 = . . . =

d∑

i=1

aim. (4.18)

Let the jth column vector αj of the matrix A represent the monomial

θαj = θα1j

1 · · · θαdj

d , 1 ≤ j ≤ m. (4.19)

By (4.18), these monomials all have the same degree. The matrix A provides an algebraic statisticalmodel f : Rd → Rm defined as

f : θ 7→ (θα1 , . . . , θαm), (4.20)

which is called the toric model associated with A. The parameter space of this toric model is given by

Θ = {θ ∈ Rd | θαj > 0,∑

j

θαj = 1}. (4.21)

Given a frequency vector u ∈ Nm0 with sample length N =∑i ui. The maximum likelihood function

of this model has the form

L(θ) = f1(θ)u1 · · · fm(θ)um

= (θα1)u1 · · · (θαm)

um

4.4 Toric Models 83

=

(d∏

i=1

θαi1u1i

)· · ·(

d∏

i=1

θαimum

i

)(4.22)

=

d∏

i=1

θαi1u1+αi2u2+...+αimum

i

= θAu.

The vector b = Au is a sufficient statistic for the model. Maximum likelihood estimation for the toricmodel means solving the optimization problem

max θb

s.t. θ ∈ Θ.(4.23)

Proposition 4.4. Let f : Rd → Rn be the toric model associated with a matrix A ∈ Nd×n0 and let

u ∈ Nm0 be a frequency vector. If θ is a local maximum of the optimization problem (4.23), then

A · p = 1

N· b,

where b = Au is the sufficient statistic, N = u1 + . . .+ um is the sample size, and p = f(θ).

Proof. We introduce a Lagrange multiplier λ. Each local optimum of (4.23) is a critical point of thefollowing function in the variable θ1, . . . , θd and λ,

θb + λ ·

1−

m∑

j=1

θαj

.

If this function is subjected to the (scaled) gradient operator

θ · ∇θ =

(θ1

∂θ1, . . . , θd

∂θd

)T,

we obtain the expression

b1 · θb11

...

bd · θbdd

− λ ·

m∑

j=1

α1j

...αdj

θαj .

We abbreviate the left vector by the expression θb ·b. If we put p = (θα1 , . . . , θαm)T , the critical equationobtained by equating (4.24) to zero becomes

θb · b = λ ·m∑

j=1

θαj · αj = λ ·A · p.

For each critical point θ with p = (θα1 , . . . , θαm)T , we obtain

84 4 Basic Algebraic Statistical Models

(θ)b · b = λ ·A · p.

Thus the vector A · p is a scalar multiple of the vector b = A ·u. But the matrix A has the all-one vector(1, . . . , 1) in its row space and

∑j pj = 1. Hence the scalar factor must be 1/N . ⊓⊔

Example 4.5 (Maple). Take the matrix

A =

(2 1 00 1 2

).

The associated toric model is given as

f : R2 → R3 : (θ1, θ2) 7→ (θ21, θ1θ2, θ22).

Suppose u = (11, 17, 23) is a frequency vector. The sample size is N = 51 and we have

b = Au =

(3963

).

The problem is to maximize the likelihood function L(θ) = θ391 θ632 over all positive real vectors

(θ1, θ2) that satisfy θ21 + θ1θ2 + θ22 = 1. For this, by Prop. 4.4, we consider the system of equations

(2θ21 + θ1θ2θ1θ2 + 2θ22

)=

1

51·(3963

). (4.24)

We use Maple to solve these equations. To this end, the toric model is described by the matrix

> with(linalg}:

> d := 2: m := 3:

> A := matrix( d, m, [2,1,0,0,1,2] );

Take the frequency vector

> u := vector( [11,17,23] );

Using the matrix A and the vector u, we obtain

> N := 0: for j from 1 to m do N := N + u[j] od:

> b := scalarmul( multiply( A, u), 1/N);

> p := vector( [ x^A[1,1] * y^A[2,1],

x^A[1,2] * y^A[2,2],

x^A[1,3] * y^A[2,3] ]);

> v := multiply( A, p);

We solve (4.24) by the floating-point solver fsolve

> fsolve( {v[1] = b[1], v[2] = b[2]}, {x,y}, {x=0..1}, {y=0..1} );

4.4 Toric Models 85

and obtain the unique solution

θ1 = 0.4718898804, θ2 = 0.6767378939.

The corresponding probability distribution satisfies

p = (0.2226800592, 0.3193457638, 0.4579741770),

which lies close to the empirical distribution

1

N· u = (0.2156862745, 0.3333333333, 0.4509803922).

♦A particularly simple toric model is the socalled independence model. To describe an independence

model for two random variables, we take positive integers m1 and m2, and put d = m1 + m2 andm = m1 ·m2. Consider the d×m matrix

A =

1 . . . 11 . . . 1

. . .1 . . . 1

1 1 1. . .

. . . . . .. . .

1 1 1

, (4.25)

whose first m1 rows successively contain the all-one vector of length m2 and whose last m2 rows arecomprised of successive m2 ×m2 identity matrices. The column sums in this matrix are all equal to 2.Thus the matrix A defines a toric model f : Rd → Rm given by

f : (θ1, . . . , θd) 7→ (θiθj+m1)i∈[m1],j∈[m2]. (4.26)

The probability distribution p = (pij) has the form

pij = θiθj+m1, 1 ≤ i ≤ m1, 1 ≤ j ≤ m2, (4.27)

and satisfies

A · p =

θ1θm1+1 + . . .+ θ1θm1+m2

...θm1

θm1+1 + . . .+ θm1θm1+m2

θ1θm1+1 + . . .+ θm1θm1+1

...θ1θm1+m2

+ . . .+ θm1θm1+m2

. (4.28)

This model can be considered as the independence model for two random variables. To this end, letX1 and X2 be random variables on the state sets [m1] and [m2], respectively. These random variablesare independent if

86 4 Basic Algebraic Statistical Models

pij = P (X1 = i ∧X2 = j) = P (X1 = i) · P (X2 = j), 1 ≤ i ≤ m1, 1 ≤ j ≤ m2. (4.29)

By putting P (X1 = i) = θi and P (X2 = j) = θj+m1, we see the analogy to the algebraic statistical

model.Given a frequency vector u = (uij) ∈ Nm0 with sample length N =

∑ij uij . The sufficient statistic

of this model is

b = A · u =

u1,1 + . . .+ u1,m2

...um1,1 + . . .+ um1,m2

u1,1 + . . .+ um1,1

...u1,m2

+ . . .+ um1,m2

. (4.30)

By Prop. 4.4, we obtain the following result.

Proposition 4.6. Let f : Rd → Rm be the independence model associated with the matrix A ∈ Nd×m0

in (4.25) and let u = (uij) ∈ Nm0 be a frequency vector. A local maximum θ for these data is given by

θi =1

N

m2∑

j=1

uij , 1 ≤ i ≤ m1,

and

θj+m1=

1

N

m1∑

i=1

uij , 1 ≤ j ≤ m2.

Example 4.7. Consider the independence model for a binary and ternary random variable (m1 = 2and m2 = 3) given by the matrix

A =

1 1 1 0 0 00 0 0 1 1 11 0 0 1 0 00 1 0 0 1 00 0 1 0 0 1

.

The matrix A gives rise to the toric model f : R5 → R2×3 defined as

(θ1, θ2, θ3, θ4, θ5) 7→ (θ1θ3, θ1θ4, θ1θ5, θ2θ3, θ2θ4, θ1θ5).

Thus we obtain

A · p =

θ1θ3 + θ1θ4 + θ1θ5θ2θ3 + θ2θ4 + θ2θ5

θ1θ3 + θ2θ3θ1θ4 + θ2θ4θ1θ5 + θ2θ5

.

Given a frequency vector u = (u11, u12, u13, u21, u22, u23), we derive the sufficient statistic

4.5 Markov Chain Model 87

b = A · u =

u11 + u12 + u13u21 + u22 + u23

u11 + u21u12 + u22u13 + u23

,

and the likelihood function

L(θ) = (θ1θ3)u11(θ1θ4)

u12(θ1θ5)u13(θ2θ3)

u21(θ2θ4)u24(θ2θ5)

u23

= θu11+u12+u131 θu21+u22+u23

2 θu11+u213 θu12+u22

4 θu13+u235 .

By Prop. 4.6, the maximum likelihood estimates for the frequency vector u are

θ1 =1

N(u11 + u12 + u13),

θ2 =1

N(u21 + u22 + u23),

θ3 =1

N(u11 + u21),

θ4 =1

N(u12 + u22),

θ5 =1

N(u13 + u23).

4.5 Markov Chain Model

A basic algebraic statistical model that is more complex than the linear and toric models is the Markovchain model.

First, we introduce the toric Markov chain model. For this, we take an alphabet Σ with l symbolsand fix a positive integer n. We consider words τ = τ1 . . . τn of length n over Σ and count the numberof occurrences in τ of length-2 words σ = σ1σ2. The number of such occurrences is denoted by aσ,τ .For instance, we have aCG,ACGACG = 2 and aCC,CGACG = 0.

We record all possible occurences by a matrix Al,n = (aσ,τ ). Note that the matrix A has d = l2 rowslabelled by the length-2 words σ over Σ and m = ln columns labelled by the length-n words τ overΣ. The matrix Al,n has the property that the sum of each of its columns is n − 1, because each wordof length n consists of n − 1 consecutive length-2 words. Thus the matrix Al,n defines a toric modelf = fl,n : Rd → Rm given by

θ = (θσ)σ∈Σ2 7→ (pτ )τ∈Σn , (4.31)

where

pτ =1

lθτ1τ2 · θτ2τ3 · · · θτn−1τn , τ = τ1 . . . τn ∈ Σn. (4.32)

88 4 Basic Algebraic Statistical Models

The leading coefficient indicates that we assume a uniform initial distribution on the states in thealphabet Σ as described by (7.1). The parameter space of the model is the set of positive l× l matricesΘ = Rl×l>0 and the state space is the set of all words over Σ of length n. This model is called toricMarkov chain model.

Example 4.8. Take the binary alphabet Σ = {0, 1} and n = 4. We have l = 2, d = 22 = 4, andm = 24 = 16, and the 4× 16 matrix A2,4 is defined as

0000 0001 0010 0011 0100 0101 0110 0111 1000 1001 1010 1011 1100 1101 1110 1111

00 3 2 1 1 1 0 0 0 2 1 0 0 1 0 0 001 0 1 1 1 1 2 1 1 0 1 1 1 0 1 0 010 0 0 1 0 1 1 1 0 1 1 2 1 1 1 1 011 0 0 0 1 0 0 1 2 0 0 0 1 1 1 2 3

.

The matrix A2,4 provides the toric Markov chain model given by the mapping

f2,4 : R4 7→ R16 : (θ00, θ01, θ10, θ11) 7→ (p0000, p0001, . . . , p1111),

where

pτ1τ2τ3τ4 =1

2θτ1τ2 · θτ2τ3 · θτ3τ4 , τ1, τ2, τ3, τ4 ∈ Σ.

For instance, we have p0000 = 12θ

300, p0001 = 1

2θ200θ01, and p0110 = 1

2θ01θ11θ10. ♦Second, we introduce the Markov chain model as a submodel of the toric Markov chain model. For

this, the parameter space of the toric Markov chain model is restricted to the set all matrices θ ∈ Rl×l>0

whose rows sum up to 1. The parameter space of the Markov chain model is thus a subset Θ1 of Rl×l>0 ,and the number of parameters is d = l · (l − 1). The entries of the matrices θ ∈ Θ1 can be viewed astransition probabilities. That is, θσ1σ2

can be interpreted as the probability to transit from state σ1 tostate σ2 in one step. The Markov chain model is given by the map fl,n : Rd → Rl

n

restricted to theparameter space Θ1. Each point p in the image fl,n(Θ1) is called a Markov chain.

Example 4.9. Reconsider the toric Markov chain model in Ex. 4.8. The parameter space Θ1 of theMarkov chain model can be viewed as the set of all pairs (θ0, θ1) ∈ R2

>0 which give rise to the probabilitymatrices

θ =

(θ0 1− θ0

1− θ1 θ1

). (4.33)

The Markov chains in f2,4(Θ1) are as follows:

p0000 = 12θ

30, p0001 = 1

2θ20(1− θ0),

p0010 = 12θ0(1− θ0)(1− θ1), p0011 = 1

2θ0(1− θ0)θ1,p0100 = 1

2 (1− θ1)θ20, p0101 = 1

2 (1− θ0)2(1− θ1),

p0110 = 12 (1− θ0)θ1(1− θ1), p0111 = 1

2 (1− θ0)θ21,

p1000 = 12 (1− θ1)θ

20, p1001 = 1

2θ0(1− θ0)(1− θ1),p1010 = 1

2 (1− θ1)2(1− θ0), p1011 = 1

2 (1− θ0)θ1(1− θ1),p1100 = 1

2θ1(1− θ1)θ0, p1101 = 12 (1− θ0)θ1(1− θ1),

p1110 = 12θ

21(1− θ1), p1111 = 1

2θ31.

4.5 Markov Chain Model 89

Let u = (uτ ) ∈ Nm0 be a frequency vector representing N observed sequences in Σn. That is,uτ = uτ1...τn counts the number of times the sequence τ = τ1 . . . τn was observed. Hence,

∑τ uτ = N .

The sufficient statistic v = Al,n · u can be regarded as an l × l matrix with entries vσ1,σ2, where

σ1, σ2 ∈ Σ. The entry vσ1σ2equals the number of occurrences of σ1σ2 ∈ Σ2 as a consecutive pair in

any of the N observed sequences.

Example 4.10. Reconsider the Markov chain model in Ex. 4.9. The sufficient statistic is given as

v00 = 3u0000 + 2u0001 + u0010 + u0011 + u0100 + 2u1000 + u1001 + u1100,

v01 = u0001 + u0010 + u0011 + u0100 + 2u0101 + u0110 + u0111 + u1001 + u1010 + u1011 + u1101,

v10 = u0010 + u0100 + u0101 + u0110 + 2u1000 + u1001 + 2u1010 + u1011 + u1100 + u1101 + u1110,

v11 = u0011 + u0110 + 2u0111 + u1011 + u1100 + u1101 + 2u1110 + 3u1111.

Proposition 4.11. In the Markov chain model fl,n, the maximum likelihood estimate of the frequency

data u ∈ Nln

0 with sufficient statistic v = Al,n ·u is given by the l× l matrix θ = (θσ1σ1) in Θ1 such that

θσ1σ2=

vσ1σ2∑σ∈Σ vσ1σ

, σ1, σ2 ∈ Σ.

Proof. Let Σ = {1, . . . , l}. The likelihood function for the toric Markov chain model is given by

L(θ) = θAl,n·u = θv =∏

ij∈Σ2

θvijij

and so the log-likelihood function equals

ℓ(θ) =∑

ij∈Σ2

vij log θij

=∑

i∈Σ

(vi1 log θi1 + . . .+ vi,l−1 log θi,l−1 + vil log

(1−

l−1∑

k=1

θik

)).

For any length-2 word ij ∈ Σ2, we obtain

∂ℓ

∂θij=vijθij

− vil

1−∑l−1k=1 θik

.

Equating these expressions to zero yields the unique critial point with coordinates

θij =vij

vi1 + . . .+ vil, ij ∈ Σ2.

⊓⊔

Example 4.12. Reconsider the Markov chain model in Ex. 4.9. Suppose there is a sample of lengthN = 89 given by the frequency vector

90 4 Basic Algebraic Statistical Models

u = (7, 2, 8, 10, 7, 9, 7, 10, 4, 2, 5, 7, 4, 3, 2, 4)T .

Then the sufficient statistic is v = A2,4 · u = (64, 79, 63, 67)T . The likelihood function is given by

L(θ) = θ640 · (1− θ0)79 · θ631 · (1− θ1)

67

and thus the log-likelihood function equals

ℓ(θ) = 64 · log θ0 + 79 · log(1− θ0) + 63 · log θ1 + 67 · log(1− θ1).

By Prop. 4.11, the maximum likelihood estimate of the data u is

θ0 =64

64 + 79= 0.447552 and θ1 =

63

63 + 67= 0.484615.

4.6 Maximum Likelihood Estimation

Maximum likelihood estimation is a popular statistical method used for fitting a statistical model tothe data and providing estimates for the model parameters. Consider an algebraic statistical modelf : Cd → Cm given by

f : θ 7→ (f1(θ), . . . , fm(θ)). (4.34)

Here the ambient spaces are taken over the complex numbers, but the coordinates f1, . . . , fm arepolynomials in Q[θ1, . . . , θd]. It is assumed that the parameter space Θ is an open subset of Rd andthat the image f(Θ) of the parameter space is a subset of Rm>0.

Given a sample set which is summarized by the frequency vector u = (u1, u2, . . . , um) of positiveintegers. The probability of observing the data is given by the likelihood function

Lu(θ) = fi1(θ)fi2(θ) · · · fiN (θ) = f1(θ)u1 · · · fim(θ)um . (4.35)

Equivalently, the likelihood function can be described by the log-likelihood function

ℓu(θ) = logLu(θ) = u1 · log f1(θ) + . . .+ um · log fm(θ). (4.36)

The problem of maximum likelihood estimation is to maximize the (log) likelihood function. Eachmaximum of the log-likelihood function is a solution of the critical equations

∂ℓu∂θi

= 0, 1 ≤ i ≤ d. (4.37)

The derivative of ℓu with respect to the variable θi is the rational function

∂ℓu∂θi

=m∑

j=1

ujfj(θ)

∂fj∂θi

, 1 ≤ i ≤ d. (4.38)

4.6 Maximum Likelihood Estimation 91

We can use Groebner bases to compute the critical points. For this, consider the polynomial ringQ[Z1, . . . , Zm, θ1, . . . , θd] and take the ideal

Ju = 〈Z1f1 − 1, . . . , Zmfm − 1,

m∑

j=1

ujZj∂fj∂θ1

, . . . ,

m∑

j=1

ujZj∂fj∂θd

〉. (4.39)

A point (z, θ) ∈ Cd+m lies in the affine variety V(Ju) if and only if θ is a critical point of the log-likelihood function, where fj(θ) 6= 0 and zj = 1/fj(θ) for 1 ≤ j ≤ m.

Consider the m-th elimination ideal of Ju with respect to an elimination ordering for Z1, . . . , Zm;that is,

Iu = Ju ∩ C[θ1, . . . , θd]. (4.40)

The ideal Iu is called the likelihood ideal and the variety V(Iu) is called the likelihood variety of themodel f with respect to the data u.

Proposition 4.13. A point θ ∈ Cd with fj(θ) 6= 0 for 1 ≤ j ≤ m lies in the likelihood variety V(Iu) ifand only if θ is a critical point of the log-likelihood function ℓu.

Proof. Let θ be a critical point of ℓu. Put zj = 1/fj(θ), 1 ≤ j ≤ m. Then (z, θ) ∈ V(Ju) and so by theClosure theorem θ = πm(z, θ) ∈ V(Iu).

Conversely, we make use the Extension theorem. First, we extend by the variable Z1. Consider thegenerator Z1f1 − 1. Since f1(θ) 6= 0 and θ ∈ V(Iu), it follows that the solution θ can be extended to asolution (θ, z1), z1 = 1/f1(θ), of the ideal 〈Z1f1−1〉+ Iu. By continuing this way, we obtain an element(θ, z) of V(Ju). Then θ is a critical point of ℓu. ⊓⊔

The problem of maximum likelihood estimation can be solved by computing the likelihood varietyV(Iu) in Cd, intersecting the variety with the preimage f−1(∆) of the probability simplex ∆m−1, andidentifying all local maxima among the points in V(Iu)∩f−1(∆). Equivalently, the maximum likelihoodestimates can be obtained by augmenting the ideal Ju with the polynomial f1 + . . .+ fm − 1; that is,

Ju = 〈f1 + f2 + f3 + f4 + f5 − 1, z1f1 − 1, . . . , zmfm − 1,

m∑

j=1

ujzj∂fj∂θ1

, . . . ,

m∑

j=1

ujzj∂fj∂θd

〉. (4.41)

Then the likelihood variety V(Iu) is intersected with the preimage f−1(Rm>0) and all local maximaamong the points in V(Iu) ∩ f−1(Rn>0) are determined.

Example 4.14 (Singular). We compute the likelihood variety of the DiaNA model. For this, take thealgebraic statistical model f : C2 → C4 given by

> ring bigring = real, (t(1..2),z(1..4), lp;

> poly f1 = -0.10*t(1) + 0.02*t(2) + 0.25;

> poly f2 = 0.08*t(1) - 0.01*t(2) + 0.25;

> poly f3 = 0.11*t(1) - 0.02*t(2) + 0.25;

> poly f4 = -0.09*t(1) + 0.01*t(2) + 0.25;

Suppose the frequency vector is

92 4 Basic Algebraic Statistical Models

> int u1 = 10;

> int u2 = 14;

> int u3 = 15;

> int u4 = 10;

The ideal Ju in the big ring Q[Z1, Z2, Z3, Z4, θ1, θ2] is defined as

> ideal Ju = f1+f2+f2+f4-1,

z(1)*f1-1, z(2)*f2-1, z(3)*f3-1, z(4)*f4-1,

u1*z(1)*diff(f1,t(1)) + u2*z(2)*diff(f2,t(1))

+ u3*z(3)*diff(f3,t(1)) + u4*z(4)*diff(f4,t(1)),

u1*z(2)*diff(f1,t(2)) + u2*z(2)*diff(f2,t(2))

+ u3*z(3)*diff(f3,t(2)) + u4*z(4)*diff(f4,t(2));

The likelihood ideal Iu is obtained from Ju by elimination:

> ideal Iu = eliminate (Ju, z(1)*z(2)*z(3)*z(4));

> ring smallring = real, (t(1..2)), lp;

> ideal Iu = fetch (bigring, Iu);

> std(Iu);

Iu_[1]=t(2)^3-(8.071e+01)*t(2)^2-(3.202e+04)*t(2)+(6.959e+03)

Iu_[2]=t(1)+(2.110e-04)*t(2)^2-(1.627-01)*t(2)-(4.838-01)

Finally, the zeros of the reduced Groebner basis of Iu are computed as follows:

> LIB "solve.lib";

> solve (Iu, 10);

The zeros are

[1]:

[1]:

-27.1481605843

[2]:

-143.2004435005

[2]:

[1]:

0.5191557516

[2]:

0.2174490559

[3]:

[1]:

26.3283769148

[2]:

223.6954677454

The second zero is the maximum likelihood estimate (Section 4.1). ♦

4.7 Model Invariants 93

4.7 Model Invariants

Each algebraic statistical model gives rise to model invariants that describe the relationships betweenthe probabilities. Consider an algebraic statistical model f : Cd → Cm given by

f : (θ1, . . . , θd) 7→ (f1(θ), . . . , fm(θ)). (4.42)

Here the ambient spaces are taken over the complex numbers, but the coordinates f1, . . . , fm arepolynomials in Q[θ1, . . . , θd]. We study the image f(Cd) by the polynomial parametrization

pi = fi(θ1, . . . , θd), 1 ≤ i ≤ m. (4.43)

The Implicitization theorem yields the following result.

Proposition 4.15. Consider the ideal I = 〈p1 − f1, . . . , pm − fm〉 in C[θ1, . . . , θd, p1, . . . , pm]. For thed-th elimination ideal Id = I ∩C[p1, . . . , pm], the affine variety V(Id) is the Zariski closure of the imagef(Cd).

The polynomials in the elimination ideal Id are called invariants of the model f . By the Eliminationtheorem, these invariants can be established by computing the reduced Groebner basis of the eliminationideal Id with respect to an elimination ordering for θ1 > . . . > θd > p1 > . . . > pm.

Example 4.16 (Singular). Consider the mapping f : C2 → C3 : (θ1, θ2) 7→ (θ31, θ1θ2, θ1θ2). The imageof f is a (dense) subset of a plane in three-space,

f(C2) = {(x, y, z) ∈ C3 | y = z ∧ (x = 0 ⇒ y = 0)}= [V(Y − Z) \ V(X,Y − Z)] ∪ V(X,Y, Z).

This is a Boolean combination of affine varieties, but not an affine variety. In view of the ideal I =〈p1 − θ31, p2 − θ1θ2, p3 − θ1θ2〉 in C[θ1, θ2, p1, p2, p3], the reduced Groebner basis with respect to the lpordering with θ1 > . . . > θd > p1 > . . . > pm can be calculated as follows,

> ring r = 0, (t(1..2),p(1..3)), lp;

> ideal i = p(1)-t(1), p(2)-t(1)*t(2), p(3)-t(1)*t(2);

> std(i);

_[1]=p(2)-p(3)

_[2]=t(2)*p(1)-p(3)

_[3]=t(1)-p(1)

Thus the reduced Groebner basis of the second elimination ideal I2 = I∩Q[p1, p2, p3] is {p2−p3} Hencethe Zariski closure of the image f(Cd) is V(p2 − p3) and p2 − p3 is a model invariant for f . ♦

Example 4.17 (Singular). Reconsider the toric model f : C2 → C3 : (θ1, θ2) 7→ (θ21, θ1θ2, θ22) studied

in Ex. 4.5. Take the ideal I = (p1−θ21, p2−θ1θ2, p3−θ22) in C[θ1, θ2, p1, p2, p3] and calculate a Groebnerbasis of I with respect to the lp ordering with θ1 > θ2 > p1 > p2 > p3.

> ring r = 0, (t(1..2),p(1..3)), lp;

> ideal i = p(1)-t(1)^2, p(2)-t(1)*t(2), p(3)-t(2)^2;

> std(i);

94 4 Basic Algebraic Statistical Models

_[1]=p(1)*p(3)-p(2)^2

_[2]=t(2)^2-p(3)

_[3]=t(1)*p(3)-p(2)^2

_[4]=t(1)*p(2)-t(2)*p(1)

_[5]=t(1)*t(2)-p(2)

_[6]=t(1)^2-p(1)

The first element provides the Groebner basis of the second elimination ideal I2 and yields the modelinvariant p1p3 − p22. ♦

Example 4.18 (Singular). Reconsider the DiaNA model. The polynomial parametrization of theDiaNA model is given in (4.2). Take the ideal I in C[θ1, θ2, p1, p2, p3, p4] generated by

p1 − (−0.10 · θ1 + 0.02 · θ2 + 0.25),p2 − (0.08 · θ1 − 0.01 · θ2 + 0.25),p3 − (0.11 · θ1 − 0.02 · θ2 + 0.25),p4 − (−0.09 · θ1 + 0.01 · θ2 + 0.25).

A Groebner basis of I with respect to the lp ordering θ1 > θ2 > p1 > p2 > p3 > p4 is

−0.5500000002 + 1.399999999 · p2 − 0.1999999995 · p3 + 1.0000 · p4,−0.05059523811 + 0.2380952383 · p4 + 0.9523809525 · p3 + 0.8333333334 · p1,−0.4285714291 + 0.9428571435 · p4 + 0.7714285717 · p3 + 0.6000000000e− 2 · θ2,−1.071428573 + 2.857142859 · p4 + 1.428571429 · p3 + 0.100 · θ1.

The first two polynomials generate the second elimination ideal I2 and thus provide invariants of theDiaNA model. ♦

Example 4.19 (Singular). Consider the dishonest casino in Ex. 6.1. Assume that the dealer alwaysstarts with the fair coin and then switches eventually to the loaded one. If a game consists of n = 4coin tosses, the probability for an outcome τ ∈ Σ′4 is

pτ = pFFFF,τ + pFFFL,τ + pFFLL,τ + pFLLL,τ .

The invariants for this model can be computed as follows.

> ring r = 0, (FF,FL,LL,Fh,Ft,Lh,Lt,p(0..15)), dp

> ideal i =

# hhhh

p(0) - Fh*FF*Fh*FF*Fh*FF*Fh - Fh*FF*Fh*FF*Fh*FL*Lh - Fh*FF*Fh*FL*Fh*LL*Lh

- Fh*FL*Lh*LL*Lh*LL*Lh,

# hhht

p(1) - Fh*FF*Fh*FF*Fh*FF*Ft - Fh*FF*Fh*FF*Fh*FL*Lt - Fh*FF*Fh*FL*Fh*LL*Lt

- Fh*FL*Lh*LL*Lh*LL*Lt,

# hhth

p(2) - Fh*FF*Fh*FF*Ft*FF*Fh - Fh*FF*Fh*FF*Ft*FL*Lh - Fh*FF*Fh*FL*Ft*LL*Lh

- Fh*FL*Lh*LL*Lt*LL*Lh,

# hhtt

4.7 Model Invariants 95

p(3) - Fh*FF*Fh*FF*Ft*FF*Ft - Fh*FF*Fh*FF*Ft*FL*Lt - Fh*FF*Fh*FL*Ft*LL*Lt

- Fh*FL*Lh*LL*Lt*LL*Lt,

# hthh

p(4) - Fh*FF*Ft*FF*Fh*FF*Fh - Fh*FF*Ft*FF*Fh*FL*Lh - Fh*FF*Ft*FL*Fh*LL*Lh

- Fh*FL*Lt*LL*Lh*LL*Lh,

# htht

p(5) - Fh*FF*Ft*FF*Fh*FF*Ft - Fh*FF*Ft*FF*Fh*FL*Lt - Fh*FF*Ft*FL*Fh*LL*Lt

- Fh*FL*Lt*LL*Lh*LL*Lt,

# htth

p(6) - Fh*FF*Ft*FF*Ft*FF*Fh - Fh*FF*Ft*FF*Ft*FL*Lh - Fh*FF*Ft*FL*Ft*LL*Lh

- Fh*FL*Lt*LL*Lt*LL*Lh,

# httt

p(7) - Fh*FF*Ft*FF*Ft*FF*Ft - Fh*FF*Ft*FF*Ft*FL*Lt - Fh*FF*Ft*FL*Ft*LL*Lt

- Fh*FL*Lt*LL*Lt*LL*Lt,

# thhh

p(8) - Ft*FF*Fh*FF*Fh*FF*Fh - Ft*FF*Fh*FF*Fh*FL*Lh - Ft*FF*Fh*FL*Fh*LL*Lh

- Ft*FL*Lh*LL*Lh*LL*Lh,

# thht

p(9) - Ft*FF*Fh*FF*Fh*FF*Ft - Ft*FF*Fh*FF*Fh*FL*Lt - Ft*FF*Fh*FL*Fh*LL*Lt

- Ft*FL*Lh*LL*Lh*LL*Lt,

# thth

p(10)- Ft*FF*Fh*FF*Ft*FF*Fh - Ft*FF*Fh*FF*Ft*FL*Lh - Ft*FF*Fh*FL*Ft*LL*Lh

- Ft*FL*Lh*LL*Lt*LL*Lh,

# thtt

p(11)- Ft*FF*Fh*FF*Ft*FF*Ft - Ft*FF*Fh*FF*Ft*FL*Lt - Ft*FF*Fh*FL*Ft*LL*Lt

- Ft*FL*Lh*LL*Lt*LL*Lt,

# tthh

p(12)- Ft*FF*Ft*FF*Fh*FF*Fh - Ft*FF*Ft*FF*Fh*FL*Lh - Ft*FF*Ft*FL*Fh*LL*Lh

- Ft*FL*Lt*LL*Lh*LL*Lh,

# ttht

p(13)- Ft*FF*Ft*FF*Fh*FF*Ft - Ft*FF*Ft*FF*Fh*FL*Lt - Ft*FF*Ft*FL*Fh*LL*Lt

- Ft*FL*Lt*LL*Lh*LL*Lt,

# ttth

p(13)- Ft*FF*Ft*FF*Ft*FF*Fh - Ft*FF*Ft*FF*Ft*FL*Lh - Ft*FF*Ft*FL*Ft*LL*Lh

- Ft*FL*Lt*LL*Lt*LL*Lh,

# tttt

p(15)- Ft*FF*Ft*FF*Ft*FF*Ft - Ft*FF*Ft*FF*Ft*FL*Lt - Ft*FF*Ft*FL*Ft*LL*Lt

- Ft*FL*Lt*LL*Lt*LL*Lt;

> ideal j = std(i);

> eliminate(j, FF*FL*LL*Fh*Ft*Lh*Lt);

The output is a list of 53 generating invariants. ♦

96 4 Basic Algebraic Statistical Models

4.8 Statistical Inference

We explain the concept of statistical inference for algebraic statistical models with observed and hiddenrandom variables. In these kind of models, we know content of the observed data but nothing aboutthe content of the hidden data. The task is then to find the most likely set of data of the hiddenrandom variables given the set of data of the observed random variables. This problem is known asstatistical inference problem and the most likely set of data of the hidden random variables is referredto as explanation of the observed data.

Consider an algebraic statistical model with hidden and observed variables such that the probabilityof the observed sequence τ is given by

pτ =∑

σ

pσ,τ , (4.44)

where pσ,τ is the probability of having the data σ at the hidden variables and the data τ at the observedvariables. Thus the probability of the observed sequence τ is the marginalization over all possible valuesof the hidden variables. Finding an explanation of the model means identifying the set of hidden dataσ with maximum a posteriori probability of generating the observed data τ ,

σ = argmaxσ{pσ,τ}. (4.45)

By putting wσ,τ = − log(pσ,τ ), the tropicalization of the marginal probability (4.44) yields

wτ =⊕

σ

wσ,τ , (4.46)

Thus the explanation σ is given by evaluation in the tropical algebra,

σ = argminσ{wσ,τ}. (4.47)

We generalize the machinery of maximum a posteriori probability estimation by allowing the ob-served data to include parameters. For this, consider an algebraic statistical model

f : Rd → Rm : θ 7→ (f1(θ), . . . , fm(θ)).

Suppose there is an associated density function

g(θ) =

M∑

i=1

θvi11 · · · θvidd , (4.48)

where vi = (vi1, . . . , vid) ∈ Nd0, 1 ≤ i ≤ M . For a fixed value of θ ∈ Rd>0, the problem is to find a termθvj11 · · · θvjdd , 1 ≤ j ≤M , in the expression g(θ) with maximum value

j = argmaxi{θvi11 · · · θvidd }. (4.49)

Each such solution is called an explanation of the model. By putting wi = − log θi and w = (w1, . . . , wd),we obtain

4.8 Statistical Inference 97

− log(θvi11 · · · θvidd ) = −[vi1 log(θ1) + · · ·+ vid log(θd)] = 〈vj , w〉. (4.50)

This amounts to finding a vector vj that minimizes the linear expression

〈vj , w〉 =d∑

i=1

wivji, 1 ≤ j ≤M. (4.51)

This minimization problem is equivalent to the linear programming problem

min〈x,w〉.s.t. x ∈ NP(g)

(4.52)

To see this, observe that the Newton polytope NP(g) of the polynomial g is the convex hull of thepoints vi, 1 ≤ i ≤M , and the vertices of this polytope form a subset of these points. But the minimalvalue of a linear functional x 7→ 〈x,w〉 over a polytope is attained a vertex of the polytope. Thus wehave shown the following assertion.

Proposition 4.20. For a fixed parameter w, the problem of solving the statistical inference prob-lem (4.49) is equivalent to the linear programming problem of minimizing the linear functionalx 7→ 〈x,w〉 over the Newton polytope NP(g).

The parametric version of this problem asks for the set of parameters w for which the vertex vjgives the explanation. That is, we seek the set of all points w such that the linear functional x 7→ 〈x,w〉attains its minimum at the point vj . By definition, this set is given by NNP(g)(vj), the normal cone ofthe polytope NP(g) at the vertex vj .

Proposition 4.21. The set of all parameters w for which the vertex vj provides the explanation in thealgebraic statistical model given by the density (4.48) is equal to the normal cone of the polytope NP(g)at the vertex vj.

The normal cones associated with the vertices of the Newton polytope NP(g) are part of the normalfan NNP(g) of the Newton polytope NP(g). The normal fan provides a decomposition of the parameterspace into regions, but only the regions corresponding to the vertices of the polytope are relevant forstatistical inference. An example of parametric statistical inference is given in the section on parametricsequence alignment.

98 4 Basic Algebraic Statistical Models

5

Sequence Alignment

A fundamental task in computational biology is the alignment of DNA or amino acid sequences. Theprimary objective of biological sequence alignment is to find positions in the sequences that are homol-ogous; that is, the symbols at those positions are derived from the same position in some anchestralsequence. It may be possible due to evolution that the two homologous positions have different statesand are located at different positions. An alignment is biologically correct if it matches up all positionsthat are truly homologous. Unfortunately, the biological truth is in most cases unknown. Therefore, aguess is made by treating sequence alignment as an optimization problem. The corresponding objectivefunction assigns a score to each alignment according to a scoring scheme. Then an alignment is soughtthat maximizes this score. But which scoring scheme should be used to predict biologically correctalignments? For this, the scoring scheme needs to be analyzed over all possible parameter values.

This chapter introduces an algebraic statistical model for pairwise sequence alignment. The in-terpretation of their marginal probabilities in the tropical algebra will lead to a formalization of thealignment problem as an optimization problem, and the interpretation in the polytope algebra willallow to analyze scoring schemes over all possible parameter values.

5.1 Sequence Alignment

We take a finite alphabet Σ with l letters and an additional symbol “−”, denoted as blank, and callΣ ∪ {−} the extended alphabet. We consider two sequences σ1 = σ1

1 . . . σ1m and σ2 = σ2

1 . . . σ2n over the

alphabet Σ.An alignment of the sequences σ1 and σ2 is a pair of aligned sequences (µ1, µ2) over the extended

alphabet Σ ∪ {−} such that both sequences µ1 and µ2 have the same length and are copies of σ1 andσ2 with inserted blanks, respectively. An alignment (µ1, µ2) does not allow blanks at the same position.It follows that the aligned sequences have length at most m+ n.

Example 5.1. Consider the sequences σ1 = ACGTAGC and σ2 = ACCGAGACC. An alignment of thesesequences is given by

µ1 = A C − G − T A − G C

µ2 = A C C G A G A C − C

100 5 Sequence Alignment

An alignment of maximal length is

A C G T A G C − − − − − − − − −− − − − − − − A C C G A G A C C

An alignment of a pair of sequences (σ1, σ2) can also be represented by a string h over the editalphabet {H, I,D}. The string h is called an edit string and the letters of the edit alphabet stand forhomology (H), insertion (I), and deletion (D). A letter I stands for an insertion (indel) in the firstsequence σ1, a letter D is a deletion (indel) in the first sequence σ1, and a letter H is a characterchange (mutation or mismatch) including the identity change (match). We write #H, #I, and #Dfor the respective number of instances of H, I, and D in an edit string for an alignment of the pair(σ1, σ2). Then we have

#H +#D = m and #H +#I = n. (5.1)

Example 5.2. Reconsider the sequences σ1 = ACGTAGC and σ2 = ACCGAGACC. An alignment of thesesequences including the edit string is given by

h = H H I H I H H I D Hµ1 = A C − G − T A − G C

µ2 = A C C G A G A C − C

We have #H = 6, #I = 3, and #D = 1. ♦

Proposition 5.3. A string over the edit alphabet {H, I,D} represents an alignment of an m-lettersequence σ1 and an n-letter sequence σ2 if and only if (5.1) holds.

Proof. Given an alignment of the pair (σ1, σ2). We form an edit string h from left to right. Each symbolin σ1 either corresponds to a symbol in σ2, in which case we record an H in the edit string, or it getsdeleted, in which case we record a D. This shows that the first equation in (5.1) holds. Each symbol inσ2 either corresponds to a symbol in σ1, in which case we already recorded an H in the edit string, orit gets inserted, in which case we record a I. This shows that the second equation in (5.1) holds.

Conversely, each edit string h with the property (5.1), when read from left to right, produces analignment of the pair (σ1, σ2). ⊓⊔

We write Am,n for the set of all strings over the edit alphabet {H, I,D} that satisfy the equa-tions (5.1). We call Am.n the set of all alignments of the sequences σ1 and σ2 in spite of the fact thatit only depends on m and n and not on the specific sequences. The cardinality of the set Am,n is calledDelannoy number (Fig. 5.2).

Proposition 5.4. The cardinality of the set Am,n can be computed as the coefficient of the monomialxmyn in the generating function 1

1−x−y−xy .

Proof. Consider the expansion of the generating function

1

1− x− y − xy=

∞∑

m=0

∞∑

n=0

am,nxmyn. (5.2)

5.1 Sequence Alignment 101

The coefficients are characterized by the linear recurrence

am,n = am−1,n + am,n−1 + am−1,n−1, m ≥ 0, n ≥ 0,m+ n ≥ 1, (5.3)

with initial conditions a0,0 = 1, am,−1 = 0, and a−1,n = 0. The same recurrence holds for the cardinalityof Am,n. To see this, note that for nonnegative integers m and n with m+ n ≥ 1, each string in Am,n

is either a string in Am−1,n−1 followed by an H, or a string in Am−1,n followed by a D, or a string inAm,n−1 followed by an I (Fig. 5.1). Moreover, A0,0 has only one element, the empty string, and Am,n

is the empty set if m < 0 or n < 0. Thus the coefficient am,n and the cardinality of Am,n satisfy thesame initial conditions and the same recurrence. It follows that they must be equal. ⊓⊔

h = . . . Hσ1 = . . . σ1

m

σ2 = . . . σ2n

h = . . . Dσ1 = . . . σ1

m

σ2 = . . . σ2n − . . . −

h = . . . Iσ1 = . . . σ1

m − . . . −σ2 = . . . σ2

n

Fig. 5.1. Three possibilities for strings in Am,n.

am,n 0 1 2 3 4 5 6 7 8 9 10

0 1 1 1 1 1 1 1 1 1 1 11 1 3 5 7 9 11 13 15 17 19 212 1 5 13 25 41 61 85 113 145 181 2213 1 7 25 63 129 231 377 575 833 1, 159 1, 5614 1 9 41 129 321 681 1, 289 2, 241 3, 649 5, 641 8, 3615 1 11 61 231 681 1, 683 3, 653 7, 183 13, 073 22, 363 36, 3656 1 13 85 377 1, 289 3, 653 8, 989 19, 825 40, 081 75, 517 134, 2457 1 15 113 575 2, 241 7, 183 19, 825 48, 639 108, 545 224, 143 433, 9058 1 17 145 833 3, 649 13, 073 40, 081 108, 545 265, 729 598, 417 1256, 4659 1 19 181 1, 159 5, 641 22, 363 75, 517 224, 143 598, 417 1, 462, 563 3, 317, 445

10 1 21 221 1, 561 8, 361 36, 365 134, 245 433, 905 1, 256, 465 3, 317, 445 8, 097, 453

Fig. 5.2. The first hundred Delannoy numbers.

The alignment graph of an m-letter sequence and an n-letter sequence is a directed graph Gm,n onthe set of nodes {0, 1, . . . ,m} × {0, 1, . . . , n} and three classes of edges: edges (i, j) → (i, j + 1) arelabelled I, edges (i, j) → (i+ 1, j) are labelled D, and edges (i, j) → (i+ 1, j + 1) are labelled H.

Proposition 5.5. The set of all alignments Am,n corresponds one-to-one with the set of all paths fromthe node (0, 0) to the node (m,n) in the alignment graph Gm,n.Proof. Given an alignment in Am,n by the edit string h. The string h provides a path in Gm,n startingfrom the node (0, 0). By (5.1), this path terminates at the node (m,n).

Conversely, given a path in Gm,n from (0, 0) to (m,n). The labelling of the path provides a string hover the edit alphabet that satisfies (5.1). By Prop. 5.3, the string h is an edit string corresponding toan alignment in Am,n. ⊓⊔

102 5 Sequence Alignment

Example 5.6. Consider the sequences σ1 = ACG and σ2 = ACC. The edit string h = HHID providesthe alignment

H H D IA C G −A C − C

This alignment can be traced by the solid path in the alignment graph G3,3 (Fig. 5.3). ♦

− A C C

− GFED@ABC0, 0I

//❴❴❴

D

��✤✤✤

H

!!❈❈❈

❈❈❈❈

❈❈GFED@ABC0, 1

I

//❴❴❴

D

��✤✤✤

H

!!❈❈

❈❈

❈GFED@ABC0, 2

I

//❴❴❴

D

��✤✤✤

H

!!❈❈

❈❈

❈GFED@ABC0, 3

D

��✤✤✤

A GFED@ABC1, 0I

//❴❴❴

D

��✤✤✤

H

!!❈❈

❈❈

❈GFED@ABC1, 1

I

//❴❴❴

D

��✤✤✤

H

!!❈❈❈

❈❈❈❈

❈❈GFED@ABC1, 2

I

//❴❴❴

D

��✤✤✤

H

!!❈❈

❈❈

❈GFED@ABC1, 3

D

��✤✤✤

C GFED@ABC2, 0I

//❴❴❴

D

��✤✤✤

H

!!❈❈

❈❈

❈GFED@ABC2, 1

I

//❴❴❴

D

��✤✤✤

H

!!❈❈

❈❈

❈GFED@ABC2, 2

I

//❴❴❴

D

�� H

!!❈❈

❈❈

❈GFED@ABC2, 3

D

��✤✤✤

G GFED@ABC3, 0I

//❴❴❴ GFED@ABC3, 1I

//❴❴❴ GFED@ABC3, 2I

// GFED@ABC3, 3

Fig. 5.3. The alignment graph G3,3 and the path corresponding to the alignment in Ex. 5.6.

5.2 Scoring Schemes

We introduce scores to alignments. For this, we need a scoring scheme defined by a pair of mappings

w : Σ ∪ {−} ×Σ ∪ {−} → R and w′ : {H, I,D} × {H, I,D} → R. (5.4)

Take two sequences σ1 and σ2 over the alphabet Σ. An alignment of the pair (σ1, σ2) is given by a pairof sequences (µ1, µ2) over the extended alphabet that can be fully represented by an edit string h overthe edit alphabet. The weight of the alignment h is defined as

W (h) =

|h|∑

i=1

w(µ1i , µ

2i ) +

|h|∑

i=2

w′(hi−1, hi), (5.5)

where |h| denotes the length of the string h. Thus the weight of an alignment is given by the sum ofcolumn scores of the aligned sequences and the sum of consecutive scores of the edit string.

5.2 Scoring Schemes 103

Assume that the sequences σ1 and σ2 are defined over the DNA alphabet Σ = {A, C, G, T}. We mayrepresent the scoring scheme (w,w′) by a pair of matrices,

w =

wA,A wA,C wA,G wA,T wA,−wC,A wC,C wC,G wC,T wC,−wG,A wG,C wG,G wG,T wG,−wT,A wT,C wT,G wT,T wT,−w−,A w−,C w−,G w−,T

(5.6)

and

w′ =

w′H,H w′

H,I w′H,D

w′I,H w′

I,I w′I,D

w′D,H w′

D,I w′D,D

. (5.7)

The lower right entry w−,− in the matrix w is left out because it is never used by our convention. Itfollows that the total number of parameters in the alignment problem is 24 + 9 = 33. We identify theparameter space with the Euclidean space R33. Thus each alignment h ∈ Am,n gives rise to a mappingW (h) : R33 → R.

Example 5.7. Consider the alignment of the sequences σ1 = ACGTAGC and σ2 = ACCGAGACC given bythe edit string h = HHIHIHHIDH (Ex. 5.2). The weight of this alignment is the linear expression

W (h) = 2 · wA,A + 2 · wC,C + 1 · wG,G + 1 · wT,G + 2 · w−,C + 1 · w−,A + 1 · wG,−+2 · w′

H,H + 3 · w′H,I + 2 · w′

I,H + 1 · w′I,D + 1 · w′

D,H .

Example 5.8 (Maple). We compute symbolically the weight of an alignment. For simplicity, we putw′ = 0 and consider the sequences

s1 := [A,C,G]: s2 := [A,C,C]:

The scoring scheme is given by the matrix w and can be defined as

w := array([ [ tAA, tAC, tAG, tAT, t_A ],

[ tCA, tCC, tCG, tCT, t_C ],

[ tGA, tGC, tGG, tGT, t_G ],

[ tTA, tTC, tTG, tTT, t_T ],

[ tA_, tC_, tG_, tT_, ] ]):

Assume that the alignment is given by the edit string

h := [H,H,D,I]:

First, it must be checked that the string h describes an alignment. For this, we need to show that (5.1)holds

104 5 Sequence Alignment

l := nops(h): m := nops(s1): n := nops(s2):

nH := 0: nI := 0: nD := 0:

for i from 1 to l do

if h[i] = H then nH := nH + 1

elif h[i] = D then nD := nD + 1

else nI := nI + 1

end if

end do;

if nH + nD = m and nH + nI = n then

print("h defines alignment");

end if:

Second, the aligned sequences are established

i1 := 1: i2 := 1:

a1 := []; a2 := [];

for i from 1 to l do

if h[i] = H then

a1 := [op(a1), s1[i1]]; i1 := i1 + 1;

a2 := [op(a2), s2[i2]]; i2 := i2 + 1

elif h[i] = D then

a1 := [op(a1), s1[i1]]; i1 := i1 + 1;

a2 := [op(a2), _]

else h[i] = D then

a1 := [op(a1), _];

a2 := [op(a2), s2[i2]]; i2 := i2 + 1

end if

end do:

a1, a2;

The last command provides the aligned sequences

[A,C,G,_]

[A,C,_,C]

Third, the weight of the alignment is calculated

u1 := subs( { A=1, C=2, G=3, T=4 }, s1);

u2 := subs( { A=1, C=2, G=3, T=4 }, s2);

W_h := 0:

for i from 1 to l do

W_h := W_h + w[u1[i],u2[i]]

end do:

expand (W_h);

The last command outputs the alignment weight

tAA + tCC + tG_ + t_C

5.2 Scoring Schemes 105

Let σ1 and σ2 be sequences of length m and n over the alphabet Σ, respectively. Given a scoringscheme (w,w′), the alignment problem is to compute alignments h ∈ Am,n of the pair (σ1, σ2) thathave minimum weight W (h) among all alignment in Am,n. These alignments are called optimal. Theproblem is thus to solve the optimization problem

min W (h).s.t. h ∈ Am,n

(5.8)

Sometimes we simplify the alignment problem by assuming that w′ = 0. Then the weight of an alignment(µ1, µ2) given by the edit string h is the linear functional

W (h) =

|h|∑

i=1

w(µ1i , µ

2i ). (5.9)

The alignment problem can be interpreted in terms of the alignment graph. For this, the edges ofthe alignment graph Gm,n are weighted by scores

(i, j)w

−,σ2j+1−→ (i, j + 1), (i, j)

wσ1i+1

,−

−→ (i+ 1, j), (i, j)w

σ1i+1

,σ2j+1−→ (i+ 1, j + 1).

This decorated graph is called weighted alignment graph with respect to the scoring scheme (w,w′ = 0).

Proposition 5.9. Let σ1 and σ2 be sequences over Σ of length m and n, respectively. The problem offinding the optimal alignments for the pair of sequences (σ1, σ2) with respect to (w,w′ = 0) is equivalentto finding the minimum weight paths from the node (0, 0) to the node (m,n) in the weighted alignmentgraph Gm,n with respect to (w,w′ = 0).

Proof. By Prop. 5.5, the alignments of the pair of sequences (σ1, σ2) correspond one-to-one to thepaths from (0, 0) to (m,n) in the graph Gm,n. Moreover, by definition, the weight of an alignment givenin (5.9) equals the weight of the corresponding path from (0, 0) to (m,n) in the weighted graph. ⊓⊔

Example 5.10. Take the sequences σ1 = ACG and σ2 = ACC, and use the scoring scheme given by

w =

3 −1 −1 −1 −2−1 3 −1 −1 −2−1 −1 3 −1 −2−1 −1 −1 3 −2−2 −2 −2 −2

and w′ = 0.

That is, matches are scored with +3, mismatches are scored with −1, and indels are scored with −2.The alignment (µ1, µ2) = (AC− G, ACC−) is given by the edit string h = HHID (Ex. 5.6) and has thescore

W (h) = w(A, A) + w(C, C) + w(−, C) + w(G,−) = 3 + 3− 2− 2 = 2.

The weighted alignment graph and the path corresponding to this alignment are shown in Fig. 5.4. ♦

106 5 Sequence Alignment

− A C C

− GFED@ABC0, 0−2

//❴❴❴

−2

��✤✤✤

3

!!❈❈❈

❈❈❈❈

❈❈GFED@ABC0, 1

−2//❴❴❴

−2

��✤✤✤

−1

!!❈❈

❈❈

❈GFED@ABC0, 2

−2//❴❴❴

−2

��✤✤✤

−1

!!❈❈

❈❈

❈GFED@ABC0, 3

−2

��✤✤✤

A GFED@ABC1, 0−2

//❴❴❴

−2

��✤✤✤

−1

!!❈❈

❈❈

❈GFED@ABC1, 1

−2//❴❴❴

−2

��✤✤✤

3

!!❈❈❈

❈❈❈❈

❈❈GFED@ABC1, 2

−2//❴❴❴

−2

��✤✤✤

3

!!❈❈

❈❈

❈GFED@ABC1, 3

−2

��✤✤✤

C GFED@ABC2, 0−2

//❴❴❴

−2

��✤✤✤

−1

!!❈❈

❈❈

❈GFED@ABC2, 1

−2//❴❴❴

−2

��✤✤✤

−1

!!❈❈

❈❈

❈GFED@ABC2, 2

−2//❴❴❴

−2

�� −1

!!❈❈

❈❈

❈GFED@ABC2, 3

−2

��✤✤✤

G GFED@ABC3, 0−2

//❴❴❴ GFED@ABC3, 1−2

//❴❴❴ GFED@ABC3, 2−2

// GFED@ABC3, 3

Fig. 5.4. A weighted alignment graph and the path corresponding to the alignment in Ex. 5.6.

5.3 Pair Hidden Markov Model

We show that the sequence alignment problem can be interpreted as an algebraic statistical model.For simplicity, we restrict our attention to the DNA alphabet Σ = {A, C, G, T}. The pair hidden Markovmodel for the set of alignments Am,n over Σ is the algebraic statistical model

f : R33 → R4m+n

: (θ, θ′) 7→ (fσ1,σ2). (5.10)

The model has 4m+n states that correspond to all pairs of sequences σ1 and σ2 of length m and n overΣ, respectively. Moreover, the model has 24+ 9 = 33 parameters that are written as a pair of matrices(θ, θ′) as follows,

θ =

θA,A θA,C θA,G θA,T θA,−θC,A θC,C θC,G θC,T θC,−θG,A θG,C θG,G θG,T θG,−θT,A θT,C θT,G θT,T θT,−θ−,A θ−,C θ−,G θ−,T

(5.11)

and

θ′ =

θ′H,H θ′H,I θ

′H,D

θ′I,H θ′I,I θ′I,Dθ′D,H θ′D,I θ

′D,D

, (5.12)

where the lower right entry θ−,− in the matrix θ is left out as it is never used by our convention. Theparameter space of the model is the product of six simplices of dimensions 15, 3, 3, 2, 2, and 2,

Θ = ∆15 ×∆3 ×∆3 ×∆2 ×∆2 ×∆2 ⊆ R33. (5.13)

5.3 Pair Hidden Markov Model 107

The big simplex ∆15 consists of all non-negative 4 × 4 matrices (θij)i,j∈Σ whose entries sum up to 1.The two thetrahedrons ∆3 come from the equalities

θ−,A + θ−,C + θ−,G + θ−,T = θA,− + θC,− + θG,− + θT,− = 1. (5.14)

The three triangles ∆2 provide the equalities

θ′H,H + θ′H,I + θ′H,D = θ′I,H + θ′I,I + θ′I,D = θ′D,H + θ′D,I + θ′D,D = 1. (5.15)

The coordinate function fσ1,σ2 of the pair hidden Markov model represents the marginal probability ofobserving the aligned pair of sequences σ1 and σ2 and is given by

fσ1,σ2 =∑

h∈Am,n

|h|∏

i=1

θµ1i,µ2

i·|h|∏

i=2

θ′hi−1,hi, (5.16)

where (µ1, µ2) is the pair of aligned sequences over Σ ∪ {−} which corresponds to the edit stringh ∈ Am,n.

Example 5.11. Consider the alignment of the sequences σ1 = ACGTAGC and σ2 = ACCGAGACC given bythe edit string h = HHIHIHHIDH (Ex. 5.2). The string h corresponds to the following term in themarginal probability fσ1,σ2 ,

θ2A,A · θ2C,C · θ2−,C · θG,G · θ−,A · θT,G · θG,− · θ′2H,H · θ′3H,I · θ′2I,H · θ′I,D · θ′D,H .

♦Proposition 5.12. The alignment problem (5.8) for the pair of sequences (σ1, σ2) is the tropicalizationof the marginal probability fσ1,σ2 of the pair hidden Markov model.

Proof. We apply the tropicalization map to the marginal probability fσ1,σ2 . For this, we put wij =− log θij and w′

XY = − log θ′XY , where 1 ≤ i ≤ m, 1 ≤ j ≤ n, and X,Y ∈ {H,D, I}, and replace theouter sum by a tropical sum and the inner product by a tropical product. In this way, we obtain thetropical polynomial

trop(fσ1,σ2) =⊕

h∈Am,n

|h|⊙

i=1

wµ1i,µ2

i⊙

|h|⊙

i=2

w′hi−1,hi

. (5.17)

For each alignment h ∈ Am,n, the corresponding tropical product in (5.17) equals the weight of thealignment h,

W (h) =

|h|⊙

i=1

wµ1i,µ2

i⊙

|h|⊙

i=2

w′hi−1,hi

. (5.18)

Since the tropical addition is associated with the formation of minima, the tropical polynomialtrop(fσ1,σ2) corresponds to the alignment problem for the pair of sequences (σ1, σ2),

trop(fσ1,σ2) = minh∈Am,n

W (h). (5.19)

⊓⊔It follows that the evaluation of marginal probability fσ1,σ2 solves the alignment problem for thesequences σ1 and σ2. However, this is only practical for smaller sequences.

108 5 Sequence Alignment

5.4 Sum-Product Decomposition

We show that the marginal probabilities of the pair hidden Markov model can be efficiently calculatedby a sum-product decomposition. For this, let σ1 = σ1

1 . . . σ1m and σ2 = σ2

1 . . . σ2n be DNA sequences.

Let σ1≤i denote the prefix σ1

1 . . . σ1i of σ1, 1 ≤ i ≤ m, and let σ2

≤j be the prefix σ21 . . . σ

2j of σ2,

1 ≤ j ≤ n. Let MX(i, j) be the probability of observing the aligned pair of sequences σ1≤i and σ

2≤i such

that X is the last symbol in the corresponding edit string. Then the marginal probability fσ1,σ2 can bedecomposed as follows,

fσ1,σ2 =∑

X

MX(m,n), (5.20)

where

MI(i, j) = θ−,σ2j·∑

X

MX(i, j − 1) · θ′X,I , (5.21)

MD(i, j) = θσ1i,− ·

X

MX(i− 1, j) · θ′X,D, (5.22)

MH(i, j) = θσ1i,σ2

j·∑

X

MX(i− 1, j − 1) · θ′X,H , (5.23)

and

MX(0, 0) = 1, X ∈ {H, I,D}, (5.24)

MX(0, j) = 1, X ∈ {H,D}, (5.25)

MX(i, 0) = 1, X ∈ {H, I}, (5.26)

MI(0, j) = θ−,σ21·j∏

k=2

θ′I,I · θ−,σ2k, (5.27)

MD(i, 0) = θσ11 ,−

·i∏

k=2

θ′D,D · θσ1k,−. (5.28)

Note that three cases can occur for the alignment of the prefixes σ1≤i and σ2

≤j as given by the equa-tions (5.21)-(5.23) (see Fig. 5.5).

h = . . . Iσ1≤i = . . . σ1

i − . . . −σ2≤j = . . . σ2

j

h = . . . Dσ1≤i = . . . σ1

i

σ2≤j = . . . σ2

j − . . . −

h = . . . Hσ1≤i = . . . σ1

i

σ2≤j = . . . σ2

j

Fig. 5.5. The alignment of the prefixes σ1≤i and σ2

≤j .

Proposition 5.13. The evaluation of the marginal probability fσ1,σ2 in (5.20) requires O(mn) steps.

5.4 Sum-Product Decomposition 109

Proof. The array ΦX(i, j), 0 ≤ i ≤ m, 0 ≤ j ≤ n, X ∈ {H,D, I}, has 3mn entries and each entry iscomputed by a constant number of operations. ⊓⊔

Example 5.14 (Maple). We calculate the marginal probability fσ1,σ2 by using Maple. For simplicity,

we put w′ = 0 and consider the model f : R24 → R4m+n

. Then the matrix θ′ specializes to

θ′ =

1 1 11 1 11 1 1

.

Take the sequences

s1 := [A,C,G]: s2 := [A,C,C]:

and provide the 24 parameters by the matrix θ defined as

T := array([ [ tAA, tAC, tAG, tAT, t_A ],

[ tCA, tCC, tCG, tCT, t_C ],

[ tGA, tGC, tGG, tGT, t_G ],

[ tTA, tTC, tTG, tTT, t_T ],

[ tA_, tC_, tG_, tT_, ] ]):

We initialize

m := nops(s1):

n := nops(s2):

u1 := subs( { A=1, C=2, G=3, T=4 }, s1);

u2 := subs( { A=1, C=2, G=3, T=4 }, s2);

blank := 5:

and obtain by ordinary arithmetics on polynomials:

M := array( [], 0..m, 0..n):

M[0,0] := 1;

for i from 1 to m do

M[i,0] := M[i-1,0] * T[u1[i],blank]

od:

for j from 1 to n do

M[0,j] := M[0,j-1] * T[blank,u2[j]]

od:

for i from 1 to m do

for j from 1 to n do

M[i,j] := M[i-1,j] * T[u1[i],blank]

+ M[i,j-1] * T[blank,u2[j]]

+ M[i-1,j-1] * T[u1[i],u2[j]]

od:

od:

lprint( expand(M[m,n]) );

110 5 Sequence Alignment

This code produces the marginal probability fACG,ACC:

20*tC_^2*t_A*t_C*t_G*tA_

+ 6*tC_^2*t_G*t_C*tAA

+ 3*tC_^2*t_G*t_A*tCA

+ tC_^2*t_A*t_C*tGA

+ 4*tC_*t_G*t_C*tA_*tAC

+ 7*tC_*t_G*tCC*t_A*tA_

+ 3*tC_*t_G*tCC*tAA

+ 9*tC_*tGC*t_A*t_C*tA_

+ 3*tC_*tGC*t_C*tAA

+ 2*t_C*tGC*tA_*tCA

+ t_G*tCC*tA_*tAC

+ tGC*t_C*tA_*tAC

+ 2*tGC*tCC*t_A*tA_

+ tGC*tCC*tAA.

This polynomial has 14 terms and each term stands for an alignment. Moreover, the sum of all coef-ficients equals the total number of alignments, |A3,3| = 63. For instance, the term 2 ∗ C ∗ GC ∗ A ∗ CAindicates the alignments

A C G −− A C C

andA C − G

− A C C

5.5 Optimal Alignment

The tropicalized marginal probability fσ1,σ2 of the pair hidden Markov model corresponds to the align-ment problem for the pair of sequences (σ1, σ2). We can compute the tropicalized marginal probabilityfσ1,σ2 by tropicalizing its sum-product decomposition. For this, we put ΦX(i, j) = − logMX(i, j),wij = − log θij , and w′

XY = − log θ′XY , where 1 ≤ i ≤ m, 1 ≤ j ≤ n, and X,Y ∈ {H,D, I}. Byreplacing sums by tropical sums and products by tropical products, we obtain

trop(fσ1,σ2) =⊕

X

ΦX(m,n), (5.29)

where

ΦI(i, j) = w−,σ2j⊙⊕

X

ΦX(i, j − 1)⊙ w′X,I , (5.30)

ΦD(i, j) = wσ1i,− ⊙

X

ΦX(i− 1, j)⊙ w′X,D, (5.31)

ΦH(i, j) = wσ1i,σ2

j⊙⊕

X

ΦX(i− 1, j − 1)⊙ w′X,H , (5.32)

and

5.6 Needleman-Wunsch Algorithm 111

ΦX(0, 0) = 0, X ∈ {H, I,D}, (5.33)

ΦX(0, j) = 0, X ∈ {H,D}, (5.34)

ΦX(i, 0) = 0, X ∈ {H, I}, (5.35)

ΦI(0, j) = w−,σ21⊙

j⊙

k=2

w′I,I ⊙ w−,σ2

k, (5.36)

ΦD(i, 0) = wσ11 ,−

⊙i⊙

k=2

w′D,D ⊙ wσ1

k,−. (5.37)

5.6 Needleman-Wunsch Algorithm

The tropicalized sum-product decomposition of the marginal probability fσ1,σ2 corresponds to thealignment problem for the pair of sequences (σ1, σ2). This decomposition provides directly an efficientalgorithm for computing the optimal alignments of the pair (σ1, σ2). The Needleman-Wunsch algorithmis a special case of this algorithm by assuming that the 3× 3 matrix w′ is zero (Alg. 5.1).

Algorithm 5.1 Needleman-Wunsch algorithm.

Require: sequences σ1 ∈ Σm, σ2 ∈ Σn and scoring scheme w ∈ R24

Ensure: alignment h ∈ Am,n with minimal weight W (h)M ← matrix[0..m, 0..n]M [0, 0]← 0for i← 1 to m do

M [i, 0]←M [i− 1, 0] + wσ1i,−

end for

for j ← 1 to n do

M [0, j]←M [0, j − 1] + w−,σ2j

end for

for i← 1 to m do

for j ← 1 to n do

M [i, j]← min{M [i− 1, j − 1] + wσ1i,σ2

j,M [i− 1, j] + wσ1

i,−,M [i, j − 1] + w−,σ2

j}

color the edges directed to (i, j) that attain the minimumend for

end for

Trace a path in the backward direction from (m,n) to (0,0) by following an arbitrary sequence of colorededges.Output the edge labels in {H, I,D} of the given path in the forward direction.

Proposition 5.15. Let σ1 and σ2 be sequences over Σ of length m and n, respectively. The Needleman-Wunsch algorithm computes an optimal alignment of the pair (σ1, σ2) with respect to the scoring scheme(w,w′ = 0) and has running time O(mn).

112 5 Sequence Alignment

Proof. By Prop. 5.12, the alignment problem for the pair (σ1, σ2) equals the tropicalization of themarginal probability fσ1,σ2 . The sum-product decomposition of this tropicalized probability for thescoring scheme (w,w′ = 0) given by the equations (5.29) to (5.37) corresponds one-to-one with theNeedleman-Wunsch algorithm.

The computation of the (m+1)× (n+1) array M requires O(mn) steps. The coloring of the edgestakes constant time and both the tracing of a colored path and the output of the aligned sequencestake O(m+ n) steps. ⊓⊔

The Needleman-Wunsch algorithm follows the paradigm of dynamic programming introduced byRichard Bellman in the 1950s. This is a method for solving a complex problem by breaking it downinto a collection of subproblems. It makes use of subproblem overlap and optimal substructure to solvea problem in much less time than problems that cannot take advantage of these characteristics.

Example 5.16 (Maple). Take the sequences

s1 := [A,C,G]: s2 := [A,C,C]:

and provide the 24 parameters by the matrix w defined as

T := array([ [ -3, 1, 1, 1, 2 ],

[ 1, -3, 1, 1, 2 ],

[ 1, 1, -3, 1, 2 ],

[ 1, 1, 1, -3, 2 ],

[ 2, 2, 2, 2, ] ]):

We initialize

m := nops(s1):

n := nops(s2):

u1 := subs( { A=1, C=2, G=3, T=4 }, s1);

u2 := subs( { A=1, C=2, G=3, T=4 }, s2);

blank := 5:

and obtain by ordinary arithmetics

M := array( [], 0..m, 0..n):

M[0,0] := 0;

for i from 1 to m do

M[i,0] := M[i-1,0] + T[u1[i],blank]

od:

for j from 1 to n do

M[0,j] := M[0,j-1] + T[blank,u2[j]]

od:

for i from 1 to m do

for j from 1 to n do

M[i,j] := min( M[i-1,j] + T[u1[i],blank],

M[i,j-1] + T[blank,u2[j]],

M[i-1,j-1] + T[u1[i],u2[j]] )

od:

od:

lprint( M[m,n] );

5.7 Parametric Sequence Alignment 113

The last command provides the score −5. The entries of the table M can be read out by the loop

for i from 1 to m do

for j from 1 to n do

printf("%2d ", M[i,j]);

od;

printf("\n");

od;

The weighted alignment graph in Fig. 5.6 illustrates the table output together with the colored edges.The graph exhibits exactly one colored path from (0,0) to (3,3): (0, 0) −→ (1, 1) −→ (2, 2) −→ (3, 3).This path amounts to the optimal alignment with score −5:

h = H H Hµ1 = A C G

µ2 = A C C

− A C C

− ?>=<89:;02

//

2

��−3

❅❅❅

❅❅❅❅

❅❅?>=<89:;2

2//

2

��✤✤✤

1!!❈

❈❈❈❈

❈❈❈❈

?>=<89:;42

//

2

��✤✤✤

1

!!❈❈

❈❈

❈?>=<89:;6

2

��✤✤✤

A ?>=<89:;22

//❴❴❴

2

��1

❅❅❅

❅❅❅❅

❅❅GFED@ABC−3

2//❴❴❴

2

�� −3!!❇

❇❇❇❇

❇❇❇❇

GFED@ABC−12

//

2

��✤✤✤

−3!!❇

❇❇❇❇

❇❇❇❇

?>=<89:;1

2

��✤✤✤

C ?>=<89:;42

//❴❴❴

2

��1

❅❅❅

❅❅❅❅

❅❅GFED@ABC−1

2//❴❴❴

2

��1

!!❇❇❇

❇❇❇❇

❇❇GFED@ABC−6

2//❴❴❴

2

�� 1!!❇

❇❇❇❇

❇❇❇❇

GFED@ABC−4

2

��✤✤✤

G ?>=<89:;62

//❴❴❴ ?>=<89:;12

//❴❴❴ GFED@ABC−42

//❴❴❴ GFED@ABC−5

Fig. 5.6. The weighted alignment graph G3,3. The nodes provide the values of the array M and the colorededges are indicated in bold.

5.7 Parametric Sequence Alignment

We have formalized the problem of DNA sequence alignment by a pair hidden Markov model with 33parameters. If the scoring scheme is fixed, the sum-product decomposition given by the equations (5.29)to (5.37) can be used to compute the optimal alignments for any pair of DNA sequences. Now we want

114 5 Sequence Alignment

the scoring scheme to vary over all possible parameter values. We will see that the parameter space canbe subdivided into regions such that parameter values in the same region give rise to the same optimalalignment. This subdivision can be attained by evaluating the coordinate functions of the pair hiddenMarkov model in the polytope algebra.

More concretely, we replace the parameter space R33 by the polynomial ring R[θ, θ′] in the 33variables (θ, θ′) and consider each marginal probability fσ1,σ2 as a polynomial in this ring. Each suchpolynomial can be assigned a Newton polytope in the polytope algebra P33. Since each marginal proba-bility fσ1,σ2 can be computed according to the sum-product decomposition given by the equations (5.20)to (5.28), the associated Newton polytope can be calculated by a corresponding sum-product decom-position in the polytope algebra P33. The evaluation of such a Newton polytope is called polytopepropagation.

Let σ1 and σ2 be DNA sequences of length m and n, respectively. The polytope propagation algo-rithm for evaluating the marginal probability fσ1,σ2 in the polytope algebra P33 is given as follows:

NP(fσ1,σ2) =⊕

X

PX(m,n), (5.38)

where

PI(i, j) = NP(θ−,σ2j)⊙

X

PX(i, j − 1)⊙NP(θ′X,I), (5.39)

PD(i, j) = NP(θσ1i,−)⊙

X

PX(i− 1, j)⊙NP(θ′X,D), (5.40)

PH(i, j) = NP(θσ1i,σ2

j)⊙

X

PX(i− 1, j − 1)⊙NP(θ′X,H), (5.41)

and

PX(0, 0) = {0}, X ∈ {H, I,D}, (5.42)

PX(0, j) = {0}, X ∈ {H,D}, (5.43)

PX(i, 0) = {0}, X ∈ {H, I}, (5.44)

PI(0, j) = NP(θ−,σ21)⊙

j⊙

k=2

NP(θ′I,I)⊙NP(θ−,σ2k), (5.45)

PD(i, 0) = NP(θσ11 ,−

)⊙i⊙

k=2

NP(θ′D,D)⊙NP(θσ1k,−). (5.46)

By Prop. 4.21, the normal cones of the polytope NP(fσ1,σ2) corresponding to the vertices providean explanation of the marginal probability fσ1,σ2 .

Example 5.17 (Maple). We consider a simplified pair hidden Markov model for the alignment ofDNA sequences with two parameters X and Y that correspond to matches, mismatches, and indels asfollows:

θa,a = X, a ∈ {A, C, G, T},θa,b = Y, a, b ∈ {A, C, G, T,−}, a 6= b,

θ′X,Y = 1, X, Y ∈ {H, I,D}.

5.7 Parametric Sequence Alignment 115

Let σ1 and σ2 be DNA sequences of length m and n, respectively. We view X and Y as variablesover R and write PX and PY for the corresponding Newton polytopes NP(X) = {(1, 0)} and NP(Y ) ={(0, 1)}, respectively. In this case, the marginal probability fσ1,σ2 can be viewed as a polynomial in thepolynomial ring R[X,Y ]. The polytope propagation algorithm for evaluating the marginal probabilityfσ1,σ2 in the polytope algebra P2 has the shape

NP(fσ1,σ2) = P(m,n), (5.47)

where

P(i, j) =(P(i− 1, j − 1)⊙NP(θσ1

i,σ2

j))⊕ (P(i− 1, j)⊙ PX)⊕ (P(i, j − 1)⊙ PY ) (5.48)

and

P(0, 0) = {(0, 0)}, (5.49)

P(i, 0) = P(i− 1, 0)⊙ PX , (5.50)

P(0, j) = P(0, j − 1)⊙ PY . (5.51)

Note that the polytopes P(i, 0) and P(0, j) are translates of polytopes by unit vectors, and the polytopeP(i, j) is given by the convex hull of three translates of polytopes by unit vectors, 1 ≤ i ≤ m and1 ≤ j ≤ n.

We implement this polytope propagation algorithm by using Maple. For this, we take the sequences

s1 := [A,T,C,G]: s2 := [T,C,G,G]:

use the linear algebra package

with(linalg):

and initialize as follows:

m := nops( s1 ):

n := nops( s2 ):

u1 := subs( { A=1, C=2, G=3, T=4 }, s1 ):

u2 := subs( { A=1, C=2, G=3, T=4 }, s2 ):

Px := vector( [1,0] ):

Py := vector( [0,1] ):

Each polytope in R2 is given by the convex hull of a finite set of points in R2, and this set or anysuperset of it can be viewed as a generating set of the polytope. In the Maple code, we represent eachpolytope P by a generating set.

The operations in the polytope algebra P2 can be implemented by using generating sets. To seethis, let P and Q be polytopes in R2 with generating sets A and B, respectively. The sum P ⊕Q hasthe generating set A∪B. Note that the union of two sets can be formed by the Maple operation union.The product P ⊙Q has the generating set {a+ b | a ∈ A, b ∈ B} and can be obtained as follows:

MinkowskiSum := proc ( P, Q )

local R:

R := {}:

116 5 Sequence Alignment

for p in P do

for q in Q do

R := R union { matadd (p,q) }

od

od

return R:

end proc:

The Newton polytope of the marginal probability fATCG,TCGG can be established by polytope arithmeticsas follows:

M := array( [], 0..m, 0..n ):

M[0,0] := { vector([0,0]) }:

for i from 1 to m do

M[i,0] := M[i-1,0] union { Py }

od:

for j from 1 to n do

M[0,j] := M[0,j-1] union { Py }

od:

for i from 1 to m do

for j from 1 to n do

M[i,j] := MinkowskiSum( M[i,j-1], { Py } ):

M[i,j] := M[i,j] union MinkowskiSum( M[i-1,j], { Py } )

if u1[i] = u2[j] then

M[i,j] := M[i,j] union MinkowskiSum( M[i-1,j-1], { Px } )

else

M[i,j] := M[i,j] union MinkowskiSum( M[i-1,j-1], { Py } )

end if

od

od:

M[m,n];

The last statement prints a generating set of the polytope P = NP(fATCG,TCGG):

[0,8],

[0,7],

[0,6], [1,6],

[0,5], [1,5],

[1,4], [2,4],

[1,3], [2,3],

[3,2].

The polytope P illustrated in Fig. 5.7 has the vertices (0, 8), (0, 5), (1, 3), and (3, 2). The normal conesof the vertices are exhibited in Fig. 5.8. The normal fan of the polytope decomposes the Euclidean 2-space into four cones corresponding to the vertices and four half rays associated to the edges (Fig. 5.9).

The normal cones of the vertices yield the optimal sequence alignments. For instance, the cone ofthe vertex (3, 2) is given by the intersection of two half-spaces defined by the inequalities −X+2Y < 0

5.7 Parametric Sequence Alignment 117

❆❆

❆❆

❆❆

❆❆

❆❆❆

❆❆

❆❆

❆❆❆

❍❍❍❍❍❍❆❆

❆❆

❆❆

0 x

y

P

Fig. 5.7. The polytope P = NP(fATCG,TCGG).

and −2X + Y < 0. We take an arbitrary point in this cone, say (1, 0), and calculate an optimalalignment for the associated scoring scheme with X = 1 (matches) and Y = 0 (mismatches and indels)by the Needleman-Wunsch algorithm. This is an optimal alignment for all scoring schemes locatedin this normal cone. Note that this alignment is uniquely determined up to column permutations,since alignments are represented by monomials that are elements of a commutative polynomial ring(Ex. 5.14). All four optimal alignments are illustrated in Table 5.1. ♦

118 5 Sequence Alignment

❆❆❆

❆❆

❆❆

❆❆

❆❆

❆❆

❆❆

❆❆❆

❍❍❍❍❍❍❆❆

❆❆

❆❆

✟✟✟✟✟✟✟✟✟

✁✁✁

✁✁

✁✁

✁✁

✟✟✟✟✟✟✟✟✟

✁✁

✁✁

✁✁

✁✁✁

✟✟✟✟✟✟✟✟✟

✟✟✟✟✟✟✟✟✟

P

Fig. 5.8. The normal cones of the polytope P = NP(fATCG,TCGG) of the vertices.

Table 5.1. The optimal alignments of the sequences ATCG and TCGG.

vertex normal cone scoring scheme alignment score

(0, 8) −x+ 2y > 0, y > 0 (0,1)A T C − G

− T C G G2

(0, 5) −x+ 2y > 0, y < 0 (-3,-1)A T C − G

− T C G G-11

(1, 3) −x+ 2y < 0, −2x+ y > 0 (-2,-2)A T C G − − − −− − − − T C G G

-16

(3, 2) −x+ 2y < 0, −2x+ y < 0 (1,0)A T C G − −− − T C G G

0

5.7 Parametric Sequence Alignment 119

✟✟✟✟✟✟✟✟✟

✟✟✟✟✟✟✟✟✟

✁✁✁

✁✁✁

✁✁✁

0

−x+ 2y = 0

−2x+ y = 0

−x+ 2y = 0

y = 0

Fig. 5.9. Normal fan of the polytope NP(fATCG,TCGG).

120 5 Sequence Alignment

6

Hidden Markov Models

The hidden Markov model is a statistical model in which the system modelled is a Markov chain withunknown parameters, and the challenge is to determine the hidden parameters from the observabledata. Hidden Markov models were introduced for speech recognition in the 1960s and are now widelyused in temporal pattern recognition.

We first introduce the fully observed Markov model in which the states are visible to the observer.Then we proceed to the hidden Markov model where the states are not observable, but each state hasa probability distribution over the generated output data. This information can be used to determinethe mostly likely state sequence that generated the output data.

6.1 Fully Observed Markov Model

We introduce a variant of the Markov chain model that will serve as a prelimiarly model for the hiddenMarkov model.

For this, we take an alphabet Σ with l symbols, an alphabet Σ′ with l′ symbols, and a positiveinteger n. We consider words σ = σ1 . . . σn and τ = τ1 . . . τn over Σ and Σ′ of length n, respectively.These words are used to label the entries of an integral block matrix

A(l,l′),n = (Aσ,τ )σ∈Σn,τ∈Σ′n , (6.1)

whose entries Aσ,τ are pairs of matrices (w,w′) such that w = (wrs) is an l × l matrix and w′ = (w′rt)

is an l× l′ matrix. The entry wrs = wrs(σ) counts the number of occurrences in σ of the length-2 wordrs, and the entry w′

st = wst(σ, τ) counts the number of indices i, 1 ≤ i ≤ n, such that σi = s and τi = t.We may view the matrices (w,w′) as columns of the matrix A(l,l′),n. Then the matrix A(l,l′),n has

d = l · l + l · l′ = l2 + l · l′ rows labelled by the length-2 words rs in Σ2 and in turn by the length-2words rt in Σ × Σ′. Moreover, the matrix has m = ln · l′n columns labelled by the pairs of length-nwords σ and τ over Σ and Σ′, respectively. The matrix A(l,l′),n has the property that the sum of eachof its column entries is (n− 1)+n = 2n− 1, since each word of length n has n− 1 consecutive length-2words and two words of length n pair in n positions. Thus the matrix A(l,l′),n defines a toric model

f = f(l,l′),n : Rd → Rm given as

(θ, θ′) 7→ (pσ,τ )σ∈Σn,τ∈Σ′n , (6.2)

122 6 Hidden Markov Models

where

pσ,τ =1

lθ′σ1,τ1θσ1,σ2

θ′σ2,τ2θσ2,σ3· · · θσn−1,σn

θ′σn,τn . (6.3)

Here we assume a uniform initial distribution on the states in the alphabet Σ as described by (7.1). Allterms pσ,τ have n+(n−1) = 2n−1 factors. The parameter space Θ of the model is the cartesian product

of the set of positive l× l matrices θ and the set of positive l× l′ matrices θ′; that is, Θ = Rl×l>0 ×Rl×l′

>0 .The matrix θ encodes a toric Markov chain, while the matrix θ′ encodes the interplay between the twoalphabets. The state space of the model is Σn×Σ′n. This model is called a fully observed toric Markovmodel.

Example 6.1. Consider a dishonest dealer in a casino tossing coins. We know that she may use a fairor loaded coin, the latter of which is supposed to have the probability of 0.75 to get heads. We alsoknow that she does not tend to change coins, which happens with probability of 0.1 (Fig. 6.1). Givena sequence of coin tosses we wish to determine when she used the biased and fair coins.

?>=<89:;F0.9 55

0.1

--

��✤✤✤

?>=<89:;L 0.9ii

��✤✤✤

0.1

mm

0.5(h), 0.5(t) output 0.75(h), 0.25(t)

Fig. 6.1. Transition graph of casino model.

This model can be described by a toric Markov model consisting of the alphabets Σ = {F,L}, whereF stands for fair and L stands for loaded, and Σ′ = {h, t}, where h stands for heads and t stands fortails. The corresponding 8 × 256 matrix A(2,2),4 for sequences of length n = 4 consists of the columnsAσ,τ , where σ and τ range over all words of length 4 over Σ and Σ′, respectively; e.g.,

AFFLL,htht =

FF 1FL 1LF 0LL 1Fh 1Ft 1Lh 1Lt 1

and AFFFL,hhhh =

FF 2FL 1LF 0LL 0Fh 3Ft 0Lh 1Lt 0

.

The model has d = 8 parameters given by the matrices

θ =

(θFF θFLθLF θLL

)and θ′ =

(θ′Fh θ

′Ft

θ′Lh θ′Lt

).

6.1 Fully Observed Markov Model 123

Suppose the dealer tosses four coins in a row such that we consider sequences of length n = 4. Then themodel has m = (2 · 2)4 = 256 states. The fully observed toric Markov model is defined by the mapping

f : R8 → R256 : (θ, θ′) 7→ (pσ,τ )σ∈Σ4,τ∈Σ′4 ,

where

pσ1σ2σ3σ4,τ1τ2τ3τ4 =1

2· θ′σ1,τ1θσ1,σ2

θ′σ2,τ2θσ2,σ3θ′σ3,τ3θσ3,σ4

θ′σ4,τ4 .

For instance, in view of the above pairs of sequences, we obtain

pFFLL,htht =1

2· θ′FhθFF θ′FtθFLθ′LhθLLθ′Lt

=1

2· θFF θFLθLLθ′Fhθ′Ftθ′Lhθ′Lt

and

pFFFL,hhhh =1

2· θ′FhθFF θ′FhθFF θ′FhθFLθ′Lh

=1

2· θ2FF θFLθ′

3Fhθ

′Lh.

♦Second, we introduce the fully observed Markov model as a submodel of the toric Markov model.

For this, the parameter space of the fully observed toric Markov model is restricted to the set of pairs ofmatrices (θ, θ′) whose row sums are equal to 1. The parameter space of the fully observed Markov model

is thus a subset Θ1 of Rl×(l−1)>0 ×R

l×(l′−1)>0 , and the number of parameters is d = (l ·(l−1)) ·(l ·(l′−1)) =

l · (l+ l′−2). A pair of matrices (θ, θ′) in Θ1 provides an l× l matrix θ describing transition probabilitiesand an l×l′ matrix θ′ providing emission probabilities. The value θij represents the probability to transitfrom state i ∈ Σ to state j ∈ Σ in one step, and the value θ′ij is the probability to emit the symbol j ∈ Σ′

in state i ∈ Σ. The fully observed Markov model is given by the mapping f(l,l′),n : Rd → Rm restrictedto the parameter space Θ1. Each point p in the image f(l,l′),n(Θ1) is called a marginal probability. Weassume usually that the initial distribution at the first state in Σ is uniform.

Let u = (uσ,τ ) ∈ Nln×l′n

0 be a frequency vector representing N observed sequence pairs in Σn×Σ′n.That is, uσ,τ counts the number of times the pair (σ, τ) is observed. Thus, we have

∑σ,τ uσ,τ = N . The

sufficient statistic v = A(l,l′),n · u can be regarded as a pair of matrices (v, v′), where v = (vrs) is anl× l matrix whose entries vrs are the number of occurrences of rs ∈ Σ2 as a consecutive pair in any ofthe sequences σ occurring in the observed pairs (σ, τ), and v′ = (v′st) is an l × l′ matrix whose entriesv′st are the number of occurrences of st ∈ Σ ×Σ′ at the same position in any of the observed sequencepairs (σ, τ).

The likelihood function of the fully observed Markov model is given as

L(θ, θ′) = (θ, θ′)A(l,l′),n·u = θv · (θ′)v′ , (θ, θ′) ∈ Θ1. (6.4)

Proposition 6.2. In the fully observed Markov chain model f(l,l′),n, the maximum likelihood estimate

of the frequency data u ∈ Nln×l′n

0 with sufficient statistic v = A(l,l′),n · u is the matrix pair (θ, θ′) in Θ1

such that

θσ1σ2=

vσ1σ2∑σ∈Σ vσ1σ

and θ′σ1τ1 =v′σ1τ1∑τ∈Σ′ v′σ1τ

, σ1, σ2 ∈ Σ, τ1 ∈ Σ′.

124 6 Hidden Markov Models

The proof is analogous to that of Prop. 4.11, since the log-likelihood function similarly decouples intoindependent parts.

Example 6.3 (Maple). We reconsider the dealer’s example. The parameter space Θ1 of the fullyobserved Markov model can be viewed as the set of all pairs of probability matrices

θ =

(θFF 1− θFF

1− θLL θLL

)and θ′ =

(θ′Fh 1− θ′Fh

1− θ′Lt θ′Lt

),

where θF,F = θL,L = 0.9 is the probability to stay with a fair or loaded coin, θF,h = 0.5 is the probabilityto observe heads for a fair coin, and θL,h = 0.75 is the probability to observe heads for a loaded coin.This model has only d = 4 parameters.

Suppose a game involves rolling the coin n = 4 times. Then the model has m = (2 ·2)4 = 256 states.This Markov model is given by the mapping f(2,2),4 : R4 → R256 with marginal probabilities

pσ1σ2σ3σ4,τ1τ2τ3τ4 =1

2· θ′σ1,τ1θσ1,σ2

θ′σ2,τ2θσ2,σ3θ′σ3,τ3θσ3,σ4

θ′σ4,τ4 .

For instance, in a game the fair coin was used two times and then the loaded coin was taken two times,and each time heads was observed. The corresponding probability is given by

pFFLL,hhhh =1

2· θ′FhθFF θ′Fh(1− θFF )(1− θ′Lt)θLL(1− θ′Lt).

We compute symbolically the likelihood function. For this, we take the packages

with(combinat): with(linalg):

and initialize

n := 4: l := 2: l’ := 2: m := (l * l’)^n:

T := array([ [tFF, tFF], [tLL, tLL] ]):

E := array([ [tFh, tFh], [tLt, tLt] ]):

P := array([], 1..l^n, 1..l’^n):

The marginal values are computed by the following code,

R := powerset([1,2,3,4]):

S := powerset([1,2,3,4]):

for i from 1 to nops(R) do

x := vector( [1,1,1,1] ):

for u from 1 to nops(R[i]) do

x[R[i,u]] := 2;

od:

for j from 1 to nops(S) do

y := vector( [1,1,1,1] ):

for v from 1 to nops(S[j]) do

y[S[j,v]] := 2;

od:

P[i,j] := 1/2 * E[x[1],y[1]] * T[x[1],x[2]] * E[x[2],y[2]]

6.2 Hidden Markov Model 125

* T[x[2],x[3]] * E[x[3],y[3]] * T[x[3],x[4]]

* E[x[4],y[4]];

od

od:

Consider a frequency vector, whose entries are randomly chosen integers between 1 and 5:

roll := rand (1..5):

u := randmatrix ( l^n, l’^n, entries = roll ):

The likelihood function can be calculated as

L := 1:

for i from 1 to 16 do

for j from 1 to 16 do

L : = L * P[i,j]^u[i,j]

od

od:

The output (up to a constant) is given by the expression

tFh^799 tFF^577 tFt^742 tLh^806 tLF^579 tLt^753 tFL^579 tLL^590

Thus the maximum likelihood estimates are

θFF = 1− θFL =577

577 + 579

θLF = 1− θLL =579

579 + 590

θ′Fh = 1− θ′Ft =799

799 + 742

θ′Lh = 1− θ′Lt =806

806 + 753.

6.2 Hidden Markov Model

A fully observed Markov model gives rise to a hidden Markov model by only observing the emittedsequences. This can be formally described by a socalled marginalization mapping. Consider the fullyobserved Markov model

F : Rl×(l−1) × Rl×(l′−1) → Rln×(l′)n (6.5)

and the marginalization mapping

ρ : Rln×(l′)n → R(l′)n (6.6)

126 6 Hidden Markov Models

that maps each ln × (l′)n matrix to the vector of column sums. The algebraic statistical model givenby the composition f = ρ ◦ F is called a hidden Markov model

f : Rl×(l−1) × Rl×(l′−1) → R(l′)n . (6.7)

Both, the hidden Markov model and the underlying fully observed Markov model have the same pa-

rameter space Θ1 ⊆ Rl×(l−1)>0 × R

l×(l′−1)>0 . For each pair of matrices (θ, θ′) ∈ Θ1, we have

f(θ, θ′) = (pτ )τ∈Σ′n , (6.8)

where

pτ =∑

σ∈Σn

pσ,τ (6.9)

=1

l

σ1∈Σ

. . .∑

σn∈Σ

θ′σ1,τ1θσ1,σ2θ′σ2,τ2θσ2,σ3

· · · θσn−1,σnθ′σn,τn .

In the hidden Markov model, the states are not observable. What is observable is the output of thecurrent state (Fig. 6.2).

ONMLHIJKstart // ?>=<89:;σ1

θσ1,σ2 //

θ′σ1,τ1

��✤✤✤✤✤✤

?>=<89:;σ2

θσ2,σ3 //

θ′σ2,τ2

��✤✤✤✤✤✤

?>=<89:;σ3

θσ3,σ4 //

θ′σ3,τ3

��✤✤✤✤✤✤

. . .

τ1 τ2 τ3 . . .

Fig. 6.2. Hidden Markov model.

Example 6.4 (Maple). We reconsider the dealer’s example. In the corresponding hidden Markovmodel, the dealer’s coin tosses are observed but not whether she chooses a fair or load coin. This modelis given by the mapping (n = 4)

f : R4 → R16 : (θ, θ′) 7→ (pτ )τ∈Σ′4 ,

whose marginal probabilities are

pτ1τ2τ3τ4 =1

2

σ1∈Σ

σ2∈Σ

σ3∈Σ

σ4∈Σ

θ′σ1,τ1θσ1,σ2θ′σ2,τ2θσ2,σ3

θ′σ3,τ3θσ3,σ4θ′σ4,τ4 .

Suppose the game is observed N times. Let u = (uτ ) ∈ N16 be the corresponding frequency vector.That is, uτ counts the number of times the sequence τ ∈ Σ′4 is observed. Then we have

∑τ uτ = N .

The goal is to maximize the likelihood function of the model,

6.3 Sum-Product Decomposition 127

ℓ(θFF , θLL, θ′Fh, θ

′Lt) =

τ∈Σ′4

puττ .

The Maple code for the computation of the likelihood function for the fully observed Markov modelcan be easily modified to calculate the likelihood function for the associated hidden Markov model. Forthis, the marginal probabilities are computed as follows,

M := vector (l’^n, 0):

for j from 1 to l’^n do

M[j] := 0;

for i from 1 to l^n do

M[j] := M[j] + P[i,j]

od

od:

By taking a randomly chosen frequency vector

roll := rand(1..5):

u := randvector (l’^n, entries = roll):

here u = [5, 2, 5, 2, 3, 4, 4, 5, 3, 1, 5, 2, 3, 2, 2, 4], we obtain the likelihood function

L := 1:

for i from 1 to l’^n do

L := L * M[i]^u[i]

od

simplify(L);

The likelihood function (up to a constant) is given as follows,

( rFh^4 tFF^3 + 3 tLt tLL tFh^3 tFF^2 + 3 tLt^2 tLL^2 tFh^2 tFF

+ tLt^3 tLL^3 tFh + tFh^3 tFF^3 tLt + 3 tLt^2 tLL tFh^2 tFF^2

+ 3 tLt^3 tLL^2 tFh tFF + tLt^4 tLL^3 )^52

6.3 Sum-Product Decomposition

We show that the marginal probabilities of the hidden Markov model can be efficiently calculated bya sum-product decomposition. For this, consider a hidden Markov model of length n with state set Σof l symbols and emission set Σ′ of l′ symbols. The model parameters are the transition probabilitymatrix θ ∈ Rl×(l−1) and the emission probability matrix θ′ ∈ Rl×(l′−1). If we assume a uniform initialdistribution on the states, the probability of occurrence of the sequence (σ, τ), σ ∈ Σn and τ ∈ Σ′n, isgiven as

pσ,τ =1

lθ′σ1,τ1θσ1,σ2

θ′σ2,τ2θσ2,σ3· · · θσn−1,σn

θ′σn,τn . (6.10)

The marginal probability of the observed sequence τ is then given by

128 6 Hidden Markov Models

pτ =∑

σ∈Σn

pσ,τ . (6.11)

This probability has the sum-product decomposition

pτ =1

l

σn∈Σ

θ′σn,τn

σn−1∈Σ

θσn−1,σnθ′σn−1,τn−1

(· · ·(∑

σ1∈Σ

θσ1,σ2θ′σ1,τ1

)· · ·) . (6.12)

This expression can be evaluated by an (n− 1)× l matrix M defined as follows,

M1,σ =∑

σ1∈Σ

θσ1,σθ′σ1,τ1 , σ ∈ Σ,

Mk,σ =∑

σk∈Σ

θσk,σθ′σk,τk

·Mk−1,σk, 2 ≤ k ≤ n− 1, σ ∈ Σ

pτ =1

l

σn∈Σ

θ′σn,τn ·Mn−1,σn.

The computation of the marginal probability pτ by using this decomposition is called the forwardalgorithm of the hidden Markov model. Its time complexity is O(l2n), because the matrix M has O(ln)entries and each entry is evaluated in O(l) steps.

Example 6.5 (Maple). We reconsider the occasionally dishonest casino. We compute the marginalprobability pτ for the observed sequence τ = hhtt. For this, we put

t := [0,0,1,1]:

and provide the parameters in a hidden Markov model by the matrices

T := array([ [tFF, tFF ], [tLL, tLL] ]:

E := array([ [tFh, tFh ], [tLt, tLt] ]:

We initialize

n := nops(t):

l := 2:

u := subs( {0=1, 1=2 }, t):

and obtain by symbolic computation,

M := array( [], 0..n, 1..l):

for i from 1 to l do

M[0,i] := 1

od:

for k from 1 to n-1 do

for i from 1 to l do

for j from 1 to l do

M[k,i] := M[k,i] + T[j,i] * E[j,u[k]] * M[k-1,j]

od:

6.4 Viterbi Algorithm 129

od:

od:

p := 0:

for j from 1 to l do

p := p + 1/2 * E[j,u[n]] * M[n-1,j]

od:

expand(p);

This code produces the marginal probability pτ . ♦

6.4 Viterbi Algorithm

The sum-product decomposition of the marginal probabilites can be used to find an explanation fora given sequence of observed data τ ∈ Σ′n. Finding an explanation means identifying a hidden statesequence σ with maximum a posteriori probability that generated the observed data τ ; that is,

σ = argmaxσ{pσ,τ}. (6.13)

By putting wτ = − log pτ and wσ,τ = − log pσ,τ , the tropicalization of the marginal probability (6.9)yields

wτ =⊕

σ

wσ,τ . (6.14)

The explanation σ is given by evaluation in the tropical algebra,

σ = argminσ{wσ,τ}. (6.15)

The value wτ can be efficiently computed by tropicalizing the sum-product decomposition of themarginal probability pτ . For this, we put uij = − log θij and vij = − log θ′ij . By replacing the sumsby tropical sums and the products by tropical products in the sum-product decomposition (6.12), weobtain

wτ =⊕

σn

vσn,τn ⊙

σn−1

uσn−1,σn⊙ vσn−1,τn−1

⊙(· · ·(⊕

σ1

uσ1,σ2⊙ vσ1,τ1

)· · ·) . (6.16)

This gives us the following result.

Proposition 6.6. Let τ ∈ Σ′n. The tropicalization wτ of the marginal probability pτ provides an ex-planation for the observed data τ .

The tropicalized term wτ can be computed by evaluating iteratively the parentheses in (6.16):

M [0, σ] := 0, σ ∈ Σ,

M [k, σ] :=⊕

σ′∈Σ

(uσ′,σ ⊙ vσ′,τk ⊙M [k − 1, σ′]) , σ ∈ Σ, 1 ≤ k ≤ n− 1, (6.17)

M [n, σ] := vσ,τn ⊙M [n− 1, σ], σ ∈ Σ,

wτ :=⊕

σ∈Σ

M [n, σ].

130 6 Hidden Markov Models

This algorithm is known as Viterbi algorithm and the computed explanation is called Viterbi sequence.The Viterbi algorithm consists of a forward algorithm evaluating the data as given by Alg. 6.1 and abackward algorithm yielding an optimal state sequence that is comprised by the symbols σ which attainthe minimum in each minimization step. This information can be recorded by the forward algorithmsimilar to the forward algorithm for sequence alignment (Alg. 5.1). The time complexity of the Viterbialgorithm is O(l2n) as described in the previous section.

Algorithm 6.1 Viterbi forward algorithm.

Require: sequence τ ∈ Σ′n, scores (uij) and (vij)Ensure: tropicalized term wτ

M ← matrix[0..n, 1..l]for σ ← 1 to l do

M [0, σ]← 0end for

for k ← 1 to n− 1 do

for σ ← 1 to l doM [k, σ]←∞for σ′ ← 1 to l do

M [k, σ]← min{M [k, σ], uσ′,σ + vσ′,τk+M [k − 1, σ′]}

end for

end for

end for

for σ ← 1 to l doM [n, σ]← vσ,τn +M [n− 1, σ]

end for

wτ ←∞for σ ← 1 to l do

wτ ← min{wτ ,M [n, σ]}end for

Example 6.7. Reconsider the dealer’s example. Take the weights as given in Fig. 6.3 and the outputsequence τ = hhth. The calculation of the Viterbi algorithm is given in Fig. 6.4. The solid lines show

?>=<89:;F1 55

3,,

��✤✤✤

?>=<89:;L 1ii

��✤✤✤

3

ll

2(h), 2(t) output 1(h), 3(t)

Fig. 6.3. Weighted transition graph of casino model.

where the minima are attained. Tracing back gives the explanation σ = LLLL. ♦

6.4 Viterbi Algorithm 131

h h t h

F 0+2 +1 //

+3

!!❉❉

❉❉ 3+2 +1 //

+3

!!❉❉

❉❉ 6+2 +1 //

+3

!!❉❉

❉❉ 9+2 // 11

L 0+1 +1 //

+3

==③③

③③

2+1 +1 //

+3

==③③③③③③③③4+3 +1 //

+3

==③③

③③

8+1 // 9

Fig. 6.4. Trellis for the output sequence τ = hhth.

Example 6.8 (Maple). Reconsider the occasionally dishonest casino. We implement the Viterbi al-gorithm by using Maple. For this, take the observed sequence τ = hhtt encoded as

t := [0,0,1,1]:

and provide the minus-log probability matrices

T := array([ [-log(0.9), -log(0.1) ], [-log(0.1), -log(0.9) ] ]:

E := array([ [-log(0.5), -log(0.5) ], [-log(0.75), -log(0.25)] ]:

We initialize

l := 2:

n := nops(t):

u := subs( {0=1, 1=2 }, t):

and obtain by tropical arithmetics,

M := array( [], 0..n, 1..l ):

for i from 1 to l do

M[0,i] := 0

od:

for k from 1 to n-1 do

for i from 1 to l do

M[k,i] := min( T[1,i] + E[1,u[k]] + M[k-1,1],

T[2,i] + E[2,u[k]] + M[k-1,2] ):

od:

od:

w := min( E[1,u[n]] + M[n-1,1], E[2,u[n]] + M[n-1,2] ):

print(M); print(w);

This code produces the table M and the minimum negative log probability wσ,0011 = 3.088670.

M [1, F ] = min{0.798508(F ), 2.590267(L)} = 0.798508

M [1, L] = min{2.995732(F ), 0.393043(L)} = 0.393043

M [2, F ] = min{1.597015(F ), 2.983310(L)} = 1.597015

M [2, L] = min{3.794240(F ), 0.786085(L)} = 0.786085

132 6 Hidden Markov Models

M [3, F ] = min{2.395523(F ), 4.474965(L)} = 2.395523

M [3, L] = min{4.592748(F ), 2.277740(L)} = 2.277740

w = min{3.088670(F ), 3.664034(L)} = 3.088670.

A corresponding Viterbi sequence σ can be obtained by tracing the optimal decisions made in each stepthe optimal decision. Here the optimal path turns out to be M [4, F ] →M [3, F ] →M [2, F ] →M [1, F ]giving rise to the explanation σ = FFFF .

In view of the emission sequence τ = hhht, we obtain the table

M [1, F ] = min{0.798508(F ), 2.590267(L)} = 0.798508

M [1, L] = min{2.995732(F ), 0.393043(L)} = 0.393043

M [2, F ] = min{1.597015(F ), 2.983310(L)} = 1.597015

M [2, L] = min{3.794240(F ), 0.786085(L)} = 0.786085

M [3, F ] = min{2.395523(F ), 3.376352(L)} = 2.395523

M [3, L] = min{4.592748(F ), 1.179128(L)} = 1.179128

w = min{3.088670(F ), 2.565422(L)} = 2.565422.

The resulting explanation is σ = LLLL. ♦

6.5 Expectation Maximization

The linear and toric models have the property that the likelihood function has at most one localmaximum. However, this property fails for most other algebraic statistical models including thoseused in computational biology. In these cases, the numerical optimization technique called expectationmaximization (EM) is widely used. It will provide under some conditions a local maximum of thelikelihood function.

Consider the hidden model F : Rd → Rm×n given by

F : (θ1, . . . , θd) → (fij(θ)) . (6.18)

Assume that the sum of all the fij(θ) equals 1 and there is an open subset Θ ⊆ Rd so that fij(θ) > 0for all θ ∈ Θ. We assume that the hidden model F has an easy and reliable algorithm for solving themaximum likelihood problem.

Consider the linear mapping that takes an m× n matrix to its vector of row sums ρ : Rm×n → Rm

given by

ρ : (gij) 7→

n∑

j=1

g1j , . . . ,n∑

j=1

gmj

. (6.19)

The observed model is the composition f = ρ ◦ F : Rd → Rm defined as

f : θ 7→

n∑

j=1

f1j(θ), . . . ,

n∑

j=1

fmj(θ)

. (6.20)

6.5 Expectation Maximization 133

We put

fi(θ) =

n∑

j=1

fij(θ), 1 ≤ i ≤ m. (6.21)

Suppose we have a data vector u = (u1, . . . , um) ∈ Nm for the observed model. The problem is tomaximize the log-likelihood function for these data with respect to the observed model,

max ℓobs(θ) = u1 · log f1(θ) + · · ·+ um · log fm(θ).s.t. θ ∈ Θ

(6.22)

This problem is usually hard to tackle due to multiple local solutions. It would be much easier to solvethe corresponding problem for the hidden model,

max ℓhid(θ) = u11 · log f11(θ) + · · ·+ umn · log fmn(θ).s.t. θ ∈ Θ

(6.23)

But here do not know the hidden data; that is, the matrix U = (uij) ∈ Nm×n. All we know is themarginalization data ρ(U) = u.

Algorithm 6.2 EM algorithm for observed model

Require: m× n matrix of polynomials fij(θ) representing the hidden model and observed data u ∈ Nm

Ensure: Maximum likelihood estimate θ ∈ Θ of the log-likelihood function ℓobs(θ) for observed model[Init] Threshold ǫ > 0 and parameter θ ∈ Θ[E-Step] Define matrix U = (uij) ∈ Rm×n with

uij =ui · fij(θ)

fi(θ)

[M-Step] Compute solution θ∗ ∈ Θ of the maximation problem in the hidden model[Comp] If ℓobs(θ

∗)− ℓobs(θ) > ǫ, set θ := θ∗ and resume with E-stepOutput θ := θ∗

Theorem 6.9. During each iteration of the EM algorithm (Alg. 6.2), the value of the log-likelihoodfunction weakly increases; that is, ℓobs(θ

∗) ≥ ℓobs(θ). If ℓobs(θ∗) = ℓobs(θ), then θ

∗ is a critical point ofthe log-likelihood function.

Proof. We have

ℓobs(θ∗)− ℓobs(θ) =

m∑

i=1

ui · [log fi(θ∗)− log fi(θ)]

=

m∑

i=1

n∑

j=1

uij · [log fij(θ∗)− log fij(θ)] (6.24)

+

m∑

i=1

ui ·

log

(fi(θ

∗)

fi(θ)

)−

n∑

j=1

uijui

· log(fij(θ

∗)

fij(θ)

) .

134 6 Hidden Markov Models

The first term equals ℓhid(θ∗)− ℓhid(θ) and is non-negative due to the M-step. We show that the second

term is also non-negative. By the E-step, the parenthesized expression equals

log

(fi(θ

∗)

fi(θ)

)−

n∑

j=1

uijui

· log(fij(θ

∗)

fij(θ)

)= log

(fi(θ

∗)

fi(θ)

)+

n∑

j=1

fij(θ)

fi(θ)· log

(fij(θ)

fij(θ∗)

). (6.25)

This expression can be rewritten as

n∑

j=1

fij(θ)

fi(θ)· log

(fi(θ

∗)

fi(θ)

)+

n∑

j=1

fij(θ)

fi(θ)· log

(fij(θ)

fij(θ∗)

)(6.26)

and thus amounts ton∑

j=1

fij(θ)

fi(θ)· log

(fi(θ

∗)

fij(θ∗)· fij(θ)fi(θ)

). (6.27)

Take the non-negative quantities

πj =fij(θ)

fi(θ)and σj =

fij(θ∗)

fi(θ∗), 1 ≤ j ≤ n. (6.28)

We have π1+ . . .+πn = 1 = σ1+ . . .+σn. Thus the vectors π and σ are probability distributions on theset [m]. The expression (6.27) equals the Kullback-Leibler distance between the probability distributionsπ and σ,

H(π‖σ) =n∑

j=1

πj · log(πjσj

)=

n∑

j=1

(−πj) · log(σjπj

)

≥n∑

j=1

πj ·(1− σj

πj

)= 0. (6.29)

where we used the inequality log x ≤ x− 1 for all x ∈ R>0.Let ℓobs(θ

∗) = ℓobs(θ). Then the two terms in (6.24) are both zero. Moreover, the Kullback-Leiblerdistance satisfies H(π‖σ) = 0 if and only if π = σ. Thus we obtain

fij(θ)

fi(θ)=fij(θ

∗)

fi(θ∗), 1 ≤ i ≤ m, 1 ≤ j ≤ n. (6.30)

Therefore,

0 =∂ℓhid(θ

∗)

∂θk=

m∑

i=1

n∑

j=1

uijfij(θ∗)

· ∂fij(θ∗)

∂θk

=

m∑

i=1

n∑

j=1

uifi(θ∗)

· ∂fij∂θk

(θ∗) =

m∑

i=1

uifi(θ∗)

·

∂θk

n∑

j=1

fij

(θ∗)

=

m∑

i=1

uifi(θ∗)

·(

∂θkfi

)(θ∗) =

∂ℓobs(θ∗)

∂θk, 1 ≤ k ≤ d,

where in the third equation we used the E-step and (6.30). ⊓⊔

6.6 Finding CpG Islands 135

The EM technique can be particularly used to provide maximum likelihood estimates for the hiddenMarkov model, because the hidden Markov model is the composition f = ρ ·F of a fully observed toricmodel F and a marginalization mapping ρ. A version of the EM algorithm for the hidden Markov modelis given by Alg. 6.3. In the E-step, the Viterbi algorithm can be used to compute the entities pσ,τ andin the M-step, Prop. 6.2 is used to compute the locally maximal estimates θ∗ and θ′∗.

Algorithm 6.3 EM algorithm for hidden Markov model

Require: Hidden Markov model f : Rl×(l−1) × Rl×(l′−1) → Rl′n with parameter space Θ1 and observed datau = (uτ ) ∈ Nl′n

Ensure: Maximum likelihood estimate (θ, θ′) ∈ Θ1

[Init] Threshold ǫ > 0 and parameters (θ, θ′) ∈ Θ1

[E-Step] Define matrix U = (uσ,τ ) ∈ Rln×l′n with

uσ,τ =uτ · pσ,τ (θ, θ

′)

pτ (θ, θ′), σ ∈ Σn, τ ∈ Σ′n,

[M-Step] Compute solution (θ∗, θ′∗) ∈ Θ1 of the maximation problem in fully observed Markov model[Comp] If ℓ(θ∗, θ′∗)− ℓ(θ, θ′) > ǫ, set θ := θ∗ and θ′ := θ′∗ and resume with E-stepOutput θ := θ∗, θ′ := θ∗

6.6 Finding CpG Islands

The CpG sites are regions of DNA in a linear DNA strand where a cytosine is next to a guanine linkedby one phosphate group. The notation ”CpG” is used to distinguish this linear sequence from the CG

base pairs where cytosine and guanine are on different DNA strands linked by hydrogen bonds.Cytosines in CpG can be methylated to form 5-methylcytosine. In mammals, 70 % to 80 % of the CpG

cytosines are methylated. The methylated cytosines within a gene can change the gene’s expression.Gene expression is a mechanism to transcribe and translate a gene into a protein.

CpG islands are regions with a high frequency of CpG sites. An objective definition of CpG islandis lacking. The usual formal definition is that a CpG island is a region with at least 200 base pairs inlength, a CG percentage greater than 50 %, and the observed-to-expected CpG ratio greater than 60 %,where the observed-to-expected CpG ratio is given by the ratio between an observed part (i.e., numberof CpG times length of sequence) and an expected part (i.e., number of C times number of G).

In mammals, many genes have CpG islands at the start of a gene (promoter regions). In mammaliangenomes, CpG islands are usually 300 to 3,000 base pairs in length and occur in about 40 % of promoterregions. In particular, the promoter regions in human genomes have a CpG content of about 70 %. Overtime, methylated cytosines in CpG sites tend to turn into thymines because of spontaneous deamination.

The methylation of CpG sites within promoter regions can lead to the silencing of the gene. Silencingis a phenomenon which can be found in a number of human cancers such as the silencing of tumorsuppressor genes. Age has a strong impact on DNA methylation levels on tens of thousands of CpG sites.

In computational biology, two questions about CpG islands arise. First, decide whether a short stretchof a genomic linear strand lies inside of a CpG island. Second, find the CgG regions of a long stretch ofa genomic linear strand.

136 6 Hidden Markov Models

We begin with the first question. From a set of human DNA sequences there were extracted a totalof 48 putative CpG islands. From the regions labelled as CpG islands the + model was derived and fromthe remaining regions the − model was established. The transition probabilities for each model werecalculated using the equations

θ+XY =c+XY∑Z c

+XZ

and θ−XY =c−XY∑Z c

−XZ

,

where c+XY and c−XY are the number of times the nucleotide Y followed the nucleotide X in a CpG islandand non-island, respectively. In this way, the transition probabilities of the two Markov chain modelsare given in Tab. 6.5.

+ A C G T

A 0.180 0.274 0.426 0.120C 0.171 0.368 0.274 0.188G 0.161 0.339 0.375 0.125T 0.079 0.355 0.384 0.182

− A C G T

A 0.300 0.205 0.285 0.210C 0.322 0.298 0.078 0.302G 0.248 0.246 0.298 0.208T 0.177 0.239 0.292 0.292

Fig. 6.5. Transition probabilities for + model and − model.

Consider a DNA sequence w. We calculate in both Markov chain models the probabilities p+(w)and p−(w). For discrimination purposes, the log-odd ratio is used,

S(w) = logp+(w)

p−(w)=∑

i

logθ+wi,wi+1

θ−wi,wi+1

.

If the value of S(w) is positive, there is a high chance that the DNA sequence represents a CpG island;otherwise, it will not.

Finally, we study the second question. For this, we build a hidden Markov model for the entire DNAsequence that incorporates both of the above Markov chain models. To this end, the states are relabelledsuch that A+, C+, G+, and T+ determine CpG island areas, while A−, C−, G−, and T− provide non-islandareas. For simplicity, we assume that there is a uniform transition probability of switching betweenisland and non-island. The transition probabilities are given in Tab. 6.6, where p+ and p− = 1−p+ arethe probabilities for staying inside and outside of a CpG island.

For instance, we have θT+A+ = 0.079p+ and θT+A− = (1 − p+)/4. The transition probabilities inthis model can be set so that within each region they are close to the transition probabilities of theoriginal component model, but there is also a chance of switching into the other region.

Finally, the emission probabilities are all 0 and 1. The state X+ or X− outputs the symbol X withcertainty; that is,

θ′X+,Y = θ′X−,Y =

{1 if X = Y,0 if X 6= Y.

There are three canonical problems associated with a Hidden Markov model. First, given the pa-rameters of the model, a DNA sequence (output), and a state sequence. Calculate the probability of

6.6 Finding CpG Islands 137

θ A+ C+ G+ T+ A− C− G− T−

A+ 0.180p+ 0.274p+ 0.426p+ 0.120p+ 1−p+

41−p+

41−p+

41−p+

4

C+ 0.171p+ 0.368p+ 0.274p+ 0.188p+ 1−p+

41−p+

41−p+

41−p+

4

G+ 0.161p+ 0.339p+ 0.375p+ 0.125p+ 1−p+

41−p+

41−p+

41−p+

4

T+ 0.079p+ 0.355p+ 0.384p+ 0.182p+ 1−p+

41−p+

41−p+

41−p+

4

A−1−p−

41−p−

41−p−

41−p−

40.300p− 0.205p− 0.285p− 0.210p−

C−1−p−

41−p−

41−p−

41−p−

40.322p− 0.298p− 0.078p− 0.302p−

G−1−p−

41−p−

41−p−

41−p−

40.248p− 0.246p− 0.298p− 0.208p−

T−1−p−

41−p−

41−p−

41−p−

40.177p− 0.239p− 0.292p− 0.292p−

Fig. 6.6. Transition probablities for hidden Markov model.

the output sequence when the model runs through the states. For instance, the DNA sequence CGCG

generated by the state sequence C+G+C+G+ has the probability

pC+G+C+G+,CGCG =1

8θ′C+,CθC+,G+

θ′G+,GθG+,C+θ′C+,CθC+,G+

θ′G+,G.

Second, given the parameters of the model and a DNA sequence (output). Find the maximuma posteriori probability of generating the output sequence. This problem is tackled by the Viterbialgorithm that finds the most probable path. When this path goes through the + states, a CpG island ispredicted. For instance, consider an output sequence and a corresponding a postoriori state sequence,

A C C C G C C G A A T A T T C G G G C C G A A T A

A− C+ C+ C+ G+ C+ C+ G+ A− A− T− A− T− T− C+ G+ G+ G+ C+ C+ G+ A− A− T− A−

The state sequence would indicate that the strand has two CpG islands.Third, given a set of DNA sequences (output), find the most likely set of transitions probabilities.

This amounts to the discovery of the parameters of the hidden Markov model given the data set. Thisproblem can be tackled by the EM algorithm.

138 6 Hidden Markov Models

7

Tree Markov Models

Phylogenetics is a branch of biology that seeks to reconstruct evolutionary history. Inferring a phylogenyis an estimation procedure that is to provide the best estimate of history based on the incompleteinformation contained in the observed data. Ultimately, we would like to reconstruct the entire tree oflife that describes the course of evolution leading to all present day species.

Phylogenetic reconstruction has a long history. The classical reconstruction has been based on theobservation and measurement of morphological similarities between taxa with the possible adjunction ofsimilar evidence from the fossil record. However, with the recent advances in technology for sequencingof genomic data, reconstruction based on the huge amount of available DNA sequence data and isnow by far the most commonly used technique. Moreover, reconstruction from DNA sequence data canoperate automatically on well-defined digital data sets that fit into the framework of classical statistics,rather than proceeding from a somewhat ill-defined mixture of qualitative and quantitative data withthe need for expert oversight to adjust for difficulties such as morphological similarity.

This chapter is mainly devoted to two approaches for phylogenetic reconstruction, the maximumlikelihood method that evaluates a hypothesis about evolutionary history in terms of the probabilityand the algebraic method of phylogenetic invariants.

7.1 Data and General Models

Phylogenetic reconstruction makes use of the structure of trees. A tree is a cycle-free connected graphT = (N,E) with node set N = N(T ) and edge set E = E(T ).

An unrooted tree is considered as an undirected graph; that is, the edges are 2-subsets of the nodeset. Each edge {k, l} is written as a word kl with kl = lk since there is no ordering on the nodes. Theedges are also called branches. An unrooted binary tree (or trivalent tree) contains only nodes of degreeone (terminal nodes or leaves) and degree three (internal nodes).

A rooted tree is considered as a directed graph; that is, the edges are ordered pairs. Each edge (k, l)is denoted as a word kl with kl 6= lk since there is an ordering on the nodes. A rooted binary treecontains besides nodes of degree one and three also one node of degree two, the socalled root. The edgesare always directed away from the root.

140 7 Tree Markov Models

A tree is labelled if its leaves are labelled. In phylogenies, trees are built from data corresponding tothe leaves. These data are called taxa, while the data at the inner nodes are called intermediates. Taxaand intermediates are both called individuals.

In a phylogenetic tree, the arrow of time points away from the root (if any), paths down throughthe tree represent lineages (lines of descent), any point on a lineage corresponds to a point of time inthe life of some ancestor of a taxon, inner nodes represent times at which lineages diverge, and the root(if any) corresponds to the most common ancestor of all the taxa.

The basic data in phylogenetics are DNA sequences corresponding one-to-one with the taxa thathave been preprocessed in some suitable way. These sequences are assumed to be aligned. For simplicity,we suppose that we are dealing with segments of DNA without indels such that all taxa share the samecommon positions, and differences between nucleotides at these positions are due to substitutions.

Suppose we have the following four aligned DNA sequences,

taxon 1 A G A C G T T A C G T A . . .taxon 2 A G A G C A A C T T T G . . .taxon 3 A A T C G A T A C G C A . . .taxon 4 T C T A G T A A C C C C . . .

A standard assumption is that the behavior at widely separated positions on the genome are statisticallyindependent. With this assumption, the modelling problem reduces to the modelling of nucleotidesobserved at a given position. For this, define a pattern σ to be a sequence of symbols that we get whenwe look at a single site (column) in the aligned sequence data. For instance, the third column gives riseto the pattern AATT. A tree that might describe the course of evolution based on this pattern is shownin Fig. 7.1. Any tree with four leaves labelled by the pattern in some order is a potential candidate

?>=<89:;1

❋❋❋❋

❋❋❋❋

❋❋❋

①①①①①①①①①①

GFED@ABC4 : A ?>=<89:;2

❇❇❇❇

❇❇❇❇

❇❇

①①①①①①①①①①

GFED@ABC5 : A ?>=<89:;3

❇❇❇❇

❇❇❇❇

⑤⑤⑤⑤⑤⑤⑤⑤⑤

GFED@ABC6 : T GFED@ABC7 : T

Fig. 7.1. Rooted binary tree with labelled leaves.

that describes the course of evolution. To this end, we need to capture the various tree topologies andlabellings.

Two trees T and T ′ are isomorphic if there is a bijective mapping φ : N → N ′ between the nodesets which is compatible with the edges; that is, for each pair of nodes k, l in T , kl is an edge in T ifand only if φ(k)φ(l) is an edge in T ′. The mapping φ is also called an isomorphism.

7.1 Data and General Models 141

Let Σ denote the DNA alphabet (or any other alphabet used for bioinformatics data, like RNAor amino acids). Let T be a tree with set of leaves L. A labelling of T by DNA data is a mappingψ : L → Σ. A tree equipped with such a labelling is called labelled. Two labelled trees T and T ′ areequivalent if there is an isomorphism φ : N → N ′ which is compatible with the labelling of the leaves;that is, if ψ : L → Σ is a labelling of T and ψ′ : L′ → Σ is a labelling of T ′, then ψ(l) = ψ′(φ(l)) foreach leaf l ∈ L.

Example 7.1. There are two non-isomorphic rooted binary trees with four leaves (Fig. 7.4). The firsthas twelve labelled trees (Fig. 7.2) and the second has three labelled trees (Fig. 7.3). ♦

leaves 4 5 6 7 leaves 4 5 6 7

A C G T G A C T

A G C T G C A T

A T C G G T A C

C A G T T A C G

C G A T T C A G

C T A G T G A C

Fig. 7.2. Labellings of the first rooted tree in Fig. 7.4.

leaves 4 5 6 7

A C G T

A G C T

A T C G

Fig. 7.3. Labellings of the second rooted tree in Fig. 7.4.

Proposition 7.2. A labelled unrooted binary tree with n ≥ 3 leaves has n−2 internal nodes and 2n−3edges. The number of inequivalent labelled unrooted binary trees with n ≥ 3 leaves is

n∏

i=3

(2i− 5).

Proof. Let gn denote the number of inequivalent labelled unrooted binary trees with n ≥ 3 leaves.There is one labelled unrooted binary tree with n = 3 leaves, i.e., g3 = 1. This tree has n − 2 = 1internal nodes and 2n− 3 = 3 edges (Fig. 7.5).

Consider a labelled unrooted binary tree T with n ≥ 3 leaves. Choose an edge of T , bisect it byintroducing an internal node and connect this node to a new leaf (Fig. 7.6). By induction, the new treehas n+ 1 leaves, (n− 2) + 1 = (n+ 1)− 2 internal nodes, and (2n− 3) + 2 = 2(n+ 1)− 3 edges. Each

142 7 Tree Markov Models

?>=<89:;1

❃❃❃❃

❃❃❃❃

���������

?>=<89:;1

❖❖❖❖❖❖

❖❖❖❖❖❖

❖❖❖

♦♦♦♦♦♦

♦♦♦♦♦♦

♦♦♦

?>=<89:;4 ?>=<89:;2

❃❃❃❃

❃❃❃❃

���������

?>=<89:;2

❃❃❃❃

❃❃❃❃

���������

?>=<89:;3

❃❃❃❃

❃❃❃❃

���������

?>=<89:;5 ?>=<89:;3

❃❃❃❃

❃❃❃❃

���������

?>=<89:;4 ?>=<89:;5 ?>=<89:;6 ?>=<89:;7

?>=<89:;6 ?>=<89:;7

Fig. 7.4. Two rooted binary trees with four leaves.

'&%$ !"#3✁✁✁✁✁✁✁✁✁

'&%$ !"#1 ��������

'&%$ !"#2

❂❂❂❂❂❂❂❂❂

Fig. 7.5. Labelled unrooted binary tree with three leaves.

labelled unrooted binary tree with n+1 leaves can be constructed in this way and the constructed treesare all pairwise inequivalent. Since each labelled unrooted binary tree with n leaves has 2n− 3 nodes,we have gn+1 = gn · (2n− 3) = gn · (2(n+ 1)− 5) as required. ⊓⊔

'&%$ !"#4 '&%$ !"#3✁✁✁✁✁✁✁✁✁

'&%$ !"#1 �������� ��������

'&%$ !"#2

❂❂❂❂❂❂❂❂❂

Fig. 7.6. Labelled unrooted binary tree with four leaves.

Proposition 7.3. A labelled rooted binary tree with n ≥ 2 leaves has n− 1 internal nodes and 2n− 2edges. The number of inequivalent labelled rooted binary trees with n ≥ 2 leaves is

7.2 Fully Observed Tree Markov Model 143

n∏

i=2

(2i− 3).

Proof. Let hn denote the number of inequivalent labelled rooted binary trees with n ≥ 2 leaves. Thereis one labelled rooted binary tree with n = 2 leaves, i.e., h2 = 1. This tree has n− 1 = 1 internal nodesand 2n− 2 = 2 edges (Fig. 7.7).

��������⑧⑧⑧⑧⑧⑧⑧⑧

❄❄❄❄

❄❄❄❄

'&%$ !"#1 '&%$ !"#2Fig. 7.7. Labelled rooted binary tree with two leaves.

Each labelled unrooted binary tree T with n ≥ 3 leaves can be converted into a labelled rootedbinary tree with n by bisecting one of its edges and taking this new node as the root (Fig. 7.8). ByProp. 7.2, each such constructed tree has (n− 2) + 1 = n− 1 internal nodes and (2n− 3) + 1 = 2n− 2edges. Each labelled rooted binary tree with n+ 1 can be constructed in this way and the constructedtrees are pairwise inequivalent. Since the number of labelled unrooted binary trees with n leaves isgn and each such tree has 2n − 3 edges, the number of labelled rooted binary trees with n leaves ishn = gn · (2n− 3) =

∏n−1i=2 (2i− 3)(2n− 3) =

∏ni=2(2i− 3) as required. ⊓⊔

'&%$ !"#3✁✁✁✁✁✁✁✁✁

'&%$ !"#1 �������� ��������

'&%$ !"#2

❂❂❂❂❂❂❂❂❂

Fig. 7.8. Labelled rooted binary tree with three leaves.

The proofs are constructive and provide a method for generating both all labelled unrooted binary treesand all rooted binary trees with a small number of leaves.

7.2 Fully Observed Tree Markov Model

We assume that not only the nucleotides for the taxa can be observed but also the nucleotides for theintermediates represented by the interior nodes of a phylogenetic tree. Two individuals in a phylogenetictree share the same lineage up to their most recent common ancestor. After the split in lineages, it isa reasonable first approximation to assume that the random mechanism by which substitutions occur

144 7 Tree Markov Models

are operating independently on the genomes that are no longer shared. Equivalently, the nucleotidesexhibited by the two individuals are conditionally independent of the corresponding nucleotide exhibitedby their most recent common ancestor.

Example 7.4. Consider the four taxa tree in Fig. 7.1. Let Xi denote the random variable over theDNA alphabet corresponding to the individual i, 1 ≤ i ≤ 7. In view of the dependence structure givenby the tree, a joint probability such as

P (X1 = A, X2 = G, X3 = T, X4 = A, X5 = A, X6 = T, X7 = T)

can be computed as

P (X1 = A)

·P (X4 = A | X1 = A)P (X2 = G | X1 = A) (7.1)

·P (X5 = A | X2 = G)P (X3 = T | X2 = G)

·P (X6 = T | X3 = T)P (X7 = T | X3 = T).

Thus, for a given tree, the joint probabilities of the individuals exhibiting a particular set of nucleotidesare determined by the probability distribution of the root and the transition probabilities correspondingto the edges. ♦

We describe formally the tree model. For this, let T be a rooted binary tree with root r. Each nodei ∈ N(T ) corresponds to a random variable Xi with values in an alphabet Σi. Each edge kl ∈ E(T )is associated to a matrix θkl with positive entries, where the rows are indexed by Σk and the columnsare indexed by Σl. The parameter space Θ of the tree model is given by the collection of the matricesθkl with kl ∈ E(T ). Thus the dimension of the parameter space is d =

∑kl∈E(T ) |Σk| · |Σl|. A state σ

of the tree model is a labelling of the nodes of tree; i.e., σ = (σi)i∈N(T ), where σi ∈ Σi. The state spaceof the model is the Cartesian product of the alphabets Σi with i ∈ N(T ). Thus the cardinality of thestate set is m =

∏i∈N(T ) |Σi|.

The fully observed toric tree Markov model for the tree T is the mapping FT : Rd → Rm defined as

FT : θ = (θkl)kl∈E(T ) 7→ p = (pσ), (7.2)

where

pσ =1

|Σr|·∏

kl∈E(T )

θklσkσl(7.3)

for each state σ = (σi)i∈N(T ). Note that the factor 1/|Σr| corresponds to a uniform distribution ofstates at the root (Ex. 7.4).

We consider a subset Θ1 of the parameter space Θ given by all positive matrices θkl whose row sumsare 1. The matrices θkl can then be viewed as transition probability matrices along the branches. Thedimension of the parameter space Θ1 is therefore d =

∑kl∈E(T ) |Σk| · (|Σl| − 1). The fully observed tree

Markov model for the tree T is given by the restriction of the mapping FT : Rd → Rm to the parameterspace Θ1.

Tree models in phylogenetics have usually the same alphabet Σ for the edges, but the transitionprobability matrices remain distinct and independent.

7.2 Fully Observed Tree Markov Model 145

Example 7.5. Consider the 1,n claw tree T in Fig. 7.9. This is a tree with no internal nodes otherthan the root r and n leaves; that is, N(T ) = {r, 1, . . . , n} and E(T ) = {r1, . . . , rn}.

Assume that all nodes have the alphabet Σ = {0, 1}. The fully observed toric tree Markov modelFT has d = 4 · n parameters given by the 2× 2 matrices

θri =

(θri00 θ

ri01

θri10 θri11

), 1 ≤ i ≤ n.

The model has m = 2n+1 states which are given by the binary strings σrσ1 . . . σn ∈ Σn+1.The fully observed tree Markov model FT has d = 2 · n parameters defined by the 2× 2 matrices

θri =

(θri00 1− θri00θri10 1− θri10

), 1 ≤ i ≤ n.

The coordinates of the mapping FT : R2n → R2n+1

are the marginal probabilities

pσrσ1...σn=

1

2θr1σrσ1

· · · θrnσrσn.

?>=<89:;r

❄❄❄❄

❄❄❄❄

⑧⑧⑧⑧⑧⑧⑧⑧⑧

?>=<89:;1 ?>=<89:;2 ?>=<89:;3

Fig. 7.9. The 1,3 claw tree.

We provide maximum likelihood estimates for the fully observed tree Markov model. For this, let Tbe a rooted binary tree T with n leaves. Note that the tree has 2n−1 nodes. Assume that all individualsshare a common alphabet Σ of cardinality q. Given a sequence of observations σ1, σ2, . . . , σN in Σ2n−1.In the corresponding data vector u = (uσ), the entry uσ provides the number of occurrences of thestate σ ∈ Σ2n−1. Thus we have

∑σ uσ = N . Let vklij denote the number of occurrences of ij ∈ Σ2 as

a consecutive pair for the edge kl in the tree (Fig. 7.10). The vector v = (vij) provides the sufficientstatistic of the model (Prop. 4.11).

Proposition 7.6. The maximum likelihood estimate of the data vector u in the fully observed treeMarkov model is the parameter vector θ = (θklij ) in Θ1 with coordinates

θklij =vklij∑s∈Σ v

klis

, kl ∈ E(T ), ij ∈ Σ2.

Proof. The log-likelihood function of the toric model can be written as follows,

146 7 Tree Markov Models

GFED@ABCk : i

��GFED@ABCl : j

Fig. 7.10. Labelled edge k → l.

ℓ(θ) =∑

kl

i

(vkli1 log θ

kli1 + . . .+ vkliq log θ

kliq

).

The log-likelihood function of the fully observed tree Markov model is obtained by restriction to the setΘ1 of positive matrices whose row sums are all equal to one. Therefore, ℓ(θ) is the sum of expressions

vkli1 log θkli1 + . . .+ vkli,q−1 log θ

kli,q−1 + vkliq log

(1−

q−1∑

s=1

θklis

).

These expressions have disjoint sets of unknowns for different sets of the indices k, l, and i. To maximizeℓ(θ) over Θ1, it is sufficient to maximize the above concave functions over a (q−1)-dimensional simplexconsisting of all non-negative vectors (θklit )t of coordinates summing to 1. By equating the partialderivatives of these expressions to zero, the unique critial point has the required coordinates. ⊓⊔

Example 7.7 (Maple). Consider the rooted binary tree T in Fig. 7.11. Assume that each node isassociated with the binary alphabet Σ = {1, 2}. For this, we initialize

> restart: with(combinat): with(linalg):

> V12 := array(1..2,1..2);

> V13 := array(1..2,1..2);

> V24 := array(1..2,1..2);

> V25 := array(1..2,1..2);

> V36 := array(1..2,1..2);

> V37 := array(1..2,1..2);

> T12 := array([[0,0],[0,0]]);

> T13 := array([[0,0],[0,0]]);

> T24 := array([[0,0],[0,0]]);

> T25 := array([[0,0],[0,0]]);

> T36 := array([[0,0],[0,0]]);

> T37 := array([[0,0],[0,0]]);

generate 128 states σ ∈ Σ7 uniformly at random,

digs := rand(1..2): M := randmatrix(128, 7, entries = digs):

and calculate the sufficient statistic

> for i from 1 to 128 do

> T12[M[i,1],M[i,2]] := T12[M[i,1],M[i,2]] + 1;

7.2 Fully Observed Tree Markov Model 147

> T13[M[i,1],M[i,3]] := T13[M[i,1],M[i,3]] + 1;

> T24[M[i,2],M[i,4]] := T24[M[i,2],M[i,4]] + 1;

> T25[M[i,2],M[i,5]] := T25[M[i,2],M[i,5]] + 1;

> T36[M[i,3],M[i,6]] := T36[M[i,3],M[i,6]] + 1;

> T37[M[i,3],M[i,7]] := T37[M[i,3],M[i,7]] + 1;

> od:

> print(T12,T13,T24,T25,T36,T37);

[35 37

26 30

],

[37 35

28 28

],

[37 28

28 35

],

[23 42

34 29

],

[30 27

43 28

],

[28 29

34 37

].

This allows to estimate the parameters according to Prop. 7.6,

V12 := array([[T12[1,1]/(T12[1,1]+T12[1,2]),

T12[1,2]/(T12[1,1]+T12[1,2])],

[T12[2,1]/(T12[2,1]+T12[2,2]),

T12[2,2]/(T12[2,1]+T12[2,2])]]);

V13 := array([[T13[1,1]/(T13[1,1]+T13[1,2]),

T13[1,2]/(T13[1,1]+T13[1,2])],

[T13[2,1]/(T13[2,1]+T13[2,2]),

T13[2,2]/(T13[2,1]+T13[2,2])]]);

V24 := array([[T24[1,1]/(T24[1,1]+T24[1,2]),

T24[1,2]/(T24[1,1]+T24[1,2])],

[T24[2,1]/(T24[2,1]+T24[2,2]),

T24[2,2]/(T24[2,1]+T24[2,2])]]);

V25 := array([[T25[1,1]/(T25[1,1]+T25[1,2]),

T25[1,2]/(T25[1,1]+T25[1,2])],

[T25[2,1]/(T25[2,1]+T25[2,2]),

T25[2,2]/(T25[2,1]+T25[2,2])]]);

V36 := array([[T36[1,1]/(T36[1,1]+T36[1,2]),

T36[1,2]/(T36[1,1]+T36[1,2])],

[T36[2,1]/(T36[2,1]+T36[2,2]),

T36[2,2]/(T36[2,1]+T36[2,2])]]);

V37 := array([[T37[1,1]/(T37[1,1]+T37[1,2]),

T37[1,2]/(T37[1,1]+T37[1,2])],

[T37[2,1]/(T37[2,1]+T37[2,2]),

T37[2,2]/(T37[2,1]+T37[2,2])]]);

> print(V12,V13,V24,V25,V36,V37);

The parameter estimates are[

3572

3772

1328

1528

],

[3772

3572

12

1427

],

[3765

2865

49

59

],

[2365

4265

3463

2963

],

[1019

919

4371

2871

],

[2857

2957

3471

3771

].

148 7 Tree Markov Models

?>=<89:;1

❖❖❖❖❖❖

❖❖❖❖❖❖

❖❖❖

♦♦♦♦♦♦

♦♦♦♦♦♦

♦♦♦

?>=<89:;2

❃❃❃❃

❃❃❃❃

���������

?>=<89:;3

❃❃❃❃

❃❃❃❃

���������

?>=<89:;4 ?>=<89:;5 ?>=<89:;6 ?>=<89:;7

Fig. 7.11. Rooted tree with four leaves.

7.3 Hidden Tree Markov Model

A fully observed Markov tree model exhibits nucleotides of all involved individuals (taxa and inter-mediates). By taking the marginal probability distribution for the taxa, we obtain the correspondinghidden tree Markov model in which nucleotides are only exhibited by the taxa.

Example 7.8. Reconsider the tree with four taxa (Fig. 7.1) in Ex. 7.4. The joint probability distributionfor the taxa is given by

P (X4 = A, X5 = A, X6 = T, X7 = T) =∑

B1

B2

B3

P (X1 = B1, X2 = B2, X3 = B3, X4 = A, X5 = A, X6 = T, X7 = T),

where the sums are taken over all nucleotides for the intermediates. ♦We formally describe the hidden tree Markov model. For this, let T be a rooted binary tree with

root r and leaf set [n]. The hidden tree Markov model for the tree T is obtained from the fully observedtree Markov model FT by summing the marginal probabilities over the internal nodes of the tree. Theparameter space remains the same. However, the state space is Σ1 × Σ2 × . . . × Σn, the product ofthe alphabets associated with the leaves of T . Thus the state space has the cardinality m′ = |Σ1| ·|Σ2| · · · |Σn|. The restriction of the fully observed to the hidden model is defined formally by themarginalization mapping ρT : Rm → Rm

which takes real-valued functions on∏i∈N(T )Σi to real-

valued functions on∏ni=1Σi such that the hidden tree Markov model is given by the marginalization

of the fully observed tree model fT = ρT ◦ FT .Note that the EM technique can be used to provide maximum likelihood estimates for the hidden

tree Markov model. For this, Prop. 7.6 can be employed to yield maximum likelihood estimates for thecorresponding fully observed tree Markov model. Note that the hidden Markov model is a specializationof the hidden tree Markov model in which the tree is the caterpillar tree (Fig. 6.2).

Example 7.9 (Singular). Reconsider the 1, n claw tree T in Ex. 7.5. The corresponding hidden treeMarkov model has d = 2 · n parameters and m = 2n states given by the binary strings σ1 . . . σn ∈ Σn

whose letters correspond one-to-one with the taxa. The coordinates of the mapping fT : R2n → R2n

are the marginal probabilities

7.3 Hidden Tree Markov Model 149

pσ1...σn=

1

2θr10σ1

· · · θrn0σn+

1

2θr11σ1

· · · θrn1σ2.

The corresponding EM algorithm is illustrated in (7.1). Note that the sufficient statistic v ∈ Z4n≥0 can

Algorithm 7.1 EM algorithm for Markov model given by the 1, n claw tree.

Require: Hidden 1, n claw tree Markov model f : R2n → R2n with parameter space Θ1 and observed datau = (uσ) ∈ Σn, Σ = {0, 1} binary alphabet

Ensure: Maximum likelihood estimate θ ∈ Θ1

[Init] Threshold ǫ > 0 and parameters θ ∈ Θ1

[E-Step] Define matrix U = (uσr,σ)σr∈Σ,σ∈Σn with

uσr,σ =uσ · pσr,σ(θ)

pσ(θ), σr ∈ Σ, σ ∈ Σn

[M-Step] Compute solution θ∗ ∈ Θ1 of the maximation problem in fully observed 1, n claw tree Markov model[Comp] If ℓ(θ∗)− ℓ(θ) > ǫ, set θ := θ∗ and resume with E-stepOutput θ := θ∗.

be established from the data U ∈ Z2n+1

≥0 in the E-step by a linear transformation v = AU , where the

matrix A ∈ Z4n×2n+1

≥0 is given as follows (n = 3),

0000 0001 0010 0011 0100 0101 0110 0111 1000 1001 1010 1011 1100 1101 1110 1111

θr100 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0θr101 0 0 0 0 1 1 1 1 0 0 0 0 0 0 0 0θr110 0 0 0 0 0 0 0 0 1 1 1 1 0 0 0 0θr111 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1θr200 1 1 0 0 1 1 0 0 0 0 0 0 0 0 0 0θr201 0 0 1 1 0 0 1 1 0 0 0 0 0 0 0 0θr210 0 0 0 0 0 0 0 0 1 1 0 0 1 1 0 0θr211 0 0 0 0 0 0 0 0 0 0 1 1 0 0 1 1θr300 1 0 1 0 1 0 1 0 0 0 0 0 0 0 0 0θr301 0 1 0 1 0 1 0 1 0 0 0 0 0 0 0 0θr310 0 0 0 0 0 0 0 0 1 0 1 0 1 0 1 0θr311 0 0 0 0 0 0 0 0 0 1 0 1 0 1 0 1

.

Phylogenetic invariants of the 1, 3 claw tree with common binary alphabet Σ = {0, 1} are calculatedas follows.

> ring r = 0, (x(1..3),y(1..3),p(1..8)), dp;

# x(i) encodes theta_00^ri, y(i) encodes theta_10^ri, i=1,2,3

# 1-x(i) encodes theta_01^ri, 1-y(i) encodes theta_11^ri, i=1,2,3

> ideal i0 = p(1) - (x(1)*x(2)*x(3)+y(1)*y(2)*y(3)); # 000

> ideal i1 = p(2) - (x(1)*x(2)*(1-x(3))+y(1)*y(2)*(1-y(3))); # 001

> ideal i2 = p(3) - (x(1)*(1-x(2))*x(3)+y(1)*(1-y(2))*y(3)); # 010

> ideal i3 = p(4) - (x(1)*(1-x(2))*(1-x(3))+y(1)*(1-y(2))*(1-y(3))); # 011

150 7 Tree Markov Models

> ideal i4 = p(5) - ((1-x(1))*x(2)*x(3)+(1-y(1))*y(2)*y(3)); # 100

> ideal i5 = p(6) - ((1-x(1))*x(2)*(1-x(3))+(1-y(1))*y(2)*(1-y(3))); # 101

> ideal i6 = p(7) - ((1-x(1))*(1-x(2))*x(3)+(1-y(1))*(1-y(2))*y(3)); # 110

> ideal i7 = p(8) - ((1-x(1))*(1-x(2))*(1-x(3))+(1-y(1))*(1-y(2))*(1-y(3))); # 111

> ideal i = i1+i2+i3+i4+i5+i6+i7+i8;

> ideal j = std(i);

> ideal k = eliminate( j, x(1)*x(2)x(3)*y(1)*y(2)*y(3) );

The output yields two phylogenetic invariants in the ring R[p1, . . . , p8] one of which being p1 + p2 +p3 + p4 + p5 + p6 + p7 + p8 − 2 and the other being rather large. ♦

Example 7.10 (Maple). Consider the fully observed tree Markov model for the tree T in Fig. 7.11.This model has d = 2 · 6 = 12 parameters and m = 24 = 16 states. The coordinates of the associatedmapping fT : R12 → R16 are the marginal probabilities

pσ4σ5σ6σ7=

1

2

σ1∈Σ

σ2∈Σ

σ3∈Σ

θ12σ1σ2θ13σ1σ3

θ24σ2σ4θ25σ2σ5

θ36σ3σ6θ37σ3σ7

.

These probabilities can be computed as follows,

P := vector(16):

R := powerset([1,2,3,4,5,6,7]):

for i from 1 to nops(R) do

r := vector( [1,1,1,1,1,1,1] ):

for j from 1 to nops(R[i]) do

r[R[i,j]] := 2

od

k := 1+(r[4]-1)+(r[5]-1)*2+(r[6]-1)*2^2+(r[7]-1)*2^3:

P[k] := 1/2*T12[r[1],r[2]]*T13[r[1],r[3]]*T24[r[2],r[4]]

*T25[r[2],r[5]]*T36[r[3],r[6]]*T37[r[3],r[7]];

od:

7.4 Sum-Product Decomposition

We show that the marginal probabilities of the hidden tree Markov model can be efficiently computed bya sum-product decomposition. For this, consider a rooted binary tree T with n leaves. The probabilityof occurrence of the pattern σ ∈ Σn labelling the leaves of the tree is given by

pσ =∑

τ

pτ,σ, (7.4)

where τ runs over all states of the internal nodes of the tree and pτ,σ is the probability that the tree isdecorated by the state (τ, σ) where τ labels the interior nodes and σ labels the leaves.

Let r be the root of the tree T and let a and b denote the decendents of the root. The marginalprobability of the pattern σ then decomposes into a nested sum

7.5 Felsenstein Algorithm 151

pσ =1

|Σr|∑

τr

[(∑

τa

θraτrτapaσ1...σs

)·(∑

τb

θrbτrτbpbσs+1...σn

)], (7.5)

where paσ1...σsand pbσs+1...σn

denote the marginal probabilities of the subtrees of T with roots a and band decorations of the leaves given by σ, . . . , σs and σs+1, . . . , σn, respectively (Fig. 7.12).

?>=<89:;r

PPPPPP

PPPPPP

PPPPP

♥♥♥♥♥♥

♥♥♥♥♥♥

♥♥♥♥

?>=<89:;a

❋❋❋❋

❋❋❋❋

❋❋❋

①①①①①①①①①①①

?>=<89:;b

❍❍❍❍

❍❍❍❍

❍❍❍❍

✈✈✈✈✈✈✈✈✈✈✈✈

σ1, . . . , σs σs+1, . . . , σn

Fig. 7.12. Decomposition of a rooted binary tree into two subtrees.

Example 7.11. Reconsider the hidden tree Markov model in Ex. 7.10. The marginal probabilities aregiven by

pσ1σ2σ3σ4=

1

2·∑

τ1∈Σ

τ2∈Σ

τ3∈Σ

θ12τ1τ2θ13τ1τ3θ

24τ2σ1

θ25τ2σ2θ36τ3σ3

θ37τ3σ4

whose sum-product decomposition amounts to

pσ1σ2σ3σ4=∑

τ1

[(∑

τ2∈Σ

θ12τ1τ2p2σ1σ2

)·(∑

τ3∈Σ

θ13τ1τ3p3σ3σ4

)]

=1

2·∑

τ1∈Σ

[(∑

τ2∈Σ

θ12τ1τ2(θ24τ2σ1

θ25τ2σ3

))

·(∑

τ3∈Σ

θ13τ1τ3(θ36τ3σ3

θ37τ3σ4

))]

. (7.6)

7.5 Felsenstein Algorithm

In the early 1980s, Felsenstein was the first who brought the maximum likelihood framework tonucleotide-based phylogenetic inference.

The Felsenstein algorithm provides an explanation for a sequence of observed data σ ∈ Σn of atree T with n leaves. Finding an explanation means identifying the hidden data τ with maximum aposteriori probability that generated the observed data σ; that is,

τ = argmaxτ{pτ,σ}. (7.7)

By putting wτ,σ = − log pτ,σ, this equation becomes

152 7 Tree Markov Models

τ = argminτ{wτ,σ}. (7.8)

The sequence τ provides a labelling of the internal nodes and is the solution of the marginalization (7.4)performed in the tropical algebra,

wσ =⊕

τ

wσ,τ . (7.9)

The value wσ can be efficiently computed by tropicalization of the sum-product decomposition of themarginal probability pσ. For this, we put wklij = − log θklij for each parameter and waσ′ = − log paσ′ foreach tree node a and each sequence σ′ decorating the leaves of the subtree of T with root a. Then thesum-product decomposition (7.5) performed in the tropical algebra becomes

wσ =⊕

τr

[(⊕

τa

wraτrτa ⊙ waσ1...σs

)⊙(⊕

τb

wrbτrτb ⊙ wbσs+1...σn

)]. (7.10)

The computation of this expression is known as Felsenstein algorithm. It consists of a forward algorithmthat evaluates the expression wσ (Alg. 7.2) and a backward algorithm that traces back the optimaldecisions made in each step; that is, the symbols for which the minimum is attained. The computedexplanation τ is called Felsenstein sequence.

Proposition 7.12. The tropicalization of the marginal probability pσ, σ ∈ ∏ni=1Σi, of the hidden tree

model provides an explanation for the observed data σ in O(l2n) steps, where l is an upper bound onthe alphabet size.

Proof. The computation can be carried out by using an (2n− 1)× l table M , where

wσ =⊕

τ

M [r, τ ],

M [r, τ ] =

(⊕

τa

wraττa ⊙M [a, τa]

)⊙(⊕

τb

wrbττb ⊙M [b, τb]

), τ ∈ Σ,

andM [ℓ, τ ] = 0 for each leaf ℓ, τ ∈ Σ.

The table M has size O(ln) and each entry is evaluated in O(l) time steps. ⊓⊔

Example 7.13. Reconsider the hidden tree model in Ex. 7.11. The tropicalization of the marginalprobability (7.6) gives

wσ1σ2σ3σ4=⊕

τ1

[(⊕

τ2∈Σ

w12τ1τ2 ⊙ w2

σ1σ2

)⊙(⊕

τ3∈Σ

w13τ1τ3 ⊙ w3

σ3σ4

)]

=⊕

τ1∈Σ

[(⊕

τ2∈Σ

w12τ1τ2

(w24τ2σ1

⊙ w25τ2σ3

))

⊙(⊕

τ3∈Σ

w13τ1τ3

(w36τ3σ3

⊙ w37τ3σ4

))]

. (7.11)

7.6 Evolutionary Models 153

Algorithm 7.2 Felsenstein forward algorithm.

Require: Rooted binary tree T with leaf set [n] and root r, observed sequence σ ∈ Σn, Σ common nodealphabet

Ensure: Evaluation of table Mfor each leaf ℓ in T do

M [ℓ, τ ]← 0mark node ℓ

end for

repeat

take unmarked node v in T whose descendents a, b are already markedfor each symbol τ in Σ do

M [v, τ ]←(

τawva

ττa ⊙M [a, τa])

⊙(

τbwvb

ττb⊙M [b, τb]

)

end for

mark node vuntil root r is marked

One problem with the maximum likelihood approach is model selection. Suppose we have n taxa, dparameters, m states, and a vector u ∈ Nm for observed data. Then we may consider all rooted binarytrees with n leaves. Each such tree leads to an algebraic statistical model f : Rd → Rm. The task is thento select a “good” model for the data; that is, a model whose likelihood function attains the largestvalue. However, the number of trees grows exponentially with the number of taxa and henceforth thisapproach is only viable for a handful of taxa.

7.6 Evolutionary Models

The above general model for the observed nucleotides allows the substitution matrices to be arbi-trary. However, there are practical reasons for constraining the form of these matrices. This leads toevolutionary models that are described by time-continuous Markov chains.

A time-continuous Markov chain is based on a matrix of rates which describes the substitutionof nucleotides at an infinitesimally small time interval. A rate matrix is an n × n real-valued matrixQ = (qij) with rows and columns indexed by a common alphabet Σ such as the DNA alphabetΣ = {A, C, G, T} that satisfies the following conditions,

• all off-diagonal entries are non-negative,

qij ≥ 0, i 6= j,

• all row sums are zero, ∑

j∈Σ

qij = 0, i ∈ Σ,

• all diagonal entries are negative,qii < 0, i ∈ Σ.

A rate matrix is an infinitesimal generator matrix for a time-continuous Markov chain capturing thenotion of instantaneous rate of mutation.

154 7 Tree Markov Models

A rate matrix Q gives rise to a probability distribution matrix Φ(t), t ≥ 0, by exponentiation. Thematrix exponential for the matrix Qt is the n× n real-valued matrix

Φ(t) = exp(Qt) =∞∑

k=0

1

k!Qktk, t ≥ 0. (7.12)

The entry of Φ(t) in row i and column j equals the conditional probability that the substitution i→ joccurs at a time interval of length t ≥ 0.

Theorem 7.14. Each rate matrix Q has the following properties:

• Chapman-Kolmogorov equation:

Φ(s+ t) = Φ(s) ·Φ(t), s, t ≥ 0.

• The matrix Φ(t) is the unique solution of the differential equation

Φ′(t) = Φ(t) ·Q = Q ·Φ(t), Φ(0) = I, t ≥ 0.

• Higher derivatives at the origin:Φ(k)(0) = Qk, k ≥ 0.

• The matrix Φ(t) is stochastic (non-negative entries with row sums equal to one) for each t ≥ 0.

Proof. The matrix exponential exp(A) is well-defined because the series converges componentwise forany square matrix A. If two n×n matrices A and B commute, we obtain (AB)k = AkBk, k ≥ 0. Thenthe basic exponentiation identity holds

exp(A+B) = exp(A) exp(B). (7.13)

But tQ and sQ commute for any values s, t and thus Φ(s+ t) = Φ(s) +Φ(t).The matrix exponential exp(tQ) can be differentiated term-by-term, since the power series exp(tQ)

has infinite radius of convergence. We have

Φ′(t) =

∞∑

k=1

tk−1Qk

(k − 1)!= Φ(t) ·Q = Q ·Φ(t), t ≥ 0.

The uniqueness is a standard result for systems of ordinary linear differential equations. Iterated dif-ferentiation leads to the identity in item three.

Finally, when t approches 0, the Taylor series expansion gives Φ(t) = I + tQ+O(t2). Thus, when tis sufficiently small, qij(t) ≥ 0 implies Φij ≥ 0 for all i 6= j. But by the first part, Φ(t) = Φ(t/m)m forall m and thus qij(t) ≥ 0 implies Φij ≥ 0, i 6= j, for all t ≥ 0. Moreover Q has row sums equal to zeroand thus Qm has the same property for each integer m ≥ 1. Hence, the matrix Φ(t) is stochastic. ⊓⊔

Example 7.15 (Maple). The matrix exponential for a matrix can be calculated as follows,

> with(LinearAlgebra):

> Q := Matrix( [[-2,1,1],[1,-2,1],[1,1,-2]] ):

> MatrixExponential(Q,t);

7.6 Evolutionary Models 155

The resulting matrix is

exp(Qt) =

23e

−3t + 13

13 − 2

3e−3t 1

3 − 23e

−3t

13 − 2

3e−3t 2

3e−3t + 1

313 − 2

3e−3t

13 − 2

3e−3t 1

3 − 23e

−3t 23e

−3t + 13

.

♦The matrix exponential exp(A) for a n×n complex-valued matrixA is invertible, since by the Chapman-Kolmogorov equation, exp(A)·exp(−A) = exp(A+(−A)) = exp(0) = I and thus the inverse of exp(A)is exp(−A). The matrix exponential provides a mapping exp :Mn(C) → Gln(C) from the vector spaceof all n × n complex matrices (Mn(C),+,0) to the general linear group of degree n, i.e., the group ofall n× n invertible complex matrices (GLn(C), ·, I). This mapping is surjective which shows that eachinvertible complex matrix can be written as the exponential of some other complex matrix.

Moreover, the matrix exponential exp(A) for a n×n real-valued symmetric matrix A can be easilycomputed. For this, notice that any n × n real-valued symmetric matrix can be diagonalized by anorthogonal matrix; that is, there is a real-valued orthogonal matrix U such that D = UTAU is adiagonal matrix. For each diagnonal matrix D with main diagonal entries d1, . . . , dn, the correspondingmatrix exponential exp(D) is a diagonal matrix with main diagonal entries ed1 , . . . , edn . Since UUT =I, it follows from the definitions that exp(A) = U exp(D)UT . Since the formation of determinantscommutes with matrix multiplication, we obtain

det(exp(A)) = det(exp(D)) =∏

i

edi = etr(A), (7.14)

where tr(A) denotes the trace of the matrix A. By taking the natural logarithm, we obtain

tr(A) = log(det(exp(A))).

An evolutionary model for n taxa over an alphabet Σ is specified by a rooted binary tree T with nleaves together with a rate matrix Q over the alphabet Σ and an initial distribution for the root of thetree (which is often assumed to be uniform on the alphabet). The transition probability matrix for theedge kl of the tree T is given by the matrix exponential exp(Qtkl), where the socalled branch lengthstkl are to be determined. The corresponding algebraic statistical model Rd → Rm is a specialization ofthe hidden tree Markov model to these matrices.

The Felsenstein hierarchy is a nested family of evolutionary models. It can be seen as a cumulativeresult of experimentation and development of many special time-continuous Markov models with ratematrices that incorporate biologically meaningful parameters.

The simplest model is the Jukes-Cantor model (JC) developed in the late 1960s. This model is highlystructured with equal transition probabilites among the nucleotides and with uniform root distribution.It is based on the Jukes-Cantor rate matrix

Q =

−3α α α αα −3α α αα α −3α αα α α −3α

, (7.15)

where the parameter α > 0 indicates that all substitutions occur at the same rate. The correspondingtransition probability matrix is

156 7 Tree Markov Models

exp(tQ) =1

4

1 + 3e−4αt 1− e−4αt 1− e−4αt 1− e−4αt

1− e−4αt 1 + 3e−4αt 1− e−4αt 1− e−4αt

1− e−4αt 1− e−4αt 1 + 3e−4αt 1− e−4αt

1− e−4αt 1− e−4αt 1− e−4αt 1 + 3e−4αt

, t ≥ 0. (7.16)

Note that all transition probabilities converge to 14 as t approaches infinity. Thus in the equilibrium

distribution all nucleotides are substituted with the same rate. Put a = e−4αt. Then the transitionprobability matrix has the form

1− 3a a a aa 1− 3α a aa a 1− 3a aa a a 1− 3a

, (7.17)

where a is the probability of a mutation from nucleotide i to another nucleotide j and 1 − 3a is theprobability of staying at the nucleotide. Since the rows must sum to 1, we have 0 < a < 1/3.

The expected number of mutations over time t is the quantity

3αt = −1

4· tr(Q) · t = −1

4· log(det(Φ(t))), (7.18)

where the last equation follows from (7.14). This number is called the branch length of the model andis used to label the edges of a phylogenetic tree. Although the JC model does not capture the biologyvery well, it is easy to work with and is often used for quick calculations.

We make a change of variables in the JC model. For this, assume that the tree T has m edgesand the leaves are indexed by the set [n]. Let Φ(i) = Φ(ti) denote the transition probability matrixassociated to the i-th edge, 1 ≤ i ≤ m. Instead of using the parameters αi and ti, we introduce newparameters

πi =1

4(1− e−4αiti) and µi =

1

4(1 + 3e−4αiti). (7.19)

The parameters are the entries of the transition probability matrix for the i-th edge,

Φ(i) = exp(tiQ) =

µi πi πi πiπi µi πi πiπi πi µi πiπi πi πi µi

. (7.20)

The entries satisfy the linear constraint

µi + 3πi = 1. (7.21)

The branch length of the i-th edge can be recovered from the matrix Φ(i) as follows,

3αiti = −1

4· log(det(Φ(i))). (7.22)

The JC model hasm parameters by (7.21) and 4n states and is thus given by the mapping f : Rm → R4n .

7.6 Evolutionary Models 157

Example 7.16. Reconsider the 1,3 claw tree T (Fig. 7.9). We take the JC model with uniform rootdistribution. This model is given by the mapping f : R3 → R64 with five distinct marginal distributions(Fig. 7.13).

The probability of observing the same symbol at all three leaves is

p123 =1

4(µ1µ2µ3 + 3π1π2π3). (7.23)

The probabilities of seeing the same nucleotide at two distinct leaves i, j and a different one at the thirdleaf are

p12 =1

4(µ1µ2π3 + π1π2µ3 + 2π1π2π3), (7.24)

p13 =1

4(µ1π2µ3 + π1µ2π3 + 2π1π2π3), (7.25)

p23 =1

4(π1µ2µ3 + µ1π2π3 + 2π1π2π3). (7.26)

The probability of noticing three distinct letters at the leaves is

pdis =1

4(µ1π2π3 + π1µ2π3 + π1π2µ3 + π1π2π3). (7.27)

?>=<89:;r

❋❋❋❋

❋❋❋❋

❋❋

①①①①①①①①①①

?>=<89:;r

❋❋❋❋

❋❋❋❋

❋❋

①①①①①①①①①①

?>=<89:;r

❋❋❋❋

❋❋❋❋

❋❋

①①①①①①①①①①

GFED@ABC1 : A GFED@ABC2 : A GFED@ABC3 : A GFED@ABC1 : A GFED@ABC2 : A GFED@ABC3 : C GFED@ABC1 : A GFED@ABC2 : C GFED@ABC3 : G

Fig. 7.13. Labellings of the 1,3 claw tree for p123, p12, and pdis, respectively.

The nucleotides fall biochemically into two families: purines (adenine and guanine) and pyrimidines(cytosine and thymine). Substitutions within a family are called transitions and substitutions betweenfamilies are called transversions.

The Kimura two-parameter model K2P (1980) is an extension of the Jukes-Cantor model thatdistinguishes between transitions and transversions by assigning a common rate to transversions anda different common rate to transitions but still assumes an equal root distribution. The model K2P isbased on the rate matrix

A C G T

A . β α βC β . β αG α β . βT β α β .

, (7.28)

158 7 Tree Markov Models

where α, β > 0 and the diagonal entries are determined by the constraint that the row sums are equalto zero.

The Kimura three-parameter model K3P (1981) is a generalization of the K2P model that allowstwo classes of transversions to occur at different rates. Its rate matrix is of the form

A C G T

A . β α γC β . γ αG α γ . βT γ α β .

, (7.29)

where α, β, γ > 0. The corresponding matrix exponential is

rt st ut vtst rt vt utut vt rt stvt ut st rt

, t ≥ 0, (7.30)

where

rt =1

4exp(−2(α+ β)t) +

1

4exp(−2(α+ γ)t) +

1

4exp(−2(β + γ)t) +

1

4

st = −1

4exp(−2(α+ β)t)− 1

4exp(−2(α+ γ)t) +

1

4exp(−2(β + γ)t) +

1

4

ut = −1

4exp(−2(α+ β)t) +

1

4exp(−2(α+ γ)t)− 1

4exp(−2(β + γ)t) +

1

4

vt =1

4exp(−2(α+ β)t)− 1

4exp(−2(α+ γ)t)− 1

4exp(−2(β + γ)t) +

1

4.

The strand symmetric model CS05 (2004) extends the K2P model by assuming that the root dis-tribution fulfills πA = πT and πC = πG. The CS05 model has the rate matrix

. βπC απC βπAβπA . βπC απAαπA βπC . βπAβπA απC βπC .

, (7.31)

where α, β > 0.The Felsenstein model F81 (1981) is an extension of the JC model that allows a non-uniform root

distribution. The rate matrix is

. απC απG απTαπA . απG απTαπA απC . απTαπA απC απG .

, (7.32)

where α > 0.The Hasegawa model HKY85 (1985) is a compromise between the models F81 and CS05. Its rate

matrix is

7.6 Evolutionary Models 159

. βπC απG βπTβπA . βπG απTαπA βπC . βπTβπA απC βπG .

, (7.33)

where α, β > 0.The Tamura-Nei model TN93 (1993) is an extension of the HKY85 model including an extra pa-

rameter for the two types of transversions. It has rate matrix

. βπC απG βπTβπA . βπG γπTαπA βπC . βπTβπA γπC βπG .

, (7.34)

where α, β, γ > 0.The symmetric model SYM (1994) is given by a symmetric rate matrix and assumes uniform root

distribution. Its rate matrix is

. α β γα . δ ǫβ δ . φγ ǫ φ .

, (7.35)

where α, β, γ, δ, ǫ, φ > 0.The REV model (1984) is the most general DNA model. Its only restriction is symmetry. The rate

matrix is

. απC βπG γπTαπA . δπG ǫπTβπA δπC . φπTγπA ǫπC φπG .

, (7.36)

where α, β, γ, δ, ǫ, φ > 0.

Example 7.17 (Singular). Reconsider the JC model of the 1,3 claw tree f : R3 → R64 studied inEx. 7.16. Since the model has only five different marginal probabilities, we may consider the JC modelas the image of the simplified mapping f ′ : R3 → R5 with marginal probabilities given by (7.23)-(7.27).The following program computes invariants of the simplified model,

> ring r = 0, (x(1..3),y(1..3),p(1..5)), dp;

> ideal i1 = p(1)-(x(1)*x(2)*x(3)+3*y(1)*y(2)*y(3));

> ideal i2 = p(2)-(x(1)*x(2)*y(3)+y(1)*y(2)*x(3)+2*y(1)*y(2)*y(3));

> ideal i3 = p(3)-(x(1)*y(2)*x(3)+y(1)*x(2)*y(3)+2*y(1)*y(2)*y(3));

> ideal i4 = p(4)-(y(1)*x(2)*x(3)+x(1)*y(2)*y(3)+2*y(1)*y(2)*y(3));

> ideal i5 = p(5)-(x(1)*y(2)*y(3)+y(1)*x(2)*y(3)+y(1)*y(2)*x(3)+y(1)*y(2)*y(3));

> ideal i = i1+i2+i3+i4+i5, x(1)+3*y(1)-1, x(2)+3*y(2)-1, x(3)+3*y(3)-1;

> ideal j = std(i);

> ideal k = eliminate(j,x(1)*x(2)*x(3)*y(1)*y(2)*y(3));

160 7 Tree Markov Models

The output provides three (large) phylogenetic invariants in the ring R[p1, . . . , p5]. ♦

Given a set of observed DNA sequences. We may ask the question whether or not these sequencescome from a particular evolutionary model M given by a tree T and a root distribution π. For this,we compute the pattern frequencies (pσ) from the observed taxa. These pattern frequencies can serveas estimates for the marginal probabilities (pσ) of the model. Then we evaluate each of the modelinvariants using the pattern frequencies. If the sequence data come from the model, we will expect theevaluated invariants not to differ significantly from zero. These evaluated invariants give us a score ofthe model. Then we may choose the evolutionary model with the minimal score among all models asthe ”true” evolutionary model.

7.7 Group-Based Evolutionary Models

The Jukes-Cantor model for either binary or DNA sequences and the Kimura models with two or threeparameters belong to the class of group based models. These models have the property that a linearchange of coordinates by using discrete Fourier transform translates the ideal of phylogenetic invariantsinto a toric model. More specifically, the symbols of the alphabet in a group-based model can be labelledby the elements of a finite group in such a way that the probability of translating group elements (fromg to h) depends only on their difference (g−h). By replacing the original coordinates pi1,...,in by Fouriercoordinates qi1,...,in , the ideal of phylogenetic invariantes becomes toric.

An evolutionary model on the state space [n] is called group-based if there is an abelian group Gwith elements g1, . . . , gn and a mapping ψ : G → R such that the n × n instantaneous rate matrixQ = (Qij) satisfies the condition

Qij = ψ(gj − gi), 1 ≤ i, j ≤ n. (7.37)

Example 7.18. Consider the cyclic group G = Z4. The group table for the differences g − h of groupelements g, h ∈ G has the form

− 0 1 2 30 0 3 2 11 1 0 3 22 2 1 0 33 3 2 1 0

(7.38)

and can be mapped onto the entries of the instantaneous rate matrix for the Kimura’s two-parametermodel K80,

Q =

. α β αα . α ββ α . αα β α .

. (7.39)

7.7 Group-Based Evolutionary Models 161

Example 7.19. Take the Klein group G = Z2 ×Z2. The group table for the differences g− h of groupelements g, h ∈ G has the form

− (0, 0) (0, 1) (1, 0) (1, 1)(0, 0) (0, 0) (0, 1) (1, 0) (1, 1)(0, 1) (0, 1) (0, 0) (1, 1) (1, 0)(1, 0) (1, 0) (1, 1) (0, 0) (0, 1)(1, 1) (1, 1) (1, 0) (0, 1) (0, 0)

(7.40)

and can be mapped onto the entries of the instantaneous rate matrix for the Kimura’s three-parametermodel K81,

Q =

. α β γα . γ ββ γ . αγ β α .

. (7.41)

We show that the substitution matrices share the same dependencies. To this end, we need tosummarize some basic facts about characters and discrete Fourier transforms. For this, let G be a finiteabelian group. We denote the group operation by addition, the neutral element by 0, and the l-foldmultiple of a group element g ∈ G as l · g = g + · · · + g (l times). Let C∗ denote the set of non-zerocomplex numbers; we can regard C∗ as an abelian group with ordinary multiplication.

A character of G is a group homomorphism mapping G into C∗; that is, χ : G→ C∗ is a characterif χ(g1 + g2) = χ(g1)χ(g2) for all elements g1, g2 ∈ G. The set of characters of G is denoted by G. Theset G is non-empty, since it contains the trivial character ǫ defined by ǫ(g) = 1 for all g ∈ G.

Lemma 7.20. The set of characters of G forms an abelian group under pointwise multiplication; thatis,

(χχ′)(g) = χ(g)χ′(g), χ, χ′ ∈ G, g ∈ G.

The groups G and G are isomorphic.

Proof. First, the defined operation on G is associative and commutative which follows directly fromassociativity and commutativity of complex multiplication. Thus G forms a commutative semigroup.The trivial character ǫ is the neutral element of G and so G forms a commutative monoid.

Suppose the groupG has order n. Then for each character χ ∈ G, we have χ(g)n = χ(gn) = χ(1) = 1.Thus the image of G under χ lies in the group of n-th roots of unity. The norm of an n-th root of unity

ζ is |ζ| =√ζζ = 1, where z denotes the conjugate of a complex number z. Thus ζζ = 1 and hence

ζ = ζ−1. Therefore, the inverse of a character χ is given by χ−1(g) = χ(g) for all g ∈ G. It follows thatG forms an abelian group.

Second, the fundamental theorem of abelian groups says that each finite abelian group G is adirect sum of cyclic groups Z1, . . . , Zk. Let zi be a generating element of Zi with order ni; that is,Zi = {l · zi | 0 ≤ l ≤ ni − 1}, 1 ≤ i ≤ k. Let ζi denote a primitive ni-th root of unity; that is, ζni

i = 1

and ζji 6= 1 for 1 ≤ j ≤ ni − 1.

Define χi ∈ G such that χi(li · zi) = ζlii , where 0 ≤ li ≤ ni − 1, 1 ≤ i ≤ k, and extend such that

162 7 Tree Markov Models

χi(l1 · z1 + · · ·+ lk · zk) = ζlii , 0 ≤ li ≤ ni − 1, 1 ≤ i ≤ k.

Let g ∈ G. Write g = l1 · z1 + · · ·+ lk · zk, where 0 ≤ lj ≤ nj − 1, 1 ≤ j ≤ k, and put

φ(g) = χl11 · · ·χlkk . (7.42)

For each g, h ∈ G, we have by definition φ(g + h) = φ(g)φ(h). Hence, φ is a group homomorphism.Let g ∈ G such that φ(g) = ǫ. Writing g = l1 · z1 + · · ·+ lk · zk gives 1 = ǫ(zi) = (χl11 · · ·χlkk )(zi) =

χ1(zi)l1 · · ·χk(zi)lk = χi(zi)

li = ζlii , 1 ≤ i ≤ k. It follows that ni is a divisor of li for each 1 ≤ i ≤ kand thus g = 1. Hence, the mapping is one-to-one.

Let χ ∈ G. Since χ(zi) is an ni-th root of unity, we have χ(zi) = ζeii = χi(zi)ei for some 0 ≤ ei ≤

ni − 1, 1 ≤ i ≤ k. Thus χ = χe11 · · ·χekk and hence the mapping is onto. ⊓⊔

The group G is called the dual group or character group of G.

Lemma 7.21. Let G and H be finite abelian groups. The dual group of the direct product G × H ={(g, h) | g ∈ G, h ∈ H} is isomorphic to G× H.

Proof. Let χ be a character of G×H. The restriction of χ to G is a character of G and the restrictionto H is a character of H; we denote the restricted characters by χG and χH , respectively. This givesχ(g, h) = χG(g) · χH(h) for all (g, h) ∈ G × H. The mapping χ 7→ (χG, χH) provides the requiredisomorphism. ⊓⊔

Example 7.22. The dual group of the cyclic group G = Zn = {0, 1, . . . , n− 1} is the group G = {χb |0 ≤ b ≤ n− 1} with

χb(a) = ζab, 0 ≤ a, b ≤ n− 1, (7.43)

where ζ is a primitive n-th root of unit. ♦

Example 7.23. The dual group of the additive group G = Zk2 of order 2k has the characters

(χa11 · · ·χakk )(b1 · z1 + · · ·+ bk · zk) =k∏

i=1

χaii (bi · zi) =k∏

i=1

(−1)aibi = (−1)〈a,b〉. (7.44)

where 0 ≤ ai, bi ≤ 1, 1 ≤ i ≤ k. ♦

Let G be a finite abelian group and let L2(G) = {f | f : G → C} be the set of all complex-valuedfunctions on G. This set becomes a complex vector space by defining addition

(f1 + f2)(g) = f1(g) + f2(g), f1, f2 ∈ L2(G), g ∈ G, (7.45)

and scalar multiplication

(af)(g) = a · f(g), f ∈ L2(G), g ∈ G, a ∈ C. (7.46)

Define the delta functions δg, g ∈ G, on G by

7.7 Group-Based Evolutionary Models 163

δg(h) =

{1, if g = h,0, otherwise.

The vector space L2(G) has the delta functions as a C-basis, since each function f ∈ L2(G) has theFourier expansion

f(g) =∑

h

f(h)δh(g), g ∈ G.

A multiplication on the C-space L2(G) is given as

(f1 ∗ f2)(g) =∑

h∈G

f1(h)f2(g − h), g ∈ G, f1, f2 ∈ L2(G).

This operation is associative and is called convolution or Hadamard product.An inner product on the vector space L2(G) is defined by

〈f1, f2〉 =∑

g∈G

f1(g)f2(g), f1, f2 ∈ L2(G). (7.47)

The delta functions on G define an orthonormal basis of L2(G) with respect to this inner product, sincewe have

〈δg, δh〉 =∑

l

δg(l)δh(l) =

{1, if g = h,0, otherwise,

g, h ∈ G.

Theorem 7.24 (Orthogonality Relations). Let G be a finite abelian group.

• For all characters χ and ψ of G, we have

〈χ, ψ〉 ={|G|, if χ = ψ,0, otherwise.

(7.48)

• For all elements g and h in G, we have

χ

χ(g)χ(h) =

{|G|, if g = h,0, otherwise.

(7.49)

Proof. First, we have

〈χ, ψ〉 =∑

g

χ(g)ψ(g) =∑

g

(χψ−1)(g)ǫ(g) = 〈χψ−1, ǫ〉,

since ψ = ψ−1. Thus we can reduce to the case ψ = ǫ. Put

S = 〈χ, ǫ〉 =∑

g

χ(g).

If χ is the trival character, the result will follow. Otherwise, there is a group element h ∈ G withχ(h) 6= 1. By multiplying the above equation with χ(h), we obtain

164 7 Tree Markov Models

χ(h)S = χ(h)∑

g

χ(g) =∑

g

χ(h+ g) =∑

g

χ(g) = S.

Thus we have χ(h)S = S with χ(h) 6= 1 and hence S = 0. This proves the first assertion.Second, we have ∑

χ

χ(g)χ(h) =∑

χ

χ(g − h), g, h ∈ G.

If g = h, the result will follow. Otherwise, there is a character ψ with ψ(l) 6= 1 for some l ∈ G. DefineS =

∑χ χ(l). Thus ψ(l)S =

∑χ(ψχ)(l) = S and hence S = 0. ⊓⊔

The discrete Fourier transform (DFT) F : L2(G) → L2(G) assigns to each function f ∈ L2(G) a

function Ff = f defined by

f(χ) =∑

g∈G

f(g)χ(g), χ ∈ G. (7.50)

In particular, the DFT of the delta function δg, g ∈ G, is given by

δg(χ) =∑

h

χ(h)δg(h) = χ(g), χ ∈ G. (7.51)

Theorem 7.25. Let G be a finite abelian group.

• Linearity: The DFT F : L2(G) → L2(G) is a C-space isomorphism.• Convolution: The DFT turns convolution into multiplication,

f1 ∗ f2(χ) = f1(χ) · f2(χ), χ ∈ G, f1, f2 ∈ L2(G).

• Inversion: For each function f ∈ L2(G),

f(g) =1

|G|∑

χ∈G

χ(g)f(χ), g ∈ G.

• Parseval identity: For all functions f1, f2 ∈ L2(G),

〈f1, f2〉 =1

|G| 〈f1, f2〉.

• Translation: For each h ∈ G and f ∈ L2(G), define fh(g) = f(h+ g). Then we have

fh(χ) = χ(h)f(χ), f ∈ L2(G), χ ∈ G, h ∈ G.

Proof. First, the DFT provides a linear map, since

(f1 + f2)(χ) = f1(χ) + f2(χ) =∑

g

χ(g)[f1(g) + f2(g)]

=∑

g

χ(g)[f1 + f2](g) = f1 + f2(χ)

7.7 Group-Based Evolutionary Models 165

and

af(χ) =∑

g

(af)(g)χ(g) =∑

g

a · f(g)χ(g) = a · f(χ), a ∈ C.

The inversion formula implies that this mapping is one-to-one. Since both spaces have the same dimen-sion, it follows by linear algebra that the mapping is also onto. Hence, the mapping is a vector spaceisomorphism.

Second, we have

f1 ∗ f2(χ) =∑

g

χ(g)f1 ∗ f2(g) =∑

g

χ(g)∑

h

f1(h)f2(g − h)

=∑

h

l

χ(h+ l)f1(h)f2(l), l = g − h,

=∑

h

l

χ(h)f1(h)χ(l)f2(l)

=∑

h

χ(h)f1(h)∑

l

χ(l)f2(l)

= f1(χ) · f2(χ).

Third, due to linearity, we may consider only the basis elements δh, h ∈ G. By the orthogonalityrelations (7.49) and (7.51), the right-hand side gives

1

|G|∑

χ

χ(g)δh(χ) =1

|G|∑

χ

χ(g)χ(h) =

{1, if g = h,0, otherwise,

which is equal to δh(g), as required.Fourth, the orthogonality relations (7.49) give

〈f1, f2〉 =∑

χ

f1(χ)f2(χ)

=∑

χ

(∑

g

f1(g)χ(g) ·∑

h

f2(h)χ(h)

)

=∑

χ

g

h

f1(g)f2(h)χ(g)χ(h)

=∑

g

h

f1(g)f2(h)∑

χ

χ(g)χ(h)

= |G| ·∑

g

f1(g)f2(g)

= |G| · 〈f1, f2〉.

Finally, we have

166 7 Tree Markov Models

fh(χ) =∑

g

fh(g)χ(g)

=∑

g

f(g + h)χ(g)

=∑

l

f(l)χ(l − h), l = g + h,

=∑

l

f(l)χ(l)χ(h)

= χ(h)f(χ).

⊓⊔

Example 7.26. Consider the cyclic group G = Zn. By taking the basis of L2(G) given by the deltafunctions, the proof of Lemma 7.20 and (7.51) show that

δa(χb) = χb(a) = 1/χb(a) = ζ−ab, 0 ≤ a, b ≤ n− 1, (7.52)

where ζ = exp(2πi/n) is a primitive n-th root of unity. Thus the matrix of the DFT is given by

An =(ζ−(a−1)(b−1)

)a,b∈[n]

. (7.53)

In the binary case n = 2, we obtain

A2 =

(1 11 −1

). (7.54)

The DFT of the function f = (f0, f1) is given by

f =

(f0 + f1f0 − f1

)=

(1 11 −1

)(f0f1

). (7.55)

Moreover, the inversion formula gives

f0 =1

2(f0 + f1) and f1 =

1

2(f0 − f1). (7.56)

In the quaternary case n = 4, we obtain

A4 =

1 1 1 11 −i −1 11 −1 1 −11 i −1 −i

. (7.57)

Example 7.27 (Maple). The constant function f : Zn → C : a 7→ 1 has the DFT

7.7 Group-Based Evolutionary Models 167

f(χb) =

{n if b = 0,0 otherwise.

To see this, we have

f(χb) =∑

a

f(a)ζ−ab.

Clearly, if b = 0 then f(1) = n. On the other hand, if b 6= 0, then by the formula for geometricprogression,

f(χb) = 1 + ζ + . . .+ ζn−1 =1− ζn

1− ζ= 0.

Take the function f(a) = 12 (δ1(a) + δ−1(a)) for all a ∈ Zn; here we assume that n is odd and so we can

identify Zn with the set {−(n− 1)/2, . . . ,−1, 0, 1, . . . , (n− 1)/2}. The DFT of the function f is

f(χb) = cos

(2πb

n

), a ∈ Zn.

Indeed, we have

f(χb) =∑

a

f(a)ζ−ab =1

2exp(−2πib/n) +

1

2exp(2πib/n)

=1

2

[cos

(−2πb

n

)+ i · sin

(−2πb

n

)]+

1

2

[cos

(2πb

n

)+ i · sin

(2πb

n

)]

= cos

(2πb

n

).

The function f : Zn → C : a 7→ a3 has the DFT shown in Fig. 7.14. It can be computed by thefollowing Maple code:

> with(DiscreteTransform):

> with(plots):

> Z := Vector ( 9, x -> evalf(x^3)):

> F := FourierTransform ( Z, normalization = full):

> ptlist := convert ( F, ’list’):

> complexplot ( ptlist, x = -50..225, style = point);

Example 7.28. Consider the additive group G = Zk2 of order 2k. The DFT of a function f ∈ L2(G)can be specified as

f(χa11 · · ·χakk ) =∑

b∈G

(−1)〈a,b〉f(b), 0 ≤ a1, . . . , ak ≤ 1. (7.58)

This mapping is provided by the 2k × 2k Hadamard matrix

H2k = ((−1)〈a,b〉).

168 7 Tree Markov Models

Fig. 7.14. DFT of cubic function (n = 9).

These matrices can be recursively defined as follows,

H2 =

(1 11 −1

), H2k+1 =

(H2k H2k

H2k −H2k

)= H2k ⊗H2, k ≥ 1. (7.59)

We will see that if the instantaneous rate matrix Q is group-based, the corresponding substitutionmatrices exp(Qt) will also be group-based.

Lemma 7.29. Let G be an abelian group of order n. The eigenvalues of a n× n group-based instanta-neous rate matrix Q satisfying (7.37) are

λχ =∑

h∈G

χ(h)ψ(h), χ ∈ G. (7.60)

The transition probabilities of the corresponding time-continuous Markov model are

Pgh(t) =1

|G|∑

χ∈G

χ(h− g)eλχt, t ≥ 0. (7.61)

Proof. First, define the n× n matrix B = (χ(g))χ,g. We have

7.7 Group-Based Evolutionary Models 169

(BQ)χ,g =∑

h∈G

χ(h)ψ(g − h)

=∑

l∈G

χ(g − l)ψ(l)

= χ(g)∑

l∈G

χ(l)ψ(l)

= χ(g)λχ.

Thus if we put G = {g1, . . . , gn}, then

(χ(g1), . . . , χ(gn))Q = λχ(χ(g1), . . . , χ(gn)). (7.62)

Hence, the rows of B are the left eigenvectors of Q. This shows the first assertion.Second, let D be the n × n diagonal matrix with diagonal entries λχ. By (7.62), we have Q =

B−1DB. But the orthogonality relations (7.49) imply that B−1 = 1|G|B

∗, where B∗ = (χ(g))g,χ is the

Hermitian matrix of B. Thus

(exp(Qt))g,h =1

|G| (B∗ exp(Dt)B)g,h

=1

|G|∑

χ

χ(g)eλχtχ(h)

=1

|G|∑

χ

χ(h− g)eλχt.

This proves the second assertion. ⊓⊔

Example 7.30 (Singular). Consider the binary JC model for the 1,3 claw tree (Fig. 7.15). Let π =

?>=<89:;r

γ

❄❄❄❄

❄❄❄❄

α

⑧⑧⑧⑧⑧⑧⑧⑧⑧

β

?>=<89:;1 ?>=<89:;2 ?>=<89:;3

Fig. 7.15. The 1,3 claw tree.

(π0, π1) denote the probability distribution of the root and let the transition probability matrices alongthe branches be given as

P(r1) =

(α0 α1

α1 α0

), P(r2) =

(β0 β1β1 β0

), P(r3) =

(γ0 γ1γ1 γ0

).

Then the algebraic statistical model is defined by the mapping f : R4 → R8 with marginal probabilities

170 7 Tree Markov Models

p000 = π0α0β0γ0 + π1α1β1γ1,

p001 = π0α0β0γ1 + π1α1β1γ0,

p010 = π0α0β1γ0 + π1α1β0γ1,

p011 = π0α0β1γ1 + π1α1β0γ0,

p100 = π0α1β0γ0 + π1α0β1γ1,

p101 = π0α1β0γ1 + π1α0β1γ0,

p110 = π0α1β1γ0 + π1α0β0γ1,

p111 = π0α1β1γ1 + π1α0β0γ0.

The discrete Fourier transform gives a linear change of coordinates in the parameter space by us-ing (7.56),

π0 = 12 (r0 + r1), α0 = 1

2 (a0 + a1), β0 = 12 (b0 + b1), γ0 = 1

2 (c0 + c1),π1 = 1

2 (r0 − r1), α1 = 12 (a0 − a1), β1 = 1

2 (b0 − b1), γ1 = 12 (c0 − c1).

Simultaneously, it provides a linear change of coordinates in the probability space by making useof (7.58),

qijk =

1∑

r=0

1∑

s=0

1∑

t=0

(−1)ir+js+ktprst.

More specifically, we obtain

q000 = p000 + p001 + p010 + p011 + p100 + p101 + p110 + p111,

q001 = p000 − p001 + p010 − p011 + p100 − p101 + p110 − p111,

q010 = p000 + p001 − p010 − p011 + p100 + p101 − p110 − p111,

q011 = p000 − p001 − p010 + p011 + p100 − p101 − p110 + p111,

q100 = p000 + p001 + p010 + p011 − p100 − p101 − p110 − p111,

q101 = p000 − p001 + p010 − p011 − p100 + p101 − p110 + p111,

q110 = p000 + p001 − p010 − p011 − p100 − p101 + p110 + p111,

q111 = p000 − p001 − p010 + p011 − p100 + p101 + p110 − p111.

After these coordinate changes, the model has the monomial representation

q000 = r0a0b0c0,

q001 = r1a0b0c1,

q010 = r1a0b1c0,

q011 = r0a0b1c1,

q100 = r1a1b0c0,

q101 = r0a1b0c1,

q110 = r0a1b1c0,

q111 = r1a1b1c1.

This model is toric and the phylogenetic invariants are given by binomials that can be established bythe following program,

7.7 Group-Based Evolutionary Models 171

> ring r = 0, (r(0..1),a(0..1),b(0..1),c(0..1),q(0..7)), dp;

> ideal i0 = q(0)-r(0)*a(0)*b(0)*c(0);

> ideal i1 = q(1)-r(1)*a(0)*b(0)*c(1);

> ideal i2 = q(2)-r(1)*a(0)*b(1)*c(0);

> ideal i3 = q(3)-r(0)*a(0)*b(1)*c(1);

> ideal i4 = q(4)-r(1)*a(1)*b(0)*c(0);

> ideal i5 = q(5)-r(0)*a(1)*b(0)*c(1);

> ideal i6 = q(6)-r(0)*a(1)*b(1)*c(0);

> ideal i7 = q(7)-r(1)*a(1)*b(1)*c(1);

> ideal i = i0+i1+i2+i3+i4+i5+i6+i7;

> ideal j = std(i);

> eliminte (j, r(0)*r(1)*a(0)*a(1)*b(0)*b(1)*c(0)*c(1));

The output provides the following invariants,

_[1]=q(1)*q(6)-q(0)*q(7)

_[2]=q(2)*q(5)-q(0)*q(7)

_[3]=q(0)*q(4)-q(0)*q(7)

_[4]=q(2)*q(4)*q(6)-q(2)*q(6)*q(7)

_[5]=q(1)*q(4)*q(6)-q(1)*q(6)*q(7)

Example 7.31. Consider the 1, n claw tree with root r and n ≥ 1 leaves and take an abelian groupG of order n. Let π denote the probability distribution of the root and let the transition probabilitymatrices P(ri), 1 ≤ i ≤ n, along the branches be given as

P(ri)(Xi = g | Xr = h) = f (ri)(g − h), g, h ∈ G, 1 ≤ i ≤ n. (7.63)

The joint probability of the group based model is then given by

p(g1, . . . , gn) = P (X1 = g1, . . . , Xn = gn) =∑

h∈G

π(h)P(ri)(Xi = gi | Xr = h)

=∑

h∈G

π(h)

n∏

i=1

f (ri)(gi − h).

In order to find the discrete Fourier transform of this probability density with respect to the group Gn,the root distribution is replaced by the new function π : Gn → C as follows,

π(h1, . . . , hn) =

{π(h1), if h1 = . . . = hn,0, otherwise.

(7.64)

This definition gives

p(g1, . . . , gn) =∑

(h1,...,hn)∈Gn

π(h1, . . . , hn)

n∏

i=1

f (ri)(gi − hi). (7.65)

If we define

172 7 Tree Markov Models

f(g1, . . . , gn) =

n∏

i=1

f (ri)(gi), g1, . . . , gn ∈ G, (7.66)

the joint probability distribution p can be written as convolution of two functions on Gn,

p(g1, . . . , gn) = (π ∗ f) (g1, . . . , gn), g1, . . . , gn ∈ G. (7.67)

Taking the discrete Fourier transform yields

q(χ1, . . . , χn) = ˆπ(χ1, . . . , χn) · f(χ1, . . . , χn). (7.68)

In particular, the discrete Fourier transform of the function f has the form

f(χ1, . . . , χn) =∑

(g1,...,gn)∈Gn

f(g1, . . . , gn)(χ1, . . . , χn)(g1, . . . , gn)

=∑

(g1,...,gn)∈Gn

n∏

i=1

f (ri)(gi)

n∏

i=1

χi(gi)

=

n∏

i=1

gi∈G

f (ri)(gi)χi(gi)

=

n∏

i=1

f (ri)(χi). (7.69)

Moreover, the discrete Fourier transform of the root distribution is

ˆπ(χ1, . . . , χn) =∑

(g1,...,gn)∈Gn

π(g1, . . . , gn)(χ1, . . . , χn)(g1, . . . , gn)

=∑

g∈G

π(g)(χ1, . . . , χn)(g, . . . , g)

=∑

g∈G

π(g)

n∏

i=1

χi(g)

= π(χ1 · · ·χn). (7.70)

It follows that the discrete Fourier transform of the joint probabilities has the monomial representation

q(χ1, . . . , χn) = π(χ1 · · ·χn) ·n∏

i=1

f (ri)(χi). (7.71)

This example is the base case for the induction in the general case. Given a rooted binary treeT with root r and n leaves. For each node v in T different from the root, write a(v) for the unique

parent of v in T . The transition from a(v) to v is given by the substitution matrix P(v). Suppose the

7.7 Group-Based Evolutionary Models 173

states of the random variables are the elements of a finite abelian group G. Then the joint probabilitydistribution of the labelling of the leaves can be written as

p(g1, . . . , gn) =∑

π(gr)∏

v∈V (T )v 6=r

P(v)ga(v),gv

, (7.72)

where the sum extends over all states of the interior nodes of the tree T . We assume that the transitionmatrix entries P(v)

ga(v),gvdepend only on the difference of the group elements ga(v) and gv. We denote

this entry by f (v)(ga(v) − gv). Thus the group based model has the joint probability distribution

p(g1, . . . , gn) =∑

π(gr)∏

v∈V (T )v 6=r

f (v)(ga(v) − gv). (7.73)

Theorem 7.32. Given the joint probability distribution p(g1, . . . , gn) of a group-based model parametrizedin (7.73). The corresponding discrete Fourier transform has the form

q(χ1, . . . , χn) = π(χ1 · · ·χn) ·∏

v∈V (T )v 6=r

f (v)(∏

l∈Λ(v)

χl), (7.74)

where Λ(v) is the set of leaves which have the node v as a common ancestor.

The formula (7.72) is a polynomial representation of the evolutionary model, while formula (7.74)provides a monomial representation of the same model. Since the groups G and G are isomorphic, themonomial representation can be rewritten as follows,

qg1,...,gn 7→ π(g1 + . . .+ gn) ·∏

v∈V (T )v 6=r

f (v)(∑

l∈Λ(v)

gl). (7.75)

We can regard this formula as the monomial mapping from a polynomial ring in |G|n unknowns

qg1,...,gn = q(g1, . . . , gn) (7.76)

to the polynomial ring in the unknowns π(g) and f (v)(g), which are indexed by the nodes of T and theelements of G.

174 7 Tree Markov Models

8

Computational Statistics

Computational statistics is a fast growing field in statistical research and applications. This chapter isconcerned with the study of ideals defining statistical models for discrete random variables. The binomialgenerating sets of these ideals are known as Markov bases. Besides giving an algebraic description ofthe set of probability distributions coming from the model, the major use of Markov bases is as a tool inMarkov chain Monte Carlo algorithms for generating random samples from a statistical model. Severalimportant statistical models are addressed such as contingency tables, the Hardy-Weinberg law, andlogistic regression.

8.1 Markov Bases

Markov bases are used in algebraic statistics to estimate the goodness of a fit of empirical data to astatistical model. In algebraic geometry Markov bases are equivalent to generating sets of toric ideals.

In order to introduce Markov bases, let A ∈ Zd×n≥0 be an integral matrix. Note that the matrix A

provides a Z-module homomorphism φ : Zn → Zd : x 7→ Ax. The kernel of this mapping,

kerφ = {u ∈ Zn | Au = 0},

is a free Z-submodule of Zn.Let b be a vector in Zd. The set of all non-negative integral vectors u that have the marginal b = Au

is the b-fibre of A, denoted as A−1[b]; i.e.,

A−1[b] = {u ∈ Zn≥0 | Au = b}. (8.1)

In statistical terms, the vector b is called a sufficient statistic.A vector m ∈ Zn is a move for A if Am = 0, i.e., if m lies in the kernel of the linear mapping

induced by A.Each integral vector m ∈ Zn decomposes into a unique difference

m = m+ −m−,

where m+ ≥ 0 and m− ≥ 0 are the positive and negative parts of m, respectively. We have

176 8 Computational Statistics

m+i = max{mi, 0} and m−

i = −min{mi, 0}, 1 ≤ i ≤ n. (8.2)

For instance, (2,−3,−1) = (2, 0, 0)− (0, 3, 1). If m is a move for A, then Am = 0 and thus the positivepart m+ and the negative part m− have the same marginal: Am+ = Am−.

Example 8.1 (Maple). The matrix A =(1 2 3

)has the kernel Zb1 ⊕ Zb2 with b1 = (−3, 0, 1)t and

b2 = (−2, 1, 0)t. It follows that for any marginal b ∈ Z3, the b-fibre has the generating set {b+b1, b+b2}.♦

A move m for A is called applicable to a vector u in the b-fibre of A if u+m ≥ 0. Let B be a finiteset of moves for A, and let u and v be elements in the b-fibre of A. We say that u is connected to v inthe b-fibre of A if there exists a sequence of moves m1, . . . ,mk in B such that

v = u+

k∑

i=1

mi and u+

h∑

i=1

mi ≥ 0, 1 ≤ h ≤ k. (8.3)

That is, it is possible to pass from u to v by moves in B without causing negative entries on the way.Obviously, the notion of connectedness is reflexive and transitive. It is symmetric if with each move

m in B also −m is a move in B. If connectedness by B is also symmetric, then it forms an equivalencerelation on the b-fibre of A and thus the b-fibre is partitioned into disjoint equivalence classes by movesin B. These classes are called the B-equivalence classes of the b-fibre of A. The B-equivalence classescan be considered as a graph; this is an undirected graph G = Gb,B with set of vertices V = A−1[b]and edges between nodes u, v ∈ A−1[b] if v = u+m for some move m ∈ B.

Example 8.2. Take the matrix A =(1 2 3

)and the following set of moves for A,

B = {±(0, 3,−2)t,±(1,−2, 1)t,±(1, 1,−1)t,±(2,−1, 0)t}.

The points u = (3, 2, 7)t and v = (6, 5, 4)t are connected, since

v = u+ (0, 3,−2)t + (1,−2, 1)t + 2 · (1, 1,−1)t.

A finite set of moves B is called a Markov basis of A if for each marginal b ∈ Zn, the b-fibre of Aitself constitutes a single B-equivalence class; that is, the graph Gb,B is connected for all marginals b.

A Markov basis B is called minimal if no proper subset of B is a Markov basis. A minimal Markovbasis always exists, because from any Markov basis we can remove redundant elements one by one,until none of the remaining elements can be removed any further.

Let B be a finite set of moves for A. Define the corresponding ideal of moves in the polynomial ringQ[X1, . . . , Xn] as

IB = 〈Xm+ −Xm− | m ∈ B〉. (8.4)

We study the relationship between the ideal IB and the toric ideal IA associated with the matrix A.By Prop. 1.41, the latter ideal is defined as

IA = 〈Xv −Xu | Av = Au, v, u ∈ Zn≥0〉.

8.1 Markov Bases 177

Proposition 8.3. We have IB ⊆ IA.

Proof. Let m ∈ B be a move with m = m+ −m−. Then 0 = Am = A(m+ −m−) = Am+ −Am− and

thus Xm+ −Xm−

belongs to IA. ⊓⊔

Proposition 8.4. Let u and v be elements of Zn≥0. If u and v are connected in a fibre of A, the binomialXu −Xv lies in the ideal IB.

Proof. Let m be a move in B such that v = u+m. Then Xv −Xu = Xu−m−

(Xm+ −Xm−

) lies in IB .

Let u and v be connected as in (8.3). Then put w = u+∑k−1i=1 mi. By induction hypothesis, Xv−Xw

and Xw −Xu lie in IB and hence their sum Xv −Xu belongs to IB . ⊓⊔

Theorem 8.5. If B is a Markov basis of A, then IB = IA.

Proof. Let u, v ∈ Zn≥0 and let Xu −Xv be a generating binomial of IA. Then Au = Av and thus thevectors u and v are in the same b-fibre of A. Since B is a Markov basis of A, the points u and v areconnected in the fibre. Hence, by Prop. 8.4, the binomial Xv −Xu belongs to IB . The other inclusionhas already been given in Prop. 8.3. ⊓⊔

Proposition 8.6. Let {Xm+−Xm− | m ∈ B} be a generating set of the toric ideal IA in Q[X1, . . . , Xn].Then the set ±B is a Markov basis of A.

Proof. Let u, v ∈ Zn≥0 and let Xu −Xv be a generating binomial of IA. By hypothesis, we can write

Xv −Xu =

k∑

i=1

Xwi(Xm+i −Xm−

i ), mi ∈ B, wi ∈ Zn≥0, 1 ≤ i ≤ k.

Claim that the vectors m1, . . . ,mk connect u to v. Indeed, in case of k = 1, we have

Xv −Xu = Xw1(Xm+1 −Xm−

1 ), m1 ∈ B, w1 ∈ Zn≥0.

By comparing the binomials, we have v = w1 +m+1 and u = w1 +m−

1 . Thus w1 = u−m−1 and hence

v = u−m−1 +m+

1 = u+m, as required.

Let k > 1. The case k = 1 gives Xw−Xu = Xw1(Xm+1 −Xm−

1 ) and w = u+m1 ≥ 0, where m1 ∈ Band w1 ∈ Zn≥0. Thus

Xv −Xu = (Xw −Xu) +

k∑

i=2

Xwi(Xm+i −Xm−

i ), mi ∈ B, wi ∈ Zn≥0, 2 ≤ i ≤ k,

and hence

Xv −Xw =k∑

i=2

Xwi(Xm+i −Xm−

i ), mi ∈ B, wi ∈ Zn≥0, 2 ≤ i ≤ k.

By induction hypothesis, we have v = w +∑ki=2mi and w +

∑wi=2mi ≥ 0 for each 2 ≤ h ≤ k. Thus

m1 connects u to w and m2, . . . ,mk connect w to v. The claim follows. ⊓⊔

178 8 Computational Statistics

Finally, the problem is to compute a Markov basis of an integral matrix A ∈ Zd×n≤0 . For this, considerthe following ideal in Q[X1, . . . , Xn, Y1, . . . , Yd],

J = 〈Xi − Y a1i1 · · ·Y adid | 1 ≤ i ≤ n〉. (8.5)

In view of section 1.8, we have the following result.

Proposition 8.7. A Markov basis of the matrix A is given by computing a Groebner basis of the ideal Jwith respect to an elimination ordering for Y1, . . . , Yd and then taking only those elements which involveonly the indeterminates X1, . . . , Xn.

Example 8.8 (Singular). Reconsider the matrix A =(1 2 3

). Take the corresponding ideal J =

〈X1 − Y,X2 − Y 2, X3 − Y 3〉 in Q[X1, X2, X3, Y ]. A reduced Groebner basis of the elimination ideal IAof J with respect to an elimination ordering for Y can be computed as follows,

> ring r = 0, (y, x(1..3)), dp;

> ideal j = x(1)-y, x(2)-y^2, x(3)-y^3;

> eliminate(std(j),y);

_[1]=x(2)^2-x(1)*x(3)

_[2]=x(1)*x(2)-x(3)

_[3]=x(1)^2-x(2)

The reduced Groebner basis of IA is G = {X22 − X1X3, X1X2 − X3, X

21 − X2}. Thus the associated

Markov basis of A is {±(1,−2, 1)t,±(1, 1,−1)t,±(2,−1, 0)t}. ♦

8.2 Markov Chains

We consider discrete time, discrete state space Markov chains. A Markov chain is an infinite sequenceof random variables (Xt) indexed by time t ≥ 0. The set of all possible values of Xt is the state space.The state space is assumed to be a finite or countable set S. The sequence (Xt) is a Markov chain if itsatisfies the Markov property,

P (Xt+1 = j|X0 = i0, . . . , Xt−1 = it−1, Xt = i) = P (Xt+1 = j|Xt = i), (8.6)

for all states i, j ∈ S and t ≥ 0. That is, the transition probabilities depend only on the current state,but not on the past.

A Markov chain is homogeneous if the probabilities of going from one state to another in one stepare independent of the current step; that is,

P (Xt+1 = j|Xt = i) = P (Xt = j|Xt−1 = i), i, j ∈ S, t ≥ 1. (8.7)

Let (Xt) be a homogeneous Markov chain and let the state space S be finite; we put S = [n]. Thenthe transition probabilities P (Xt+1|Xt) can be represented by an n × n transition probability matrixP = (pij), where the entry pij is the probability that the chain provides a transition from state i to statej in one step. Note that the conservation of probability requires that the matrix P is row stochastic,i.e.,

∑j pij = 1.

8.2 Markov Chains 179

Let p(k)ij denote the probability that the chain moves from state i to state j in k ≥ 1 steps. The

k-step transition probabilities satisfy the Chapman-Kolmogorov equation

p(k)ij =

z∈S

p(l)iz p

(k−l)zj , i, j ∈ S, 0 < l < k. (8.8)

It follows that the k-step transition probabilities are the entries of the matrix Pk; i.e., P(k) = (p(k)ij ) =

Pk, where P is the k-th power of P.Let Q = (Q(i)) be a distribution of the state space. Then the distribution Q′ = (Q′(i)) of the state

space at the next time instant is given by

Q′(j) =∑

i∈S

pijQ(i), j ∈ S, (8.9)

or in shorthand notation,

Q′ = PQ, (8.10)

where P can be viewed as an operator on the space of probability distributions on the state set S.An induction argument shows that the evolution of the Markov chain through t ≥ 1 time steps is

given by

P (Xt = j) =∑

i∈S

p(t)ij P (X0 = i), j ∈ S.

Starting from the initial distribution Q0 given by a column vector q0 = (q0(1), . . . , q0(n)) with q0(i) =P (X0 = i), i ∈ S, the marginal distribution Qt after t time steps given by the column vector qt =(qt(1), . . . , qt(n)) with qt(i) = P (Xt = i), i ∈ S, satisfies the equation

qt = Ptq0, t ≥ 1. (8.11)

Thus the operator on the space of probability distribution on the state set S is linear.A distribution q = (q(i)) as a column vector of the state set S is stationary if it satisfies the matrix

equation

q = Pq. (8.12)

Thus a stationary distribution is an eigenvector of the transition matrix with the eigenvalue 1. Theproblem of finding the stationary distributions of a homogeneous Markov chain is a non-trivial task.

Example 8.9 (Maple). Consider the transition probability matrix for the Jukes-Cantor model

P =

1− 3a a a aa 1− 3a a aa a 1− 3a aa a a 1− 3a

.

Since the matrix must be row stochastic, we have 0 < a < 1/3.

180 8 Computational Statistics

Taking a = 0.1, the 2-step and 16-step transition matrices are

P2 =

0.52 0.16 0.16 0.160.16 0.52 0.16 0.160.16 0.16 0.52 0.160.16 0.16 0.2458 0.52

and P16 =

0.2502 0.2499 0.2499 0.24990.2499 0.2502 0.2499 0.24990.2499 0.2499 0.2502 0.24990.2499 0.2499 0.2458 0.2502

.

The transition probabilities in each row converge to the same stationary distribution π on the fourstates given by π(i) = 1

4 for 1 ≤ i ≤ 4 (Sect. 7.6).In view of the initial distribution (0.1, 0.2, 0.2, 0.5), we obtain after 2 steps and after 16 steps

the distributions (0.1960, 0.2320, 0.2320, 0.3400) and (0.2499, 0.2499, 0.2500, 0.2500), respectively. Thecorresponding Maple code is

> with(LinearAlgebra);

> A := Matrix([1-3*a,a,a],[a,1-3*a,a],[a,a,1-3*a,a],[a,a,a,1-3*a]]);

> a := 0.1;

> A^2; A^16;

> v := Vector([0.1,0.2,0.2,0.5]);

> (A^2).v;

> (A^16).v;

We study the convergence of Markov chains. For this, we consider a homogeneous Markov chainwith finite state space S and transition probability matrix P = (pij). To this end, we need a metric onthe space of probability distributions on the state space.

Proposition 8.10. A metric on the space of probability distributions on the state space S is given as

d(Q,Q′) =∑

s∈S

|Q(s)−Q′(s)|. (8.13)

Proof. Clearly, we have for all probability distributions Q, Q′, and Q′′ on S, d(Q,Q) = 0, d(Q,Q′) > 0if Q 6= Q′, and d(Q,Q′) = d(Q′, Q). Finally, the triangle inequality d(Q,Q′) ≤ d(Q,Q′′) + d(Q′′, Q′)holds, since

d(Q,Q′) =∑

s∈S

|Q(s)−Q′′(s) +Q′′(s)−Q′(s)|

≤∑

s∈S

|Q(s)−Q′′(s)|+∑

s∈S

|Q′′(s)−Q′(s)|

= d(Q,Q′′) + d(Q′′, Q′).

We assume that the transition probabilities satisfies the strong ergodic condition; that is, all transi-tion probabilities pij are positive. First, we provide a fixpoint result that will guarantee the existenceand uniqueness of fixed points.

8.2 Markov Chains 181

Theorem 8.11. If the transition probabilities P satisfy the strong ergodic condition, they define acontraction mapping with respect to the metric; that is, there is a non-negative real number α < 1 suchthat for all probability distributions Q and Q′ on the state space S,

d(PQ,PQ′) ≤ α · d(Q,Q′). (8.14)

Proof. Let Q and Q′ be distinct distributions on S. Define

∆Q(s) = Q(s)−Q′(s), s ∈ S.

We have

d(PQ,PQ′) =∑

s

|PQ(s)− PQ′(s)|

=∑

s

∣∣∣∣∣∑

t

P (s | t)Q(t)− P (s | t)Q′(t)

∣∣∣∣∣

=∑

s

∣∣∣∣∣∑

t

P (s | t)∆Q(t)

∣∣∣∣∣ .

We decompose the sum∑t into to partial sums

∑t+ +

∑t− such that t+ and t− are the states where

∆Q is positive or negative, respectively. This gives us

d(PQ,PQ′) =∑

s

∣∣∣∣∣∑

t+

P (s | t+)∆Q(t+) +∑

t−

P (s | t−)∆Q′(t−)

∣∣∣∣∣

=∑

s

(∑

t+

P (s | t+)∆Q(t+)−∑

t−

P (s | t−)∆Q′(t−)

)

−2∑

s

min

{|∑

t+

P (s | t+)∆Q(t+)|, |∑

t−

P (s | t−)∆Q(t−)|}

=∑

s

t

P (s | t)|∆Q(t)|

−2∑

s

min

{|∑

t+

P (s | t+)∆Q(t+)|, |∑

t−

P (s | t−)∆Q(t−)|}

=∑

t

|∆Q(t)| − 2∑

s

min

{|∑

t+

P (s | t+)∆Q(t+)|, |∑

t−

P (s | t−)∆Q(t−)|}

≤∑

t

|∆Q(t)| − 2∑

s

P smin min

{|∑

t+

∆Q(t+)|, |∑

t−

∆Q(t−)|}

where we used |(|x| − |y|)| = |x| + |y| − 2min{|x|, |y|} and P smin = min{P (s | t) | t ∈ S} > 0. But wehave

182 8 Computational Statistics

t+

∆Q(t+) +∑

t−

∆Q(t−) =∑

t

Q(t)−∑

t

Q′(t) = 0

and thus ∣∣∣∣∣∑

t+

∆Q(t+)

∣∣∣∣∣ =∣∣∣∣∣∑

t−

∆Q(t−)

∣∣∣∣∣ =1

2

t

|∆Q(t)|.

It follows that

d(PQ,PQ′) ≤(1−

s

P smin

)d(Q,Q′).

If we put α = 1− P smin for some s ∈ S, we obtain the desired result. ♦

Theorem 8.12. Let the transition probabilities P satisfy the strong ergodic condition. For each prob-ability distribution Q on the state space, the sequence (Q,PQ,P 2Q, . . .) has the property that for eachnumber ǫ > 0, there exists a positive integer N such that for all steps n and m larger than N ,

0 ≤ d(PnQ,PmQ) < ǫ. (8.15)

The sequence (Q,PQ,P 2Q, . . .) converges to the probability distribution

Q∗ = limn→∞

PnQ (8.16)

such that Q∗ is the unique fixed point of P .

Proof. First, by repeated application of the triangle inequality and Thm. 8.11, we obtain

d(PnQ,PmQ) ≤ d(PNQ,PnQ) + d(PNQ,PmQ), min{m, } ≥ N ≥ 0,

≤n−N−1∑

k=0

d(PN+kQ,PN+k+1Q) +

m−N−1∑

l=0

d(PN+lQ,PN+l+1Q)

≤n−N−1∑

k=0

αN+kd(Q,PQ) +m−N−1∑

l=0

αN+ld(Q,PQ)

= αNd(Q,PQ)

(n−N−1∑

k=0

αk +m−N−1∑

l=0

αl

)

≤ αNd(Q,PQ)

(2− αn−N − αm−N

1− α

).

Thus

d(PnQ,PmQ) <2αN

1− αd(Q,PQ).

Hence, if we put

N ≥ logα

(ǫ(1− α)

2d(Q,PQ)

),

we have

8.2 Markov Chains 183

d(PnQ,PmQ) < ǫ.

Thus the series (PnQ) is a Cauchy sequence and hence is convergent by completeness of Rn.Second, let Q∗ denote the limiting point. We have

0 ≤ d(Pn+1Q,PQ∗) ≤ α · d(PnQ,Q∗), n ≥ 0.

But α · d(PnQ,Q∗) tends to 0 as n goes to infinity. Thus PnQ goes to PQ∗ when n tends to infinity.Since limits are unique, it follows that Q∗ = PQ∗.

Third, assume there is another distribution Q satisfying PQ = Q. Then 0 ≤ d(Q,Q∗) =d(PQ,PQ∗) ≤ α · d(Q,Q∗). Thus 0 ≤ (1 − α)d(Q,Q∗) ≤ 0 which shows that d(Q,Q∗) = 0 andhence Q = Q∗. ♦

A probability distribution Q on the state space S is called a fixed point or a stationary distribution ofthe operator P provided that PQ = Q.

Example 8.13 (R). A one-dimensional random walk is a discrete-time Markov chain whose state spaceis given by the set of integers S = Z. For some number p with 0 < p < 1, the transition probabilities(the probability pi,j of moving from state i to state j) are given by

pi,j =

p if j = i+ 1,1− p if j = i− 1,0 otherwise,

for all i, j ∈ Z.

In a random walk, at each transition a step of unit length is made at random to the right with probabilityp and to the left with probability 1− p. A random walk can be interpreted as the betting of a gamblerwho bets 1 Euro on a sequence of p-Bernoulli trials and wins or loses 1 Euro at each transition; ifX0 = 0, the state of the process at time n is her gain or loss after n trials. The probability to start instate 0 and return to state 0 in 2n steps for the first time is

p(2n)00 =

(2n

n

)pn(1− p)n.

It can be shown that∑∞n=1 p

(2n)00 <∞ if and only if p 6= 1/2. Thus the expected number of returns to 0

is finite if and only if p 6= 1/2. A random walk with Bernoulli probability p = 0.7 can be generated overa short time span as follows.

> n <- 400

# n p-Bernoulli trials

> X <- sample( c(-1,1), size=n, replace = TRUE, p=(0.3,0.7) )

# coerce to a data frame

> D <- as.integer( c(0,cumsum(X)) )

> plot ( 0..n, D, type="l", main="", xlab = "i" )

A trajectory of the process starting at state 0 is given in Fig. 8.1.Another way to define a one-dimensional random walk is to take a sequence (Xt)t≥1 of independent,

identically distributed random variables, where each variable has state space {±1}. Put S0 = 0 andconsider the partial sum Sn =

∑ni=1Xi. The sequence (St)t≥0 is a simple random walk on Z. The series

184 8 Computational Statistics

Fig. 8.1. Partial realization of a random walk with p = 0.7.

given by the sum of sequences of 1’s and −1’s provides the walking distance if each part of the walk isof unit length. In case of p = 1/2 we speak of a symmetric random walk. In a symmetric random walk,all states are recurrent; i.e., the chain returns to each state with probability 1. A symmetric randomwalk can be generated over a short time span as follows.

> n <- 400

# n Bernoulli trials with p=1/2

> X <- sample( c(-1,1), size=n, replace = TRUE )

# coerce to a data frame

> S <- as.integer( c(0,cumsum(X)) )

> plot ( 0..n, S, type="l", main="", xlab = "i" )

A trajectory of the process starting at S0 = 0 is given in Fig. 8.2. The process returns to 0 severaltimes within the given time span. This can be seen by invoking the command which(S==0). ♦

8.3 Metropolis Algorithm

The Metropolis algorithm can be used to generate random samples from a given probability distribution.The idea is to generate a Markov chain (Xt)t≥0 whose stationary distribution is the target distribution.The algorithm must specify for a given state Xt how to generate the next state Xt+1. For this, acandidate point Y is generated from a proposed distribution g(·|Xt). If the candidate point is accepted,the chain moves to state Y at time t+ 1. Otherwise, the chain stays in state Xt and Xt+1 = Xt.

First, we formulate the Metropolis algorithm in the context of Markov bases. For this, let B ={m1, . . . ,ml} be a Markov basis of a matrix A ∈ Zd×n≥0 . Given a marginal b ∈ Zd, the state set is theb-fibre of A consisting of all vectors in Zn≥0 with marginal b,

8.3 Metropolis Algorithm 185

Fig. 8.2. Partial realization of a symmetric random walk.

A−1[b] = {u ∈ Zn≥0 | Au = b}. (8.17)

Consider the Markov chain P ′ given by the transition probabilities

p′(v|u) ={1/2l if v = u+m ≥ 0 for some m ∈ B,0 otherwise.

(8.18)

Clearly, the Markov chain P ′ is a random walk based on the Markov basis with a uniform stationarydistribution.

Fix a positive function f : A−1[b] → Z>0 on the b-fibre of A. The function f is assumed to beproportional to a target distribution. The Metropolis construction provides a Markov chain P = (Xt)on the state space A[b−1] with transition probabilities defined as

p(v|u) =

p′(v|u), if v 6= u and f(v) ≥ f(u),

p′(v|u) f(v)f(u) , if v 6= u and f(v) < f(u),

p′(u|u) +∑f(v)<f(u) p′(v|u)

(1− f(v)

f(u)

), otherwise (u = v).

(8.19)

We have∑v p(v|u) = 1 for each state u and thus P forms a Markov chain. This chain can be imple-

mented as follows:

Take the state u and choose a new state v from the Markov chain P ′. If f(v) ≥ f(u), moveto the state v with certainty. If f(v) < f(u), perform a random experiment such that the newstate v is accepted with rejection probability f(v)/f(u). (The random experient is implementedby generating a number r ∈ [0, 1] uniformly at random and checking if r < f(v)/f(u). If so, thenew state v is accepted.) Otherwise, stay at the state u.

186 8 Computational Statistics

Proposition 8.14. The Markov chain P on the b-fibre of A defined by the transition probabilities (8.19)satisfies the condition of detailled balance:

f(u)p(v|u) = f(v)p(u|v), u, v ∈ A−1[b]. (8.20)

Proof. The equation is clear for u = v. Let v 6= u. If f(v) ≥ f(u), then p(v|u)f(v)/f(u) = p(u|v) andif f(v) < f(u), then p(v|u) = p(u|v)f(v)/f(u) as required. ⊓⊔

In view of the algorithm, only quotiens of f are considered and therefore it is sufficient to know thetarget distribution f up to a constant. The condition of detailed balance is sufficient such that theMarkov chain becomes stationary.

Proposition 8.15. The Markov chain P on the b-fibre of A defined by the transition probabilities (8.19)has a stationary distribution proportional to f .

Proof. Let v ∈ A−1[b]. We have

u

f(u)p(v|u) =∑

u

f(v)p(u|v) = f(v), (8.21)

where the first equations follows from the condition of detailed balance and the second from the con-servation of probabilities. Thus the distribution proportional to (f(u)) is a fixed point of the transitionmatrix. ⊓⊔

Example 8.16 (R). Reconsider the matrix A =(1 2 3

). The corresponding Markov basis B =

{±(1,−2, 1)t,±(1, 1,−1)t,±(2,−1, 0)t}. Take the fibre b = A(1, 1, 1)t = 1 + 2 + 3 = 6, and use useas target distribution the multinomial distribution

f(u1, u2, u3) =n!

u1!u2!u3!pu11 pu2

2 pu33

with probabilities p1, p2, p3 > 0, p1 + p2 + p3 = 1, u1 + u2 + u3 = n. sufficient to choose f(u1, u2, u3) =pu11 pu2

2 pu33 since n = 6 fixed.

> B <- matrix( c(1,-2,1, 1,1,-1, 2,-1,0, -1,2,-1, -1,-1,1, -2,1,0), nrow=6, ncol=3)

> f <- function (u) {

p <- c(0.1,0.3,0.6)

return (p[1]^u[1]*p[2]^u[2]*p[3]^u[3])

}

> k <- 0

> N <- 100

> z <- runif(N) # random numbers between 0 and 1

> u <- c(1,1,1) # starting point

> for (j in 2:N) {

# generate random integer between 1 and 6

i <- sample(1:6, 1)

v <- c(B[i,1]+u[1],B[i,2]+u[2],B[i,3]+u[3])

if (f(v) > f(u))

8.3 Metropolis Algorithm 187

u <- v

else {

if ( z[j] - f(v)/f(u) < 0 )

u <- v

else

k <- k+1 # v is rejected

}

}

> print(k) # number of rejections

[1] 62

♦In general terms, given a target distribution f we choose a Markov chain (Xt) and a proposal

distribution g(·|Xt) which fulfills some regularity conditions (irreducibility, positive recurrence, andaperiodicity). Then the Metropolis algorithm given in Fig. 8.1 will converge to the given target distri-bution. Note that the candidate point Y is accepted with probability

r(Xt, Y ) = min{1, f(Y )g(Xt|Y )

f(Xt)g(Y |Xt)}.

Since only quotients of the target distribution occur, it is sufficient to know the target distribution upto a constant.

Algorithm 8.1 Metropolis algorithm.

Require: S state set, f target distribution on SEnsure: Markov chain (Xt)t≥0 on S with stationary distribution f .

Choose proposal distribution g(·|Xt) on SGenerate X0 from distribution gwhile (Chain not converged to stationary distribution according to some criterion) do

Generate Y from g(·|Xt)Generate U from uniform distribution on [0, 1]

if U ≤ f(Y )g(Xt|Y )f(Xt)g(Y |Xt)

then

Xt+1 ← Yelse

Xt+1 ← Xt

end if

t← t+ 1end while

Example 8.17 (R). A random walk can be implemented by the Metropolis algorithm. For this, a can-didate point Y is generated from a symmetrical proposal distribution g(Y |Xt) = g(|Xt−Y |) dependingon the distance between the points Xt and Y . At each iteration, a random increment Z is generatedfrom the proposal distribution g and the candidate point is set to Y = Xt+Z. The target distributionis the Student’s t distribution with ν degrees of freedom and the proposal distribution is the normaldistribution N(Xt, σ

2). Note that the t(ν) density is proportional to

188 8 Computational Statistics

f(x) = (1 + x2/ν)−(ν+1)/2.

Thus by the symmetry of the proposal distribution, we have

r(xt, y) =f(y)

f(xt)=

(1 + y2/ν)−(ν+1)/2

(1 + x2t/ν)−(ν+1)/2

.

The Metropolis algorithm then looks as follows.

> metropolis <- function(nu, sigma, x0, N) {

> x <- numeric(N)

> x[1] <- x0

> z <- runif(N)

> k <- 0

> for (j in 2:N) {

y <- rnorm(1, x[j-1], sigma)

if (dt(y,nu) > dt(x[j-1],nu))

x[j] <- y

else {

if ( z[j] - dt(y,nu)/dt(x[j-1],nu) < 0 )

x[j] <- y

else {

x[j] <- x[j-1]

k <- k+1

}

}

}

return(list(x=x,k=k))

}

The convergence of the random walk is sensitive to the parameter choice. To this end, we have providedrandom walks with several choices of σ (Fig. 8.3).

> nu <- 4 # degrees of freedom of target Student’s t distribution

> N <- 1000

> sigma <- c(0.1, 0.2, 0.5, 1.0, 1.5, 2.0, 3.0, 5.0, 10.0)

> x0 <- 20

> m1 <- metropolis( nu, sigma[1], x0, N )

> m2 <- metropolis( nu, sigma[2], x0, N )

> m3 <- metropolis( nu, sigma[3], x0, N )

> m4 <- metropolis( nu, sigma[4], x0, N )

> m5 <- metropolis( nu, sigma[5], x0, N )

> m6 <- metropolis( nu, sigma[6], x0, N )

> m7 <- metropolis( nu, sigma[7], x0, N )

> m8 <- metropolis( nu, sigma[8], x0, N )

> m9 <- metropolis( nu, sigma[9], x0, N )

# number of candidate points rejected

8.4 Contingency Tables 189

> print( c( m1$k, m2$k, m3$k, m4$k, m5$k, m6$k, m7$k, m8$k, m9$k ) )

[1] 1, 3, 7, 18, 28, 48, 71, 119, 169

Fig. 8.3. Random walk with σ = 0.5.

8.4 Contingency Tables

In statistics, contingency tables are matrices that record the multivariate frequency distribution of twoor more discrete categorial variables. They form a basic tool in business intelligence and survey research.They show a basic picture of the interrelation between two or more random variables and are used tofind interactions between them.

Example 8.18. Consider the 4× 4 contingency table shown in Table 8.1 that presents a classificationof 592 people according to eye color and hair color. A basic question of interest for this table is whethereye color and hair color are independent features. ♦

Let X and Y be random variables with state sets [r] and [c], respectively. An r×c contingency tabledisplays the frequencies of random selections from these two variables (Table 8.2). All probabilisticinformation about the random variables X and Y is contained in the joint probabilities

pij = P (X = i, Y = j), 1 ≤ i ≤ r, 1 ≤ j ≤ c.

The way of which the sample data are acquired is of central importance.

190 8 Computational Statistics

Table 8.1. Eye color vs. hair color for 592 subjects. (The right-hand column and the bottom row contain arethe marginal totals and the bottom right-hand corner cell is the grand total.)

Hair Color

Eye Color Black Brunette Red Blonde Total

Brown 68 119 26 7 220Blue 20 84 17 94 215Hazel 15 54 14 10 93Green 5 29 14 16 64

Total 108 286 71 127 592

Table 8.2. Scheme of r × c contingency table for two categorial random variables.

X\Y Y = 1 . . . Y = j . . . Y = c Total

X = 1 n11 . . . n1j . . . n1c n1+

......

......

......

...X = i ni1 . . . nij . . . nic ni+

......

......

......

...X = r nr1 . . . nrj . . . nrc nr+

Total n+1 . . . n+j . . . n+c n++

• Unrestricted sampling: Suppose the number of observations n = n++ is not fixed (e.g., connectionbetween car brand and exceeding the speed limit in traffic control). Then the frequencies nij canbe viewed as realizations of independent Poisson distributed random variables with mean λij ,

P (X = i, Y = j) = e−λijλnij

ij

nij !.

The maximum likelihood function is given by

L(λ, n) =

r∏

i=1

c∏

j=1

e−λijλnij

ij

nij !,

and the maximum likelihood estimates of the means λij are

λij = nij , 1 ≤ i ≤ r, 1 ≤ j ≤ c.

• Multinomial sampling: Suppose the number of observations n = n++ is fixed (e.g., connectionbetween eye color and hair color of n persons). Then the common density is given by a multinomialdistribution

f(nij |n) =n!∏r

i=1

∏cj=1 nij !

r∏

i=1

c∏

j=1

pnij

ij .

The maximum likelihood function is

8.4 Contingency Tables 191

L(p, n) = n!

r∏

i=1

c∏

j=1

pnij

ij

nij !

and the maximum likelihood estimates are

pij =nijn, 1 ≤ i ≤ r, 1 ≤ j ≤ c.

• Hypergeometric sampling: Suppose the row and column sums are fixed (e.g., classical tee test). Takethe row and column sums

ni+ =

c∑

j=1

nij , 1 ≤ i ≤ r, and n+j =

r∑

i=1

nij , 1 ≤ j ≤ c,

respectively, and the vectors of row and column sums

n.+ = (n1+, . . . , nr+)t and n+. = (n+1, . . . , n+c)

t,

respectively. Then the common density is given by the hypergeometric distribution

f(nij |n.+,n+.) =

∏ri=1 ni+!

∏cj=1 n+j !

n!∏ri=1

∏cj=1 nij !

=

∏cj=1

(n+j

n1j ,...,nrj

)(

nn1+,...,nr+

) . (8.22)

In particular, in view of a 2× 2 contingency table,

f(n11|n.+,n+.) =

(n+1

n11

)(n−n+1

n+1−n11

)(nn1+

) .

An important problem in multivariate statistics is finding the dependence structure between cate-gorial random variables. In inferential statistics, a test of independence assesses whether the observedfeatures expressed in a contingency table are independent of each other. The null hypothesis refers tothe general assertion that there is no relationship between the measured phenomena. In view of tworandom variables X and Y with state sets [r] and [c], respectively, we assume that the frequencies nijin the r × c contingency table follow a multinomial distribution. Define the marginal probabilities

pi+ = P (X = i), 1 ≤ i ≤ r, and p+j = P (Y = j), 1 ≤ j ≤ s.

Then the null hypothesis states that the random variables X and Y are independent,

H0 : pij = P (X = i, Y = j) = pi+p+j , 1 ≤ i ≤ r, 1 ≤ j ≤ c,

and the alternative hypothesis Ha states that the opposite holds.Under the null hypothesis, the theoretical frequencies of the outcomes are

pi+ =ni+n, 1 ≤ i ≤ r, and p+j =

n+jn, 1 ≤ j ≤ c,

andnij = n · pij = npi+p+j =

ni+n+jn

, 1 ≤ i ≤ r, 1 ≤ j ≤ c.

192 8 Computational Statistics

Proposition 8.19. The random variables X and Y are independent if and only if the r × c matrixP = (pij) has rank 1.

Proof. Suppose X and Y are independent. Then the matrix P can be written as a product of thecolumn vector (pi+) and the row vector (pj−). It follows that the matrix has rank 1.

Conversely, let P have rank 1. Then the matrix has the form P = abT , where a ∈ Rr and b ∈ Rc.All entries of the matrix are non-negative and so the vectors can be chosen to have non-negative entriesas well. We have pij = aibj , 1 ≤ i ≤ r, 1 ≤ j ≤ c. Let a+ and b+ be the sums of the entries in a andb, respectively. Then pi+ = aib+, p+j = a+bj , and a+b+ = 1, 1 ≤ i ≤ r, 1 ≤ j ≤ c. It follows thatpij = aibj = aib+a+bj = pi+p+j , 1 ≤ i ≤ r, 1 ≤ j ≤ c. ♦

The test of independent can be assessed by Pearson’s chi-squared test. For this, the statisticalevaluation of the difference between the observed frequencies and the expected frequencies under thenull hypothesis is based on Pearson’s cumulative test statistic

χ2ν =

r∑

i=1

c∑

j=1

(nij − nij)2

nij.

This test statistic is approximately chi-squared distributed with ν degrees of freedom. The number ofdegrees of freedom is determined by the dependencies among the data:

∑ri=1 pi+ = 1,

∑cj=1 p+j = 1,

and∑ri=1

∑cj=1 pij = 1. Thus we have

ν = (r · c− 1)− ((r − 1) + (c− 1)) = r · c− r − c+ 1 = (r − 1)(c− 1).

Note that for small values of n and χ2 values < 0.1, the χ2ν distribution is only a coarse approximation.

The significance level of a statistical test is a probability threshold below which the null hypothesiswill be rejected. Common values are 1% or 5%. The rejection of the null hypothesis when it is actuallytrue is known as type I error or false positive determination. For instance, if the null hypothesis isrejected at the significance level of α = 0.05, then on average in 5 of 100 cases a type I error will becommitted.

The distribution of a test statistic T = χ2 under the null hypothesis and according the significancelevel α decomposes the possible values of T into two regions, one for which the null hypothesis isrejected, the so-called critical region, and one for which it is not. The two regions are divided accordingto the (1− α)-quantile χ2

n;1−α of the chi-squared distribution. The probability of the critical region isα (Fig. 8.4).

Decide to reject the null hypothesis in favor of the alternative if the calculated value χ2 is largerthan the (1− α)-quantile χ2

n;1−α and accept the null hypothesis otherwise (Table 8.3).

Example 8.20 (R). Reconsider the 4×4 contingency table relating eye color and hair color for n = 592persons (Table 8.1). The test of independence is conducted using R as follows.

> eh <- matrix( c(68,118,26,7,20,84,17,94,15,54,14,10,5,29,14,16), ncol=4 )

> chisq.test( eh, correct=TRUE)

Pearson’s Chi-squared test

8.4 Contingency Tables 193

Fig. 8.4. Critical region.

Table 8.3. The 0.95-quantiles of the χ2ν distribution for small degrees of freedom ν with 1 ≤ ν ≤ 9.

ν χ2ν,0.95

1 3.842 5.993 7.814 9.495 11.076 12.597 14.078 15.519 16.92

data: eh

X-squared: 138.29, df = 9, p-value < 2.2e-16

> qchisq( 0.95, 9 )

[1] 16.91898

> qchisq( 0.99, 9 )

[1] 21.66599

The test statistic yields χ2 = 138.29 and the 95%-quantile of the chi-squared distribution with ν = 9degrees of freedom is χ2

9;0.95 = 16.91898. Thus the null hypothesis must be strongly rejected at thesignificance level of 5%. A similar result holds for the significance level of 1%. ♦

The p-value is a statistic defined as the probability of obtaining a result equal to or more extremethan what was actually observed under the assumption that the null hypothesis is true. If the p-valueis smaller than the significance level, the observed data are inconsistent with the null hypothesis andtherefore the null hypothesis must be rejected.

Finally, we consider two-way contingency tables with fixed row and column sums. The probabilityof such an r × c two-way contingency table is given by (8.22). Generally, it is difficult to provide a

194 8 Computational Statistics

complete list of all those tables. An alternative is sampling in the fibre of a given table by using theMetropolis algorithm and Markov bases. In order to find a Markov basis for two-way r× c contingencytables with fixed row and column sums, consider the integral (r + c)× rc matrix

Ar,c =

(1Tr ⊗ IcIr ⊗ 1Tc

)(8.23)

=

1 . . . 1 0 . . . 0 . . . 0 . . . 00 . . . 0 1 . . . 1 . . . 0 . . . 0

0 . . . 0 0 . . . 0 . . . 1 . . . 11 0 1 0 1 0. . .

. . . . . .. . .

0 1 0 1 0 1

,

where 1r is the all-one vector of lenght r, Ir is the r× r identity matrix, and ⊗ denotes the Kroneckerproduct.

If n denotes an r×c contingency table with vectors of row and column sums n+ and n−, respectively,and n is written row-wise as a column vector, then we have

Ar,cn =

(n+

n−

). (8.24)

The vector on the right-hand side

b =

(n+

n−

)

forms a sufficient statistic of the model.

Proposition 8.21. A reduced Groebner basis of the ideal IAr,cin the polynomial ring Q[{Xij}i∈[r],j∈[c]]

is given byGr,c = {XilXjk −XikXjl | 1 ≤ i < j ≤ r, 1 ≤ k < l ≤ c}.

Let eij denote the standard unit table, which has a 1 in the (i, j) position, and zeros elsewhere.Then by Prop. 8.6, we obtain the following result.

Proposition 8.22. The minimal Markov basis of the matrix Ar,c corresponding to the Groebner basisGr,c consists of 2 ·

(r2

)(c2

)moves given by

Br,c = {±(eil + ejk − eik − ejl) | 1 ≤ i < j ≤ r, 1 ≤ k < l ≤ c}.

It follows that the Markov basis consists of all tables which have the following 2× 2 minors and zeroselsewhere,

+1 −1−1 +1

and−1 +1+1 −1

(8.25)

8.4 Contingency Tables 195

Example 8.23 (Singular). The 3×2 contingency tables with fixed row and column sums are describedby the matrix

A3,2 =

1 1 0 0 0 00 0 1 1 0 00 0 0 0 1 11 0 1 0 1 00 1 0 1 0 1

. (8.26)

Then we have

A3,2

u11u12u21u22u31u32

=

u1+u2+u3+u+1

u+2

. (8.27)

The reduced Groebner basis of the ideal IA3,2is

G3,2 = {X12X21 −X11X22, X12X31 −X11X32, X22X31 −X21X32}

which can be calculated as follows,

> ring r = 0, (x(1..6),y(1..5)), dp;

> ideal i = x(1)-y(1)*y(4), x(2)-y(1)*y(5), x(3)-y(2)*y(4),

x(4)-y(2)*y(6), x(5)-y(3)*y(4), x(6)-y(3)*y(6);

> ideal j = std(i);

> eliminate( j, y(1)*y(2)*y(3)*y(4)*y(5) );

_[1]=x(4)*x(5)-x(3)*x(6)

_[2]=x(2)*x(5)-x(1)*x(6)

_[3]=x(2)*x(3)-x(1)*x(4)

Therefore, the corresponding Markov basis is

B3,2 = {±(e12 + e21 − (e11 + e22)),±(e12 + e31 − (e11 + e32)),±(e22 + e31 − (e21 + e32))},

where

±(e12 + e21 − e11 − e22) = ±

−1 +1+1 −10 0

,

±(e12 + e31 − e11 − e32) = ±

−1 +10 0

+1 −1

,

±(e22 + e31 − e21 − e32) = ±

0 0−1 +1+1 −1

.

196 8 Computational Statistics

The usual approach to perform hypothesis testing for contingency tables is the asymptotic one, whichinvolves chi-squared distributions. In many cases, especially when the table is sparse, the chi-squaredapproximation may not be adequate. If so, we can approximate the test statistic via the Metropolisalgorithm, drawing contingency tables in the fibre b according to the hypergeometric distribution.

Given an r × c contingency table n in the fibre b given by the vector (u+,u−). First, generate aMarkov move m ∈ B uniformly at random. This means, pick two rows and two colums uniformly atrandom and take a sign ǫ = ±1 with probability 1/2 such that one of the following two 2× 2 minors isobtained

+1 −1−1 +1

or−1 +1+1 −1

(8.28)

Second, calculate the new table n′ = n+ ǫ ·m. If the new table satisfies n′ ≥ 0, move to the new tablewith probability min{1, f(n′|n+,n−)/f(n|n+,n−)}; otherwise, stay at the table n.

The rejection probability is given as follows,

f(n+m|n+,n−)

f(n|n+,n−)=

(∏cj=1

(n+j

n1j+m1j ,...,nrj+mrj

)(

nn1+,...,nr+

))(∏c

j=1

(n+j

n1j ,...,nrj

)(

nn1+,...,nr+

))−1

=c∏

j=1

(n+j

n1j+m1j ,...,nrj+mrj

)(

n+j

n1j ,...,nrj

)

=

c∏

j=1

∏ri=1 nij !∏r

i=1(nij +mij)!(8.29)

=∏

i,j

mij 6=0

nij !

(nij +mij)!

=∏

i,j

mij=+1

nij !

(nij + 1)!

i,j

mij=−1

nij !

(nij − 1)!

=∏

i,j

mij=+1

(nij + 1)−1∏

i,j

mij=−1

nij ,

where the last term involves only four numbers.

Example 8.24 (Maple). Reconsider the 4× 4 contingency table n that provides eye color versus haircolor for N = 592 persons (Table 8.1). As we have seen, the test statistic yields χ2 = 138.29 andmust be rejected at the significance level of 5%. Diaconis and Efron (1985) labored long and hard todetermine the proportion of tables with the same row and column sums as the given table having teststatistic ≤ 138.29. Their best estimate was ”about 10%”. The following Metropolis run illustrates thatthe estimate ”about 20%” is more realistic. The corresponding Maple code is as follows (Fig. 8.5).

> restart:

> with(Statistics): infolevel[Statistics]:=0:

> with(RandomTools):

8.4 Contingency Tables 197

> X := Matrix([[68, 119, 26, 7],

[20, 84, 17, 94],

[15, 54, 14, 10],

[5, 29, 14, 16]]):

> Z := Matrix([[0, 0, 0, 0],

[0, 0, 0, 0],

[0, 0, 0, 0],

[0, 0, 0, 0]]):

> roll4 := rand(1..4): roll2 := rand(0..1):

> count := 0:

> L := []:

> st0 := ChiSquareIndependenceTest( X, level=0.05, output=’statistic’);

138.2898416

> L := [op(L), round(st0)]:

> while (nops(L)<20000) do

i := roll4():

j := roll4():

if j<i then h:=j; j:=i; i:=h end if;

u := roll4():

v := roll4():

if v<u then h:=v; v:=u; u:=h end if;

r := roll2():

if i<>j and u<>v then

Y := Matrix([[0, 0, 0, 0],

[0, 0, 0, 0],

[0, 0, 0, 0],

[0, 0, 0, 0]]);

Y[i,u] := 2*r-1;

Y[j,v] := 2*r-1;

Y[i,v] := -(2*r-1);

Y[j,u] := -(2*r-1);

Z := X + Y;

if (Z[i,u] > 0 and Z[j,v] > 0 and Z[i,v] > 0 and Z[j,u] > 0) then

st1 := ChiSquareIndependenceTest( Z, level=0.1, output=’statistic’ );

L := [op(L),round(st1)];

if (st1 < st0) then

X := Z;

count := count + 1;

else

rn := Generate(rational(range=0..1));

if roll2 = 1 then mv := X[i,v]*X[j,u] / ((X[i,u]+1)*(X[j,v]+1))

else mv := X[i,u]*X[j,v] / ((X[i,v]+1)*(X[j,u]+1))

end if;

if rn < mv then

X := Z

198 8 Computational Statistics

end if;

end if;

end if;

end if;

end:

> count;

4321

> nops(L);

20000

> evalf(count/nops(L));

0.2160500000

> Histogram(L,discrete=true);

Fig. 8.5. Histogram of Metropolis run (20.000 iterations).

8.5 Hardy-Weinberg Model

The Hardy-Weinberg law is a milestone of population genetics. Suppose a population of diploid organ-isms mates randomly and that there is no selection or mutation affecting the gene frequencies. Underthese conditions, the Hardy-Weinberg law states that the frequency of allele combinations (genotypes)

8.5 Hardy-Weinberg Model 199

will remain constant across generations and gives a formula for these frequencies. These frequenciesapply to infinite populations, while for finite populations the question of interest is whether or notthe finite population is a random subset of a population that follows the Hardy-Weinberg law. Testingwhether a finite population obeys the proportions of the Hardy-Weinberg law is an important first steptowards the analysis of a population.

Consider the Hardy-Weinberg model of a two-allele locus with alleles Y and y. Suppose in a popu-lation, the alleles Y and y occur with probabilities p and q, respectively. Then we have

p+ q = 1. (8.30)

The Hardy-Weinberg law states that in the offspring, the genotypes Yy and yY (heterozygote) occurtogether with probability 2pq, and the genotypes YY and yy (homozygote) occur with probabilities p2

and q2, respectively. A population with these genotype frequencies is said to be in Hardy-Weinbergequilibrium at the locus and the genotype frequencies are known as Hardy-Weinberg proportions. Thesegenotype frequencies satisfy (Fig. 8.6).

1 = (p+ q)2 = p2 + 2pq + q2. (8.31)

In population genetices one is particurly interested in the prevalence of a particular allele or genotypein a population. The Hardy-Weinberg law also explains why recessive phenotype persist over time.We assume that p2 is the probability of homozygous dominant genotype, 2pq is the probability ofheterozygous genotype, and q2 is the probability of homozygous recessive genotype.

Y(p) y(q)

Y(p) YY(p2) Yy(pq)y(q) yY(qp) yy(q2)

Fig. 8.6. Hardy-Weinberg Punnett square.

Example 8.25. Consider a population of mice in Hardy-Weinberg equilibirium.First, suppose there are 16% of the mice in the population that are homozygous recessive. How

many mice in the population are homozygous dominant? We have q2 = 0.16. Then q = 0.40 and sop = 1− q = 0.60. Thus p2 = 0.36 and hence 36% of the mice are homozygous dominant.

Second, suppose 19% of the mice in the population show the dominant phenotype. What is thedominant allele frequency? We have p2 + 2pq = 0.19. Then q2 = 1− (p2 + 2pq) = 0.81 and so q = 0.90.Thus p = 1− q = 0.10 and hence 10% of the mice show the dominant allele frequency.

Third, suppose 0.0001% of the mice in the population suffer from a genetic disorder. How manymice in the population are carriers? We have q2 = 0.0001. Then q = 0.01 and so p = 1− q = 0.99. Thus2pq = 0.0198 and hence approximately 2% of the mice are carriers.

For instance, in the population of 300 million US Americans, there are 0.0001% US Americanssuffering from cystic fibrosis. By the above calculation, 3 · 108 · 0.0198 = 5, 940, 000 US Americans arecarriers. ♦

More generally, let m ≥ 2 be an integer and consider the Hardy-Weinberg model for an m-allelelocus with alleles A1, . . . , Am. Suppose the allele Ai occurs with probability pi, 1 ≤ i ≤ m, and theprobabilities satisfy

200 8 Computational Statistics

p1 + . . .+ pm = 1. (8.32)

The Hardy-Weinberg law states that in the offspring, the genotype (heterozygote) AiAj , i < j, occurswith probability 2pipj and the genotype (homozygote) AiAi has probability p2i . It is assumed thatthe resulting

(m+12

)genotypes are phenotypically distinguishable. A population with these genotype

frequencies is said to be in Hardy-Weinberg equilibrium at the locus and the genotype frequencies arethe Hardy-Weinberg proportions. These genotype frequencies fulfill

1 = (p1 + . . .+ pm)2 =

m∑

i=1

p2i +∑

i<j

2pipj . (8.33)

Let R = {(i, j) | 1 ≤ i ≤ j ≤ m} and let pij be the probability that the genotype AiAj is observed,1 ≤ i ≤ j ≤ m. Suppose that n genotypes are observed with the frequencies u = (uij), where uijdescribes the number of outcomes of the genotype AiAj , (i, j) ∈ R. Then we have

n =∑

(i,j)∈R

uij . (8.34)

Since the number of observations n is fixed, the common density is given by a multinomial distribution

f(uij |n) =n!∏

(i,j)∈R uij !puij

ij . (8.35)

Testing the deviation from the Hardy-Weinberg proportions can be considered as a hypothesis testingproblem. For this, the null hypothesis states that the population is conform with the Hardy-Weinbergproportions,

H0 : pij =

{p2i i = j,2pipj i 6= j.

(8.36)

The alternative hypothesis Ha assumes that it is not. The common approach for this kind of testing isa goodness-of-fit test.

Example 8.26 (R). Take a random sample of 10 six-sided dice that are thrown 40 times. The out-comes are given by the vector dice below. We test the null hypothesis that all outcomes have equalprobabilities.

> dice <- c( 71, 69, 74, 54, 66, 67 )

# default: uniform distribution

> sum( dice )

[1] 400

> chisq.test( z, correct=TRUE )

Chi-squared test for given probabilities

data: dice

X-squared: 3.53, df = 5, p-value = 0.6189

> qchisq( 0.95, 5 )

[1] 11.0705

8.5 Hardy-Weinberg Model 201

The test statistic yields χ2 = 3.53 and the 95%-quantile of the chi-squared distribution with ν = 5degrees of freedom is χ2

5;0.95 = 11.0705. Thus the null hypothesis cannot be rejected at the significancelevel of 5%. ♦

Example 8.27 (R). In a course of Statistics 101, there are 26 freshmen, 33 sophomores, 20 juniors, and22 seniors. We test the null hypothesis that freshmen, sophomores, juniors, and seniors are respresentedaccording to the probabilities 1/3, 1/3, 1/6, and 1/6, respectively.

> students <- c( 26, 33, 20, 22 )

> sum(students)

[1] 101

> null.probs <- c( 1/3, 1/3, 1/6, 1/6 )

> chisq.test( students, p = null.probs, correct=TRUE )

Chi-squared test for given probabilities

data: students

X-squared: 3.9406, df = 3, p-value = 0.268

> qchisq( 0.95, 3 )

[1] 2.365974

The test statistic yields χ2 = 3.9406 and the 95%-quantile of the chi-squared distribution with ν = 3degrees of freedom is χ2

3;0.95 = 2.365974. Thus the null hypothesis must be rejected at the significancelevel of 5%.

On the other hand, we test the null hypothesis that freshmen, sophomores, juniors, and seniors arerespresented according to the probabilities 0.3, 0.3, 0.2, and 0.2, respectively.

> students <- c( 26, 33, 20, 22 )

> sum(students)

[1] 101

> null.probs <- c( 0.3, 0.3, 0.2, 0.2 )

> chisq.test( students, p = null.probs, correct=TRUE )

Chi-squared test for given probabilities

data: students

X-squared: 1.0132, df = 3, p-value = 0.7981

Here the null hypothesis cannot be rejected at the significance level of 5%. ♦

In view of the Hardy-Weinberg model, the number of alleles Ai in the sample is

ui+ = u1i + . . .+ 2uii + . . .+ uim, 1 ≤ i ≤ m. (8.37)

Under the null hypothesis, the theoretical frequencies of the outcomes are

pi =ui+2n

, 1 ≤ i ≤ m, (8.38)

202 8 Computational Statistics

since each genotype is counted twice. Thus the expected number of outcomes of the genotype AiAj canbe estimated for heterozygotes as

uij = 2npipj , 1 ≤ i < j ≤ m, (8.39)

and for homozygotes as

uii = np2i , 1 ≤ i ≤ m. (8.40)

In view of the goodness-of-fit test, the test statistic measures the statistical difference between theobserved frequencies and the expected frequencies under the null hypothesis.

χ2ν =

(i,j)∈R

(uij − uij)2

uij. (8.41)

This test statistic is approximately chi-squared distributed with ν degrees of freedom. The number ofdegrees of freedom is ν =

(m2

), since the homozygotic frequences uii can be obtained from ui+ and the

heterozygotic frequencies uij .A level-α chi-squared test rejects the null hypothesis if the test statistic satisfies χ2 > χ2

ν;1−α, whereχ2ν,α is the (1− α)-quantile of the chi-squared distribution with ν degrees of freedom and 0 < α < 1.

Example 8.28. Consider phenotype data from Scarlet tiger moths. After collapsing the data of n =1612 moths into two groups of alleles, we observe u11 = 1469 (white-spotted), u12 = 138 (intermediate),and u22 = 5 (little spotting). The estimates of the allele probabilities are

p1 =2 · 1469 + 138

2 · 1612 = 0.954 and p2 =2 · 5 + 138

2 · 1612 = 0.046.

Then we obtain

u11 = np21 = 1467.397,

u12 = 2np1p2 = 141.206,

u22 = np22 = 3.397.

The test statistics gives

χ2 =(u11 − u11)

2

u11+

(u12 − u12)2

u12+

(u22 − u22)2

u22= 0.831.

At the significance level of α = 0.05, the 0.95-quantile of the chi-squared distribution is χ21;0.95 = 3.84.

Thus the null hypothesis cannot be rejected at the 5% significance level for one degree of freedom. ♦

The Hardy-Weinberg model amounts to a toric statistical model that can be described by them×

(m+12

)matrix

A =(Am Am−1 . . . A1

)(8.42)

where the k-th block matrix is the m× k matrix

8.5 Hardy-Weinberg Model 203

Ak =

0... Om−k,k−1

02 1 . . . 10... Ik−1,k−1

0

, 1 ≤ k ≤ m. (8.43)

In view of the observed frequencies

u = (u11, u12, . . . , u1m, u22, u23, . . . , u2m, u33, . . . , umm)t (8.44)

and the marginal frequencies

u+ = (u1+, . . . , um+)t, (8.45)

we obtain

Au = u+. (8.46)

Thus the marginal frequencies u+ form a sufficient statistic of the model.

Proposition 8.29. A reduced Groebner basis of the ideal IA in the polynomial ring Q[{Xi,j}(i,j)∈R] isgiven as

GR = {Xi1,i2Xj1,j2 −Xk1,k3Xk2,k4 | (k1, k2, k3, k4) = sort(i1, j1, i2, j2)},where sort denotes the sorting of quadruples over the alphabet {1, . . . ,m}.Note that the Groebner basis of the ideal IA consists of three types of binomials:

• Xi1,i2Xi3,i4 −Xi1,i3Xi2,i4 , where {i1, . . . , i4} is a 4-element subset of [4].• Xi1,i1Xi2,i3 −Xi1,i2Xi1,i3 , where {i1, . . . , i3} is a 3-element subset of [4].• Xi1,i1Xi2,i2 −X2

i1,i2, where {i1, i2} is a 2-element subset of [4].

By Prop. 8.6, we obtain the following result.

Proposition 8.30. The minimal Markov basis for the matrix A corresponding to the Groebner basisGR is given by

BR = {±(ei1,i2 + ej1,j2 − ek1,k3 − ek2,k4) | (k1, k2, k3, k4) = sort(i1, j1, i2, j2)}.

Example 8.31 (Singular). In view of the Hardy-Weinberg model for four alleles, the toric model isgiven by the matrix

A =

2 1 1 1 0 0 0 0 0 00 1 0 0 2 1 1 0 0 00 0 1 0 0 1 0 2 1 00 0 0 1 0 0 1 0 1 2

and we have

204 8 Computational Statistics

A

u11u12u13u14u22u23u24u33u34u44

=

u1+u2+u3+u4+

.

A reduced Groebner basis of the ideal IA consists of the following elements,

X12X34 −X13X24, X12X34 −X14X23,

X11X23 −X12X13, X11X24 −X12X14, X11X34 −X13X14,X22X13 −X12X23, X22X14 −X12X24, X22X34 −X23X24,X33X12 −X13X23, X33X14 −X13X34, X33X24 −X23X34,X44X12 −X14X24, X44X13 −X14X34, X44X23 −X24X34,

X11X22 −X212, X11X33 −X2

13, X11X44 −X214,

X22X33 −X223, X22X44 −X2

24, X33X44 −X234.

This can be seen from the following computation.

> ring r = 0, (x(1..10),y(1..4)), dp;

> ideal i = x(1)-y(1)^2, x(2)-y(1)*y(2), x(3)-y(1)*y(3), x(4)-y(1)*y(4),

x(5)-y(2)^2, x(6)-y(2)*y(3), x(7)-y(2)*y(4),

x(8)-y(3)^2, x(9)-y(3)*y(4),

x(10)-y(4)^2;

> ideal j = std(i);

> eliminate( j, y(1)*y(2)*y(3)*y(4)*y(5) );

_[1]=x(9)^2-x(8)*x(10)

_[2]=x(7)*x(9)-x(6)*x(10)

_[3]=x(4)*x(9)-x(3)*x(10)

_[4]=x(7)*x(8)-x(6)*x(9)

_[5]=x(4)*x(8)-x(3)*x(9)

_[6]=x(7)^2-x(5)*x(10)

_[7]=x(6)*x(7)-x(5)*x(9)

_[8]=x(4)*x(7)-x(2)*x(10)

_[9]=x(3)*x(7)-x(2)*x(9)

_[10]=x(6)^2-x(5)*x(8)

_[11]=x(4)*x(6)-x(2)*x(9)

_[12]=x(3)*x(6)-x(2)*x(8)

_[13]=x(4)*x(5)-x(2)*x(7)

_[14]=x(3)*x(5)-x(2)*x(6)

_[15]=x(4)^2-x(1)*x(10)

8.6 Logistic Regression 205

_[16]=x(3)*x(4)-x(1)*x(9)

_[17]=x(2)*x(4)-x(1)*x(7)

_[18]=x(3)^2-x(1)*x(8)

_[19]=x(2)*x(3)-x(1)*x(6)

_[20]=x(2)^2-x(1)*x(5)

For Hardy-Weinberg models with low genotype counts the asymptotic assumption of the chi-squaredistribution may not hold and the chi-square tests may fail. We can obtain approximations of thetest statistic by the Metropolis algorithm, drawing distributions in the fibre A−1[u+] according to thehypergeometric distribution

P (uij |u+) =

(n

{uij}(i,j)∈R

)· 2

∑i<j uij

(2n

u1+,...,um+

) . (8.47)

To see this, note that

P (uij |n) =(

n

{uij}(i,j)∈R

) ∏

(i,j)∈R

puij

ij . (8.48)

But under the null hypothesis, we have

(i,j)∈R

puij

ij = 2∑

i<j uij ·m∏

i=1

pui+

i . (8.49)

Moreover, we have

P (u+|n) =(

2n

u1+, . . . , um+

) m∏

i=1

pui+

i (8.50)

Now the result follows.

8.6 Logistic Regression

In statistics, logistic regression is a regression model in which the dependent variable is categorial.We consider the case of binary dependent variables. The binary logistic model is used to estimate theprobability of a binary response, like win/lose, pass/fail or alive/dead, based on one or more independentvariables.

Suppose there is a binary indicator given by a random variable Y with state set {0, 1} and a set ofobservable covariates z that lie in a finite subset A of Zd. A logistic model specifies a log-linear relationof the form

P (Y = 1|z) = ez·θ

1 + ez·θand P (Y = 0|z) = 1

1 + ez·θ, (8.51)

206 8 Computational Statistics

where the parameter vector θ ∈ Rd is to be estimated. For instance, if z = (1, i)t ∈ Z2, the modelbecomes

P (Y = 1|z) = eθ1+iθ2

1 + eθ1+iθ2and P (Y = 0|z) = 1

1 + eθ1+iθ2.

This might be appropriate if the probability depends on a distance, dose or educational level that arisesin equally spaced intervals.

Consider a data set of N pairs (y1, z1), . . . , (yN , zN ). For each covariate z ∈ A, let W (z) denotethe number of samples with zi = z and let Wk(z) be the number of samples with zi = z and Yi = k,where k ∈ {0, 1}. Then we have

W (z) =W0(z) +W1(z). (8.52)

The probability of seeing such data is given by the likelihood function

L(Yi = yi|zi) =N∏

i=1

eyi(zi·θ)

1 + ezi·θ(8.53)

=∏

z

1

(1 + ez·θ)W (z)· eθ·

∑z zW1(z).

From this description it follows that a sufficient statistic for θ is

(W (z))z and∑

z

zW1(z). (8.54)

If A = {z1, . . . , zn}, the sample data are usually summarized as 2× n matrix

(W0(z1) W0(z2) . . . W0(zn)W1(z1) W1(z2) . . . W1(zn)

). (8.55)

i 0 1 2 3 4 5 6 7 8 9 10 11 12

W1(i) 4 2 4 6 5 13 25 27 75 29 32 36 115W (i) 6 2 4 9 10 20 34 42 124 58 77 95 360

Fig. 8.7. Men’s response to ”Women should run their homes and leave men to run the country” (1974).

Example 8.32 (Maple). Consider data from the US social science survey on men’s response to thestatement ”Women should run their homes and leave men to run the country” in 1974 (Fig. 8.7). LetY = 1 if the respondent ”approves” and Y = 0 otherwise. For each respondent, the number i of yearsin school is reported, 0 ≤ i ≤ 12. The proportions p(i) = W1(i)/W (i) seem to decrease with years ofeducation. It is natural to fit a logistic model of the form

P (Y = 1 | i) = eθ1+iθ2

1 + eθ1+iθ2, (8.56)

8.6 Logistic Regression 207

where A = {(1, 0), (1, 1), . . . , (1, 12)} is a subset of Z2. Here the likelihood function amounts to

12∏

i=0

1

(1 + eθ1+iθ2)W (i)· e

∑12i=0W1(i)θ1+iW1(i)θ2 .

The maximum likelihood estimates of the parameter θ for the likelihood function are θ1 = 2.0545and θ2 = −0.2305. This gives estimates p(i) for the probabilities P (Y = 1 | i). A chi-squred test statisticfor goodness-of-fit compares the expected counts W1(i)pi with the observed counts W1(i) taking intoaccount the likelihood of the joint event,

χ2ν =

12∑

i=0

(W (i)p(i)−W1(i))

W (i)p(i). (8.57)

Here is the corresponding Maple code,

> restart: with(Statistics): infolevel[Statistics]:=1:

> W1 := Vector([4,2,4,6,5,13,25,27,75,29,32,36,115]):

> W := Vector([6,2,4,9,10,20,34,42,124,58,77,95,360]):

> for i from 1 to 13 do

> W[i] := W[i]*exp(2.0545-(i-1)*0.2305)/(1+exp(2.0545-(i-1)*0.2305))

> end:

> ChiSquareGoodnessOfFitTest(Ob, W, level = 0.05);

The Maple program yields the output

Chi-Square Test for Goodness-of-Fit

-----------------------------------

Null Hypothesis:Observed sample does not differ from

expected sample

Alt. Hypothesis:Observed sample differs from

expected sample

Categories: 13

Distribution: ChiSquare(12)

Computed statistic: 2.84515

Computed pvalue: 0.996542

Critical value: 21.02606982

Result: [Accepted]There is no statistical evidence

against the null hypothesis

The uneven nature of the counts, with some counts small, give cause of worry about the classical ap-proximation. Therefore, approximations of the test statistic by the Metropolis algorithm lead eventuallyto more confident results. ♦

The logistic regression model amounts to a toric statistical model that can be described by the(d+ n)× 2n matrix

208 8 Computational Statistics

A =

1 11 1

. . .

1 10 z1 0 z2 . . . 0 zn

. (8.58)

Let wyi(zi) denote the number of samples (yi, zi) in the sample set, 1 ≤ i ≤ N . By taking the samplevector w = (w0(z1), w1(z1), . . . , w0(zn), w1(zn))

t, we obtain

Aw =

w0(z1) + w1(z1)...

w0(zn) + w1(zn)∑zzw1(z)

=

w(z1)...

w(zn)∑zzw1(z)

. (8.59)

A simpler toric model is given by the (d+ 1)× n matrix

A =

(1 1 . . . 1z1 z2 . . . zn

). (8.60)

By taking the sample vector w = (w1(z1), . . . , w1(zn))t, we obtain

Aw =

( ∑zw1(z)∑

zzw1(z)

). (8.61)

This toric model fixes the counts

z

w1(z) and∑

z

zw1(z). (8.62)

These counts also provide a sufficient statistic for the model of logistic regression when the countsW (z) are fixed at the beginning, since then the original sufficient statistic (8.54) can be recovered byretrieving the counts W0(z) according to the equation W0(z) =W (z)−W1(z)

Proposition 8.33. Given the 3× (n+ 1) matrix

A =

1 1 . . . 11 1 . . . 10 1 . . . n

. (8.63)

A reduced Groebner basis G of the ideal IA in the polynomial ring Q[X0, X1, . . . , Xn] consists of thebinomials

XiXl −XjXk,

where i + l = j + k and 0 ≤ i ≤ j ≤ k ≤ l ≤ n, and further binomials of the form X2i −XjXk, where

1 ≤ i ≤ n− 1 and i, j, k are pairwise distinct.

By Prop. 8.6, we obtain the following result.

8.6 Logistic Regression 209

Proposition 8.34. A minimal Markov basis B for the matrix A in Prop. 8.33 consists of the matrices

±(ei + el − ej − ek),

where i+ l = j + k and 0 ≤ i ≤ j ≤ k ≤ l ≤ n, and further matrices of the form ±(2ei − ejek), where1 ≤ i ≤ n− 1 and i, j, k are pairwise distinct.

Example 8.35 (Singular). The Groebner basis for the case n = 6 is given by the following computa-tion.

> ring r = 0, (y(1..3),x(0..6)), dp;

> ideal i = x(0)-y(1)*y(2), x(1)-y(1)*y(2)*y(3), x(2)-y(1)*y(2)*y(3)^2,

x(3)-y(1)*y(2)*y(3)^3, x(4)-y(1)*y(2)*y(3)^4, x(5)-y(1)*y(2)*y(3)^5,

x(6)-y(1)*y(2)*y(3)^6;

> ideal j = std(i);

> eliminate( j, y(1)*y(2)*y(3) );

_[1]=x(5)^2-x(4)*x(6)

_[2]=x(4)*x(5)-x(3)*x(6)

_[3]=x(3)*x(5)-x(2)*x(6)

_[4]=x(2)*x(5)-x(1)*x(6)

_[5]=x(1)*x(5)-x(0)*x(6)

_[6]=x(4)^2-x(2)*x(6)

_[7]=x(3)*x(4)-x(1)*x(6)

_[8]=x(2)*x(4)-x(0)*x(6)

_[9]=x(1)*x(4)-x(0)*x(5)

_[10]=x(3)^2-x(0)*x(6)

_[11]=x(2)*x(3)-x(0)*x(5)

_[12]=x(1)*x(3)-x(0)*x(4)

_[13]=x(2)^2-x(0)*x(4)

_[12]=x(1)*x(2)-x(0)*x(3)

_[15]=x(1)^2-x(0)*x(2)

210 8 Computational Statistics

A

Computational Statistics in R

Computational statistics is a fast growing area in statistical research and applications. This supplemen-tary chapter provides a basic introduction to computational statistics using the statistical language R.It encompasses descriptive statistics, important discrete and continuous distributions, the method ofmoments, and maximum likelihood estimation. Further topics of computational statistics are treatedin chapter 8.

A.1 Descriptive Statistics

Descriptive statistics is a field of mathematical statistics that concentrates on the description of themain features of a collection of sample data.

Univariate data analysis focusses on the description of the distribution of a single random variable,including central tendencies like mean, median, and mode, dispersion like range and quantiles, measuresof spread like, variance and standard deviation, and measures of shape like skewness and kurtosis. Thecharacteristics of the distribution of a random variable are often described in graphical or tabular formatincluding plots, histograms, and stem-and-leaf-displays.

The (sample) mean of real-valued data x1, . . . , xn is defined as

x =1

n

n∑

i=1

xi.

The display of discrete data and the calculation of the mean shows the following R code (Fig. A.1).

> data <- c(45,50,55,75)

> names(data) <- c("IIW","ET","TM","CS")

> data

IIW ET TM CS

45 50 55 75

> total <- sum(data); total

[1] 225

> mean( data )

212 A Computational Statistics in R

[1] 56.25

> relative <- data / total; round(relative, 2)

IIW ET TM CS

0.20 0.22 0.24 0.33

> percent <- relative * 100; round(percent, 1)

IIW ET TM CS

20.0 22.2 24.4 33.3

> pie(data)

> barplot(data)

Fig. A.1. Pie plot and barplot.

The standard deviation of a set of real-valued data x1, . . . , xn is defined as

s =

√∑ni=1(xi − x)

n− 1.

The square of the standard deviation s is the variance s2 which can be computed as follows,

s2 =1

2n(n− 1)

i

j

(xi − xj)2.

The computation of mean, standard deviation, and variance of a data set shows the following R code.

# Body mass index (bmi) of 10 persons

> bmi <- c(25.1, 17.7, 35.5, 27.7, 28.2, 22.5, 24.3, 27.9, 21.2, 20.5)

> n <- length(bmi); n

[1] 10

> sum(bmi)

A.1 Descriptive Statistics 213

[1] 250.6

> mean(bmi) # mean

[1] 25.06

> mean(bmi, trim=0.1) # trimmed mean by 10%

[1] 24.675

> sd(bmi) # standard deviation

[1] 5.064956

> var(bmi) # variance

[1] 25.65378

The trimmed mean is obtained from the sample mean by excluding some of the extreme values.Quantiles are cutpoints that divide a sample set into equal sized groups. Suppose the sample data

x1, x2, . . . , xn is ordered such that x(1) ≤ x(2) ≤ . . . ≤ x(n). The median (or 2-quantile) of the data setis the middle element in the ordering if the number of data is odd. Otherwise, the median is the meanof the two middle values in the ordering. That is,

x =

{x(k) if n is odd, k = (n+ 1)/2,12 (x(k) + x(k+1)) otherwise, k = n/2.

More generally, for any number 0 < α < 1, the α-quantiles are values that partition the data set into αsublists of equal size. The 4-quantiles are the quartiles given by α = 1/4, 2/4, 3/4 and the 10-quantilesare the deciles defined by α = k/10 for k = 1, . . . , 9. That is,

xα =

{x(k) if n · α is not an integer, k = ⌈n · α⌉,12 (x(k) + x(k+1)) otherwise, k = n · α.

If the cumulative distribution function f of a random variable is known, the q-quantiles are given bythe values of f at the images 1/q, 2/q, . . . , (q − 1)/q. The computation of the quartiles in R illustratesthe following code (Fig. A.2).

# Pain perception before and after therapy on a scale from 1 (low) to 10 (high)

> before <- c(3, 4, 4, 7, 9, 8, 4, 4, 7, 9)

> after <- c(2, 2, 4, 5, 6, 7, 5, 4, 2, 1)

> before; sort(before)

[1] 3 4 4 7 9 8 4 4 7 9

[1] 3 4 4 4 4 7 7 8 9 9

> after; sort(after)

[1] 2 2 4 5 6 7 5 4 2 1

[1] 1 2 2 2 4 4 5 5 6 7

> median(before); median(after)

[1] 5.5 4

> quantile(before, c(0.25,0.5,0.75))

25% 50% 75%

4.00 5.50 7.75

> quantile(after,c(0.25,0.5,0.75))

25% 50% 75%

2 4 5

> boxplot(before,after,names=c("before","after"))

214 A Computational Statistics in R

Fig. A.2. Boxplot.

A histogram is a graphical representation of the distribution of numerial data. An alternative repre-sentation is a stem-and-leaf plot. It contains two columns separated by a vertical line. The left columncontains the stem and the right one the leaves. A histogram and stem-and-leaf plot are given by thefollowing R code (Fig. A.3).

# Body mass index (bmi) of 10 persons

> hist(bmi, c(15,17.5,20,22.5,25,27.5,30,32.5,35,37.5,40))

> stem(bmi)

The decimal point is 1 digit(s) to the right of the |

1 | 8

2 | 1134

2 | 5888

3 |

3 | 6

Bivariate analysis involves the distribution of two random variables. The relationship between pairsof variables can be described among others by scatterplots and quantitative measures of dependence.A dot plot in R can be obtained as follows (Fig. A.4).

# age versus height of teenagers in African countries

> age <- c(10:19); age

[1] 10 11 12 13 14 15 16 17 18 19

> height <- c(133, 139, 147, 154, 165, 170, 175, 177, 180, 182); height

[1] 133, 139, 147, 154, 165, 170, 175, 177, 180, 182

> plot(age, height, xlim=c(10,19), ylim=c(130,190))

A.1 Descriptive Statistics 215

Fig. A.3. Histogram.

Fig. A.4. Age versus height plot.

Let x1, . . . , xn and y1, . . . , yn be real-valued sample data from two random variables. The covarianceis a measure of how much two random variable change together and is defined as

sxy =

∑ni=1(xi − x)(yi − y)

n− 1.

If the larger values of one variable correspond to the larger values of the other variable and the sameholds for the smaller values, the covariance is positive. In the opposite case, the covariance is negative.

The correlation coefficient is a measure of the linear correlation between two random variables andis given by

216 A Computational Statistics in R

r =sxysxsy

=

∑ni=1(xi − x)(yi − y)√∑n

i=1(xi − x)2∑ni=1(yi − y)2

.

The correlation coefficient gives a value between +1 and −1, where +1 is total positive correlation, 0is no correlation, and −1 is total negative correlation.

# Economic growth in %

> x <- c( 2.1,2.5,4.0,3.5 )

# rate of return in %

> y <- c( 8,12,14,10 )

> cov( x, y ) # covariance

[1] 1.533333

> cor( x, y ) # correlation coefficient

[1] 0.6625739

Linear regression is used in statistics for modeling the relationship between an explanatory (in-dependent) random variable and a scalar dependent variable. The following example shows a lineardependence between two data sets introduced above (Fig. A.5).

> lm(height ~ age) # linear model

Coefficients:

(Intercept) age

79.067 5.733

The linear prediction function is given by f(x) = 79.067 + 5.733 · x.

Fig. A.5. Linear regression.

A.2 Random Variables and Probability 217

A.2 Random Variables and Probability

A random variable is a variable whose values are subject to change due to randomness in a mathematicalsense. A random variable is discrete if it can take on values from a finite or countable set, and a randomvariable is continuous if it can take on numerical values from an interval or a collection of intervals.

The cumulative distribution function (cdf) of a random variable X is FX defined as

FX(x) = P (X ≤ x), x ∈ R.

where P denotes the probability of the argument. The subscript of FX is omitted if it is clear from thecontext. The cdf of a random variable X has the following properties:

• FX is non-decreasing.• FX is right-continuous; i.e.,

limǫ→0+

FX(x+ ǫ) = FX(x), x ∈ R.

• FX has the extremal values limǫ→−∞ FX(x) = 0 and limǫ→∞ FX(x) = 1.

A random variable X is continuous if the cdf FX is a continuous function. A random variable Xis discrete if the cdf FX is a step function. A discrete cdf is given by a probability mass function(pmf) pX(x) = P (X = x). The discontinuities in the cdf are the points where the pmf is positive andpX(x) = FX(x)− FX(x−).

If a random variable X is discrete, the cdf of X is given by

FX(x) = P (X ≤ x) =∑

y≤x

pX (y)>0

pX(y).

For a continuous random variable X, the probability density function (pdf) of X is fX(x) = F ′X(x) for

all x ∈ R if FX is differentiable. In this case, by the fundament theorem of calculus,

FX(x) = P (X ≤ x) =

∫ x

−∞

fX(t)dt.

The joint density of continuous random variables X and Y is fX,Y and the cdf of the pair (X,Y ) is

FX,Y (x, y) = P (X ≤ x, Y ≤ y) =

∫ y

−∞

∫ x

−∞

fX,Y (s, t)dsdt.

The marginal probability densities of X and Y are given as

fX(x) =

∫ ∞

−∞

fX,Y (x, y)dy and fY (y) =

∫ ∞

−∞

fX,Y (x, y)dx.

The formulae for the discrete random variables are analogously defined with integrals replaced by sums.In the following, fX denotes both the pdf of X if X is continuous or the pmf of X if X is discrete.

The mean of a random variable X is the mathematical expectation (or expected value) of thevariable and is denoted by E[X]. If X is continuous with pdf fX , the mean of X is

218 A Computational Statistics in R

E[X] =

∫ ∞

−∞

xfX(x)dx.

If X is discrete with pmf fX , the mean of X is

E[X] =∑

x

fX (x)>0

xfX(x)dx.

We assume that E[X] is finite if E[X] appears in a formula. The mathematical expectation of a functiong(X) of a continuous random variable X with pdf fX is

E[g(X)] =

∫ ∞

−∞

g(x)fX(x)dx.

Thus E[X] is the mathematical expectation of the identity function on R. The value µX = E[X] is thefirst moment of X. For any integer r ≥ 1, the r-th moment of X is E[Xr]. Thus if X is continuous,then

E[Xr] =

∫ ∞

−∞

xrfX(x)dx.

The variance of a random variable X is the second central moment given by

Var(X) = E[(X − E[X])2].

Since E[(X − E[X])2] = E[X2]− (E[X])2, we have

Var(X) = E[X2]− (E[X])2 = E[X2]− µ2X .

The variance is also denoted by σ2X . The square root of the variance is the standard deviation and the

reciprocal of the variance is the precision.The mathematical expectation of the product of two continuous random variables X and Y with

joint density fX,Y is

E[XY ] =

∫ ∞

−∞

∫ ∞

−∞

xyfX,Y (x, y)dxdy.

The convariance of X and Y is

Cov(X,Y ) = E[(X − µX)(Y − µY )] = E[XY ]− E[X]E[Y ] = E[XY ]− µXµY .

The covariance of X and Y is also denoted by σX,Y . In particular, we have Cov(X,X) = Var(X).The correlation between two continuous random variables X and Y is

ρ(X,Y ) =Cov(X,Y )√Var(X)Var(Y )

=σX,YσXσY

.

Two random variables X and Y are uncorrelated if ρ(X,Y ) = 0.Two random variables X and Y are independent if

fX,Y (x, y) = fX(x)fY (y), x, y ∈ R,

A.2 Random Variables and Probability 219

or equivalently,FX,Y (x, y) = FX(x)FY (y), x, y ∈ R.

More generally, the random variables X1, . . . , Xn are independent if the joint pdf f of X1, . . . , Xn equalsthe product of the marginal densities; i.e.,

f(x1, . . . , xn) =

n∏

i=1

fXi(xi), (x1, . . . , xn) ∈ Rn.

If two random variables X and Y are independent, then Cov(X,Y ) = 0 and thus ρ(X,Y ) = 0.The converse is generally false. However, if X and Y are normally distributed random variables withCov(X,Y ) = 0, then X and Y are independent.

The random variables X1, . . . , Xn are a random sample from a distribution FX if X1, . . . , Xn areindependent and identically distributed with distribution FX . Thus the joint pdf of X1, . . . , Xn is

f(x1, . . . , xn) =

n∏

i=1

fX(xi), (x1, . . . , xn) ∈ Rn.

Proposition A.1. Let X and Y be random variables and a, b ∈ R.

1. E[aX + bY ] = aE[X] + bE[Y ].2. E[aX + b] = aE[X] + b.3. If X and Y are independent, E[XY ] = E[X]E[Y ].4. Var(aX + b) = a2Var(X).5. Var(X + Y ) = Var(X) + Var(Y ) + 2Cov(X,Y ).6. If X and Y are independent, Var(X + Y ) = Var(X) + Var(Y ).

Corollary A.2. If X1, . . . , Xn are independent and identically distributed random variables, then

E[X1 + . . .+Xn] = nµX and Var(X1 + . . .+Xn) = nσ2X .

It follows that the sample mean

X =1

n

n∑

i=1

Xi

has the expected value µX and the variance σ2X/n.

The conditional probability of an event A given the event B has taken place is

P (A | B) =P (AB)

P (B),

where AB = A ∩B is the intersection of the events A and B. Two events A and B are independent ifP (AB) = P (A)P (B). In this case, P (A | B) = P (A).

Let X and Y be random variables with joint density fX,Y . Then the conditional density of X givenY = y, y ∈ R, is

fX|Y=y(x) =fX,Y (x, y)

fY (y), x ∈ R.

220 A Computational Statistics in R

In the same way, the conditional density of Y given X = x, x ∈ R, is

fY |X=x(y) =fX,Y (x, y)

fX(x), y ∈ R.

Thus the joint density fX,Y has the form

fX,Y (x, y) = fX|Y=y(x)fY (y) = fX|X=x(y)fX(x), x, y ∈ R.

The conditional expected value of X given Y = y, y ∈ R, is

E[X | Y = y] =

∫ ∞

−∞

xfX|Y=y(x)dx,

where fX|Y=y is assumed to be continuous.

Proposition A.3. Let X and Y be random variables.

1. Conditional expectation rule: E[X] = E[E[X | Y ]].2. Conditional variance formula: Var(X) = E[Var(X | Y )] + Var(E[X | Y ]).

A.3 Some Discrete Distributions

The most important discrete distributions are counting distributions which are used to model thefrequencies or waiting times of events.

In the statistical language R, the probability mass functions (pmf) or densities (pdf), cumulativedistribution functions (cdf), quantile functions, and random number generators of many commonlyused probability distributions are made available. The first letter always denotes the function type:

d density functionp cumulative distribution functionq quantile functionr random number generator

In a uniform distribution, there is finite number of values that are equally likely to be observed. Auniformly distributed random variable X with state set [m] has the pmf

P (X = x) =1

m, x ∈ [m].

The cumulative distribution function is a step function

fX(x) =x

m, x ∈ [m].

The mean and variance of X are respectively given by

E[X] =m+ 1

2and Var(X) =

m2 − 1

12.

A.3 Some Discrete Distributions 221

The uniform distribution in R can be analyzed by the functions ddiscrete, pdiscrete, qdiscrete,and rdiscrete. It requires the loading of the library e1071.

Several important discrete distributions can be formulated in terms of Bernoulli trails. A Bernoulliexperiment has two possible outcomes, success (1) or failure (0). A Bernoulli random variable X hasthe pmf

P (X = 1) = p and P (X = 0) = 1− p,

where p is the probability of success. The mean and variance of X are respectively given by

E[X] = p and Var(X) = p(1− p).

Let X be a random variable that counts the number of successes in n independent, identicallydistributed Bernoulli trials with success probability p. Then X has the binomial distribution withparameters n and p if

P (X = x) =

(n

x

)px(1− p)n−x, 0 ≤ x ≤ n.

Since the binomial variable X is the sum of n independent, identically distributed Bernoulli variables,we have

E[X] = np and Var(X) = np(1− p).

The following R code provides three binomial distributions (Fig. A.6).

> f1 = dbinom ( 0:6, 6, 0.25 ) # pmf

> f2 = dbinom ( 0:6, 6, 0.50 )

> f3 = dbinom ( 0:6, 6, 0.75 )

> matplot( 0:6, cbind(f1,f2,f3), type="l", pch=c("red","black","green") )

> F1 = pbinom ( 0:6, 6, 0.25 ) # cdf

> F2 = pbinom ( 0:6, 6, 0.50 )

> F3 = pbinom ( 0:6, 6, 0.75 )

> matplot( 0:6, cbind(F1,F2,F3), type="l", pch=c("red","black","green") )

Example A.4 (R). Consider the treatment of inpatients with a specific drug. Suppose the probabilityof successful treatment is p = 0.75. Then the probability that sixteen out of twenty inpatients aresuccessfully treated is

P =

(20

8

)0.75160.254 = 0.1896855.

This can be computed using R as follows,

> p <- 0.75

> choose(20,16) * p^16 * (1-p)^4

[1] 0.1896855

> dbinom( 16, 20, p )

[1] 0.1896855

The following code provides the probability of successful treatment of n = 20 inpatients (Fig. A.7).

222 A Computational Statistics in R

Fig. A.6. Binomial distribution.

> n <- 20

> p <- 0.75

> P <- rep( NA, n )

> for (i in 1:n) P[i] <- dbinom( i, n, p )

> plot( 1:n, P, ylab="binom(pmf)" )

> Q <- rep( NA, n )

> for (i in 1:n) Q[i] <- pbinom( i, n, p )

> plot( 1:n, P, type="l", )

Fig. A.7. Successful treatment of twenty inpatients: pdf und cdf.

A.3 Some Discrete Distributions 223

♦The multinomial distribution generalizes the binomial distribution. To see this, consider k mutually

exclusive and exhaustive events A1, . . . , Ak which happen at any trial of the experiment, and each eventoccurs with probability p(Ai) = pi, 1 ≤ i ≤ k. Then p1 + . . . + pk = 1. Let Xi be a random variablewhich counts the number of events Ai in a sequence of n independent and identical trials. Then therandom variable X = (X1, . . . , Xn) has the multinomial distribution with joint pdf

f(x1, . . . , xk) =

(n

x1, . . . , xn

)px11 · · · pxk

k

where x1 + . . .+ xk = n.For each random variable Xi, we have E[Xi] = npi and Var(Xi) = npi(1 − pi). Moreover, the

covariance and the correlation of the random variables Xi and Xj for i 6= j respectively are

Cov(Xi, Xj) = −npipj and ρ(Xi, Xj) = −√

pipj(1− pi)(1− pj)

.

The following example shows the use of the multinomial distribution in R.

> n <- 4

> p <- c(0.4,0.3,0.3)

> m <- c(0,0,4, 0,1,3, 0,2,2, 0,3,1, 0,4,0,

+ 1,0,3, 1,1,2, 1,2,1, 1,3,0, 2,0,2,

+ 2,1,1, 2,2,0, 3,0,1, 3,1,0, 4,0,0)

> M <- matrix( m, nrow=3 )

> ncol(M)

[1] 15

> M

[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13] [,14] [,15]

[1,] 0 0 0 0 0 1 1 1 1 1 2 2 3 3 4

[2,] 0 1 2 3 4 0 1 2 3 4 1 2 0 1 0

[3,] 4 3 2 1 0 3 2 1 1 0 1 0 1 0 0

> P <- rep( NA, ncol(M) )

> for (i in 1:ncol(M)) P[i] <- dmultinomial( M[,i], prob=p ) # pmf

> plot( 1:ncol(M), P, ylab="multinom(pmf)" )

Example A.5 (R). Consider a box with 100 colored beads of which 50 are red, 30 are green, and 20are yellow. How large is the probability to draw 10 beads such that 4 are red, 3 are green, and 3 areyellow? The probabilities to draw one red, one green, and one yellow bead are p1 = 0.5, p2 = 0.3, andp3 = 0.2, respectively. Thus the probability to draw 4 red beads, 3 green beads, and 3 yellow beads is

P =

(10

4, 3, 3

)0.540.330.23 = 0.0567.

This can be computed using R as follows,

> dmultinom( c(4,3,3), prob=(0.5,0.3,0.2) )

[1] 0.0567

224 A Computational Statistics in R

The geometric distribution is a variant of the binomial distribution. To see this, take a sequence ofBernoulli trials with success probability p. Let X be a random variable counting the number of trialsuntil the first success happens. Then we have

P (X = x) = p(1− p)x−1, x ≥ 1.

A random variable X with this pmf is geometrically distributed. If X has the geometric distributionwith success probability p, the cdf of X is

FX(x) = P (X ≤ x) = 1− (1− p)x, x ≥ 1.

A geometrically distributed random variable X has respectively the mean and variance

E[X] =1− p

pand Var(X) =

1− p

p2.

Note that a geometrically distributed random variable X satisfies the relation

P (X = n+ k|X > n) =P (X = n+ k)

P (X > n)

=p(1− p)n+k−1

1− (1− p(1− p)n)

= p(1− p)k−1 = P (X = k), x ≥ 1.

Thus when X is interpreted as waiting time for an event to occur relative, the above property exhibitsthe memorylessness of the geometric distribution.

Example A.6 (R). Consider the tossing of a six-sided fair die. The probability of the first occurrenceof a ”six” can be described by the geometric distribution. We have p = 1/6 and the probability after xtrails to roll a ”six” is p(1− p)x−1. The pdf and cdf can be computed by R as follows (Fig. A.8).

> n <- 20

> y <- dgeom( 1:n, prob=1/6 )

> plot( 1:n, y, ylab="geometric(pmf)" )

> z <- pgeom( 1:n, prob=1/6 )

> plot( 1:n, z, ylab="geometric(df)" )

♦The negative binomial distribution is a generalization of the geometric distribution in the sense that

one is interested in the number of failures until the r-th success. Let X be a random variable that countsthe number of failures until the r-th success. If X = x failures occur before the r-th success, the r-thsuccess happens by the (x+ r)-th trial and in the first x+ r − 1 trials there are r − 1 successes and xfailures. This takes place in

(x+r−1r−1

)=(x+r−1x

)different ways and each case has probability pr(1− p)x.

Thus the random variable X has the pmf

P (X = x) =

(x+ r − 1

r − 1

)pr(1− p)x, x ≥ 0.

A.3 Some Discrete Distributions 225

Fig. A.8. Probability of success after a number of failures: pdf and cdf.

A random variable X has the negative binomial distribution with parameters r and p if

P (X = x) =Γ (x+ r)

Γ (r)Γ (x+ 1)pr(1− p)x, x ≥ 0,

where Γ denotes the complete gamma function defined as

Γ (r) =

∫ ∞

0

tr−1e−tdt, r 6= 0,−1,−2, . . . .

Note that Γ (n) = (n− 1)! for each integer n ≥ 1. It follows that both probabilities are identical whenr ≥ 1 is an integer. Let X be a random variable with a negative binomial distribution with parametersr and p. Then X is the sum of r independent, identically distributed geometric random variables withparameter p. Thus the mean and variance of X respectively are given by

E[X] = r1− p

pand Var(X) = r

1− p

p2.

Example A.7 (R). Consider the number of failures in a lottery until the third success if the probabilityof success is p = 0.25. For instance, the probability of at most n failures until the third success is givenby

n−3∑

i=0

(i+ 3− 1

i

)0.2530.75i.

The following code provides the probability distribution for at most n = 20 draws until the third success(Fig. A.9).

> n <- 20

> p <- 0.25

226 A Computational Statistics in R

> P <- rep( NA, 20)

> for (i in 1:n) P[i] <- dnbinom( i, 3, p )

> plot( 1:n, P, ylab="negbinom(pmf)" )

> Q <- rep( NA, 20)

> for (i in 1:n) Q[i] <- pnbinom( i, 3, p)

> plot( 1:n, Q, ylab="negbinom(cdf)" )

Fig. A.9. Probability of the third success after a number of failures: pdf and cdf.

♦The hypergeometric distribution describes the probability of x successes in n draws without replace-

ment from a finite population of size N which contains exactly K successes. In view of this setting, arandom variable X follows the hypergeometric distribution if the pmf is given by

P (X = x) =

(Kx

)(N−Kn−x

)(Nn

) .

The pmf is positive if max(0, n+K −N) ≤ x ≤ min(K,n). The mean and variance of a hypergeomet-rically distributed random variable X respectively are given as

E[X] = nK

N= np and Var(X) = np(1− p)

N − n

N − 1.

Example A.8 (R). Consider a collection of 100 items which contains 5% defective items. How largeis the probability of drawing 50 items of which i items are defect? The probabilities are given by ahypergeometric distribution which can be computed in R as follows (Fig. A.10).

# call: phyper(x,K,N-K,n)

A.3 Some Discrete Distributions 227

> P <- rep( NA, 5 )

> for (i in 1:5) dhyper( 50-i, 95, 5, 50 )

> plot( 1:5, P, xlab="defect items", ylab="hypergeom(pmf)" )

Fig. A.10. Hypergeometric distribution.

♦The poisson distribution is another important discrete distribution that is applicable to systems

with a large number of possible events each of which being very rare. A random variable X has thepoisson distribution with parameter λ > 0 if the pmf of X is given by

P (X = x) =e−λλx

x!, x ≥ 0.

It expresses the probability of a given number of events x occurring in a fixed interval of time or space ifthese events happen with a known average rate and independent of time since the last event. Examplesare the number of phone calls received by a call center per hour and the number of decay events persecond from a radioactive source. The pmf follows the recursion p(x+ 1) = p(x) λ

x+1 . If X is a randomvariable X with poisson distribution with parameter λ > 0, the mean and the variance of X respectivelyare given as

E[X] = λ and Var(X) = λ.

The use of the poisson distribution in R is exemplified by the following code (Fig. A.11).

> f1 <- dpois( 1:20, 1 ) # lambda = 1

> f2 <- dpois( 1:20, 5 )

> f3 <- dpois( 1:20, 10 )

> matplot( 1:20, cbind(f1,f2,f3), type="l", pch=c("red","green","black") )

> F1 <- ppois( 1:20, 1 ) # lambda = 1

228 A Computational Statistics in R

> F2 <- ppois( 1:20, 5 )

> F3 <- ppois( 1:20, 10 )

> matplot( 1:20, cbind(F1,F2,F3), type="l", pch=c("red","green","black") )

Fig. A.11. Poisson distribution.

Note that for large values of n, small values of p, and fixed value λ = np (i.e., n → ∞, p → 0, andnp→ λ), the binomial distribution converges to the poisson distribution with parameter λ.

Example A.9 (R). Suppose the probability that a person suffers from a drug intolerance is p = 0.001.How large is the probability that x persons out of n = 2000 persons suffer from the drug intolerance?We have λ = n · p = 2000 · 0.001 = 2, since 1 − p = 0.999 is close to 1, and the probability that xpersons suffer from drug intolerance is

P (x) =λe−λ

x!=

2xe−2

x!.

This quantity can be computed by the following R code (Fig. A.12).

> l <- 2

> P <- rep( NA, 11 )

> for (i in 0:10) P[i] = dpois( x[i], l )

> plot( 0:10, P, type="l", xlab="x", ylab="P(x)" )

A.4 Some Continuous Distributions

The most basic continuous distribution is the uniform distribution in which all elements of an intervalare equally probable. The pdf of the uniform distribution for the real-valued interval (a, b) is definedby

A.4 Some Continuous Distributions 229

Fig. A.12. Probability of drug intolerance.

f(x) =

{1b−a if a < x < b,

0 otherwise.

The cdf of the pdf f is given as

F (x) =x− a

b− a, x ∈ R.

The respective mean and variance are

µ =a+ b

2and σ2 =

(b− a)2

12.

The normal distribution is one of the most important continuous distributions in statistics. Theyare often used to represent real-valued random variables whose distributions are unknown. The pdf ofthe normal distribution with mean µ and standard deviation σ > 0, denoted N(µ, σ2), defined as

f(x) =1

σ√2π

exp

[−1

2

(x− µ

σ

)2], x ∈ R.

If µ = 0 and σ = 1, the distribution is the standard normal distribution, denoted by N(0, 1), and isgiven by the standard normal pdf

f(x) =1√2π

exp

[−1

2x2], x ∈ R.

The cdf of the standard normal distribution is the integral

Φ(x) =1√2π

∫ x

−∞

e−t2/2dt, x ∈ R.

230 A Computational Statistics in R

Moreover, for a generic normal distribution given by pdf f with mean µ and standard deviation σ, thecdf is given by

F (x) = Φ

(x− µ

σ

).

The pdf of a normal distribution can be drawn as follows (Fig. A.13).

> mu <- 50

> sig <- 6

> low <- mu-3.5*sig; upp <- mu+3.5*sig

> x <- seq( low, upp, by=0.1 )

> f <- dnorm( x, mean=mu, sigma=sig )

> plot( x, f, type="l", xlim=c(low,upp) )

Fig. A.13. Normal distribution.

The normal distribution has several key properties. Let X be a random variable with normal distri-bution N(µ, σ2) and let a, b ∈ R. Then the linear transformation Y = aX + b gives a random variablewhich is N(aµ + b, a2σ2). In particular, if X is N(µ, σ2), then Z = X−µ

σ is N(0, 1). Moreover, thestandard normal distribution has the symmetry property

Φ(−z) = P (Z ≤ −z) = P (Z ≥ z) = 1− P (Z ≤ z) = 1− Φ(z).

Example A.10 (R). Suppose the fasting blood sugar (mg/dl) is given by a normal random variableX with mean µ = 100 and standard deviation σ = 10. How large is the probability that a randomlychosen person has fasting blood sugar (a) at most 70 mg/dl, (b) between 90 and 120 mg/dl, or (c)larger than 140 mg/dl? In view of a), we have

P (X ≤ 70) = P (Z ≤ −3) = 0.001349898,

A.4 Some Continuous Distributions 231

In view of b), we have

P (X > 140) = P (Z > 4) = P (Z ≤ 4) = 3.167124e− 05,

In view of c), we have

P (90 ≤ X ≤ 120) = P (Z ≤ 2)− P (Z ≤ 1) = 0.8185946.

The calculation in R is as follows.

> pnorm( 70, mean=100, sd=10 )

[1] 0.001349898

> pnorm( 140, mean=100, sd=10, lower.tail=FALSE )

[1] 3.167124e-05

> pnorm( 120, mean=100, sd=10 ) - pnorm( 90, mean=100, sd=10 )

[1] 0.8185946

♦Let X1, . . . , Xn be independent normal random variables, where Xi is N(µi, σ

2i ), 1 ≤ i ≤ n, and let

a1, . . . , an ∈ R. Then the random variable given by the linear combination Y = a1X1 + . . .+ anXn hasa normal distribution with mean and variance respectively given by

µ = a1µ1 + . . .+ anµn and σ2 = a21σ21 + . . .+ a2nσ

2n.

In particular, if X1, . . . , Xn are identically distributed random variables which are N(µ, σ2), the randomvariable given by the sum X = X1 + . . .+Xn is N(nµ, nσ2). The usefulness of the normal distributioncomes from the following result.

Theorem A.11 (Central Limit Theorem). If the random variables X1, . . . , Xn are independent andidentically distributed with mean µ and variance σ2, the limiting distribution of the random variables

Zn =

∑ni=1Xi − nµ

σ√n

when n becomes large is the standard normal distribution.

Continuous random variables X1, . . . , Xd have a multivariate (or d-variate) normal distribution,abbreviated Nd(µ,Σ), if the joint pdf is given by

f(x1, . . . , xd) =1

(2π)d/2|Σ|1/2[−1

2(x− µ)tΣ−1(x− µ)

],

where Σ = (σij) is the d × d nonsingular covariance matrix of X1, . . . , Xd, µ = (µ1, . . . , µd)t is the

vector of means, and x = (x1, . . . , xd)t ∈ Rd. Note that the one-dimensional marginal distributions

are normal with mean µi and variance σ2i , 1 ≤ i ≤ d. The normal random variables X1, . . . , Xd are

independent if and only if the covariance matrix Σ is a diagonal matrix. A linear transformation of amultivariante random variable vector X = (X1, . . . , Xd) is multivariate normal. More specifically, if Ais an l × d real-valued matrix and b = (b1, . . . , bm)t ∈ Rl, then Y = AX + b has an l-variate normaldistribution with mean vector Aµ+ b and covariance matrix AΣAt.

232 A Computational Statistics in R

The exponential distribution is a continuous probability distribution that describes the time betweenevents in a poisson process. This is a stochastic process in which events occur continuously and inde-pendently at a constant average rate. It is the continuous analogue of the geometric distribution andbeing memoryless is its key property. The pdf of a random variable X which is exponentially distributedwith rate parameter λ is given by

f(x;λ) =

{λe−λx x ≥ 0,0 x < 0.

The inverse parameter T = 1/λ is referred to as characteristic lifetime and is called mean time betweenfailures. The corresponding cdf is given as

F (x;λ) =

{1− e−λx x ≥ 0,0 x < 0.

A random variable X that is exponentially distributed with rate parameter λ has respectively the meanand variance

E[X] =1

λand Var(X) =

1

λ2.

The use of expontially distributed random variable is shown by the following R code (Fig. A.14).

> x <- seq( 0, 20, by=0.1 )

> e1 <- dexp( x, rate=1 ) # lambda = 1

> e2 <- dexp( x, rate=5 )

> e3 <- dexp( x, rate=10 )

> matplot( x, cbind(f1,f2,f3), type="l", col = c("red","black","green") )

> E1 <- pexp( x, rate=1 )

> E2 <- pexp( x, rate=5 )

> E3 <- pexp( x, rate=10 )

> matplot( x, cbind(F1,F2,F3), type="l", col = c("red","black","green") )

Note that an exponentially distributed random variable X satisfies the relation

P (X > s+ t|X > s) =P (X > s+ t)

P (X > s)

=e−λ(s+t)

e−λs

= e−λt = P (X > t), s, t ≥ 0.

Thus when X is interpreted as waiting time for an event to occur relative to some initial time, theabove property exhibits the memorylessness of the exponential distribution.

Example A.12. Suppose the average lifespan of a brand of light bulbs is 100 hours. How large is theprobability that a randomly chosen light bulb is on light for at least 110 hours? The rate parameter isλ = 0.01 and we have

P (X > 110) = 1− P (X ≤ 110) = 1− (1− e−110·0.01) = 0.3328711.

The corresponding computation in R is as follows.

A.4 Some Continuous Distributions 233

Fig. A.14. Exponential distributions.

> 1 - pexp( 110, rate=0.01 )

[1] 0.3328711

♦The gamma distribution is a two-parameter family of probability distributions. It is used to model

the waiting time until the occurrence of the r-th event. The density of a random variable that isgamma-distributed with shape parameter r > 0 and rate paramter λ > 0 is given by

f(x; r, λ) =λr

Γ (r)xr−1e−λx, x > 0.

The corresponding cdf is the regularized gamma function

F (x; r, λ) =

∫ x

0

f(y; r, λ)dy =γ(r, λx)

Γ (r),

where γ is the lower incomplete gamma function. Here are some basic properties of the gamma function:

Γ (0.5) =√π,

Γ (1) = Γ (2) = 1,

Γ (3) = 2,

Γ (∞) = ∞,

Γ (x+ 1) = xΓ (x), x > 0,

Γ (m+ n)

Γ (m)Γ (n+ 1)=

(m+ n− 1

n

), m, n ∈ N.

The form of the gamma distribution depends on both, the shape and rate parameter. For 0 < r ≤ 1,the density decreased monotonously and for r > 1 the density is a shifted bell curve with f(0) = 0 andmaximum value at (r − 1)/λ.

234 A Computational Statistics in R

If r = 1, the gamma density is the density of the exponential distribution. If λ = 1/2 and r = ν/2where ν is a positive integer, the gamma density is the density of the chi-squared distribution with νdegrees of freedom.

The use of gamma distribution is shown by the following R code (Fig. A.15).

> x <- seq( 0, 20, by=0.1 )

> g1 <- dgamma( x, shape=0.5, rate=3 ) # lambda = 3

> g2 <- dgamma( x, shape=1, rate=3 )

> g3 <- dgamma( x, shape=10, rate=3 )

> matplot( x, cbind(g1,g2,g3), type="l", col = c("red","black","green") )

> G1 <- pgamma( x, shape=0.5, rate=3 ) # lambda = 3

> G2 <- pgamma( x, shape=1, rate=3 )

> G3 <- pgamma( x, shape=10, rate=3 )

> matplot( x, cbind(F1,F2,F3), type="l", col = c("red","black","green") )

Fig. A.15. Gamma distributions with r = 0.5, 1, and 10, and λ = 3.

The mean and variance of a random variable X which is gamma-distributed with shape parameterr > 0 and rate paramter λ > 0 are respectively

E[X] =r

λand Var(X) =

r

λ2.

The parameter of a gamma distribution can be estimated by the method of moments,

r =nx2∑n

i=1(xi − x)2

and

λ =nx∑n

i=1(xi − x)2

A.4 Some Continuous Distributions 235

Example A.13 (R). Consider the durability (in hours) of ten pressure vessels under working condi-tions. The sample data are given by the vector time.

> time <- c( 12.5, 13.7, 17.2, 10.4, 11.4, 18.1, 19.0, 20.1, 12.7, 11.9 )

> n <- length(time)

[1] 10

> m <- mean(time)

[1] 14.7

> r.hat <- (n * m^2)/(sum((time-m)^2))

[1] 19.20459

> l.hat <- (n * m)/(sum((time-m)^2))

[1] 1.306434

The calculation provides a gamma distribution with estimated parameters r = 19.20459 and λ =1.306434. ♦

Finally, we consider two continuous distributions that are used for statistical testing purposes. Thechi-squared distribution with ν degrees of freedom, denoted by χ2(ν), is the distribution of a sum ofsquares of ν independent standard norml random variables. It is one of the most widely used probabilitydistributions in inferential statistics for hypothesis testing and in construction of confidence intervals.The pdf of a χ2(ν) random variable X is

f(x) =1

Γ (ν/2)2ν/2x(ν/2)−1e−x/2, x ∈ R, ν ≥ 1.

where Γ (ν/2) denotes the gamma function which has closed-form values for integer arguments.The cdf of a chi-squared pdf with ν degrees of freedom is

F (x, ν) =γ( ν2 ,

x2 )

Γ ( ν2 ),

where γ is the lower incomplete gamma function. Tables of the chi-squared cdf are widely available inall statistical packages. For instance, the 0.95-quantiles of the chi-squared distributions with degreesfor freedom ν, where 1 ≤ ν ≤ 10, can be calculated by R as follows,

> for (i in 1:10) print( qchisq( 0.95, i ))

[1] 3.841459

[1] 5.991465

[1] 7.814726

[1] 9.487729

[1] 11.0705

[1] 12.59159

[1] 14.06714

[1] 15.50731

[1] 16.91898

[1] 18.30704

236 A Computational Statistics in R

A χ2(ν) random variable X has respectively the mean and variance

E[X] = ν and Var(X) = 2ν.

The use of χ2(ν) random variables is shown by the following R code (Fig. A.16).

> x <- seq( 0, 20, by=0.1 )

> f1 <- dchisq( x, 2 ) # nu = 2

> f2 <- dchisq( x, 5 )

> f3 <- dchisq( x, 10 )

> matplot( x, cbind(f1,f2,f3), type="l", col = c("red","black","green") )

> F1 <- pchisq( x, 2 )

> F2 <- pchisq( x, 5 )

> F3 <- pchisq( x, 10 )

> matplot( x, cbind(F1,F2,F3), type="l", col = c("red","black","green") )

Fig. A.16. Chi-square distributions.

The sum of independent chi-squared random variables is also chi-squared distributed. More specif-ically, if X1, . . . , Xn are independent chi-squared random variables with degrees of freedom ν1, . . . , νnrespectively, then Z = X1 + . . . + Xn is a chi-squared random variable with ν1 + . . . + νn degrees offreedom. Consider n independent and identically distributed chi-squared random variables X1, . . . , Xn

each of which with k degrees of freedom. Then the sample mean

X =1

nXi

is distributed according to a gamma distribution with shape nν/2 and scale 2/n. If n goes to infinity,then by the central limit theorem, the sample mean converges towards a normal distribution with meanν and variance 2ν.

A.5 Statistics 237

The Student’s t distribution emerges when the mean of a normal distributed population is to beestimated in situations where the sample size is small and the population standard deviation is unknown.Let Z be an N(0, 1) random variable and V be a χ2(ν) random variable. If Z and V are independent,the random variable

T =Z√V/ν

has the Student’s t distribution with ν degrees of freedom, abbreviated t(ν). The density of a t(ν)random variable X is given by

f(x) =Γ ((ν + 1)/2)

Γ (ν/2)

1√νπ

1

(1 + x2/ν)(ν+1)/2, x ∈ R, ν ≥ 1.

The mean and variance of a t(ν) random variable X are respectively

E[X] = 0, ν > 1, and Var(X) =ν

ν − 2, ν > 2.

The use of Student’s t random variables is shown by the following R code (Fig. A.17).

> x <- seq( -6, 6, by=0.1 )

> f1 <- dt( x, 3 ) # nu = 3

> f2 <- dt( x, 7 )

> f3 <- dt( x, 20 )

> matplot( x, cbind(f1,f2,f3), type="l", col = c("red","black","green") )

> F1 <- pt( x, 3 ) # nu = 3

> F2 <- pt( x, 7 )

> F3 <- pt( x, 20 )

> matplot( x, cbind(F1,F2,F3), type="l", col = c("red","black","green") )

A.5 Statistics

Let X1, . . . , Xn be a random sample from a distribution with cdf FX(x) = P (X ≤ x), pdf or pmffX , mean E[X] = µX , and variance Var(X) = σ2

X . Note that lowercase letters x1, . . . , xn denote anobserved random sample.

A statistic is a function Tn = Tn(X1, . . . , Xn) of a sample. Important statistics are the sample mean

X =1

n

n∑

i=1

Xi,

the sample variance

S2 =1

n− 1

n∑

i=1

(Xi −X)2,

and the sample standard deviation S =√S2.

238 A Computational Statistics in R

Fig. A.17. Student’s t distributions.

The empirical cumulative distribution function (ecdf) of FX(x) = P (X ≤ x) is the proportion ofsample points which fall into the interval (−∞, x]. That is, the ecdf of an observed sample x1, x2, . . . , xnwith ordering x(1) ≤ x(2) ≤ . . . ≤ x(n) is given by

Fn(x) =

0, if x < x(1),in , if x(i) ≤ x < x(i+1), 1 ≤ i ≤ n− 1,1, if x(n) ≤ x.

A statistic Tn is an unbiased estimator of parameter θ if E[Tn] = θ. An estimator Tn is asymptoticallyunbiased for parameter θ if

limn→∞

E[Tn] = θ.

The bias of an estimator Tn for parameter θ is given by bias(Tn) = E[Tn]− θ.

Proposition A.14. The sample mean X is an unbiased estimator of the mean µ = E[X], and thesample variance S2 is an unbiased estimator of the variance σ2 = Var(X).

Proof. Since the random variables X1, . . . , Xn are independent and identically distributed with meanµ, in view of the sample mean we have

E[X] =1

n

i

E[Xi] =1

nnµ = µ.

In view of the sample variance, we have

E[S2] =1

n− 1E[

n∑

i=1

(Xi −X)2]

=1

n− 1E[

n∑

i=1

(Xi − µ)2 − n(X − µ)2]

A.5 Statistics 239

=1

n− 1[

n∑

i=1

E[(Xi − µ)2]− nE[(X − µ)2]

=1

n− 1

[nσ2 − n

σ2

n

]

=1

n− 1[(n− 1)σ2]

= σ2.

⊓⊔

However, the statistic S is generally a biased estimator of σ. Indeed, for any random variable X withmean µ, we have Var(X) = E[(X − µ)2] = E[X2]− µ. Thus Var(S) = E[S2]−E[S]2 = σ2 −E[S]2 ≥ 0and hence E[S] ≤ σ.

The mean-squared error (MSE) of an estimator T = Tn for parameter θ is defined as

MSE(T ) = E[(T − θ)2].

Note that for an unbiased estimator the MSE equals by definition the variance of the estimator. However,if T is a biased estimator of θ, the MSE becomes larger than the variance. To see this, we have

E[(T − θ)2] = E[(T − E[T ] + E[T ]− θ)2]

= E[(T − E[T ])2] + 2(E[T ]− E[T ])(E[T ]− θ) + (E[T ]− θ)2

= Var(T ) + (E[T ]− θ)2

= Var(T ) + bias(T )2,

where bias(T ) = (E[T ]− θ)2 is the bias of T and θ.

Example A.15 (R). The function fitdistr can be used to provide maximum likelihood fitting of thegeometric distribution.

> library(MASS)

> set.seed(123)

# generate random numbers

> x <- rgeom( 1000, prob=0.6 )

# maximum likelihood estimation

> z <- fitdistr( x, "geometric")

prob

0.61050061

(0.01204868)

# bias

> (z$estimate - 0.6)^2

[1] 0.0001102628

240 A Computational Statistics in R

A.6 Method of Moments

The method of moments is a technique for the estimation of parameters. This is achieved by derivingequations that relate the estimated moments of a sample set to the parameters of interest.

Let X be a (discrete) random variable with cdf fX and k be a positive integer. Then the k-thmoment of X is defined as

µk = E[Xk] =∑

x

xkfX(x).

Let X1, . . . , Xn be a random sample of independent and identically distributed random variables withcdf fX and realizations x1, . . . , xn. Then the value

µk =1

n

n∑

i=1

xki

is the k-th sample moment, an estimate of µk.For instance, the first moment E[X] is the expected mean value and is estimated by

µ1 = x =1

n

n∑

i=1

xi.

Moreover, the second moment E[X2] is estimated by

µ2 =1

n

n∑

i=1

x2i .

Thus the expected variance σ2 = Var(X) = E[X2]− E[X]2 is estimated by

µ2 − µ21 =

1

n

n∑

i=1

x2i −1

n

(n∑

i=1

xi

)2 .

Example A.16 (R). The method of moments is used to estimate the mean value and the variance ofa sample set that has a standard normal distribution.

> n <- 1000

> data <- rnorm( n, mean=0, sd=1 )

> m1.hat <- sum(data)/n

[1] 0.02451518

> m2.hat <- sum(data^2)/n

[2] 1.039528

> var.hat <- m2.hat - m1.hat^2

[2] 1.038927

The method of moments can be used as the first approximation to the solution of the likelihoodequations and further improved approximations can be derived by numerical methods. On the otherhand, the maximum likelihood method may lead to better results. But in some cases the maximumlikelihood method may be intractable whereas estimators from the method of moments can be efficientlyestablished.

A.7 Maximum-Likelihood Estimation 241

A.7 Maximum-Likelihood Estimation

Maximum-likelihood estimation is a method of estimating the parameters of a statistical model givensample data.

Let X1, . . . , Xn be random variables with parameter or parameter vector θ ∈ Θ, where Θ is theparameter space of possible parameters. The likelihood function L(θ) of random variables X1, . . . , Xn

with realizations x1, . . . , xn is defined as the joint density

L(θ) = f(x1, . . . , xn|θ).

If X1, . . . , Xn are a random sample of independent and identically distributed random variables withdensity f(x|θ), then

L(θ) =

n∏

i=1

f(xi|θ).

A maximum likelihood estimate of θ is a value or vector θ which maximizes L(θ). That is, θ is a solutionof the maximization problem

L(θ) = f(x1, . . . , xn|θ) = max{f(x1, . . . , xn|θ) | θ ∈ Θ}.

If the estimate θ is uniquely determined, then θ is the maximum likelihood estimator (MLE) of θ.If the parameter space Θ is an real-valued interval in R and the function L(θ) is differentiable and

takes on a maximum on Θ, then θ is a solution of

d

dθL(θ) = 0.

Since the logarithm function is monotonous and differentiable, it is often easier to consider the log-likelihood function

ℓ(θ) = logL(θ).

The maximum likelihood estimates of L(θ) and ℓ(θ) are the same. In particular, if X1, . . . , Xn are arandom sample of independent and identically distributed random variables with density f(x|θ), thenthe log-likelihood function becomes

ℓ(θ) = logL(θ) =

n∑

i=1

log f(xi|θ).

The maximum likelihood estimates can often be determined analytically. If not, numerial optimizationor other computational methods like heuristics can be used.

Example A.17 (R). Let X1, . . . , Xn be a random sample with density

f(x|θ) = θ

2e−θx, x ∈ R,

and realizations x1, . . . , xn. The likelihood function is

242 A Computational Statistics in R

L(θ) =

n∏

i=1

θ

2e−θxi =

θn

2ne−θ(x1+...+xn)

and the log-likelihood function is

ℓ(θ) = n log θ − θ(x1 + . . .+ xn)− log 2n.

The first derivate of ℓ(θ) gives the equation

d

dθℓ(θ) =

n

θ− (x1 + . . .+ xn) = 0.

The unique solution provides the MLE

θ =n

x1 + . . .+ xn,

which amounts to the reciprocal sample mean. A numerical solution in R can be obtained as follows.

# numerical optimization

> x <- c( 0.25,0.41,0.37,0.54 )

# minus log-likelihood of density, initial value theta=1

> mlog <- function(theta=1) { return( -(length(x) * log(theta) - theta * sum(x) )) }

> library( stats4 )

> y <- mle( mlog )

> summary( y )

Call:

mle(minuslog1 = mlog)

Coefficients

theta

2.547728

# analytic optimization (above)

> opt-theta <- length(x) / sum(x); opt-theta

[1] 2.547771

Example A.18 (R). Let X1, . . . , Xn be a random sample of independent and poisson identicallydistributed random variables with parameter λ > 0 and realizations x1, . . . , xn. The likelihood functionis

L(λ) = enλλx1+...+xn

x1! · · ·xn!and the log-likelihood function is

ℓ(λ) = logL(λ) = −nλ+ (x1 + . . .+ xn) log λ− log(x1! · · ·xn!).

Then

A.7 Maximum-Likelihood Estimation 243

d

dλℓ(λ) = −n+ (x1 + . . .+ xn)

1

λ= 0

implies

λ =x1 + . . .+ xn

n.

The function fitdistr provides maximum likelihood fitting of the poisson distribution.

> library(MASS)

> set.seed(123)

# generate random numbers

> x <- rpois( 1000, lambda=5 )

# maximum likelihood estimation

> fitdistr( x, "poisson" )

# output: estimate, sd, vcov, loglik

lambda

5.01000000

(0.07078153)

> mean(x)

[1] 5.01

Suppose the density is a multivariate function f(x1, . . . , xn|θ), where θ is a vector in Rd, the param-eter space Θ is an open subset of Rd, and the derivatives of the maximum likelihood function L(θ) exist

in all coordinates. Then the maximum likelihood estimate θ has to fulfill simultaneously d equations

∂θiL(θ) = 0, 1 ≤ i ≤ d.

This gives a system of d equations in d unknowns.

Example A.19 (R). Let X1, . . . , Xn be a random sample of independent and identically distributednormal random variables with parameters µ and σ and let x1, . . . , xn be a realization. Then the likeli-hood function is

L(µ, σ) =n∏

i=1

1√2πσ

exp

(− (xi − µ)2

2σ2

)

and the log-likelihood function is

ℓ(µ, σ) = −n2log(2π)− n

2log σ2 − 1

2σ2

n∑

i=1

(xi − µ)2.

Then the partial derivatives give

∂µℓ =

1

σ2

n∑

i=1

(xi − µ) = 0

and

244 A Computational Statistics in R

∂σℓ = − n

2σ2+

1

σ3

n∑

i=1

(xi − µ)2 = 0.

Thus the maximum likelihood estimates are

µ =x1 + . . .+ xn

n

and

σ2 =1

n

n∑

i=1

(xi − µ).

The function fitdistr provides maximum likelihood fitting of the normal distribution.

> library(MASS)

> set.seed(123)

# generate random numbers

> x <- rnorm( 1000, mean=80, sd=15 )

# maximum likelihood estimation

> fitdistr( x, "normal")

# output: estimate, sd, vcov, loglik

mean sd

80.3936556 15.1467157

(0.4789812) (0.3386909)

A.8 Ordinary Least Squares

In statistics, ordinary reast squares (OLS) is a method for estimating the unknown parameter in alinear or nonlinear regression model.

Example A.20 (R). In the classical linear regression model, consider sample data consisting of nobservations (xi, yi), 1 ≤ i ≤ n. The response variable is a linear function of the independent variable

yi = α+ βxi, 1 ≤ i ≤ n.

The ordinary least squares approximation seeks parameters α and β such that the following expressionbecomes minimal,

S(α, β) =

n∑

i=1

(yi − (α+ βxi))2.

We have∂S

∂α= −2

n∑

i=1

(yi − α− βxi)

and

A.8 Ordinary Least Squares 245

∂S

∂β= −2

n∑

i=1

(yi − α− βxi)xi.

By setting both derivatives to zero, we obtain

β =

∑ni=1(xi − x)(yi − y)∑n

i=1(xi − x)

andα = y − βx.

The following data are pertubed with white noise and the parameters of the linear regression modelare calculated by using a linear regression model.

> x <- seq( 0, 10, by=0.1 )

> n <- length( x )

> set.seed(210)

> e <- rnorm(n, mean=0, sd=1 ) # white noise

> y <- 10 + 2*x + e

> lm( y ~ x ) # linear model

Call

lm(formula) = y ~ x)

Coefficient

(Intercept) x

10.160 1.971

Thus we obtain α = 10.160 and β = 1.971 (Fig. A.18). ♦

Example A.21 (R). In the nonlinear regression model, take sample data consisting of n observations(xi, yi), 1 ≤ i ≤ n. The response variable is an exponential function of the independent variable

yi = p1e−p2·xi , 1 ≤ i ≤ n.

The least squares approximation seeks parameters p1 and p2 such that the following expression becomesminimal,

E(p1, p2) =

n∑

i=1

(yi − p1e

−p2·xi)2, 1 ≤ i ≤ n.

The following data are pertubed with white noise and the parameters of the regression model arecalculated by using a nonlinear regression model.

> x <- seq( 0, 10, by=0.1 )

> n <- length( x )

> set.seed(210)

> e <- rnorm(n, mean=0, sd=1 ) # white noise

> y <- 10/exp(0.5*x) + e

246 A Computational Statistics in R

> nls( y ~ p1/exp(p2*x) , start=list(p1=1,p2=1) ) # nonlinear model

Nonlinear regression model

model: y ~ p1/exp(p2 * x)

data: parent.frame()

p1 p2

10.2135 0.1044

residual sum-of-squares: 107.1

Number of iterations to converge: 7

Achieved convergence tolerance: 5.338e-08

Thus we obtain p1 = 10.2135 and p2 = 0.1044 (Fig. A.18). ♦

Fig. A.18. Sample data from linear and nonlinear regression model with white noise.

A.9 Parameter Optimization

The language R provides several functions for one- and two-dimensional parameter optimization. Forthe optimization of univariate functions, the function optimize can be used.

Example A.22 (R). Consider the real-valued function

f(x) =log(2 + log(x))

log(2 + x), x ≥ 1.

The graph of the function f(x) is shown in Fig. A.19. The drawing of the function in the interval [1, 10]can be done by the following code.

A.9 Parameter Optimization 247

Fig. A.19. Graph of function f(x).

> x <- seq( 1, 10, 0.01 )

> y <- log(2+log(x))/log(2+x)

> plot( x, y, type="l", xlab="x",ylab="P(x)" )

As can be seen, the optimum lies in the interval [2, 4]. Therefore, we apply the function optimize

to this interval. The default is to minimize the function. To maximize f(x), set maximize to TRUE.The following code yields the optimal parameter value x = 2.090302 and the optimal function valuef(x) = 0.714867.

> f <- function( x )

+ log(2+log(x))/(2+log(x))

> optimize( f, lower=2, upper=4, maximum=TRUE )

$maximum

[1] 2.090302

$objective

[1] 0.714867

♦For the optimization of bivariate functions, the function optim can be used.

Example A.23 (R). Let X1, . . . , Xn be a random sample of independent and identically distributedrandom variables from the gamma(r, λ) distribution with shape parameter r > 0 and rate parameterλ > 0, and let x1, . . . , xn be a non-negative realization. Then the likelihood function is

L(r, λ) =λnr

Γ (r)n

n∏

i=1

xr−1i exp

(−λ

n∑

i=1

xi

)

and the log-likelihood function is

248 A Computational Statistics in R

ℓ(r, λ) = nr log λ− n logΓ (r) + (r − 1)

n∑

i=1

xi − λ

n∑

i=1

xi.

The problem to maximize the log-likelihood function with respect to r and λ is a two-dimensionalproblem. The log-likelihood function can be implemented as follows.

LogL <- function( theta, sx, slogx, n ) {

+ r <- theta[1]

+ lambda <- theta[2]

+ val <- n*r*log(lambda)+(r-1)*slogx - lambda*sx - n*log(gamma(r))

+ -val

+ }

Note that function optim performs minimization by default and therefore the return value is −ℓ(r, λ).Initial values need to be chosen with care. For this problem, the method of moments can be used forthe initial values of the parameters. For simplicity, the initial values are set to r = 1 and λ = 1. Inthe following, x denotes a random sample of length n = 200. Then the parameter estimation can beachieved as follows.

> n <- 200

> r <- 5; lambda <- 2

> x <- rgamma( n, shape=r, rate=lambda )

> optim( c(1,1), LogL, sx=sum(x), slogx=sum(log(x)), n=n )

$par

[1] 5.094346 2.052238

$value

[1] 289.0687

$counts

function gradient

75 NA

$convergence

[1] 0

$message

NULL

The result shows that the optimization method (Nelder-Mead as default) converges successfully to the

maximum likelihood estimates r = 5.094346 and λ = 2.052238. The error code $convergence is 0 forsuccessful runs and otherwise indicates a problem.

Now the procedure is repeated 1,000 times to provide better parameter estimates.

> test <- replicate(1000, expr = {

+ x <- rgamma(200, shape=5, rate=2)

+ optim( c(1,2), LogL, sx=sum(x), slogx=sum(log(x)), n=n )$par

+ })

> colMeans(t(test))

[1] 5.042200 2.019213

The output exhibits the average estimated values which are slightly better than the parameter valuesestablished in one run. ♦

B

Spectral Analysis of Ranked Data

Human and particularly US Americans seem to be unable to avoid ranking things. Top Five/Ten/Twenty/Hundredlists are plentiful: Best (Worst) Dressed, Best Suburbs, Most Watched Movies, Videos, Best FootballTeams, and so on. Rankings have very serious uses as well. Companies need to know what productsconsumers prefer, social and political leaders need to know what the society values, and elections needto be conducted. Data consisting of ranking appear in Psychology, Animal Science, Educational Test-ing, Sociology, Economics, and Biology. Indeed, for almost any situation where there are data it can behelpful to transform the data into ranks. This chapter is aimed at preference rankings.

B.1 Data Analysis

Data sometimes come in the form of ranks or preferences. A group of people may be asked to rank orderfive brands of chocolate chip cookies. Each person tastes the cookies and ranks all five. This results ina ranking π(1), π(2), π(3), π(4), and π(5), with π(i) the rank given the ith brand. The collection ofrankings then makes up the data set. Elections are also sometimes based on rankings. For instance, theAmerican Psychological Association asks its members to rank order five candidates for president. Ananalysis of the data from one such election is presented in this chapter.

Almost anyone analyzing ranked data looks at simple averages such as the proportion of times eachitem was ranked first or last and the average rank for each item. These are first order statistics sincethey are linear combinations of the number of times item i was ranked in position j. There are alsonatural second order statistics based on the number of times items i and i′ are ranked in positions j andj′. These come in ordered and unordered modes. Similarly, there are third and higher order statisticsof various types.

A basic paradigm of data analysis is to take out some found structure and to look at the rest. Thus tolook at second order statistics it is natural to subtract away the observed first order structure. This leadsto a natural decomposition of the original data into orthogonal parts. The decomposition is somewhatmore complicated than standard analysis of variance decompositions because of the dependence inherentin the permutation structure.

Example B.1. In a study, responses of 2,262 German citizens were asked to rank order the desirabilityof four political goals (1989):

250 B Spectral Analysis of Ranked Data

1. Maintain order;2. give people more say in government;3. fight rising prices;4. protect freedom of speech.

The data appear as1234 137 2134 48 3124 330 4123 211243 29 2143 23 3142 294 4132 301324 309 2314 61 3214 117 4213 291342 255 2341 55 3241 69 4231 521423 52 2413 33 3412 70 4312 351432 93 2431 39 3421 34 4321 27

875 279 914 194 2262

Thus 137 people ranked (1) first, (2) second, (3) third, and (4) fourth. The marginal totals show peoplethought item (3) is most important (914) and ranked it first. The first order summary is the 4 × 4matrix,

F =

875 279 914 194746 433 742 341345 773 419 725296 777 187 1002

.

The first row shows the number of people ranking a given item first, while the last row exhibits thenumber of people ranking a given item last. The data was collected in part to study whether the popu-lation could be usefully broken into ”liberals” who might favor items (2) and (4), and ” conservatives”who might favor items (1) and (3). ♦Example B.2. The American Psychological Association (APA) is a large professional organization ofacademicians, clinicians, and all shades in between. The APA elects a president every year by askingeach member to rank order a slate of five candidates. There were about 50,000 APA members in 1980and about 15,000 members voted. Many members cast incomplete ballots, voting for their favorite qof five candidates, 1 ≤ q ≤ 3. These ballots are illustrated in Table B.1. For instance, 1022 membersranked candidate 5 first and left the other unranked, while 142 members ranked candidate 5 firstand candidate 4 second; zeros and blanks indicate unranked candidates. Moreover, the 5738 completeballots are tabulated in Table B.2. For instance, 29 members ranked candidate 5 first, candidate 4second, candidate 3 third, candidate 2 fourth, and candidate 1 fifth. In the next section, we will mainlyconsider useful summaries of these data given by simple averages (first order effects).

The APA chooses a winner by the Hare system also known as proportional voting: If one of the fivecandidates is ranked first by more than half of the voters, the candidate wins. If not, the candidatewith the fewest first place votes is eliminated, each of the remaining candidates is reranked in relativeorder and the method is inductively applied. Candidate 1 is the eventual winner here. ♦

B.2 Representation Theory for Partial Rankings

Consider a list of n items. Let λ be a partition of n. A partial ranking of shape λ is specified accordingto the following instructions: Choose your favorite λ1 items from the list but do not bother to rank

B.2 Representation Theory for Partial Rankings 251

Table B.1. APA election data; incomplete ballots; 5,141 votes (q = 1) and 2,462 votes (q = 2).

q = 1 q = 2

ranking votes ranking votes

1 1022 21 14310 1145 12 196

100 1198 201 641000 881 210 48

10000 895 102 93120 56

2001 702010 1142100 891002 801020 871200 51

20001 11720010 10420100 54721000 7210002 7210020 7410200 30212000 83

within. Then choose your next λ2 favorite items from the list but do not rank within and so on. Suchpartial rankings can be written as λ-tabloids such as

1 3 52 4

,

where the items 1, 3, 5 are ranked first and items 2, 4 are ranked second.With this interpretation, the set of all partial rankings of shape λ forms a basis of the permutation

module Mλ. Moreover, the elements of the R-space Mλ can be considered as the set of all real-valuedfunctions f on partial rankings given as linear combinations of partial rankings,

f =∑

{T}

f{T}{T}, f{T} ∈ R.

There is a simple method for computing the projection of a permutation module onto its isotypicsubspaces. This involves the character table of the group.

Theorem B.3. Let λ and µ be partitions of n. The orthogonal projection of an element f ∈ Mλ ontothe isotypic subspace V µλ is the function

fµ(x) =χµ(id)

n!

π∈Sn

χµ(π)f(π−1(x)).

252 B Spectral Analysis of Ranked Data

Table B.2. APA election data; complete ballots; 5,738 votes.

ranking votes ranking votes ranking votes ranking votes

54321 29 43521 91 32541 41 21543 3654312 67 43512 84 32514 64 21534 4254231 37 43251 30 32451 34 21453 2454213 24 43215 35 32415 75 21435 2654132 43 43152 38 32154 82 21354 3054123 28 43125 35 32145 74 21345 4053421 57 42531 58 31542 30 15432 4053412 49 42513 66 31524 34 15423 3553241 22 42351 24 31452 40 15342 3653214 22 42315 51 31425 42 15324 1753142 34 42153 52 31254 30 15243 7053124 26 42135 40 31245 34 15234 5052431 54 41532 50 25431 35 14532 5252413 44 41523 45 25413 34 14523 4852341 26 41352 31 25341 40 14352 5152314 24 41325 23 25314 21 14325 2452143 35 41253 22 25143 106 14253 7052134 50 41235 16 25134 79 14235 4551432 50 35421 71 24531 63 13542 3551423 46 35412 61 24513 53 13524 2851342 25 35241 41 24351 44 13452 3751324 19 35214 27 24315 28 13425 3551243 11 35142 45 24153 162 13254 9551234 29 35124 36 24135 96 13245 10245321 31 34521 107 23541 45 12543 3445312 54 34512 133 23514 52 12534 3545231 34 34251 62 23451 53 12453 2945213 24 34215 28 23415 52 12435 2745132 38 34152 87 23154 186 12354 2845123 30 34125 35 23145 172 12345 30

B.4. First Order Projection – Winner-takes-it-all The simple choice of 1 out of n leads to apartial ranking of shape (n− 1, 1). The real-valued functions on the corresponding permutation moduleM (n−1,1) are of the form

f =

n∑

i=1

f(i)· . . . ·i

, f(i) ∈ R.

The module M (n−1,1) has the splitting

M (n−1,1) = S(n) ⊕ S(n−1,1).

The data vector f ∈M (n−1,1) has the decomposition f = F + (f −F ), with F = (f(1) + . . .+ f(n))/n.In view of the APA election data with q = 1 and 5, 141 items given in Table B.1, we obtain F =

5, 141/5 = 1, 028. The projection onto the isotypic submodule S(4,1) merely amounts to subtracting thenumber of rankers divided by 5 from the original data vector,

B.2 Representation Theory for Partial Rankings 253

Candidate Projectiononto S(4,1)

1 −1332 −1473 1704 1175 −6

It follows that candidate 3 is the most popular followed by candidate 4. ♦

B.5. Second Order Projection – Unordered Pairs The choice of an unordered pair out of namounts to a partial ranking of shape (n − 2, 2). The real-valued functions on the corresponding per-mutation module M (n−2,2) are given as

f =∑

i,j

f({i, j}) · · . . . ·i j

, f({i, j}) ∈ R.

The associated permutation module has the decomposition

M (n−2,2) = S(n) ⊕ S(n−1,1) ⊕ S(n−2,2).

The projection onto the trivial submodule S(n) is the mean F =∑i,j f({i, j})/n(n− 1). Moreover, the

projection onto the submodule S(n−1,1) is given by f(i) =∑j f({i, j})−F1 with F1 =

∑i,j f({i, j})/n,

1 ≤ i ≤ n. The projection onto the submodule S(n−2,2) is what is left after the mean and the popularityof individual terms are taken out. The first order analysis gives

Candidate Projectiononto S(n−1,1)

1 9392 1543 7584 8395 343

Thus there is a slight difference in the first order statistics when compared with the previous statistics.Candidate 1 is the most popular followed by candidate 4, while candidate 3 is only third ranked. ♦

B.6. Second Order Projection – Ordered Pairs The choice of an ordered pair out of n gives riseto a partial ranking of shape (n − 2, 12). The real-valued functions on the corresponding permutation

module M (n−2,12) are defined as

f =∑

i,j

f(i, j)· . . . ·ij

, f(i, j) ∈ R.

The corresponding permutation module splits as follows,

M (n−2,12) = S(n) ⊕ 2 · S(n−1,1) ⊕ S(n−2,2) ⊕ S(n−2,12).

254 B Spectral Analysis of Ranked Data

The two copies of the space S(n−1,1) provide the effect of an item in first and second position. Theprojection onto the subspace S(n−2,2) describes an unordered pair effect, while the projection ontoS(n−2,12) gives an ordered pair effect.

The projection onto the trivial submodule S(n) is the mean F =∑i,j f(i, j)/n(n − 1). Moreover,

the projection onto the two submodules S(n−1,1) accounts for the position one and two. Furthermore,the projection onto the submodule S(n−2,2) provides an unordered pair effect, while the projection ontothe submodule S(n−2,12) yields an ordered pair effect after the mean, first order and unordered paireffects were removed.

In view of the APA election data with q = 2 and 2, 462 items given in Table B.1, we obtainF = 2, 492/20 = 123. The projection onto first and second submodule S(4,1) amounts to counting thenumber of votes for candidate i to be ranked first and second, respectively, and subtracting the numberof rankers divided by 5,

Candidate Projection Projectiononto 1st S(n−1,1) onto 2nd S(n−1,1)

1 39 3482 −201 −1353 291 −264 −29 −315 −97 −50

The first order statistics ranks candidate 3 first and candidate 1 second. ♦

B.7. Complete Ranking The partition (1n) with all parts equal to 1 corresponds to complete rank-ings. The associated permutation module M (1n) is isomorphic to the group algebra of the group Sn.Thus the real-valued functions on complete rankings can be written as

f =∑

π∈Sn

f(π)π, f(π) ∈ R.

The permutation module has the decomposition

M (1n) =⊕

λ

(dimSλ)Sλ.

For instance, the decomposition of the permutation module for the complete rankings of n = 5 itemsis given as

M (15) = S(5) ⊕ 4 · S(4,1) ⊕ 5 · S(3,2) ⊕ 6 · S(3,12) ⊕ 5 · S(22,1) ⊕ 4 · S(2,13) ⊕ S(15).

The four copies of the submodule S(4,1) provide the effect of an item in one of the five positions; onlyfour of these positions are independent.

In view of the APA election data, there are 5,738 complete ballots given in Table B.2. The projectiononto the trivial module S(5) is imply the mean, and the projection onto the four copies of the submoduleS(4,1) providing the first order effects is given in Table B.3. This table shows the percentage of votersranking candidate i in position j. Thus, candidate 3 is the most popular, being ranked first by 28% ofthe voters, but candidate 3 also had some hate vote. Candidate 1 is strongest in the second position, she

B.2 Representation Theory for Partial Rankings 255

Table B.3. First order statistic: Percentage of voters ranking candidate i in position j.

rankcandidate 1 2 3 4 5

1 18 26 23 17 152 14 19 25 24 183 28 17 14 18 234 20 17 19 20 235 20 21 20 19 20

has no hate vote and a lower average rank than candidate 3. The voters seem indifferent on candidate 5.

256 B Spectral Analysis of Ranked Data

C

Representation Theory of the Symmetric Group

C.1 The Symmetric Group

A function from the set [n] = {1, . . . , n} onto itself is called a permutation of n, and the set of allpermutations of n, together with the usual composition of functions, is the symmetric group of degreen, which will be denoted by Sn. The symmetric group Sn has n! elements. If X is a subset of [n], thenSX denotes the subgroup of Sn that fixes every number outside of X.

A permutation π of Sn is generally written in two-row format

π =

(1 2 3 . . . n

π(1) π(2) π(3) . . . π(n)

).

By consider the orbits {i, π(i), π2(i), . . .}, i ∈ [n], of the group generated by a permutation π, it followsthat π can be written as a product of disjoint cycles, as in the example

π =

(1 2 3 4 5 6 72 1 4 5 6 3 7

)= (12)(3456)(7).

We usually suppress the 1-cycles. A permutation π that interchanges different numbers a, b and leavesthe other number fixed, is called a transposition and is written as π = (ab). Since the cycle (i1i2 . . . ik)equals the product (i1i2)(i1i3) . . . (i1ik), any cycle, and hence any permutation, can be written as aproduct of transpositions.

Moreover, if π = σ1σ2 . . . σk = τ1τ2 . . . τl are two ways of writing π as a product of transpositions,then it can be proved that k − l is even. Hence, the signature function sgn : Sn → {±1} such thatsgn(π) = (−1)k, when π is a product of k transpositions, is well-defined. We have π(1) = 1. For anytwo permutations σ and τ of n, which are products of k and l transpositions, respectively, sgn(στ) =(−1)k+l = (−1)k · (−1)l = sgn(σ)sgn(τ). Thus the sign mapping is a homomorphism. It follows thatfor any permutation π of n, we have 1 = sgn(1) = sgn(π−1π) = sgn(π−1)sgn(π) and thus sgn(π)−1 =sgn(π−1).

Two permutations σ and τ of n are called conjugate if there exists a permutation π of n suchthat τ = πσπ−1. Conjugation is an equivalence relation on the set of permutations of n, and thecorresponding equivalence classes are called the conjugacy classes of n.

258 C Representation Theory of the Symmetric Group

A sequence λ = (λ1, λ2, . . . , ) of non-negative integers such that∑i λi = n is called an improper

partition of n. An improper partition λ of n is called proper if the entries are monotonically decreasing;that is, λ1 ≥ λ2 ≥ . . .. Proper partitions are also simply called partitions. A partition is usually writtenas a finite sequence in which trailing zeros are deleted, such as (4, 2, 2, 1, 0, 0, . . .) = (4, 2, 2, 1) = (4, 22, 1).If λ is a partition of n such that λm > 0 and λm+s = 0 for all s ≥ 1, then λ is called a partition of ninto m parts.

Let λ be a partition of n. A permutation π is said to have cycle-type λ if the orbits of the groupgenerated by π have lengths λ1 ≥ λ2 ≥ . . .. For instance, the permutation (12)(3456)(7) has thecycle-type (4, 2, 1).

Let Pn denote the set of all partitions of n. Consider the map that assigns to each permutationπ of n the corresponding cycle-type in Pn. This map is onto since each partition λ of n gives rise toa permutation π = (1, . . . , λ1)(λ1 + 1, . . . , λ1 + λ2) . . . of n that has cycle-type λ. For instance, thepartition (3, 2, 1) gives rise to the permutation (123)(45)(6). Moreover, for each cycle (i1, . . . , ik) andeach permutation π of n, we have

π(i1, . . . , ik)π−1 = (π(i1), . . . , π(ik)).

Thus a permutation of n and all of its conjugate permutations have the same cycle-type. Conversely, sup-pose σ and τ are permutations of n having the same cycle-type λ. Write σ = (. . .) . . . (i1, . . . , ik) . . . (. . .)and τ = (. . .) . . . (j1, . . . , jk) . . . (. . .). Then the permutation π of n that assigns il 7→ jl has the propertyτ = πσπ−1, and thus π and τ are conjugate. It follows that the map is also one-to-one. Thus we haveshown the following

Proposition C.1. The number of conjugacy classes of Sn equals the number of partitions of n.

Let K be a field. The group algebra KSn is a vector space over K with the elements of Sn as a basis.The elements of KSn are written as linear combinations of the elements in Sn with coefficients in K,

π∈Sn

aππ, aπ ∈ K. (C.1)

The addition of two elements in the group algebra is given as

(∑

π∈Sn

aππ

)+

(∑

π∈Sn

bππ

)=∑

π∈Sn

(aπ + bπ)π. (C.2)

This vector space has the dimension |Sn| = n!. The multiplication in the group algebra is defined bylinear extension of the group multiplication as

(∑

π∈Sn

aππ

)(∑

σ∈Sn

bσσ

)=

π,σ∈Sn

aπbσπσ =∑

σ∈Sn

(∑

π∈Sn

aπbπ−1σσ

). (C.3)

This multiplication is associative and turns the group algebra KSn into a unitary ring. Notice that thegroup algebra KG of any finite group G is similarly defined.

C.2 Diagrams, Tableaux, and Tabloids 259

C.2 Diagrams, Tableaux, and Tabloids

Let λ be a partition of n. The diagram of λ is the set

[λ] = {(i, j) | i ≥ 1, 1 ≤ j ≤ λi}.

Each element (i, j) in [λ] is called a node. The kth row (column) of a diagram consists of those nodeswhose first (second) coordinate is k. For instance, the diagram of the partition λ = (4, 22, 1) is

[λ] =

× × × ×× ×× ××

.

Let λ and µ be partitions of n. We say that λ dominates µ, briefly λDµ, provided that for all k ≥ 1,

k∑

i=1

λi ≥k∑

i=1

µi.

For instance, we have (6)D (5, 1)D (3, 3)D (3, 2, 1)D (3, 13)D (22, 12)D (2, 14)D (16).Moreover, we write λ > µ if the least number j for which λj 6= µj satisfies λj > µj . This is a total

order and called the dictionary order on partitions of n. The dictionary order contains the dominanceorder in the sense that λD µ implies λ > µ. For instance, we have (3, 13) > (23) but (3, 13) 6 D(23).

Let [λ] be a diagram. The conjugate diagram [λ′] is obtained by interchanging the rows and columnsin the diagram [λ]. The corresponding partition λ′ of n is called the partition conjugate to λ. Forinstance, if λ = (4, 2,1) then λ′ = (4, 3, 12). Notice that the part λ′i in λ

′ is the number of parts λj in λthat are greater than or equal to i. The conjugate partition of a proper partition is also proper.

Proposition C.2. We have λD µ if and only if µ′ D λ′.

Proof. Let λ D µ. There exists an index k such that λi = µi for 1 ≤ i ≤ k − 1 and λk > µk. Then λ′k,

the number of parts λj with λj ≥ k, is less than µ′k, the number of parts µj with µj ≥ k. Thus µ′ D λ′.

The converse follows by using the identity (λ′)′ = λ. ♦Let λ be a partition of n. A λ-tableau is a bijection T : [λ] → [n]. Graphically, a λ-tableau is an array

of integers obtained by replacing each node in the diagram [λ] by an integer in [n] without repeats. Forinstance, two (4, 22, 1)-tableaux are

1 2 3 45 67 89

and

2 6 3 19 48 57

A tableau is called row-standard if the entries increase along the rows. For instance, the first of theabove tableaux is row-standard, the second is not.

C.3. Basic Combinatorial Lemma Let λ and µ be partitions of n, and suppose T is a λ-tableauand T ′ is a µ-tableau. If for each i ≥ 1, the entries in the ith row of T ′ belong to different columns ofT , then λD µ.

260 C Representation Theory of the Symmetric Group

Proof. Imagine we can place the µ1 numbers from the first row of T ′ into the diagram [λ] such that notwo numbers are in the same column. Then [λ] must have at least µ1 columns; that is, λ1 ≥ µ1. Nextwe insert the µ2 numbers from the second row of T ′ into different columns. To have space to do this,we require that λ1 + λ2 ≥ µ1 + µ2. By continuing in this way, we obtain λD µ. ♦

The symmetric group Sn acts on the set of λ-tableaux in a natural way. Given a λ-tableau T anda permutation π of n, the composition of the functions T and π gives the λ-tableau πT . For instance,the permutation (1264)(3)(5978) sends the first of the above tableaux to the second.

Let T be a λ-tableau. The row stabilizer R(T ) of T is the subgroup of Sn that keeps the rows of Tfixed setwise; that is,

R(T ) = {σ ∈ Sn | ∀i : i and σ(i) belong to same row of T}.The column stabilizer C(T ) of T is similarly defined. For instance, the tableau

T =

2 6 3 19 48 57

has the row stabilizer R(T ) = S{1,2,3,6} ⊕ S{4,9} ⊕ S{5,8} ⊕ S{7} and the column stabilizer C(T ) =S{2,7,8,9} ⊕ S{4,5,6} ⊕ S{3} ⊕ S{1}.

The row and column stabilizers are Young subgroups of Sn. Generally, given a partition X ={X1, . . . , Xk} of the set [n] into disjoint non-empty subsets. The Young subgroup SX is given by theproduct group

SX = SX1⊕ . . .⊕ SXk

,

where SXiis a subgroup of Sn leaving all elements outside of Xi element-wise fixed. In particular, each

partition λ of n has an associated canonical Young subgroup

Sλ = S{1,...,λ1} ⊕ . . .⊕ S{λ1+1,...,λ1+λ2} ⊕ . . . .

Lemma C.4. Let T be a λ-tableau.

• For each permutation π of n,

C(πT ) = πC(T )π−1 and R(πT ) = πR(T )π−1.

• C(T ) ∩R(T ) = 1.• The row and colum stabilizer of T are subgroups of Sn that are conjugate of Sλ and Sλ′ , respectively.

Define an equivalence relation on the set of all λ-tableaux. Two λ-tableaux T1 and T2 are calledequivalent, if there is some permutation π ∈ R(T1) such that πT1 = T2. The equivalence classes arecalled tabloids, and the equivalence class containing the tableau T is denoted as {T}. A tabloid can beconsidered as a tableau with totally ordered row entries. For instance, the (3, 2)-tabloids are

1 2 34 5

,1 2 43 5

,1 2 53 4

,1 3 42 5

,1 3 52 4

,

1 4 52 3

,2 3 41 5

,2 3 51 4

,2 4 51 3

,3 4 51 2

.

C.3 Permutation Modules 261

The group Sn acts on the set of λ-tabloids by

π{T} = {πT}, π ∈ Sn, T λ-tableau.

This action is well-defined, since {T1} = {T2} implies T2 = σT1 for some permutation σ ∈ R(T1). ByLemma C.4, for each permutation π ∈ Sn, πσπ

−1 ∈ R(πT1) and thus {πT1} = {(πσπ−1)(πT1)} ={πT2}.

C.3 Permutation Modules

Let K be a field. For each partition λ of n, we consider the vector spaceMλ over K whose basis elementsare the λ-tabloids; that is,

Mλ =⊕

{T}

K{T}.

The action of the symmetric group Sn defined on tabloids can be linearly extended on the space Mλ

as follows, (∑

π

kππ

{T}

k′{T}{T}

=

π

{T}

kπk′{T}π{T}.

This action turns Mλ into a KSn-module.Let T be a λ-tableau, whose row stabilizer is Sλ. Take a set of left coset representatives σ1, . . . , σl

of the subgroup Sλ in Sn. This gives the decomposition

Sn = σ1Sλ ∪ . . . ∪ σlSλ.

Form the l elementsσi{T}, 1 ≤ i ≤ l.

Claim that these elements are linearly independent over K. Indeed, given a linear combination∑i kiσi{T} = 0, ki ∈ K, 1 ≤ i ≤ l, each term kiσi{T} must be zero, since the tabloids involved

are pairwisely different. Moreover, claim that these elements form a basis of Mλ. Indeed, it suffices toshow that for any element π ∈ Sn, πσi{T} is a linear combination of the elements of the set. For this,observe that πσi = σjτ for some τ ∈ Sλ and some unique σj so that πσi{T} = σjτ{T} = σj{T}. Thisproves the claim.

Proposition C.5. The space Mλ is a cyclic KSn-module generated by any one λ-tabloid, and thedimension of Mλ is

dimKMλ =

(n

λ1, λ2, . . .

).

Proof. In view of the above basis of Mλ, we have Mλ = KSn{T} for any λ-tabloid {T}; that is, Mλ iscyclic. The dimension of Mλ equals the number of coset representatives of Sλ in Sn. ♦

262 C Representation Theory of the Symmetric Group

For instance, the space M (3,2) has dimension 10; the (3,2)-tabloids were given in the previous Section.A bilinear form on the space Mλ can be defined by setting

〈{T1}, {T2}〉 ={1 {T1} = {T2},0 otherwise.

This bilinear form is symmetric and makes the set of λ-tabloids an orthonormal basis ofMλ. This form isnon-singular (i.e., for each non-zero elementm inM there is an elementm′ inM with 〈m,m′〉 6= 0), andassociative (i.e., for all tabloids {T1}, {T2} and permutations π of n, 〈π{T1}, π{T2}〉 = 〈{T1}, {T2}〉).It follows that the group Sn operates as a group of orthogonal transformations on the space Mλ.

C.4 Specht Modules

Let T be a λ-tableau. The signed colum sum of T is an element of the group algebra KSn that is obtainedby summing the elements in the column stabilizer of T , attaching the signature to each permutation,

κT =∑

σ∈C(T )

sgn(σ)σ.

The polytabloid associated with the tableau T is a linear combination of λ-tabloids that is obtained bymultiplying the tabloid {T} with the signed column sum,

eT = κT {T} =∑

σ∈C(T )

sgn(σ)σ{T}.

The polytabloid eT depends on the tableau T not just on the tabloid {T}.

Example C.6. Take the tableau T =1 3 52 4

. The corresponding signed column sum is

κT = (1− (12))(1− (34)) = (1)− (12)− (34) + (12)(34)

and the associated polytabloid is

eT =1 3 52 4

− 2 3 51 4

− 1 4 52 3

+2 4 51 3

.

The practical way of writing down the polytabloid eT is to permute the numbers in the columns ofthe tableau T in all possible ways, attaching the signature to each permutation, and then permutingthe positions of T accordingly.

The Specht module for the partition λ is the submodule Sλ of Mλ spanned as a vector space by thepolytabloids,

Sλ =⊕

T λ−tableau

KeT .

C.4 Specht Modules 263

Proposition C.7. The Specht module Sλ is a cyclic KSn-module generated by any one polytabloid.

Proof. Let T be a λ-tableau. For each permutation π of n, we have

eπT =∑

σ∈C(πT )

sgn(σ)σ{πT}

=∑

σ∈C(T )

sgn(πσπ−1)πσπ−1{πT}

= π∑

σ∈C(T )

sgn(σ)σ{T}

= πeT .

It follows that Sλ is a cyclic module. ♦Lemma C.8. Let λ and µ be partitions of n. If T is a λ-tableau and T ′ is a µ-tableau such thatκT {T ′} = 0 then λD µ. In particular, if λ = µ then κT {T ′} = ±κT {T} = ±eT .Proof. Let i and j be two numbers in the same row of T ′. Then we have

[1− (ij)]{T ′} = {T ′} − (ij){T ′} = 0.

Suppose i and j are in the same column of T . Then we can find coset representatives σ1, . . . , σk for thesubgroup U = {1, (ij)} of the column stabilizer of T such that

κT = [sgn(σ1)σ1 + . . .+ sgn(σk)σk][1− (ij)].

Then it follows that κT {T ′} = 0 contradicting the hypothesis. Thus we have shown that the numbersin the same row of T ′ belong to different columns of T . Hence, the basic combinatorial lemma givesλD µ.

If λ = µ, then by construction {T ′} is one of the tabloids involved in κT {T}. Thus {T ′} = σ{T}for some column permutation σ in C(T ). It follows that κT {T ′} = κTσ{T} = ±κT {T}. ♦Corollary C.9. If m is an element of Mλ and T is a λ-tableau, then κTm is a multiple of eT .

Proof. By Lemma C.8, for each λ-tabloid {T ′}, κT {T ′} is a multiple of eT . Butm is a linear combinationof λ-tabloids and thus m is a multiple of eT . ♦

Let m,m′ ∈Mλ and T be a µ-tableau. By the properties of the bilinear form on Mλ, we have

〈κTm,m′〉 =∑

σ∈C(T )

〈sgn(σ)σm,m′〉

=∑

σ∈C(T )

〈m, sgn(σ)σ−1m′〉 (C.4)

=∑

σ∈C(T )

〈m, sgn(σ)σm′〉

=∑

σ∈C(T )

〈m,κTm′〉.

264 C Representation Theory of the Symmetric Group

C.10. Submodule Theorem If U is a KSn-submodule of Mλ, then either U ⊇ Sλ or U ⊆ Sλ⊥.

Proof. Let u ∈ U and let T be a λ-tableau. By Corollary C.9, κTu is a multiple of eT . If κTu 6= 0, theneT is an element of U . But Sλ is generated by eT and thus U ⊇ Sλ. Otherwise, we have κTu = 0 foreach u ∈ U and λ-tableau T . Then we have

0 = 〈κTu, {T}〉 = 〈u, κT {T}〉 = 〈u, eT 〉.

It follows that u belongs to Sλ⊥

and thus U ⊆ Sλ⊥. ♦

Theorem C.11. The Specht module Sλ over Q is irreducible; that is, there is no proper QSn-submoduleof Sλ.

Proof. By the submodule theorem, any submodule U of Sλ is either Sλ or is contained in Sλ ∩ Sλ⊥.But Sλ ∩ Sλ⊥ = 0 over Q and thus the result follows. ♦Lemma C.12. If φ : Mλ → Mµ is a KSn-homomorphism such that Sλ 6⊂ kerφ, then λ D µ. Inparticular, if λ = µ, then the restriction of φ to Sλ amounts to multiplication by a constant.

Proof. Let T be a λ-tableau. By hypothesis, eT 6∈ kerφ and thus

0 6= φ(eT ) = κTφ({T}).Thus φ(eT ) is κT times a linear combination of µ-tabloids. By Lemma C.8, λDµ. In particular, if λ = µthen by Lemma C.8, φ(eT ) is a multiple of eT . ♦Corollary C.13. If φ : Sλ → Mµ is a nonzero QSn-homomorphism, then λ D µ. In particular, ifλ = µ, then φ amounts to multiplication by a constant.

Proof. Over the field Q, we have

Sλ ∩ Sλ⊥ = 0 and Mλ = Sλ ⊕ Sλ⊥. (C.5)

Any homomorphism φ : Sλ →Mµ can thus be extended to a homomorphism φ :Mλ →Mµ by letting

it be zero on Sλ⊥. Now Lemma C.12 yields the result. ♦

Theorem C.14. The Specht modules over Q provide all irreducible QSn-modules.

Proof. If two Specht modules Sλ and Sµ are isomorpic, they give rise to a nonzero homomorphismφ : Sλ → Sµ. Then by Corollary C.13, λD µ. Similarly, µD λ and thus λ = µ.

For any finite group G, the number of irreducible QG-modules equals the number of conjugacyclasses of G. But, by Proposition C.1, the number of conjugacy classes of Sn equals the number ofpartitions of n. Thus by Theorem C.11, the Specht modules provide all irreducible QSn-modules. ♦Theorem C.15. The permutation module Mµ is a direct sum of Specht modules Sλ where λ D µ(possibly with repeats). In particular, the Specht module Sµ appears once in this sum.

Proof. By Maschke’s Theorem, for any finite group G, each nonzero QG-module M is completelyreducible; that is, M can be written as a direct sum of irreducible QG-modules. Thus we have

Mµ =⊕

λ

Sλ.

Thus for each direct summand Sλ of Mµ, there is a nonzero QSn-homomorphism φ : Sλ → Mµ. ByCorollary C.13, it follows that λD µ. Moreover, by Eq. (C.5), we have that Sλ appears exactly once inthis decomposition. ♦

C.5 Standard Basis of Specht Modules 265

C.5 Standard Basis of Specht Modules

A λ-tableau T is called standard if the numbers increase along the rows and down the columns of T . Aλ-tabloid {T} is called standard if there is a standard tableau in its equivalence class. A λ-polytabloideT is called standard if the tableau T is standard.

Example C.16. The standard (3,2)-tableaux are

1 2 34 5

,1 2 43 5

,1 2 53 4

,1 3 42 5

,1 3 52 4

.

Define a total ordering on the set of λ-tabloids. Let {T} and {T ′} be λ-tabloids. Write {T} < {T ′}if there is a number i such that when j is a number with j > i, then j is in the same row of {T} and{T ′}, and the number i is in a higher row of {T} than {T ′}. For instance, we have

3 4 51 2

<2 4 51 3

<1 4 52 3

<2 3 51 4

<1 3 52 4

.

Lemma C.17. Let m1, . . . ,mk be elements of Mλ such that the tabloid {Ti} is the <-maximal elementinvolved in mi, 1 ≤ i ≤ k. If the tabloids {Ti} are all different, then m1, . . . ,mk are linearly independent.

Proof. Assume that {T1} < . . . < {Tk}. If a1m1+ . . .+akmk = 0 with a1, . . . , ak ∈ K and aj+1 = . . . =ak = 0, 1 ≤ j ≤ m, then aj = 0, since {Tj} is involved in mj but not in any mi with i < j. Therefore,a1 = . . . = ak = 0. ♦

Theorem C.18. The set of standard λ-tableaux is a basis for the Specht module Sλ.

Proof. Let T be a standard λ-tableau. By definition of the total order on tabloids, it follows that foreach tabloid {T ′} involved in eT , we have {T ′} ≤ {T}. Thus by Lemma C.17, the set of standardλ-tableaux is linearly independent.

Let T be a λ-tableau. Write [T ] = {σT | σ ∈ C(T )} for the column equivalence class of T . Thecolumn equivalence classes are totally ordered in the same way as the row equivalence classes.

Suppose T is a non-standard λ-tableau. For each column permutation σ ∈ C(T ), we have σeT =sgn(σ)eT and thus we may assume that the entries of T are increasing down the columns. Unless T isstandard, there are two adjacent columns in T of the form

a1 b1...

...au bu...

...av bv...aw

where a1 < . . . < aw, b1 < . . . < bv, and au > bu.

266 C Representation Theory of the Symmetric Group

Take X = {au, . . . , aw} and Y = {b1, . . . , bu}. Let σ1, . . . , σk be the coset representatives for thesubgroup SX × SY in the group SX∪Y . The element GX,Y =

∑j sgn(σj)σj in the group algebra KSn

is called Garnier element. We have GX,Y eT = 0. But GX,Y eT =∑j sgn(σj)eσjT and [σjT ] < [T ] for

σj 6= 1. Thus eT is a linear combination of polytabloids eσjT , σj 6= 1. By induction, each polytabloideσjT , σj 6= 1, can be written as a linear combination of standard polytabloids. Hence, the polytabloideT has the same property. ♦

Example C.19. To illustrate the proof, consider the tableau T =1 24 35

. We have X = {4, 5} and

Y = {2, 3}. This gives the Garnier element

GX,Y = 1− (34) + (354) + (234)− (2354) + (24)(35).

The theorem implies that the dimension of the Specht module Sλ is independent of the ground field,and equals the number of standard λ-tableaux. The proof shows that any λ-polytabloid can be writtenas an integral linear combination of standard λ-polytabloids.

Example C.20. By Example C.16, the Specht module S(3,2) has dimension 5 over any field. ♦

C.6 Young’s Rule

We have seen that each permutation moduleMµ is direct sum of Specht modules Sλ. By Theorem C.15,we have

Mλ =⊕

λDµ

kλ,µSµ,

where kλ,µ denotes the number of irreducible submodules of Mλ that are isomorphic to Sµ. The directsum of all submodules that are isomorphic to Sµ is called the isotypic subspace belonging to partitionµ. This submodule is denoted by V µλ . Thus, V µλ is isomorphic to the kλ,µ-fold multiple of the Spechtmodule Sµ. The multiplicities can be calculated by making use of a general result from representationtheory.

Theorem C.21. For any finite group G, the multiplicity of an irreducible CG-module S to occur inthe irreducible decomposition of a CG-module M equals the dimension of the C-space HomCG(S,M).

As Q is the splitting field for the group Sn, the number we seek is the dimension of the spaceHomQSn

(Sλ,Mν). A basis of the space HomQSn(Sλ,Mν) can be obtained by modifying the construction

of the standard basis of the Specht module.For this, it is convenient to introduce a new copy of the permutation module. To this end, we

introduce tableaux with repeated entries. For this, let λ and µ be partitions of n. A λ-tableau t hastype µ if for each number i, the number i occurs µi times in t. For instance, two (4,1)-tableaux of type(3,2) are

1 1 2 21

and1 1 2 12

C.6 Young’s Rule 267

The tableaux considered so far were all of type (1n). Let T (λ, µ) denote the set of all λ-tableaux oftype µ. In the following, let T0 be a fixed λ-tableau of type (1n). For each tableau t ∈ T (λ, µ), let t(i)denote the entry in t that occurs in the same position as i in T0. The symmetric group Sn acts on theset T (λ, µ) by

(πt)(i) = t(π−1(i)), 1 ≤ i ≤ n, π ∈ Sn, t ∈ T (λ, µ).

This action is simply a place permutation. For instance, if we put

T0 =1 3 4 52

and t =2 2 1 11

then

(12) t =1 2 1 12

and (123) t =2 1 1 12

The symmetric group acts transitively on the set T (λ, µ), and there is an element whose stabilizer isthe Young subgroup Sµ. Thus we may take Mµ to be the vector space spanned by the tableaux in

T (λ, µ). For instance, in view of the (4, 1)-tableau T0 =1 2 3 45

, we have the following correspondence

between (4, 1)-tabloids and (4, 1)-tableaux of type (3, 2):

1 2 34 5

1 1 1 22

,1 3 42 5

1 2 1 12

.

Let t1 and t2 be λ-tableaux. We say that t1 and t2 are row equivalent if t2 = πt1 for some permutationin the row stabilizer of the given λ-tableau T0; column equivalence is similarly defined.

For each λ-tableau t of type µ, define the map ϕt by

ϕt : κ{T0} 7→ κ∑

{t′}, κ ∈ KSn,

where the sum extends over all different λ-tableaux of type µ that are row equivalent to {t}. Themapping ϕt belongs to HomKSn

(Mλ,Mµ). For instance, in view of the (4, 1)-tableau T0 =1 3 4 52

and

the (4, 1)-tableau t =2 2 1 11

of type (3, 2),

ϕt{T0} =2 2 1 11

+2 1 2 11

+1 2 2 11

+1 2 1 21

+1 1 2 21

.

Define ϕt as the restriction of ϕt to the Specht module Sλ. However, we clearly have κT0t = 0 if and

only if some column of t contains two identical numbers. Thus the homomorphism ϕt can sometimes bezero. To eliminate such trivial elements from the space HomKSn

(Sλ,Mµ), we consider specific tableaux.A λ-tableau t of type µ is called semistandard if the numbers are nondecreasing along the rows of

t and strictly increasing down the columns of t. If t is a semistandard λ-tableau of type µ, then thehomomorphism ϕt is also called semistandard. For instance, there are two semistandard (4, 1)-tableauxof type (22, 1),

1 1 2 23

and1 1 2 32

.

268 C Representation Theory of the Symmetric Group

Theorem C.22. The semistandard homomorphisms ϕt corresponding to the semistandard λ-tableauxt of type µ form a Q-basis of the space HomQSn

(Sλ,Mµ).

Corollary C.23. The multiplicity of the Specht module Sλ in the permutation module Mµ equals thenumber of semistandard λ-tableaux of type µ.

Example C.24. The semistandard tableau of type (3, 2, 2) are

1 1 1 2 2 3 3 1 1 1 2 2 33

1 1 1 2 3 32

1 1 1 2 23 3

1 1 1 2 32 3

1 1 1 3 32 2

1 1 1 22 3 3

1 1 1 32 2 3

1 1 1 2 323

1 1 1 22 33

1 1 1 32 23

1 1 12 2 33

1 1 12 23 3

Thus we obtain the following decomposition of the permutation moduleM (3,22) into isotypic subspaces,

M (3,22) = 1 · S(7) ⊕ 2 · S(6,1) ⊕ 3 · S(5,2) ⊕ 2 · S(4,3) ⊕ 1 · S(5,12)

= ⊕2 · S(4,2,1) ⊕ 1 · S(32,1) ⊕ 1 · S(3,22).

The semistandard λ-tableaux of type µ = (1n) are exactly the standard λ-tableaux. But the standardλ-tableaux form a basis of the Specht module Sλ and the permutation module M (1n) is isomorphic tothe group algebra of the group Sn.

Corollary C.25. The multiplicity of the Specht module Sλ in the group algebra QSn equals its dimen-sion.

C.7 Representations

Let G be a finite group and K be a field. A homomorphism of the group G into a group of n × nmatrices over K is called a representation of G of degree n. This means that to each element g of Gthere is a matrix ϕ(g) and if g and h are elements of G, then ϕ(gh) = ϕ(g)ϕ(h). A representation iscalled faithful if the homomorphism is one-to-one. Frobenius posed the problem to determine all matrixrepresentations of a finite group.

C.7 Representations 269

First, there is always a representation of a finite groupG. For this, takeG = {g1, . . . , gn} and considerthe mapping ϕ that assigns to each group element g the linear substitution gi 7→ gig, 1 ≤ i ≤ n. Thislinear substitution is represented by an n × n permutation matrix; that is, a matrix with entries 0and 1 that has entry 1 in position (i, j) if and only if gig = gj . Since gig = gi if and only if g = 1,the representation is faithful. The representation ϕ is called regular. At the other extreme, there is theone-representation ι for which ι(g) = 1, for all g ∈ G.

Second, given any matrix representation of a group G we can find infinitely many. For this, let ϕbe a matrix representation of G of degree n and let B be an invertible n× n matrix. We can define

ψ(g) = Bϕ(g)B−1, g ∈ G.

Clearly ψ is a matrix representation of G, since we have

ψ(gh) = Bϕ(gh)B−1 = Bϕ(g)ϕ(h)B−1 = Bϕ(g)B−1Bϕ(h)B−1

= ψ(g)ψ(h), g, h ∈ G.

The new representation is obtained from the given representation by a change of basis. Representationsrelated as ϕ and ψ are called equivalent and are regarded as essentially the same representation.

Third, let ϕ and ψ be matrix representations of a group G of degree m and n, respectively. Considerthe matrix

τ(g) =

(ϕ(g) 00 ψ(g)

)

Clearly, τ is a matrix representation of G of degree m+ n, since we have

τ(g)τ(h) =

(ϕ(g) 00 ψ(g)

)(ϕ(h) 00 ψ(h)

)=

(ϕ(g)ϕ(h) 0

0 ψ(g)ψ(h)

)

=

(ϕ(gh) 0

0 ψ(gh)

)= τ(gh), g, h ∈ G.

The representation τ of G is called the direct sum of ϕ and ψ, and we write τ = ϕ⊕ψ. A representationof the form ϕ ⊕ ψ is said to be decomposable with components ϕ and ψ. A representation that is notdecomposable is called indecomposable.

Fourth, the property of being indecomposable depends on the field underlying the representation.For this, consider the matrix group

G =

{(1 00 1

),

(−1 1−1 0

),

(0 −11 −1

)}.

This group is indecomposable over the rational field. To see this, observe that there is no 2× 2 matrixB with rational coefficients such that

B

(−1 1−1 0

)B−1 =

(a 00 b

), a, b ∈ Q.

If such a matrix would exist, then equating for the trace and determinant on both sides would givea+ b = −1 and ab = 1. This would lead to the quadratic equation a2 + a+ 1 = 0, whose solutions arecubic roots of unity. On the other hand, over the complex field we obtain

270 C Representation Theory of the Symmetric Group

(ξ 1ξ2 1

)(−1 1−1 0

)(ξ 1ξ2 1

)−1

=

(ξ 00 ξ2

),

where ξ is a cubic root of unity. Thus the matrix group is decomposable over the complex field.As a second example, consider the matrix group

G =

{(1 00 1

),

(1 01 1

)},

whose entries lie in the field Z2 of characteristic 2. This matrix group G is also indecomposable; thiscan be proved along the same lines as in the previous example. However, unlike the previous example,the group G stays indecomposable even if the field is extended. Such a representation which remainsindecomposable in any extension of the field is called absolutely indecomposable. Representation theoryin fields of characteristic p > 0 is very different from the case at characteristic zero, when the order ofthe group is divisible by p.

Fifth, there is a weaker concept than decomposability. For this, let τ be a matrix representation ofa group G such that for each group element g ∈ G,

τ(g) =

(A(g) 0I(g) B(g)

),

whereA(g) andB(g) are s×s and t×tmatrices, respectively. If we define ϕ(g) = A(g) and ψ(g) = B(g)for all g ∈ G, then it is easy to see that ϕ and ψ are matrix representations of G of degree s andt, respectively. The representation τ and any representation equivalent to it is called reducible. Arepresentation that is not reducible is called irreducible. The representations ϕ and ψ are the constituentsof τ . If either is irreducible it is an irreducible constituent. A representation that is decomposible(I(g) = 0) is clearly reducible. The last example above shows that a reducible representation canbe indecomposable. We will see by Maschke’s Theorem that this cannot happen if the characteristicof the field is zero or is not a divisor of the group order. In this case, reducible representations aredecomposable.

If ϕ and ψ are themselves reducible, they yield constituents of smaller degree. Continuing in thisway, we obtain a set of irreducible representation ϕ1, . . . , ϕk as constituents of ψ. It can be shown thatthese representations are uniquely determined up to equivalence for each representation τ .

Sixth, representations can be derived from representation modules; that is, modules over groupalgebras. For this, let G be a finite group and K be a field. A KG-module is a K-vector space V onwhich the group operates in the same way as a scalar multiplication. Each KG-module V gives rise toa representation of the underlying group. To this end, observe that V is a vector space of K and thushas a K-basis {v1, . . . , vk}. The representation corresponding to V is the mapping ϕV that assigns toeach group element g the scalar matrix ϕV (g) = (kgij), whose entries are given by the action of g onthe basis elements,

gvj =k∑

i=1

kgijvi, 1 ≤ j ≤ k.

A representation module V is called decomposable if the corresponding representation ϕV is decom-posable. Thus a decomposable representation module can be written as a direct sum of representationmodules. A representation module V is called reducible if the associated representation ϕV is reducible;

C.7 Representations 271

otherwise, the module is called irreducible. Thus for a reducible representation module V there is a rep-resentation module that forms a proper KG-submodule of V . Equivalently, an irreducible representationmodule V has only two KG-submodules, zero module and V itself.

C.26. Maschke’s Theorem Let G be a finite group and K be a field whose characteristic is zero or nota divisor of the group order. For each KG-submodule U of a KG-module V , there is a KG-submoduleW such that

V = U ⊕W.

Proof. Take a K-basis v1, . . . , vr of V . There is a unique bilinear form φ on V given as

φ(vi, vj) =

{1 vi = vj ,0 otherwise.

A new bilinear form on V can be defined by

〈u, v〉 = 1

|G|∑

g∈G

φ(gu, gv), u, v ∈ V.

This form is G-invariant in the sense that

〈gu, gv〉 = 〈u, v〉, g ∈ G, u, v ∈ V.

Let U be a KG-submodule of V . Define U⊥ as the set of all elements v ∈ V such that 〈u, v〉 = 0for all u ∈ U . Clearly, U⊥ is a K-subspace of V . For each u ∈ U , v ∈ U⊥, and g ∈ G, we have〈u, gv〉 = 〈g−1u, v〉 = 0, since g−1u ∈ U . Thus U⊥ is a KG-submodule of V . Moreover, by hypothesison K, for each nonzero element u ∈ U , 〈u, u〉 6= 0 K. Hence, U ∩ U⊥ = 0. Finally, we show thatU ⊕ U⊥ = V . ♦

Maschke’s Theorem can be used to prove by induction on the dimension of KG-modules the following

Theorem C.27. Let G be a finite group and K be a field, whose characteristic is zero or not a divisorof the group order. Each KG-module is a direct sum of irreducible KG-submodules.

Let H be a subgroup of a group G. Assume that g1, . . . , gk form a complete coset of representativesfor the group H in G. Given a matrix representation ϕ of the group H of degree m. There is a matrixrepresentation ϕGH of the group G of degree k ·m defined as

ϕGH(g) =

ϕ(g1gg

−11 ) . . . ϕ(g1gg

−1k )

... . . ....

ϕ(gkgg−11 ) . . . ϕ(gkgg

−1k )

, g ∈ G,

where

ϕ(gigg−1j ) =

{ϕ(gigg

−1j ) if gigg

−1j ∈ H,

0 otherwise.

We say that the matrix representation ϕGH of G is induced from the representation ϕ of H.Conversely, given a matrix representation ϕ of G. There is a matrix representation ϕH of the

subgroup H defined as

272 C Representation Theory of the Symmetric Group

ϕH(h) = ϕ(h), h ∈ G.

Finally, consider representation modules of the symmetric group. For this, let λ be a partition ofn. The permutation module Mλ has the basis {σ1{T}, . . . , σl{T}}, where T is a λ-tableau with rowstabilizer Sλ and the permutations σ1, . . . , σl are the left coset representatives of Sλ in Sn. For eachpermutation π ∈ Sn, we have πσj = σijτij for some τij ∈ Sλ and a unique left coset representativeσij , and thus πσj{T} = σij{T}. Hence, the representation of Mλ assigns to each permutation π thepermutation matrix ϕ(π) that has entries 1 in exactly the positions (ij , j), 1 ≤ j ≤ n. This is thereason that the representation module Mλ is referred to as a permutation module. For instance, theYoung subgroup S(2,1) = S{1,2} ⊕S{3} has the coset representatives σ1 = (1), σ2 = (13), and σ3 = (23)

in S3. The permutation module Mλ can be considered as being induced from the one-dimensionalQSλ-module Q{T}, where the λ-tableau T has the stabilizer Sλ.

The Specht module Sλ forms an irreducible representation module of the group Sn. A basis of thismodule is given by the standard λ-polytabloids eT1

, . . . , eTk. The representation corresponding to the

Specht module Sλ is the mapping ϕλ that assigns to each group element π the matrix ϕλ(π) = (kπij),whose entries are given as

πeTj= eπTj

=

k∑

i=1

kπijeTi, 1 ≤ j ≤ k.

As an example consider the Specht module S(2,1) that forms an irreducible representation moduleof the group S3. The standard basis of S(2,1) is given by the two polytabloids

e 1 23

=1 23

− 2 31

and e 1 32

=1 32

− 2 31

.

For the unit element π = (1), we have

(1) e 1 23

= e 1 23

and (1) e 1 32

= e 1 32

.

Thus, for the irreducible representation ϕλ,

ϕλ((1)) =

(1 00 1

).

For the transposition π = (12), we have

(12) e 1 23

= e(12)

1 23

= e 2 13

=1 23

− 1 32

= e 1 23

− e 1 32

and

C.8 Characters 273

(12) e 1 32

= e(12)

1 32

= e 2 31

=2 31

− 1 32

= −e 1 32

.

Thus, for the irreducible representation ϕλ,

ϕλ((12)) =

(1 0

−1 −1

).

Moreover, for the permutation π = (123), we have

(123) e 1 23

= e(123)

1 23

= e 2 31

=2 31

− 1 32

= −e 1 32

and

(123) e 1 32

= e(123)

1 32

= e 2 13

=2 13

− 3 12

= e 1 23

− e 1 32

.

Thus, for the irreducible representation ϕλ,

ϕλ((123)) =

(0 1

−1 −1

).

C.8 Characters

The matrices representing a group allow the introduction of numerical functions on the group. Thesefunctions are called characters and play a vital role in the theory. Let A = (aij) be an n × n matrixover a field K. The trace of A is the sum of diagonal matrix entries,

trA =

n∑

i=1

aii.

If B is another n× n matrix over K, then a direct calculation shows that

trAB = trBA.

Thus, if B is nonsingular, then

274 C Representation Theory of the Symmetric Group

trB−1AB = trA.

The character of the representation ϕ of a finite group G is the function χϕ on G to the field ofrepresentation K given by

χϕ(g) = trϕ(g), g ∈ G.

If two representations ϕ and ψ of a group G are equivalent, then there is a nonsingular matrix B

such that ψ(g) = Bϕ(g)B−1 for each element g ∈ G. But then

χψ(g) = trψ(g) = trBϕ(g)B−1 = trϕ(g) = χϕ(g), g ∈ G.

Thus, equivalent representations have the same character. If τ is a reducible representation, then forsome nonsingular matrix B,

Bτ(g)B−1 =

(ϕ(g) 0I(g) ψ(g)

), g ∈ G.

By taking traces on both sides, we obtain

χτ (g) = χϕ(g) + χψ(g), g ∈ G.

Thus the character of each reducible representation is the sum of characters of its constituents. ByTheorem C.27, if G is a finite group and K is a field, whose characteristic is zero or not a divisor ofthe group order, then each character of a representation of G is a sum of irreducible characters; that is,characters of irreducible representations.

Let ϕ be a representation of a finite group G, and let g and h be two conjugate elements of G; thatis, h = tgt−1 for some group element t ∈ G. We have

χϕ(h) = χϕ(tgt−1) = trϕ(tgt−1) = trϕ(t)ϕ(g)ϕ(t)−1 = trϕ(g) = χϕ(g).

It follows that characters have the same value for elements of the same conjugacy class. For this reason,characters are class functions.

The one-representation ι of a group G equals its character which is denoted by 1G and called thetrivial character of G.

The character table of a finite group G is a matrix with columns indexed by conjugacy classes ofG and rows indexed by inequivalent irreducible representations of G. The entry of the character tablecorresponding to irreducible representation ϕ and conjugacy class C is χϕ(g) for some g ∈ C; thatis, the value of the character of ϕ on the conjugacy class. One row and column of the character tablecan be directly filled. For this, let C1 = {1} be the conjugacy class of the unit element 1 of the groupG. For each representation ϕ, the matrix ϕ(1) at the unit element is the identity matrix whose sizeis given by the dimension of the corresponding representation module. Thus, the trace χ(1) of theidentity element equals the dimension of the representation module. Moreover, let V1 be the trivialKG-module corresponding to the one-representation; that is, gv = v for each v ∈ V1. The character ofthe one-representation is the trivial character ǫ given by ǫ(g) = 1 for each group element g in G. Thusthe character table looks as follows,

C1 C2 . . . CrV1 1 1 1V2 dimV2...

...Vs dimVn

C.9 Characters of the Symmetric Group 275

For instance, the group S3 has three conjugacy classes, C1 = {(1)}, C2 = {(12), (13), (23)}, andC3 = {(123), (132)}. Moreover, there are three Specht modules corresponding to the partitions of 3,

namely, S(3), S(2,1), and S(13). The corresponding character table is

C1 C2 C3

S(3) 1 1 1S(2,1) 2 0 −1

S(13) 1 −1 1

The entries in the second row were calculated in the previous section.Let H be a subgroup of a group G. Assume that g1, . . . , gk form a complete coset of representatives

for the group H in G. Given a matrix representation ϕGH of G that is induced from a representation ϕof H. The character of the induced representation ϕGH can be given in terms of the character χH of therepresentation ϕ of H as follows,

χGH(g) =

k∑

i=1

χ′H(gigg

−1i ), g ∈ G,

=1

|H|k∑

i=1

h∈H

χ′H(hgigg

−1i h−1)

=1

|H|∑

x∈G

χ′H(xgx−1),

where

χ′H(g) =

{χH(g) if g ∈ H.0 otherwise.

Notice that the second equation makes use of the identity χ′H(gigg

−1i ) = χ′

H(hgigg−1i h−1) since gigg

−1i ∈

H if and only if hgigg−1i h−1 ∈ H, and the third equation uses the fact that each element in G can be

uniquely written in the form hgi for some h ∈ H and 1 ≤ i ≤ k.

C.9 Characters of the Symmetric Group

We provide a procedure that allows to determine the character table of the symmetric group. For this,we define a bilinear form on the set of characters of a linear group G. To this end, we assume that thefield K is algebraically closed with characteristic 0 or characteristic p > 0 such that the group orderis not divisible by p. Let U and V be irreducible representation modules of G with characters χU andχV . We put

〈χU , χV 〉 ={1, U and V are equivalent,0, otherwise.

Since by hypothesis, each representation module is a direct sum of irreducible representation modules,we can linearly extend the form to obtain a bilinear form on the set of characters of G.

276 C Representation Theory of the Symmetric Group

C.28. Frobenius Reciprocity Theorem Let ϕ and ψ be absolutely irreducible representations of agroup G and a subgroup H, respectively. If χ and χ′ denote the respective characters corresponding toϕ and ψ, then

〈χ, χ′G〉 = 〈χH , χ′〉.

The theorem says that the representation ψG induced by ψ contains the irreducible representation ϕwith the same multiplicity as the representation ϕH restricted to H contains the irreducible represen-tation ψ. The following calculations will make use of this theorem.

Let µ be a partition of n. By Theorem C.15, the permutation module Mµ decomposes into a directsum of Specht modules

Mλ =⊕

λDµ

kλ,µSµ,

where kλ,µ ≥ 0 denotes the multiplicity with which the module Sλ occurs in the module Mµ. Inparticular, we have kλ,λ = 1. Let χλ denote the character of the Specht module Sλ and let 1λ ↑ Sndenote the character of the permutation module Mλ. It follows that

〈1λ ↑ Sn, χµ〉 ={kλ,µ, λD µ,0, otherwise.

Consider the matrix K = (kλ,µ). Since the dominance order can be embedded into the dictionary order,the matrix K is a lower triangular matrix. Consequently, the matrix B = (bλ,µ) defined by

bλ,µ = |Sµ|〈χλ, 1µ ↑ Sn〉

is an upper triangular matrix. In particular, we have

bλ,λ = |Sλ|〈χλ, 1λ ↑ Sn〉 = |Sλ| =∏

i

λi!.

Let Cµ denote the conjugacy class of Sn corresponding to the partition µ, and let A = (aλ,µ) be thematrix given by

aλ,µ = |Sλ ∩ Cµ|.In particular, we have

aλ,λ = |Sλ ∩ Cλ| =∏

i

(λi − 1)!,

since all elements of the conjugacy class Cλ can be obtained from the permutation π = (1, . . . , λ1)(λ1+1, . . . , λ1 + λ2) . . . Once the matrix A is known, the character table C = (cλ,µ) of Sn can be calculatedby straightforward matrix manipulation. To see this, first note that

µ

cλ,µaν,µ =∑

µ

χλ(Cµ) · |Sν ∩ Cµ|

= |Sν | · 〈χλ ↓ Sν , 1ν〉= |Sν | · 〈χλ, 1ν ↑ Sn〉= bλ,ν .

C.9 Characters of the Symmetric Group 277

Therefore, B = CAT . Second, we have

µ

bµ,λbµ,ν = |Sλ| · |Sν | · 〈1λ ↑ Sn, 1ν ↑ Sn〉

= |Sλ| · |Sν | · 〈1λ ↑ Sn ↓ Sν , 1ν〉= |Sλ| · |Sν | · 〈1λ ↑ Sn ↓ Sν , 1ν〉= |Sλ| ·

µ

1λ ↑ Sn(gµ) · |Sν ∩ Cµ|

=∑

µ

[Sn : Sλ] · |Sλ ∩ Cµ| · |Sν ∩ Cµ|

=∑

µ

[Sn : Sλ] · aλ,µ · aν,µ.

Thus if the matrix A is known, the matrix B can be calculated. Moreover, the matrix A is invertibleand hence we obtain the character table of Sn as

C = BAT−1.

For instance, in case of n = 5, we have

K =

(5) (4, 1) (3, 2) (3, 12) (22, 1) (2, 13) (15)

[5] 1[4][1] 1 1[3][2] 1 1 1[3][1]2 1 2 1 1[2]2[1] 1 2 2 1 1[2][1]3 1 3 3 3 2 1[1]5 1 4 5 6 5 4 1

.

The decompositions likeM (4,1) = S(5)⊕S(4,1) andM (3,2) = S(5)⊕S(4,1)⊕S(3,2) can be directly obtainedfrom Corollary C.23. But Young’s ruel shows how to directly evaluate the matrix K. Moreover, we have

A =

(5) (4, 1) (3, 2) (3, 12) (22, 1) (2, 13) (15)

(5) 24 30 20 20 15 10 1(4, 1) 6 0 8 3 6 1(3, 2) 2 2 3 4 1(3, 12) 2 0 3 1(22, 1) 1 2 1(2, 13) 1 1(15) 1

,

278 C Representation Theory of the Symmetric Group

B =

(5) (4, 1) (3, 2) (3, 12) (22, 1) (2, 13) (15)

(5) 120 24 12 6 4 2 1(4)(1) 24 12 12 8 6 4(3)(2) 12 6 8 6 5(3)(1)2 6 4 6 6(2)2(1) 4 4 5(2)(1)3 2 4(1)5 1

,

and

C =

(5) (4, 1) (3, 2) (3, 12) (22, 1) (2, 13) (15)

(5) 1 1 1 1 1 1 1(4)(1) −1 0 −1 1 0 2 4(3)(2) 0 −1 1 −1 1 1 5(3)(1)2 1 0 0 0 −2 0 6(2)2(1) 0 1 −1 −1 1 −1 5(2)(1)3 −1 0 1 1 0 −2 4(1)5 1 −1 −1 1 1 −1 1

,

where the first column lists the characters at the 5-cycles, the second at the product of 4- and 1-cycles,the third at the product of 3- and 2-cycles, the fourth at the product of 3-and two 1-cycles, the fifthat the product of two 2-and 1-cycles, the fifth at the product of 2-and three 1-cycles, and the last atidentity element that has cycle type 15. The last column entries χλ(id) are the dimensions of the Spechtmodules.

C.10 Dimension of Specht Modules

Let λ be a partition of n. The permutation moduleMλ decomposes into a direct sum of Specht modules

Mλ =⊕

λDµ

kλ,µSµ,

where kλ,µ ≥ 0 denotes the multiplicity with which the module Sλ occurs in the module Mµ. Inparticular, we have kλ,λ = 1. Since the matrix K = (kλ,µ) is lower triangular with 1’s down thediagonal, it can be inverted. The inverse matrix is also lower triangular with 1’s down the diagonal. Forthis, if we write the above decomposition in the form

[λ1][λ2] . . . =⊕

µ

kλ,µ[µ],

then we obtain[λ] =

µ

(k−1)λ,µ[µ1][µ2] . . . .

For instance, the inverse of the matrix K for the group S5 is given as

C.10 Dimension of Specht Modules 279

[5] [4, 1] [3][2] [3][1]2 [2]2[1] [2][1]3 [1]5

(5) 1(4, 1) −1 1(3, 2) 0 −1 1(3, 12) 1 −1 −1 1(22, 1) 0 1 −1 −1 1(2, 13) −1 1 2 −1 −2 1(15) 1 −2 −2 3 3 −4 1

.

The inverse matrix of K can be calculated as follows.

C.29. The Determinantal Form If λ is a partition into n non-zero parts, then

[λ] = det ([λi − i+ j])ni,j=1 ,

where [m] = 0 if m < 0.

The determinant for [λ] is to put [λ1], [λ2], . . . in order down the diagonal, and then let the numbersincrease by 1 in each row to the right of the diagonal and decrease by 1 in each row to the left of thediagonal. The element [0] serves as multiplicative identity.

For instance, we have

[4, 1] = det

([4] [5][0] [1]

)= [4][1]− [5]

and

[3, 2] = det

([3] [4][1] [2]

)= [3][2]− [4][1].

If the determinantal form holds for partitions into two non-zero parts, it can be extended to partitionsinto three non-zero parts, say by expanding up the last column as follows,

[3, 12] = det

[3] [4] [5][0] [1] [2][-1] [0] [1]

= [5] det

([0] [1]0 [0]

)− [2] det

([3] [4]0 [0]

)+ [1] det

([3] [4][0] [1]

)

= [5]− [3][2] + [3][1]2 − [4][1].

Corollary C.30. The Specht module Sλ has the dimension

dimSλ = n! · det(

1

λi − i+ j

)n

i,j=1

,

where 1/r! = 0 if r < 0.

The dimension of the Specht module Sλ can be calculated by using hooks. The (i, j)-hook of thediagram [λ] consists of the (i, j)-node along with the λi − j nodes to the right of it, the hooks’s arm,and the λ′j − i nodes below it, the hook’s leg. The length of the (i, j)-hook is the number of involved

280 C Representation Theory of the Symmetric Group

nodes, hij = λi + λ′j + 1− i− j. If we replace the (i, j)-node in the diagram [λ] by the number hij foreach node, we obtain the hook graph. For instance, the partition λ = (3, 2) corresponds to the hookgraph

X X XX X

4 3 12 1

C.31. The Hook Formula The dimension of the Specht module Sλ is given by

dimSλ =n!∏

(hook lengths in [λ]).

For instance, by the above hook graph for the partition λ = (3, 2), the Specht module S(3,2) has thedimension

dimS(3,2) =5!

4 · 3 · 2 = 5.

Proof. We show the result for partitions into three non-zero parts. By Corollary C.30,

dimSλ

n!= det

1(h11−2)!

1(h11−1)!

1h11!

1(h21−2)!

1(h21−1)!

1h21!

1(h31−2)!

1(h31−1)!

1h31!

=1

h11!

1

h21!

1

h31!det

h11(h11 − 1) h11 1h21(h21 − 1) h21 1h31(h31 − 1) h31 1

=(h11 − h21)(h11 − h31)(h21 − h31)

h11!h21!h31!

=1

h11!

1

h21!

1

h31!det

(h11 − 1)(h11 − 2) h11 − 1 1(h21 − 1)(h21 − 2) h21 − 1 1(h31 − 1)(h31 − 2) h31 − 1 1

=1

h11h21h31det

1(h11−3)!

1(h11−2)!

1(h11−1)!

1(h21−3)!

1(h21−2)!

1(h21−1)!

1(h31−3)!

1(h31−2)!

1(h31−1)!

=1

h11h21h31· 1∏

(hook lengths in [λ1 − 1, λ2 − 1, . . .])

=1∏

(hook lengths in [λ]),

where the next to the last equation follows by making use of the induction hypothesis. ♦

Index

affine combination, 54affine hull, 55affine hyperplane, 57affine space, 25affine subspace, 54affine variety, 26

irreducible, 34algebraic number, 38

degree, 38algebraic statistical model, 78aligned sequences, 99alignment, 99

optimal, 105alignment graph, 101

weighted, 105alignment problem, 105alphabet, 99

extended, 99alternative hypothesis, 191, 200ascending chain condition, 16

backward algorithm, 130basis, 5Bernoulli experiment, 221bias, 238binomial

pure, 21binomial distribution, 221bionomial, 21blank, 99branch, 139

length, 156Buchberger’s algorithm, 18

Buchberger’s S-criterion, 17

canonical form, 9caterpillar tree, 148cdf, 217character, 161character group, 162chi-squared distribution, 235chi-squared test, 192claw tree, 145closed set, 34closure theorem, 42coefficient, 3condition of detailled balance, 186conditional density, 219cone, 56contingency table, 189convex combination, 53convex hull, 52convex polygon, 53convex polyhedron, 53convex set, 52convolution, 163correlation, 218correlation coefficient, 215covariance, 215, 218CpG island, 135critical region, 192cumulative distribution function, 217

decile, 213degree, 4

total, 3Delannoy number, 100

282 Index

deletion, 100DiaNA, 75Dickson basis, 8Dickson’s lemma, 8dimension

affine subspace, 55polytope, 56

discrete Fourier transform, 164division algorithm, 12division theorem, 11dual group, 162dynamic programming, 112

edge, 59edit alphabet, 100edit string, 100elimination ideal, 35elimination ordering, 36elimination property, 36elimination theorem, 36EM algorithm, 133

claw tree, 149HMM, 135

empirical data, 78equivalence, 141evolutionary model, 155

group-based, 160expectation maximization, 132explanation, 96, 129exponential distribution, 232extension theorem, 37

f-vector, 59face, 59

-k, 59facet, 59false positive determination, 192fan, 62

complete, 62pointed, 62

Felsenstein algorithm, 152forward, 153

Felsenstein model, 158Felsenstein sequence, 152fibre, 175Floyd-Warshall algorithm, 51forward algorithm, 128, 130Fourier expansion, 163frequency vector, 78

gamma distribution, 233geometric distribution, 224goodness of fit test, 200Groebner basis, 14

minimal, 19reduced, 20

H-polytope, 61Hadamard matrix, 167Hadamard product, 163half-space, 57Hardy-Weinberg equilibrium, 199Hardy-Weinberg law, 198Hardy-Weinberg proportions, 199Hasegawa model, 158Hilbert basis theorem, 15homology, 100hypergeometric distribution, 226

ideal, 5affine variety, 31generated, 5homogeneous, 21maximal, 7of moves, 176prime, 7product, 5radical, 7sum, 5toric, 21, 22trivial, 5

ideal-variety correspondence, 33implicitization problem

polynomial, 44rational, 47

implizitization, 43indel, 100independence model, 85individual, 140inference problem, 96infinitesimal generator matrix, 153initial form, 69insertion, 100intermediate, 140internal node, 139invariant, 93isomorphism, 140

Jukes-Cantor model, 155

Index 283

Kimura 2P model, 157Kimura 3P model, 158Kullback-Leibler distance, 134

labelling, 141lattice polytope, 53leading coefficient, 10leading monomial, 10leading term, 10leaf, 139likelihood ideal, 91likelihood variety, 91line segment, 52, 53lineage, 140linear model, 79logistic model, 205

marginal probability, 123marginalization mapping, 125Markov basis, 176

minimal, 176Markov chain, 88, 178Markov chain model, 88

toric, 88Markov model

fully observed, 123fully observed toric, 122fully observed tree, 144fully observed tree toric, 144hidden, 126hidden tree, 148pair hidden, 106

Markov property, 178match, 100matrix exponential, 154maximum likelihood

estimate, 79estimation, 79, 241

mean, 211, 217mean-squared error, 239median, 213Metropolis algorithm, 187Minkowski sum, 65mismatch, 100model selection, 153moment, 218, 240moments method, 240monoid, 7monomial, 3

monomial ordering, 7Dp, 8dp, 8lp, 8

move, 175multinomial distribution, 223mutation, 100

Needleman-Wunsch algorithm, 111negative binomial distribution, 224Newton polytope, 68normal cone, 63normal distribution, 229normal fan, 64normal form, 15null hypothesis, 191, 200

open set, 34ordinary least squares, 244orthogonality relations, 163outward pointing normal, 58

p-value, 193parameter estimation, 246parameter space, 78parametric representation, 46parametrization, 30, 43partition function, 78pattern, 140pdf, 217pmf, 217polyhedral sets, 61polymake, 53polynomial, 3

component, 4homogeneous, 4monic, 10

polynomial function, 25polynomial implicitization, 44polynomial ring, 4polytope, 53polytope algebra, 66polytope propagation, 114positive hull, 56possion distribution, 227precision, 218probability density function, 217probability mass function, 217product ordering, 36

284 Index

projection map, 41purine, 157pyrimidine, 157

quantile, 213quartile, 213

radical, 6, 40radical ideal property, 6random sample, 219random variable, 217

continuous, 217correlation, 218discrete, 217independence, 85, 218

random walk, 183simple, 183symmetric, 184

rate matrix, 153rate parameter, 232rational implicitization, 47reduction step, 11rejection probability, 185REV model, 159ridge, 59ring

Noetherian, 16root, 139

S-polynomial, 16sample mean, 219sample size, 78scoring scheme, 102semiring, 49

commutative, 49idempotent, 49

shifting step, 12significance level, 192simplex, 54solution, 37

partial, 37standard deviation, 212, 218state space, 77stationary distribution, 183

statistic, 237biased, 238

strand symmetric model, 158strong ergodicity, 180strong nullstellensatz, 32Student distribution, 237subgroup criterion, 5sufficient statistic, 79, 175supporting hyperplane, 58symmetric model, 159

Tamura-Nei model, 159taxa, 140term, 3terminal node, 139test of independence, 191test statistic, 192topology, 34toric model, 82translate, 54translation, 157transversion, 157tree, 139

binary, 139labelled, 140rooted, 139trivalent, 139unrooted, 139

tree of life, 139tropicalization, 50

uniform distribution, 220, 228

V-polytope, 61variance, 212, 218vertex, 59Viterbi algorithm, 130Viterbi sequence, 130

week nullstellensatz, 28weight, 102

Zariski closure, 34Zariski topology, 34