c 2010 Aaron Carl Smith - University of...

USING ULAM’S METHOD TO TEST FOR MIXING

By

AARON CARL SMITH

A DISSERTATION PRESENTED TO THE GRADUATE SCHOOLOF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT

OF THE REQUIREMENTS FOR THE DEGREE OFDOCTOR OF PHILOSOPHY

UNIVERSITY OF FLORIDA

2010

c© 2010 Aaron Carl Smith

2

I dedicate this to Bridgett for her support while I pursued my goals.

3

ACKNOWLEDGMENTS

I thank Professor Boyland for his guidance, and members of my supervisory

committee for their mentoring. I thank Bridgett and Akiko for their love and patience. I

needed the support they gave to reach my goal.

4

TABLE OF CONTENTS

page

ACKNOWLEDGMENTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

ABSTRACT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

CHAPTER

1 INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

1.1 Hypotheses for Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121.2 Testing Procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2 ERGODIC THEORY AND MARKOV SHIFTS . . . . . . . . . . . . . . . . . . . 16

2.1 Ergodic Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162.2 Markov Shifts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

3 STOCHASTIC AND DOUBLY-STOCHASTIC MATRICES . . . . . . . . . . . . 23

3.1 Doubly-Stochastic Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . 233.2 Additional Properties of Stochastic Matrices . . . . . . . . . . . . . . . . . 33

4 ESTIMATING THE RATE OF MIXING . . . . . . . . . . . . . . . . . . . . . . . 41

4.1 The Jordan Canonical Form of Stochastic Matrices . . . . . . . . . . . . . 414.2 Estimating Mixing Rate . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

5 PROBABILISTIC PROPERTIES OF DS-MATRICES . . . . . . . . . . . . . . . 45

5.1 Random DS-Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 455.2 Metric Entropy of Markov Shifts with Random Matrices . . . . . . . . . . . 54

6 PARTITION REFINEMENTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

6.1 Equal Measure Refinements . . . . . . . . . . . . . . . . . . . . . . . . . 566.2 A Special Class of Refinements . . . . . . . . . . . . . . . . . . . . . . . . 68

7 PROBALISTIC PROPERTIES OF PARTITION REFINEMENTS . . . . . . . . . 74

7.1 Entries of a Refinement Matrix . . . . . . . . . . . . . . . . . . . . . . . . 747.2 The Central Tendency of Refinement Matrices . . . . . . . . . . . . . . . 787.3 Metric Entropy After Equal Measure Refinement . . . . . . . . . . . . . . 80

8 ULAM MATRICES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82

8.1 Building the Stochastic Ulam Matrix . . . . . . . . . . . . . . . . . . . . . 828.2 Properties of Ulam Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . 84

5

9 CONVERGENCE TO AN OPERATOR . . . . . . . . . . . . . . . . . . . . . . . 92

9.1 Stirring Protocols as Operators and Operator Eigenfunctions . . . . . . . 929.2 Convergence Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97

10 DECAY OF CORRELATION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107

10.1 Comparing Our Test to Decay of Correlation . . . . . . . . . . . . . . . . . 10710.2 A Conjecture About Mixing Rate . . . . . . . . . . . . . . . . . . . . . . . 110

11 CRITERIA FOR WHEN MORE DATA POINTS ARE NEEDED . . . . . . . . . . 112

11.1 Our Main Criteria for When More Data Points Are Needed . . . . . . . . . 11211.2 Other Criteria for When More Data Points Are Needed . . . . . . . . . . . 116

12 PROBABILITY DISTRIBUTIONS OF DS-MATRICES . . . . . . . . . . . . . . . 117

12.1 Conditional Probability Distributions . . . . . . . . . . . . . . . . . . . . . 11712.2 Approximating Probability Distributions . . . . . . . . . . . . . . . . . . . . 120

12.2.1 The Dealer’s Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . 12012.2.2 Full Convex Combinations . . . . . . . . . . . . . . . . . . . . . . . 12312.2.3 Reduced Convex Combinations . . . . . . . . . . . . . . . . . . . . 12612.2.4 The DS-Greedy Algorithm . . . . . . . . . . . . . . . . . . . . . . . 12812.2.5 Using the Greedy DS-Algorithm . . . . . . . . . . . . . . . . . . . . 13012.2.6 DS-Matrices Arising from Unitary Matrices . . . . . . . . . . . . . . 131

13 EXAMPLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133

13.1 The Reflection Map . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13313.2 Arnold’s Cat Map . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13513.3 The Sine Flow Map (parameter 8/5) . . . . . . . . . . . . . . . . . . . . . 13613.4 The Sine Flow Map (parameter 4/5) . . . . . . . . . . . . . . . . . . . . . 13713.5 The Baker’s Map . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13813.6 The Chirikov Standard Map (parameter 0) . . . . . . . . . . . . . . . . . . 140

REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143

BIOGRAPHICAL SKETCH . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145

6

LIST OF TABLES

Table page

13-1 The Reflection Map . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135

13-2 Arnold’s Cat Map . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136

13-3 The Sine Flow Map (parameter 8/5) . . . . . . . . . . . . . . . . . . . . . . . . 137

13-4 The Sine Flow Map (parameter 4/5) . . . . . . . . . . . . . . . . . . . . . . . . 138

13-5 The Baker’s Map . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140

13-6 The Chirikov Standard Map (parameter 0) . . . . . . . . . . . . . . . . . . . . . 141

7

Abstract of Dissertation Presented to the Graduate Schoolof the University of Florida in Partial Fulfillment of theRequirements for the Degree of Doctor of Philosophy

USING ULAM’S METHOD TO TEST FOR MIXING

By

Aaron Carl Smith

August 2010

Chair: Philip BoylandMajor: Mathematics

Ulam’s method is a popular technique to discretize and approximate a continuous

dynamical system. We propose a statistical test using Ulam’s method to evaluate the

hypothesis that a dynamical system with a measure-preserving map is weak-mixing.

The test statistic is the second largest eigenvalue of a Monte Carlo stochastic matrix that

approximates a doubly-stochastic Ulam matrix of the dynamical system. This eigenvalue

leads to a mixing rate estimate. Our approach requires only one experiment while the

most common method to evaluate the hypothesis, decay of correlation, requires iterated

experiments. Currently, time of computation determines how many Monte Carlo points

to use; we present a method based on desired accuracy and risk tolerance to decide

how many points should be used. Our test has direct application to mixing relatively

incompressible fluids, such as water and chocolate. Stirring protocols of compression

resistant fluids may be modeled with measure-preserving maps. Our test evaluates if a

stirring protocol mixes and can compare mixing rates between different protocols.

8

CHAPTER 1INTRODUCTION

If we need to evaluate a stirring protocol’s ability to mix an incompressible fluid in a

closed system, but cannot do so analytically, Ulam’s method provides a Markov shift that

approximates the protocol. Since the fluid is incompressible, the stirring protocol defines

a measure preserving map where the volume of a region defines its measure. For a

function to be a stirring protocol, the function must be volume (measure) preserving.

Otherwise a closed system could have more or less mass after mixing than before.

Let D be the incompressible fluid we wish to mix (D is our domain), D is a bounded

and connected subset of Rk , k ∈ N, B is the Borel σ-algebra of D,and µ is the uniform

probability measure (rescaled Lebesque measure) of (D,B). The function f : D → D is

defined by the given stirring protocol. Our dynamical system is (D,B,µ, f ).If a stirring protocol mixes, then the concentration of an ingredient will become

constant as the protocol is iterated. When a solution is mixed, the amount of an

ingredient in a region becomes proportional to the volume of the region. If we partition

the fluid into n parts, we may use an n × n stochastic matrix to represent how the stirring

protocol moves an ingredient; if the protocol mixes, then the matrix will become a rank

one matrix with all rows being equal as the protocol is iterated (rows of the rank one

matrix correspond to the measure of partition sets). When the partition is a Markov

partition, we may use powers of the matrix to represent how iterations of the stirring

protocol move an ingredient; powers of a stochastic matrix converge to a rank one

matrix if and only if the second largest eigenvalue is less than one in magnitude. Since

the fluid is partitioned into n sets, it is natural to partition the fluid into sets with volume1

n. When we partition a stirring protocol in this manner the approximating matrix and its

transpose will be stochastic.

The stochastic matrix that approximates the stirring protocol defines a Markov shift,

so the Markov shift approximates and models how stirring iterations move particles from

9

partition set to partition set. The Markov shift as a dynamical system approximates the

dynamical system defined by stirring, so we may use the Markov shift to make decisions

about (D,B,µ, f ).We will call the matrix generated by Ulam’s method an Ulam matrix; after rescaling

the rows of an Ulam matrix we will call the resulting matrix an Ulam stochastic

matrix. We will use the Markov shift defined by the Ulam stochastic matrix to decide

if (D,B,µ, f ) is mixing. The magnitude of the second largest eigenvalue of the

Ulam stochastic matrix provides a test statistic to accept or reject the protocol as

weak-mixing. Also if we accept the protocol as mixing, the second largest eigenvalue

and the dimensions of the Ulam stochastic matrix provides an estimate of the rate of

mixing. From now on when we refer to eigenvalue size, we mean the magnitude of the

eigenvalues.

Ulam’s method has been used to study hyperbolic maps [1], find attractors [2],

and approximate random and forced dynamical systems [3]. Stan Ulam proposed what

we call Ulam’s method in his 1964 book Problems in Modern Mathematics [4], the

procedure gives a discretization of a Perron-Frobenius operator (transfer operator). The

procedure provides a superior method of estimating long term distributions and natural

invariant measures of deterministic systems [3]. Eigenvalues and their corresponding

eigenfunctions of hyperbolic maps can reveal important persistent structures of a

dyamical systems, such as almostinvariant sets [1, 5]. Ulam’s finite approximation of

absolutely continuous invariant measures of systems defined by random compositions

of piecewise monotonic transformations converges [6]. Ulam’s method may be used

to estimate the measure-theoretic entropy of uniformly hyperbolic maps on smooth

manifolds and obtain numerical estimates of physical measures, Lyapunov exponents,

decay of correlation and escape rates for everywhere expanding maps, Anosov

maps, maps that are hyperbolic on an attracting invariant set, and hyperbolic on a

non-attracting invariant set [7].

10

In this paper, we present the testing procedure and hypotheses of testing early on,

we do this to give an overview of paper, help the reader understand the goals of this

work, and provide easy reference. Background information about ergodic theory and

Markov shifts is provided to establish our notation and review critical concepts. Since

the stirring protocol is modeled with a doubly-stochastic matrix, a review of stochastic

and doubly-stochastic matrix properties is presented. When the approximating Markov

shift defines a mixing dynamical system, the rate at which the matrix converges to

a rank one matrix provides an estimate of (D,B,µ, f )’s mixing rate; construction of

the estimate is provided to justify its utility. When a decision is made based upon

observations, statistics can establish confidence in the decision, the approximating

matrix is treated as a random variable so that statistics may be used. Properties of

random doubly-stochastic matrices are given to illuminate the approximating matrix

as a random variable. How does the Markov shift change if the partition is refined?

Will a sequence of partitions lead to a sequence of Markov shift dynamical systems

that converge? What will the convergence rate be? As a first step to answering

these questions, we look at the relationship between Markov shifts before and after

a refinement, then investigate the probabilistic properties of random partitions. To

be consistent, we only consider refinement that have partition sets of equal volume.

Ulam’s method gives us a approximating matrix that usually cannot be observed directly,

numerical or statistical observations can approximate this matrix with a Monte Carlo

technique; proof that the Monte Carlo technique converges is provided. Ulam’s method

converges to an operator [6]; similarities between the approximating matrix and the

target operator are established, and proof of convergence with respect weak-mixing is

given. Decay of correlation is a well established measure of mixing; the second largest

eigenvalue of the approximating matrix and decay of correlation are different measures

of mixing. Decay of correlation is a better measure of mixing, but requires a sequence of

stirring iteration, the second largest eigenvalue requires one iteration. If the partition is a

11

Markov partiition, then the two measures are equivalent. One must decide if the sample

size is sufficient when using a Monte Carlo technique, we propose using a modified

chi-squared test to evaluate sufficiency after building the the approximating matrix.

Statistics requires probability distributions of observations, several statistical and Monte

Carlo methods of distribution estimation are given.

1.1 Hypotheses for Testing

The hypotheses for testing are

1. Ho : (D,B,µ, f ) is not ergodic (and hence not mixing).

2. Ha1 : (D,B,µ, f ) is ergodic but not weak-mixing.

3. Ha2 : (D,B,µ, f ) is weak-mixing (and hence ergodic).

We will define ergodic, weak-mixing and strong-mixing in the Ergodic Theory and

Markov Shifts chapter. We refer to Ho as the null hypothesis, Ha1 and Ha2 are called

alternative hypotheses. The proceedure we present does not prove or disprove that

(D,B,µ, f ) is ergodic or weak-mixing, it provides a method to decide to accept or reject

hypotheses. We use logical arguments of the form

If statement A is correct, then statement B is correct or

If statement B is incorrect, then statement A is incorrect.

when proving a statement; in hypothesis testing we use

If our observation is unlikely , then the null hypothesis is probably incorrect.

So hypothesis testing uses an argument similiar to a contrapositive. When making a

decision we can make two mistakes

1. We can reject Ho when Ho is correct. This is called a type I error.

2. We can fail to reject Ho when Ho is incorrect. This is called a type II error.

When we decide which hypothesis is the null and which are the alternatives, we

set the null hypothesis such that the consequences of a type I error are more severe

12

than the consequences of type II errors. Therefore it takes strong evidence to reject

Ho . In general, we cannot simultaneously control the type I error risk and the type II

error risk; by default we control the risk of a type I error. The maximum chance we are

willing to risk a type I error is called the significance level. To conduct proper hypothesis

testing, one must set the significance level before gathering observations. After setting

the significance level, one must establish what statistic to use (called the test statistic);

what set of statistics results in reject Ho , what set of statistics results in fail to reject Ho

(the boundary between these two sets is called critical value(s)).

For our problem, P is a stochastic Ulam matrix that approximates P, a doubly-stochastic

matrix (ds-matrix). We will make our decision based on the following criteria.

1. Reject Ho in favor of Ha2 if | λ2(P) |< c2 where c2 is a critical value.

2. Reject Ho in favor of Ha1 if | λ2(P) − 1 |> c1 and | λ2(P) |≥ c2 where c1 is a critical

value.

3. Fail to reject Ho otherwise.

Since we are more concerned with a type I error, rejecting Ho is unlikely when

observing random events. To set the critical values we must have an estimate of the

probability distribution of the test statistic. Probability distibutions of ds-matrices are

difficult to work with, the Monte Carlo chapter provides ways to approximate probability

distributions of test statistics.

1.2 Testing Procedure

Notation 1.2.1. Let

P =

1

n...1

n... . . . ...1

n...1

n

and I be the identity matrix. We will denote disjoint unions with⊎

.

1. Set the significance level.

13

2. Set n such that connected regions of measure1

nare sufficiently small for the

application. If an upper bound of (D,B,µ, f )’s entropy is known, call it h, set n such

that eh < n.

3. Decide which conditional probability distributions of |λ2(P)| when |λ2(P)| = 1 to

use. We propose using a beta distribution with α ≥ 2 and β = 1 for Ha2.

4. Set critical values for Ha1,Ha2.

5. Partition D into n connected subsets with equal measure,

Dini=1, D =n⊎

i=1

Di , µ(Di) =1

n.

6. Randomly select m sample points in D, call the points xkmk=1. Let mi be the

number of points in Di .

7. Run one iteration of the mixing protocol.

8. Let mij be the number of points such that xk ∈ Di and f (xk) ∈ Dj . Let M = (mij), M

is is called an Ulam Matrix.

9. Let P = (mijmi), P is an Ulam stochastic matrix. Compare the second largest

eigenvalue of P, λ2(P), to the critical values.

10. If there are concerns about eigenvalue stability, confirm the results of Ho versus

Ha2 with σ1((I − P)P).11. Make a decision about the hypotheses of testing based on the critical values.

12. If we reject Ho in favor of Ha2, let the rate at which(Nn−1

)(λ2(P))

N−n+1 → 0 as

N → ∞ be our estimate of the rate of mixing.

13. Estimate of the entropy of the dynamical system with

−1n

n∑

i=1

n∑

j=1

pij log pij .

Definition 1.2.2. When we partition D into n connected subsets of equal measure,

D =⊎ni=1Di , µ(Di) = 1/n, we call the partition an n-partition.

14

After taking an n-partition, f maps some portion of state i to state j . Let P = (pij) be

the matrix where

pij =µ(x ∈ Dj ∧ f −1(x) ∈ Di)

µ(f −1(x) ∈ Di) = µ(x ∈ Dj |f −1(x) ∈ Di) = µ(f (x) ∈ Dj |x ∈ Di).

Our measure µ is a probability measure on D so by construction P is a stochastic matrix

with pij = µ(f (x) ∈ Dj |x ∈ Di) and Σnj=1pij = 1, pij is a conditional probability.

When establishing an n-partition the physical action for the mixing protocol on

the domain should be considered. If the domain is the unit disk and mixing protocol

acts in a circular manner, to check that regions closer and farther from the origin are

mixing we could partition the disk into rings with radii r1 = 1√n, rk+1 =

√r 2k +

1n. If the

domain is a rectangle and we are confident that the mixing protocol mixes horizontal

(vertical) sections, then we may partition the rectangle into n horizontal (vertical) smaller

rectangles to check that vertical (horizontal) regions mix. If we do not know about the

stirring protocol before partitioning, the partition should minimize subset diameter.

15

CHAPTER 2ERGODIC THEORY AND MARKOV SHIFTS

2.1 Ergodic Theory

In this section we provide theorems and definitions from ergodic theory and hint at

how Markov shifts relate to approximating (D,B,µ, f ).Definition 2.1.1. [8] If (D,B,µ) is a probability space, a measure preserving transforma-

tion f is ergodic if B ∈ B and f −1(B) = B then µ(B) = 0 or µ(Bc) = 0.

This definition generalizes to σ-finite measure spaces, but we will only look at

probability spaces.

We will use the following theorem to define ergodic, weak-mixing and strong-mixing.

Theorem 2.1.2. [9] Let (D,B,µ, f ) be a probability space and let S be a semi-algebra

that generates B. Let f : D→ D be a measure preserving transformation. Then

1. (D,B,µ, f ) is ergodic if and only if for all A,B ∈ S,

limN→∞

1

N

N−1∑

i=0

µ(f −i(A) ∩ B) = µ(A)µ(B).

2. (D,B,µ, f ) is weak-mixing if and only if for all A,B ∈ S,

limN→∞

1

N

N−1∑

i=0

| µ(f −i(A) ∩ B)− µ(A)µ(B) |= 0.

3. (D,B,µ, f ) is strong-mixing if and only if for all A,B ∈ S,

limN→∞

µ(f −N(A) ∩ B) = µ(A)µ(B).

Notice that (D,B,µ, f ) is strong-mixing ⇒ (D,B,µ, f ) is weak-mixing ⇒ (D,B,µ, f )is ergodic.

Ergodic means that iterations of the function averages out to be independent,

strong-mixing means that iterations of the function converges to independence, and

weak-mixing means that iterations of the function convergences to independence except

for rare occasions [8].

16

2.2 Markov Shifts

Definition 2.2.1. A vector

~p = (p1, p2, ..., pn)

is a probability vector if

0 ≤ pi ≤ 1,

for all i and

p1 + p2 + ... + pn = 1.

Definition 2.2.2. A matrix is a stochastic matrix if every row is a probability vector.

Notation 2.2.3. Let

Xn =(xi) : i ∈ N, xi ∈ 1, 2, ..., n or

Xn =(xi) : i ∈ Z, xi ∈ 1, 2, ..., n.

Notice that xi ∈ 1, 2, ..., n instead of xi ∈ 0, 1, 2, ..., n − 1. We do this for ease of

numerical indexing.

Definition 2.2.4. Subsets of Xn of the form

(xi) ∈ Xn : x1 = a1, x2 = a2, ..., xk = ak

for a given a1, a2, ..., ak are called cylinder sets.

If Σn is the σ-algebra generated by cylinder sets, ~p is a length n probability vector

and P is a stochastic matrix such that ~pP = ~p, then (Xn, Σn, (~p,P)) defines a globally

consistent measure space where the measure of xi ∈ Xn : x1 = a1, x2 = a2, ..., xk =ak is (pa1)(pa1a2...pak−1ak ).

The measure of a cylinder set uses both ~p and P.

17

Definition 2.2.5. If P is a stochastic matrix and ~p is a probability vector such that

~p = ~pP, then ~p is called a stationary distribution of P.

Definition 2.2.6. Let fn : Σn → Σn, fn((xi)) = (xi+1) then fn is called the shift map.

The shift map is a (~p,P)-measure preserving map.

Remark 2.2.7. If pj = 0 for some j then the measure of (xi) ∈ Xn : xk = j is zero;

without loss of generality say that pj > 0 for all j ∈ 1, 2, ..., n. Otherwise if pj = 0 then

the set a sequences with xk = pj for some k has zero measure.

Definition 2.2.8. The dynamical system (Xn, Σn, (~p,P), fn) is called a Markov shift,

with (~p,P) as the Markov measure. If Xn = (xi) : i ∈ N, xi ∈ 1, 2, ..., n then it is a

one-sided Markov shift. If Xn = (xi) : i ∈ Z, xi ∈ 1, 2, ..., n then it is a two-sided

Markov shift.

Our goal is to use (Xn, Σn, (~p,P), fn) to approximate (D,B,µ, f ). If we look at a point

x ∈ D, iterate f , and let aN = i where f N(x) ∈ Di , (~p,P) gives the probability distribution

of cylinder sets. For our mixing problem an element of Xn represents the movement of

an ingredient particle while stirring, so we say that (Xn, Σn, (~p,P), fn) is a one-sided shift.

In the Doubly-Stochastic Matrices section, we will show that ~p = (1/n, 1/n, ..., 1/n) is

a stationary distribution for any n × n ds-matrix. For brevity we will let (~p,P) represent

Markov shifts.

With Markov shifts, weak-mixing is equivalent to strong-mixing [9], so we will refer

to ((1/n, 1/n, 1/n, ..., 1/n),P) as mixing or not mixing. Since our Markov shift is only an

approximation of (D,B,µ, f ), we may accept hypotheses of weak-mixing and not make

statements of strong-mixing when weak-mixing is accepted. Later we will show why

((1/n, 1/n, 1/n, ..., 1/n),P) is a reasonable approximation of (D,B,µ, f ).1. If ((1/n, 1/n, 1/n, ..., 1/n),P) is not ergodic, we will fail to reject the hypothesis that

(D,B,µ, f ) is not ergodic and not weak-mixing.

18

2. If ((1/n, 1/n, 1/n, ..., 1/n),P) is ergodic but not mixing, we will reject the hypothesis

that (D,B,µ, f ) is not ergodic in favor of the hypothesis that (D,B,µ, f ) is ergodic

but not weak-mixing.

3. If ((1/n, 1/n, 1/n, ..., 1/n),P) is mixing, we will reject the hypothesis that (D,B,µ, f )is not ergodic and not mixing in favor of the hypothesis that (D,B,µ, f ) is ergodic

and weak-mixing.

The following lemma and theorems give us criteria for when

((1/n, 1/n, 1/n, ..., 1/n),P)

is ergodic or mixing.

Lemma 2.2.9. Let P be a stochastic matrix, having a strictly positive probability vector ~p

with ~pP = ~p, then

Q = limN→∞

1

N

N−1∑

i=0

P i

exists. The matrix Q is also stochastic and

PQ = QP = Q.

Any eigenvector of P for the eigenvalue 1 is also an eigenvector of Q. Also

Q2 = Q.

Theorem 2.2.10. Let fn denote the (~p,P) Markov shift (one-sided or two-sided). We can

assume pi > 0, ∀i , where ~p = (p1, ..., pn) (P is n × n). Let Q be the matrix obtained in

the lemma above, the following are equivalent:

1. (Xn, Σn, (~p,P), fn) is ergodic.

2. All rows of the matrix Q are identical.

3. Every entry in Q is strictly positive.

4. P is irreducible.

19

5. 1 is a simple eigenvalue of P.

We will set ~p = (1/n, 1/n, 1/n, ..., 1/n), P = (pij), pij = µ(f (x) ∈ Dj |x ∈ Di). P is a

stochastic matrix; 1 is and eigenvalue of P, if 1 is a simple eigenvalue of P and all other

eigenvalues are not 1 in magnitude, then we will reject the null hypothesis in favor of the

hypothesis that (D,B,µ, f ) is ergodic.

Theorem 2.2.11. [9, 10] If fn is the (~p,P) Markov shift (either one-sided or two-sided)

the following are equivalent:

1. (Xn, Σn, (~p,P), fn) is weak-mixing.

2. (Xn, Σn, (~p,P), fn) is strong-mixing.

3. The matrix P is irreducible and aperiodic (i.e. ∃N > 0 such that the matrix PN has

no zero entries).

4. For all states i , j we have (Pk)ij → pj .5. 1 is the only eigenvalue of P with magnitude 1, and it is a simple root of P ’s

characteristic polynomial.

If we say that λ1,λ2,λ3, ...,λn is the multiset of P ’s eigenvalues with

1 = λ1 ≥ |λ2| ≥ |λ3| ≥ ... ≥ |λn|.

Partially order the eigenvalues by magnitude, then by distance from 1. The previous

theorem implies that ((1/n, 1/n, 1/n, ..., 1/n),P) is mixing if and only if 1 > |λ2|. So if

|λ2| is smaller than one, then we will reject the null hypothesis in favor of the hypothesis

that (D,B,µ, f ) is weak-mixing. Since (Xn, Σn, ((1/n, 1/n, ..., 1/n),P), fn) is only an

approximation of (D,B,µ, f ) our test only checks for weak-mixing.

Theorem 2.2.12. [9] The Markov shift (~p,P), ~p = (pi), P = (Pij), has metric entropy

−n∑

i=1

n∑

j=1

pipij(log(pij)).

Where we define 0 log(0) := 0.

20

Corollary 2.2.13. The sum −1n

∑ni=1

∑nj=1 pij(log(pij) gives an estimate of the entropy of

(D,B,µ, f ).

Proof. Set (~p,P) = ((1/n, 1/n, 1/n, ..., 1/n),P), pi = 1/n ∀i .

Theorem 2.2.14. If (~p,P) is a Markov shift on n states (one-sided or two-sided) then its

entropy is less than or equal to log(n).

Corollary 2.2.15. If h is an upper bound for the entropy of (D,B,µ, f ) then we should

set

n > exp(h)

when we partition D.

Example 2.2.16. Here are some examples of Markov shifts with four states.

1. ( (1/4,1/4,1/4,1/4),

1/6 5/6 0 0

5/6 1/6 0 0

0 0 1/3 2/3

0 0 2/3 1/3

) is a non-ergodic Markov shift with

entropy approximately 0.544.

2. ((1/4,1/4,1/4,1/4),

0 1 0 0

0 0 1 0

0 0 0 1

1 0 0 0

) is an ergodic Markov shift, but is non-mixing, with

entropy 0.

3. ((1/4,1/4,1/4,1/4),

1/4 1/4 1/4 1/4

1/4 1/4 1/4 1/4

1/4 1/4 1/4 1/4

1/4 1/4 1/4 1/4

) is a mixing Markov shift. This Markov

shift achieves the entropy upper bound of log 4.

21

Example 2.2.17 (Special Case n = 2). If we have a 2-refinement of D, P =

p q

q p

, q =

1− p. The characteristic polynomial of P is (λ− 1)(λ+ 1− 2p). So λ2(P) = 2p − 1.

((1/2, 1/2),P) is ergodic ⇐⇒P 6=

1 0

0 1

.

((1/2, 1/2),P) is mixing ⇔P 6=

1 0

0 1

or

0 1

1 0

.

When n = 2, a doubly-stochastic matrix is symmetric and there are one or two distinct

entries, so it is easy to explicitly state all cases of ergodic and mixing Markov shifts.

When n > 2, the numbers of entries make describing such Markov shifts with matrix

entries difficult.

22

CHAPTER 3STOCHASTIC AND DOUBLY-STOCHASTIC MATRICES

3.1 Doubly-Stochastic Matrices

When we construct the stochastic Ulam matrix that approximates our stirring

protocol, it will approximate a doubly-stochastic matrix. In this section we establish

properties of stochastic matrices and doubly stochastic matrices. Some of the results

are not used in our hypothesis tests, but can confirm statistical results and provide

intuition into the structure of doubly stochastic matrices.

Definition 3.1.1. If P is an n × n stochastic matrix such that PT is also a stochastic

matrix then P is a doubly-stochastic matrix. We will refer to such matrices as ds-

matrices.

Notation 3.1.2. Let Mn denote the set of n × n matrices such that all rows sum to one

(no restriction on entries).

Let Pn denote the set of n × n stochastic matrices.

Let DSn denote the set of n × n ds-matrices.

Let Sn denote the set of n × n symmetric ds-matrices.

Notice that Sn ⊆ DSn ( Pn ( Mn.Remark 3.1.3. If P is an n × n symmetric stochastic matrix then P is a ds-matrix, but the

converse is not true.

Proof. The matrix P is stochastic, and

P = PT .

Therefore P is doubly-stochastic.

For the converse look at the counterexample

P =

1/3 1/3 1/3

1/2 1/6 1/3

1/6 1/2 1/3

.

23

PT is stochastic, but P is not symmetric.

Theorem 3.1.4. If P is the stochastic matrix formed by taking an n-partition of

(D,B,µ, f ), then P is a ds-matrix.

Proof. By construction, P is a stochastic matrix. Since 0 ≤ pij ≤ 1 for all ij , it suffices to

show that the sum of all columns equals one. Since µ(x ∈ Di) are all equal, µ(x ∈ Di) =1

n.

n∑

i=1

pij =

n∑

i=1

µ(f (x) ∈ Dj |x ∈ Di)

=

n∑

i=1

µ(f (x) ∈ Dj ∧ x ∈ Di)µ(x ∈ Di)

=

n∑

i=1

µ(f (x) ∈ Dj ∧ x ∈ Di)1

n

= n

n∑

i=1

µ(f (x) ∈ Dj ∧ x ∈ Di)

= nµ(f (x) ∈ Dj ∧ x ∈n⊎

i=1

Di)

= nµ(f (x) ∈ Dj) = n(1/n) = 1.

So∑ni=1 pij = 1. It follows that PT is a stochastic matrix.

We skip the standard proofs of the following lemmas. The next lemma shows that all

of the eigenvalues of a stochastic matrix are on the unit disk of the complex plane.

Lemma 3.1.5. If P is an n × n stochastic matrix and λ is an eigenvalue of P then

| λ |≤ 1.

Next we show that the largest eigenvalue of any stochastic matrix is one with

an eigenvector of all ones. Because of this lemma, we may use (1/n, ..., 1/n) as a

stationary distribution of P. Since our goal is to measure mixing between n states each

with measure 1/n, (1/n, ..., 1/n) is intuitively the correct stationary distribution to use.

24

Without the lemma we would have the additional task of finding a stationary distribution

of P.

Lemma 3.1.6. If P is a stochastic matrix, then

(1, 1, ...)

is a left eigenvector of PT with eigenvalue 1. If P is an n × n ds-matrix, then

(1/n, 1/n, ..., 1/n)

is a stationary distribution.

The positive and negative entries of eigenvectors from an Ulam matrix can be used

to detect regions that do not mix or are slow to mix. The next theorem is critical to this

technique. In addition, the next theorem will help justify using eigenvectors from an Ulam

matrix to approximate eigenfunctions. Also the theorem and proof are very similar to an

operator result about eigenfunctions.

Theorem 3.1.7. If P is an n × n stochastic matrix with eigenvector ~v = (vi), P~v = λ~v ,

then

λ = 1 orn∑

i=1

vi = 0.

Proof. The matrix P is a stochastic matrix so all rows sums to one. Since P~v = λ~v ,

n∑

j=1

pijvj = λvi

n∑

i=1

n∑

j=1

pijvj =

n∑

i=1

λvi

n∑

j=1

vj(

n∑

i=1

pij) = λ

n∑

i=1

vi

n∑

j=1

vj(1) = λ

n∑

i=1

vi

25

n∑

i=1

vi = λ

n∑

i=1

vi

(1− λ)

n∑

i=1

vi = 0.

Thus we have the result.

Theorem 3.1.8 (The Perron-Frobenius Theorem for Stochastic Matrices). If P is a

stochastic matrix with strictly positive entries then

1. P has 1 as a simple eigenvalue.

2. The eigenvector corresponding to 1 has strictly positive entries.

3. All other eigenvalues have magnitude less than 1.

4. The eigenvectors that do not correspond to 1 have nonpositive entries.

If P is a stochastic matrix with nonnegative entries then there is an eigenvector corre-

sponding to 1 with all entries on [0,∞).Corollary 3.1.9. If P (our Monte Carlo approximation of P) has strictly positive entries

then for each ij

µ(f (x) ∈ Dj ∧ x ∈ Di) > 0 almost surely.

It follows that P has strictly positive entries, so (Xn, Σn, ((1/n, ..., 1/n),P), fn) is weak-

mixing and we will conclude that (D,B,µ, f ) is weak-mixing.

When P is the matrix defined by

pij = µ(f (x) ∈ Dj | x ∈ Di),

the vector

(1/n, 1/n, 1/n, ..., 1/n)

is a left eigenvector and stationary distribution of P. Thus

((1/n, 1/n, 1/n, ..., 1/n),P)

26

defines a one-side or two-sided Markov shift. Our goal is to use

(Xn, Σn, ((1/n, 1/n, 1/n, ..., 1/n),P), fn)

to approximate (D,B,µ, f ). If we look at a point x ∈ D, iterate f , and let aN = i where

f N(x) ∈ Di , ((1/n, 1/n, 1/n, ..., 1/n),P) as a Markov shift gives the probability distribution

of cylinder sets. So our approximation gives us information about iterations of f on D.

Knowing that P is a ds-matrix gives us a stationary distribution of P that is consistent

with µ(Di) =1

n.

Now we take a look at the simplest case poossible, n = 2. We will return to this

example to illustrate concepts.

Example 3.1.10 (Special Case n = 2). If we partition D into D1 and D2, µ(D1) = µ(D2) =12, we get the following equations for P.

p11 + p12 = 1

p21 + p22 = 1

p11 + p21 = 1

p12 + p22 = 1

Thus p11 = p22, p12 = p21. If we set p = p11, q = 1− p = p12,

P =

p q

q p

.

Proposition 3.1.11. Sn = DSn ⇐⇒ n = 2

Proof. Break up the proof into three cases, n = 2, n = 3, and n > 3.

27

n=2: This case follows from the previous example. Every 2 × 2 ds-matrix is of the

form

P =

p 1− p1− p p

for some p ∈ [0, 1].n=3: This case follows from a previous counterexample. The matrix

P =

1/3 1/3 1/3

1/2 1/6 1/3

1/6 1/2 1/3

is a 3× 3 ds-matrix that is not symmetric.

n > 3: Let P denote the matrix in the n = 3 case, Ik be the k × k identity matrix, and

0k be the k × k matrix with zero for all entries. Then for each n > 3,P 03

03 In−3

is an n × n ds-matrix that is not symmetric.

As n increases, the degrees of freedom for n × n ds-matrices increases and the

ways that a ds-matrix can deviate from being symmetric grow. Due to this and the

observation that randomly generated ds-matrices are symmetric less frequently for large

n, we propose the following conjecture.

Conjecture 3.1.12. If Mn is the set of n × n matrices whose rows and columns sum to

one, g :Mn → Rn2, ‖ M ‖F=‖ g(M) ‖2 for all M ∈Mn, DSn is the set of n × n ds-matrices,

and Sn is the set of symmetric n × n ds-matrices, then

limn→∞

µ(g(Sn)) = 0

in the measure space (g(DSn),B,µ) where µ is Lebesque measure.

28

If this conjecture is correct, then observing a symmetric ds-matrix becomes less

likely as n → ∞.

Definition 3.1.13. If V is a vector space with real scalars and we take a linear com-

bination of elements from V , c1 ~v1 + c2 ~v2 + ... + cN ~vN , such that 0 ≤ ci ≤ 1 and

c1 + c2 + ... + cN = 1 then c1 ~v1 + c2 ~v2 + ... + cN ~vN is called a convex combination. We

refer to the coefficient c1, c2, ..., cN as convex coefficients.

Remark 3.1.14. If c1, c2, ..., cN are convex coefficients then ~c = (ci) is a probability

vector. Convex combinations are weighted averages.

Theorem 3.1.15 (Birkhoff-von Neumann). [11] An n× n matrix is doubly stochastic if and

only if it is a convex combination of n × n permutation matrices.

So DSn is a convex set with the permutation matrices being the extreme points of

the set. In fact, by Caratheodory’s convex set theorem, every n × n ds-matrix is a convex

combination of (n − 1)2 + 1 or fewer permutation matrices.

Notation 3.1.16. Let

P =

1

n...1

n... . . . ...1

n...1

n

.

A quick computation shows that

P =

n!∑

k=1

1

n!Pi ,

where Pin!i=1 is the set of n × n permutation matrices.

Since P is the average of all n × n permutation matrices, P is the geometric center

of DSn. If we apply the uniform probability measure to the set of permutation matrices,

then P is the mean.

29

Example 3.1.17 (Special Case n = 2). If n = 2 then a ds-matrix is of the form

p q

q p

,

q = 1− p. p q

q p

= p

1 0

0 1

+ q

0 1

1 0

Notice that for 2× 2 ds-matrices are always symmetric.

Example 3.1.18 (Special Case n = 3). If P is an n × n ds-matrix then P is a convex

combination of

1 0 0

0 1 0

0 0 1

,

1 0 0

0 0 1

0 1 0

,

0 1 0

1 0 0

0 0 1

,

0 0 1

0 1 0

1 0 0

,

0 1 0

0 0 1

1 0 0

,

0 0 1

1 0 0

0 1 0

.

Since the first four permutation matrices are symmetric and the last two are not, the

convex coefficients of

0 1 0

0 0 1

1 0 0

,

0 0 1

1 0 0

0 1 0

determine how far a ds-matrix is from

symmetry.

The next theorem shows that DSn is a bounded set.

Theorem 3.1.19. If P is an n × n ds-matrix, then for the 1-norm, 2-norm ,∞-norm and

Frobenius-norm

‖ P − P ‖≤‖ I − P ‖ .

Proof. By the Birkhoff-von Neumann theorem P is a convex combination of permutation

matrices. Assume that P =∑n!i=1 αiPi where Pin!i=1 is the set of n × n permutation

matrices,

‖ P − P ‖ =‖n!∑

i=1

αiPi −n!∑

i=1

αi P ‖

=‖n!∑

i=1

αi(Pi − P) ‖

≤n!∑

i=1

αi ‖ Pi − P ‖

30

≤n!∑

i=1

αimaxj ‖ Pj − P ‖

≤ (maxj ‖ Pj − P ‖)n!∑

i=1

αi

≤ maxj ‖ Pj − P ‖ .

Since P has all equal entries,

maxj ‖ Pj − P ‖=‖ I − P ‖ .

Thus we get the upperbound.

Corollary 3.1.20.

‖ P − P ‖1≤ 2(n − 1)n

< 2

‖ P − P ‖2≤ 1

‖ P − P ‖∞≤ 2(n − 1)n

< 2

‖ P − P ‖F≤√n − 1

Notice that all upper bounds are achieved when P is a permutation matrix.

Proof. For the one, infinite and Frobenius norms the result follows from the definitions.

For the two norm, P is a symmetric idempotent matrix thus I − P is a symmetric

idempotent matrix. Idempotent matrices have eigenvalues 0 and 1. Since I − P is not

the zero matrix, it has one as its largest eigenvalue. Look at the largest singular value of

I − P.

‖ I − P ‖2 = σ1(I − P)

=

√|λ1((I − P)T (I − P))|

=

√|λ1((I − P)(I − P))|

=

√|λ1(I − P)|

31

= 1.

Thus ‖ P − P ‖2≤ 1.

Example 3.1.21 (n = 2). If P =

p 1− p1− p p

then

‖ P − P ‖=‖

p − 12

−p + 12

−p + 12p − 12

‖ .

It follow that

‖ P − P ‖1 = |2p − 1| ≤ 1,

‖ P − P ‖2 = |2p − 1| ≤ 1,

‖ P − P ‖∞ = |2p − 1| ≤ 1,

‖ P − P ‖F = |2p − 1| ≤ 1.

The matrix P is the center of ds-matrices, how does ((1/n, ..., 1/n), P) compare to

other measures? If our Markov shift has ((1/n, ..., 1/n), P) as its measure, then knowing

which partition set x is in tells us nothing about which partition set f (x) is in. Thus P

is the matrix of optimal mixing. If |λ2(P)| < 1, then (Xn, Σn, ((1/n, ..., 1/n),P), fn) is a

mixing dynamical system and Pk → P as k → ∞ (we will show this result while we

construct our mixing rate estimate). So a Markov shift, (Xn, Σn, ((1/n, ..., 1/n),P), fn),

being a mixing dynamical system is equivalent to Pk → P as k → ∞. We will need a

Jordan canonical form of P to make an inference about the rate at which Pk → P as

k → ∞. We will use the rate at which Pk → P as k → ∞ to estimate the mixing rate of

(D,B,µ, f ).Lemma 3.1.22. The n × n matrix P is an orthogonal projection [12] with characteristic

polynomial xn−1(x − 1). Futhermore the Jordan Canonical form of P has a one on the

diagonal and all other entries are zero.

32

Proof. The matrix P is a ds-matrix, hence (1/n, 1/n, ..., 1/n) is a stationary distribution.

(1/n, 1/n, ..., 1/n)P = (1/n, 1/n, ..., 1/n).

It follows that

P2 = P.

Since P is an orthogonal projection, all eigenvalues are 0 or 1. All columns and all rows

are equal so rank(P) = 1, thus only one eigenvalue does not equal zero. Since the rank

of a matrix is equal to the rank of its Jordan canonical form, the last part of the lemma

follows.

3.2 Additional Properties of Stochastic Matrices

Since we are using an eigenvalue of a stochastic matrix as a test statistic, the

probability distribution of eigenvalues is important. We may use the next few results

to eliminate some measures from consideration as the probability distribution of λ2(P)

when we have prior knowledge about P.

Proposition 3.2.1. If P is an n × n stochastic matrix then λ2 + ... + λn ∈ [−1, n − 1].Notice that the upper bound is achieved when P is the identity matrix and the lower

bound is achieved when P =

0 1 0

0 0 1

1 0 0

.

Proof. The matrix P is stochastic so all entries are on [0, 1], thus trace(P) ∈ [0, n].λ1 + λ2 + ... + λn = trace(P) so λ1 + λ2 + ... + λn ∈ [0, n]. Since λ1 = 1, λ2 + ... + λn ∈[−1, n − 1].

Proposition 3.2.2. If P is an n × n stochastic matrix then det(P) ∈ [−1, 1].The upper and lower bounds are achieved by permutation matrices.

33

Proof. All stochastic matrices have real entries, so det(P) ∈ R.

det(P) = λ1λ2...λn

Taking absolute values of both sides, we get

| det(P) | =| λ1λ2...λn |

=| λ1 || λ2 | ... | λn |

≤ 1.

Thus we get the result.

Corollary 3.2.3. If P is an n × n stochastic matrix then λ2...λn ∈ [−1, 1].The next result shows that if we are working with a class of ds-matrices with large

entries on the diagonal, then the eigenvalues are not uniformly distributed on the unit

disk. If we write the Gershgorin circle theorem for stochastic matrices, we can quickly

find a region that will contain all eigenvalues.

Theorem 3.2.4. If P = (pij) is an n × n stochastic matrix then

| λ−minipii |≤ 1−min

ipii

for all eigenvalues.

Proof. By Gershgorin circle theorem there exists i such that

| λ− pii | ≤n∑

j=1,i 6=jpij .

P is a stochastic matrix, thus

n∑

j=1,i 6=jpij = 1− pii .

It follows that

| λ− pii | ≤ 1− pii .

34

So each eigenvalue is contained in a closed disk centered at pii with radius 1 − pii for

some i . All of these disks have a diameter in [−1, 1] that contains 1. If we look at the real

numbers in [−1, 1] contained the i th disk,

| x − pii | ≤ 1− piipii − 1 ≤ x − pii ≤ 1− pii2pii − 1 ≤ x ≤ 1

2minjpjj − 1 ≤ 2pii − 1 ≤ x ≤ 1

2minjpjj − 1 ≤ x ≤ 1

minjpjj − 1 ≤ x −min

jpjj ≤ 1−min

jpjj

| x −minjpjj | ≤ 1−min

jpjj .

It follows that the real numbers contained in | λ − pii |≤ 1 − pii are contained in the disk

defined by | λ−minj pjj |≤ 1−minj pjj . Since the center of each disk is contained in [0, 1],

all disks are contained in | λ−minj pjj |≤ 1−minj pjj .

Corollary 3.2.5. If P is an n× n stochastic matrix with all diagonal entries greater than1

2

then P is invertible.

Proof. For an n × n matrix to be invertible, all eigenvalues must be nonzero.

| λ− pii |≤ 1−minipii < 12.

Since

1

2< minipii ,

the open disk centered at pii with radius1

2does not contain zero.

35

If we have prior knowledge of a positive lower bound of trace(P), then we may use

the next theorem to exclude some distributions from consideration as the probability

distribution of λ2.

Theorem 3.2.6. If P is an n × n stochastic matrix thentrace(P)− 1n − 1 ≤| λ2 |.

Proof. P is a nonnegative matrix, so trace(P) ≥ 0,

trace(P) = 1 + λ2 + ... + λn

=| 1 + λ2 + ... + λn |

≤| 1 | + | λ2 | +...+ | λn |

≤ 1 + (n − 1) | λ2 | .

Therefore

trace(P)− 1n − 1 ≤| λ2 | .


This upperbound is achieved when P is the identity matrix.

Theorem 3.2.7. If P is an n × n stochastic matrix then

| λn |≤ n−1√

| det(P) | ≤| λ2 | .

Proof. Since P is stochastic,

λ1 = 1.

Let’s use the relationship between determinants and eigenvalues,

det(P) = λ1λ2...λn

= λ2...λn.

36

Taking absolute values of both sides, we get

| det(P) | =| λ2...λn |=| λ2 | ... | λn | .

By how we defined λi ,

| λn |n−1 ≤| λ2 | ... | λn |≤| λ2 |n−1 .

Thus

| λn | ≤ n−1√

| det(P) | ≤| λ2 | .

This gives us both inequalities.

Notice that | λn |= n−1√

| det(P) | =| λ2 | when P is a permutation matrix or P.

Corollary 3.2.8. If P is an n × n stochastic matrix then

maxtrace(P)− 1n − 1 , n−1

√| det(P) | ≤| λ2 | .

Proposition 3.2.9. If P is an n× n stochastic matrix then for any matrix norm induced by

a vector norm

1 ≤‖ P ‖

Proof. One is the largest eigenvalue for any stochastic matrix.

The identity matrix achieves the lower bound.

Proposition 3.2.10. If P is an n × n stochastic matrix, then

‖ P ‖1 = 1

‖ P ‖2 = 1

‖ P ‖∞ = 1

37

Now we look at a matrix that defines a linear map that we may use to evaluate the

eigenvalues of P. We may use this map when eigenvalue stability is uncertain.

Notation 3.2.11. Let T denote the n × n matrix

I − P.

Notation 3.2.12. Let < ~u,~v > denote the standard dot product on the vectors ~u and ~v .

The next theorem shows that λ2(P) = λ1(TP).

Theorem 3.2.13. If P is an n × n stochastic matrix, then

Λ(TP) = (Λ(P) ∪ 0)− 1,

where Λ denotes the multiset of eigenvalues. Moreover, any eigenvector of P with λ 6= 1is an eigenvector of TP with the same eigenvalue.

Proof. Since P is a stochastic matrix,

~u =1√n

1

...

1

is an eigenvector with eigenvalue one and has norm one. For any vector, ~v = (vi),

~v− < ~v ,~u > ~u

is orthogonal to ~u.

~v− < ~v ,~u > ~u =~v− < ~v ,1√n

1

...

1

> (1√n

1

...

1

)

= I~v − 1n< ~v ,

1

...

1

>

1

...

1

38

= I~v − 1n

n∑

i=1

vi

1

...

1

= I~v − P~v

= (I − P)~v

= T~v .

If ~v is an eigenvector of P, P~v = λ~v then,

TP~v =T (λ~v)

=λT~v

=λT~v

=λ(I~v − 1n< ~v ,

1

...

1

>

1

...

1

).

If λ 6= 1, < ~v ,~u >= 0,

TP~v =λ~v .

If ~v is a scalar multiple of

1

...

1

, then

TP~v =

0

...

0

= 0~v .

39

If λ = 1 and ~v 6=

1

...

1

, then without loss on generality we may assume that < ~v ,~u >= 0.

It follows that

TP~v =~v .

Since TP has the same eigenvectors as P, we get the result.

So we may use TP to describe the eigenvalues of P, furthermore λ2(P) = λ1(TP).

Next we look at how the singular values of TP compare to the eigenvalues of P.

Notation 3.2.14. If M is an n × m matrix, let σ1(M),σ2(M), ... , σmin(m,n)(M) denote the

singular values of M;

σ1(M) ≥ σ2(M) ≥ ... ≥ σmin(m,n)(M).

Theorem 3.2.15. If P is an n × n stochastic matrix, then

|λ2(P)| ≤ σ1(TP) ≤ 1.

Proof. Use an eigenvalue and singular value inequality [12] and our previous result,

|λ2(P)| = |λ1(TP)| ≤ σ1(TP).

The upperbound of one follows from a straight forward computation.

If we are concerned about the stability of eigenvalues from our approximating

matrix, P, then we may use the eigenvalues of TP. If we do not trust the stability of

eigenvalues from either matrix, then we may use the first singular value of TP. Since

the first singular value of a matrix is very stable, σ1(TP) is a better statistic when

eigenvalue stability is questionable. Unfortunately, the probability distribution of σ1(TP)

will likely differ from the probability distribution of |λ2(P)|.

40

CHAPTER 4ESTIMATING THE RATE OF MIXING

Time is money, so if we have several mixing protocols we would prefer one that

mixes rapidly. After we conclude that a protocol mixes, the next question is ”How

fast does it mix?” In this section we establish a statistical measure of mixing rate for

(D,B,µ, f ) when

(Xn, Σn, ((1/n, 1/n, ..., 1/n),P), fn) is mixing.

4.1 The Jordan Canonical Form of Stochastic Matrices

Theorem 4.1.1. If P is an n × n ds-matrix such that |λ2(P)| < 1 then

PN →

1/n ... 1/n

... . . . ...

1/n ... 1/n

as N → ∞.

Proof. Let J be a Jordan canonical form of P with (J)ii = λi , and conjugating matrix E ,

P = EJE−1,

and

PN = EJNE−1.

Also, let Bil denote an l × l Jordan Block of J with eigenvalue λi .

If |λ2(P)| < 1, 1 has multiplicity one, hence [1] is a Jordan block; all other

blocks have diagonal entries with magnitude less than one. If we look at powers of

Jordan blocks BNil (N > n), the diagonal is λNi ...λNi , the subdiagonals are zeros, the

superdiagonals are λN−di

(Nd

)where d is the d th superdiagonal. When |λ2(P)| < 1 and

i > 1,

∞∑

N=d

λN−di

(N

d

)

41

passes the ratio test, so the sum converges. Thus all entries of BNil go to zero as

N → ∞. It follows that

JN →

1 0 ... 0

0 0 ... 0

...... . . . ...

0 0 ... 0

as N → ∞. JN goes to a rank one matrix as N → ∞ and PN = EJNE−1, thus PN goes to

a rank one matrix as N → ∞. The left probability vector (1/n, 1/n, ..., 1/n) is a stationary

distribution for every ds-matrix, so

(1/n, 1/n, ..., 1/n)P = (1/n, 1/n, ..., 1/n).

Thus the statement follows.

The Markov shift ((1/n, ..., 1/n), P) intuitively gives the optimal mixing for a Markov

shift. Knowing xi tells us nothing about xi+1, and for any probability vector, (p1, p2, ..., pn),

(p1, p2, ..., pn)P = (1/n, 1/n, ..., 1/n).

When we use (p1, p2, ..., pn) to represent a simple function approximating the initial

concentration of an ingredient to be mixed in D, a stirring protocol that mixes the

ingredient in one iteration will have (Xn, Σn, ((1/n, ..., 1/n), P), fn) as the approximating

Markov shift.

4.2 Estimating Mixing Rate

The dynamical system (Xn, Σn, ((1/n, ..., 1/n),PN), fn) is our approximation of how

N iterations of our stirring protocol act on D. So the rate at which PN goes to a rank one

matrix gives us a measure of the rate that f mixes D. Let Bil be an l × l Jordan block of

P with eigenvalue λi . The rate at which B2l goes to zero for the largest l determines the

rate at which PN goes to a rank one matrix. So if we know the Jordan canonical form of

42

P, we have a measure of the mixing rate. What if n is so large that the Jordan canonical

form of P is not computable?

Theorem 4.2.1. The sequence defined by

(N

n − 1)|λ2|N−n+1,

N ∈ N provides an estimate of f ’s rate of mixing.

Proof. Let J be the Jordan canonical form of P with conjugating matrix E ,

P = E−1JE , PN = E−1JNE .

If |λ2(P)| = 1 then JN does not go a rank one matrix as N → ∞, and we do not

conclude that (D,B,µ, f ) is weak-mixing.

If |λ2(P)| < 1, 1 has multiplicity one, hence [1] is a Jordan block; all other

blocks have diagonal entries with magnitude less than one. If we look at powers of

a Jordan block Bil (N > n), the diagonal is λNi ...λNi , the subdiagonals are zeros, the

superdiagonals are λN−di

(Nd

)where d is the d th super diagonal. So the upper right entry

of a Jordan block is the slowest to converge to zero. When |λ2(P)| < 1 and i > 1,

∞∑

N=d

λN−di

(N

d

)

passes the ratio test, so the sum converges. Thus all entries of BNil go to zero as

N → ∞. If we look at ratios of the upper right entries, we see that eigenvalue magnitude

influences the rate of convergence more than block size. Thus the upper right entry of

the largest block of λ2 converges most slowly to 0. The largest block size possible for λ2

is (n − 1) × (n − 1). Therefore the rate at which entries of J that converges to zero go

no slower than the rate that λN−n+12

(Nn−1

) → 0. The equivalence of P and J shows that

the rate that λN2(Nn−1

) → 0 as N → ∞ gives an upper bound on the rate that PN goes to a

rank one matrix as N → ∞.

43

Our mixing rate estimate is an upperbound on the rate that PN goes to a rank one

matrix. Since (Xn, Σn, (~p,P), fn) only approximates the dynamical system, we use the

rate at which

(N

n − 1)λN−n+12 → 0

as our mixing rate estimate instead of an upperbound on the rate of mixing.

44

CHAPTER 5PROBABILISTIC PROPERTIES OF DS-MATRICES

After the value of n is set we know everything about

(Xn, Σn, ((1/n, 1/n, ..., 1/n),P), fn)

except the ds-matrix P. We will treat P as a random event, and in this section we look

at properties of random doubly stochastic matrices. Most of the results presented apply

directly to our dynamical problem, others are presented to provide insight into random

ds-matrices. Since each entry of a random matrix defines a random variable, we will

spend quite a bit of time on the entries of random ds-matrices.

5.1 Random DS-Matrices

Definition 5.1.1. If x(ω) is an integrable function over the probability space (Ω,Σ,µ(ω)),

then the expected value of x is

E(x) =

∫x(ω)dµ(ω).

If (x − E(x))2 is an integrable function, then the variance of x is

V (x) = E((x − E(x))2).

If x(ω), y(ω) and x(ω)y(ω) are integrable functions over the probability space

(Ω,Σ,µ(ω))

then the covariance of x and y is

cov(x , y) = E((x − E(x))(y − E(y))).

Some standard results are that

E(c) = c

45

for any constant c ,

V (x) = E(x2)− (E(x))2

whenever E(x2) and E(x) are finite, and

cov(x , y) = E(xy)− E(x)E(y)

whenever E(xy),E(x), and E(xy) are finite.

Theorem 5.1.2. If P = (pij) is a random stochastic matrix where pij are identically

distributed then

E(pij) =1

n, for all ij .

Proof. Since P is a stochastic matrix and taking an expected value is a linear operation,

n∑

j=1

E(pij) = 1

nE(pij) = 1

E(pij) =1

n.

Thus the statement holds.

Theorem 5.1.3. If P = (pij) is a random n × n stochastic matrix where pij are identically

distributed and N ∈ N, then

1

nN≤ E(pNij ) ≤

1

n

where pNij = (Pij)N .

Proof. First we prove that E(pNij ) ≤1

n.

For any stochastic matrix P,

pi1 + pi2 + ... + pin = 1

46

(pi1 + pi2 + ... + pin)N = 1

pNi1 + pNi2 + ... + p

Nin + θ = 1.

Where θ is the summation of the remaining addends after expanding the multinomial.

θ = 1− pNi1 − pNi2 − ...− pNin ⇒ 0 ≤ θ < 1

pNi1 + pNi2 + ... + p

Nin = 1− θ ≤ 1

E(pNi1 + pNi2 + ... + p

Nin) = E(1− θ) ≤ E(1)

nE(pNij ) = 1− E(θ) ≤ 1

E(pNij ) =1

n− E(θ)n

≤ 1n.

Now we prove that1

nN≤ E(pNij ) by using Minkowski’s inequality.

1 = pi1 + pi2 + ... + pin

= (pi1 + pi2 + ... + pin)N

= E((pi1 + pi2 + ... + pin)N)

= N√E((pi1 + pi2 + ... + pin)N)

≤ N

√E(pNi1) +

N

√E(pNi2) + ... +

N

√E(pNin).

Since pij are identically distributed, these expected values are all equal,

1 ≤ n N√E(pNij ).

Dividing both sides by n gives

1

n≤ N

√E(pNij ).

Finally taking powers of both sides gives the result.

1

nN≤ E(pNij ).

47

Hence1

nN≤ E(pNij ) ≤

1

n.

Corollary 5.1.4. If P = (pij) is a random n × n stochastic matrix where pij are identically

distributed then

V (pij) ≤ 1n− 1n2.

Proof.

V (pij) = E(p2ij)− E(pij)2

= E(p2ij)−1

n2

≤ 1n− 1n2.

Thus the statement follows.

The next few theorems give properties of the covariance between entries of a

random n × n ds-matrix. Since all rows and all columns sum to one, if one entry changes

then at least one other entry on the same row and one other entry on the same column

must change to maintain the sum. If follows that the entries of a random n × n ds-matrix

cannot be independent.


distributed then

0 ≤ E(pijpi ′j ′) ≤ 1n.

Proof. Since P is stochastic,

0 ≤ pij ≤ 1

0 ≤ pijpi ′j ′ ≤ pi ′j ′

E(0) ≤ E(pijpi ′j ′) ≤ E(pi ′j ′)

0 ≤ E(pijpi ′j ′) ≤ 1n.

48

And so we get the result.


distributed and j 6= j ′ then

0 ≤ E(pijpij ′) ≤ 1n− 1n2.

Proof. The matrix P is stochastic and pij , pij ′ are on the same row, thus

pij + pij ′ ≤ 1

(pij + pij ′)2 ≤ pij + pij ′

p2ij + p2ij ′ + 2pijpij ′ ≤ pij + pij ′.

Next take expected values of both sides of the equation,

E(p2ij + p2ij ′ + 2pijpij ′) ≤ E(pij + pij ′)

E(p2ij) + E(p2ij ′) + 2E(pijpij ′) ≤ E(pij) + E(pij ′)

2E(p2ij) + 2E(pijpij ′) ≤2

n

E(p2ij) + E(pijpij ′) ≤1

n.

Now subtract E(p2ij) from the equations,

E(pijpij ′) ≤ 1n− E(p2ij)

E(pijpij ′) ≤ 1n− E(p2ij) ≤

1

n− 1n2.

Use the fact that 0 ≤ pijpij ′ to get

0 ≤ E(pijpij ′) ≤ 1n− 1n2.

So we get bounds on E(pijpij ′).

Corollary 5.1.7. If P = (pij) is a random n × n ds-matrix where pij are identically

distributed and i 6= i ′then 0 ≤ E(pijpi ′j) ≤ 1n− 1n2

.

49

Proof. Apply the previous theorem to PT .


distributed, then

1.

− 1n2

≤ cov(pij , pi ′j ′) ≤ (1n− 1n2).

2. If j 6= j ′, then

− 1n2

≤ cov(pij , pij ′) ≤ (1n− 2n2).

3. If i 6= i ′, then

− 1n2

≤ cov(pij , pi ′j) ≤ (1n− 2n2).

Proof. For the first statement,

0 ≤E(pijpi ′j ′) ≤ 1n

−E(pij)E(pi ′j ′) ≤E(pijpi ′j ′)− E(pij)E(pi ′j ′) ≤ 1n− E(pij)E(pi ′j ′)

− 1n2

≤E(pijpi ′j ′)− E(pij)E(pi ′j ′) ≤ 1n− 1n2

− 1n2

≤cov(pij , pi ′j ′) ≤ 1n− 1n2.

The second statement follows from the slightly better upper bound of E(pijpij ′). The third

statement follows from applying the second statement to PT .

Conjecture 5.1.9. If P = (pij) is a random n × n ds-matrix where pij are identically

distributed then

cov(pij , pi ′j ′) ≤ 0 if i = i ′, j 6= j ′,

cov(pij , pi ′j ′) ≤ 0 if i 6= i ′, j = j ′,

cov(pij , pi ′j ′) ≥ 0 if i 6= i ′, j 6= j ′.

50

The sum of each row (column) is one, so if pij and pi ′j ′ are on the same row (column)

and pij increases, pi ′j ′ tends to decrease to maintain the sum. If i 6= i ′ and j 6= j ′, when

pij increases, pij ′ tends to decrease to maintain the row sum; when pij ′ decreases, pi ′j ′

tends to increase to maintain the column sum.

When P is a random matrix, det(P), trace(P), and λi(P) are random variables. We

look at properties of these random variables for the remainder of the chapter.

Theorem 5.1.10. If P = (pij) is an n × n matrix (n ≥ 2), and E(p1σ(1)p2σ(2)...pnσ(n)) is

constant for any length n permutations σ, then

E(det(P)) = 0.

Proof. We need to use the definition of determinants that uses permutation,

det(P) =∑

σ is a permutation

(−1)kσp1σ(1)p2σ(2)...pnσ(n).

Where kσ = 1 if σ is an odd permutation and kσ = 0 if σ is an even permutation.

E(det(P)) = E(∑

σ is a permutation

(−1)kσp1σ(1)p2σ(2)...pnσ(n))

=∑

σ is a permutation

(−1)kσE(p1σ(1)p2σ(2)...pnσ(n))

=∑

σ is a permutation

(−1)kσE(p11p22...pnn)

= E(p11p22...pnn)∑

σ is a permutation

(−1)kσ

Since n ≥ 2 half of the permutations are even, and half are odd.

Corollary 5.1.11. If P = (pij) is an n × n matrix (n ≥ 2), and

E(p1σ(1)p2σ(2)...pnσ(n))

51

is constant for any length n permutations σ then

E(λ2λ3...λn) = 0.


distributed then E(λ2 + λ3 + ... + λn) = 0.

Proof. Since P is a stochastic matrix, λ1 = 1. By the definition of trace(P) and the

commutative property of the trace operator

trace(P) = 1 + λ2 + ... + λn and trace(P) = p11 + p22 + ... + pnn.

It follows that

1 + λ2 + ... + λn = p11 + p22 + ... + pnn.

Taking expected values of both sides give

E(1 + λ2 + ... + λn) = E(p11 + p22 + ... + pnn)

1 + E(λ2 + ... + λn) = E(p11) + E(p22) + ... + E(pnn)

=1

n+1

n+ ... +

1

n

= 1.

Subtracting one from both sides gives the result

Notation 5.1.13. If Pnk∞k=1 is a sequence of nk × nk matrices, let λk,i denote λi(Pnk ).

We will use the next theorem to tell us about the eigenvalues of matrices that arise

from taking a sequence of refinements of Dini=1.Theorem 5.1.14. If Pnk∞k=1 is a sequence of nk × nk stochastic matrices such that

E((Pnk )ij) =1

nkfor all ij

52

then

−1 ≤ E(lim infk→∞

λk2 + λk3 + ... + λknk ) ≤ 0.

Proof. Stochastic matrices have entries on [0, 1] so trace(Pnk ) ∈ [0, nk ]. By Fatou’s

lemma

0 ≤E(lim infk→∞

trace(Pnk )) ≤ lim infk→∞

E(trace(Pnk ))

0 ≤E(lim infk→∞

1 + λk2 + λk3 + ... + λknk ) ≤ lim infk→∞

E(

nk∑

i=1

(Pnk )ii)

0 ≤E(1 + lim infk→∞

λk2 + λk3 + ... + λknk ) ≤ lim infk→∞

nk∑

i=1

E((Pnk )ii)

0 ≤1 + E(lim infk→∞

λk2 + λk3 + ... + λknk ) ≤ lim infk→∞

nk∑

i=1

1

nk

0 ≤1 + E(lim infk→∞

λk2 + λk3 + ... + λknk ) ≤ 1

−1 ≤E(lim infk→∞

λk2 + λk3 + ... + λknk ) ≤ 0.


The next example gives us an idea of how much we can expect P to differ from P

when n = 2.

Example 5.1.15. If P is an 2× 2 ds-matrix then

‖ P − P ‖1=‖ P − P ‖2=‖ P − P ‖∞=‖ P − P ‖F= |2p − 1| ≤ 1

Where P =

p 1− p1− p p

. For these particular norms if p is a uniform[0, 1] random

variable, then

E(‖ P − P ‖) = E(|2p − 1|) = 12.

53

5.2 Metric Entropy of Markov Shifts with Random Matrices

The next two theorems are important for estimating the metric entropy of

(Xn, Σn, (1/n, ..., 1/n),P), fn)

when P is a random variable. The metric entropy of the dynamical system is

−1n

n∑

i=1

n∑

j=1

pij log(pij).

Where we define

0 log(0) = 0.


distributed and 0 < pij almost surely then

E(log(pij)) ≤ − log n, and

−1nlog(n) ≤ E(pij log(pij)).

Proof. First we will show the first inequality. Since f (x) = − log x is a convex function on

(0,∞), Jensen’s inequality tells us

E(− log pij) ≤ − logE(pij)

−E(log pij) ≤ − log 1n

−E(log pij) ≤ log n

E(log pij) ≥ − log n.

Now we will show the second inequality. Since f (x) = x log x is convex on (0,∞),Jensen’s inequality tells us that

E(pij) log(E(pij)) ≤ E(pij log(pij))1

nlog(1

n) ≤ E(pij log(pij))

54

−1nlog(n) ≤ E(pij log(pij)).

Thus we get the two inequalities.

The next theorem is useful if one wishes to use the harmonic mean of the entries of

a random stochastic matrix.


distributed and 0 < pij almost surely, then n ≤ E( 1pij).

Proof. Since f (x) =1

xis a convex function on (0,∞), Jensen’s inequality tells us that

1

E(pij)≤ E( 1

pij)

1

1/n≤ E( 1

pij)

n ≤ E( 1pij).

And the statement follows.

55

CHAPTER 6PARTITION REFINEMENTS

6.1 Equal Measure Refinements

If we know P and we refine our partition of D, what does P tell us about our new

Markov shift? Intuitively, we expect the refined subshift of finite type to be a better

approximation to (D,B,µ, f ). It is unreasonable to use a Markov shift from an n-partition

to approximate a continuous dynamical system if the Markov shift does not provide

information about new Markov shifts formed after refining the partition. Here we present

the relationship between n-partition Markov shifts and nk-partition Markov shifts where

the later is a refinement of the former. Partition each Di into k connected subsets,

Di =k⊎

α=1

Diα,

µ(Diα) =1

nk.

Each subset has equal measure. We will refer to such refinements as k-refinements.

Notation 6.1.1. We will use the following notation when referring to refinement of

partitions.

1. Let Pn be our stochastic matrix before refinement with entries pij ,

pij = µ(f (x) ∈ Dj |x ∈ Di).

2. Let Pnk be our stochastic matrix after refinement with entries piαjβ ,

piαjβ = µ(f (x) ∈ Djβ |x ∈ Diα).

Arrange the rows and columns of Pnk with the order

1112...1k2122...2k ...n1n2...nk .

With this arrangement we can represent Pnk as a block matrix where each entry of Pn

corresponds to a k × k block of Pnk .

56

Theorem 6.1.2. If ((1/n, ..., 1/n),Pn) is the Markov shift that approximates (D,B,µ, f )after an n-partition, and Pnk is the stochastic matrix after a k-refinement, then

((1/nk , ..., 1/nk),Pnk)

is a Markov shift and

k∑

β=1

k∑α=1

piαjβ = kpij .

We will refer to these equations as the refinement equations.

Proof. After refining, Pnk is an nk × nk ds-matrix so ((1/nk , ..., 1/nk),Pnk) is a Markov

shift. Now let’s look at the refinement equations.

k∑

β=1

k∑α=1

piαjβ =

k∑

β=1

k∑α=1

µ(f (x) ∈ Djβ |x ∈ Diα)

=

k∑

β=1

k∑α=1

µ(f (x) ∈ Djβ ∧ x ∈ Diα)µ(x ∈ Diα)

=

k∑

β=1

k∑α=1

µ(f (x) ∈ Djβ ∧ x ∈ Diα)1nk

= nk

k∑

β=1

k∑α=1

µ(f (x) ∈ Djβ ∧ x ∈ Diα)

= nkµ(f (x) ∈k⊎

β=1

Djβ ∧ x ∈k⊎

α=1

Diα)

= nkµ(f (x) ∈ Dj ∧ x ∈ Di)

= kµ(f (x) ∈ Dj ∧ x ∈ Di)

1n

= kµ(f (x) ∈ Dj ∧ x ∈ Di)

µ(x ∈ Di)= kpij .

Thus we get the refinement equations.

57

The following is the simplest general example of a k-refinement. We present it to

show the relationships between the unrefined Markov shift and the refined Markov shift.

Example 6.1.3. If we have a 2-partition and apply a 2-refinement, what does P2 tell

us about P4? Here we are looking at the situation where P2 is a known matrix from a

Markov shift ((1/2, 1/2),P2) and we want to make an inference about the Markov shift

that results if we refine our partition, ((1/4, 1/4, 1/4, 1/4),P4). Let

P2 =

p q

q p

and

P4 =

p1111 p1112 p1121 p1122

p1211 p1212 p1221 p1222

p2111 p2112 p2121 p2122

p2211 p2212 p2221 p2222

.

First let’s look at P4 as a ds-matrix without considering it as a refinement of P2. The

matrix P4 is a stochastic matrix so all rows sum to one.

p1111 + p1112 + p1121 + p1122 = 1

p1211 + p1212 + p1221 + p1222 = 1

p2111 + p2112 + p2121 + p2122 = 1

p2211 + p2212 + p2221 + p2222 = 1.

Our matrix P4 is a ds-matrix so all columns sum to one and we get the additional

equations.

p1111 + p1211 + p2111 + p2211 = 1

p1112 + p1212 + p2112 + p2212 = 1

p1121 + p1221 + p2121 + p2221 = 1

p1122 + p1222 + p2122 + p2222 = 1.

58

If we set up these equations as a system of equations and take the reduced row

echelon form (pivots are in bold font) we get

1 0 0 0 0 0 −1 0 0 1 −1 0 0 0 −1 1 0

0 1 0 0 0 0 0 1 0 0 0 1 0 0 0 1 1

0 0 1 0 0 0 1 0 0 0 0 1 0 1 0 0 1

0 0 0 1 0 0 0 1 0 1 −1 1 −1 1 0 1 1

0 0 0 0 1 0 1 0 0 0 1 0 0 0 1 0 1

0 0 0 0 0 1 0 −1 0 0 1 −1 1 0 0 −1 00 0 0 0 0 0 0 0 1 −1 1 −1 1 −1 1 −1 00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

.

Notice that from the eight equations the reduced row echelon form has seven pivots

and one zero-row.

The zero matrix is not a solution to the system of equations, thus the solution set is

not a vector space. If we convert the reduced row echelon form back to matrices, we see

that a solution is

0 0 0 1

0 0 1 0

0 1 0 0

1 0 0 0

.

This matrix added to any linear combination of the following matrices is a solution to

the system of equations.

1 0 0 −1−1 0 1 0

0 −1 0 1

0 1 −1 0

,

0 0 0 0

−1 1 0 0

0 0 0 0

1 −1 0 0

,

0 0 0 0

0 0 0 0

0 −1 1 0

0 1 −1 0

,

59

0 0 0 0

0 0 0 0

0 1 0 −10 −1 0 1

,

0 0 1 −10 0 0 0

0 −1 0 1

0 1 −1 0

,

0 0 0 0

0 0 −1 1

0 1 0 −10 −1 1 0

,

0 0 0 0

0 0 0 0

1 −1 0 0−1 1 0 0

,

0 1 0 −1−1 0 1 0

0 −1 0 1

1 0 −1 0

,

0 0 0 0

−1 0 1 0

0 0 0 0

1 0 −1 0

.

Since P4 is a stochastic matrix, all entries are on [0, 1]. So any solution to the

system with a negative entry or an entry greater than one is not a ds-matrix. The set of

4× 4 ds-matrices is a strict subset of the system’s solution set.

The matrices above span all 4 × 4 ds-matrices, Now we look at how things change

when we include the equations from P4 being a 2-refinement matrix of P2. If we take the

refinement equations and include that

P =

p 1− p1− p p

,

then we get the additional equations

p1111 + p1112 + p1211 + p1212 = 2p

p1121 + p1122 + p1221 + p1222 = 2(1− p)

p2111 + p2112 + p2211 + p2212 = 2(1− p)

p2121 + p2122 + p2221 + p2222 = 2p.

60

Taking the reduced row echelon form of the system of equations after including the

refinement equations (pivots are in bold font) we get

1 0 0 0 0 0 0 −1 0 1 0 0 0 0 −1 0 1− 2p0 1 0 0 0 0 0 1 0 0 0 1 0 0 0 1 1

0 0 1 0 0 0 0 1 0 0 −1 1 0 1 0 1 2p

0 0 0 1 0 0 0 1 0 1 −1 1 −1 1 0 1 1

0 0 0 0 1 0 0 1 0 0 0 0 0 0 1 1 2p

0 0 0 0 0 1 0 −1 0 0 1 −1 1 0 0 −1 0

0 0 0 0 0 0 1 −1 0 0 1 0 0 0 0 −1 1− 2p0 0 0 0 0 0 0 0 1 −1 1 −1 1 −1 1 −1 0

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

.

From the twelve equations the reduced row echelon form has eight pivots and four

zero-rows. Thus including the refinement equations reduces one degree of freedom.

The zero matrix is not a solution to the system of equations, thus the solution set is

still not a vector space. If we convert the reduced row echelon form to matrices, we see

that a solution is

0 0 0 1

2p 0 1− 2p 00 1 0 0

1− 2p 0 2p 0

.

61

This matrix added to any linear combination of the following matrices is a solution to

the new system of equations.

1 0 0 −1−1 0 1 0

0 −1 0 1

0 1 −1 0

,

0 0 0 0

−1 1 0 0

0 0 0 0

1 −1 0 0

,

0 0 0 0

0 0 0 0

0 −1 1 0

0 1 −1 0

,

0 0 0 0

0 0 0 0

0 1 0 −10 −1 0 1

,

0 0 1 −10 0 0 0

0 −1 0 1

0 1 −1 0

,

0 0 0 0

0 0 −1 1

0 1 0 −10 −1 1 0

,

0 0 0 0

0 0 0 0

1 −1 0 0−1 1 0 0

,

0 1 0 −1−1 0 1 0

0 −1 0 1

1 0 −1 0

.

So including the 2-refinement equations reduces the degrees of freedom by one.

Example 6.1.4 (Special Case of the Previous Example). Apply a 2-refinement to a

2-partition where we know that diagonal entries all equal x, and the coefficients of the

spanning matrices with zeros on the diagonal is y, then

P4 =

x y y 1− x − 2y2p − 2x − y x 1− 2p + x y

y 1− x − 2y x y

1− 2p + x y 2p − 2x − y x

.

Since P4 is a stochastic matrix, all entries must be on [0, 1]. Hence we get the

following restrictions on x and y ,

max2p − 1, 0 ≤x ≤ min2p, 1

0 ≤y ≤ 1

max2p − 1, 0 ≤x + 2y ≤ 1

max2p − 1, 0 ≤2x + y ≤ min2p, 1.

62

Theorem 6.1.5. If Pnk is an nk×nk ds-matrix, then Pnk is a k-refinement matrix for some

n × n ds-matrix.

Proof. The matrix Pnk is ds so all rows and columns sum to one, all entries are

nonnegative. Look at the block matrix formed by breaking Pnk into n2 blocks of size

k × k . Call the blocks Bij1≤i≤n,1≤j≤n,

Pnk =

B11 ...B1n... . . . ...

Bn1 ...Bnn

.

Take the sum of each block’s entries. Let

B =

(∑B11) ... (

∑B1n)

... . . . ...

(∑Bn1) ... (

∑Bnn)

.

Notice that B is an n × n matrix. Since Pnk is ds, all rows and columns of B sum to k and

all entries of B are nonnegative. Hence (1

k)B is an n × n ds-matrix.

If we take a sequence of refinement matrices

Pnk∞k=1 , nk |nk+1, λk,2 = λ2(Pnk ),

then λk,2∞k=1 measures mixing at each refinement. Since λ(k+1),2 measures mixing on

a finer partition than λk,2, one would expect |λk,2|∞k=1 to be a nondecreasing sequence,

this is not the case. There are examples of sequences that have |λ(k+1),2| < |λk,2|. The

observed refinement matrices with decreasing eigenvalue magnitude had poor mixing

between states of the form Di1,Di2, ...,Dik . That is, the mixing was poor between states

that were all contained in one prerefined state.

Proposition 6.1.6. If Pnk∞k=1 , nk |nk+1 is a sequence ofnk+1nk

-refinement matrices, then

|λk,2| is not necessarily a nondecreasing sequence.

63

Proof. Proof by counterexample:

Let

P4 =

4/33 5/33 5/11 3/11

3/11 4/33 10/33 10/33

1/33 20/33 2/11 2/11

19/33 4/33 2/33 8/33

.

The eigenvalues of P4 are approximately

1, −0.246, −0.044 + 0.167i , −0.044 + 0.167i ,

so the magnitude of P4’s eigenvalues are approximately

1, 0.246, 0.173, 0.173,

λ4,2 ≈ −0.246.

Using the refinement equations we see that

P2 =

1/3 2/3

2/3 1/3

.

The eigenvalues of P2 are 1,−1/3, λ2,2 = −1/3. It follows that |λ4,2| < |λ2,2|.

When we take a refinement, what do the eigenvalues and eigenvectors of Pn tell

us about the eigenvalues and eigenvectors of Pnk? For Pn to be of value, it needs to

capture the useful information from Pnk , if it does not, then Pn has no hope of being

useful in making a decision about (D,B,µ, f ). Since we are using an eigenvalue as

a test statistic, we present the next results describing the relationship between the

eigenvalues of Pn and Pnk .

Theorem 6.1.7. If Pnk is a k-refinement matrix of Pn and

(v1 ... v1v2 ... v2 ... vn ... vn

)T

64

(vi appears in the vector k times for all i ) is an eigenvector of Pnk with eigenvalue λ, then

v1

v2...

vn

is an eigenvector of Pn with eigenvalue λ.

Proof. By the hypotheses

n∑

j=1

k∑

β=1

piαjβvj = λvi .

k∑α=1

n∑

j=1

k∑

β=1

piαjβvj =

k∑α=1

λvi .

n∑

j=1

vj

k∑α=1

k∑

β=1

piαjβ = λkvi .

The matrix Pnk is a k-refinement matrix of Pn and k 6= 0, thus

n∑

j=1

vjkpij = λkvi

n∑

j=1

pijvj = λvi .

Hence the statement holds.

Theorem 6.1.8. If Pn is an n × n ds-matrix, Pnk is a k-refinement of Pn,Pn~v = λ~v ,~v =

(vi), then

Σkα=1Σnj=1Σ

kβ=1piαjβvj

k= λvi .

Thus the vector

(v1 ... v1v2 ... v2 ... vn ... vn

)T

(vi appears k times) averages out over block rows to be like an eigenvector of Pnk .

65

Proof. Since ~v is an eigenvector of Pn,

Pn~v = λ~v .

So it follows that

n∑

j=1

vjpij = λvi .

By the refinement equations

k∑α=1

n∑

j=1

k∑

β=1

piαjβvj =

n∑

j=1

vj

k∑α=1

k∑

β=1

piαjβ

=

n∑

j=1

vjkpij

= λkpij .

Divide both sides by k to get the result.

Remark 6.1.9. For any matrix M with left and right eigenvectors of λ, ~u and ~v (~u∗M =

λ~u∗ and M~v = λ~v),

~u∗M~v = λ~u∗~v = λ < ~u,~v > .

Where ∗ denotes conjugate transpose, and < ~u,~v > refers to the dot product.

66

Theorem 6.1.10. If Pn is an n × n ds-matrix, Pnk is a k-refinement of Pn, Pn~v = λ~v ,

~v = (vi), ~u∗Pn = λ~u∗, ~u = (ui), then

(u1...u1u2...u2...un...un

)Pnk

v1...

v1

v2...

v2...

vn...

vn

= λk < ~v ,~u > .

Proof. The left hand side of the equation equals

n∑

i=1

k∑α=1

n∑

j=1

k∑

β=1

uipiαjβvj =

n∑

i=1

n∑

j=1

uivj

k∑α=1

k∑

β=1

piαjβ

=

n∑

i=1

n∑

j=1

uivjkpij

= k

n∑

i=1

ui

n∑

j=1

pijvj

= k

n∑

i=1

uiλvi

= kλ

n∑

i=1

uivi

= λk < ~v ,~u > .


67

6.2 A Special Class of Refinements

Now we look at a special class of k-partitions. These partitions are interesting

because the eigenvalue multiset after refinement contains the eigenvalue multiset before

refinement.

Definition 6.2.1. If Pnk is a k-refinement of Pn such that for every block matrix

pi1j1 · · · pi1jk... . . . ...

pik j1 · · · pik jk

there exists a k × k ds-matrix Dij such that

pijDij =

pi1j1 · · · pi1jk... . . . ...


,

then Pnk is called a Dk -refinement matrix of Pn.

Definition 6.2.2. If Pnk is a k-refinement of Pn such that for every block matrix

pi1j1 · · · pi1jk... . . . ...


there exists a k × k permutation matrix Sij such that

pijSij =

pi1j1 · · · pi1jk... . . . ...


,

then Pnk is called a Sk -refinement matrix of Pn.

Remark 6.2.3. Every Sk -refinement is a Dk -refinement and if we take a sequence of

Sk -refinements then the matrices will become sparse.

68

Theorem 6.2.4 (The Boyland Theorem). If Pn is a ds-matrix, Pnk is a Dk -refinement of

Pn and

v1

v2...

vn

is an eigenvector of Pn with eigenvalue λ, then

(v1 ... v1v2 ... v2 ... vn ... vn

)T

is an eigenvector of Pnk with eigenvalue λ.

Proof. For any iα

n∑

j=1

k∑

β=1

piαjβvj =

n∑

j=1

vj

k∑

β=1

piαjβ .

The refinement matrix, Pnk , is a Dk -refinement of Pn so every row of the block matrix that

corresponds to pij sums to pij , thus

n∑

j=1

k∑

β=1

piαjβvj =

n∑

j=1

vjpij .

Since we are working with an eigenvector of Pn,

n∑

j=1

k∑

β=1

piαjβvj = λvi .

Hence the statement follows.

Corollary 6.2.5. If Pn is a ds-matrix, Pnk is a Dk -refinement of Pn, then

| λj(Pn) | ≤| λj(Pnk) | for all j ≤ n,

| λn(Pn) | ≥| λnk(Pnk) |,

| det(Pn) | ≥| det(Pnk) | .

69

In addition, by boundedness and monotonicity, if we take a sequence of Dk -refinements

and apply these functions to the refinement matrices, the resulting sequences will

converge.

Proof. Let Λ(Pn), Λ(Pnk) denote the eigenvalue multisets of Pn and Pnk ; Pnk is a

Dk -refinement of Pn so

Λ(Pn) ⊆ Λ(Pnk)

1,λ2(Pn), ...,λn(Pn) ⊆ 1,λ2(Pnk), ...,λn(Pnk), ...,λnk(Pnk)

λ2(Pn), ...,λn(Pn) ⊆ λ2(Pnk), ...,λn(Pnk), ...,λnk(Pnk)

| λ2(Pn) |, ..., | λn(Pn) | ⊆ | λ2(Pnk) |, ..., | λn(Pnk) |, ..., | λnk(Pnk) |.

Since Λ(Pn) ⊆ Λ(Pnk), the j th largest element of | λ2(Pn) |, ..., | λn(Pn) | is smaller

than the the j th largest element of | λ2(Pnk) |, ..., | λn(Pnk) |, ..., | λnk(Pnk) |, hence

| λj(Pn) |≤| λj(Pnk) | for all j ≤ n. Now if we take the minimum of both multisets,

| λ2(Pn) |, ..., | λn(Pn) | ⊆ | λ2(Pnk) |, ..., | λn(Pnk) |, ..., | λnk(Pnk) |

min| λ2(Pn) |, ..., | λn(Pn) | ≥ min| λ2(Pnk) |, ..., | λn(Pnk) |, ..., | λnk(Pnk) |

| λn(Pn) |≥| λnk(Pnk) | .

When we take determinants of Pn and Pnk , we get

det(Pn) =∏

λ∈Λ(Pn)λ

det(Pnk) =∏

λ∈Λ(Pnk)λ = (

∏

λ∈Λ(Pn)λ)(

∏

λ∈(Λ(Pnk)−Λ(Pn))λ).

Eigenvalue magnitude of any stochastic matrix is bounded above by one, thus

|∏

λ∈(Λ(Pnk)−Λ(Pn))λ| ≤ 1

| det(Pnk)| = |∏

λ∈Λ(Pn)λ||

∏

λ∈(Λ(Pnk)−Λ(Pn))λ| ≤ |

∏

λ∈Λ(Pn)λ| = | det(Pn)|.

70

Conjecture 6.2.8. If Pnk∞k=1 is a sequence of ds-matrices where Pnk+1 is a refinement

matrix of Pnk for all k , then

|λ2(Pnk )|∞k=1

is a submartingale.

This conjecture was proven for Dk -refinements by the previous theorems.

Theorem 6.2.9. If Pnk is a k-refinement matrix of Pn, then

trace(Pnk) ≤ k(trace(Pn)).

Proof. By the refinement equations,

k∑α=1

k∑

β=1

piαiβ = kpii .

Thus

n∑

i=1

k∑α=1

k∑

β=1

piαiβ = k

n∑

i=1

pii

n∑

i=1

(

k∑α=1

piαiα +∑

α 6=βpiαiβ) =

(

n∑

i=1

k∑α=1

piαiα) + (

n∑

i=1

∑

α 6=βpiαiβ) = k

n∑

i=1

pii .

It follows that

trace(Pnk) + (

n∑

i=1

∑

α 6=βpiαiβ) = k

n∑

i=1

pii .

Since 0 ≤ piαiβ ,

trace(Pnk) ≤ k(trace(Pn)).


72

Remark 6.2.10. The upper bound in the previous theorem is achieved when Pn and Pnk

are identity matrices or derangement permutation matrices.

73

CHAPTER 7PROBALISTIC PROPERTIES OF PARTITION REFINEMENTS

If we know Pn of our n-partition and we apply a k-refinement, what do we expect of

Pnk? The matrix Pn provides all of our knowledge of Pnk , so in this section we make the

assumption that for fixed ij , the distribution of piαjβ is identical for each α, β.

7.1 Entries of a Refinement Matrix

First we look at probabilistic properties of the entries of a refinement matrix.

Theorem 7.1.1. If Pnk is a k-refinement matrix of Pn and the distribution of piαjβ is

identical for each α, β, then

E(piαjβ) =pijk.

Proof. The refinement matrix, Pnk , is a k-refinement matrix of Pn so k and Pn are known.

k∑α=1

k∑

β=1

piαjβ = kpij

E(

k∑α=1

k∑

β=1

piαjβ) = E(kpij)

k∑α=1

k∑

β=1

E(piαjβ) = kpij .

The distribution of piαjβ is identical for each α, β, so

k2E(piαjβ) = kpij

E(piαjβ) =pijk.


Theorem 7.1.2. If Pn is an n × n ds-matrix, Pnk is a k-refinement of Pn,Pn = (pij),Pnk =

(piαjβ), piαjβ are identically distributed of each fixed ij , then

E(pqiαjβ) ≤ kq−2pqij

74

for all q ∈ N.

Proof. By the refinement equations

k∑α=1

k∑

β=1

piαjβ = kpij

Taking powers of both sides gives

(

k∑α=1

k∑

β=1

piαjβ)q = kqpqij

k∑α=1

k∑

β=1

pqiαjβ + γ = kqpqij .

Where γ are the remaining terms from the expansion of (∑k

α=1

∑kβ=1 piαjβ)

q. Since Pnk is

a ds-matrix, 0 ≤ piαjβ , so 0 ≤ γ.

k∑α=1

k∑

β=1

pqiαjβ ≤ kqpqij .

Now take expected values

E(

k∑α=1

k∑

β=1

pqiαjβ) ≤ E(kqpqij )

k∑α=1

k∑

β=1

E(pqiαjβ) ≤ kqpqij

E(pqiαjβ) ≤ kq−2pqij .

So the statement holds.


(piαjβ), piαjβ are identically distributed for each fixed ij andpijk

≤ t ≤ 1, then

P(piαjβ ≥ t) ≤ pijkt.

Proof. Apply Markov’s inequality with E(piαjβ) =pijk

.

75

Since we are treating the entries of a refinement matrix as a random event, what

can we say about the variance of entries?



max(0, p2ij +1

k− 1) ≤ E(p2iαjβ) ≤ p2ij .

Proof. First note that Pnk is a ds-matrix, so 0 ≤ piαjβ ≤ 1 and 0 ≤ p2iαjβ ≤ 1. By the

refinement equations

k∑α=1

k∑

β=1

piαjβ =kpij

(

k∑α=1

k∑

β=1

piαjβ)2 =k2p2ij

k∑α=1

k∑

β=1

p2iαjβ +∑

α6=α′orβ 6=β′piαjβpiα′ jβ′ =k

2p2ij .

Now 0 ≤ piαjβ ≤ 1, so

0 ≤∑

α6=α′orβ 6=β′piαjβpiα′ jβ′ ≤ k2 − k

So if we use these upper and lower bounds, we see that

k∑α=1

k∑

β=1

p2iαjβ ≤k∑

α=1

k∑

β=1

p2iαjβ +∑

α 6=α′orβ 6=β′piαjβpiα′ jβ′ ≤ k2 − k +

k∑α=1

k∑

β=1

p2iαjβ

k∑α=1

k∑

β=1

p2iαjβ ≤ k2p2ij ≤ k2 − k +k∑

α=1

k∑

β=1

p2iαjβ

k∑α=1

k∑

β=1

p2iαjβ ≤ k2p2ij ≤ k2 − k +k∑

α=1

k∑

β=1

p2iαjβ .

Taking expected values shows that

E(

k∑α=1

k∑

β=1

p2iαjβ) ≤ E(k2p2ij) ≤ E(k2 − k +k∑

α=1

k∑

β=1

p2iαjβ)

76

k2E(p2iαjβ) ≤ k2p2ij ≤ k2 − k + k2E(p2iαjβ)

E(p2iαjβ) ≤ p2ij ≤ 1−1

k+ E(p2iαjβ).

Using the left inequality we get

E(p2iαjβ) ≤ p2ij ≤ 1−1

k+ E(p2iαjβ) ≤ 1−

1

k+ p2ij

p2ij +1

k− 1 ≤ E(p2iαjβ) ≤ p2ij

Combining this statement with the fact that a squared value is nonnegative gives us the

result.

Corollary 7.1.5. If Pn is an n × n ds-matrix, Pnk is a k-refinement of Pn,Pn = (pij),Pnk =


V (piαjβ) ≤ (1−1

k2)p2ij .

Proof. A standard result is that V (piαjβ) = E(p2iαjβ)− E(piαjβ)2.

Thus

V (piαjβ) = E(p2iαjβ)− E(piαjβ)2 = E(p2iαjβ)−

p2ijk2.

Using the previous theorem we get

V (piαjβ) ≤ p2ij −p2ijk2= (1− 1

k2)p2ij .

So we have an upper bound on the variance.

Theorem 7.1.6. If Pnk is a k-refinement matrix of Pn, then

E(trace(Pnk)) = trace(Pn), and

E(λ2(Pnk) + λ3(Pnk) + ... + λnk(Pnk)) = λ2(Pn) + λ3(Pn) + ...λn(Pn).

77

Proof.

E(trace(Pnk)) = E(

n∑

i=1

k∑α=1

piαiα)

=

n∑

i=1

k∑α=1

E(piαiα)

=

n∑

i=1

k∑α=1

piik

=

n∑

i=1

pii

= trace(Pn).

The second statement follows by the relationship between eigenvalues and the trace,

and that the largest eigenvalue of a probability matrix is one.

7.2 The Central Tendency of Refinement Matrices

Notation 7.2.1. Let Pnk denote the k-refinement matrix of Pn with piαjβ =pijk

.

Since the entries of Pnk are the expected values of piαjβ when the entries of Pnk are

identically distributed, we take Pnk to be the central tendency of refinement matrices.

Definition 7.2.2. For a given random variable, a central tendency is a location where

the data tends to cluster. The most common measures of central tendency are popu-

lation mean, population median, and population mode. If a real random variable has a

symmetric distribution, the central tendency is usually measured by the mean. If a real

random variable is highly skewed, then the central tendency is usually measured by

the median. A mode is usually used when the mean and the mode are not suitable. If

x is a random variable over the probability space (R,B,P) where P is defined by the

probability distribution function g, P(A) =∫Ag(x)dx , then the population mean is the

expected value

E(x) =

∫

Rxg(x)dx ,

78

when it exists. The population median, m, is the value where half of the measure is on

either side

∫ m

−∞g(x)dx =

∫ ∞

m

g(x)dx ,

when it exists. A mode is an absolute maximum of g(x).

Remark 7.2.3. Since Pnk has n distinct columns, Pnk is noninvertible for all k > 1

The next theorem further justifies that Pnk is the central tendency of refinement

matrices of Pn.

Theorem 7.2.4. If Pnk is a k-refinement matrix of ds-matrix Pn, then

E(∥∥Pnk − Pnk

∥∥2F) ≤ E(‖Pnk −M‖2F ),

for any nk × nk matrix M.

Proof. Since E(piαjβ) =pijk

,pijk

minimizes the function E(piαjβ − x)2. By the definition of

the Frobenius norm the statement holds.

The next theorem indicates how much we can expect Pn to differ from Pnk .


(piαjβ), piαjβ are identically distributed of each fixed i , j , then

E(‖ Pnk − Pnk ‖2F ) ≤ (k2 − 1)n∑

i=1

n∑

j=1

p2ij

Proof. Here we use our upper bound of variance of entries.

E(‖ Pnk − Pnk ‖2F ) = E(n∑

i=1

n∑

j=1

k∑α=1

k∑

β=1

(piαjβ −pijk)2)

=

n∑

i=1

n∑

j=1

k∑α=1

k∑

β=1

E((piαjβ −pijk)2)

≤n∑

i=1

n∑

j=1

k∑α=1

k∑

β=1

(1− 1k2)p2ij

79

≤n∑

i=1

n∑

j=1

k2(1− 1k2)p2ij

≤ (k2 − 1)n∑

i=1

n∑

j=1

p2ij .


7.3 Metric Entropy After Equal Measure Refinement

Metric entropy is a fundamental concept in ergodic theory, so what do we expect of

our entropy estimate after refinement?

Theorem 7.3.1. If Pn is an n × n ds-matrix, Pnk is a k-refinement of Pn,

Pn =(pij),

Pnk =(piαjβ),

piαjβ are identically distributed of each fixed ij ; let hn, hnk denote the metric entropy of the

Markov shifts defined by ((1/n, ..., 1/n),Pn), ((1/nk , 1/nk , ..., 1/nk),Pnk), then

E(hnk) ≤ hn + log(k).

Proof. First define

0 log(0) := 0.

Note that

hn = −n∑

i=1

n∑

j=1

1

npij log(pij), and

hnk = −k∑

α=1

k∑

β=1

n∑

i=1

n∑

j=1

1

nkpiαjβ log(piαjβ).

If g(x) = x log x , then g is a convex function on (0,∞). By Jensen’s inequality

g(E(piαjβ)) ≤ E(g(piαjβ)),

E(piαjβ) log(E(piαjβ)) ≤ E(piαjβ log(piαjβ))

80

pijklog(pijk) ≤ E(piαjβ log(piαjβ)).

Taking summations of both sides leads to

k∑α=1

k∑

β=1

n∑

i=1

n∑

j=1

pijklog(pijk) ≤

k∑α=1

k∑

β=1

n∑

i=1

n∑

j=1

E(piαjβ log(piαjβ))

k2n∑

i=1

n∑

j=1

pijklog(pijk) ≤ E(

k∑α=1

k∑

β=1

n∑

i=1

n∑

j=1

piαjβ log(piαjβ)).

The formula for metric entropy has a negative sign, the formula for hnk includes1

nk;

multiply both sides by−1nk

,

−1nkE(

k∑α=1

k∑

β=1

n∑

i=1

n∑

j=1

piαjβ log(piαjβ)) ≤−1nkk2

n∑

i=1

n∑

j=1

pijklog(pijk)

E(hnk) ≤ −n∑

i=1

n∑

j=1

1

npij log(

pijk)

E(hnk) ≤ −n∑

i=1

n∑

j=1

1

npij log(pij) +

n∑

i=1

n∑

j=1

pijnlog(k)

E(hnk) ≤ hn + log(k).

Thus we get the bound.

81

CHAPTER 8ULAM MATRICES

If we know P, then

(Xn, Σn, ((1/n, 1/n, ..., 1/n),P), fn)

provides an approximation of our dynamical system, but what if we do not know

P? Ulam’s method provides a Monte Carlo technique to statistically or numerically

approximate P. We will denote our approximation of P as P = (pij). So we will

approximate

(D,B,µ, f ) with (Xn, Σn, ((1/n, 1/n, ..., 1/n),P), fn),

and we will approximate

(Xn, Σn, ((1/n, 1/n, ..., 1/n),P), fn) with (Xn, Σn, ((1/n, 1/n, ..., 1/n), P), fn).

Ultimately we approximate

(D,B,µ, f ) with (Xn, Σn, ((1/n, 1/n, ..., 1/n), P), fn).

8.1 Building the Stochastic Ulam Matrix

Here we restate our proceedure for testing that involves the stochastic Ulam matrix

P.

Algorithm 8.1.1. Using Ulam’s Method to approximate (D,B,µ, f ) with a Markov shift:

1. Apply a n-partition to D.

2. Randomly generate or statistically sample mi uniformly distributed independent

points in Di for all i (mi ∈ N). Apply f to the data points.

3. Set mij equal to the number of points such that x ∈ Di , f (x) ∈ Dj .4. Let P be the matrix with

pij =mijmi.

82

Where mi is the number of states that start in state Di .

5. Let λ2 be P ’s second largest eigenvalue in magnitude. If λ2 is not unique pick an

eigenvalue of minimal distance to one.

6. If |λ2| is sufficiently smaller than 1, reject the hypothesis that (D,B,µ, f ) is not

mixing in favor of the hypothesis that (D,B,µ, f ) is weak-mixing.

7. If |λ2| is not sufficiently small, but |λ2 − 1| is sufficiently large reject the hypothesis

that (D,B,µ, f ) is not ergodic in favor of the hypothesis that (D,B,µ, f ) is ergodic.

8. If λ2 is close to one, fail to reject the hypothesis that (D,B,µ, f ) is not ergodic.

9. If we accept the hypothesis that (D,B,µ, f ) is weak-mixing, use the rate at which

(N

n − 1)|λ2|N−n+1 → 0 as N → ∞

as an estimate of the rate at which f mixes.

10. Let

−1n

n∑

i=1

n∑

j=1

pij log(pij)

be our estimate of our entropy. Note that by construction of n-partitions, a sta-

tionary distribution of P is (1/n, 1/n, ..., 1/n), so we use (1/n, · · · , 1/n) for the

stationary distribution of our approximating Markov shift.

When we use the term sufficent, we compare the test statistic to the corresponding

critical value for hypothesis testing. The critical value comes from the level of significance

we set, and the test statistic’s probability distribution. When we make a decision, it is

better to conclude that a weak-mixing dynamical system is not mixing (type II error) than

to accept a non-mixing dynamical system as weak-mixing (type I error). We will discuss

probability distributions in a later chapter. We propose using a beta distribution with

α ≥ 2, and β = 1 to set the critical value for the weak-mixing hypothesis.

Definition 8.1.2. The matrix M = (mij) from the previous algorithm is called an Ulam

Matrix.

83

Some dynamical systems have atypical behavior for subsets of measure zero. If

we use a mesh of points for sampling, we may unintentionally sample exclusively from

an atypical subset. To avoid this type of sample bias, the points should have random

coordinates. This way if a subset has an atypical property, sampling will probably not

be from that subset only. Usually it is easier and faster to generate m random points

(m =∑ni=1mi ) in D then count mi , rather than generate mi points in Di . The quotient in

the proof that P converges to P provides a way to decide how large m should be before

generating data. If min1≤i≤nmi is too small, generate more points and combine the

sets of points until min1≤i≤nmi is sufficient.

The function f is measure preserving, so if after applying the map to the points

there is a Di with no data points, not enough points were used. If min1≤i≤nmi is

sufficiently large and close to being constant, the number of points in Di before and

after mapping will be about the same.

The matrix P is a ds-matrix, P will be a stochastic matrix and may not be a

ds-matrix. For (~v , P) to be a Markov shift, P must be a stochastic matrix and ~v must

be a left eigenvector of P and a positive probability vector. So the criteria of ergodicity

and mixing hold for our observations as long as P has a left eigenvector that is a

positive stationary distribution. We will show that P converges to a ds-matrix as

min1≤i≤nmi → ∞, so if our observations do not provide such a vector either not

enough points were used, a hypothesis was violated or an error was made.

8.2 Properties of Ulam Matrices

Theorem 8.2.1. Let P be an n × n stochastic matrix from Ulam’s method (n is fixed) and

the sample points are independent uniform random variables, then

pij → pij almost surely as mi → ∞,

and

E(pij) = pij .

84

Proof. The matrix P is a ds-matrix, so 0 ≤ pij ≤ 1, and mi ∈ N; thus the pair (pij ,mi)

defines a binomial random variable where pij is the probability of success, mi is the

number of trials, and mij equals the number of successes. By the strong law of large

numbers

mijmi

→ pij almost surely.

Setting pij =mijmi

gives us the result.

Since we are using an approximation of P, how far off is P? How many data points

should be used to generate P?

Theorem 8.2.2. If P is an approximation of P generated by Ulam’s method with a fixed

partition and the sample points are independent uniform random variables, then

∥∥∥P − P∥∥∥F→ 0 in probability as min

1≤i≤nmi → ∞.

Furthermore for any given ε > 0,

P(∥∥∥P − P

∥∥∥F> ε) ≤ n2

4ε2min1≤i≤nmi .

Proof. Let ε > 0 be given. Use Markov’s inequality,

P(∥∥∥P − P

∥∥∥F> ε) = P(

n∑

i=1

n∑

j=1

(pij − pij)2 > ε2)

≤ E(∑ni=1

∑nj=1(pij − pij)2)ε2

=

∑ni=1

∑nj=1 E((pij − pij)2)

ε2.

Now

E((pij − pij)2) = V (pij)

=pij(1− pij)mi

≤ 1

4mi

85

≤ 1

4min1≤i≤nmi.

It follows that

P(∥∥∥P − P

∥∥∥F> ε) ≤ n2

4ε2min1≤i≤nmi .

Both n and ε are fixed before we sample, so

∥∥∥P − P∥∥∥F→ 0

in probability as min1≤i≤nmi → ∞.

The previous theorem showed convergence in probability and gives insight into the

probability distribution of P. Our next theorem shows that

E(∥∥∥P − P

∥∥∥F)→ 0 as min

1≤i≤nmi → ∞.

So the next theorem implies the previous convergence result; we showed the last

theorem for its statement about the probability distribution.

Theorem 8.2.3. If P is an approximation of P generated by Ulam’s method with a fixed

partition and the sample points are independent uniform random variables, then

E(∥∥∥P − P

∥∥∥F)→ 0 as min

1≤i≤nmi → ∞.

Proof. By Jensen’s inequality

(E(∥∥∥P − P

∥∥∥F))2 ≤ E(

∥∥∥P − P∥∥∥2

F)

= E(

n∑

i=1

n∑

j=1

(pij − pij)2)

=

n∑

i=1

n∑

j=1

E((pij − pij)2)

=

n∑

i=1

n∑

j=1

V (pij)

86

≤n∑

i=1

n∑

j=1

1

4mi

≤n∑

i=1

n∑

j=1

1

4min1≤i≤nmi

=n2

4min1≤i≤nmi.

Taking square roots of both sides yields

E(∥∥∥P − P

∥∥∥F) ≤ n

2√min1≤i≤nmi

.

Since n is fixed, taking the limit as min1≤i≤nmi goes to infinity gives us the result.

We want λ2, to converge to the correct value, otherwise the second largest

eigenvalue would make a poor test statistic.

Theorem 8.2.4. For a fixed n-partition, if P is our observed matrix from Ulam’s method

for P where the sample points are independent uniform random variable, then for all j

λj(P)→ λj(P) in probability as min1≤i≤n

(mi)→ ∞.

Proof. Roots of a polynomial depend continuously on the coefficients as a closed set

in the Hausdorff topology. So eigenvalues depend continuously on the coefficients of

the characteristic polynomial. By the definition of the characteristic polynomial, the

coefficients of the characteristic polynomial of a matrix depend continuously on the

entries of the matrix. So λ2(P) depends continuously on the entries of P.

By a previous theorem

∥∥∥P − P∥∥∥F→ 0 in probability as min1≤i≤nmi → ∞.

So we get the result.

87

Definition 8.2.5. [13] If T is a function on empirical data X and θ is a parameter of

the probability distribution of X , then T (X ) is a sufficient statistic for θ if the conditional

distribution of X given T (X ) does not depend on θ.

Theorem 8.2.6. λ2 is a sufficient statistic for our hypothesis testing.

Proof. Let’s look at the weak-mixing hypothesis,

P(((1/n, ..., 1/n), P) is mixing |λ2,λ2) = P(|λ2| < 1|λ2,λ2)

= P(|λ2| < 1|λ2).

A similar results holds for ergodicity.

Once again we look at the simplest case, here we make inferences about the

probability distribution of an Ulam approximation of 2× 2 matrix.

Example 8.2.7 (Special Case n=2). If we take a 2-partition of (D,B,µ, f ) and apply our

algorithm where our sample points are independent uniform random variables,

P =

p q

q p

, (p ∈ [0, 1], q = 1− p).

Let p be our estimate of p. For this special case, p11 = p22 = p and p12 = p21 =

1 − p = q we may count a success whenever a point does not leave its initial state. Say

that we have m data points, thus (m, p) defines a binomial random variable with p equal

to the number of points that do not leave their initial state divided by m.

P =

p q

q p

, (p ∈ [0, 1], q = 1− p).

The characteristic polynomial of P is (λ − 1)(λ + 1 − 2p), so λ2 = 2p − 1. Since

0 ≤ p ≤ 1, −1 ≤ λ2 ≤ 1. So we may use the binomial distribution to establish critical

values for hypothesis testing. Look at the cases where P or P equal a permutation

matrix.

P(p = 1|p) = P(mp = m|p) = pm

88

P(p = 0|p) = P(mp = 0|p) = (1− p)m

P(p = 1 or 0|p) = P(mp = m or 0|p) = pm + (1− p)m

P(p 6= 1|p = 1) = P(mp 6= m|p = 1) = 0

P(p 6= 0|p = 0) = P(mp 6= 0|p = 0) = 0.

So if n = 2 and P is a permutation matrix then P = P almost surely.

Also if θ ∈ [0, 1],

P(|λ2| < θ) = P(|2p − 1| < θ)

= P(−θ < 2p − 1 < θ)

= P((1− θ)/2 < p < (1 + θ)/2)

= P(m(1− θ)/2 < mp < m(1 + θ)/2)

=∑

m(1−θ)/2<k<m(1+θ)/2

(m

k

)pk(1− p)m−k .

Hence the binomial distribution gives us the probability distribution of λ2 when n = 2.

Unfortunately, when n > 2 and P is a random matrix, the probability distribution of

|λ2(P)| is not so obvious.

Theorem 8.2.8. If P is a stochastic Ulam matrix that approximates P where the sample

points are independent uniform random variables, and pij > 0, then

pij > 0 almost surely.

Proof. Proof by contrapositive. The pair (pij ,mi) defines a binomial random variable with

mij as the number of successes. If pij = 0, then

P(mij > 0|pij = 0) =mi∑

k=1

pkij (1− pij)mi−k

=

mi∑

k=1

0k(1− 0)mi−k

89

= 0.

We get the result by taking the contrapositive of the statement we just proved.

Corollary 8.2.9. If (Xn, Σn, ((1/n, ..., 1/n), P), fn) is a Markov shift approximating

the dynamical system (D,B,µ, f ) and P is a positive stochastic Ulam matrix where

the sample points are independent and uniformly distributed, then we will reject the

hypothesis that (D,B,µ, f ) is not weak-mixing in favor of the hypothesis that (D,B,µ, f )is weak-mixing.

Proof. The matrix P is a positive, thus for all i , j

pij > 0

and P is a nonnegative ds-matrix. By the Perron-Frobenius theorem it follows that

|λ2(P)| < 1

and (Xn, Σn, ((1/n, ..., 1/n), P), fn) is weak-mixing. Hence we reject the hypothesis that

(D,B,µ, f ) is not weak-mixing in favor of the hypothesis that (D,B,µ, f ) is weak-mixing.

The Jordan canonical form of P provides a mixing rate estimate. What can we say

about using the Jordan canonical form of P to approximate P ’s canonical form?

Theorem 8.2.10. If P, P are n × n matrices with the same conjugating matrix in their

Jordan canonical forms (P = EJE−1, P = EJE−1), then for any matrix norm

P(‖ J − J ‖≤ k

‖ E ‖‖ E−1 ‖) ≤ P(‖ P − P ‖≤ k).

Moreover if E is unitary and the norm is the 2-norm or the Frobenius norm then

P(‖ J − J ‖≤ k) = P(‖ P − P ‖≤ k).

90

Proof.

P(k <‖ P − P ‖) = P(k <‖ EJE−1 − EJE−1 ‖)

= P(k <‖ E(J − J)E−1 ‖)

≤ P(k <‖ E ‖‖ J − J ‖‖ E−1 ‖).

Multiply both sides by negative one and add one to both sides,

1− P(k <‖ P − P ‖) ≥ 1− P(k <‖ E ‖‖ J − J ‖‖ E−1 ‖).

By the properties of probability measures we see that

P(‖ P − P ‖≤ k) ≥ P(‖ E ‖‖ J − J ‖‖ E−1 ‖≤ k)

P(‖ P − P ‖≤ k) ≥ P(‖ J − J ‖≤ k

‖ E ‖‖ E−1 ‖).

When E is unitary and we have the 2-norm or the Frobenius norm

P(k <‖ P − P ‖) = P(k <‖ EJE−1 − EJE−1 ‖)

= P(k <‖ E(J − J)E−1 ‖)

= P(k <‖ J − J ‖).

Thus both statements hold.

91

CHAPTER 9CONVERGENCE TO AN OPERATOR

9.1 Stirring Protocols as Operators and Operator Eigenfunctions

Notation 9.1.1. If (D,B,µ, f ) is a dynamical system, f is measure preserving, let f ∗

denote the operator

f ∗ : L1(D,B,µ)→ L1(D,B,µ)

f ∗(u) = u f .

For our problem, f represents our stirring protocol, u is a probability distribution

function that measures the concentration of an ingredient before stirring and u f k is a

probability distribution function that measures the concentration of the ingredient after

running the protocol k times. If our stirring protocol mixes, then for any continuous initial

concentration of our ingredient, the concentration should become constant as we run the

stirring protocol. Mathematically we represent this situation as

∫

A

u f kdµ → µ(A) for all A ∈ B as k → ∞.

For mixing, we are concerned with u ∈ L1(D,B,µ) that define a probability

distribution function on D, i.e. u ≥ 0 and ‖ u ‖= 1. When u is constant, the

concentration of the ingredient is constant and the ingredient is mixed. The function f ∗

is a Perron-Frobenius operator defined by our stirring protocol. Stationary distributions

for a stochastic matrix are left eigenvectors with eigenvalue one, we want to look at

u ∈ L1(D,B,µ) that define stationary distributions for f ∗. We shall see that P is a

Galerkin projection of f ∗.

If u is a nonconstant nonnegative eigenfunction for f ∗ with λ = 1 that defines a

probability distribution function, then when we stir an ingredient with initial distribution u,

the concentration of the ingredient does not change. Since u is nonconstant, our stirring

protocol is not mixing.

92

The next three theorems for eigenfunctions of f ∗ are similar to theorems regarding

eigenvectors of stochastic matrices and help justify using eigenvectors to approximate

eigenfunctions.

Proposition 9.1.2. If u is an integrable eigenfunction of f ∗, f ∗(u) = λu, and f is

µ-meausure preserving, then

λ = 1 or∫

Dudµ = 0.

Proof. The function f is measure preserving, so∫D u f dµ =

∫D udµ. Since u is an

eigenfunction, f ∗(u) = λu; so u f = λu. Thus

∫

Du dµ =

∫

Dλu dµ

(1− λ)

∫

Du dµ = 0.


So if u is an eigenfunction of f ∗ and defines a probablitiy distribution function on D,

then λ = 1.

Proposition 9.1.3. If u is an eigenfunction of f ∗, f ∗(u) = λu, and f is µ-meausure

preserving, then

|λ| ≤ 1.

Proof. By hypothesis λu = u f , so if we integrate over any set in B,

∫

A

λudµ =

∫

A

u f dµ

λ

∫

A

udµ =

∫

f −1(A)udµ.

It follows that for any natural number k ,

λk∫

A

udµ =

∫

f −k(A)udµ.

93

Taking absolute values of both sides of the equation yields

|λk∫

A

udµ| =|∫

f −k(A)udµ|

|λ|k |∫

A

udµ| ≤∫

f −k(A)|u|dµ ≤

∫

D|u|dµ

|λ|k |∫

A

udµ| ≤∫

D|u|dµ.

Now, u is an eigenfunction, thus u 6= 0 and there exists B ∈ B such that∫Budµ 6= 0.

|λ|k ≤∫D |u|dµ

| ∫Budµ| for all k ∈ N.

Since |λ|k is bounded above for all k ∈ N, |λ| ≤ 1.

Notation 9.1.4. If u ∈ L1(D,B,µ), Dkinki=1∞k=1 is a sequence of nk -partitions (µ(Dki) =1

nkfor all k , i), let uk denote the simple function over Dkinki=1 defined by

uk(x) =

nk∑

i=1

[

∫Dkiudµ

µ(Dki)]1Dki (x) = nk

nk∑

i=1

[

∫

Dkiudµ]1Dki (x).

By construction

∫

Dkiuk dµ =

∫

Dkiu dµ for all ki ,

and thus

∫

Duk dµ =

∫

Du dµ for all k .

So uk is our best approximation of u over the σ-algebra generated by Dkinki=1. In fact

uk is a projection of u onto the set of simple function over Dkinki=1 [14]. If Dk+1ink+1i=1 is

a refinement of Dkinki=1, then the set of simple functions over Dkinki=1 is contained in

the set of simple functions over Dk+1ink+1i=1 ; it follows that uk+1 is a better approximation

of u than uk . We use the entries from stationary distributions of Pnk∞k=1 to construct

approximations of stationary distributions of f ∗. If ~pk = ~pk Pnk and ~p it a probability vector

94

then we look to see if

nk

nk∑

i=1

~pk1Dki (x)

approximates a nonconstant stationary distribution of f ∗.

Theorem 9.1.5. If u ∈ L1(D,B,µ), Dkinki=1∞k=1 is a sequence of nk -partitions where

Dk+1ink+1i=1 is annk+1nk

-refinement of Dkinki=1 (nk |nk+1), then

∫

Dkjuk+1dµ =

∫

Dkjukdµ.

Proof.

∫

Dkjuk+1dµ =

∫

Dkjnk+1

nk+1∑

i=1

[

∫

Dk+1iudµ]1Dk+1i (x)dµ

=nk+1

nk+1∑

i=1

[

∫

Dk+1iudµ]

∫

Dkj1Dk+1i (x)dµ

=nk+1

nk+1∑

i=1,Dk+1i⊂Dkj[

∫

Dk+1iudµ]

1

nk+1

=

nk+1∑

i=1,Dk+1i⊂Dkj

∫

Dk+1iudµ

=

∫⊎nk+1i=1,Dk+1i⊂Dkj Dk+1i

udµ

=

∫

Dkjudµ

=

∫

Dkjukdµ.

We present the next proposition since the result holds for f ∗ and stochastic

matrices.

Proposition 9.1.6. (D,B,µ, f ) is a dynamical system, f is measure preserving, then

u = 1 is an eigenfunction of f ∗ with eigenvalue one.

95

If u = 1, then uk = 1, so uk = u. When u = 1, it represents an ingredient with

constant concentration, that is to say the ingredient is mixed throughout.

Notation 9.1.7. If Dkinki=1∞k=1 is a sequence of nk -partitions where Dk+1ink+1i=1 is annk+1nk

-refinement of Dkinki=1, nk |nk+1, nk < nk+1, then let Sk and S∞ denote the following

σ-algebras,

Sk =σ(Dkinki=1)

the intersection of all σ-algebras containing the partition sets Dkinki=1,

S∞ =σ(∞⋃

k=1

Sk)

the intersection of all σ-algebras containing⋃∞k=1 Sk .

Notice that Sk ( Sk+1.The function uk is Sk -measurable and integrable, and

∫

A

uk dµ =

∫

A

u dµ for all A ∈ Sk

by construction. It follows that

uk = E(u|Sk).

Definition 9.1.8. [15] Let X1,X2, ... be a sequence of random variables on a probability

space (Ω,F ,P), and let F1,F2, ... be a sequence of σ-algebras in F . The sequence

(Xk ,Fk)∞k=1 is a martingale if these four conditions hold:

1. Fk ⊆ Fk+1;2. Xk is measurable in Fk ;3. E(|Xk |) < ∞;

4. with probability one, E(Xk+1|Fk) = Xk .

96

The next theorem shows that uk defines a martingale. This shows in what sense

uk approximates u, and will lead to convergence when Doob’s martingale theorem is

applied.



-refinement of Dkinki=1, then for a given x ∈ D,

(uk(x),Sk)∞k=1

is a martingale.

Proof. By the construction of Sk and uk we only need to show that

E(uk+1(x)|Sk) = uk(x).

Let Dkj be the partition set containing x . So

E(uk+1(x)|Sk) =E(uk+1(x)|x ∈ Dkj)

=

∫Dkjukdµ

µ(Dkj)

=

∫Dkjudµ

1/nk

=nk

∫

Dkjudµ

=uk(x).

Thus for a given x ∈ D, (uk(x),Sk)∞k=1 is a Martingale.

9.2 Convergence Results

Now we prove some convergence results. Our main theorem of this chapter is that

when u f is continuous, our Galerkin projection of f ∗ converges to f ∗ as we take equal

measure refinements.

97

Theorem 9.2.1 (A martingale convergence theorem). [15] If F1,F2, ... is a sequence of

σ-algebras satisfying F1 ⊆ F2 ⊆ ..., F∞ = σ(⋃∞k=1Fk) and Z is integrable, then

E(Z |Fk)→ E(Z |F∞) with probability one.

The next theorem gives us convergence of uk to u when S∞ = B. Thus it is

important to refine partitions such that the refinements generate the Borel σ-algebra in

the limit.



-refinement of Dkinki=1, nk |nk+1 and S∞ = B, then for a given x ∈ D

uk(x)→ u(x) as k → ∞ in µ.

Proof. Here we use the previous martingale convergence theorem.

uk(x) = E(u(x)|Sk)→ E(u(x)|S∞) = E(u|B).

Since u is B-measureable, E(u|B) = u. Thus uk(x)→ u(x).

Since we are working with a sequence of refinement partitions, uk(x) is contained

in the set of simple functions over Dk+1ink+1i=1 . Hence uk+1(x) is no worse of an

approximation of u(x) than uk(x).

Notation 9.2.3. If u ∈ L1(D,B,µ), (D,B,µ, f ) is a µ-measure preserving dynamical sys-

tem, Dkinki=1∞k=1 is a sequence of nk -partitions, and Dk+1ink+1i=1 is ank+1nk

-refinement of

Dkinki=1, let ~uk ∈ R1×nk be the vector where

(~uk)i =

∫

Dkiudµ.

So the entries of ~uk correspond to the coefficients of uk(x) (the coefficients of our

Galerkin projection of u onto the simple functions over Dkinki=1). Let Pnk be the nk × nk

98

matrix defined by

(Pnk )ij = µ(f (x) ∈ Dkj |x ∈ Dki).

Let P∗nk

be the operator that maps simple functions over Dkinki=1 to simple functions over

Dkinki=1 such that P∗nk(uk(x)) corresponds to ~ukPnk .

By construction

nk∑

i=1

(~uk)i =

∫

Dudµ.

When u defines a probability distribution, ~uk is a probability vector.

We will use

P∗nk(uk(x))

to approximate

f ∗(u).

We will use left eigenvectors of

Pnk

to construct approximations to eigenfunctions of f ∗. The next theorem shows that when

f ∗ acts on probability distribution functions and u f is continuous,

P∗nk

converges to the Perron-Frobenius operator defined by our stirring protocol.

Theorem 9.2.4. If u ∈ L1(D,B,µ), (D,B,µ, f ) is a µ-measure preserving dynamical sys-

tem, Dkinki=1∞k=1 is a sequence of nk -partitions, and Dk+1ink+1i=1 is ank+1nk

-refinement of

99

Dkinki=1, u(x) defines a probability distribution, and u f is continuous then

P∗nk(uk(x))→ f ∗(u) in L1 as k → ∞

Proof. Since u defines a probability distribution on (D,B), uk(x) defines a probability

distribution and ~uk is a probability vector. The matrix Pnk is a doubly-stochastic matrix so

~ukPnk is a probability vector. Thus P∗nk(uk(x)) defines a probability distribution function on

(D,B). If

P(x ∈ A) =∫

A

u dµ for all A ∈ B,

then

∫

Dkiu f dµ =P(f (x) ∈ Dkj)

=

nk∑

i=1

P(x ∈ Dki)P(f (x) ∈ Dkj |x ∈ Dki)

=

nk∑

i=1

(

∫

Dkiu dµ)pij

=

nk∑

i=1

(

∫

Dkiuk dµ)pij

=

∫

DkiP∗nk(uk(x))dµ.

It follows that

∫

DkiP∗nk(uk(x))dµ =

∫

Dkiu f dµ for all ki .

Furthermore

∫

A

P∗nk(uk(x))dµ =

∫

A

u f dµ for all A ∈ Sk for all k .

100

The function P∗nk(uk(x)) is a simple function and u f is continuous so by the mean value

theorem of integrals, there exists

∞⋃

k=1

xk1, xk2, ..., xknk ( D

such that xki ∈ Dki and u f (xki) = P∗nk(uk(x)) for all x ∈ Dki . Thus

∫

D|u f (x)− P∗

nk(uk(x))|dµ =

nk∑

i=1

∫

Dki|u f (x)− P∗

nk(uk(x))|dµ

=

nk∑

i=1

∫

Dki|u f (x)− u f (xki)|dµ

≤nk∑

i=1

∫

Dki| supDki(u f (x))− inf

Dki(u f (x))|dµ

≤nk∑

i=1

| supDki(u f (x))− inf

Dki(u f (x))|

∫

Dkidµ

≤nk∑

i=1


Dki(u f (x))| 1

nk

≤ 1nk

nk∑

i=1


Dki(u f (x))|.

Since u f is continuous, u f is Riemann integrable. Our sequence of partitions is a

sequence of refinements therefore

1

nk

nk∑

i=1


Dki(u f (x))| → 0 as k → ∞.

Hence

∫

D|u f (x)− P∗

nk(uk(x))|dµ → 0 as k → ∞.

Hence the statement holds.

101

In the previous theorem, we used the hypothesis that u f was continuous so that

the points

∞⋃

k=1

xk1, xk2, ..., xknk

existed, and u f was Reimann integrable. If u f is Reimann integrable, then these

points exist almost surley. So we may generalize the previous theorem to when u f is

Reimann integrable.

Proposition 9.2.5. If

1. Fk∞k=1 is a sequence of σ-algebras over D,

2. F∞ = σ(⋃∞k=1Fk), and

3. (D,F∞,µ, f ) is a weak-mixing dynamical system,

then

limN→∞

1

N

N−1∑

i=0

|µ(f −i(Ak) ∩ Bk)− µ(Ak)µ(Bk)| = 0 for all Ak ,Bk ∈ Fk .

Proof. Let Ak ,Bk ∈ Fk ; by the definition of F∞,Ak ,Bk ∈ F∞. Since (D,F∞,µ, f ) is

weak-mixing

limN→∞

1

N

N−1∑

i=0

|µ(f −i(Ak) ∩ Bk)− µ(Ak)µ(Bk)| = 0.


We are not saying that (D,Fk ,µ, f ) is a weak-mixing dynamical system, f does

not need to be Fk -measureable. This proposition shows us that if we observe two sets

Ak ,Bk ∈ Fk such that

limN→∞

1

N

N−1∑

i=0

|µ(f −i(Ak) ∩ Bk)− µ(Ak)µ(Bk)| 6= 0,

then we know (D,F∞,µ, f ) is not weak-mixing. So if we observe two states that do not

mix, we are correct in failing to reject that (D,B,µ, f ) is not weak-mixing.

102

6. (D,F∞,µ, f ) is a dynamical system

7. f is measure preserving

8. limN→∞1

N

∑N−1i=1 |µ(f −i(A) ∩ B)− µ(A)µ(B)| = 0∀A,B ∈ Fk for all k .

then (D,F∞,µ, f ) is weak-mixing.

Proof. Let A,B ∈ F∞,Ak ,Bk ∈ Fk such that

µ(A4Ak) = minµ(A4S) : S ∈ Fk

µ(B4Bk) = minµ(B4S) : S ∈ Fk

Since |Fk | < ∞ and Fk ⊆ F∞, Ak and Bk exist,

|µ(f −i(A) ∩ B)− µ(A)µ(B)| =|µ(f −i(A) ∩ B)− µ(f −i(A) ∩ Bk)

+ µ(f −i(A) ∩ Bk)− µ(f −i(Ak) ∩ Bk)

+ µ(f −i(Ak) ∩ Bk)− µ(Ak)µ(Bk)

+ µ(Ak)µ(Bk)− µ(A)µ(B)|

≤|µ(f −i(A) ∩ B)− µ(f −i(A) ∩ Bk)|

+ |µ(f −i(A) ∩ Bk)− µ(f −i(Ak) ∩ Bk)|

+ |µ(f −i(Ak) ∩ Bk)− µ(Ak)µ(Bk)|

+ |µ(Ak)µ(Bk)− µ(A)µ(B)|

≤|µ(B4Bk)|+ |µ(f −i(A)4f −i(Ak))|


+ |µ(Ak)µ(Bk)− µ(A)µ(B)|

=|µ(B4Bk)|+ |µ(A4Ak)|


+ |µ(Ak)µ(Bk)− µ(A)µ(B)|.

104

Therefore

1

N

N−1∑

i=0

|µ(f −i(A) ∩ B)− µ(A)µ(B)| < ε whenever N ≥ K .


106

CHAPTER 10DECAY OF CORRELATION

10.1 Comparing Our Test to Decay of Correlation

Definition 10.1.1. Two sequences of random variables, Xk∞k=1 and Yk∞k=1, are said

to have decay of correlation if

limk→∞E(XkYk)− E(Xk)E(Yk) = 0

By the definition of covariance, this is the same as

limk→∞cov(Xk ,Yk) = 0.

Really we should call this decay of covariance. Decay of correlation is more general than

the correlation function going to zero. The variance of both variables must be finite for

the correlation function to be defined, covariance only requires that E(XkYk),E(Xk), and

E(Yk) be finite.

Decay of correlation is the standard for measuring mixing, but it requires a

sequence of experiments. We will show that decay of correlation is sensitive to

strong-mixing and is not sensitive to weak-mixing. Using λ2(P) is not as precise as

decay of correlation and does not detect strong-mixing, but only one experiment is

needed.

Proposition 10.1.2. A measure preserving dynamical system, (D,B,µ, f ), is strong-

mixing if and only if there is decay of correlation between 1f −k(A) and 1B for all A,B ∈ B.

Proof. Let A,B ∈ B, set

Xk =1f −k(A),

Yk =1B .

107

Since f is µ-measure preserving

E(XkYk)− E(Xk)E(Yk) =∫

D1f −k(A)1Bdµ−

∫

D1f −k(A)dµ

∫

D1Bdµ

=

∫

D1f −k(A)∩Bdµ−

∫

f −k(A)dµ

∫

B

dµ

=

∫

f −k(A)∩Bdµ− µ(f −k(A))µ(B)

=µ(f −k(A) ∩ B)− µ(A)µ(B).

Hence a dynamical system (D,B,µ, f ) is strong-mixing if and only if there is decay of

correlation between 1f −k(A) and 1B for all A,B ∈ B.

Notation 10.1.3. For given dynamical system (D,B,µ, f ) and n-partition Dini=1, let P(k)

be the n × n matrix defined by

(P(k))ij = µ(f k(x) ∈ Dj |x ∈ Di),

and we denote P(1) as P.

Proposition 10.1.4. For given dynamical system (D,B,µ, f ), if f is µ-measure preserv-

ing and n-partition Dini=1, then

‖ P(k) − P ‖F= n( n∑

i=1

n∑

j=1

(cov(1f −k(Di ), 1Dj ))2)1/2

Proof. For any partition sets Di and Dj ,

E(1f −k(Di )∩Dj )− E(1f −k(Di ))E(1Dj ) =µ(f −k(Di) ∩ Dj)− µ(f −k(Di))µ(Dj)

=µ(f k(Dj) ∩ Di)− µ(Di)µ(Dj)

=µ(Di)µ(f k(x) ∈ Dj |x ∈ Di)− 1n2

=1

nµ(f k(x) ∈ Dj |x ∈ Di)− 1

n2

=1

n[µ(f k(x) ∈ Dj |x ∈ Di)− 1

n]

=1

n[(P(k) − P)ij ].

108

Multiply both sides by n,

n(E(1f −k(Di )∩Dj )− E(1f −k(Di ))E(1Dj )) =[(P(k) − P)ij ].

Square both sides,

n2(E(1f −k(Di )∩Dj )− E(1f −k(Di ))E(1Dj ))2 =[(P(k) − P)ij ]2.

Take the sum of these terms over both i and j ,

n2n∑

i=1

n∑

j=1

(E(1f −k(Di )∩Dj )− E(1f −k(Di ))E(1Dj ))2 =n∑

i=1

n∑

j=1

(P(k) − P)2ij .

Take the square root of both sides to get the left hand side to equal the Frobenius norm,

n( n∑

i=1

n∑

j=1

(E(1f −k(Di )∩Dj )− E(1f −k(Di ))E(1Dj ))2)1/2

=( n∑

i=1

n∑

j=1

(P(k) − P)2ij)1/2

= ‖ P(k) − P ‖F .

By the definition of covariance, we get the result.

So we may use the sequence ‖ P(k) − P ‖F∞k=1 to detect decay of correlation

between simple functions over sets in σ(Dini=1). Also we may use the rate at which this

sequence goes to zero to measure the rate that the dynamical system strong-mixes over

σ(Dini=1).Definition 10.1.5. Let (X , Σ, ν, g) be a dynamical system, a finite partition of X , Aini=1,is a generating partition of Σ if

Σ = σ(⋃

−∞<k<∞

n⋃

i=1

gk(Ai)).

Since Markov partitions capture the dynamics of a system, we use the sequence

‖ P(k) − Pk ‖F∞k=2

109

to measure how close our n-partition Dini=1 is to being Markov. This sequence

indicates how well the partition represents the dynamics over itself. If Dini=1 captures

the dynamics perfectly, then

P(k) = Pk for all k ∈ N.

Our next theorem says that when we conclude (D,B,µ, f ) is weak-mixing, the

system is strong-mixing over partition sets if and only if Dini=1 is close to being a

Markov partition.

Theorem 10.1.6. For a measure preserving dynamical system (D,B,µ, f ) with an

n-partition, if |λ2(P)| < 1, then

‖ P(k) − Pk ‖F→ 0 ⇐⇒ ‖ P(k) − P ‖F→ 0

as k → ∞.

Proof. In the derivation of our mixing rate estimate we showed that if |λ2(P)| < 1, then

‖ Pk − P ‖F→ 0 as k → ∞. Now use the triangle inequality to prove both directions.

(⇒)

‖ P(k) − P ‖F≤‖ P(k) − Pk ‖F + ‖ Pk − P ‖F .

By hypotheses ‖ P(k)−Pk ‖F and ‖ Pk − P ‖F both go to zero; hence ‖ P(k)− P ‖F goes

to zero.

(⇐)

‖ P(k) − Pk ‖F≤‖ P(k) − P ‖F + ‖ P − Pk ‖F

By hypotheses ‖ P(k)− P ‖F and ‖ P −Pk ‖F both go to zero; hence ‖ P(k)−Pk ‖F goes

to zero.

10.2 A Conjecture About Mixing Rate

Numerical observations lead to the following conjectures.

110

Conjecture 10.2.1. For any measure preserving dynamical system (D,B,µ, f ) that is

strong-mixing with an n-partition

‖ Pk − P ‖F≤ ‖ P(k) − P ‖F and

|λ2(P)|k ≤|λ2(P(k))|.

If these conjectures are correct and the dynamical system is strong-mixing, then

our weak-mixing rate estimate converges faster than the rate that the dynamical system

strong-mixes. That is to say

(N

n − 1)|λ2(P)|N−n+1 → 0 faster than f strong-mixes.

111

CHAPTER 11CRITERIA FOR WHEN MORE DATA POINTS ARE NEEDED

11.1 Our Main Criteria for When More Data Points Are Needed

We want to be confident that P approximates P well; we cannot trust a decision

based on a poor approximation. If our data points are insufficient, then P will probably

represent P poorly. We need some way to check if we used enough data points in all

partition sets; this chapter presents some criteria to decide if the set of data points is

insufficient after observing P.

Our next theorems gives us an alternative estimate of pij .

Theorem 11.1.1. If Dini=1 is an n-partition and f is µ-measure preserving, then

µ(f (x) ∈ Dj |x ∈ Di) = µ(x ∈ Di |f (x) ∈ Dj)

Proof. For an n-partition, the measure of all partition sets is 1/n and f is µ-measure

preserving.

µ(f (x) ∈ Dj |x ∈ Di) = µ(f (x) ∈ Dj ∧ x ∈ Di)µ(x ∈ Di)

=µ(f (x) ∈ Dj ∧ x ∈ Di)

1/n

=µ(f (x) ∈ Dj ∧ x ∈ Di)

µ(x ∈ Dj)=

µ(x ∈ Di ∧ f (x) ∈ Dj)µ(f (x) ∈ Dj)

= µ(x ∈ Di |f (x) ∈ Dj).

Thus we get the equality.

It follows that

mijm1j + ... +mnj

is an estimate of pij .

112

Theorem 11.1.2. If mij counts the points such that x ∈ Di and f (x) ∈ Dj for our

Ulam matrix where the data points are independent uniform random variables, and

m1 = m2 = ... = mn, then

mijm1j + ... +mnj

→ pij as m1 → ∞ almost surely

Proof. mik ∼ binomial(pik , m) for each ik so by the strong law of large numbers

mikmi

→ pik as mi → ∞ almost surely.

Now we take the limit of the quotient,

limm1→∞

mijm1j + ... +mnj

= limm1→∞

mijm1

m1j+...+mnjm1

= limm1→∞

mijm1

m1jm1+ ... +

mnjm1

=limm1→∞

mijm1

limm1→∞(m1jm1+ ... +

mnjm1)

=limm1→∞

mijm1

(limm1→∞m1jm1) + ... + (limm1→∞

mnjm1)

=limmi→∞

mijmi

(limm1→∞m1jm1) + ... + (limmn→∞

mnjmn)

=pij

p1j + ... + pnj

Now p1j + ... + pnj = 1, since we use an n-partition to generate our Ulam matrix. So

limm→∞

mijm1j + ... +mnj

= pij .

Thus the quotient converges to the entry of P.

Corollary 11.1.3. E(mij

m1j + ... +mnj) = pij

Proof. Our measure space is a probability space and

0 ≤ mijm1j + ... +mnj

≤ 1,

113

it follows that the expected value exists. By the strong law of large numbers, and that the

data points are independent and indentically distributed,mij

m1j + ... +mnjconverges to its

expected value.

Remark 11.1.4. When we usemij

m1j + ... +mnjas a test statistic, we let

mijm1j + ... +mnj

= 0 whenever mij = 0.

We do this to prevent problems with zero denominators when

f (x) /∈ Dj

for all data points. By construction, data points start in each Di , but it is possible to have

a sample where no points land in a particular partition set.

Theorem 11.1.5. If mij are our counts from our Ulam matrix and m1 = m2 = ...mn,

then

V (mij

m1j + ... +mnj) ≤ 1− p2ij .

Proof.

(mij

m1j + ... +mnj)2 ≤ 1.

Take the expected value,

E((mij

m1j + ... +mnj)2) ≤ 1

Subtract the term needed to make the left hand side equal the variance,

E((mij

m1j + ... +mnj)2)− (E( mij

m1j + ... +mnj))2 ≤ 1− (E( mij

m1j + ... +mnj))2

V (mij

m1j + ... +mnj) ≤ 1− p2ij .

So we have an upper bound for the variance of the quotient.

114

The next conjecture provides a hypothesis test to check if mini mi is large enough

after running Ulam method. The conjecture is based on the χ2-test where

∑ (observed − expected)2expected

is the test statistic. Here we insert the entries of P for the observed values, 1/n for

the expected values in the denominator. Since we wish to check if our data points are

insufficient and P is a ds-matrix, we insert

mijm1j + ... +mnj

for the expected values in the numerator. In the chapter of examples, we assume this

conjecture is correct and give the results of this test.

Conjecture 11.1.6. If mij are our counts from our Ulam matrix and

m1,m2, ... ,mn

are the number of points in

D1,D2, ... ,D1

before applying our map, then

n

n∑

i=1

n∑

j=1

(mij

m1j + ... +mnj− mijmi1 + ... +min

)2 ∼ χ2(n−1)2

and this may be used as a test statistic for the hypothesis test

1. Ho : mi is insufficient for some i .

2. Ha : mi is sufficient for all i .

Where we will reject Ho in favor of Ha if

n

n∑

i=1

n∑

j=1

(mij

m1j + ... +mnj− mijmi1 + ... +min

)2

is smaller than a critical value.

115

11.2 Other Criteria for When More Data Points Are Needed

Next, we propose alternative criteria for deciding if mini mi is too small.

1. When we generate P, we know that it approximates an n × n ds-matrix, thus

column sums should be close to one. If

n∑

j=1

(1−n∑

i=1

pij)2 =

n∑

j=1

(

n∑

i=1

pij)2 − n

is significantly larger than zero then more points should be used.

2. Suppose we want to check how close P is to being ds with respect to a particular.

If the value of

‖ PT

1

1

...

1

−

1

1

...

1

‖ = ‖ (PT − I )

1

1

...

1

‖

differs significantly from zero, then more points should be used.

3. All ds-matrices have (1/n, 1/n, ..., 1/n) as a stationary distribution so if P does not

have a strictly positive stationary distribution, more points should be used.

4. All ds-matrices have (1/n, 1/n, ..., 1/n) as a stationary distribution so if

sup√n | < ~u, (1/n, 1/n, ..., 1/n) > |

‖ ~u ‖ : ~uP = ~u,~u is a probability vector

is far from one for all strictly positive stationary distributions of P, then more points

should be used.

116

CHAPTER 12PROBABILITY DISTRIBUTIONS OF DS-MATRICES

12.1 Conditional Probability Distributions

How do we set critical values for hypothesis testing with our Ulam matrix? We need

to know the probability space that our test statistic comes from to set critical values for

test statistics. For the n = 2 case we may use the binomial distribution. When n > 2,

which probability distribution should be used is an open question. We propose using a

subfamily of the beta distribution family.

When we set the critical value for

1. Ho : (D,B,µ, f ) is not ergodic (and hence not mixing).

2. Ha2 : (D,B,µ, f ) is weak-mixing (and hence ergodic).

we need some idea of

P(|λ2(P)| < t : |λ2(P)| = 1).

Since P is a stochastic matrix, 0 ≤ |λ2(P)| ≤ 1. So we should look to distributions

with support only on [0, 1]. Some common such distributions are

1. Uniform distribution:

g(t) = 1[0,1](t)

2. Beta distribution:

gαβ(t) =Γ(α)Γ(β)

Γ(α+ β)tα−1(1− t)tβ−11[0,1](t), α > 0, β > 0

3. Triangle distribution:

ga(t) is the piecewise linear function that forms a triangle with the interval [0, 1] as

the base and the point (a, 2) as the apex.

When |λ2(P)| = 1, the correct distribution to use probably has |λ2(P)| = 1 as a

central tendency. Since |λ2(P)| ≤ 1 the only way to have the mean or median equal one

117

is to have the distribution

P(|λ2(P)| = 1) = 1

P(|λ2(P)| = t) = 0 if t 6= 1.

We cannot reject Ho in favor of Ha2 if we use this distribution, so hopefully this is not

the correct distribution. Let’s look at having |λ2(P)| = 1 as a mode of the distribution.

1. Uniform distribution: All values on [0, 1] are a mode of the distribution.

g(t) = 1[0,1](t)

2. Beta distribution: To have t = 1 be a mode we must set α ≥ 1, β = 1.

gα1(t) = αtα−11[0,1](t)

3. Triangle distribution: The mode of the distribution is a so set a = 1.

ga(t) = 2t1[0,1](t)

If α = 1, β = 1, then the beta distribution gives us the uniform distribution; if

α = 2, β = 1, then the beta distribution gives us the triangle distribution distribution

with mode of one. The beta family of distributions includes the uniform distribution and

the triangle distribution with mode at one. So we propose using the beta distribution with

α ≥ 1, β = 1.

gα1(t) = αtα−11[0,1](t).

The beta distribution provides the additional advantage that it is an exponential family

of distributions. When α ≥ 1, β = 1 and a significance level is set, the critical value

increases as α increases. If possible, before evaluating our stirring protocol, we should

look at several stirring patterns in the same class as f and use maximum likelihood

estimates or method of moments to estimate α. If λ2 is a uniformly distributed random

118

variable on the unit disk, then |λ2| is a beta random variable with α = 2, β = 1. We

propose that if we have no insight into the value of α, then we should set it equal to two.

When the beta distribution is the correct distribution and α ≥ 1, β = 1; if we set the

critical value with a smaller alpha parameter, we will be less likely to make a Type I error,

if we set the critical value with a larger alpha parameter, we will be less more likely to

make a Type I error. So it is better to use an alpha parameter that is too small rather

than too big.

The next example describes the probability distribution function if n = 2, and p is a

beta(α, β) random variable.

Example 12.1.1 (n=2). Say that

P =

p q

q p

,

and we observe a perturbed version of P,

P =

p q

q p

,

p = p + ε,

q = 1− p − ε.

Where ε is a random variable such that p is a beta(α, β) random variable, then

P(p < k |p) = P(p + ε < k |p) =∫ k

0

Γ(α+ β)

Γ(α)Γ(β)tα−1(1− t)β−1dt.

It follows that

If β = 1, then P(p < k |p) = kα.

If α = 1, then P(p < k |p) = 1− (1− k)β.

If α = 1, and β = 1, then P(p < k |p) = k .

119

We have the uniform([0, 1]) distribution for p when ε is a uniform([−p, 1 − p]) random

variable.

The expected value of a beta(α, β) random variable isα

α+ β, and p is fixed so if

E(ε) = 0, then

E(p) = p =α

α+ β.

It follows that

V (p) = V (ε) =p2(1− p)p + α

.

So if E(ε) = 0, then V (p) decreases as α increases.

12.2 Approximating Probability Distributions

The rest of the chapter will look at other ways to approximate the probability

distributions of ds-matrices when the central tendency of the distribution is not specified.

We propose using one of the following Monte Carlo technique to approximate probability

distributions of statistic(s) from random ds-matrices when n > 2.

1. Use the dealer’s algorithm to generate ds-matrices

2. Take convex combinations of all n! permutation matrices to generate ds-matrices

3. Take convex combinations of (n−1)2+1 or more permutation matrices to generate

ds-matrices

4. Use unitary matrices to generate ds-matrices

12.2.1 The Dealer’s Algorithm

Say that a dealer has a deck with n suits and g cards in each suit (ng cards in

the deck), shuffles the cards such that the order of the cards is a uniformly distributed

random variable. Then the dealer deals the entire deck to n distinct players. We can

represent the number of cards in a suit that each player received with an n × n matrix,

each player corresponds to a row, each suit corresponds to a column. All rows and all

columns will sum to g. We get a ds-matrix after rescaling the matrix by1

g.

120

Algorithm 12.2.1 (The Dealer’s Algorithm for DS-Matrices in Matlab).

M = zeros(n, n);

deal = randperm(n ∗ g);

for j = 1 : n ∗ g;

M(mod(j , n) + 1,mod(deal(j), n) + 1) = ...

M(mod(j , n) + 1,mod(deal(j), n) + 1) + 1;

end

P = (1/g) ∗M;

When we use Ulam’s method to generate P our target matrix, P, is a ds-matrix

so the column and row sums of M = (mij) should be close to constant if mi is

constant. The entries of P will be rational by construction. If we set g = minmi then

the Dealer’s Algorithm gives us a way to generate a Monte Carlo approximation of the

probability distribution of P. The other algorithms we present approximate the probability

distribution of P. The dealer’s algorithm should be used when

1. The number of sample points is smaller than we would like (minmi is small).

2. We have no knowledge of the distribution before hand.

3. We want to sample ds-matrices with entries from ag: a ∈ 0, 1, 2, ..., g.

Theorem 12.2.2. If P is an n × n ds-matrix generated by the dealer’s algorithm, then

‖ P − P ‖F → 0 in probability as g → ∞, and

E(‖ P − P ‖F )→ 0 as g → ∞

Proof. The second statement implies the first, so we just need to show convergence

of the expected value. Let P =1

gM, where M is a random matrix defined by the

number from each suit a player receives in the dealer’s algorithm. The entries of M, mij ,

are marginally nonindependent binomial (1

n, g) random variables by construction. By

121

Jensen’s inequality

E(‖ P − P ‖F )2 ≤ E(‖ P − P ‖2F )

= E(

n∑

i=1

n∑

j=1

(pij − 1n)2)

=

n∑

i=1

n∑

j=1

E((pij − 1n)2)

=

n∑

i=1

n∑

j=1

V (pij)

=

n∑

i=1

n∑

j=1

n − 1gn2

=n − 1g

<n

g.

Taking square roots of both sides leads to

E(‖ P − P ‖F ) <√n√g.

We get the result when we take the limit as g goes to infinity.

The next example is an extreme case to show the futility of using too few points.

Example 12.2.3 (minmi = 1). If we run the dealer’s algorithm to approximate the

probability distribution of ds-matrices for Ulam’s method with minmi = 1, then g = 1. All

matrices in our Monte Carlo approximation will be ds-matrices with exactly one 1 in each

column and each row, thus the matrices are permutation matrices. All eigenvalues of a

permutation matrix have magnitude one. If follows that the subshifts of finite type arising

from the Monte Carlo matrices are nonmixing. So if we we use the Monte Carlo matrices

to estimate the probability distribution of |λ2|, we will fail to reject that (D,B,µ, f ) is not

weak-mixing.

122

12.2.2 Full Convex Combinations

The Birkhoff-von Neumann theorem provides a technique for generating Monte

Carlo probability distribution functions for real functions of ds-matrices. If we want to

randomly generate a doubly stochastic matrix, then the Birkhoff-von Neumann theorem

tells us that we may apply a randomly generated weighted average to the set of n × npermutation matrices to get a randomly generated ds-matrix. Recording the statistics

of the random ds-matrices gives a Monte Carlo approximation of the desired probability

distribution.

Proposition 12.2.4. If ~u is a length N vector where ui is a nonnegative random variable

for all i and P(~u = ~0) = 0, then ~usum(~u)

is a random convex combination almost surely.

We may take the absolute value of real random variables that are continuous at

zero to get convex combinations. If we change the distribution of the ui ’s, we change the

distribution of the convex combinations and thus change the distribution of the doubly

stochastic matrices. To verify this, compare the results when the ui ’s are independent

uniform([0, 1]) random variables, and when ui = v 2i where the vi ’s are independent

cauchy(0, 1) random variables.

Notation 12.2.5. Let ~γ = (γ1, ..., γN) denote a convex vector.

Proposition 12.2.6. If ~γ is a random probability vector (convex combination) of length N

and γ1, γ2, ..., γN are marginally identically distributed, then

E(γi) =1

N

for all i .

Remark 12.2.7. The convex coefficients of the ~γ from the previous theorem are not

independent, if one entry increases (decreases) then sum of the other terms must

decrease (increase) to preserve γ1 + γ2 + ... + γN = 1.

123

Theorem 12.2.8. If c1, c2, ..., cN+1 are independent identically distributed gamma(α,β)

random variables and ~γ is the probability vector given by

γi =ci

c1 + ... + cN+1,

then the marginal distribution for each γi is beta(α,Nα).

Proof. Since c1, c2, ..., cN+1 are gamma distributed, 0 < c1, ... , cN+1 almost always. Since

c1, c2, ..., cN+1 are iid we may show the result for γN+1 without loss of generality,

P(γN+1 < k) = P(cN+1

c1 + ... + cN + cN+1< k)

= P(cN+1 < (c1 + ... + cN + cN+1)k)

= P((1− k)cN+1 < (c1 + ... + cN)k)

= P(cN+1 <(c1 + ... + cN)k

1− k ).

Now c1, c2, ..., cN are gamma, the moment generating function of c1 + c2 + ... + cN+1

shows us that c1 + c2 + ... + cN is a gamma(Nα,β) random variable,

P(γN+1 < k) =

∫ ∞

0

∫ ky1−k

0

xα−1e−x/β

Γ(α)βα

yNα−1e−y/β

Γ(Nα)βNαdxdy

d

dkP(γN+1 < k) =

d

dk

∫ ∞

0

∫ ky1−k

0

xα−1e−x/β

Γ(α)βα

yNα−1e−y/β

Γ(Nα)βNαdxdy

Using the definition of derivatives and the dominated convergence theorem, we see that

d

dkP(γN+1 < k) =

∫ ∞

0

d

dk

∫ ky1−k

0

xα−1e−x/β

Γ(α)βα

yNα−1e−y/β

Γ(Nα)βNαdxdy .

It follows that

d

dkP(γN+1 < k) =

kα−1

Γ(α)Γ(Nα)(1− k)α+1β(N+1)α∫ ∞

0

y (N+1)α−1 exp(−y

(1− k)β )dy .

124

Notice that the function inside the integral is the kernel of a gamma distribution.

d

dk(P(γN+1 < k)) =

Γ((N + 1)α)

Γ(Nα)Γ(α)kα−1(1− k)Nα−1.

This is the probability distribution function of a beta(α,Nα) random variable. By the

independence of c1, c2, ..., cN , we have the result.

Remark 12.2.9. If P is an n × n ds-matrix then P ’s convex combination may or may not

be unique.

Proof. Here we look at two examples that justify the statement.

1. If P is a permutation matrix, then the only convex coefficient that is not zero is

the coefficient corresponding to P. So permutation matrices have unique convex

coefficients.

2. If P =

1/n ... 1/n

... . . . ...

1/n ... 1/n

, then the convex vector (1/n! ... 1/n!) gives P.

If Pk is the permutation matrix corresponding to

i → i + k mod n,

gcd(k , n) = 1,

take the convex combination where powers of Pk have coefficients 1/n and all

other coefficients are zero. This convex combination gives P also.

So some ds-matrices result from unique convex combinations and some do not.

It would be nice to be able to extend a set of observed ds-matrices whenever

obtaining observations is difficult or expensive; Murali Rao [16] created a way to extend

a set of observations.

125

Algorithm 12.2.10 (Rao’s Convex Data Extention). If Pkn!k=1 are the n × n permutation

matrices and ~γ is a length n! convex vector, then for any permutation on (1, 2, ..., n!), σ,

n!∑

k=1

γσ(k)Pk

is an n × n ds-matrix.

So if ~γ is a random convex vector, then we may extend our data by randomly

selecting permutations to generate new ds-matrices.

12.2.3 Reduced Convex Combinations

Theorem 12.2.11. If P is an n×n ds-matrix then P equals a convex combination of n×npermutation matrices with at most (n − 1)2 + 1 nonzero coefficients.

Proof. There are n2 entries of an n × n matrix; all rows and all columns of a ds-matrix

sum to one so there are (n − 1)2 degrees of freedom. We may treat the set of matrices

whose columns and rows sum to one as a set with dimension (n − 1)2. By the

Birkhoff-von Neumann theorem the set of ds-matrices is convex with permutation

matrices as corners. By Carathodory’s theorem for convex sets, every ds-matrix may be

expressed as a convex combination of (n − 1)2 + 1 permutation matrices.

We will refer to convex combinations of permutation matrices that use all n!

permutation matrices almost surely as full convex combinations. We will call convex

combinations that use (n − 1)2 + 1 permutation matrices reduced convex combinations.

The next result shows that full and reduced convex combinations sample from

probability spaces with different measures.

Theorem 12.2.12. Let ~γf and ~γr be random convex vectors of length n!. All entries of ~γf

are nonzero almost surely; ~γr has (n − 1)! or fewer nonzero entries. If Pkn!k=1 is the set

of n × n permutation matrices,

Pf =

n!∑

k=1

(~γf )kPk

126

Pr =

n!∑

k=1

(~γr)kPk ,

and for some ij

P((~γr)k = 0 ∀ Pk such that (Pk)ij = 1

)> 0,

then Pf and Pr are random variables from probability spaces with different measures.

Proof. Without loss of generality say that

P((~γr)k = 0, ∀ Pk such that (Pk)11 = 1

)> 0.

Since all entries of ~γf are nonzero almost surely, the coefficient of the identity matrix is

nonzero almost surely, thus

P((~γf )k = 0, ∀ Pk such that (Pk)11 = 1

)= 0.

Hence Pf and Pr come from probability spaces with different measures.

It would be nice to be able to extend a set of observed ds-matrices whenever

obtaining observations is difficult or expensive; Murali Rao created a way to extend a set

of observations.

Algorithm 12.2.13 (Rao’s Reduced Convex Data Extention). If Pkn!k=1 are the n × npermutation matrices and ~γ is a length (n − 1)2 + 1 random convex combination such

that the γi are identically distributed, then for any permutation on (1, 2, ..., (n−1)2+1), σ,and any set of (n − 1)2 + 1 distinct permutation matrices, Pki(n−1)

2+1i=1

(n−1)2+1∑

i=1

γσ(i)Pki

is an n × n ds-matrix, and for any permutation on (1, 2, ..., n!), σ

(n−1)2+1∑

i=1

γiPσ(ki )

127

is an n × n ds-matrix.

So if ~γ is a random convex vector, then we may extend our data by randomly

selecting permutations to generate new ds-matrices.

12.2.4 The DS-Greedy Algorithm

We need to observe some convex vectors from ds-matrices to be able to estimate

the vectors’ probability distribution. We may use the following algorithm when we are

given an n × n ds-matrix P and want to find a convex combination of n × n permutation

matrices that equals P. Since P is a convex combination of permutation matrices, for

each permutation matrix, Pin!i=1, we find the largest value, ci such that

P − ciPi

in nonnegative. Let

cm1 = max ci .

We will use cm1Pm1 in our convex combination. Repeat this with

(P − cm1Pm1)

to find cm2 and Pm2. Repeat this process with

P − cm1Pm1 − ...− cmjPmj .

Then

P = cm1Pm1 + cm2Pm2 + ... + cm(n−1)2+1Pm(n−1)2+1,

and a convex vector of P is

(cm1, cm2, ..., cm(n−1)2+1)

128

with

1 ≥ cm1 ≥ cm2 ≥ ... ≥ cm(n−1)2+1 ≥ 0.

Since each step takes a maximal coefficient to construct a convex combination of a

ds-matrix, we call this the ds-greedy algorithm.

f = factorial(n);

pt = perms(1 : n);% perms generates all n! permutations.

M = zeros(n, n, f );% For each i, M(:,:,i) will be a permutation matrix.

for i = 1 : f ;% This loops stores permutation matrices in M.

for j = 1 : n;

M(j , pt(i , j), i) = 1;

end

end

v = zeros(1, f );%v will be the vector of convex coefficients

S = P;

%This loops looks at the S - c(i)*M(:,:,i) where M(:,:,i) is a permutation matrix

%and c(i) is that largest possible value s.t. S - c(i)*M(:,:,i) has no negative entries.

for j = 1 : (1 + (n − 1)2);

c = zeros(1, f );%Stores largest values s.t. S - c(i)*M(:,:,i) has no negative entries.

h = [ ];%h will indicate where in c the max coefficient is.

for i = 1 : f ;

c(i) = min((S . ∗M(:, :, i)) ∗ ones(n, 1));

end

129

l = max(c);

for i = f : −1 : 1;

if c(i) == l ;

h = i ;

end

end

v(h) = l ;

S = S − l ∗M(:, :, h);%reduces S s.t. the next largest coefficient may be detected.

end

12.2.5 Using the Greedy DS-Algorithm

If we have a sample of n×n ds-matrices that arise from convex combinations whose

coefficients are marginal beta distributed with parameters α, (n − 1)α, but we do not

know the value of α. Then we may use the relationship between the gamma distribution

and the beta distribution to generate new ds-matrices from a probability space that

approximates the sampled space.

1. Apply the greedy ds-algorithm to our observed ds-matrices.

2. Use the method of moments or maximum likelihood estimation to approximate the

parameter α. Call the approximation α.

3. Use independent gamma(α, β) random variables to generate new ds-matrices.

The researcher must decide what value of β is appropriate.

The ck ’s are independent and identically distributed gamma(α, β).

γk =ck

c1 + ... + c(n−1)2+1

P(k)(n−1)2+1

k=1 are uniform randomly selected n × n permutation matrices.

130

P =

(n−1)2+1∑

k=1

γkP(k)

12.2.6 DS-Matrices Arising from Unitary Matrices

Every unitary matrix can be used to create a ds-matrix.

Proposition 12.2.14. If U is a unitary matrix and P is the matrix such that

Pij = |Uij |2,

then P is a ds-matrix.

Proof. We must show that P is a nonnegative matrix whose columns and rows all sum

to one. By how we define P, P is a nonnegative matrix. Since U is unitary,

I = U∗U.

So if

~u =

u1j

u2j...

is a column vector of U, then

1 =< ~u,~u >=∑

i

|uij |2.

Thus PT is a stochastic matrix. If we repeat this argument with

I = UU∗,

we see that the rows of P sum to one, thus P is a stochastic matrix. Hence P is a

ds-matrix.

131

Definition 12.2.15. The ensemble of all n×n unitary matrices endowed with a probability

measure that is invariant under every automorphism

U → VUW

where V andW are n × n unitary matrices is called CUE(n).

Berkolaiko showed that when ds-matrices arise from CUE(n) unitary matrices

generated by Hurwitz parametrization [17], the expected value of the second largest

eigenvalue of the ds-matrices goes to zero as n goes to infinity [18].

If A is a random n × m matrix with full column rank almost surely, U is a unitary

qr-factor of A, and P is the matrix where

Pij = |Uij |2,

then P is a random ds-matrix. Different probability measures for A will result in different

probability measures for P. So any process that generates random matrices with full

column rank may be used to generate ds-matrices.

132

CHAPTER 13EXAMPLES

Here we demonstrate our procedure on well known maps on the unit square. We

partitioned the unit square into half-open subsquares and set m = 106 points. The points

were uniformly distributed pseudorandom numbers generated with MATLAB’s rand

function on default settings. A goal was for the points to be approximately independent

identically distributed uniform((0, 1) × (0, 1)) random variables. The number of partition

sets is a power of four since we are partitioning the unit square into subsquares,

n = 4, 16, 64, 256, 1024, 4096.

When a map is defined on the standard torus, we treat the surface as the unit square

with the edges identified. For each map we present one observed matrices when

n = 4 since this is the easiest case to interpret; we omit the matrices for n =

16, 64, 256, 1024, 4096. We present the average observed |λ2|. Under the assumption

that our conjectured test for when the data points are sufficient, we present typical

results from our χ2-test; the p-values are presented rather than giving the results for a

particular significance level, this is done to allow the reader to make conclusions using

their own criteria.

13.1 The Reflection Map

The reflection map defined by

f (x , y) =

0 1

1 0

x

y

= (y , x),

reflects the unit square over the line x = y . The disk

(x , y) : (x − 1/2)2 + (y − 1/2)2 < 1/4

is mapped to itself and has measure π/4, thus the reflection map is not ergodic.

133

When we partition the unit square into four squares

P =

1 0 0 0

0 0 1 0

0 1 0 0

0 0 0 1

,

which has characteristic polynomial

(λ− 1)3(λ+ 1).

Our eigenvalues are 1, 1, 1,−1; P has four linearly independent eigenvectors so it is

diagonalizable. If we refine our partition into smaller squares we see that P will be a

diagonal block matrix with

√n

[1

]blocks and

n −√n

2

0 1

1 0

blocks

(recall that n is a power of four). Since each block is diagonalizable, our refinement

matrices are diagonalizable with characteristic polynomial

(λ− 1) n+√n

2 (λ+ 1)n−√

n2 .

We ran our procedure 100 times with 106 pseudorandomly generated points

(uniformly) and saw the following results listed in table 13-1.

For every partition P = P, m = 106 appears to be sufficient and we fail to reject Ho

and correctly conclude that the map is not mixing.

134

Table 13-1. The Reflection Map

Number of States Average | λ2(P) | Typical p-value of χ2

4 1 016 1 064 1 0

256 1 01024 1 04096 1 0

13.2 Arnold’s Cat Map

Look at the unit square as a torus and apply Arnold’s cat map,

f (x , y) =

2 1

1 1

x

y

mod 1.

This map is strong-mixing. When we partition the unit torus into four squares

P =

1/4 1/4 1/4 1/4

1/4 1/4 1/4 1/4

1/4 1/4 1/4 1/4

1/4 1/4 1/4 1/4

,


λ3(λ− 1).

This matrix arises from our particular partition of four subsquares of the unit square. A

way to confirm that this is the correct matrix is to draw the mapping of the subsquares

on the xy -plane, then look at where the four mapped subsquares are on the torus. The

eigenvalues are 1, 0, 0, 0. So when we partition the unit torus into four subsquares,

Arnold’s cat map sends one-fourth of a subsquare to each subsquare.


(uniformly) and saw the results listed in table 13-2.

135

Table 13-2. Arnold’s Cat Map


4 0.00 0.0016 0.09 0.0064 0.45 0.00

256 0.45 0.001024 0.45 0.004096 0.49 1.00

A typical P for four states is

P ≈

0.2507 0.2496 0.2501 0.2496

0.2513 0.2484 0.2505 0.2497

0.2489 0.2511 0.2500 0.2500

0.2506 0.2504 0.2499 0.2491

.

max(|P − P|) ≈ 0.0016

|λ2(P)| ≈ 0.0016

Typical significance levels will conclude that we used enough points when

n = 4, 16, 64, 256, 1024,

but m = 106 is not sufficient when n = 4096.

13.3 The Sine Flow Map (parameter 8/5)

The sine flow map is a well studied area preserving nonlinear map on the torus.

f (x , y) = (x + T sin(2πy), y + T sin(2π(x + T sin(2πy))))

When T = 8/5, f is chaotic, it is conjectured that f is chaotic when T = 4/5 [19]. For a

dynamical system to be chaotic it must be topologically mixing; that is to say, for any two

open sets A,B ⊂ D, there exists an N such that

f n(A) ∩ B 6= ∅

136

Table 13-3. The Sine Flow Map (parameter 8/5)


4 0.2042 0.00016 0.2068 0.00064 0.3539 0.000

256 0.5198 0.0001024 0.6927 0.0004096 0.7427 1.000

whenever n > N.



Typical significance levels will conclude that we used enough points for

n = 4, 16, 64, 256, 1024,

but m = 106 appears to be insufficient when n = 4096.


P ≈

0.1655 0.2341 0.2330 0.3673

0.2316 0.1656 0.3709 0.2318

0.2332 0.3691 0.1647 0.2331

0.3696 0.2326 0.2328 0.1650

.

|λ2(P)| ≈ 0.2048

13.4 The Sine Flow Map (parameter 4/5)

Here we set the parameter of the sine flow map to 4/5.

f (x , y) = (x + (4/5) sin(2πy), y + (4/5) sin(2π(x + (4/5) sin(2πy))))

It is conjectured that this dynamical system is choatic.



137

Table 13-4. The Sine Flow Map (parameter 4/5)


4 0.1381 0.00016 0.2584 0.00064 0.4183 0.000

256 0.5313 0.0001024 0.5755 0.0004096 0.6151 1.000


n = 4, 16, 64, 256, 1024,

but m = 106 is not sufficient when n = 4096.


P ≈

0.2015 0.2311 0.2294 0.3380

0.2293 0.2018 0.3394 0.2296

0.2307 0.3367 0.2010 0.2317

0.3391 0.2296 0.2298 0.2015

.

|λ2(P)| ≈ 0.1375

It is believed that f mixes faster for larger T . Comparison of eigenvalue magnitude

from the two sine flow map examples runs counter to this conjecture.

The baker’s map example shows how eigenvalue instability can alter the observed

eigenvalues.

13.5 The Baker’s Map

The baker’s map defines a mixing dynamical system on the unit square where

f (x , y) = (2x , y/2) if 0 ≤ x < 1/2

f (x , y) = (2− 2x , 1− y/2) if 1/2 ≤ x ≤ 1.

138


P =

1/2 1/2 0 0

0 0 1/2 1/2

1/2 1/2 0 0

0 0 1/2 1/2

,


λ3(λ− 1).

Our eigenvalues are 1, 0, 0, 0 and the rank of P is two so a Jordan canonical form is

1 0 0 0

0 0 1 0

0 0 0 0

0 0 0 0

.

If we refine our partition into smaller squares we get characteristic polynomials of the

form

λ4k−1(λ− 1).

The graph of

p(x) = x4k−1(x − 1)

is nearly flat near x = 0, Taylor’s theorem shows that for large k , a small perturbation

to P can greatly change λ2(P). Notice that |λ2(P)| is the most perturbed root of P ’s

characteristic polynomial. The shape of these polynomials shows that eigenvalue

instability of λ = 0 will increase with refinement.



139

Table 13-5. The Baker’s Map


4 0.03 0.00416 0.17 0.0064 0.33 0.00

256 0.44 0.001024 0.53 0.004096 0.61 1.00


P ≈

0.5013 0.4987 0 0

0 0 0.4995 0.5005

0.5011 0.4989 0 0

0 0 0.4997 0.5003

.

max(|P − P|) ≈ 0.0013

|λ2(P)| ≈ 0.0079.


n = 4, 16, 64, 256, 1024,

but m = 106 is not sufficient when n = 4096. We know that the map is mixing, the

induced Markov shift is mixing but the eigenvalues of the approximating matrices do not

reflect the rate of mixing. This example demonstrates how eigenvalue instability and

insufficient m can throw off an observation.

13.6 The Chirikov Standard Map (parameter 0)

The Chirikov standard map is a Lebesque measure preserving function that maps

the torus to itself

f (x , y) = (x + k sin(2πy), y + x + k sin(2πy)).

140

Table 13-6. The Chirikov Standard Map (parameter 0)


4 1.00 0.0016 1.00 0.0064 1.00 0.00

256 1.00 0.001024 1.00 0.004096 1.00 1.00

For this example we set k = 0 so that we may compute P,

f (x , y) = (x , x + y).


P =

1/2 0 1/2 0

0 1/2 0 1/2

1/2 0 1/2 0

0 1/2 0 1/2

,


λ2(λ− 1)2.




P ≈

0.5012 0 0.4988 0

0 0.5000 0 0.5000

0.5006 0 0.4994 0

0 0.5010 0 0.4990

.

141

If we relabel the subscripts of Di we can get

P =

1/2 1/2 0 0

1/2 1/2 0 0

0 0 1/2 1/2

0 0 1/2 1/2

and

P ≈

0.5012 0.4988 0 0

0.5006 0.4994 0 0

0 0 0.5000 0.5000

0 0 0.5010 0.4990

.

The graph of our Markov shifts has two disjoint subgraphs, hence our subshifts of finite

type are not ergodic. Typical significance levels will conclude that we used enough

points when

n = 4, 16, 64, 256, 1024, but m = 106 is not sufficient when n = 4096.

142

REFERENCES

[1] G. Froyland, On Ulam approximation of the isolated spectrum and eigenfunctions ofhyperbolic maps, Discrete Contin. Dyn. Syst. 17 (3) (2007) 671–689 (electronic).

URL http://dx.doi.org/10.3934/dcds.2007.17.671

[2] F. Y. Hunt, Unique ergodicity and the approximation of attractors and their invariantmeasures using Ulam’s method, Nonlinearity 11 (2) (1998) 307–317.

URL http://dx.doi.org/10.1088/0951-7715/11/2/007

[3] Froyland, G. Aihara, Kazuyuki, Ulam formulae for random and forced systems,Proceedings of the 1998 International Symposium on Nonlinear Theory and itsApplications 2 (1998) 623–626.

[4] S. M. Ulam, Problems in modern mathematics, Science Editions John Wiley &Sons, Inc., New York, 1964.

[5] M. Dellnitz, G. Froyland, S. Sertl, On the isolated spectrum of the Perron-Frobeniusoperator, Nonlinearity 13 (4) (2000) 1171–1188.


[6] T. Y. Li, Finite approximation for the Frobenius-Perron operator. A solution to Ulam’sconjecture, J. Approximation Theory 17 (2) (1976) 177–186.

[7] G. Froyland, Using Ulam’s method to calculate entropy and other dynamicalinvariants, Nonlinearity 12 (1) (1999) 79–101.


[8] J. R. Brown, Ergodic theory and topological dynamics, Academic Press [HarcourtBrace Jovanovich Publishers], New York, 1976, pure and Applied Mathematics, No.70.

[9] P. Walters, An introduction to ergodic theory, Vol. 79 of Graduate Texts inMathematics, Springer-Verlag, New York, 1982.

[10] V. Baladi, Positive transfer operators and decay of correlations, Vol. 16 of AdvancedSeries in Nonlinear Dynamics, World Scientific Publishing Co. Inc., River Edge, NJ,2000.

[11] G. Birkhoff, Three observations on linear algebra, Univ. Nac. Tucuman. Revista A. 5(1946) 147–151.

[12] G. H. Golub, C. F. Van Loan, Matrix computations, 3rd Edition, Johns HopkinsStudies in the Mathematical Sciences, Johns Hopkins University Press, Baltimore,MD, 1996.

143

http://dx.doi.org/10.3934/dcds.2007.17.671

http://dx.doi.org/10.1088/0951-7715/11/2/007

http://dx.doi.org/10.1088/0951-7715/13/4/310

http://dx.doi.org/10.1088/0951-7715/12/1/006

[13] G. Casella, R. L. Berger, Statistical inference, The Wadsworth & Brooks/ColeStatistics/Probability Series, Wadsworth & Brooks/Cole Advanced Books &Software, Pacific Grove, CA, 1990.

[14] J. Ding, A. Zhou, Finite approximations of Frobenius-Perron operators. A solutionof Ulam’s conjecture to multi-dimensional transformations, Phys. D 92 (1-2) (1996)61–68.

URL http://dx.doi.org/10.1016/0167-2789(95)00292-8

[15] P. Billingsley, Probability and measure, 3rd Edition, Wiley Series in Probabilityand Mathematical Statistics, John Wiley & Sons Inc., New York, 1995, aWiley-Interscience Publication.

[16] M. Rao, personal communication (2009).

[17] M. Pozniak, K. Zyczkowski, M. Kus, Composed ensembles of random unitarymatrices, J. Phys. A 31 (3) (1998) 1059–1071.


[18] G. Berkolaiko, Spectral gap of doubly stochastic matrices generated fromequidistributed unitary matrices, J. Phys. A 34 (22) (2001) L319–L326.


[19] M. Giona, S. Cerbelli, Connecting the spatial structure of periodic orbits andinvariant manifolds in hyperbolic area-preserving systems, Phys. Lett. A 347 (4-6)(2005) 200–207.

URL http://dx.doi.org/10.1016/j.physleta.2005.08.005

144

http://dx.doi.org/10.1016/0167-2789(95)00292-8

http://dx.doi.org/10.1088/0305-4470/31/3/016

http://dx.doi.org/10.1088/0305-4470/34/22/101

http://dx.doi.org/10.1016/j.physleta.2005.08.005

BIOGRAPHICAL SKETCH

Aaron Carl Smith was born in Portland, Indiana, and grew up in Ashland, Oregon.

After serving in the United States Army, Aaron used the Montgomery GI bill to attend the

University of Florida. He is married to the most beautiful woman in the world, Bridgett

Smith; they have a wonderful daughter, Akiko.

145

c 2010 Aaron Carl Smith - University of...

Documents

Transcript of c 2010 Aaron Carl Smith - University of...