A Colorful Introduction to Linear Algebra - Mine · PDF fileA Colorful Introduction to Linear...

A Colorful Introduction to Linear Algebra Zack Booth Simpson, May 2014

Table of Contents Introduction

Learning to count Vector addition Vectors as geometry Solving for a new mixture Matrix multiplication in general Order matters Size matters Rescaling and normalization Projection Matrices as operators

Combining Operators Basis Homogeneous Matrices

Inverses Rank Special Matrices

Identity Matrix Diagonal, a.k.a. Scaling Matrix Orthogonal Matrix Orthonormal / Rotation Matrix

Factorizations Eigen Decomposition LU decomposition

Principal Component Analysis Covariance

Vocabulary References Acknowledgments Appendix A

Using Trigonometry Without Using Trigonometry

Appendix B Trigonometry Refresher Appendix C The Algebra of PCA

Introduction Linear Algebra is the math of mixing things. Linear algebra is incredibly useful; it has applications to practically every discipline of science and engineering ranging from quantum

1

mechanics to video game design, to name but two examples of hundreds. This paper seeks to give you intuition for a variety of ideas from linear algebra. It is not a linear algebra textbook. Whereas a textbook would focus on proofs and formal definitions, this paper will instead focus on application and vocabulary. It is hoped that you will be able to generalize from the presented applications and be capable, through familiarity with the jargon, of referencing a textbook when needed.

Learning to count When we teach children to count we might say: “one ball, two balls, three balls”. Then we might say: “one block, two blocks, three blocks”. The child might reply: “one ball, two blocks, three dolls” and we correct them: “No, no you have to add things of the same type!” It is meaningless to add balls with blocks with dolls. Or is it? If we want to count different kinds of things we have to keep them isolated in separate piles. Keeping track of these separate piles is what linear algebra is all about so you might think that mathematicians would call these interesting piles of things “piles” or “collections” or some such term. But they don’t, they call them “vectors.” Linear algebra, like a lot of math, has tons of weird vocabulary for otherwise easy ideas. Sometimes this jargon can get in the way of understanding the ideas. That said, you can’t really communicate with other people about an idea unless you learn the jargon. So, in this paper, I’m going to introduce the jargon slowly and always be sure to let you know where those words came from. Let’s start with the word “vector.” “Vector” comes from the Latin “vehere” meaning “to carry.” It has the same root as “vehicle.” [1] It makes a tiny little bit of sense a vector “carries” us to a destination. In the case of our child learning to count we might say that destination is something like: “3 balls, 2 blocks, 1 doll.” Vectors are usually written vertically like this:

Why are vectors conventionally written as columns? Because they are. Let’s have a word about notation. All of the rules about how to write down a mathematical statement are one of the reason that people get scared of math. The symbols, conventions, and Greek letters understandably make people believe that math is a magical language only to be spoken by high priests. This impression starts in gradeschool when we are taught these symbols without

2

explanation of where they came from as if the + and symbols were handed down by God and we must not question them. In reality, notation is a shorthand that’s been passed down from generation to generation and has typically evolved extensively. In other words, it is part arbitrary convention and part refined tool. You should remember that a notation represents a decision made by people just like you and if you don’t like a notation then you are free to invent your own. Of course, you do so at your peril you may have a hard time communicating your ideas to others who have different notations. That said, this is exactly where all the standard notation came from someone made it up and their ideas were good enough that other people started copying them. Maybe you’ll get good enough at math that you can make up notation for your ideas and they’ll stick too! Back to vectors. Sometimes it is inconvenient to write a vector as a column as it takes up a lot of paper. So, there’s an alternative notation like this (3,2,1)T . That little “T” up there is a way of saying that this vector is laying on its side. The “T” stands for “transpose” which comes from Latin: “trans” = “over” + “pose”=“to put.” So “transpose” literally means “to put on its side.” [2]

I’m going to interchange between these two notations frequently. Let’s write our first vector equation. Suppose that we have a bread recipe that calls for 2 cups of flour, 1 cup of sugar, and 1 stick of butter. Suppose that we want to make two loaves. We need to multiply. By the way, “multiply” comes from Latin “multi” = “many” + “ply” = “fold.” A vector multiply by 2 is written like this:

Multiplying a vector by a number is called “scaling” the vector also known as a “scalar” multiply. The AngloSaxon word “scale” is interesting; it comes from the time when scales used to measure the weight of things were made of balanced cups hanging from strings. The word comes from the Norse word “skal” which means “skull” maybe because the Vikings used to use

3

skulls as cups! So when you “scale” a vector you may be ghoulishly talking about drinking from a skull! [3]

When scaling a vector, it doesn’t matter on which side you write the multiply. This might seem obvious but it is very important as we will learn in a bit.

Vector addition Suppose that you had two recipes. One for the bread that we’ve already discussed and one for a cake that was “1 cup flour, 2 cups sugar, 1 stick butter”. If you want to go to the store and buy enough ingredients for both bread and cake you’d want to add the two requirements which we write as follows where we simply add up each row so that we’re adding flour with flour, sugar with sugar, etc.

Vectors as geometry One way to think of vectors is to plot them. For example, the following is the flour and sugar components of our bread and cake purchase above where we plot cups of flour on the horizontal axis and cups of sugar on the vertical.

When plotted addition is the same thing as drawing the vectors so that the tail one is on the

4

head of the other. The result (how much we need to buy at the store for the combined recipe) is the blue vector that carries us to the end point. In this case, 3 cups flour and 3 cups of sugar. Visualizing vectors as geometry is very appealing when you have only two components (ingredients) but is much harder with three vectors, and impossible with even more. Sometimes people find that once they think of vectors as geometry, they can’t think of them any other way. This is a mistake. The power of linear algebra is exactly that it works for vectors with lots of components, that is, with a high “dimension” (from “di” + “metri” = “to measure out” [4]). There are two good reasons to think of vectors as geometry. The first is when geometry is in fact the problem of interest such as when one is modelling shapes in a video game or for a physics problem. The second reason is that geometric understanding leads to good visual intuition. But, as suggested, this can be as limiting since your intuition for high dimensions is essentially zero and therefore relying on geometric intuition over algebra can lead to problems. Geometrically, when we scale a vector we’re changing its size but not its direction. A bigger or smaller cake is still a cake its ingredient vectors point in the same direction but with different lengths. Here we plot 2 loaves of bread:

Solving for a new mixture Consider all of the things you could make with flour and sugar, ignoring the other ingredients. This is called the “space” of flour and sugar. We’ve already given the names “bread” and “cake” to some of the vectors in this space now let’s create a third vector (2,0.5)T and call it “scone.”

5

Suppose that we’ve bought a premixed cake recipe. We know the ratio of ingredients in the cake mix (1 flour to 2 sugar), but the mix is just that mixed we can’t unmix it. Meanwhile we’ve also bought a premixed scone recipe. Similarly we know its ratio of ingredients (2 flour to 0.5 sugar) but we can’t unmix them either. So, the question is, given these two premixed bags of cake and scone can we make bread? (Ignoring all the ingredients except for flour and sugar for simplicity). Yes we can! As we see below, there’s some amount of scone mix plus some amount of cake mix that will make exactly one loaf of bread.

The fact that we can combine two premade mixtures into something new is not obvious. After all, we have these two mixtures and it seems as if we’re somehow unmixing them to get something new. But, of course, we’re not unmixing them, we’re mixing two mixtures and if we can do that then we can get what we want. This idea is the heart of linear algebra. But, we must be careful: can we make scone from bread and cake? No. That would require unmixing either the bread or the cake mixes. Mathematically this is saying we’d need a negative quantity of bread or cake mix which is manifestly silly in this situation but not in all situations. The following demonstrates the problem; to get scone we’d need a cake vector pointing in the negative sugar and flour direction.

Given that we know we can make bread from cake and scone since they are all positive numbers, let’s solve for right combination of cake and scone ingredients to make the bread. We can write an equation for the situation like this:

6

We’re interested in finding the two unknowns: cups of scone mix (Cscone) plus cups of cake mix (Ccake); we know the mixtures of everything else. We’re ready for the neatest trick in linear algebra notation. We will rewrite our equation using a table called a “matrix.” (We’ll explain why this rearrangement works and what the meaning of the word “matrix” is in the next section; just follow along for a minute.)

The lefthand table (“matrix”) is made by combining the column vectors from our equation into an orderly table. Meanwhile the two unknown were moved into a column vector of their own. While this rearrangement is merely notation, it’s some very clever notation for it allows us to rewrite the equation in a form that is familiar from grade school. We’ll use “M” for the “Mixing Matrix”, “x” for the unknown values vector (Cscone,Ccake)

T and “b” for the bread vector. (It is conventional to use upper case letters for matrices, bold lower for vectors, and nonbold for scalars).

M * x = b Isn’t that elegant? It looks like a plain old algebra problem from 8th grade except that we know that it’s really a two dimensional set of equations. We know from 8th grade how to solve for x we divide by M.

x = b / M Except there’s a problem; division isn’t something we know how to do to a matrix! There’s a way to do it by hand for a 2 by 2 matrix, but for matrices bigger than that it gets extremely involved so we’re not going to learn how to do it manually. The right answer is that in this modern age is you simply ask a computer to do it for you. The way you ask a computer to do this is you ask it for the “inverse” of the matrix M. “Inverse” comes from Latin. “In”=“opposite” + “vertere”=”turn”.[5] Mathematicians represent the “inverse” using a “1” in superscript. Why? That’s a story we don’t have time for.

7

M * x = b Original equation M1 * M * x = M1 * b Left multiply both sides by M1

(M1 * M) * x = M1 * b Visually group M1 * M

x = M1 * b (M1 * M) cancels out The (M1 * M) term cancels out. It’s like saying with plainold numbers (M/M) cancels out because M/M = 1 and 1 times any number is that same number. So, all we have to do is ask a computer for M1, multiply it by b and we will have our unknowns x! (A warning: we’re later going to see that there’s a different and superior way to do this which in general eliminates the needs to use inverses. See “LU Decomposition” below.) Rearranging vectors into matrix collections as just demonstrated is probably the most important concept in linear algebra so we’re going to study it more. By the way, “matrix” means a “womb” in Latin (“mater” = mother.[6]) So, why do mathematicians call such tables wombs? Because a mathematician in 1850 imaginatively thought of the equations coming from the womb of a common parent.[7] Thus demonstrating the strange nature of mathematical etymology.

Matrix multiplication in general In the above section we rearranged an equation into a matrix times a vector. Let’s look at that again in more detail because it’s a technique we’re going to use a lot:

Notice that multiplying a matrix (blue and orange) by a vector (purple) is the same thing as scaling the first column vector (blue) by the first component of the vector (Cscone) plus scaling the second column (orange) by the second component (Ccake). Stop now, and study that; it is particularly important to see that the righthandside of the equation is a single vector specifically it is the sum of two scalar/vector multiplications. When describing the sizes of matrices it is conventional to list the number of rows first and the number of columns second. So the above equation is a 2 x 2 matrix times a 2 x 1 vector = 2 x 1 vector.

8

This pattern can extend to as many vectors as you want by simply adding more columns to the right side matrix. For example, here we have two vectors a and b combined into a matrix:

All we’ve done is show that we can extend the pattern to more than one vector. Compared to our previous equation, we’ve added another set of unknowns. Where before we had a single vector (Cscone, Ccake)

T we now have two vectors (a1, a2)T and (b1, b2)

T. And the righthand result is now a two column matrix instead of a vector. This notation is a bit tricky but is really just some rearrangements. You might want to review the above several times until you see the pattern, it is very important.

Order matters The order in which we multiply matrices matters. A * B does not equal B * A. Attempting to reverse the multiply is like mixing up components the way our naive child did when learning to count you can’t add balls to blocks. For example, first we do it the right way:

9

Now the other way around, swapping the order of the multiply results in nonsense.

10

So, in general you can not swap the order of a matrix multiply. A*B is not the same as B*A and furthermore its units are often meaningless. So, the lesson is you have to be very careful when you write down your equations.

Size matters The arrangement of the components of a matrix matters too. We can only multiply matrices if they are compatible. For example, the following is meaningless:

It is simply meaningless to multiply a 2x2 by a 1x2 as demonstrated.

11

There’s a handy trick to remind yourself if a given matrix multiply is meaningful. Write out the sizes rows by columns times rows by columns. For example:

If the inner numbers are the same then it is a legal multiplication. The resulting matrix will be the dimensions of the outer numbers. So in the above example the inner numbers are both 2 so its legal. The resulting matrix will be 1 x 3. The following is illegal as the inner numbers are not the same

Rescaling and normalization Our first bread recipe called for b = (flour,sugar,butter)T of (2,1,1)T. Suppose that we made this recipe and found that the recipe made 1.5 loaves in our pans. “What kind of stupid recipe makes 1.5 of something?”, we ask ourselves. We want to scale the recipe so that it gives us exactly 1 loaf in our pan units. We want to “rescale” the recipe to length 1. We simply divide it by its length. This is a process called “normalization”.

How do we get the length of a vector? The Pythagorean Theorem: sqrt(base2 + height2) for two dimensions or, more generally, sqrt( sum( Ci

2 ) ) where Ci is the ith component of a higher dimensional system.

12

Above we see normalizing the bread vector (2,1)T. Notice that when the bread vector (2,1)T is divided by the length 2.2 then it ends up at (0.9,0.45)T which is 1 unit long as shown on the right.

Projection Consider the following dialog between two vectors:

So, who is right? The orange vector or the green vector? Neither of course. All measurement is relative to some other measurement. Thus, we need a way to say: “How long is the green arrow compared to the orange arrow” and viceversa. Such measurement is called “projecting” one vector onto another. In the above case our orange and green friends are pointing in the same

13

direction, but in general they might not be. Consider this situation:

Now what do we do? They don’t even agree on which direction is the “true” direction! When projecting the orange vector onto the green vector we have to find the vector that points in the same direction as green but which is closest to orange.

The idea of projection is very important. It is common that we have one vector and we want to understand it in terms of some other vector. In this case we seek the vector vprojection shown in blue which is how the green vector b would understand the orange vector a. Note that the difference between the orange vector a and the blue projection vector is called the “residual” vector. “Residual” comes from Latin and means “the leftover part”. [8]

The equation for projecting a onto b is:

This is a somewhat mysterious equation, for a detailed understanding of this equation and its relationship to the angle between the vectors see Appendix A. One thing to notice is that if axbx + ayby is zero then there is no projection of a onto b. How is that possible? It means that a and b are perpendicular to each other.

14

Matrices as operators A very common thing to do with a matrix is to have it represent some transformation you wish to do to a bunch of vectors. For example, in video games, characters are modelled by positioning lots of points in 3D space and connecting them with triangles and quadrilateral faces as the below picture illustrates. This kind of model is called a “mesh” of vertices.

[9]

We use matrices to act on the vertices of the character in order to position, scale, and rotate the character in space. This called using the matrix as an “operator.” “Operator” comes from Latin “operari” = ”to work.” [10] Let’s demonstrate what a matrix operator does by using a 2D mesh representing a house instead of a complicated 3D game character. The following shows our matrix of vertices (one per column) sitting next to a plot of those same vertices with connecting lines forming a house.

15

Let’s suppose that we’ve modeled the house but want it twice as wide (everyone wants a bigger house!). We make a matrix that contains our scaling as follows.

We make the first column of our scaling matrix be two units long in the horizontal direction. With this we’re saying: “the horizontal components of the space are twice as long.” Then we make the second column be one unit along the vertical direction so we’re keeping the height the same as it was. The result is a scale matrix that makes the house twice as fat but no taller.

SCALE

Now let’s rotate the house, laying it on its side. This time we want the operationmatrix’s first column vector to be pointed up and the second column vector to now point left as in the following.

And the result is the house laying on its side.

16

The following is an arbitrary rotation around 45 degrees. If cos(45) and sin(45) are mysterious to you, see the trigonometry refresher in Appendix B.

ROTATE

And finally if we let the matrix operator have nonperpendicular column vectors it will deform the house, skewing it.

SKEW

Combining Operators You can combine operators easily. Suppose you want to both scale and a rotate. You simply multiply the operators to form the composite matrix operation as the following demonstrates with a scale by 2 and a rotation by 90 degrees.

17

Basis Notice that in the above five examples the column vectors of the operation matrix are plotted in orange and blue. Take note of how those operation vectors are related to the outcome of the house; when those orange/blue vectors are rotated, the house is rotated, when scaled, the house scales, etc. These orange and blue vectors the column vectors of the operation matrix are called a “basis.” A basis is simply a set of vectors that represent the canonical directions of the operation. It is common to “change basis” meaning that you modify a set of vectors such as the data vectors (the house vertices, for example) are in the new coordinate system. For example:

Homogeneous Matrices In all of the examples above the house is anchored to the origin no matter what operator we use. What if we wanted to move the house to a different place? Sliding a set of vectors around without changing shape is called “translation”, from Latin “trans”=”across” + “latus”=”to carry.”[11] The question is, how do we translate using a matrix operator? We don’t have a multiplicative operator which means “add this amount to every vector,” but there’s a very clever trick we can do: we add another dimension.

18

In the above example a third column and row have been added to the matrix operator; that third column (2,2,1)T results in a translation of (2,2) because we’ve also added a row of 1’s to our vertices at the bottom. (Note that the system is now three dimensional but I’m only plotting two of those dimensions.) So, for example, consider the first vertex (first column) of the house = (0,0,1)T.

This is tricky matrix kungfu called “matrix homogenization”. If you plan on using linear algebra for, say, computer graphics, then this is a mustknow method. In summary, the idea is that by adding a new dimension to our system we’ve added translation. The clever bit is that we added an extra dimension set to all 1 on the vertices of our house model so that when multiplied by the operator they pick up the translation in the third column vector. One way to visualize this is that the operator matrix is broken into two subparts as shown below. The green submatrix is the rotation/scale/skew operator and the blue submatrix is the translation. The bottom row is always set to 0, 0, 1.

Inverses Almost anything you can do with a matrix you can undo. For example if we rotate by 90 degrees and scale by 2 as we did above then all we have to do to undo that is make a matrix operator that scales by 0.5 and rotates 45 degrees (note the reversal of the order!). As described before, a matrix which undoes the operation of another is called its inverse.

19

Do all matrix operators have an inverse? No! Consider the following operation on our house model:

The above operator flattens the house on to a diagonal line. Once you’ve done that there’s no inverse anymore. That is, when you flatten some aspect of a matrix operation you’ve lost information you can’t go backward anymore. Matrices that can’t be inverted are called “singular”. This term probably comes from the fact that a singular matrix (noninvertible) “stands alone” it has no inverse partner to play with. Notice that a noninvertible matrix has some of its column vectors pointing in the same direction. That is, it is “flattened” in a way it’s parts don’t span the whole space. The basis of this matrix has lost a dimension you can’t combine the blue and orange column vectors and get to anywhere in the space anymore you’re stuck on that line.

Rank Rank is a metric of how many effective dimensions there are in a matrix. A full rank matrix means one that isn’t singular its vectors allow you to combine them so that you can reach anywhere in the space. But a singular matrix like the one above has at least two of its column vectors pointing in the same direction thus collapsing the space; its rank is less than full. In this case, it is a two dimensional matrix but has only one direction among its two column vectors so its rank is one. The directions you can’t travel in (which prevent you from leaving the line that the blue and orange vectors are on) is called the “null space” of the matrix. For example, the circle in the below basis is in the nullspace there’s no combination of the blue and orange basis vectors that gets to it.

20

Special Matrices There are a lot of special matrix forms that have interesting properties. Here are a few:

Identity Matrix A matrix that is all zeros except for having one’s along the topleft to bottomright diagonal is called the “identity matrix”. It has the special property that as an operator it leaves things unchanged. It is abbreviated capitalized “I” for “Identity” (to differentiate it from lower case “i”: the imaginary number).

In the above drawing, I’m only showing two dimensions, but the idea of the identity matrix extends to any number of dimensions just add 1’s along the diagonal. Prove to yourself that this operator leaves things unchanged. Try plugging in your own values in the green vector below:

Note that the identity matrix is the result of a matrix times its own inverse. A1 * A = I.

Diagonal, a.k.a. Scaling Matrix The identity matrix described above is a special case of a “diagonal matrix” one with all zeros except the along the topleft to bottomright diagonal . A diagonal matrix is one whose basis vectors are axisaligned but which are not necessarily all 1’s along the diagonal. When used as an operator, it is said that these column vectors act to “scale” the right matrix. We’ve seen an example of a “scaling” matrix earlier.

21

Orthogonal Matrix

The word “orthogonal” comes from Greek, “ortho” = ”straight” and “gonia” = ”angle”; “gonia” comes from “gony” = “knee”.[12] Any matrix where all of its column vectors are all at right angles to one another is orthogonal. The identity and diagonal matrices we just learned about are two examples of orthogonal matrices in those cases the basis is not only orthogonal but axisaligned too. A more general orthogonal matrix might be rotated like:

Note that the lengths of the basis vectors don’t have to be the same length, as long as all the bases are perpendicular to each other, it’s called orthogonal. We previously saw that when you have two vectors that are orthogonal to each other that the axbx+ayby term of their projection was zero. Note that this term is the same thing as taking the row vector aT and multiplying it by the column vector b.

axbx+ayby = aT * b This is true no matter how many dimensions there are (showing only 2 in the above equation). Thus, whenever aT * b = 0 we know that a and b are orthogonal.

Orthonormal / Rotation Matrix An orthogonal matrix that has all unitlength basis vectors is called “orthonormal” or a “rotation

22

matrix”. The “normal” part of the word means that it has been “normalized”, i.e. made all unitlength. It is also called a rotation matrix because its effect as an operator is to simply rotate the space without resizing or skewing.

An orthonormal matrix has the property that it’s transpose is also its inverse. To see why:

Since we know that the matrix A is orthogonal, all the offdiagonal terms must be zero since we know at aT * b is zero when two vectors are orthogonal. And it must be the case that the upperleft to lowerright diagonal must be all ones since a vector aT * a must be equal to 1 since its projection onto itself is obviously 1. (Review projections above if you don’t see that).

Thus we see that AT * A is equal to the identity matrix meaning that AT must be equal to the inverse. (See Identity Matrix above). Therefore we’ve proved that for an orthonormal matrix, the its transpose is its inverse.

Factorizations A very important thing to understand about matrix multiplication is that you can combine or split the matrices into parts. Consider this simple scalarvalued equation:

3 * 5 * x = 15 * x

23

We can collapse the 3 * 5 into 15 or viceversa we can split the 15 into 3 * 5. When we split a product into parts that’s “factorization”; when we combine, that’s “compositing” or just “multiplying”. We can do the same with matrix multiplies. For example we can composite A*B into a single (A*B) matrix.

A * B * x = (A*B) * x This is a very handy thing to do computationally because it means we save a potentially expensive loop over x. For example, the matrix A might be a scale operation and the matrix B might be a rotation. We might have many x’s say, all the vertices of a game character. It is much more efficient to combine the A * B into a single operation and then apply it to each x vertex than it is to apply all the vertices to B and repeat that applying them all to A. Why loop through the x’s more than once when you can combine the operators more cheaply? Factoring is equally handy but much more complicated. Suppose you have a composite rotation and scale (R*S)=M; the two operations have been mixed together and you might want to pull back out the unknown rotation and unknown scale given M. This is not so easy even for simple numbers (you basically have to try every prime number) much less systems of coupled equations. We’re now going to look at several such matrix factorizations.

LU decomposition The second factorization we are going to examine is used for solving matrices without using the inverse. We previously solved an equation like the following to find the proportions of scone and cake recipe we needed to make a bread recipe:

M * x = b and we learned that we could solve for x by:

(M1*M) * x = M1 * b But calculating an inverse is a fairly computationally expensive operation, especially for large matrices. It is also fraught with numerical accuracy problems. Furthermore, suppose that we have a lot of different b vectors to solve for. We’d have to do an expensive matrix multiply for each b vector. There’s a better way. First, observe the following. Suppose that we had a matrix that had the following form:

24

The matrix is in a “upper triangular form” meaning that all of its nonzero entries are in the upperright triangle. Everything in the lowerleft is zero. Now consider the equation for the bottom row:

0 * x1 + 0 * x2 + m6 * x3 = b3

Which is solvable for m6 because those zeros make it simplify down to:

m6 * x3 = b3 therefore x3 = b3 / m6 And now that we have the value for x3 we can solve the next row up because we can substitute in x3 in the equation and therefore have only one unknown. We repeat this up the rows until we have solved for all the unknowns. This process is called “backsubstitution” and is the prefered way to solve a large matrix. Using an uppertriangular (U) matrix allows you to solve by backsubstitution, but how do you get any old matrix M into U form? You use an algorithm called “LU” for “Lowertriangle / Uppertriangle” decomposition. It breaks M into L * U. You then multiply b by L to get the b vector into the right form and then you apply back substitution using U. For large matrices you have to solve them using LU for both time and numerical stability reasons. The advantages of LU over the inverse are so large that some computational linear algebra packages such as LAPACK don’t even provide an inverse function you have to use a dedicated solver function which is internally using LU.

Eigen Decomposition For most matrices there exists a set of vectors that don’t change direction when operated upon by the matrix. In all the above matrix operations we would input a set of vectors (for example, the house vertices) and we’d find that the output vectors were different from the input. But, for most matrices there are special vectors that don’t change direction under the operation. These are the “eigenvectors”. The term “eigen” comes from German and means, roughly, “own” or “self” which makes sense the eigenvectors map onto themselves under the operation. It’s a little hard to visualize so the best way to understand this is with several specific examples. First consider the following skew matrix from above.

25

Any vector that points in the blue direction (or opposite to blue) is going to keep pointing in the same blue direction under the operation. Any other vector will change which way it’s pointing (it will be skewed right). The blue direction is an “eigenvector” of this operation. To motivate ourselves as to the utility of an eigen decomposition, let’s look at a practical example borrowed from Linear Algebra and its Applications by David Lay[13]. In his example he models the movement of people to and from the cities. Let’s suppose that in any given year 10% of the people in the rural areas move to the city and 5% move from the city to the country. That means that 90% of rural people stay in the rural area and 95% stay in the city. Let’s make a matrix for that:

The first column represents the movement of the people who live in the countryside 90% go from countryside to countryside (i.e. stay put) and 10% move to the city. The second column represents the movement of the people who live in the cities: 5% move to the country and 95% stay put. Now let’s use this matrix as an operator. Let’s suppose that there’s a million people in the country and half a million in the city the first year. We can find out how many end up in each place after one year by applying the matrix operator.

As we can see, 75,000 = (1,000,000925,000) people moved from the country to the city. Algebraically we could write the above operator using M as the matrix that quantifies the movement between country and city, p0 as the population at time zero, and p1 as the population after one year as follows:

26

M * p0 = p1

What if we apply the movement matrix for another year? We can write that as:

M * p1 = p2

Notice that there are fewer people moving this year only 63,750 = (925,000861,250). It makes sense that less people moved the second year after all there were fewer people in the country to move out. We could write this two year trend by concatenating the matrices as in:

M * p0 = p1 M * p1 = p2

Therefore:

M * (M * p0) = p2 M2 * p0 = p2

What if this pattern repeats itself year after year for, say for n years? That is the same thing as raising M to the nth power.

Mn * p0 = pn Eventually, when n is large, the number of people leaving the country will exactly match the number of people leaving the city. That is called the “equilibrium” of the system. “Equi”=”equal” + “Libra”=”Balance”. (That’s why the astrological symbol of Libra is represented holding a balance.) The equilibrium vector (pequilibirum) doesn’t change under the matrix operation. Stated algebraically:

pequilibirum = M * pequilibirum

Because the equilibrium vector doesn’t change under the matrix operator, it is an eigenvector of

27

the matrix operator. This particular example of an eigenvector for the peoplemoving example is actually a very special eigenvector because it doesn’t change length under the operation. In general, eigenvectors point the same direction after the matrix operation but may change length. The fact that this one doesn’t change length means it is a special case where the system isn’t changing size (that is, we didn’t have birth or death in our model, only movement). A movement operator such as this has a special name: “Markov matrix” named after an important Russian mathematician who studied probability. You can tell it has the special conserving property because all the columns add up to 1. Each eigenvector has an associated number called its “eigenvalue” that tells you by how much the eigenvector changes length (scales) under the operation. For a Markov matrix, the dominant eigenvalue is 1 meaning it doesn’t scale. How do you solve for the eigenvectors and their eigenvalues? Like other linear algebra operations, you ask a computer to do it for you but you have to know how to phrase the question. The eigen decomposition of M is written as follows:

MQ = QD which is the same thing as M = QDQ1 Where M is the transform matrix, Q is the matrix of eigenvectors (each in a column) and D is the diagonal matrix of eigenvalues. What the above equation says is “If I operate with M on any of the special eigenvectors (the columns of Q) then I end up with that same special Q vector back scaled by some amount (the amount is in the respective column of D). What other applications are there for the eigen factorization? There are a lot. In general they have to do with analysis of how something is changing when you mix or rotate in some way. In the Principal Component Analysis section below we will look at one of them.

28

Principal Component Analysis

The above graph is a scatter plot of weight versus height.[14] Each data point represents a person. The vertical axis is their height, the horizontal their weight. As we can see, and as we would suspect, there is a relationship between these two variables giving us a bottomleft to topright slant. That is, the taller you are in general the heavier you are. This pattern is called “correlation”. (“Correlation” comes from “cor”=”together” + “relation”.) But, as we see clearly, it isn’t a perfect correlation: there’s scatter to the points. That means that there are some people who are much heavier than you’d expect given their height and others who are much skinnier than you’d expect given their height. A very handy thing to do to this graph would be to make a new set of axes that are better aligned to the data giving us a trend line.

29

The red axis in the above plot shows us the trend how much do you weigh on average given your height (or vice versa how tall are you given your weight). Those individuals above the red line are those who are taller than the trend and those below are shorter. That is, your position along the green axis tells us how “fat” or “skinny” someone is corrected for their height (or vice versa). Our goal is to solve for the new axes the ones labeled in red and green above. We will use a tool called “Principal Component Analysis” (PCA). The first “principal component” (shown in red) is the direction of maximum “variance”, or “spreadoutiness”. If you knew nothing else about a person other than their position along this red axis you’d still be able to guess at both their height and weight. Knowing the position of a data point along the most variable dimension tells you a lot about that data point. Knowing the position along the least variable dimension tells you very little. So, if you’re interested in summarizing data, especially data that is embedded in a high dimension, then you use PCA to find the dimension(s) of highest variance, then record each data point along those few dimensions and simply toss out the rest of the data. This is a form of data compression called “dimensional reduction”. To perform PCA you begin by centering the data by subtracting out the mean so that the center of the data is now at the origin.

30

We want to find the directions of maximum variance. To compute the variance you simply add up the square of all the residuals from the mean.

Note that variance is a squared quantity which means that it’s going to have the property that both positive and negative residuals will have positive contributions to the sum (since negative times negative is positive). Computing the variance with matrix notation is very easy: we simply multiply a centered data matrix by its transpose. Variance = Data * DataT.

31

Our data matrix was 2 dimensions (weight and height) by n data points. Transposing and multiplying gives a (2 x n) * (n x 2) = (2 x 2). What are the four elements of that 2x2? The topleft corner of our result is the variance along the first dimension. The lowerright corner is the variance along the second dimension. The lowerleft corner is called the “covariance” of the 1st with the 2nd dimension. Note that the upperright is the same thing as the lowerleft. This makes sense. The dotproduct(x,y) is the same thing as the dotproduct(y,x).

Covariance Variance of a single variable is fairly easy to understand, but covariance between two variables is a little tricker. Consider the following contrasting situations.

32

When a data series runs along a lowerleft to upperright diagonal we see that the x*y residual products will generally be positive because the points in the lowerleft will be negative*negative = positive and the ones in the upperright will also be positive. So the sum of all these numbers will be large positive number. But in the opposite situation when the data points trend upperleft to lowerright we see that the residuals will be positive*negative = negative and thus the sum will tend towards a large negative number. What if there is no correlation? In that case (bottom) the positive values roughly cancel out the negative values and the result will be a small number, somewhere near zero.

The covariance between two components of a signal tells us about how well they predict each other. A large positive score means they align well bottomleft to topright and a large negative score tells us the line up well the other way around. In contrast, a small sum tells us that they don’t line up well at all. Principal Component Analysis uses the covariance matrix to solve for the new axes which lie along the directions of most variance to least variance. The procedure is both simple and profound the eigenvectors of the covariance matrix are the principal components. The eigenvector with the largest eigenvalue is the first component, the next largest is the next, and

33

so forth. Why are the eigenvectors of the covariance matrix the principal components? That’s a much harder question to answer without slogging through the algebra (see Appendix C). It’s worth noting that the vast majority of mathematical theorems are only accessible through algebra. Try as we might to have intuition for why the PCA components equal the eigenvectors of the covariance matrix, there isn’t an easy shortcut (believe me, I’ve tried!). While searching for an easier way to learn geometry King Ptolemy is supposed to have been told by Euclid that there was no “Royal Road to geometry”.[15] As usual, Euclid is right often you just have to make your way through the algebra to get somewhere useful.

Vocabulary vector a collection of measurements, one for each component of a system scalar a term which scales something under multiplication matrix a table of number that represents a linear system basis the column vectors of a matrix operator using a matrix to change the basis of a system dimension the rulers by which a system is measured transpose to lay a matrix on its side making column vectors into row vectors and vice versa projection the shadow of a higher dimensional system on a lower one inverse the operator that undoes the operation of some other operator normalize to make unit length projection to measure one vector in terms of another orthogonal perpendicular in every dimension homogeneous matrix a trick where by an extra dimension is added to account for translation eigen decomposition a matrix factorization that finds the equilibrium vectors of a system principal component analysis (PCA) a technique to identify the principal axes of variation

References [1] http://www.etymonline.com/index.php?term=vector [2] http://www.etymonline.com/index.php?term=transpose [3] http://www.etymonline.com/index.php?term=scale [4] http://www.etymonline.com/index.php?term=dimension [5] http://www.etymonline.com/index.php?term=invert [6] http://www.etymonline.com/index.php?term=matrix [7] Via Wikipedia. The Collected Mathematical Papers of James Joseph Sylvester. 18371853. p.247. On google books: http://books.google.com/books?id=5GQPlxWrDiEC&pg=PA247 [8] http://www.etymonline.com/index.php?term=residual [9] Image from http://www.polycount.com/forum/showthread.php?t=72623 [10] http://www.etymonline.com/index.php?term=operate

34

http://www.etymonline.com/index.php?search=vector

http://www.etymonline.com/index.php?search=transpose

http://www.etymonline.com/index.php?term=scale&allowed_in_frame=0

http://www.etymonline.com/index.php?term=scale&allowed_in_frame=0



http://books.google.com/books?id=5GQPlxWrDiEC&pg=PA247&dq=sylvester+matrix+womb&hl=en&ei=uJakTaytCoOv8gPa5cG5Dw&sa=X&oi=book_result&ct=result&resnum=8&ved=0CE8Q6AEwBw#v=onepage&q&f=false


http://www.polycount.com/forum/showthread.php?t=72623



[11] http://www.etymonline.com/index.php?term=translate [12] http://www.etymonline.com/index.php?term=orthogonal [13] Lay, David; Linear Algebra and its Applications, 3rd Edition; p. 289 [14] Data from http://www.amstat.org/publications/jse/datasets/body.dat.txt [15] Proclus, p.15 http://books.google.com/books?id=JZEHj2fEmqAC&lpg=PP1&pg=PA57#v=onepage&q&f=false [16] http://mathworld.wolfram.com/OrthogonalMatrix.html [17] See theorem 4, http://www.math.vanderbilt.edu/~msapir/msapir/prtranspose.html

Acknowledgments Thanks to Diane Whitmer, John H Davis, Andreas Matouschek, Edward Marcotte, Alberto Verjovsky, Benni Goetz, and Steve Moore for their help.

35





http://www.amstat.org/publications/jse/datasets/body.dat.txt



http://books.google.com/books?id=JZEHj2fEmqAC&lpg=PP1&pg=PA57#v=onepage&q&f=false


http://mathworld.wolfram.com/OrthogonalMatrix.html

http://www.math.vanderbilt.edu/~msapir/msapir/prtranspose.html

Appendix A

Using Trigonometry First, we consider the situation using trigonometry. Note that there is a right triangle formed by Vprojection on the base, Vresidual on the opposite and a on the hypotenuse. First, make a unit vector out of the green b vector. Then, once we find the length of Vprojection we’ll just scale that unit vector to the right length.

bunit = b / len(b) Consider the angle θ between the two vectors. The cosine of that angle is what we want remember the cosine is the adjacent over the hypotenuse (see Appendix B). cos(θ) = len(vprojection) / len(a) Definition of cosine len(vprojection) = cos(θ) * len(a) Solve for len(vprojection) vprojection = cos(θ) * len(a) * bunit Multiply bunit by length to get projection vector

Without Using Trigonometry Turns out that there’s a way to do this projection without angles. The reason that’s interesting is that trig functions like cosine are computationally expensive. If there was a way do vector projection without angles, that would be great. And there is, here’s the equation:

len(a) * len(b) * cos(θ) = ax*bx + ay*by

Notice that the righthand side has no reference to the angle but rather uses only the

36

components of a and b. But it’s hard to see how this magic equation works.

In this “proof” we see the two vectors a and b at the top and the angle θ between them. Two triangles, green and brown, have been formed from those vectors. We arrange pairs of the triangles made by those vectors into the rectangular arrangement seen on the left (follow the colors). We see that the beige area is equal to ax*bx + ay*by. We then rearrange those triangles as shown in the center panel. Since we only moved the triangles around then it follows that beige area in the center panel must equal the beige area in the left panel. We then cut along the a vector and move the resulting triangle up to form a rectangle whose area is len(a) * len(b) * cos(θ) which must equal ax*bx + ay*by. So now we can rearrange the projection equation so that it doesn’t involve an angle.

From above

Substitute definition of bunit

Multiply by len(b) / len(b)

Substitute cos(θ)*len(a)*len(b)

Simplify So, finally the projection can be written using vector multiplication as:

37

Appendix B Trigonometry Refresher

38

Appendix C The Algebra of PCA We seek a rotation matrix (Q) that will align with the maximum variance of the data matrix (X).

X = Q * Y The columns of Y are the new coordinates in the Q basis. In other words, the Y column vectors when rotated by Q, give us back X. We want to choose Q so that the covariation in this new rotation is all diagonal. That is, we want Q so that it removes all the crosscorrelation terms and leaves us with just the diagonal variance terms in the Y * YT covariance matrix. We need to find the relationship between the original covariance matrix (X * XT) and the new rotated one (Y * YT). (1) X = Q * Y Definition of Q and Y (2) Q1 * X = Q1 * Q * Y Left multiply both sides by Q1

(3) Q1 * X = Y Cancel Q1 to Isolate Y (4) QT * X = Y Q is orthonormal: it’s inverse is its transpose [16]

(5) QT * X * YT = Y * YT Right multiply both sides by YT

(6) XT * Q = YT From (4) solve for YT by transposition theorem[17] (7) QT * X * XT * Q = Y * YT From (6) substitute XT * Q for YT into (5) (8) QT * (covar X) * Q = (covar Y) Rewrite as “covar” as a reminder of what we’re up to (9) Q1 * (covar X) * Q = (covar Y) Q is orthonormal: it’s transpose is its inverse [16] (10) Q * Q1 * (covar X) * Q = Q * (covar Y) Left multiply both sides by Q (11) (covar X) * Q = Q * (covar Y) Now in eigen form: M * Q = Q * D Now we have an equation we’ve seen before an eigen system. Remember that we seek (covar Y) to be all diagonal. So this equation is in eigen form where Q are the eigenvectors of the covariance of X and the covariance of Y is diagonal. Thus we see that the eigenvectors of the covariance matrix of X are the principal components of the data matrix X. Not so obvious!

39

A Colorful Introduction to Linear Algebra - Mine · PDF fileA Colorful Introduction to Linear...

Documents

Transcript of A Colorful Introduction to Linear Algebra - Mine · PDF fileA Colorful Introduction to Linear...