Are the Folded F-test and Bartlett’s equivalent?james/STAT579-F18/SAS14.pdf · Megadeth...

76
Are the Folded F-test and Bartlett’s equivalent? I can think of two ways to solve this problem: (1) Look up what they actually are. (2) Simulate under many conditions and look closely at the results. Looking at a UCLA website at http://www.ats.ucla.edu/stat/sas/output/ttest.htm, the folded F test is defined by F 0 = max(s 2 1 , s 2 2 )/ min(s 2 1 , s 2 2 ) where s 2 i refers to the sample variance for group i . For k groups, the Bartlett test statistic is http://www.itl.nist.gov/div898/handbook/eda/section3/eda357.htm T = (N - k ) log s 2 p - k i =1 (N i - 1) log s 2 i 1+ 1 3(k -1) k i =1 1 N i -1 - 1 N-k SAS Programming November 20, 2014 1 / 76

Transcript of Are the Folded F-test and Bartlett’s equivalent?james/STAT579-F18/SAS14.pdf · Megadeth...

  • Are the Folded F-test and Bartlett’s equivalent?

    I can think of two ways to solve this problem: (1) Look up what theyactually are. (2) Simulate under many conditions and look closely at theresults.

    Looking at a UCLA website athttp://www.ats.ucla.edu/stat/sas/output/ttest.htm, the foldedF test is defined by

    F ′ = max(s21 , s22 )/min(s

    21 , s

    22 )

    where s2i refers to the sample variance for group i . For k groups, theBartlett test statistic ishttp://www.itl.nist.gov/div898/handbook/eda/section3/eda357.htm

    T =(N − k) log s2p −

    ∑ki=1(Ni − 1) log s2i

    1 + 13(k−1)

    (∑ki=1

    1Ni−1 −

    1N−k

    )SAS Programming November 20, 2014 1 / 76

  • Are the Folded F-test and Bartlett’s equivalent?

    As test statistics they look quite different, but it’s more difficult to tellwhether they will generate different p-values because. Bartlett’s test iscompared to a χ2 distribution with k − 1 degrees of freedom. The FoldedF test is based on an F distribution. It is called folded because the largervariance is always placed in the numerator, thus making the F statisticconstrained to be greater than or equal to 1.

    Why are the p-values so close, you ask?

    SAS Programming November 20, 2014 2 / 76

  • Are the Folded F-test and Bartlett’s equivalent?

    The F test and the χ2 distributions in that the F distribution has a

    denominator and a numerator degrees of freedom and Fm,nd→ χ2m as

    n→∞. The arrow means “convergence in distribution”, somethingslightly different from the usual limits you study in Calculus, butsomething you’ll study in Casella and Berger. Basically this means thatP(Fm,n > x) ≈ P(χ2m > x) for large n, which also means that theygenerate very similar p-values, but they aren’t equivalent for finite n, evenif they are close.

    If you are graphing, they might appear to be identical due to limitedresolution, so it is good to look at the numbers themselves to see howclose they are.

    SAS Programming November 20, 2014 3 / 76

  • Are the Folded F-test and Bartlett’s equivalent?

    How to investigate this with simulation? We might guess that as thesample size increases, then the tests are more likely to be similar, so we’lltry with a small sample size. Unequal sample sizes might also have aneffect, so I’ll try three observations in one group and 10 in the other.

    Both definitions look like they easily generalize to more than two groups,for example you could use

    max(s21 , . . . , s2k )/min(s

    21 , . . . , s

    2k )

    as a test statistic. If all variances are the same, this should still be close to1. The distribution is now not clear, so you could either use theory toderive the distribution (might be hard!) or simulate the distribution

    SAS Programming November 20, 2014 4 / 76

  • Are the Folded F-test and Bartlett’s equivalent?

    Let’s simulate to see how different the p-values are for the Folded-F versusBartlett’s

    SAS Programming November 20, 2014 5 / 76

  • Are the Folded F-test and Bartlett’s equivalent?

    Let’s simulate to see how different the p-values are for the Folded-F versusBartlett’s

    SAS Programming November 20, 2014 6 / 76

  • Are the Folded F-test and Bartlett’s equivalent?

    Let’s simulate to see how different the p-values are for the Folded-F versusBartlett’s

    SAS Programming November 20, 2014 7 / 76

  • Are the Folded F-test and Bartlett’s equivalent?

    SAS Programming November 20, 2014 8 / 76

  • Are the Folded F-test and Bartlett’s equivalent?

    SAS Programming November 20, 2014 9 / 76

  • Are the Folded F-test and Bartlett’s equivalent?

    So the type 1 error rate was different between the two tests. Note thatalthough the Folded F was slightly elevated above 0.05, this was notsignificant since .056− 1.96 ∗ sqrt(.05 ∗ .95/1000) = .042.

    I also tried equal samples sizes of 10 per group, and there the p-valuesdisagree only in the 4th decimal place. In 1000 iterations, it wouldn’t besurprising if they always agreed on whether or not to reject the null.Looking at the actual p-values instead of the power gives you moreresolution for what is going on.

    SAS Programming November 20, 2014 10 / 76

  • Are the Folded F-test and Bartlett’s equivalent?

    If I use 3 observations in each group, then the tests appear to be closer,but still not identical.

    SAS Programming November 20, 2014 11 / 76

  • HW hints: How to plot two identical curves on top of eachother?

    Metallica recurrence data.

    SAS Programming November 20, 2014 12 / 76

  • HW hints

    Rolling Stones recurrence data.

    SAS Programming November 20, 2014 13 / 76

  • HW hints

    Megadeth recurrence data (much easier colors to read...)

    SAS Programming November 20, 2014 14 / 76

  • HW hints: how to plot two nearly identical curves on topof each other

    SAS Programming November 20, 2014 15 / 76

  • PROC IML! PROC IML! PROC IML!

    SAS/IML is the Interactive Matrix Language. It is probably the last topicof programming to cover in the course. After this, we will look at othertopics using more statistical procedures from SAS.

    SAS/IML, like macros, can be used To Boldly Go Where No SAS ProcHas Gone Before...that is, it is partly used to give SAS programmers moreflexibility and to implement new techniques that haven’t found their wayinto SAS procedures yet.

    SAS Programming November 20, 2014 16 / 76

  • PROC IML! PROC IML! PROC IML!

    Another use for learning SAS/IML is to program with matrices is toimprove your knowledge of statistics. This can be done with R orMATLAB just as (or perhaps more) easily than with SAS. It is often toodifficult to work out multiple regression estimates by hand for instance,but instead of just relying on a canned procedure to get the answers foryou, you can implement the matrices involved yourself and see if you canduplicate the results from programs like SAS or R or STATA. This can giveyou better appreciation for the methods and involved and deeperknowledge of what’s going on under the hood of these programs.

    SAS Programming November 20, 2014 17 / 76

  • PROC IML basics

    We’ll start by just doing some of the basics of how to use PROC IML.Despite the “interactive” title, it works pretty much like other SASprocedures, with typing in the Code editor and viewing the results in theResults viewer.

    The syntax for entering vectors and matrices is more similar to MATLABthan R. Here are some basic examples that show assigning values, vectors,and matrices and displaying their properties.Something that might be annoying is that you have to tell it PROC IML;reset log print each time you run it. If you just highlight your latestbit of code and run it, it will not know that you are using IML or know thepreviously defined variables.

    SAS Programming November 20, 2014 18 / 76

  • PROC IML basics

    SAS Programming November 20, 2014 19 / 76

  • PROC IML basics

    You can perform operations on matrices such as concatenation eitherhorizontally or vertically....

    SAS Programming November 20, 2014 20 / 76

  • SAS Programming November 20, 2014 21 / 76

  • PROC IML basics

    The transpose of a vector or matrix is handled by the left single-quote,which looks a little weird. Some references say you can use the transposefunction t(), which is how R does it, but it has generated errors for me.

    SAS Programming November 20, 2014 22 / 76

  • PROC IML basics

    You can create arrays of numbers similarly to R, and even arrays of stringsthat have consecutive suffixes. Try the following in the Code window andthen look at the Results viewer.

    proc iml;

    reset log print;

    index=1:10;

    rows = 1:3‘;

    rindex=10:1;

    series(0,100,20);

    strings = "x1":"x8"; /* turns strings into a vector of ,

    strings "x1", "x2", ..., "x8" */

    SAS Programming November 20, 2014 23 / 76

  • PROC IML basics

    Try these also to set up special matrices used often in statistics.

    proc iml;

    reset log print;

    b=j(6,1); *6x1 matrix of ones;

    c=j(2,3); *2x3 matrix of ones;

    a = I(6); *6x6 identity matrix;

    d = diag( {1 2 4} ); * diagonal matrix;

    b = 1:3;

    d2 = diag(b);

    d3 = diag(1:5);

    SAS Programming November 20, 2014 24 / 76

  • PROC IML basics

    SAS Programming November 20, 2014 25 / 76

  • PROC IML basics

    Some other functions and operators that can be useful:

    B = A##2; /* squares A elementwise */

    C = A B; /* element-wise maximum */

    C = A < B; /* element-wise 0-1 test */

    B = sqrt(A); /* element-wise square root */

    D = block(A,B,C); /* creates a block diagonal matrix

    using A, B, and C as matrices */

    c = NCOL(A); /* number of columns of A */

    d = NROW(A); /* number of rows of A */

    E = exp(A); /* element-wise exponentiation */

    F = log(A); /* element-wise log */

    d = det(A); /* determinant */

    B = inv(A); /* inverse of A */

    v = vecdiag(A) /* v is a vector with the diagonal entries of A */

    SAS Programming November 20, 2014 26 / 76

  • PROC IML basics

    You can extract individual elements or submatrices from a matrix:

    d1_12 = d1[1,2]; /* extact first row, second column */

    d1_ = d1[1,]; /* extract first row */

    d_2 = d1[,2]; /* extract second column */

    d = d1[1:2,{1 3}]; /* extract a 2x2 submatrix */

    mycols = {1 3};

    d = d1[1:2,mycols]; /* alternative gives the same results */

    Remember that to run this code you also need to run proc iml; resetlog print; each time.

    SAS Programming November 20, 2014 27 / 76

  • PROC IML basics

    To get column sums and column sums of squares, there is special notation

    csum = d1[+,] #column sum is a row vector

    rsum = d1[,+] #row sum is a column vector

    rsum2 = d1‘[,+] #rsum2 is a row vector

    a = d1[+,+]; /* sum of all elements in the matrix */

    b = sum(d1); /* a and b are the same but b is

    more efficient */

    SAS Programming November 20, 2014 28 / 76

  • PROC IML: searching matrices

    Often we want an index of where certain values occur. For example, wemight want to know which values in a vector or a matrix are less than 0,or where the maximum value occurs. This can be done using the LOCfunction. The LOC function is easier to understand with vectors. Here itreturns a vector of indices of the vector satisfying the condition.

    For example, with d2 = {−4 0 − 4}, trying

    m = max(d_2);

    negvalue = loc(d_2

  • PROC IML searching matrices

    If you apply the LOC function to an entire r × c , it counts the cells from 1to r × c and gives a single number indexing the cells rather than the rowand column number, and counts cell row-wise. Thus, for a 2× 3 matrixlike d1, d1[5] extracts element (2,2).

    SAS Programming November 20, 2014 30 / 76

  • PROC IML: missing values

    Matrices can be defined with missing values, which will sometimesappropriately create missing values when operated on, and other timesmissing values are ignored.

    SAS Programming November 20, 2014 31 / 76

  • PROC IML: formatting and labels

    You can format values to have dollar signs, commas or any other formatyou like and can also label rows and columns.

    SAS Programming November 20, 2014 32 / 76

  • PROC IML: Reading data to and from SAS data sets

    Typically, you’ll want to read in data from a SAS data set instead ofstarting from scratch and entering data into your matrix. You also mightwant to manipulate data using PROC IML then output it again to a SASdata set to use usual SAS PROCs.

    SAS Programming November 20, 2014 33 / 76

  • PROC IML: Creating a SAS data set

    Typically, you’ll want to read in data from a SAS data set instead ofstarting from scratch and entering data into your matrix. You also mightwant to manipulate data using PROC IML then output it again to a SASdata set to use usual SAS PROCs.

    SAS Programming November 20, 2014 34 / 76

  • PROC IML: Creating a SAS data set

    Note that I’ve changed the variable name from temperature to temp andalso changed the 5th observation from age 73 to age 67. I needed thethree statements, CREATE, APPEND, and CLOSE to create the data set.If you don’t CLOSE, then the data set is created but is still empty.

    Instead of creating a new data set, you can instead edit an existing dataset using the EDIT statement. However, you can only input one data setat a time into PROC IML.

    SAS Programming November 20, 2014 35 / 76

  • PROC IML: more on reading in data

    You can choose to read in less than all of the data in a data set. Thesyntax for the READ statement is

    READ

    ;

    where items within angled brackets are options. Typically for the rangeyou put all if you want to read in all of the data. You can also read inthe next n observations using NEXT n, or a list of specific observationsusing point {3,10,11} (for example to read in the 3rd, 10th, and 11thobservations only.

    SAS Programming November 20, 2014 36 / 76

  • PROC IML: more on reading in data

    VAR allows you to specify a list of which variables to read in if you don’twant to read them all in. The default is to only read in numeric variables,however you can use CHARACTER to specify that you only want to read incharacter variables.

    The WHERE option works similarly to WHERE statements in data stepsin PROCs. A common example might be to only read in observationswithout missing data, e.g. where var1 ne ..

    An alternative to using the READ statement to control what is read in isto create a new data set using data steps to first create the subset of thedata you want to manipulate in PROC IML, although this could be lessefficient than going directly through PROC IML for large data sets

    SAS Programming November 20, 2014 37 / 76

  • PROC IML: logical expressions, loops

    Logical expressions using IF THEN, ELSE, and loops using DO follow thesame syntax as in data steps. They allow you to loop through a matrix ina way that can be a bit easier than looping through a data set in a datastep. If your loop is indexing observation (row) i , then you can examinerow i − 1 without having to use lag functions and retain statements. Youalso have access to observation i + 1.

    SAS Programming November 20, 2014 38 / 76

  • PROC IML: simple linear regression

    As an exercise, let’s use PROC IML to find the regression coefficientswhen we do a simple linear regression of temperature on age. For now,we’ll ignore the sex of the individuals. We want to use the model

    Y = Xβ + e, or yi = β0 + β1xi + ei , i = 1, . . . , 130

    First, we’ll read in the data, separate the data into observations Y , andthe predictor, and create the design matrix. The design matrix will have acolumn of 1s and a column for the predictors. Note that X is 130× 2 andthe vector of coefficients in the regression model is β = (β0 β1)

    ′, which isa 2× 1 matrix (or column vector). If we set Y = Xβ + e, where we usethese matrix expressions, then the ith row of Y is yi = xi1β0 + xi2β1 + ei .

    SAS Programming November 20, 2014 39 / 76

  • PROC IML: simple linear regression

    The idea is to solve for β̂ using the equation Y = X β̂. To do this we firstmultiply both sides by the transpose of X , so

    X ′Y = X ′X β̂

    ⇒ (X ′X )−1X ′Y = (X ′X )−1X ′X β̂

    ⇒ β̂ = (X ′X )−1X ′Y

    Now we can use PROC IML to do these matrix calculations and see if theyresult in β̂ that matches the output of SAS procedures.

    SAS Programming November 20, 2014 40 / 76

  • PROC IML: simple linear regression

    Typically, you’ll want to read in data from a SAS data set instead ofstarting from scratch and entering data into your matrix. You also mightwant to manipulate data using PROC IML then output it again to a SASdata set to use usual SAS PROCs.

    SAS Programming November 20, 2014 41 / 76

  • PROC IML: simple linear regression

    output from PROC IML and PROC REG

    SAS Programming November 20, 2014 42 / 76

  • PROC IML: simple linear regression

    How do you interpret the matrix X ′X? Think about how to multiply this. Thematrix is p × p where p is the number of parameters (β terms), so it is 2× 2 inthis case. The (1,1) entry is

    1 · 1 + 1 · 1 + · · ·+ 1 · 1130 times, so it is 130. You can also think of this as the sample size. The (1,2)entry is

    1 · x1 + 1 · x2 + · · ·+ 1 · · · x130 =130∑i=1

    xi

    The (2,1) entry is

    x1 · 1 + x2 · 1 + · · ·+ x130 · 1 =130∑i=1

    xi

    The (2,2) entry is

    x1 · x1 + x2 · x2 + · · · x130 · x130 =130∑i=1

    x2i

    SAS Programming November 20, 2014 43 / 76

  • PROC REG

    You can also output the X ′X matrix from PROC REG using

    model temperature = age / xpx;

    which is an abbreviation for “x prime x”. Of course, PROC REG outputsmuch more than this, and you can use PROC IML to see if you canreproduce residuals, fitted values, F statistics, and so on.

    SAS Programming November 20, 2014 44 / 76

  • Getting other quantities for regression

    Other quantities of interest for a linear model include the hat matrix,X (X ′X )−1X ′, the predicted values, X ′X etc., all of which can easily beobtained from PROC IML. Some of these can also be obtained fromPROC REG. If a new diagnostic test is developed not implemented inPROC REG or PROG GLM, then it would be beneficial to have access tothese matrices directly from PROC IML.

    SAS Programming November 20, 2014 45 / 76

  • Comparing shapes

    An example where you might need something like PROC IML is if youhave to rotate your data, which can be accomplished through matrixmultiplication of your original data. This comes up in statistics when youwant to compare to sets of points (usually either in 2D or 3D, but couldbe higher-dimensional), and the points are not oriented the same way orscaled the same way.

    This can come up particularly in shape analysis, where you want todetermine whether two shapes are roughly equivalent, or you want tocompare two photographs taken from slightly different positions. Inaddition to statistical testing, sometimes you just want to best visualizethe difference between two sets of points, and this is best accomplished bylining up the points as nearly as possible.

    SAS Programming November 20, 2014 46 / 76

  • Comparing shapes

    As an example, consider two photographs of hands.

    SAS Programming November 20, 2014 47 / 76

  • Comparing shapes

    We have reference points on the hands, and we want to line up thereference points as closely as possible by rotating the images, rescaling ifnecessary (suppose you have photos that are cropped or zoomed in andstill want to compare the shapes). In general we might also allow mirrorimages. In this case, we assume the points are two-dimensional so thatthey each have just an x and y coordinate. This would typically be thecase for analyzing photographic images, although in general you canimagine have three dimensional data as well.

    SAS Programming November 20, 2014 48 / 76

  • Comparing shapes

    Applications for this are widespread. In medical imaging, you might takean x-ray of a patient over time to compare how their spine is changingwith osteoporosis. The x-rays won’t be taken at identical distances, anglesand so forth, so you need to align the images by stretching and rotating.

    Other examples will include MRI scans of the brains, where you mightwant to either compare the same individual at different time points, theleft versus the right hemisphere to look for asymmetries, or two separateindividuals to see how closely aligned two brains are. Here we want toignore the fact that one brain might be slightly larger than the other.If eyes or fingerprints are used for ID, again it will be easiest to comparetwo images by rotation and rescaling.

    SAS Programming November 20, 2014 49 / 76

  • Comparing shapes

    If you have satellite images of regions on earth, you might want tomeasure things like habitat loss. Successive photos of the same regionwon’t be exactly the same, so you might try to align two photos usingcertain geographical reference points. Once the photos are aligned, youcan use differences in the area that is green as a measure of vegetationloss, for example.

    “The name Procrustes refers to a bandit from Greek mythology who madehis victims fit his bed either by stretching their limbs or cutting them off.”(Wikipedia)

    SAS Programming November 20, 2014 50 / 76

  • Procrustes illustrations

    You can find some interesting illustrations online....

    SAS Programming November 20, 2014 51 / 76

  • Comparing shapes

    Back to the hand example. Here we have two sets of coordinates. Wemight call them

    X = {(x11, x12), (x21, x22), . . . , (xn1, xn2)}

    andY = {(y11, y12), . . . , (yn1, yn2)}

    How should we align the points? If we use the distances betweencorresponding points, we can minimize the distance between points overall possible angles of rotation, rescalings, and reflections. To deal withreflections, it might help to center the points so that (x1, x2) = 0 and(y1, y2) = 0. For many problems (like with satellite photographs or thesame patient over time), reflections won’t matter.

    SAS Programming November 20, 2014 52 / 76

  • Comparing shapes

    The distance between two individual points xi = (xi1, xi2) andyi = (yi1, yi2) is naturally defined as

    d(xi , yi ) =√

    (xi1 − yi1)2 + (xi2 − yi2)2

    This is the Euclidean distance between two points in the plane. We mightdefine the overall squared distance from the set of points X to the set ofpoints Y as

    d2(X ,Y ) =n∑

    i=1

    d2(xi , yi )

    SAS Programming November 20, 2014 53 / 76

  • Comparing shapes

    We can then minimize the sum of squared distances between points (thisis equivalent to minimizing the distances–why?). This is fairly similar as acriterion to what we do in regression, so hopefully doesn’t seem too weird.In other words, we want to minimize

    d2(xi , yi ) = (xi1 − y ′i1)2 + (xi2 − y ′i2)2

    over all choices of θ and c .To do this, we need to write y ′i as a function of yi , θ, and a scaling factorc .

    SAS Programming November 20, 2014 54 / 76

  • Example

    −3 −2 −1 0 1 2 3

    −3

    −2

    −1

    01

    23

    SAS Programming November 20, 2014 55 / 76

  • Example

    Lines connecting corresponding points.

    −3 −2 −1 0 1 2 3

    −3

    −2

    −1

    01

    23

    SAS Programming November 20, 2014 56 / 76

  • Example

    Rotating π/8 radians = 22.5 degrees clockwise, we get...

    −3 −2 −1 0 1 2 3

    −3

    −2

    −1

    01

    23

    SAS Programming November 20, 2014 57 / 76

  • Example

    How the squares were shifted by π/8 radians = 22.5 degrees

    −3 −2 −1 0 1 2 3

    −3

    −2

    −1

    01

    23

    SAS Programming November 20, 2014 58 / 76

  • Example

    Shifting by another π/8 radians (45 degrees total), we get...

    −3 −2 −1 0 1 2 3

    −3

    −2

    −1

    01

    23

    SAS Programming November 20, 2014 59 / 76

  • Rotating your data

    To rotate 2-dimensional data, you can use a rotation matrix. If a set ofpoints is in an n × 2 matrix, then we need a 2× 2 matrix to multiply thismatrix. The rotation matrix (from linear algebra) is

    R =

    [cos θ − sin θsin θ cos θ

    ]

    SAS Programming November 20, 2014 60 / 76

  • Minimizing squared distances

    Let X be the matrix for the first data set and Y the matrix for the seconddata set. To simplify the problem, we’ll consider only doing optimalrotations without worrying about scaling. We can think of rotating theobservations Y to match X using a rotation matrix R. Thus

    X = RY ′

    This usually can’t be solved exactly, so we want to find R for which

    X − RY ′ ≈ 0

    To minimize the sum of squared distances, we minimize

    tr[(X − RY ′)′(X − RY ′)]

    where tr is the trace, or sum of the diagonals. After some matrix algebra,this is equivalent to minimizing

    tr(RY ′X )

    SAS Programming November 20, 2014 61 / 76

  • Minimizing squared distances

    A technique for solving this problem involves the singular valuedecomposition, which again comes from linear algebra and which we won’treview, but can be used in SAS, Matlab, and other matrix-orientedlanguages. However, the matrix Y ′X can be decomposed into UDV ′,where D is diagonal and U and V are orthogonal. The solution is

    R = VU ′

    which minimizes the sum of the squared distances. If you have softwarewhich can do the singular value decomposition, then you can use it get theoptimal rotation.

    SAS Programming November 20, 2014 62 / 76

  • Using PROC IML for Procrustes Rotation

    SAS Programming November 20, 2014 63 / 76

  • Using PROC IML for Procrustes Rotation

    SAS Programming November 20, 2014 64 / 76

  • Using PROC IML for Procrustes Rotation

    SAS Programming November 20, 2014 65 / 76

  • Using PROC IML for Procrustes Rotation

    SAS Programming November 20, 2014 66 / 76

  • Using PROC IML for Procrustes Rotation

    In this case, the rotation matrix has values near 1 and -1 for sin θ and− sin θ for the (1,2) and (2,1) entries, respectively, suggesting that theoptimal rotation is near 90 degrees or π/2 radians. This means 90 degreescounterclockwise, and if you look at the photos, moving the right-handphoto 90 degrees counterclockwise will indeed line up the wrists (from topleft to bottom left) and the rest of the hand. Since these photos lookidentical, the slight discrepancy might be due to truncating measurementsin the positions of the points.

    SAS Programming November 20, 2014 67 / 76

  • Using PROC IML for Procrustes Rotation

    Most of the work of the Procrustes rotation in terms of code was forcentering the data. Once the data was centered, there were only threelines of code needed. It is possible to do the optimal rotation withoutusing matrices, but this would involve more work in terms of coding.

    For the centering itself, this could have been done outside of PROC IML,and PROC IML could have read in a SAS dataset with X and Y alreadycentered. If you had the original data in a SAS data set, how would youcenter the data using data step programming?

    SAS Programming November 20, 2014 68 / 76

  • Centering the data

    Often if something is very tedious, there might be a procedure to help youout. Googling “z-score SAS” quickly reveals PROC STANDARD, whichcan center your data. If you have a data set called temperature, you canuse something like

    proc standard data=temperature mean=0 std=1 out=ztemp;

    var degrees;

    run;

    SAS Programming November 20, 2014 69 / 76

  • Minimizing squared distances using calculus

    To minimize this distance, we can use calculus, but in this case, we needto minimize over rotations (angles) and rescalings. It helps to think of onedata set as fixed, say X , and we rotate and rescale Y to match X asclosely as possible. If we rotate the Y data by an angle θ and stretch theirvalues by c , then[

    cos θ − sin θsin θ cos θ

    ] [cyi1cyi2

    ]= c

    [yi1 cos θ − yi2 sin θyi1 sin θ + yi2 cos θ

    ]So we use y ′i1 = c(yi1 cos θ − yi2 sin θ) and y ′i2 = cyi1 sin θ + cyi2 cos θ andand plug these values into

    d2(X ,Y ) =n∑

    i=1

    (xi1 − y ′i1)2 + (xi2 − y ′i2)2

    Then the idea is to minimize with respect to θ and c. This can be donetaking partial derivates with respect to θ and c and using the usualcalculus techniques of optimization.

    SAS Programming November 20, 2014 70 / 76

  • The calculus approach

    I’ll give the formulas in terms of summations for the optimal values for θand c. This will be equivalent whether you use the calculus approach orthe matrix approach. The optimal rotation angle is

    θ = arctanD

    B+ kπ

    for integer k, and

    c =B

    Acos θ +

    D

    Asin θ

    where k ∈ {0, 1} should be chosen to let c > 0. Here

    A =n∑

    i=1

    y2i1 + y2i2 B =

    n∑i=1

    xi1yi1 + xi2yi2

    C =n∑

    i=1

    xi1yi1 − xi2yi2

    SAS Programming November 20, 2014 71 / 76

  • Using PROC IML for Procrustes Rotation

    Part of the advantage of the matrix approach to Procrustes rotations isthat it generalizes more easily than the calculus approach. In particular,you can compare three-dimensional (or higher dimensional) shapes orpatterns in the data as well as two-dimensional, which makes the rotationsmore complicated.

    We just illustrated rotations rather than scaling. Optimizing the scalingfactor as well as rotations is sometimes called “extended ProcrustesAnalysis”. Generalized Procrustes also more than two data sets at a timeto be used. For Generalized Procrustes, and “mean shape” is the set ofpoints that minimizes the sum of the Procrustes distances to each of theinput data sets.

    SAS Programming November 20, 2014 72 / 76

  • Comparing data sets and Outlier detection

    In addition to rotating your data, Procrustes rotations give you a way toquantify how different two shapes are. You might or might not want torescale (stretch) the data depending on the application, for example usingthe minimum sum of squared distances. Thus, given three data sets, youcan look at all pairwise distances to determine which two datasets aremost similar.

    The Procrustes rotation can also be used to look for outliers. Tryremoving one observation at a time and recomputing the squared distanceseach time. (This means you will do Procrustes rotations with n − 1instead of n data points each time.) This gives you a measure of whichobservations have the biggest effect for one dataset not being able to berotated to match the other data set.

    SAS Programming November 20, 2014 73 / 76

  • Writing functions in PROC IML

    SAS Programming November 20, 2014 74 / 76

  • Writing functions in PROC IML

    SAS Programming November 20, 2014 75 / 76

  • Writing functions in PROC IML

    SAS Programming November 20, 2014 76 / 76