Handed out: · Web view1) Use file “stocks.xls”, showing time series of three...

24
Multivariate Statistics (MARS6300) - Homework 1 Name: ________KEY _________ Distributed: 01/23/2015 Due: Tuesday 02/06/2015 Instructions: Copy and paste your answers below and turn in a word file and two excel files before class via email to [email protected] using “MARS6300 hw#1” as the message subject. Label all files with your name (e.g., MARS6300_hw1_hyrenbach). You will be fined 0.5 points for failure to do so. You are free to use any reference materials of your choice. While you are encouraged to work together, make sure you turn your own assignment. This homework is worth 5 points. Note: You can perform all of these calculations with Excel or with SPSS. If you use Excel, make sure you leave the formulas showing all of your calculations in the sheets, and explain your reasoning, to get partial credit. The objectives of this homework are: A) To review the following statistical principles: - Calculation of full and partial correlation - Calculation of partial regressions – similarity with partial correlations - Calculation of critical values and p values checking for significance B) To use partial correlations / regressions to test a causal model of large scale drivers of local upwelling dynamics in the California Current System. C) To practice the documentation of the flow of data analysis. 1

Transcript of Handed out: · Web view1) Use file “stocks.xls”, showing time series of three...

Multivariate Statistics (MARS6300) - Homework 1 Name: ________KEY_________

Distributed: 01/23/2015 Due: Tuesday 02/06/2015

Instructions: Copy and paste your answers below and turn in a word file and two excel files before class via email to [email protected] using “MARS6300 hw#1” as the message subject. Label all files with your name (e.g., MARS6300_hw1_hyrenbach). You will be fined 0.5 points for failure to do so.

You are free to use any reference materials of your choice. While you are encouraged to work together, make sure you turn your own assignment. This homework is worth 5 points.

Note: You can perform all of these calculations with Excel or with SPSS. If you use Excel, make sure you leave the formulas showing all of your calculations in the sheets, and explain your reasoning, to get partial credit.

The objectives of this homework are:

A) To review the following statistical principles:

- Calculation of full and partial correlation- Calculation of partial regressions – similarity with partial correlations

- Calculation of critical values and p values checking for significance

B) To use partial correlations / regressions to test a causal model of large scale drivers of local upwelling dynamics in the California Current System.

C) To practice the documentation of the flow of data analysis.

D) To practice entering and evaluating data in PC-ORD. To complete this homework, you will need:

- Instruction file: “MARS6300_hw1.doc” (open with word file) – turn in- Correlations data file: “stocks.xls” (open with excel) – turn in- Causal model data file: “upwell.xls” (open with excel) – turn in

1

Questions:

1) Use file “stocks.xls”, showing time series of three indices: NASDAQ, DOW and FTSE. The raw data are on sheet “raw”. Use the other sheets to make the required calculations and paste tables / figures of results into this word file. Rename the excel file with your results and turn in (e.g., stocks_hyrenbach.xls)

Plot of the three time series, to allow you to visualize their degree of co-variation before we make the calculations.

2

(A) Use the sheet “correlation_calculations” to calculate the correlation coefficient for all three pair-wise combinations of two variables using the formula of the Pearson correlation coefficient discussed in class (Hint: -1 < r < +1). I want you to do it using the spreadsheet, and check it against the “correl” calculation from excel.

Copy and paste the results into the completed table 1, below:

Pairwise Correlation Coefficients ( r )

NASDAQ DOW FTSENASDAQ 1 0.924 0.910

Significance (p value) DOW p < 0.001 1 0.921FTSE p < 0.001 p < 0.001 1

(B) Calculate the p values for these correlations using the table provided in the sheet “correlation_results” and insert results in the table above. Answer these questions:

How many degrees of freedom are there?

The degrees of freedom are N-2 because the correlation coefficient is based on the deviations from the means for the two time series (x and y). Thus, we remove 1 degree of freedom for each variable. Since N = 752, then df = 750.

What is the critical r value for alpha = 0.05?

Because the degrees of freedom does not appear on the probability table provided for the assignment, you need to interpolate those critical values using the provided degrees of freedom immediately larger (1000) and smaller (500) than the target value (750). Use a linear equation to figure out the values. I show the slopes below:

  Two-Tailed Probabilities0.1 0.05 0.01 0.001

change_critical 0.022 0.026 0.034 0.043change_df 500 500 500 500

Slope0.0000

440.0000

520.0000

680.0000

86Intercept 0.052 0.062 0.081 0.104

interpolation 0.063 0.075 0.098 0.1255mid-point 0.063 0.075 0.098 0.1255

Notice that because 750 is the midpoint between 1000 and 500, the calculated values are the midpoints (averages) of the larger and smaller values.

3

Table of Critical values for Pearson

correlation  Two-Tailed Probabilities

df 0.1 0.05 0.01 0.001100 0.165 0.197 0.256 0.324200 0.117 0.139 0.182 0.231300 0.095 0.113 0.149 0.189500 0.074 0.088 0.115 0.147750 0.063 0.075 0.098 0.1255

1000 0.052 0.062 0.081 0.104

Plot showing the contours of the correlation coefficient, as a function of the sample size and the associated p value.

What is the result: are these correlations statistically significant? (Please explain Why / Why not ?)

The critical values (for all the p values provided) are smaller than the calculated correlation coefficients. Thus, the p values are smaller than 0.001. This means that these extreme correlation coefficients would be obtained by chance, if the null hypothesis were true, only less than 1 in 1000 times. Because this rate is well below the standard 1/20 ratio used to assess significance, these correlations are “too rare”. Thus, we reject the null hypotheses and accept the alternate hypotheses: there are

4

correlations between the various pair-wise combinations of variables. In n other words, the r values ARE different from 0.

(C) Calculate the partial pair-wise correlations using the sheet “partials”. Hint: I worked out one example for you. Fill-in and paste Table 2, below.

Partialed (first order)

Unpartialed (zero order)

v1 v2 v3 r (12.3) covariance (r^2) r(12) covariance (r^2)NASDAQ DOW FTSE 0.534 0.285 0.924 0.854NASDAQ FTSE DOW 0.396 0.157 0.910 0.828FTSE DOW NASDAQ 0.503 0.253 0.921 0.848

- Which two variables are most strongly correlated, once you “remove” the effect of the third variable?

Interpret what influence the “control” (puppet-master) variable had on the other two variables, given the magnitude of the zero and the first order correlations.

To answer this question, you need to compare the full (order-zero) and the partial (order-one) correlations of each pairwise combination of variables.

The two variables most strongly correlated, after “removing” the influence of the third variable are: NASDAQ & DOW. Their partial correlation value (0.534) is the largest, and suggests that these two time series share 28.5% of their variance (co-variance).

Note how the order-zero correlation coefficient decreased from 0.924 to 0.534 after the third variable (FTSE) had been factored out. This suggests that FTSE has a large influence on the correlation of NASDAQ and DOW because it is strongly correlated with these two variables (as the respective order-zero correlations indicate).

- Which two variables are most weakly correlated, once you “remove” the effect of the third variable?

Interpret what influence the “control” (puppet-master) variable had on the other two variables, given the magnitude of the zero and the first order correlations.

The two variables that are most weakly correlated are NASDAQ & FTSE, once the influence of the third variable (DOW) has been removed from the correlation. NASDAQ and FTSE have a partial correlation value of 0.396, the lowest of any of the three pair-0wise correlations. Therefore, NASDAQ and FTSE have the smallest shared variance

5

(covariance) of 15.7%. Without removing the influence of the DOW variable, the correlation coefficient for NASDAQ and FTSE is 0.910.

• 2) Use file “upwell.xls”, showing four time series of oceanographic indices spanning a 10-year period: SOI, NOI, Upwelling_36, Upwelling_39. The raw data are on sheet “raw”. Use the other sheets to make the required calculations and paste tables / figures of results into this word file. Rename the excel file with your results and turn in (e.g., upwell_hyrenbach.xls)

(A)Regress each variable against time (decimal years) to determine if there is a long-term significant trend. For each regression, report the following:R-squared, Significance, Sum of the residuals. Discuss the result of each regression (is there a significant trend, how much of the variance is explained?)

For each regression, calculate the “predicted values” by plugging the calculated (mean) coefficients into the equation of a straight line. Notice where the residuals are coming from.

NOTE: You can do this using the Excel “Data analysis” Add-in or a different statistics software program. However, make sure you can get the residuals from each regression.

Plot of the four series, to allow you to visualize their degree of co-variation before we make the calculations.

Notice that SOI and NOI do not significantly change over time (there is no linear trend). The two upwelling indices show a significant change over time. Yet, the linear trends explain a very small amount of the total observed variation in upwelling:4.9 % (0.049) and 3.4% (0.034) for upwelling36 and upwelling39, respectively.

6

Variable df F p value r squared Sum Residuals OutcomeSOI 1, 238 1.2731 0.2603 0.005 -3.29514E-13 Not SignificantNOI 1, 238 0.8789 0.3494 0.004 -6.48592E-13 Not Significant

Upwell36 1, 238 12.2999 0.0005 0.049 -3.18323E-12 Significant

Upwell39 1, 238 8.4621 0.0040 0.034 9.64064E-11 Significant

(B) Use linear regressions to partial out the effect of one variable on the others, as shown in lecture, to determine the strength of the correlation between upwelling at 36 and upwelling at 39, given the influence of SOI and NOI on these two local upwelling indices.

- Calculate the correlation between Upwelling at 36 and Upwelling at 39. Report a Pearson correlation coefficient and a p value (Note: you can use the “correl” command in the Excel “Data Analysis” Add-In, or another program of your choice). Hint: use the same table of critical Pearson values from question 1.

Pearson correlation (zero-order) between Upwelling36 and Upwelling39= 0.613

R squared = 0.378

- Using linear regressions, determine the correlation between Upwelling at 36 and Upwelling at 39, once you remove the effect of SOI on both variables. Paste the residuals into the sheet “residuals” and calculate their sum. Report the R squared and the p value. Finally, paste two plots showing the best-fit regression line and the residuals. To check that the regression and the correlation give you the same result, calculate the Pearson correlation between the residuals of Upwelling at 36 and the residuals of Upwelling at 39. Is the partialed (1-order) correlation stronger or weaker than the simple (0-order) correlation you calculated before?

Upwelling36 regressed on SOI:r_squared p value sum of residuals0.0004 0.7508 -5.96856E-13

Upwelling39 regressed on SOI:r_squared p value sum of residuals

0.0024 0.4513 1.55254E-12

Correlation of Upwelling36 residuals and Upwelling39 residuals:

Pearson Correlationof the residuals

7

(order 1)coefficient: 0.618r squared: 0.378

Regression of residuals:R Squared: 0.378

p value: 2.47E-26

- Outcome: The regression and the correlation give you the same result.

- The partialed (1-order) correlation (0.618) is larger, but very similar to the 0-order correlation (0.613).

- Interpretation: The partialed correlation is stronger than the unpartialed correlation, suggesting that when the influence of factor of SOI is removed, Upwell39 and Upwell36 have a slightly stronger covariation. This result suggests that SOI was slightly suppressing the relationship between Upwell36 and Upwell39. Yet, this was a weak effect, since the 0-order and 1-order correlations were so similar (and both significant).

- Using linear regressions, determine the correlation between Upwelling at 36 and Upwelling at 39, once you remove the effect of NOI on both variables. Paste the residuals into the sheet “residuals” and calculate their sum. Report the R squared and the p value. Finally, paste two plots showing the best-fit regression line and the residuals. To check that the regression and the correlation give you the same result, calculate the Pearson correlation between the residuals of Upwelling at 36 and the residuals of Upwelling at 39. Is the partialed (1-order) correlation stronger or weaker than the simple (0-order) correlation you calculated before?

Upwelling36 regressed on NOI:r_squared p value sum of residuals

0.0002 0.8193 7.38964E-13

Upwelling39 regressed on NOI:r_squared p value sum of residuals

0.0181 0.0370 1.10845E-12

Correlation of Upwelling36 residuals and Upwelling39 residuals:

Pearson Correlation

8

of the residuals (order-1)coefficient: 0.616r squared: 0.380

p value: 1.6 E -26

Regression of residuals:R Squared: 0.380

- Outcome: The regression and the correlation give you the same result.

- The partialed (1-order) correlation (0.616) is larger but very similar to the 0-order correlation (0.613).

- Interpretation: The partialed correlation is stronger than the unpartialed correlation, suggesting that when the influence of factor of NOI is removed, Upwell39 and Upwell36 have a slightly stronger covariation. This result suggests that SOI was slightly suppressing the relationship between Upwell36 and Upwell39. Yet, this was a weak effect, since the 0-order and 1-order correlations were so similar (and both significant).

3) Data Interpretation: Draw (using paint or powerpoint, or pen and paper and take a digital photo) a conceptual diagram (with bubbles and arrows) of the cause and effect relationship between the two upwelling indices (the explanatory variables) and the two driver variables (NOI, SOI).

Show the following: partial correlation coefficient between both upwelling indices, the 0-order correlation between the two driver variables (NOI, SOI), and the correlation coefficients between the explanatory and the driver variables, calculated using the partial regressions. (Hint: your diagram should have 4 bubbles, and 6 arrows. Make sure you correctly label one-headed (cause-effect) and two-headed (covariation) arrows).

These are the zero-order cross-correlation coefficients:NOI SOI upwell_36 upwell_39

NOI -      SOI 0.533 -    upwell_36 0.015 -0.021  -  upwell_39 0.135 0.049 0.613 -

This the conceptual diagram, showing the correlation coefficients:

9

- Finally, use the diagram to interpret your results. Do you agree with the Schwing et al. (2002) assertion that the NOI is a better descriptor of upwelling dynamics in the California Current than the SOI? Explain Why / Why Not?

Yes, I do agree. The NOI does much better than the currently existing SOI at predicting upwelling index at 39 degrees N. values. Otherwise, both NOI and SOI do equally poorly at predicting upwelling at 36 degrees N. See table of the % of variance explained by the zero-order correlations (from 0 to 100), below:

NOI SOIupwell_36 0.023 0.044upwell_39 1.823 0.240

4) Data Documentation: Create a flow chart (using paint or powerpoint, or using pen and paper and taking a digital photo or scanning) where you illustrate all of the steps you took to calculate the correlation values from each of the six arrows in the plot from figure 3. Make sure you include the 0-order / 1-order correlations.

10

This is Devon’s flow chart – Great Work!

This is Shannon’s flow chart – Great Work!

11

5) Data Manipulation: For this part of the homework, you will use PC-ORD.

Load the data from the file “upwell.xls”, located in the PC-ORD sheet. Note: you can use the “import” command –in the “File” menu to open a single sheet from excel. Note: This is the “main matrix”; there is no “second matrix”.

Explore the data using the following tools. Copy and paste screen shots (use print screen), submit requested “result” files, and answer specific questions.

- A) Command: ADVISOR > Show Current ProfileRun this command to check the profile of your data.

Focus on “skewness”. Define what it means?

Skewness quantifies the symmetry of the probability distribution. This value can be zero (perfectly symmetrical distribution), positive (asymmetrical) or negative (asymmetrical).

For univariate data, skewness is calculated as follows:

where   is the mean,   is the standard deviation, and N is the number of data points. The skewness for a normal distribution is zero, and any symmetrical distribution should also have a skewness near zero. 

The “rule of thumb” is that: If the value of skewness falls outside the range of -1 to 1, it indicates that the distribution is non-normal.

Copy and paste the skewness results below:

Columns 5 vars Skewness

Average + 0.0 Maximum + 0.8 Minimum - 0.5

12

Focus on “outliers”. Define what it means to be an outlier? Outliers are “extreme” values that, within a given data set, are too “far” (different) from the other values. These distances are based on the number of variables (axes) we are using to describe the dataset. Note that for community data, the distances amongst all pairs of samples are compared on the basis of the species recorded. On the other hand, the distances for the species are calculated on the basis of the samples. Outliers can occur by mistake (e.g., measurement or typological error) or can be real (e.g., caused by ecological differences in community structure or species-specific distributions across environmental variables).

Copy and paste the outliers below:

Potential Outliers SD - Item

6.3-s95.1255.2-s99.4584.7-s02.5424.0-s99.3752.8-s93.4582.6-s99.5422.6-s03.0422.4-s01.6252.3-s89.6252.2-s03.625

What does SD mean? How can we can use SD to measure the outliers?

SD = Standard Deviation. The distances are expressed in terms of standard deviations units. This allows the reader to assess the magnitude of this deviation, on the basis of the underlying properties of a normal distribution. In other words, using the SD provides standardized outliers. For a normal distribution: ~68% of the observations will fall within +/- 1 SD of the mean, ~95% will fall within +/- 2 SD of the mean, and ~99% will fall within +/- 3 SD of the mean.

Use the Help button to read about what is the “Relativized Euclidean Distance”. Copy and paste explanation below:

This is the text output from the PC-ORD help:

“Relativized Euclidean distance (RED) is similar conceptually to Euclidean distance, except that the data are normalized such that all of the data points fall on the surface of a unit quarter hypersphere. Built into this distance measure is the adjustment so that the sums of squares for each row equals one. For sample unit x species data, this effectively removes differences in overall abundances among sample units, instead focusing the analysis on the differences in relative abundances among species. The formula in Ludwig and Reynolds (1988) is incorrect. Relative Euclidean distance eliminates differences in total abundance (totals of squared abundances) among sample units. The range of RED is 0 to the square root of 2 given all non-negative data. Visualize

13

the SU's as being placed on the surface of a quarter hypersphere with a radius of one. RED is the chord distance between two points on this surface.

RED builds in a standardization. It puts differently scaled variables on the same footing, eliminating any signal other than relative abundance. Note that the correlation coefficient also accomplishes this standardization, but arc cos(r) gives the arc distance on the quarter hypersphere, not the chord distance. Also, with the correlation coefficient the surface is a full hypersphere, not a quarter hypersphere, and the center of the hypersphere is the centroid of the cloud of points, not the origin.

- B) Command: SUMMARY > Row and Column SummaryRun this command to summarize the data by columns (vars) only.

Copy and paste output below: Column_Summary

Summary of: 5 vars N = 240 samples --------------------------------------------------------------------------------------------------- Num. Name Mean Stand.Dev. Sum Minimum Maximum S E H D`--------------------------------------------------------------------------------------------------- 1 time 1996.000 5.786 479040.0000 1986.042 2005.958 240 1.000 5.481 0.9958 2 MEI -0.4816 2.817 -115.6 -11.75 8.678 240 0.993 5.445 0.9956 3 PDO -0.3167 1.731 -76.01 -5.429 3.489 240 0.988 5.417 0.9954 4 upwell_3 5.371 49.488 1289.0000 -131.000 200.000 237 0.990 5.415 0.9953 5 upwell_3 21.150 64.645 5076.0000 -296.000 240.000 239 0.997 5.458 0.9957--------------------------------------------------------------------------------------------------- AVERAGES: 404.3 24.89 0.9704E+05 308.4 491.6 239.2 0.994 5.443 0.9955----------------------------------------------------------------------------------------

Skewness Kurtosis---------------------------------- 1 time 0.000 -1.162 2 MEI -0.460 2.669 3 PDO -0.140 -0.162 4 upwell_3 0.846 1.802 5 upwell_3 -0.021 2.934---------------------------------- Averages: 0.045 1.216---------------------------------------------

1200 cells in main matrixPercent of cells empty = 0.333Matrix total = 0.48521E+06Matrix mean = 0.40434E+03Variance of totals of vars = 0.45605E+11CV of totals of vars = 220.06%

---------------------------------------------S = Richness = number of non-zero elements in rowE = Evenness = H / ln (Richness)H = Diversity = - sum (Pi*ln(Pi)) = Shannon`s diversity indexD = Simpson`s diversity index for infinite population = 1 - sum (Pi*Pi) where Pi = importance probability in element i (element i relativized by row total)

****************************** Analysis completed ******************************

14

C) Command: SUMMARY> Outlier Analysis

Run this command to identify outliers. Copy and paste output below:

What is the largest outlier in this dataset? What sample (month-year) is it?

The largest outlier is 95.125

What does it mean to be an outlier, according to PC-ORD? (Hint – use the program’s Help)

In general outliers are data points that are abnormally far from the mean. In multivariate statistics this means calculating all pairs of sample distances and then determining how they relate to the mean distance. PC-ORD flags any sample that is at least 2 SD above and below the mean distance. However, the user you can use whatever criterion makes sense. For instance, it makes no sense to throw samples that are too similar to all other samples (they are less more than 2 SD below the mean value)

- D) Command: GRAPH > ScatterplotRun this command to create scatterplots of pairs of variables.

Create a plot of Upwell_36 versus Upwell_39, reformat the figure (add tick marks and labels, change the axes labels, and add a figure title). Label the largest outlier (s95.125). Copy and paste figure below:

15

s95.125

-150

-300

-50 50 150

-100

100

300

Comparison of coastal upwelling index anomalies off California, at latitudes 36 and 39 degrees N

Upwelling index anomaly, 36 degrees N

Upw

ellin

g in

dex

anom

aly,

39

degr

ees

N

- E) Command: GRAPH > Scatterplot Matrix Run this command to create scatterplots of all pairs of variables. Copy and paste figure below:

Scatterplot_matrix

time

MEI

PDO

upwell_36

upwell_39

- What does “jittering” do? ExplainJittering adds a small random number to the data set to allow the reader to visualize observations stacked on top of each other in a graph. The degree of jittering is expressed as a percentage of the total range of a given variable, and can set in the PC-ORD “settings” between 1% and 20% for the X-axis and the Y-axis.

16