Finding Concurrency CET306 Harry R. Erwin University of Sunderland.
Summary of Remainder Harry R. Erwin, PhD School of Computing and Technology University of...
-
Upload
chrystal-hicks -
Category
Documents
-
view
218 -
download
0
Transcript of Summary of Remainder Harry R. Erwin, PhD School of Computing and Technology University of...
![Page 1: Summary of Remainder Harry R. Erwin, PhD School of Computing and Technology University of Sunderland.](https://reader034.fdocuments.us/reader034/viewer/2022051417/56649ea05503460f94ba34ed/html5/thumbnails/1.jpg)
Summary of Remainder
Harry R. Erwin, PhD
School of Computing and Technology
University of Sunderland
![Page 2: Summary of Remainder Harry R. Erwin, PhD School of Computing and Technology University of Sunderland.](https://reader034.fdocuments.us/reader034/viewer/2022051417/56649ea05503460f94ba34ed/html5/thumbnails/2.jpg)
Resources
• Crawley, MJ (2005) Statistics: An Introduction Using R. Wiley.
• Freund, RJ, and WJ Wilson (1998) Regression Analysis, Academic Press.
• Gentle, JE (2002) Elements of Computational Statistics. Springer.
• Gonick, L., and Woollcott Smith (1993) A Cartoon Guide to Statistics. HarperResource (for fun).
![Page 3: Summary of Remainder Harry R. Erwin, PhD School of Computing and Technology University of Sunderland.](https://reader034.fdocuments.us/reader034/viewer/2022051417/56649ea05503460f94ba34ed/html5/thumbnails/3.jpg)
Topics
• Multiple Regression
• Contrasts
• Count Data
• Proportion Data
• Survival Data
• Binary Response
• Course Summary
![Page 4: Summary of Remainder Harry R. Erwin, PhD School of Computing and Technology University of Sunderland.](https://reader034.fdocuments.us/reader034/viewer/2022051417/56649ea05503460f94ba34ed/html5/thumbnails/4.jpg)
Multiple Regression• Two or more continuous explanatory variables• Your problems are not restricted to order. You often lack enough
data to examine all the potential interactions and higher-order effects.– To explore the possibility of a third order interaction term with three
explanatory variables (A:B:C) requires about 38 = 24 data values. – If there’s potential for curvature, you need 33 = 9 more data values to
pin that down.
• Be selective. If you are considering an interaction term, you have to consider all the lower-order interactions and the individual explanatory variables in it.
![Page 5: Summary of Remainder Harry R. Erwin, PhD School of Computing and Technology University of Sunderland.](https://reader034.fdocuments.us/reader034/viewer/2022051417/56649ea05503460f94ba34ed/html5/thumbnails/5.jpg)
Issues to Consider
• Which explanatory variables to include.
• Curvature in the response to explanatory variables.
• Interactions between explanatory variables. (High order interactions tend to be rare.)
• Correlation between explanatory variables.
• Over-parameterization. (Avoid!)
![Page 6: Summary of Remainder Harry R. Erwin, PhD School of Computing and Technology University of Sunderland.](https://reader034.fdocuments.us/reader034/viewer/2022051417/56649ea05503460f94ba34ed/html5/thumbnails/6.jpg)
Contrasts
• Contrasts are the basis of hypothesis testing and model simplification in ANOVA
• When you have more than two levels in a categorical variable, you need to know which levels are meaningful and which can be combined.
• Sometimes you know which ones to combine and sometimes not.
• First do the basic ANOVA to determine whether there are significant differences to be investigated.
![Page 7: Summary of Remainder Harry R. Erwin, PhD School of Computing and Technology University of Sunderland.](https://reader034.fdocuments.us/reader034/viewer/2022051417/56649ea05503460f94ba34ed/html5/thumbnails/7.jpg)
Model Reduction in ANOVA
• Basically how you reduce a model in ANOVA is by combining factor levels.
• Define your contrasts based on the science:– Treatment versus control– Similar treatments versus other treatments.– Treatment differences within similar treatments.
• You can also aggregate factor levels in steps.• See me if you need to do this. R can automate the
process.
![Page 8: Summary of Remainder Harry R. Erwin, PhD School of Computing and Technology University of Sunderland.](https://reader034.fdocuments.us/reader034/viewer/2022051417/56649ea05503460f94ba34ed/html5/thumbnails/8.jpg)
Count Data
• With frequency data, we know how often something happened, but not how often it didn’t happen.
• Linear regression assumes constant variance and normal errors. This is not appropriate for count data:1. Counts are non-negative.
2. Response variance usually increases with the mean.
3. Errors are not normally distributed.
4. Zeros are hard to transform.
![Page 9: Summary of Remainder Harry R. Erwin, PhD School of Computing and Technology University of Sunderland.](https://reader034.fdocuments.us/reader034/viewer/2022051417/56649ea05503460f94ba34ed/html5/thumbnails/9.jpg)
Handling Count Data in R
• Use a glm model with family=poisson.– This sets errors to Poisson, so variance is
proportional to the mean.– This sets link to log, so fitted values are positive.
• If you have overdispersion (residual deviance greater than residual degrees of freedom), use family=quasipoisson instead.
![Page 10: Summary of Remainder Harry R. Erwin, PhD School of Computing and Technology University of Sunderland.](https://reader034.fdocuments.us/reader034/viewer/2022051417/56649ea05503460f94ba34ed/html5/thumbnails/10.jpg)
Contingency Tables
• There is a risk of data aggregation over important explanatory variables (nuisance variables).
• So check the significance of the real part of the model before you eliminate nuisance variables.
![Page 11: Summary of Remainder Harry R. Erwin, PhD School of Computing and Technology University of Sunderland.](https://reader034.fdocuments.us/reader034/viewer/2022051417/56649ea05503460f94ba34ed/html5/thumbnails/11.jpg)
Frequencies and Proportions
• With frequency data, you know how often something happened, but not how often it didn’t happen.
• With proportion data, you know both.• Applied to:
– Mortality and infection rates– Response to clinical treatment– Voting– Sex ratios– Proportional response to experimental treatments
![Page 12: Summary of Remainder Harry R. Erwin, PhD School of Computing and Technology University of Sunderland.](https://reader034.fdocuments.us/reader034/viewer/2022051417/56649ea05503460f94ba34ed/html5/thumbnails/12.jpg)
Working With Proportions
• Traditionally, proportion data was modelled by using the percentage as the response variable.
• This is bad for four reasons:1. Errors are not normally distributed.2. Non-constant variance.3. Response is bounded by 0.0 and 1.0.4. The size of the sample, n, is lost.
![Page 13: Summary of Remainder Harry R. Erwin, PhD School of Computing and Technology University of Sunderland.](https://reader034.fdocuments.us/reader034/viewer/2022051417/56649ea05503460f94ba34ed/html5/thumbnails/13.jpg)
Testing Proportions
• To compare a single binomial proportion to a constant, use binom.test.– y<-c(15,5)– binom.test(y,0.5)– y<-c(14,6)– binom.test(y,0.5)
• To compare two samples, use prop.test.– prop.test(c(14,6),c(10,10))
• Only use glm methods for complex models:– Regression tables– Contingency tables
![Page 14: Summary of Remainder Harry R. Erwin, PhD School of Computing and Technology University of Sunderland.](https://reader034.fdocuments.us/reader034/viewer/2022051417/56649ea05503460f94ba34ed/html5/thumbnails/14.jpg)
GLM Models for Proportions
• Start with a general linear model (glm).• family = binomial (i.e., unfair coin flip)• Use two vectors, one of the success counts and
the other of the failure counts.• number of failures + number of successes =
binomial denominator, n• y<-cbind(successes, failures)• model<-glm(y~whatever,binomial)
![Page 15: Summary of Remainder Harry R. Erwin, PhD School of Computing and Technology University of Sunderland.](https://reader034.fdocuments.us/reader034/viewer/2022051417/56649ea05503460f94ba34ed/html5/thumbnails/15.jpg)
How R Handles Proportions
• Weighted regression (weighted by the individual sample sizes).• logit link to ensure linearity• If percentage cover data (e.g., survey data)
– Do an arc-sine transformation, followed by conventional modelling (normal errors, constant variance).
• If percentage change in a continuous measurement (e.g. growth)– ANCOVA with final weight as the response and initial weight as a
covariate, or– Use the relative growth rate (log(final/initial)) as response.– Both produce normal errors.
![Page 16: Summary of Remainder Harry R. Erwin, PhD School of Computing and Technology University of Sunderland.](https://reader034.fdocuments.us/reader034/viewer/2022051417/56649ea05503460f94ba34ed/html5/thumbnails/16.jpg)
Count Data in Proportions
• R supports the traditional arcsine and probit transformations:– arcsine makes the error distribution normal– probit linearises the relationship between percentage
mortality and log(dose)
• It is usually better to use the logit transformation and assume you have binomial data.
![Page 17: Summary of Remainder Harry R. Erwin, PhD School of Computing and Technology University of Sunderland.](https://reader034.fdocuments.us/reader034/viewer/2022051417/56649ea05503460f94ba34ed/html5/thumbnails/17.jpg)
Death and Failure Data
• Applications include:– Time to death– Time to failure– Time to event
• This is useful way to analyse performance when the process leading to a goal is complex—for example when it is a robot performing a task.
![Page 18: Summary of Remainder Harry R. Erwin, PhD School of Computing and Technology University of Sunderland.](https://reader034.fdocuments.us/reader034/viewer/2022051417/56649ea05503460f94ba34ed/html5/thumbnails/18.jpg)
Problems with Survival Data
• Non-constant variance, so standard methods are inappropriate.
• If errors are gamma distributed, the variance is proportional to the square of the mean.
• Use a glm with Gamma errors.
![Page 19: Summary of Remainder Harry R. Erwin, PhD School of Computing and Technology University of Sunderland.](https://reader034.fdocuments.us/reader034/viewer/2022051417/56649ea05503460f94ba34ed/html5/thumbnails/19.jpg)
How do we deal with events that don’t happen during the study?
• In those trials, we don’t know when the event would occur. We just know the time would be greater than the end of the trial. Those trials are censored.
• The methods for handling censored data make up the field of survival analysis.
• (I used survival analysis in my PhD work. My wife does survival analysis for cancer data.)
![Page 20: Summary of Remainder Harry R. Erwin, PhD School of Computing and Technology University of Sunderland.](https://reader034.fdocuments.us/reader034/viewer/2022051417/56649ea05503460f94ba34ed/html5/thumbnails/20.jpg)
Binary Response
• Very common:– dead or alive– occupied or empty– male or female– employed or unemployed
• Response variable is 0 or 1.
• R assumes a binomial trial with sample size 1.
![Page 21: Summary of Remainder Harry R. Erwin, PhD School of Computing and Technology University of Sunderland.](https://reader034.fdocuments.us/reader034/viewer/2022051417/56649ea05503460f94ba34ed/html5/thumbnails/21.jpg)
When to use Binary Response Data
• Do a binary response analysis only when you have unique values of one or more explanatory variables for each and every possible individual case.
• Otherwise lump: aggregate to the point where you have unique values. Either:– Analyse the data as a contingency table using Poisson errors,
or– Decide which explanatory variable is key, express the data
as proportions, recode as a count of a two-level factor, and assume binomial errors.
![Page 22: Summary of Remainder Harry R. Erwin, PhD School of Computing and Technology University of Sunderland.](https://reader034.fdocuments.us/reader034/viewer/2022051417/56649ea05503460f94ba34ed/html5/thumbnails/22.jpg)
Modelling Binary Response
• Single vector with the response variable
• Use glm with family = binomial• Think about a log-log link instead of logit. Use
the one that gives less deviance.
• Fit the usual way.
• Test significance using 2.
![Page 23: Summary of Remainder Harry R. Erwin, PhD School of Computing and Technology University of Sunderland.](https://reader034.fdocuments.us/reader034/viewer/2022051417/56649ea05503460f94ba34ed/html5/thumbnails/23.jpg)
Course Summary
• We’ve had an introduction to thinking critically about data.
• We’ve seen how to use a typical statistical analysis system (R).
• We’ve looked at our projects critically.
• We’ve discussed hypothesis testing.
• We’ve looked at statistical modelling.
![Page 24: Summary of Remainder Harry R. Erwin, PhD School of Computing and Technology University of Sunderland.](https://reader034.fdocuments.us/reader034/viewer/2022051417/56649ea05503460f94ba34ed/html5/thumbnails/24.jpg)
Statistical Activities
• Data collection (ideally the statistician has a say on how they are collected)
• Description of a dataset– Averages
– Spreads
– Extreme points
• Inference within a model or collection of models• Model selection
![Page 25: Summary of Remainder Harry R. Erwin, PhD School of Computing and Technology University of Sunderland.](https://reader034.fdocuments.us/reader034/viewer/2022051417/56649ea05503460f94ba34ed/html5/thumbnails/25.jpg)
Why Model?
• Usually you do statistics to explore the structure of data. The questions you might ask are rather open-ended. Your understanding is facilitated by a model.
• A model embodies what you currently know about the data. You can formulate it either as a data-generating process or a set of rules for processing the data.
![Page 26: Summary of Remainder Harry R. Erwin, PhD School of Computing and Technology University of Sunderland.](https://reader034.fdocuments.us/reader034/viewer/2022051417/56649ea05503460f94ba34ed/html5/thumbnails/26.jpg)
Structure-in-the-data
• Of most interest…, for example:– Modes– Gaps– Clusters– Symmetry– Shape– Deviations from normality
• Plot the data to understand this.
![Page 27: Summary of Remainder Harry R. Erwin, PhD School of Computing and Technology University of Sunderland.](https://reader034.fdocuments.us/reader034/viewer/2022051417/56649ea05503460f94ba34ed/html5/thumbnails/27.jpg)
Visualization
• Multiple views are necessary.• Be able to zoom in on the data as a few points
can obscure the interesting structure.• Scaling of the axes may be necessary, since our
eyes are not perfect tools for detecting structure.• Watch out for time-ordered or location-ordered
data, particularly if time or location are not explicitly reported.
![Page 28: Summary of Remainder Harry R. Erwin, PhD School of Computing and Technology University of Sunderland.](https://reader034.fdocuments.us/reader034/viewer/2022051417/56649ea05503460f94ba34ed/html5/thumbnails/28.jpg)
Plots
• Use simple plots to start with.
• Watch for rounded data—shown by horizontal strata in the data. That often signals other problems.
![Page 29: Summary of Remainder Harry R. Erwin, PhD School of Computing and Technology University of Sunderland.](https://reader034.fdocuments.us/reader034/viewer/2022051417/56649ea05503460f94ba34ed/html5/thumbnails/29.jpg)
Bottom Line
• I am available for consulting (free).
• E-mail: [email protected]
• Phone: 515-3227 or extension 3227 from university phones.
• Plan on about an hour meeting to allow time to think intelligently about your data.