Advanced Regression Workshop - Kean University

27
0 Multiple Regression Kean University February 12, 2013

Transcript of Advanced Regression Workshop - Kean University

Page 1: Advanced Regression Workshop - Kean University

0

Multiple Regression Kean University

February 12, 2013

Page 2: Advanced Regression Workshop - Kean University

1

Content

1. Multiple Linear Regression ……………………………………………………….2

2. Logistic Regression ……………………………………………………………….8

3. Statistical Definitions …………………………….………………………………12

4. Regression Models & SEM…………………………………………..…………... 17

Page 3: Advanced Regression Workshop - Kean University

2

Multiple Linear Regression

Regressions techniques are primarily used in order to create an equation which can be used to predict values of

dependent variables for all members of the population. A secondary function of using regression is that it can

be used as a means of explaining causal relationships between variables.

Types of Linear Regression

Standard Multiple Regression-All independent variables are entered into the analysis simultaneously

Sequential Multiple Regression (Hierarchical Multiple Regression)-Independent variables are entered into the

equation in a particular order as decided by the researcher

Stepwise Multiple Regression-Typically used as an exploratory analysis, and used with large sets of predictors

1. Forward Selection-Bivariate correlations between all the IVs and the DV are calculated, and IVs are

entered into the equation from the strongest correlate to the smallest

2. Stepwise Selection-Similar to forward selection; however, if in combination with other predictors and the

IV no longer appears to contribute much to the equation, it is thrown out.

3. Backward Deletion-All IVs are entered into the equation. Partial F tests are calculated on each variable as

if it were entered last, to determine the level of contribution to the overall prediction. The smallest partial

F is removed based on a predetermined criteria.

Variables

IV- Also referred to as predictor variables, one or more continuous variables

DV-Also referred to as the outcome variable, a single continuous variable

Assumptions that must be met:

1. Normality. All errors should to be normally distributed, which can be tested by looking at the skewness,

kurtosis, and histogram plots. Technically normality is necessary only for the t-tests to be valid, estimation

of the coefficients only requires that the errors be identically and independently distributed

2. Independence. The errors associated with one observation are not correlated with the errors of any other

observation.

3. Linearity. The relationship between the IVs and DV should be linear.

4. Homoscedasticity. The variances of the residuals across all levels of the IVs should be consistent, which can

be tested by plotting the residuals.

5. Model specification. The model should be properly specified (including all relevant variables, and excluding

irrelevant variables)

Other important issues:

Influence - individual observations that exert undue influence on the coefficients. Are there covariates that

you should be including in your model?

Collinearity - The predictor variables should be related, but not so strongly correlated that they are measuring

the same thing (e.g. using age and grade), which will lead to multicollinearity. Multicollinearity misleadingly

Page 4: Advanced Regression Workshop - Kean University

3

inflates the standard errors. Thus, it makes some variables statistically insignificant while they should be

otherwise significant.

Unusual and Influential data

A single observation that is substantially different from all other observations can make a large difference in

the results of your regression analysis. If a single observation (or small group of observations) substantially

changes your results, you would want to know about this and investigate further. There are three ways that an

observation can be unusual.

Outliers: In linear regression, an outlier is an observation with large residual. In other words, it is an

observation whose dependent-variable value is unusual given its values on the predictor variables. An outlier may

indicate a sample peculiarity or may indicate a data entry error or other problem.

Leverage: An observation with an extreme value on a predictor variable is called a point with high leverage.

Leverage is a measure of how far an observation deviates from the mean of that variable. These leverage points

can have an unusually large effect on the estimate of regression coefficients.

Influence: An observation is said to be influential if removing the observation substantially changes the

estimate of coefficients. Influence can be thought of as the product of leverage and outlierness.

Collinearity Diagnostics

VIF: Formally, variance inflation factors (VIF) measure how much the variance of the estimated coefficients

are increased over the case of no correlation among the X variables. If no two X variables are correlated, then

all the VIFs will be 1. If there are two or more variables that will have a VIF around or greater than 5 (some say

up to 10 is okay), one of these variables must be removed from the regression model. To determine the best one

to remove, remove each one individually. Select the regression equation that explains the most variance (R2 the

highest).

Tolerance: Value should be greater than .10. Less than .10 indicates a collinearity issue.

Other informal signs of multicollinearity are:

• Regression coefficients change drastically when adding or deleting an X variable.

• A regression coefficient is negative when theoretically Y should increase with increasing values of that

X variable, or the regression coefficient is positive when theoretically Y should decrease with increasing

values of that X variable.

• None of the individual coefficients has a significant t statistic, but the overall F test for fit is

significant.

• A regression coefficient has a nonsignificant t statistic, even though on theoretical grounds that X

variable should provide substantial information about Y.

• High pairwise correlations between the X variables. (But three or more X variables can be multicollinear

together without having high pairwise correlations.)

How to deal with multicollinearity:

• Increasing the sample size is a common first step, but only partially offsets the.

• The easiest solution: Remove the most intercorrelated variable(s) from analysis. This method is

misguided if the variables were there due to the theory of the model.

• Combine variables into a composite variable through building indexes. Remember: in order to create an

index, you need to have theoretical and empirical reasons to justify this action.

• Use centering: transform the offending independents by subtracting the mean from each case.

Page 5: Advanced Regression Workshop - Kean University

4

• Drop the intercorrelated variables from analysis but substitute their crossproduct as an interaction

term, or in some other way combine the intercorrelated variables. This is equivalent to respecifying the

model by conceptualizing the correlated variables as indicators of a single latent variable. Note: if a

correlated variable is a dummy variable, other dummies in that set should also be included in the

combined variable in order to keep the set of dummies conceptually together.

• Leave one intercorrelated variable as is but then remove the variance in its covariates by regressing

them on that variable and using the residuals.

• Assign the common variance to each of the covariates by some probably arbitrary procedure.

• Treat the common variance as a separate variable and decontaminate each covariate by regressing them

on the others and using the residuals. That is, analyze the common variance as a separate variable.

To Conduct the Analysis in SPSS

1. Your dataset must be open. To run the

analysis, click analyze, then

regression, then linear.

2. The Linear Regression window will open.

Select the outcome variable, then the right

arrow to put the variable in the dependent

variable box. Highlight all of the independent

variables, then the right arrow to put the

variables into the Independent(s) box.

3. Select the method of regression that is

most appropriate for the data set:

a. Enter-enters all IVs, one at a time, into

the model regardless of significant

contribution

b. Stepwise-combines Forward and

Backward methods, and utilizes criteria

for both entering and removing IVs in

the equation

c. Remove-first uses and Enter method.

Specific variable(s) is removed from the

model and the Enter method is repeated

d. Backward-enters all IVs one at a time

and then removes then removes them one

at a time based on a predetermined level

of significance for removal (default is p ≥

.01).

e. Forward-only enters IVs that significantly contribute to the model.

2. Place DV (outcome) in

Dependent Box

3. Select the appropriate

regression method

2. Place IV s(predictors)

in Independent Box

Page 6: Advanced Regression Workshop - Kean University

5

4. Click on the Statistics button to open

the Statistics Dialogue box. Check the

appropriate statistics, which usually

includes Estimates, Model Fit,

Descriptives, Part and Partial

Correlations, Collinearity diagnostics.

Note: if running a stepwise regression,

check, R squared change.

5. Click on the Options button to open the Options Dialogue

box. Here you can change the inclusion and exclusion

criteria depending on the method of regression used.

6. Optional, if needed, click on the Plots button to add Plots

and Histograms to the output. Also, clicking the save

button gives options to save the residuals, etc.

7. To create Syntax file, simply click on Paste.

Output Run the syntax and Output file

should look similar to those below:

The variables entered box shows which

variables have be included or excluded

from the regression analysis, and the

method in which they have been

entered. Depending on the method of

regression used, certain variables

maybe removed for failing to meet

predetermined criteria.

5. If applicable, change the probability or F

value for inclusion or

exclusion

4. Select the

appropriate

statistics

Page 7: Advanced Regression Workshop - Kean University

6

The model summary box outlines the overall fit

of the model. R is the correlation between the

variables, which should be the same as shown in

the Correlations table. R Square value indicates

the amount of variance in the dependent variable

by the predictor variables. In this case, the

predictor variables account for 9.8% of the

variance in number of offenses. The adjusted R

Square is a more conservative estimate of variance explained and removes variability that is likely due to

chance; however, this value is not often reported or interpreted. The ANOVA is used to test whether or not

the model significantly predicts the outcome variable. In this example, the model does significantly predicts

the outcome variable, because p <.001.

The Coefficients

box notes the

degree and

significance that

each predictor has

on the outcome

variable. In this

example, only

whether or not one

is incarcerated and

their entry age are

significant predictors. When conducting

regression analyses, it may be useful to run

multiple combinations of predictor variables

and regression methods.

R Square indicates the degree to which the

amount of variance in the DV explained by the

IVs. Use the adjusted R Square when you

have more than 1 predictor (IV).

The T statistic and

Significance tell

whether or not

each predictor is

significant

Unstandardized B is

the degree to which

each predictor

variable impacts the

outcome variable (DV)

Standardized Betas tell

the amount of variance

of the DV that is

explained by each

predictor variable

individually

The F ratio and

significance tells the degree to which the

model predicts the DV

Page 8: Advanced Regression Workshop - Kean University

7

Sample Write Up & Table

A multiple regression was also conducted to predict the number of offenses based on the

available independent variables. The predictors included incarcerated (vs. not incarcerated), the

age at first offense, the number of days in placement, and race. The overall model was

significant, F (5, 220) = 4.80, p < .001, and accounted for 31% of the variance. The results

indicated that incarcerated and the age at first offense were significant predictors of the

number of offenses committed (see Table 9). The number of days in placement and race were

not significant predictors of the number of offenses committed. Incarceration (compared to

non-incarceration) was associated with an increase in the number of offenses committed (Beta =

.243, p < .01). In addition, controlling for the other predictors, as age at first offense increased,

the number of offenses also increased (Beta = .187, p < .01).

________________________________________________________________________

Table 9

Multiple Regression Analyses of Incarceration Status, Age, Days in Placement, and

Race on Number of Offenses (N = 226)

________________________________________________________________________

Unstandardized

B SE Beta t p

Incarcerated .668 .24 .243 2.76 .006

Age at First Offense .147 .05 .187 2.71 .007

Days in Placement .000 .00 - .057 -0.66 .511

African American .065 .22 .022 0.30 .766

Caucasian - .212 .20 - .077 -1.03 .302

________________________________________________________________________

Note. F (5, 220) = 4.80, p < .001, R2 = .098

Page 9: Advanced Regression Workshop - Kean University

8

Logistic Regression

Logistic regression, also referred to as binary or multinomial regression, is a type of prediction analysis that

predicts a dichotomous dependent variable based on a set of independent variables.

Variables

DV-one dichotomous dependent variable (e.g. alive/dead, married/single, purchase/not purchase)

IVs-one or more independent variable that can be either continuous or categorical

Assumptions that must be met:

1. Sample Size. Reducing a continuous variable to a binary or categorical one loses information and attenuates

effect sizes, reducing the power of the logistic procedure. Therefore, in many cases, a larger sample size is

needed to insure power of the statistical procedure. It is recommended that a sample size be at least 30

times the number of parameters or 10 cases per independent variable.

2. Meaningful coding. Logistic coefficients will be difficult to interpret if not coded meaningfully. The

convention for binary logistic regression is to code the dependent class of greatest interest as 1 ("the event

occurring") and the other class as 0 ("the event not occurring).

3. Proper specification of the model is particularly crucial; parameters may change magnitude and even

direction when variables are added to or removed from the model.

a. Inclusion of all relevant variables in the regression model: If relevant variables are omitted, the common

variance they share with included variables may be wrongly attributed to those variables, or the error

term may be inflated.

b. Exclusion of all irrelevant variables: If causally irrelevant variables are included in the model, the

common variance they share with included variables may be wrongly attributed to the irrelevant

variables. The more the correlation of the irrelevant variable(s) with other independents, the greater

the standard errors of the regression coefficients for these independents.

4. Linearity. Logistic regression does not require linear relationships between the independent factor or

covariates and the dependent, but it does assume a linear relationship between the independents and the log

odds (logit) of the dependent.

a. Box-Tidwell Transformation (Test): Add to the logistic model interaction terms which are the

crossproduct of each independent times its natural logarithm. If these terms are significant, then there

is nonlinearity in the logit. This method is not sensitive to small nonlinearities.

5. No outliers. As in logistic regression, outliers can affect results significantly.

Types of Logistic Regression

Binary Logistic Regression-treats all IVs as continuous covariates and categorical variables must be set in SPSS

Multinomial Logistic Regression- All IVs are explicitly entered as factors, and the reference category of the

outcome variable must be set in SPSS.

Page 10: Advanced Regression Workshop - Kean University

9

To conduct this analysis in SPSS

1. Your data set must be open. To

run the analysis, click Analyze,

then Regression, then either

Binary or Multinomial Logistic.

2. Move the DV into the

“Dependent” box, and move the IVs

into the “Covariates” Box.

3. Select the method of regression

that is most appropriate for the

data set:

a. Enter-enters all IVs, one at a

time, into the model regardless

of significant contribution

b. Stepwise-combines Forward and

Backward methods, and utilizes

criteria for both entering and

removing IVs in the equation

c. Remove-first uses and Enter

method. Specific variable(s) is

removed from the model and

the Enter method is repeated

d. Backward-enters all IVs one at

a time and then removes then removes them one at a time based on a predetermined level of significance

for removal (default is p ≥ .01).

e. Forward-only enters IVs that significantly contribute to the model.

4. Click on the “Options” box, and check the box next to “CI for exp(B).” Then click “Continue.”

2. Move the DV to the

“Dependent” Box

2. Move the IVs to the

“Covariates” Box

3. Select the appropriate

inclusion method

4. Check this box to

display the

confidence interval

Page 11: Advanced Regression Workshop - Kean University

10

5. Paste and run the syntax.

Output

Run the syntax, and Output file should look

similar to those below:

The first box outlines how many cases were

included and excluded from the analysis, you will

report the n included in the analysis in your

write up.

The dependent variable encoding box shows the

label for each coding. This is important to note, because SPSS

creates the regression equation based on the likelihood of having a

value of 1. In this case, SPSS is creating an equation to predict the

likelihood that an individual is not very satisfied.

The next set of tables falls under the

heading of “Block 0: Beginning Block,” and

consist of three tables: Classification Table,

Variables in the Equation, and Variables not

in the Equation. This block provides a

description of the null model and does not

include the predictor variables. These

values are not interpreted or report.

Report the number of cases

included in the analysis

Block 1 is what is interpreted

and reported in the write up.

The Omnibus Test uses a Chi-Square

to determine if the model is

statistically significant. -2 Log is not interpreted

or reported. Cox & Snell R Square and Nagelkerke R

Square are measures of effect size.

Typically Nagelkerke is reported over

Cox & Snell.

Based on the regression equation created

from the analysis, SPSS will predicts

which group individual cases will belong to.

SPSS then calculates the percentage of

correct predictions.

The Beginning Block does not

include the predictor

variables. DO NOT report or

interpret these values.

Page 12: Advanced Regression Workshop - Kean University

11

Sample Write Up and Table

A logistic regression analysis was conducted to predict if an individual was not very satisfied with his or her job,

see Table 1. Overall, the model was significant, χ2(4) = 16.71, p = .002, Nagelkerke R2 = .025. Of all the

predictor variables, only age was a significant predictor, p <.001, and had an odds ratio of .980, indicating that

as an individual’s age increases he or she is less likely to be not very satisfied with his or her job. As a

predictor, years of education was marginally significant, p = .085, and had an odds ratio of .957, indicating that

as years of education increases, there is a decrease in the likelihood of being not very satisfied with one’s job.

None of the remaining predictors (e.g., hours worked per week, number of siblings) were not significant

predictors of job satisfaction, ns.

________________________________________________________________________

Table 1

Summary of Logistic Regression Predicting Job Satisfaction Satisfied or Not Satisfied

________________________________________________________________________

Odds

Ratio

95% CI

β Lower Upper p

Age -.021 .980 .97 .99 .000

Years of Education -.044 .957 .91 1.01 .085

Hours per Week -.004 .996 .99 1.01 .384

Number of Siblings .000 1.000 .95 1.05 .990

________________________________________________________________________

Note.: χ2(4) = 16.71, p = .002, Nagelkerke R2 = .025.

The estimated

coefficient and standard

error are reported, but

not interpreted

Report the confidence

intervals for both the

lower and upper limits

Report significance

of each predictor

Exp(B) is the odds ratio for each predictor. As mentioned, SPSS is

predicting the likelihood of the DV being a 1, in this case “Not Very

Satisfied.’ When the odds ratio is less than 1, increasing values in the

variable correspond to decreasing odds of the event’s occurrence. When

the odds ratio is greater than 1, increasing values of the variable

corresponds to increasing odds of the event’s occurrence.

Page 13: Advanced Regression Workshop - Kean University

12

Statistics Definitions

Binary (dichotomous) variable

A binary variable has only two values, typically 0 or 1. Similarly, a dichotomous variable is a categorical variable

with only two values. Examples include success or failure; male or female; alive or dead.

Categorical variable

A variable that can be placed into separate categories based on some characteristic or attribute. Also referred

to as “qualitative”, “discrete”, or “nominal” variables. Examples include gender, drug treatments, race or

ethnicity, disease subtypes, dosage level.

Causal relationship

A causal relationship is one in which a change in one variable can be attributed to a change in another variable.

The study needs to be designed in a way that it is legitimate to infer cause. In most cases, the term “causal

conclusion” indicates findings from an experiment in which the subjects are randomly assigned to a control or

experimental group. For instance, causality cannot be determined from a correlational research design.

Furthermore, it is important to note that a significant finding (small p-value) does not signify causality. The

medical statistician, Austin B. Hill, outlined nine criteria to establish causality in epidemiological research:

temporal relationship, strength, dose-response relationship, plausibility, consideration of alternate explanations,

experiment, specificity, and coherence.

Central Limit Theorem

The Central Limit Theorem is the foundation for many statistical techniques. The theorem proposes that the

larger the sample size (>30), the more closely the sampling distribution of the mean will approach a normal

distribution. The mean of the sampling distribution of the mean will approach the true population mean have a

standard deviation of σ / √n (population variance / square root of n). The population from which the sample is

drawn does not need to be normally distributed. Furthermore, the Central Limit Theorem explains that the

approximation improves with larger samples as well as why sampling error is smaller with larger samples than it

is with smaller samples.

Confidence interval

A confidence interval is an interval estimate of a population parameter, consisting of a range of values bounded

by upper and lower confidence limits. The parameter is estimated as falling somewhere between the two values.

Researchers can assign a degree of confidence for the interval estimate (typically 90%, 95%, 99%), indicating

that the interval will include the population parameter that percentage of the time (i.e., 90%, 95%, 99%). The

wider the confidence interval, the higher the confidence level.

Confounding variable

A confounding variable is one that obscures the effects of another variable. In other words, a confounding

variable is one that is associated with both the independent and dependent (outcome) variables, and therefore

affects the results. Confounding variables are also called extraneous variables and are problematic because the

researcher cannot be sure if results are due to the independent variable, the confounding variable, or both.

Smoking, for instance, would be a confounding variable in the relationship between drinking alcohol and lung

cancer. Therefore, a researcher studying the relationship between alcohol consumption and lung cancer should

control for the effects of smoking. A positive confounder is related to the independent and dependent variables

in the same direction; a negative confounder displays an opposite relationship to the two variables. If there is a

confounding effect, researchers can use a stratified sample and/or a statistical model that controls for the

confounding variable (e.g., multiple regression, analysis of covariance).

Page 14: Advanced Regression Workshop - Kean University

13

Continuous variables

A variable that can take on any value within the limits the variable ranges. Continuous variables are measured on

ratio or interval scales. Examples include age, temperature, height, and weight.

Control group

In experimental research and many types of clinical trials, the control group is the group of participants that

does not receive the treatment. The control group is used for comparison and is treated exactly like the

experimental group except that it does not receive the experimental treatment. In many clinical trials, one

group of patients will be given an experimental drug or treatment, while the control group is given either a

standard treatment for the illness or a placebo (ex: sugar pill).

Covariate

A covariate is a variable that is statistically controlled for using techniques such as multiple regression analysis

or analysis of covariance. Covariates are also known as control variables and in general, have a linear relationship

with the dependent variable. Using covariates in analyses allow the researcher to produce more precise

estimates of the effect of the independent variable of interest. In order to determine if the use of a covariate

is legitimate, the effect of the covariate on the residual (error) variance should be examined. If the covariate

reduces the error, then is likely to improve the analysis.

Degrees of freedom

The degrees of freedom is usually abbreviated “df” and represents the number of values free to vary when

calculating a statistic. For instance, the degrees of freedom in a 2x2 crosstab table are calculated by

multiplying the number of rows minus 1 by the number of columns minus 1. Therefore, if the totals are fixed,

only one of the four cell counts is free to vary, and the df = (2-1) (2-1) = 1.

Dependent variable

The dependent variable is the effect of interest that is measured in the study. It is termed the “dependent”

variable because it “depends” on another variable. Also referred to as outcome variables or criterion variables.

Descriptive / Inferential Statistics

Descriptive Statistics. Descriptive statistics provide a summary of the available data. The descriptive

statistics are used to simplify large amounts of data by summarizing, organizing, and graphing quantitative

information. Typical descriptive statistics include measures of central tendency (mean, median, mode) and

measures of variability or spread (range, standard deviation, variance).

Inferential Statistics. Inferential statistics allow researchers to draw conclusions or inferences from the

data. Typically, inferential statistics are used to make inferences or claims about a population based on a sample

drawn from that population. Examples include independent t tests and Analysis of Variance (ANOVA)

techniques.

Effect size

An effect size is a measure of the strength of the relationship between two variables. Sample-based effect

sizes are distinguished from test statistics used in hypothesis testing, in that they estimate the strength of an

apparent relationship, rather than assigning a significance level reflecting whether the relationship could be due

to chance. The effect size does not determine the significance level, or vice-versa. Some fields using effect

sizes apply words such as "small", "medium" and "large" to the size of the effect. Whether an effect size should

be interpreted small, medium, or big depends on its substantial context and its operational definition. Some

common measures of effect size are Cohen’s D, Cramer’s V, Odds Ratios, Standardized Beta weights, Pearson’s

R, and partial Eta squared.

Page 15: Advanced Regression Workshop - Kean University

14

Independent variable

The independent variables are typically controlled or manipulated by the researcher. Independent variables are

also used to predict the values of another variable. Furthermore, researchers often use demographic variables

(e.g., gender, race, age) as independent variables in statistical analysis. Examples of independent variables

include the treatment given to groups, dosage level of an experimental drug, gender, and race.

Measures of Central Tendency: Mean, Median, Mode

Measures of central tendency are a way of summarizing data using the value which is most typical or

representative, including the mean, median, and mode.

Mean. The mean (strictly speaking arithmetic mean) is also known as the average. It is calculated by adding

up the values for each case and dividing by the total number of cases. It is often symbolized by M or X (“X-

bar”). The mean is influenced by outliers and also should not be used with skewed distributions.

Median. The median is the central value of a set of values, ranked in ascending (or descending) order. Since

50% of all scores fall at or below the 50th percentile, the median is therefore the score located at the

50th percentile. The median is not influenced by extreme scores and is the preferred measure of central

tendency for a skewed distribution.

Mode. The mode is the value which occurs most frequently in a set of scores. The mode is not influenced by

extreme values.

Measures of Dispersion: Variance, Standard deviation, range

Measures of dispersion include statistics that show the amount of variation or spread in the scores, or values

of, a variable. Widely scattered or variable data results in large measures of dispersion, whereas tightly

clustered data results in small measures of dispersion. Commonly used measures of dispersion include the

variance and the standard deviation.

Variance. A measure of the amount of variability in a set of scores. Variance is calculated as the square of

the standard deviation of scores. Larger values for the variance indicate that individual cases are further

away from the mean and a wider distribution. Smaller variances indicate that individual cases are closer to

the mean and a tighter distribution. The population variance is symbolized by σ2 and the sample variance is

symbolized by s2.

Standard Deviation. A measure of spread or dispersion in a set of scores. The standard deviation is the

square root of the variance. Similar to the variance, the more widely the scores are spread out, the larger

the standard deviation. Unlike the variance, which is expressed in squared units of measurement, the

standard deviation is expressed in the same units as the measurements of the original data. In the event

that the standard deviation is greater than the mean, the mean would be deemed inappropriate as a

representative measure of central tendency. The empirical rule states that for normal distributions,

approximately 68% of the distribution falls within ± 1 standard deviation of the mean, 95% of the

distribution falls within ± 2 standard deviations of the mean, and 99.7% of the distribution falls within ± 3

standard deviations of the mean. The standard deviation is symbolized by SD or s.

Multivariate / Bivariate / Univariate

Multivariate. Quantitative methods for examining multiple variables at the same time. For instance, designs

that involve two or more independent variables and two or more dependent variables would use multivariate

analytic techniques. Examples include multiple regression analysis, MANOVA, factor analysis, and

discriminant analysis.

Bivariate. Quantitative methods that involve two variables.

Univariate. Methods that involve only one variable. Often used to refer to techniques in which there is only

one outcome or dependent variable.

Page 16: Advanced Regression Workshop - Kean University

15

Normal distribution

The normal distribution is a bell-shaped, theoretical continuous probability distribution. The horizontal axis

represents all possible values of a variable and the vertical axis represents the probability of those values. The

scores on the variable are clustered around the mean in a symmetrical, unimodal fashion. The mean, median, and

mode are all the same in the normal distribution. The normal distribution is widely used in statistical inference.

Null / Alternative Hypotheses

Null Hypothesis. In general, the null hypothesis (H0) is a statement of no effect. The null hypothesis is set

up under the assumption that it is true, and is therefore tested for rejection.

Alternative Hypothesis. The hypothesis alternative to the one being tested (i.e., the alternative to the null

hypothesis. The alternative hypothesis is denoted by Ha or H1 and is also known as the research or

experimental hypothesis. Rejecting the null hypothesis (on the basis of some statistical test) indicates that

the alternative hypothesis may be true.

Parametric / non-Parametric

Parametric Statistics: Statistical techniques that require the data to have certain characteristics

(approximately normally distributed, interval/ ratio scale of measurement). Also called inferential statistics.

Non-Parametric Statistics: Statistical techniques designed for use when the data does not conform to the

characteristics required for parametric tests. Non-parametric statistics are also known as distribution-free

statistics. Examples include the Mann-Whitney U test, Kruskal-Wallis test and Wilcoxon's (T) test. In

general, parametric tests are more robust, more complicated to compute, and have greater power efficiency.

Population / Sample

The population is the group of persons or objects that the researcher is interested in studying. To generalize

about a population, the researcher studies a sample that is representative of the population.

Power

The power of a statistical test is the probability that the test will reject the null hypothesis when the null

hypothesis is false (i.e. that it will not make a Type II error, or a false negative decision). As the power

increases, the chances of a Type II error occurring decrease. Power analysis can be used to calculate the

minimum sample size required so that one can be reasonably likely to detect an effect of a given size. Power

analysis can also be used to calculate the minimum effect size that is likely to be detected in a study using a

given sample size. In addition, the concept of power is used to make comparisons between different statistical

testing procedures: for example, between a parametric and a nonparametric test of the same hypothesis.

p-value

The p-value stands for probability value and represents the likelihood that a result is due to chance alone. More

specifically, the p-value is the probability of obtaining a result at least as extreme as the one that was actually

observed given that the null hypothesis is true. For instance, given a p -value of 0.05 (1/20) and repeated

experiments, we would expect that in approximately every 20 replications of the experiment, there would be

one in which the relationship between the variables would be equal to or more extreme than what was found.

The p-value is compared with the alpha value set by the researcher (usually .05) to determine if the result is

statistically significant. If the p-value is less than the alpha level, the result is significant and the null

hypothesis is rejected. If the p-value is greater than the alpha level, the result is non-significant and the

researcher fails to reject the null hypothesis. When interpreting the p-value, it is important to understand the

measurement as well as the practical significance of the results. The p-value indicates significance, but does not

reveal the size of the effect. In addition, a non-significant p-value does not necessarily mean that there is no

association; rather, the non-significant result could be due to a lack of power to detect an association. In

clinical trials, the level of statistical significance depends on the number of participants studied and the

observations made, as well as the magnitude of differences observed.

Page 17: Advanced Regression Workshop - Kean University

16

Skewed distribution and other distribution shapes (bimodal, J-shaped)

Skewed Distribution. A skewed distribution is a distribution of scores or measures that produces a

nonsymmetrical curve when plotted on a graph. The distribution may be positively skewed (infrequent scores

on the high or right side of the distribution) or negatively skewed (infrequent scores on the low or left side

of the distribution). The mean, mode, and median are not equal in a skewed distribution.

Bimodal. A bimodal distribution is a distribution that has two modes. The bimodal distribution has two

values that both occur with the highest frequency in the distribution. This distribution looks like it has two

peaks where the data centers on the two values more frequently than other neighboring values.

J-Shaped. A J-shaped distribution occurs when one of the first values on either end of the distribution

occurs with the most frequency with the following values occurring less and less frequently so that the

distribution is extremely asymmetrical and roughly resembles a “J” lying on its side.

Standard error

Standard error (SE) is a measure of the extent to which the sample mean deviates from the population mean.

Another name for standard error (SE) is standard error of the mean (SEM). This alternative name gives more

insight into the standard error statistic. The standard error is the standard deviation of the means of multiple

samples from the same population. In other words, multiple samples are taken from a population and the

standard error is the standard deviation of the mean of the multiple sample means. The standard error can be

thought of as an index to how well the sample reflects the population. The smaller the standard error, the more

the sampling distribution resembles the population.

z-score

The z-score (aka standard score) is the statistic of the standard normal distribution. The standard normal

distribution has a mean of zero and a standard deviation of 1. Raw scores can be standardized into z-scores

(thus also known as standard scores). The z-score measures the location of a raw score by its distance from the

mean in standard deviation units. Since the mean of the standard normal distribution is zero, a z-score of 1

would reflect a raw score that falls one standard deviation from the mean. In the same manner, a z-score of -1

would reflect a raw score that falls exactly one standard deviation below the mean. If we were reading

standardized IQ scores (raw mean = 100, SD = 15), for example, a z-score of 1 would reflect a raw score of 115

and a z-score of -1 would reflect a raw score of 85.

Page 18: Advanced Regression Workshop - Kean University

17

Regression Models & SEM

Bayesian Linear Regression

Bayesian linear regression is an approach to linear regression in which the statistical analysis is undertaken

within the context of Bayesian inference. When the regression model has errors that have a normal

distribution, and if a particular form of prior distribution is assumed, explicit results are available for the

posterior probability distributions of the model's parameters. Bayesian methods can be used for any probability

distribution.

Bootstrapped Estimates

Bootstrapped estimates assume the sample is representative of the universe and do not make parametric

assumptions about the data.

Canonical Correlation

A canonical correlation is the correlation of two canonical (latent) variables, one representing a set of

independent variables, the other a set of dependent variables. Canonical correlation is used for many-to-many

relationships. There may be more than one such linear correlation relating the two sets of variables, with each

such correlation representing a different dimension by which the independent set of variables is related to the

dependent set. The purpose of canonical correlation is to explain the relation of the two sets of variables, not

to model the individual variables.

Categorical Regression

The goal of categorical regression is to describe the relationship between a response variable and a set of

predictors. It is a variant of regression which can handle nominal independent variables, but now largely

replaced by generalized linear models. Scale values are assigned to each category of every variable such that

these values are optimal with respect to the regression.

Cox Regression

Cox regression may be used to analyze time-to-event as well as proximity, and preference data. Cox regression

is designed for analysis of time until an event or time between events. The classic univariate example is time

from diagnosis with a terminal illness until the event of death (hence survival analysis). The central statistical

output is the hazard ratio.

Curve Estimation

Curve estimation lets the researcher explore how linear regression compares to any of 10 nonlinear models, for

the case of one independent predicting one dependent, and thus is useful for exploring which procedures and

models may be appropriate for relationships in one's data. Curve-fitting to compare linear, logarithmic, inverse,

quadratic, cubic, power, compound, S-curve, logistic, growth, and exponential models based on their relative

goodness of fit for models where a single dependent variable is predicted by a single independent variable or by

a time variable.

Discriminant Function Analysis

Discriminant function analysis is used when the dependent variable is a dichotomy but other assumptions of

multiple regression can be met, making it more powerful than logistic regression for binary or multinomial

dependents. Discriminant function analysis, a.k.a. discriminant analysis or DA, is used to classify cases into the

values of a categorical dependent, usually a dichotomy. If discriminant function analysis is effective for a set of

data, the classification table of correct and incorrect estimates will yield a high percentage correct.

Page 19: Advanced Regression Workshop - Kean University

18

Multiple discriminant analysis (MDA) is an extension of discriminant analysis and a cousin of multiple analysis of

variance (MANOVA), sharing many of the same assumptions and tests. MDA is used to classify a categorical

dependent which has more than two categories, using as predictors a number of interval or dummy independent

variables. MDA is sometimes also called discriminant factor analysis or canonical discriminant analysis.

Dummy Coding

Dummy variables are a way of adding the values of a nominal or ordinal variable to a regression equation. The

standard approach to modeling categorical variables is to include the categorical variables in the regression

equation by converting each level of each categorical variable into a variable of its own, usually coded 0 or 1. For

instance, the categorical variable "region" may be converted into dummy variables such as "East," "West,"

"North," or "South." Typically "1" means the attribute of interest is present (ex., South = 1 means the case is

from the region South). Of course, once the conversion is made, if we know a case's value on all the levels of a

categorical variable except one, that last one is determined. We have to leave one of the levels out of the

regression model to avoid perfect multicollinearity (singularity; redundancy), which will prevent a solution (for

example, we may leave out "North" to avoid singularity). The omitted category is the reference category

because b coefficients must be interpreted with reference to it.

The interpretation of b coefficients is different when dummy variables are present. Normally, without dummy

variables, the b coefficient is the amount the dependent variable increases when the independent variable

associated with the b increases by one unit. When using a dummy variable such as "region" in the example above,

the b coefficient is how much more the dependent variable increases (or decreases if b is negative) when the

dummy variable increases one unit (thus shifting from 0=not present to 1=present, such as South=1=case is from

the South) compared to the reference category (North, in our example). Thus for the set of dummy variables

for "Region," assuming "North" is the reference category and education level is the dependent, a b of -1.5 for

the dummy "South" means that the expected education level for the South is 1.5 years less than the average of

"North" respondents.

Entry Terms – Forward/Backward/Stepwise/Blocking/Hierarchical

Forward selection starts with the constant-only model and adding variables one at a time in the order they are

best by some criterion (see below) until some cutoff level is reached (ex., until the step at which all variables

not in the model have a significance higher than .05).

Backward selection starts with all variables and deletes one at a time, in the order they are worst by some

criterion.

Stepwise multiple regression is a way of computing OLS regression in stages. In stage one, the independent

variable best correlated with the dependent is included in the equation. In the second stage, the remaining

independent with the highest partial correlation with the dependent, controlling for the first independent, is

entered. This process is repeated, at each stage partialing for previously-entered independents, until the

addition of a remaining independent does not increase R-squared by a significant amount (or until all variables

are entered, of course). Alternatively, the process can work backward, starting with all variables and eliminating

independents one at a time until the elimination of one makes a significant difference in R-squared.

Hierarchical multiple regression (not to be confused with hierarchical linear models) is similar to stepwise

regression, but the researcher, not the computer, determines the order of entry of the variables. F-tests are

used to compute the significance of each added variable (or set of variables) to the explanation reflected in R-

square. This hierarchical procedure is an alternative to comparing betas for purposes of assessing the

importance of the independents. In more complex forms of hierarchical regression, the model may involve a

series of intermediate variables which are dependents with respect to some other independents, but are

themselves independents with respect to the ultimate dependent. Hierarchical multiple regression may then

involve a series of regressions for each intermediate as well as for the ultimate dependent.

Page 20: Advanced Regression Workshop - Kean University

19

Exogenous and endogenous variables.

Exogenous variables in a path model are those with no explicit causes (no arrows going to them, other than the

measurement error term). If exogenous variables are correlated, this is indicated by a double headed arrow

connecting them. Endogenous variables, then, are those which do have incoming arrows. Endogenous variables

include intervening causal variables and dependents. Intervening endogenous variables have both incoming and

outgoing causal arrows in the path diagram. The dependent variable(s) have only incoming arrows.

Factor Analysis

Factor analysis is used to uncover the latent structure (dimensions) of a set of variables. It reduces attribute

space from a larger number of variables to a smaller number of factors and does not assume a dependent

variable is specified. Exploratory factor analysis (EFA) seeks to uncover the underlying structure of a relatively

large set of variables. The researcher's à priori assumption is that any indicator may be associated with any

factor. This is the most common form of factor analysis. There is no prior theory and one uses factor loadings

to intuit the factor structure of the data. Confirmatory factor analysis (CFA) seeks to determine if the number

of factors and the loadings of measured (indicator) variables on them conform to what is expected on the basis

of pre-established theory. Indicator variables are selected on the basis of prior theory and factor analysis is

used to see if they load as predicted on the expected number of factors. The researcher's à priori assumption

is that each factor (the number and labels of which may be specified à priori) is associated with a specified

subset of indicator variables. A minimum requirement of confirmatory factor analysis is that one hypothesize

beforehand the number of factors in the model, but usually also the researcher will posit expectations about

which variables will load on which factors (Kim and Mueller, 1978b: 55). The researcher seeks to determine, for

instance, if measures created to represent a latent variable really belong together.

There are several different types of factor analysis, with the most common being principal components analysis

(PCA), which is preferred for purposes of data reduction. However, common factor analysis is preferred for

purposes of causal analysis and for confirmatory factor analysis in structural equation modeling, among other

settings.

Generalized Least Squares

Generalized least squares (GLS) is an adaptation of OLS to minimize the sum of the differences between

observed and predicted covariances rather than between estimates and scores. GLS works well even for non-

normal data when samples are large (n>2500).

General Linear Model (multivariate)

Although regression models may be run easily in GLM, as a practical matter univariate GLM is used primarily to

run analysis of variance (ANOVA) and analysis of covariance (ANCOVA) models. Multivariate GLM is used

primarily to run multiple analysis of variance (MANOVA) and multiple analysis of covariance (MANCOVA) models.

Multiple regression with just covariates (and/or with dummy variables) yields the same inferences as multiple

analysis of variance (MANOVA), to which it is statistically equivalent. GLM can implement regression models

with multiple dependents.

Generalized Linear Models/Generalized Estimating Equations

GZLM/GEE are the generalization of linear modeling to a form covering almost any dependent distribution with

almost any link function, thus supporting linear regression, Poisson regression, gamma regression, and many

others.. GZLM is for variance and regression models which analyze normally distributed dependent variables

using an identity link function (that is, prediction is directly of the values of the dependent).

Page 21: Advanced Regression Workshop - Kean University

20

Linear Mixed Models

Linear mixed models (LMM) handle data where observations are not independent. That is, LMM correctly models

correlated errors, whereas procedures in the general linear model family (GLM) usually do not. (GLM includes

such procedures as t-tests, analysis of variance, correlation, regression, and factor analysis, to name a few.)

LMM is a further generalization of GLM to better support analysis of a continuous dependent for:

1. random effects: where the set of values of a categorical predictor variable are seen not as the complete

set but rather as a random sample of all values (ex., the variable "product" has values representing only 5 of

a possible 42 brands). Through random effects models, the researcher can make inferences over a wider

population in LMM than possible with GLM.

2. hierarchical effects: where predictor variables are measured at more than one level (ex., reading

achievement scores at the student level and teacher-student ratios at the school level).

3. repeated measures: where observations are correlated rather than independent (ex., before-after

studies, time series data, matched-pairs designs).

LMM uses maximum likelihood estimation to estimate these parameters and supports more variations and data

options. Hierarchical models in SPSS require LMM implementation. Linear mixed models include a variety of

multi-level modeling (MLM) approaches, including hierarchical linear models, random coefficients models (RC),

and covariance components models. Note that multi-level mixed models are based on a multi-level theory which

specifies expected direct effects of variables on each other within any one level, and which specifies cross-

level interaction effects between variables located at different levels. That is, the researcher must postulate

mediating mechanisms which cause variables at one level to influence variables at another level (ex., school-level

funding may positively affect individual-level student performance by way of recruiting superior teachers, made

possible by superior financial incentives). Multi-level modeling tests multi-level theories statistically,

simultaneously modeling variables at different levels without necessary recourse to aggregation or

disaggregation. It should be noted, though, that in practice some variables may represent aggregated scores.

Logistic regression/odds ratio

Logistic Regression. Logistic regression is a form of regression that is used with dichotomous dependent

variables (usually scored 0, 1) and continuous and/or categorical independent variables. It is usually used for

predicting if something will happen or not, for instance, pass/fail, heart disease, or anything that can be

expressed as an Event or Non-Event. Logistic regression transforms the data by taking their natural logarithms

to reduce nonlinearity. The technique estimates the odds of an event occurring by calculating changes in the log

odds of the dependent variable. Logistic regression techniques do not assume linear relationships between the

independent and dependent variables, does not require normally distributed variables, and does not assume

homoscedasticity. However, the observations must be independent and the independent variables must be

linearly related to the logit of the dependent variable.

Odds Ratios. An odds ratio is the ratio of two odds. An odds ratio of 1.0 indicates that the independent has no

effect on the dependent and that the variables are statistically independent. An odds ratio greater than 1

indicates that the independent variable increases the likelihood of the event. The "event" depends on the coding

of the dependent variable. Typically, the dependent variable is coded as 0 or 1, with the 1 representing the

event of interest. Therefore, a unit increase in the independent variable is associated with an increase in the

odds that the dependent equals 1 in binomial logistic regression. An odds ratio less than 1 indicates that the

independent variable decreases the likelihood of the event. That is, a unit increase in the independent variable

is associated with a decrease in the odds of the dependent being 1.

Page 22: Advanced Regression Workshop - Kean University

21

Logit Regression

Logit Regressionuses log-linear techniques to predict one or more categorical dependent variables. Logit models

discriminate better than probit models for high and low potencies and are therefore more appropriate when the

binary dependent is seen as representing an underlying equal distribution (large tails). The logit model is

equivalent to binary logistic regression for grouped data. The logit is the value of the left-hand side of the

equation and is the natural log of the odds ratio, p/(1-p), where p is the probability of response. Thus, if the

probability is .025, the logit = ln(.025/.975) = -.366; if the probability is .5, the logit=ln(.5/.5) = 0; etc.

Multicollinearity

Multicollinearity refers to excessive correlation of the predictor variables. When correlation is excessive (some

use the rule of thumb of r > .90), standard errors of the b and beta coefficients become large, making it

difficult or impossible to assess the relative importance of the predictor variables. Multicollinearity is less

important where the research purpose is sheer prediction since the predicted values of the dependent remain

stable, but multicollinearity is a severe problem when the research purpose includes causal modeling...

Multiple Linear Regression

Multiple Linear Regression is employed to account for (predict) the variance in an interval dependent, based on

linear combinations of interval, dichotomous, or dummy independent variables. Multiple regression can establish

that a set of independent variables explains a proportion of the variance in a dependent variable at a significant

level (through a significance test of R2), and can establish the relative predictive importance of the independent

variables. Power terms can be added as independent variables to explore curvilinear effects. Cross-product

terms can be added as independent variables to explore interaction effects. One can test the significance of

difference of two R2's to determine if adding an independent variable to the model helps significantly. Using

hierarchical regression, one can see how most variance in the dependent can be explained by one or a set of new

independent variables, over and above that explained by an earlier set. Of course, the estimates (b coefficients

and constant) can be used to construct a prediction equation and generate predicted scores on a variable for

further analysis.

The multiple regression equation takes the form y = b1x1 + b2x2 + ... + bnxn + c. The b's are the regression

coefficients, representing the amount the dependent variable y changes when the corresponding independent

changes 1 unit. The c is the constant, where the regression line intercepts the y axis, representing the amount

the dependent y will be when all the independent variables are 0. The standardized version of the b coefficients

are the beta weights, and the ratio of the beta coefficients is the ratio of the relative predictive power of the

independent variables. Associated with multiple regression is R2, multiple correlation, which is the percent of

variance in the dependent variable explained collectively by all of the independent variables. In addition, it is

important that the model being tested is correctly specified. The exclusion of important causal variables or the

inclusion of extraneous variables can change markedly the beta weights and hence the interpretation of the

importance of the independent variables.

Multinomial Logistic Regression

Logistic regression for a categorical dependent variable with more than two levels.

Negative Binomial Regression

This is similar to the Poisson distribution, also used for count data, but is used when the variance is larger than

the mean. Typically this is characterized by "there being too many 0's." It is not assumed all cases have an equal

probability of experiencing the rare event, but rather that events may cluster. The negative binomial model is

therefore sometimes called the "overdispersed Poisson model." Values must still be non-negative integers. The

negative binomial is specified by an ancillary (dispersion) parameter, k. When k=0, the negative binomial is

identical to the Poisson distribution. The researcher may specify k or allow it to be estimated by the program.

Page 23: Advanced Regression Workshop - Kean University

22

Nonlinear Regression

Nonlinear regression refers to algorithms for fitting complex and even arbitrary curves to one's data using

iterative estimation when the usual methods of dealing with nonlinearity fail. Simple curves can be implemented

in general linear models (GLM) and OLS regression and in models supported by the generalized linear modeling

because the dependent is transformed by some nonlinear link function). Nonlinear regression is used to fit

curves not amenable to transformation methods. That is, it is used when the nonlinear relationship is

intrinsically nonlinear because there is no possible transformation to linearize the relationship of the

independent(s) to the dependent. Common models for nonlinear regression include logistic population growth

models and asymptotic growth and decay models.

Ordinal Regression

Ordinal regression is a special case of generalized linear modeling (GZLM). Ordinal regression is used with

ordinal dependent (response) variables, where the independents may be categorical factors or continuous

covariates. Ordinal regression models are sometimes called cumulative logit models. Ordinal regression typically

uses the logit link function, though other link functions are available. Ordinal regression with a logit link is also

called a proportional odds model, since the parameters of the predictor variables may be converted to odds

ratios, as in logistic regression. Ordinal regression requires assuming that the effect of the independents is the

same for each level of the dependent. Thus if an independent is age, then the effect on the dependent for a 10

year increase in age should be the same whether the difference is between age 20 to age 30, or from age 50 to

age 60. The "test of parallel lines assumption" tests this critical assumption, which should not be taken for

granted.

Ordinary Least Squares Regression

This is the common form of multiple regression, used in early, stand-alone path analysis programs. It makes

estimates based on minimizing the sum of squared deviations of the linear estimates from the observed scores.

However, even for path modeling of one-indicator variables, MLE is still preferred in SEM because MLE

estimates are computed simultaneously for the model as a whole, whereas OLS estimates are computed

separately in relation to each endogenous variable. OLS assumes similar underlying distributions but not

multivariate normality, is even less restrictive and is a better choice when MLE's multivariate normality

assumption is severely violated.

Path Analysis

Path analysis is an extension of the regression model, used to test the fit of the correlation matrix against two

or more causal models which are being compared by the researcher. The model is usually depicted in a circle-

and-arrow figure in which single-headed arrows indicate causation. A regression is done for each variable in the

model as a dependent on others which the model indicates are causes. The regression weights predicted by the

model are compared with the observed correlation matrix for the variables, and a goodness-of-fit statistic is

calculated. The best-fitting of two or more models is selected by the researcher as the best model for

advancement of theory. Path analysis requires the usual assumptions of regression. It is particularly sensitive to

model specification because failure to include relevant causal variables or inclusion of extraneous variables

often substantially affects the path coefficients, which are used to assess the relative importance of various

direct and indirect causal paths to the dependent variable. Such interpretations should be undertaken in the

context of comparing alternative models, after assessing their goodness of fit discussed in the section on

structural equation modeling. When the variables in the model are latent variables measured by multiple

observed indicators, path analysis is termed structural equation modeling.

Page 24: Advanced Regression Workshop - Kean University

23

Partial Least Squares Regression

Sometimes called “soft modeling” because it makes relaxed assumptions about the data. PLS can support small

sample models, even where there are more variables than observations, but it is lower in power than SEM

approaches. The advantages of PLS include ability to model multiple dependents as well as multiple independents;

ability to handle multicollinearity among the independents; robustness in the face of data noise and missing

data; and creating independent latents directly on the basis of crossproducts involving the response variable(s),

making for stronger predictions. Disadvantages of PLS include greater difficulty of interpreting the loadings of

the independent latent variables (which are based on crossproduct relations with the response variables, not

based as in common factor analysis on covariances among the manifest independents) and because the

distributional properties of estimates are not known, the researcher cannot assess significance except through

bootstrap induction.

Poisson Regression

Poisson regression is a form of regression analysis used to model count variables and contingency tables count

data in event history analysis. It has a very strong assumption: the conditional variance equals the conditional

mean. A Poisson regression model is sometimes known as a log-linear model, especially when used to model

contingency tables. Data appropriate for Poisson regression do not happen very often. Nevertheless, Poisson

regression is often used as a starting point for modeling count data and Poisson regression has many extensions.

A rule of thumb is to use a Poisson rather than binomial distribution when n is 100 or more and the probability is

.05 or less. Where the binomial distribution is used where the variable of interest is count of successes per

given number of trials, the Poisson distribution is used for count of successes per given number of time units.

The Poisson distribution is also used when "event occurs" can be counted but non-occurrence cannot be counted.

Probit Algorithms

Probit models are similar to logistic models but use a log-normal transformation (the probit transformation) of

the dependent variable. Where logit and logistic regression are appropriate when the categories of the

dependent are equal or well dispersed, probit may be recommended when the middle categories have greater

frequencies than the high and low tail categories, or with binomial dependents when an underlying normal

distribution is assumed. As a practical matter, probit and logistic models yield the same substantive conclusions

for the same data the great majority of the time.

R2

Also called multiple correlation or coefficient of multiple determination, R2 is the percent of the variance in the

dependent explained uniquely or jointly by the independents. R-squared can also be interpreted as the

proportionate reduction in error in estimating the dependent when knowing the independents. That is, R2

reflects the number of errors made when using the regression model to guess the value of the dependent, in

ratio to the total errors made when using only the dependent's mean as the basis for estimating all cases.

Adjusted R-Square is an adjustment for the fact that when one has a large number of independents

Recursive Partitioning

Recursive partitioning creates a decision tree that attempts to correctly classify members of the population

based on several dichotomous dependent variables. It creates a formula that researchers can use to calculate

the probability that a participant belongs to a particular category. For example, a patient has a disease,

recursive partitioning creates a rule such as 'If a patient has finding x, y, or z they probably have disease q'

Advantages include generates clinically more intuitive models that do not require the user to perform

calculations, allows varying prioritizing of misclassifications in order to create a decision rule that has more

sensitivity or specificity, and may be more accurate.

Disadvantages include does not work well for continuous variables and may overfit data.

Page 25: Advanced Regression Workshop - Kean University

24

Spuriousness

A given bivariate correlation or beta weight may be inflated because one has not yet introduced control

variables into the model by way of partial correlation. For instance, regressing height on hair length will

generate a significant b coefficient, but only when gender is left out of the model specification (women are

shorter and tend to have longer hair).

Structural Equation Modeling

Structural equation modeling (SEM) grows out of and serves purposes similar to multiple regression, but in a

more powerful way which takes into account the modeling of interactions, nonlinearities, correlated

independents, measurement error, correlated error terms, multiple latent independents each measured by

multiple indicators, and one or more latent dependents also each with multiple indicators. SEM may be used as a

more powerful alternative to multiple regression, path analysis, factor analysis, time series analysis, and analysis

of covariance. That is, these procedures may be seen as special cases of SEM, or, to put it another way, SEM is

an extension of the general linear model (GLM) of which multiple regression is a part. Advantages of SEM

compared to multiple regression include more flexible assumptions (particularly allowing interpretation even in

the face of multicollinearity), use of confirmatory factor analysis to reduce measurement error by having

multiple indicators per latent variable, the attraction of SEM's graphical modeling interface, the desirability of

testing models overall rather than coefficients individually, the ability to test models with multiple dependents,

the ability to model mediating variables rather than be restricted to an additive model the ability to model

error terms, the ability to test coefficients across multiple between-subjects groups, and ability to handle

difficult data (time series with autocorrelated error, non-normal data, incomplete data). Moreover, where

regression is highly susceptible to error of interpretation by misspecification, the SEM strategy of comparing

alternative models to assess relative model fit makes it more robust. SEM is usually viewed as a confirmatory

rather than exploratory procedure, using one of three approaches:

1. Strictly confirmatory approach

2. Alternative models approach

3. Model development approach

Regardless of approach, SEM cannot itself draw causal arrows in models or resolve causal ambiguities.

Theoretical insight and judgment by the researcher is still of utmost importance. SEM is a family of statistical

techniques which incorporates and integrates path analysis and factor analysis. In fact, use of SEM software

for a model in which each variable has only one indicator is a type of path analysis. Use of SEM software for a

model in which each variable has multiple indicators but there are no direct effects (arrows) connecting the

variables is a type of factor analysis. Usually, however, SEM refers to a hybrid model with both multiple

indicators for each variable (called latent variables or factors), and paths specified connecting the latent

variables. Synonyms for SEM are covariance structure analysis, covariance structure modeling, and analysis of

covariance structures. Although these synonyms rightly indicate that analysis of covariance is the focus of

SEM, be aware that SEM can also analyze the mean structure of a model.

Suppression

Suppression occurs when the omitted variable has a positive causal influence on the included independent and a

negative influence on the included dependent (or vice versa), thereby masking the impact the independent would

have on the dependent if the third variable did not exist. Note that when the omitted variable has a suppressing

effect, coefficients in the model may underestimate rather than overestimate the effect of those variables on

the dependent.

Page 26: Advanced Regression Workshop - Kean University

25

Two-Stage Least-Squares Regression

Two-stage least squares regression (2SLS) is a method of extending regression to cover models which violate

ordinary least squares (OLS) regression's assumption of recursivity (all arrows flow one way, with no feedback

looping), specifically models where the researcher must assume that the disturbance term of the dependent

variable is correlated with the cause(s) of the independent variable(s). Second, 2SLS is used for the same

purpose to extend path analysis, except that in path models there may be multiple endogenous variables rather

than a single dependent variable. Third, 2SLS is an alternative to maximum likelihood estimation (MLE) in

estimating path parameters of non-recursive models with correlated error among the endogenous variables in

structural equation modeling (SEM). Fourth, 2SLS can be used to test for selection bias in quasi-experimental

studies involving a treatment group and a comparison group, in order to reject the hypothesis that self-

selection or other forms of selection into the two groups accounts for differences in the dependent variable. �

Weight Estimation

One of the critical assumptions of OLS regression is homoscedasticity: that the variance of residual error

should be constant for all values of the independent(s). Weighted least squares (WLS) regression compensates

for violation of the homoscedasticity assumption by weighting cases differentially: cases whose value on the

dependent variable corresponds to large variances on the independent variable(s) count less and those with

small variances count more in estimating the regression coefficients. That is, cases with greater weights

contribute more to the fit of the regression line. The result is that the estimated coefficients are usually very

close to what they would be in OLS regression, but under WLS regression their standard errors are smaller.

Apart from its main function in correcting for heteroscedasticity, WLS regression is sometimes also used to

adjust fit to give less weight to distant points and outliers, or to give less weight to observations thought to be

less reliable.

Zero-Inflated Regression

Zero-inflated models attempt to account for excess zeros. In other words, two kinds of zeros are thought to

exist in the data, "true zeros" and "excess zeros". Zero-inflated models estimate two equations simultaneously,

one for the count model and one for the excess zeros. One common cause of over-dispersion is excess zeros by

an additional data generating process. If the data generating process does not allow for any 0s (such as the

number of days spent in the hospital), then a zero-truncated model may be more appropriate.

Page 27: Advanced Regression Workshop - Kean University

26