# Linear regression

- I want to perform a regression analysis. Do all my variables have to be continuous (quantitative)?
- My dependent variable is ordinal: measured on a Likert scale. Does this count as continuous?
- I want to do a multiple regression analysis. Which assumptions do I need to check?

- What are residual values?
- How and why do I check independence of the data?
- What should I do if independence of the data is violated?
- How and why do I check the assumptions of normality, homoscedasticity and linearity?
- What if normality is violated?
- I have some extreme values in my data. What do I do now?
- What if linearity is violated?
- What if homoscedasticity is violated?
- How do I check multicollinearity?
- How do I deal with multicollinearity?
- How can I check and avoid influential cases?

- When and how do I control for some predictors in a regression analysis?

## I want to perform a regression analysis. Do all my variables have to be continuous (quantitative)?

This differs between the independent and dependent variables. So we discuss them separately.

*Dependent variable (DV)*

For a linear regression analysis, your dependent variable must be continuous (quantitative): interval or ratio level. If your dependent variable is not continuous, you can’t do linear regression.

But fortunately, there is more than only linear regression.

If your dependent variable is categorical and it has only two categories (levels), this is called a dichotomous or binary variable. Then you can do a logistic regression. This is something you probably have not learned in your statistics courses, but you can find out more about it here in Dutch and here in English. Do discuss with your supervisor if it is required of you to learn this.

If your dependent variable is categorical with three or more categories, ordinal and multinomial regression analysis exists but it’s complicated and not what we would recommend anyway. It’s probably a better idea to choose a different type of analysis. Alternatively you can combine categories such that you have only two and perform a logistic regression (see above).

*Independent variables (IVs)*

Your independent variables should be either continuous (quantitative) or dichotomous (categorical with two groups). If you have a dichotomous variable, make sure the values are numeric. A common way is to score one group as 0 in your variable and the other as 1. However, if you have interactions with the dichotomous variable in your model, interpretation gets easier if you center the variable, either by using the codes -0.5 and 0.5 for the two categories, or by centering or standardizing the variable.

If you have both qualitative and quantitative variables, you may prefer to do an ANCOVA instead. Then you can also have more than 2 levels for your qualitative variable.

## My dependent variable is ordinal: measured on a Likert scale. Does this count as continuous?

As long as your scores are total scores (sum or mean) across multiple Likert items, there probably is no problem. Such scale scores are considered to be continuous (quantitative), although it depends on the number of items your total score is based on: the more, the better.

If you have only one item, then technically it is not quantitative, but only ordinal. In general, we would not recommend you to use the scores of one item as continuous (quantitative), so you might want to discuss this with your supervisor. But in some circumstances you might consider the scores on one item as continuous. If you want to do this, you should meet two criteria:

1) You should have at least 5 (but more is better) separate response categories;

2) It should be reasonable to consider the categories as equally spaced (equal step sizes between successive categories) **in a psychological sense**, at least roughly so. Yes, this is vague, you have to think about it for each specific case. But do give it a chance: for instance, for a standard Likert item running from ‘absolutely disagree’ to ‘absolutely agree’ through 7 categories, say, it may not seem unreasonable to assume that the participant bases his response on a roughly equal distance interpretation of the options offered.”

If you decide to consider the one item score as a continuous (quantitative) variable and run a linear regression, you have still to check the other assumptions, of course. If the scores on the item turn out to be really skewed, there is a considerable risk that the residuals will violate the normality assumption. In such a case, you may be forced to transform the item scores into a dichotomous (0-1) variable, anyway, and resort to a logistic regression instead.

## I want to do a multiple regression analysis. Which assumptions do I need to check?

A regression analysis only gives trustworthy results if the assumptions are met. You also need a few more checks. Specifically, you need to check each of the following:

- Data should be independent.
- Normality: residuals are normally distributed
- Homoscedasticity: residuals are homoscedastic. This is sometimes also called homogeneity of variance.
- Linearity: the relationship(s) between IV(s) and DV is linear.
- Avoid multicollinearity (i.e. highly correlated IVs)
- Check for influential cases

### What are residual values?

Many assumptions are based on the residual values, so you need to know what they are first. In short: residual values (also called residuals) are the parts of the variance that your model cannot explain.

In a regression analysis, the program makes a prediction of every participant’s score on the DV, based on their score on the IV(s). If there is a perfect linear relation (R2 = 1), this predicted score is exactly the same as the actual, observed score of that participant. But this is never the case: in real life data, your model isn’t perfect, so the observed scores of the participants are a bit higher or lower than predicted. For each participant, the difference between their predicted score by the model and their observed score is their residual score (sometimes called error score).

In the figure below the residual scores are visualized as red lines showing the distance between the observed and predicted scores. In this model, children’s reading scores were used to predict their math scores. The best fitting regression line is drawn in blue. Since the relation between reading and math is not perfect, the individual scores are mostly not exactly on the line but they are higher or lower. The red lines show how much higher or lower for each observation. In other words, the lengths of the red lines are the residual scores.

The aim of your model is to have small residuals, because that means your model predictions of the DV are accurate. The smaller your residuals, the higher your R2. A statistical program estimates the *b* weights in a regression analysis in such a way that the (absolute) residual values are as small as possible. Or to be precise: the *squared* residuals are minimized.

### How and why do I check independence of the data?

Independence is an issue in the design of your study. The data of any two distinct cases (i.e., rows in your SPSS file) should be independent of each other in the sense that the particular outcomes for one case do not give you any information on what to expect as outcomes for the other case. You cannot test this assumption with your data. You can only take care of it while collecting your data: participants should not take part in your experiment twice and be entered in the data as if they are different people. Or a father and a mother should not report about the same child and entered in different rows. Note that a father and a mother report in different variables on the same row is fine! Then the data of both belong to the same case.

And also: if you do not study humans as participants in your thesis but you observe locations, you shouldn’t observe the same few locations repeatedly and enter these on different rows.

Independence is a very important assumption. You cannot trust the results of your analyses if data are dependent.

### What should I do if independence of the data is violated?

See if you can take care of this in your analysis, for instance with repeated measures ANOVA, or with the more advanced mixed effects models (but this is research master level). Otherwise: remove cases from your data set such that you only have independent cases left.

### How and why do I check the assumptions of normality, homoscedasticity and linearity?

Do not run statistical tests to check these assumptions. The main problem with statistical tests to check the assumptions is that they are sensitive to sample size in the wrong way. They miss problematic violations in small samples, and they do detect very small and harmless violations in large samples.

So instead, you make plots and inspect them. There are several possibilities for these plots. We show one of them below.

Note that in contrast to common belief, the assumptions of *normality* and *homoscedasticity *(also called* homogeneity of variance*) are not about your variables themselves. Rather, these assumptions are about residual values (or in short: residuals). The assumption of *linearity *is about your variables themselves, but can also be checked with residuals, in the same plot for homoscedasticity. So the easiest way to go is to make plots with the residual values and check all three assumptions together.

*Making the plots*

While you are running your regression analysis in SPSS, click ‘plots’. You need two plots.

1) For normality: check the box with ‘histogram’. This makes a histogram of your residuals.

2) For homoscedasticity and linearity, make a scatterplot, with the predicted scores (*Zpred) on the X-axis and the residual values (called *ZResid) on the Y-axis.

*Checking the plots*

*Normality*

The histogram of the residual values should approximately look like a normal distribution. Note that normality is often stressed the most, but it is actually the least important assumption. Most regular analyses are quite robust against violations of normality, especially if your sample size is not too small (at least 30). That means your data don’t have to look perfectly normal.

*Homoscedasticity and linearity*.

The scatter plot can be used to inspect both homoscedasticity and linearity.

Here you see the ideal case: the assumptions of linearity and homoscedasticity are met.

In order to meet the assumption of **homoscedasticity**, for every point on the x-axis, the variation of the residuals should be the same. In other words, the cloud should roughly look like a rectangle that has the same height everywhere from left to right.

For **linearity**, there should not be a curvilinear pattern. You can draw a Loess fit line in the plot to check this (double click on the plot; choose Add fit line at total; the Loess option gives you the Loess fit line). The Loess line is not restricted to be straight, but is allowed to curve when needed, to better fit the data. In the ideal linear world, the Loess fit line is a horizontal line at Y = 0. Small, local, random looking deviations are always there and are no problem. Bigger, systematic deviations from the Y = 0 line point to a violation of the linearity assumption.

Below you see an example of an SPSS plot with a Loess line. If the line looks like this, there is nothing to worry about.

The other panels show what violations look like. The violations are quite extreme, to give a clear example what you should be looking for. In real life, you’ll never find such extreme cases. Or well, you should hope you never will.

Here you see what happens if the assumption of homoscedasticity (‘same variance’) is violated: in other words, there is heteroscedasticity. The shape of the cloud is a cone instead of a rectangle. (The red lines are added for clarity).

On the left, for low predicted scores, the residuals are small, meaning that here the model prediction is very good: low predicted scores are very close to the actual scores. But for high predicted scores the prediction quality is poor, with many large residuals.

Such a diverging fan pattern from left to right is the most common form of violation of homoscedasticity. But any pattern in which the spread of residuals about 0 is systematically varying from left to right, so, no rectangular shape but something diagonal instead, constitutes a violation of homoscedasticity.

This panel shows a violation of linearity. The best line through this scatterplot is clearly not a horizontal line through 0; it is curved.

### What if normality is violated?

First, remember that your variables themselves don’t have to follow a normal distribution, only your residuals. Second, normality is the least important assumption. Most regular statistics techniques are robust against deviations from normality. Especially if your sample size is large, deviations from normality are not that problematic.

If normality of the residuals is strongly violated and your sample is small, you have three options. First, if the residuals have a skewed distribution, the dependent variable will be very skewed and you can consider using a non-linear transformation of the dependent variable.

Second, if there is a distinct floor or ceiling effect (e.g., many participants have a score that is equal to or very close to the highest or lowest score), such a transformation will not help. In such a case you may consider recoding your DV into a binary variable. For example: recode into 0 = absence and 1 = presence of the characteristic measured by the DV. If you recode a DV into an ordinal or a nominal variable, the planned linear regression has to be replaced by a logistic regression.

Finally, you can use a bootstrapping procedure. Without going into details what that means: just tell SPSS to bootstrap. It’s one of the options.

However, if you find an extremely skewed plot, meaning that many scores cluster at one of the extremes, that could be reason for concern. Especially if the skewness is caused because there are many scores that are (close to) the lowest possible value (floor effect), or many scores at or close to the maximum score (ceiling effect), you can recode your dependent variable into a binary variable. For instance: recode into 0 = absence and 1 = presence of a characteristic. You then proceed with a logistic regression analysis.

If your plot looks fine overall but you find one or a few extreme values, this means that these cases have a score that is very different from what the model predicted. See the section below.

### I have some extreme values in my data. What do I do now?

First of all, note that a few extreme values are fine. Any data set will have those. You should expect about 5% of your cases to be more extreme than 2 standard deviations away from the mean, and 0.3% will be more extreme than 3 standard deviations.

Always check if those outliers are correct data, and they didn’t arise because you made a mistake in data entry, or because you used 999 to label missing values but forgot to tell SPSS.

If the data are correct observations: you do not have to remove outliers by definition, only if you have a reason to do this. For example, in a reaction time task, you might delete "outlier" trials that are extremely short or extremely long, because you think that really short responses indicate a guess and really long responses indicate a lapse in attention instead of participants doing the task correctly.

But for other measures, outliers do not perse mean participants should not be included. You can have very slow participants, or very smart participants, or extremely extraverted participants. They do exist, so there is not direct reason to throw them out.

The advice with this kind of outliers is that you do the analysis twice; one with all cases included and once with the outliers removed. And then you hope you reach the same conclusions (see for example http://pss.sagepub.com/content/22/11/1359.full.pdf+html, p. 1363, requirement 5).

So you check if *p*-values and effect sizes differ substantially. If so, report both results. If not, you can mention briefly that you did this and that the results didn’t differ substantially, and report only the results from the full dataset in detail.

All of this assumes that the residuals look fine except for these outliers. If they show more general violations of normality, you have to counteract these first.

Note instead of checking for outliers it is often a better alternative to check if you have influential cases: cases that have an overly large influence on your results. Outliers may be influential cases but they don’t have to be.

### What if linearity is violated?

This means that the relation between an IV and DV is not linear but takes a different shape. You can transform a variable (for instance, take the log, square or square root) to make the relation become more linear. If the non-linearity concerns one specific IV, transformation of this IV is recommended. If there is a more general pattern of non-linear DV-IV relationships, a transformation of the DV may work best, but only if the deviation from linearity is in the same ‘direction’ for all IVs: either a concave-up (cup-like) curvature for all, or a concave-down (cap-like) curvature for all.

But be careful, especially if you have interaction terms in your model. A transformation may introduce an interaction effect that wasn’t really there, or it may remove one that was there.

But also if you don’t have an interaction term, you need to take the transformation into account. So you have to report that there is a relation between X and the log of Y, for example.

### What if homoscedasticity is violated?

This is a problem. The estimate of the *b *weights is reliable but the t-values and p-values are not. So you cannot trust the significance of your results. In fact, the homoscedasticity assumption is far more important than the normality assumption.

In this case, sometimes a transformation of the dependent variable can help. For example, take the square root of the log of the dependent variable. Keep in mind, however, that the interpretation of the *b *weights is not straightforward anymore. And see above for more complications with transformation of the data.

Another option is to run a robust regression analysis. This paper contains a tutorial on how to do this in SPSS. Don’t get scared by the formulas: you can scroll through them. The tutorial part starts on page 11.

### How do I check multicollinearity?

If some of your IVs are highly correlated with other IVs, it becomes impossible to estimate the unique contribution of each variable. Multicollinearity results in unreliable regression coefficients (b’s).

When you run your analysis, click the ‘statistics’ button and ask for collinearity diagnostics. This gives you the tolerance and the VIF (Variance Inflation Factor), as additional columns in the Coefficients table.

The tolerance of a predictor tells you how much of the variance in this predictor cannot be explained by the other predictors. If this value is very low, there is multicollinearity.

VIF gives in fact the same information, to be precise, VIF = 1/tolerance.

The rules of thumb here are that there is an issue when:

- Tolerance < .20 (so VIF > 5): multicollinearity: estimates are unreliable. It is wise to do something to fix this.
- Tolerance < .10 (so VIF > 10): strong multicollinearity: estimates are very unreliable. You really have to do something to fix this.

### How do I deal with multicollinearity?

If you have multicollinearity in your data, you might combine multiple IVs in a combined score. Do check first if this makes sense, that is, if the IVs indeed measure a strongly similar construct and they can be combined. For example, if you have scores from mothers and from fathers and these are strongly correlated, you can combine the two into a parent score by taking the mean score. It is also possible to leave one of the two similar variables out of the analysis (e.g., use only the mother’s score).

### How can I check and avoid influential cases?

An influential case deviates so much from the general pattern that it has a large effect on the regression line. Without this case, the regression line would look very different. The data below are the same, except that in the right panel one influential case is added.

Especially when your sample is not that large, an extreme observation can be so extreme that it ‘pulls’ the best fitting line towards itself. In the example above, the influential case present in panel B ‘creates’ a clearly positive, possibly highly significant linear relation between two variables, which disappears if this single case is left out (see panel A). The effect of an influential case may also be in the reverse direction, it may ‘destroy’ a linear relationship that is present without this single case. The problem in general is that a single influential case substantially changes the pattern found in your model.

In the example above, you can find the influential value just by looking at the plots. Normally, you would look for influential cases with Cook’s Distance. When specifying your regression model in SPSS, you can ask for Cook’s Distance under SAVE. After running your model, you will now find a new column in your data set (so not in the output!) called ‘COO_##’. Cases with higher values have a larger influence.

There is no clear rule when a value of Cook’s distance is too high. A guideline is to look for values that are larger than 4/n, where n = the number of cases in your data set. But it can also be useful to check whether there are a one or a few values that are notably higher than the rest, regardless of the exact value. (You can easily find them if you sort your COO variable in descending order. If you wish, you can also create a histogram, for instance by using FREQUENCY > CHARTS.)

If you identified potential influential cases, then do the same as with outliers: first check if you didn’t make a data entry mistake or forgot to tell SPSS that 999 means missing. Then run your model with and without the potentially influential cases, and see whether your conclusions differ (i.e., if *p*-values and effect sizes differ substantially). If so, report both results. If not, you can mention that you did both analyses and that the results didn’t differ substantially, and then proceed to report the results of the analysis including all your data in detail.

## When and how do I control for some predictors in a regression analysis?

Suppose that you think that another variable other than the IV of interest might influence the DV. However, you are not really interested in this other variable; that is, it is not part of your research questions.

By just including these variables in your regression model, you already control for them. That is, the meaning of the coefficient of each predictor variable is what happens to the DV if the predictor variable increases by 1, while all the other variables remain constant. So you take into account their effects.

But maybe you also want to know if adding your IV has additional value in the prediction of the DV. If so, you can control use ‘Blocks’ in SPSS: Hierarchical Regression.

When performing your regression analysis, enter your control variable(s) in the first block. Then enter your main predictors in the second block. Ask for R2 change under Statistics.

This R2 change shows how much your explained variance improves when you add the variables of interest to the model, compared to your control variables only.

## Still can't figure it out?

Then you can contact the Statistical Methodological Advice Center (SMAP).