Frequently asked questions about lineair regression
The aim of your model is to have small residuals, because that means your model predictions of the DV are accurate. The smaller your residuals, the higher your R2. A statistical program estimates the b weights in a regression analysis in such a way that the (absolute) residual values are as small as possible. Or to be precise: the squared residuals are minimized.
How and why do I check independence of the data?
Independence is an issue in the design of your study. The data of any two distinct cases (i.e., rows in your SPSS file) should be independent of each other in the sense that the particular outcomes for one case do not give you any information on what to expect as outcomes for the other case. You cannot test this assumption with your data. You can only take care of it while collecting your data: participants should not take part in your experiment twice and be entered in the data as if they are different people. Or a father and a mother should not report about the same child and entered in different rows. Note that a father and a mother report in different variables on the same row is fine! Then the data of both belong to the same case.
And also: if you do not study humans as participants in your thesis but you observe locations, you shouldn’t observe the same few locations repeatedly and enter these on different rows.
Independence is a very important assumption. You cannot trust the results of your analyses if data are dependent.
What should I do if independence of the data is violated?
See if you can take care of this in your analysis, for instance with repeated measures ANOVA, or with the more advanced mixed effects models (but this is research master level). Otherwise: remove cases from your data set such that you only have independent cases left.
How and why do I check the assumptions of normality, homoscedasticity and linearity?
Do not run statistical tests to check these assumptions. The main problem with statistical tests to check the assumptions is that they are sensitive to sample size in the wrong way. They miss problematic violations in small samples, and they do detect very small and harmless violations in large samples.
So instead, you make plots and inspect them. There are several possibilities for these plots. We show one of them below.
Note that in contrast to common belief, the assumptions of normality and homoscedasticity (also called homogeneity of variance) are not about your variables themselves. Rather, these assumptions are about residual values (or in short: residuals). The assumption of linearity is about your variables themselves, but can also be checked with residuals, in the same plot for homoscedasticity. So the easiest way to go is to make plots with the residual values and check all three assumptions together.
Making the plots
While you are running your regression analysis in SPSS, click ‘plots’. You need two plots.
- For normality: check the box with ‘histogram’. This makes a histogram of your residuals.
- For homoscedasticity and linearity, make a scatterplot, with the predicted scores (*Zpred) on the X-axis and the residual values (called *ZResid) on the Y-axis.
Checking the plots
Normality
The histogram of the residual values should approximately look like a normal distribution. Note that normality is often stressed the most, but it is actually the least important assumption. Most regular analyses are quite robust against violations of normality, especially if your sample size is not too small (at least 30). That means your data don’t have to look perfectly normal.
Homoscedasticity and linearity.
The scatter plot can be used to inspect both homoscedasticity and linearity. Here you see the ideal case: the assumptions of linearity and homoscedasticity are met.
In order to meet the assumption of homoscedasticity, for every point on the x-axis, the variation of the residuals should be the same. In other words, the cloud should roughly look like a rectangle that has the same height everywhere from left to right.
For linearity, there should not be a curvilinear pattern. You can draw a Loess fit line in the plot to check this (double click on the plot; choose Add fit line at total; the Loess option gives you the Loess fit line). The Loess line is not restricted to be straight, but is allowed to curve when needed, to better fit the data. In the ideal linear world, the Loess fit line is a horizontal line at Y = 0. Small, local, random looking deviations are always there and are no problem. Bigger, systematic deviations from the Y = 0 line point to a violation of the linearity assumption.
Below you see an example of an SPSS plot with a Loess line. If the line looks like this, there is nothing to worry about.
The other panels show what violations look like. The violations are quite extreme, to give a clear example what you should be looking for. In real life, you’ll never find such extreme cases. Or well, you should hope you never will.
Here you see what happens if the assumption of homoscedasticity (‘same variance’) is violated: in other words, there is heteroscedasticity. The shape of the cloud is a cone instead of a rectangle. (The red lines are added for clarity).
On the left, for low predicted scores, the residuals are small, meaning that here the model prediction is very good: low predicted scores are very close to the actual scores. But for high predicted scores the prediction quality is poor, with many large residuals.
Such a diverging fan pattern from left to right is the most common form of violation of homoscedasticity. But any pattern in which the spread of residuals about 0 is systematically varying from left to right, so, no rectangular shape but something diagonal instead, constitutes a violation of homoscedasticity.
This panel shows a violation of linearity. The best line through this scatterplot is clearly not a horizontal line through 0; it is curved.
What if normality is violated?
First, remember that your variables themselves don’t have to follow a normal distribution, only your residuals. Second, normality is the least important assumption. Most regular statistics techniques are robust against deviations from normality. Especially if your sample size is large, deviations from normality are not that problematic.
If normality of the residuals is strongly violated and your sample is small, you have three options. First, if the residuals have a skewed distribution, the dependent variable will be very skewed and you can consider using a non-linear transformation of the dependent variable.
Second, if there is a distinct floor or ceiling effect (e.g., many participants have a score that is equal to or very close to the highest or lowest score), such a transformation will not help. In such a case you may consider recoding your DV into a binary variable. For example: recode into 0 = absence and 1 = presence of the characteristic measured by the DV. If you recode a DV into an ordinal or a nominal variable, the planned linear regression has to be replaced by a logistic regression.
Finally, you can use a bootstrapping procedure. Without going into details what that means: just tell SPSS to bootstrap. It’s one of the options.
However, if you find an extremely skewed plot, meaning that many scores cluster at one of the extremes, that could be reason for concern. Especially if the skewness is caused because there are many scores that are (close to) the lowest possible value (floor effect), or many scores at or close to the maximum score (ceiling effect), you can recode your dependent variable into a binary variable. For instance: recode into 0 = absence and 1 = presence of a characteristic. You then proceed with a logistic regression analysis.
If your plot looks fine overall but you find one or a few extreme values, this means that these cases have a score that is very different from what the model predicted. See the section below.
I have some extreme values in my data. What do I do now?
First of all, note that a few extreme values are fine. Any data set will have those. You should expect about 5% of your cases to be more extreme than 2 standard deviations away from the mean, and 0.3% will be more extreme than 3 standard deviations.
Always check if those outliers are correct data, and they didn’t arise because you made a mistake in data entry, or because you used 999 to label missing values but forgot to tell SPSS.
If the data are correct observations: you do not have to remove outliers by definition, only if you have a reason to do this. For example, in a reaction time task, you might delete "outlier" trials that are extremely short or extremely long, because you think that really short responses indicate a guess and really long responses indicate a lapse in attention instead of participants doing the task correctly.
But for other measures, outliers do not perse mean participants should not be included. You can have very slow participants, or very smart participants, or extremely extraverted participants. They do exist, so there is not direct reason to throw them out.
The advice with this kind of outliers is that you do the analysis twice; one with all cases included and once with the outliers removed. And then you hope you reach the same conclusions (see for example http://pss.sagepub.com/content/22/11/1359.full.pdf+html, p. 1363, requirement 5).
So you check if p-values and effect sizes differ substantially. If so, report both results. If not, you can mention briefly that you did this and that the results didn’t differ substantially, and report only the results from the full dataset in detail.
All of this assumes that the residuals look fine except for these outliers. If they show more general violations of normality, you have to counteract these first.
Note instead of checking for outliers it is often a better alternative to check if you have influential cases: cases that have an overly large influence on your results. Outliers may be influential cases but they don’t have to be.
What if linearity is violated?
This means that the relation between an IV and DV is not linear but takes a different shape. You can transform a variable (for instance, take the log, square or square root) to make the relation become more linear. If the non-linearity concerns one specific IV, transformation of this IV is recommended. If there is a more general pattern of non-linear DV-IV relationships, a transformation of the DV may work best, but only if the deviation from linearity is in the same ‘direction’ for all IVs: either a concave-up (cup-like) curvature for all, or a concave-down (cap-like) curvature for all.
But be careful, especially if you have interaction terms in your model. A transformation may introduce an interaction effect that wasn’t really there, or it may remove one that was there.
But also if you don’t have an interaction term, you need to take the transformation into account. So you have to report that there is a relation between X and the log of Y, for example.
What if homoscedasticity is violated?
This is a problem. The estimate of the b weights is reliable but the t-values and p-values are not. So you cannot trust the significance of your results. In fact, the homoscedasticity assumption is far more important than the normality assumption.
In this case, sometimes a transformation of the dependent variable can help. For example, take the square root of the log of the dependent variable. Keep in mind, however, that the interpretation of the b weights is not straightforward anymore. And see above for more complications with transformation of the data.
Another option is to run a robust regression analysis. This paper contains a tutorial on how to do this in SPSS. Don’t get scared by the formulas: you can scroll through them. The tutorial part starts on page 11.
How do I check multicollinearity?
If some of your IVs are highly correlated with other IVs, it becomes impossible to estimate the unique contribution of each variable. Multicollinearity results in unreliable regression coefficients (b’s).
When you run your analysis, click the ‘statistics’ button and ask for collinearity diagnostics. This gives you the tolerance and the VIF (Variance Inflation Factor), as additional columns in the Coefficients table.
The tolerance of a predictor tells you how much of the variance in this predictor cannot be explained by the other predictors. If this value is very low, there is multicollinearity.
VIF gives in fact the same information, to be precise, VIF = 1/tolerance.
The rules of thumb here are that there is an issue when:
- Tolerance < .20 (so VIF > 5): multicollinearity: estimates are unreliable. It is wise to do something to fix this.
- Tolerance < .10 (so VIF > 10): strong multicollinearity: estimates are very unreliable. You really have to do something to fix this.
How do I deal with multicollinearity?
If you have multicollinearity in your data, you might combine multiple IVs in a combined score. Do check first if this makes sense, that is, if the IVs indeed measure a strongly similar construct and they can be combined. For example, if you have scores from mothers and from fathers and these are strongly correlated, you can combine the two into a parent score by taking the mean score. It is also possible to leave one of the two similar variables out of the analysis (e.g., use only the mother’s score).
How can I check and avoid influential cases?
An influential case deviates so much from the general pattern that it has a large effect on the regression line. Without this case, the regression line would look very different. The data below are the same, except that in the right panel one influential case is added.
Especially when your sample is not that large, an extreme observation can be so extreme that it ‘pulls’ the best fitting line towards itself. In the example above, the influential case present in panel B ‘creates’ a clearly positive, possibly highly significant linear relation between two variables, which disappears if this single case is left out (see panel A). The effect of an influential case may also be in the reverse direction, it may ‘destroy’ a linear relationship that is present without this single case. The problem in general is that a single influential case substantially changes the pattern found in your model.
In the example above, you can find the influential value just by looking at the plots. Normally, you would look for influential cases with Cook’s Distance. When specifying your regression model in SPSS, you can ask for Cook’s Distance under SAVE. After running your model, you will now find a new column in your data set (so not in the output!) called ‘COO_##’. Cases with higher values have a larger influence.
There is no clear rule when a value of Cook’s distance is too high. A guideline is to look for values that are larger than 4/n, where n = the number of cases in your data set. But it can also be useful to check whether there are a one or a few values that are notably higher than the rest, regardless of the exact value. (You can easily find them if you sort your COO variable in descending order. If you wish, you can also create a histogram, for instance by using FREQUENCY > CHARTS.)
If you identified potential influential cases, then do the same as with outliers: first check if you didn’t make a data entry mistake or forgot to tell SPSS that 999 means missing. Then run your model with and without the potentially influential cases, and see whether your conclusions differ (i.e., if p-values and effect sizes differ substantially). If so, report both results. If not, you can mention that you did both analyses and that the results didn’t differ substantially, and then proceed to report the results of the analysis including all your data in detail.