Having discussed two assumptions of linear regression, linearity between predictors and the response variable and the bell-curvishness of the residuals, a third assumption needs to be introduced.
Linear regression also assumes that the residuals have a consistently-shaped distribution around zero, regardless of the values of the predictors. This means that at whatever the number of F500 headquarters and sub-500 employee businesses, the magnitude of the residuals for lawyers doesn’t change too much (the difference between the actual number of private-practice lawyers in a state and what the model with the two predictor variables estimates). In statistical terms, the variance of the residuals stays constant. Variance, since you ask, measures the dispersion of a set of numbers.
A tool to eyeball whether the assumption holds is to plot the fitted response values on the bottom axis against their residual values (how far off they are). Cleverly called a “Residuals vs. Fitted” plot, it should display no pattern and the magnitude of the spread of points around zero should be similar regardless of the fitted value.
Basically, this assumption of linear regression looks at how spread out the residuals are and does the “spread” vary by the magnitude of the predictor. Cover your eyes if large words make you quesy, because what you want is minimal spread and approximately similar numbers on both sides of a smoothed line, the mouth-filler known affectionately as homoscedasticity.
This plot, however, tells us that residuals lose desirable compactness and consistency as lawyers reach the largest values on the right. New York stands at the far right top, an unusual set of many F500 headquarters (54) and sub-500 employee companies (454,718) that apparently would predict a quite different number than the 96,000 in real life. On the left of the plot, with two exceptions, the data has relatively uniform variance, but not on the right. We added a smoothing curve and since it veers off the zero line at two points, this is a sign of systematic under- or over-prediction in certain ranges: the errors are correlated with the dependent variable.
Later we will discuss how this kind of situation offers a possible solution: to transform the response variable.