As discussed before, in a linear regression model that your firm or department can rely on, the relationship between predictor variables and the response variable must be linear. Additionally, the residuals of the model must be normally distributed.
We should unpack that last sentence.
A residual is the difference between the actual value of a predictor variable such as F500 headquarters and the value estimated for it by the linear regression model based on all of the predictor variables. All regression methods produce a model that leaves the smallest amount of residuals (or when more than one predictor variable is in the formula a hyperplane). We will explain later the mathematics that minimizes residuals.
OK, so residuals are about how far off real data points are from regression line points. What about a ‘normal distribution’ of those residuals? A distribution is statistics-speak for a group of numbers. If the numbers in a distribution are plotted on a graph by how many of the numbers are at each value, it is a normal distribution if the shape is reasonably close to the often-seen bell curve (relatively few numbers far to either side on the tails and most of them clustered and piled up toward the middle).
Here is a histogram [For more on histograms, click here for my book on law firm data and graphics] of the residuals from two predictors: F500 headquarters and the number of enterprises in the state that have fewer than 500 employees. You have to imagine a three-dimensional cube where the bottom axis of the cube is one predictor and the axis going back is the other predictor, while the left side of the cube is the response variable. Regression software creates a best-fit two-dimensional plane for the predictors and the residual is the distance from each pair of predictor points to that plane.
You can make out a partial bell curve, applying to most of the residuals, except that three residuals stick far out on the right tail of the histogram. We should investigate those states’ values, because they could be outliers arising from a mistake in the data. Meanwhile, however, the shape reasonably resembles a bell and therefore satisfies the assumption that the residuals be normally distributed.
As an aside, when your regression model has more than one predictor variable, it’s harder to visualize a best fit “line”. If you have only two predictors, you can consider a plane as the best fit — as if a stiff piece of paper represents the ‘line.’ But with more predictors the mind boggles at visualizing a hyperplane. Software has no such frailties and it will figure out the residual of each point no matter how many predictor variables.