Machine learning for lawyers: collinearity in linear regression

We have reached the fourth assumption for valid multiple linear regression: correlations between predictor variables cannot be too high. If the correlation between any pair of predictor variables is close to 1 or -1, there lies a problem. One is that it can be difficult to separate out the individual effects of those variables on the response. Predictor variables (aka independent variables) should be independent of each other.

For example, the correlation between the number of F500 headquarters in a state and the number of business establishments with fewer than 500 employees is very high, at 0.90. The two counts relate very closely to each other, which makes sense. States have more or less vigorous business activity whether measured at the huge corporate strata or the small business strata.

The plot below shows how closely the values for each state of those two variables march together toward the upper right. The ellipse emphasizes that the association holds particularly strongly at the smaller values of the two variables.

Collinearity is the term used for this undesirable situation: predictor variables rise or fall closely in unison. To the degree they do, one of them is redundant, and the effective number of predictors is smaller that the actual number of predictors. Other weaknesses in the linear model we will leave until later.

To see how closely predictors correlate with each other you can unleash a correlation matrix plot. In the plot, for various reasons we haven’t includes some of the data for states, such as population, area and gdp.

This intimidating plot offers three kinds of insights. First, it shows a scatter plot of each predictor variable against each of the other predictors. For example, in the first column (F500), the second cell down shows a scatter plot of that variable on the horizontal axis and law school enrollment (enrollment — which is the second row with its axis being the vertical axis).

Second, the diagonal from the top left down the middle contains density plots that display how the predictor variable is distributed. For example, in the column for the number of businesses that have fewer than 500 employees (Less500), the density plot bulges high on the left, which means that quite a few states have relatively few of those enterprises but a handful stretching out to the far right have many.

Third, the upper triangle prints the correlation of each predictor against the others. As an example, the number of prisoners in 2009 (prison09) hardly correlates at all with the percentage of high school graduates (HS), at -0.199.

A correlation matrix plot helps you figure out which predictor variables are too closely correlated with others to be in the same regression model.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.