Machine learning and lawyers: influential regression values

A data point has large influence only if it strongly affects the regression model. Leverage only takes into account the extremeness of the predictor variable values, but a high leverage observation may or may not be influential. A high-leverage data point is influential if it materially changes the tilt of the best-fit line. Think of it as having leverage (an extreme value among the other predictor variable values) and also outlier value such that that it singlehandedly alters the slope of the regression line considerably. Put differently, an influential point changes the constants that multiply the predictor values.

Statisticians spot influential data points by calculating Cook’s distance, but those scores don’t provide information on how the data points effect the model. This is particularly challenging, as it is very hard to visualize the impact of many predictors on the response.

Software computes the influence exerted by each observation (row in the spreadsheet, such as our states) on the predicted number of lawyers. It looks at how much the residuals of all the data points would change if any particular observation were excluded from the calculation of the regression coefficients. A large Cook’s distance indicates that excluding that state changes the coefficient substantially.  A few states cross that threshold.

Another plot combines findings about outliers, leverage, and influence. States above +2 or below -2 on the vertical axis (horizontal dotted lines) are considered outliers. States to the right of 0.38 on the horizontal axis (vertical dotted line) have high leverage. The size of each circle is proportional to the state’s Cook’s distance for influence.

This plot shows the “Studentized residuals” on the vertical axis. For now, take “studentized” as a form of standardizing all residuals so that more than twice the studentized residual (the horizontal dotted lines, which are at +2 and -2) is statistically quite unusual.

The horizontal axis shows “hat values,” which are the most common measures of leverage. Once again, reading from the right, Texas [43], New York [34], Florida [9] and California [5] are high-influence observations, pull strongly on the best-fit line’s angle, and therefore significantly alter the model’s coefficients.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.