Machine learning (lawyers): leverage points

A high leverage observation has values in its predictor variables that stick out with respect to other observations’ corresponding values. Unlike outliers, leverage has nothing to do with the estimated response variable (number of practicing lawyers in our example). Rather, the leverage of a data point is based on how much the observation’s value differs from the average of that particular variable’s values. So, for example, perhaps the percentage of high school graduates in a state falls far below all the other states’ percentages. Or perhaps a combination of predictor variables leads to an observation with an extreme value having high leverage.

As with outliers, you can spot high leverage observations with calculations, graphs, or trial-and-error. Here is a graphic depiction using our own data set.

This plot takes each of the six predictor variables in the regression model, shows with the predictor what the model estimates as the number of lawyers in that state, and identifies with an index value — the row number of the state in the spreadsheet — observations that stand out as exhibiting high leverage. For example, the plot upper left uses each state’s gross domestic product (gdp) and models the estimated number of lawyers in the state. Four have a number beside them because they qualify as high leverage: from the left, state 9 (Florida), 45 (Virginia), 34 (New York), and 43 (Texas).

If no observations with high leverage have a large residual, the regression is relatively “safe.” What this means is that a single data point may be far out to the right or the left on the plot, but the point hews close to the best-fit line. Florida (state 9) would be such a case.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.