Machine learning (lawyers): leverage points

A high leverage observation has values in its predictor variables that stick out with respect to other observations’ corresponding values. Unlike outliers, leverage has nothing to do with the estimated response variable (number of practicing lawyers in our example). Rather, the leverage of a data point is based on how much the observation’s value differs from the average of that particular variable’s values. So, for example, perhaps the percentage of high school graduates in a state falls far below all the other states’ percentages. Or perhaps a combination of predictor variables leads to an observation with an extreme value having high leverage.

As with outliers, you can spot high leverage observations with calculations, graphs, or trial-and-error. Here is a graphic depiction using our own data set.

This plot takes each of the six predictor variables in the regression model, shows with the predictor what the model estimates as the number of lawyers in that state, and identifies with an index value — the row number of the state in the spreadsheet — observations that stand out as exhibiting high leverage. For example, the plot upper left uses each state’s gross domestic product (gdp) and models the estimated number of lawyers in the state. Four have a number beside them because they qualify as high leverage: from the left, state 9 (Florida), 45 (Virginia), 34 (New York), and 43 (Texas).

If no observations with high leverage have a large residual, the regression is relatively “safe.” What this means is that a single data point may be far out to the right or the left on the plot, but the point hews close to the best-fit line. Florida (state 9) would be such a case.

Regression for lawyers: outliers, leverage and influence points

Before running a regression, it is prudent to study any of your data that exhibit unusual characteristics. The purpose of spotting and evaluating abnormal data is to make sure they are not mistakes in measurement, collection, data entry, or calculation nor are they data that unjustifiably warp your regression model. You want to scrutinize three varieties of unusual data: outliers, high leverage points, and influence points.

An outlier is an observation (a U.S. state in our example data) whose influence on the response variable (number of lawyers in the state) is poorly predicted. That is to say, the model produces and unduly large miss from the actual number of lawyers when it estimates the number. With our data, New York stands as quite an outlier. Assuming the figures and facts we have for New York are correct, however, that’s real life; generally speaking, unless you have a solid reason to omit some observation, outlier data should be included in your model.

At least three techniques can help spot outliers: statistical tests, graphical plots, and repeated modeling.

Among the statistical tests, one calculates whether the largest residual of the response variable (the amount the model mis-estimated number of lawyers) is “statistically significantly“} off the mark; if no such unusual residual shows up, the data has no outliers. However, if the largest response residual is statistically significant and therefore is an outlier observation, analysts sometimes delete it and rerun the test (the third technique) to see whether other outliers are present.

This graphic plots residuals, after some mathematical adjustments to the scale, against the corresponding theoretical quantiles. Quantiles (sometimes called ‘percentiles’) are created when you sort data from high to low and plot the point where 25% of the points are below the “first quartile”; 50% are below the second quartile — the median; and so forth. These are points in your data below which a certain proportion of your data fall. So, the horizontal axis above shows what a normal bell-curve distribution line looks like when it is based on the quantiles of such a distribution. It also generates a dotted-line band above and below the residuals to show a statistical form of confidence in the estimated value. The plot tells us that New York (34, top right corner) and California (5, lower left corner) are outliers to be scrutinized.

As alluded to above, a third technique helps if you are concerned about an observation being an outlier. You exclude the suspect observation (state) from the regression. If the model’s coefficients don’t change much, then you don’t have to worry.

Machine learning for lawyers: collinearity in linear regression

We have reached the fourth assumption for valid multiple linear regression: correlations between predictor variables cannot be too high. If the correlation between any pair of predictor variables is close to 1 or -1, there lies a problem. One is that it can be difficult to separate out the individual effects of those variables on the response. Predictor variables (aka independent variables) should be independent of each other.

For example, the correlation between the number of F500 headquarters in a state and the number of business establishments with fewer than 500 employees is very high, at 0.90. The two counts relate very closely to each other, which makes sense. States have more or less vigorous business activity whether measured at the huge corporate strata or the small business strata.

The plot below shows how closely the values for each state of those two variables march together toward the upper right. The ellipse emphasizes that the association holds particularly strongly at the smaller values of the two variables.

Collinearity is the term used for this undesirable situation: predictor variables rise or fall closely in unison. To the degree they do, one of them is redundant, and the effective number of predictors is smaller that the actual number of predictors. Other weaknesses in the linear model we will leave until later.

To see how closely predictors correlate with each other you can unleash a correlation matrix plot. In the plot, for various reasons we haven’t includes some of the data for states, such as population, area and gdp.

This intimidating plot offers three kinds of insights. First, it shows a scatter plot of each predictor variable against each of the other predictors. For example, in the first column (F500), the second cell down shows a scatter plot of that variable on the horizontal axis and law school enrollment (enrollment — which is the second row with its axis being the vertical axis).

Second, the diagonal from the top left down the middle contains density plots that display how the predictor variable is distributed. For example, in the column for the number of businesses that have fewer than 500 employees (Less500), the density plot bulges high on the left, which means that quite a few states have relatively few of those enterprises but a handful stretching out to the far right have many.

Third, the upper triangle prints the correlation of each predictor against the others. As an example, the number of prisoners in 2009 (prison09) hardly correlates at all with the percentage of high school graduates (HS), at -0.199.

A correlation matrix plot helps you figure out which predictor variables are too closely correlated with others to be in the same regression model.

Similar distribution of response residuals (homoscedasticity)

Having discussed two assumptions of linear regression, linearity between predictors and the response variable and the bell-curvishness of the residuals, a third assumption needs to be introduced.

Linear regression also assumes that the residuals have a consistently-shaped distribution around zero, regardless of the values of the predictors. This means that at whatever the number of F500 headquarters and sub-500 employee businesses, the magnitude of the residuals for lawyers doesn’t change too much (the difference between the actual number of private-practice lawyers in a state and what the model with the two predictor variables estimates). In statistical terms, the variance of the residuals stays constant. Variance, since you ask, measures the dispersion of a set of numbers.

A tool to eyeball whether the assumption holds is to plot the fitted response values on the bottom axis against their residual values (how far off they are). Cleverly called a “Residuals vs. Fitted” plot, it should display no pattern and the magnitude of the spread of points around zero should be similar regardless of the fitted value.

Basically, this assumption of linear regression looks at how spread out the residuals are and does the “spread” vary by the magnitude of the predictor. Cover your eyes if large words make you quesy, because what you want is minimal spread and approximately similar numbers on both sides of a smoothed line, the mouth-filler known affectionately as homoscedasticity.

This plot, however, tells us that residuals lose desirable compactness and consistency as lawyers reach the largest values on the right. New York stands at the far right top, an unusual set of many F500 headquarters (54) and sub-500 employee companies (454,718) that apparently would predict a quite different number than the 96,000 in real life. On the left of the plot, with two exceptions, the data has relatively uniform variance, but not on the right. We added a smoothing curve and since it veers off the zero line at two points, this is a sign of systematic under- or over-prediction in certain ranges: the errors are correlated with the dependent variable.

Later we will discuss how this kind of situation offers a possible solution: to transform the response variable.

Assumption of normally distributed residuals

As discussed before, in a linear regression model that your firm or department can rely on, the relationship between predictor variables and the response variable must be linear. Additionally, the residuals of the model must be normally distributed.

We should unpack that last sentence.

A residual is the difference between the actual value of a predictor variable such as F500 headquarters and the value estimated for it by the linear regression model based on all of the predictor variables. All regression methods produce a model that leaves the smallest amount of residuals (or when more than one predictor variable is in the formula a hyperplane). We will explain later the mathematics that minimizes residuals.

OK, so residuals are about how far off real data points are from regression line points. What about a ‘normal distribution’ of those residuals? A distribution is statistics-speak for a group of numbers. If the numbers in a distribution are plotted on a graph by how many of the numbers are at each value, it is a normal distribution if the shape is reasonably close to the often-seen bell curve (relatively few numbers far to either side on the tails and most of them clustered and piled up toward the middle).

Here is a histogram [For more on histograms, click here for my book on law firm data and graphics] of the residuals from two predictors: F500 headquarters and the number of enterprises in the state that have fewer than 500 employees. You have to imagine a three-dimensional cube where the bottom axis of the cube is one predictor and the axis going back is the other predictor, while the left side of the cube is the response variable. Regression software creates a best-fit two-dimensional plane for the predictors and the residual is the distance from each pair of predictor points to that plane.

You can make out a partial bell curve, applying to most of the residuals, except that three residuals stick far out on the right tail of the histogram. We should investigate those states’ values, because they could be outliers arising from a mistake in the data. Meanwhile, however, the shape reasonably resembles a bell and therefore satisfies the assumption that the residuals be normally distributed.

As an aside, when your regression model has more than one predictor variable, it’s harder to visualize a best fit “line”. If you have only two predictors, you can consider a plane as the best fit — as if a stiff piece of paper represents the ‘line.’ But with more predictors the mind boggles at visualizing a hyperplane. Software has no such frailties and it will figure out the residual of each point no matter how many predictor variables.

Machine learning for lawyers: linear regression assumptions

To produce valid results, a linear regression model needs the relationship between the predictor and response variables to be linear. For example, multiply every additional F500 headquarters by 1,000 (and add the intercept value) to predict the number of practicing lawyers in a state. Since 1,000 is the same for every number of F500 headquarters and since it is not squaring the headquarters numbers or taking its logarithm or some other mathematical operation, the relationship can be plotted on a line: it’s linear.

Bear in mind that the software will blindly calculate a best-fit line even if the data is absolutely random, looks U-shaped, hockey-stick shaped, or exhibits a bizarre irregularity. The requirement for valid linear regression goes back to the slope of the best-fit line, which has a constant number — not a squared number or some varying number — that multiplies the values of the predictors to estimate the response variable.

Got it, but how can you tell if your data satisfies the criterion of linearity? Quite often you can simply eyeball a scatter plot. You put plot each state’s predictor variable on the horizontal axis, increasing in numbers of F500 headquarters to the right. You plot the corresponding response variable on the vertical axis, increasing toward the top. If the pattern suggests a relationship — the two variables rise together or fall together or if you were to sketch an ellipse around the bulk of the data it would look something like a tilted football — you have a linear relationship between the variables and satisfy the assumption for linear regression.

This plot suggests a roughly linear increase in the number of lawyers as F500 headquarters increase in that you can draw a straight line from the lower left mid-way through the points up to the upper right.

The plot below doesn’t look nearly as much like a linear pattern, so that predictor is probably less useful [unlikely to be statistically significant], but the distribution of points still is reasonably linear. Another clue to relative linearity of two predictors is the correlation between lawyers and F500 headquarters, 0.9, whereas the correlation between lawyers and the urban population as a percent of total population is half that (0.45).

 

If the variables do not have a linear association, \underline{transformations} of the variables are possible, such as using their square roots, but that’s the subject of another post. Also, other kinds of regression might fit a valid model, but those alternatives are too advanced for this overview. It is also important to check for outliers, very unusual and influential data points, which we will return to later.

Suggestions for spreadsheets used by lawyers in machine learning

Law firms and law departments can store their data for regression models, or any machine learning algorithm, in a spreadsheet. Spreadsheets are perfectly fine, and indeed Excel and like programs can perform linear regressions, but more powerful software such as Mathematica, Matlab, Python (open source), R (open source), SASS, SPSS, and Tableau typically take in data from a spreadsheet before they can begin their magic.

Here are some observations and good hygiene for spreadsheets that contain regression data.

Store the data in columns, not rows. So, in our data set, each state is a row and each column contains the values for a variable. Here are the first six rows of the data being used in this series of posts. <chr>, <dbl> and <fct> stand for a text variable, a numeric variable (double precision is what gives it the “dbl”), and a categorical or factor variable, respectively, as used by the R programming language.

 state population    area lawyers     gdp  F500 capital     region party
  <chr>      <dbl>   <dbl>   <dbl>   <dbl> <dbl> <chr>       <fct>  <fct>
1 AK        735132 665384.    1585   51859     0 Juneau      West   rep  
2 AL       4822023  52420.    7615  183547     1 Montgomery  South  rep  
3 AR       2949131  53179.    4447  109557     7 Little Rock South  rep  
4 AZ       6553255 113990.    8023  266891     5 Phoenix     West   rep  
5 CA      38041430 163696.   85274 2003479    54 Sacramento  West   dem  
6 CO       5187582 104094.   11584  274048     9 Denver      West   dem 

Put each observation on its own row and it is a convention to put the observations in the first column. In real life, observations might be associates, partners, offices, countries, law firms, client groups, matters or others.

The arrangement of the columns does not matter to the software nor does the order of the observations. Our data set has the observation (state) in the first column, the dependent variable, lawyers, in the fourth column and F500 headquarters in the fifth. It also doesn’t matter if some columns (or some rows) are not used in the model.

Try not to have extraneous rows, such as headers, summaries, or sub-tables. Leave out totals, explanatory text and blank rows. It doesn’t matter, we should note, if you have superfluous columns on the right or rows below the data, because your software can easily remove or ignore them.

Make sure you have only numbers in columns for numbers. That is to say, do not have commas, dollar signs or any text (17500 USD is a no-no). Decimal points in values are perfectly fine, as in “area” above.

If you are missing values, be consistent in how you identify them. The best approach is to put nothing in the cell; avoid hyphens or writing something, such as “no value.” If text and numbers are jumbled together, most software will treat the variable as text and cannot do math on the text.

Writing code is easier if you use one-word names for your columns and make them understandable. For example, use “hours” or “firm” or “type” rather than “billable hours” or “paid vendor”, or “x-1 code”. It’s fine to use camelback style like “StateGDP” or “AreaState”.

Effect size in regression, and holding variables constant

Each coefficient of a regression model measures what is called effect size. Returning to our data, the effect size tells how much the predicted number of lawyers changes with a change of one F500 headquarters. Since our single-variable coefficient is 1,265 it would mean that every additional F500 headquarters increases the estimated lawyers by 1,265; every F500 headquarters less would drop the estimated lawyers by 1,265. So, effect size indicates the influence of one plus-or-minus predictor unit on the response number. But we shouldn’t use that single effect size because we have data for other variables that contribute to the estimated number of lawyers.

To this point we have been conducting linear regression with only a single predictor. If we include in our model more than one predictor, by the way, we are using multiple linear regression.

To progress to multiple linear regression and to see more effect sizes and how coefficients change when there are additional predictor variables, let’s add to our model the predictor variable of state population. The resulting regression equation appears below.

[1] “lawyers = -275.8 + 767.689 * F500 + 0.001 * population + e”

The coefficient for F500 headquarters has dropped dramatically, from 1,265 to 768.  Further, we see a tiny coefficient for state population.

But you can’t look at the absolute size of a coefficient and decide whether it’s more or less influential than another predictor’s coefficient, because if they are both statistically significant, you also must take into account the units of the predictor variable. Our units are one headquarters and one state resident. Intuitively, a change in one headquarters ought to make much more of a difference in the lawyer count than the change of one resident.

Both predictors are statistically significant, so what do the coefficients tell us about translating their variable values into the real world? On this two-predictor model, every increase or decrease in the number of F500 headquarters changes the estimated number of lawyers by 768, when we hold state population constant. Every increase or decrease in the population changes the estimated lawyers by 1/1000, holding F500 headquarters constant. Thus, for every thousand additional residents in a state, this model predicts an additional lawyer.

Holding other predictors constant means that the software sets the remaining predictor variables at the same value, so they have no influence on the response variable — the number of private practice lawyers. Doing so isolates the effect of the remaining predictor on the response variable.

We need to include as many predictor variables as we have available and evaluate that multiple linear regression model, which we will do in a later post.

Regression as machine learning: p-values and R-squared for lawyers

When software calculates a regression best-fit line and equation, it also bestows other insights. Here we consider two of them: (1) p-values, which with our data says whether the predictor variable, the number of F500 headquarters in a state, tells us something we can rely on about the estimated number of private lawyers in a state and (2) Adjusted R-squared, which tells us how much of the estimated number of lawyers is accounted for by the predictor variable.

Start with p-values. Each predictor variable in a regression model has its own p-value. That value estimates the probability that the coefficient for the predictor — the number it is multiplied by in the regression equation — could have occurred by chance if the predictor had zero influence on the response variable.  If F500 headquarters have no bearing on the number of practicing lawyers in a state, the p-value for F500 would be high, such as 0.5 or 0.8.

But, a p-value below 0.05 tells us there is less than a 5% chance of such a zero relationship. If the data in the model says something has happened that would happen less than one-out-of-twenty times (less than 5% of the time), it’s unusual enough for us to accept that “Something real is going on!”.

Our p-value for F500 headquarters, it turns out, is extremely tiny, below 0.001, so we have solid reason to believe that the number of F500 headquarters strongly and reliably relates to the number of practicing lawyers. Be careful: we cannot say that the number of F500 headquarters causes the number of lawyers, only that it is strongly associated with the number of lawyers.

Next, let’s learn what the Adjusted R-squared result tells us. Adjusted R-squared tells us for this model what percentage of the estimate of lawyers is accounted for by F500 headquarters. In our data, Adjusted R-squared is 81%, which is quite high; other factors are associated with the number of private lawyers in a state relatively little compared to it.

Adjusted R-squared gives a sense of the portion of the response variable explained by the model. Adjusted R-squared doesn’t directly indicate how well the model will perform in predictions based on new observations. They help more when you compare different models for the same data set, such as when you try different predictor variables.

Regression equations and prediction (machine learning)

What we can’t do with simple correlation, as discussed before, is predict the number of lawyers in a new “state” if we knew the number of Fortune 500 headquarters in that state. Once we create a regression model, however, we can fill in the equation to estimate one variable (predict it) when we know the other variable.

Here is the equation for our regression model of private practice lawyers (hereafter, “lawyers”) as influenced by the number of Fortune 500 headquarters in the state (hereafter, “F500 headquarters”):

                 lawyers = 1343.92 + 1265.27 * F500 + e

The equation tells us that if a state has, say, three F500 headquarters, then “lawyers are estimated to number 1,343.92 plus the product of 1265.27 times 3[F500 headquarters] plus a bit of slippage [e]” (more later on errors): an estimated 5,140 lawyers in private practice in that state.

Imagine a different situation where we only have data for 40 of the states. Regression would create a model called the training set. This is what machine learning software does: it “learns” from the data given it and can apply that learning — the model — to new information. We could then predict the number of lawyers for any of the remaining 10 states, the test data. Notice that when we make predictions while we know the actual numbers in the test set, we can assess the accuracy of our regression model by comparing the model’s estimates to reality.

Any time a firm or law department has two or more variables for observations, if a handful of assumptions to be covered later are satisfied, linear regression will tell you more than you know now.

The linear regression methodology for prediction applies broadly. Let’s illustrate with a law firm that wants to predict an associate’s annual billable hours based on the number of partners that associate worked for during the year. The data set would be the firm’s associates. For each associate the number of hours he or she billed during the most recent year would be one variable and the number of partners who assigned him or her work would be the second variable. Linear regression would generate an equation and the firm could predict either variable for any associate who had missing data for the other variable. [For more on the terminology, see this post.]

With this particular illustration, the value of regression as a prediction tool may be low, but as a tool to understand the relationship between partner numbers and billable hours, it might be insightful. That relationship involves three concepts that we will consider in the next post: p-value, effect size, and R-squared.