Imputation for missing data when machine learning

Most collections of data have holes, missing data points. You don’t have the law school graduation year for this associate or the number of matters worked on for that associate. When you include those associate’s information in your regression modelling, the software may drop the associate totally because one piece is missing. You don’t want that to happen because then you have also lost the remaining, valid data of the associate.

Likewise, to shift examples, if you are studying your firm’s fees charged for reviewing securities law filings and you have completed 65 such matters over the past few years, but 10 of them are missing a number for revenue of the client, you actually have shrunk the analyzable set to only 55 matters.

Wanting to know what’s missing, as always with analyses a picture is invaluable. Here is a map of a data set with 500 observations that has some values missing in some of its 17 variables. A light, vertical line means that the observation on the horizontal axis had no value for that variable.

Make sure that no pattern explains missing data, such as if all the corporate department lawyers have no evaluation scores. But let’s assume that your data is missing at random, not for some identifiable reason like the Chicago office did not turn in its response sheet.

To counter the clobbering of good data caused by absent data, analysts resort to a range of methods to plug-in plausible figures and thereby save the remaining data. These methods, called imputation, are an important step when you prepare data for analysis.

The simplest method plugs in the average or median of all the values for that variable. Doing this, the average or median year of law school graduation would be inserted for the unknown year of an associate. Many other methods are available, with increasing amounts of calculations needed but with imputed values that are likely to be closer to the actual unavailable data. For example, you can run a regression model based on what you know and predict the value(s) you don’t know. For more on data imputation, see my article, Rees W. Morrison, “Missing in Action: Impute Intelligently before Deciding Based on Data”, LegalTechnology News April 2017.

Machine learning and lawyers: influential regression values

A data point has large influence only if it strongly affects the regression model. Leverage only takes into account the extremeness of the predictor variable values, but a high leverage observation may or may not be influential. A high-leverage data point is influential if it materially changes the tilt of the best-fit line. Think of it as having leverage (an extreme value among the other predictor variable values) and also outlier value such that that it singlehandedly alters the slope of the regression line considerably. Put differently, an influential point changes the constants that multiply the predictor values.

Statisticians spot influential data points by calculating Cook’s distance, but those scores don’t provide information on how the data points effect the model. This is particularly challenging, as it is very hard to visualize the impact of many predictors on the response.

Software computes the influence exerted by each observation (row in the spreadsheet, such as our states) on the predicted number of lawyers. It looks at how much the residuals of all the data points would change if any particular observation were excluded from the calculation of the regression coefficients. A large Cook’s distance indicates that excluding that state changes the coefficient substantially.  A few states cross that threshold.

Another plot combines findings about outliers, leverage, and influence. States above +2 or below -2 on the vertical axis (horizontal dotted lines) are considered outliers. States to the right of 0.38 on the horizontal axis (vertical dotted line) have high leverage. The size of each circle is proportional to the state’s Cook’s distance for influence.

This plot shows the “Studentized residuals” on the vertical axis. For now, take “studentized” as a form of standardizing all residuals so that more than twice the studentized residual (the horizontal dotted lines, which are at +2 and -2) is statistically quite unusual.

The horizontal axis shows “hat values,” which are the most common measures of leverage. Once again, reading from the right, Texas [43], New York [34], Florida [9] and California [5] are high-influence observations, pull strongly on the best-fit line’s angle, and therefore significantly alter the model’s coefficients.

Machine learning (lawyers): leverage points

A high leverage observation has values in its predictor variables that stick out with respect to other observations’ corresponding values. Unlike outliers, leverage has nothing to do with the estimated response variable (number of practicing lawyers in our example). Rather, the leverage of a data point is based on how much the observation’s value differs from the average of that particular variable’s values. So, for example, perhaps the percentage of high school graduates in a state falls far below all the other states’ percentages. Or perhaps a combination of predictor variables leads to an observation with an extreme value having high leverage.

As with outliers, you can spot high leverage observations with calculations, graphs, or trial-and-error. Here is a graphic depiction using our own data set.

This plot takes each of the six predictor variables in the regression model, shows with the predictor what the model estimates as the number of lawyers in that state, and identifies with an index value — the row number of the state in the spreadsheet — observations that stand out as exhibiting high leverage. For example, the plot upper left uses each state’s gross domestic product (gdp) and models the estimated number of lawyers in the state. Four have a number beside them because they qualify as high leverage: from the left, state 9 (Florida), 45 (Virginia), 34 (New York), and 43 (Texas).

If no observations with high leverage have a large residual, the regression is relatively “safe.” What this means is that a single data point may be far out to the right or the left on the plot, but the point hews close to the best-fit line. Florida (state 9) would be such a case.

Regression for lawyers: outliers, leverage and influence points

Before running a regression, it is prudent to study any of your data that exhibit unusual characteristics. The purpose of spotting and evaluating abnormal data is to make sure they are not mistakes in measurement, collection, data entry, or calculation nor are they data that unjustifiably warp your regression model. You want to scrutinize three varieties of unusual data: outliers, high leverage points, and influence points.

An outlier is an observation (a U.S. state in our example data) whose influence on the response variable (number of lawyers in the state) is poorly predicted. That is to say, the model produces and unduly large miss from the actual number of lawyers when it estimates the number. With our data, New York stands as quite an outlier. Assuming the figures and facts we have for New York are correct, however, that’s real life; generally speaking, unless you have a solid reason to omit some observation, outlier data should be included in your model.

At least three techniques can help spot outliers: statistical tests, graphical plots, and repeated modeling.

Among the statistical tests, one calculates whether the largest residual of the response variable (the amount the model mis-estimated number of lawyers) is “statistically significantly“} off the mark; if no such unusual residual shows up, the data has no outliers. However, if the largest response residual is statistically significant and therefore is an outlier observation, analysts sometimes delete it and rerun the test (the third technique) to see whether other outliers are present.

This graphic plots residuals, after some mathematical adjustments to the scale, against the corresponding theoretical quantiles. Quantiles (sometimes called ‘percentiles’) are created when you sort data from high to low and plot the point where 25% of the points are below the “first quartile”; 50% are below the second quartile — the median; and so forth. These are points in your data below which a certain proportion of your data fall. So, the horizontal axis above shows what a normal bell-curve distribution line looks like when it is based on the quantiles of such a distribution. It also generates a dotted-line band above and below the residuals to show a statistical form of confidence in the estimated value. The plot tells us that New York (34, top right corner) and California (5, lower left corner) are outliers to be scrutinized.

As alluded to above, a third technique helps if you are concerned about an observation being an outlier. You exclude the suspect observation (state) from the regression. If the model’s coefficients don’t change much, then you don’t have to worry.

Machine learning for lawyers: collinearity in linear regression

We have reached the fourth assumption for valid multiple linear regression: correlations between predictor variables cannot be too high. If the correlation between any pair of predictor variables is close to 1 or -1, there lies a problem. One is that it can be difficult to separate out the individual effects of those variables on the response. Predictor variables (aka independent variables) should be independent of each other.

For example, the correlation between the number of F500 headquarters in a state and the number of business establishments with fewer than 500 employees is very high, at 0.90. The two counts relate very closely to each other, which makes sense. States have more or less vigorous business activity whether measured at the huge corporate strata or the small business strata.

The plot below shows how closely the values for each state of those two variables march together toward the upper right. The ellipse emphasizes that the association holds particularly strongly at the smaller values of the two variables.

Collinearity is the term used for this undesirable situation: predictor variables rise or fall closely in unison. To the degree they do, one of them is redundant, and the effective number of predictors is smaller that the actual number of predictors. Other weaknesses in the linear model we will leave until later.

To see how closely predictors correlate with each other you can unleash a correlation matrix plot. In the plot, for various reasons we haven’t includes some of the data for states, such as population, area and gdp.

This intimidating plot offers three kinds of insights. First, it shows a scatter plot of each predictor variable against each of the other predictors. For example, in the first column (F500), the second cell down shows a scatter plot of that variable on the horizontal axis and law school enrollment (enrollment — which is the second row with its axis being the vertical axis).

Second, the diagonal from the top left down the middle contains density plots that display how the predictor variable is distributed. For example, in the column for the number of businesses that have fewer than 500 employees (Less500), the density plot bulges high on the left, which means that quite a few states have relatively few of those enterprises but a handful stretching out to the far right have many.

Third, the upper triangle prints the correlation of each predictor against the others. As an example, the number of prisoners in 2009 (prison09) hardly correlates at all with the percentage of high school graduates (HS), at -0.199.

A correlation matrix plot helps you figure out which predictor variables are too closely correlated with others to be in the same regression model.

Similar distribution of response residuals (homoscedasticity)

Having discussed two assumptions of linear regression, linearity between predictors and the response variable and the bell-curvishness of the residuals, a third assumption needs to be introduced.

Linear regression also assumes that the residuals have a consistently-shaped distribution around zero, regardless of the values of the predictors. This means that at whatever the number of F500 headquarters and sub-500 employee businesses, the magnitude of the residuals for lawyers doesn’t change too much (the difference between the actual number of private-practice lawyers in a state and what the model with the two predictor variables estimates). In statistical terms, the variance of the residuals stays constant. Variance, since you ask, measures the dispersion of a set of numbers.

A tool to eyeball whether the assumption holds is to plot the fitted response values on the bottom axis against their residual values (how far off they are). Cleverly called a “Residuals vs. Fitted” plot, it should display no pattern and the magnitude of the spread of points around zero should be similar regardless of the fitted value.

Basically, this assumption of linear regression looks at how spread out the residuals are and does the “spread” vary by the magnitude of the predictor. Cover your eyes if large words make you quesy, because what you want is minimal spread and approximately similar numbers on both sides of a smoothed line, the mouth-filler known affectionately as homoscedasticity.

This plot, however, tells us that residuals lose desirable compactness and consistency as lawyers reach the largest values on the right. New York stands at the far right top, an unusual set of many F500 headquarters (54) and sub-500 employee companies (454,718) that apparently would predict a quite different number than the 96,000 in real life. On the left of the plot, with two exceptions, the data has relatively uniform variance, but not on the right. We added a smoothing curve and since it veers off the zero line at two points, this is a sign of systematic under- or over-prediction in certain ranges: the errors are correlated with the dependent variable.

Later we will discuss how this kind of situation offers a possible solution: to transform the response variable.

Assumption of normally distributed residuals

As discussed before, in a linear regression model that your firm or department can rely on, the relationship between predictor variables and the response variable must be linear. Additionally, the residuals of the model must be normally distributed.

We should unpack that last sentence.

A residual is the difference between the actual value of a predictor variable such as F500 headquarters and the value estimated for it by the linear regression model based on all of the predictor variables. All regression methods produce a model that leaves the smallest amount of residuals (or when more than one predictor variable is in the formula a hyperplane). We will explain later the mathematics that minimizes residuals.

OK, so residuals are about how far off real data points are from regression line points. What about a ‘normal distribution’ of those residuals? A distribution is statistics-speak for a group of numbers. If the numbers in a distribution are plotted on a graph by how many of the numbers are at each value, it is a normal distribution if the shape is reasonably close to the often-seen bell curve (relatively few numbers far to either side on the tails and most of them clustered and piled up toward the middle).

Here is a histogram [For more on histograms, click here for my book on law firm data and graphics] of the residuals from two predictors: F500 headquarters and the number of enterprises in the state that have fewer than 500 employees. You have to imagine a three-dimensional cube where the bottom axis of the cube is one predictor and the axis going back is the other predictor, while the left side of the cube is the response variable. Regression software creates a best-fit two-dimensional plane for the predictors and the residual is the distance from each pair of predictor points to that plane.

You can make out a partial bell curve, applying to most of the residuals, except that three residuals stick far out on the right tail of the histogram. We should investigate those states’ values, because they could be outliers arising from a mistake in the data. Meanwhile, however, the shape reasonably resembles a bell and therefore satisfies the assumption that the residuals be normally distributed.

As an aside, when your regression model has more than one predictor variable, it’s harder to visualize a best fit “line”. If you have only two predictors, you can consider a plane as the best fit — as if a stiff piece of paper represents the ‘line.’ But with more predictors the mind boggles at visualizing a hyperplane. Software has no such frailties and it will figure out the residual of each point no matter how many predictor variables.

Machine learning for lawyers: linear regression assumptions

To produce valid results, a linear regression model needs the relationship between the predictor and response variables to be linear. For example, multiply every additional F500 headquarters by 1,000 (and add the intercept value) to predict the number of practicing lawyers in a state. Since 1,000 is the same for every number of F500 headquarters and since it is not squaring the headquarters numbers or taking its logarithm or some other mathematical operation, the relationship can be plotted on a line: it’s linear.

Bear in mind that the software will blindly calculate a best-fit line even if the data is absolutely random, looks U-shaped, hockey-stick shaped, or exhibits a bizarre irregularity. The requirement for valid linear regression goes back to the slope of the best-fit line, which has a constant number — not a squared number or some varying number — that multiplies the values of the predictors to estimate the response variable.

Got it, but how can you tell if your data satisfies the criterion of linearity? Quite often you can simply eyeball a scatter plot. You put plot each state’s predictor variable on the horizontal axis, increasing in numbers of F500 headquarters to the right. You plot the corresponding response variable on the vertical axis, increasing toward the top. If the pattern suggests a relationship — the two variables rise together or fall together or if you were to sketch an ellipse around the bulk of the data it would look something like a tilted football — you have a linear relationship between the variables and satisfy the assumption for linear regression.

This plot suggests a roughly linear increase in the number of lawyers as F500 headquarters increase in that you can draw a straight line from the lower left mid-way through the points up to the upper right.

The plot below doesn’t look nearly as much like a linear pattern, so that predictor is probably less useful [unlikely to be statistically significant], but the distribution of points still is reasonably linear. Another clue to relative linearity of two predictors is the correlation between lawyers and F500 headquarters, 0.9, whereas the correlation between lawyers and the urban population as a percent of total population is half that (0.45).

 

If the variables do not have a linear association, \underline{transformations} of the variables are possible, such as using their square roots, but that’s the subject of another post. Also, other kinds of regression might fit a valid model, but those alternatives are too advanced for this overview. It is also important to check for outliers, very unusual and influential data points, which we will return to later.

Suggestions for spreadsheets used by lawyers in machine learning

Law firms and law departments can store their data for regression models, or any machine learning algorithm, in a spreadsheet. Spreadsheets are perfectly fine, and indeed Excel and like programs can perform linear regressions, but more powerful software such as Mathematica, Matlab, Python (open source), R (open source), SASS, SPSS, and Tableau typically take in data from a spreadsheet before they can begin their magic.

Here are some observations and good hygiene for spreadsheets that contain regression data.

Store the data in columns, not rows. So, in our data set, each state is a row and each column contains the values for a variable. Here are the first six rows of the data being used in this series of posts. <chr>, <dbl> and <fct> stand for a text variable, a numeric variable (double precision is what gives it the “dbl”), and a categorical or factor variable, respectively, as used by the R programming language.

 state population    area lawyers     gdp  F500 capital     region party
  <chr>      <dbl>   <dbl>   <dbl>   <dbl> <dbl> <chr>       <fct>  <fct>
1 AK        735132 665384.    1585   51859     0 Juneau      West   rep  
2 AL       4822023  52420.    7615  183547     1 Montgomery  South  rep  
3 AR       2949131  53179.    4447  109557     7 Little Rock South  rep  
4 AZ       6553255 113990.    8023  266891     5 Phoenix     West   rep  
5 CA      38041430 163696.   85274 2003479    54 Sacramento  West   dem  
6 CO       5187582 104094.   11584  274048     9 Denver      West   dem 

Put each observation on its own row and it is a convention to put the observations in the first column. In real life, observations might be associates, partners, offices, countries, law firms, client groups, matters or others.

The arrangement of the columns does not matter to the software nor does the order of the observations. Our data set has the observation (state) in the first column, the dependent variable, lawyers, in the fourth column and F500 headquarters in the fifth. It also doesn’t matter if some columns (or some rows) are not used in the model.

Try not to have extraneous rows, such as headers, summaries, or sub-tables. Leave out totals, explanatory text and blank rows. It doesn’t matter, we should note, if you have superfluous columns on the right or rows below the data, because your software can easily remove or ignore them.

Make sure you have only numbers in columns for numbers. That is to say, do not have commas, dollar signs or any text (17500 USD is a no-no). Decimal points in values are perfectly fine, as in “area” above.

If you are missing values, be consistent in how you identify them. The best approach is to put nothing in the cell; avoid hyphens or writing something, such as “no value.” If text and numbers are jumbled together, most software will treat the variable as text and cannot do math on the text.

Writing code is easier if you use one-word names for your columns and make them understandable. For example, use “hours” or “firm” or “type” rather than “billable hours” or “paid vendor”, or “x-1 code”. It’s fine to use camelback style like “StateGDP” or “AreaState”.

Effect size in regression, and holding variables constant

Each coefficient of a regression model measures what is called effect size. Returning to our data, the effect size tells how much the predicted number of lawyers changes with a change of one F500 headquarters. Since our single-variable coefficient is 1,265 it would mean that every additional F500 headquarters increases the estimated lawyers by 1,265; every F500 headquarters less would drop the estimated lawyers by 1,265. So, effect size indicates the influence of one plus-or-minus predictor unit on the response number. But we shouldn’t use that single effect size because we have data for other variables that contribute to the estimated number of lawyers.

To this point we have been conducting linear regression with only a single predictor. If we include in our model more than one predictor, by the way, we are using multiple linear regression.

To progress to multiple linear regression and to see more effect sizes and how coefficients change when there are additional predictor variables, let’s add to our model the predictor variable of state population. The resulting regression equation appears below.

[1] “lawyers = -275.8 + 767.689 * F500 + 0.001 * population + e”

The coefficient for F500 headquarters has dropped dramatically, from 1,265 to 768.  Further, we see a tiny coefficient for state population.

But you can’t look at the absolute size of a coefficient and decide whether it’s more or less influential than another predictor’s coefficient, because if they are both statistically significant, you also must take into account the units of the predictor variable. Our units are one headquarters and one state resident. Intuitively, a change in one headquarters ought to make much more of a difference in the lawyer count than the change of one resident.

Both predictors are statistically significant, so what do the coefficients tell us about translating their variable values into the real world? On this two-predictor model, every increase or decrease in the number of F500 headquarters changes the estimated number of lawyers by 768, when we hold state population constant. Every increase or decrease in the population changes the estimated lawyers by 1/1000, holding F500 headquarters constant. Thus, for every thousand additional residents in a state, this model predicts an additional lawyer.

Holding other predictors constant means that the software sets the remaining predictor variables at the same value, so they have no influence on the response variable — the number of private practice lawyers. Doing so isolates the effect of the remaining predictor on the response variable.

We need to include as many predictor variables as we have available and evaluate that multiple linear regression model, which we will do in a later post.