Imputation for missing data when machine learning

Most collections of data have holes, missing data points. You don’t have the law school graduation year for this associate or the number of matters worked on for that associate. When you include those associate’s information in your regression modelling, the software may drop the associate totally because one piece is missing. You don’t want that to happen because then you have also lost the remaining, valid data of the associate.

Likewise, to shift examples, if you are studying your firm’s fees charged for reviewing securities law filings and you have completed 65 such matters over the past few years, but 10 of them are missing a number for revenue of the client, you actually have shrunk the analyzable set to only 55 matters.

Wanting to know what’s missing, as always with analyses a picture is invaluable. Here is a map of a data set with 500 observations that has some values missing in some of its 17 variables. A light, vertical line means that the observation on the horizontal axis had no value for that variable.

Make sure that no pattern explains missing data, such as if all the corporate department lawyers have no evaluation scores. But let’s assume that your data is missing at random, not for some identifiable reason like the Chicago office did not turn in its response sheet.

To counter the clobbering of good data caused by absent data, analysts resort to a range of methods to plug-in plausible figures and thereby save the remaining data. These methods, called imputation, are an important step when you prepare data for analysis.

The simplest method plugs in the average or median of all the values for that variable. Doing this, the average or median year of law school graduation would be inserted for the unknown year of an associate. Many other methods are available, with increasing amounts of calculations needed but with imputed values that are likely to be closer to the actual unavailable data. For example, you can run a regression model based on what you know and predict the value(s) you don’t know. For more on data imputation, see my article, Rees W. Morrison, “Missing in Action: Impute Intelligently before Deciding Based on Data”, LegalTechnology News April 2017.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.