Rooting around for meaning (RMSE)

After fitting a regression model, analysts look closely at how far the model’s estimates lie from the actuals. Such deviations, which we met before as residuals, permit several calculations that can assess the accuracy of the model.

The residuals (represented by the error term in a regression equation) represent unsystematic deviations from the ‘ground truth’ response values, the bouncing around of their estimates, the noise. The residuals average to zero because of the way the regression calculations are done, meaning they range on both sides of the best-fit line so some are negative and some are positive, but overall they cancel each other out. Every observation (U.S. state) has a residual because the model estimates each state’s number of practicing lawyers, which can be compared to the real number. The gaps constitute the residuals.

Ordinary least squares (OLS) regression squares the residuals and then minimizes the sum of those squares, thereby setting equation and the best-fit line.

Back to assessing a model’s accuracy. Software can calculate the \underline{square root} of the \underline{average} of those squared residuals: that square root divided by the number of observations is the Root Mean Squared Error (RMSE).

So, what’s the big deal? The deal is that the RMSE lets you compare the accuracy of linear regression models that have different mixes of predictor variables (but not between data sets, as it is scale-dependent). In general, a lower RMSE is better than a higher one. For an example, the three predictors of Less500, enrollment and urbanpct yield a RMSE for practicing lawyers of 4,708. If we drop urbanpct, the RMSE rises a tiny amount — to 4,713 — meaning that the smaller model is slightly less good at predicting practicing lawyers accurately.

Here’s the deeper deal. Recall that Adjusted R-squared measures the amount of variance in the response variable (practicing lawyers in a state) that can be explained by our model. It gives one view of the quality of the regression model.

Sometimes, however, we may be more interested in quantifying the residuals in the same measuring unit as the response variable. We want a figure for the plus-or-minus range of estimated practicing lawyers in a state. We could consider the average of the residuals of the model, except that the linear regression residuals always average zero. So we need to find other ways to quantify the residuals; the RMSE does the trick and it has the same measuring unit of the response variable, lawyers.

Think of the RMSE as a measure of the width of the data cloud around the line of perfect prediction. The effect of each residual on RMSE is proportional to the size of its square; thus larger residuals have a disproportionately large effect on RMSE. Consequently, and problematically, RMSE is sensitive to outliers.

To solve the problem with large residuals that might skew the RMSE we can use the Mean Absolute Error (MAE), the average of the absolute value of the residuals (the negative of a number and the positive of a number are treated the same with the absolute value of that number). This measurement is more robust against, less susceptible to, large residuals because we are not squaring them. MAE is also probably easier to understand than the square root of the average of squared residuals. It is usually similar in magnitude to RMSE, but slightly smaller.

Being sure about confidence intervals

Using our data, we we can be 95\% confident that the true change in practicing lawyers for adding or subtracting one more student in a top-100 law school in the state lies between 2.5 and 7.8 lawyers. For a given confidence level (such as 95%), a narrower interval indicates a more precise estimate, whereas a broader interval indicates less precision. When the high boundary of the confidence interval is multiples of the low boundary, we are less sure of the association of the predictor and the response variable.

Note something else: because the confidence interval for percentage of the state’s population living in an urban area contains zero, a change in it is insufficiently related to practicing lawyer counts, statistically speaking, holding the other predictors constant. In other words, if the change might be zero, there is no statistically meaningful effect by that predictor. On the other hand, the law school enrollment interval not straddling zero, that predictor has a statistically significant p-value.

A confidence interval is an interval of good estimates of the unknown true population parameter.

Here is a plot with confidence intervals around the best-fit line of the Less500 predictor (companies with fewer than 500 employees). The intervals show as the shaded portions above and below the line. You can be 95 percent confident that the vertical range contains the true number of private practice lawyers for a state with the corresponding predictor value on the horizontal axis. If the predictor is indeed associated with the response variable, the more data a plot has in an area, the narrower the confidence interval, as you are more and more sure of the estimate.

One more aspect. We should explain how statisticians use the terms population and sample. The entire set of what you would like to count is the population; the portion of the population that you obtain is a sample from that population. So, for example, all the 45 associates in a firm comprise a population; the 15 selected at random to take a survey would be a sample from the population.

Statistics offers an impressive toolbox for making inferences from a partial sample to the entire population — and stating how likely those inferences are correct.

If we repeatedly sampled from the larger population (different mixes of 15 associates each time), the confidence intervals would contain the true population mean of whatever we are estimating from the linear regression. In other words, there is a 95% chance of selecting a sample such that the 95% confidence intervals calculated from that sample contain the correct mean for the response variable.

The confidence level does not express the chance that repeated sample estimates will fall into the confidence interval. Nor does it give the probability that the unknown mean for the response variable lies within the confidence interval.

Regression and categorical variables for dummies

If a predictor variable is categorical (aka a factor), a variable with a finite number of levels like male/female or Democrat/Republican/independent, linear regression can still flourish. Software will convert each factor into the same number of dummy variables} as there are levels in the factor.

Typically the alphabetically first level becomes the reference level, which is the zero-coded dummy factor, while the other levels are the comparison levels. Since the intercept is the expected mean when the predictors equal zero, the intercept indicates the mean value for the reference group, the Democratic party (as all other comparison group levels have a 1 when the reference group has a 0). The regression model’s output does not show the reference level; the coefficients of the other levels are measured with respect to the reference level.

The coefficient for the comparison levels tells how much higher or lower they are than the reference level. The coefficients of each of the dummy variables are equal to the difference between the mean of the level coded 1 and the mean of the reference group. For our data, the coefficient for Republican is -976, the average difference in number of practicing lawyers between the reference group category (Democrat) and the category coded 1 (Republican) therefore being that much less than the intercept value (the Democrat mean).

Meet up with the regression model’s intercept

The intercept (sometimes referred to as the “constant”) shows up in a regression formula. The intercept is the expected average value of the response variable (practicing lawyers in a state) if all the predictor values were zero. If none equals zero, then the intercept has no meaning and tells nothing about the relationship between the predictors and the response. With our two-predictor model, using only companies with fewer than 500 employees and total enrollment in top-100 law schools, the intercept is -1,917, which absurdly says that if all the states had no top-law-schools and no small companies, the model’s estimate for the number of practicing lawyers would have MINUS 1,917 lawyers. But of course, both predictors are not zero.

When no predictors equal zero you have a reason to center them. That means you re-scale them so that their averages do equal zero (software subtracts the average of the predictor’s values from each value). Now the intercept has meaning. It’s the average value of the estimated response variables at the average of the predictor variables. Returning to our model, when the two predictor variables are centered, the new intercept estimates 13,612 lawyers: precisely the actual average of the practicing lawyers in all of the states.

Regardless, as we explain elsewhere, you need the intercept in the regression formula to calculate predicted values.

Quality of a linear regression model: the F-test statistic

One more nugget gleams from a linear regression model: the F-statistic.   That statistic compares your model to a model that has no predictors.  The stripped-down model relies only on the average of the response variable (for us, the average number of private practice lawyers in a state).

Stated differently, the F-test statistic is a ratio: the top of the ratio (the numerator) is the variance in the estimated response variables explained by your model’s predictor variables. That figure is divided by bottom of the ratio (the denominator), the variance of the stripped-down model. This is not quite correct, because the F-statistic also takes account of degrees of freedom, but it is close enough for us.

Each combination of an F-test statistic and an arbitrary number of degrees of freedom corresponds to a p-value. The statistic and its p-value are to the overall regression model much the same as the t-statistic and its p-value are to each coefficient estimate.  However, while each t-statistic value is associated with a specific p-value, the F-test statistic p-value depends on both the test statistic and the number of degrees of freedom. The fewer the degrees of freedom, the higher the F-test statistic needs to be in order to return the same p-value.

If an F-test statistic is statistically significant, it implies that all the predictor variables together explain the response variable to a degree you can rely on in the eyes of a statistician.

While another statistic we wrote about, R-squared, estimates the strength of the relationship between your model’s predictors and the response variable, it does not provide a formal hypothesis test for the relationship, a core statistical concept which we will consider later.  The F-test statistic does so. If the p-value for the F-test statistic is less than your significance level, such as 0.05, you can conclude that R-squared is statistically significantly different from zero.

With a model that uses only two predictor variables (we used companies with fewer than 500 employees and total enrollment in top-100 law schools), the F-test statistic is highly significant because its p-value falls much less than 0.05. We can be quite confident that the model explains the variability of the dependent variable (practicing lawyers) around its average far better than using just the average itself.

Machine learning, regression and degrees of freedom

An important concept in machine learning involves the ratio between the number of observations and the number of predictors: degrees of freedom. For multiple linear regression, the degrees of freedom equals the number of observations minus one more than the number of predictors fit by the model. On our data set, using three predictors, we have 46 degrees of freedom: 50 states minus three variables plus the intercept.

With linear regression, statisticians want the degrees of freedom to be large compared to the number of predictor variables to avoid over-fitting. You lose one degree of freedom for every coefficient fitted, so if we had 49 different predictor variables we would have no degrees of freedom and a horribly over-fitted model. It would be useless for interpretation and for prediction.

Aside from over-fitting, here are two results of fitting a linear regression model where degrees of freedom play a role. One of the calculations that results from fitting a regression model is R-squared. It gives a sense of the portion of the response estimates explained by the model. When you add more variables to a regression formula, Adjusted R-squared increases only if the new predictor variables improve the model more than you would expect by chance. So Adjusted R-squared takes into account degrees of freedom.

A second place where linear regression uses degrees of freedom are F-test statistics, which we explain later.

Machine-learning statistics: variance and standard deviation

It would be nice if lawyers could embrace regression without a grasp of its statistical underpinnings. If partners or associate general counsel are content with arm-waving, vague notions of regression, that is a choice. But if they want to take part in discussions about regression and feel assured that they understand what it can and cannot do, they should learn some statistical concepts. Broadly stated, that is the goal of this blog: to make machine learning software comprehensible to lawyers.

Specifically, this post explains two statistical stalwarts — variance and standard deviation — that play roles explicitly or implicitly in many other posts.

Variance is a statistical measure of how far the numbers in a collection of numbers are scattered from the collection’s average. It tells you about the collection’s degree of dispersion.

Take a law firm that has many offices, but does environmental work in only four of them. In those four offices, the firm has one environmental partner, three environmental partners, five, and seven respectively. The variance of that collection of numbers (1, 3, 5, 7) is 6.67.

If we were to calculate the variance by hand we would start with the average number of partners in the offices (sum 1, 3, 5, and 7 and divide by 4). The sum being 16 across the four offices, the average is four per office. Next, we would subtract that average, four, from each office’s number of partners. We would then square the result of the subtraction for each office (multiply the result by the result), and add up all of those squared numbers (1-4 squared = 9; 3-4 squared = 1, plus 1, plus 9). Finally, we would divide that total (20) by the number of offices minus 1 (3).

A single command in statistical analysis software such as R does all this instantly: the variance is 6.67 (squared partners).

Now, what if instead the largest environmental office has 11 partners instead of seven. Intuitively, you should sense that the variance would be larger, because there is a wider spread in the set of partner numbers (1, 3, 5, 11). You would be right! The variance of this set of partners is 18.67 (squared partners). Larger variances represent greater dispersion.

Most people find it easier to think about dispersion measures when they are expressed in the same units as the data rather than in squared units. Here, partners holds meaning more comfortably than squared partners (whatever that is!).

To convert variance to the original units, you find its square root, the number which multiplied by itself equals the variance. That figure is the standard deviation of the collection of partner numbers. The square root of the first example of offices, which has a variance of 6.67, is 2.58 (2.58 times 2.58 = 6.67, with rounding); the square root of the second example, with the larger variance of 18.67, is 4.32.

A way to put the standard deviation into context is to compare it to the average of the numbers. So, in the first example of offices the standard deviation is approximately 2.6 while the average is 4 (the standard deviation is 65% of the average); in the second example, because the fourth office has 11 partners instead of 7, the standard deviation rises to 4.3 while the average increases to 5. Now the standard deviation is 86% of the average, so it confirms a much more varied collection of partner numbers.

What most people are familiar with is the standard deviation of a bell-shaped distribution. It represents about 68% of the numbers in the set. Thus, a bit more than two-thirds of all the numbers fall within one standard deviation above and one standard deviation below the average. Two standard deviations on either side of the average covers around 95% of the values. Bear in mind, however, that most distributions of numbers do not exhibit a so-called normal distribution (we will return to this importance concept later), so standard deviation can’t be translated into such neat percentages. What you should understand is that the larger the standard deviation (relative to the average), the greater the dispersion among the numbers and the less precise of a measurement the average represents.

Regression and interaction terms predicting number of lawyers

In a regression model’s formula, an interaction term captures an interplay between two predictor variables, which happens when the effect of one predictor variable on the response variable is modulated by the other predictor variable. An interaction term should be in a regression formula and the resulting model when a change in the estimated response due to the combination is more than the change due to each predictor alone.

Our data for U.S. states doesn’t appear to have variables that suggest an interaction, but if we knew which states had carried out death sentences in the past five years, or which states had three-strikes-and-you’re-out felony laws, it is possible that we would find an interaction term of either of those variables combined with the number of prisoners.

Among the tools that can help spot potential interaction terms, one is an interaction plot. Parallel lines in an interaction plot indicate no interaction. The greater the difference in slopes between the lines, therefore, the higher the degree of interaction.

One form of interaction plot would have an upper solid line that marks one standard deviation of the response variable (practicing lawyers at the top and a lightly dotted line that marks one standard deviation of those lawyers at the bottom. A dotted line midway would indicate the average number of lawyers. However, an interaction plot doesn’t say whether the interaction is statistically significant.

Imputation for missing data when machine learning

Most collections of data have holes, missing data points. You don’t have the law school graduation year for this associate or the number of matters worked on for that associate. When you include those associate’s information in your regression modelling, the software may drop the associate totally because one piece is missing. You don’t want that to happen because then you have also lost the remaining, valid data of the associate.

Likewise, to shift examples, if you are studying your firm’s fees charged for reviewing securities law filings and you have completed 65 such matters over the past few years, but 10 of them are missing a number for revenue of the client, you actually have shrunk the analyzable set to only 55 matters.

Wanting to know what’s missing, as always with analyses a picture is invaluable. Here is a map of a data set with 500 observations that has some values missing in some of its 17 variables. A light, vertical line means that the observation on the horizontal axis had no value for that variable.

Make sure that no pattern explains missing data, such as if all the corporate department lawyers have no evaluation scores. But let’s assume that your data is missing at random, not for some identifiable reason like the Chicago office did not turn in its response sheet.

To counter the clobbering of good data caused by absent data, analysts resort to a range of methods to plug-in plausible figures and thereby save the remaining data. These methods, called imputation, are an important step when you prepare data for analysis.

The simplest method plugs in the average or median of all the values for that variable. Doing this, the average or median year of law school graduation would be inserted for the unknown year of an associate. Many other methods are available, with increasing amounts of calculations needed but with imputed values that are likely to be closer to the actual unavailable data. For example, you can run a regression model based on what you know and predict the value(s) you don’t know. For more on data imputation, see my article, Rees W. Morrison, “Missing in Action: Impute Intelligently before Deciding Based on Data”, LegalTechnology News April 2017.

Machine learning and lawyers: influential regression values

A data point has large influence only if it strongly affects the regression model. Leverage only takes into account the extremeness of the predictor variable values, but a high leverage observation may or may not be influential. A high-leverage data point is influential if it materially changes the tilt of the best-fit line. Think of it as having leverage (an extreme value among the other predictor variable values) and also outlier value such that that it singlehandedly alters the slope of the regression line considerably. Put differently, an influential point changes the constants that multiply the predictor values.

Statisticians spot influential data points by calculating Cook’s distance, but those scores don’t provide information on how the data points effect the model. This is particularly challenging, as it is very hard to visualize the impact of many predictors on the response.

Software computes the influence exerted by each observation (row in the spreadsheet, such as our states) on the predicted number of lawyers. It looks at how much the residuals of all the data points would change if any particular observation were excluded from the calculation of the regression coefficients. A large Cook’s distance indicates that excluding that state changes the coefficient substantially.  A few states cross that threshold.

Another plot combines findings about outliers, leverage, and influence. States above +2 or below -2 on the vertical axis (horizontal dotted lines) are considered outliers. States to the right of 0.38 on the horizontal axis (vertical dotted line) have high leverage. The size of each circle is proportional to the state’s Cook’s distance for influence.

This plot shows the “Studentized residuals” on the vertical axis. For now, take “studentized” as a form of standardizing all residuals so that more than twice the studentized residual (the horizontal dotted lines, which are at +2 and -2) is statistically quite unusual.

The horizontal axis shows “hat values,” which are the most common measures of leverage. Once again, reading from the right, Texas [43], New York [34], Florida [9] and California [5] are high-influence observations, pull strongly on the best-fit line’s angle, and therefore significantly alter the model’s coefficients.