Regression and categorical variables for dummies

If a predictor variable is categorical (aka a factor), a variable with a finite number of levels like male/female or Democrat/Republican/independent, linear regression can still flourish. Software will convert each factor into the same number of dummy variables} as there are levels in the factor.

Typically the alphabetically first level becomes the reference level, which is the zero-coded dummy factor, while the other levels are the comparison levels. Since the intercept is the expected mean when the predictors equal zero, the intercept indicates the mean value for the reference group, the Democratic party (as all other comparison group levels have a 1 when the reference group has a 0). The regression model’s output does not show the reference level; the coefficients of the other levels are measured with respect to the reference level.

The coefficient for the comparison levels tells how much higher or lower they are than the reference level. The coefficients of each of the dummy variables are equal to the difference between the mean of the level coded 1 and the mean of the reference group. For our data, the coefficient for Republican is -976, the average difference in number of practicing lawyers between the reference group category (Democrat) and the category coded 1 (Republican) therefore being that much less than the intercept value (the Democrat mean).

Meet up with the regression model’s intercept

The intercept (sometimes referred to as the “constant”) shows up in a regression formula. The intercept is the expected average value of the response variable (practicing lawyers in a state) if all the predictor values were zero. If none equals zero, then the intercept has no meaning and tells nothing about the relationship between the predictors and the response. With our two-predictor model, using only companies with fewer than 500 employees and total enrollment in top-100 law schools, the intercept is -1,917, which absurdly says that if all the states had no top-law-schools and no small companies, the model’s estimate for the number of practicing lawyers would have MINUS 1,917 lawyers. But of course, both predictors are not zero.

When no predictors equal zero you have a reason to center them. That means you re-scale them so that their averages do equal zero (software subtracts the average of the predictor’s values from each value). Now the intercept has meaning. It’s the average value of the estimated response variables at the average of the predictor variables. Returning to our model, when the two predictor variables are centered, the new intercept estimates 13,612 lawyers: precisely the actual average of the practicing lawyers in all of the states.

Regardless, as we explain elsewhere, you need the intercept in the regression formula to calculate predicted values.

Quality of a linear regression model: the F-test statistic

One more nugget gleams from a linear regression model: the F-statistic.   That statistic compares your model to a model that has no predictors.  The stripped-down model relies only on the average of the response variable (for us, the average number of private practice lawyers in a state).

Stated differently, the F-test statistic is a ratio: the top of the ratio (the numerator) is the variance in the estimated response variables explained by your model’s predictor variables. That figure is divided by bottom of the ratio (the denominator), the variance of the stripped-down model. This is not quite correct, because the F-statistic also takes account of degrees of freedom, but it is close enough for us.

Each combination of an F-test statistic and an arbitrary number of degrees of freedom corresponds to a p-value. The statistic and its p-value are to the overall regression model much the same as the t-statistic and its p-value are to each coefficient estimate.  However, while each t-statistic value is associated with a specific p-value, the F-test statistic p-value depends on both the test statistic and the number of degrees of freedom. The fewer the degrees of freedom, the higher the F-test statistic needs to be in order to return the same p-value.

If an F-test statistic is statistically significant, it implies that all the predictor variables together explain the response variable to a degree you can rely on in the eyes of a statistician.

While another statistic we wrote about, R-squared, estimates the strength of the relationship between your model’s predictors and the response variable, it does not provide a formal hypothesis test for the relationship, a core statistical concept which we will consider later.  The F-test statistic does so. If the p-value for the F-test statistic is less than your significance level, such as 0.05, you can conclude that R-squared is statistically significantly different from zero.

With a model that uses only two predictor variables (we used companies with fewer than 500 employees and total enrollment in top-100 law schools), the F-test statistic is highly significant because its p-value falls much less than 0.05. We can be quite confident that the model explains the variability of the dependent variable (practicing lawyers) around its average far better than using just the average itself.

Machine learning, regression and degrees of freedom

An important concept in machine learning involves the ratio between the number of observations and the number of predictors: degrees of freedom. For multiple linear regression, the degrees of freedom equals the number of observations minus one more than the number of predictors fit by the model. On our data set, using three predictors, we have 46 degrees of freedom: 50 states minus three variables plus the intercept.

With linear regression, statisticians want the degrees of freedom to be large compared to the number of predictor variables to avoid over-fitting. You lose one degree of freedom for every coefficient fitted, so if we had 49 different predictor variables we would have no degrees of freedom and a horribly over-fitted model. It would be useless for interpretation and for prediction.

Aside from over-fitting, here are two results of fitting a linear regression model where degrees of freedom play a role. One of the calculations that results from fitting a regression model is R-squared. It gives a sense of the portion of the response estimates explained by the model. When you add more variables to a regression formula, Adjusted R-squared increases only if the new predictor variables improve the model more than you would expect by chance. So Adjusted R-squared takes into account degrees of freedom.

A second place where linear regression uses degrees of freedom are F-test statistics, which we explain later.

Machine-learning statistics: variance and standard deviation

It would be nice if lawyers could embrace regression without a grasp of its statistical underpinnings. If partners or associate general counsel are content with arm-waving, vague notions of regression, that is a choice. But if they want to take part in discussions about regression and feel assured that they understand what it can and cannot do, they should learn some statistical concepts. Broadly stated, that is the goal of this blog: to make machine learning software comprehensible to lawyers.

Specifically, this post explains two statistical stalwarts — variance and standard deviation — that play roles explicitly or implicitly in many other posts.

Variance is a statistical measure of how far the numbers in a collection of numbers are scattered from the collection’s average. It tells you about the collection’s degree of dispersion.

Take a law firm that has many offices, but does environmental work in only four of them. In those four offices, the firm has one environmental partner, three environmental partners, five, and seven respectively. The variance of that collection of numbers (1, 3, 5, 7) is 6.67.

If we were to calculate the variance by hand we would start with the average number of partners in the offices (sum 1, 3, 5, and 7 and divide by 4). The sum being 16 across the four offices, the average is four per office. Next, we would subtract that average, four, from each office’s number of partners. We would then square the result of the subtraction for each office (multiply the result by the result), and add up all of those squared numbers (1-4 squared = 9; 3-4 squared = 1, plus 1, plus 9). Finally, we would divide that total (20) by the number of offices minus 1 (3).

A single command in statistical analysis software such as R does all this instantly: the variance is 6.67 (squared partners).

Now, what if instead the largest environmental office has 11 partners instead of seven. Intuitively, you should sense that the variance would be larger, because there is a wider spread in the set of partner numbers (1, 3, 5, 11). You would be right! The variance of this set of partners is 18.67 (squared partners). Larger variances represent greater dispersion.

Most people find it easier to think about dispersion measures when they are expressed in the same units as the data rather than in squared units. Here, partners holds meaning more comfortably than squared partners (whatever that is!).

To convert variance to the original units, you find its square root, the number which multiplied by itself equals the variance. That figure is the standard deviation of the collection of partner numbers. The square root of the first example of offices, which has a variance of 6.67, is 2.58 (2.58 times 2.58 = 6.67, with rounding); the square root of the second example, with the larger variance of 18.67, is 4.32.

A way to put the standard deviation into context is to compare it to the average of the numbers. So, in the first example of offices the standard deviation is approximately 2.6 while the average is 4 (the standard deviation is 65% of the average); in the second example, because the fourth office has 11 partners instead of 7, the standard deviation rises to 4.3 while the average increases to 5. Now the standard deviation is 86% of the average, so it confirms a much more varied collection of partner numbers.

What most people are familiar with is the standard deviation of a bell-shaped distribution. It represents about 68% of the numbers in the set. Thus, a bit more than two-thirds of all the numbers fall within one standard deviation above and one standard deviation below the average. Two standard deviations on either side of the average covers around 95% of the values. Bear in mind, however, that most distributions of numbers do not exhibit a so-called normal distribution (we will return to this importance concept later), so standard deviation can’t be translated into such neat percentages. What you should understand is that the larger the standard deviation (relative to the average), the greater the dispersion among the numbers and the less precise of a measurement the average represents.

Regression and interaction terms predicting number of lawyers

In a regression model’s formula, an interaction term captures an interplay between two predictor variables, which happens when the effect of one predictor variable on the response variable is modulated by the other predictor variable. An interaction term should be in a regression formula and the resulting model when a change in the estimated response due to the combination is more than the change due to each predictor alone.

Our data for U.S. states doesn’t appear to have variables that suggest an interaction, but if we knew which states had carried out death sentences in the past five years, or which states had three-strikes-and-you’re-out felony laws, it is possible that we would find an interaction term of either of those variables combined with the number of prisoners.

Among the tools that can help spot potential interaction terms, one is an interaction plot. Parallel lines in an interaction plot indicate no interaction. The greater the difference in slopes between the lines, therefore, the higher the degree of interaction.

One form of interaction plot would have an upper solid line that marks one standard deviation of the response variable (practicing lawyers at the top and a lightly dotted line that marks one standard deviation of those lawyers at the bottom. A dotted line midway would indicate the average number of lawyers. However, an interaction plot doesn’t say whether the interaction is statistically significant.

Machine learning and lawyers: influential regression values

A data point has large influence only if it strongly affects the regression model. Leverage only takes into account the extremeness of the predictor variable values, but a high leverage observation may or may not be influential. A high-leverage data point is influential if it materially changes the tilt of the best-fit line. Think of it as having leverage (an extreme value among the other predictor variable values) and also outlier value such that that it singlehandedly alters the slope of the regression line considerably. Put differently, an influential point changes the constants that multiply the predictor values.

Statisticians spot influential data points by calculating Cook’s distance, but those scores don’t provide information on how the data points effect the model. This is particularly challenging, as it is very hard to visualize the impact of many predictors on the response.

Software computes the influence exerted by each observation (row in the spreadsheet, such as our states) on the predicted number of lawyers. It looks at how much the residuals of all the data points would change if any particular observation were excluded from the calculation of the regression coefficients. A large Cook’s distance indicates that excluding that state changes the coefficient substantially.  A few states cross that threshold.

Another plot combines findings about outliers, leverage, and influence. States above +2 or below -2 on the vertical axis (horizontal dotted lines) are considered outliers. States to the right of 0.38 on the horizontal axis (vertical dotted line) have high leverage. The size of each circle is proportional to the state’s Cook’s distance for influence.

This plot shows the “Studentized residuals” on the vertical axis. For now, take “studentized” as a form of standardizing all residuals so that more than twice the studentized residual (the horizontal dotted lines, which are at +2 and -2) is statistically quite unusual.

The horizontal axis shows “hat values,” which are the most common measures of leverage. Once again, reading from the right, Texas [43], New York [34], Florida [9] and California [5] are high-influence observations, pull strongly on the best-fit line’s angle, and therefore significantly alter the model’s coefficients.

Machine learning (lawyers): leverage points

A high leverage observation has values in its predictor variables that stick out with respect to other observations’ corresponding values. Unlike outliers, leverage has nothing to do with the estimated response variable (number of practicing lawyers in our example). Rather, the leverage of a data point is based on how much the observation’s value differs from the average of that particular variable’s values. So, for example, perhaps the percentage of high school graduates in a state falls far below all the other states’ percentages. Or perhaps a combination of predictor variables leads to an observation with an extreme value having high leverage.

As with outliers, you can spot high leverage observations with calculations, graphs, or trial-and-error. Here is a graphic depiction using our own data set.

This plot takes each of the six predictor variables in the regression model, shows with the predictor what the model estimates as the number of lawyers in that state, and identifies with an index value — the row number of the state in the spreadsheet — observations that stand out as exhibiting high leverage. For example, the plot upper left uses each state’s gross domestic product (gdp) and models the estimated number of lawyers in the state. Four have a number beside them because they qualify as high leverage: from the left, state 9 (Florida), 45 (Virginia), 34 (New York), and 43 (Texas).

If no observations with high leverage have a large residual, the regression is relatively “safe.” What this means is that a single data point may be far out to the right or the left on the plot, but the point hews close to the best-fit line. Florida (state 9) would be such a case.

Regression for lawyers: outliers, leverage and influence points

Before running a regression, it is prudent to study any of your data that exhibit unusual characteristics. The purpose of spotting and evaluating abnormal data is to make sure they are not mistakes in measurement, collection, data entry, or calculation nor are they data that unjustifiably warp your regression model. You want to scrutinize three varieties of unusual data: outliers, high leverage points, and influence points.

An outlier is an observation (a U.S. state in our example data) whose influence on the response variable (number of lawyers in the state) is poorly predicted. That is to say, the model produces and unduly large miss from the actual number of lawyers when it estimates the number. With our data, New York stands as quite an outlier. Assuming the figures and facts we have for New York are correct, however, that’s real life; generally speaking, unless you have a solid reason to omit some observation, outlier data should be included in your model.

At least three techniques can help spot outliers: statistical tests, graphical plots, and repeated modeling.

Among the statistical tests, one calculates whether the largest residual of the response variable (the amount the model mis-estimated number of lawyers) is “statistically significantly“} off the mark; if no such unusual residual shows up, the data has no outliers. However, if the largest response residual is statistically significant and therefore is an outlier observation, analysts sometimes delete it and rerun the test (the third technique) to see whether other outliers are present.

This graphic plots residuals, after some mathematical adjustments to the scale, against the corresponding theoretical quantiles. Quantiles (sometimes called ‘percentiles’) are created when you sort data from high to low and plot the point where 25% of the points are below the “first quartile”; 50% are below the second quartile — the median; and so forth. These are points in your data below which a certain proportion of your data fall. So, the horizontal axis above shows what a normal bell-curve distribution line looks like when it is based on the quantiles of such a distribution. It also generates a dotted-line band above and below the residuals to show a statistical form of confidence in the estimated value. The plot tells us that New York (34, top right corner) and California (5, lower left corner) are outliers to be scrutinized.

As alluded to above, a third technique helps if you are concerned about an observation being an outlier. You exclude the suspect observation (state) from the regression. If the model’s coefficients don’t change much, then you don’t have to worry.

Machine learning for lawyers: collinearity in linear regression

We have reached the fourth assumption for valid multiple linear regression: correlations between predictor variables cannot be too high. If the correlation between any pair of predictor variables is close to 1 or -1, there lies a problem. One is that it can be difficult to separate out the individual effects of those variables on the response. Predictor variables (aka independent variables) should be independent of each other.

For example, the correlation between the number of F500 headquarters in a state and the number of business establishments with fewer than 500 employees is very high, at 0.90. The two counts relate very closely to each other, which makes sense. States have more or less vigorous business activity whether measured at the huge corporate strata or the small business strata.

The plot below shows how closely the values for each state of those two variables march together toward the upper right. The ellipse emphasizes that the association holds particularly strongly at the smaller values of the two variables.

Collinearity is the term used for this undesirable situation: predictor variables rise or fall closely in unison. To the degree they do, one of them is redundant, and the effective number of predictors is smaller that the actual number of predictors. Other weaknesses in the linear model we will leave until later.

To see how closely predictors correlate with each other you can unleash a correlation matrix plot. In the plot, for various reasons we haven’t includes some of the data for states, such as population, area and gdp.

This intimidating plot offers three kinds of insights. First, it shows a scatter plot of each predictor variable against each of the other predictors. For example, in the first column (F500), the second cell down shows a scatter plot of that variable on the horizontal axis and law school enrollment (enrollment — which is the second row with its axis being the vertical axis).

Second, the diagonal from the top left down the middle contains density plots that display how the predictor variable is distributed. For example, in the column for the number of businesses that have fewer than 500 employees (Less500), the density plot bulges high on the left, which means that quite a few states have relatively few of those enterprises but a handful stretching out to the far right have many.

Third, the upper triangle prints the correlation of each predictor against the others. As an example, the number of prisoners in 2009 (prison09) hardly correlates at all with the percentage of high school graduates (HS), at -0.199.

A correlation matrix plot helps you figure out which predictor variables are too closely correlated with others to be in the same regression model.