Regression as machine learning: p-values and R-squared for lawyers

When software calculates a regression best-fit line and equation, it also bestows other insights. Here we consider two of them: (1) p-values, which with our data says whether the predictor variable, the number of F500 headquarters in a state, tells us something we can rely on about the estimated number of private lawyers in a state and (2) Adjusted R-squared, which tells us how much of the estimated number of lawyers is accounted for by the predictor variable.

Start with p-values. Each predictor variable in a regression model has its own p-value. That value estimates the probability that the coefficient for the predictor — the number it is multiplied by in the regression equation — could have occurred by chance if the predictor had zero influence on the response variable.  If F500 headquarters have no bearing on the number of practicing lawyers in a state, the p-value for F500 would be high, such as 0.5 or 0.8.

But, a p-value below 0.05 tells us there is less than a 5% chance of such a zero relationship. If the data in the model says something has happened that would happen less than one-out-of-twenty times (less than 5% of the time), it’s unusual enough for us to accept that “Something real is going on!”.

Our p-value for F500 headquarters, it turns out, is extremely tiny, below 0.001, so we have solid reason to believe that the number of F500 headquarters strongly and reliably relates to the number of practicing lawyers. Be careful: we cannot say that the number of F500 headquarters causes the number of lawyers, only that it is strongly associated with the number of lawyers.

Next, let’s learn what the Adjusted R-squared result tells us. Adjusted R-squared tells us for this model what percentage of the estimate of lawyers is accounted for by F500 headquarters. In our data, Adjusted R-squared is 81%, which is quite high; other factors are associated with the number of private lawyers in a state relatively little compared to it.

Adjusted R-squared gives a sense of the portion of the response variable explained by the model. Adjusted R-squared doesn’t directly indicate how well the model will perform in predictions based on new observations. They help more when you compare different models for the same data set, such as when you try different predictor variables.

Regression equations and prediction (machine learning)

What we can’t do with simple correlation, as discussed before, is predict the number of lawyers in a new “state” if we knew the number of Fortune 500 headquarters in that state. Once we create a regression model, however, we can fill in the equation to estimate one variable (predict it) when we know the other variable.

Here is the equation for our regression model of private practice lawyers (hereafter, “lawyers”) as influenced by the number of Fortune 500 headquarters in the state (hereafter, “F500 headquarters”):

                 lawyers = 1343.92 + 1265.27 * F500 + e

The equation tells us that if a state has, say, three F500 headquarters, then “lawyers are estimated to number 1,343.92 plus the product of 1265.27 times 3[F500 headquarters] plus a bit of slippage [e]” (more later on errors): an estimated 5,140 lawyers in private practice in that state.

Imagine a different situation where we only have data for 40 of the states. Regression would create a model called the training set. This is what machine learning software does: it “learns” from the data given it and can apply that learning — the model — to new information. We could then predict the number of lawyers for any of the remaining 10 states, the test data. Notice that when we make predictions while we know the actual numbers in the test set, we can assess the accuracy of our regression model by comparing the model’s estimates to reality.

Any time a firm or law department has two or more variables for observations, if a handful of assumptions to be covered later are satisfied, linear regression will tell you more than you know now.

The linear regression methodology for prediction applies broadly. Let’s illustrate with a law firm that wants to predict an associate’s annual billable hours based on the number of partners that associate worked for during the year. The data set would be the firm’s associates. For each associate the number of hours he or she billed during the most recent year would be one variable and the number of partners who assigned him or her work would be the second variable. Linear regression would generate an equation and the firm could predict either variable for any associate who had missing data for the other variable. [For more on the terminology, see this post.]

With this particular illustration, the value of regression as a prediction tool may be low, but as a tool to understand the relationship between partner numbers and billable hours, it might be insightful. That relationship involves three concepts that we will consider in the next post: p-value, effect size, and R-squared.

Best-fit lines and residuals in linear regression

Your software that creates a scatter plot of a single predictor variable and the response variable can also create a best-fit line. Based on our data, the plot below shows such a line. The plot sorts the Fortune 500 figures from the lowest on the left to the highest on the right.

Simplifying somewhat, such a line makes the distance between it and each of the data points as small as possible. In other words, the software minimizes the total of the vertical distances between the data points and the best-fit line.

Now, do you remember that every straight line on a graph can be stated as an equation (“slope” equals “rise-over-run,” the amount of vertical change as the sorted horizontal numbers changes)? When the best-fit line is calculated, the software figures out the equation and thereby produces the so-called coefficients for the regression equation.

While doing so, the software also calculates the differences between the actual points and their respective positions vertically above or below on the best-fit line, called the error (or the residual). The plot shows an example of the distance for Florida, with approximately 43,400 lawyers and 16 Fortune 500 headquarters. Errors for points above the line are negative; errors for points below the line are positive; all the errors added together equal zero.

For our regression example, the equation says that the estimated number of private practice lawyers in a state is equal to a coefficient called the intercept, which is where the best-fit line crosses the vertical axis, plus some number multiplied by the number of Fortune 500 headquarters (often the intercept has no real-world meaning, because it assumes all the predictor variables equal zero, which is unlikely). The next post will flesh out that sentence.

Before regression, look at your data and check correlations

Let’s use the terms we just learned and predict the number of lawyers in a state, our dependent variable, by regressing only one predictor variable, the number of Fortune 500 companies with their headquarters in the state.

Importantly, however, it is a good practice to examine your data before you plunge into a regression analysis. One method looks at a scatter plot of the predictor variable, along the horizontal axis of the plot, and the dependent variable along the vertical, y-axis. On the plot, each state is represented by one point on those two coordinates.

From the top down, the right-most points are New York (54 headquarters and 96,000 lawyers), California (54 headquarters and 85,000 lawyers) and Texas (52 headquarters and 48,000 lawyers).

As states have more Fortune 500 headquarters moving to the right, does it appear they have more private practice lawyers moving up the plot?

Yes! Your eye tells you that the distribution of the points on the scatter plot drifts upwards toward the right roughly on a line: more headquarters, more lawyers. That conclusion makes intuitive sense to the extent that more Big Corporates probably generate more legal issues and the outside lawyers who handle those issues are likely to live in the state.

We can get a more precise, quantitative sense of the relationship between the two variables. When a scatter plot suggests a linear relationship (we’ll explain this term later), we can supplement it with the correlation coefficient, which measures the strength and direction of a linear relationship between two quantitative variables (quantitative variables are numbers, as compared to qualitative variables, called factors like state or region). Correlations range between -1 and 1. Values near -1 indicate a strong negative linear relationship, values near 0 indicate a weak linear relationship, and values near 1 indicate a strong positive linear relationship.

On our data the correlation is 0.904, which is formidable. Also, since the correlation is positive, it means what we said, more headquarters, more lawyers.

Correlation has meaning only for linear relationships , and it is sensitive to outliers (unusual, possibly erroneous values that might improperly skew the model — more later).

Regression explained for lawyers

Your law firm or law department might want to learn even more from numbers you have collected. One software tool to do so carries out what is called regression. How to understand and benefit from regression will be the topic of a series of posts. The goal will be to use everyday language and a lawyer framework to explain how to use regression responsibly, how to intuitively understand the results of regression, and ways to make it real and nonthreatening to managers of lawyers.

How might managers of lawyers apply regression? A general counsel might investigate whether the size of law firms retained is associated with average effective billing rate. Or she might predict whether more matters assigned to a firm is associated with lower effective billing rates. The managing partner of a law firm might look at several pieces of information about associates and use regression to estimate the likelihood of an associate making partner. Or the head of marketing might learn to what degree the number of participants in a survey influences how many times the report is downloaded. Countless examples exist of the ways regression can illuminate numbers in law firms and law departments.

To start, we need to settle on some terminology. All regression analyses need numbers, which statisticians call data. The example data in this series comes from the 2012 time-frame and consists of the number of private-practice lawyers in each state, the population of each state, the state’s “GDP”, and the number of Fortune 500 companies that have their headquarters in the state. The four numbers for each state are variables.

Here are some more terms you should feel comfortable with from regression. We will create a regression model that predicts the number of lawyers in a state — the number of lawyers is the response variable, from the state’s population, GDP and F500 number, called the predictor variables or the independent variables.

Regression estimates a response variable from predictor variables, which you can rephrase as estimating the dependent variable from the independent variables — in our model the number of lawyers in the state depends to an unknown degree on the number of people living in the state, the state’s economic productivity (its GDP), and how many huge companies call that state its home. You should think of a regression model as a condensed description of a set of numbers by means of an equations. Creating a useful model underlies much of what data analysts do, including many forms of machine learning.

Interaction between series and co-contributors

One analysis that every law firm can carry out on its survey data relies on a contingency table.  A contingency table contains counts by levels of a factor (also known as a “categorical variable”), which is a nominal variable that has levels, such as the variable “Position” might have levels of “GC,” “Direct Report,” “Other Lawyer” and “Staff.”

As an example of a contingency table, a survey might ask respondents about their company’s headquarters country and whether their company is publicly traded or privately held. Country would be a factor with two or more levels (let’s say the U.S. is one level and has 55 respondents while Canada is a second level, with 35 respondents); public or private would also be a factor but it has only two levels, perhaps 20 and 70, respectively.

So, a contingency table of counts (frequencies) for these two factors and their levels has a total of four categories (Public/US, Private/US, Public/Canada and Private/Canada) and would look something like this one:

Traded Stock   US   Canada
Public               12     8
Private             43    27

Easy to create, yet law firms miss many opportunities to deepen their analyses by exploring such contingency tables.

Here is an illustration that can also teach us about surveys identified so far. When law firms survey, they often decide to team with another organization, referred to here as a co-contributor. Separately from that decision, law firms also frequently conduct surveys on a topic more than once, referred to as a series. A question is whether co-contributors are more commonly associated with series. In other words, given a contingency table of the four counts, is there a statistical association between series and co-contributors?

Conributor           NotSeries  Series
Alone                     94               23
Co-Contributor   60               23

The contingency table above derives from 200 different surveys, where each series is treated as a single survey, for which I have located a published report. Reports are necessary to determine whether there was a co-contributor. Of that set, 23 are series in which a co-contributor took part (the bottom right count in the table); the same number are series where the law firm proceeded alone (top right). Of the “NotSeries” surveys, 94 were done by the law firm alone while 60 of them had a co-contributor. We must stress that this data is preliminary because we identified co-contributors at various times and may not have spotted all co-coordinators and fully matched them to series and non-series.

We can learn much from the data in the contingency table whether there is a statistical association between series and co-contributors. One methodology available to us is called a chi-square test.

Number of words and words per page in survey reports

The graph that follows shows an aspect of reports: the number of words per report page. The results might be considered a measure of text density: how much text information is included on an average page. The reports were chosen alphabetically from the data set. Furthermore, the software that counted words includes words in graphs, headers, tables, footers, headers, covers and back pages — more words than would normally be assumed to be ‘text.’ Still, since nearly all survey reports have those constituent elements, the numbers of words per report page produce comparable figures.


The total words in the reports varied by an order of 10: from 610 words in the most terse to 6,983 in the most loquacious. This particular set of ten reports averaged 4,493 words with a median of 5,067.

As can be seen, Baker McKenzie Cloud 2017 has approximately 50 words for each of its pages. At the other extreme, Ashurst GreekNPL 2017 weighs in at 420 words for each page.

Increases in participants over life of survey series by law firms

The plot below shows data on participants per year of lengthy series conducted by five law firms. A facet plot, it gives the data in a separate pane for each firm (alphabetically from Carlton Fields to White & Case). Within each pane the left axis varies for the number of participants in the survey year. For example, DLA Piper top right ranges from below 100 participants to around 300 whereas Davies Ward to its left ranges from 500 to 1,200 participants. White & Case’s survey data is missing participants for 2013 and 2014 so the line breaks. This group covers nine in the series at the maximum and six years at the minimum.

Generally speaking, the upward slope of the lines confirms that series gain participants as they continue over the years. The exception was Davies Ward, which declined from the initial burst of enthusiasm in 2005 but then began a recovery until the firm ceased sponsoring the series after 2011.

If a few more series of at least six years duration had full information on participants, we could more confidently assert that brand recognition and appreciation for a series build over time. Certainly this initial view suggests that to be the case.

Years of survey series and numbers of participants

Does the longevity of a survey series affect the average number of participants in the series? This is likely to be too crude a question, because the target populations of series differ significantly. Then too, firms might modify their questions as the series goes along rather than repeating the same questions, which could affect participation. A series might bring on different co-coordinators or change how it reaches out for participants. If we could control for factors such as these, which might swamp changes in participant numbers arising simply from annual invites, content, and publicity, we could make some headway on the question, but the data for that level of analysis is not available. Also, averaging participant numbers over the years of a survey series may conceal material ups and downs.

Moreover, of greater usefulness to law firms would be knowing whether numbers of participants tend to increase over the life of a series as it becomes better known and more relied on.

We plunge ahead anyway. To start, consider the series that have been sponsored by a law firm for four years or more. We know of 21 as are presented in the plot below. The color coding from the legend at the bottom corresponds to how many surveys have been in the series (some of which are ongoing). The color coding moves from midnight blue for the four-year series to the lightest (yellow) for the longest-running survey (13 years).

As we speculated above, a regression of how many years a survey has been conducted against average participants provides no insight. Other factors than the number of years a survey series has run influence the number of participants more.

Long series of surveys by law firms and their meta-topics

Several law firms have conducted (and may still be conducting) two different series. These firms include Baker McKenzie (Brexit and cloud computing), Berwin Leighton (hotels, two different geographies), Clifford Chance (M&A and European debt), Freshfields Bruckhaus (corporate crises and whistle blowers), Herbert Smith (M&A and finance), Jackson Lewis (workplace and dress codes), Miller Chevalier (Latin American corruption and tax policy), and Morrison Foerster (legal industry and M&A).

A few firms have done (or may still be conducting) three surveys on different topics; CMS (the legal industry, Brexit, M&A in Europe), DLA Piper (compliance, debt in Europe, M&A), and Pinsent Masons (Brexit and two on construction in different geographies).

We can also look at the broad topics where one or more firms have coordinated a series of at least five years’ length. We have coded the particular topics into broader meta-topics. The next chart tells us that three meta-topics on industries are included in these long-running series: construction, real estate, and private equity. Second, firms have also run five-plus-year series on disputes (litigation, class actions, and arbitration). Finally, the most popular subject for research surveys has been mergers and acquisitions, with three different meta-topics.