Suggestions for spreadsheets used by lawyers in machine learning

Law firms and law departments can store their data for regression models, or any machine learning algorithm, in a spreadsheet. Spreadsheets are perfectly fine, and indeed Excel and like programs can perform linear regressions, but more powerful software such as Mathematica, Matlab, Python (open source), R (open source), SASS, SPSS, and Tableau typically take in data from a spreadsheet before they can begin their magic.

Here are some observations and good hygiene for spreadsheets that contain regression data.

Store the data in columns, not rows. So, in our data set, each state is a row and each column contains the values for a variable. Here are the first six rows of the data being used in this series of posts. <chr>, <dbl> and <fct> stand for a text variable, a numeric variable (double precision is what gives it the “dbl”), and a categorical or factor variable, respectively, as used by the R programming language.

 state population    area lawyers     gdp  F500 capital     region party
  <chr>      <dbl>   <dbl>   <dbl>   <dbl> <dbl> <chr>       <fct>  <fct>
1 AK        735132 665384.    1585   51859     0 Juneau      West   rep  
2 AL       4822023  52420.    7615  183547     1 Montgomery  South  rep  
3 AR       2949131  53179.    4447  109557     7 Little Rock South  rep  
4 AZ       6553255 113990.    8023  266891     5 Phoenix     West   rep  
5 CA      38041430 163696.   85274 2003479    54 Sacramento  West   dem  
6 CO       5187582 104094.   11584  274048     9 Denver      West   dem 

Put each observation on its own row and it is a convention to put the observations in the first column. In real life, observations might be associates, partners, offices, countries, law firms, client groups, matters or others.

The arrangement of the columns does not matter to the software nor does the order of the observations. Our data set has the observation (state) in the first column, the dependent variable, lawyers, in the fourth column and F500 headquarters in the fifth. It also doesn’t matter if some columns (or some rows) are not used in the model.

Try not to have extraneous rows, such as headers, summaries, or sub-tables. Leave out totals, explanatory text and blank rows. It doesn’t matter, we should note, if you have superfluous columns on the right or rows below the data, because your software can easily remove or ignore them.

Make sure you have only numbers in columns for numbers. That is to say, do not have commas, dollar signs or any text (17500 USD is a no-no). Decimal points in values are perfectly fine, as in “area” above.

If you are missing values, be consistent in how you identify them. The best approach is to put nothing in the cell; avoid hyphens or writing something, such as “no value.” If text and numbers are jumbled together, most software will treat the variable as text and cannot do math on the text.

Writing code is easier if you use one-word names for your columns and make them understandable. For example, use “hours” or “firm” or “type” rather than “billable hours” or “paid vendor”, or “x-1 code”. It’s fine to use camelback style like “StateGDP” or “AreaState”.

Effect size in regression, and holding variables constant

Each coefficient of a regression model measures what is called effect size. Returning to our data, the effect size tells how much the predicted number of lawyers changes with a change of one F500 headquarters. Since our single-variable coefficient is 1,265 it would mean that every additional F500 headquarters increases the estimated lawyers by 1,265; every F500 headquarters less would drop the estimated lawyers by 1,265. So, effect size indicates the influence of one plus-or-minus predictor unit on the response number. But we shouldn’t use that single effect size because we have data for other variables that contribute to the estimated number of lawyers.

To this point we have been conducting linear regression with only a single predictor. If we include in our model more than one predictor, by the way, we are using multiple linear regression.

To progress to multiple linear regression and to see more effect sizes and how coefficients change when there are additional predictor variables, let’s add to our model the predictor variable of state population. The resulting regression equation appears below.

[1] “lawyers = -275.8 + 767.689 * F500 + 0.001 * population + e”

The coefficient for F500 headquarters has dropped dramatically, from 1,265 to 768.  Further, we see a tiny coefficient for state population.

But you can’t look at the absolute size of a coefficient and decide whether it’s more or less influential than another predictor’s coefficient, because if they are both statistically significant, you also must take into account the units of the predictor variable. Our units are one headquarters and one state resident. Intuitively, a change in one headquarters ought to make much more of a difference in the lawyer count than the change of one resident.

Both predictors are statistically significant, so what do the coefficients tell us about translating their variable values into the real world? On this two-predictor model, every increase or decrease in the number of F500 headquarters changes the estimated number of lawyers by 768, when we hold state population constant. Every increase or decrease in the population changes the estimated lawyers by 1/1000, holding F500 headquarters constant. Thus, for every thousand additional residents in a state, this model predicts an additional lawyer.

Holding other predictors constant means that the software sets the remaining predictor variables at the same value, so they have no influence on the response variable — the number of private practice lawyers. Doing so isolates the effect of the remaining predictor on the response variable.

We need to include as many predictor variables as we have available and evaluate that multiple linear regression model, which we will do in a later post.

Regression as machine learning: p-values and R-squared for lawyers

When software calculates a regression best-fit line and equation, it also bestows other insights. Here we consider two of them: (1) p-values, which with our data says whether the predictor variable, the number of F500 headquarters in a state, tells us something we can rely on about the estimated number of private lawyers in a state and (2) Adjusted R-squared, which tells us how much of the estimated number of lawyers is accounted for by the predictor variable.

Start with p-values. Each predictor variable in a regression model has its own p-value. That value estimates the probability that the coefficient for the predictor — the number it is multiplied by in the regression equation — could have occurred by chance if the predictor had zero influence on the response variable.  If F500 headquarters have no bearing on the number of practicing lawyers in a state, the p-value for F500 would be high, such as 0.5 or 0.8.

But, a p-value below 0.05 tells us there is less than a 5% chance of such a zero relationship. If the data in the model says something has happened that would happen less than one-out-of-twenty times (less than 5% of the time), it’s unusual enough for us to accept that “Something real is going on!”.

Our p-value for F500 headquarters, it turns out, is extremely tiny, below 0.001, so we have solid reason to believe that the number of F500 headquarters strongly and reliably relates to the number of practicing lawyers. Be careful: we cannot say that the number of F500 headquarters causes the number of lawyers, only that it is strongly associated with the number of lawyers.

Next, let’s learn what the Adjusted R-squared result tells us. Adjusted R-squared tells us for this model what percentage of the estimate of lawyers is accounted for by F500 headquarters. In our data, Adjusted R-squared is 81%, which is quite high; other factors are associated with the number of private lawyers in a state relatively little compared to it.

Adjusted R-squared gives a sense of the portion of the response variable explained by the model. Adjusted R-squared doesn’t directly indicate how well the model will perform in predictions based on new observations. They help more when you compare different models for the same data set, such as when you try different predictor variables.

Regression equations and prediction (machine learning)

What we can’t do with simple correlation, as discussed before, is predict the number of lawyers in a new “state” if we knew the number of Fortune 500 headquarters in that state. Once we create a regression model, however, we can fill in the equation to estimate one variable (predict it) when we know the other variable.

Here is the equation for our regression model of private practice lawyers (hereafter, “lawyers”) as influenced by the number of Fortune 500 headquarters in the state (hereafter, “F500 headquarters”):

                 lawyers = 1343.92 + 1265.27 * F500 + e

The equation tells us that if a state has, say, three F500 headquarters, then “lawyers are estimated to number 1,343.92 plus the product of 1265.27 times 3[F500 headquarters] plus a bit of slippage [e]” (more later on errors): an estimated 5,140 lawyers in private practice in that state.

Imagine a different situation where we only have data for 40 of the states. Regression would create a model called the training set. This is what machine learning software does: it “learns” from the data given it and can apply that learning — the model — to new information. We could then predict the number of lawyers for any of the remaining 10 states, the test data. Notice that when we make predictions while we know the actual numbers in the test set, we can assess the accuracy of our regression model by comparing the model’s estimates to reality.

Any time a firm or law department has two or more variables for observations, if a handful of assumptions to be covered later are satisfied, linear regression will tell you more than you know now.

The linear regression methodology for prediction applies broadly. Let’s illustrate with a law firm that wants to predict an associate’s annual billable hours based on the number of partners that associate worked for during the year. The data set would be the firm’s associates. For each associate the number of hours he or she billed during the most recent year would be one variable and the number of partners who assigned him or her work would be the second variable. Linear regression would generate an equation and the firm could predict either variable for any associate who had missing data for the other variable. [For more on the terminology, see this post.]

With this particular illustration, the value of regression as a prediction tool may be low, but as a tool to understand the relationship between partner numbers and billable hours, it might be insightful. That relationship involves three concepts that we will consider in the next post: p-value, effect size, and R-squared.

Best-fit lines and residuals in linear regression

Your software that creates a scatter plot of a single predictor variable and the response variable can also create a best-fit line. Based on our data, the plot below shows such a line. The plot sorts the Fortune 500 figures from the lowest on the left to the highest on the right.

Simplifying somewhat, such a line makes the distance between it and each of the data points as small as possible. In other words, the software minimizes the total of the vertical distances between the data points and the best-fit line.

Now, do you remember that every straight line on a graph can be stated as an equation (“slope” equals “rise-over-run,” the amount of vertical change as the sorted horizontal numbers changes)? When the best-fit line is calculated, the software figures out the equation and thereby produces the so-called coefficients for the regression equation.

While doing so, the software also calculates the differences between the actual points and their respective positions vertically above or below on the best-fit line, called the error (or the residual). The plot shows an example of the distance for Florida, with approximately 43,400 lawyers and 16 Fortune 500 headquarters. Errors for points above the line are negative; errors for points below the line are positive; all the errors added together equal zero.

For our regression example, the equation says that the estimated number of private practice lawyers in a state is equal to a coefficient called the intercept, which is where the best-fit line crosses the vertical axis, plus some number multiplied by the number of Fortune 500 headquarters (often the intercept has no real-world meaning, because it assumes all the predictor variables equal zero, which is unlikely). The next post will flesh out that sentence.

Before regression, look at your data and check correlations

Let’s use the terms we just learned and predict the number of lawyers in a state, our dependent variable, by regressing only one predictor variable, the number of Fortune 500 companies with their headquarters in the state.

Importantly, however, it is a good practice to examine your data before you plunge into a regression analysis. One method looks at a scatter plot of the predictor variable, along the horizontal axis of the plot, and the dependent variable along the vertical, y-axis. On the plot, each state is represented by one point on those two coordinates.

From the top down, the right-most points are New York (54 headquarters and 96,000 lawyers), California (54 headquarters and 85,000 lawyers) and Texas (52 headquarters and 48,000 lawyers).

As states have more Fortune 500 headquarters moving to the right, does it appear they have more private practice lawyers moving up the plot?

Yes! Your eye tells you that the distribution of the points on the scatter plot drifts upwards toward the right roughly on a line: more headquarters, more lawyers. That conclusion makes intuitive sense to the extent that more Big Corporates probably generate more legal issues and the outside lawyers who handle those issues are likely to live in the state.

We can get a more precise, quantitative sense of the relationship between the two variables. When a scatter plot suggests a linear relationship (we’ll explain this term later), we can supplement it with the correlation coefficient, which measures the strength and direction of a linear relationship between two quantitative variables (quantitative variables are numbers, as compared to qualitative variables, called factors like state or region). Correlations range between -1 and 1. Values near -1 indicate a strong negative linear relationship, values near 0 indicate a weak linear relationship, and values near 1 indicate a strong positive linear relationship.

On our data the correlation is 0.904, which is formidable. Also, since the correlation is positive, it means what we said, more headquarters, more lawyers.

Correlation has meaning only for linear relationships , and it is sensitive to outliers (unusual, possibly erroneous values that might improperly skew the model — more later).

Regression explained for lawyers

Your law firm or law department might want to learn even more from numbers you have collected. One software tool to do so carries out what is called regression. How to understand and benefit from regression will be the topic of a series of posts. The goal will be to use everyday language and a lawyer framework to explain how to use regression responsibly, how to intuitively understand the results of regression, and ways to make it real and nonthreatening to managers of lawyers.

How might managers of lawyers apply regression? A general counsel might investigate whether the size of law firms retained is associated with average effective billing rate. Or she might predict whether more matters assigned to a firm is associated with lower effective billing rates. The managing partner of a law firm might look at several pieces of information about associates and use regression to estimate the likelihood of an associate making partner. Or the head of marketing might learn to what degree the number of participants in a survey influences how many times the report is downloaded. Countless examples exist of the ways regression can illuminate numbers in law firms and law departments.

To start, we need to settle on some terminology. All regression analyses need numbers, which statisticians call data. The example data in this series comes from the 2012 time-frame and consists of the number of private-practice lawyers in each state, the population of each state, the state’s “GDP”, and the number of Fortune 500 companies that have their headquarters in the state. The four numbers for each state are variables.

Here are some more terms you should feel comfortable with from regression. We will create a regression model that predicts the number of lawyers in a state — the number of lawyers is the response variable, from the state’s population, GDP and F500 number, called the predictor variables or the independent variables.

Regression estimates a response variable from predictor variables, which you can rephrase as estimating the dependent variable from the independent variables — in our model the number of lawyers in the state depends to an unknown degree on the number of people living in the state, the state’s economic productivity (its GDP), and how many huge companies call that state its home. You should think of a regression model as a condensed description of a set of numbers by means of an equations. Creating a useful model underlies much of what data analysts do, including many forms of machine learning.

Interaction between series and co-contributors

One analysis that every law firm can carry out on its survey data relies on a contingency table.  A contingency table contains counts by levels of a factor (also known as a “categorical variable”), which is a nominal variable that has levels, such as the variable “Position” might have levels of “GC,” “Direct Report,” “Other Lawyer” and “Staff.”

As an example of a contingency table, a survey might ask respondents about their company’s headquarters country and whether their company is publicly traded or privately held. Country would be a factor with two or more levels (let’s say the U.S. is one level and has 55 respondents while Canada is a second level, with 35 respondents); public or private would also be a factor but it has only two levels, perhaps 20 and 70, respectively.

So, a contingency table of counts (frequencies) for these two factors and their levels has a total of four categories (Public/US, Private/US, Public/Canada and Private/Canada) and would look something like this one:

Traded Stock   US   Canada
Public               12     8
Private             43    27

Easy to create, yet law firms miss many opportunities to deepen their analyses by exploring such contingency tables.

Here is an illustration that can also teach us about surveys identified so far. When law firms survey, they often decide to team with another organization, referred to here as a co-contributor. Separately from that decision, law firms also frequently conduct surveys on a topic more than once, referred to as a series. A question is whether co-contributors are more commonly associated with series. In other words, given a contingency table of the four counts, is there a statistical association between series and co-contributors?

Conributor           NotSeries  Series
Alone                     94               23
Co-Contributor   60               23

The contingency table above derives from 200 different surveys, where each series is treated as a single survey, for which I have located a published report. Reports are necessary to determine whether there was a co-contributor. Of that set, 23 are series in which a co-contributor took part (the bottom right count in the table); the same number are series where the law firm proceeded alone (top right). Of the “NotSeries” surveys, 94 were done by the law firm alone while 60 of them had a co-contributor. We must stress that this data is preliminary because we identified co-contributors at various times and may not have spotted all co-coordinators and fully matched them to series and non-series.

We can learn much from the data in the contingency table whether there is a statistical association between series and co-contributors. One methodology available to us is called a chi-square test.

Number of words and words per page in survey reports

The graph that follows shows an aspect of reports: the number of words per report page. The results might be considered a measure of text density: how much text information is included on an average page. The reports were chosen alphabetically from the data set. Furthermore, the software that counted words includes words in graphs, headers, tables, footers, headers, covers and back pages — more words than would normally be assumed to be ‘text.’ Still, since nearly all survey reports have those constituent elements, the numbers of words per report page produce comparable figures.


The total words in the reports varied by an order of 10: from 610 words in the most terse to 6,983 in the most loquacious. This particular set of ten reports averaged 4,493 words with a median of 5,067.

As can be seen, Baker McKenzie Cloud 2017 has approximately 50 words for each of its pages. At the other extreme, Ashurst GreekNPL 2017 weighs in at 420 words for each page.

Increases in participants over life of survey series by law firms

The plot below shows data on participants per year of lengthy series conducted by five law firms. A facet plot, it gives the data in a separate pane for each firm (alphabetically from Carlton Fields to White & Case). Within each pane the left axis varies for the number of participants in the survey year. For example, DLA Piper top right ranges from below 100 participants to around 300 whereas Davies Ward to its left ranges from 500 to 1,200 participants. White & Case’s survey data is missing participants for 2013 and 2014 so the line breaks. This group covers nine in the series at the maximum and six years at the minimum.

Generally speaking, the upward slope of the lines confirms that series gain participants as they continue over the years. The exception was Davies Ward, which declined from the initial burst of enthusiasm in 2005 but then began a recovery until the firm ceased sponsoring the series after 2011.

If a few more series of at least six years duration had full information on participants, we could more confidently assert that brand recognition and appreciation for a series build over time. Certainly this initial view suggests that to be the case.