What we can’t do with simple correlation, as discussed before, is predict the number of lawyers in a new “state” if we knew the number of Fortune 500 headquarters in that state. Once we create a regression model, however, we can fill in the equation to estimate one variable (predict it) when we know the other variable.

Here is the equation for our regression model of private practice lawyers (hereafter, “lawyers”) as influenced by the number of Fortune 500 headquarters in the state (hereafter, “F500 headquarters”):

lawyers = 1343.92 + 1265.27 * F500 + e

The equation tells us that if a state has, say, three F500 headquarters, then “lawyers are estimated to number 1,343.92 plus the product of 1265.27 times 3[F500 headquarters] plus a bit of slippage [e]” (more later on errors): an estimated 5,140 lawyers in private practice in that state.

Imagine a different situation where we only have data for 40 of the states. Regression would create a model called the training set. This is what machine learning software does: it “learns” from the data given it and can apply that learning — the model — to new information. We could then predict the number of lawyers for any of the remaining 10 states, the test data. Notice that when we make predictions while we know the actual numbers in the test set, we can assess the accuracy of our regression model by comparing the model’s estimates to reality.

Any time a firm or law department has two or more variables for observations, if a handful of assumptions to be covered later are satisfied, linear regression will tell you more than you know now.

The linear regression methodology for prediction applies broadly. Let’s illustrate with a law firm that wants to predict an associate’s annual billable hours based on the number of partners that associate worked for during the year. The data set would be the firm’s associates. For each associate the number of hours he or she billed during the most recent year would be one variable and the number of partners who assigned him or her work would be the second variable. Linear regression would generate an equation and the firm could predict either variable for any associate who had missing data for the other variable. [For more on the terminology, see this post.]

With this particular illustration, the value of regression as a prediction tool may be low, but as a tool to understand the relationship between partner numbers and billable hours, it might be insightful. That relationship involves three concepts that we will consider in the next post: p-value, effect size, and R-squared.