Regression as machine learning: p-values and R-squared for lawyers

When software calculates a regression best-fit line and equation, it also bestows other insights. Here we consider two of them: (1) p-values, which with our data says whether the predictor variable, the number of F500 headquarters in a state, tells us something we can rely on about the estimated number of private lawyers in a state and (2) Adjusted R-squared, which tells us how much of the estimated number of lawyers is accounted for by the predictor variable.

Start with p-values. Each predictor variable in a regression model has its own p-value. That value estimates the probability that the coefficient for the predictor — the number it is multiplied by in the regression equation — could have occurred by chance if the predictor had zero influence on the response variable.  If F500 headquarters have no bearing on the number of practicing lawyers in a state, the p-value for F500 would be high, such as 0.5 or 0.8.

But, a p-value below 0.05 tells us there is less than a 5% chance of such a zero relationship. If the data in the model says something has happened that would happen less than one-out-of-twenty times (less than 5% of the time), it’s unusual enough for us to accept that “Something real is going on!”.

Our p-value for F500 headquarters, it turns out, is extremely tiny, below 0.001, so we have solid reason to believe that the number of F500 headquarters strongly and reliably relates to the number of practicing lawyers. Be careful: we cannot say that the number of F500 headquarters causes the number of lawyers, only that it is strongly associated with the number of lawyers.

Next, let’s learn what the Adjusted R-squared result tells us. Adjusted R-squared tells us for this model what percentage of the estimate of lawyers is accounted for by F500 headquarters. In our data, Adjusted R-squared is 81%, which is quite high; other factors are associated with the number of private lawyers in a state relatively little compared to it.

Adjusted R-squared gives a sense of the portion of the response variable explained by the model. Adjusted R-squared doesn’t directly indicate how well the model will perform in predictions based on new observations. They help more when you compare different models for the same data set, such as when you try different predictor variables.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.