Machine learning for lawyers: linear regression assumptions

To produce valid results, a linear regression model needs the relationship between the predictor and response variables to be linear. For example, multiply every additional F500 headquarters by 1,000 (and add the intercept value) to predict the number of practicing lawyers in a state. Since 1,000 is the same for every number of F500 headquarters and since it is not squaring the headquarters numbers or taking its logarithm or some other mathematical operation, the relationship can be plotted on a line: it’s linear.

Bear in mind that the software will blindly calculate a best-fit line even if the data is absolutely random, looks U-shaped, hockey-stick shaped, or exhibits a bizarre irregularity. The requirement for valid linear regression goes back to the slope of the best-fit line, which has a constant number — not a squared number or some varying number — that multiplies the values of the predictors to estimate the response variable.

Got it, but how can you tell if your data satisfies the criterion of linearity? Quite often you can simply eyeball a scatter plot. You put plot each state’s predictor variable on the horizontal axis, increasing in numbers of F500 headquarters to the right. You plot the corresponding response variable on the vertical axis, increasing toward the top. If the pattern suggests a relationship — the two variables rise together or fall together or if you were to sketch an ellipse around the bulk of the data it would look something like a tilted football — you have a linear relationship between the variables and satisfy the assumption for linear regression.

This plot suggests a roughly linear increase in the number of lawyers as F500 headquarters increase in that you can draw a straight line from the lower left mid-way through the points up to the upper right.

The plot below doesn’t look nearly as much like a linear pattern, so that predictor is probably less useful [unlikely to be statistically significant], but the distribution of points still is reasonably linear. Another clue to relative linearity of two predictors is the correlation between lawyers and F500 headquarters, 0.9, whereas the correlation between lawyers and the urban population as a percent of total population is half that (0.45).

 

If the variables do not have a linear association, \underline{transformations} of the variables are possible, such as using their square roots, but that’s the subject of another post. Also, other kinds of regression might fit a valid model, but those alternatives are too advanced for this overview. It is also important to check for outliers, very unusual and influential data points, which we will return to later.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.