Before running a regression, it is prudent to study any of your data that exhibit unusual characteristics. The purpose of spotting and evaluating abnormal data is to make sure they are not mistakes in measurement, collection, data entry, or calculation nor are they data that unjustifiably warp your regression model. You want to scrutinize three varieties of unusual data: outliers, high leverage points, and influence points.
An outlier is an observation (a U.S. state in our example data) whose influence on the response variable (number of lawyers in the state) is poorly predicted. That is to say, the model produces and unduly large miss from the actual number of lawyers when it estimates the number. With our data, New York stands as quite an outlier. Assuming the figures and facts we have for New York are correct, however, that’s real life; generally speaking, unless you have a solid reason to omit some observation, outlier data should be included in your model.
At least three techniques can help spot outliers: statistical tests, graphical plots, and repeated modeling.
Among the statistical tests, one calculates whether the largest residual of the response variable (the amount the model mis-estimated number of lawyers) is “statistically significantly“} off the mark; if no such unusual residual shows up, the data has no outliers. However, if the largest response residual is statistically significant and therefore is an outlier observation, analysts sometimes delete it and rerun the test (the third technique) to see whether other outliers are present.
This graphic plots residuals, after some mathematical adjustments to the scale, against the corresponding theoretical quantiles. Quantiles (sometimes called ‘percentiles’) are created when you sort data from high to low and plot the point where 25% of the points are below the “first quartile”; 50% are below the second quartile — the median; and so forth. These are points in your data below which a certain proportion of your data fall. So, the horizontal axis above shows what a normal bell-curve distribution line looks like when it is based on the quantiles of such a distribution. It also generates a dotted-line band above and below the residuals to show a statistical form of confidence in the estimated value. The plot tells us that New York (34, top right corner) and California (5, lower left corner) are outliers to be scrutinized.
As alluded to above, a third technique helps if you are concerned about an observation being an outlier. You exclude the suspect observation (state) from the regression. If the model’s coefficients don’t change much, then you don’t have to worry.