# Before regression, look at your data and check correlations

Let’s use the terms we just learned and predict the number of lawyers in a state, our dependent variable, by regressing only one predictor variable, the number of Fortune 500 companies with their headquarters in the state.

Importantly, however, it is a good practice to examine your data before you plunge into a regression analysis. One method looks at a scatter plot of the predictor variable, along the horizontal axis of the plot, and the dependent variable along the vertical, y-axis. On the plot, each state is represented by one point on those two coordinates.

From the top down, the right-most points are New York (54 headquarters and 96,000 lawyers), California (54 headquarters and 85,000 lawyers) and Texas (52 headquarters and 48,000 lawyers).

As states have more Fortune 500 headquarters moving to the right, does it appear they have more private practice lawyers moving up the plot?

Yes! Your eye tells you that the distribution of the points on the scatter plot drifts upwards toward the right roughly on a line: more headquarters, more lawyers. That conclusion makes intuitive sense to the extent that more Big Corporates probably generate more legal issues and the outside lawyers who handle those issues are likely to live in the state.

We can get a more precise, quantitative sense of the relationship between the two variables. When a scatter plot suggests a linear relationship (we’ll explain this term later), we can supplement it with the correlation coefficient, which measures the strength and direction of a linear relationship between two quantitative variables (quantitative variables are numbers, as compared to qualitative variables, called factors like state or region). Correlations range between -1 and 1. Values near -1 indicate a strong negative linear relationship, values near 0 indicate a weak linear relationship, and values near 1 indicate a strong positive linear relationship.

On our data the correlation is 0.904, which is formidable. Also, since the correlation is positive, it means what we said, more headquarters, more lawyers.

Correlation has meaning only for linear relationships , and it is sensitive to outliers (unusual, possibly erroneous values that might improperly skew the model — more later).

This site uses Akismet to reduce spam. Learn how your comment data is processed.