Law firms and law departments can store their data for regression models, or any machine learning algorithm, in a spreadsheet. Spreadsheets are perfectly fine, and indeed Excel and like programs can perform linear regressions, but more powerful software such as Mathematica, Matlab, Python (open source), R (open source), SASS, SPSS, and Tableau typically take in data from a spreadsheet before they can begin their magic.
Here are some observations and good hygiene for spreadsheets that contain regression data.
Store the data in columns, not rows. So, in our data set, each state is a row and each column contains the values for a variable. Here are the first six rows of the data being used in this series of posts. <chr>, <dbl> and <fct> stand for a text variable, a numeric variable (double precision is what gives it the “dbl”), and a categorical or factor variable, respectively, as used by the R programming language.
state population area lawyers gdp F500 capital region party <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <fct> <fct> 1 AK 735132 665384. 1585 51859 0 Juneau West rep 2 AL 4822023 52420. 7615 183547 1 Montgomery South rep 3 AR 2949131 53179. 4447 109557 7 Little Rock South rep 4 AZ 6553255 113990. 8023 266891 5 Phoenix West rep 5 CA 38041430 163696. 85274 2003479 54 Sacramento West dem 6 CO 5187582 104094. 11584 274048 9 Denver West dem
Put each observation on its own row and it is a convention to put the observations in the first column. In real life, observations might be associates, partners, offices, countries, law firms, client groups, matters or others.
The arrangement of the columns does not matter to the software nor does the order of the observations. Our data set has the observation (state) in the first column, the dependent variable, lawyers, in the fourth column and F500 headquarters in the fifth. It also doesn’t matter if some columns (or some rows) are not used in the model.
Try not to have extraneous rows, such as headers, summaries, or sub-tables. Leave out totals, explanatory text and blank rows. It doesn’t matter, we should note, if you have superfluous columns on the right or rows below the data, because your software can easily remove or ignore them.
Make sure you have only numbers in columns for numbers. That is to say, do not have commas, dollar signs or any text (17500 USD is a no-no). Decimal points in values are perfectly fine, as in “area” above.
If you are missing values, be consistent in how you identify them. The best approach is to put nothing in the cell; avoid hyphens or writing something, such as “no value.” If text and numbers are jumbled together, most software will treat the variable as text and cannot do math on the text.
Writing code is easier if you use one-word names for your columns and make them understandable. For example, use “hours” or “firm” or “type” rather than “billable hours” or “paid vendor”, or “x-1 code”. It’s fine to use camelback style like “StateGDP” or “AreaState”.