Suggestions for spreadsheets used by lawyers in machine learning

Law firms and law departments can store their data for regression models, or any machine learning algorithm, in a spreadsheet. Spreadsheets are perfectly fine, and indeed Excel and like programs can perform linear regressions, but more powerful software such as Mathematica, Matlab, Python (open source), R (open source), SASS, SPSS, and Tableau typically take in data from a spreadsheet before they can begin their magic.

Here are some observations and good hygiene for spreadsheets that contain regression data.

Store the data in columns, not rows. So, in our data set, each state is a row and each column contains the values for a variable. Here are the first six rows of the data being used in this series of posts. <chr>, <dbl> and <fct> stand for a text variable, a numeric variable (double precision is what gives it the “dbl”), and a categorical or factor variable, respectively, as used by the R programming language.

 state population    area lawyers     gdp  F500 capital     region party
  <chr>      <dbl>   <dbl>   <dbl>   <dbl> <dbl> <chr>       <fct>  <fct>
1 AK        735132 665384.    1585   51859     0 Juneau      West   rep  
2 AL       4822023  52420.    7615  183547     1 Montgomery  South  rep  
3 AR       2949131  53179.    4447  109557     7 Little Rock South  rep  
4 AZ       6553255 113990.    8023  266891     5 Phoenix     West   rep  
5 CA      38041430 163696.   85274 2003479    54 Sacramento  West   dem  
6 CO       5187582 104094.   11584  274048     9 Denver      West   dem 

Put each observation on its own row and it is a convention to put the observations in the first column. In real life, observations might be associates, partners, offices, countries, law firms, client groups, matters or others.

The arrangement of the columns does not matter to the software nor does the order of the observations. Our data set has the observation (state) in the first column, the dependent variable, lawyers, in the fourth column and F500 headquarters in the fifth. It also doesn’t matter if some columns (or some rows) are not used in the model.

Try not to have extraneous rows, such as headers, summaries, or sub-tables. Leave out totals, explanatory text and blank rows. It doesn’t matter, we should note, if you have superfluous columns on the right or rows below the data, because your software can easily remove or ignore them.

Make sure you have only numbers in columns for numbers. That is to say, do not have commas, dollar signs or any text (17500 USD is a no-no). Decimal points in values are perfectly fine, as in “area” above.

If you are missing values, be consistent in how you identify them. The best approach is to put nothing in the cell; avoid hyphens or writing something, such as “no value.” If text and numbers are jumbled together, most software will treat the variable as text and cannot do math on the text.

Writing code is easier if you use one-word names for your columns and make them understandable. For example, use “hours” or “firm” or “type” rather than “billable hours” or “paid vendor”, or “x-1 code”. It’s fine to use camelback style like “StateGDP” or “AreaState”.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.