Reproducible research as a desiderata for legal data analysis

What is termed ‘reproducible research’ urges all data scientists to keep careful track of their data’s source and transformations.  Each step of the way from the original numbers – the headwaters – through each addition, subtraction, calculation or revision should be recorded so that another person could reproduce the final data set – the mouth of the river.  They should be able to evaluate the appropriateness of the complete stream of alterations and manipulations.

As to the provenance of data, the URL and date on which information was scraped from a web site would be crucial.  The publication, date and page of data obtained from print would be key.  How the original data was collected, such as by an export from an email system or survey needs to be spelled out.  And so on.

In the first instance, programmers approach reproducibility with the code itself, which tells another programmer in the same language what is going on, such as turning a character variable into a numeric variable or multiplying a group of numbers by something or choosing a subset of the data.   But often code alone can be cryptic, or the logic is not clear, or the reasons for certain choices that were made are murky and difficult to recreate.

Liberal commenting by the programmer can fill the gap to create a roadmap for others.  All programming languages have a simple method to say to the computer, “Ignore this line, it is a note to myself and others.”  Good programmers explain in comments what the following lines of code do, why the script is doing that, and any issues or decisions in play.   It is an excellent practice to write fulsome comments that would allow a non-programmer to follow the origin, transformations, and outputs of a data workflow.  Such comments, by the way, greatly help the programmer later when she returns to the now-forgotten analysis and has to reconstruct it.

Beyond spelling out the source of the data, the programming calls themselves, and ample comments, what is known as ‘literate programming’ gives guidelines for how the code should be divided up, indented, and how the supplementary annotations are added.

In the legal industry, data analysts should strive for reproducible research, transparency in every step of their work.