Reproducible research as a desiderata for legal data analysis

What is termed ‘reproducible research’ urges all data scientists to keep careful track of their data’s source and transformations.  Each step of the way from the original numbers – the headwaters – through each addition, subtraction, calculation or revision should be recorded so that another person could reproduce the final data set – the mouth of the river.  They should be able to evaluate the appropriateness of the complete stream of alterations and manipulations.

As to the provenance of data, the URL and date on which information was scraped from a web site would be crucial.  The publication, date and page of data obtained from print would be key.  How the original data was collected, such as by an export from an email system or survey needs to be spelled out.  And so on.

In the first instance, programmers approach reproducibility with the code itself, which tells another programmer in the same language what is going on, such as turning a character variable into a numeric variable or multiplying a group of numbers by something or choosing a subset of the data.   But often code alone can be cryptic, or the logic is not clear, or the reasons for certain choices that were made are murky and difficult to recreate.

Liberal commenting by the programmer can fill the gap to create a roadmap for others.  All programming languages have a simple method to say to the computer, “Ignore this line, it is a note to myself and others.”  Good programmers explain in comments what the following lines of code do, why the script is doing that, and any issues or decisions in play.   It is an excellent practice to write fulsome comments that would allow a non-programmer to follow the origin, transformations, and outputs of a data workflow.  Such comments, by the way, greatly help the programmer later when she returns to the now-forgotten analysis and has to reconstruct it.

Beyond spelling out the source of the data, the programming calls themselves, and ample comments, what is known as ‘literate programming’ gives guidelines for how the code should be divided up, indented, and how the supplementary annotations are added.

In the legal industry, data analysts should strive for reproducible research, transparency in every step of their work.

Modest involvement with “AI software” according to ILTA survey

Signs are everywhere that the U.S. legal industry has started to recognize the potential for computer-assisted decision-making.  For example, the 2016 ILTA/InsideLegal Technology Purchasing Survey had a question on the topic: “Is your firm currently evaluating (or already utilizing) artificial intelligence technologies, systems or related strategies?”  The web-based survey was distributed to 1,231 ILTA member law firms of whom 14% responded (172 firms).

Only 13% of the respondents answered the AI question favorably, consisting of 2% already utilizing such technologies and 11% “currently evaluating” it. Write-ins cited by them include IBM Watson, Kira Systems, RAVN, Lex Machina and ROSS.  Not surprisingly, “half of the respondents that are currently evaluating AI come from Large Firms”, defined as firms with more than 200 lawyers [They comprised 19% of the total respondents.].

What makes it impossible to assess the actual level of support for AI-software is that “Response percentages are based on total responses per question, not overall survey participation” [emphasis added].  Therefore, we cannot say that 13% of 172 firms responded favorably because the survey report does not state how many firms provided an answer to that particular question.