An introduction to software that identifies relevant documents in litigation

Software that analyzes text documents collected electronically in litigation discovery has developed a long way in the past few years. Variously referred to as “predictive coding,” “technology assisted review“ (TAR), and “computer-assisted review”, the software’s steps can be explained simply.

You train the software on a “seed set” of documents that humans have coded to be relevant or not and then you aim the software at a random sample of other documents that have also been coded for relevance, the “validation set.”

Assuming the software has been trained sufficiently well on the seed set, its effectiveness is judged by “recall,” which is the percentage of relevant documents in the random-sample validation set that the software accurately identified as such.  As pointed out in a recent article in LTN, Oct. 2016 at 40, by three lawyers at the law firm BuckleySandler, courts have paid attention to how the seed set was constructed, but haven’t paid as much attention to the accuracy of coding the validation set.

Once the level of recall satisfies both sides to the lawsuit, the software is unleashed on all the collected documents and it dutifully identifies those deemed by the algorithm to be relevant and thus producible.