Predicting a Court’s decision: the natural language processing steps

Earlier we introduced a machine learning study that predicted decisions of a European court regarding human rights violations.  The data preparation started with two kinds of text analysis.  My translation of their article borrows heavily from their writing.

Finding frequent words or combinations of words — N-gram features: “The Bag-of-Words (BOW) model is a popular semantic representation of text used in NLP.  In a BOW model, … text is represented as the bag … of its words (unigrams) or N-grams without taking into account grammar, syntax and word order.“

In short, as is common when using Natural Language Processing [NLP] they decomposed the text of each decision into a “bag” of single words and small-multiple words.  Doing so ignores the actual meaning, order or part of speech of the words in the original.  A “unigram” is a single word, whereas an N-gram is any two, three, four or more words (the “N”) in combination.  The researchers went as far as 4-grams.  So, you could go through a U.S. Supreme Court decision and find each word (unigram), each doublet of words, each triplet of words, and each four-in-a-row combination of words (2-grams, 3-grams, and 4-grams) and create a bag of words without noun designations, verb designations etc.

Creating a vector space representation:  Then the researchers created what you might visualize as a row in a spreadsheet for each opinion and a column for each word, doublet, triplet, etc.  in it.  If you think of a 0 in a cell where the particular opinion did NOT have the N-gram and a 1 where it DID contain the N-gram, you have what statisticians call a matrix (a rectangle of zeros and ones as long as your number of texts (opinions) and as wide as your total number of N-grams).   It would likely be called a “sparse matrix” because so many of the -grams would show up only once or two; lots and lots of cells would have a 0, hence the sparse descriptor.  As they succinctly stated this step, “That results in a vector space representation where documents are represented as m-dimensional variables over a set of m N-grams.”   The term “vector space representation” describes of a huge multi-dimensional space where the axes are each one opinion and their points are the N-grams in that opinion.   Many N-gram points would represent a single word used only once; some points would be words used in several opinions, a few words (like articles and prepositions) would be dense clusters of points because the word was used in many opinions (incidentally, NLP researchers usually strip out those “stop words” since they add no information).

For machine learning, the case opinions (the axes in hyperspace) are usually called “observations” and the columns (corresponding to the points in hyperspace) are usually called “features.”  As the researchers wrote, “N-gram features have been shown to be effective in various supervised learning tasks,” which refers to the machine learning algorithm described later and its task.

“For each set of cases in our data set, we compute the top-2000 most frequent N-grams where N ∈ {1, 2, 3, 4}.”   Hence, they went only as far as combinations of four unigrams and kept only the 2000 N-grams used most often in the opinions.   [The trident facing right is set notation for saying “The values of N come from the set of 1, 2, 3, and 4.”]  “Each feature represents the normalized frequency of a particular N-gram in a case or a section of a case.”  “Normalized” means they converted the number of times each N-gram showed up to a scale, such as from 0 to 1.

“This can be considered as a feature matrix, C ∈ ℝc×m, where c is the number of the cases and m = 2, 000.”   The researchers extracted N-gram features for each of five sections in the opinions, since all opinions of the Court are written in the same format, as well as for the full text.  They refer to this huge array of 0’s and 1s as their C feature matrix (case features from N-grams).

An explanation of research on machine learning software predicting court decisions

Legal managers should be intrigued by the fledgling capabilities of software to predict decisions of courts.  To assess the likelihood and power of such a capability in the real world, however, calls for a manager to understand the tools that data scientists might deploy in the prediction process. Fortunately, the ABA Journal cited and discussed a study published in October 2016 that offers us a clear example.

Four researchers, including three computer scientists and a lawyer, used machine learning software to examine 584 cases before the European Court of Human Rights. They found that the court’s judgments of the plaintiff’s rights having been violated or not were more highly correlated to facts than to legal arguments.  Given only the facts, their model predicted the court’s decisions with an average accuracy of 79 percent.

The article’s software-laden explanation of their method and tools makes for very heavy reading, so what follows attempts humbly and respectfully and certainly not infallibly to translate their work into plainer English.  The starting point is their abstract’s summary.  They “formulate a binary classification task where the input of our classifiers is the textual content extracted from a case and the target output is the actual judgment as to whether there has been a violation of an article of the convention of human rights. Textual information is represented using contiguous word sequences, i.e., N-grams, and topics.”  Additionally, “The text, N-grams and topics, trained Support Vector Machine (SVM) classifiers… [which] applied a linear kernel function that facilitates the interpretation of models in a straightforward manner.”

First, let’s get a handle on how the prediction model dealt with the text of the opinions (Natural Language Processing), then we will look at the software that classified the text to predict a violation or not (Support Vector Machine).

Two drawbacks of machine learning algorithms

Legal managers need to be alert to marketing hype, which is markedly present in the scrum of “AI” for lawyers.  Machine learning can fall deep into that and be extolled as a powerful tool able to leap tall concepts at a single bound.  Well, to some, not quite so super.

One drawback (bit of kryptonite?) of machine learning is its black box nature.   “It is difficult to explain specifically why the system arrives at a particular conclusion, and to correct it if it is erroneous.” [italics in original]  The quote comes from KMWorld, Oct. 2016 at S19, by Daniel Mayer of Expert System Enterprise.   A neural net, for instance, conceals its inner analytical processes quite effectively and offers users crude parameters to tweak.

A second drawback that Mayer points out is the labor-intensiveness of machine learning.  Its application requires “large training sets that the need to be built and maintained over time to ensure quality results.”  True enough, but so do data sets that natural language processing (NLP) works on, which is his favored tool.  It may be true, however, that text requires less cleaning and maintenance than numeric data.

Data analytics (NLP) to boost knowledge management efforts

Knowledge management for law firms and law departments has been pursued for decades, but the overall success given the investment seems debatable.  It has been proven difficult to collect the unstructured text of lawyers in a system that others find useful enough to justify the cost.

Perhaps machine learning and natural language processing will replace the older paradigm of contributions by lawyers of their work product, often with key words extracted or sometimes with full-text searching, by a paradigm of software sifting through everything that is saved on a firm or law department’s servers, enriched by  semantic networks or taxonomies created by software.  Natural language processing (NLP) can create the infrastructure of knowledge without lawyers taking any of their time.  Stated differently in the words of the lead articles of a recent publication, data analytics is potentially a “powerful force for increasing knowledge management by amplifying existing data.”  If you can parse and organize and enrich material collected in the ordinary course of legal business, you can boost KM efforts enormously.

These dots connected for me as I read KMWorld, Oct. 2016, at S18 of its white paper on cognitive computing best practices.