A Support Vector Machine algorithm finds court opinion topics that predict the decision

We have explained in Part I, Part II and Part III how researchers took the text of certain European court opinions, found how often words and combinations of words appeared in them, and coalesced those words that appeared relatively often together into named, broader topics. Next, they wanted to see if software could predict from the topics how the court would decide.   They relied on a machine learning algorithm called Support Vector Machines (SVM).

“An SVM is a machine learning algorithm that has shown particularly good results in text classification, especially using small data sets. We employ a linear kernel since that allows us to identify important features that are indicative of each class by looking at the weight learned for each feature. We label all the violation cases as +1, while no violation is denoted by −1. Therefore, features assigned with positive weights are more indicative of violation, while features with negative weights are more indicative of no violation.”

Whew!  A linear kernel is a sophisticated method from linear algebra that projects data (transforms it into a different relationship) into a complex, multi-dimensional space (a “hyperspace, “which can be thought of as having not just an x-axis and a y-axis, but also an a-axis, b-axis and so on out to as many axes as there are data features).   In that hyperspace, the SVM algorithm can accomplish more than if the data were “flatter”.  For example, if finds key data points (called “support vectors”) that define the widest boundary between violation cases and non-violation cases.  The result is what is known as a “hyperplane” because it separates the classes in a hyperspace as well as possible (as a line can do in two dimensions and a plane in three dimensions).

The weights that the algorithm identifies enable it to classify the topics and create the hyperplane.  The weights represent the hyperplane, by giving the coordinates of a vector which is orthogonal to the hyperplane (“orthogonal” can be imagined as a perpendicular vector to some point in a hyperspace; it also means there is no correlation between orthogonal vectors).  The vector’s direction gives the predicted class, so if you take the dot product [more matrix algebra] of any point with the vector, you can tell on which side it is: if the dot product is positive, it belongs to the positive class, if it is negative it belongs to the negative class.  You could say that the absolute size of the (weight) coefficient relative to the other ones gives an indication of how important the feature was for the separation of the hyperplane.

An explanation of research on machine learning software predicting court decisions

Legal managers should be intrigued by the fledgling capabilities of software to predict decisions of courts.  To assess the likelihood and power of such a capability in the real world, however, calls for a manager to understand the tools that data scientists might deploy in the prediction process. Fortunately, the ABA Journal cited and discussed a study published in October 2016 that offers us a clear example.

Four researchers, including three computer scientists and a lawyer, used machine learning software to examine 584 cases before the European Court of Human Rights. They found that the court’s judgments of the plaintiff’s rights having been violated or not were more highly correlated to facts than to legal arguments.  Given only the facts, their model predicted the court’s decisions with an average accuracy of 79 percent.

The article’s software-laden explanation of their method and tools makes for very heavy reading, so what follows attempts humbly and respectfully and certainly not infallibly to translate their work into plainer English.  The starting point is their abstract’s summary.  They “formulate a binary classification task where the input of our classifiers is the textual content extracted from a case and the target output is the actual judgment as to whether there has been a violation of an article of the convention of human rights. Textual information is represented using contiguous word sequences, i.e., N-grams, and topics.”  Additionally, “The text, N-grams and topics, trained Support Vector Machine (SVM) classifiers… [which] applied a linear kernel function that facilitates the interpretation of models in a straightforward manner.”

First, let’s get a handle on how the prediction model dealt with the text of the opinions (Natural Language Processing), then we will look at the software that classified the text to predict a violation or not (Support Vector Machine).