The term “predictive analytics” compared to “machine learning”

The term “machine learning” may be most common, but the alternative “predictive analytics” has much going for it.  This is the term Eric Siegel has promoted extensively, including in his Predictive Analytics (John Wiley 2016).

Siegel places “machine learning” mostly in academia and research papers.  It is a computer science term that connotes statistics and matrix algebra.  His term has more overtones of usefulness to business as it stresses the value of algorithms that take in data and predict numbers or classifications or most-similar values.

Find out and improve the accuracy of a machine-learning model for court opinions

Machine learning models need to be validated, which entrails running the model on new data to see how well the classification or prediction works.  In the research explained in Part I, Part II, Part III, and Part IV, topics were identified and used to predict a European court’s decisions.

In the validation of their model, the researchers tested how accurate their model was based on being trained on a subset of the case opinions.  “The models are trained and tested by applying a stratified 10-fold cross validation, which uses a held-out 10% of the data at each stage to measure predictive performance.”

In less technical words, they ran their model many, many times, each time training it on a randomly-selected 90 percent of the cases and then using the model to predict the ruling on the left-out 10 percent of the cases.  They averaged the results of the multiple runs so that extremes would be less influential.

That’s not all.  “The linear SVM [the machine learning algorithm employed to classify court decisions into violation found or not found] has a regularisation parameter of the error term C, which is tuned using grid-search.”  We will forgo a full explanation of this dense sentence, but it has to do with finding (“tuning”) the best controls or constraints on the SVM’s application (“parameters”) through a method of testing lots of variations where the parameters are randomly varied (“grid-search”).

The article continues: “The linear kernel of the SVM model can be used to examine which topics are most important for inferring whether an article of the Convention has been violated or not by looking at their weights w.”  In this research, the weights calculated by the algorithm are a measure of how much a topic influences the Court’s decision.  Tables in the article present the six topics for the most positive and negative SVM weights for the parts of the opinions.

Thus ends our exegesis of a wonderful piece of applied machine learning relevant to the legal industry.

We welcome any and all questions or comments, especially those that will make even clearer the power of this method of artificial intelligence research and its application in legal management.

A Support Vector Machine algorithm finds court opinion topics that predict the decision

We have explained in Part I, Part II and Part III how researchers took the text of certain European court opinions, found how often words and combinations of words appeared in them, and coalesced those words that appeared relatively often together into named, broader topics. Next, they wanted to see if software could predict from the topics how the court would decide.   They relied on a machine learning algorithm called Support Vector Machines (SVM).

“An SVM is a machine learning algorithm that has shown particularly good results in text classification, especially using small data sets. We employ a linear kernel since that allows us to identify important features that are indicative of each class by looking at the weight learned for each feature. We label all the violation cases as +1, while no violation is denoted by −1. Therefore, features assigned with positive weights are more indicative of violation, while features with negative weights are more indicative of no violation.”

Whew!  A linear kernel is a sophisticated method from linear algebra that projects data (transforms it into a different relationship) into a complex, multi-dimensional space (a “hyperspace, “which can be thought of as having not just an x-axis and a y-axis, but also an a-axis, b-axis and so on out to as many axes as there are data features).   In that hyperspace, the SVM algorithm can accomplish more than if the data were “flatter”.  For example, if finds key data points (called “support vectors”) that define the widest boundary between violation cases and non-violation cases.  The result is what is known as a “hyperplane” because it separates the classes in a hyperspace as well as possible (as a line can do in two dimensions and a plane in three dimensions).

The weights that the algorithm identifies enable it to classify the topics and create the hyperplane.  The weights represent the hyperplane, by giving the coordinates of a vector which is orthogonal to the hyperplane (“orthogonal” can be imagined as a perpendicular vector to some point in a hyperspace; it also means there is no correlation between orthogonal vectors).  The vector’s direction gives the predicted class, so if you take the dot product [more matrix algebra] of any point with the vector, you can tell on which side it is: if the dot product is positive, it belongs to the positive class, if it is negative it belongs to the negative class.  You could say that the absolute size of the (weight) coefficient relative to the other ones gives an indication of how important the feature was for the separation of the hyperplane.

IBM’s Watson takes on financial regulations

Machine learning has the potential to invade and disrupt the current market for lawyers in some of the most complex, high-end legal practices.  Any practice where there are large numbers of court opinions, briefs, law review articles, white papers, laws and regulations, and other textual material, it appears that IBM’s Watson looms as a tool to absorb it, recognize patterns, and augment lawyers’ reasoning.  Remember, however, that Watson is a glutton for vast amounts of digitized documents.  Without that diet, the formidable Watson may wither like the Wicked Witch of Oz when sprinkled with water.

Augmentation has a partnering ring, a positive valence, but the dark side lurks.  Associate-heavy research memos and scores of partners “coming up to speed” in an area of law will evaporate when software does the heavy lifting and prep work.  The experienced judgment of intelligent lawyers will forever be in demand, but software augmentation will limit leverage and slash hours that would have been billed to clients for a law firm tackling a new area of law for that firm.

A chilling glimpse of this future appears in the Economist, Oct. 22, 2016 at 64.  The “cognitive artificial intelligence platform [Watson] has begun categorizing the various [financial industry] regulations and matching them with the appropriate enforcement mechanisms.”  Experts in the regulatory web that ensnare and confound financial firms vet the conclusions Watson derives from the mass of material available to it; “A dozen rules are now being assimilated weekly.”  The target is the estimated $270 billion or more spent each year on regulatory compliance – “of which $20 billion is spent simply on understanding the requirements.”  Who knows how much flows to law firms or how much that flow will slow once Watson has matured?

To the extent Watson and look-alikes can make sense out of the tangle of financial regulations, lawyers will experience less demand for their tools and experience.  It is unlikely that the projection of legal work to more sophisticated levels, aided by software organization and analysis, will replace the loss of billable hours at the lower end.

An example of NLP and machine learning to find “topics” in court opinions

Finding Topics:  In this the third of a series begun here and continued here we discussed a recently-published research paper.   The researchers created “topics for each [opinion] by clustering together N-grams that are semantically similar by leveraging the distributional hypothesis suggesting that similar words appear in similar contexts.”  In other words, they assumed that related words tend to appear together and as a group suggest a “topic”.

They used the C feature matrix, “which is a distributional representation of the N-grams given the case as the context; each column vector of the matrix represents an N-gram.”

OK, so how did the software identify or create topics?  “Using this vector representation of words, we compute N-gram similarity using the cosine metric and create an N-gram by N-gram similarity matrix.”   Cosine is a common metric in text analysis to find which vectors (sets of numbers) are more like which others.  It measures the smallest angle between two vectors where a vector in this research consists of all the 0’s and 1’s of one N-gram compared to each other N-gram.  Roughly speaking, if you think of those sets of values graphed on a sheet of paper, there is an angle of some number of degrees – thus also a cosine from trigonometry — between them.  That is a measure of how much they overlap, how much they are similar.  A similarity matrix shows the degree of “sameness” of each word (or word combo) to each other.

The researchers went further once they found with the cosine similarity calculation which N-grams were more closely associated with each other.  “We finally apply spectral clustering—which performs graph partitioning on the similarity matrix—to obtain 30 clusters of N-grams.”   Spectral clustering techniques make use of the spectrum (eigenvalues, a linear algebra construct beyond simple translation!) of the similarity matrix of the data to perform dimensionality reduction before clustering in fewer dimensions.   In other words, once you figure out that certain words are associated with each other, spectral clustering brings together groups of related words into “topics.”

A reduced number of dimensions allows the software to perform more efficiently and helps humans decipher the output.  [Note that the researchers created topics as hard clusters according to which an N-gram can only be part of a single topic.  Machine learning allows more flexibility but this was a plausible design choice.]

The researchers then gave over-arching names to the topics based on which words were most common to the cluster, which they refer to as a “representation.”  “A representation of a cluster is derived by looking at the most frequent N-grams it contains. The main advantages of using topics sets of N-grams) instead of single N-grams is that it reduces the dimensionality of the feature space, which is essential for feature selection, [as] it limits over-fitting to training data, and also provides a more concise semantic representation….  Topics form a more abstract way of representing the information contained in each case and capture a more general gist of the cases.”   Stated differently by the author, “Topics identify in a sufficiently robust manner patterns of fact scenarios that correspond to well-established trends in the Court’s case law.”  Over-fitting is a bugaboo of machine learning, but we will leave that to a later discussion.

The label “cognitive computing”

“Cognitive computing” may be just another marketing buzzword, but legal managers will encounter it.  Based on KMWorld, Oct. 2016 and its white paper, “cognitive computing is all about machine learning, with some artificial intelligence and natural language processing.”  You can learn more from the Cognitive Computing Consortium  although that group does not yet support a legal sub-group.

In that regard, however, a LinkedIn user group called Artificial Intelligence for Legal Professionals has a couple of hundred members.

Predicting a Court’s decision: the natural language processing steps

Earlier we introduced a machine learning study that predicted decisions of a European court regarding human rights violations.  The data preparation started with two kinds of text analysis.  My translation of their article borrows heavily from their writing.

Finding frequent words or combinations of words — N-gram features: “The Bag-of-Words (BOW) model is a popular semantic representation of text used in NLP.  In a BOW model, … text is represented as the bag … of its words (unigrams) or N-grams without taking into account grammar, syntax and word order.“

In short, as is common when using Natural Language Processing [NLP] they decomposed the text of each decision into a “bag” of single words and small-multiple words.  Doing so ignores the actual meaning, order or part of speech of the words in the original.  A “unigram” is a single word, whereas an N-gram is any two, three, four or more words (the “N”) in combination.  The researchers went as far as 4-grams.  So, you could go through a U.S. Supreme Court decision and find each word (unigram), each doublet of words, each triplet of words, and each four-in-a-row combination of words (2-grams, 3-grams, and 4-grams) and create a bag of words without noun designations, verb designations etc.

Creating a vector space representation:  Then the researchers created what you might visualize as a row in a spreadsheet for each opinion and a column for each word, doublet, triplet, etc.  in it.  If you think of a 0 in a cell where the particular opinion did NOT have the N-gram and a 1 where it DID contain the N-gram, you have what statisticians call a matrix (a rectangle of zeros and ones as long as your number of texts (opinions) and as wide as your total number of N-grams).   It would likely be called a “sparse matrix” because so many of the -grams would show up only once or two; lots and lots of cells would have a 0, hence the sparse descriptor.  As they succinctly stated this step, “That results in a vector space representation where documents are represented as m-dimensional variables over a set of m N-grams.”   The term “vector space representation” describes of a huge multi-dimensional space where the axes are each one opinion and their points are the N-grams in that opinion.   Many N-gram points would represent a single word used only once; some points would be words used in several opinions, a few words (like articles and prepositions) would be dense clusters of points because the word was used in many opinions (incidentally, NLP researchers usually strip out those “stop words” since they add no information).

For machine learning, the case opinions (the axes in hyperspace) are usually called “observations” and the columns (corresponding to the points in hyperspace) are usually called “features.”  As the researchers wrote, “N-gram features have been shown to be effective in various supervised learning tasks,” which refers to the machine learning algorithm described later and its task.

“For each set of cases in our data set, we compute the top-2000 most frequent N-grams where N ∈ {1, 2, 3, 4}.”   Hence, they went only as far as combinations of four unigrams and kept only the 2000 N-grams used most often in the opinions.   [The trident facing right is set notation for saying “The values of N come from the set of 1, 2, 3, and 4.”]  “Each feature represents the normalized frequency of a particular N-gram in a case or a section of a case.”  “Normalized” means they converted the number of times each N-gram showed up to a scale, such as from 0 to 1.

“This can be considered as a feature matrix, C ∈ ℝc×m, where c is the number of the cases and m = 2, 000.”   The researchers extracted N-gram features for each of five sections in the opinions, since all opinions of the Court are written in the same format, as well as for the full text.  They refer to this huge array of 0’s and 1s as their C feature matrix (case features from N-grams).

An explanation of research on machine learning software predicting court decisions

Legal managers should be intrigued by the fledgling capabilities of software to predict decisions of courts.  To assess the likelihood and power of such a capability in the real world, however, calls for a manager to understand the tools that data scientists might deploy in the prediction process. Fortunately, the ABA Journal cited and discussed a study published in October 2016 that offers us a clear example.

Four researchers, including three computer scientists and a lawyer, used machine learning software to examine 584 cases before the European Court of Human Rights. They found that the court’s judgments of the plaintiff’s rights having been violated or not were more highly correlated to facts than to legal arguments.  Given only the facts, their model predicted the court’s decisions with an average accuracy of 79 percent.

The article’s software-laden explanation of their method and tools makes for very heavy reading, so what follows attempts humbly and respectfully and certainly not infallibly to translate their work into plainer English.  The starting point is their abstract’s summary.  They “formulate a binary classification task where the input of our classifiers is the textual content extracted from a case and the target output is the actual judgment as to whether there has been a violation of an article of the convention of human rights. Textual information is represented using contiguous word sequences, i.e., N-grams, and topics.”  Additionally, “The text, N-grams and topics, trained Support Vector Machine (SVM) classifiers… [which] applied a linear kernel function that facilitates the interpretation of models in a straightforward manner.”

First, let’s get a handle on how the prediction model dealt with the text of the opinions (Natural Language Processing), then we will look at the software that classified the text to predict a violation or not (Support Vector Machine).

Watson requires huge quantities of text to formulate learnings

In February 2016, the accounting giant KPMG announced that it had been working with IBM Watson, one of the most advanced artificial intelligence technology platforms available.  An article describes Watson briefly:  “It works by using natural language processing and machine learning to reveal insights and information from huge quantities of unstructured data.” [emphasis added]   Notably, over a period of a few years Watson has digested hundreds of thousands of medical research papers on cancer and thereafter shown itself capable of matching the diagnoses of experts and suggesting new therapies.  According to the TV show 60 Minutes, eight thousand cancer papers are published every day!

A handful of law firms have announced that they are using Watson’s algorithms.  One firm (Baker & Hostetler), it sounds like to me, has directed Watson to parse thousands of cases, law review articles, and briefs in the bankruptcy area.  Whether that corpus of documents provides enough grist for Watson’s mill, since it is an order of magnitude or two smaller than the oncology set, remains to be seen.

My point is that the vast pools of text necessary for Watson to hone its skills to a proficient level may be rare in the legal industry.  And, related to that point, experienced lawyers may need to devote hours and hours to coding some of the textual material so that Watson can pick out which patterns are most likely to be present in the coded results.

Two drawbacks of machine learning algorithms

Legal managers need to be alert to marketing hype, which is markedly present in the scrum of “AI” for lawyers.  Machine learning can fall deep into that and be extolled as a powerful tool able to leap tall concepts at a single bound.  Well, to some, not quite so super.

One drawback (bit of kryptonite?) of machine learning is its black box nature.   “It is difficult to explain specifically why the system arrives at a particular conclusion, and to correct it if it is erroneous.” [italics in original]  The quote comes from KMWorld, Oct. 2016 at S19, by Daniel Mayer of Expert System Enterprise.   A neural net, for instance, conceals its inner analytical processes quite effectively and offers users crude parameters to tweak.

A second drawback that Mayer points out is the labor-intensiveness of machine learning.  Its application requires “large training sets that the need to be built and maintained over time to ensure quality results.”  True enough, but so do data sets that natural language processing (NLP) works on, which is his favored tool.  It may be true, however, that text requires less cleaning and maintenance than numeric data.