An example of NLP and machine learning to find “topics” in court opinions

Finding Topics:  In this the third of a series begun here and continued here we discussed a recently-published research paper.   The researchers created “topics for each [opinion] by clustering together N-grams that are semantically similar by leveraging the distributional hypothesis suggesting that similar words appear in similar contexts.”  In other words, they assumed that related words tend to appear together and as a group suggest a “topic”.

They used the C feature matrix, “which is a distributional representation of the N-grams given the case as the context; each column vector of the matrix represents an N-gram.”

OK, so how did the software identify or create topics?  “Using this vector representation of words, we compute N-gram similarity using the cosine metric and create an N-gram by N-gram similarity matrix.”   Cosine is a common metric in text analysis to find which vectors (sets of numbers) are more like which others.  It measures the smallest angle between two vectors where a vector in this research consists of all the 0’s and 1’s of one N-gram compared to each other N-gram.  Roughly speaking, if you think of those sets of values graphed on a sheet of paper, there is an angle of some number of degrees – thus also a cosine from trigonometry — between them.  That is a measure of how much they overlap, how much they are similar.  A similarity matrix shows the degree of “sameness” of each word (or word combo) to each other.

The researchers went further once they found with the cosine similarity calculation which N-grams were more closely associated with each other.  “We finally apply spectral clustering—which performs graph partitioning on the similarity matrix—to obtain 30 clusters of N-grams.”   Spectral clustering techniques make use of the spectrum (eigenvalues, a linear algebra construct beyond simple translation!) of the similarity matrix of the data to perform dimensionality reduction before clustering in fewer dimensions.   In other words, once you figure out that certain words are associated with each other, spectral clustering brings together groups of related words into “topics.”

A reduced number of dimensions allows the software to perform more efficiently and helps humans decipher the output.  [Note that the researchers created topics as hard clusters according to which an N-gram can only be part of a single topic.  Machine learning allows more flexibility but this was a plausible design choice.]

The researchers then gave over-arching names to the topics based on which words were most common to the cluster, which they refer to as a “representation.”  “A representation of a cluster is derived by looking at the most frequent N-grams it contains. The main advantages of using topics sets of N-grams) instead of single N-grams is that it reduces the dimensionality of the feature space, which is essential for feature selection, [as] it limits over-fitting to training data, and also provides a more concise semantic representation….  Topics form a more abstract way of representing the information contained in each case and capture a more general gist of the cases.”   Stated differently by the author, “Topics identify in a sufficiently robust manner patterns of fact scenarios that correspond to well-established trends in the Court’s case law.”  Over-fitting is a bugaboo of machine learning, but we will leave that to a later discussion.

The label “cognitive computing”

“Cognitive computing” may be just another marketing buzzword, but legal managers will encounter it.  Based on KMWorld, Oct. 2016 and its white paper, “cognitive computing is all about machine learning, with some artificial intelligence and natural language processing.”  You can learn more from the Cognitive Computing Consortium  although that group does not yet support a legal sub-group.

In that regard, however, a LinkedIn user group called Artificial Intelligence for Legal Professionals has a couple of hundred members.

Common problems in pre-processing data so software can work with it

Based on structured interviews with a small, convenience sample of seven visual analysts, the authors of an academic paper [Source: Victoria Lemieux et al., Meeting Big Data challenges with visual analytics, Records Mgt. J. 24(2), July 2014 at 127] identified these themes in difficulties with pre-processing data.  I have added a gloss to each one.

• Unavailability of data [you can’t locate data or it was never compiled in the first place]

• Fragmentation of data  [locating relevant data distributed across multiple databases, database tables and/or files is  very time-consuming]

• Data quality [whoever gathered the information made mistakes or recorded something that has to be converted into a zero or a missing value indicator, or included extra spaces, for example]

– Missing values [it is not clear whether there were in fact no expenses on the matter or whether the person who recorded the data did not know the amount of the expenses]

– Data format [dates are notorious for being May 16, 1962 in one record, 05/16/62 in another, 05/16/1962 in a third and all kinds of other variations that require being standardized]

– Need for standardization [for example, some numbers have decimals, some have leading zeros, some are left justified with spaces at the right, some have commas, and so on]

• Data shaping [for example, in the R programming language the most common package to create plots is called “ggplot2”.  When you use it, the data ideally is in what is called “long form,” so you might need to shape the data before you plot it]

– For technical compatibility [perhaps this means that data stored as comma separate values (.csv), for example, might need to be in Access database structure for Access to work]

– For better analysis [it may be that the way the data was read into memory stored a variable as character strings whereas the data scientist wants that variable to be a factor that has a defined number of levels]

• Disconnect between creation/management and use [the general point could be that someone in the law firm tracks something, but it is not useful beyond a narrow purpose]

• Record-keeping [this may refer to the important step of keeping a record of each step in the data collection and cleaning, i.e, reproducibility of research]

– General expression of need for record-keeping [perhaps a firm-wide or law department-wide statement or policy that data has value and we need to shepherd it]

– Version control [keeping track of successive iterations of the software that works on the data]

Predicting a Court’s decision: the natural language processing steps

Earlier we introduced a machine learning study that predicted decisions of a European court regarding human rights violations.  The data preparation started with two kinds of text analysis.  My translation of their article borrows heavily from their writing.

Finding frequent words or combinations of words — N-gram features: “The Bag-of-Words (BOW) model is a popular semantic representation of text used in NLP.  In a BOW model, … text is represented as the bag … of its words (unigrams) or N-grams without taking into account grammar, syntax and word order.“

In short, as is common when using Natural Language Processing [NLP] they decomposed the text of each decision into a “bag” of single words and small-multiple words.  Doing so ignores the actual meaning, order or part of speech of the words in the original.  A “unigram” is a single word, whereas an N-gram is any two, three, four or more words (the “N”) in combination.  The researchers went as far as 4-grams.  So, you could go through a U.S. Supreme Court decision and find each word (unigram), each doublet of words, each triplet of words, and each four-in-a-row combination of words (2-grams, 3-grams, and 4-grams) and create a bag of words without noun designations, verb designations etc.

Creating a vector space representation:  Then the researchers created what you might visualize as a row in a spreadsheet for each opinion and a column for each word, doublet, triplet, etc.  in it.  If you think of a 0 in a cell where the particular opinion did NOT have the N-gram and a 1 where it DID contain the N-gram, you have what statisticians call a matrix (a rectangle of zeros and ones as long as your number of texts (opinions) and as wide as your total number of N-grams).   It would likely be called a “sparse matrix” because so many of the -grams would show up only once or two; lots and lots of cells would have a 0, hence the sparse descriptor.  As they succinctly stated this step, “That results in a vector space representation where documents are represented as m-dimensional variables over a set of m N-grams.”   The term “vector space representation” describes of a huge multi-dimensional space where the axes are each one opinion and their points are the N-grams in that opinion.   Many N-gram points would represent a single word used only once; some points would be words used in several opinions, a few words (like articles and prepositions) would be dense clusters of points because the word was used in many opinions (incidentally, NLP researchers usually strip out those “stop words” since they add no information).

For machine learning, the case opinions (the axes in hyperspace) are usually called “observations” and the columns (corresponding to the points in hyperspace) are usually called “features.”  As the researchers wrote, “N-gram features have been shown to be effective in various supervised learning tasks,” which refers to the machine learning algorithm described later and its task.

“For each set of cases in our data set, we compute the top-2000 most frequent N-grams where N ∈ {1, 2, 3, 4}.”   Hence, they went only as far as combinations of four unigrams and kept only the 2000 N-grams used most often in the opinions.   [The trident facing right is set notation for saying “The values of N come from the set of 1, 2, 3, and 4.”]  “Each feature represents the normalized frequency of a particular N-gram in a case or a section of a case.”  “Normalized” means they converted the number of times each N-gram showed up to a scale, such as from 0 to 1.

“This can be considered as a feature matrix, C ∈ ℝc×m, where c is the number of the cases and m = 2, 000.”   The researchers extracted N-gram features for each of five sections in the opinions, since all opinions of the Court are written in the same format, as well as for the full text.  They refer to this huge array of 0’s and 1s as their C feature matrix (case features from N-grams).

An explanation of research on machine learning software predicting court decisions

Legal managers should be intrigued by the fledgling capabilities of software to predict decisions of courts.  To assess the likelihood and power of such a capability in the real world, however, calls for a manager to understand the tools that data scientists might deploy in the prediction process. Fortunately, the ABA Journal cited and discussed a study published in October 2016 that offers us a clear example.

Four researchers, including three computer scientists and a lawyer, used machine learning software to examine 584 cases before the European Court of Human Rights. They found that the court’s judgments of the plaintiff’s rights having been violated or not were more highly correlated to facts than to legal arguments.  Given only the facts, their model predicted the court’s decisions with an average accuracy of 79 percent.

The article’s software-laden explanation of their method and tools makes for very heavy reading, so what follows attempts humbly and respectfully and certainly not infallibly to translate their work into plainer English.  The starting point is their abstract’s summary.  They “formulate a binary classification task where the input of our classifiers is the textual content extracted from a case and the target output is the actual judgment as to whether there has been a violation of an article of the convention of human rights. Textual information is represented using contiguous word sequences, i.e., N-grams, and topics.”  Additionally, “The text, N-grams and topics, trained Support Vector Machine (SVM) classifiers… [which] applied a linear kernel function that facilitates the interpretation of models in a straightforward manner.”

First, let’s get a handle on how the prediction model dealt with the text of the opinions (Natural Language Processing), then we will look at the software that classified the text to predict a violation or not (Support Vector Machine).

Few legal-industry surveys sample from a representative population

To respect and rely on the findings of a legal industry survey, legal managers should be able to find in the survey report the number of people who answered the survey (the sample, respondent number or sometimes just “N”), the number of people who were invited to answer the survey (the population), and how the surveyor developed that population of invitees.

Focus on that last disclosure, which basically concerns the representativeness of the survey population.  If a company that sells time and billing software to law firms writes to its customers and asks them “Do you find software technology valuable for your firm?,” no one should be surprised if the headline the vendor’s report boasts “Nine out of ten law firms find software technology valuable!”  Aside from binary choice of the question and the vendor’s blatant self-interest in promoting sales of software, the crucial skew in the results arises from the fact that the people invited to complete the survey hardly mirror people in law firms generally.  They have licensed or at least know about time and billing software.  The deck was stacked, the election was rigged.

Unfortunately, all too often vendor-sponsored surveys go out to invitees who have some connection with the vendor and therefore are hardly representative of law firm lawyers and staff as a whole.  The invitees will almost certainly be on the vendor’s contact list or its newsletter recipients or those who visit the vendor’s website and register.  Only sometimes will a vendor develop or rent a much larger mailing list and reach out to its names.  Even if they do, respondents will likely be self-selected because they use that kind of software or service or have some level of awareness of it.

Track and analyze the “surface area” of your lawyers’ contacts with individual clients

Legal managers look for available but overlooked data that can sharpen their business judgment.   One data set that might be new is “surface area”: how many individual clients interact with lawyers during a period of time, either within the organization for law departments or at organizational clients for law firms.  Surface area doesn’t just track senior clients, it tracks all clients.  The more clients who have dealings with a lawyer each quarter, the larger the contact surface area and presumably the better the law department or law firm both knows and responds to clients.  Widespread connections – a large surface area for the law department or law firm – assures that clients are finding the lawyers valuable.  It also keeps the lawyers more in touch with business realities, rather than lost in the myopia of purely legal developments.

True, the lawyers might need to tally a few individual clients on their own, but tools exist to capture much of the data.  What comes to mind is software that extracts names of clients in emails of the lawyers.  For a partner in a firm, email traffic with [name]@[client].com would be fairly easy to keep pull out and keep track of; for an associate general counsel in a company, the same type of filter would be even easier to spot and count internal email traffic.  Another source could be invitation lists to meetings.

Analyses of data on client contacts would focus on changes over time and distribution, and could also allow fuel social network insights.   For the network graphs, it would be useful to categorize clients by level or position.

Handling extreme values with Winsorizing and trimming means

Legal managers need to be sensitive to data that has extreme values.  Such very high or very low numbers in a distribution of numbers (meaning, the set of numbers) can result in a skewed representation of the average (arithmetic mean, in statistical terminology).  Those who analyze data have many ways to handle extreme values, with the best known one being to calculate the median of the distribution.  But let’s consider two others: Winsorizing the distribution and trimming the distribution.

We can return to Corp. Counsel, Oct. 2016 at 44, and its table that shows counts of U.S. law firms that “turn up the most in court documents.”  We added the number or lawyers in the firm and found that the arithmetic mean is 896.7 lawyers.

To lessen the influence of outliers, the distribution could be “Winsorized.”   When you Winsorize data, tail values at the extremes are set equal to some specified percentile of the data, such as plus and minus four standard deviations. For a 90 percent Winsorization, the bottom 5 percent of the values are set equal to the value corresponding to the 5th percentile while the upper 5 percent of the values are set equal to the value corresponding to the 95th percentile. This adjustment is different than throwing some of the extreme data away, which happens with trimmed means. Once you Winsorize your data, your medians will not change but your average will.   The Winsorized mean of this data is 892.7.

A trimmed mean calculation lops off the designated percent of firms at the top (and the same percent of firms at the lowest end of the distribution of lawyer sizes).   In short, trimming is done by equal amounts at both ends to minimize the bias of the result.  The trimmed mean of this distribution, lopping off 5% at each end (rounding if necessary), is 880.7.

Try a snowball survey so you get more participants in a client satisfaction or legal industry survey

Most law departments, when inviting their clients to complete a satisfaction survey, select recipients at or above a certain level, such as all “Managers,” or “everyone above comp level 15.”  It would be interesting and enlightening for a department to try a “snowball survey.”

Send the questionnaire form (or an email with its online equivalent) to a relatively few, high-level clients. Ask them to complete the form and also to forward the blank form to three colleagues who have worked recently with the law department (or forward the email invitation to those colleagues).  Each recipient, in turn, is also invited to extend the survey’s reach, and thus the snowball grows.

A service provider in the legal industry could adopt the same tactic: invite everyone you can reach to take a survey, but urge them to send it on to others they know who would have something to say about the survey’s topic.  Now, some surveyors may reject the snowball approach because they want to control who is possibly in the participation group.  But a broader-minded desire and one that is more objective would be to sample as many participants as possible and thereby gain a more accurate understanding of the entire population.

Watson requires huge quantities of text to formulate learnings

In February 2016, the accounting giant KPMG announced that it had been working with IBM Watson, one of the most advanced artificial intelligence technology platforms available.  An article describes Watson briefly:  “It works by using natural language processing and machine learning to reveal insights and information from huge quantities of unstructured data.” [emphasis added]   Notably, over a period of a few years Watson has digested hundreds of thousands of medical research papers on cancer and thereafter shown itself capable of matching the diagnoses of experts and suggesting new therapies.  According to the TV show 60 Minutes, eight thousand cancer papers are published every day!

A handful of law firms have announced that they are using Watson’s algorithms.  One firm (Baker & Hostetler), it sounds like to me, has directed Watson to parse thousands of cases, law review articles, and briefs in the bankruptcy area.  Whether that corpus of documents provides enough grist for Watson’s mill, since it is an order of magnitude or two smaller than the oncology set, remains to be seen.

My point is that the vast pools of text necessary for Watson to hone its skills to a proficient level may be rare in the legal industry.  And, related to that point, experienced lawyers may need to devote hours and hours to coding some of the textual material so that Watson can pick out which patterns are most likely to be present in the coded results.