|To a data scientist, responsible data informs and guides. For legal managers, carefully curated data should change your mind if it shows you were unaware, mistaken or had an untenable belief. Unfortunately, no less than silken data often succumbs to wooly thinking.
This depressing reality, that you can wear yourself out gathering insightful numbers but someone beholden to an ideology or enjoying privileges threatened by that finding will not only reject the insights but will not even acknowledge them. The NY Times, Nov. 4, 2016 at B4 emphasizes this shortcoming of humans in an article about partisan suspicion of government employment data. “Decades of psychological research have shown that people … tend to embrace information that confirms their existing beliefs and disregard data that contradicts them.”
When a general counsel is presented with data that shows a favored law firm’s effective billing rate is much higher than the firm’s peers or a managing partner is presented with data that a long-standing client is unprofitable, they can whisk out a sewing basket of tools to rend such “whole cloth” malarkey. We all find comfort in data that confirms what we believe and we disregard data that controverts our values, belief sets, or sense of self. We all believe we look good in what we put on.
Even so, data scientists in the data-tattered legal industry must persevere to support the thoughtful pursuit of enlightenment through numbers.
|Whenever a data scientist decides to merge two sets of data, there must be a common field (variable) for the software to merge on. The software needs to be able to instruct the computer “Whenever the first data set has “Alaska” in the State column, and the second data set has “Alaska” in the Jurisdiction column, add on to the first data set any additional columns of information from the data set.” The code has to tell the software that the State variable and the Jurisdiction variable are the common field for purposes of matching and use the right function for merging.
With the Law School Data Set, when I found data on admission rates in one source and data on numbers of Supreme Court Clerks in another, the common field was the name of the law school. A human can match school names instantly even if they vary a little in the precise name used.
That sounds like it should also be simple for a computer program, but to a computer “NYU Law” is completely different than “New York University Law”; “Columbia Law School” is not “Columbia School of Law”. The multitudinous ways publications name law schools means that the data scientist has to settle on one version of the school’s name – sometimes referred to as the “canonical version” – and then spend much time transforming the alternative names to the canonical name. It’s slogging work, subject to errors, and adds no real value. But only once it is done can a merge function in the software achieve what you hope.
|To have a hefty data set that would both interest lawyers and be available to share publicly has long been a desire of mine. It would let me show how to work with data and readers can download the data and follow along. While it is easy to make up data for what programmers call “toy data sets”, they are abstract and uninteresting.
Even more importantly, made-up data lacks patterns and characteristics that can demonstrate machine learning capabilities in real life.
My benchmark data from law departments could not be shared, because it was all proprietary. My data collected during consulting projects for law departments and law firms also has to be kept strictly confidential. And some data that are in the public domain or have leaked into it, such as older AMLAW100 compilations on law firms, do not have a range of variables that can illustrate machine learning techniques, for example.
So, I created a data set on information about U.S. law schools. The first version started with the schools rated by U.S. News & World Report. Thereafter I successively added more data for the schools from about six other sources. I also added data about the population of the city each school was in and its state, and its state’s number of lawyers in private practice and some other variables about clerkships, etc.
The final step is a coming out party for this set of data about U.S. law schools!
|The legendary Prof. Edward Tufte gave a keynote presentation in September 2016 at Microsoft’s Machine Learning and Data Summit. Tufte’s ambitious subject was “The Future of Data Analysis”. You can listen to the 50-minute talk online. Early on he emphasized that you display data to assist reasoning (analytic thinking) and to enable smart comparisons.
Tufte frequently referred to data visualization as a method aimed to maximize “information throughput”, yet also to be interpretable by the reader. I took information throughput to be engineering jargon for “lots of data presented.”
Maximal information throughput, from the standpoint of legal managers, has almost no relevance. The data sets that could be analyzed by AI or machine learning techniques or visualized by Excel, Tableau, R and other software are simply too small to justify that “Big Data” orientation and terminology.
That distinction understood, legal managers should take away from Tufte’s model and recommendation that when you create a graph, strive to present as much of the underlying information as you can as clearly as you can so that the reader of the graph can come to her own interpretations.
|Machine learning models need to be validated, which entrails running the model on new data to see how well the classification or prediction works. In the research explained in Part I, Part II, Part III, and Part IV, topics were identified and used to predict a European court’s decisions.
In the validation of their model, the researchers tested how accurate their model was based on being trained on a subset of the case opinions. “The models are trained and tested by applying a stratified 10-fold cross validation, which uses a held-out 10% of the data at each stage to measure predictive performance.”
In less technical words, they ran their model many, many times, each time training it on a randomly-selected 90 percent of the cases and then using the model to predict the ruling on the left-out 10 percent of the cases. They averaged the results of the multiple runs so that extremes would be less influential.
That’s not all. “The linear SVM [the machine learning algorithm employed to classify court decisions into violation found or not found] has a regularisation parameter of the error term C, which is tuned using grid-search.” We will forgo a full explanation of this dense sentence, but it has to do with finding (“tuning”) the best controls or constraints on the SVM’s application (“parameters”) through a method of testing lots of variations where the parameters are randomly varied (“grid-search”).
The article continues: “The linear kernel of the SVM model can be used to examine which topics are most important for inferring whether an article of the Convention has been violated or not by looking at their weights w.” In this research, the weights calculated by the algorithm are a measure of how much a topic influences the Court’s decision. Tables in the article present the six topics for the most positive and negative SVM weights for the parts of the opinions.
Thus ends our exegesis of a wonderful piece of applied machine learning relevant to the legal industry.
We welcome any and all questions or comments, especially those that will make even clearer the power of this method of artificial intelligence research and its application in legal management.
|All data appears because of underlying value judgments by someone. A vendor who conducts a survey of law firms or law departments privileges certain numbers that it asks for over the all the other numbers not asked about. Just the wording, number, or order of questions reveals personal biases toward what is important to know and what isn’t. (“Bias” is not a pejorative term but rather connotes the leanings or predilections or unexamined assumptions of someone.) As Frank Bruni wrote in the NY Times, Oct. 30, 2106 at SR3 regarding the proliferation of college rankings, “all of them make subjective value judgments about what’s most important in higher education.” Some look at selectiveness of colleges, others at student satisfaction, some rankings elevate diversity where others focus on earnings of graduates. The decision of what data to emphasize in any survey is far from neutral.
In the legal industry, the client-law firm relationship stands higher than all other facets of the industry as evidenced by the number are breadth of surveys. The subjective judgments of surveyors signal strongly that how a law department deals with its law firms economically is its defining attribute, rather than quality of advice or professional growth on the buyer or seller side, or independence or many other conceivable attributes. It is easier to collect data on a topic that has been promoted to the top and is suffused with money, power, and prestige.
Don’t read this as my saying that which law firms a law department pays how much for what kinds of services is unimportant. It is indeed pragmatic and very important. But I do want to highlight how easy it is to overlook that privileging certain sets of data automatically demotes other data. Legal managers need to keep in mind the subjective value judgments made everywhere in the data value chain and that different value judgments would result in different data and possible managerial decisions.
|We have explained in Part I, Part II and Part III how researchers took the text of certain European court opinions, found how often words and combinations of words appeared in them, and coalesced those words that appeared relatively often together into named, broader topics. Next, they wanted to see if software could predict from the topics how the court would decide. They relied on a machine learning algorithm called Support Vector Machines (SVM).
“An SVM is a machine learning algorithm that has shown particularly good results in text classification, especially using small data sets. We employ a linear kernel since that allows us to identify important features that are indicative of each class by looking at the weight learned for each feature. We label all the violation cases as +1, while no violation is denoted by −1. Therefore, features assigned with positive weights are more indicative of violation, while features with negative weights are more indicative of no violation.”
Whew! A linear kernel is a sophisticated method from linear algebra that projects data (transforms it into a different relationship) into a complex, multi-dimensional space (a “hyperspace, “which can be thought of as having not just an x-axis and a y-axis, but also an a-axis, b-axis and so on out to as many axes as there are data features). In that hyperspace, the SVM algorithm can accomplish more than if the data were “flatter”. For example, if finds key data points (called “support vectors”) that define the widest boundary between violation cases and non-violation cases. The result is what is known as a “hyperplane” because it separates the classes in a hyperspace as well as possible (as a line can do in two dimensions and a plane in three dimensions).
The weights that the algorithm identifies enable it to classify the topics and create the hyperplane. The weights represent the hyperplane, by giving the coordinates of a vector which is orthogonal to the hyperplane (“orthogonal” can be imagined as a perpendicular vector to some point in a hyperspace; it also means there is no correlation between orthogonal vectors). The vector’s direction gives the predicted class, so if you take the dot product [more matrix algebra] of any point with the vector, you can tell on which side it is: if the dot product is positive, it belongs to the positive class, if it is negative it belongs to the negative class. You could say that the absolute size of the (weight) coefficient relative to the other ones gives an indication of how important the feature was for the separation of the hyperplane.
Machine learning has the potential to invade and disrupt the current market for lawyers in some of the most complex, high-end legal practices. Any practice where there are large numbers of court opinions, briefs, law review articles, white papers, laws and regulations, and other textual material, it appears that IBM’s Watson looms as a tool to absorb it, recognize patterns, and augment lawyers’ reasoning. Remember, however, that Watson is a glutton for vast amounts of digitized documents. Without that diet, the formidable Watson may wither like the Wicked Witch of Oz when sprinkled with water.
Augmentation has a partnering ring, a positive valence, but the dark side lurks. Associate-heavy research memos and scores of partners “coming up to speed” in an area of law will evaporate when software does the heavy lifting and prep work. The experienced judgment of intelligent lawyers will forever be in demand, but software augmentation will limit leverage and slash hours that would have been billed to clients for a law firm tackling a new area of law for that firm.
A chilling glimpse of this future appears in the Economist, Oct. 22, 2016 at 64. The “cognitive artificial intelligence platform [Watson] has begun categorizing the various [financial industry] regulations and matching them with the appropriate enforcement mechanisms.” Experts in the regulatory web that ensnare and confound financial firms vet the conclusions Watson derives from the mass of material available to it; “A dozen rules are now being assimilated weekly.” The target is the estimated $270 billion or more spent each year on regulatory compliance – “of which $20 billion is spent simply on understanding the requirements.” Who knows how much flows to law firms or how much that flow will slow once Watson has matured?
To the extent Watson and look-alikes can make sense out of the tangle of financial regulations, lawyers will experience less demand for their tools and experience. It is unlikely that the projection of legal work to more sophisticated levels, aided by software organization and analysis, will replace the loss of billable hours at the lower end.
|Finding Topics: In this the third of a series begun here and continued here we discussed a recently-published research paper. The researchers created “topics for each [opinion] by clustering together N-grams that are semantically similar by leveraging the distributional hypothesis suggesting that similar words appear in similar contexts.” In other words, they assumed that related words tend to appear together and as a group suggest a “topic”.
They used the C feature matrix, “which is a distributional representation of the N-grams given the case as the context; each column vector of the matrix represents an N-gram.”
OK, so how did the software identify or create topics? “Using this vector representation of words, we compute N-gram similarity using the cosine metric and create an N-gram by N-gram similarity matrix.” Cosine is a common metric in text analysis to find which vectors (sets of numbers) are more like which others. It measures the smallest angle between two vectors where a vector in this research consists of all the 0’s and 1’s of one N-gram compared to each other N-gram. Roughly speaking, if you think of those sets of values graphed on a sheet of paper, there is an angle of some number of degrees – thus also a cosine from trigonometry — between them. That is a measure of how much they overlap, how much they are similar. A similarity matrix shows the degree of “sameness” of each word (or word combo) to each other.
The researchers went further once they found with the cosine similarity calculation which N-grams were more closely associated with each other. “We finally apply spectral clustering—which performs graph partitioning on the similarity matrix—to obtain 30 clusters of N-grams.” Spectral clustering techniques make use of the spectrum (eigenvalues, a linear algebra construct beyond simple translation!) of the similarity matrix of the data to perform dimensionality reduction before clustering in fewer dimensions. In other words, once you figure out that certain words are associated with each other, spectral clustering brings together groups of related words into “topics.”
A reduced number of dimensions allows the software to perform more efficiently and helps humans decipher the output. [Note that the researchers created topics as hard clusters according to which an N-gram can only be part of a single topic. Machine learning allows more flexibility but this was a plausible design choice.]
The researchers then gave over-arching names to the topics based on which words were most common to the cluster, which they refer to as a “representation.” “A representation of a cluster is derived by looking at the most frequent N-grams it contains. The main advantages of using topics sets of N-grams) instead of single N-grams is that it reduces the dimensionality of the feature space, which is essential for feature selection, [as] it limits over-fitting to training data, and also provides a more concise semantic representation…. Topics form a more abstract way of representing the information contained in each case and capture a more general gist of the cases.” Stated differently by the author, “Topics identify in a sufficiently robust manner patterns of fact scenarios that correspond to well-established trends in the Court’s case law.” Over-fitting is a bugaboo of machine learning, but we will leave that to a later discussion.
|“Cognitive computing” may be just another marketing buzzword, but legal managers will encounter it. Based on KMWorld, Oct. 2016 and its white paper, “cognitive computing is all about machine learning, with some artificial intelligence and natural language processing.” You can learn more from the Cognitive Computing Consortium although that group does not yet support a legal sub-group.
In that regard, however, a LinkedIn user group called Artificial Intelligence for Legal Professionals has a couple of hundred members.