Find out and improve the accuracy of a machine-learning model for court opinions

Machine learning models need to be validated, which entrails running the model on new data to see how well the classification or prediction works.  In the research explained in Part I, Part II, Part III, and Part IV, topics were identified and used to predict a European court’s decisions.

In the validation of their model, the researchers tested how accurate their model was based on being trained on a subset of the case opinions.  “The models are trained and tested by applying a stratified 10-fold cross validation, which uses a held-out 10% of the data at each stage to measure predictive performance.”

In less technical words, they ran their model many, many times, each time training it on a randomly-selected 90 percent of the cases and then using the model to predict the ruling on the left-out 10 percent of the cases.  They averaged the results of the multiple runs so that extremes would be less influential.

That’s not all.  “The linear SVM [the machine learning algorithm employed to classify court decisions into violation found or not found] has a regularisation parameter of the error term C, which is tuned using grid-search.”  We will forgo a full explanation of this dense sentence, but it has to do with finding (“tuning”) the best controls or constraints on the SVM’s application (“parameters”) through a method of testing lots of variations where the parameters are randomly varied (“grid-search”).

The article continues: “The linear kernel of the SVM model can be used to examine which topics are most important for inferring whether an article of the Convention has been violated or not by looking at their weights w.”  In this research, the weights calculated by the algorithm are a measure of how much a topic influences the Court’s decision.  Tables in the article present the six topics for the most positive and negative SVM weights for the parts of the opinions.

Thus ends our exegesis of a wonderful piece of applied machine learning relevant to the legal industry.

We welcome any and all questions or comments, especially those that will make even clearer the power of this method of artificial intelligence research and its application in legal management.

Surveys should weight respondents by their proportion in the larger population

Survey data provides considerable insight for legal managers, if the survey’s methodology was sound.  One of the methodological decisions to be made by the sponsor is whether to weight responses. You do so by adjusting the responses to match the demographic characteristics of the population you have surveyed.

A survey of law departments, as an example, might weight the responses by the size of the law departments.  That means you adjust the responses you have in hand so that they more accurately represent the entire population. You might have a category of 1-to-3 lawyers in the department, a second of 4-to-6, a third of 7 to 12, and fourth category for all law departments that are larger.  Demographic data about law departments in the United States suggest that at least a third of them have three lawyers or fewer.

If the survey responses had only ten percent in the smallest category, the surveyor should multiply the ones they got by three so that the unbalanced sample is more representative of all U.S. law departments.   The few in the sample need to be counted more if you are going to generalize about all law departments in the population.

The broader the categories, the less the surveyor needs to consider weighting responses since the responses are more likely to distribute themselves in conformity with the population’s distribution. But with narrow categories, a handful of responses might need to be weighted heavily (multiplied more) and therefore those few will be disproportionately influential in the overall results. One prophylactic is to trim the weights (that prevents one or two respondents from being upgraded by more than some large amount such as 5 or 10 times).  An article in the New York Times, September 13, 2016, by Nate Cohn, helped make this point about survey weighting clear.

Surveys with fewer than 400 participants produce “ballpark” results at best

Findings from surveys can enlighten legal managers and sharpen their decisions, but only if the data reported by the organization that conducted the survey is credible.  Among the many imperfections that can mar survey results, an immediately obvious one is sample size and its inverse effect on the margin of error of the results.  Put simply, the smaller the sample of respondents, the more the results might diverge from the actual figure that would emerge if all the population could be polled – the margin of error balloons.  Or, lots of participants, small margin of error (results more likely to be representative of the whole population).

The NY Times, Oct. 15, 2016 at A15 refers to voter surveys, but the statistical caveat is the same for legal-industry surveys.  “If the sample is less than 400, the result should be considered no more than a ballpark estimate.”

Sadly, many surveys by vendors to law firms and law departments fail to accumulate more than 400 participants.  Worse, quite a few survey reports say nothing about how many participants they obtained, even if they provide demographic data about them.  Their findings might be characterized as SWAGs (scientific wild-ass guesses), which might even then be giving them too much credit on the “scientific” side.  No one should base decisions  derived from a too-tiny  group of survey respondents.

We leave for another post a further wrinkle that the Times highlights: if the data analysts weight the responses, they “don’t adjust their margins of error to account for the effect of weighting.”

Enrich client satisfaction data with weights by frequency of use

To make better decision based on client-satisfaction survey results, break down client scores, such as by the frequency of their legal service use: low, medium, and high.  In other words, for the attribute “Knowledge of the business” you might report that infrequent users averaged 3.8 on a scale of 1 (poor) to 5 (good); that medium users (seeking legal advice once a quarter or more often, perhaps) averaged 3.9; and high-volume users (perhaps more than three times a month) averaged 4.1.  That would require an additional question on the survey regarding three choices for frequency of calling the law department but it lets you gauge more finely the scores of different tranches of your clients.

Heavy users could be thought to be your main clients, thus most deserving of your attention and resources, although some people might argue that infrequent users may be avoiding your department, under-using your expertise, and running unwanted legal risks.  This is a complex topic, since a heavy user may be lazy, offloading work to the law department, thick as a brick, or too cautious to make decisions.

To go beyond tabulations of satisfaction ratings by frequency of use, and to introduce another way to weight each individual’s score, you could use the level of the person.  A Grade 2 (SVP level, maybe) response would be weighted more than a Grade 3 (AVP level), and so on.  Then the calculations of average scores can take into account the position in the company of respondents  in a single metric, rather than multiple metrics for senior, medium and junior levels.