Weighting data from surveys by law firms

Surveyors sometimes weight their data to make the findings more representative of another set of information. For example, a law firm might realize that it has gotten too few responses from some demographic strata, such as manufacturers or companies with more than $5 billion in revenue. The firm might want to correct for the imbalance so that it can present conclusions respecting the entire population (remember, the survey captures but a sample from the population). The firm could weight the manufacturers or large companies that they got more heavily to create a sample more in line with reality.

How might such a transformation apply in surveys for the legal industry? Let’s assume that a firm knows roughly how many companies in the United States have revenue over $100 million by each major industry. Those known proportions enable weighting. If the participants materially under-represent some industry or revenue range, the proportions in each industry don’t match the proportions that we know to be true. One way to adjust (weight) the data set would be to replicate participants in industries (or revenue ranges) enough to make the survey data set more like the real data set.

In a rare example, CMS Nabarro HealthTech 2017 [pg. 19] states explicitly that the analysis applied no weightings.

King Spalding ClaimsProfs 2016 [pg. 10] explains that it calculated the “weighted average experience” for certain employees. This might mean that one company had fewer employees than the others, so the firm weighted that company’s numbers so that the larger companies would not disproportionately affect the average age. In other words, they might have weighted the average by the number of employees in each of the companies. As a matter of good methodology, it would have been better for the firm to explain what they did in order to calculate the weighted average.

White Case Arbitration 2010 [pg. 15] writes that it “weighted the results to reveal the highest ranked influences.” This could mean that a “very important on” rating was treated as a four, a “quite important” rating as a three, and so on down to zero. If every respondent had given one of the influences on choice of governing law the highest rating, a four, that would have been the maximum possible weighted score. Whatever the sum of the actual ratings were could then be calculated as a percentage of that highest possible rating. The table lists the responses in decreasing order according to that calculation. This is my supposition of the procedure, but again, it would have been much better had the firm explained how it calculated the “weighted rank.”

Dykema Gossett MA 2015 [pg. 5] does not explain what “weighted rank” means in the following snippet, but the firm may have applied the same technique.

On one question, Seyfarth Shaw RE 2017 [pg. 10] explained a similar translation: “Question No. 3 used an inverse weighted ranking system to score each response. For example in No. 3, 1=10 points, 2=9 points, 3=8 points, 4=7 points, 5=6 points, 6=5 points, 7=4 points, 8=3 points, 9=2 points, 10=1 point”

Miller Chevalier TaxPolicy 2017 [pg. 6] asked respondents to rank the top three. The firm then used an inverse ranking to treat a 1 as 3, a 2 as 2 and a 1 as 1 and summed to reach a weighted rank (score).

Sometimes surveys use the term “weight” to mean “rank”. Here is an example from Berwin Leighton Risk 2014 [pg. 6].


Find out and improve the accuracy of a machine-learning model for court opinions

Machine learning models need to be validated, which entrails running the model on new data to see how well the classification or prediction works.  In the research explained in Part I, Part II, Part III, and Part IV, topics were identified and used to predict a European court’s decisions.

In the validation of their model, the researchers tested how accurate their model was based on being trained on a subset of the case opinions.  “The models are trained and tested by applying a stratified 10-fold cross validation, which uses a held-out 10% of the data at each stage to measure predictive performance.”

In less technical words, they ran their model many, many times, each time training it on a randomly-selected 90 percent of the cases and then using the model to predict the ruling on the left-out 10 percent of the cases.  They averaged the results of the multiple runs so that extremes would be less influential.

That’s not all.  “The linear SVM [the machine learning algorithm employed to classify court decisions into violation found or not found] has a regularisation parameter of the error term C, which is tuned using grid-search.”  We will forgo a full explanation of this dense sentence, but it has to do with finding (“tuning”) the best controls or constraints on the SVM’s application (“parameters”) through a method of testing lots of variations where the parameters are randomly varied (“grid-search”).

The article continues: “The linear kernel of the SVM model can be used to examine which topics are most important for inferring whether an article of the Convention has been violated or not by looking at their weights w.”  In this research, the weights calculated by the algorithm are a measure of how much a topic influences the Court’s decision.  Tables in the article present the six topics for the most positive and negative SVM weights for the parts of the opinions.

Thus ends our exegesis of a wonderful piece of applied machine learning relevant to the legal industry.

We welcome any and all questions or comments, especially those that will make even clearer the power of this method of artificial intelligence research and its application in legal management.

Surveys should weight respondents by their proportion in the larger population

Survey data provides considerable insight for legal managers, if the survey’s methodology was sound.  One of the methodological decisions to be made by the sponsor is whether to weight responses. You do so by adjusting the responses to match the demographic characteristics of the population you have surveyed.

A survey of law departments, as an example, might weight the responses by the size of the law departments.  That means you adjust the responses you have in hand so that they more accurately represent the entire population. You might have a category of 1-to-3 lawyers in the department, a second of 4-to-6, a third of 7 to 12, and fourth category for all law departments that are larger.  Demographic data about law departments in the United States suggest that at least a third of them have three lawyers or fewer.

If the survey responses had only ten percent in the smallest category, the surveyor should multiply the ones they got by three so that the unbalanced sample is more representative of all U.S. law departments.   The few in the sample need to be counted more if you are going to generalize about all law departments in the population.

The broader the categories, the less the surveyor needs to consider weighting responses since the responses are more likely to distribute themselves in conformity with the population’s distribution. But with narrow categories, a handful of responses might need to be weighted heavily (multiplied more) and therefore those few will be disproportionately influential in the overall results. One prophylactic is to trim the weights (that prevents one or two respondents from being upgraded by more than some large amount such as 5 or 10 times).  An article in the New York Times, September 13, 2016, by Nate Cohn, helped make this point about survey weighting clear.

Surveys with fewer than 400 participants produce “ballpark” results at best

Findings from surveys can enlighten legal managers and sharpen their decisions, but only if the data reported by the organization that conducted the survey is credible.  Among the many imperfections that can mar survey results, an immediately obvious one is sample size and its inverse effect on the margin of error of the results.  Put simply, the smaller the sample of respondents, the more the results might diverge from the actual figure that would emerge if all the population could be polled – the margin of error balloons.  Or, lots of participants, small margin of error (results more likely to be representative of the whole population).

The NY Times, Oct. 15, 2016 at A15 refers to voter surveys, but the statistical caveat is the same for legal-industry surveys.  “If the sample is less than 400, the result should be considered no more than a ballpark estimate.”

Sadly, many surveys by vendors to law firms and law departments fail to accumulate more than 400 participants.  Worse, quite a few survey reports say nothing about how many participants they obtained, even if they provide demographic data about them.  Their findings might be characterized as SWAGs (scientific wild-ass guesses), which might even then be giving them too much credit on the “scientific” side.  No one should base decisions  derived from a too-tiny  group of survey respondents.

We leave for another post a further wrinkle that the Times highlights: if the data analysts weight the responses, they “don’t adjust their margins of error to account for the effect of weighting.”

Enrich client satisfaction data with weights by frequency of use

To make better decision based on client-satisfaction survey results, break down client scores, such as by the frequency of their legal service use: low, medium, and high.  In other words, for the attribute “Knowledge of the business” you might report that infrequent users averaged 3.8 on a scale of 1 (poor) to 5 (good); that medium users (seeking legal advice once a quarter or more often, perhaps) averaged 3.9; and high-volume users (perhaps more than three times a month) averaged 4.1.  That would require an additional question on the survey regarding three choices for frequency of calling the law department but it lets you gauge more finely the scores of different tranches of your clients.

Heavy users could be thought to be your main clients, thus most deserving of your attention and resources, although some people might argue that infrequent users may be avoiding your department, under-using your expertise, and running unwanted legal risks.  This is a complex topic, since a heavy user may be lazy, offloading work to the law department, thick as a brick, or too cautious to make decisions.

To go beyond tabulations of satisfaction ratings by frequency of use, and to introduce another way to weight each individual’s score, you could use the level of the person.  A Grade 2 (SVP level, maybe) response would be weighted more than a Grade 3 (AVP level), and so on.  Then the calculations of average scores can take into account the position in the company of respondents  in a single metric, rather than multiple metrics for senior, medium and junior levels.