Canonical names to allow software to combine data on law schools

Whenever a data scientist decides to merge two sets of data, there must be a common field (variable) for the software to merge on.  The software needs to be able to instruct the computer “Whenever the first data set has “Alaska” in the State column, and the second data set has “Alaska” in the Jurisdiction column, add on to the first data set any additional columns of information from the data set.”   The code has to tell the software that the State variable and the Jurisdiction variable are the common field for purposes of matching and use the right function for merging.

With the Law School Data Set, when I found data on admission rates in one source and data on numbers of Supreme Court Clerks in another, the common field was the name of the law school.  A human can match school names instantly even if they vary a little in the precise name used.

That sounds like it should also be simple for a computer program, but to a computer “NYU Law” is completely different than “New York University Law”; “Columbia Law School” is not “Columbia School of Law”.  The multitudinous ways publications name law schools means that the data scientist has to settle on one version of the school’s name – sometimes referred to as the “canonical version” – and then spend much time transforming the alternative names to the canonical name.  It’s slogging work, subject to errors, and adds no real value.  But only once it is done can a merge function in the software achieve what you hope.

Handling extreme values with Winsorizing and trimming means

Legal managers need to be sensitive to data that has extreme values.  Such very high or very low numbers in a distribution of numbers (meaning, the set of numbers) can result in a skewed representation of the average (arithmetic mean, in statistical terminology).  Those who analyze data have many ways to handle extreme values, with the best known one being to calculate the median of the distribution.  But let’s consider two others: Winsorizing the distribution and trimming the distribution.

We can return to Corp. Counsel, Oct. 2016 at 44, and its table that shows counts of U.S. law firms that “turn up the most in court documents.”  We added the number or lawyers in the firm and found that the arithmetic mean is 896.7 lawyers.

To lessen the influence of outliers, the distribution could be “Winsorized.”   When you Winsorize data, tail values at the extremes are set equal to some specified percentile of the data, such as plus and minus four standard deviations. For a 90 percent Winsorization, the bottom 5 percent of the values are set equal to the value corresponding to the 5th percentile while the upper 5 percent of the values are set equal to the value corresponding to the 95th percentile. This adjustment is different than throwing some of the extreme data away, which happens with trimmed means. Once you Winsorize your data, your medians will not change but your average will.   The Winsorized mean of this data is 892.7.

A trimmed mean calculation lops off the designated percent of firms at the top (and the same percent of firms at the lowest end of the distribution of lawyer sizes).   In short, trimming is done by equal amounts at both ends to minimize the bias of the result.  The trimmed mean of this distribution, lopping off 5% at each end (rounding if necessary), is 880.7.

Create a new variable and learn more (court mentions per 100 law-firm lawyers)

Let’s return to the court-document mentions from Corp. Counsel, Oct. 2016 at 44, where a table shows counts of 30 U.S. law firms that “turn up the most in court documents.”   Whereas previously JurisDators looked at mentions by number of lawyers in the firms, the next plot creates a new variable (a ‘synthetic variable’) that divides the number of mentions for each firm by the number of the firm’s lawyers, and then multiplies that result that by 100.

corpcounsel-mentions-perjd

Hence, the graph shows how many mentions Corporate Counsel found in court documents for every 100 lawyers in the firm.

By this measure, Littler Mendelson and Ogletree Deakins again top the chart, but Ogletree punched well above its weight.  Shook Hardy and Fish & Richardson also look much more impressive from this perspective whereas K&L Gates seems to owe its spot on the table to its vast size mostly.

The larger point here is that data sets yield more information when the data scientist carries out a calculation to create a new variable, such as the normalizing done here by the size of the firm.  Absolute numbers tell one story; normalized numbers almost always add very different insights.

Extract-Load-Transform (ELT) cleans data later in the process than Extract-Transform Load (ETL)

A few days ago, I challenged a statement by David White, a director at AlixPartners,  in a column.  When I wrote to him to let him know, he responded right away, and taught me something.

He suggested “looking into the differences between the concepts of Extract-Transform-Load (ETL) and Extract-Load-Transform (ELT), especially with regard to big data technology such as Hadoop, MapReduce, Splunk and the like.   The process you describe [in the previous blog post] of needing to normalize data before analysis, such as trimming currency symbols, refers to the “transform” stage that traditional analytics system require be done upfront.  Newer technologies no longer require this to be done until much later, as part of the analytics process.  And often, not at all.  Therefore, we don’t have to worry that some of our values have dollar signs and some do not, or some dates are in one format and others in another.  We can extract and load all the data upfront, regardless and deal with these issues when building queries, if needed at all.  This feature is what allows these systems to perform analysis on multiple diverse formats of data, including both structured and unstructured data, at the same time, and on real time data feeds.  For example, Google doesn’t care about date formats or dollar signs when you run a query across millions of websites, yet returns accurate results regardless of the format of the original content.”

Excellent, and I appreciate David taking the time to explain why my challenge was misguided!

Analytic software requires curating before it can proceed reliably

A column in Met. Corp. Counsel, Sept. 2016 at 21 by David White, a director at AlixPartners, starts with three paragraphs on the severity of anti-corruption risks to U.S. companies that do business abroad and the associated huge variety and volume of documents and records that might have to be analyzed to respond to regulators.  Data analytics to the rescue!

White continues: “Unlike their traditional counterparts, newer analytic systems based on big data technologies, predictive analytics and artificial intelligence are not bound by upfront data transformations and normalization requirements …” [emphasis added].

In my experience, analytical software inevitably requires the data fed to it to be formatted cleanly in ways that it can handle.  For example, dollars signs can’t be included, currencies need to be converted, dates need a consistent format, missing numbers or problematic outliers need to be dealt with – these and other steps of “upfront data transformations and normalization” are often manual, require human judgment, and can take hours of thoughtful attention.  Moreover, the data doctor ought to keep track of all the steps in the clean up.

Add new variables, e.g., a gap index for client satisfaction

To learn more from a set of data, you may want to calculate additional variables.  Here is an example from a client satisfaction survey.

If you are a general counsel and you ask your clients to assess your department, ask them not only to evaluate your group’s performance on a set of attributes but also to rank those attributes by importance.  The more important the attribute – such as timeliness, understanding of the law, responsiveness – the more your clients should expect good performance from the law department.   You want to focus on what your clients value.

From the survey data, create an “index of client satisfaction” which divides the reality (performance ratings) by the expectations of clients (importance ratings) on each attribute.  In short, reality divided by expectations, which is client satisfaction.   Then you can calculate averages, medians, etc.

With 1.0 being the absolute best, where the delivered performance fully met the expectations of all your clients, your index will decline to the degree the performance of the law department fell short of what clients felt was important and expected.  By the way, low expectations (importance) fully met shows up in the index as high satisfaction.  Focus on the gap between the highest ranking attributes and their evaluation ratings.

Enrich client satisfaction data with weights by frequency of use

To make better decision based on client-satisfaction survey results, break down client scores, such as by the frequency of their legal service use: low, medium, and high.  In other words, for the attribute “Knowledge of the business” you might report that infrequent users averaged 3.8 on a scale of 1 (poor) to 5 (good); that medium users (seeking legal advice once a quarter or more often, perhaps) averaged 3.9; and high-volume users (perhaps more than three times a month) averaged 4.1.  That would require an additional question on the survey regarding three choices for frequency of calling the law department but it lets you gauge more finely the scores of different tranches of your clients.

Heavy users could be thought to be your main clients, thus most deserving of your attention and resources, although some people might argue that infrequent users may be avoiding your department, under-using your expertise, and running unwanted legal risks.  This is a complex topic, since a heavy user may be lazy, offloading work to the law department, thick as a brick, or too cautious to make decisions.

To go beyond tabulations of satisfaction ratings by frequency of use, and to introduce another way to weight each individual’s score, you could use the level of the person.  A Grade 2 (SVP level, maybe) response would be weighted more than a Grade 3 (AVP level), and so on.  Then the calculations of average scores can take into account the position in the company of respondents  in a single metric, rather than multiple metrics for senior, medium and junior levels.

For survey ranking questions, a technique to assure that the scale was applied correctly

If you are collecting data with a survey, you might ask the invitees to rank various selections on a scale.  “Please rank the following five methods of knowledge management on their effectiveness using a scale of 1 (least) to 5 (most)” followed by a list of five methods.  Ranking yields more useful data than “Pick all that you believe are effective” since the latter does not differentiate between methods: each one picked appears equally effective.

But ranking spawns the risk that respondents will confuse which end of the scale is most effective and which least.  They might not read carefully and therefore put the number 1 for their most effective method – after all, being Number 1 is best, right? – and put the number 5 for their least effective method.

One method some surveys adopt to guard against respondents misreading the direction of the scale is to add a question after the ranking question.  The follow-on question asks them to check the most effective method.  Software can quickly confirm that the respondent understood and applied the scale correctly since the 5 on the first question matches the checked method on the second question.