Number of lawyers in survey firms; merged names

We start with a couple of methodological decisions. First, what number shall we use for the count of practicing lawyers in the firm? To reconstruct the number of lawyers practicing at the firm back in the year of a survey would take much digging. Although we could then analyze our data set much more accurately when firm size has meaning, the effort to obtain the historical, matching data would be daunting.

A second, related issue focuses on how to handle surveying firms that merged after the survey. At least three of the firms in the data set have merged with another major firm during the past few years. These merged firms include BryanCaveBLP, CMS, HoganLovells, and Norton Rose Fulbright. How should we treat their sizes? Also, if we keep the pre–merger name of the firm, we have to figure out both the month and year its merger took affect as well as the month and year a survey was published. That game’s not worth the candle. If we use the name of the merged firm, we lose the correct name of the firm as of the year the survey completed.

The convention I have tried to adopt uses the current lawyer headcount of an unmerged firm, the latest name of the merged firm, and the merged firm’s lawyer count. The first two names of the firm, without any punctuation, make up my “firm name”.

Accordingly, the average number of lawyers in the 77 law firms for which I have data is 1047. The median is 753 lawyers. The conclusion is inescapable: very large law firms are the typical sponsors of research surveys.

The range of sizes is also illuminating: 6 lawyers to 4,607 lawyers. The set includes at least three firms with less than 200 lawyers along with ten of more than 2,000 lawyers. The takeaway? A firm of any size can launch a research survey.

The plot presents aggregate size data from 69 firms based in four “countries”: Canada (6 different law firms), the United Kingdom (20 firms), the United States (38), and “VereinCLG,” five firms that have a legal structure of either a Swiss verein or a “company limited by guarantee” (CLG).

Canonical names to allow software to combine data on law schools

Whenever a data scientist decides to merge two sets of data, there must be a common field (variable) for the software to merge on.  The software needs to be able to instruct the computer “Whenever the first data set has “Alaska” in the State column, and the second data set has “Alaska” in the Jurisdiction column, add on to the first data set any additional columns of information from the data set.”   The code has to tell the software that the State variable and the Jurisdiction variable are the common field for purposes of matching and use the right function for merging.

With the Law School Data Set, when I found data on admission rates in one source and data on numbers of Supreme Court Clerks in another, the common field was the name of the law school.  A human can match school names instantly even if they vary a little in the precise name used.

That sounds like it should also be simple for a computer program, but to a computer “NYU Law” is completely different than “New York University Law”; “Columbia Law School” is not “Columbia School of Law”.  The multitudinous ways publications name law schools means that the data scientist has to settle on one version of the school’s name – sometimes referred to as the “canonical version” – and then spend much time transforming the alternative names to the canonical name.  It’s slogging work, subject to errors, and adds no real value.  But only once it is done can a merge function in the software achieve what you hope.