A publicly-available data set on U.S. law schools

To have a hefty data set that would both interest lawyers and be available to share publicly has long been a desire of mine.   It would let me show how to work with data and readers can download the data and follow along.  While it is easy to make up data for what programmers call “toy data sets”, they are abstract and uninteresting.

Even more importantly, made-up data lacks patterns and characteristics that can demonstrate machine learning capabilities in real life.

My benchmark data from law departments could not be shared, because it was all proprietary.  My data collected during consulting projects for law departments and law firms also has to be kept strictly confidential.  And some data that are in the public domain or have leaked into it, such as older AMLAW100 compilations on law firms, do not have a range of variables that can illustrate machine learning techniques, for example.

So, I created a data set on information about U.S. law schools.  The first version started with the schools rated by U.S. News & World Report.  Thereafter I successively added more data for the schools from about six other sources.  I also added data about the population of the city each school was in and its state, and its state’s number of lawyers in private practice and some other variables about clerkships, etc.

The final step is a coming out party for this set of data about U.S. law schools!

All data for legal managers are shaped by subjective value judgments

All data appears because of underlying value judgments by someone.  A vendor who conducts a survey of law firms or law departments privileges certain numbers that it asks for over the all the other numbers not asked about.  Just the wording, number, or order of questions reveals personal biases toward what is important to know and what isn’t.  (“Bias” is not a pejorative term but rather connotes the leanings or predilections or unexamined assumptions of someone.)  As Frank Bruni wrote in the NY Times, Oct. 30, 2106 at SR3 regarding the proliferation of college rankings, “all of them make subjective value judgments about what’s most important in higher education.”  Some look at selectiveness of colleges, others at student satisfaction, some rankings elevate diversity where others focus on earnings of graduates.  The decision of what data to emphasize in any survey is far from neutral.

In the legal industry, the client-law firm relationship stands higher than all other facets of the industry as evidenced by the number are breadth of surveys.  The subjective judgments of surveyors signal strongly that how a law department deals with its law firms economically is its defining attribute, rather than quality of advice or professional growth on the buyer or seller side, or independence or many other conceivable attributes.  It is easier to collect data on a topic that has been promoted to the top and is suffused with money, power, and prestige.

Don’t read this as my saying that which law firms a law department pays how much for what kinds of services is unimportant.  It is indeed pragmatic and very important.  But I do want to highlight how easy it is to overlook that privileging certain sets of data automatically demotes other data.  Legal managers need to keep in mind the subjective value judgments made everywhere in the data value chain and that different value judgments would result in different data and possible managerial decisions.

Common problems in pre-processing data so software can work with it

Based on structured interviews with a small, convenience sample of seven visual analysts, the authors of an academic paper [Source: Victoria Lemieux et al., Meeting Big Data challenges with visual analytics, Records Mgt. J. 24(2), July 2014 at 127] identified these themes in difficulties with pre-processing data.  I have added a gloss to each one.

• Unavailability of data [you can’t locate data or it was never compiled in the first place]

• Fragmentation of data  [locating relevant data distributed across multiple databases, database tables and/or files is  very time-consuming]

• Data quality [whoever gathered the information made mistakes or recorded something that has to be converted into a zero or a missing value indicator, or included extra spaces, for example]

– Missing values [it is not clear whether there were in fact no expenses on the matter or whether the person who recorded the data did not know the amount of the expenses]

– Data format [dates are notorious for being May 16, 1962 in one record, 05/16/62 in another, 05/16/1962 in a third and all kinds of other variations that require being standardized]

– Need for standardization [for example, some numbers have decimals, some have leading zeros, some are left justified with spaces at the right, some have commas, and so on]

• Data shaping [for example, in the R programming language the most common package to create plots is called “ggplot2”.  When you use it, the data ideally is in what is called “long form,” so you might need to shape the data before you plot it]

– For technical compatibility [perhaps this means that data stored as comma separate values (.csv), for example, might need to be in Access database structure for Access to work]

– For better analysis [it may be that the way the data was read into memory stored a variable as character strings whereas the data scientist wants that variable to be a factor that has a defined number of levels]

• Disconnect between creation/management and use [the general point could be that someone in the law firm tracks something, but it is not useful beyond a narrow purpose]

• Record-keeping [this may refer to the important step of keeping a record of each step in the data collection and cleaning, i.e, reproducibility of research]

– General expression of need for record-keeping [perhaps a firm-wide or law department-wide statement or policy that data has value and we need to shepherd it]

– Version control [keeping track of successive iterations of the software that works on the data]

Track and analyze the “surface area” of your lawyers’ contacts with individual clients

Legal managers look for available but overlooked data that can sharpen their business judgment.   One data set that might be new is “surface area”: how many individual clients interact with lawyers during a period of time, either within the organization for law departments or at organizational clients for law firms.  Surface area doesn’t just track senior clients, it tracks all clients.  The more clients who have dealings with a lawyer each quarter, the larger the contact surface area and presumably the better the law department or law firm both knows and responds to clients.  Widespread connections – a large surface area for the law department or law firm – assures that clients are finding the lawyers valuable.  It also keeps the lawyers more in touch with business realities, rather than lost in the myopia of purely legal developments.

True, the lawyers might need to tally a few individual clients on their own, but tools exist to capture much of the data.  What comes to mind is software that extracts names of clients in emails of the lawyers.  For a partner in a firm, email traffic with [name]@[client].com would be fairly easy to keep pull out and keep track of; for an associate general counsel in a company, the same type of filter would be even easier to spot and count internal email traffic.  Another source could be invitation lists to meetings.

Analyses of data on client contacts would focus on changes over time and distribution, and could also allow fuel social network insights.   For the network graphs, it would be useful to categorize clients by level or position.

Law firms and departments are not dealing with “Big Data”

“Big Data” has no accepted formal definition, as its parameters are difficult to pinpoint, according to Victoria Lemieux et al., “Meeting Big Data challenges with visual analytics,” Records Mgt. J. 24(2), July 2014 at 122 [citations omitted].   Extremely large amounts of data is a prerequisite, but “At what volume data become big remains an open question, however, with some suggesting that it comprises data at the scale of exabytes” or larger. An exabyte is a billion gigabytes.   Others look at volume in terms of manageability by standard software: “data ‘with sizes beyond the ability of commonly used software tools to capture, curate, manage, and process the data within a tolerable elapsed time’”.  The legal industry, aside perhaps from a rare and ginormous e-discovery mountain, faces modest volumes of data.

Other definitions of Big Data emphasize not just the sheer volume of data, but also its velocity (speed of data in and out), and variety (range of data types and sources). Some writers also include veracity (the biases, noise and abnormality in data) as an additional defining characteristic.”   The legal industry’s data arrives at a relative snail’s pace, in traditional garb, and what pertains to management issues consists of text or numbers from relatively few sources.  Veracity is an add-on to the common parameters of volume, velocity, and variety.

The data challenges for the legal industry are still formidable, even if the loosely-defined and ubiquitous term “Big Data” does not apply.

Data tracking is a prerequisite to data analysis, but not the same thing

Your law firm or law department must track numbers before you can analyze them.  That’s obvious, but for some people the two activities, collecting and interpreting, get conflated.  Consider The General Counsel’s Technology Report: Top Trends Impacting Legal & Contract Management, released in 2016 by Apttus.

Of the respondents to its survey, 50% “admit to lacking the necessary insight into critical contract information, including cycle times, number of active contracts and term success rates.”  That says to me that half the departments don’t even track the numbers they might need to perform an analysis.  “Insight” here is a fancy word for not even keeping totals.

When the report also states that among its respondents “1 in 4 Legal Departments see analytics as the fastest growing technology trend, yet only 19% have any sort of analytics tool in place” the report blends counting and calculating.  Legal data analytics presupposes the availability of figures that have been compiled, such as the number of active contracts, but it goes beyond collection: it makes predictions, shows correlations, classifies or clusters so that legal managers learn something from that data.

Time records of in-house counsel are suspect sources of data for analysis

Data scientists in law departments may have time records to analyze, but it is questionable whether they should “take the time” (pun intended).  The practice is riddled with methodological impurities as well as managerial challenges.

A fair number of larger law departments have their lawyers track time; some estimates place that number in the 20-30 percent range.  Many of those departments, however, use the hours-worked information only for internal purposes such as to manage workloads, understand the kinds of matters they handle, and evaluate performance.

Some departments that track internal lawyer time, however, charge time back to clients..  A chargeback system has drawbacks; it’s a hassle for the lawyers, it requires someone to handle the “billing,” it might dissuade a needy but impecunious client from getting legal advice, it’s just an intra-company transfer, and it might spawn inaccurate charges.  Perversely, if you bill clients for internal lawyer time, you dissuade them sometimes from seeking legal counsel or you lure them into forum-shopping (taking work to a lawyer who under-records time).  [We note that some utilities are required by state regulatory agencies to allocate their lawyers’ time between regulated and non-regulated activities so that the rate base is not skewed.]

Defenders of charging time to clients have their arguments: it gives some market discipline to what otherwise is a free, and possibly abused, service (but only if clients can contest charges); it adds another overseer of lawyer productivity – the client charged; it allows a law department to negotiate levels of service (though hours worked is as pernicious as hourly billing by law firms); and it gives quantitative backup to a plea to hire more lawyers (although no confirmation is possible).   Worse, some law departments require everyone to record eight hours a day, which makes a mockery of any claim that the data matches reality.

Collect past data going back a few years, but do so carefully

The first time a law firm or law department decides to collect a certain kind of data, legal managers should also decide whether to go back in time for past data. We know the number of EEOC charges handled in 2016, but what about in 2015 and 2014?  Such retrospective data raises a set of concerns and challenges.

The farther back you go, the harder it is to collect accurate data and to feel comfortable that the data has been consistently collected over the period. For example, to figure out how many summer interns stayed more than six weeks becomes harder the more years you go back because full-time-equivalent information might not have been logged.

Maintaining a consistent definition back through time, where conditions possibly were changing, also limits the reach back. For example, it may be problematic to collect costs of e-discovery going back several years because the technology, staff skills, and procedural rules transformed so much  during that time.

As to who should do the retrospective collection, it is a better practice to have one person in charge so that they develop a sense of similar treatment.

For a final step, once the older data has been assembled, it is good to graph it and see whether the visual trend line makes sense to subject matter expert (SME). Another technique is to have a SME at the beginning of the project estimate what they think the numbers will be in the prior years. Yes, those are subjective estimates, but at least they give a basis for testing the numbers collected against someone’s a priori surmise. Obviously, too, the firm or department needs to evaluate whether the value of the data exceeds the cost of reconstructing it.

One final note: whatever the decisions made during collection and whatever the methods, someone needs to carefully keep track of them so that someone else can audit the process or improve it if that appears appropriate.