Common problems in pre-processing data so software can work with it

Based on structured interviews with a small, convenience sample of seven visual analysts, the authors of an academic paper [Source: Victoria Lemieux et al., Meeting Big Data challenges with visual analytics, Records Mgt. J. 24(2), July 2014 at 127] identified these themes in difficulties with pre-processing data.  I have added a gloss to each one.

• Unavailability of data [you can’t locate data or it was never compiled in the first place]

• Fragmentation of data  [locating relevant data distributed across multiple databases, database tables and/or files is  very time-consuming]

• Data quality [whoever gathered the information made mistakes or recorded something that has to be converted into a zero or a missing value indicator, or included extra spaces, for example]

– Missing values [it is not clear whether there were in fact no expenses on the matter or whether the person who recorded the data did not know the amount of the expenses]

– Data format [dates are notorious for being May 16, 1962 in one record, 05/16/62 in another, 05/16/1962 in a third and all kinds of other variations that require being standardized]

– Need for standardization [for example, some numbers have decimals, some have leading zeros, some are left justified with spaces at the right, some have commas, and so on]

• Data shaping [for example, in the R programming language the most common package to create plots is called “ggplot2”.  When you use it, the data ideally is in what is called “long form,” so you might need to shape the data before you plot it]

– For technical compatibility [perhaps this means that data stored as comma separate values (.csv), for example, might need to be in Access database structure for Access to work]

– For better analysis [it may be that the way the data was read into memory stored a variable as character strings whereas the data scientist wants that variable to be a factor that has a defined number of levels]

• Disconnect between creation/management and use [the general point could be that someone in the law firm tracks something, but it is not useful beyond a narrow purpose]

• Record-keeping [this may refer to the important step of keeping a record of each step in the data collection and cleaning, i.e, reproducibility of research]

– General expression of need for record-keeping [perhaps a firm-wide or law department-wide statement or policy that data has value and we need to shepherd it]

– Version control [keeping track of successive iterations of the software that works on the data]

Predicting a Court’s decision: the natural language processing steps

Earlier we introduced a machine learning study that predicted decisions of a European court regarding human rights violations.  The data preparation started with two kinds of text analysis.  My translation of their article borrows heavily from their writing.

Finding frequent words or combinations of words — N-gram features: “The Bag-of-Words (BOW) model is a popular semantic representation of text used in NLP.  In a BOW model, … text is represented as the bag … of its words (unigrams) or N-grams without taking into account grammar, syntax and word order.“

In short, as is common when using Natural Language Processing [NLP] they decomposed the text of each decision into a “bag” of single words and small-multiple words.  Doing so ignores the actual meaning, order or part of speech of the words in the original.  A “unigram” is a single word, whereas an N-gram is any two, three, four or more words (the “N”) in combination.  The researchers went as far as 4-grams.  So, you could go through a U.S. Supreme Court decision and find each word (unigram), each doublet of words, each triplet of words, and each four-in-a-row combination of words (2-grams, 3-grams, and 4-grams) and create a bag of words without noun designations, verb designations etc.

Creating a vector space representation:  Then the researchers created what you might visualize as a row in a spreadsheet for each opinion and a column for each word, doublet, triplet, etc.  in it.  If you think of a 0 in a cell where the particular opinion did NOT have the N-gram and a 1 where it DID contain the N-gram, you have what statisticians call a matrix (a rectangle of zeros and ones as long as your number of texts (opinions) and as wide as your total number of N-grams).   It would likely be called a “sparse matrix” because so many of the -grams would show up only once or two; lots and lots of cells would have a 0, hence the sparse descriptor.  As they succinctly stated this step, “That results in a vector space representation where documents are represented as m-dimensional variables over a set of m N-grams.”   The term “vector space representation” describes of a huge multi-dimensional space where the axes are each one opinion and their points are the N-grams in that opinion.   Many N-gram points would represent a single word used only once; some points would be words used in several opinions, a few words (like articles and prepositions) would be dense clusters of points because the word was used in many opinions (incidentally, NLP researchers usually strip out those “stop words” since they add no information).

For machine learning, the case opinions (the axes in hyperspace) are usually called “observations” and the columns (corresponding to the points in hyperspace) are usually called “features.”  As the researchers wrote, “N-gram features have been shown to be effective in various supervised learning tasks,” which refers to the machine learning algorithm described later and its task.

“For each set of cases in our data set, we compute the top-2000 most frequent N-grams where N ∈ {1, 2, 3, 4}.”   Hence, they went only as far as combinations of four unigrams and kept only the 2000 N-grams used most often in the opinions.   [The trident facing right is set notation for saying “The values of N come from the set of 1, 2, 3, and 4.”]  “Each feature represents the normalized frequency of a particular N-gram in a case or a section of a case.”  “Normalized” means they converted the number of times each N-gram showed up to a scale, such as from 0 to 1.

“This can be considered as a feature matrix, C ∈ ℝc×m, where c is the number of the cases and m = 2, 000.”   The researchers extracted N-gram features for each of five sections in the opinions, since all opinions of the Court are written in the same format, as well as for the full text.  They refer to this huge array of 0’s and 1s as their C feature matrix (case features from N-grams).

An explanation of research on machine learning software predicting court decisions

Legal managers should be intrigued by the fledgling capabilities of software to predict decisions of courts.  To assess the likelihood and power of such a capability in the real world, however, calls for a manager to understand the tools that data scientists might deploy in the prediction process. Fortunately, the ABA Journal cited and discussed a study published in October 2016 that offers us a clear example.

Four researchers, including three computer scientists and a lawyer, used machine learning software to examine 584 cases before the European Court of Human Rights. They found that the court’s judgments of the plaintiff’s rights having been violated or not were more highly correlated to facts than to legal arguments.  Given only the facts, their model predicted the court’s decisions with an average accuracy of 79 percent.

The article’s software-laden explanation of their method and tools makes for very heavy reading, so what follows attempts humbly and respectfully and certainly not infallibly to translate their work into plainer English.  The starting point is their abstract’s summary.  They “formulate a binary classification task where the input of our classifiers is the textual content extracted from a case and the target output is the actual judgment as to whether there has been a violation of an article of the convention of human rights. Textual information is represented using contiguous word sequences, i.e., N-grams, and topics.”  Additionally, “The text, N-grams and topics, trained Support Vector Machine (SVM) classifiers… [which] applied a linear kernel function that facilitates the interpretation of models in a straightforward manner.”

First, let’s get a handle on how the prediction model dealt with the text of the opinions (Natural Language Processing), then we will look at the software that classified the text to predict a violation or not (Support Vector Machine).

Few legal-industry surveys sample from a representative population

To respect and rely on the findings of a legal industry survey, legal managers should be able to find in the survey report the number of people who answered the survey (the sample, respondent number or sometimes just “N”), the number of people who were invited to answer the survey (the population), and how the surveyor developed that population of invitees.

Focus on that last disclosure, which basically concerns the representativeness of the survey population.  If a company that sells time and billing software to law firms writes to its customers and asks them “Do you find software technology valuable for your firm?,” no one should be surprised if the headline the vendor’s report boasts “Nine out of ten law firms find software technology valuable!”  Aside from binary choice of the question and the vendor’s blatant self-interest in promoting sales of software, the crucial skew in the results arises from the fact that the people invited to complete the survey hardly mirror people in law firms generally.  They have licensed or at least know about time and billing software.  The deck was stacked, the election was rigged.

Unfortunately, all too often vendor-sponsored surveys go out to invitees who have some connection with the vendor and therefore are hardly representative of law firm lawyers and staff as a whole.  The invitees will almost certainly be on the vendor’s contact list or its newsletter recipients or those who visit the vendor’s website and register.  Only sometimes will a vendor develop or rent a much larger mailing list and reach out to its names.  Even if they do, respondents will likely be self-selected because they use that kind of software or service or have some level of awareness of it.

Track and analyze the “surface area” of your lawyers’ contacts with individual clients

Legal managers look for available but overlooked data that can sharpen their business judgment.   One data set that might be new is “surface area”: how many individual clients interact with lawyers during a period of time, either within the organization for law departments or at organizational clients for law firms.  Surface area doesn’t just track senior clients, it tracks all clients.  The more clients who have dealings with a lawyer each quarter, the larger the contact surface area and presumably the better the law department or law firm both knows and responds to clients.  Widespread connections – a large surface area for the law department or law firm – assures that clients are finding the lawyers valuable.  It also keeps the lawyers more in touch with business realities, rather than lost in the myopia of purely legal developments.

True, the lawyers might need to tally a few individual clients on their own, but tools exist to capture much of the data.  What comes to mind is software that extracts names of clients in emails of the lawyers.  For a partner in a firm, email traffic with [name]@[client].com would be fairly easy to keep pull out and keep track of; for an associate general counsel in a company, the same type of filter would be even easier to spot and count internal email traffic.  Another source could be invitation lists to meetings.

Analyses of data on client contacts would focus on changes over time and distribution, and could also allow fuel social network insights.   For the network graphs, it would be useful to categorize clients by level or position.

Handling extreme values with Winsorizing and trimming means

Legal managers need to be sensitive to data that has extreme values.  Such very high or very low numbers in a distribution of numbers (meaning, the set of numbers) can result in a skewed representation of the average (arithmetic mean, in statistical terminology).  Those who analyze data have many ways to handle extreme values, with the best known one being to calculate the median of the distribution.  But let’s consider two others: Winsorizing the distribution and trimming the distribution.

We can return to Corp. Counsel, Oct. 2016 at 44, and its table that shows counts of U.S. law firms that “turn up the most in court documents.”  We added the number or lawyers in the firm and found that the arithmetic mean is 896.7 lawyers.

To lessen the influence of outliers, the distribution could be “Winsorized.”   When you Winsorize data, tail values at the extremes are set equal to some specified percentile of the data, such as plus and minus four standard deviations. For a 90 percent Winsorization, the bottom 5 percent of the values are set equal to the value corresponding to the 5th percentile while the upper 5 percent of the values are set equal to the value corresponding to the 95th percentile. This adjustment is different than throwing some of the extreme data away, which happens with trimmed means. Once you Winsorize your data, your medians will not change but your average will.   The Winsorized mean of this data is 892.7.

A trimmed mean calculation lops off the designated percent of firms at the top (and the same percent of firms at the lowest end of the distribution of lawyer sizes).   In short, trimming is done by equal amounts at both ends to minimize the bias of the result.  The trimmed mean of this distribution, lopping off 5% at each end (rounding if necessary), is 880.7.

Try a snowball survey so you get more participants in a client satisfaction or legal industry survey

Most law departments, when inviting their clients to complete a satisfaction survey, select recipients at or above a certain level, such as all “Managers,” or “everyone above comp level 15.”  It would be interesting and enlightening for a department to try a “snowball survey.”

Send the questionnaire form (or an email with its online equivalent) to a relatively few, high-level clients. Ask them to complete the form and also to forward the blank form to three colleagues who have worked recently with the law department (or forward the email invitation to those colleagues).  Each recipient, in turn, is also invited to extend the survey’s reach, and thus the snowball grows.

A service provider in the legal industry could adopt the same tactic: invite everyone you can reach to take a survey, but urge them to send it on to others they know who would have something to say about the survey’s topic.  Now, some surveyors may reject the snowball approach because they want to control who is possibly in the participation group.  But a broader-minded desire and one that is more objective would be to sample as many participants as possible and thereby gain a more accurate understanding of the entire population.

Watson requires huge quantities of text to formulate learnings

In February 2016, the accounting giant KPMG announced that it had been working with IBM Watson, one of the most advanced artificial intelligence technology platforms available.  An article describes Watson briefly:  “It works by using natural language processing and machine learning to reveal insights and information from huge quantities of unstructured data.” [emphasis added]   Notably, over a period of a few years Watson has digested hundreds of thousands of medical research papers on cancer and thereafter shown itself capable of matching the diagnoses of experts and suggesting new therapies.  According to the TV show 60 Minutes, eight thousand cancer papers are published every day!

A handful of law firms have announced that they are using Watson’s algorithms.  One firm (Baker & Hostetler), it sounds like to me, has directed Watson to parse thousands of cases, law review articles, and briefs in the bankruptcy area.  Whether that corpus of documents provides enough grist for Watson’s mill, since it is an order of magnitude or two smaller than the oncology set, remains to be seen.

My point is that the vast pools of text necessary for Watson to hone its skills to a proficient level may be rare in the legal industry.  And, related to that point, experienced lawyers may need to devote hours and hours to coding some of the textual material so that Watson can pick out which patterns are most likely to be present in the coded results.

Polls and surveys, what is the difference?

What is the difference between a “poll” and a “survey”?  One commentator said that polls tend to focus on single questions, like a referendum in politics that asks “yes” or “no” or a more elaborate question that offers a menu of possible answers, compared to surveys that ask sets of questions that can increase the coverage and reliability of the results.  Another definition suggests that poll results appeal to a wider public – “Who should be the All-Star first baseman” — whereas surveys fit the needs more of academicians (or in the legal industry, vendors) who want to emphasize the scientific or scholarly character of their work.  A survey regarding machine learning software used by law departments and law firms would be an example.

So in short, a poll is generally used to ask one simple question, while a survey is generally used to ask a wide range of questions.  JurisDatoris may poll its readers someday, asking a single question such as “What is the most common target of data analysis in your job?”  By contrast, this blog will constantly discuss findings and methodology of surveys that ask many questions.

Create a choropleth to display data by State, country, region

When legal managers want to present data by State or by country, they can make good use of what is called a “choropleth”.  Choropleths are maps that color their regions in proportion to the count or other statistic of the variable being displayed on the map, such as the number of pending law suits per State or amounts spent on outside counsel by country.   Darker colors typically indicate more in a region and lighter shades of the color indicate fewer.

Below is an example of a choropleth that appears in Exterro’s 2016 Law Firm Benchmarking Report at page 8.  It shows how many of the 112 survey participants come from each state.

exterro-choropleth-oct-2016-post

California is the darkest with 21; the grey states had no participants.  The table below the map, which is truncated in this screen shot, gives the actual numbers by State, so someone could carp that the choropleth sweetens the eye but adds no nutritional information.  Still, it looks pretty good and it is an unusual example of an effective graphical tool.