Turn Multi-Question Responses into Dummy Variable Matrix

One vital step in the analysis of a multi-choice question creates a variable for each potential selection. The dummy variable for each selection is coded “1” if the respondent checked it and “0” if not.

Think of a spreadsheet where each row holds a person’s answer to the question. If the only question they answered was the multi-choice question, they will have columns to the right of their name up to the number of selections, and in each column a “0” if they did not select that role and a “1” if they selected it. The sheet would have as many rows as respondents and each row would have a pattern of “0”s and “1”s corresponding to the options not selected or selected. All those “0”s and “1”s form a matrix, a rectangular array of numbers.

For an example of a “”check all that apply” question, a multi-choice question, the snippet below shows the results from respondents checking from six selections available. The percentage inside the top bar selection tells us that 62% of the respondents picked it, so a “1” showed up for that dummy variable. For the remaining 38% of the respondents, the column would have a “0”.

It is entirely possible to have software count the number of times each selection was checked, but analysts often decide to convert multi-choice responses into binary matrices, populated only with “0”s and “1”s, so that software can carry out more elaborate calculations. For a simple example, the binary matrix shown below has a “RowSum” column on the far right that added each “1” in the columns to the left. The first respondent selected two roles, Role1 and Role3, so “1”s are in those two cells and the “RowSum” equals 2.

Multi-Answer, Multiple-Choice Questions in Surveys

Research surveys by law firms ask multiple-choice questions much more frequently than they ask any other style of question. They do so because it is easier to analyze the data from answers selected from a list or from a drop-down menu. Not only are they common, multiple-choice questions often permit respondents to mark more than one selection. These multi-questions, as we will refer to them, have instructions such as “Choose all that apply” or “Pick the top three.” The image below, from page 11 of a 2015 survey report by King & Wood Mallesons, states in the footnote that “Survey participants were able to select multiple options.” Thus, participants could have chosen a single selection or up to 10 selections.

To get a sense of how many multi-questions show up, we picked four survey reports we recently found and counted how many multi-questions they asked, based on the plots their reports presented. The surveys are Kilpatrick Townsend CyberSec 2016, King Wood AustraliaDirs 2015, Littler Mendelson Employer 2018, and Morrison Foerster ConsumerProd 2018. In that order they have 7 multi-questions in 24 non-Appendix pages, 4 in 36 pages, 8 in 28 pages and 4 in 16 pages. Accordingly, results from at least 21 multi-questions appeared in 104 pages. Bear in mind that each report has a cover and a back page that have no plots and almost always other pages without plots so the total number of survey questions asked is always less than the number of report pages.

While multi-questions certainly allow more nuanced answers than “Pick the most important…” questions, for example, and create much more data, those more complicated pools of data challenge the survey analyst regarding how best to interpret and present it.

A number of analytic approaches enable an analyst to describe the results, to glean from the selection patterns deeper insights, and to depict them graphically. We will explore those techniques.

Average number of pages in reports by originating law firm’s geography

From the period 2013 through now, we have found 154 research surveys where a law firm conducted or sponsored the survey and a PDF report was published. That group includes 55 different law firms.

We categorized the firms according to five geographical bases: United States firms, United Kingdom firms, vereins, combinations of U.S. and U.K. firms (“USUK”), and the rest of the world (“RoW” — Australia, Canada, Israel, New Zealand, and South Africa). We thought we would find that the largest firms, either the vereins or the USUK firms, would write the longest reports. Our reasoning was that they could reach more participants and could analyze the more voluminous data more extensively (and perhaps add more marketing pages about themselves).

Quite true! As can be seen in the table below, the average number of pages and the median number of pages for the five geographical groupings of firms each stand at approximately the same number. How many surveys are included in each category is shown in the column entitled “Number”. Nevertheless, the two large classes of firms do indeed produce more pages of reports.

GeoBase Number AvgPages MedianPages
RoW 13 25.0 20.0
UK 41 24.1 20.0
US 78 22.5 19.0
USUK 17 30.2 22.0
Verein 5 27.6 28.0

We tested the difference between the average number of pages for the USUK reports and average pages for the US reports. We selected those two groups because they had the largest gap [30.2 versus 22.5].

A statistical test called the t-test looks at two averages and the dispersion of values that make up each average. It tells you how likely it is that the difference of those averages is statistically significant, meaning that if random samples of survey reports were taken repeatedly from law firms in each geography, less than 5% of the time a gap of that amount or more would show up. If that threshold is not met, you can’t say that the differences are due to anything other than chance. If the threshold is met, statistician say that the difference can be relied on, in that it is statistically significant. On our data, the t-test was 1.2 and the p-value is 0.24, much above the threshold of 0.05 for statistical significance. The swing between USUK average pages and US average pages may look material, but on the data available, we can’t conclude that something other than random variation accounts for it.

Descriptive statistics and the step beyond to predictive statistics

A fundamental distinction between two kinds of data analytics appears in a report published by KPMG, “Through the looking glass, How corporate leaders view the General Counsel of today and tomorrow” (Sept. 2016).  “Companies are making greater use of data analytics and are increasingly moving from descriptive analytics (where technology is used to compress large tranches of data into more user-friendly statistics) to predictive analytics and prescriptive models that extrapolate future trends and behavior.” (page 14)

Law firms and law departments can avail themselves of many kinds of software to summarize aspects of a data set.  Descriptive statistics, as some call it, include calculating averages, medians, quantiles, and standard deviations.  These summary statistics, yet another term for the basic calculations, are themselves simplified models of the underlying data.  [Note that a “statistic” is properly a number calculated from underlying data.  So, we calculate the variance statistic of all this year’s invoices where the underlying “raw” data is the data set of all the year’s invoices.]

Predictive statistics go farther than descriptive statistics.  Using programs like open-source R and its lm package, you can easily fit a regression model that predicts the number of billable hours likely to be recorded by associates based on their practice group, years with the law firm, gender and previous year’s billings, for example.   Predictive analytic models allow the user to derive numbers, not just describe them

Descriptive analytics compared to predictive analytics

A fundamental distinction between two kinds of data analytics appears in a report published by KPMG, “Through the looking glass, How corporate leaders view the General Counsel of today and tomorrow” (Sept. 2016).  The report observes that “Companies are making greater use of data analytics and are increasingly moving from descriptive analytics (where technology is used to compress large tranches of data into more user-friendly statistics) to predictive analytics and prescriptive models that extrapolate future trends and behavior.” (page 14).

Law firms and law departments can avail themselves of many kinds of software to summarize aspects of a data set.  Descriptive analytics, as some call it, include averages, medians, quantiles, and standard deviations.  These” summary statistics,” yet another term for the basic calculations, are simplified models of the underlying data.  Note that a “statistic” is a number calculated from underlying data.  So, we calculate the variance statistic of all this year’s invoices where the underlying “raw” data is the data set of all the year’s invoices.

Predictive statistics go farther than descriptive statistics.  Using programs like R and the lm package, you can create a linear regression model that predicts the number of billable hours likely to be recorded by associates based on their practice group, years with the law firm, gender and previous year’s billings, for example.   Predictive analytic models allow the user to forecast numbers.

Surveys with fewer than 400 participants produce “ballpark” results at best

Findings from surveys can enlighten legal managers and sharpen their decisions, but only if the data reported by the organization that conducted the survey is credible.  Among the many imperfections that can mar survey results, an immediately obvious one is sample size and its inverse effect on the margin of error of the results.  Put simply, the smaller the sample of respondents, the more the results might diverge from the actual figure that would emerge if all the population could be polled – the margin of error balloons.  Or, lots of participants, small margin of error (results more likely to be representative of the whole population).

The NY Times, Oct. 15, 2016 at A15 refers to voter surveys, but the statistical caveat is the same for legal-industry surveys.  “If the sample is less than 400, the result should be considered no more than a ballpark estimate.”

Sadly, many surveys by vendors to law firms and law departments fail to accumulate more than 400 participants.  Worse, quite a few survey reports say nothing about how many participants they obtained, even if they provide demographic data about them.  Their findings might be characterized as SWAGs (scientific wild-ass guesses), which might even then be giving them too much credit on the “scientific” side.  No one should base decisions  derived from a too-tiny  group of survey respondents.

We leave for another post a further wrinkle that the Times highlights: if the data analysts weight the responses, they “don’t adjust their margins of error to account for the effect of weighting.”