Before regression, look at your data and check correlations

Let’s use the terms we just learned and predict the number of lawyers in a state, our dependent variable, by regressing only one predictor variable, the number of Fortune 500 companies with their headquarters in the state.

Importantly, however, it is a good practice to examine your data before you plunge into a regression analysis. One method looks at a scatter plot of the predictor variable, along the horizontal axis of the plot, and the dependent variable along the vertical, y-axis. On the plot, each state is represented by one point on those two coordinates.

From the top down, the right-most points are New York (54 headquarters and 96,000 lawyers), California (54 headquarters and 85,000 lawyers) and Texas (52 headquarters and 48,000 lawyers).

As states have more Fortune 500 headquarters moving to the right, does it appear they have more private practice lawyers moving up the plot?

Yes! Your eye tells you that the distribution of the points on the scatter plot drifts upwards toward the right roughly on a line: more headquarters, more lawyers. That conclusion makes intuitive sense to the extent that more Big Corporates probably generate more legal issues and the outside lawyers who handle those issues are likely to live in the state.

We can get a more precise, quantitative sense of the relationship between the two variables. When a scatter plot suggests a linear relationship (we’ll explain this term later), we can supplement it with the correlation coefficient, which measures the strength and direction of a linear relationship between two quantitative variables (quantitative variables are numbers, as compared to qualitative variables, called factors like state or region). Correlations range between -1 and 1. Values near -1 indicate a strong negative linear relationship, values near 0 indicate a weak linear relationship, and values near 1 indicate a strong positive linear relationship.

On our data the correlation is 0.904, which is formidable. Also, since the correlation is positive, it means what we said, more headquarters, more lawyers.

Correlation has meaning only for linear relationships , and it is sensitive to outliers (unusual, possibly erroneous values that might improperly skew the model — more later).

Regression explained for lawyers

Your law firm or law department might want to learn even more from numbers you have collected. One software tool to do so carries out what is called regression. How to understand and benefit from regression will be the topic of a series of posts. The goal will be to use everyday language and a lawyer framework to explain how to use regression responsibly, how to intuitively understand the results of regression, and ways to make it real and nonthreatening to managers of lawyers.

How might managers of lawyers apply regression? A general counsel might investigate whether the size of law firms retained is associated with average effective billing rate. Or she might predict whether more matters assigned to a firm is associated with lower effective billing rates. The managing partner of a law firm might look at several pieces of information about associates and use regression to estimate the likelihood of an associate making partner. Or the head of marketing might learn to what degree the number of participants in a survey influences how many times the report is downloaded. Countless examples exist of the ways regression can illuminate numbers in law firms and law departments.

To start, we need to settle on some terminology. All regression analyses need numbers, which statisticians call data. The example data in this series comes from the 2012 time-frame and consists of the number of private-practice lawyers in each state, the population of each state, the state’s “GDP”, and the number of Fortune 500 companies that have their headquarters in the state. The four numbers for each state are variables.

Here are some more terms you should feel comfortable with from regression. We will create a regression model that predicts the number of lawyers in a state — the number of lawyers is the response variable, from the state’s population, GDP and F500 number, called the predictor variables or the independent variables.

Regression estimates a response variable from predictor variables, which you can rephrase as estimating the dependent variable from the independent variables — in our model the number of lawyers in the state depends to an unknown degree on the number of people living in the state, the state’s economic productivity (its GDP), and how many huge companies call that state its home. You should think of a regression model as a condensed description of a set of numbers by means of an equations. Creating a useful model underlies much of what data analysts do, including many forms of machine learning.

Interaction between series and co-contributors

One analysis that every law firm can carry out on its survey data relies on a contingency table.  A contingency table contains counts by levels of a factor (also known as a “categorical variable”), which is a nominal variable that has levels, such as the variable “Position” might have levels of “GC,” “Direct Report,” “Other Lawyer” and “Staff.”

As an example of a contingency table, a survey might ask respondents about their company’s headquarters country and whether their company is publicly traded or privately held. Country would be a factor with two or more levels (let’s say the U.S. is one level and has 55 respondents while Canada is a second level, with 35 respondents); public or private would also be a factor but it has only two levels, perhaps 20 and 70, respectively.

So, a contingency table of counts (frequencies) for these two factors and their levels has a total of four categories (Public/US, Private/US, Public/Canada and Private/Canada) and would look something like this one:

Traded Stock   US   Canada
Public               12     8
Private             43    27

Easy to create, yet law firms miss many opportunities to deepen their analyses by exploring such contingency tables.

Here is an illustration that can also teach us about surveys identified so far. When law firms survey, they often decide to team with another organization, referred to here as a co-contributor. Separately from that decision, law firms also frequently conduct surveys on a topic more than once, referred to as a series. A question is whether co-contributors are more commonly associated with series. In other words, given a contingency table of the four counts, is there a statistical association between series and co-contributors?

Conributor           NotSeries  Series
Alone                     94               23
Co-Contributor   60               23

The contingency table above derives from 200 different surveys, where each series is treated as a single survey, for which I have located a published report. Reports are necessary to determine whether there was a co-contributor. Of that set, 23 are series in which a co-contributor took part (the bottom right count in the table); the same number are series where the law firm proceeded alone (top right). Of the “NotSeries” surveys, 94 were done by the law firm alone while 60 of them had a co-contributor. We must stress that this data is preliminary because we identified co-contributors at various times and may not have spotted all co-coordinators and fully matched them to series and non-series.

We can learn much from the data in the contingency table whether there is a statistical association between series and co-contributors. One methodology available to us is called a chi-square test.

Number of words and words per page in survey reports

The graph that follows shows an aspect of reports: the number of words per report page. The results might be considered a measure of text density: how much text information is included on an average page. The reports were chosen alphabetically from the data set. Furthermore, the software that counted words includes words in graphs, headers, tables, footers, headers, covers and back pages — more words than would normally be assumed to be ‘text.’ Still, since nearly all survey reports have those constituent elements, the numbers of words per report page produce comparable figures.


The total words in the reports varied by an order of 10: from 610 words in the most terse to 6,983 in the most loquacious. This particular set of ten reports averaged 4,493 words with a median of 5,067.

As can be seen, Baker McKenzie Cloud 2017 has approximately 50 words for each of its pages. At the other extreme, Ashurst GreekNPL 2017 weighs in at 420 words for each page.

Increases in participants over life of survey series by law firms

The plot below shows data on participants per year of lengthy series conducted by five law firms. A facet plot, it gives the data in a separate pane for each firm (alphabetically from Carlton Fields to White & Case). Within each pane the left axis varies for the number of participants in the survey year. For example, DLA Piper top right ranges from below 100 participants to around 300 whereas Davies Ward to its left ranges from 500 to 1,200 participants. White & Case’s survey data is missing participants for 2013 and 2014 so the line breaks. This group covers nine in the series at the maximum and six years at the minimum.

Generally speaking, the upward slope of the lines confirms that series gain participants as they continue over the years. The exception was Davies Ward, which declined from the initial burst of enthusiasm in 2005 but then began a recovery until the firm ceased sponsoring the series after 2011.

If a few more series of at least six years duration had full information on participants, we could more confidently assert that brand recognition and appreciation for a series build over time. Certainly this initial view suggests that to be the case.

Years of survey series and numbers of participants

Does the longevity of a survey series affect the average number of participants in the series? This is likely to be too crude a question, because the target populations of series differ significantly. Then too, firms might modify their questions as the series goes along rather than repeating the same questions, which could affect participation. A series might bring on different co-coordinators or change how it reaches out for participants. If we could control for factors such as these, which might swamp changes in participant numbers arising simply from annual invites, content, and publicity, we could make some headway on the question, but the data for that level of analysis is not available. Also, averaging participant numbers over the years of a survey series may conceal material ups and downs.

Moreover, of greater usefulness to law firms would be knowing whether numbers of participants tend to increase over the life of a series as it becomes better known and more relied on.

We plunge ahead anyway. To start, consider the series that have been sponsored by a law firm for four years or more. We know of 21 as are presented in the plot below. The color coding from the legend at the bottom corresponds to how many surveys have been in the series (some of which are ongoing). The color coding moves from midnight blue for the four-year series to the lightest (yellow) for the longest-running survey (13 years).

As we speculated above, a regression of how many years a survey has been conducted against average participants provides no insight. Other factors than the number of years a survey series has run influence the number of participants more.

Long series of surveys by law firms and their meta-topics

Several law firms have conducted (and may still be conducting) two different series. These firms include Baker McKenzie (Brexit and cloud computing), Berwin Leighton (hotels, two different geographies), Clifford Chance (M&A and European debt), Freshfields Bruckhaus (corporate crises and whistle blowers), Herbert Smith (M&A and finance), Jackson Lewis (workplace and dress codes), Miller Chevalier (Latin American corruption and tax policy), and Morrison Foerster (legal industry and M&A).

A few firms have done (or may still be conducting) three surveys on different topics; CMS (the legal industry, Brexit, M&A in Europe), DLA Piper (compliance, debt in Europe, M&A), and Pinsent Masons (Brexit and two on construction in different geographies).

We can also look at the broad topics where one or more firms have coordinated a series of at least five years’ length. We have coded the particular topics into broader meta-topics. The next chart tells us that three meta-topics on industries are included in these long-running series: construction, real estate, and private equity. Second, firms have also run five-plus-year series on disputes (litigation, class actions, and arbitration). Finally, the most popular subject for research surveys has been mergers and acquisitions, with three different meta-topics.

 

Numbers of co-contributors on surveys conducted by law firms

If some organization helps on a law firm’s research survey, the report clearly acknowledges that contribution. For example, as in the snippet below, Burgess Salmon Infrastructure 2018 [pg. 8] gave a shout out to its two co-coordinators (Infrastructure Intelligence and YouGov).

At least 12 law firms have conducted surveys with two different co-contributors. Three firms have worked with four co-contributors (Dentons, Morrison & Foerster, and Reed Smith) and two firms have worked with six co-contributors (CMS and Pinsent Masons).

Interestingly, two law firms have teamed with one or more other law firms: Shakespeare Martineau Brexit 2017 with Becker Büttner Held and Miller Chevalier LatAmCorruption 2016 with 10 regional law firms.

For most co-coordinator surveys, the pairing is one law firm and one co-coordinator. However, Pinsent Masons Infratech 2017 and Clifford Chance Debt 2007 sought the assistance of three co-coordinators for a research survey.

At this point, there are at least nine co-contributors who have helped on more than one survey by a law firm: Acritas, Alix Partners, ALM Intelligence (4 surveys), Canadian Corporate Counsel Association (5), the Economist Intelligence Unit, FTI Consulting (3), Infrastructure Intelligence, IPSOS (5), Ponemon Institute, RSG Consulting (3), and YouGov.

Double surveys by law firms, with two meanings

Consider two different meanings of “double survey.” One meaning applies to a law firm sending out two surveys, each to a different target audience, and then combining the responses in a report. A second meaning applies to a firm conducting more than one survey in a year, but with the same target audience.

Burgess Salmon Infrastructure 2018 [pg. 8] explains that it simultaneously conducted two separate surveys, one by interviews and the other by an online questionnaire. The report juxtaposes the findings.

Minter Ellison Cybersecurity 2017 [pg. 6] also undertook a double survey. With separate instruments, it reached out to members of boards of directors and also to chief information officers and others. The report combines the data.

Turning to the second meaning of “double survey”, one example started in 2015. Haynes Boone has conducted its energy borrowing survey twice yearly since then, e.g., Haynes Boone Borrowing 2018 [pg. 2].

Other firms that have conducted surveys twice a year on a topic include Morrison Foerster, e.g., Morrison Foerster MA 2018, and Irwin Mitchell, e.g., Irwin Mitchell Occupiers 2014. We also found an instance of quarterly surveys: Brodies Firm Brexit 2017!

Use scale questions, but think about text labels

Quite often law firms ask respondents to answer a question with a value from a scale. Those values should represent balanced positions on the scale. That is, they should have the same equal conceptual distance from one point to the next. For example, researchers have shown the perceived balance on the strongly disagree-disagree-neutral-agree-strongly agree scale.

Most survey designers set the bottom point as the worst possible situation and the top point as the best possible, then evenly spread the scale points in-between.

The text selected for the spectrum of choices deserves an extended discussion. Sometimes questions on surveys add text only to the polar values of a scale. For example, “Choose from a scale of 1 to 6 where 1 indicates “Yes, definitely” and 6 indicates “No, definitely not.” Alternatively, the question could supply intermediate scale positions with text: 2 indicates “Yes, probably”, 3 indicates “Maybe”, etc.

DLA Piper Compliance 2017 [pg. 6] used a 10-point scale and text at the extremes and in middle position:

It is hard to create text descriptions of positions on a scale that respondents perceive as equally spaced. If you put only numbers, respondents will unconsciously space the choices: but you will not have as clear a way to indicate what was in the mind of the respondents. On the other hand, words are inherently ambiguous and introduce all kinds of variability in interpretation by respondents.

Often the responses to a well-crafted scale question come back reasonably “normal,” as in the oft-seen bell-curve normal distribution. The midpoint gets the most responses and on either side the numbers drop or rise fairly symmetrically. Here is an example from a five-point scale.