Common problems in pre-processing data so software can work with it

Based on structured interviews with a small, convenience sample of seven visual analysts, the authors of an academic paper [Source: Victoria Lemieux et al., Meeting Big Data challenges with visual analytics, Records Mgt. J. 24(2), July 2014 at 127] identified these themes in difficulties with pre-processing data.  I have added a gloss to each one.

• Unavailability of data [you can’t locate data or it was never compiled in the first place]

• Fragmentation of data  [locating relevant data distributed across multiple databases, database tables and/or files is  very time-consuming]

• Data quality [whoever gathered the information made mistakes or recorded something that has to be converted into a zero or a missing value indicator, or included extra spaces, for example]

– Missing values [it is not clear whether there were in fact no expenses on the matter or whether the person who recorded the data did not know the amount of the expenses]

– Data format [dates are notorious for being May 16, 1962 in one record, 05/16/62 in another, 05/16/1962 in a third and all kinds of other variations that require being standardized]

– Need for standardization [for example, some numbers have decimals, some have leading zeros, some are left justified with spaces at the right, some have commas, and so on]

• Data shaping [for example, in the R programming language the most common package to create plots is called “ggplot2”.  When you use it, the data ideally is in what is called “long form,” so you might need to shape the data before you plot it]

– For technical compatibility [perhaps this means that data stored as comma separate values (.csv), for example, might need to be in Access database structure for Access to work]

– For better analysis [it may be that the way the data was read into memory stored a variable as character strings whereas the data scientist wants that variable to be a factor that has a defined number of levels]

• Disconnect between creation/management and use [the general point could be that someone in the law firm tracks something, but it is not useful beyond a narrow purpose]

• Record-keeping [this may refer to the important step of keeping a record of each step in the data collection and cleaning, i.e, reproducibility of research]

– General expression of need for record-keeping [perhaps a firm-wide or law department-wide statement or policy that data has value and we need to shepherd it]

– Version control [keeping track of successive iterations of the software that works on the data]

Analytic software requires curating before it can proceed reliably

A column in Met. Corp. Counsel, Sept. 2016 at 21 by David White, a director at AlixPartners, starts with three paragraphs on the severity of anti-corruption risks to U.S. companies that do business abroad and the associated huge variety and volume of documents and records that might have to be analyzed to respond to regulators.  Data analytics to the rescue!

White continues: “Unlike their traditional counterparts, newer analytic systems based on big data technologies, predictive analytics and artificial intelligence are not bound by upfront data transformations and normalization requirements …” [emphasis added].

In my experience, analytical software inevitably requires the data fed to it to be formatted cleanly in ways that it can handle.  For example, dollars signs can’t be included, currencies need to be converted, dates need a consistent format, missing numbers or problematic outliers need to be dealt with – these and other steps of “upfront data transformations and normalization” are often manual, require human judgment, and can take hours of thoughtful attention.  Moreover, the data doctor ought to keep track of all the steps in the clean up.