Once someone releases a number, such as a count of environmental cases in 2016 where settlements were more than $250,000, that number becomes reified. It takes on a life of its own as a given, taken for granted to be an accurate statement of a fact. Few who later rely on that number bother to look under the hood (and quite possibly could only do so with difficulty) and understand the decisions and methods that went into its pronouncement.
All numbers have methodological issues: someone made calls at different points as to how to handle different questions. To keep with the example, what if a settlement was for $200,000 and a one-year agreement not to do something? Or what if another case settled for $500,000 payable in two installments, where the second installment was contingent on the other party doing something? Or what if a settlement was paid in a foreign-currency and someone had to decide on the appropriate exchange rate?
All the numbers that might be used by a law firm or law department in its data science efforts harbor unexamined birth pangs like these. At some point, a data scientist has to treat her numbers as if they are accurate, but always stress test them for reasonableness, look for outliers, and probe for hidden assumptions.
This is a foundational concept of data science: trust but verify the numbers. Moreover, in the back of our minds we should treat all numbers as probabilistic. The actual number, we hope, is the stated one, but realistically it probably vibrates in the middle of a cloud of possibly-true numbers around the Platonic ideal number.