no data does not mean no data quality
Consider these important traits as they relate to data quality. Completeness, Consistency, Integrity, Reasonability, Timeliness, Uniqueness/Deduplication, Validity, Accessibility.
The intent that you have behind the use of a particular set of data weighs heavily in any data quality regimen that you might consider implementing. Intent weighs even more heavily when you consider completeness.
There have been many occasions when I have worked with customer data and I’ve been able to easily and quickly identify that there are null values in the data. At face value, you might say that null data is a bad situation. After all, didn’t you define a data schema with the intent for all data to be provided? The answer is, of course, somewhere in between.
I may have a name, a date of birth, an address, but do I have a phone number and email address? Are phone numbers and email addresses, in fact, important enough as data attributes, that the absence of them makes the data I have any less valuable? These aspects all feed into the data completeness dimension of data quality.
It might be easy to think that all data attributes should be present and populated, the reality though, is that there are classes of data that are considered optional or discretionary and this is often considered at the time of data schema definition.
Looking at the contact data that I have, it is perhaps reasonable to assume that children and the very elderly do not have email addresses. I might find that a particular demographic of my contacts do not have phone numbers or choose to withhold phone numbers because phone numbers are not pertinent to my relationship with them.
Every record should have a date of birth if that is a key attribute that we will rely on to make decisions. Every record should have a name. At the very least, an initial and a last name perhaps, unless of course, I am simply doing statistical analysis say of clinical trials. Is an initial sufficient, is it even necessary? Only you can be the judge of what is needed, contingent on your domain knowledge around the purpose and intent for the data.
When examining a large set of data that purports to contain names, addresses and methods of contact, I might establish that some columns of data are very sparsely populated. That might suggest a data quality problem. The complete absence of data is perhaps more telling. A column that contains no data for any of the records suggests that the column either has no method or means for data capture, is redundant, or is empty as a consequence of some upstream malfunction.
The null question
When importing data from say an ERP or CRM system using file-based methods, the surfacing of row identifiers with no associative data for the defined schema, almost certainly, speaks to a data extraction or creation problem.
So, when we consider completeness we consider it in the context of the whole. We might say that it is ok for 50% of our records to have no telephone number. When we set up our data quality rules, we should set a threshold that states that if more than 50% of the records have no telephone number then we potentially have a data problem.
Of course, where we expect 100% completeness, our business rules expect completeness and we would want to highlight records that are missing critical data that we believe should always be there.
In the absence of proper data, we should also consider what the alternatives might look like. It is a well-known issue, that the default export for SQL is to place the string NULL in the column placeholders where null attributes are identified for data rows, but is this actually correct?
There are arguments in favour of being more explicit, but there are also schools of thought that suggest that this ‘helpful’ indicator does nothing more than interfere with the proper handling of the data elsewhere and, in particular, in other tools and require preprocessing before you can work with the data in some other tool or system.
Another aspect of completeness is how critical the value being provided is. For example, if I know that half of my records may not have phone numbers or email addresses, what do I do next? If the purpose I have in mind for that data is to run a telephone or email outbound marketing campaign, then the absence of those two key data attributes compromises my ability to use those records. I can hardly call or email someone with details of some special offer or promotional deal that I am offering if I have no hope of contacting these people.
Stale data
If phone numbers or email addresses are no longer valid or out of date, the reality is that I have them, but I cannot use them. Perhaps the format that they are provided in is inappropriate or incorrect. These aspects are covered under accuracy and validity, but the validity, in particular, is a different topic that I will speak to separately. Nonetheless, it is important to keep this in mind also.
When examining completeness, just as for accuracy, we would typically look to the whole population of the data that we have at our disposal and then make a determination on a percentage basis as to what level of tolerance we can deal with, in terms of completeness.
We also need to consider what the options are at our disposal for remediation of incompleteness issues. We may choose, for example, to filter out all records that are identified as incomplete in order to target with more precision or to ensure a higher yield rate on potential marketing outreach campaigns.
Records that we identify as missing key attributes, may be pushed into a secondary data clean up remediation and resolution cycle where we leverage the data attributes that we do have and know and use them to fill in the blanks or assist with record completion.
An example I often use here, is inferring vehicle brands from models.
If my vehicle is a Corolla or a Golf or an F150, there is a good chance that I can infer the make from just having these attributes. As long as I recognize and accept that there is a risk with making inferences about the missing data attributes, perhaps this is a good enough approach for data quality resolution where completeness is identified as a big enough problem area.