Recently attending a function, I overheard a conversation between a couple developers. I was reminded of the implication of design choices for and the implications of data quality. The conversation went something like..
“Hey, we need to increase the size of the data field so that our file output to [a partner] works correctly, right now, data is getting truncated.”
After a pause, his counter part said: “Yes, that sounds like a good idea, but the reason the field is truncated is that [a different partner] only allows 30 characters in that field, and we decided at the time to only allow that many.”
Thinking about it for a moment, they both slumped back realizing the challenge they now face. Let’s face it, we’ve all made design choices that had long term implications that we “thought” were acceptable, and that’s just life as an engineer. That said, it highlights the importance of data quality.
No matter how you want to look at data, it’s important that the data is as High Quality as you can get, which begs the question, how do you measure data quality? With out making it a long drawn-out definition, here’s are the characteristics I look for based on what I’ve learned from others.
- Completeness: Is the data complete, for example, do you have an address that’s missing the postal code?
- Validity: Is the data collected in the proper format for it’s intended purpose?
- Consistency: Is the data collected consistent, or do you have different information for the same entity in different locations?
- Timely: Is the data still accurate, or is it sufficiently stale, that it is no longer accurate?
- Accuracy: Is the culmination of all the other characteristics in my view.
While not a comprehensive, these bullet points are a good starting point. In reality, data quality is part of a much larger science, one that has caused many enterprises to develop data governance teams, whose soul purpose is the improve the quality of data, which makes a lot sense when you consider the value of data.
All that said, I urge anyone developing a distributed system to collect the best quality data possible, and perform validation on the incoming data. Then you can deal with formatting the data on the outbound side. as The old adage goes, Garbage In, Garbage out.