While many of us would think that all data is to be considered actual fact, the truth is data can be skewed. Data may not always act as a direct reflection of reality. Hence, before we deduce insights and information from a certain set of data, we should always ask ourselves the quality of the data that has been set before us and how trustworthy is that data.
What is Data?
Data is a collection of numbers, facts, measurements or description of something. Before injecting any form of reasoning or conditioning, raw data essentially has no meaning. It is only when we compute raw data with conditions, constraints and reasoning that data become information. And information provides insights and a way to understand the data.
For example,
Data: “Data shows 30 Baht with frequency of 3 times.” This type of raw data provides little to no insights.
Information: “A customer bought 30 THB worth of items, 3 times in the past week.” Computed data provides valuable insights.
Here’s another example,
Data: “Data shows -30% from the average number.” Presents no real value.
Information: “The average number of a website’s organic visitor has dropped by 30% last month, which coincides with the holiday season.” Presents valuable insight.
There’s valuable data and there’s invaluable data. There’s also high quality data vs low quality data. Most importantly, there’s the trustworthiness of data which is based on 6 dimensions.
The 6 Dimensions of Data
1 – Accuracy: Refers to the level at which the data represents real-world scenario. This involves considerations such as sample size, error margin, and the method data was collected per se.
2 – Consistency: Refers to how consistent the data is over time. For example, if you were to run the same data sampling 3 times over a period of 3 months and the numbers came out with less than 3% difference each time, then a high level of consistency can be deduced. However, if the numbers shows 40% difference each time, there is very low consistency and begs the question of how trustworthy is that set of data.
3 – Validity: Refers to the correctness of the data collected. For example, if the data should consist of only numbers such as kilograms or meters, but some instances of the data show alphabets a-z, it stains the result. Hence the data is invalid.
4 – Completeness: Refers to missing parts of data that compromises the validity of the result of any computation done on that data. For example, if you were to send out 100 surveys but only 30 were returned, even if you found 90% of respondents favored your new product, you cannot effectively and accurate deduce that, that is the actual case.
5 – Timeliness: Refers to the time dimension of a certain set of data, such as, the period selected for the data. Or questioning whether that data is real-time versus a snapshot (frozen) of how things looked a week ago, that is being displayed today. Are you comparing a number against its average the past 3 month or the past year?
6 – Uniqueness: Refers to the percentage or ratio of data overlaps and duplicates. For example, if you running a pivot based on email address, but a certain customer has registered twice with difference email addresses @gmail and @hotmail, then the information that is deduced from the raw data may very well be skewed.
At the end of the day, data is simply a set of numbers and records as-is, that is given meaning by the person who explores and analyzes that data. Hence, when referring to the trustworthiness of data and information, there is an important extra dimension that extends beyond the data itself, integrity.