In the last article, Data in Production: Data Collection, we discussed the topic of Good Data and Bad data. We summed up by saying that solid data is essential to the success of any machine learning project. In this article, we'll expand on that idea by exploring the various factors that affect data quality. You can catch up on the previous article here if you missed it.
P.S. This is, by far, the shortest article I have written on this site, as I tried to keep it as simple as possible.
So, What is "good" data? How do we assess the "good" or "bad" in something like data? As in all cases, the "good" and "bad" can be subjective. Depending on the context, something previously useful can easily become useless. Hence, every project needs a new set of evaluation metrics and instructions to determine whether our collected data is useful or useless.
Even in this case, we must base our instruction on some metrics. These metrics should be those that can evaluate both the context relevancy of the data and the quality of the data. And after getting a consolidated report based on these metrics, we should be able to decide on the further processing of the data.
Generally, there are four major concepts on which this data quality is determined. They are,
- Data Accuracy
- Data Completeness
- Data Consistency
- Data Timeliness
Let's look at each of them in a bit more detail.
The first and most basic evaluation when we get a dataset is to check its accuracy. No process is errorless, especially the data collection. Various errors can creep into the data during collection, as we would generally collect from different sites and sources. These errors are treated immediately most of the time, and is one of the longer processes in the data life-cycle of a project.
Let's ponder for a few minutes. What could go so wrong that our dataset is deemed unusable?
- It can have mismatched labels.
- Maybe the values that we collected are incorrect. Did we collect any text in the columns that should be purely numerical?
- Maybe the data we collected has too many outliers, making the model unusable.
Well, all of these, and much more, can go wrong in the data collection process. Here are a few examples,
One of the easiest mistakes we can make during the data evaluation is to overlook that our data might not represent all possible divisions of the population. This is better understood by a scenario.
Suppose you are working on a model that recommends morning routines for healthier living. To make your recommendations better, you should consider the audience who will be using the model. If your customers are professional athletes, then you can train your model based on the behaviours of peak athletes. However, you cannot use the same model to recommend morning routines to your next-door grandpa. He will not be able to handle 100 pushups if he has arthritis. So, we must change our dataset and the behaviours we collect.
Similarly, based on the use case, the completeness of the data should be re-checked repeatedly. For this, a tactic from marketing is generally useful. We need to build a consumer persona around which we can anchor our data collection. The data incompleteness can be attributed to 3 major mistakes:
- Missing values: These might not be noticed without EDA, but not all data sources have the complete data we are collecting. For example, when trying to collect merchant information from Amazon, Flipkart, Myntra and other sites, some sites mention the gender of the merchant, while some don't. In the process, we tend to leave the fields unavailable to be empty, so we can decide later.
- Continuous timeline: Like in the first example, you often find data missing a few days. For example, you could try to guess what the stock market will do tomorrow based on what was predicted last year. You can find a trend in last year's stock data, but you can't use it to guess what the prices will be today.
- Shallow samples: This is the category where the example of trying to suggest morning routines based on athletes' lifestyles falls under. These can be avoided by sorting out and defining the customers the model targets.
When you have time, take a trip to Kickstarter's website, and browse through a few campaigns. One direct thing you will notice is the sheer variety of currencies in which funds are being raised. If we blindly collect the amounts from Kickstarter without the currency, we will only collect noisy data. This form of inconsistency is called Measurement inconsistency.
Now, take a look at this: 09/07/23. This is the DD/MM/YY form of the date. But depending upon the location you are from, you will be reading this in multiple ways, like 9th July 2023, 7th September 2023, or even 23rd July 2009. This is a simple example where you can find multiple forms of a single entity and which can confuse the model that is being built upon this data. These inconsistencies are generally called Format inconsistencies.
Other than Measurement and Format, we also have two other commonly found inconsistencies: Categorical and Convention inconsistencies. Categorical inconsistencies arise when similar items are placed under different categories. For example, we can consider Fashion and Clothing. The same shirt can be found on Fashion on Amazon and under Men's Shirts on Myntra. These rows of data can't be easily removed by
pd.drop_duplicates() function and need someone to manually verify these sections (Of course, there are other things we can do, too! But that's a topic of discussion for another day.)
Similarly, the easiest example of convention inconsistencies is our different names for a footpath. Some prefer to call it a pavement, while others call it a sidewalk. These different naming conventions can easily creep in during the data collection and must be filtered after the collection.
If you want to predict tomorrow's stock values, then the data that you need should be as recent as possible. If you try to predict based on the data from the last decade, you can bid farewell to your hard-earned money.
Similarly, the weather predictions should be based on the latest data rather than the storms from the last century. This importance of the relevancy of the time period of the data is known as data timeliness. When it comes to time, all kinds of things can go wrong. We might just have outdated information or a lag in the information collection process, which is fatal in stock predictions.
At the same time, mechanical failures might cause data synchronization errors. These different issues will ultimately affect the trends that we are trying to predict. Hence we must keep these different things in mind during the data validation process.
With this, we have finished our short discussion regarding the factors that can negatively affect the data quality during the collection. Next, we will discuss how we validate the data during a project using the TFDV framework. Though this article will be released around August 11th, you can check out the different articles on the site while waiting.
In the meantime, you can check what the LinkedIn community recommends as a machine learning roadmap here. Or, you can check out the new NLP series, which started with regex, here. Make sure to subscribe to this page to get notified of the upcoming articles in NLP and MLOps, along with the newsletter that will start soon!
Thanks for reading till the end! Have a nice day!
- Data Collection + Evaluation, People + AI Guidebook, Google
- Best Practices for ML Engineering
- Machine leanring design patterns, O'Reily Media.