Recently, I have worked on an MLOps series, where I briefly discussed the different steps involved in the lifecycle of a Machine learning project. We started with the deployment overview and moved a step back until we reached the process of scoping a project. However, those articles were written in the format of notes for the Introduction to Machine Learning in Production course by Andrew Ng.
When I progressed to the following courses in the specialization, I discovered that the specialization depends on TensorFlow and TensorFlow Extended frameworks. I could have gone ahead mindlessly and learned new things later; however, I took a step back. I re-evaluated what I wanted and re-assessed the direction where I wanted my journey to progress. In the process, I have realized that though I do the course, I cannot only focus on the course content anymore. I had to analyze what I was being taught.
Hence, I took my sweet time before taking the next step, which led to this article. So, without wasting more time, let us get into the topic.
When you recall the machine learning lifecycle, remember that the first step after defining the project is to question what data would help us successfully achieve the intended goals. In this phase, our only goal is to identify and determine the good and useful data, which we then use to create machine learning models.
Wait a minute...how do we determine the criteria for "good" and "useful"? We can either ignore these remarks and move on or carefully examine them. In production, it helps to have a firm grasp of the meaning of these adjectives that pop up from time to time.
The Good Data and Bad Data
Suppose we are trying to predict the economic fluctuations of a country. To this example, what would you think when I insist on adding a feature which is the number of times I sneeze a day? You start seeing me as a crazy person. Why? Cause my sneeze is irrelevant to the economic fluctuations of a country.
When we are trying to predict economic fluctuations, the number of times I sneeze is a garbage value. Now the value for the same feature is different when I try to analyze the state of my cold. The feature remains the same in the two cases, but you see me as a madman, whereas you perfectly agree with me in the other. This is basically how we determine a feature to be good or bad.
A good feature or good data is relevant to the project we are working on. Bad data is the exact opposite.
Imagine if I insist on adding multiple such features to our project. Our model will not be something that predicts economic fluctuations. It will be something that predicts garbage. This is how the phrase "Garbage In, Garbage Out" came out.
This is one of the interesting descriptions that I found regarding Good Data and Bad data:
What is useful, then?
There was another keyword that was used for the data that we needed to collect. Useful. How do you know if a feature is important for the project?
Whatever the case, once we start working on a project in a production setting, it is aimed at a certain group of customers who will benefit from the product. Because of this, we need to think about the user when we collect information or develop services. What can the user provide? What is the type of service we are offering? Is it just augmenting the process, or is it automating the task?
When we consider the user aspects, the data we need will change. For example, when building an application that recommends his health routine to the user. If he asks for recommendations for his daily workouts, we tend to collect his gym goals, the frequency of his workout, the time he plans to spend in the gym, any health conditions, etc.
You won't need his daily wake-up time to recommend him his workouts. However, if the user's goal is to maintain a healthy lifestyle, you won't need the time he plans to spend in the gym or others. Rather, you would like to know the amount of sleep he is getting, his drinking and smoking habits, his target sleep time, his diet etc. Depending on the user's needs and criteria, the data that is to be collected should be changed.
Since we have discussed what data we need to collect, it is time to discuss the "how" of the data collection process. We have many questions here, But for now, let's focus on the most basic questions you need to answer to start your machine learning project.
- How to acquire the data that we need?
- How much data should we collect?
The sources for data acquisition can be widely categorized into two categories, User-generated and System-acquired. At the root level, the user-generated data is obtained from surveys, polls, case studies, Q&A forums, images, videos etc. This data is highly diverse and can have multiple inconsistencies hidden in it. There is a common saying that is accepted by data professionals, which states that "If it is even remotely possible for users to input the wrong data, they are going to do it." Hence user-input data requires heavy-duty checking and processing.
On the other hand, is the system-generated data. It is the data that is automatically generated or stored by the systems that users utilize. This includes various types of logs, system outputs, online sales records, website traffic etc. Because this gives us the ability to store everything that we can, this kind of data is generally huge, and the required data can be difficult to find within all the noise that has been collected. The rise of internet usage in the last two decades has made it easy for organizations to collect system-generated data.
Other than these two types, there is another source involved. This source is called Third-party data. This is where most of the online free data sources on Kaggle and others come into play. Third-party data is the data collected by other companies or organizations from users that are not the direct clients or users of the company and its services.
How much data should we collect?
This is a very interesting question, as no clear guidelines exist on how much data one should collect. In practical applications, it is done iteratively. This process is described in one of the previous articles in the ML lifecycle series, which you can read here. We tune the amount of data based on the results we acquire from modelling it. Most of the time, projects begin with a small amount of data and will slowly expand depending on the project's needs. However, one initial benchmark can be estimated based on how the model is intended to be used.
For example, if the sole purpose of the model is to recommend online courses, you need to collect data related to the person's job and career goals. Suppose it concerns a bigger topic, like recommending videos on YouTube or other video streaming platforms. In that case, your data should include everything that can help you determine the users' interests, not their background, which is probably not very useful here.
How can things go wrong?
In the above discussion, I have stated a few things about what data you must collect and how to collect it. However, these are not the only things you should consider. Especially in a professional space, you should ensure that the data you collect is unbiased and considers the user's privacy and security by complying with applicable privacy principles and regulations.
These ethical guidelines differ slightly from company to company, but most have the same essence. The main points are as follows,
- The data should not incorporate any bias on people, especially towards sensitive characteristics like race, gender, nationality, political or religious beliefs etc.
- The data collected should avoid incorporating sensitive personal identifiable information.
- The data collected should guarantee security regarding whatever sensitive information it contains.
More detailed guidelines regarding the principles can be found in each company's ethical guidelines. For example, you can check out Google's AI principles here.
You can start your data collection process based on the information given in this article. Generally, data collection is a very important aspect of any machine learning or data science project, as depending on the quality of the data collected, the speed remaining phases in the project lifecycle can be affected. Hence it is important to ensure that the data collected maintains its consistency regarding the format or units of the features and other basic principles.
However. expecting a perfect outcome right after collecting the data is a dream. Hence, followed by data collection comes another critical phase of the data lifecycle, which is Data Validation. The data validation process will be described in detail, along with an example using the TFDV (Tensorflow data validation) framework in the following article of the Data Lifecycle in Production series.
You could start reading from here if you didn't check out the Machine Learning project lifecycle series. Subscribe to Neuronuts if you want to stay updated on the upcoming articles in the series, or follow us on Instagram or LinkedIn for other information related to the articles! Thanks for reading till the end.