I Have Some Massive Amazing Idea. Ok, Now How Do I Make It Happen With Data Science?
Many people talk about data science projects and products, but only a handful understand the steps involved in building a data science product or model. One thing I’ve come to discover so far in school is that few organizations have the appropriate infrastructure for data science.
In the remaining blabber, I’ll discuss the different steps you need to build a data science model or product. Everybody can benefit from this process and methodology.
What the Heck Am I trying to Do?
The first thing you have to do before you start anything is defining and understanding the problem you’re trying to solve. You need to be able to transform that idea into a data science question and process ending in solution.
Try first, engaging with the right people whose business or process you want to improve and asking the appropriate questions.
Data science products or models shouldn’t exist in a vacuum. Instead, they should help your employer, or self, transform business operations, improve processes, or identify possible solutions to issues impeding operations, e.t.c. That’s why you need to truly understand the business problem and evaluate whether you can solve it because not all business problems can be solved using data science. A better understanding of the business problem increases your chances of building great data-driven products that can positively impact your organization.
What the Heck Do I Need?
The next step after problem understanding is to collect the right set of data. Data collection is essential, and it’s useless-to-impossible to build a good model without the correct data or the mechanism to parse the data. Categorically many organizations collect unreliable, incomplete data and everything done afterwards is messed up. Even worse some don’t even know what data to collect or where to find it.
With modern technologies like web scraping, cloud data collection tools, web APIs, database systems and tools like SQL, Python, R, Beautiful Soup, Apache, e.t.c, you can collect very valuable data from almost anywhere. The data to collect depends on what problem you’re trying to solve and how.
Why the Heck Isn’t This Helpful?
It’s not enough to get all this data without processing it. Just like crude oil, freshly collected data is raw and useless in its pure form. After collecting the appropriate dataset, you need to adequately clean and process the data before proceeding to the next step.
If you just build an ML model without processing or cleaning, your final product will definitely not make any sense. Bad data makes bad models, no matter how much you tune and optimize parameters. Even the effectiveness of your analysis weighs upon on the quality of your data.
Theres also billions of data problems like duplicate and null values, inconsistent types, invalid entries, improper formatting or even, all together missing data that I had to resolve before proceeding.
In fact, it’s normal that most time is spent on these first stages. It takes 80% of the time dedicated to a data project. So if you want to build great data science models, you need to find and resolve flaws in the dataset first. Although data cleaning is cumbersome, you’ll benefit from it as long as you remain focused and understand the final goal.
What Is This Stuff Trying to Tell Me?
At this point, you have a plethora of data and you’ve managed to make it as organized as possible. It’s time to deeply inspect all the features, trust the numbers, gain intuition about the data, and figure out how to handle each feature. This entire process is called exploratory data analysis (EDA) — one of the common words in data science.
EDA involves several forms of analysis like missing values treatment, outlier identifying, variable transformation, feature engineering and correlation understanding. An effective strategy provides the best foundation to create better stable predictors. It’ll help you ask the data the right questions, better explore and visualize different datasets to identify patterns and uncover the insights useful in constructing a good model. It also exposes me to think innovatively and analytically.
In the model construction phase, you can do the actual modelling of the data and better explain the insights. The first thing you have to do at this stage is to split your dataset you’ve cleaned into a set to train with and a set to test on. The typical magic formula for this is 70% train and 30% test but no worry there are handing packages to do this for you.
You should use the training set to build predictive models and evaluate your model’s performance on the unseen data points (test set). ML problems are generally classified as either supervised or unsupervised. Supervised learning involves building a model that can accurately predict the target variable using a set of features known as predictors. While unsupervised learning is a self-learning approach where the model has to find all kinds of unknown patterns and relationships between all predictors.
You can use several evaluation metrics to examine how well your model works, and the choice of metrics to use depends on the kind of problem you’re trying to solve. Then based on the evaluation results, you may need to tweak the model’s parameters to ensure that it generalizes well and can work well when exposed to previously unseen data.
How Can I Explain This To Someone Who Knows No Data Science?
After building and evaluating the model, you need to communicate the model results and present your findings to stakeholders.
Noone is interested in the fancy algorithm you used to build the model or the number of hyperparameters used. They’re interested in understanding what they can do with it and how it will make things better.
As such, every data scientist needs to have good presentation and data storytelling skills to show how a model helps address the business problems identified in the first phase of the life cycle. Using sophisticated wording and putting complex formulas on your presentation slide won’t get you far. But by showing your model’s real value in a precise and concise way, it becomes easier for the executives to adopt the model.
Model Deployment and Maintenance
Communication is usually not the last phase in the data science project lifecycle. Once the stakeholders are pleased with your model’s results, the next step is to deploy the model.
A machine learning model isn’t built to reside on a local machine forever. It needs to generate value for organizations, and the only way to use the model to make practical, data-driven decisions is by delivering it to end-users. I like the way Luigi Patruno puts it in this article -” no machine learning model is useful unless it’s deployed to production”.
The last thing I’ll mention in this section is model maintenance. When I first got into data science, I always thought that a data scientist can just build a model, deploy it and then relax while the model keeps working forever. But, it wasn’t long before I discovered that just like machines need maintenance, machine learning models need to be maintained as well.
The “deploy once and run forever”practice is harmful because many thing could affect a model’s predictive power over time. The pandemic is a perfect example of that because of all the unpredictable factors it introduced. It’s safe to say tons of organizations have updated or are already planning to update all the ML models they built before the pandemic to capture the new customer patterns and behaviours exposed during the pandemic. Generally, every organization should have a model upgrade strategy to reconfigure their model periodically.
Well that happened…
In this blabber, I was able to explain the different stages of a data science project from concept to model, including problem understanding, data aggregation, cleaning and processing, analysis, model building and evaluation, talking about that, model deployment and evaluation.
I assume that you are now a complete subject matter expert and now get the steps you need to build a data science model from an idea. If you have questions or need more terrible instruction, send a carrier pigeon with your questions(or comment on this here post) and clap as many times as you can. In my subsequent blabbers, I’ll continue to document my enormous struggle along the learning curve to become a top-notch data scientist.
Thanks for reading and tell all your friends.