A couple of weeks ago I came across this article on the Harvard Business Review on data quality and its impact on machine learning models. It made me think about the factors that put a new data science project on the path to success.
A note on planning
Data science projects are very much research projects and just like any other science, breakthroughs don’t come at a set time. Businesses are used to the process of planning, tracking progress and checking intermediate results, but building a model requires a trial-and-error approach that makes it hard to quantify and evaluate in intermediate states. This should always be taken into account in the planning stage. While a handful of core steps can be easily identified, the process to get to completion isn’t smooth and often takes a few attempts to get things right. Project managers should plan for quick iterations and embrace failure as a possible outcome of each of them.
Goal setting in Data Science
Before you even start planning however, setting goals and expectations has a great impact on the outcome both from a technical and business point of view.
A tricky part of this process is bringing together business goals and mathematical objectives. On the one hand, business decisions are still influenced by vision and business acumen, while on the other, data scientists make decisions based on rigorous performance metrics that don’t always directly translate into business value. A common mistake might be to come up with a quantitative goal, e.g. 95% accuracy for a classification model, without any prior notion of whether this is a reasonable measure: is accuracy the right performance metric? Is 95% too easy or too hard in the context of this specific domain? Will achieving that level of accuracy translate into a competitive advantage for the business?
The only solution to this is to spend time understanding the domain of the problem before defining goals and setting expectations. The positive effects of having clearly defined goals trickle down to all subsequent stages of the process, setting guidelines for both project managers and the technical team.
In terms of data quality, for example, clear goals help ensure the data collected and used in the project is correctly aligned with the scope of the project. This is the main point made by the HBR article: if you don’t have the right data, no amount of it will improve your results, no matter how clever the algorithm and how much time spent on tuning it.
Data processing
The next step in a successful data science project is data processing. Real data comes with noise, inconsistencies and missing values, so once you have the right dataset (or datasets), you must process it appropriately. Data cleaning is widely recognized as the least enjoyable and most time-consuming task in a data scientist’s work life (see here). As the story goes, the time spent on preparing data for modeling is about 60 to 80 percent of the total time spent on a project. From a business perspective this might look like an inefficiency, or a waste of resources, but it’s not. Before processing, data is really just a hot mess, so this is a very important stage in the building of a model: it’s at this point that messy data is transformed into actionable insights. The tasks performed at this stage include anything from replacing missing data based on observations and statistics to dimensionality reduction and feature extraction. Some of them, such as correcting inconsistencies, may sound trivial, but the choices made at this stage can make the difference between injecting bias into a model, and facilitating an effective and accurate machine learning workflow.
Wrapping it all up
Data science projects are far more similar to research projects than businesses are used to so it’s important to keep that in mind when getting started. They require a more flexible style of project management and particular attention to goal setting so there’s enough space for creative innovation (the magic element) but also enough focus to ensure the creation of business value. Ultimately, the performance of any model is determined more by the quality of its data than its parameter tuning, so extending the first phases and spending more time on increased data quality, both in terms of sources and processing, will be a determining factor in the success of any project.