Data science rarely fails to draw interest from IT and business leaders alike these days. But it does fail.
In fact, data science initiatives, which leverage scientific methods, processes, algorithms, and technology systems to extract a range of insights from structured and unstructured data, can fail in any number of ways, leading to wasted time, money, and other resources. Flawed projects can result in more damage for an enterprise than benefits, by leading decision-makers astray.
Here are some of the most common reasons why data science projects do not pan out as expected.
Poor data quality
Bad data makes for bad data science, so it’s of vital importance to take the time to ensure data is of high quality. That’s true for any analytics undertaking and it’s certainly the case with data science.
“Bad or dirty data makes data science initiatives impossible,” says Neal Riley, CIO at Adaptavist, a digital transformation consultancy. “You have to make sure that your data is clean and ready for data analysts. If not, it’s just a complete waste of time.”
When enterprises use unclean data for data science projects, they will end up “looking at models that come out with weird outputs [and] seeing it doesn’t represent reality or the process in a way that makes things better,” Riley says.
Sometimes the quality of data is poor because of bias or discrepancies in the data sets.
“For some organizations there are multiple systems used to run the business,” says Brandon Jones, CIO at insurer Worldwide Assurance for Employees of Public Agencies (WAEPA). “For seasoned businesses you may even have legacy systems that are still accessed [for] reference or validation. In many cases, the business changed with each system, therefore leading to different processes and/or ways to count a metric within the business.”
This can be a leading cause of failure for data science, Jones says. Findings might be inflated due to double counting based on a modified business process. “To resolve this issue, organizations must level-set their data analytics program,” he says. “This means outlining a specific date where data can be validated and everyone understands and has buy-in that this is the common standard that the organization will be working from.”
No clear definition of the problem to be solved
How can a data science initiative succeed if team members don’t understand the business problem they are trying to solve? And yet, data science teams sometimes encounter this shortcoming when asked to take on projects.
“Defining a problem is often left to the data scientists, when in fact the definition of a problem [comprises] business cases that both scope the work and define the potential return on investment,” says Michael Roytman, chief data scientist at cyber security company Kenna Security.
Business users looking to leverage data science need to ask probing questions about the problem they are trying to solve, says Marc Johnson, a senior advisor and virtual CIO at healthcare consulting firm Impact Advisors.
“Just as with any project, spend the time to lock down the scope of the problem to identify the correct sources for the data,” Johnson says. “I was asked to produce an analytics product for a 20-year-old company a few years ago. There was no research with the customer base to see if there was a market for it. There was no identification of the metrics for which the customer wanted to view the analytics. It was all based upon the competition claiming it had an analytics product and hearsay that customers wanted it.”
The project lingered for two years with no direction “because of the fuzzy definition of what the problem was that we were attempting to solve,” Johnson says.
Lack of relevant data
Another sure-fire way to fail with data science is to not provide the specific kinds of data needed to address a particular issue.
Throwing an enormous volume of data at a problem is not the answer.
“There is an assumption that large data will lead to insights, which is actually rarely the case,” Roytman says. “Smart, tailored, and often smaller datasets are more often the ones that provide robust generalizable models.”
To get value out of data science, there should be an ongoing effort to continue the collection of data from the most relevant sources, Johnson says. “Creation [is] not a one-time event,” he says.
As data is being collected or purchased from various sources, teams need to make sure any modifications in the data do not skew the results and sacrifice the quality of the entire data set, Johnson says. They also must make sure there are no privacy, legal, or ethical issues with the data set.
Lack of data transparency
Teams need to be transparent with the data they used to build any given model.
“Data science projects fail when people don’t trust the model or understand the solution,” says Jack McCarthy, CIO of the State of New Jersey–Judiciary. “The way to combat this is that you must be able to ‘show the math’ and communicate it to stakeholders who might not have the technical or statistical skills.”
Data scientists need to explain where the data comes from, what they did to calculate models, an
d provide access to all relevant data. “Transparency can be key to a successful project,” McCarthy says.
An example of this is the risk assessment algorithm used in New Jersey. “We provide all stakeholders with a report that shows which cases in a defendant’s history fall into which category, and how each is scored,” McCarthy says. “This is provided to all adversaries so they have an opportunity to look at each case and challenge its inclusion. It is all done transparently.”
Unwillingness to acknowledge that findings are uncertain
Sometimes the business group requesting insights or the data science team itself is simply not willing to conclude that findings were uncertain, unclear, or not strong enough for a business application, Roytman says.
“It is an equally acceptable and valuable answer to say, ‘The model is not good enough to generate ROI [return on investment] to the business,’” Roytman says.
The data science team at Kenna Security spent two months building a vulnerability classification model that would generate a common weakness enumeration automatically for a vulnerability, Roytman says. “The model worked; it was a sound answer to a graduate-level course problem,” he says. “But it didn’t work well enough to be valuable for our customers. [The] precision was too low. So we scrapped the project, even though we had invested time and had a result.”
Absence of an executive champion
Data science efforts need a champion in the C-suite, to ensure that projects get sufficient resources and support.
“It helps if it’s the CIO,” Riley says. “We view data science as an integral part of our operation, and I’ve made sure to be a champion for our efforts.” Even if CIOs are not the internal champions for data science, they should be responsible for keeping all the data involved secure, he says. But involvement should go way beyond security.
“Getting the most out of the information you capture is what I would call a modern CIO’s responsibility,” Riley says. “With all this data on hand, you have the means to learn from it and use it intelligently, and that’s something that CIOs can utilize to help their organizations cross-functionally.”
Adaptavist has gained the most from its data science work in determining new tactics and modifications it can make with the sales process, Riley says. “It has had nothing to do with our product or IT infrastructure, marketing, none of that.” he says. “It has helped us the most from a business process optimization standpoint, for handling and managing leads better from inside sales.”
Shortage of talent
The skills gap is plaguing many aspects of IT, and data science is no exception. Many organizations simply don’t have the skill sets in place to maintain projects or get the maximum value.
“Bona fide data scientists are high in demand, hard to come by, and expensive,” says Tracy Huitika, CIO of engineering and data at Beanworks, a cloud-based accounts payable automation provider. “The position usually requires a PhD in physics or sciences, as well as the ability to write code in R and Python.”
One of the biggest reasons data science projects fail, even when they get to deployment, is the lack of operational talent to continue managing the project, Johnson says. “Taking a brilliant data scientist to create the model without a plan for running the operations of continued improvement with adjustments for market and data changes is like engineering a car and handing the keys to a 10-year-old,” he says.
Companies need to get the right skill sets in place to maintain the model after it has gone into production, either through hiring or tapping outside experts such as consultants who are well versed in data science.
Data science wasn’t the right solution
What if a particular problem didn’t require data science as a solution in the first place? This misguided use of the discipline can lead to failure, so it’s worth giving a lot of thought to when and when not to apply data science methods, processes, and tools.
“One of the biggest things that will cause data science projects to fail is if data science, algorithms, and machine learning aren’t even the right solution,” Riley says.
“You may not need a machine learning model at all; you might need simple regression, and you can spend quite a lot of time and effort going through all the different permutations without use for data science,” Riley says. “We got caught in one of those situations where we were looking at financial data science modeling for visualizing predictors for future financial success for lines of our business. It turned out the best thing to use was just statistical regression.”