Businesses today are dealing with huge amounts of data and it’s arriving faster than ever before. At the same time, the competitive landscape is changing rapidly and it’s critical to be able to make decisions fast.
As Jason Jennings and Laurence Haughton put it “It’s not the big that eat the small… It’s the fast that eat the slow”.
Business success comes from making fast decisions using the best possible information.
Machine learning (ML) is powering that evolution. Whether a business is trying to make recommendations to customers, hone its manufacturing processes or anticipate changes to a market, ML can assist by processing large volumes of data to better support companies as they seek a competitive advantage.
However, while machine learning offers great opportunities, there are some challenges. ML systems rely on lots of data and the ability to execute complex computations. External factors, such as shifting customer expectations or unexpected market fluctuations, mean ML models need to be monitored and maintained.
In addition, there are several practical issues in machine learning that need to be solved. Here we will take a close look at five of the key practical issues and their business implications.
1. Data quality
Machine learning systems rely on data. That data can be broadly classified into two groups: features and labels.
Features are the inputs to the ML model. For example, this could be data from sensors, customer questionnaires, website cookies or historical information.
The quality of these features can be variable. For example, customers may not fill questionnaires correctly or omit responses. Sensors can malfunction and deliver erroneous data, and website cookies may give incomplete information about a user’s precise actions on a website. The quality of datasets is important so that models can be correctly trained.
Data can also be noisy, filled with unwanted information that can mislead a machine learning model into making incorrect predictions.
The outputs of ML models are labels. The sparsity of labels, where we know the inputs to a system but are unsure of what outputs have occurred, is also an issue. In such cases, it can be extremely challenging to detect the relationships between features and the labels of a model. In many cases, this can be labour intensive as it requires human intervention to associate labels to inputs.
Without accurate mapping of inputs to outputs, the model might not be able to learn the correct relationship between the inputs and outputs.
Machine learning relies on the relationships between input and output data to create generalisations that can be used to make predictions and provide recommendations for future actions. When the input data is noisy, incomplete or erroneous, it can be extremely difficult to understand why a particular output, or label, occurred.
2. The complexity and quality trade-off
Building robust machine learning models requires substantial computational resources to process the features and labels. Coding a complex model requires significant effort from data scientists and software engineers. Complex models can require substantial computing power to execute and can take longer to derive a usable result.
This represents a trade-off for businesses. They can choose a faster response but a potentially less accurate outcome. Or they can accept a slower response but receive a more accurate result from the model. But these compromises aren’t all bad news. The decision of whether to go for a higher cost and more accurate model over a faster response comes down to the use case.
For example, making recommendations to shoppers on a retail shopping site requires real-time responses, but can accept some unpredictability in the result. On the other hand, a stock trading system requires a more robust result. So, a model that uses more data and performs more computations is likely to deliver a better outcome when a real-time result is not needed.
As Machine Learning as a Service (MLaaS) offerings enter the market, the complexity and quality of trade-offs will get greater attention. Researchers from the University of Chicago looked at the effectiveness of MLaaS and found that “they can achieve results comparable to standalone classifiers if they have sufficient insight into key decisions like classifiers and feature selection”.
3. Sampling bias in data
Many companies use machine learning algorithms to assist them in recruitment. For example, Amazon discovered that the algorithm they used to assist with selecting candidates to work in the business was biased. Also, researchers from Princeton found that European names were favoured by other systems, mimicking some human biases.
The problem here isn’t the model specifically. The problem is that the data used to train the model comes with its own biases. However, when we know the data is biased, there are ways to debias or to reduce the weighting given to that data.
The first challenge is determining if there is inherent bias in the data. That means conducting some pre-processing. And while it may not be possible to remove all bias from the data, its impact can be minimised by injecting human knowledge.
In some cases, it may also be necessary to limit the number of features in the data. For example, omitting traits such as race or gender can help limit the impact of biased data on the results from a model.
4. Changing expectations and concept drift
Machine learning models operate within specific contexts. For example, ML models that power recommendation engines for retailers operate at a specific time when customers are looking at certain products. However, customer needs change over time, and that means the ML model can drift away from what it was designed to deliver.
Models can decay for a number of reasons. Drift can occur when new data is introduced to the model. This is called data drift. It can also occur when our interpretation of the data changes. This is concept drift.
To accommodate this drift, you need a model that continuously updates and improves itself using data that comes in. That means you need to keep checking the model.
That requires the collection of features and labels and to react to changes so the model can be updated and retrained. While some aspects of the retraining can be conducted automatically, some human intervention is needed. It’s critical to recognise that the deployment of a machine learning tool is not a one-off activity.
Machine learning tools require regular review and update to remain relevant and continue to deliver value.
5. Monitoring and maintenance
Creating a model is easy. Building a model can be automatic. However, maintaining and updating the models requires a plan and resources.
Machine learning models are part of a longer pipeline that starts with the features that are used to train the model. Then there is the model itself, which is a piece of software that can require modification and updates. That model requires labels so that the results of an input can be recognised and used by the model. And there may be a disconnect between the model and the final signal in a system.
In many cases when an unexpected outcome is delivered, it’s not the machine learning that has broken down but some other part of the chain. For example, a recommendation engine may have offered a product to a customer, but sometimes the connection between the sales system and the recommendation could be broken, and it takes time to find the bug. In this case, it would be hard to tell the model if the recommendation was successful. Troubleshooting issues like this can be quite labour intensive.
Machine learning offers significant benefits to businesses. The ability to predict future outcomes to anticipate and influence customer behaviour and to support business operations are substantial. However, ML also brings challenges to businesses. By recognising these challenges and developing strategies to address them, companies can ensure they are prepared and equipped to handle them and get the most out of machine learning technology.
Dr. Shou-De Lin, Chief Machine Learning Scientist, Appier