advertisement
Data in, intelligence out: Machine learning pipelines demystified
It’s tempting to think of machine learning as a magic black box. In goes the data; out come predictions. But…
It’s tempting to think of machine learning as a magic black box. In goes the data; out come predictions. But there’s no magic in there—just data and algorithms, and models created by processing the data through the algorithms.
If you’re in the business of deriving actionable insights from data through machine learning, it helps for the process not to be a black box. The more you know what’s inside the box, the better you’ll understand every step of the process for how data can be transformed into predictions, and the more powerful your predictions can be.
Devops people speak of “build pipelines” to describe how software is taken from source code to deployment. There’s also a pipeline for data as it flows through machine learning solutions. Mastering how that pipeline comes together is a powerful way to know machine learning itself from the inside out.
advertisement
The machine learning pipeline consists of four phases, as described by Wikibon Research analyst George Gilbert:
- ingesting data
- preparing data (including data exploration and governance)
- training models
- serving predictions
A machine learning pipeline needs to start with two things: data to be trained on, and algorithms to perform the training. Odds are the data will come in one of two forms:
- Live data you’re already collecting and aggregating somewhere, which you plan on making regularly updated predictions with.
- A “frozen” dataset, something you’re downloading and using as is, or deriving from an existing data source by way of an ETL operation.
With frozen data, you generally perform only one kind of processing: you train a model with it, deploy the model, and depending on one’s needs, update the model periodically, if at all.
advertisement
But with streamed data, you have two choices with how to produce models and results from the data. The first option is to save the data somewhere—a database, a lake—and perform analytics on it later. The second option is to train models on streamed data as it comes in.
Training on streamed data also takes two forms, as described by Charles Parker of machine learning solution provider BigML. One scenario is where you’re applying a regular flow of recent data against the model to make predictions, but you’re not adjusting the underlying model all that much. The other scenario is when the data you’re ingesting needs to be used to train entirely new models every so often, because older data isn’t as relevant.
This is why choosing your algorithms early on is important. Some algorithms support incremental retraining, while others have to be retrained from scratch with the new data. If you’re going to be retraining often because you’re streaming in fresh data all the time to fed your models, you want to go with an algorithm that supports incremental retrain. Spark Streaming, for example, supports this kind of incremental retraining.
advertisement
Data preparation for machine learning
Once you have a data source to train on, the next step is to ensure it can be used for training. The catchall term for ensuring consistency in the data to be used is normalization.
Real-world data can be noisy. If it’s drawn from a database, a certain amount of normalization takes place automatically on it there. But many machine learning applications may also draw data straight in from data lakes or other heterogeneous sources, and so their data isn’t necessarily normalized for production use.
Sebastian Raschka, author of Python Machine Learning, has written in detail about normalization, and how to conduct it for some common types of datasets. The examples he uses are Python-centric, but the basic concepts can be applied universally.
Is normalization always required? Not always, says Franck Dernoncourt, a PhD candidate in AI at MIT, in a detailed exploration of the subject on Stack Overflow. But as he puts it, “It rarely hurts.” The important thing to know, he points out, is the use case. For artificial neural networks, normalization isn’t needed but can be useful; for something like a K-means clustering algorithm, normalization is vital.
One area where normalization isn’t a good idea is when “the scale of the data has significance,” according to Malik Magdon-Ismail, co-author of Learning from Data. An example: “If income is twice as important as debt in credit approval, it is appropriate for income to have twice the size as debt.”
Something else to be conscious of during the data intake and preparation phase is how biases can be introduced into a model by way of the data, its normalization, or both. Biases in machine learning have real-world consequences; it helps to know how to find and defeat such bias where it might exist. Never assume that clean (readable, consistent) data is unbiased data.
Training machine learning models
Once you have your dataset established, next comes the training process, where the data is used to generate a model from which predictions can be made.
I mentioned earlier how the type of prediction job and the kinds of algorithms in use are important here, depending on whether you want models that are trained all at once on a batch of data or models that are retrained incrementally. But another key aspect to training models is how to tune the training to increase the precision of the resulting model—what’s called hyperparameterization.
A hyperparameter for a machine learning model is a setting that governs how the resulting model is produced from the algorithm. The K-means clustering algorithm, for example, organizes data into groups based on how similar things are to each other in a certain way. So, one hyperparameter for a K-means algorithm would be the number of clusters to search for.
Generally, the best choices for a hyperparameter come from having experience with the algorithm. Sometimes you need to try out a few variations and see which ones yield workable results for your problem set. That said, for some algorithm implementations, it’s becoming possible to automatically tune hyperparameters. The Ray framework for machine learning, for example, has a hyperparameter optimization feature.
Many of the libraries for model training can take advantage of parallelism, which speeds the training process by distributing it across multiple CPUS, GPUs, or nodes. If you’ve got access to the hardware to train in parallel, use it. The speedups are often near-linear for each additional computing device.
Parallel training can be supported by the machine learning framework you’re using to perform the training. The MXNet library, for example, lets you train models in parallel. MXNet also supports both of the key methodologies for parallelizing training, data parallelism and model parallelism.
Alex Krizhevsky, a member of Google’s Brain Team, explained the differences between data parallelism and model parallelism in a paper about parallelizing network training. With data parallelism, “different workers train [models] on different data examples … [but] must synchronize model parameters (or parameter gradients) to ensure they are training a consistent model.” In other words, you split the data to train across multiple devices, but you have to make sure the resulting models stay in sync with each other.
With model parallelism, “different workers train different parts of the model,” but workers have to stay in sync whenever “the model part … trained by one worker requires output from a model part trained by another worker.” This approach is typically used when training a model has multiple layers that feed into each other, such as a recurrent neural network.
It’s worth learning how to assemble pipelines using both of these approaches, because many frameworks, such as the Torch framework, now support both.
Deploying machine learning models
The last phase in the pipeline is deploying the trained model, or the “predict and serve” phase, as Gilbert puts it in his paper “Machine Learning Pipeline: Chinese Menu of Building Blocks.” This is where the trained model is run against incoming data to generate a prediction. For a face-recognition system, for example, the incoming data could be a headshot or a selfie, with predictions made from a model derived from other face photos.
Where and how this prediction is served constitutes another part of the pipeline. The most common scenario is providing predictions from a cloud instance by way of a RESTful API. All the obvious advantages of serving from the cloud come into play here. You can spin up more instances to satisfy demand, for example.
With a cloud-hosted model, you can also keep more of the pipeline in the same place—training data, trained models, and the prediction infrastructure. Data doesn’t have to be moved around as much, so everything is faster. Incremental retraining of a model can be done more quickly, because the model can be retrained and deployed in the same environment.
But sometimes it makes sense to deploy a model on a client and serve predictions from there. Good candidates for this approach are mobile apps, where bandwidth is at a premium, and any app where a network connection isn’t guaranteed or reliable.
One caveat is that the quality of predictions made on a local machine may be lesser. The size of the deployed model may be smaller due to local storage constraints, and that might in turn affect prediction quality. That said, it’s becoming more feasible to deploy highly accurate models on modest devices like smartphones, mostly by way of a slight trade-off of accuracy for speed. It’s worth taking a look at the application in question and seeing if it would be better to deploy a trained model on the client and refresh it periodically, rather than access it through a remote API.
There’s another possible stumbling block: Because you can deploy models in so many places, the deployment process can be complex. There’s no consistent path from any one trained model to any one target hardware, operating system, or application, except whatever you create on an app-by-app basis. This complexity is not likely to go away anytime soon, although there’s going to be pressure to find a consistent deployment pipeline thanks to the growing practice of developing apps using machine learning models of some kind.
The machine learning pipeline isn’t really a pipeline
The term pipeline implies a one-way flow of things from one end to another. In reality, the flow is more cyclical: Data comes in, it is used to train a model, and then the accuracy of that model is assessed and retrained as new data arrives and the condition of that data evolves.
Right now, we don’t have much of a choice except to think of the machine learning pipeline except as discrete pieces that need individuated attention. Not because each stage is functionally different, but because there’s little in the way of end-to-end integration for all these pieces. In otherwords, there’s no pipeline, just a series of activities we tend to think of as being in a pipeline.
But projects are coming together that attempt to fill this need for a real pipeline. Hadoop vendor MapR, for example, has its “Distributed Deep Learning Quick Start Solution”—a combination of a one-year, six-node license for the MapR Hadoop distribution, integrated neural network libraries with CPU/GPU support, and professional consulting services.
An ideal solution would be a complete open source design pattern that covers every phase of the pipeline and provides as seamless an experience as the continuous-delivery systems that now exist for software. In other words, something that constitutes, as Wikibon’s Gilbert put it, “devops tools for data scientists.” Baidu has announced it is looking into such a devops tool for data scientists, with Kubernetes as a chief element (something MapR also uses to coordinate work across nodes in its system), but nothing concrete has materialized yet.
Until that day comes, we’ll have to settle for learning every bit of the pipeline from the inside out.