8 Ways To Jump-Start Your Machine Learning

Do you need to classify data or predict outcomes? Are you having trouble getting your machine learning project off the ground? There are a number of techniques available to help you achieve lift-off.

Some of the eight methods discussed below will accelerate your machine learning process dramatically, while others will not only accelerate the process but also help you build a better model. Not all of these methods will be suitable for a given project, but the first one—exploratory data analysis—should never be left out.

Start with exploratory data analysis

Jumping to machine learning training without first examining your data in depth is like sex without foreplay. It’s a lot of work, and won’t be nearly as rewarding.

Exploratory data analysis combines graphical and statistical methods. Some of the more common techniques include histograms and box-and-whisker plots of individual variables, scatter charts of pairs of variables, and plots of descriptive statistics, for example correlations among variables as a heatmap plot of pairwise correlations.

Exploratory data analysis can also include dimensionality reduction techniques, such as principal component analysis (PCA) and nonlinear dimensionality reduction (NLDR). For time-based data you also want to plot line charts of your raw variables and statistics against time, which can, among other things, highlight seasonal and day-of-week variations and anomalous jumps from externalities such as storms and (cough, cough) epidemics.

Exploratory data analysis is more than just statistical graphics. It’s a philosophical approach to data analysis designed to help you keep an open mind instead of trying to force the data into a model. These days, many of the ideas of exploratory data analysis have been incorporated into data mining.

Build unsupervised clusters

Cluster analysis is an unsupervised learning problem that asks the model to find groups of similar data points. There are several clustering algorithms currently in use, which tend to have slightly different characteristics. In general, clustering algorithms look at the metrics or distance functions between the feature vectors of the data points, and then group the ones that are “near” each other. Clustering algorithms work best if the classes do not overlap.

One of the most common clustering methods is k-means, which attempts to divide n observations into k clusters using the Euclidean distance metric, with the objective of minimising the variance (sum of squares) within each cluster. It is a method of vector quantisation, and is useful for feature learning.

Lloyd’s algorithm (iterative cluster agglomeration with centroid updates) is the most common heuristic used to solve the problem, and is relatively efficient, but doesn’t guarantee global convergence. To improve that, people often run the algorithm multiple times using random initial cluster centroids generated by the Forgy or Random Partition methods.

K-means assumes spherical clusters that are separable so that the mean converges towards the cluster centre, and also assumes that the ordering of the data points does not matter. The clusters are expected to be of similar size, so that the assignment to the nearest cluster centre is the correct assignment.

If k-means clustering doesn’t work for you, consider hierarchical cluster analysis, mixture models, or DBSCAN. Also consider other kinds of unsupervised learning, such as autoencoders and the method of moments.

Tag your data with semi-supervised learning

Tagged data is the sine qua non of machine learning. If you have no tagged data, you can’t train a model to predict the target value.

The simple but expensive answer to that is to manually tag all your data. The “joke” about this in academia (among the professors) is that your grad students can do it. (That isn’t funny if you’re a grad student.)

The less expensive answer is to manually tag some of your data, and then try to predict the rest of the target values with one or more models; this is called semi-supervised learning. With self-training algorithms (one kind of semi-supervised learning) you accept any predicted values from a single model with a probability above some threshold, and use the now-larger training dataset to build a refined model. Then you use that model for another round of predictions, and iterate until there are no more predictions that are confident. Self-training sometimes works; other times, the model is corrupted by a bad prediction.

If you build multiple models and use them to check each other, you can come up with something more robust, such as tri-training. Another alternative is to combine semi-supervised learning with transfer learning from an existing model built from different data.

You can implement any of these schemes yourself. Alternatively, you can use a web service with trained labelers such as Amazon SageMaker Ground Truth, Hive Data, Labelbox, Dataloop, and Datasaur.

Add complementary datasets

Externalities can often cast light on anomalies in datasets, particularly time-series datasets. For example, if you add weather data to a bicycle-rental dataset, you’ll be able to explain many deviations that otherwise might have been mysteries, such as a sharp drop in rentals during rainstorms.

Predicting retail sales offers other good examples. Sales, competitive offerings, changes in advertising, economic events, and weather might all affect sales. The short summary: If the data doesn’t make sense, add some context, and perhaps all will become clearer.

Try automated machine learning

At one time, the only way to find the best model for your data was to train every possible model and see which one came out on top. For many kinds of data, especially tagged tabular data, you can point an AutoML (automated machine learning) tool at the dataset and come back later to get some good answers. Sometimes the best model will be an ensemble of other models, which can be costly to use for inference, but often the best simple model is nearly as good as the ensemble and much cheaper to run.

Under the hood, AutoML services often do more than blindly trying every appropriate model. For example, some automatically create normalised and engineered feature sets, impute missing values, drop correlated features, and add lagged columns for time-series forecasting. Another optional activity is performing hyperparameter optimisation for some of the best models to improve them further. To get the best possible result in the allotted time, some AutoML services quickly terminate the training of models that aren’t improving much, and devote more of their cycles to the models that look the most promising.

Customise a trained model with transfer learning

Training a large neural network from scratch typically requires a lot of data (millions of training items are not unusual) and significant time and computing resources (several weeks using multiple server GPUs). One powerful shortcut, called transfer learning, is to customise a trained neural network by training a few new layers on top of the network with new data, or extracting the features from the network and using those to train a simple linear classifier. This can be done using a cloud service, such as Azure Custom Vision or custom Language Understanding, or by taking advantage of libraries of trained neural networks created with, for example, TensorFlowor PyTorch. Transfer learning or fine tuning can often be completed in minutes with a single GPU.

Try deep learning algorithms from a ‘model zoo’

Even if you can’t easily create the model you need with transfer learning using your preferred cloud service or deep learning framework, you still might be able to avoid the slog of designing and training a deep neural network model from scratch. Most major frameworks have a model zoo that’s more extensive than their model APIs. There are even some websites that maintain model zoos for multiple frameworks, or for any framework that can handle a specific representation, such as ONNX.

Many of the models you’ll find in model zoos are fully trained and ready to use. Some, however, are partially trained snapshots, whose weights are useful as starting points for training with your own datasets.

Optimize your model’s hyperparameters

Training a model the first time isn’t usually the end of the process. Machine learning models can often be improved by using different hyperparameters, and the best ones are found by hyperparameter optimization or tuning. No, this isn’t really a jump-start, but it is a way to get from an early not-so-good model to a much better model.

Hyperparameters are parameters outside the model, which are used to control the learning process. Parameters inside the model, such as node weights, are learned during model training. Hyperparameter optimization is essentially the process of finding the best set of hyperparameters for a given model. Each step in the optimization involves training the model again and getting a loss function value back.

The hyperparameters that matter depend on the model and the optimiser used within the model. For example, learning rate is a common hyperparameter for neural networks, except when the optimiser takes control of the learning rate from epoch to epoch. For a Support Vector Machine classifier with an RBF (radial basis function) kernel, the hyperparameters might be a regularisation constant and a kernel constant.

Hyperparameter optimizers can use a number of search algorithms. Grid search is traditional. On the one hand, grid search requires many trainings to cover all the combinations of multiple hyperparameters, but on the other hand, all the trainings can run in parallel if you have enough compute resources. Random search is sometimes more efficient, and is also easy to parallelise. Other alternatives include Bayesian optimisation, gradient descent, evolutionary optimisation, and early-stopping algorithms.

To summarise, start your model building process with exploratory data analysis. Use unsupervised learning to understand more about your data and features. Try AutoML to test out many models quickly. If you need a deep neural network model, first try transfer learning or a model zoo before trying to design and train your own network from scratch. If you find a model you think looks pretty good, try improving it with hyperparameter tuning. Then you can try the model in production, and monitor it.

By the way, you’re not really done at that point. Over time, the data or the concept will drift because of real-world events, and you’ll need to refine and retrain your model. You may even discover that a different type of model will work better for the new data.

What fun would it be if you could build a model once and be done with it?