In this post, you will learn about the concepts and differences between online and batch learning in relation to how machine learning models in production learn incrementally from the stream of incoming data. It is one of the most important aspects of designing machine learning systems. Data science architects would require to get a good understanding of when to go for online learning and when to go for batch or offline learning.
What is Batch Learning?
Batch learning represents the training of machine learning models in a batch manner. The data get accumulated over a period of time. The models then get trained with the accumulated data from time to time in a batch manner. In other words, the system is incapable of learning incrementally from the stream of data. The fact that the model is with a lot of accumulated data takes a lot of time and resources (CPU, memory space, disk space, disk I/O, network I/O, etc.). Batch learning is also called offline learning. The models trained using batch learning are moved into production only at regular intervals based on the performance of models trained with new data.
If the models trained using batch learning needs to learn about new data, the models need to be retrained using the new data set and replaced appropriately with the model already in production based on different criteria such as model performance. The whole process of batch learning can be automated as well.
The disadvantage of batch learning is it takes lot of time and resources for re-training the model.
What is Online Learning?
In online learning, the training happens in an incremental manner by continuously feeding data as it arrives or in small group. Each learning step is fast and cheap, so the system can learn about new data on the fly, as it arrives.
Online learning is great for machine learning systems that receive data as a continuous flow (e.g., stock prices) and need to adapt to change rapidly or autonomously. It is also a good option if you have limited computing resources: once an online learning system has learned about new data instances, it does not need them anymore, so you can discard them (unless you want to be able to roll back to a previous state and “replay” the data). This can save a huge amount of space. The diagram given below represents online learning.
Online learning algorithms can also be used to train systems on huge datasets that cannot fit in one machine’s main memory (this is also called out-of-core learning). The algorithm loads part of the data, runs a training step on that data, and repeats the process until it has run on all of the data.
One of the key aspect of online learning is learning rate. The rate at which you want your machine learning to adapt to new data set is called as learning rate. A system with high learning rate will tend to forget the learning quickly. A system with low learning rate will be more like a batch learning.
One of the big disadvantage of online learning system is that if it is fed with bad data, the system will have bad performance and the user will see the impact instantly. Thus, it is very important to put appropriate filters in place to ensure that the data fed is of high quality. In addition, it is very important to monitor the performance of the machine learning system in a very close manner.