Introduction of Dropout and Ensemble Model in the History of Deep Learning

Tresor Koffi
unpack
Published in
5 min readFeb 1, 2021

--

Dropout is a technique where randomly selected neurons are ignored during training. They are “dropped-out” randomly. This means that their contribution to the activation of downstream neurons is temporally removed on the forward pass and any weight updates are not applied to the neuron on the backward pass.

The history of drop out

Dropout was introduced in 2012 as a technique to avoid overfitting and was subsequently applied in the 2012 winning submission for the Large Scale Visual Recognition Challenge that revitalized deep neural network research. The original method omitted each neuron in a neural network with probability 0.5 during each training iteration, with all neurons being included during testing. This technique was shown to significantly improve results on a variety of tasks .In the years since, a wide range of stochastic techniques inspired by the original dropout method have been proposed for use with deep learning models. We use the term dropout methods to refer to them in general. They include dropconnect standout ,fast dropout ,variational dropout, Monte Carlo dropout , and many others. Generally speaking, dropout methods involve randomly modifying neural network parameters or activations during training or inference, or approximating this process. Figure 1 illustrates research into dropout methods over time. Another direction of research into dropout methods has been applying them to a wider range of neural network topologies. This includes methods for applying dropout to convolutional neural network layers as well as to recurrent neural networks (RNNs) . RNN dropout methods in particular have become commonly used, and have been recently applied in improving state-of-the-art results in natural language processing.

Some proposed methods and theoretical advances in dropout methods from 2012 to 2019

Drop out in Neural Network

Neural networks are the building blocks of any machine-learning architecture. They consist of one input layer, one or more hidden layers, and an output layer. When we training our neural network (or model) by updating each of its weights, it might become too dependent on the dataset we are using. Therefore, when this model has to make a prediction or classification, it will not give satisfactory results. This is known as over-fitting

“We might understand this problem through a real-world example: If a student of mathematics learns only one chapter of a book and then takes a test on the whole syllabus, he will probably fail.”

“The basic idea of this method is to, based on probability, temporarily “drop out” neurons from our original network. Doing this for every training example gives us different models for each one. Afterwards, when we want to test our model, we take the average of each model to get our answer/prediction”

Dropout in the Training Process

We assign ‘p’ to represent the probability of a neuron, in the hidden layer, being excluded from the network; this probability value is usually equal to 0.5. We do the same process for the input layer whose probability value is usually lower than 0.5 (e.g. 0.2). Remember, we delete the connections going into, and out of, the neuron when we drop it.

Dropout can be easily implemented by randomly disconnecting some neurons of the network, resulting in what is called a “thinned” network

Drop out in the Testing Process

An output, given from a model trained using the dropout technique, is a bit different: We can take a sample of many dropped-out models and compute the geometric mean of their output neurons by multiplying all the numbers together and taking the product’s square root. However, since this is computationally expensive, we use the original model instead by simply cutting all of the hidden units’ weights in half. This will give us a good approximation of the average for each of the different dropped-out models.

During training time, dropout randomly sets node values to zero. In the original implementation, we have “keep probability” pkeep. So dropout randomly kills node values with “dropout probability” 1−pkeep. During inference time, dropout does not kill node values, but all the weights in the layer were multiplied by pkeep. One of the major motivations of doing so is to make sure that the distribution of the values after affine transformation during inference time is close to that during training time. Equivalently, This multiplier could be placed on the input values rather than the weights.

Concretely, say we have a vector x={1,2,3,4,5} as the input to certain fully connected layer and pkeep=0.8. During training time, x could be set to {1,0,3,4,5} due to dropout. During inference time, x would be set to {0.8,1.6,2.4,3.2,4.0} while the weights remain unchanged.

If a unit is retained with probability p during training, the outgoing weights of that unit are multiplied by p at test time

Ensemble Model

Ensemble methods are techniques that create multiple models and then combine them to produce improved results. Ensemble methods usually produces more accurate solutions than a single model would.

How does Ensemble Model work?

Say you want to develop a machine learning model that predicts inventory stock orders for your company based on historical data you have gathered from previous years. You use train four machine learning models using a different algorithms: linear regression, support vector machine, a regression decision tree, and a basic artificial neural network. But even after much tweaking and configuration, none of them achieves your desired 95 percent prediction accuracy. These machine learning models are called “weak learners” because they fail to converge to the desired level.

single Machine Learning Method does not provide the desire accuracy

But weak doesn’t mean useless. You can combine them into an ensemble. For each new prediction, you run your input data through all four models, and then compute the average of the results. When examining the new result, you see that the aggregate results provide 92 percent accuracy, which is more than acceptable.

The reason ensemble learning is efficient is that your machine learning models work differently. Each model might perform well on some data and less accurately on others. When you combine all them, they cancel out each other’s weaknesses.

You can apply ensemble methods to both predictions problems, like the inventory prediction example we just saw, and classification problems, such as determining whether a picture contains a certain object.

Ensemble machine learning combine several models to improve the overall results.

Ensemble machine learning combine several models to improve the overall results.

Conclusion

Dropout in a neural network can be considered as an ensemble technique, where multiple sub-networks are trained together by “dropping” out certain connections between neurons.

REFERENCES

http://laid.delanover.com/dropout-explained-and-implementation-in-tensorflow/

https://gluon.mxnet.io/chapter03_deep-neural-networks/mlp-dropout-gluon.html?highlight=dropout

https://machinelearningmastery.com/ensemble-methods-for-deep-learning-neural-networks/

https://hal.archives-ouvertes.fr/hal-02922336/document

--

--