What should I do when my neural network doesn't generalize well?

.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty margin-bottom:0;

up vote
17
down vote

favorite

I'm training a neural network and the training loss decreases, but the validation loss doesn't, or it decreases much less than what I would expect, based on references or experiments with very similar architectures and data. How can I fix this?

As for question

What should I do when my neural network doesn't learn?

to which this question is inspired, the question is intentionally left general so that other questions about how to reduce the generalization error of a neural network down to a level which has been proved to be attainable, can be closed as duplicates of this one.

See also dedicated thread on Meta:

Is there a generic question to which we can redirect questions of the type "why does my neural network not generalize well?"

edited Sep 7 at 10:17

Jan Kukacka

4,61111334

asked Sep 7 at 9:12

DeltaIV

6,08611852

2

If you are planning to post your own comprehensive answer, then it might have been a good idea to post the Q and the A simultaneously (the user interface allows that). Otherwise you are encouraging other people to write answers and we might end up with several answers that partially duplicate each other... Anyway, looking forward to your answer.
â€“Â amoeba
Sep 7 at 12:03

@amoeba ah, I didn't know that: the UI opens a pop-up when I try to answer the question, so I thought Q & A could not be posted together....Well, if someone writes a better/more complete answer than what I was going to write, I'll just avoid adding a duplicate.
â€“Â DeltaIV
Sep 7 at 12:49

add a commentÂ |Â

up vote
17
down vote

favorite

As for question

What should I do when my neural network doesn't learn?

See also dedicated thread on Meta:

Is there a generic question to which we can redirect questions of the type "why does my neural network not generalize well?"

edited Sep 7 at 10:17

Jan Kukacka

4,61111334

asked Sep 7 at 9:12

DeltaIV

6,08611852

2

If you are planning to post your own comprehensive answer, then it might have been a good idea to post the Q and the A simultaneously (the user interface allows that). Otherwise you are encouraging other people to write answers and we might end up with several answers that partially duplicate each other... Anyway, looking forward to your answer.
â€“Â amoeba
Sep 7 at 12:03

@amoeba ah, I didn't know that: the UI opens a pop-up when I try to answer the question, so I thought Q & A could not be posted together....Well, if someone writes a better/more complete answer than what I was going to write, I'll just avoid adding a duplicate.
â€“Â DeltaIV
Sep 7 at 12:49

add a commentÂ |Â

up vote
17
down vote

favorite

As for question

What should I do when my neural network doesn't learn?

See also dedicated thread on Meta:

Is there a generic question to which we can redirect questions of the type "why does my neural network not generalize well?"

edited Sep 7 at 10:17

Jan Kukacka

4,61111334

asked Sep 7 at 9:12

DeltaIV

6,08611852

As for question

What should I do when my neural network doesn't learn?

See also dedicated thread on Meta:

Is there a generic question to which we can redirect questions of the type "why does my neural network not generalize well?"

neural-networks deep-learning

edited Sep 7 at 10:17

Jan Kukacka

4,61111334

asked Sep 7 at 9:12

DeltaIV

6,08611852

edited Sep 7 at 10:17

Jan Kukacka

4,61111334

asked Sep 7 at 9:12

DeltaIV

6,08611852

edited Sep 7 at 10:17

Jan Kukacka

4,61111334

edited Sep 7 at 10:17

Jan Kukacka

4,61111334

edited Sep 7 at 10:17

Jan Kukacka

4,61111334

asked Sep 7 at 9:12

DeltaIV

6,08611852

asked Sep 7 at 9:12

DeltaIV

6,08611852

asked Sep 7 at 9:12

DeltaIV

6,08611852

2

If you are planning to post your own comprehensive answer, then it might have been a good idea to post the Q and the A simultaneously (the user interface allows that). Otherwise you are encouraging other people to write answers and we might end up with several answers that partially duplicate each other... Anyway, looking forward to your answer.
â€“Â amoeba
Sep 7 at 12:03

@amoeba ah, I didn't know that: the UI opens a pop-up when I try to answer the question, so I thought Q & A could not be posted together....Well, if someone writes a better/more complete answer than what I was going to write, I'll just avoid adding a duplicate.
â€“Â DeltaIV
Sep 7 at 12:49

add a commentÂ |Â

2

If you are planning to post your own comprehensive answer, then it might have been a good idea to post the Q and the A simultaneously (the user interface allows that). Otherwise you are encouraging other people to write answers and we might end up with several answers that partially duplicate each other... Anyway, looking forward to your answer.
â€“Â amoeba
Sep 7 at 12:03

@amoeba ah, I didn't know that: the UI opens a pop-up when I try to answer the question, so I thought Q & A could not be posted together....Well, if someone writes a better/more complete answer than what I was going to write, I'll just avoid adding a duplicate.
â€“Â DeltaIV
Sep 7 at 12:49

If you are planning to post your own comprehensive answer, then it might have been a good idea to post the Q and the A simultaneously (the user interface allows that). Otherwise you are encouraging other people to write answers and we might end up with several answers that partially duplicate each other... Anyway, looking forward to your answer.
â€“Â amoeba
Sep 7 at 12:03

@amoeba ah, I didn't know that: the UI opens a pop-up when I try to answer the question, so I thought Q & A could not be posted together....Well, if someone writes a better/more complete answer than what I was going to write, I'll just avoid adding a duplicate.
â€“Â DeltaIV
Sep 7 at 12:49

add a commentÂ |Â

2 Answers
2

active

oldest

votes

up vote
11
down vote

Why is your model not generalizing properly?

The most important part is understanding why your network doesn't generalize well. High-capacity Machine Learning models have the ability to memorize the training set, which can lead to overfitting.

Overfitting is the state where an estimator has begun to learn the training set so well that it has started to model the noise in the training samples (besides all useful relationships).

For example, in the image below we can see how the blue line has clearly overfit.

But why is this bad?

When attempting to evaluate our model on new, previously unseen data (i.e. validation/test set), the model's performance will be much worse than what we expect.

How to prevent overfitting?

In the beginning of the post I implied that the complexity of your model is what is actually causing the overfitting, as it is allowing the model to extract unnecessary relationships from the training set, that map its inherent noise. The easiest way to reduce overfitting is to essentially limit the capacity of your model. These techniques are called regularization techniques.

Parameter norm penalties. These add an extra term to the weight update function of each model, that is dependent on the norm of the parameters. This term's purpose is to counter the actual update (i.e. limit how much each weight can be updated). This makes the models more robust to outliers and noise. Examples of such regularizations are L1 and L2 regularizations, which can be found on the Lasso, Ridge and Elastic Net regressors.

Early stopping. This technique attempts to stop an estimator's training phase prematurely, at the point where it has learned to extract all meaningful relationships from the data, before beginning to model its noise. This is done by monitoring the validation loss (or a validation metric of your choosing) and terminating the training phase when this metric stops improving. This way we give the estimator enough time to learn the useful information but not enough to learn from the noise.

Neural Network specific regularizations. Some examples are:
- Dropout. Dropout is an interesting technique that works surprisingly well. Dropout is applied between two successive layers in a network. At each iteration a specified percentage of the connections (selected randomly), connecting the two layers, are dropped. This causes the subsequent layer rely on all of its connections to the previous layer.
- Transfer learning. This is especially used in Deep Learning. This is done by initializing the weights of your network to the ones of another network with the same architecture pre-trained on a large, generic dataset.
- Other things that may limit overfitting in Deep Neural Networks are: Batch Normalization, which can act as a regulizer; relatively small sized batches in SGD, which can also prevent overfitting; adding small random noise to hidden layers.

Another way of preventing overfitting, besides limiting the model's capacity, is by improving the quality of your data. The most obvious choice would be outlier/noise removal, however in practice their usefulness is limited. A more common way (especially in image-related tasks) is data augmentation. Here we attempt randomly transform the training examples so that while they appear to the model to be different, they convey the same semantic information (e.g. left-right flipping on inages).

edited Sep 7 at 13:48

answered Sep 7 at 11:07

Djib2011

1,806718

add a commentÂ |Â

up vote
1
down vote

A list of commonly used regularization techniques which I've seen in the literature are:

Using batch normalization, which is a surprisingly effective regularizer to the point where I rarely see dropout used anymore, because it is simply not necessary.

A small amount of weight decay.

Some more recent regularization techniques include Shake-shake ("Shake-Shake regularization" by Xavier Gastaldi) and Cutout (
"Improved Regularization of Convolutional Neural Networks with Cutout" by
Terrance DeVries and Graham W. Taylor). In particular, the ease with which Cutout can be implemented makes it very attractive. I believe these work better than dropout -- but I'm not sure.

If possible, prefer fully convolutional architectures to architectures with fully connected layers. Compare VGG-16, which has 100 million parameters in a single fully connected layer, to Resnet-152, which has 10 times the number of layers and still fewer parameters.

Prefer SGD to other optimizers such as Rmsprop and Adam. It has been shown to generalize better. ("Improving Generalization Performance by Switching from Adam to SGD" by Nitish Shirish Keskar and Richard Socher)

edited Sep 10 at 13:56

Sycorax

34.1k588158

answered Sep 9 at 1:52

shimao

6,11811125

add a commentÂ |Â

Your Answer

StackExchange.ifUsing("editor", function ()
return StackExchange.using("mathjaxEditing", function ()
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix)
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\$","\$"]]);
);
);
, "mathjax-editing");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "65"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
convertImagesToLinks: false,
noModals: false,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstats.stackexchange.com%2fquestions%2f365778%2fwhat-should-i-do-when-my-neural-network-doesnt-generalize-well%23new-answer', 'question_page');

);

Post as a guest

Name

2 Answers
2

active

oldest

votes

2 Answers
2

active

oldest

votes

up vote
11
down vote

Why is your model not generalizing properly?

Overfitting is the state where an estimator has begun to learn the training set so well that it has started to model the noise in the training samples (besides all useful relationships).

For example, in the image below we can see how the blue line has clearly overfit.

But why is this bad?

When attempting to evaluate our model on new, previously unseen data (i.e. validation/test set), the model's performance will be much worse than what we expect.

How to prevent overfitting?

Parameter norm penalties. These add an extra term to the weight update function of each model, that is dependent on the norm of the parameters. This term's purpose is to counter the actual update (i.e. limit how much each weight can be updated). This makes the models more robust to outliers and noise. Examples of such regularizations are L1 and L2 regularizations, which can be found on the Lasso, Ridge and Elastic Net regressors.

Early stopping. This technique attempts to stop an estimator's training phase prematurely, at the point where it has learned to extract all meaningful relationships from the data, before beginning to model its noise. This is done by monitoring the validation loss (or a validation metric of your choosing) and terminating the training phase when this metric stops improving. This way we give the estimator enough time to learn the useful information but not enough to learn from the noise.

Neural Network specific regularizations. Some examples are:
- Dropout. Dropout is an interesting technique that works surprisingly well. Dropout is applied between two successive layers in a network. At each iteration a specified percentage of the connections (selected randomly), connecting the two layers, are dropped. This causes the subsequent layer rely on all of its connections to the previous layer.
- Transfer learning. This is especially used in Deep Learning. This is done by initializing the weights of your network to the ones of another network with the same architecture pre-trained on a large, generic dataset.
- Other things that may limit overfitting in Deep Neural Networks are: Batch Normalization, which can act as a regulizer; relatively small sized batches in SGD, which can also prevent overfitting; adding small random noise to hidden layers.

edited Sep 7 at 13:48

answered Sep 7 at 11:07

Djib2011

1,806718

add a commentÂ |Â

up vote
11
down vote

Why is your model not generalizing properly?

Overfitting is the state where an estimator has begun to learn the training set so well that it has started to model the noise in the training samples (besides all useful relationships).

For example, in the image below we can see how the blue line has clearly overfit.

But why is this bad?

When attempting to evaluate our model on new, previously unseen data (i.e. validation/test set), the model's performance will be much worse than what we expect.

How to prevent overfitting?

Parameter norm penalties. These add an extra term to the weight update function of each model, that is dependent on the norm of the parameters. This term's purpose is to counter the actual update (i.e. limit how much each weight can be updated). This makes the models more robust to outliers and noise. Examples of such regularizations are L1 and L2 regularizations, which can be found on the Lasso, Ridge and Elastic Net regressors.

Early stopping. This technique attempts to stop an estimator's training phase prematurely, at the point where it has learned to extract all meaningful relationships from the data, before beginning to model its noise. This is done by monitoring the validation loss (or a validation metric of your choosing) and terminating the training phase when this metric stops improving. This way we give the estimator enough time to learn the useful information but not enough to learn from the noise.

Neural Network specific regularizations. Some examples are:
- Dropout. Dropout is an interesting technique that works surprisingly well. Dropout is applied between two successive layers in a network. At each iteration a specified percentage of the connections (selected randomly), connecting the two layers, are dropped. This causes the subsequent layer rely on all of its connections to the previous layer.
- Transfer learning. This is especially used in Deep Learning. This is done by initializing the weights of your network to the ones of another network with the same architecture pre-trained on a large, generic dataset.
- Other things that may limit overfitting in Deep Neural Networks are: Batch Normalization, which can act as a regulizer; relatively small sized batches in SGD, which can also prevent overfitting; adding small random noise to hidden layers.

edited Sep 7 at 13:48

answered Sep 7 at 11:07

Djib2011

1,806718

add a commentÂ |Â

up vote
11
down vote

Why is your model not generalizing properly?

Overfitting is the state where an estimator has begun to learn the training set so well that it has started to model the noise in the training samples (besides all useful relationships).

For example, in the image below we can see how the blue line has clearly overfit.

But why is this bad?

When attempting to evaluate our model on new, previously unseen data (i.e. validation/test set), the model's performance will be much worse than what we expect.

How to prevent overfitting?

Parameter norm penalties. These add an extra term to the weight update function of each model, that is dependent on the norm of the parameters. This term's purpose is to counter the actual update (i.e. limit how much each weight can be updated). This makes the models more robust to outliers and noise. Examples of such regularizations are L1 and L2 regularizations, which can be found on the Lasso, Ridge and Elastic Net regressors.

Early stopping. This technique attempts to stop an estimator's training phase prematurely, at the point where it has learned to extract all meaningful relationships from the data, before beginning to model its noise. This is done by monitoring the validation loss (or a validation metric of your choosing) and terminating the training phase when this metric stops improving. This way we give the estimator enough time to learn the useful information but not enough to learn from the noise.

Neural Network specific regularizations. Some examples are:
- Dropout. Dropout is an interesting technique that works surprisingly well. Dropout is applied between two successive layers in a network. At each iteration a specified percentage of the connections (selected randomly), connecting the two layers, are dropped. This causes the subsequent layer rely on all of its connections to the previous layer.
- Transfer learning. This is especially used in Deep Learning. This is done by initializing the weights of your network to the ones of another network with the same architecture pre-trained on a large, generic dataset.
- Other things that may limit overfitting in Deep Neural Networks are: Batch Normalization, which can act as a regulizer; relatively small sized batches in SGD, which can also prevent overfitting; adding small random noise to hidden layers.

edited Sep 7 at 13:48

answered Sep 7 at 11:07

Djib2011

1,806718

Why is your model not generalizing properly?

Overfitting is the state where an estimator has begun to learn the training set so well that it has started to model the noise in the training samples (besides all useful relationships).

For example, in the image below we can see how the blue line has clearly overfit.

But why is this bad?

When attempting to evaluate our model on new, previously unseen data (i.e. validation/test set), the model's performance will be much worse than what we expect.

How to prevent overfitting?

Parameter norm penalties. These add an extra term to the weight update function of each model, that is dependent on the norm of the parameters. This term's purpose is to counter the actual update (i.e. limit how much each weight can be updated). This makes the models more robust to outliers and noise. Examples of such regularizations are L1 and L2 regularizations, which can be found on the Lasso, Ridge and Elastic Net regressors.

Early stopping. This technique attempts to stop an estimator's training phase prematurely, at the point where it has learned to extract all meaningful relationships from the data, before beginning to model its noise. This is done by monitoring the validation loss (or a validation metric of your choosing) and terminating the training phase when this metric stops improving. This way we give the estimator enough time to learn the useful information but not enough to learn from the noise.

Neural Network specific regularizations. Some examples are:
- Dropout. Dropout is an interesting technique that works surprisingly well. Dropout is applied between two successive layers in a network. At each iteration a specified percentage of the connections (selected randomly), connecting the two layers, are dropped. This causes the subsequent layer rely on all of its connections to the previous layer.
- Transfer learning. This is especially used in Deep Learning. This is done by initializing the weights of your network to the ones of another network with the same architecture pre-trained on a large, generic dataset.
- Other things that may limit overfitting in Deep Neural Networks are: Batch Normalization, which can act as a regulizer; relatively small sized batches in SGD, which can also prevent overfitting; adding small random noise to hidden layers.

edited Sep 7 at 13:48

answered Sep 7 at 11:07

Djib2011

1,806718

edited Sep 7 at 13:48

answered Sep 7 at 11:07

Djib2011

1,806718

answered Sep 7 at 11:07

Djib2011

1,806718

answered Sep 7 at 11:07

Djib2011

1,806718

add a commentÂ |Â

up vote
1
down vote

A list of commonly used regularization techniques which I've seen in the literature are:

Using batch normalization, which is a surprisingly effective regularizer to the point where I rarely see dropout used anymore, because it is simply not necessary.

A small amount of weight decay.

Some more recent regularization techniques include Shake-shake ("Shake-Shake regularization" by Xavier Gastaldi) and Cutout (
"Improved Regularization of Convolutional Neural Networks with Cutout" by
Terrance DeVries and Graham W. Taylor). In particular, the ease with which Cutout can be implemented makes it very attractive. I believe these work better than dropout -- but I'm not sure.

If possible, prefer fully convolutional architectures to architectures with fully connected layers. Compare VGG-16, which has 100 million parameters in a single fully connected layer, to Resnet-152, which has 10 times the number of layers and still fewer parameters.

Prefer SGD to other optimizers such as Rmsprop and Adam. It has been shown to generalize better. ("Improving Generalization Performance by Switching from Adam to SGD" by Nitish Shirish Keskar and Richard Socher)

edited Sep 10 at 13:56

Sycorax

34.1k588158

answered Sep 9 at 1:52

shimao

6,11811125

add a commentÂ |Â

up vote
1
down vote

A list of commonly used regularization techniques which I've seen in the literature are:

Using batch normalization, which is a surprisingly effective regularizer to the point where I rarely see dropout used anymore, because it is simply not necessary.

A small amount of weight decay.

Some more recent regularization techniques include Shake-shake ("Shake-Shake regularization" by Xavier Gastaldi) and Cutout (
"Improved Regularization of Convolutional Neural Networks with Cutout" by
Terrance DeVries and Graham W. Taylor). In particular, the ease with which Cutout can be implemented makes it very attractive. I believe these work better than dropout -- but I'm not sure.

If possible, prefer fully convolutional architectures to architectures with fully connected layers. Compare VGG-16, which has 100 million parameters in a single fully connected layer, to Resnet-152, which has 10 times the number of layers and still fewer parameters.

Prefer SGD to other optimizers such as Rmsprop and Adam. It has been shown to generalize better. ("Improving Generalization Performance by Switching from Adam to SGD" by Nitish Shirish Keskar and Richard Socher)

edited Sep 10 at 13:56

Sycorax

34.1k588158

answered Sep 9 at 1:52

shimao

6,11811125

add a commentÂ |Â

up vote
1
down vote

A list of commonly used regularization techniques which I've seen in the literature are:

Using batch normalization, which is a surprisingly effective regularizer to the point where I rarely see dropout used anymore, because it is simply not necessary.

A small amount of weight decay.

Some more recent regularization techniques include Shake-shake ("Shake-Shake regularization" by Xavier Gastaldi) and Cutout (
"Improved Regularization of Convolutional Neural Networks with Cutout" by
Terrance DeVries and Graham W. Taylor). In particular, the ease with which Cutout can be implemented makes it very attractive. I believe these work better than dropout -- but I'm not sure.

If possible, prefer fully convolutional architectures to architectures with fully connected layers. Compare VGG-16, which has 100 million parameters in a single fully connected layer, to Resnet-152, which has 10 times the number of layers and still fewer parameters.

Prefer SGD to other optimizers such as Rmsprop and Adam. It has been shown to generalize better. ("Improving Generalization Performance by Switching from Adam to SGD" by Nitish Shirish Keskar and Richard Socher)

edited Sep 10 at 13:56

Sycorax

34.1k588158

answered Sep 9 at 1:52

shimao

6,11811125

A list of commonly used regularization techniques which I've seen in the literature are:

Using batch normalization, which is a surprisingly effective regularizer to the point where I rarely see dropout used anymore, because it is simply not necessary.

A small amount of weight decay.

Some more recent regularization techniques include Shake-shake ("Shake-Shake regularization" by Xavier Gastaldi) and Cutout (
"Improved Regularization of Convolutional Neural Networks with Cutout" by
Terrance DeVries and Graham W. Taylor). In particular, the ease with which Cutout can be implemented makes it very attractive. I believe these work better than dropout -- but I'm not sure.

If possible, prefer fully convolutional architectures to architectures with fully connected layers. Compare VGG-16, which has 100 million parameters in a single fully connected layer, to Resnet-152, which has 10 times the number of layers and still fewer parameters.

Prefer SGD to other optimizers such as Rmsprop and Adam. It has been shown to generalize better. ("Improving Generalization Performance by Switching from Adam to SGD" by Nitish Shirish Keskar and Richard Socher)

edited Sep 10 at 13:56

Sycorax

34.1k588158

answered Sep 9 at 1:52

shimao

6,11811125

edited Sep 10 at 13:56

Sycorax

34.1k588158

edited Sep 10 at 13:56

Sycorax

34.1k588158

edited Sep 10 at 13:56

Sycorax

34.1k588158

answered Sep 9 at 1:52

shimao

6,11811125

answered Sep 9 at 1:52

shimao

6,11811125

answered Sep 9 at 1:52

shimao

6,11811125

add a commentÂ |Â

draft saved

draft discarded

draft saved

draft discarded

Post as a guest

Name

搜尋此網誌

Vtyjkyuk