What should I do when my neural network doesn't generalize well?

The name of the pictureThe name of the pictureThe name of the pictureClash Royale CLAN TAG#URR8PPP





.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty margin-bottom:0;







up vote
17
down vote

favorite
5












I'm training a neural network and the training loss decreases, but the validation loss doesn't, or it decreases much less than what I would expect, based on references or experiments with very similar architectures and data. How can I fix this?




As for question




What should I do when my neural network doesn't learn?




to which this question is inspired, the question is intentionally left general so that other questions about how to reduce the generalization error of a neural network down to a level which has been proved to be attainable, can be closed as duplicates of this one.



See also dedicated thread on Meta:




Is there a generic question to which we can redirect questions of the type "why does my neural network not generalize well?"











share|cite|improve this question



















  • 2




    If you are planning to post your own comprehensive answer, then it might have been a good idea to post the Q and the A simultaneously (the user interface allows that). Otherwise you are encouraging other people to write answers and we might end up with several answers that partially duplicate each other... Anyway, looking forward to your answer.
    – amoeba
    Sep 7 at 12:03











  • @amoeba ah, I didn't know that: the UI opens a pop-up when I try to answer the question, so I thought Q & A could not be posted together....Well, if someone writes a better/more complete answer than what I was going to write, I'll just avoid adding a duplicate.
    – DeltaIV
    Sep 7 at 12:49

















up vote
17
down vote

favorite
5












I'm training a neural network and the training loss decreases, but the validation loss doesn't, or it decreases much less than what I would expect, based on references or experiments with very similar architectures and data. How can I fix this?




As for question




What should I do when my neural network doesn't learn?




to which this question is inspired, the question is intentionally left general so that other questions about how to reduce the generalization error of a neural network down to a level which has been proved to be attainable, can be closed as duplicates of this one.



See also dedicated thread on Meta:




Is there a generic question to which we can redirect questions of the type "why does my neural network not generalize well?"











share|cite|improve this question



















  • 2




    If you are planning to post your own comprehensive answer, then it might have been a good idea to post the Q and the A simultaneously (the user interface allows that). Otherwise you are encouraging other people to write answers and we might end up with several answers that partially duplicate each other... Anyway, looking forward to your answer.
    – amoeba
    Sep 7 at 12:03











  • @amoeba ah, I didn't know that: the UI opens a pop-up when I try to answer the question, so I thought Q & A could not be posted together....Well, if someone writes a better/more complete answer than what I was going to write, I'll just avoid adding a duplicate.
    – DeltaIV
    Sep 7 at 12:49













up vote
17
down vote

favorite
5









up vote
17
down vote

favorite
5






5





I'm training a neural network and the training loss decreases, but the validation loss doesn't, or it decreases much less than what I would expect, based on references or experiments with very similar architectures and data. How can I fix this?




As for question




What should I do when my neural network doesn't learn?




to which this question is inspired, the question is intentionally left general so that other questions about how to reduce the generalization error of a neural network down to a level which has been proved to be attainable, can be closed as duplicates of this one.



See also dedicated thread on Meta:




Is there a generic question to which we can redirect questions of the type "why does my neural network not generalize well?"











share|cite|improve this question















I'm training a neural network and the training loss decreases, but the validation loss doesn't, or it decreases much less than what I would expect, based on references or experiments with very similar architectures and data. How can I fix this?




As for question




What should I do when my neural network doesn't learn?




to which this question is inspired, the question is intentionally left general so that other questions about how to reduce the generalization error of a neural network down to a level which has been proved to be attainable, can be closed as duplicates of this one.



See also dedicated thread on Meta:




Is there a generic question to which we can redirect questions of the type "why does my neural network not generalize well?"








neural-networks deep-learning






share|cite|improve this question















share|cite|improve this question













share|cite|improve this question




share|cite|improve this question








edited Sep 7 at 10:17









Jan Kukacka

4,61111334




4,61111334










asked Sep 7 at 9:12









DeltaIV

6,08611852




6,08611852







  • 2




    If you are planning to post your own comprehensive answer, then it might have been a good idea to post the Q and the A simultaneously (the user interface allows that). Otherwise you are encouraging other people to write answers and we might end up with several answers that partially duplicate each other... Anyway, looking forward to your answer.
    – amoeba
    Sep 7 at 12:03











  • @amoeba ah, I didn't know that: the UI opens a pop-up when I try to answer the question, so I thought Q & A could not be posted together....Well, if someone writes a better/more complete answer than what I was going to write, I'll just avoid adding a duplicate.
    – DeltaIV
    Sep 7 at 12:49













  • 2




    If you are planning to post your own comprehensive answer, then it might have been a good idea to post the Q and the A simultaneously (the user interface allows that). Otherwise you are encouraging other people to write answers and we might end up with several answers that partially duplicate each other... Anyway, looking forward to your answer.
    – amoeba
    Sep 7 at 12:03











  • @amoeba ah, I didn't know that: the UI opens a pop-up when I try to answer the question, so I thought Q & A could not be posted together....Well, if someone writes a better/more complete answer than what I was going to write, I'll just avoid adding a duplicate.
    – DeltaIV
    Sep 7 at 12:49








2




2




If you are planning to post your own comprehensive answer, then it might have been a good idea to post the Q and the A simultaneously (the user interface allows that). Otherwise you are encouraging other people to write answers and we might end up with several answers that partially duplicate each other... Anyway, looking forward to your answer.
– amoeba
Sep 7 at 12:03





If you are planning to post your own comprehensive answer, then it might have been a good idea to post the Q and the A simultaneously (the user interface allows that). Otherwise you are encouraging other people to write answers and we might end up with several answers that partially duplicate each other... Anyway, looking forward to your answer.
– amoeba
Sep 7 at 12:03













@amoeba ah, I didn't know that: the UI opens a pop-up when I try to answer the question, so I thought Q & A could not be posted together....Well, if someone writes a better/more complete answer than what I was going to write, I'll just avoid adding a duplicate.
– DeltaIV
Sep 7 at 12:49





@amoeba ah, I didn't know that: the UI opens a pop-up when I try to answer the question, so I thought Q & A could not be posted together....Well, if someone writes a better/more complete answer than what I was going to write, I'll just avoid adding a duplicate.
– DeltaIV
Sep 7 at 12:49











2 Answers
2






active

oldest

votes

















up vote
11
down vote













Why is your model not generalizing properly?



The most important part is understanding why your network doesn't generalize well. High-capacity Machine Learning models have the ability to memorize the training set, which can lead to overfitting.



Overfitting is the state where an estimator has begun to learn the training set so well that it has started to model the noise in the training samples (besides all useful relationships).



For example, in the image below we can see how the blue line has clearly overfit.





But why is this bad?



When attempting to evaluate our model on new, previously unseen data (i.e. validation/test set), the model's performance will be much worse than what we expect.



How to prevent overfitting?



In the beginning of the post I implied that the complexity of your model is what is actually causing the overfitting, as it is allowing the model to extract unnecessary relationships from the training set, that map its inherent noise. The easiest way to reduce overfitting is to essentially limit the capacity of your model. These techniques are called regularization techniques.



  • Parameter norm penalties. These add an extra term to the weight update function of each model, that is dependent on the norm of the parameters. This term's purpose is to counter the actual update (i.e. limit how much each weight can be updated). This makes the models more robust to outliers and noise. Examples of such regularizations are L1 and L2 regularizations, which can be found on the Lasso, Ridge and Elastic Net regressors.


  • Early stopping. This technique attempts to stop an estimator's training phase prematurely, at the point where it has learned to extract all meaningful relationships from the data, before beginning to model its noise. This is done by monitoring the validation loss (or a validation metric of your choosing) and terminating the training phase when this metric stops improving. This way we give the estimator enough time to learn the useful information but not enough to learn from the noise.





  • Neural Network specific regularizations. Some examples are:


    • Dropout. Dropout is an interesting technique that works surprisingly well. Dropout is applied between two successive layers in a network. At each iteration a specified percentage of the connections (selected randomly), connecting the two layers, are dropped. This causes the subsequent layer rely on all of its connections to the previous layer.


    • Transfer learning. This is especially used in Deep Learning. This is done by initializing the weights of your network to the ones of another network with the same architecture pre-trained on a large, generic dataset.

    • Other things that may limit overfitting in Deep Neural Networks are: Batch Normalization, which can act as a regulizer; relatively small sized batches in SGD, which can also prevent overfitting; adding small random noise to hidden layers.


Another way of preventing overfitting, besides limiting the model's capacity, is by improving the quality of your data. The most obvious choice would be outlier/noise removal, however in practice their usefulness is limited. A more common way (especially in image-related tasks) is data augmentation. Here we attempt randomly transform the training examples so that while they appear to the model to be different, they convey the same semantic information (e.g. left-right flipping on inages).






share|cite|improve this answer





























    up vote
    1
    down vote













    A list of commonly used regularization techniques which I've seen in the literature are:



    1. Using batch normalization, which is a surprisingly effective regularizer to the point where I rarely see dropout used anymore, because it is simply not necessary.

    2. A small amount of weight decay.

    3. Some more recent regularization techniques include Shake-shake ("Shake-Shake regularization" by Xavier Gastaldi) and Cutout (
      "Improved Regularization of Convolutional Neural Networks with Cutout" by
      Terrance DeVries and Graham W. Taylor). In particular, the ease with which Cutout can be implemented makes it very attractive. I believe these work better than dropout -- but I'm not sure.

    4. If possible, prefer fully convolutional architectures to architectures with fully connected layers. Compare VGG-16, which has 100 million parameters in a single fully connected layer, to Resnet-152, which has 10 times the number of layers and still fewer parameters.

    5. Prefer SGD to other optimizers such as Rmsprop and Adam. It has been shown to generalize better. ("Improving Generalization Performance by Switching from Adam to SGD" by Nitish Shirish Keskar and Richard Socher)





    share|cite|improve this answer






















      Your Answer




      StackExchange.ifUsing("editor", function ()
      return StackExchange.using("mathjaxEditing", function ()
      StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix)
      StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
      );
      );
      , "mathjax-editing");

      StackExchange.ready(function()
      var channelOptions =
      tags: "".split(" "),
      id: "65"
      ;
      initTagRenderer("".split(" "), "".split(" "), channelOptions);

      StackExchange.using("externalEditor", function()
      // Have to fire editor after snippets, if snippets enabled
      if (StackExchange.settings.snippets.snippetsEnabled)
      StackExchange.using("snippets", function()
      createEditor();
      );

      else
      createEditor();

      );

      function createEditor()
      StackExchange.prepareEditor(
      heartbeatType: 'answer',
      convertImagesToLinks: false,
      noModals: false,
      showLowRepImageUploadWarning: true,
      reputationToPostImages: null,
      bindNavPrevention: true,
      postfix: "",
      onDemand: true,
      discardSelector: ".discard-answer"
      ,immediatelyShowMarkdownHelp:true
      );



      );













       

      draft saved


      draft discarded


















      StackExchange.ready(
      function ()
      StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstats.stackexchange.com%2fquestions%2f365778%2fwhat-should-i-do-when-my-neural-network-doesnt-generalize-well%23new-answer', 'question_page');

      );

      Post as a guest






























      2 Answers
      2






      active

      oldest

      votes








      2 Answers
      2






      active

      oldest

      votes









      active

      oldest

      votes






      active

      oldest

      votes








      up vote
      11
      down vote













      Why is your model not generalizing properly?



      The most important part is understanding why your network doesn't generalize well. High-capacity Machine Learning models have the ability to memorize the training set, which can lead to overfitting.



      Overfitting is the state where an estimator has begun to learn the training set so well that it has started to model the noise in the training samples (besides all useful relationships).



      For example, in the image below we can see how the blue line has clearly overfit.





      But why is this bad?



      When attempting to evaluate our model on new, previously unseen data (i.e. validation/test set), the model's performance will be much worse than what we expect.



      How to prevent overfitting?



      In the beginning of the post I implied that the complexity of your model is what is actually causing the overfitting, as it is allowing the model to extract unnecessary relationships from the training set, that map its inherent noise. The easiest way to reduce overfitting is to essentially limit the capacity of your model. These techniques are called regularization techniques.



      • Parameter norm penalties. These add an extra term to the weight update function of each model, that is dependent on the norm of the parameters. This term's purpose is to counter the actual update (i.e. limit how much each weight can be updated). This makes the models more robust to outliers and noise. Examples of such regularizations are L1 and L2 regularizations, which can be found on the Lasso, Ridge and Elastic Net regressors.


      • Early stopping. This technique attempts to stop an estimator's training phase prematurely, at the point where it has learned to extract all meaningful relationships from the data, before beginning to model its noise. This is done by monitoring the validation loss (or a validation metric of your choosing) and terminating the training phase when this metric stops improving. This way we give the estimator enough time to learn the useful information but not enough to learn from the noise.





      • Neural Network specific regularizations. Some examples are:


        • Dropout. Dropout is an interesting technique that works surprisingly well. Dropout is applied between two successive layers in a network. At each iteration a specified percentage of the connections (selected randomly), connecting the two layers, are dropped. This causes the subsequent layer rely on all of its connections to the previous layer.


        • Transfer learning. This is especially used in Deep Learning. This is done by initializing the weights of your network to the ones of another network with the same architecture pre-trained on a large, generic dataset.

        • Other things that may limit overfitting in Deep Neural Networks are: Batch Normalization, which can act as a regulizer; relatively small sized batches in SGD, which can also prevent overfitting; adding small random noise to hidden layers.


      Another way of preventing overfitting, besides limiting the model's capacity, is by improving the quality of your data. The most obvious choice would be outlier/noise removal, however in practice their usefulness is limited. A more common way (especially in image-related tasks) is data augmentation. Here we attempt randomly transform the training examples so that while they appear to the model to be different, they convey the same semantic information (e.g. left-right flipping on inages).






      share|cite|improve this answer


























        up vote
        11
        down vote













        Why is your model not generalizing properly?



        The most important part is understanding why your network doesn't generalize well. High-capacity Machine Learning models have the ability to memorize the training set, which can lead to overfitting.



        Overfitting is the state where an estimator has begun to learn the training set so well that it has started to model the noise in the training samples (besides all useful relationships).



        For example, in the image below we can see how the blue line has clearly overfit.





        But why is this bad?



        When attempting to evaluate our model on new, previously unseen data (i.e. validation/test set), the model's performance will be much worse than what we expect.



        How to prevent overfitting?



        In the beginning of the post I implied that the complexity of your model is what is actually causing the overfitting, as it is allowing the model to extract unnecessary relationships from the training set, that map its inherent noise. The easiest way to reduce overfitting is to essentially limit the capacity of your model. These techniques are called regularization techniques.



        • Parameter norm penalties. These add an extra term to the weight update function of each model, that is dependent on the norm of the parameters. This term's purpose is to counter the actual update (i.e. limit how much each weight can be updated). This makes the models more robust to outliers and noise. Examples of such regularizations are L1 and L2 regularizations, which can be found on the Lasso, Ridge and Elastic Net regressors.


        • Early stopping. This technique attempts to stop an estimator's training phase prematurely, at the point where it has learned to extract all meaningful relationships from the data, before beginning to model its noise. This is done by monitoring the validation loss (or a validation metric of your choosing) and terminating the training phase when this metric stops improving. This way we give the estimator enough time to learn the useful information but not enough to learn from the noise.





        • Neural Network specific regularizations. Some examples are:


          • Dropout. Dropout is an interesting technique that works surprisingly well. Dropout is applied between two successive layers in a network. At each iteration a specified percentage of the connections (selected randomly), connecting the two layers, are dropped. This causes the subsequent layer rely on all of its connections to the previous layer.


          • Transfer learning. This is especially used in Deep Learning. This is done by initializing the weights of your network to the ones of another network with the same architecture pre-trained on a large, generic dataset.

          • Other things that may limit overfitting in Deep Neural Networks are: Batch Normalization, which can act as a regulizer; relatively small sized batches in SGD, which can also prevent overfitting; adding small random noise to hidden layers.


        Another way of preventing overfitting, besides limiting the model's capacity, is by improving the quality of your data. The most obvious choice would be outlier/noise removal, however in practice their usefulness is limited. A more common way (especially in image-related tasks) is data augmentation. Here we attempt randomly transform the training examples so that while they appear to the model to be different, they convey the same semantic information (e.g. left-right flipping on inages).






        share|cite|improve this answer
























          up vote
          11
          down vote










          up vote
          11
          down vote









          Why is your model not generalizing properly?



          The most important part is understanding why your network doesn't generalize well. High-capacity Machine Learning models have the ability to memorize the training set, which can lead to overfitting.



          Overfitting is the state where an estimator has begun to learn the training set so well that it has started to model the noise in the training samples (besides all useful relationships).



          For example, in the image below we can see how the blue line has clearly overfit.





          But why is this bad?



          When attempting to evaluate our model on new, previously unseen data (i.e. validation/test set), the model's performance will be much worse than what we expect.



          How to prevent overfitting?



          In the beginning of the post I implied that the complexity of your model is what is actually causing the overfitting, as it is allowing the model to extract unnecessary relationships from the training set, that map its inherent noise. The easiest way to reduce overfitting is to essentially limit the capacity of your model. These techniques are called regularization techniques.



          • Parameter norm penalties. These add an extra term to the weight update function of each model, that is dependent on the norm of the parameters. This term's purpose is to counter the actual update (i.e. limit how much each weight can be updated). This makes the models more robust to outliers and noise. Examples of such regularizations are L1 and L2 regularizations, which can be found on the Lasso, Ridge and Elastic Net regressors.


          • Early stopping. This technique attempts to stop an estimator's training phase prematurely, at the point where it has learned to extract all meaningful relationships from the data, before beginning to model its noise. This is done by monitoring the validation loss (or a validation metric of your choosing) and terminating the training phase when this metric stops improving. This way we give the estimator enough time to learn the useful information but not enough to learn from the noise.





          • Neural Network specific regularizations. Some examples are:


            • Dropout. Dropout is an interesting technique that works surprisingly well. Dropout is applied between two successive layers in a network. At each iteration a specified percentage of the connections (selected randomly), connecting the two layers, are dropped. This causes the subsequent layer rely on all of its connections to the previous layer.


            • Transfer learning. This is especially used in Deep Learning. This is done by initializing the weights of your network to the ones of another network with the same architecture pre-trained on a large, generic dataset.

            • Other things that may limit overfitting in Deep Neural Networks are: Batch Normalization, which can act as a regulizer; relatively small sized batches in SGD, which can also prevent overfitting; adding small random noise to hidden layers.


          Another way of preventing overfitting, besides limiting the model's capacity, is by improving the quality of your data. The most obvious choice would be outlier/noise removal, however in practice their usefulness is limited. A more common way (especially in image-related tasks) is data augmentation. Here we attempt randomly transform the training examples so that while they appear to the model to be different, they convey the same semantic information (e.g. left-right flipping on inages).






          share|cite|improve this answer














          Why is your model not generalizing properly?



          The most important part is understanding why your network doesn't generalize well. High-capacity Machine Learning models have the ability to memorize the training set, which can lead to overfitting.



          Overfitting is the state where an estimator has begun to learn the training set so well that it has started to model the noise in the training samples (besides all useful relationships).



          For example, in the image below we can see how the blue line has clearly overfit.





          But why is this bad?



          When attempting to evaluate our model on new, previously unseen data (i.e. validation/test set), the model's performance will be much worse than what we expect.



          How to prevent overfitting?



          In the beginning of the post I implied that the complexity of your model is what is actually causing the overfitting, as it is allowing the model to extract unnecessary relationships from the training set, that map its inherent noise. The easiest way to reduce overfitting is to essentially limit the capacity of your model. These techniques are called regularization techniques.



          • Parameter norm penalties. These add an extra term to the weight update function of each model, that is dependent on the norm of the parameters. This term's purpose is to counter the actual update (i.e. limit how much each weight can be updated). This makes the models more robust to outliers and noise. Examples of such regularizations are L1 and L2 regularizations, which can be found on the Lasso, Ridge and Elastic Net regressors.


          • Early stopping. This technique attempts to stop an estimator's training phase prematurely, at the point where it has learned to extract all meaningful relationships from the data, before beginning to model its noise. This is done by monitoring the validation loss (or a validation metric of your choosing) and terminating the training phase when this metric stops improving. This way we give the estimator enough time to learn the useful information but not enough to learn from the noise.





          • Neural Network specific regularizations. Some examples are:


            • Dropout. Dropout is an interesting technique that works surprisingly well. Dropout is applied between two successive layers in a network. At each iteration a specified percentage of the connections (selected randomly), connecting the two layers, are dropped. This causes the subsequent layer rely on all of its connections to the previous layer.


            • Transfer learning. This is especially used in Deep Learning. This is done by initializing the weights of your network to the ones of another network with the same architecture pre-trained on a large, generic dataset.

            • Other things that may limit overfitting in Deep Neural Networks are: Batch Normalization, which can act as a regulizer; relatively small sized batches in SGD, which can also prevent overfitting; adding small random noise to hidden layers.


          Another way of preventing overfitting, besides limiting the model's capacity, is by improving the quality of your data. The most obvious choice would be outlier/noise removal, however in practice their usefulness is limited. A more common way (especially in image-related tasks) is data augmentation. Here we attempt randomly transform the training examples so that while they appear to the model to be different, they convey the same semantic information (e.g. left-right flipping on inages).







          share|cite|improve this answer














          share|cite|improve this answer



          share|cite|improve this answer








          edited Sep 7 at 13:48

























          answered Sep 7 at 11:07









          Djib2011

          1,806718




          1,806718






















              up vote
              1
              down vote













              A list of commonly used regularization techniques which I've seen in the literature are:



              1. Using batch normalization, which is a surprisingly effective regularizer to the point where I rarely see dropout used anymore, because it is simply not necessary.

              2. A small amount of weight decay.

              3. Some more recent regularization techniques include Shake-shake ("Shake-Shake regularization" by Xavier Gastaldi) and Cutout (
                "Improved Regularization of Convolutional Neural Networks with Cutout" by
                Terrance DeVries and Graham W. Taylor). In particular, the ease with which Cutout can be implemented makes it very attractive. I believe these work better than dropout -- but I'm not sure.

              4. If possible, prefer fully convolutional architectures to architectures with fully connected layers. Compare VGG-16, which has 100 million parameters in a single fully connected layer, to Resnet-152, which has 10 times the number of layers and still fewer parameters.

              5. Prefer SGD to other optimizers such as Rmsprop and Adam. It has been shown to generalize better. ("Improving Generalization Performance by Switching from Adam to SGD" by Nitish Shirish Keskar and Richard Socher)





              share|cite|improve this answer


























                up vote
                1
                down vote













                A list of commonly used regularization techniques which I've seen in the literature are:



                1. Using batch normalization, which is a surprisingly effective regularizer to the point where I rarely see dropout used anymore, because it is simply not necessary.

                2. A small amount of weight decay.

                3. Some more recent regularization techniques include Shake-shake ("Shake-Shake regularization" by Xavier Gastaldi) and Cutout (
                  "Improved Regularization of Convolutional Neural Networks with Cutout" by
                  Terrance DeVries and Graham W. Taylor). In particular, the ease with which Cutout can be implemented makes it very attractive. I believe these work better than dropout -- but I'm not sure.

                4. If possible, prefer fully convolutional architectures to architectures with fully connected layers. Compare VGG-16, which has 100 million parameters in a single fully connected layer, to Resnet-152, which has 10 times the number of layers and still fewer parameters.

                5. Prefer SGD to other optimizers such as Rmsprop and Adam. It has been shown to generalize better. ("Improving Generalization Performance by Switching from Adam to SGD" by Nitish Shirish Keskar and Richard Socher)





                share|cite|improve this answer
























                  up vote
                  1
                  down vote










                  up vote
                  1
                  down vote









                  A list of commonly used regularization techniques which I've seen in the literature are:



                  1. Using batch normalization, which is a surprisingly effective regularizer to the point where I rarely see dropout used anymore, because it is simply not necessary.

                  2. A small amount of weight decay.

                  3. Some more recent regularization techniques include Shake-shake ("Shake-Shake regularization" by Xavier Gastaldi) and Cutout (
                    "Improved Regularization of Convolutional Neural Networks with Cutout" by
                    Terrance DeVries and Graham W. Taylor). In particular, the ease with which Cutout can be implemented makes it very attractive. I believe these work better than dropout -- but I'm not sure.

                  4. If possible, prefer fully convolutional architectures to architectures with fully connected layers. Compare VGG-16, which has 100 million parameters in a single fully connected layer, to Resnet-152, which has 10 times the number of layers and still fewer parameters.

                  5. Prefer SGD to other optimizers such as Rmsprop and Adam. It has been shown to generalize better. ("Improving Generalization Performance by Switching from Adam to SGD" by Nitish Shirish Keskar and Richard Socher)





                  share|cite|improve this answer














                  A list of commonly used regularization techniques which I've seen in the literature are:



                  1. Using batch normalization, which is a surprisingly effective regularizer to the point where I rarely see dropout used anymore, because it is simply not necessary.

                  2. A small amount of weight decay.

                  3. Some more recent regularization techniques include Shake-shake ("Shake-Shake regularization" by Xavier Gastaldi) and Cutout (
                    "Improved Regularization of Convolutional Neural Networks with Cutout" by
                    Terrance DeVries and Graham W. Taylor). In particular, the ease with which Cutout can be implemented makes it very attractive. I believe these work better than dropout -- but I'm not sure.

                  4. If possible, prefer fully convolutional architectures to architectures with fully connected layers. Compare VGG-16, which has 100 million parameters in a single fully connected layer, to Resnet-152, which has 10 times the number of layers and still fewer parameters.

                  5. Prefer SGD to other optimizers such as Rmsprop and Adam. It has been shown to generalize better. ("Improving Generalization Performance by Switching from Adam to SGD" by Nitish Shirish Keskar and Richard Socher)






                  share|cite|improve this answer














                  share|cite|improve this answer



                  share|cite|improve this answer








                  edited Sep 10 at 13:56









                  Sycorax

                  34.1k588158




                  34.1k588158










                  answered Sep 9 at 1:52









                  shimao

                  6,11811125




                  6,11811125



























                       

                      draft saved


                      draft discarded















































                       


                      draft saved


                      draft discarded














                      StackExchange.ready(
                      function ()
                      StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstats.stackexchange.com%2fquestions%2f365778%2fwhat-should-i-do-when-my-neural-network-doesnt-generalize-well%23new-answer', 'question_page');

                      );

                      Post as a guest













































































                      這個網誌中的熱門文章

                      How to combine Bézier curves to a surface?

                      Mutual Information Always Non-negative

                      Why am i infinitely getting the same tweet with the Twitter Search API?