Why second order SGD convergence methods are unpopular for deep learning?Why is Newton's method not widely...

Players preemptively rolling, even though their rolls are useless or are checking the wrong skills

What happened to Hermione’s clothing and other possessions after she wiped her parents’ memories of her?

Is it really OK to use "because of"?

Problems formatting part entries in ToC with `titletoc`

Is it possible to map from a parameter to a trajectory?

Why does a single AND gate need 60 transistors?

Sticky Strike or Sticky Delta

Is there any danger of my neighbor having my wife's signature?

Does Plato's "Ring of Gyges" have a corrupting influence on its wearer?

Tikz: Perpendicular FROM a line

Apple's new 2FA requirement for developer Apple IDs

Can you say "leftside right"?

Are all power cords made equal?

How can I handle players killing my NPC outside of combat?

How do I find the distance from a point to a plane?

How can guns be countered by melee combat without raw-ability or exceptional explanations?

Performance and power usage for Raspberry Pi in the Stratosphere

Isn't a semicolon (';') needed after a function declaration in C++?

Why is Shelob considered evil?

If I tried and failed to start my own business, how do I apply for a job without job experience?

Why write a book when there's a movie in my head?

How do I add a strong "onion flavor" to the biryani (in restaurant style)?

Do we still track damage on indestructible creatures?

How can I deduce the power of a capacitor from its datasheet?



Why second order SGD convergence methods are unpopular for deep learning?


Why is Newton's method not widely used in machine learning?Why not use the third derivative for numerical optimization?Why it is popular to use stochastic gradient descent in neural networks rather than the BFGS algorithm?Why are second-order derivatives useful in convex optimization?Is deep learning useful for combinatorial optimization?Why are gradient-based methods better for nonstationary environments?SGD shows the same convergence behaviour as batch gradient descent when using adaptive learning rate?How does batch size affect convergence of SGD and why?Are line search methods used in deep learning? Why not?Is mini-batch / stochastic gradient descend similar implicitly adding the same effect as simulated annealing?What is the spectral domain?Strategy to label stock pricesLoss convergence in deep learning













2












$begingroup$


It seems that, especially for deep learning, there are dominating very simple methods for optimizing SGD convergence like ADAM - nice overview: http://ruder.io/optimizing-gradient-descent/



They trace only single direction - discarding information about the remaining ones, they do not try to estimate distance from near extremum - which is suggested by gradient evolution ($rightarrow 0$ in extremum), and could help with the crucial choice of step size.



Both these missed opportunities could be exploited by second order methods - trying to locally model parabola in simultaneously multiple directions (not all, just a few), e.g. near saddle attracting in some directions, repulsing in the others. Here are some:




  • L-BFGS: http://aria42.com/blog/2014/12/understanding-lbfgs

  • TONGA: https://papers.nips.cc/paper/3234-topmoumoute-online-natural-gradient-algorithm

  • K-FAC: https://arxiv.org/pdf/1503.05671.pdf

  • my second order local parametrization: https://arxiv.org/pdf/1901.11457


But still first order methods dominate (?), I have heard opinions that second order just don't work for deep learning (?)



There are mainly 3 challenges (any more?): inverting Hessian, stochasticity of gradients, and handling saddles. All of them should be resolved if locally modelling parametrization as parabolas in a few promising directions (I would like to use): update this parametrization based on calculated gradients, and perform proper step based on this parametrization. This way extrema can be in updated parameters - no Hessian inversion, slow evolution of parametrization allows to accumulate statistical trends from gradients, we can model both curvatures near saddles: correspondingly attract or repulse, with strength depending on modeled distance.



Should we go toward second order methods for deep learning?



Why is it so difficult to make them more successful than simple first order methods - could we identify these challenges ... resolve them?



As there are many ways to realize second order methods, which seems the most promising?










share|cite|improve this question











$endgroup$

















    2












    $begingroup$


    It seems that, especially for deep learning, there are dominating very simple methods for optimizing SGD convergence like ADAM - nice overview: http://ruder.io/optimizing-gradient-descent/



    They trace only single direction - discarding information about the remaining ones, they do not try to estimate distance from near extremum - which is suggested by gradient evolution ($rightarrow 0$ in extremum), and could help with the crucial choice of step size.



    Both these missed opportunities could be exploited by second order methods - trying to locally model parabola in simultaneously multiple directions (not all, just a few), e.g. near saddle attracting in some directions, repulsing in the others. Here are some:




    • L-BFGS: http://aria42.com/blog/2014/12/understanding-lbfgs

    • TONGA: https://papers.nips.cc/paper/3234-topmoumoute-online-natural-gradient-algorithm

    • K-FAC: https://arxiv.org/pdf/1503.05671.pdf

    • my second order local parametrization: https://arxiv.org/pdf/1901.11457


    But still first order methods dominate (?), I have heard opinions that second order just don't work for deep learning (?)



    There are mainly 3 challenges (any more?): inverting Hessian, stochasticity of gradients, and handling saddles. All of them should be resolved if locally modelling parametrization as parabolas in a few promising directions (I would like to use): update this parametrization based on calculated gradients, and perform proper step based on this parametrization. This way extrema can be in updated parameters - no Hessian inversion, slow evolution of parametrization allows to accumulate statistical trends from gradients, we can model both curvatures near saddles: correspondingly attract or repulse, with strength depending on modeled distance.



    Should we go toward second order methods for deep learning?



    Why is it so difficult to make them more successful than simple first order methods - could we identify these challenges ... resolve them?



    As there are many ways to realize second order methods, which seems the most promising?










    share|cite|improve this question











    $endgroup$















      2












      2








      2


      4



      $begingroup$


      It seems that, especially for deep learning, there are dominating very simple methods for optimizing SGD convergence like ADAM - nice overview: http://ruder.io/optimizing-gradient-descent/



      They trace only single direction - discarding information about the remaining ones, they do not try to estimate distance from near extremum - which is suggested by gradient evolution ($rightarrow 0$ in extremum), and could help with the crucial choice of step size.



      Both these missed opportunities could be exploited by second order methods - trying to locally model parabola in simultaneously multiple directions (not all, just a few), e.g. near saddle attracting in some directions, repulsing in the others. Here are some:




      • L-BFGS: http://aria42.com/blog/2014/12/understanding-lbfgs

      • TONGA: https://papers.nips.cc/paper/3234-topmoumoute-online-natural-gradient-algorithm

      • K-FAC: https://arxiv.org/pdf/1503.05671.pdf

      • my second order local parametrization: https://arxiv.org/pdf/1901.11457


      But still first order methods dominate (?), I have heard opinions that second order just don't work for deep learning (?)



      There are mainly 3 challenges (any more?): inverting Hessian, stochasticity of gradients, and handling saddles. All of them should be resolved if locally modelling parametrization as parabolas in a few promising directions (I would like to use): update this parametrization based on calculated gradients, and perform proper step based on this parametrization. This way extrema can be in updated parameters - no Hessian inversion, slow evolution of parametrization allows to accumulate statistical trends from gradients, we can model both curvatures near saddles: correspondingly attract or repulse, with strength depending on modeled distance.



      Should we go toward second order methods for deep learning?



      Why is it so difficult to make them more successful than simple first order methods - could we identify these challenges ... resolve them?



      As there are many ways to realize second order methods, which seems the most promising?










      share|cite|improve this question











      $endgroup$




      It seems that, especially for deep learning, there are dominating very simple methods for optimizing SGD convergence like ADAM - nice overview: http://ruder.io/optimizing-gradient-descent/



      They trace only single direction - discarding information about the remaining ones, they do not try to estimate distance from near extremum - which is suggested by gradient evolution ($rightarrow 0$ in extremum), and could help with the crucial choice of step size.



      Both these missed opportunities could be exploited by second order methods - trying to locally model parabola in simultaneously multiple directions (not all, just a few), e.g. near saddle attracting in some directions, repulsing in the others. Here are some:




      • L-BFGS: http://aria42.com/blog/2014/12/understanding-lbfgs

      • TONGA: https://papers.nips.cc/paper/3234-topmoumoute-online-natural-gradient-algorithm

      • K-FAC: https://arxiv.org/pdf/1503.05671.pdf

      • my second order local parametrization: https://arxiv.org/pdf/1901.11457


      But still first order methods dominate (?), I have heard opinions that second order just don't work for deep learning (?)



      There are mainly 3 challenges (any more?): inverting Hessian, stochasticity of gradients, and handling saddles. All of them should be resolved if locally modelling parametrization as parabolas in a few promising directions (I would like to use): update this parametrization based on calculated gradients, and perform proper step based on this parametrization. This way extrema can be in updated parameters - no Hessian inversion, slow evolution of parametrization allows to accumulate statistical trends from gradients, we can model both curvatures near saddles: correspondingly attract or repulse, with strength depending on modeled distance.



      Should we go toward second order methods for deep learning?



      Why is it so difficult to make them more successful than simple first order methods - could we identify these challenges ... resolve them?



      As there are many ways to realize second order methods, which seems the most promising?







      deep-learning optimization convergence gradient-descent sgd






      share|cite|improve this question















      share|cite|improve this question













      share|cite|improve this question




      share|cite|improve this question








      edited 4 hours ago







      Jarek Duda

















      asked 6 hours ago









      Jarek DudaJarek Duda

      968




      968






















          1 Answer
          1






          active

          oldest

          votes


















          3












          $begingroup$


          Should we go toward second order methods for deep learning?




          TL;DR: No, especially now when the pace of innovation is slowing down, and we're seeing less new architectural innovations, and more ways to train what are basically just copies of existing architectures, on larger datasets (see OpenAI's GPT-2).



          First, without even getting to second order, it's worth mentioning that in Deep Learning you don't even use (mini-batch) gradient descent to the fullest of its potential (i.e., you don't perform line search), because the optimization step in line search would be very costly.



          Second, second order methods are:




          • way more complex, i.e., harder to implement without bugs. DL systems are increasingly becoming a small part of huge data processing pipelines. Introducing further complexity and brittleness in a complex system is only wise if the gains largely offset the risks. I'll argue below that they don't.

          • harder to optimize for distributed computing on heterogeneous hardware, which is becoming more and more common. See how much work was required in order to make K-FAC work on distributed (non heterogeneous) systems, and performances are still no better than the best first-order methods: https://arxiv.org/pdf/1811.12019.pdf. Instead, if just switching to distributed computing makes my first-order method as fast as, or faster, than second-order methods, I don't see the reason to use a more complicated optimization algorithm.

          • way more expensive in terms of iteration cost (not number) and memory occupation, thus they introduce a considerable overhead. Current architectures (GPUs) are more memory-bound that computation-bound. As explained very well here, the increase in iteration cost and memory occupation is steeper, the more high-dimensional the problem is. Optimization in Deep Learning is arguably one of the most high-dimensional optimization problems, so it's not clear that second order methods would have a clear advantage in terms of computational time (not iteration count, which is not what we really care about) wrt first-order methods.

          • another issue with Deep Learning optimization are saddle points. It's becoming abundantly clear that "bad" local minima are not an issue in Deep Learning, but saddle points are. Newton's method does have a tendency to be attracted to saddle points. If I remember correctly, Hessian approximating methods such as K-FAC don't have this issue, but I think the proof depends on the type of architecture, making the use of such methods brittle.

          • they don't fix the problems which make practitioners waste most of their time. Dead or saturated units are not solved by K-FAC, but by better initialization schemes, so that's what we should focus on, e.g., Fixup: https://arxiv.org/abs/1901.09321

          • finally, they have a lot of moving parts, which are difficult to tune up in the best way. Your same paper leaves a lot of details to be specified. Do we need even more extra hyperparameters, or do we need robust optimization methods? Keep in mind that in Deep Learning, as explained very well by Shai Shalev-Shwartz, when something goes wrong, it's very difficult to understand how to fix it https://www.youtube.com/watch?v=1nvf_DBnsxo and more hyperparameters don't help in that respect.






          share|cite|improve this answer









          $endgroup$









          • 1




            $begingroup$
            You are wise... (+1) More seriously, the point you raise about memory bottlenecks in GPU is substantial and often overlooked. NVIDIA realised this issue early so it immediately served cuBLAS and then (as DL gain traction) cuDNN exactly so be practice for using GPU resources is available. The early days of pure CUDA (and OpenCL 1.x) were marred with such problems; they have been partially hidden from the user's attention with the aforementioned tools but their shadow remains.
            $endgroup$
            – usεr11852
            3 hours ago










          • $begingroup$
            Thanks for materials, I will have to read. However, there is a large spectrum of 2nd order methods, starting with considering single direction but additionally trying to estimate distance to extremum in this direction to choose step size. Considering simultaneously two directions wouldn't add much cost, but could allow to handle problematic saddles you mentioned: attract in one direction, repulse in second. K-FAC uses relatively huge blocks - is on the opposite side of this spectrum - close to full Hessian, but maybe the low cost end deserves some more exploration - small steps from 1st order?
            $endgroup$
            – Jarek Duda
            1 hour ago











          Your Answer





          StackExchange.ifUsing("editor", function () {
          return StackExchange.using("mathjaxEditing", function () {
          StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
          StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
          });
          });
          }, "mathjax-editing");

          StackExchange.ready(function() {
          var channelOptions = {
          tags: "".split(" "),
          id: "65"
          };
          initTagRenderer("".split(" "), "".split(" "), channelOptions);

          StackExchange.using("externalEditor", function() {
          // Have to fire editor after snippets, if snippets enabled
          if (StackExchange.settings.snippets.snippetsEnabled) {
          StackExchange.using("snippets", function() {
          createEditor();
          });
          }
          else {
          createEditor();
          }
          });

          function createEditor() {
          StackExchange.prepareEditor({
          heartbeatType: 'answer',
          autoActivateHeartbeat: false,
          convertImagesToLinks: false,
          noModals: true,
          showLowRepImageUploadWarning: true,
          reputationToPostImages: null,
          bindNavPrevention: true,
          postfix: "",
          imageUploader: {
          brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
          contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
          allowUrls: true
          },
          onDemand: true,
          discardSelector: ".discard-answer"
          ,immediatelyShowMarkdownHelp:true
          });


          }
          });














          draft saved

          draft discarded


















          StackExchange.ready(
          function () {
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstats.stackexchange.com%2fquestions%2f394083%2fwhy-second-order-sgd-convergence-methods-are-unpopular-for-deep-learning%23new-answer', 'question_page');
          }
          );

          Post as a guest















          Required, but never shown

























          1 Answer
          1






          active

          oldest

          votes








          1 Answer
          1






          active

          oldest

          votes









          active

          oldest

          votes






          active

          oldest

          votes









          3












          $begingroup$


          Should we go toward second order methods for deep learning?




          TL;DR: No, especially now when the pace of innovation is slowing down, and we're seeing less new architectural innovations, and more ways to train what are basically just copies of existing architectures, on larger datasets (see OpenAI's GPT-2).



          First, without even getting to second order, it's worth mentioning that in Deep Learning you don't even use (mini-batch) gradient descent to the fullest of its potential (i.e., you don't perform line search), because the optimization step in line search would be very costly.



          Second, second order methods are:




          • way more complex, i.e., harder to implement without bugs. DL systems are increasingly becoming a small part of huge data processing pipelines. Introducing further complexity and brittleness in a complex system is only wise if the gains largely offset the risks. I'll argue below that they don't.

          • harder to optimize for distributed computing on heterogeneous hardware, which is becoming more and more common. See how much work was required in order to make K-FAC work on distributed (non heterogeneous) systems, and performances are still no better than the best first-order methods: https://arxiv.org/pdf/1811.12019.pdf. Instead, if just switching to distributed computing makes my first-order method as fast as, or faster, than second-order methods, I don't see the reason to use a more complicated optimization algorithm.

          • way more expensive in terms of iteration cost (not number) and memory occupation, thus they introduce a considerable overhead. Current architectures (GPUs) are more memory-bound that computation-bound. As explained very well here, the increase in iteration cost and memory occupation is steeper, the more high-dimensional the problem is. Optimization in Deep Learning is arguably one of the most high-dimensional optimization problems, so it's not clear that second order methods would have a clear advantage in terms of computational time (not iteration count, which is not what we really care about) wrt first-order methods.

          • another issue with Deep Learning optimization are saddle points. It's becoming abundantly clear that "bad" local minima are not an issue in Deep Learning, but saddle points are. Newton's method does have a tendency to be attracted to saddle points. If I remember correctly, Hessian approximating methods such as K-FAC don't have this issue, but I think the proof depends on the type of architecture, making the use of such methods brittle.

          • they don't fix the problems which make practitioners waste most of their time. Dead or saturated units are not solved by K-FAC, but by better initialization schemes, so that's what we should focus on, e.g., Fixup: https://arxiv.org/abs/1901.09321

          • finally, they have a lot of moving parts, which are difficult to tune up in the best way. Your same paper leaves a lot of details to be specified. Do we need even more extra hyperparameters, or do we need robust optimization methods? Keep in mind that in Deep Learning, as explained very well by Shai Shalev-Shwartz, when something goes wrong, it's very difficult to understand how to fix it https://www.youtube.com/watch?v=1nvf_DBnsxo and more hyperparameters don't help in that respect.






          share|cite|improve this answer









          $endgroup$









          • 1




            $begingroup$
            You are wise... (+1) More seriously, the point you raise about memory bottlenecks in GPU is substantial and often overlooked. NVIDIA realised this issue early so it immediately served cuBLAS and then (as DL gain traction) cuDNN exactly so be practice for using GPU resources is available. The early days of pure CUDA (and OpenCL 1.x) were marred with such problems; they have been partially hidden from the user's attention with the aforementioned tools but their shadow remains.
            $endgroup$
            – usεr11852
            3 hours ago










          • $begingroup$
            Thanks for materials, I will have to read. However, there is a large spectrum of 2nd order methods, starting with considering single direction but additionally trying to estimate distance to extremum in this direction to choose step size. Considering simultaneously two directions wouldn't add much cost, but could allow to handle problematic saddles you mentioned: attract in one direction, repulse in second. K-FAC uses relatively huge blocks - is on the opposite side of this spectrum - close to full Hessian, but maybe the low cost end deserves some more exploration - small steps from 1st order?
            $endgroup$
            – Jarek Duda
            1 hour ago
















          3












          $begingroup$


          Should we go toward second order methods for deep learning?




          TL;DR: No, especially now when the pace of innovation is slowing down, and we're seeing less new architectural innovations, and more ways to train what are basically just copies of existing architectures, on larger datasets (see OpenAI's GPT-2).



          First, without even getting to second order, it's worth mentioning that in Deep Learning you don't even use (mini-batch) gradient descent to the fullest of its potential (i.e., you don't perform line search), because the optimization step in line search would be very costly.



          Second, second order methods are:




          • way more complex, i.e., harder to implement without bugs. DL systems are increasingly becoming a small part of huge data processing pipelines. Introducing further complexity and brittleness in a complex system is only wise if the gains largely offset the risks. I'll argue below that they don't.

          • harder to optimize for distributed computing on heterogeneous hardware, which is becoming more and more common. See how much work was required in order to make K-FAC work on distributed (non heterogeneous) systems, and performances are still no better than the best first-order methods: https://arxiv.org/pdf/1811.12019.pdf. Instead, if just switching to distributed computing makes my first-order method as fast as, or faster, than second-order methods, I don't see the reason to use a more complicated optimization algorithm.

          • way more expensive in terms of iteration cost (not number) and memory occupation, thus they introduce a considerable overhead. Current architectures (GPUs) are more memory-bound that computation-bound. As explained very well here, the increase in iteration cost and memory occupation is steeper, the more high-dimensional the problem is. Optimization in Deep Learning is arguably one of the most high-dimensional optimization problems, so it's not clear that second order methods would have a clear advantage in terms of computational time (not iteration count, which is not what we really care about) wrt first-order methods.

          • another issue with Deep Learning optimization are saddle points. It's becoming abundantly clear that "bad" local minima are not an issue in Deep Learning, but saddle points are. Newton's method does have a tendency to be attracted to saddle points. If I remember correctly, Hessian approximating methods such as K-FAC don't have this issue, but I think the proof depends on the type of architecture, making the use of such methods brittle.

          • they don't fix the problems which make practitioners waste most of their time. Dead or saturated units are not solved by K-FAC, but by better initialization schemes, so that's what we should focus on, e.g., Fixup: https://arxiv.org/abs/1901.09321

          • finally, they have a lot of moving parts, which are difficult to tune up in the best way. Your same paper leaves a lot of details to be specified. Do we need even more extra hyperparameters, or do we need robust optimization methods? Keep in mind that in Deep Learning, as explained very well by Shai Shalev-Shwartz, when something goes wrong, it's very difficult to understand how to fix it https://www.youtube.com/watch?v=1nvf_DBnsxo and more hyperparameters don't help in that respect.






          share|cite|improve this answer









          $endgroup$









          • 1




            $begingroup$
            You are wise... (+1) More seriously, the point you raise about memory bottlenecks in GPU is substantial and often overlooked. NVIDIA realised this issue early so it immediately served cuBLAS and then (as DL gain traction) cuDNN exactly so be practice for using GPU resources is available. The early days of pure CUDA (and OpenCL 1.x) were marred with such problems; they have been partially hidden from the user's attention with the aforementioned tools but their shadow remains.
            $endgroup$
            – usεr11852
            3 hours ago










          • $begingroup$
            Thanks for materials, I will have to read. However, there is a large spectrum of 2nd order methods, starting with considering single direction but additionally trying to estimate distance to extremum in this direction to choose step size. Considering simultaneously two directions wouldn't add much cost, but could allow to handle problematic saddles you mentioned: attract in one direction, repulse in second. K-FAC uses relatively huge blocks - is on the opposite side of this spectrum - close to full Hessian, but maybe the low cost end deserves some more exploration - small steps from 1st order?
            $endgroup$
            – Jarek Duda
            1 hour ago














          3












          3








          3





          $begingroup$


          Should we go toward second order methods for deep learning?




          TL;DR: No, especially now when the pace of innovation is slowing down, and we're seeing less new architectural innovations, and more ways to train what are basically just copies of existing architectures, on larger datasets (see OpenAI's GPT-2).



          First, without even getting to second order, it's worth mentioning that in Deep Learning you don't even use (mini-batch) gradient descent to the fullest of its potential (i.e., you don't perform line search), because the optimization step in line search would be very costly.



          Second, second order methods are:




          • way more complex, i.e., harder to implement without bugs. DL systems are increasingly becoming a small part of huge data processing pipelines. Introducing further complexity and brittleness in a complex system is only wise if the gains largely offset the risks. I'll argue below that they don't.

          • harder to optimize for distributed computing on heterogeneous hardware, which is becoming more and more common. See how much work was required in order to make K-FAC work on distributed (non heterogeneous) systems, and performances are still no better than the best first-order methods: https://arxiv.org/pdf/1811.12019.pdf. Instead, if just switching to distributed computing makes my first-order method as fast as, or faster, than second-order methods, I don't see the reason to use a more complicated optimization algorithm.

          • way more expensive in terms of iteration cost (not number) and memory occupation, thus they introduce a considerable overhead. Current architectures (GPUs) are more memory-bound that computation-bound. As explained very well here, the increase in iteration cost and memory occupation is steeper, the more high-dimensional the problem is. Optimization in Deep Learning is arguably one of the most high-dimensional optimization problems, so it's not clear that second order methods would have a clear advantage in terms of computational time (not iteration count, which is not what we really care about) wrt first-order methods.

          • another issue with Deep Learning optimization are saddle points. It's becoming abundantly clear that "bad" local minima are not an issue in Deep Learning, but saddle points are. Newton's method does have a tendency to be attracted to saddle points. If I remember correctly, Hessian approximating methods such as K-FAC don't have this issue, but I think the proof depends on the type of architecture, making the use of such methods brittle.

          • they don't fix the problems which make practitioners waste most of their time. Dead or saturated units are not solved by K-FAC, but by better initialization schemes, so that's what we should focus on, e.g., Fixup: https://arxiv.org/abs/1901.09321

          • finally, they have a lot of moving parts, which are difficult to tune up in the best way. Your same paper leaves a lot of details to be specified. Do we need even more extra hyperparameters, or do we need robust optimization methods? Keep in mind that in Deep Learning, as explained very well by Shai Shalev-Shwartz, when something goes wrong, it's very difficult to understand how to fix it https://www.youtube.com/watch?v=1nvf_DBnsxo and more hyperparameters don't help in that respect.






          share|cite|improve this answer









          $endgroup$




          Should we go toward second order methods for deep learning?




          TL;DR: No, especially now when the pace of innovation is slowing down, and we're seeing less new architectural innovations, and more ways to train what are basically just copies of existing architectures, on larger datasets (see OpenAI's GPT-2).



          First, without even getting to second order, it's worth mentioning that in Deep Learning you don't even use (mini-batch) gradient descent to the fullest of its potential (i.e., you don't perform line search), because the optimization step in line search would be very costly.



          Second, second order methods are:




          • way more complex, i.e., harder to implement without bugs. DL systems are increasingly becoming a small part of huge data processing pipelines. Introducing further complexity and brittleness in a complex system is only wise if the gains largely offset the risks. I'll argue below that they don't.

          • harder to optimize for distributed computing on heterogeneous hardware, which is becoming more and more common. See how much work was required in order to make K-FAC work on distributed (non heterogeneous) systems, and performances are still no better than the best first-order methods: https://arxiv.org/pdf/1811.12019.pdf. Instead, if just switching to distributed computing makes my first-order method as fast as, or faster, than second-order methods, I don't see the reason to use a more complicated optimization algorithm.

          • way more expensive in terms of iteration cost (not number) and memory occupation, thus they introduce a considerable overhead. Current architectures (GPUs) are more memory-bound that computation-bound. As explained very well here, the increase in iteration cost and memory occupation is steeper, the more high-dimensional the problem is. Optimization in Deep Learning is arguably one of the most high-dimensional optimization problems, so it's not clear that second order methods would have a clear advantage in terms of computational time (not iteration count, which is not what we really care about) wrt first-order methods.

          • another issue with Deep Learning optimization are saddle points. It's becoming abundantly clear that "bad" local minima are not an issue in Deep Learning, but saddle points are. Newton's method does have a tendency to be attracted to saddle points. If I remember correctly, Hessian approximating methods such as K-FAC don't have this issue, but I think the proof depends on the type of architecture, making the use of such methods brittle.

          • they don't fix the problems which make practitioners waste most of their time. Dead or saturated units are not solved by K-FAC, but by better initialization schemes, so that's what we should focus on, e.g., Fixup: https://arxiv.org/abs/1901.09321

          • finally, they have a lot of moving parts, which are difficult to tune up in the best way. Your same paper leaves a lot of details to be specified. Do we need even more extra hyperparameters, or do we need robust optimization methods? Keep in mind that in Deep Learning, as explained very well by Shai Shalev-Shwartz, when something goes wrong, it's very difficult to understand how to fix it https://www.youtube.com/watch?v=1nvf_DBnsxo and more hyperparameters don't help in that respect.







          share|cite|improve this answer












          share|cite|improve this answer



          share|cite|improve this answer










          answered 3 hours ago









          DeltaIVDeltaIV

          7,67512468




          7,67512468








          • 1




            $begingroup$
            You are wise... (+1) More seriously, the point you raise about memory bottlenecks in GPU is substantial and often overlooked. NVIDIA realised this issue early so it immediately served cuBLAS and then (as DL gain traction) cuDNN exactly so be practice for using GPU resources is available. The early days of pure CUDA (and OpenCL 1.x) were marred with such problems; they have been partially hidden from the user's attention with the aforementioned tools but their shadow remains.
            $endgroup$
            – usεr11852
            3 hours ago










          • $begingroup$
            Thanks for materials, I will have to read. However, there is a large spectrum of 2nd order methods, starting with considering single direction but additionally trying to estimate distance to extremum in this direction to choose step size. Considering simultaneously two directions wouldn't add much cost, but could allow to handle problematic saddles you mentioned: attract in one direction, repulse in second. K-FAC uses relatively huge blocks - is on the opposite side of this spectrum - close to full Hessian, but maybe the low cost end deserves some more exploration - small steps from 1st order?
            $endgroup$
            – Jarek Duda
            1 hour ago














          • 1




            $begingroup$
            You are wise... (+1) More seriously, the point you raise about memory bottlenecks in GPU is substantial and often overlooked. NVIDIA realised this issue early so it immediately served cuBLAS and then (as DL gain traction) cuDNN exactly so be practice for using GPU resources is available. The early days of pure CUDA (and OpenCL 1.x) were marred with such problems; they have been partially hidden from the user's attention with the aforementioned tools but their shadow remains.
            $endgroup$
            – usεr11852
            3 hours ago










          • $begingroup$
            Thanks for materials, I will have to read. However, there is a large spectrum of 2nd order methods, starting with considering single direction but additionally trying to estimate distance to extremum in this direction to choose step size. Considering simultaneously two directions wouldn't add much cost, but could allow to handle problematic saddles you mentioned: attract in one direction, repulse in second. K-FAC uses relatively huge blocks - is on the opposite side of this spectrum - close to full Hessian, but maybe the low cost end deserves some more exploration - small steps from 1st order?
            $endgroup$
            – Jarek Duda
            1 hour ago








          1




          1




          $begingroup$
          You are wise... (+1) More seriously, the point you raise about memory bottlenecks in GPU is substantial and often overlooked. NVIDIA realised this issue early so it immediately served cuBLAS and then (as DL gain traction) cuDNN exactly so be practice for using GPU resources is available. The early days of pure CUDA (and OpenCL 1.x) were marred with such problems; they have been partially hidden from the user's attention with the aforementioned tools but their shadow remains.
          $endgroup$
          – usεr11852
          3 hours ago




          $begingroup$
          You are wise... (+1) More seriously, the point you raise about memory bottlenecks in GPU is substantial and often overlooked. NVIDIA realised this issue early so it immediately served cuBLAS and then (as DL gain traction) cuDNN exactly so be practice for using GPU resources is available. The early days of pure CUDA (and OpenCL 1.x) were marred with such problems; they have been partially hidden from the user's attention with the aforementioned tools but their shadow remains.
          $endgroup$
          – usεr11852
          3 hours ago












          $begingroup$
          Thanks for materials, I will have to read. However, there is a large spectrum of 2nd order methods, starting with considering single direction but additionally trying to estimate distance to extremum in this direction to choose step size. Considering simultaneously two directions wouldn't add much cost, but could allow to handle problematic saddles you mentioned: attract in one direction, repulse in second. K-FAC uses relatively huge blocks - is on the opposite side of this spectrum - close to full Hessian, but maybe the low cost end deserves some more exploration - small steps from 1st order?
          $endgroup$
          – Jarek Duda
          1 hour ago




          $begingroup$
          Thanks for materials, I will have to read. However, there is a large spectrum of 2nd order methods, starting with considering single direction but additionally trying to estimate distance to extremum in this direction to choose step size. Considering simultaneously two directions wouldn't add much cost, but could allow to handle problematic saddles you mentioned: attract in one direction, repulse in second. K-FAC uses relatively huge blocks - is on the opposite side of this spectrum - close to full Hessian, but maybe the low cost end deserves some more exploration - small steps from 1st order?
          $endgroup$
          – Jarek Duda
          1 hour ago


















          draft saved

          draft discarded




















































          Thanks for contributing an answer to Cross Validated!


          • Please be sure to answer the question. Provide details and share your research!

          But avoid



          • Asking for help, clarification, or responding to other answers.

          • Making statements based on opinion; back them up with references or personal experience.


          Use MathJax to format equations. MathJax reference.


          To learn more, see our tips on writing great answers.




          draft saved


          draft discarded














          StackExchange.ready(
          function () {
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstats.stackexchange.com%2fquestions%2f394083%2fwhy-second-order-sgd-convergence-methods-are-unpopular-for-deep-learning%23new-answer', 'question_page');
          }
          );

          Post as a guest















          Required, but never shown





















































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown

































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown







          Popular posts from this blog

          Szabolcs (Ungheria) Altri progetti | Menu di navigazione48°10′14.56″N 21°29′33.14″E /...

          Discografia di Klaus Schulze Indice Album in studio | Album dal vivo | Singoli | Antologie | Colonne...

          How to make inet_server_addr() return localhost in spite of ::1/128RETURN NEXT in Postgres FunctionConnect to...