Why is gradient the direction of steepest ascent?

The name of the pictureThe name of the pictureThe name of the pictureClash Royale CLAN TAG#URR8PPP











up vote
44
down vote

favorite
32












$$f(x_1,x_2,...x_n):mathbbR^n rightarrow mathbbR$$
The definition of the gradient is
$$ fracpartial fpartial x_1hate_1 + ... +fracpartial fpartial x_nhate_n$$



which is a vector.



Reading this definition makes me consider that each component of the gradient corresponds to the rate of change with respect to my objective function if I go along with the direction $hate_i$.



But I can't see why this vector (defined by the definition of the gradient) has anything to do with the steepest descent.



Why do I get maximal value again if I move along with the direction of gradient?







share|cite|improve this question


























    up vote
    44
    down vote

    favorite
    32












    $$f(x_1,x_2,...x_n):mathbbR^n rightarrow mathbbR$$
    The definition of the gradient is
    $$ fracpartial fpartial x_1hate_1 + ... +fracpartial fpartial x_nhate_n$$



    which is a vector.



    Reading this definition makes me consider that each component of the gradient corresponds to the rate of change with respect to my objective function if I go along with the direction $hate_i$.



    But I can't see why this vector (defined by the definition of the gradient) has anything to do with the steepest descent.



    Why do I get maximal value again if I move along with the direction of gradient?







    share|cite|improve this question
























      up vote
      44
      down vote

      favorite
      32









      up vote
      44
      down vote

      favorite
      32






      32





      $$f(x_1,x_2,...x_n):mathbbR^n rightarrow mathbbR$$
      The definition of the gradient is
      $$ fracpartial fpartial x_1hate_1 + ... +fracpartial fpartial x_nhate_n$$



      which is a vector.



      Reading this definition makes me consider that each component of the gradient corresponds to the rate of change with respect to my objective function if I go along with the direction $hate_i$.



      But I can't see why this vector (defined by the definition of the gradient) has anything to do with the steepest descent.



      Why do I get maximal value again if I move along with the direction of gradient?







      share|cite|improve this question














      $$f(x_1,x_2,...x_n):mathbbR^n rightarrow mathbbR$$
      The definition of the gradient is
      $$ fracpartial fpartial x_1hate_1 + ... +fracpartial fpartial x_nhate_n$$



      which is a vector.



      Reading this definition makes me consider that each component of the gradient corresponds to the rate of change with respect to my objective function if I go along with the direction $hate_i$.



      But I can't see why this vector (defined by the definition of the gradient) has anything to do with the steepest descent.



      Why do I get maximal value again if I move along with the direction of gradient?









      share|cite|improve this question













      share|cite|improve this question




      share|cite|improve this question








      edited Jul 18 '17 at 19:12









      SRS

      544322




      544322










      asked Oct 29 '12 at 3:55









      Jing

      5591820




      5591820




















          8 Answers
          8






          active

          oldest

          votes

















          up vote
          64
          down vote



          accepted










          Each component of the gradient tells you how fast your function is changing with respect to the standard basis. It's not too far-fetched then to wonder, how fast the function might be changing with respect to some arbitrary direction? Letting $vec v$ denote a unit vector, we can project along this direction in the natural way, namely via the dot product $textgrad( f(a))cdot vec v$. This is a fairly common definition of the directional derivative.



          We can then ask in what direction is this quantity maximal? You'll recall that $$textgrad( f(a))cdot vec v = |textgrad( f(a))|| vec v|textcos(theta)$$



          Since $vec v$ is unit, we have $|textgrad( f)|textcos(theta)$, which is maximal when $cos(theta)=1$, in particular when $vec v$ points in the same direction as $textgrad(f(a))$.






          share|cite|improve this answer


















          • 4




            I was wondering, but how to you that grad(f(a)) gives the steepest change? How do you know there is not other vector that moving in its direction might lead to a steeper change?
            – Pinocchio
            Feb 25 '14 at 1:51






          • 3




            so we multiply the gradient with an arbitrary vector and then say that the product is maximum when the vector points in the same direction as the gradient? How does that answer the question?
            – novice
            Sep 3 '16 at 0:30










          • @novice It answers the question since the scalar product EQUALS the rate of change of $f$ along the direction of the unit vector. So, in maximizing the product, the rate of increase in $f$ is likewise maximized.
            – Mark Viola
            Sep 3 '16 at 1:16







          • 1




            @Dr.MV - sorry for coming back, but isnt the quantity being maximized rate of change of f ALONG the unit vector and not the rate of increase of f?
            – novice
            Sep 3 '16 at 4:26










          • is this proof only true in 3d-dimension?
            – hqt
            Oct 10 '17 at 9:49


















          up vote
          19
          down vote













          Consider a Taylor expansion of this function,
          $$f(bf r+bfdelta r)=f(bf r)+(nabla f)cdotbfdelta r+ldots$$
          The linear correction term $(nabla f)cdotbfdelta r$ is maximized when $bfdelta r$ is in the direction of $nabla f$.






          share|cite|improve this answer
















          • 1




            This is the best answer to me. Thanks!
            – Dagang
            Apr 10 at 3:06

















          up vote
          17
          down vote













          Other answers are correct in using the directional derivative to show that the gradient is the direction of steepest ascent/descent. However, I think it is instructive to look at the definition of the directional derivative from first principles to understand why this is so (it is not arbitrarily defined to be the dot product of the gradient and the directional vector).



          Let $f(mathbfx):mathbbR^n rightarrow mathbbR$. The partial derivatives of $f$ are the rates of change along the basis vectors of $mathbfx$:



          $textrmrate of change along mathbfe_i = lim_hrightarrow 0 fracf(mathbfx + hmathbfe_i)- f(mathbfx)h = fracpartial fpartial x_i$



          Each partial derivative is a scalar. It is simply a rate of change.



          The gradient of $f$ is then defined as the vector:



          $nabla f = sum_i fracpartial fpartial x_i mathbfe_i$



          We can naturally extend the concept of the rate of change along a basis vector to a (unit) vector pointing in an arbitrary direction. Let $mathbfv$ be such a vector, i.e., $mathbfv = sum_i alpha_i mathbfe_i$ where $sum_i alpha_i^2 = 1$. Then:



          $textrmrate of change along mathbfv = lim_hrightarrow 0 fracf(mathbfx + hmathbfv) - f(mathbfx)h$



          Again, this quantity is a scalar.



          Now, it can be proven that if $f$ is differentiable at $mathbfx$, the limit above evaluates to: $(nabla f) cdot mathbfv$. This is a dot product of two vectors, which returns a scalar.



          We know from linear algebra that the dot product is maximized when the two vectors point in the same direction. This means that the rate of change along an arbitrary vector $mathbfv$ is maximized when $mathbfv$ points in the same direction as the gradient. In other words, the gradient corresponds to the rate of steepest ascent/descent.






          share|cite|improve this answer




















          • Are you missing a square root around $sum_i alpha_i^2$? The way I understand it, what you're trying to say is that the magnitude of $v$ should be 1, which only happens if the square root of the sum of alphas squared is 1 (I could easily be completely wrong, my apologies if that's the case). I like your explanation overall regardless.
            – jeremy radcliff
            Sep 24 '16 at 23:08







          • 2




            @jeremyradcliff Yes exactly, I'm saying the magnitude should be 1. I left out the square root precisely because $1^2 =1$.
            – MGA
            Sep 26 '16 at 16:23











          • Thank you for following up; that makes sense, I should've realized.
            – jeremy radcliff
            Sep 26 '16 at 23:47

















          up vote
          16
          down vote













          The question you're asking can be rephrased as "In which direction is the directional derivative $nabla_hatuf$ a maximum?".



          Assuming differentiability, $nabla_hatuf$ can be written as:



          $$nabla_hatuf = nabla f(textbfx) cdot hatu =|nabla f(textbfx)||hatu|cos theta = |nabla f(textbfx)|cos theta$$



          which is a maximum when $theta =0$: when $nabla f(textbfx)$ and $hatu$ are parallel.






          share|cite|improve this answer



























            up vote
            5
            down vote













            Each component of the derivative
            $$
            fracpartial fpartial x_1 ... fracpartial fpartial x_n$$
            tells you how fast your function is changing with respect to the standard basis.
            It's now possible to make a basetransformation to an orthogonal base with $ n-1 $ base Directions with $0$ ascent and the gradient direction. In such a base the gradient direction must be the steepest since any adding of other base directions adds length but no ascent.



            For a 3 dimensional Vector space the base could look like this
            $$ left( left( beginmatrix partial x_2 \ -partial x_1 \ 0 endmatrix right) left( beginmatrix partial x_1 \ partial x_2 \ -dfrac(partial x_1)²+(partial x_2)²partial x_3 endmatrix right) left( beginmatrix partial x_1 \ partial x_2 \ partial x_3 endmatrix right) right) $$ By complete induction it can now be shown that such a base is constructable for an n-Dimensional Vector space.
            $$ left( left( beginmatrix partial x_2 \ -partial x_1 \ 0 \ 0 endmatrix right) left( beginmatrix colorbluepartial x_1 \ partial x_2 \ -dfrac(partial x_1)²+(partial x_2)²partial x_3 \ 0 endmatrix right) left( beginmatrix colorbluepartial x_1 \ partial x_2 \ colorgreenpartial x_3 \ -dfrac(partial x_1)²+(partial x_2)²+(partial x_3)²partial x_4 endmatrix right) left(beginmatrix colorbluepartial x_1 \ partial x_2 \ colorgreenpartial x_3 \ colororangepartial x_4 endmatrix right) right) $$ One can see here that the first Basevector demands the first 2 Elements of the following Basevectors to be $partial x_1$ & $partial x_2$ because of the orthogonal condition,
            similarly the 2nd vector demands all the 3rd elements of the following vectors to be $partial x_3$
            as does the 3rd vector for the 4th element them being $partial x_4$.



            If another dimension is added the n+1 Element of the n$th$ Vector needs to be $$-dfrac(partial x_1)²+...+(partial x_n)²partial x_n+1$$ to meet the $0$ ascension condition which in turn forces the new n+1$th$ Vector to be of the form $$left(beginmatrixpartial x_1 \ ... \ partial x_n+1endmatrixright)$$ for it to be orthogonal to the rest.






            share|cite|improve this answer





























              up vote
              1
              down vote













              Let $vec v$ be an arbitrary unit vector. Then the change of $f$ by moving in the direction of $v$, starting in point $a$, is given by $grad( f(a)) cdot vec v$. We want to find a $vec v$ for which this inner product is maximal.
              For the inner product we have the Cauchy–Schwarz inequality $vec a cdot vec b leq |vec a||vec b|$.
              Now the equality holds when $vec v = lambda ; grad(f(a))$, for some $lambda in mathbbR$.






              share|cite|improve this answer



























                up vote
                1
                down vote













                Let $v=fracss$ be a unit vector and assume that $v$ is a descent direction, i.e. $v^Tnabla f(x) <0$. Then $f(x+lambda v)$ as a function of $lambda$, describes how this function changes along the direction $v$.



                The rate of descent at $x$ along $v$ is given by:
                $$ fracdd lambdaf(x+lambda v)|_lambda=0 = v^T nabla f(x) =fracs^Tsnabla f(x) equiv fracs^Tsg$$
                So we want to find the maximum of this quantity as a function of $s$.
                Differentiating the above wrt $s$ and setting it equal to zero, we get (noting that $nabla_s|s| =fracss$): $g=(g^T v)vequiv av$.



                Taking the Euclidean norm: $|g|=|a||v|=|a| Rightarrow a=pm|g|$.



                We choose the minus sign to satisfy that $v$ is descent. Hence the direction of the steepest descent is
                $$ v= dfrac1ag = -dfracgg$$






                share|cite|improve this answer



























                  up vote
                  1
                  down vote













                  Just want to further clarify why the gradient provides the steepest ascent (instead of descent) here. Any differentiable $f$ can be approximated by the linear tangent plane, i.e., $$f(mathbfx + h mathbfv) = f(mathbfx) + h , nabla f(mathbfx)^T mathbfv $$ as $h rightarrow 0$ for any unit-length direction $mathbfv$ with $parallel mathbfv parallel =1.$ As $h downarrow 0$, consider the amount of change
                  $$
                  f(mathbfx + h mathbfv) - f(mathbfx) = h , left , nabla f(mathbfx)^T mathbfv right
                  ~~in~~ left[ - h , parallel nabla f(mathbfx) parallel, ~ h , parallel nabla f(mathbfx) parallel right]
                  $$
                  by Cauchy-Swcharz inequality, which reaches its maximum (increase) $(h , parallel nabla f(mathbfx) parallel)$ when $mathbfv = nabla f(mathbfx) / parallel nabla f(mathbfx) parallel$ and its minimum (i.e., maximum decrease) $ (-h , parallel nabla f(mathbfx) parallel) $ if $ mathbfv= - nabla f(mathbfx)/parallel nabla f(mathbfx) parallel$ (the negative gradient direction).






                  share|cite|improve this answer






















                    Your Answer




                    StackExchange.ifUsing("editor", function ()
                    return StackExchange.using("mathjaxEditing", function ()
                    StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix)
                    StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
                    );
                    );
                    , "mathjax-editing");

                    StackExchange.ready(function()
                    var channelOptions =
                    tags: "".split(" "),
                    id: "69"
                    ;
                    initTagRenderer("".split(" "), "".split(" "), channelOptions);

                    StackExchange.using("externalEditor", function()
                    // Have to fire editor after snippets, if snippets enabled
                    if (StackExchange.settings.snippets.snippetsEnabled)
                    StackExchange.using("snippets", function()
                    createEditor();
                    );

                    else
                    createEditor();

                    );

                    function createEditor()
                    StackExchange.prepareEditor(
                    heartbeatType: 'answer',
                    convertImagesToLinks: true,
                    noModals: false,
                    showLowRepImageUploadWarning: true,
                    reputationToPostImages: 10,
                    bindNavPrevention: true,
                    postfix: "",
                    noCode: true, onDemand: true,
                    discardSelector: ".discard-answer"
                    ,immediatelyShowMarkdownHelp:true
                    );



                    );








                     

                    draft saved


                    draft discarded


















                    StackExchange.ready(
                    function ()
                    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fmath.stackexchange.com%2fquestions%2f223252%2fwhy-is-gradient-the-direction-of-steepest-ascent%23new-answer', 'question_page');

                    );

                    Post as a guest






























                    8 Answers
                    8






                    active

                    oldest

                    votes








                    8 Answers
                    8






                    active

                    oldest

                    votes









                    active

                    oldest

                    votes






                    active

                    oldest

                    votes








                    up vote
                    64
                    down vote



                    accepted










                    Each component of the gradient tells you how fast your function is changing with respect to the standard basis. It's not too far-fetched then to wonder, how fast the function might be changing with respect to some arbitrary direction? Letting $vec v$ denote a unit vector, we can project along this direction in the natural way, namely via the dot product $textgrad( f(a))cdot vec v$. This is a fairly common definition of the directional derivative.



                    We can then ask in what direction is this quantity maximal? You'll recall that $$textgrad( f(a))cdot vec v = |textgrad( f(a))|| vec v|textcos(theta)$$



                    Since $vec v$ is unit, we have $|textgrad( f)|textcos(theta)$, which is maximal when $cos(theta)=1$, in particular when $vec v$ points in the same direction as $textgrad(f(a))$.






                    share|cite|improve this answer


















                    • 4




                      I was wondering, but how to you that grad(f(a)) gives the steepest change? How do you know there is not other vector that moving in its direction might lead to a steeper change?
                      – Pinocchio
                      Feb 25 '14 at 1:51






                    • 3




                      so we multiply the gradient with an arbitrary vector and then say that the product is maximum when the vector points in the same direction as the gradient? How does that answer the question?
                      – novice
                      Sep 3 '16 at 0:30










                    • @novice It answers the question since the scalar product EQUALS the rate of change of $f$ along the direction of the unit vector. So, in maximizing the product, the rate of increase in $f$ is likewise maximized.
                      – Mark Viola
                      Sep 3 '16 at 1:16







                    • 1




                      @Dr.MV - sorry for coming back, but isnt the quantity being maximized rate of change of f ALONG the unit vector and not the rate of increase of f?
                      – novice
                      Sep 3 '16 at 4:26










                    • is this proof only true in 3d-dimension?
                      – hqt
                      Oct 10 '17 at 9:49















                    up vote
                    64
                    down vote



                    accepted










                    Each component of the gradient tells you how fast your function is changing with respect to the standard basis. It's not too far-fetched then to wonder, how fast the function might be changing with respect to some arbitrary direction? Letting $vec v$ denote a unit vector, we can project along this direction in the natural way, namely via the dot product $textgrad( f(a))cdot vec v$. This is a fairly common definition of the directional derivative.



                    We can then ask in what direction is this quantity maximal? You'll recall that $$textgrad( f(a))cdot vec v = |textgrad( f(a))|| vec v|textcos(theta)$$



                    Since $vec v$ is unit, we have $|textgrad( f)|textcos(theta)$, which is maximal when $cos(theta)=1$, in particular when $vec v$ points in the same direction as $textgrad(f(a))$.






                    share|cite|improve this answer


















                    • 4




                      I was wondering, but how to you that grad(f(a)) gives the steepest change? How do you know there is not other vector that moving in its direction might lead to a steeper change?
                      – Pinocchio
                      Feb 25 '14 at 1:51






                    • 3




                      so we multiply the gradient with an arbitrary vector and then say that the product is maximum when the vector points in the same direction as the gradient? How does that answer the question?
                      – novice
                      Sep 3 '16 at 0:30










                    • @novice It answers the question since the scalar product EQUALS the rate of change of $f$ along the direction of the unit vector. So, in maximizing the product, the rate of increase in $f$ is likewise maximized.
                      – Mark Viola
                      Sep 3 '16 at 1:16







                    • 1




                      @Dr.MV - sorry for coming back, but isnt the quantity being maximized rate of change of f ALONG the unit vector and not the rate of increase of f?
                      – novice
                      Sep 3 '16 at 4:26










                    • is this proof only true in 3d-dimension?
                      – hqt
                      Oct 10 '17 at 9:49













                    up vote
                    64
                    down vote



                    accepted







                    up vote
                    64
                    down vote



                    accepted






                    Each component of the gradient tells you how fast your function is changing with respect to the standard basis. It's not too far-fetched then to wonder, how fast the function might be changing with respect to some arbitrary direction? Letting $vec v$ denote a unit vector, we can project along this direction in the natural way, namely via the dot product $textgrad( f(a))cdot vec v$. This is a fairly common definition of the directional derivative.



                    We can then ask in what direction is this quantity maximal? You'll recall that $$textgrad( f(a))cdot vec v = |textgrad( f(a))|| vec v|textcos(theta)$$



                    Since $vec v$ is unit, we have $|textgrad( f)|textcos(theta)$, which is maximal when $cos(theta)=1$, in particular when $vec v$ points in the same direction as $textgrad(f(a))$.






                    share|cite|improve this answer














                    Each component of the gradient tells you how fast your function is changing with respect to the standard basis. It's not too far-fetched then to wonder, how fast the function might be changing with respect to some arbitrary direction? Letting $vec v$ denote a unit vector, we can project along this direction in the natural way, namely via the dot product $textgrad( f(a))cdot vec v$. This is a fairly common definition of the directional derivative.



                    We can then ask in what direction is this quantity maximal? You'll recall that $$textgrad( f(a))cdot vec v = |textgrad( f(a))|| vec v|textcos(theta)$$



                    Since $vec v$ is unit, we have $|textgrad( f)|textcos(theta)$, which is maximal when $cos(theta)=1$, in particular when $vec v$ points in the same direction as $textgrad(f(a))$.







                    share|cite|improve this answer














                    share|cite|improve this answer



                    share|cite|improve this answer








                    edited Dec 24 '17 at 2:45

























                    answered Oct 29 '12 at 4:16









                    AsinglePANCAKE

                    1,8561429




                    1,8561429







                    • 4




                      I was wondering, but how to you that grad(f(a)) gives the steepest change? How do you know there is not other vector that moving in its direction might lead to a steeper change?
                      – Pinocchio
                      Feb 25 '14 at 1:51






                    • 3




                      so we multiply the gradient with an arbitrary vector and then say that the product is maximum when the vector points in the same direction as the gradient? How does that answer the question?
                      – novice
                      Sep 3 '16 at 0:30










                    • @novice It answers the question since the scalar product EQUALS the rate of change of $f$ along the direction of the unit vector. So, in maximizing the product, the rate of increase in $f$ is likewise maximized.
                      – Mark Viola
                      Sep 3 '16 at 1:16







                    • 1




                      @Dr.MV - sorry for coming back, but isnt the quantity being maximized rate of change of f ALONG the unit vector and not the rate of increase of f?
                      – novice
                      Sep 3 '16 at 4:26










                    • is this proof only true in 3d-dimension?
                      – hqt
                      Oct 10 '17 at 9:49













                    • 4




                      I was wondering, but how to you that grad(f(a)) gives the steepest change? How do you know there is not other vector that moving in its direction might lead to a steeper change?
                      – Pinocchio
                      Feb 25 '14 at 1:51






                    • 3




                      so we multiply the gradient with an arbitrary vector and then say that the product is maximum when the vector points in the same direction as the gradient? How does that answer the question?
                      – novice
                      Sep 3 '16 at 0:30










                    • @novice It answers the question since the scalar product EQUALS the rate of change of $f$ along the direction of the unit vector. So, in maximizing the product, the rate of increase in $f$ is likewise maximized.
                      – Mark Viola
                      Sep 3 '16 at 1:16







                    • 1




                      @Dr.MV - sorry for coming back, but isnt the quantity being maximized rate of change of f ALONG the unit vector and not the rate of increase of f?
                      – novice
                      Sep 3 '16 at 4:26










                    • is this proof only true in 3d-dimension?
                      – hqt
                      Oct 10 '17 at 9:49








                    4




                    4




                    I was wondering, but how to you that grad(f(a)) gives the steepest change? How do you know there is not other vector that moving in its direction might lead to a steeper change?
                    – Pinocchio
                    Feb 25 '14 at 1:51




                    I was wondering, but how to you that grad(f(a)) gives the steepest change? How do you know there is not other vector that moving in its direction might lead to a steeper change?
                    – Pinocchio
                    Feb 25 '14 at 1:51




                    3




                    3




                    so we multiply the gradient with an arbitrary vector and then say that the product is maximum when the vector points in the same direction as the gradient? How does that answer the question?
                    – novice
                    Sep 3 '16 at 0:30




                    so we multiply the gradient with an arbitrary vector and then say that the product is maximum when the vector points in the same direction as the gradient? How does that answer the question?
                    – novice
                    Sep 3 '16 at 0:30












                    @novice It answers the question since the scalar product EQUALS the rate of change of $f$ along the direction of the unit vector. So, in maximizing the product, the rate of increase in $f$ is likewise maximized.
                    – Mark Viola
                    Sep 3 '16 at 1:16





                    @novice It answers the question since the scalar product EQUALS the rate of change of $f$ along the direction of the unit vector. So, in maximizing the product, the rate of increase in $f$ is likewise maximized.
                    – Mark Viola
                    Sep 3 '16 at 1:16





                    1




                    1




                    @Dr.MV - sorry for coming back, but isnt the quantity being maximized rate of change of f ALONG the unit vector and not the rate of increase of f?
                    – novice
                    Sep 3 '16 at 4:26




                    @Dr.MV - sorry for coming back, but isnt the quantity being maximized rate of change of f ALONG the unit vector and not the rate of increase of f?
                    – novice
                    Sep 3 '16 at 4:26












                    is this proof only true in 3d-dimension?
                    – hqt
                    Oct 10 '17 at 9:49





                    is this proof only true in 3d-dimension?
                    – hqt
                    Oct 10 '17 at 9:49











                    up vote
                    19
                    down vote













                    Consider a Taylor expansion of this function,
                    $$f(bf r+bfdelta r)=f(bf r)+(nabla f)cdotbfdelta r+ldots$$
                    The linear correction term $(nabla f)cdotbfdelta r$ is maximized when $bfdelta r$ is in the direction of $nabla f$.






                    share|cite|improve this answer
















                    • 1




                      This is the best answer to me. Thanks!
                      – Dagang
                      Apr 10 at 3:06














                    up vote
                    19
                    down vote













                    Consider a Taylor expansion of this function,
                    $$f(bf r+bfdelta r)=f(bf r)+(nabla f)cdotbfdelta r+ldots$$
                    The linear correction term $(nabla f)cdotbfdelta r$ is maximized when $bfdelta r$ is in the direction of $nabla f$.






                    share|cite|improve this answer
















                    • 1




                      This is the best answer to me. Thanks!
                      – Dagang
                      Apr 10 at 3:06












                    up vote
                    19
                    down vote










                    up vote
                    19
                    down vote









                    Consider a Taylor expansion of this function,
                    $$f(bf r+bfdelta r)=f(bf r)+(nabla f)cdotbfdelta r+ldots$$
                    The linear correction term $(nabla f)cdotbfdelta r$ is maximized when $bfdelta r$ is in the direction of $nabla f$.






                    share|cite|improve this answer












                    Consider a Taylor expansion of this function,
                    $$f(bf r+bfdelta r)=f(bf r)+(nabla f)cdotbfdelta r+ldots$$
                    The linear correction term $(nabla f)cdotbfdelta r$ is maximized when $bfdelta r$ is in the direction of $nabla f$.







                    share|cite|improve this answer












                    share|cite|improve this answer



                    share|cite|improve this answer










                    answered Oct 29 '12 at 4:12









                    Jonathan

                    6,56711432




                    6,56711432







                    • 1




                      This is the best answer to me. Thanks!
                      – Dagang
                      Apr 10 at 3:06












                    • 1




                      This is the best answer to me. Thanks!
                      – Dagang
                      Apr 10 at 3:06







                    1




                    1




                    This is the best answer to me. Thanks!
                    – Dagang
                    Apr 10 at 3:06




                    This is the best answer to me. Thanks!
                    – Dagang
                    Apr 10 at 3:06










                    up vote
                    17
                    down vote













                    Other answers are correct in using the directional derivative to show that the gradient is the direction of steepest ascent/descent. However, I think it is instructive to look at the definition of the directional derivative from first principles to understand why this is so (it is not arbitrarily defined to be the dot product of the gradient and the directional vector).



                    Let $f(mathbfx):mathbbR^n rightarrow mathbbR$. The partial derivatives of $f$ are the rates of change along the basis vectors of $mathbfx$:



                    $textrmrate of change along mathbfe_i = lim_hrightarrow 0 fracf(mathbfx + hmathbfe_i)- f(mathbfx)h = fracpartial fpartial x_i$



                    Each partial derivative is a scalar. It is simply a rate of change.



                    The gradient of $f$ is then defined as the vector:



                    $nabla f = sum_i fracpartial fpartial x_i mathbfe_i$



                    We can naturally extend the concept of the rate of change along a basis vector to a (unit) vector pointing in an arbitrary direction. Let $mathbfv$ be such a vector, i.e., $mathbfv = sum_i alpha_i mathbfe_i$ where $sum_i alpha_i^2 = 1$. Then:



                    $textrmrate of change along mathbfv = lim_hrightarrow 0 fracf(mathbfx + hmathbfv) - f(mathbfx)h$



                    Again, this quantity is a scalar.



                    Now, it can be proven that if $f$ is differentiable at $mathbfx$, the limit above evaluates to: $(nabla f) cdot mathbfv$. This is a dot product of two vectors, which returns a scalar.



                    We know from linear algebra that the dot product is maximized when the two vectors point in the same direction. This means that the rate of change along an arbitrary vector $mathbfv$ is maximized when $mathbfv$ points in the same direction as the gradient. In other words, the gradient corresponds to the rate of steepest ascent/descent.






                    share|cite|improve this answer




















                    • Are you missing a square root around $sum_i alpha_i^2$? The way I understand it, what you're trying to say is that the magnitude of $v$ should be 1, which only happens if the square root of the sum of alphas squared is 1 (I could easily be completely wrong, my apologies if that's the case). I like your explanation overall regardless.
                      – jeremy radcliff
                      Sep 24 '16 at 23:08







                    • 2




                      @jeremyradcliff Yes exactly, I'm saying the magnitude should be 1. I left out the square root precisely because $1^2 =1$.
                      – MGA
                      Sep 26 '16 at 16:23











                    • Thank you for following up; that makes sense, I should've realized.
                      – jeremy radcliff
                      Sep 26 '16 at 23:47














                    up vote
                    17
                    down vote













                    Other answers are correct in using the directional derivative to show that the gradient is the direction of steepest ascent/descent. However, I think it is instructive to look at the definition of the directional derivative from first principles to understand why this is so (it is not arbitrarily defined to be the dot product of the gradient and the directional vector).



                    Let $f(mathbfx):mathbbR^n rightarrow mathbbR$. The partial derivatives of $f$ are the rates of change along the basis vectors of $mathbfx$:



                    $textrmrate of change along mathbfe_i = lim_hrightarrow 0 fracf(mathbfx + hmathbfe_i)- f(mathbfx)h = fracpartial fpartial x_i$



                    Each partial derivative is a scalar. It is simply a rate of change.



                    The gradient of $f$ is then defined as the vector:



                    $nabla f = sum_i fracpartial fpartial x_i mathbfe_i$



                    We can naturally extend the concept of the rate of change along a basis vector to a (unit) vector pointing in an arbitrary direction. Let $mathbfv$ be such a vector, i.e., $mathbfv = sum_i alpha_i mathbfe_i$ where $sum_i alpha_i^2 = 1$. Then:



                    $textrmrate of change along mathbfv = lim_hrightarrow 0 fracf(mathbfx + hmathbfv) - f(mathbfx)h$



                    Again, this quantity is a scalar.



                    Now, it can be proven that if $f$ is differentiable at $mathbfx$, the limit above evaluates to: $(nabla f) cdot mathbfv$. This is a dot product of two vectors, which returns a scalar.



                    We know from linear algebra that the dot product is maximized when the two vectors point in the same direction. This means that the rate of change along an arbitrary vector $mathbfv$ is maximized when $mathbfv$ points in the same direction as the gradient. In other words, the gradient corresponds to the rate of steepest ascent/descent.






                    share|cite|improve this answer




















                    • Are you missing a square root around $sum_i alpha_i^2$? The way I understand it, what you're trying to say is that the magnitude of $v$ should be 1, which only happens if the square root of the sum of alphas squared is 1 (I could easily be completely wrong, my apologies if that's the case). I like your explanation overall regardless.
                      – jeremy radcliff
                      Sep 24 '16 at 23:08







                    • 2




                      @jeremyradcliff Yes exactly, I'm saying the magnitude should be 1. I left out the square root precisely because $1^2 =1$.
                      – MGA
                      Sep 26 '16 at 16:23











                    • Thank you for following up; that makes sense, I should've realized.
                      – jeremy radcliff
                      Sep 26 '16 at 23:47












                    up vote
                    17
                    down vote










                    up vote
                    17
                    down vote









                    Other answers are correct in using the directional derivative to show that the gradient is the direction of steepest ascent/descent. However, I think it is instructive to look at the definition of the directional derivative from first principles to understand why this is so (it is not arbitrarily defined to be the dot product of the gradient and the directional vector).



                    Let $f(mathbfx):mathbbR^n rightarrow mathbbR$. The partial derivatives of $f$ are the rates of change along the basis vectors of $mathbfx$:



                    $textrmrate of change along mathbfe_i = lim_hrightarrow 0 fracf(mathbfx + hmathbfe_i)- f(mathbfx)h = fracpartial fpartial x_i$



                    Each partial derivative is a scalar. It is simply a rate of change.



                    The gradient of $f$ is then defined as the vector:



                    $nabla f = sum_i fracpartial fpartial x_i mathbfe_i$



                    We can naturally extend the concept of the rate of change along a basis vector to a (unit) vector pointing in an arbitrary direction. Let $mathbfv$ be such a vector, i.e., $mathbfv = sum_i alpha_i mathbfe_i$ where $sum_i alpha_i^2 = 1$. Then:



                    $textrmrate of change along mathbfv = lim_hrightarrow 0 fracf(mathbfx + hmathbfv) - f(mathbfx)h$



                    Again, this quantity is a scalar.



                    Now, it can be proven that if $f$ is differentiable at $mathbfx$, the limit above evaluates to: $(nabla f) cdot mathbfv$. This is a dot product of two vectors, which returns a scalar.



                    We know from linear algebra that the dot product is maximized when the two vectors point in the same direction. This means that the rate of change along an arbitrary vector $mathbfv$ is maximized when $mathbfv$ points in the same direction as the gradient. In other words, the gradient corresponds to the rate of steepest ascent/descent.






                    share|cite|improve this answer












                    Other answers are correct in using the directional derivative to show that the gradient is the direction of steepest ascent/descent. However, I think it is instructive to look at the definition of the directional derivative from first principles to understand why this is so (it is not arbitrarily defined to be the dot product of the gradient and the directional vector).



                    Let $f(mathbfx):mathbbR^n rightarrow mathbbR$. The partial derivatives of $f$ are the rates of change along the basis vectors of $mathbfx$:



                    $textrmrate of change along mathbfe_i = lim_hrightarrow 0 fracf(mathbfx + hmathbfe_i)- f(mathbfx)h = fracpartial fpartial x_i$



                    Each partial derivative is a scalar. It is simply a rate of change.



                    The gradient of $f$ is then defined as the vector:



                    $nabla f = sum_i fracpartial fpartial x_i mathbfe_i$



                    We can naturally extend the concept of the rate of change along a basis vector to a (unit) vector pointing in an arbitrary direction. Let $mathbfv$ be such a vector, i.e., $mathbfv = sum_i alpha_i mathbfe_i$ where $sum_i alpha_i^2 = 1$. Then:



                    $textrmrate of change along mathbfv = lim_hrightarrow 0 fracf(mathbfx + hmathbfv) - f(mathbfx)h$



                    Again, this quantity is a scalar.



                    Now, it can be proven that if $f$ is differentiable at $mathbfx$, the limit above evaluates to: $(nabla f) cdot mathbfv$. This is a dot product of two vectors, which returns a scalar.



                    We know from linear algebra that the dot product is maximized when the two vectors point in the same direction. This means that the rate of change along an arbitrary vector $mathbfv$ is maximized when $mathbfv$ points in the same direction as the gradient. In other words, the gradient corresponds to the rate of steepest ascent/descent.







                    share|cite|improve this answer












                    share|cite|improve this answer



                    share|cite|improve this answer










                    answered May 6 '15 at 15:46









                    MGA

                    4,85333050




                    4,85333050











                    • Are you missing a square root around $sum_i alpha_i^2$? The way I understand it, what you're trying to say is that the magnitude of $v$ should be 1, which only happens if the square root of the sum of alphas squared is 1 (I could easily be completely wrong, my apologies if that's the case). I like your explanation overall regardless.
                      – jeremy radcliff
                      Sep 24 '16 at 23:08







                    • 2




                      @jeremyradcliff Yes exactly, I'm saying the magnitude should be 1. I left out the square root precisely because $1^2 =1$.
                      – MGA
                      Sep 26 '16 at 16:23











                    • Thank you for following up; that makes sense, I should've realized.
                      – jeremy radcliff
                      Sep 26 '16 at 23:47
















                    • Are you missing a square root around $sum_i alpha_i^2$? The way I understand it, what you're trying to say is that the magnitude of $v$ should be 1, which only happens if the square root of the sum of alphas squared is 1 (I could easily be completely wrong, my apologies if that's the case). I like your explanation overall regardless.
                      – jeremy radcliff
                      Sep 24 '16 at 23:08







                    • 2




                      @jeremyradcliff Yes exactly, I'm saying the magnitude should be 1. I left out the square root precisely because $1^2 =1$.
                      – MGA
                      Sep 26 '16 at 16:23











                    • Thank you for following up; that makes sense, I should've realized.
                      – jeremy radcliff
                      Sep 26 '16 at 23:47















                    Are you missing a square root around $sum_i alpha_i^2$? The way I understand it, what you're trying to say is that the magnitude of $v$ should be 1, which only happens if the square root of the sum of alphas squared is 1 (I could easily be completely wrong, my apologies if that's the case). I like your explanation overall regardless.
                    – jeremy radcliff
                    Sep 24 '16 at 23:08





                    Are you missing a square root around $sum_i alpha_i^2$? The way I understand it, what you're trying to say is that the magnitude of $v$ should be 1, which only happens if the square root of the sum of alphas squared is 1 (I could easily be completely wrong, my apologies if that's the case). I like your explanation overall regardless.
                    – jeremy radcliff
                    Sep 24 '16 at 23:08





                    2




                    2




                    @jeremyradcliff Yes exactly, I'm saying the magnitude should be 1. I left out the square root precisely because $1^2 =1$.
                    – MGA
                    Sep 26 '16 at 16:23





                    @jeremyradcliff Yes exactly, I'm saying the magnitude should be 1. I left out the square root precisely because $1^2 =1$.
                    – MGA
                    Sep 26 '16 at 16:23













                    Thank you for following up; that makes sense, I should've realized.
                    – jeremy radcliff
                    Sep 26 '16 at 23:47




                    Thank you for following up; that makes sense, I should've realized.
                    – jeremy radcliff
                    Sep 26 '16 at 23:47










                    up vote
                    16
                    down vote













                    The question you're asking can be rephrased as "In which direction is the directional derivative $nabla_hatuf$ a maximum?".



                    Assuming differentiability, $nabla_hatuf$ can be written as:



                    $$nabla_hatuf = nabla f(textbfx) cdot hatu =|nabla f(textbfx)||hatu|cos theta = |nabla f(textbfx)|cos theta$$



                    which is a maximum when $theta =0$: when $nabla f(textbfx)$ and $hatu$ are parallel.






                    share|cite|improve this answer
























                      up vote
                      16
                      down vote













                      The question you're asking can be rephrased as "In which direction is the directional derivative $nabla_hatuf$ a maximum?".



                      Assuming differentiability, $nabla_hatuf$ can be written as:



                      $$nabla_hatuf = nabla f(textbfx) cdot hatu =|nabla f(textbfx)||hatu|cos theta = |nabla f(textbfx)|cos theta$$



                      which is a maximum when $theta =0$: when $nabla f(textbfx)$ and $hatu$ are parallel.






                      share|cite|improve this answer






















                        up vote
                        16
                        down vote










                        up vote
                        16
                        down vote









                        The question you're asking can be rephrased as "In which direction is the directional derivative $nabla_hatuf$ a maximum?".



                        Assuming differentiability, $nabla_hatuf$ can be written as:



                        $$nabla_hatuf = nabla f(textbfx) cdot hatu =|nabla f(textbfx)||hatu|cos theta = |nabla f(textbfx)|cos theta$$



                        which is a maximum when $theta =0$: when $nabla f(textbfx)$ and $hatu$ are parallel.






                        share|cite|improve this answer












                        The question you're asking can be rephrased as "In which direction is the directional derivative $nabla_hatuf$ a maximum?".



                        Assuming differentiability, $nabla_hatuf$ can be written as:



                        $$nabla_hatuf = nabla f(textbfx) cdot hatu =|nabla f(textbfx)||hatu|cos theta = |nabla f(textbfx)|cos theta$$



                        which is a maximum when $theta =0$: when $nabla f(textbfx)$ and $hatu$ are parallel.







                        share|cite|improve this answer












                        share|cite|improve this answer



                        share|cite|improve this answer










                        answered Oct 29 '12 at 4:22









                        BobaFret

                        2,4311717




                        2,4311717




















                            up vote
                            5
                            down vote













                            Each component of the derivative
                            $$
                            fracpartial fpartial x_1 ... fracpartial fpartial x_n$$
                            tells you how fast your function is changing with respect to the standard basis.
                            It's now possible to make a basetransformation to an orthogonal base with $ n-1 $ base Directions with $0$ ascent and the gradient direction. In such a base the gradient direction must be the steepest since any adding of other base directions adds length but no ascent.



                            For a 3 dimensional Vector space the base could look like this
                            $$ left( left( beginmatrix partial x_2 \ -partial x_1 \ 0 endmatrix right) left( beginmatrix partial x_1 \ partial x_2 \ -dfrac(partial x_1)²+(partial x_2)²partial x_3 endmatrix right) left( beginmatrix partial x_1 \ partial x_2 \ partial x_3 endmatrix right) right) $$ By complete induction it can now be shown that such a base is constructable for an n-Dimensional Vector space.
                            $$ left( left( beginmatrix partial x_2 \ -partial x_1 \ 0 \ 0 endmatrix right) left( beginmatrix colorbluepartial x_1 \ partial x_2 \ -dfrac(partial x_1)²+(partial x_2)²partial x_3 \ 0 endmatrix right) left( beginmatrix colorbluepartial x_1 \ partial x_2 \ colorgreenpartial x_3 \ -dfrac(partial x_1)²+(partial x_2)²+(partial x_3)²partial x_4 endmatrix right) left(beginmatrix colorbluepartial x_1 \ partial x_2 \ colorgreenpartial x_3 \ colororangepartial x_4 endmatrix right) right) $$ One can see here that the first Basevector demands the first 2 Elements of the following Basevectors to be $partial x_1$ & $partial x_2$ because of the orthogonal condition,
                            similarly the 2nd vector demands all the 3rd elements of the following vectors to be $partial x_3$
                            as does the 3rd vector for the 4th element them being $partial x_4$.



                            If another dimension is added the n+1 Element of the n$th$ Vector needs to be $$-dfrac(partial x_1)²+...+(partial x_n)²partial x_n+1$$ to meet the $0$ ascension condition which in turn forces the new n+1$th$ Vector to be of the form $$left(beginmatrixpartial x_1 \ ... \ partial x_n+1endmatrixright)$$ for it to be orthogonal to the rest.






                            share|cite|improve this answer


























                              up vote
                              5
                              down vote













                              Each component of the derivative
                              $$
                              fracpartial fpartial x_1 ... fracpartial fpartial x_n$$
                              tells you how fast your function is changing with respect to the standard basis.
                              It's now possible to make a basetransformation to an orthogonal base with $ n-1 $ base Directions with $0$ ascent and the gradient direction. In such a base the gradient direction must be the steepest since any adding of other base directions adds length but no ascent.



                              For a 3 dimensional Vector space the base could look like this
                              $$ left( left( beginmatrix partial x_2 \ -partial x_1 \ 0 endmatrix right) left( beginmatrix partial x_1 \ partial x_2 \ -dfrac(partial x_1)²+(partial x_2)²partial x_3 endmatrix right) left( beginmatrix partial x_1 \ partial x_2 \ partial x_3 endmatrix right) right) $$ By complete induction it can now be shown that such a base is constructable for an n-Dimensional Vector space.
                              $$ left( left( beginmatrix partial x_2 \ -partial x_1 \ 0 \ 0 endmatrix right) left( beginmatrix colorbluepartial x_1 \ partial x_2 \ -dfrac(partial x_1)²+(partial x_2)²partial x_3 \ 0 endmatrix right) left( beginmatrix colorbluepartial x_1 \ partial x_2 \ colorgreenpartial x_3 \ -dfrac(partial x_1)²+(partial x_2)²+(partial x_3)²partial x_4 endmatrix right) left(beginmatrix colorbluepartial x_1 \ partial x_2 \ colorgreenpartial x_3 \ colororangepartial x_4 endmatrix right) right) $$ One can see here that the first Basevector demands the first 2 Elements of the following Basevectors to be $partial x_1$ & $partial x_2$ because of the orthogonal condition,
                              similarly the 2nd vector demands all the 3rd elements of the following vectors to be $partial x_3$
                              as does the 3rd vector for the 4th element them being $partial x_4$.



                              If another dimension is added the n+1 Element of the n$th$ Vector needs to be $$-dfrac(partial x_1)²+...+(partial x_n)²partial x_n+1$$ to meet the $0$ ascension condition which in turn forces the new n+1$th$ Vector to be of the form $$left(beginmatrixpartial x_1 \ ... \ partial x_n+1endmatrixright)$$ for it to be orthogonal to the rest.






                              share|cite|improve this answer
























                                up vote
                                5
                                down vote










                                up vote
                                5
                                down vote









                                Each component of the derivative
                                $$
                                fracpartial fpartial x_1 ... fracpartial fpartial x_n$$
                                tells you how fast your function is changing with respect to the standard basis.
                                It's now possible to make a basetransformation to an orthogonal base with $ n-1 $ base Directions with $0$ ascent and the gradient direction. In such a base the gradient direction must be the steepest since any adding of other base directions adds length but no ascent.



                                For a 3 dimensional Vector space the base could look like this
                                $$ left( left( beginmatrix partial x_2 \ -partial x_1 \ 0 endmatrix right) left( beginmatrix partial x_1 \ partial x_2 \ -dfrac(partial x_1)²+(partial x_2)²partial x_3 endmatrix right) left( beginmatrix partial x_1 \ partial x_2 \ partial x_3 endmatrix right) right) $$ By complete induction it can now be shown that such a base is constructable for an n-Dimensional Vector space.
                                $$ left( left( beginmatrix partial x_2 \ -partial x_1 \ 0 \ 0 endmatrix right) left( beginmatrix colorbluepartial x_1 \ partial x_2 \ -dfrac(partial x_1)²+(partial x_2)²partial x_3 \ 0 endmatrix right) left( beginmatrix colorbluepartial x_1 \ partial x_2 \ colorgreenpartial x_3 \ -dfrac(partial x_1)²+(partial x_2)²+(partial x_3)²partial x_4 endmatrix right) left(beginmatrix colorbluepartial x_1 \ partial x_2 \ colorgreenpartial x_3 \ colororangepartial x_4 endmatrix right) right) $$ One can see here that the first Basevector demands the first 2 Elements of the following Basevectors to be $partial x_1$ & $partial x_2$ because of the orthogonal condition,
                                similarly the 2nd vector demands all the 3rd elements of the following vectors to be $partial x_3$
                                as does the 3rd vector for the 4th element them being $partial x_4$.



                                If another dimension is added the n+1 Element of the n$th$ Vector needs to be $$-dfrac(partial x_1)²+...+(partial x_n)²partial x_n+1$$ to meet the $0$ ascension condition which in turn forces the new n+1$th$ Vector to be of the form $$left(beginmatrixpartial x_1 \ ... \ partial x_n+1endmatrixright)$$ for it to be orthogonal to the rest.






                                share|cite|improve this answer














                                Each component of the derivative
                                $$
                                fracpartial fpartial x_1 ... fracpartial fpartial x_n$$
                                tells you how fast your function is changing with respect to the standard basis.
                                It's now possible to make a basetransformation to an orthogonal base with $ n-1 $ base Directions with $0$ ascent and the gradient direction. In such a base the gradient direction must be the steepest since any adding of other base directions adds length but no ascent.



                                For a 3 dimensional Vector space the base could look like this
                                $$ left( left( beginmatrix partial x_2 \ -partial x_1 \ 0 endmatrix right) left( beginmatrix partial x_1 \ partial x_2 \ -dfrac(partial x_1)²+(partial x_2)²partial x_3 endmatrix right) left( beginmatrix partial x_1 \ partial x_2 \ partial x_3 endmatrix right) right) $$ By complete induction it can now be shown that such a base is constructable for an n-Dimensional Vector space.
                                $$ left( left( beginmatrix partial x_2 \ -partial x_1 \ 0 \ 0 endmatrix right) left( beginmatrix colorbluepartial x_1 \ partial x_2 \ -dfrac(partial x_1)²+(partial x_2)²partial x_3 \ 0 endmatrix right) left( beginmatrix colorbluepartial x_1 \ partial x_2 \ colorgreenpartial x_3 \ -dfrac(partial x_1)²+(partial x_2)²+(partial x_3)²partial x_4 endmatrix right) left(beginmatrix colorbluepartial x_1 \ partial x_2 \ colorgreenpartial x_3 \ colororangepartial x_4 endmatrix right) right) $$ One can see here that the first Basevector demands the first 2 Elements of the following Basevectors to be $partial x_1$ & $partial x_2$ because of the orthogonal condition,
                                similarly the 2nd vector demands all the 3rd elements of the following vectors to be $partial x_3$
                                as does the 3rd vector for the 4th element them being $partial x_4$.



                                If another dimension is added the n+1 Element of the n$th$ Vector needs to be $$-dfrac(partial x_1)²+...+(partial x_n)²partial x_n+1$$ to meet the $0$ ascension condition which in turn forces the new n+1$th$ Vector to be of the form $$left(beginmatrixpartial x_1 \ ... \ partial x_n+1endmatrixright)$$ for it to be orthogonal to the rest.







                                share|cite|improve this answer














                                share|cite|improve this answer



                                share|cite|improve this answer








                                edited Mar 18 '13 at 14:43

























                                answered Mar 18 '13 at 13:03









                                whateverguy

                                6816




                                6816




















                                    up vote
                                    1
                                    down vote













                                    Let $vec v$ be an arbitrary unit vector. Then the change of $f$ by moving in the direction of $v$, starting in point $a$, is given by $grad( f(a)) cdot vec v$. We want to find a $vec v$ for which this inner product is maximal.
                                    For the inner product we have the Cauchy–Schwarz inequality $vec a cdot vec b leq |vec a||vec b|$.
                                    Now the equality holds when $vec v = lambda ; grad(f(a))$, for some $lambda in mathbbR$.






                                    share|cite|improve this answer
























                                      up vote
                                      1
                                      down vote













                                      Let $vec v$ be an arbitrary unit vector. Then the change of $f$ by moving in the direction of $v$, starting in point $a$, is given by $grad( f(a)) cdot vec v$. We want to find a $vec v$ for which this inner product is maximal.
                                      For the inner product we have the Cauchy–Schwarz inequality $vec a cdot vec b leq |vec a||vec b|$.
                                      Now the equality holds when $vec v = lambda ; grad(f(a))$, for some $lambda in mathbbR$.






                                      share|cite|improve this answer






















                                        up vote
                                        1
                                        down vote










                                        up vote
                                        1
                                        down vote









                                        Let $vec v$ be an arbitrary unit vector. Then the change of $f$ by moving in the direction of $v$, starting in point $a$, is given by $grad( f(a)) cdot vec v$. We want to find a $vec v$ for which this inner product is maximal.
                                        For the inner product we have the Cauchy–Schwarz inequality $vec a cdot vec b leq |vec a||vec b|$.
                                        Now the equality holds when $vec v = lambda ; grad(f(a))$, for some $lambda in mathbbR$.






                                        share|cite|improve this answer












                                        Let $vec v$ be an arbitrary unit vector. Then the change of $f$ by moving in the direction of $v$, starting in point $a$, is given by $grad( f(a)) cdot vec v$. We want to find a $vec v$ for which this inner product is maximal.
                                        For the inner product we have the Cauchy–Schwarz inequality $vec a cdot vec b leq |vec a||vec b|$.
                                        Now the equality holds when $vec v = lambda ; grad(f(a))$, for some $lambda in mathbbR$.







                                        share|cite|improve this answer












                                        share|cite|improve this answer



                                        share|cite|improve this answer










                                        answered Sep 22 '17 at 15:20









                                        Jens Wagemaker

                                        375211




                                        375211




















                                            up vote
                                            1
                                            down vote













                                            Let $v=fracss$ be a unit vector and assume that $v$ is a descent direction, i.e. $v^Tnabla f(x) <0$. Then $f(x+lambda v)$ as a function of $lambda$, describes how this function changes along the direction $v$.



                                            The rate of descent at $x$ along $v$ is given by:
                                            $$ fracdd lambdaf(x+lambda v)|_lambda=0 = v^T nabla f(x) =fracs^Tsnabla f(x) equiv fracs^Tsg$$
                                            So we want to find the maximum of this quantity as a function of $s$.
                                            Differentiating the above wrt $s$ and setting it equal to zero, we get (noting that $nabla_s|s| =fracss$): $g=(g^T v)vequiv av$.



                                            Taking the Euclidean norm: $|g|=|a||v|=|a| Rightarrow a=pm|g|$.



                                            We choose the minus sign to satisfy that $v$ is descent. Hence the direction of the steepest descent is
                                            $$ v= dfrac1ag = -dfracgg$$






                                            share|cite|improve this answer
























                                              up vote
                                              1
                                              down vote













                                              Let $v=fracss$ be a unit vector and assume that $v$ is a descent direction, i.e. $v^Tnabla f(x) <0$. Then $f(x+lambda v)$ as a function of $lambda$, describes how this function changes along the direction $v$.



                                              The rate of descent at $x$ along $v$ is given by:
                                              $$ fracdd lambdaf(x+lambda v)|_lambda=0 = v^T nabla f(x) =fracs^Tsnabla f(x) equiv fracs^Tsg$$
                                              So we want to find the maximum of this quantity as a function of $s$.
                                              Differentiating the above wrt $s$ and setting it equal to zero, we get (noting that $nabla_s|s| =fracss$): $g=(g^T v)vequiv av$.



                                              Taking the Euclidean norm: $|g|=|a||v|=|a| Rightarrow a=pm|g|$.



                                              We choose the minus sign to satisfy that $v$ is descent. Hence the direction of the steepest descent is
                                              $$ v= dfrac1ag = -dfracgg$$






                                              share|cite|improve this answer






















                                                up vote
                                                1
                                                down vote










                                                up vote
                                                1
                                                down vote









                                                Let $v=fracss$ be a unit vector and assume that $v$ is a descent direction, i.e. $v^Tnabla f(x) <0$. Then $f(x+lambda v)$ as a function of $lambda$, describes how this function changes along the direction $v$.



                                                The rate of descent at $x$ along $v$ is given by:
                                                $$ fracdd lambdaf(x+lambda v)|_lambda=0 = v^T nabla f(x) =fracs^Tsnabla f(x) equiv fracs^Tsg$$
                                                So we want to find the maximum of this quantity as a function of $s$.
                                                Differentiating the above wrt $s$ and setting it equal to zero, we get (noting that $nabla_s|s| =fracss$): $g=(g^T v)vequiv av$.



                                                Taking the Euclidean norm: $|g|=|a||v|=|a| Rightarrow a=pm|g|$.



                                                We choose the minus sign to satisfy that $v$ is descent. Hence the direction of the steepest descent is
                                                $$ v= dfrac1ag = -dfracgg$$






                                                share|cite|improve this answer












                                                Let $v=fracss$ be a unit vector and assume that $v$ is a descent direction, i.e. $v^Tnabla f(x) <0$. Then $f(x+lambda v)$ as a function of $lambda$, describes how this function changes along the direction $v$.



                                                The rate of descent at $x$ along $v$ is given by:
                                                $$ fracdd lambdaf(x+lambda v)|_lambda=0 = v^T nabla f(x) =fracs^Tsnabla f(x) equiv fracs^Tsg$$
                                                So we want to find the maximum of this quantity as a function of $s$.
                                                Differentiating the above wrt $s$ and setting it equal to zero, we get (noting that $nabla_s|s| =fracss$): $g=(g^T v)vequiv av$.



                                                Taking the Euclidean norm: $|g|=|a||v|=|a| Rightarrow a=pm|g|$.



                                                We choose the minus sign to satisfy that $v$ is descent. Hence the direction of the steepest descent is
                                                $$ v= dfrac1ag = -dfracgg$$







                                                share|cite|improve this answer












                                                share|cite|improve this answer



                                                share|cite|improve this answer










                                                answered Mar 28 at 16:26









                                                Isaac Lagaris

                                                111




                                                111




















                                                    up vote
                                                    1
                                                    down vote













                                                    Just want to further clarify why the gradient provides the steepest ascent (instead of descent) here. Any differentiable $f$ can be approximated by the linear tangent plane, i.e., $$f(mathbfx + h mathbfv) = f(mathbfx) + h , nabla f(mathbfx)^T mathbfv $$ as $h rightarrow 0$ for any unit-length direction $mathbfv$ with $parallel mathbfv parallel =1.$ As $h downarrow 0$, consider the amount of change
                                                    $$
                                                    f(mathbfx + h mathbfv) - f(mathbfx) = h , left , nabla f(mathbfx)^T mathbfv right
                                                    ~~in~~ left[ - h , parallel nabla f(mathbfx) parallel, ~ h , parallel nabla f(mathbfx) parallel right]
                                                    $$
                                                    by Cauchy-Swcharz inequality, which reaches its maximum (increase) $(h , parallel nabla f(mathbfx) parallel)$ when $mathbfv = nabla f(mathbfx) / parallel nabla f(mathbfx) parallel$ and its minimum (i.e., maximum decrease) $ (-h , parallel nabla f(mathbfx) parallel) $ if $ mathbfv= - nabla f(mathbfx)/parallel nabla f(mathbfx) parallel$ (the negative gradient direction).






                                                    share|cite|improve this answer


























                                                      up vote
                                                      1
                                                      down vote













                                                      Just want to further clarify why the gradient provides the steepest ascent (instead of descent) here. Any differentiable $f$ can be approximated by the linear tangent plane, i.e., $$f(mathbfx + h mathbfv) = f(mathbfx) + h , nabla f(mathbfx)^T mathbfv $$ as $h rightarrow 0$ for any unit-length direction $mathbfv$ with $parallel mathbfv parallel =1.$ As $h downarrow 0$, consider the amount of change
                                                      $$
                                                      f(mathbfx + h mathbfv) - f(mathbfx) = h , left , nabla f(mathbfx)^T mathbfv right
                                                      ~~in~~ left[ - h , parallel nabla f(mathbfx) parallel, ~ h , parallel nabla f(mathbfx) parallel right]
                                                      $$
                                                      by Cauchy-Swcharz inequality, which reaches its maximum (increase) $(h , parallel nabla f(mathbfx) parallel)$ when $mathbfv = nabla f(mathbfx) / parallel nabla f(mathbfx) parallel$ and its minimum (i.e., maximum decrease) $ (-h , parallel nabla f(mathbfx) parallel) $ if $ mathbfv= - nabla f(mathbfx)/parallel nabla f(mathbfx) parallel$ (the negative gradient direction).






                                                      share|cite|improve this answer
























                                                        up vote
                                                        1
                                                        down vote










                                                        up vote
                                                        1
                                                        down vote









                                                        Just want to further clarify why the gradient provides the steepest ascent (instead of descent) here. Any differentiable $f$ can be approximated by the linear tangent plane, i.e., $$f(mathbfx + h mathbfv) = f(mathbfx) + h , nabla f(mathbfx)^T mathbfv $$ as $h rightarrow 0$ for any unit-length direction $mathbfv$ with $parallel mathbfv parallel =1.$ As $h downarrow 0$, consider the amount of change
                                                        $$
                                                        f(mathbfx + h mathbfv) - f(mathbfx) = h , left , nabla f(mathbfx)^T mathbfv right
                                                        ~~in~~ left[ - h , parallel nabla f(mathbfx) parallel, ~ h , parallel nabla f(mathbfx) parallel right]
                                                        $$
                                                        by Cauchy-Swcharz inequality, which reaches its maximum (increase) $(h , parallel nabla f(mathbfx) parallel)$ when $mathbfv = nabla f(mathbfx) / parallel nabla f(mathbfx) parallel$ and its minimum (i.e., maximum decrease) $ (-h , parallel nabla f(mathbfx) parallel) $ if $ mathbfv= - nabla f(mathbfx)/parallel nabla f(mathbfx) parallel$ (the negative gradient direction).






                                                        share|cite|improve this answer














                                                        Just want to further clarify why the gradient provides the steepest ascent (instead of descent) here. Any differentiable $f$ can be approximated by the linear tangent plane, i.e., $$f(mathbfx + h mathbfv) = f(mathbfx) + h , nabla f(mathbfx)^T mathbfv $$ as $h rightarrow 0$ for any unit-length direction $mathbfv$ with $parallel mathbfv parallel =1.$ As $h downarrow 0$, consider the amount of change
                                                        $$
                                                        f(mathbfx + h mathbfv) - f(mathbfx) = h , left , nabla f(mathbfx)^T mathbfv right
                                                        ~~in~~ left[ - h , parallel nabla f(mathbfx) parallel, ~ h , parallel nabla f(mathbfx) parallel right]
                                                        $$
                                                        by Cauchy-Swcharz inequality, which reaches its maximum (increase) $(h , parallel nabla f(mathbfx) parallel)$ when $mathbfv = nabla f(mathbfx) / parallel nabla f(mathbfx) parallel$ and its minimum (i.e., maximum decrease) $ (-h , parallel nabla f(mathbfx) parallel) $ if $ mathbfv= - nabla f(mathbfx)/parallel nabla f(mathbfx) parallel$ (the negative gradient direction).







                                                        share|cite|improve this answer














                                                        share|cite|improve this answer



                                                        share|cite|improve this answer








                                                        edited Aug 10 at 17:59

























                                                        answered Aug 10 at 17:36









                                                        XGS

                                                        212




                                                        212






















                                                             

                                                            draft saved


                                                            draft discarded


























                                                             


                                                            draft saved


                                                            draft discarded














                                                            StackExchange.ready(
                                                            function ()
                                                            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fmath.stackexchange.com%2fquestions%2f223252%2fwhy-is-gradient-the-direction-of-steepest-ascent%23new-answer', 'question_page');

                                                            );

                                                            Post as a guest