Foreword
This article will discuss methods for solving mathematical optimization problems based on the use of the function gradient. The main goal is to collect in the article all the most important ideas that are somehow connected with this method and its various modifications.
UPD . In the comments they write that formulas are not displayed on some browsers and in the mobile application. Unfortunately, I do not know how to deal with it. I can only say that I used the “inline” and “display” macros of the Habrav editor. If you suddenly know how to fix it - please write in the comments.
Note from the author
At the time of writing, I defended my thesis, whose task required from me a deep understanding of the theoretical basic methods of mathematical optimization. Nevertheless, I still (like everyone) blurred my eyes from the terrible long formulas, so I spent a lot of time to isolate the key ideas that would characterize different variations of gradient methods. My personal goal is to write an article containing the minimum amount of information necessary for a less detailed understanding of the subject. But be prepared, one can not do without formulas anyway.
Formulation of the problem
Before describing a method, you must first describe the task, namely: “Given a set of
m a t h c a l K and function
f : m a t h c a l k r i g h t a r r o w m a t h b b R need to find point
x ∗ i n m a t h c a l k such that
f ( x ) g e q f ( x ∗ ) for all
x in mathcalK "Which is usually written like this
f(x) rightarrow minx in mathcalK.
In
theory , it is usually assumed that
f - differentiable and convex function, and
mathcalK - convex set (and even better, if at all
mathcalK= mathbbRn ), this allows us to give some guarantees of the success of the use of the gradient descent. In
practice, gradient descent is successfully applied even when the task has none of the above properties (the example is further in the article).
A bit of math
Suppose for now we just need to find a minimum of one-dimensional function.
f(x) rightarrow minx in mathbbR.
As early as the 17th century, Pierre Farm was invented a criterion that allowed solving simple optimization problems, namely,
x∗ - minimum point f∗ thenf′(x∗)=0
Where
f′ - derivative
f . This criterion is based on a linear approximation.
f(x) approxf(x∗)+f′(x∗)(x−x∗).
The closer
x to
x∗ more precisely this approximation. In the right part - the expression that
f′(x∗) neq0 maybe more
f(x∗) and less - this is the main essence of the criterion. In the multidimensional case, similarly from the linear
f(x) approxf(x∗)+ nablaf(x∗)T(x−x∗) (hereinafter
xty= sumni=1xiyi - a standard scalar product, the form of the record is due to the fact that the scalar product is the same as the matrix product of the row vector and column vector) the result is
nablaf(x∗)=0.
Magnitude
nablaf(x∗) - function
gradient f at the point
x∗ . Also, the equality of the gradient to zero means the equality of all partial derivatives to zero, so in the multidimensional case, you can get this criterion simply by consistently applying the one-dimensional criterion for each variable separately.
It is worth noting that these conditions are necessary, but not sufficient, the simplest example is 0 for
f(x)=x2 and
f(x)=x3

This criterion is sufficient in the case of a convex function, largely because of this, so many results were obtained for convex functions.
Quadratic functions
Quadratic functions in
mathbbRn Is a view function
f(x)=f(x1,x2, ldots,xn)= frac12 sumni,j=1aijxixj− sumni=1bixi+c
To save space (and even less to mess with indices), this function is usually written in matrix form:
f(x)= frac12xTAx−bTx+c,
Where
x=(x1, ldots,xn)T ,
b=(b1, ldots,bn)T ,
A - the matrix, which at the intersection
i strings and
j column value is
frac12(aij+aji) (
A at the same time it turns out symmetrical - this is important). Further. when mentioning a quadratic function, I will have the above function.
Why am I talking about this? The fact is that quadratic functions are important in optimization for two reasons:
- They also occur in practice, for example, when constructing a linear regression using the least squares method
- The gradient of a quadratic function is a linear function, in particular for the function above
frac partial partialxif(x1,x2, ldots,xn)=aiixi+ sumj neqi frac12(aij+aji)xj−bi,
Or in matrix form
nablaf(x)=Ax−b,
Thus the system nablaf(x)=0 - linear system. A system that is simpler than linear is simply non-existent. The idea I tried to get to is the optimization of a quadratic function — the simplest class of optimization problems . On the other hand, the fact that nablaf(x∗)=0 - the necessary minimum conditions make it possible to solve linear systems through optimization problems. A little later, I will try to convince you that it makes sense.
Useful gradient properties
Well, we seem to have figured out that if a function is differentiable (it has derivatives in all variables), then at the minimum point the gradient should be zero. But does the gradient carry any useful information in the case when it is different from zero?
Let's try to solve a simpler problem for now:
a dot is given x find point barx such that f( barx)<f(x) . Let's take a point next to
x again using linear approximation
f( barx) approxf(x)+ nablaf(x)T( barx−x) . If you take
barx=x− alpha nablaf(x) ,
alpha>0 then we get
f( barx) approxf(x)− alpha | nablaf(x) |2<f(x).
Similarly, if
alpha<0 then
f( barx) there will be more
f(x) (hereinafter
||x||= sqrtx21+x22+ ldots+x2n ). Again, since we used the approximation, these considerations will be valid only for small
alpha . Summarizing the above, if
nablaf(x) neq0 , the
gradient indicates the direction of the greatest local increase of the function .
Here are two examples for two-dimensional functions. Such pictures can often be seen in demonstrations of gradient descent. Colored lines - the so-called
level lines , is a set of points for which the function takes fixed values, in my case these are circles and ellipses. I marked the blue lines of the level with a lower value, red - with a higher.


Note that for a surface given by a type equation
f(x)=c ,
nablaf(x) sets the normal (in common - perpendicular) to this surface. Also note that even though the gradient shows in the direction of the largest increase in the function, there is no guarantee that you can find a minimum in the direction opposite to the gradient (for example, the left picture).
Gradient descent
Before the basic method of gradient descent, there was only a small step: we learned from the point
x getting a point
barx with a smaller function value
f . What prevents us from repeating this several times? In essence, this is a gradient descent: building a sequence
xk+1=xk− alphak nablaf(xk).
Magnitude
alphak called the
step size (in machine learning - the
speed of learning ). A few words about the choice
alphak : if a
alphak - very small, the sequence slowly changes, which makes the algorithm not very effective; if
alphak very large, the linear approximation becomes bad, and maybe even incorrect. In practice, the step size is often chosen empirically, in theory, the Lipschitz gradient is usually assumed, namely, if
| nablaf(x)− nablaf(y) | leqL |x−y |
for all
x,y then
alphak< frac2L guarantees a decrease
f(xk) .
Analysis for quadratic functions
If a
A - symmetric reversible matrix,
Ax∗=b then for quadratic function
f(x)= frac12xTAx−bTx+c point
x∗ is a minimum point (
UPD ., provided that this minimum exists at all -
f doesn't take as many as you want
− infty values only if
A positive definite), and for the gradient descent method we can get the following
xk+1−x∗=xk− alphak nablaf(xk)−x∗=xk− alphak(Axk−b)−x∗=
(xk−x∗)− alphakA(xk−x∗)=(I− alphakA)(xk−x∗),
Where
I - unit matrix, i.e.
Ix=x for all
x . If
alphak equiv alpha it will work out
|xk−x∗ |= |(I− alphaA)k(x0−x∗) | leq |I− alphaA |k |x0−x∗ |.
The expression on the left is the distance from the approximation obtained in step
k gradient descent to the minimum point, on the right - the expression
lambdak beta which converges to zero if
| lambda|<1 (condition I wrote on
alpha this is guaranteed in the previous paragraph). This base estimate ensures that the gradient descent converges.
Gradient Descent Modifications
Now I would like to tell you a little about frequently used modifications of the gradient descent, first of all the so-called
Inertial or accelerated gradient methods
All methods of this class are expressed as follows.
xk+1=xk− alphak nablaf(xk)+ betak(xk−xk−1).
The last term characterizes this very “inertia”, the algorithm at each step tries to move against the gradient, but at the same time, by inertia it partially moves in the same direction as in the previous iteration. Such methods have two important properties:
- They practically do not complicate the usual gradient descent computationally.
- With careful selection alphak, betak such methods are an order of magnitude faster than ordinary gradient descent, even with an optimally selected step.
One of the first such methods appeared in the middle of the 20th century and was called
the heavy ball method , which conveyed the nature of the inertia of the method: in this method
alphak, betak do not depend on
k and carefully selected depending on the objective function. It is worth noting that
alphak maybe whatever
betak -
usually only slightly less than one .
The heavy ball method is the simplest inertial method, but not the first. In this case, in my opinion, the very first method is quite important for understanding the essence of these methods.
Chebyshev method
Yes, yes, the first method of this type was invented by Chebyshev to solve systems of linear equations. At some point in the analysis of gradient descent, the following equality was obtained
xk+1−x∗=(I− alphakA)(xk−x∗)= ldots=
(I− alphakA)(I− alphak−1A) ldots(I− alpha1A)(x0−x∗)=Pk(A)(x0−x∗),
Where
Pk - some degree polynomial
k . Why not try to pick up
alphak In the way that
Pk(a)(x0−x∗) was it smaller? One bond of universal polynomials that deviate least from zero is the Chebyshev polynomial. Chebyshev’s method is essentially to select the descent parameters so that
Pk was a Chebyshev polynomial. There is really one small problem: for ordinary gradient descent, this is simply impossible. However, for inertial methods this is possible. This is mainly due to the fact that Chebyshev polynomials satisfy the second order recurrence relation
Tn+1(x)=2xTn(x)−Tn−1(x),
therefore, they cannot be constructed for a gradient descent, which calculates a new value only from one previous one, and for inertia it becomes possible due to the fact that two previous values are used. It turns out that the complexity of the calculation
alphak, betak does not depend on
k nor on the size of the space
n .
Conjugate Gradient Method
Another very interesting and important fact (a consequence of the Hamilton-Cayley theorem):
for any square matrix A size n timesn there is a polynomial P no more degrees n , for which P(A)=0 . What is interesting? It's all about equality.
xk+1−x∗=Pk(A)(x0−x∗).
If we could select a step size in a gradient descent so as to obtain this particular null polynomial, then the gradient descent would converge in a fixed number of iteration not larger than the dimension
A . As we found out, we cannot do this for gradient descent. Fortunately, for inertia methods - we can. The description and justification of the method is quite technical, I will confine myself to the essence:
at each iteration, parameters are selected that give the best polynomial that can be constructed taking into account all the gradient measurements made before the current step . Wherein
- One iteration of the gradient descent (without taking into account the calculation of parameters) contains one multiplication of a matrix by a vector and 2-3 additions of vectors
- Calculation of parameters also requires 1-2 multiplication of a matrix by a vector, 2-3 scalar multiplications of a vector by a vector, and several additions of vectors.
The most difficult computationally is the multiplication of the Matrix by the vector, usually done in time.
mathcalO(n2) however, with a special implementation, this can be done for
mathcalO(m) where
m - the number of nonzero elements in
A . Given the convergence of the conjugate gradient method for no more than
n iteration we get the overall complexity of the algorithm
mathcalO(nm) that in all cases not worse
mathcalO(n3) for the Gauss or Cholesky method, but much better in case
m<<n2 that is not so rare.
The conjugate gradient method works well also if
f is not a quadratic function, but it does not converge in a finite number of steps and often requires minor additional modifications
Nesterov method
For the communities of mathematical optimization and machine learning, the surname “Nesterov” has long been a household name. In the 80s of the last century, Yu.E. Nesterov came up with an interesting variant of the inertial method, which has the form
xk+1=xk− alphak nablaf(xk+ betak(xk−xk−1))+ betak(xk−xk−1),
it does not assume any complex counting
alphak, betak as in the conjugate gradient method, in general, the behavior of the method is similar to the heavy ball method, but its convergence is usually much more reliable both in theory and in practice.
Stochastic Gradient Descent
The only formal difference from the usual gradient descent is to use the function instead of the gradient.
g(x, theta) such that
E thetag(x, theta)= nablaf(x) (
E theta - mathematical expectation at random
theta ), so the stochastic gradient descent is
xk+1=xk− alphakg(xk, thetak).
thetak - This is some random parameter that we do not affect, but at the same time, on average, we go against the gradient. As an example, consider the functions
f(x)= frac12m summj=1 |x−yj |2, nablaf(x)= frac1m summj=1(x−yj)
and
g(x,i)=x−yi.
If a
i takes values
1, ldots,m equiprobable then just on average
g Is a gradient
f . This example is also indicative of the following: the complexity of calculating the gradient in
m times more than the complexity of the calculation
g . This allows the stochastic gradient descent to be done at the same time in
m times more iterations. Despite the fact that stochastic gradient descent usually converges more slowly than usual, due to such a large increase in the number of iterations, it is possible to improve the rate of convergence per unit of time. As far as I know, at the moment, stochastic gradient descent is the basic method of learning most neural networks, implemented in all main ML libraries: tensorflow, torch, caffe, CNTK, etc.
It is worth noting that the ideas of the inertial methods are used for stochastic gradient descent and in practice often give an increase, in theory it is usually assumed that the asymptotic convergence rate does not change due to the fact that the main error in the stochastic gradient descent is due to dispersion
g .
Subgradient descent
This variation allows you to work with non-differentiable functions, I will describe it in more detail. We will have to recall the linear approximation again - the fact is that there is a simple characteristic of convexity in terms of a gradient, a
differentiable function f convex if and only if executed f(y) geqf(x)+ nablaf(x)T(y−x) for all x,y . It turns out that a convex function need not be differentiable, but for any point
x there will definitely be such a vector
g , what
f(y) geqf(x)+gT(y−x) for all
y . Such a vector
g called
subgradient f at the point
x , the set of all subgradients in points
x called
subdifferential x and denote
partialf(x) (despite the designation - has nothing to do with partial derivatives). In the one-dimensional case
g Is a number, and the above property simply means that the graph
f lies above the straight line passing through
(x,f(x)) and having a slope
g (see pictures below). I note that there can be several subgradients for one point, even an infinite number.


To calculate at least one subgradient for a point is usually not very difficult, the subgradient descent essentially uses a subgradient instead of a gradient. It turns out that this is sufficient; in theory, the rate of convergence decreases, but for example, in neural networks, there is a nondifferentiable function
ReLU(x)= max(0,x) they like to use it just because training goes faster with it (by the way, this is an example of a non-convex nondifferentiable function in which (sub) gradient descent is successfully used. The function itself
Relu is convex but a multilayered neural network containing
Relu , non-convex and non-differentiable). As an example, for the function
f(x)=|x| the subdifferential is very simple
\ partial f (x) = \ begin {cases} 1, & x> 0, \\ -1, & x <0, \\ [-1, 1], & x = 0. \ end {cases}
Perhaps the last important thing to know is that the
subgradient descent does not converge with a constant step size . This is easiest to see for the above function.
f(x)=|x| . Even the absence of a derivative at one point breaks the convergence:
- Suppose we started from a point x0 .
- Subgradient descent step:
x_ {k + 1} = \ begin {cases} x_ {k} -1, & x> 0, \\ x_k + 1, & x <0, \\ ??? & x = 0. \ end {cases}
- If a x0>0 then the first few steps we will subtract the unit if x0<0 then add. One way or another, we will at some point be on the interval [0,1) from which we get into [−1,0) , and then we will jump between two points of these intervals.
In theory, for a subgradient descent, it is recommended to take a sequence of steps
alphak= frac1(k+1)c.
Where
c usually
1 or
frac12 . In practice, I have often seen successful application of steps.
alphak=e−ck although for such steps there is generally no convergence.
Proximal methods
Unfortunately, I don’t know a good translation for “proximal” in the context of optimization, so I’ll just call this method. Proximal methods appeared as a generalization of projective gradient methods. The idea is very simple: if there is a function
f representable as a sum
f(x)= varphi(x)+h(x) where
varphi - differentiable convex function, and
h(x) - convex, for which there is a special proximal-operator
proxh(x) (in this article I will confine myself only to examples, I will not describe in general), then the convergence properties of the gradient descent for
varphi remain for gradient descent for
f , if after each iteration to apply this proximal-operator for the current point
xk In other words, the general form of the proximal method looks like this:
xk+1=prox alphakh(xk− alphak nabla varphi(xk))
I think that for the time being it is completely incomprehensible why this may be necessary, especially considering that I did not explain what a proximal operator is. Here are two examples:
- h(x) - indicator function of a convex set mathcalK , i.e
h (x) = \ begin {cases} 0, & x \ in \ mathcal {K}, \\ + \ infty, & x \ notin \ mathcal {K}. \\ \ end {cases}
In this case prox alphakh(x) - this is a projection of many mathcalK , that is, "closest to x set point mathcalK ". Thus, we limit the gradient descent to only many mathcalK that allows you to solve problems with constraints. Unfortunately, the calculation of the projection in the general case can be even more difficult, so this method is usually applied if the constraints are simple, for example, the so-called box-constraints: for each coordinate
li leqxi leqri
- h(x)= lambda |x |1= lambda sumni=1|xi| - ell1 -regularization. They like to add such term to optimization tasks in machine learning in order to avoid retraining. Regularization of this kind also tends to reset the least significant components. For such a function, the proximal-operator has the form (the expression for one coordinate is described below):
[prox _ {\ alpha h} (x)] _ i = \ begin {cases} x_i- \ alpha, & x_i> \ alpha, \\ x_i + \ alpha, & x_i <- \ alpha, \\ 0, & x_i \ in [- \ alpha, \ alpha], \ end {cases}
which is pretty simple to calculate.
Conclusion
This ends the main variations of the gradient method known to me. Perhaps at the end I would note that all of these modifications (except perhaps the conjugate gradient method) can easily interact with each other. I did not specifically include Newton's method and quasi-Newtonian methods (BFGS and others) in this list: although they use a gradient, but they are more complex methods, they require specific additional calculations, which are usually more computationally expensive than calculating the gradient. Nevertheless, if this text is claimed, I will gladly make a similar review on them.
Used / recommended literature
Boyd. S, Vandenberghe L. Convex OptimizationShewchuk JR Graduient Method Without the Agonizing PainBertsekas DP Convex Optimization TheoryNesterov Yu. E. Convex optimization methodsA. Gasnikov. Universal Gradient Descent