Overview of gradient methods in mathematical optimization problems

Foreword

This article will discuss methods for solving mathematical optimization problems based on the use of the function gradient. The main goal is to collect in the article all the most important ideas that are somehow connected with this method and its various modifications.

UPD . In the comments they write that formulas are not displayed on some browsers and in the mobile application. Unfortunately, I do not know how to deal with it. I can only say that I used the “inline” and “display” macros of the Habrav editor. If you suddenly know how to fix it - please write in the comments.

Note from the author

At the time of writing, I defended my thesis, whose task required from me a deep understanding of the theoretical basic methods of mathematical optimization. Nevertheless, I still (like everyone) blurred my eyes from the terrible long formulas, so I spent a lot of time to isolate the key ideas that would characterize different variations of gradient methods. My personal goal is to write an article containing the minimum amount of information necessary for a less detailed understanding of the subject. But be prepared, one can not do without formulas anyway.

Formulation of the problem

Before describing a method, you must first describe the task, namely: “Given a set of

$\ mathcal {K}$ and function

$f: \ mathcal {K} \ rightarrow \ mathbb {R}$ need to find point

$x ^ * \ in \ mathcal {K}$ such that

$f (x) \ geq f (x ^ *)$ for all

$x \ in \ mathcal {K}$ "Which is usually written like this

$f (x) \ rightarrow \ min_ {x \ in \ mathcal {K}}.$

In theory , it is usually assumed that

$f$ - differentiable and convex function, and

$\ mathcal {K}$ - convex set (and even better, if at all

$\ mathcal {K} = \ mathbb {R} ^ n$ ), this allows us to give some guarantees of the success of the use of the gradient descent. In practice, gradient descent is successfully applied even when the task has none of the above properties (the example is further in the article).

A bit of math

Suppose for now we just need to find a minimum of one-dimensional function.

$f (x) \ rightarrow \ min_ {x \ in \ mathbb {R}}.$

As early as the 17th century, Pierre Farm was invented a criterion that allowed solving simple optimization problems, namely,

$x ^ *$ - minimum point

$f ^ *$ then

$f '(x ^ *) = 0$

Where

$f '$ - derivative

$f$ . This criterion is based on a linear approximation.

$f (x) \ approx f (x ^ *) + f '(x ^ *) (x-x ^ *).$

The closer

$x$ to

$x ^ *$ more precisely this approximation. In the right part - the expression that

$f '(x ^ *) \ neq 0$ maybe more

$f (x ^ *)$ and less - this is the main essence of the criterion. In the multidimensional case, similarly from the linear

$f (x) \ approx f (x ^ *) + \ nabla f (x ^ *) ^ T (x-x ^ *)$ (hereinafter

$x ^ ty = \ sum_ {i = 1} ^ nx_iy_i$ - a standard scalar product, the form of the record is due to the fact that the scalar product is the same as the matrix product of the row vector and column vector) the result is

$\ nabla f (x ^ *) = 0.$

Magnitude

$\ nabla f (x ^ *)$ - function gradient

$f$ at the point

$x ^ *$ . Also, the equality of the gradient to zero means the equality of all partial derivatives to zero, so in the multidimensional case, you can get this criterion simply by consistently applying the one-dimensional criterion for each variable separately.

It is worth noting that these conditions are necessary, but not sufficient, the simplest example is 0 for

$f (x) = x ^ 2$ and

$f (x) = x ^ 3$

This criterion is sufficient in the case of a convex function, largely because of this, so many results were obtained for convex functions.

Quadratic functions

Quadratic functions in

$\ mathbb {R} ^ n$ Is a view function

$f (x) = f (x_1, x_2, \ ldots, x_n) = \ frac {1} {2} \ sum_ {i, j = 1} ^ na_ {ij} x_ix_j- \ sum_ {i = 1} ^ n b_ix_i + c$

To save space (and even less to mess with indices), this function is usually written in matrix form:

$f (x) = \ frac {1} {2} x ^ TAx-b ^ Tx + c,$

Where

$x = (x_1, \ ldots, x_n) ^ T$ ,

$b = (b_1, \ ldots, b_n) ^ T$ ,

$A$ - the matrix, which at the intersection

$i$ strings and

$j$ column value is

$\ frac {1} {2} (a_ {ij} + a_ {ji})$ (

$A$ at the same time it turns out symmetrical - this is important). Further. when mentioning a quadratic function, I will have the above function.

Why am I talking about this? The fact is that quadratic functions are important in optimization for two reasons:

They also occur in practice, for example, when constructing a linear regression using the least squares method
The gradient of a quadratic function is a linear function, in particular for the function above
$\ frac {\ partial} {\ partial x_i} f (x_1, x_2, \ ldots, x_n) = a_ {ii} x_i + \ sum_ {j \ neq i} \ frac {1} {2} (a_ {ij } + a_ {ji}) x_j -b_i,$

Or in matrix form
$\ nabla f (x) = Ax-b,$

Thus the system $\ nabla f (x) = 0$ - linear system. A system that is simpler than linear is simply non-existent. The idea I tried to get to is the optimization of a quadratic function — the simplest class of optimization problems . On the other hand, the fact that $\ nabla f (x ^ *) = 0$ - the necessary minimum conditions make it possible to solve linear systems through optimization problems. A little later, I will try to convince you that it makes sense.

Useful gradient properties

Well, we seem to have figured out that if a function is differentiable (it has derivatives in all variables), then at the minimum point the gradient should be zero. But does the gradient carry any useful information in the case when it is different from zero?

Let's try to solve a simpler problem for now: a dot is given

$x$ find point

$\ bar {x}$ such that

$f (\ bar {x}) <f (x)$ . Let's take a point next to

$x$ again using linear approximation

$f (\ bar {x}) \ approx f (x) + \ nabla f (x) ^ T (\ bar {x} -x)$ . If you take

$\ bar {x} = x- \ alpha \ nabla f (x)$ ,

$\ alpha> 0$ then we get

$f (\ bar {x}) \ approx f (x) - \ alpha \ | \ nabla f (x) \ | ^ 2 <f (x).$

Similarly, if

$\ alpha <0$ then

$f (\ bar {x})$ there will be more

$f (x)$ (hereinafter

$|| x || = \ sqrt {x_1 ^ 2 + x_2 ^ 2 + \ ldots + x_n ^ 2} ~$ ). Again, since we used the approximation, these considerations will be valid only for small

$\ alpha$ . Summarizing the above, if

$\ nabla f (x) \ neq 0$ , the gradient indicates the direction of the greatest local increase of the function .

Here are two examples for two-dimensional functions. Such pictures can often be seen in demonstrations of gradient descent. Colored lines - the so-called level lines , is a set of points for which the function takes fixed values, in my case these are circles and ellipses. I marked the blue lines of the level with a lower value, red - with a higher.

Note that for a surface given by a type equation

$f (x) = c$ ,

$\ nabla f (x)$ sets the normal (in common - perpendicular) to this surface. Also note that even though the gradient shows in the direction of the largest increase in the function, there is no guarantee that you can find a minimum in the direction opposite to the gradient (for example, the left picture).

Gradient descent

Before the basic method of gradient descent, there was only a small step: we learned from the point

$x$ getting a point

$\ bar {x}$ with a smaller function value

$f$ . What prevents us from repeating this several times? In essence, this is a gradient descent: building a sequence

$x_ {k + 1} = x_k- \ alpha_k \ nabla f (x_k).$

Magnitude

$\ alpha_k$ called the step size (in machine learning - the speed of learning ). A few words about the choice

$\ alpha_k$ : if a

$\ alpha_k$ - very small, the sequence slowly changes, which makes the algorithm not very effective; if

$\ alpha_k$ very large, the linear approximation becomes bad, and maybe even incorrect. In practice, the step size is often chosen empirically, in theory, the Lipschitz gradient is usually assumed, namely, if

$\ | \ nabla f (x) - \ nabla f (y) \ | \ leq L \ | x-y \ |$

for all

$x, y$ then

$\ alpha_k <\ frac {2} {L}$ guarantees a decrease

$f (x_k)$ .

Analysis for quadratic functions

If a

$A$ - symmetric reversible matrix,

$Ax ^ * = b$ then for quadratic function

$f (x) = \ frac {1} {2} x ^ TAx-b ^ Tx + c$ point

$x ^ *$ is a minimum point ( UPD ., provided that this minimum exists at all -

$f$ doesn't take as many as you want

$- \ infty$ values only if

$A$ positive definite), and for the gradient descent method we can get the following

$x_ {k + 1} -x ^ * = x_k- \ alpha_k \ nabla f (x_k) -x ^ * = x_k- \ alpha_k (Ax_k-b) -x ^ * =$

$(x_k-x ^ *) - \ alpha_kA (x_k-x ^ *) = (I- \ alpha_k A) (x_k-x ^ *),$

Where

$I$ - unit matrix, i.e.

$Ix = x$ for all

$x$ . If

$\ alpha_k \ equiv \ alpha$ it will work out

$\ | x_ {k} -x ^ * \ | = \ | (I- \ alpha A) ^ k (x_0-x ^ *) \ | \ leq \ | I- \ alpha A \ | ^ k \ | x_0 -x ^ * \ |.$

The expression on the left is the distance from the approximation obtained in step

$k$ gradient descent to the minimum point, on the right - the expression

$\ lambda ^ k \ beta$ which converges to zero if

$| \ lambda | <1$ (condition I wrote on

$\ alpha$ this is guaranteed in the previous paragraph). This base estimate ensures that the gradient descent converges.

Gradient Descent Modifications

Now I would like to tell you a little about frequently used modifications of the gradient descent, first of all the so-called

Inertial or accelerated gradient methods

All methods of this class are expressed as follows.

$x_ {k + 1} = x_k- \ alpha_k \ nabla f (x_k) + \ beta_k (x_k-x_ {k-1}).$

The last term characterizes this very “inertia”, the algorithm at each step tries to move against the gradient, but at the same time, by inertia it partially moves in the same direction as in the previous iteration. Such methods have two important properties:

They practically do not complicate the usual gradient descent computationally.
With careful selection $\ alpha_k, \ beta_k$ such methods are an order of magnitude faster than ordinary gradient descent, even with an optimally selected step.

One of the first such methods appeared in the middle of the 20th century and was called the heavy ball method , which conveyed the nature of the inertia of the method: in this method

$\ alpha_k, \ beta_k$ do not depend on

$k$ and carefully selected depending on the objective function. It is worth noting that

$\ alpha_k$ maybe whatever

$\ beta_k$ - usually only slightly less than one .

The heavy ball method is the simplest inertial method, but not the first. In this case, in my opinion, the very first method is quite important for understanding the essence of these methods.

Chebyshev method

Yes, yes, the first method of this type was invented by Chebyshev to solve systems of linear equations. At some point in the analysis of gradient descent, the following equality was obtained

$x_ {k + 1} -x ^ * = (I- \ alpha_k A) (x_k-x ^ *) = \ ldots =$

$(I- \ alpha_kA) (I- \ alpha_ {k-1} A) \ ldots (I- \ alpha_1A) (x_0-x ^ *) = P_k (A) (x_0-x ^ *),$

Where

$P_k$ - some degree polynomial

$k$ . Why not try to pick up

$\ alpha_k$ In the way that

$P_k (a) (x_0-x ^ *)$ was it smaller? One bond of universal polynomials that deviate least from zero is the Chebyshev polynomial. Chebyshev’s method is essentially to select the descent parameters so that

$P_k$ was a Chebyshev polynomial. There is really one small problem: for ordinary gradient descent, this is simply impossible. However, for inertial methods this is possible. This is mainly due to the fact that Chebyshev polynomials satisfy the second order recurrence relation

$T_ {n + 1} (x) = 2xT_n (x) -T_ {n-1} (x),$

therefore, they cannot be constructed for a gradient descent, which calculates a new value only from one previous one, and for inertia it becomes possible due to the fact that two previous values are used. It turns out that the complexity of the calculation

$\ alpha_k, \ beta_k$ does not depend on

$k$ nor on the size of the space

$n$ .

Conjugate Gradient Method

Another very interesting and important fact (a consequence of the Hamilton-Cayley theorem): for any square matrix

$A$ size

$n \ times n$ there is a polynomial

$P$ no more degrees

$n$ , for which

$P (A) = 0$ . What is interesting? It's all about equality.

$x_ {k + 1} -x ^ * = P_k (A) (x_0-x ^ *).$

If we could select a step size in a gradient descent so as to obtain this particular null polynomial, then the gradient descent would converge in a fixed number of iteration not larger than the dimension

$A$ . As we found out, we cannot do this for gradient descent. Fortunately, for inertia methods - we can. The description and justification of the method is quite technical, I will confine myself to the essence: at each iteration, parameters are selected that give the best polynomial that can be constructed taking into account all the gradient measurements made before the current step . Wherein

One iteration of the gradient descent (without taking into account the calculation of parameters) contains one multiplication of a matrix by a vector and 2-3 additions of vectors
Calculation of parameters also requires 1-2 multiplication of a matrix by a vector, 2-3 scalar multiplications of a vector by a vector, and several additions of vectors.

The most difficult computationally is the multiplication of the Matrix by the vector, usually done in time.

$\ mathcal {O} (n ^ 2)$ however, with a special implementation, this can be done for

$\ mathcal {O} (m)$ where

$m$ - the number of nonzero elements in

$A$ . Given the convergence of the conjugate gradient method for no more than

$n$ iteration we get the overall complexity of the algorithm

$\ mathcal {O} (nm)$ that in all cases not worse

$\ mathcal {O} (n ^ 3)$ for the Gauss or Cholesky method, but much better in case

$m << n ^ 2$ that is not so rare.

The conjugate gradient method works well also if

$f$ is not a quadratic function, but it does not converge in a finite number of steps and often requires minor additional modifications

Nesterov method

For the communities of mathematical optimization and machine learning, the surname “Nesterov” has long been a household name. In the 80s of the last century, Yu.E. Nesterov came up with an interesting variant of the inertial method, which has the form

$x_ {k + 1} = x_k- \ alpha_k \ nabla f (x_k + \ beta_k (x_k-x_ {k-1})) + \ beta_k (x_k-x_ {k-1}),$

it does not assume any complex counting

$\ alpha_k, \ beta_k$ as in the conjugate gradient method, in general, the behavior of the method is similar to the heavy ball method, but its convergence is usually much more reliable both in theory and in practice.

Stochastic Gradient Descent

The only formal difference from the usual gradient descent is to use the function instead of the gradient.

$g (x, \ theta)$ such that

$E_ \ theta g (x, \ theta) = \ nabla f (x)$ (

$E_ \ theta$ - mathematical expectation at random

$\ theta$ ), so the stochastic gradient descent is

$x_ {k + 1} = x_k- \ alpha_kg (x_k, \ theta_k).$

$\ theta_k$ - This is some random parameter that we do not affect, but at the same time, on average, we go against the gradient. As an example, consider the functions

$f (x) = \ frac {1} {2m} \ sum_ {j = 1} ^ m \ | x-y_j \ | ^ 2, ~~ \ nabla f (x) = \ frac {1} {m} \ sum_ {j = 1} ^ m (x-y_j)$

and

$g (x, i) = x-y_i.$

If a

$i$ takes values

$1, \ ldots, m$ equiprobable then just on average

$g$ Is a gradient

$f$ . This example is also indicative of the following: the complexity of calculating the gradient in

$m$ times more than the complexity of the calculation

$g$ . This allows the stochastic gradient descent to be done at the same time in

$m$ times more iterations. Despite the fact that stochastic gradient descent usually converges more slowly than usual, due to such a large increase in the number of iterations, it is possible to improve the rate of convergence per unit of time. As far as I know, at the moment, stochastic gradient descent is the basic method of learning most neural networks, implemented in all main ML libraries: tensorflow, torch, caffe, CNTK, etc.

It is worth noting that the ideas of the inertial methods are used for stochastic gradient descent and in practice often give an increase, in theory it is usually assumed that the asymptotic convergence rate does not change due to the fact that the main error in the stochastic gradient descent is due to dispersion

$g$ .

Subgradient descent

This variation allows you to work with non-differentiable functions, I will describe it in more detail. We will have to recall the linear approximation again - the fact is that there is a simple characteristic of convexity in terms of a gradient, a differentiable function

$f$ convex if and only if executed

$f (y) \ geq f (x) + \ nabla f (x) ^ T (y-x)$ for all

$x, y$ . It turns out that a convex function need not be differentiable, but for any point

$x$ there will definitely be such a vector

$g$ , what

$f (y) \ geq f (x) + g ^ T (y-x)$ for all

$y$ . Such a vector

$g$ called subgradient

$f$ at the point

$x$ , the set of all subgradients in points

$x$ called subdifferential

$x$ and denote

$\ partial f (x)$ (despite the designation - has nothing to do with partial derivatives). In the one-dimensional case

$g$ Is a number, and the above property simply means that the graph

$f$ lies above the straight line passing through

$(x, f (x))$ and having a slope

$g$ (see pictures below). I note that there can be several subgradients for one point, even an infinite number.

To calculate at least one subgradient for a point is usually not very difficult, the subgradient descent essentially uses a subgradient instead of a gradient. It turns out that this is sufficient; in theory, the rate of convergence decreases, but for example, in neural networks, there is a nondifferentiable function

$ReLU (x) = \ max (0, x)$ they like to use it just because training goes faster with it (by the way, this is an example of a non-convex nondifferentiable function in which (sub) gradient descent is successfully used. The function itself

$Relu$ is convex but a multilayered neural network containing

$Relu$ , non-convex and non-differentiable). As an example, for the function

$f (x) = | x |$ the subdifferential is very simple

$\ partial f (x) = \ begin {cases} 1, & x> 0, \\ -1, & x <0, \\ [-1, 1], & x = 0. \ end {cases}$

Perhaps the last important thing to know is that the subgradient descent does not converge with a constant step size . This is easiest to see for the above function.

$f (x) = | x |$ . Even the absence of a derivative at one point breaks the convergence:

Suppose we started from a point $x_0$ .
Subgradient descent step:
$x_ {k + 1} = \ begin {cases} x_ {k} -1, & x> 0, \\ x_k + 1, & x <0, \\ ??? & x = 0. \ end {cases}$
If a $x_0> 0$ then the first few steps we will subtract the unit if $x_0 <0$ then add. One way or another, we will at some point be on the interval $[0, 1)$ from which we get into $[- 1, 0)$ , and then we will jump between two points of these intervals.

In theory, for a subgradient descent, it is recommended to take a sequence of steps

$\ alpha_k = \ frac {1} {(k + 1) ^ c}.$

Where

$c$ usually

$1$ or

$\ frac {1} {2}$ . In practice, I have often seen successful application of steps.

$\ alpha_k = e ^ {- ck}$ although for such steps there is generally no convergence.

Proximal methods

Unfortunately, I don’t know a good translation for “proximal” in the context of optimization, so I’ll just call this method. Proximal methods appeared as a generalization of projective gradient methods. The idea is very simple: if there is a function

$f$ representable as a sum

$f (x) = \ varphi (x) + h (x)$ where

$\ varphi$ - differentiable convex function, and

$h (x)$ - convex, for which there is a special proximal-operator

$prox_h (x)$ (in this article I will confine myself only to examples, I will not describe in general), then the convergence properties of the gradient descent for

$\ varphi$ remain for gradient descent for

$f$ , if after each iteration to apply this proximal-operator for the current point

$x_k$ In other words, the general form of the proximal method looks like this:

$x_ {k + 1} = prox _ {\ alpha_kh} (x_k- \ alpha_k \ nabla \ varphi (x_k))$

I think that for the time being it is completely incomprehensible why this may be necessary, especially considering that I did not explain what a proximal operator is. Here are two examples:

$h (x)$ - indicator function of a convex set $\ mathcal {K}$ , i.e
$h (x) = \ begin {cases} 0, & x \ in \ mathcal {K}, \\ + \ infty, & x \ notin \ mathcal {K}. \\ \ end {cases}$

In this case $prox _ {\ alpha_kh} (x)$ - this is a projection of many $\ mathcal {K}$ , that is, "closest to $x$ set point $\ mathcal {K}$ ". Thus, we limit the gradient descent to only many $\ mathcal {K}$ that allows you to solve problems with constraints. Unfortunately, the calculation of the projection in the general case can be even more difficult, so this method is usually applied if the constraints are simple, for example, the so-called box-constraints: for each coordinate
$l_i \ leq x_i \ leq r_i$
$h (x) = \ lambda \ | x \ | _1 = \ lambda \ sum_ {i = 1} ^ n | x_i |$ - $\ ell_1$ -regularization. They like to add such term to optimization tasks in machine learning in order to avoid retraining. Regularization of this kind also tends to reset the least significant components. For such a function, the proximal-operator has the form (the expression for one coordinate is described below):
$[prox _ {\ alpha h} (x)] _ i = \ begin {cases} x_i- \ alpha, & x_i> \ alpha, \\ x_i + \ alpha, & x_i <- \ alpha, \\ 0, & x_i \ in [- \ alpha, \ alpha], \ end {cases}$

which is pretty simple to calculate.

Conclusion

This ends the main variations of the gradient method known to me. Perhaps at the end I would note that all of these modifications (except perhaps the conjugate gradient method) can easily interact with each other. I did not specifically include Newton's method and quasi-Newtonian methods (BFGS and others) in this list: although they use a gradient, but they are more complex methods, they require specific additional calculations, which are usually more computationally expensive than calculating the gradient. Nevertheless, if this text is claimed, I will gladly make a similar review on them.

Used / recommended literature

Boyd. S, Vandenberghe L. Convex Optimization
Shewchuk JR Graduient Method Without the Agonizing Pain
Bertsekas DP Convex Optimization Theory

Nesterov Yu. E. Convex optimization methods
A. Gasnikov. Universal Gradient Descent

Source: https://habr.com/ru/post/413853/

All Articles

Overview of gradient methods in mathematical optimization problems

Foreword

Note from the author

Formulation of the problem

A bit of math

Quadratic functions

Useful gradient properties

Gradient descent

Analysis for quadratic functions

Gradient Descent Modifications

Inertial or accelerated gradient methods

Chebyshev method

Conjugate Gradient Method

Nesterov method

Stochastic Gradient Descent

Subgradient descent

Proximal methods

Conclusion

Used / recommended literature

More articles: