Gradient et dérivée¶

Le gradient est composé des dérivées partielles suivant chaque direction de la base. On peut rérouler les calculs et calculer $\nabla J$ facilement (mais c'est long). Faisons le pour une matrice sur la partie compliquée :

$$ \begin{align} \frac{1}{2} {\bf x}^T A {\bf x} &= \frac{1}{2} \begin{pmatrix} x_1 & x_2 \end{pmatrix} \begin{pmatrix} a_{11} & a_{12} \\ a_{21} & a_{22} \end{pmatrix} \begin{pmatrix} x_1 \\ x_2 \end{pmatrix} \\ &= \frac{1}{2} \left( x_1 a_{11} x_1 + x_1 a_{12} x_2 + x_2 a_{21} x_1 + x_2 a_{22} x_2 \right) \\ & = \frac{1}{2} \left( a_{11} x_1^2 + (a_{12} + a_{21}) x_1 x_2 + a_{22} x_2^2 \right) \end{align} $$

Calcul des dérivées partielles de $ \frac{1}{2} x^T A x $ :

$$ \frac{\partial}{\partial x_1} \left( \frac{1}{2} (a_{11} x_1^2 + (a_{12} + a_{21}) x_1 x_2 + a_{22} x_2^2) \right) = \frac{1}{2} (2 a_{11} x_1 + (a_{12} + a_{21}) x_2) = a_{11} x_1 + \frac{a_{12} + a_{21}}{2} x_2 $$

$$ \frac{\partial}{\partial x_2} \left( \frac{1}{2} (a_{11} x_1^2 + (a_{12} + a_{21}) x_1 x_2 + a_{22} x_2^2) \right) = \frac{1}{2} ((a_{12} + a_{21}) x_1 + 2 a_{22} x_2) = \frac{a_{12} + a_{21}}{2} x_1 + a_{22} x_2 $$

Donc le gradient de $\frac{1}{2} {\bf x}^T A {\bf x}$ est :

$$ \nabla \left( \frac{1}{2} x^T A x \right) = \begin{pmatrix} a_{11} x_1 + \frac{a_{12} + a_{21}}{2} x_2 \\ \frac{a_{12} + a_{21}}{2} x_1 + a_{22} x_2 \end{pmatrix} = \frac{1}{2} (A + A^T) \, {\bf x} $$

Il faut faire le calcul pour une matrice $n \times n$ mais on arrive au même résultat et donc

$$ J'({\bf x}) = \nabla J({\bf x}) $$

Ainsi on peut aussi calculer la dérivée suivant la direction $\bf h$ avec le gradient :

$$ J'({\bf x})({\bf h}) = \nabla J({\bf x}) \, . \, {\bf h} $$

A x = b seen as an optimization problem¶

To solve $A {\bf x} = {\bf b}$ we will seek the minimum of the functional

$$ J({\bf x}) = \frac{1}{2} \, {\bf x}^T \, A \, {\bf x} - {\bf b}\, . {\bf x} $$

At this point the derivative vanishes and in a precise case the derivative is precisely $A {\bf x} - {\bf b}$.

Calculation of derivative¶

The notion of derivative in dimension greater than 1 must be handled with care because there are different types of derivatives, partial derivatives and the total derivative.

What interests us is the derivative in one direction which corresponds to the partial derivative in $y$ if we goes in the direction of the axis $y$.

Definition¶

We say that $f : \Omega \subset X \rightarrow Y$ (open $\Omega$) is differentiable in ${\bf p} \in \Omega$ if

$$ \exists \; f'({\bf p}) \in L(X,Y) \textrm{ tel que} \\ f ({\bf p} + {\bf h}) = f({\bf p}) + f'({\bf p}) ( {\bf h} ) + {\bf h} \; \varepsilon({\bf h}) \quad \textrm{ avec } \lim_{\bf h \rightarrow 0} \varepsilon({\bf h}) = {\bf 0} $$

where $L(X,Y)$ represents the space of continuous linear maps of $X$ in $Y$. Be careful it is $f'(p) \in L(X,Y)$ and not $f'$.

We note that if $f$ is differentiable in ${\bf p}$ then $\forall \; {\bf h} \in X$

$$ f'({\bf p}) ({\bf h})= \lim_{\theta \rightarrow 0} \frac{f({\bf p} + \theta \, {\bf h}) - f({\bf p})}{\theta} $$

Pay attention to the type of each term. To visualize well I considered that $f$ is a scalar function therefore $Y = ℝ$. On the other hand $X = ℝ^n$ with $n > 1$.

Other notation¶

I find the notation with the gradient clearer. We can see better where the vectors and the scalars are:

$$ f ({\bf p} + {\bf h}) = f({\bf p}) + \nabla \!f \, ({\bf p})^T \; {\bf h} + {\bf h}^T \; \varepsilon({\bf h}) \quad $$

As I took $Y = ℝ$ the $f$ map is scalar. $\nabla \! f \, ({\bf p})$ is a vector whose dot product with ${\bf h}$ gives a real.

Calculate the derivative of J along a direction¶

We calculate the derivative of $J({\bf x})$ at the point ${\bf p}$ following the direction given by the vector $\bf h$:

$$ \begin{align} J'({\bf p}) ({\bf h}) &= \lim_{\theta \rightarrow 0} \; \frac{J({\bf p} + \theta \, {\bf h}) - J({\bf p})}{\theta} \\ &= \lim_{\theta \rightarrow 0} \; \frac{1}{\theta} \; \left( \frac{1}{2} \; ({\bf p} + \theta {\bf h})^T \, A \; ({\bf p} + \theta {\bf h}) - {\bf b}^T \, ({\bf p} + \theta {\bf h}) - \frac{1}{2} \; {\bf p}^T \, A \; {\bf p} + {\bf b}^T \, {\bf p} \right) \\ &= \lim_{\theta \rightarrow 0} \; \frac{1}{\theta} \; \left( \frac{1}{2} \left( \theta \, {\bf p}^T \, A \; {\bf h} + \theta \, {\bf h}^T \, A \; {\bf p} + \theta^2 \, {\bf h}^T \, A \; {\bf h} \right) - \theta \, {\bf b}^T \, {\bf h} \right) \\ &= \frac{1}{2} \left( {\bf p}^T \, A \; {\bf h} +{\bf h}^T \, A \; {\bf p} \right) - {\bf b}^T \, {\bf h} \\ &= \frac{1}{2} \left( {\bf p}^T \, A \; {\bf h} +{\bf p}^T \, A^T \; {\bf h} \right) - {\bf b}^T \, {\bf h} \\ &= \frac{1}{2} \left( {\bf p}^T \, ( A + A^T ) - {\bf b}^T \right) \, {\bf h} \\ &= \frac{1}{2} \left( ( A + A^T ) \, {\bf p} - {\bf b} \right) \, . \, {\bf h} \end{align} $$

SO

$$ \begin{align} J' : \; & {\bf x} \in \Omega \subset ℝ^n \longrightarrow \;L(ℝ^n, ℝ) \\ & {\bf x} \; \longmapsto \; \frac{1}{2} ( A + A^T ) \, {\bf x}^T - {\bf b} \end{align} $$

A symmetrical¶

If A is symmetric, we have

$$ J'({\bf x}) = A \; {\bf x} - {\bf b} $$

We note that if the derivative vanishes then we have solved our matrix system!

It is therefore necessary that

A is symmetric
There is a minimum (for some, we saw a case of saddle point)

and in this case we can use a gradient method to find the minimum which will be the solution of our matrix system.

Property¶

If A is symmetric and positive definite then $J$ is strictly convex and coervice (i.e. $\lim_{||{\bf p}|| \rightarrow \infty} J({\bf p}) = + \infty$) which guarantees that it has a minimum.

Gradient and derivative¶

The gradient is composed of the partial derivatives with respect to each direction of the basis. We can go through the calculations and compute $\nabla J $ easily (though it is lengthy). Let's do it for a 2x2 matrix on the complicated part. :

$$ \begin{align} \frac{1}{2} {\bf x}^T A {\bf x} &= \frac{1}{2} \begin{pmatrix} x_1 & x_2 \end{pmatrix} \begin{pmatrix} a_{11} & a_{12} \\ a_{21} & a_{22} \end{pmatrix} \begin{pmatrix} x_1 \\ x_2 \end{pmatrix} \\ &= \frac{1}{2} \left( x_1 a_{11} x_1 + x_1 a_{12} x_2 + x_2 a_{21} x_1 + x_2 a_{22} x_2 \right) \\ & = \frac{1}{2} \left( a_{11} x_1^2 + (a_{12} + a_{21}) x_1 x_2 + a_{22} x_2^2 \right) \end{align} $$

Lets evaluate partial derivatives of $ \frac{1}{2} x^T A x $ :

$$ \frac{\partial}{\partial x_1} \left( \frac{1}{2} (a_{11} x_1^2 + (a_{12} + a_{21}) x_1 x_2 + a_{22} x_2^2) \right) = \frac{1}{2} (2 a_{11} x_1 + (a_{12} + a_{21}) x_2) = a_{11} x_1 + \frac{a_{12} + a_{21}}{2} x_2 $$

$$ \frac{\partial}{\partial x_2} \left( \frac{1}{2} (a_{11} x_1^2 + (a_{12} + a_{21}) x_1 x_2 + a_{22} x_2^2) \right) = \frac{1}{2} ((a_{12} + a_{21}) x_1 + 2 a_{22} x_2) = \frac{a_{12} + a_{21}}{2} x_1 + a_{22} x_2 $$

Therefore the gradient of $\frac{1}{2} {\bf x}^T A {\bf x}$ is :

$$ \nabla \left( \frac{1}{2} x^T A x \right) = \begin{pmatrix} a_{11} x_1 + \frac{a_{12} + a_{21}}{2} x_2 \\ \frac{a_{12} + a_{21}}{2} x_1 + a_{22} x_2 \end{pmatrix} = \frac{1}{2} (A + A^T) \, {\bf x} $$

We should do computations for an $n \times n$ matrix, but it gives the same result, so

$$ J'({\bf x}) = \nabla J({\bf x}) $$

We can also compute derivative of J in the direction $\bf h$ with its gradient :

$$ J'({\bf x})({\bf h}) = \nabla J({\bf x}) \, . \, {\bf h} $$

In [1]:

import numpy as np
import plotly.offline as py
import plotly.graph_objects as go

%matplotlib inline
%config InlineBackend.figure_format = 'retina'

In [2]:

A = np.array([[2,1], [1,2]])
b = np.array([-1, 0.5])

# lets make the mesh
x = np.linspace(-20,20,100)
y = np.linspace(-20,20,100)
mx, my = np.meshgrid(x,y) 
M = np.array(np.meshgrid(x,y)).swapaxes(0,2)

# and compute 0.5 x.T A x - b x on each node of the mesh
MA = np.einsum("ijk, ka -> ija", M, A)
mz = 0.5 * np.einsum("ijk, ijk -> ij", MA, M) - M @ b

In [3]:

fig = go.Figure(data=[go.Surface(x=mx, y=my, z=mz)])
py.iplot(fig, show_link=False)

The constraint may seem like too strong a limit but there are many cases that generate matrices A that respect these constraints. For example numerical simulation problems like that of the plane that we saw generates such a matrix, which moreover will be very sparse with about ten of non-zero value per row even for a 1 billion by 1 billion matrix.

It is for this type of matrix that an iterative method like the gradient offers an incredible performance gain since we have the solution in almost $O(n^2)$ instead of $O(n^3)$.

In [ ]: