Minimization of a functional for solving linear problems

1 General idea
2 Gradient of the functional \(\varphi (x)\) for \(Ax=b\) (real valued)
3 Wirtinger derivatives for complex functions
4 Gradient of the functional \(\varphi (x)\) for \(Ax=b\) (complex valued)
5 Gradient of the residual norm in linear least square problems
6 Summary

Abstract Minimization of a functional can be considered as a reduction or simplification method for solving a linear system with many or infinite number of degree of freedoms (DoFs). The minimum point can be found by computing the critical point of the functional, which is achieved when the gradient of the functional is zero. In the complex valued case, the evaluation of the gradient requires Wirtinger derivatives, where the complex variable \(z\) and its complex conjugation \(\overline {z}\) are independent from each other.

1 General idea

According to Variational problems, a linear operator equation \(Au=f\) (as a strong form problem) is equivalent to a variational equation \(\langle Au,v \rangle = \langle f,v \rangle \) (as a weak form problem) due to the boundedness of the operator \(A\). Furthermore, the variational equation is equivalent to the minimization of a functional \(\mathcal {L}(v)=\frac {1}{2}\langle Av,v \rangle - \langle f,v \rangle \), if the linear operator \(A\) is also self-adjoint (in the sense of normed space, which is also called self-dual, see Adjoint operators in functional analysis) and elliptic. The variational equation is based on the notion of measurement by projection, while the functional minimization problem is based on the notion of minimizing energy, such as the principle of least action in physics.

Even though the above theory is usually introduced in a book about PDE, such as (Steinbach), it is actually a general theory, which naturally holds for linear algebra where finite dimensional spaces are involved. For example, to solve the linear system \(Ax=b\), where \(A\in \mathbb {R}^{n\times n}\) is symmetric positive definite (SPD), \(x\in \mathbb {R}^n\) and \(b\in \mathbb {R}^n\), we can search the critical point \(x\) which minimizes a functional \(\varphi (x) = \frac {1}{2} ( Ax,x ) - ( b,x )\), where \(( \cdot ,\cdot )\) is the inner product in \(\mathbb {R}^{n}\). According to Understanding about the Lagrange multiplier method and its application in PDEs, the critical point of \(\varphi (x)\) must be a minimum point, if the operator \(A\) is positive semi-definite. Of course, if \(A\) is a SPD matrix in \(\mathbb {R}^{n\times n}\), this condition is satisfied.

If we use an iterative algorithm, such as the steepest descent method, we need to construct a sequence \(\{ x^{(k)} \}_{k\geq 1}\), which minimizes the functional \(\varphi (x)\) incrementally. In each iteration step, with the previous vector \(x^{(k-1)}\), we search the next vector \(x^{(k)}\) along the negative gradient direction \(-\nabla \varphi (x) \Big \vert _{x^{(k-1)}}\), the concept of which is the same as the Newton’s method in 1D search. The step size in this search direction is an unknown factor \(\alpha ^{(k)}\), which can be obtained by solving the critical point of this functional \begin{equation} \tilde {\varphi }(\alpha ^{(k)}) = \varphi \left (x^{(k)} + \alpha ^{(k)}r^{(k)} \right ), \end{equation} where \(r^{(k)}\) is the unit vector of \(-\nabla \varphi (x)\Big \vert _{x^{(k-1)}}\). Now, let’s see what on earth the gradient of the functional is.

2 Gradient of the functional \(\varphi (x)\) for \(Ax=b\) (real valued)

Theorem 1 The gradient of \(\varphi (x)\) is \(Ax-b\).

Proof \begin{equation} \begin{aligned} \varphi (x) &= \frac {1}{2} ( Ax,x ) - ( b,x ) = \frac {1}{2} x^{\mathrm {T}}Ax - x^{\mathrm {T}}b \\ &= \frac {1}{2} \sum _{i=1}^n \sum _{j=1}^n a_{ij}x_ix_j - \sum _{i=1}^n b_ix_i. \end{aligned} \end{equation} We need to enforce the conditions \begin{equation} \frac {\diff \varphi (x)}{\diff x_k} = 0 \quad k=1,\cdots ,n. \end{equation} For the first term \(\frac {1}{2} \sum _{i=1}^n \sum _{j=1}^n a_{ij}x_ix_j\) in \(\varphi (x)\), we consider the following cases about \(k\):

When \(k\neq i\) and \(k\neq j\), \begin{equation} \frac {\partial a_{ij}x_ix_j}{\partial x_{k}} = 0. \end{equation}
When \(k=i\) and \(k\neq j\), \begin{equation} \frac {\partial a_{ij}x_ix_j}{\partial x_{k}} = \frac {\partial a_{kj}x_kx_j}{\partial x_k} = a_{kj}x_j. \end{equation}
When \(k\neq i\) and \(k=j\), \begin{equation} \frac {\partial a_{ij}x_ix_j}{\partial x_{k}} = \frac {\partial a_{ik}x_ix_k}{\partial x_k} = a_{ik}x_i. \end{equation}
When \(k=i=j\), \begin{equation} \frac {\partial a_{ij}x_ix_j}{\partial x_{k}} = \frac {\partial a_{kk}x_k^2}{\partial x_k} = 2a_{kk}x_k. \end{equation}

Then the partial derivative of the first term with respect to \(x_k\) is \begin{equation} \frac {1}{2} \sum _{\substack {j=1 \\ j\neq k}}^n a_{kj}x_j + \frac {1}{2} \sum _{\substack {i=1 \\ i\neq k}}^n a_{ik}x_i + a_{kk}x_k. \end{equation} Because \(A\) is SPD, \(a_{ik} = a_{ki}\), the first and second terms in the above expression can be merged, hence the above expression becomes \begin{equation} \sum _{\substack {j=1 \\ j\neq k}}^n a_{kj}x_j + a_{kk}x_k = \sum _{j=1}^n a_{kj}x_j. \end{equation} This is just the \(k\)-th component of the vector \(Ax\).

The partial derivative of the second term in \(\varphi (x)\) with respect to \(x_k\) is simply \(-b_k\). Therefore, we have \(\nabla \varphi (x) = Ax-b\).

Comment 1 The gradient of \((Ax,x)\) with respect to \(x\) is \(2Ax\) and the gradient of \((b,x)\) is \(b\). This is a generalization of the derivation rule for scalar values, i.e. \(\frac {\diff (ax^2)}{\diff x} = 2ax\) and \(\frac {\diff (bx)}{\diff x} = b\).

3 Wirtinger derivatives for complex functions

Before we examine the gradient of the functional \(\varphi (x)\) in the complex valued case, we need to clarify the derivatives of a function with respect to a complex variable \(x\) and its complex conjugate \(\overline {x}\), both of which will appear in the functional \(\varphi (x)\). The basic idea is that even though \(x\) is related to \(\overline {x}\) via complex conjugation, these two variables are actually independent. This determines how we compute the gradient of \(\varphi (x)\). The reason for \(x\) and \(\overline {x}\) are independent is given below.

Let \(f\) be a complex valued function dependent on a complex variable \(z\). If \(f\) is differentiable at \(z_0\), the variation of the function value at \(z_0 + \Delta z\) can be written as \begin{equation} \Delta f(z_0) = f(z_0 + \Delta z) - f(z_0) = \frac {\partial f}{\partial x} \Delta x + \frac {\partial f}{\partial y} \Delta y + o(\Delta z), \end{equation} where \(o(\Delta z)/\lvert \Delta z \rvert \rightarrow 0\) when \(\Delta z \rightarrow 0\). We should bear in mind that a complex function is just a complex valued function defined on a 2D plane. Therefore, its variation around the point \(z_0\) can be represented as a combination of the variations in the \(x\) direction and \(y\) direction. If we define \(\Delta z = \Delta x + \rmi \Delta y\) and \(\Delta \overline {z} = \Delta x - \rmi \Delta y\), we can represent \((\Delta x, \Delta y)\) with \((\Delta z, \Delta \overline {z})\) as \begin{equation} \Delta x = \frac {1}{2} (\Delta z + \Delta \overline {z}), \Delta y = \frac {1}{2\rmi } ( \Delta z - \Delta \overline {z} ). \end{equation} Then \begin{equation} \Delta f(z_0) = \frac {1}{2} \left ( \frac {\partial f}{\partial x} - \rmi \frac {\partial f}{\partial y} \right ) \Delta z + \frac {1}{2} \left ( \frac {\partial f}{\partial x} + \rmi \frac {\partial f}{\partial y}\right ) \Delta \overline {z} + o(\Delta z) \end{equation} and divide it by \(\Delta z\), we have \begin{equation} \begin{aligned} \frac {\mathrm {d} f}{\mathrm {d} z} \Big \vert _{z_0} &= \lim _{\substack {\Delta z \rightarrow 0 \\ \Delta z \in \mathbb {C}}} \left [ \frac {1}{2} \left ( \frac {\partial f}{\partial x} - \rmi \frac {\partial f}{\partial y} \right ) + \frac {1}{2} \left (\frac {\partial f}{\partial x} + \rmi \frac {\partial f}{\partial y} \right ) \frac {\Delta \overline {z}}{\Delta z} + \frac {o(\Delta z)}{\Delta z} \right ]. \end{aligned} \end{equation} Because \(\frac {\Delta \overline {z}}{\Delta z}\) depends on the path of \(\Delta z\) approaching \(0\), it does not have a limiting value. To make \(\frac {\diff f}{\diff z}\) meaningful, we should enforce \(\frac {\partial f}{\partial x} + \rmi \frac {\partial f}{\partial y} = 0\), which is just equivalent to the Cauchy-Riemann equations. Therefore, with the differentiability of \(f\) with respect to real variables \(x\) and \(y\) as well as the Cauchy-Riemann equations, we have \begin{equation} \frac {\diff f}{\diff z} \Big \vert _{z_0} = \frac {1}{2} \left ( \frac {\partial f}{\partial x} - \rmi \frac {\partial f}{\partial y} \right ). \end{equation} Similarly, the derivative of \(f\) with respect to \(\overline {z}\) can be written as \begin{equation} \begin{aligned} \frac {\mathrm {d} f}{\mathrm {d} \overline {z}} \Big \vert _{z_0} &= \lim _{\substack {\Delta z \rightarrow 0 \\ \Delta z \in \mathbb {C}}} \left [ \frac {1}{2} \left ( \frac {\partial f}{\partial x} - \rmi \frac {\partial f}{\partial y} \right ) \frac {\Delta z}{\Delta \overline {z}} + \frac {1}{2} \left (\frac {\partial f}{\partial x} + \rmi \frac {\partial f}{\partial y} \right ) + \frac {o(\Delta z)}{\Delta z} \right ]. \end{aligned} \end{equation} Enforcing \(\frac {\partial f}{\partial x} - \rmi \frac {\partial f}{\partial y} = 0\), which is the counterpart of the Cauchy-Riemann equations when \(f\) is a function of \(\overline {z}\) instead of \(z\), we have \begin{equation} \frac {\diff f}{\diff \overline {z}} \Big \vert _{z_0} = \frac {1}{2} \left ( \frac {\partial f}{\partial x} + \rmi \frac {\partial f}{\partial y} \right ). \end{equation} The above \(\frac {\diff f}{\diff z}\) and \(\frac {\diff f}{\diff \overline {z}}\) are called Wirtinger derivatives.

Comment 2 Because a complex function \(f\) can be considered as a complex valued function on the \(xy\) plane, its variation in the neighborhood about a point \(z_0\) can be decomposed into real partial derivatives along two orthogonal directions \(x\) and \(y\). On the other hand, the function variation can be represented with the complex derivative with respect to \(z\), which is along the direction angle \(\mathrm {atan}(x,y)\) or with respect to \(\overline {z}\), which is along the direction angle \(\mathrm {atan}(x,-y)\). These two directions are linearly independent. Therefore, even though the two variables \(z\) and \(\overline {z}\) are linked via complex conjugation, they are two independent variables.

4 Gradient of the functional \(\varphi (x)\) for \(Ax=b\) (complex valued)

When the matrix \(A\) and vectors \(x\) and \(b\) are complex valued ¹, the definition of the inner product in \(\mathbb {C}^n\) involves complex conjugation. For any \(x\) and \(y\) in \(\mathbb {C}^n\), the following conditions should be satisfied:

\((x,y) = \overline {(y,x)}\);
\(( \alpha x,y ) = \alpha (x,y)\);
\((x,\alpha y) = \overline {\alpha } (x,y)\).

Meanwhile, the SPD condition of \(A\) now becomes Hermite symmetric (\(A = A^{\mathrm {H}}\)) and positive definite.

In the complex valued case, the range of the functional \(\varphi (x)\) should still be \(\mathbb {R}\), since it is related to energy. Therefore, the term \(( b,x )\) in the original \(\varphi (x)\), which is not necessarily real valued, should be changed to \(\real ( b,x )\), while \(\frac {1}{2} ( Ax,x )\) must be real valued because \(A\) is SPD. Then the functional is (Sauter and Schwab, page 354) \begin{equation} \begin{aligned} \varphi (x) &= \frac {1}{2} ( Ax,x )-\real (b,x) = \frac {1}{2} x^{\mathrm {H}}Ax - \real (x^{\mathrm {H}}b) \\ &= \frac {1}{2} \sum _{i=1}^n \sum _{j=1}^n a_{ij}\overline {x}_ix_j - \real \left ( \sum _{i=1}^n b_i \overline {x}_i \right ). \end{aligned} \end{equation} First, let’s see the gradient of \(\varphi (x)\) with respect to \(x\). Before proceeding, we notice that the second term above does not contain \(x\) at all. If we directly evaluate \(\nabla _x \varphi (x)\) using this expression, the result will not depend on \(b\), which is obviously incorrect. Therefore, the functional should be written as below, which is still the same as before \begin{equation} \begin{aligned} \varphi (x) &= \frac {1}{2} ( Ax,x )-\real (x,b) = \frac {1}{2} x^{\mathrm {H}}Ax - \real (b^{\mathrm {H}}x) \\ &= \frac {1}{2} \sum _{i=1}^n \sum _{j=1}^n a_{ij}\overline {x}_ix_j - \real \left ( \sum _{i=1}^n \overline {b}_i x_i \right ). \end{aligned} \end{equation} The complex partial derivative of the first term in the above expression with respect to \(x_k\) is \begin{equation} \frac {\partial }{\partial x_k} \left ( \frac {1}{2} \sum _{i=1}^n \sum _{j=1}^n a_{ij} \overline {x}_ix_j \right ) = \frac {1}{2} \sum _{i=1}^{n} a_{ik} \overline {x}_i. \end{equation} For the second term, only when \(i=k\), the term in the sum contributes to the complex partial derivative of \(\varphi (x)\). Let \(b_k = b_{k1} + \rmi b_{k2}\) and \(x_k = x_{k1} + \rmi x_{k2}\), \begin{equation} \real (\overline {b}_k x_k) = \real (b_{k1}x_{k1} + b_{k2}x_{k2} + \rmi (b_{k1}x_{k2} - b_{k2}x_{k1})) = b_{k1}x_{k1} + b_{k2}x_{k2}. \end{equation} Using the Wirtinger derivative, the complex partial derivative of the second term with respect to \(x_k\) is \begin{equation} \begin{aligned} \frac {\partial }{\partial x_k} \real (\overline {b}_kx_k) &= \frac {\partial }{\partial x_k} (b_{k1}x_{k1} + b_{k2}x_{k2}) \\ &= \frac {1}{2} \left ( \frac {\partial }{\partial x_{k1}} - \rmi \frac {\partial }{\partial x_{k2}} \right ) (b_{k1}x_{k1} + b_{k2}x_{k2}) \\ &= \frac {1}{2} ( b_{k1} - \rmi b_{k2} ) = \frac {1}{2} \overline {b}_k. \end{aligned} \end{equation} Then the gradient \(\nabla _x \varphi (x)\) is \begin{equation} \nabla _x\varphi (x) = \frac {1}{2} \overline {A^{\mathrm {H}}x} - \frac {1}{2} \overline {b}. \end{equation} When \(\nabla _x\varphi (x) = 0\), we have \begin{equation} \overline {A^{\mathrm {H}}x} = \overline {b} \Leftrightarrow A^{\mathrm {H}}x = b. \end{equation} Because \(A\) is Hermite symmetric, this is further equivalent to \(Ax = b\).

Second, we check the gradient of \(\varphi (x)\) with respect to \(\overline {x}\). Using the same procedure, we still obtain \(Ax=b\), when \(\nabla _{\overline {x}}\varphi (x)=0\).

5 Gradient of the residual norm in linear least square problems

For a linear system \(Ax=b\), where \(A\in \mathbb {R}^{m\times n}\) with \(m\geq n\), if \(x\) minimizes the 2-norm of the residual vector \(\lVert b - Ax \rVert _2\), it is called the linear least square solution. This is equivalent to solve the normal equation \(A^{\mathrm {T}}Ax=b\), or \(A^{\mathrm {H}}Ax=b\) in the complex valued case. Then we will show this equivalence.

First, we consider the real valued case. The 2-norm of the residual vector can also be considered as a functional \begin{equation} \varphi (x) = \lVert b - Ax \rVert _2 = ( Ax-b,Ax-b ) = ( Ax,Ax ) + ( b,b ) - ( Ax,b ) - ( b,Ax ). \end{equation} The gradient of the first term in this functional is \begin{equation} \nabla _x ( Ax,Ax ) = \nabla _x ( A^{\mathrm {T}}Ax,x ) = 2A^{\mathrm {T}}Ax. \end{equation}

The gradient of the third term in \(\varphi (x)\) is \begin{equation} \nabla _x ( Ax,b ) = \nabla _x b^{\mathrm {T}}Ax = b^{\mathrm {T}}A. \end{equation} Write it as a column vector \begin{equation} \nabla _x ( Ax,b ) = A^{\mathrm {T}}b. \end{equation}

The gradient of the fourth term in \(\varphi (x)\) is \begin{equation} \nabla _x ( b,Ax ) = \nabla _x x^{\mathrm {T}}A^{\mathrm {T}}b = A^{\mathrm {T}}b. \end{equation}

Therefore, \begin{equation} \nabla _x \varphi (x) = 2A^{\mathrm {T}}Ax - 2A^{\mathrm {T}}b. \end{equation} When it is 0, we have \(A^{\mathrm {T}}Ax = A^{\mathrm {T}}b\), which is the normal equation.

Second, we consider the complex valued case. The functional \(\varphi (x)\) still takes its original form, since \(( Ax,Ax )\) and \(( b,b )\) are real valued for sure, while \(( Ax,b ) + ( b,Ax )\) is also real valued.

For the gradient with respect to \(x\), we have \begin{equation} \nabla _x ( Ax,Ax ) = \nabla _x x^{\mathrm {H}}A^{\mathrm {H}}Ax = x^{\mathrm {H}}A^{\mathrm {H}}A, \end{equation} \begin{equation} \nabla _x ( Ax,b ) = b^{\mathrm {H}}A \end{equation} and \begin{equation} \nabla _x ( b,Ax ) = 0. \end{equation} Hence \begin{equation} \nabla _x \varphi (x) = x^{\mathrm {H}}A^{\mathrm {H}}A - b^{\mathrm {H}}A. \end{equation} When the gradient with respect to \(x\) is 0, we obtain the normal equation \begin{equation} x^{\mathrm {H}}A^{\mathrm {H}}A - b^{\mathrm {H}}A = 0 \Leftrightarrow A^{\mathrm {H}}Ax = A^{\mathrm {H}}b. \end{equation}

For the gradient with respect to \(\overline {x}\), we have \begin{equation} \nabla _{\overline {x}} ( Ax,Ax ) = \nabla _{\overline {x}} x^{\mathrm {H}}A^{\mathrm {H}}Ax = A^{\mathrm {H}}Ax, \end{equation} \begin{equation} \nabla _{\overline {x}} (Ax,b ) = 0 \end{equation} and \begin{equation} \nabla _{\overline {x}} ( b,Ax ) = \nabla _{\overline {x}} x^{\mathrm {H}}A^{\mathrm {H}}b = A^{\mathrm {H}}b. \end{equation} Hence \begin{equation} \nabla _{\overline {x}} \varphi (x) = A^{\mathrm {H}}Ax - A^{\mathrm {H}}b. \end{equation} When this gradient is 0, we also obtain the normal equation \(A^{\mathrm {H}}Ax = A^{\mathrm {H}}b\).

6 Summary

The solution of a linear system \(Ax=b\) is equivalent to finding the minimum point of a functional, if the linear operator \(A\) satisfies some properties. When \(A\) is a partial differential operator, there are infinite number of DoFs to be solved. When \(A\) is a large matrix, as in FEM or BEM which is usually the discrete version of a corresponding partial differential operator, there are still a large number of unknowns to be solved. If the said functional is found, the solution of many unknowns is converted to the minimization of a single objective function, which can be reckoned as a kind of simplification or reduction.
The critical point of the functional \(\varphi (x) = \frac {1}{2}( Ax,x ) - ( b,x )\) in the real valued case, or the critical point of the functional \(\varphi (x) = \frac {1}{2} (Ax,x) - \real ( b,x )\) in the complex valued case, is the solution of \(Ax=b\).
The critical point of the residual norm \(\lVert b - Ax \rVert _2\) in both real and complex valued cases is the solution of the normal equation \(A^{\mathrm {T}}Ax = A^{\mathrm {T}}b\) in the real valued case or \(A^{\mathrm {H}}Ax = A^{\mathrm {H}}b\) in the complex valued case, which is also the linear least square solution of \(Ax=b\).

References

Yousef Saad. Iterative Methods for Sparse Linear Systems. Other Titles in Applied Mathematics. Society for Industrial and Applied Mathematics. ISBN 978-0-89871-534-7. doi: 10.1137/1.9780898718003.

Stefan Sauter and Christoph Schwab. Boundary Element Methods. Springer Science & Business Media. ISBN 978-3-540-68093-2.

Olaf Steinbach. Numerical Approximation Methods for Elliptic Boundary Value Problems: Finite and Boundary Elements. Springer Science & Business Media. ISBN 978-0-387-31312-2.

¹Complex valued problems will be met, when we solve harmonic acoutics or electromagnetic equations, i.e Helmholtz equations.

Contents