Hi! The last two weeks in GSoC I spent time developing algorithm drafts, testing and tuning them. It is an addicting process and I spent perhaps already too much time on it. The results are pretty good, but now I seriously need to stop it and start writing production quality code and so on.
The first thing we decided I should do is to provide a plain English description of the methods I studied and implemented. To make it more understandable I decided to make this post as an introduction to nonlinear least-squares optimization. So here I start.
Nonlinear least-squares optimization problem
The objective function we seek to minimize has the form
here we introduced the residual vector and we want to minimize the square of its Euclidean norm. Each component is a smooth function from to . We treat as a vector-valued function which Jacobian matrix defined as follows
In other words -th row of contains transposed gradient . Now it is easy to verify that the gradient and Hessian (the matrix containing all second partial derivatives) of the objective function have the form:
Notice that the second term in a Hessian will be small if a) the residuals are small near the optimum or b) the residuals depends on approximately linearly (possibly only near the optimum). Both conditions are often satisfied in practice, which leads to the main idea of least-squares minimization to use an approximation of Hessian in the form:
So the distinctive feature of least-squares optimization is the availability of a good Hessian approximation using only the first order derivative information. Noted that, we can apply any optimization method which employs Hessian information with a good probability of satisfactory results.
Most of the local optimization algorithms are iterative: starting with the initial guess they generate a sequence which should converge to a local minimum. I will denote steps taken by an algorithm with a letter , such that . Also I will denote the quantities at a point with the index , for example and so on.
This is just an adaptation of Newton method where instead of computing Newton step at each iteration , we compute Gauss-Newton step using the aforementioned Hessian approximation:
To make the algorithm globally convergent we invoke a line search along the computed direction to satisfy a “sufficient decrease condition”. See the chapter on line search in “Numerical optimization” of Nocedal and Wright.
Note that the equation for is the normal equation of the following linear least squares problem:
It means that can be found by whatever linear-least squares method you think is appropriate:
- Through the normal equation with Cholesky factorization. Pros: effectiveness when , and can be computed using memory, solving the equation is fast. Cons: potentially bad accuracy since the condition number of is squared compared to , Cholesky factorization can incorrectly fail when is nearly rank deficient.
- Through QR factorization with pivoting of . Pros: more reliable for ill-conditioned . Cons: slower than previous, rank deficient case is still problematic.
- Through singular value decomposition (SVD) of . Pros: the most robust approach, gives the full information about sensitivity of the solution to perturbations of the input, allows finding the least norm solution in the rank deficient case, allows zeroing of very small singular values (thus avoiding the excessive sensitivity). Cons: slower than the previous two.
- Conjugate gradient and alike methods (LSQR, LSMR), which only requires the ability to compute and for arbitrary and vectors. Used in sparse large-scale setting.
The third approach is used in numpy.lstsq. At the moment I don’t have a good idea of how much SVD approach is slower than QR.
If singular values of are uniformly bounded from zero in the region of interest then Gauss-Newton method is globally convergent and the convergence rate is no worse than linear and approaches quadratic as the second term in true Hessian becomes insignificant compared to . But note that convergence rate of an infinite sequence is measured after some , which can be arbitrary large. It means that we are to observe such rate of convergence within some (perhaps small) neighborhood of the optimal point.
The original idea of Levenberg was to use “regularized” Gauss-Newton step, which satisfies the following equation:
where is adjusted from iteration to iteration depending on some measure of the success of the previous step. A line search is not used in this algorithm. The rough explanation is as follows. When is small we are taking full Gauss-Newton steps, which are known to be good near the optimum, when is big we are taking the steps along the anti-gradient, thus assuring global convergence. If I’m not mistaken Marquardt contribution was a suggestion to use a more general term instead of , with . This is a question of variables scaling and for simplicity I will ignore it here.
Algorithms directly adjusting are considered obsolete, but nevertheless such algorithm is used in MATLAB for instance and it works well. But the more recent view is to consider Levenberg-Marquardt as a trust-region type algorithm.
In a trust-region approach we obtain the step by solving the following constrained quadratic subproblem:
where is the approximation to Hessian at the current point, — the gradient, — a radius of the trust region. A radius is adjusted by observing the ratio of actual to predicted change in the objective function (as a measure of adequateness of a quadratic model within the trust region):
The update rules for are approximately as follows: if then , if and then , otherwise . If is negative (no actual descrease) then the computed step is not taken and it is recomputed with from the current point again (the threshold for this might be higher, for example 0.25 is stated in “Numerical Optimization”).
In least-squares problems we have . The connection between original Levenberg method and a trust-region method are given by the following theorem:
The solution to a trust-region problem satisfies the equation , for some . Such that is positive semidefinite, and .
The last condition means that either or the optimal solution lies on the boundary. This theorem tells us that there is a one-to-one correspondence between and and suggests the following conceptual algorithm of solving trust-region problem:
- If is positive definite compute a Newton step and if it is within the trust regions — we found our solution.
- Otherwise find s. t. using some root finding algorithm. Then compute using .
The step 2 is not particular easy, also there is an additional difficult case when will be positive semidefinite and we can’t compute just from the equation. For the proper discussion of the problem refer to “Numerical Optimization”, Chapter 4.
The suitable and detailed algorithm for solving a trust-region subproblem arising in least squares is given in the paper “The Levenberg-Marquardt Algorithm – Implementation and theory” by J. J. More (implemented in MINPACK, scipy.optimize.leastsq wraps it). The author analyses the problem in terms of SVD, but then suggests an implementation using QR decomposition and Givens rotations (the approach generally chosen in MINPACK). I decided to stick with SVD, the only potential disadvantage of SVD is speed (but even that is questionable), in all other aspects this approach is great, including the simplicity of implementation. Let’s introduce a function we wan’t to find a zero of (see case 2 of trust-region solving algorithm):
If we have SVD and , then
We have an explicit function of and can easily compute its derivative too. In numpy/scipy this function can be implemented as follows and will work correctly even when .
import numpy as np
from scipy.linalg import svd
def phi_and_derivative(J, f, alpha, Delta):
U, s, VT = svd(J, full_matrices=False)
suf = s * U.T.dot(f)
denom = s**2 + alpha
p_norm = np.linalg.norm(suf / denom)
phi = p_norm - Delta
phi_prime = -np.sum(suf ** 2 / denom**3) / p_norm
return phi, phi_prime
Then an iterative Newton-like zero-finding method is run (with some safeguarding described in the paper) until , it converges usually in 1-3 iterations. Of course in a real implementation SVD decomposition should be precomputed only once (this is not true for the QR approach described in the paper), also note that we need only “thin SVD” (full_matrices=False). So by almost the same with numpy.lstsq amount of work we accurately solve a trust region least-squares subproblem.
This (near) exact procedure of solving a trust-region problem is important, but in large-scale setting one of the approximate methods must be chosen. I will outline them in a future posts. Perhaps the most efficient and accurate method is to solve the problem in a subspace spanned by 2 properly selected vectors, in this case we apply the exact procedure to the matrix.
Gauss-Newton and Levenberg-Marquardt methods share the same convergence properties. But Levenberg-Marquardt can handle rank deficient Jacobians and, as far as I understand, works generally better in practice (although I don’t have experience with Gauss-Newton). So LM is usually the method of choice for unconstrained least-squares optimization.
In the next posts I will describe the algorithm I was studying and implementing for bound constrained least squares. Its unofficial title is “Trust Region Reflective”.