The problem of optimizing the neural network costfunction is usually a nonlinear static unconstrained optimization problem.
Nonlinear meaning that it is usually not possible to solve the problem directly, --- instead iterative schemes must be applied.
There is usually no temporal state involved in the feed-forward neural network. The patterns are treated as independent. Therefore is the problem static.
Unconstrained meaning that the variables to be optimized is not bounded, --- for example that they should be positive. This is in regard to the costfunction. Optimization of the errorfunction may be regarded as a constrained optimization where the constraints are imposed by the regularization. Fortunately the constraints are differentiable, so the constraints are easily added to the errorfunction giving the costfunction.
The optimization in connection with neural network usually called training or learning. Each of the iterative steps is called an epoch.
For an optimization methods 3 efficiency measures may be given [62]:
One may say that linear convergence is where the number of determined ciphers increase linearly with the epochs, and that quadratic convergence is where the number increase quadratic.
The speed of convergence and the complexity of computation usually work opposite --- it is just a question of where to put the epoch. If the 2 values are ''multiplied'' together we get the time consume of the algorithm. Often the robustness fight against the time consume. This is because a robust algorithm often works not so locally, uses more points, where the function is evaluated.
Some algorithms are only robust with a convex costfunction, that is where the Hessian is positive definite. The neural network costfunction --- especially a unregularized --- is not convex. To get both robustness and a fair time consume optimization methods can be mixed forming a hybrid optimization methods.
it will need
function
evaluations. These are the vertices in a simplex --- a irregular shaped sort
of prism. The vertex having the largest function value is mirror in the
hyperplane the other vertices spans. This creates a new simplex, where a new
vertex with the highest function value now can be mirror. The simplex ---
also called amoeba --- should gradually move downhill.
we arrive at the ordinary backprobagation:
The problem with backpropagation is that if the step
is too small,
the convergence is to slow, and if it is to large, it is not certain that
the costfunction will decrease: We might land on the opposite side of the
costfunction valley. By using soft line search, where we start off
with a large
a gradually decrease it until the costfunction is
decreased, we can get both improvement in the robustness and the speed of
convergence, but at the cost of more complexity of computation. If the line
search is hard, that is the optimal step size
is found, we
end up with steepest descent
:
This is usually not worth the effort, compared to soft line search.
Some curvature information can be obtained if the old step is used in backpropagation with momentum:
A problem with the gradient descent is that all directions share the same step size. This is a problem with stiff functions, that is functions with a long shallow valley in one direction and a short steep valley in an other. In mathematical terms this means that the ratio between the largest and the smallest Hessian eigenvalue is big. The step size has to be fitted to the short steep valley, not to bump into its valley side.
Neural network costfunction are often stiff because the inputs have different importance for the outputs. Consider the neural network input of principal components: The first one should be of importance, while the higher should be noise that should be ignored.
The gradient descent algorithms have linear convergence.
The pure newton is not robust, because the Hessian might not be positive definite, that is the cost function is not fully convex. If the costfunction is concave in one dimension it will take a step uphill. Using the gauss-newton approximation4.6.2 the Hessian automatically becomes positive definite (or at least semi-positive definite). If the pseudo approximation is used the matrix inversion of 4.67 is replaced by an element-by-element division:

If the costfunction surface is ''hyperparaboloidic'', one newton step will immediately bring the weights to their optimal sizes. The costfunction surface has to have axis-parallel cross sections of the hyperparaboloide for that to happen with the pseudo-newton approximation.
Usually the costfunction is certainly not hyperparaboloidic and we have to take several steps. The costfunction can be of such a type, that the newton methods take a too large step, crossing the valley, and getting a worser costfunction value. In parallel with the gradient descent methods, some of the cure for this behavior is soft line search. To make the newton method yet more robust the failed newton step can be followed by a gradient descent step.
The Levenberg-Marquardt method uses the gauss-newton approximation step and the usual gradient descent step in a combination:
The
is here the identity matrix. Thus the Levenberg-Marquardt
approximation is a interpolating between the gauss-newton step and a
infinitely small gradient step with the parameter
. The parameter
functions as a sort of line search parameter, and is stored from epoch to
epoch.
It seems to me that this method requires matrix inversions in the line search.
The pure newton has quadratic convergence for convex costfunctions [62].
This is actually just the pure newton. Correlation among the outputs (see the equation 4.36) is no further problem:

Neither is the augmentation of the square weight decay. In this case we end up with the ridge regression: