next up previous contents
Next: Mean and Median Up: Neural Network Analysis Previous: Neural Network Analysis

Optimization

The heterogenous errors in the data --- due to interrun noise and intrarun-interscan noise --- causes the crossvalidation error to behave rather irregularly with overfitting: Usually the 1. and 2. test sets gif have errors that are higher than the 5. and 6. test sets, firstly because they are not in the basis --- they may span differently ---, and secondly: test set 5 and 6 do not contain interrun noise that is not already in the training set. This behavior can be studied when overfitting with the lower principal components (that is up till 50). But if all principal components are piped to the neural network the sizes of the crossvalidation errors reverses: The test set 5 and 6 get the highest errors. This might be explained in the following way: The highest principal components span the intrasession noise of the in-the-basis test sets. Test set 5 and 6 have a significant contribution in these projection, while the out-of-basis test sets do not have a significant span in these components. As the pruning takes care of removing the principal components which contains little information, the errors of the test set 5 and 6 will cross the errors of test set 1 and 2.

  
Figure 7.12: Behaviour of the errors of the crossvalidation sets. --- Learning set, Test set 1, Test set 2, Test set 3 (rise flange), Test set 4 (fall flange), Test set 5, Test set 6

The crossvalidation test set at the fall flange (test set 4 table 7.2) will have a reverse type of behavior. This test set is right at the deactivation flange with the hemodynamic response. As noted in section 7.1, I was not able to obtain any good results with training and test sets that contained the hemodynamic response. These was masked out. Test set 4 shows the effect the hemodynamic response has on the neural network: The neural network will correctly(!) misclassify these first post flange scans, and due to the entropic errorfunction penalizing total misclassifications hard, the error will be large. In fact the best neural network (with respect to intrarun generalization) will be the one with the largest error on the flange crossvalidation test set.

The crossvalidation error can be studied in figure 7.12, showing the error during pruning from a large network sized , regularized heavily. The 2 in-basis intrarun crossvalidation sets starts out with the highest error, then as the pruning proceeds taking a deep duck, crossing the out-of-basis test sets.

    
Figure 7.14: Final Prediction Error estimate


Figure 7.13: The effective number of parameters during pruning

The neural network is very large, but also regularized very hard, being able to bring number of parameters down to under the number of patterns, watching figures 7.13 and 7.14. The generalization error estimates do not fit with the test set 1 and 2 minimum., but rather resembles the minimum of the test set 5 and 6.

The figures 7.13 and 7.14 also show a difference between the generalization error estimate and effective number of parameter estimates, when these are calculated with and without the gauss approximation of the second order derivative. This is only for the estimates early in the pruning session. The estimate using the pseudo-newton is lower than than the estimate using the pseudo-gauss-newton derivative. This is dependent on the regularisation parameter. Their minima are approximately the same, and they have the same value there. The generalization error estimates do usually not perform very well with a model that is much too large, so the difference between the pseudo-gauss-newton and the pseudo-newton estimate might not be of any importance.

    
Figure 7.15: The errorfunctions. Same legend as figure 7.12


Figure 7.16: Generalization estimates

The behaviour of the test set was consistent through the weight decay space. Figures 7.15 and 7.16 show a neural network sized trained with a weight decay of 10.

The size of neural network with the optimal architecture chosen with the intrarun validation set depend not on the starting architecture or the regularization parameter. For the intrarun generalization the effective number of parameters is just under 20. For the intrarun out-of-basis generalization the architeecture is 10 times greater, 200 effective parameters or less!

The generalization behavior is indeed different than what usually is observed. Training and validation patterns that are further away in the input space usually require a smaller network to show good generalization. Here it is not the case. The validation sets have little span in the higher principal components, and the projection onto these axis is rather low.

All though there are little span in the higher principal component looking at the test sets (figure 7.6), it proved to be important not to leave to many principal components out. With 50 principal component I was not able to have generalization into the interrun out-of-basis test sets. The first 150--200 principal component was important for this generalization.

It was also important to have the principal component normalized, so their range would be of equal size. Neural networks that were just feed the principal component projection could not generalize on the test sets. There is no indication that the normalization should be the range.



next up previous contents
Next: Mean and Median Up: Neural Network Analysis Previous: Neural Network Analysis



Finn Nielsen
Sun Feb 25 19:22:55 PST 1996