Wrong check for positivity in n3fit
Created by: scarlehoff
Edit: editing the issue to have a more faithful title (and to quickly see what the issue was). Below the original issue for historiography.
The issue was due to two separate bugs, one that only regarded positivity and another a bit more serious:
-
The positivity check was performed on the logs returned by TensorFlow for the training model. These logs, however, correspond to the value of the loss before the weights are updated, however the weights that are saved only after the update. This meant the positivity that was checked was not the positivity saved. The solution is to add the positivity observables to the validation model, as they are evaluated after the weights are updated.
-
A more problematic issue was that the model that was reloaded to the best state only when the stopping stopped the fit. Instead, if the model exhausted the training length, the model was not reloaded and the exported model would be the corresponding to the last epoch. The solution is just to add the reload to the
on_train_end
method of the stopping callback.
Right now the n3fit
fits will pass positivity almost by construction, in the sense that the replica is only marked as POS_PASS
if it passes the positivity during the fit.
This could generate several problems:
- In
n3fit
positivity is multiplied by a lagrange multiplier. This means it is not exactly the positivity predictions (it should be more negative!). We can also recompute the loss inn3fit
without the multiplier of course but some of the problems below still apply. - To avoid some cases of exploding gradient we found when we started with n3fit the positivity is computed with the
elu
function, which means the loss can go "negative" up to 10^-7. I don't think it can compensate the -10^-4 we see in some plots, but it is not 0. - In
n3fit
all quantities are float32. This means the fktables would also be truncated to float32. A loss of precision can happen here. - Further precision-level differences could be introduce by the LHAPDF interpolation.
- Some other bugs??
So, given that the positivity is well satisfied by most n3fit
fits, I was wondering whether it would make sense to recompute the positivity observables at the postfit level instead of relying on the POS_PASS
flag from n3fit
. That way one could catch the one or two replicas that could have escaped the n3fit
filter.
What made me open this issue is the fact that if I re-read the 130920-nnpdf40_jcm_iterated_70k_epochs
fit into TensorFlow and try to compute the loss for this set (with a lagrange multiplier of 1e4
) I get a loss of
Loss value: 10.247996285331814
for replica 99 for the POSXSB
dataset. This is not only big enough that the filter should've caught it, but the fit should have a bad training chi2!
I will investigate this further. I'll start by using the relu
activation function to reduce the sources of trouble.