Fit many replicas in parallel (!1039) · Merge requests · Emanuele Roberto Nocera / nnpdf

Emanuele Roberto Nocera requested to merge multireplica_n3fit_mk2 into master Dec 16, 2020

Created by: scarlehoff

Given that one of the problems that we have towards a future (better) hyperparameter scan is that we are limited by how many replicas we can use to inform the hyperparametrization algorithm, I thought it would be nice to exploit the fact that many of the calculations done for different replicas are shared.

The basis of this PR (once it is finished, there are others to come* but I've decided to do things step by step) is to create a model that concatenates all PDFs, so that the PDF is (n_replicas, xgrid, flavours) then one continues doing everything in the normal way and at the end n3fit computes:

Total_Loss = \sum_{replicas} L_{i}

where each L_{i} depends only on one of the PDFs so that the gradient descent will try to minimize all of them. As the fit advances some of the replicas will stop training (still to do) which means as it goes to the end n3fit will still be calculating 50 gradients even though 49 will have weight 0 but the performance gain outweighs that little inefficiency.

The way the optimizers in TensorFlow work they will try to minimize Total_Loss meaning any bad behaved replica will dominate, fixable but I'll leave it for the future. The short (for whatever value of short) term goal is to improve on the hyperoptimization, for which badly behave replicas would mean bad architectures so we don't want them anyway.

Note that the optimizer to be used is quite important here, any GA would be very bad and only gradient descent with learning rate per weight/layer can be expected to train.

I'm also guessing this will be very useful for closure test or, even, for using closure tests as the hyperoptimization procedure reward.

My to do list for this PR is:

Fit many replicas in one go
Change the output of the model predict to be the loss, so that there's no need to evaluate the model
Keep track of all replicas separately in the stopping
Stop training when the stopping decides that it is time to stop
Apply positivity separately per replica
Checks so that this feature is only used with options which are known to work
Test that a standard fit works
Test that non-standard fitting techniques work

Please let me know if you see any issues with this or if you think anything should be added (so that I either add it to this PR or to the list of things to be completed afterwards below).

The usage is quite simple, it is enough to add a parallel_models flag to the runcard

parallel_models: 50

and then

n3fit runcard.yml 1

will fit from replica 1 to 50 in one go.

The code here is not that much better on a CPU (it is even very bad if you try to fit too many replicas, crashes have been seen) but I've manage to fit 50 replicas in less than 8 GPU hours in a discrete GPU. There's also a certain flexibility to be exploited, for instance many replicas at once might not be that useful in the end, but fitting many different architectures at once for a hyperparameter scan would be.

Note: pointing to tf2.4 because I have it now installed in my PC so I rebased, but works on 3 as well, an older version is in here which was a quick testing I did.

*the other PRs for which I have (not necessarily functional) prototypes are:

Fit to different data per replica
Allow the usage within hyperopt (multireplica/multimodel)
Good separation of minimization per partial loss

Fit many replicas in parallel

Merge request reports