Revert the default for parallel fits to a single model per replica
Created by: scarlehoff
This PR reverts back to a model-per-replica configuration (instead of the multidense layer) since the multidense layer is not able to fit models independently (at the moment) which is necessary for a complete fit.
For sequential fits nothing should change, for parallel fit the difference is of about a factor of 2 in time (but I'm guessing that this factor of 2 we are gaining might be why the fits are of worse quality).
I've introduced a multidense
layer type if people want to play around with multidense
.
Here two reports, as you can see multidense
has a terrible chi2 while dense
is statistically compatible with 4.0.
multidense
: https://vp.nnpdf.science/UGSJ6OsPTdSR2s97OXIvDw==
dense
: https://vp.nnpdf.science/WbBCvsjfQV-6ncIQ3GhCVw==
I haven't looked deep into the details of the error but I don't think multidense
was truly benchmarked in a real-life scenario? At least I could not find any examples.
Anyway, for reference, these were fits of 120 replicas in a RTX 2080. 80 minutes with dense
, 40 minutes with multidense
.
It would be ideal if multidense
could be fixed but I can live with 80 minutes runs... for now checks.py
will fail if you use multidense
to make sure it is only used as a development option for now.
Fixes, in a way, #2118
TODO:
-
Fix regression tests, note that most of the .json
files have no changes (proving that, for one replica,multidense
anddense
were really the same) -
Check it works with TF 2.16/2.17 (if I'm able to get a computer with both a GPU and a recent-enough version of python/tf at the same time, not trivial)
RE the last point, I cannot check but the python tests are getting TF 2.17 so the code does work.
Merge request reports
Activity
requested review from @enocera
added redo-regressions label
758 758 759 759 return layers 760 760 761 elif layer_type == "single_dense": 761 elif layer_type == "dense": 762 762 763 # The checks should've triggered, but better safe than sorry 764 if len(replica_seeds) > 1: 765 raise ValueError("`single_dense` only valid with one replica") removed redo-regressions label
added redo-regressions label
removed redo-regressions label
mentioned in merge request !1939 (merged)
mentioned in issue #2118
Created by: goord
I am suspicious whether the multidense implementation results in the correct gradient descent process. Comparing the single-dense to the multi-dense for parallel replicas, I see quite different behavior of the pdf training evolution. The single-dense jumps around a lot more than the multi-dense pdf, I'm not sure we can attribute that to numerical noise...
Created by: Cmurilochem
If one of you confirm it also works for you, we can merge this.
Hi @scarlehoff. I tested this branch as request and ran a parallel fit using GPUs and python 3.11 + tf 2.15.1; see below.
python 3.11 + tf 2.15.1 https://vp.nnpdf.science/ICdkTBY0T2ebanL1lLNcMg== The results match nearly perfectly the ones from the sequential
NNPDF40_nnlo_as_01180_qcd
one.As of today I am not able however to run a parallel GPU fit with newer versions of tensorflow:
- With python 3.11 + tf 2.16.2 or 2.17.0: I have strange problems associated with memory leaks (https://github.com/tensorflow/tensorflow/issues/64170).
- With python 3.12 + tf 2.16.2 or 2.17.0: Tensorflow never recognises the GPUs in snellius environment. These issues require some more investigation but certainly do not have anything to do with our reproducibility issue.
Created by: scarlehoff
I am suspicious whether the multidense implementation results in the correct gradient descent process.
My guess (which I haven't tested) is that when you do a single model per replica, each model is trained independently (so, they all descend the gradient independently).
Instead, when you have one single model, they all have to descend the gradient together. If you test with a small number of models chances are that, even if they are descending together, they both converge. The more models you add, the less likely it is that they are all going to converge at once.