Skip to content
Snippets Groups Projects

Revert the default for parallel fits to a single model per replica

Open Emanuele Roberto Nocera requested to merge revert_multidense_default into master

Created by: scarlehoff

This PR reverts back to a model-per-replica configuration (instead of the multidense layer) since the multidense layer is not able to fit models independently (at the moment) which is necessary for a complete fit.

For sequential fits nothing should change, for parallel fit the difference is of about a factor of 2 in time (but I'm guessing that this factor of 2 we are gaining might be why the fits are of worse quality).

I've introduced a multidense layer type if people want to play around with multidense.

Here two reports, as you can see multidense has a terrible chi2 while dense is statistically compatible with 4.0.

multidense: https://vp.nnpdf.science/UGSJ6OsPTdSR2s97OXIvDw== dense: https://vp.nnpdf.science/WbBCvsjfQV-6ncIQ3GhCVw==

I haven't looked deep into the details of the error but I don't think multidense was truly benchmarked in a real-life scenario? At least I could not find any examples.

Anyway, for reference, these were fits of 120 replicas in a RTX 2080. 80 minutes with dense, 40 minutes with multidense. It would be ideal if multidense could be fixed but I can live with 80 minutes runs... for now checks.py will fail if you use multidense to make sure it is only used as a development option for now.

Fixes, in a way, #2118

TODO:

  • Fix regression tests, note that most of the .json files have no changes (proving that, for one replica, multidense and dense were really the same)
  • Check it works with TF 2.16/2.17 (if I'm able to get a computer with both a GPU and a recent-enough version of python/tf at the same time, not trivial)

RE the last point, I cannot check but the python tests are getting TF 2.17 so the code does work.

Merge request reports

Ready to merge by members who can write to the target branch.
Loading

Activity

Filter activity
  • Approvals
  • Assignees & reviewers
  • Comments (from bots)
  • Comments (from users)
  • Commits & branches
  • Edits
  • Labels
  • Lock status
  • Mentions
  • Merge request status
  • Tracking
423 423 """
424 424 if not parallel_models:
425 425 return
426 if parameters.get("layer_type") != "dense":
  • 758 758
    759 759 return layers
    760 760
    761 elif layer_type == "single_dense":
    761 elif layer_type == "dense":
    762 762
    763 # The checks should've triggered, but better safe than sorry
    764 if len(replica_seeds) > 1:
    765 raise ValueError("`single_dense` only valid with one replica")
  • mentioned in merge request !1939 (merged)

  • Created by: goord

    Hi @scarlehoff I can do the TF2.17/python3.12 test (on a CPU though :))

  • mentioned in issue #2118

  • Created by: scarlehoff

    In CPU works fine (some of the tests are running tf 2.17). But since there were big changes in TF 2.16 I wonder whether it changed something in GPU.

  • Created by: goord

    Ok I will ask @Cmurilochem to do the test, I don't have access to a GPU right now. I will start looking for the bug in the multidense layer type.

  • Created by: scarlehoff

    No, that one (at least according to the runcard) was run sequentially.

  • Created by: scarlehoff

    If one of you confirm it also works for you, we can merge this.

  • Created by: Radonirinaunimi

    If one of you confirm it also works for you, we can merge this.

    I can test this over the weekend.

  • Created by: goord

    I am suspicious whether the multidense implementation results in the correct gradient descent process. Comparing the single-dense to the multi-dense for parallel replicas, I see quite different behavior of the pdf training evolution. The single-dense jumps around a lot more than the multi-dense pdf, I'm not sure we can attribute that to numerical noise...

  • Created by: Cmurilochem

    If one of you confirm it also works for you, we can merge this.

    Hi @scarlehoff. I tested this branch as request and ran a parallel fit using GPUs and python 3.11 + tf 2.15.1; see below.

    python 3.11 + tf 2.15.1
    https://vp.nnpdf.science/ICdkTBY0T2ebanL1lLNcMg==

    The results match nearly perfectly the ones from the sequential NNPDF40_nnlo_as_01180_qcd one.

    As of today I am not able however to run a parallel GPU fit with newer versions of tensorflow:

    • With python 3.11 + tf 2.16.2 or 2.17.0: I have strange problems associated with memory leaks (https://github.com/tensorflow/tensorflow/issues/64170).
    • With python 3.12 + tf 2.16.2 or 2.17.0: Tensorflow never recognises the GPUs in snellius environment. These issues require some more investigation but certainly do not have anything to do with our reproducibility issue.
  • Created by: scarlehoff

    Nice, I face the second issue (my gpu is not recognized).

    Can you disable XLA and see whether it works?

    XLA_FLAGS=--xla_disabled_backends=cpu,gpu

    it will be slower but we just want to check whether it works

  • Created by: scarlehoff

    I am suspicious whether the multidense implementation results in the correct gradient descent process.

    My guess (which I haven't tested) is that when you do a single model per replica, each model is trained independently (so, they all descend the gradient independently).

    Instead, when you have one single model, they all have to descend the gradient together. If you test with a small number of models chances are that, even if they are descending together, they both converge. The more models you add, the less likely it is that they are all going to converge at once.

  • Created by: goord

    Right, in my opinion the multidense approach is rather dangerous. Any operation that mixes the weights in the replica dimension will lead errors. It means that you need to know exactly what goes on in the optimizer for instance.

  • Loading
  • Loading
  • Loading
  • Loading
  • Loading
  • Loading
  • Loading
  • Loading
  • Loading
  • Loading
  • Please register or sign in to reply
    Loading