Skip to content
Snippets Groups Projects

Fix multidense compatibility with tf 2.16 and keras 3

Merged Emanuele Roberto Nocera requested to merge fix-multidense-tf216 into master

Created by: APJansen

Merge request reports

Merged by Emanuele Roberto NoceraEmanuele Roberto Nocera 1 year ago (Mar 5, 2024 11:12pm UTC)

Loading

Activity

Filter activity
  • Approvals
  • Assignees & reviewers
  • Comments (from bots)
  • Comments (from users)
  • Commits & branches
  • Edits
  • Labels
  • Lock status
  • Mentions
  • Merge request status
  • Tracking
  • mentioned in issue #1970 (closed)

  • Created by: APJansen

    @scarlehoff I'm not sure this fix works, and 3.12 isn't in the CI yet, do you have a PR which has the new test where I could push this to (or rebase this on)? I couldn't find it.

  • Created by: scarlehoff

    I have added 3.12 to the tests. It might fail because of other reasons of course (if it runs!), but let's see whether this one is fixed and then revert that commit.

  • Created by: scarlehoff

    When trying it in my computer I got this error though:

    /NNPDF/src/nnpdf/n3fit/src/n3fit/model_gen.py", line 789, in generate_nn
        pdfs = layer(pdfs)
               ^^^^^^^^^^^
    /NNPDF/src/nnpdf/n3fit/src/n3fit/backends/keras_backend/multi_dense.py", line 128,
       output_shape = output_shape[:1] + [self.replicas] + output_shape[1:]
                       ~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~
    TypeError: Exception encountered when calling MultiDense.call().
    can only concatenate tuple (not "list") to tuple
    
    Arguments received by MultiDense.call():
      • args=('<KerasTensor shape=(1, None, 2), dtype=float32, sparse=None, name=xgrids_processed>',)
      • kwargs=<class 'inspect._empty'>
    
  • Created by: scarlehoff

    fixing that generates further errors I'm afraid

    ValueError: In a nested call() argument, you cannot mix tensors and non-tensors. Received invalid mixed argument: inputs={'pdf_x': <KerasTensor shape=(1, 1, None, 14), dtype=float32, sparse=False, name=keras_tensor_25>, 'pdf_xgrid_integration': <KerasTensor shape=(1, 1, None, 14), dtype=float32, sparse=False, name=keras_tensor_36>, 'xgrid_integration': <KerasTensor shape=(1, None, 1), dtype=float32, sparse=None, name=integration_grid>, 'photon_integral': array([[[0.]]])}

    so keras 3 might be breaking quite a lot of stuff...

  • Created by: APJansen

    Yes was just about to say that I would expect that... We are doing many things which are not standard, particularly in MetaModel and MetaLayer, I would expect a lot of things to break.

    I think transitioning to Keras 3 with multiple supported backends would be great, especially when all those backend wrappers would be removed (or at least moved down a level). But timing wise perhaps this is not the best moment..

    One thing you can check is what happens with the dense-per-flavour layer, but I think the error you quote above would also arise there.

  • Created by: scarlehoff

    It is not that bad, the problem right now seem to be

    NotImplementedError: numpy() is only available when eager execution is enabled

    I'm not sure where that's happening since the traceback is just the training so we are at some point calling something we are not allowed to call (but it should've failed earlier during compilation of the model... it's, at least partially, a tensorflow bug)

    Edit: setting eager execution everything work other than the log broken down by experiment because they have changed the reporting but that's minor.

  • Created by: APJansen

    numpy is used in op.scatter_to_one in msr_normalization.py, maybe that's it.

  • Created by: scarlehoff

    It seems to be inside tensorflow/keras. More specifically it is understanding the preprocessing factor variables as numpy arrays. Which it really seems like a bug in tensorflow in that if there's anything wrong with them they should've been caught some time ago by the time the training starts.

  • Created by: scarlehoff

    Indeed. The problem is there only when the preprocessing is trainable.

  • Created by: scarlehoff

    In my computer it works for 3.12 but it breaks now for 3.11.

    Which I'd say it is ok, now we just have to either have a conditional or fix it in some other way, but it's good :)

    Let's see what happens with the rest of the tests.

  • Created by: scarlehoff

    The errors are in the np.allclose :)

  • Created by: scarlehoff

    (sorry for the spam of commits, I will squash all changes together I've been basically using the CI as the test machine)

  • Emanuele Roberto Nocera
  • Created by: github-actions[bot]

    Greetings from your nice fit :robot: ! I have good news for you, I just finished my tasks:

    Check the report carefully, and please buy me a :coffee: , or better, a GPU :wink:!

  • Created by: scarlehoff

    Please @APJansen when you have time could you make sure that the changes I made are sensible?

    @Cmurilochem I had to change the scope of the hyperopt pickle test to check that indeed when starting from a pickle the next first trial does give you the same parameters as if they had been run sequentially... however the results after the first trial change. It's a sub-% change but that's enough to change the following trials. I've noticed that tf (or maybe keras) has changed the behaviour of the special case seed=0 (which now basically means "get some random seed" and broke some of the multidense tests), I wonder whether this is related.

    How important is that, after a restart, the hyperopt continues exactly on the same path after the first trial? If it is important it might be necessary to investigate where this difference is coming from before merging (or blocking hyperopt runs to only use tensorflow <2.16 / python <3.12 / numpy < X.YY... not sure which library is at fault here tbh)

  • 414 408 for i_replica in range(self.num_replicas):
    415 409 self.set_replica_weights(weights, i_replica)
    416 410
    411 def save_weights(self, file, save_format="h5"):
    412 """
    413 Compatibility function for tf < 2.16
    414 """
    415 try:
    416 super().save_weights(file, save_format=save_format)
    417 except TypeError:
    418 new_file = file.with_suffix(".weights.h5")
    419 super().save_weights(new_file)
    420 shutil.move(new_file, file)
    • Created by: APJansen

      Maybe add a comment here which branch corresponds to which version. Also not sure what happens here, the second branch requires a particular suffix? Or doesn't allow you to overwrite an existing file?

    • Created by: scarlehoff

      I'll add it as a comment but, the reason is that for keras 3 save_format doesn't exist anymore and instead the format is read directly from the extension docs.

      So you are forced to save it to a .weights.h5 file.

  • Emanuele Roberto Nocera
  • Loading
  • Loading
  • Loading
  • Loading
  • Loading
  • Loading
  • Loading
  • Loading
  • Loading
  • Loading
  • Please register or sign in to reply
    Loading