Skip to content
Snippets Groups Projects

Avoid idle gpu

Merged Emanuele Roberto Nocera requested to merge avoid-idle-gpu into master

Created by: APJansen

The idea

We observed large gaps between training steps in the tensorboard profile when running on the GPU. After a lot of fiddling were found to be (at least partially) due to a per epoch overhead from tensorflow. This is reduced by redefining one actual training step as a single batch of size 1, not as a whole epoch as it used to be.

Implementation wise, this is done by copying the input up to 100 times, creating a set of 100 identical training inputs. Existing callbacks simply implement on_step_end instead of on_epoch_end and inherit from CallbackStep to take care of the conversion between steps, batches and epochs. One extra issue is that Keras computes metrics cumulatively, they are converted back to per step in CallbackStep.correct_logs. This is the only source of slight numerical differences, which however only appear in the logs and do not propagate at all, training results remain identical.

Performance

Timings for 1000 epochs of the main runcard (NNPDF40_nnlo_as_01180_1000), on Snellius, with 100 replicas on the GPU or 1 replica on the CPU. In brackets the GPU memory used.

branch commit hash 1 replica 100 replicas
master 0a5fc614 145 91
avoid-idle-gpu bb366aa6da 145 67

This saves 24 seconds (or 25%) per 1k epochs.

Profile

Note: slightly outdated profiles

This branch: image and before this single commit for comparison: image

Merge request reports

Merged by Emanuele Roberto NoceraEmanuele Roberto Nocera 8 months ago (Jul 17, 2024 1:34pm UTC)

Merge details

  • Changes merged into with 1637dba7.
  • Deleted the source branch.

Activity

Filter activity
  • Approvals
  • Assignees & reviewers
  • Comments (from bots)
  • Comments (from users)
  • Commits & branches
  • Edits
  • Labels
  • Lock status
  • Mentions
  • Merge request status
  • Tracking
  • Created by: scarlehoff

    Review: Approved

    I think this can be merged. #2188 can be dealt with in a separate PR.

  • added run-fit-bot label and removed redo-regressions label

  • Created by: github-actions[bot]

    Greetings from your nice fit :robot: ! I have good news for you, I just finished my tasks:

    Check the report carefully, and please buy me a :coffee: , or better, a GPU :wink:!

  • Created by: scarlehoff

    I'm going to merge this since the tests are passing and it is rebased on top of master (which means it is probably fixing something that has changed since the last merge, probably the TF / np version).

    Worst case scenario, it can be reverted. The report in #2127 https://vp.nnpdf.science/WbBCvsjfQV-6ncIQ3GhCVw== was made with a PR which is on top of this one, so we have a reproduction of 4.0 with this changes that seems to do ok.

  • Loading
  • Loading
  • Loading
  • Loading
  • Loading
  • Loading
  • Loading
  • Loading
  • Loading
  • Loading
  • Please register or sign in to reply
    Loading