Avoid idle gpu
Created by: APJansen
The idea
We observed large gaps between training steps in the tensorboard profile when running on the GPU. After a lot of fiddling were found to be (at least partially) due to a per epoch overhead from tensorflow. This is reduced by redefining one actual training step as a single batch of size 1, not as a whole epoch as it used to be.
Implementation wise, this is done by copying the input up to 100 times, creating a set of 100 identical training inputs. Existing callbacks simply implement on_step_end
instead of on_epoch_end
and inherit from CallbackStep
to take care of the conversion between steps, batches and epochs.
One extra issue is that Keras computes metrics cumulatively, they are converted back to per step in CallbackStep.correct_logs
. This is the only source of slight numerical differences, which however only appear in the logs and do not propagate at all, training results remain identical.
Performance
Timings for 1000 epochs of the main runcard (NNPDF40_nnlo_as_01180_1000), on Snellius, with 100 replicas on the GPU or 1 replica on the CPU. In brackets the GPU memory used.
branch | commit hash | 1 replica | 100 replicas |
---|---|---|---|
master | 0a5fc614 | 145 | 91 |
avoid-idle-gpu | bb366aa6da | 145 | 67 |
This saves 24 seconds (or 25%) per 1k epochs.
Profile
Note: slightly outdated profiles
Merge request reports
Activity
added escience n3fit performance labels
assigned to @enocera
mentioned in merge request !1802 (closed)
mentioned in issue #1977
Created by: APJansen
The issue with the regression test was fixed by simply rebasing. (And then broke another, but I think it's just a fluke that will be fixed by rerunning).
I've updated the timings as well. (Don't worry about the higher time on the CPU, that's just because the CPU nodes I was using before aren't available today, this PR only affects >1 replica.)
Side comment: I added in a slight fix to the logging, where for multiple replicas it would print
"Validation chi2s:"
and then nothing. Incidentally I don't think the comment# The partial chi2 makes no sense for more than one replica at once
is true anymore?assigned to @enocera
unassigned @enocera
requested review from @enocera
added run-fit-bot label
Created by: github-actions[bot]
Greetings from your nice fit
! I have good news for you, I just finished my tasks:- Fit Name: NNBOT-9cf1f0522-2024-06-07
- Fit Report wrt master: https://vp.nnpdf.science/XCW60m6ySVO1cqBCe6Zlig==
- Fit Report wrt latest stable reference: https://vp.nnpdf.science/72Hgjp4vQOaUBgBm6PNnnQ==
- Fit Data: https://data.nnpdf.science/fits/NNBOT-9cf1f0522-2024-06-07.tar.gz
Check the report carefully, and please buy me a
, or better, a GPU !mentioned in issue #2118
removed run-fit-bot label
added redo-regressions label
added run-fit-bot label and removed redo-regressions label
Created by: github-actions[bot]
Greetings from your nice fit
! I have good news for you, I just finished my tasks:- Fit Name: NNBOT-c2dc50df8-2024-07-17
- Fit Report wrt master: https://vp.nnpdf.science/1cLR0izpTQqk193L6Ih_6w==
- Fit Report wrt latest stable reference: https://vp.nnpdf.science/fEvlOQ1QREyxenwccx4zIA==
- Fit Data: https://data.nnpdf.science/fits/NNBOT-c2dc50df8-2024-07-17.tar.gz
Check the report carefully, and please buy me a
, or better, a GPU !Created by: scarlehoff
I'm going to merge this since the tests are passing and it is rebased on top of master (which means it is probably fixing something that has changed since the last merge, probably the TF / np version).
Worst case scenario, it can be reverted. The report in #2127 https://vp.nnpdf.science/WbBCvsjfQV-6ncIQ3GhCVw== was made with a PR which is on top of this one, so we have a reproduction of 4.0 with this changes that seems to do ok.