Realising a factor 20-30 speedup on GPU

Created by: APJansen

Last week @goord and I started looking at tensorboard profiles of the code running on a GPU. We found and resolved several bottlenecks in the performance, resulting in a total speedup of a factor of 20-30 compared to the current state of the trvl-mask-layers branch of #1788. As a result, we are able to do a full 17k epoch run of the NNPDF40_nnlo_as_01180_100 runcard with 100 replicas within half an hour.

We have this running, so the time quoted is the actual start to end wall time (to be precise, it took 19 minutes, 9 of which are loading the data and building the model etc). Most of it still requires a lot of cleanup to integrate it properly though. Currently it crashes just after the fit just because the appropriate changes haven't been made.

Factors contributing to speedup

In no particular order, the factors contributing to the speedup are:

Rewriting the FK table contractions. Several things here:

we have restructured the PDF so that the replicas are in the first rather than the last axis. I'm no expert but as I understand, the values in the last axis are contiguous in memory, and since it is the x and flavour axis that are contracted, it's beneficial to have those last.
We've rewritten the masking from using tf.boolean_mask, where the output shape depends on the values, i.e. the number of Trues, to being a matrix multiplication, and precomputed this. So the FK table for a DY experiment is now of shape (n, x, f, x, f)

Having a single PDF with all the replicas inside, rather than 100 separate PDFs, as in #1782. We saw that the 30% speedup observed there (which becomes much more significant given all the other speedups) is mainly due to kernel loading. That is, all PDFs were computed separately, and now it's not only done in parallel, but you also don't have the overhead of starting a new computation on the GPU every time, which was quite significant.
There was a huge gap between every step, almost as long as a step itself after 1 and 2 were implemented, in which the GPU was idle. I eventually found out it was due to some tensorflow overhead that happens every epoch. Now currently, every epoch means every step. But this can be avoided entirely by the stupid looking trick of copying the input grids by the number of steps you want to run, and then training for 1 "epoch" with a batch size of 1. This almost completely removes the gaps, while doing the same number of steps. (If the memory this takes is an issue, we can limit it to say 1k copies, at the cost of worrying what to do when the desired number of epochs is not divisible by this)
The validation losses are computed from scratch every step, starting from x. This repeats the computation of the training model up to and including the computation of the observables. If this were rewritten to start from the observables, the cost goes to essentially 0 (from about 30% now). (This hasn't been started, in the timing above I cheated by skipping this computation) I think this would also improve readability, to have one model with 3 losses, rather than 3 models.
Even more is possible, for instance even now ~20% of every step is spent on restructuring the FK tables, doing the same every step. I wasn't able to fix this yet. Perhaps some more tinkering with index orders and contraction orders along the lines of 1 can fix this.

Steps remaining

Unfortunately I'll have very little time in the next month to work on this (holidays and other projects). Below I'll list the steps necessary to integrate it, and where help could be useful.

Get the trvl-mask-layers branch to pass the tests.

Some help here would be useful, @Radonirinaunimi would take a look

The changes in point 1 above are in a branch gpu-go-brrr (;P), off of trvl-mask-layers. This can be merged into it.
Once merged they should be tested and reviewed.

What would be super useful here is to have a list of runcards to test, and also the actual results from master. Since this is already much faster, it's a lot easier if we can only run them on this branch, and have something to compare against already.

The rewriting from epochs to batches is independent of all the other changes. If anyone wants to pick that up that'd be great. I started it in #1802 UPDATE: this is done, just need testing:
The rewriting of the 3 "models" into one and instead 3 losses (By which btw I don't necessarily mean that we put that part in an actual keras.Loss or something, not sure if that's efficient or not, just to not repeat the computations), I think this is also relatively independent of the rest. If anyone wants to do this that'd be great, it doesn't have the highest payout/effort ratio of all these, so I can also do it myself after the last point. UPDATE: WIP in #1855
The multi replica PDF, this is the most work, and the most specialized, so I think it's best if I focus on this. UPDATE: turned into its own issue #1880, see there for updated progress.

Tensorboard profile

Here is the tensorboard profile with all these improvements, may be nice to see: