Multi dense layer
Created by: APJansen
MultiDense
Main idea: As discussed already in several places, the point of this PR is to merge the multiple replicas in the tightest way possible, which is at the level of the Dense
layer, here implemented as a MultiDense
layer.
The essence is this line and the lines around it. We extend the layer's weights from shape (in_units, out_units)
to shape (replicas, in_units, out_units)
, or (r, f, g)
for short.
The initial input at the first layer does not have a replica axis, its shape is (batch, n_gridpoints, features)
.
In this case the linked lines become einsum("bnf, rfg -> brng")
.
Every layer thereafter will have a replica axis. This simply adds an "r"
to the first term in the einsum, to give einsum("brnf, rfg -> brng")
. What this does is it uses the weights of replica i on the ith component of the input, that is on the previous layer's output corresponding to the same replica i. So it acts identically to the previous case, just more optimized.
MultiInitializer
Weight initialization: After all the refactorings before this, it is quite simple to initialize the weights in the same manner as is done now. A list of seeds is given, one per replica, along with an initializer which has a seed of its own, that the per replica seeds get added to. (So we can differentiate the different layers). A custom MultiInitializer
class takes care of resetting the initializer to a given replica's seed, creating that replica's weights, and stacking everything into a single weight tensor.
Note that many initializers' statistics depend on the shape of the input, so just using a single initializer out of the box not only will give different results because it is seeded differently, it will actually be statistically different.
Dropout
Naively applying dropout to multi replica outputs will not consistently mask an equal fraction of each replica.
A simple and sufficient solution is to define dropout without the replica axis, and just broadcast to the replica dimension.
This is actually sort of supported already, you can subclass the Dropout
layer and override the method _get_noise_shape
, putting a None
where you want it to broadcast.
Note that while this would turn off the same components in every replica, there is no meaning or relation to the order of the weights, so that should be completely fine.
Update: Actually, this is not necessary at all. What I thought was that dropout always sets a fixed fraction to zero, but actually it works individually per number, so it is completely fine to use the standard dropout.
Integration
I'm not sure what the best way of integrating this into the existing framework is, what I've done now is to create an additional layer_type
, "multi_dense"
, that will have to be specified in the runcard to enable this. Previous behaviour with both layer_type="dense"
and layer_type="dense_per_flavour"
should be unaffected, the overhead to keep it like that is managable.
The upside of course is that if later changes become too complicated with this layer, you can always go back to the standard one. The downside though is that it creates yet another code path, and everything will have to be tested separately.
Alternatively it could just replace the current "dense"
layer type entirely, not sure if there is a nice middle ground.
Update After discussing briefly with Roy, we agreed it's not necessary to keep the old dense layer. Later I saw that actually it kind of is, as that is used under the hood in "dense_per_flavour" as well. So I have renamed that into "single_dense"
, and the new layer here as just "dense"
.
Tests
I have two unit tests, one shows that weight initalization is identical to standard dense layers. The second shows that the output on a test input is the same, up to what I think are round off errors.
Currently the CI is passing almost completely, with the only exception of a single test in python 3.11, a regression test, where one of the elements has a relative difference of 0.015, which is bigger than the tolerance of 0.002. I assume this is just an accumulation of round off differences, I have no idea what else it could be.
Comparison with new baseline: https://vp.nnpdf.science/FGwCRitnQhmqBYBjeQWS7Q==/
Timings
I have done some timing tests on the main runcar (NNPDF40_nnlo_as_01180_1000
), on Snellius, with 100 replicas on the GPU or 1 replica on the CPU. For completeness I'm also comparing to an earlier PR which made a big difference on performance, and the state of master just before that was merged. I still need to run the current master on the GPU.
branch | commit hash | 1 replica | 100 replicas | diff 1 | diff 100 |
---|---|---|---|---|---|
master | 5eebfbaf | 96 | 860 | 0 | 0 |
replica-axis-first | 8cbe0cfd | 96 | 505 | -1 | 355 |
current master | f40ddd9f | 116 | ?? | -19 | ?? |
multi-dense-layer | d8f28ff88a | 112 | 304 | 3 | 201 |
Status:
I need to do full fit comparisons, apart from that it's ready for review.