n3fit datasets as single layers
Created by: scarlehoff
This is the last of the "optimization" PR. This doesn't mean n3fit
can't be optimized any further, it can, but I had to give myself a deadline (2 weeks to get the memory down to 4 GB was nice, 1 week to gain 5 minutes per hour, meh). I have some ideas of where we can look at to reduce even more the times and I will write a few pages be in the docs or the wiki on the thing I tested and the things I want to test.
In this PR I tried to make each experiment a different model so that it was much easier to work with them. Sadly that introduces an overhead which was far more noticeable that what I expected so I reverted that and instead created a layer per dataset (previously each fktable of each dataset was being computed individually). This does seem to save some computation minutes but the content of this PR is mainly for organizational purposes rather than for performance.
This PR builds on top of #760 which builds on top of #745
I'll create a separate pull request to add documentation on how to run n3fit
efficiently with some examples and what's not.
There will be another PR building on top of this one with some small changes to kfolding, as I was doing things which weree unnecessarily convoluted because of the 16GB of RAM that the fit was using. Those are not necessary anymore.
Since this is the last of the optimization PR, here's a benchmark, everything ran in an Intel(R) Core(TM) i7-4770 CPU @ 3.40GHz within a conda environment. The computer was always doing only these fits, it correspond to a global fit (the runcard in the repository) with no stopping and only 5000 epochs.
With the version from #745 (bc03ddc6) and using tensorflow-eigen: walltime: 905s cputime: 4708s
With mkl (#760) (95c7945c) walltime: 896s cputime: 3036s
With this PR (3b9b31ef) walltime: 862s cputime: 2946s
The memory never went above 4GB (never above 3.5 with mkl)
I am a bit disappointed* on the gain from a timing point of view of one single replica, I was hoping for more. These changes make a bigger difference, however, when running in a cluster where many things are running in parallel as using less CPU time and memory have an effect on how long you have to wait for having your fit ready.
*I should be very happy about the reduction of memory which was the main issue of n3fit but I hoped to see the same magnitude of reduction on timing and that of course did not happen.
Caveats:
- For mkl to be faster than eigen it has to be configured properly, #760 sets some default parameters that seem to work on the CPUs I've tested but your particular ones might differ. For instance, if you do have too many cores available (say, you have a 38 cores computer) these will change. For instance, in a Intel(R) Core(TM) i9-9980XE CPU @ 3.00GHz setting in the runcard the
maxcores
flag to 18 seemed better that occupying the whole machine.