Use mkl for n3fit
Created by: scarlehoff
These are small changes but seem to be important for performance. This is to be compared to the times and memory usages of #745
DIS, 4 threads 2 GB: 27+-8 min Global, 8 cores, 4.8 GB: 3.7 h +- 1.0 h Global, 8 cores, 8 GB: 3.3 +- 0.7 h
Note that each two thread correspond to one single core. Also that to get similar times for master you need to allocate at least 12-14 GB of memory. I never saw it going beyond 3.8 in my tests, I gave it one extra in the cluster just in case.
The interesting one is the global, I gave it a much smaller memory allowance (which can have side effects like more jobs falling in the same node or swapping) with no speed penalty. I hope this opens the door to running several replicas in parallel. I also did the test with 8 GB for a better comparison (more memory maybe it is not useful for n3fit but helps to not have jobs from others in the same node)
Also added two new tests, one that ensures that n3fit doesn't suddenly take ages and another one to ensure that the changes don't break the hyperoptimization.
With regards to the usage of MKL, the key setting for good performance seems to be KMP_BLOCKTIME
which gives best result when set to 0. https://software.intel.com/en-us/articles/maximize-tensorflow-performance-on-cpu-considerations-and-recommendations-for-inference
I wonder whether it makes sense to install directly from the intel conda channel instead of the default. I'll run some benchmarks in case I notice any difference.
This is ready for review, but given that the changes might be machine dependent I'll run the fits on the other cluster I have access to in order to make sure that everything is ok.