Optimize pseudodata computation (!1503) · Merge requests · Emanuele Roberto Nocera / nnpdf

Emanuele Roberto Nocera requested to merge fitpseudodata into master Jan 25, 2022

Created by: Zaharid

The function computed_pseudoreplicas_chi2 in chi2grids.py is unusable in the current master as well as various pseudodata replica utilities.

One showstopper involves

 fitted_make_replicas = collect('make_replica', ('pdfreplicas',))

Reportengine is unfortunately unable to deal with this very elegantly and will try to resolve all the data for each of the replicas anew. It would be nice to fix that but in the meantime, add the loop manually.

Once that is fixed it turns out that generating a replica with make_replica is unbearably slow now (in fact it didn't finish for me between launching it early this morning and looking at it now). I am not sure if this is a old phenomenon or has to do with some regression in pandas.

(Incidentally @enocera did you notice any problems in that phase while running the fits?)

So currently, on master calling make_replica with an unlucky input can take well in excess of a few minutes per replica. The slow parts are all pandas operations such as some trivial looking concatenation like special_add_errors = pd.concat(special_add, axis=0, sort=True).fillna(0).to_numpy() (hence why I think it may e a regression) which takes half a second for some reason. To address that move all these operations outside, to a separate provider so they can be shared across various inputs. With that computing a pseidodata replica takes a few seconds at worst on my laptop.

Optimize pseudodata computation

Merge request reports