Fkparser (!404) · Merge requests · Emanuele Roberto Nocera / nnpdf

Emanuele Roberto Nocera requested to merge fkparser into master Mar 20, 2019

Created by: Zaharid

This adds a pure python parser for fktables and is work in progress. At the moment it should give sensible results for both compressed and uncompressed hadronic fktables. The most frequent usage should be something like:

from validphys.fkparser import load_fktable
from validphys.loader import Loader
l = Loader()
fk = l.check_fktable(setname="ATLASTTBARTOT", theoryID=52, cfac=())
f = load_fktable(fk)

Some thoughts and considerations:

I am trying to follow the spec https://github.com/NNPDF/nnpdf/blob/master/doc/data/data_layout.pdf and have not looked at the c++ code in a long time. Lets see how that goes.
The fktable format is very irregular. For example the fktable in the examples uses both (!) tabs and spaces to delimit numbers in different places of the fastkernel grid. This makes parsing slower than it has to be and means that about the only non manual solution that has any hope of working is pandas.read_csv. This shows you why an opaque binary format that forces you to specify everything precisely is better in the long term.
At the moment I am prioritizing good error messages and clear code over performance. I suspect that this will not matter much as most of the time will go into parsing the sigma grids, which is done with an external function.
The theory info field is a bit redundant unless somebody is doing something very wrong. I can't really be bothered to write the types for all the keys. Does anybody thing those might be needed? More generally I think for the fit we will only need the tensor and the boolean mask, possibly also with a flag to specify if it is hadronic or not. The rest of the data might be useful for analysis purposes. There is also some redundancy (e.g. ndata and nx) which will be used to check the consistency of the fktable. In any case there will be some higher level objects than a dictionary at some point, which will check for things like required fields.
I would like very much that this serves to make the parallel mode of validphys faster. For that it would be best to allocate the sigma grid on some shared memory store. I know how to do this manually on linux (/dev/shm), but not so much in cross platform way. For that I am looking at https://arrow.apache.org/ and in particular https://arrow.apache.org/docs/python/plasma.html. There is the question on whether we could load directly an arrow table on a shared store and avoid copying, but that would mean patching their csv parser (which is more limited but probably faster than the pandas one because it is multithreaded). At the moment I am leaning towards implementing all that as a layer on top of the parser, even thought it is not the most efficient approach.
The sigma is effectively encoded as as sparse tensor in the x combinations and a dense tensor in the flavour indexes. So we have all possible 14 flavour combinations, many of which are typically zero, but only some indexed for the xgrids. For example the first few entries of the index of the table above look like (0, 2, 21), (0, 2, 22), (0, 2, 23), (0, 2, 24), (0, 3, 15), (0, 3, 16), (0, 3, 17), where the first index is the data point and the other two are indexes in the xgrid. @scarrazza @scarlehoff do you want this on you would prefer a full grid full of zeros?
Overall I am too old for writing parsers. I think that now that we can process these files into pure python we should just dump them to parquet (https://parquet.apache.org/), because it seems to be the format with most support (and there are direct interfaces for both Arrow and pandas so we really would never have to look inside the file). This would make things easier and faster.

Additionally this needs better docs and tests.

Will close #377 (closed).

Fkparser

Merge request reports