General theory covmat

Created by: wilsonmr

I've been thinking about "general theory covmats"

A while ago we (@RosalynLP and myself) discussed about having "theory systematic" like files which, for the point prescriptions would be files for each dataset with the different shifts and then the construction function could load the shifts from file rather than requiring having 9 theories at the point of running the fit - the difference between constructing the experimental covmat is the different shifts have a slightly more complicated recipe to create the covmats (not just matrix @ matrix.T) but that information would be in the construction function (which already exists and takes the shifts as input as far as I know), these files would be identified by the generating PDF as well as the theories used to generate and possibly a date of creation or something in case they needed to be updated (i.e when we stop using apfel), maybe even other theory settings - I think this wouldn't be so difficult to work out.

I was wondering if the same was true with the nuclear and deuteron covmats. After looking at your code it seems even simpler because basically you can just save your deltas in a N_data * N_replicas table, so each column is difference between a replica and the "central" pdf and contructing the covmat is then identical to if each of those deltas was an additive correlated systematic (so I guess the "commondata hack" is just appending these into the commondata files themselves @enocera ?)

I think if these shifts (uncut) were saved instead of a large uncut matrix in this PR then this would be more future proof. It's modular - so if for any reason we wanted more datasets in future with this correction the shifts only have to be calculated for that dataset, then the construction simply takes all datasets (and cuts) and sees for which ones these file exist (in some canonical location) and constructs the covmat which is dumped and subsequently loaded by nnfit

I think this addresses @Zaharid's concern if you wanted different cuts/datasets and since this would be constructed at the point of running vp-setupfit it would satisfy the criteria of the point prescription matrix and then I think your only objection would be whether it should be included in a fit, which is a different discussion.

Also when we eventually have python covmats/replica gen used by n3fit I think it would be relatively easy to make this pipeline be integrated

Created by: wilsonmr

I actually think that us not saving the deltas when we did the theory covmat paper was a bit silly, because those were pretty stable and it would have avoided each person running those fits to have like 30Gb of theories. It also would allow for more theory covmat tests (which would be correct up to the calculation of shifts)

I was wondering if the same was true with the nuclear and deuteron covmats. After looking at your code it seems even simpler because basically you can just save your deltas in a N_data * N_replicas table, so each column is difference between a replica and the "central" pdf and contructing the covmat is then identical to if each of those deltas was an additive correlated systematic (so I guess the "commondata hack" is just appending these into the commondata files themselves @enocera ?)

Yes. And making sure that these are properly correlated (e.g. across all deuteron data sets).

Created by: RosalynLP

@wilsonmr I remember this discussion and your suggestions and I agree that this could work well for nuclear and deuteron covmats. For me as far as cuts are concerned there is little difference here versus storing an uncut covariance matrix, but it seems that the major advantage of this is that, as you say, we don't also have to maintain a separate code for calculating the covariance matrix given new experiments; we only need to compute the shifts for each experiment and make sure the correlations are correctly calculated. My only minor reservation would be that in some cases saving all these shifts could be cumbersome - in e.g. 9 pt scale varn we have only 28 deltas but maybe there could be some scenario in the future where we have hundreds (already for nuclear we have N_rep = 250). Also we might have to do some work to transform a given covariance matrix into these shifts.

closed

reopened

Created by: RosalynLP

Whoops

Created by: wilsonmr

My only minor reservation would be that in some cases saving all these shifts could be cumbersome - in e.g. 9 pt scale varn we have only 28 deltas but maybe there could be some scenario in the future where we have hundreds (already for nuclear we have N_rep = 250).

This is True, although I think that the flexibility outweighs the potential cost - at any rate we'd need lots of theory covariances with lots of "columns" to get anywhere near the current cost of just including 9 pt prescription (30Gb is rather a lot..) at the moment we would be fine for quite some time.

Also we might have to do some work to transform a given covariance matrix into these shifts.

Well the thing is we always are generating covmats which are positive semidefinite symmetric matrices, and this is normally achieved by definining the covmats as some matrix times its transpose (or linear combination of this in the case of the point prescriptions) do you have a concrete example where somebody wouldn't be doing this to construct a theory covmat?

Created by: wilsonmr

I guess I have no idea how the higher twist covmat is constructed, but sure at some point you are squaring something? Which I think means with not too much thought we can define some "systematics"

Created by: RosalynLP

This is True, although I think that the flexibility outweighs the potential cost - at any rate we'd need lots of theory covariances with lots of "columns" to get anywhere near the current cost of just including 9 pt prescription (30Gb is rather a lot..) at the moment we would be fine for quite some time.

For the scale var covmat I agree that having columns outweighs having 9 theories, I was more thinking having a covariance matrix instead could be better, but then I do agree that the way you are suggesting is more flexible so to be preferred.

Well the thing is we always are generating covmats which are positive semidefinite symmetric matrices, and this is normally achieved by definining the covmats as some matrix times its transpose (or linear combination of this in the case of the point prescriptions) do you have a concrete example where somebody wouldn't be doing this to construct a theory covmat?

Yes I agree, I don't see a situation in which this wouldn't be the case, I was just trying to keep things as flexible as possible where I can but we should be fine here

Created by: RosalynLP

I guess I have no idea how the higher twist covmat is constructed, but sure at some point you are squaring something? Which I think means with not too much thought we can define some "systematics"

Yes the higher twist covmat is also based on nuisance parameters, in any case we are probably not going to end up using that

requested review from @enocera

Created by: RosalynLP

NB I need to do proper docs for this, will do soon - feel free to wait to review until then

changed title from General theory covmat (new) to General theory covmat

Created by: wilsonmr

Review: Commented

One thing that slightly concerns me is that here we have various groups_* things which if left unchecked groups by experiment but could in theory group by anything, especially now we have custom groups and then we have the old runcards where I you use a big experiment in order to allow correlations across datasets. But then if the ordering of the datasets is not exactly the same as the grouped data then the covmat will not be aligned with the correct data points in the C++ code - at the level of nnfit the index/columns of the dataframe never get checked or reindexed.

In validphys this is less of a problem because in the same namespace the groups should definitely be the same so this will be a bit easier to manage with the newer fitting code. I think you could add a check that if you flatten group_dataset_inputs_by_metadata and compare with data_input that they're in the same order and add that to the fromfile_covmat

Looking at the loading of the covmat, I'm still convinced that you'd be better saving theory covmat systematics and then

load just the files for datasets you want theory covmat for
apply cuts to the theory systematics
concatenate across all datasets
construct covmat - in this case by simply multiplying by transpose

At the moment you are doing this work elsewhere for all data, dumping the total covmat and then having to do so much work to get the correct cut version with datasets in the order you want.

General theory covmat

Merged by (Apr 29, 2025 1:49am UTC)

Activity

General theory covmat

Merge request reports

Merged by (Apr 29, 2025 1:49am UTC)

Activity