Restart hyperopt (!1824) · Merge requests · Emanuele Roberto Nocera / nnpdf

Emanuele Roberto Nocera requested to merge restart_hyperopt into master Oct 24, 2023

Created by: Cmurilochem

This PR addresses the issue of restarting an hyperoptimization with the Hyperopt library as discussed in https://github.com/NNPDF/nnpdf/issues/1800.

Comments on the initial changes made

1. `hyper_optimization/filetrials.py`

To the FileTrials class I have added the from_pkl and to_pkl methods. The last one is a @classmethod that is useful to create instances of the class when tries.pkl file is available from a previous run. The to_pkl method saves the current state of FileTrials to a pickle file, although this is currently being indeed done for every trial in hyperopt.fmin directly via the trials_save_file argument.
In this regard, I also added an attribute self.pkl_file which will be responsible for generating a tries.pkl file in the same directory as the tries.json
An additional attribute self._rstate is also added that will store the last numpy.random.Generator of the hyperopt algorithm and will be passed as rstate in the hyperopt.fmin function so that we warrant that by restarting we do so with the same history as if we were doing a direct calculation. The initial fixed seed in trials.rstate = np.random.default_rng(42) here can still be relaxed and provided as input later.

2. `hyper_optimization/hyper_scan.py`

In case of restarts, an extra boolean attribute is added to HyperScanner, named self.restart_hyperopt which is set true in case of the --continue option in n3fit command line (details to be discussed below).
I have adapted hyper_scan_wrapper to allow it to check if hyperscanner.restart_hyperopt is true. If so, it will generate an initial FileTrial instance (trials) from tries.pkl, which contains by built-in the history of the previous hyperopt and also the trials.rstate attribute with the previous numpy random generator.

3. `scripts/n3fit_exec.py`

This is perhaps the most fragile of the changes and where I would need help to adapt it properly.

To the N3FitApp I added a new parser --continue that will be the keyword to hyperopt restarts.
To its run method I add a new self.environment.restart = self.args["continue"] attribute.
The way I found to pass this keyword to HyperScanner later is to use it in connection with produce_hyperscanner. If this is true, I then update hyperscan_config with hyperscan_config.update({'restart': 'true'}) and this will later be part of the HyperScanner's sampling_dict argument.

Questions and requested feedback

It looks to me that the adaptations made in the scripts/n3fit_exec.py file to allow for --continue are not optimal. Maybe a more experienced developer could suggest a more convenient way to do so.
Despite all our efforts to make sure that the hyperopt restarts have the same history as if we were just making direct experiments, it seems that, despite having the same hyperparameter guesses, restart calculations will afterall show differences in the obtained final losses for different k folds. This might be due to the fact that the seeds for the initial weights for each k-fold in difference runs are inherently different (see below). For example, I have done a test in which I make a simple hyperoptimization with 2 trials, and then restart it to make another 2 trials (4 in total). Then I run another experiment and calculate (with the same runcard) 4 trials directly and compare the results.

Restart 0
 {'validation_losses': '[[ 8.973183]\n [12.285054]]', 'experimental_losses': '[[52.548443]\n [42.66381 ]]', 'hyper_losses': '[[52.548443]\n [42.66381 ]]'}
 {'Adam_clipnorm': [], 'Adam_learning_rate': [], 'Nadam_clipnorm': [2.0117276880696024e-05], 'Nadam_learning_rate': [0.002011336799276246], 'nl2:-0/2': [41.0], 'nl2:-1/2': [32.0], 'nodes_per_layer': ['0'], 'optimizer': ['0']}
Direct 0
 {'validation_losses': '[[ 8.973183]\n [12.285054]]', 'experimental_losses': '[[52.548443]\n [42.66381 ]]', 'hyper_losses': '[[52.548443]\n [42.66381 ]]'}
 {'Adam_clipnorm': [], 'Adam_learning_rate': [], 'Nadam_clipnorm': [2.0117276880696024e-05], 'Nadam_learning_rate': [0.002011336799276246], 'nl2:-0/2': [41.0], 'nl2:-1/2': [32.0], 'nodes_per_layer': ['0'], 'optimizer': ['0']}

Restart 1
 {'validation_losses': '[[14.459031]\n [29.935106]]', 'experimental_losses': '[[64.00286 ]\n [74.955086]]', 'hyper_losses': '[[64.00286 ]\n [74.955086]]'}
 {'Adam_clipnorm': [4.9234053337502195e-06], 'Adam_learning_rate': [0.00013178694123594783], 'Nadam_clipnorm': [], 'Nadam_learning_rate': [], 'nl2:-0/2': [40.0], 'nl2:-1/2': [31.0], 'nodes_per_layer': ['0'], 'optimizer': ['1']}
Direct 1
 {'validation_losses': '[[14.459031]\n [29.935106]]', 'experimental_losses': '[[64.00286 ]\n [74.955086]]', 'hyper_losses': '[[64.00286 ]\n [74.955086]]'}
 {'Adam_clipnorm': [4.9234053337502195e-06], 'Adam_learning_rate': [0.00013178694123594783], 'Nadam_clipnorm': [], 'Nadam_learning_rate': [], 'nl2:-0/2': [40.0], 'nl2:-1/2': [31.0], 'nodes_per_layer': ['0'], 'optimizer': ['1']}

-- restarting and performing more two trials

Restart 2
 {'validation_losses': '[[19.959248]\n [39.484573]]', 'experimental_losses': '[[55.306988]\n [92.67653 ]]', 'hyper_losses': '[[55.306988]\n [92.67653 ]]'}
 {'Adam_clipnorm': [], 'Adam_learning_rate': [], 'Nadam_clipnorm': [5.226179920719625e-06], 'Nadam_learning_rate': [0.00020623358967934892], 'nl2:-0/2': [32.0], 'nl2:-1/2': [13.0], 'nodes_per_layer': ['0'], 'optimizer': ['0']}
Direct 2
 {'validation_losses': '[[19.959248]\n [43.446335]]', 'experimental_losses': '[[ 55.306988]\n [104.45502 ]]', 'hyper_losses': '[[ 55.306988]\n [104.45502 ]]'}
 {'Adam_clipnorm': [], 'Adam_learning_rate': [], 'Nadam_clipnorm': [5.226179920719625e-06], 'Nadam_learning_rate': [0.00020623358967934892], 'nl2:-0/2': [32.0], 'nl2:-1/2': [13.0], 'nodes_per_layer': ['0'], 'optimizer': ['0']}

Restart 3
 {'validation_losses': '[[23.391615]\n [65.55123 ]]', 'experimental_losses': '[[ 69.19588]\n [137.06268]]', 'hyper_losses': '[[ 69.19588]\n [137.06268]]'}
 {'Adam_clipnorm': [1.6662863474168997e-07], 'Adam_learning_rate': [0.0025640118767782183], 'Nadam_clipnorm': [], 'Nadam_learning_rate': [], 'nl2:-0/2': [31.0], 'nl2:-1/2': [17.0], 'nodes_per_layer': ['0'], 'optimizer': ['1']}
Direct 3
 {'validation_losses': '[[23.391615]\n [44.021515]]', 'experimental_losses': '[[69.19588 ]\n [93.464554]]', 'hyper_losses': '[[69.19588 ]\n [93.464554]]'}
 {'Adam_clipnorm': [1.6662863474168997e-07], 'Adam_learning_rate': [0.0025640118767782183], 'Nadam_clipnorm': [], 'Nadam_learning_rate': [], 'nl2:-0/2': [31.0], 'nl2:-1/2': [17.0], 'nodes_per_layer': ['0'], 'optimizer': ['1']}

By looking at the above results, we can see that Restart 2/3 have the same hyperparameters as Direct 2/3, with the 2 folds having different losses however. Maybe the 1st fold can still keep up with the losses but not the second fold. With the help of @goord and @APJansen, I investigated this issue and have printed the generated random integers passed as seeds to generate the PDF models for each fold in MoldelTrainer.hyperparametrizable(); see here. They are shown in the Table below:

	Trial 0		Trial 1		Trial 2		Trial 3
	Fold 1	Fold 2	Fold 1	Fold 2	Fold 1	Fold 2	Fold 1	Fold 2
Restart Job -random integers	1181867710	461027504	1181867710	1020231754	1181867710	461027504	1181867710	1020231754
Direct Job - random integers	1181867710	461027504	1181867710	1020231754	1181867710	1543757328	1181867710	1392765670

As foreseen, it is clear from the table that the seeds are different for the second fold every time we run a new calculation, despite the fact that the runs start with the same hyperparameters. This clearly reflects in the different losses shown above. I suspect that if we want to make hyperopt runs completely reproducible we could think of alternatives to

for k, partition in enumerate(self.kpartitions):
            # Each partition of the kfolding needs to have its own separate model
            # and the seed needs to be updated accordingly
            seeds = self._nn_seeds
            if k > 0:
                seeds = [np.random.randint(0, pow(2, 31)) for _ in seeds]

to initialise the seeds.

Solution to the random integer issue described above

4. `model_trainer.py`

To ensure that these seeds are generated in reproducible way, @RoyStegeman helped me to devise a new form that changes the way they are generated by defining:

        for k, partition in enumerate(self.kpartitions):
            # Each partition of the kfolding needs to have its own separate model
            # and the seed needs to be updated accordingly
            seeds = self._nn_seeds
            if k > 0:
                # generate random integers for each k-fold from the input `nnseeds`
                # we generate new seeds to avoid the integer overflow that may
                # occur when doing k*nnseeds
                rngs = [np.random.default_rng(seed=seed) for seed in seeds]
                seeds = [generator.integers(1, pow(2, 30)) * k for generator in rngs]

With all the above modifications, I have repeated my previous 4 trial experiment. The results are shown below for both restart and direct runs:

Restart 0
 {'validation_losses': ['2.2993183', '4.4195056'], 'experimental_losses': [10.660690008425245, 13.892794249487705], 'hyper_losses': [19.669736106403093, 21.73920023647384]}
 {'Adadelta_clipnorm': [], 'Adadelta_learning_rate': [], 'RMSprop_learning_rate': [0.015380823956886622], 'activation_per_layer': ['0'], 'dropout': [0.15], 'epochs': [35.0], 'initializer': ['0'], 'multiplier': [1.074400261320179], 'nl2:-0/2': [], 'nl2:-1/2': [], 'nl3:-0/3': [], 'nl3:-1/3': [], 'nl3:-2/3': [], 'nl4:-0/4': [15.0], 'nl4:-1/4': [41.0], 'nl4:-2/4': [36.0], 'nl4:-3/4': [45.0], 'nl5:-0/5': [], 'nl5:-1/5': [], 'nl5:-2/5': [], 'nl5:-3/5': [], 'nl5:-4/5': [], 'nodes_per_layer': ['2'], 'optimizer': ['1'], 'stopping_patience': [0.3600000000000001]}
Direct 0
 {'validation_losses': ['2.2993183', '4.4195056'], 'experimental_losses': [10.660690008425245, 13.892794249487705], 'hyper_losses': [19.669736106403093, 21.73920023647384]}
 {'Adadelta_clipnorm': [], 'Adadelta_learning_rate': [], 'RMSprop_learning_rate': [0.015380823956886622], 'activation_per_layer': ['0'], 'dropout': [0.15], 'epochs': [35.0], 'initializer': ['0'], 'multiplier': [1.074400261320179], 'nl2:-0/2': [], 'nl2:-1/2': [], 'nl3:-0/3': [], 'nl3:-1/3': [], 'nl3:-2/3': [], 'nl4:-0/4': [15.0], 'nl4:-1/4': [41.0], 'nl4:-2/4': [36.0], 'nl4:-3/4': [45.0], 'nl5:-0/5': [], 'nl5:-1/5': [], 'nl5:-2/5': [], 'nl5:-3/5': [], 'nl5:-4/5': [], 'nodes_per_layer': ['2'], 'optimizer': ['1'], 'stopping_patience': [0.3600000000000001]}

Restart 1
 {'validation_losses': ['10.667141', '18.144234'], 'experimental_losses': [14.569936714920344, 25.68137247054303], 'hyper_losses': [46.88904701966194, 52.881341569995435]}
 {'Adadelta_clipnorm': [1.7558937825962389], 'Adadelta_learning_rate': [0.02971486397602543], 'RMSprop_learning_rate': [], 'activation_per_layer': ['0'], 'dropout': [0.03], 'epochs': [30.0], 'initializer': ['0'], 'multiplier': [1.0896393776712885], 'nl2:-0/2': [], 'nl2:-1/2': [], 'nl3:-0/3': [], 'nl3:-1/3': [], 'nl3:-2/3': [], 'nl4:-0/4': [13.0], 'nl4:-1/4': [33.0], 'nl4:-2/4': [12.0], 'nl4:-3/4': [44.0], 'nl5:-0/5': [], 'nl5:-1/5': [], 'nl5:-2/5': [], 'nl5:-3/5': [], 'nl5:-4/5': [], 'nodes_per_layer': ['2'], 'optimizer': ['0'], 'stopping_patience': [0.18000000000000005]}
Direct 1
 {'validation_losses': ['10.667141', '18.144234'], 'experimental_losses': [14.569936714920344, 25.68137247054303], 'hyper_losses': [46.88904701966194, 52.881341569995435]}
 {'Adadelta_clipnorm': [1.7558937825962389], 'Adadelta_learning_rate': [0.02971486397602543], 'RMSprop_learning_rate': [], 'activation_per_layer': ['0'], 'dropout': [0.03], 'epochs': [30.0], 'initializer': ['0'], 'multiplier': [1.0896393776712885], 'nl2:-0/2': [], 'nl2:-1/2': [], 'nl3:-0/3': [], 'nl3:-1/3': [], 'nl3:-2/3': [], 'nl4:-0/4': [13.0], 'nl4:-1/4': [33.0], 'nl4:-2/4': [12.0], 'nl4:-3/4': [44.0], 'nl5:-0/5': [], 'nl5:-1/5': [], 'nl5:-2/5': [], 'nl5:-3/5': [], 'nl5:-4/5': [], 'nodes_per_layer': ['2'], 'optimizer': ['0'], 'stopping_patience': [0.18000000000000005]}

-- restarting and performing more two trials

Restart 2
 {'validation_losses': ['18.18834', '52.55721'], 'experimental_losses': [21.345310585171568, 49.5125512295082], 'hyper_losses': [144.60983298921894, 105.20777437819639]}
 {'Adadelta_clipnorm': [0.8411342478713798], 'Adadelta_learning_rate': [0.04928810632634438], 'RMSprop_learning_rate': [], 'activation_per_layer': ['1'], 'dropout': [0.09], 'epochs': [47.0], 'initializer': ['1'], 'multiplier': [1.0615455307107098], 'nl2:-0/2': [16.0], 'nl2:-1/2': [35.0], 'nl3:-0/3': [], 'nl3:-1/3': [], 'nl3:-2/3': [], 'nl4:-0/4': [], 'nl4:-1/4': [], 'nl4:-2/4': [], 'nl4:-3/4': [], 'nl5:-0/5': [], 'nl5:-1/5': [], 'nl5:-2/5': [], 'nl5:-3/5': [], 'nl5:-4/5': [], 'nodes_per_layer': ['0'], 'optimizer': ['0'], 'stopping_patience': [0.12000000000000002]}
Direct 2
 {'validation_losses': ['18.18834', '52.55721'], 'experimental_losses': [21.345310585171568, 49.5125512295082], 'hyper_losses': [144.60983298921894, 105.20777437819639]}
 {'Adadelta_clipnorm': [0.8411342478713798], 'Adadelta_learning_rate': [0.04928810632634438], 'RMSprop_learning_rate': [], 'activation_per_layer': ['1'], 'dropout': [0.09], 'epochs': [47.0], 'initializer': ['1'], 'multiplier': [1.0615455307107098], 'nl2:-0/2': [16.0], 'nl2:-1/2': [35.0], 'nl3:-0/3': [], 'nl3:-1/3': [], 'nl3:-2/3': [], 'nl4:-0/4': [], 'nl4:-1/4': [], 'nl4:-2/4': [], 'nl4:-3/4': [], 'nl5:-0/5': [], 'nl5:-1/5': [], 'nl5:-2/5': [], 'nl5:-3/5': [], 'nl5:-4/5': [], 'nodes_per_layer': ['0'], 'optimizer': ['0'], 'stopping_patience': [0.12000000000000002]}

Restart 3
 {'validation_losses': ['26.753922', '24.388603'], 'experimental_losses': [52.71014284620098, 35.8982934170082], 'hyper_losses': [82.31994112766945, 3697.219938467043]}
 {'Adadelta_clipnorm': [0.44633727461389994], 'Adadelta_learning_rate': [0.023650226340698025], 'RMSprop_learning_rate': [], 'activation_per_layer': ['1'], 'dropout': [0.09], 'epochs': [26.0], 'initializer': ['1'], 'multiplier': [1.0166524890792967], 'nl2:-0/2': [38.0], 'nl2:-1/2': [34.0], 'nl3:-0/3': [], 'nl3:-1/3': [], 'nl3:-2/3': [], 'nl4:-0/4': [], 'nl4:-1/4': [], 'nl4:-2/4': [], 'nl4:-3/4': [], 'nl5:-0/5': [], 'nl5:-1/5': [], 'nl5:-2/5': [], 'nl5:-3/5': [], 'nl5:-4/5': [], 'nodes_per_layer': ['0'], 'optimizer': ['0'], 'stopping_patience': [0.24000000000000005]}
Direct 3
 {'validation_losses': ['26.753922', '24.388603'], 'experimental_losses': [52.71014284620098, 35.8982934170082], 'hyper_losses': [82.31994112766945, 3697.219938467043]}
 {'Adadelta_clipnorm': [0.44633727461389994], 'Adadelta_learning_rate': [0.023650226340698025], 'RMSprop_learning_rate': [], 'activation_per_layer': ['1'], 'dropout': [0.09], 'epochs': [26.0], 'initializer': ['1'], 'multiplier': [1.0166524890792967], 'nl2:-0/2': [38.0], 'nl2:-1/2': [34.0], 'nl3:-0/3': [], 'nl3:-1/3': [], 'nl3:-2/3': [], 'nl4:-0/4': [], 'nl4:-1/4': [], 'nl4:-2/4': [], 'nl4:-3/4': [], 'nl5:-0/5': [], 'nl5:-1/5': [], 'nl5:-2/5': [], 'nl5:-3/5': [], 'nl5:-4/5': [], 'nodes_per_layer': ['0'], 'optimizer': ['0'], 'stopping_patience': [0.24000000000000005]}

	Trial 0		Trial 1		Trial 2		Trial 3
	Fold 1	Fold 2	Fold 1	Fold 2	Fold 1	Fold 2	Fold 1	Fold 2
Restart Job -random integers	1872583848	203138455	1872583848	203138455	1872583848	203138455	1872583848	203138455
Direct Job - random integers	1872583848	203138455	1872583848	203138455	1872583848	203138455	1872583848	203138455

As seen, we are now able to ensure that both the hyperparameter space and the initial weights for each k-fold are reproducible when restarting.

Note

As can be seen from the above (last) table, because the seeds to generate the random integers for each k-fold are now derived from the fixed value self._nn_seeds here, the generated random integers will always be the same in every trial; see https://github.com/NNPDF/nnpdf/pull/1824#discussion_r1379037604. This is an important aspect to keep in mind.

Restart hyperopt

Comments on the initial changes made

1. hyper_optimization/filetrials.py

2. hyper_optimization/hyper_scan.py

3. scripts/n3fit_exec.py