Restart hyperopt
Created by: Cmurilochem
This PR addresses the issue of restarting an hyperoptimization with the Hyperopt library as discussed in https://github.com/NNPDF/nnpdf/issues/1800.
Comments on the initial changes made
hyper_optimization/filetrials.py
1. - To the
FileTrials
class I have added thefrom_pkl
andto_pkl
methods. The last one is a@classmethod
that is useful to create instances of the class whentries.pkl
file is available from a previous run. Theto_pkl
method saves the current state ofFileTrials
to a pickle file, although this is currently being indeed done for every trial inhyperopt.fmin
directly via thetrials_save_file
argument. - In this regard, I also added an attribute
self.pkl_file
which will be responsible for generating atries.pkl
file in the same directory as thetries.json
- An additional attribute
self._rstate
is also added that will store the lastnumpy.random.Generator
of the hyperopt algorithm and will be passed asrstate
in thehyperopt.fmin
function so that we warrant that by restarting we do so with the same history as if we were doing a direct calculation. The initial fixed seed intrials.rstate = np.random.default_rng(42)
here can still be relaxed and provided as input later.
hyper_optimization/hyper_scan.py
2. - In case of restarts, an extra boolean attribute is added to
HyperScanner
, namedself.restart_hyperopt
which is settrue
in case of the--continue
option inn3fit
command line (details to be discussed below). - I have adapted
hyper_scan_wrapper
to allow it to check ifhyperscanner.restart_hyperopt
istrue
. If so, it will generate an initialFileTrial
instance (trials
) fromtries.pkl
, which contains by built-in the history of the previous hyperopt and also thetrials.rstate
attribute with the previous numpy random generator.
scripts/n3fit_exec.py
3. This is perhaps the most fragile of the changes and where I would need help to adapt it properly.
- To the
N3FitApp
I added a new parser--continue
that will be the keyword to hyperopt restarts. - To its
run
method I add a newself.environment.restart = self.args["continue"]
attribute. - The way I found to pass this keyword to
HyperScanner
later is to use it in connection withproduce_hyperscanner
. If this istrue
, I then updatehyperscan_config
withhyperscan_config.update({'restart': 'true'})
and this will later be part of theHyperScanner
'ssampling_dict
argument.
Questions and requested feedback
- It looks to me that the adaptations made in the
scripts/n3fit_exec.py
file to allow for--continue
are not optimal. Maybe a more experienced developer could suggest a more convenient way to do so. - Despite all our efforts to make sure that the hyperopt restarts have the same history as if we were just making direct experiments, it seems that, despite having the same hyperparameter guesses, restart calculations will afterall show
differences in the obtained final losses for different
k
folds. This might be due to the fact that the seeds for the initialweights
for each k-fold in difference runs are inherently different (see below). For example, I have done a test in which I make a simple hyperoptimization with 2 trials, and then restart it to make another 2 trials (4 in total). Then I run another experiment and calculate (with the same runcard) 4 trials directly and compare the results.
Restart 0
{'validation_losses': '[[ 8.973183]\n [12.285054]]', 'experimental_losses': '[[52.548443]\n [42.66381 ]]', 'hyper_losses': '[[52.548443]\n [42.66381 ]]'}
{'Adam_clipnorm': [], 'Adam_learning_rate': [], 'Nadam_clipnorm': [2.0117276880696024e-05], 'Nadam_learning_rate': [0.002011336799276246], 'nl2:-0/2': [41.0], 'nl2:-1/2': [32.0], 'nodes_per_layer': ['0'], 'optimizer': ['0']}
Direct 0
{'validation_losses': '[[ 8.973183]\n [12.285054]]', 'experimental_losses': '[[52.548443]\n [42.66381 ]]', 'hyper_losses': '[[52.548443]\n [42.66381 ]]'}
{'Adam_clipnorm': [], 'Adam_learning_rate': [], 'Nadam_clipnorm': [2.0117276880696024e-05], 'Nadam_learning_rate': [0.002011336799276246], 'nl2:-0/2': [41.0], 'nl2:-1/2': [32.0], 'nodes_per_layer': ['0'], 'optimizer': ['0']}
Restart 1
{'validation_losses': '[[14.459031]\n [29.935106]]', 'experimental_losses': '[[64.00286 ]\n [74.955086]]', 'hyper_losses': '[[64.00286 ]\n [74.955086]]'}
{'Adam_clipnorm': [4.9234053337502195e-06], 'Adam_learning_rate': [0.00013178694123594783], 'Nadam_clipnorm': [], 'Nadam_learning_rate': [], 'nl2:-0/2': [40.0], 'nl2:-1/2': [31.0], 'nodes_per_layer': ['0'], 'optimizer': ['1']}
Direct 1
{'validation_losses': '[[14.459031]\n [29.935106]]', 'experimental_losses': '[[64.00286 ]\n [74.955086]]', 'hyper_losses': '[[64.00286 ]\n [74.955086]]'}
{'Adam_clipnorm': [4.9234053337502195e-06], 'Adam_learning_rate': [0.00013178694123594783], 'Nadam_clipnorm': [], 'Nadam_learning_rate': [], 'nl2:-0/2': [40.0], 'nl2:-1/2': [31.0], 'nodes_per_layer': ['0'], 'optimizer': ['1']}
-- restarting and performing more two trials
Restart 2
{'validation_losses': '[[19.959248]\n [39.484573]]', 'experimental_losses': '[[55.306988]\n [92.67653 ]]', 'hyper_losses': '[[55.306988]\n [92.67653 ]]'}
{'Adam_clipnorm': [], 'Adam_learning_rate': [], 'Nadam_clipnorm': [5.226179920719625e-06], 'Nadam_learning_rate': [0.00020623358967934892], 'nl2:-0/2': [32.0], 'nl2:-1/2': [13.0], 'nodes_per_layer': ['0'], 'optimizer': ['0']}
Direct 2
{'validation_losses': '[[19.959248]\n [43.446335]]', 'experimental_losses': '[[ 55.306988]\n [104.45502 ]]', 'hyper_losses': '[[ 55.306988]\n [104.45502 ]]'}
{'Adam_clipnorm': [], 'Adam_learning_rate': [], 'Nadam_clipnorm': [5.226179920719625e-06], 'Nadam_learning_rate': [0.00020623358967934892], 'nl2:-0/2': [32.0], 'nl2:-1/2': [13.0], 'nodes_per_layer': ['0'], 'optimizer': ['0']}
Restart 3
{'validation_losses': '[[23.391615]\n [65.55123 ]]', 'experimental_losses': '[[ 69.19588]\n [137.06268]]', 'hyper_losses': '[[ 69.19588]\n [137.06268]]'}
{'Adam_clipnorm': [1.6662863474168997e-07], 'Adam_learning_rate': [0.0025640118767782183], 'Nadam_clipnorm': [], 'Nadam_learning_rate': [], 'nl2:-0/2': [31.0], 'nl2:-1/2': [17.0], 'nodes_per_layer': ['0'], 'optimizer': ['1']}
Direct 3
{'validation_losses': '[[23.391615]\n [44.021515]]', 'experimental_losses': '[[69.19588 ]\n [93.464554]]', 'hyper_losses': '[[69.19588 ]\n [93.464554]]'}
{'Adam_clipnorm': [1.6662863474168997e-07], 'Adam_learning_rate': [0.0025640118767782183], 'Nadam_clipnorm': [], 'Nadam_learning_rate': [], 'nl2:-0/2': [31.0], 'nl2:-1/2': [17.0], 'nodes_per_layer': ['0'], 'optimizer': ['1']}
By looking at the above results, we can see that Restart 2/3 have the same hyperparameters as Direct 2/3, with the 2 folds having different losses however. Maybe the 1st fold can still keep up with the losses but not the second fold.
With the help of @goord and @APJansen, I investigated this issue and have printed the generated random integers passed as seeds
to generate the PDF models for each fold in MoldelTrainer.hyperparametrizable()
; see here. They are shown in the Table below:
Trial 0 | Trial 1 | Trial 2 | Trial 3 | |||||
---|---|---|---|---|---|---|---|---|
Fold 1 | Fold 2 | Fold 1 | Fold 2 | Fold 1 | Fold 2 | Fold 1 | Fold 2 | |
Restart Job -random integers | 1181867710 | 461027504 | 1181867710 | 1020231754 | 1181867710 | 461027504 | 1181867710 | 1020231754 |
Direct Job - random integers | 1181867710 | 461027504 | 1181867710 | 1020231754 | 1181867710 | 1543757328 | 1181867710 | 1392765670 |
As foreseen, it is clear from the table that the seed
s are different for the second fold every time we run a new calculation, despite the fact that the runs start with the same hyperparameters. This clearly reflects in the different losses shown above. I suspect that if we want to make hyperopt
runs completely reproducible we could think of alternatives to
for k, partition in enumerate(self.kpartitions):
# Each partition of the kfolding needs to have its own separate model
# and the seed needs to be updated accordingly
seeds = self._nn_seeds
if k > 0:
seeds = [np.random.randint(0, pow(2, 31)) for _ in seeds]
to initialise the seeds.
Solution to the random integer issue described above
model_trainer.py
4. To ensure that these seeds
are generated in reproducible way, @RoyStegeman helped me to devise a new form that changes the way they are generated by defining:
for k, partition in enumerate(self.kpartitions):
# Each partition of the kfolding needs to have its own separate model
# and the seed needs to be updated accordingly
seeds = self._nn_seeds
if k > 0:
# generate random integers for each k-fold from the input `nnseeds`
# we generate new seeds to avoid the integer overflow that may
# occur when doing k*nnseeds
rngs = [np.random.default_rng(seed=seed) for seed in seeds]
seeds = [generator.integers(1, pow(2, 30)) * k for generator in rngs]
With all the above modifications, I have repeated my previous 4 trial experiment. The results are shown below for both restart and direct runs:
Restart 0
{'validation_losses': ['2.2993183', '4.4195056'], 'experimental_losses': [10.660690008425245, 13.892794249487705], 'hyper_losses': [19.669736106403093, 21.73920023647384]}
{'Adadelta_clipnorm': [], 'Adadelta_learning_rate': [], 'RMSprop_learning_rate': [0.015380823956886622], 'activation_per_layer': ['0'], 'dropout': [0.15], 'epochs': [35.0], 'initializer': ['0'], 'multiplier': [1.074400261320179], 'nl2:-0/2': [], 'nl2:-1/2': [], 'nl3:-0/3': [], 'nl3:-1/3': [], 'nl3:-2/3': [], 'nl4:-0/4': [15.0], 'nl4:-1/4': [41.0], 'nl4:-2/4': [36.0], 'nl4:-3/4': [45.0], 'nl5:-0/5': [], 'nl5:-1/5': [], 'nl5:-2/5': [], 'nl5:-3/5': [], 'nl5:-4/5': [], 'nodes_per_layer': ['2'], 'optimizer': ['1'], 'stopping_patience': [0.3600000000000001]}
Direct 0
{'validation_losses': ['2.2993183', '4.4195056'], 'experimental_losses': [10.660690008425245, 13.892794249487705], 'hyper_losses': [19.669736106403093, 21.73920023647384]}
{'Adadelta_clipnorm': [], 'Adadelta_learning_rate': [], 'RMSprop_learning_rate': [0.015380823956886622], 'activation_per_layer': ['0'], 'dropout': [0.15], 'epochs': [35.0], 'initializer': ['0'], 'multiplier': [1.074400261320179], 'nl2:-0/2': [], 'nl2:-1/2': [], 'nl3:-0/3': [], 'nl3:-1/3': [], 'nl3:-2/3': [], 'nl4:-0/4': [15.0], 'nl4:-1/4': [41.0], 'nl4:-2/4': [36.0], 'nl4:-3/4': [45.0], 'nl5:-0/5': [], 'nl5:-1/5': [], 'nl5:-2/5': [], 'nl5:-3/5': [], 'nl5:-4/5': [], 'nodes_per_layer': ['2'], 'optimizer': ['1'], 'stopping_patience': [0.3600000000000001]}
Restart 1
{'validation_losses': ['10.667141', '18.144234'], 'experimental_losses': [14.569936714920344, 25.68137247054303], 'hyper_losses': [46.88904701966194, 52.881341569995435]}
{'Adadelta_clipnorm': [1.7558937825962389], 'Adadelta_learning_rate': [0.02971486397602543], 'RMSprop_learning_rate': [], 'activation_per_layer': ['0'], 'dropout': [0.03], 'epochs': [30.0], 'initializer': ['0'], 'multiplier': [1.0896393776712885], 'nl2:-0/2': [], 'nl2:-1/2': [], 'nl3:-0/3': [], 'nl3:-1/3': [], 'nl3:-2/3': [], 'nl4:-0/4': [13.0], 'nl4:-1/4': [33.0], 'nl4:-2/4': [12.0], 'nl4:-3/4': [44.0], 'nl5:-0/5': [], 'nl5:-1/5': [], 'nl5:-2/5': [], 'nl5:-3/5': [], 'nl5:-4/5': [], 'nodes_per_layer': ['2'], 'optimizer': ['0'], 'stopping_patience': [0.18000000000000005]}
Direct 1
{'validation_losses': ['10.667141', '18.144234'], 'experimental_losses': [14.569936714920344, 25.68137247054303], 'hyper_losses': [46.88904701966194, 52.881341569995435]}
{'Adadelta_clipnorm': [1.7558937825962389], 'Adadelta_learning_rate': [0.02971486397602543], 'RMSprop_learning_rate': [], 'activation_per_layer': ['0'], 'dropout': [0.03], 'epochs': [30.0], 'initializer': ['0'], 'multiplier': [1.0896393776712885], 'nl2:-0/2': [], 'nl2:-1/2': [], 'nl3:-0/3': [], 'nl3:-1/3': [], 'nl3:-2/3': [], 'nl4:-0/4': [13.0], 'nl4:-1/4': [33.0], 'nl4:-2/4': [12.0], 'nl4:-3/4': [44.0], 'nl5:-0/5': [], 'nl5:-1/5': [], 'nl5:-2/5': [], 'nl5:-3/5': [], 'nl5:-4/5': [], 'nodes_per_layer': ['2'], 'optimizer': ['0'], 'stopping_patience': [0.18000000000000005]}
-- restarting and performing more two trials
Restart 2
{'validation_losses': ['18.18834', '52.55721'], 'experimental_losses': [21.345310585171568, 49.5125512295082], 'hyper_losses': [144.60983298921894, 105.20777437819639]}
{'Adadelta_clipnorm': [0.8411342478713798], 'Adadelta_learning_rate': [0.04928810632634438], 'RMSprop_learning_rate': [], 'activation_per_layer': ['1'], 'dropout': [0.09], 'epochs': [47.0], 'initializer': ['1'], 'multiplier': [1.0615455307107098], 'nl2:-0/2': [16.0], 'nl2:-1/2': [35.0], 'nl3:-0/3': [], 'nl3:-1/3': [], 'nl3:-2/3': [], 'nl4:-0/4': [], 'nl4:-1/4': [], 'nl4:-2/4': [], 'nl4:-3/4': [], 'nl5:-0/5': [], 'nl5:-1/5': [], 'nl5:-2/5': [], 'nl5:-3/5': [], 'nl5:-4/5': [], 'nodes_per_layer': ['0'], 'optimizer': ['0'], 'stopping_patience': [0.12000000000000002]}
Direct 2
{'validation_losses': ['18.18834', '52.55721'], 'experimental_losses': [21.345310585171568, 49.5125512295082], 'hyper_losses': [144.60983298921894, 105.20777437819639]}
{'Adadelta_clipnorm': [0.8411342478713798], 'Adadelta_learning_rate': [0.04928810632634438], 'RMSprop_learning_rate': [], 'activation_per_layer': ['1'], 'dropout': [0.09], 'epochs': [47.0], 'initializer': ['1'], 'multiplier': [1.0615455307107098], 'nl2:-0/2': [16.0], 'nl2:-1/2': [35.0], 'nl3:-0/3': [], 'nl3:-1/3': [], 'nl3:-2/3': [], 'nl4:-0/4': [], 'nl4:-1/4': [], 'nl4:-2/4': [], 'nl4:-3/4': [], 'nl5:-0/5': [], 'nl5:-1/5': [], 'nl5:-2/5': [], 'nl5:-3/5': [], 'nl5:-4/5': [], 'nodes_per_layer': ['0'], 'optimizer': ['0'], 'stopping_patience': [0.12000000000000002]}
Restart 3
{'validation_losses': ['26.753922', '24.388603'], 'experimental_losses': [52.71014284620098, 35.8982934170082], 'hyper_losses': [82.31994112766945, 3697.219938467043]}
{'Adadelta_clipnorm': [0.44633727461389994], 'Adadelta_learning_rate': [0.023650226340698025], 'RMSprop_learning_rate': [], 'activation_per_layer': ['1'], 'dropout': [0.09], 'epochs': [26.0], 'initializer': ['1'], 'multiplier': [1.0166524890792967], 'nl2:-0/2': [38.0], 'nl2:-1/2': [34.0], 'nl3:-0/3': [], 'nl3:-1/3': [], 'nl3:-2/3': [], 'nl4:-0/4': [], 'nl4:-1/4': [], 'nl4:-2/4': [], 'nl4:-3/4': [], 'nl5:-0/5': [], 'nl5:-1/5': [], 'nl5:-2/5': [], 'nl5:-3/5': [], 'nl5:-4/5': [], 'nodes_per_layer': ['0'], 'optimizer': ['0'], 'stopping_patience': [0.24000000000000005]}
Direct 3
{'validation_losses': ['26.753922', '24.388603'], 'experimental_losses': [52.71014284620098, 35.8982934170082], 'hyper_losses': [82.31994112766945, 3697.219938467043]}
{'Adadelta_clipnorm': [0.44633727461389994], 'Adadelta_learning_rate': [0.023650226340698025], 'RMSprop_learning_rate': [], 'activation_per_layer': ['1'], 'dropout': [0.09], 'epochs': [26.0], 'initializer': ['1'], 'multiplier': [1.0166524890792967], 'nl2:-0/2': [38.0], 'nl2:-1/2': [34.0], 'nl3:-0/3': [], 'nl3:-1/3': [], 'nl3:-2/3': [], 'nl4:-0/4': [], 'nl4:-1/4': [], 'nl4:-2/4': [], 'nl4:-3/4': [], 'nl5:-0/5': [], 'nl5:-1/5': [], 'nl5:-2/5': [], 'nl5:-3/5': [], 'nl5:-4/5': [], 'nodes_per_layer': ['0'], 'optimizer': ['0'], 'stopping_patience': [0.24000000000000005]}
Trial 0 | Trial 1 | Trial 2 | Trial 3 | |||||
---|---|---|---|---|---|---|---|---|
Fold 1 | Fold 2 | Fold 1 | Fold 2 | Fold 1 | Fold 2 | Fold 1 | Fold 2 | |
Restart Job -random integers | 1872583848 | 203138455 | 1872583848 | 203138455 | 1872583848 | 203138455 | 1872583848 | 203138455 |
Direct Job - random integers | 1872583848 | 203138455 | 1872583848 | 203138455 | 1872583848 | 203138455 | 1872583848 | 203138455 |
As seen, we are now able to ensure that both the hyperparameter space and the initial weights
for each k-fold are reproducible when restarting.
Note
As can be seen from the above (last) table, because the seeds to generate the random integers for each k-fold are now derived from the fixed value self._nn_seeds
here, the generated random integers will always be the same in every trial; see https://github.com/NNPDF/nnpdf/pull/1824#discussion_r1379037604. This is an important aspect to keep in mind.