Aim

This PR aims to implement parallel hyperoptimizations using MongoDB datasets and mongo workers. This will enable us to calculate several trials simultaneously.

Strategy

Similarly to FileTrials, the main idea is to implement a MongoFileTrials class that inherits from MongoTrials. This new MongoFileTrials class will then be the one we will instantiate before calling hyperopt fmin

Tasks

Implement MongoFileTrials
Parse MongoDB option to n3fit command and HyperScanner
Adapt hyper_scan_wrapper to allow for parallel evaluation of fmin trials
Add MondoDB and pymongo as dependencies
Add unit/integration test
Quantify performance improvement
Run test on snellius
Add documentation
Add restarting options

Usage

Local Machine (for simple tests only)

First, make sure that you have MongoDB installed either via conda (not sure if available in the latest conda version) or apt-get/brew. Also pymongo is necessary but this can be easily installed via pip (it has already been added as dependency).

In the latest version of the code in this PR, n3fit is adapted to run automatically (by internal subprocessing) both mongod (that generates MongoDB databases) and hyperopt-mongo-worker (that launches mongo workers).

To run parallel hyperopts with n3fit, do:

n3fit hyper-quickcard.yml 1 -r N_replicas -o dir_output_name --hyperopt N_trials --parallel-hyperopt --num-mongo-workers N

where N defines the number of mongo workers you want to launch in parallel. Indeed, N will define the number of trials we are calculating simultaneously. If you want to restart jobs, make sure you have dir_output_name in your current path and do:

n3fit hyper-quickcard.yml 1 -r N_replicas -o dir_output_name --hyperopt N_trials --parallel-hyperopt --num-mongo-workers N --restart

Snellius

Here is a complete slurm script showing how we would run a hyperopt experiment in parallel in snellius (including restarts if needed):

#!/bin/bash
#SBATCH --nodes=1
#SBATCH --partition gpu
#SBATCH --gpus-per-node=4
#SBATCH --time 24:00:00
#SBATCH --output=logs/parallel_slurm-%j.out


# Print job info
echo "Job started on $(hostname) at $(date)"


# conda env
ENVNAME=py_nnpdf-master-gpu

# calc details
RUNCARD="hyper-quickcard.yml"
REPLICAS=2
TRIALS=30
DIR_OUTPUT_NAME="test_hyperopt"
RESTART=false

# number of mongo workers to lauch
N_MONGOWORKERS=4


# activate conda environment
source ~/.bashrc
anaconda
conda activate $ENVNAME


# set up cudnn to run on the gpu
CUDNN_PATH=$(dirname $(python -c "import nvidia.cudnn;print(nvidia.cudnn.__file__)"))
echo "CUDNN path: $CUDNN_PATH"
export LD_LIBRARY_PATH="$CONDA_PREFIX/lib/:$CUDNN_PATH/lib:$LD_LIBRARY_PATH"
echo "LD_LIBRARY_PATH: $LD_LIBRARY_PATH"

# Verify GPU usage
ngpus=$(python3 -c "import tensorflow as tf; print(len(tf.config.list_physical_devices('GPU')))")
ngpus_list=$(python3 -c "import tensorflow as tf; print(tf.config.list_physical_devices('GPU'))")

echo "List of physical devices '$ngpus_list'"

if [ ${ngpus} -eq 0 ]; then
    echo "GPUs not being used!"
else
    echo "Using GPUs!"
    echo "Num GPUs Available: ${ngpus}"
fi


# Run n3fit

echo "Changing directory to $TMPDIR"
cp "runcards/$RUNCARD" $TMPDIR
if [ ${RESTART} == "true" ]; then
    cp -r $DIR_OUTPUT_NAME $TMPDIR
fi
cd $TMPDIR


echo "Running n3fit..."

if [ ${RESTART} == "true" ]; then

    echo "Restarting job...."
    echo "n3fit '$TMPDIR/$RUNCARD' 1 -r $REPLICAS --hyperopt $TRIALS -o $DIR_OUTPUT_NAME --parallel-hyperopt --num-mongo-workers $N_MONGOWORKERS --restart"

    n3fit "$TMPDIR/$RUNCARD" 1 -r $REPLICAS --hyperopt $TRIALS -o $DIR_OUTPUT_NAME --parallel-hyperopt --num-mongo-workers $N_MONGOWORKERS --restart

else

    echo "n3fit '$TMPDIR/$RUNCARD' 1 -r $REPLICAS --hyperopt $TRIALS -o $DIR_OUTPUT_NAME --parallel-hyperopt --num-mongo-workers $N_MONGOWORKERS"

    n3fit "$TMPDIR/$RUNCARD" 1 -r $REPLICAS --hyperopt $TRIALS -o $DIR_OUTPUT_NAME --parallel-hyperopt --num-mongo-workers $N_MONGOWORKERS

fi


echo "Copying outputs to $SLURM_SUBMIT_DIR ..."
cp -r "$TMPDIR/$DIR_OUTPUT_NAME" $SLURM_SUBMIT_DIR


echo "Returning to $SLURM_SUBMIT_DIR ..."
cd $SLURM_SUBMIT_DIR


echo "Job completed at $(date)"

This would be run by doing:

sbatch minimal_parallel_hyperopt.slurm --exclusive

Here, each mongo worker selected (4) sees and run in one separate GPU:

as implemented here. In this run, we are then calculating 4 trials in parallel.

We could also set up our experiment to run 2 mongo workers in each gpu (8 trials in parallel), e.g., by using N_MONGOWORKERS=8 in the script above. In this case, we would observe:

Performance assessment

Local Machine

I have just made a very quick test in my local pc to assess the possible performance improvement with parallel hyperopts. I used the hyper-quickcard.yml card from n3fit/tests/regression (with minor modifications) and run it for 10 trials and 2 replicas varying the number of simultaneously launched mongo workers. The results are summarised in the figure below:

The results look encouraging a priori.

Snellius

For the snellius tests, I have employed the slurm script above as model and a more complete runcard.txt. I ran 10 trials with 2 replicas with varying numbers of mongo workers. The final results (after several fine tunings in the code) are plotted in the figure below:

It shows the variations of the total clock run time of each job as a function of the number of launched mongo workers. The idea here is that each mongo worker is responsible for one trial in hyperopt, so the more mongo workers we launch the more trials we calculate simultaneously.

I also tested the possibility that we launch more than 1 mongo worker per gpu; see right (light grey) part of the figure. This is actually where we observe the best performance and improvement. So, as seen, a job with 8 mongo workers (2 per gpu) is nearly ~8x faster than a serial hyperopt.

Parallel hyperoptimization with MongoDB

Aim

Strategy

Tasks

Usage

Local Machine (for simple tests only)

Snellius

Performance assessment

Local Machine

Snellius

Merge request reports