If you have a cluster of GPU cards on a remote machine, you can use it to run Predict Engine. To enable the cluster mode, open the Preferences via the menu PredictSuite/Preferences..., section Engine Cluster.
Using a cluster of GPU requires a license with a minimum edition : Enterprise.
This section will vary depending on which cluster is used, please refer to the administrator to obtain a valid configuration file and the path to the predict-hpc.sh script.
After logging into the cluster through ssh, you can start the server session by calling the script predict-hpc.sh with a valid configuration file (see details on the example on the right).
Once the job is accepted, wait until Server listening on... appears on the log. The session is now ready for a client connection. You need to keep the ssh connection opened for the remainder of the session.
To close the session, simply press Ctrl+C. This will cancel the job, release the licenses and shutdown the assets synchronizer and proxy.
#######################################################################################
#SESSION
#######################################################################################
uvr_task_count= Number of node launched
uvr_gpus_per_task= Number of GPUs per node
uvr_cpus_per_task= Number of CPUs per node
uvr_slurm_partition= Partition to run on
uvr_slurm_output= Path of the output file
uvr_slurm_name= Name of the job
uvr_exclude_list= A list of nodes to be excluded from the resources scheduling (optionnal)
uvr_server_node= The node used to host the server (optionnal, see below)
#######################################################################################
#CLUSTER-WIDE PARAMETERS
#######################################################################################
......
If your session requires more resources than available on the cluster, the job will be left hanging. You'll have to wait until resources are freed or you'll have to reduce the number of nodes/gpus of the session.
When scheduling resources, SLURM may allocate the server node after the compute nodes (Ex. Node 0 for a compute node and Node 1 for the server node). This does not mix well with mpi resulting in this error :
If this happens, you need to specify explicitly the node used to host the server by setting the parameter uvr_server_node.
Predict-HPC comes packed with two more scripts that can be used to monitor the job or clean a crashed session :
clean-session.sh => Cancel the job and shutdown the asset synchronizer and proxy.
monitor-job.sh => Output informations on the current session.
After launching the session on the cluster-side, you need to go to Preferences/UVR PredictSuite/Engine Cluster and fill-in information on the remote machine including the address and port, the username, password and the absolute path to where the remote assets are stored.
Clicking on connect to server enters predict-unity remote mode, all subsequent renderings within this session will be executed remotely. To end a session, click on disconnect from server before closing the session on the cluster-side.
Predict-hpc uses remote services to synchronize assets between the client and the remote cluster. Clicking on Synchronize will upload every assets in the current Unity project to the cluster, only updating outdated or modified assets for more efficiency. By default, in remote mode, the scene is automatically synchronized when pressing play but you can pre-synch most of the assets while woking on the project to gain time.
Clear Cache can be used to wipe the distant assets directory for a clean synchronization.
You can monitor assets synchronization from within the PredictSuite/Engine/File synchronizer monitor.
Some functionalities are not supported when using a cluster :
Engine Window : to visualize a simulation in Unity you can choose between the Game View Overlay and the Engine Window. When using the cluster, the Engine View is not supported : you must use the Overlay mode.
Variants : variants are not supported when using the cluster, you will need to edit the scene and reload it manually.
Environment rotation : usually, when the scene contains an environment light (HDRI/Skybox), you can rotate it without reloading the scene. When using the cluster the scene must be reloaded manually for the rotation to be applied.
Optics : when using the cluster you cannot change the camera lens radius and focus distance interactively, the scene must be reloaded manually.
Resolution : when using the cluster, you cannot change the resolution on the run once the simulation is started. To change the resolution you must stop the simulation, change the resolution and restart it.
Device selection : in the Preferences, you can choose on which device the simulation will be run. This has no impact when using the cluster since the simulation will not be run locally.
Profiler : the Profiler only tracks local processes, it cannot give you informations on a cluster process.