Large Model Training

Large Model Training can require running on more than 8 GPUs, which is the maximum number of GPUs available on one node. To do so, Toolkit allows users to run multiple copies (replicas) of a single job on the same or different nodes in parallel. These tasks are referred to as replica jobs, to use them, you need to enable:

InfiniBand devices to accelerate the communication between the GPUs.
Replicas option and its requirements.

Toolkit will create one master job and one or multiple replicas depending on your job specifications. If you are requesting 4 GPUs per job, then 2 replicas will be able to start on the same node since a DGX have 8 GPUs, if you are requesting more than 4 GPUs, they will be scheduled over multiple nodes.

To enable the communication between the jobs we will use another Toolkit feature, the Internal DNS.

Training using replicas is only available on:

DGX V100
DGX A100
DGX H100 (Superpod only)

Specificities about replicas jobs

They are scheduled together at the same time, if your job needs 2 DGXs to run then it will have to wait for 2 of them to be usable before being started.
They may not all start exactly at the same time, which means the client might start before the server and so a connection retry mechanism should be implemented in the client.
They have higher priority.
They are Preemptable job and are preempted together.
If 1 job FAILS, the others will jobs will be CANCELLED. The user can use logs of FAILED job to debug the issue
If 1 job is CANCELLED, then the other jobs are CANCELLED as well.

Preemption of replica jobs

Replica jobs generally run with higher priority but may be preempted if a replica job is scheduled on a node reserved for another account.

Case 1: When incoming job is non-replica job

Nodes n1,n2 are reserved for account_A, and nodes n3 is reserved for account_B. n4 is unreserved
account_A starts a replica job that runs on 4 nodes (n1,n2,n3,n4).
account_B starts a new non-replica job.
The replica job of account_A is preempted together. account_B job will be scheduled on node n3.
Any other remaining jobs that can be scheduled on n1,n2,n4 will be scheduled
account_A replica job will wait for 4 nodes to be available for scheduling. It has higher scheduling priority once nodes are free.

Case 2: When incoming job is replica job

Nodes n1,n2 are reserved for account_A, and nodes n3 is reserved for account_B. n4 is unreserved
account_A has successfully started a replica job that runs on 4 nodes (n1,n2,n3,n4).
account_B starts a new replica job with replica size 2
account_A’s replica job will not be preempted as the incoming replica job of account_B has only one reserved node, but replica job size is 2
Any other remaining jobs will be scheduled on n3.
account_B replica job will wait for 2 nodes to be available for scheduling. It has higher scheduling priority once nodes are free.

Build images

Example repository: https://code.devsnc.com/ai-toolkit/example-job-multinode

This repository contains a simple client.py (worker node) and server.py (master node). The server echoes back the number sent by the client. You can build the Docker image by running the provided commands. Please check the environment variable used in main.py

export ACCOUNT_ID=$(eai account get --field id)
export IMAGE=registry.console.elementai.com/$ACCOUNT_ID/multinode-example
docker buildx build --platform linux/amd64 -t $IMAGE .
docker push $IMAGE

Implementation example

The command below runs 8 instances (--replicas=8) of the $IMAGE. The port used for communication will be:

--replica-master-port=8080

The value passed in replica-master-port is accessible as the “MASTER_PORT” environment variable inside the job.
he address of the master job (rank zero job) is available as “MASTER_ADDR” inside the job. Internally, this will be the dns entry of rank 0 job.
The rank of each job is available as “RANK” inside the job.
Please refer to the code in the example repository to see how to use these variables in the example repository (https://code.devsnc.com/ai-toolkit/example-job-multinode/blob/main/main.py).
Each replica will have a rank from 0 to 7 as –replicas=8 is used in the above command. The jobs can be scheduled on the same node or different nodes, depending on the availability of cluster resources.

$ eai job new --replicas=8 --image=ubuntu --cpu 1 --replica-master-port=8080 -- python  main.py

If users want to schedule jobs on different nodes (multi-node jobs), they must specify the exact resources along with –infiniband and –gpu-model-filter (A100, V100, or H100 (superpod)). Failing to specify these can result in unexpected behavior.

$ eai job new --replicas=8 --image=ubuntu --cpu 100 --gpu 8 --gpu-mem=80 --preemptable --gpu-model-filter=A100 --infiniband --replica-master-port=8080 -- python  main.py

Or using the CLI directly:

$ export IMAGE="registry.console-dev.elementai.com/66e42857-5dce-4405-9ea4-4601f958840a/python-multinode"
$ eai job new --replicas=4 --image=$IMAGE --gpu 8 --gpu-mem=80 --gpu-model-filter=A100 --replica-master-port=8080 --infiniband -- python main.py

CLI examples

# View replica jobs
$ eai job ls --replica-jobs
# or
$ eai replica-jobs ls

# Kill replica jobs
$ eai job kill {ANY_OF_JOB_ID}
# or
$ eai replica-jobs kill {replicaGroupId}

The above command returns the replicaGroupId that is used internally to identify the replica jobs belonging to a particular group. We can get more information about a replica job by using:

$ eai job replica-jobs get {replicaGroupId}

Jobspec example

In this case, 4 jobs will start, using 4 different nodes and a total of 32 GPUs.

image: registry.console.elementai.com/$ACCOUNT_ID/multinode-example
interactive: false
command:
- python
- main.py
resources:
    gpu: 8
    replicas: 4
options:
internal-dns:
    name: ""
    ports:
    - port: 8080
        target-port: 8080
        protocol: TCP

MPI Framework like DeepSpeed

MPI tools like DeepSpeed require a hostfile to list all jobs/nodes which are part of the replica jobs.

DeepSpeed or OpenMPI use a specific format for this hostfile as describe in OpenMPI Documentation:

node0 slots=2 max_slots=8
node1 slots=2 max_slots=8

Toolkit doesn’t provide this file autoamtically and some variable like slots or max_slots could customize on your side. As a workaround, this command will generate the file for you and create a file /tmp/hostfile:

seq -f $(echo $MASTER_ADDR | grep -oE ".+-")'%g slots='$(nvidia-smi -L | wc -l) 0 $(( $WORLD_SIZE - 1 )) > /tmp/hostfile

Example of usage:

$ WORLD_SIZE=2
$ MASTER_ADDR=dns-ff1098a5-752f-4423-9bc0-45494a924c37-0
$ seq -f $(echo $MASTER_ADDR | grep -oE ".+-")'%g slots='$(nvidia-smi -L | wc -l) 0 $(( $WORLD_SIZE - 1 ))
dns-ff1098a5-752f-4423-9bc0-45494a924c37-0 slots=8
dns-ff1098a5-752f-4423-9bc0-45494a924c37-1 slots=8

Old method using process-agent

Using multinode jobs is the way to go now for multinode training. Before that we used the process-agent and you can still find information about it here:

https://github.com/ServiceNow/toolkit-infiniband-example