Large Model Training

Large Model Training can require running on more than 8 GPUs, which is the maximum number of GPUs available on one node. To do so, Toolkit allows users to run multiple copies (replicas) of a single job on the same or different nodes in parallel. These tasks are referred to as replica jobs, to use them, you need to enable:

  • InfiniBand devices to accelerate the communication between the GPUs.

  • Replicas option and its requirements.

Toolkit will create one master job and one or multiple replicas depending on your job specifications. If you are requesting 4 GPUs per job, then 2 replicas will be able to start on the same node since a DGX have 8 GPUs, if you are requesting more than 4 GPUs, they will be scheduled over multiple nodes.

To enable the communication between the jobs we will use another Toolkit feature, the Internal DNS.

Training using replicas is only available on:

  • DGX V100

  • DGX A100

  • DGX H100 (Superpod only)

Specificities about replicas jobs

  • They are scheduled together at the same time, if your job needs 2 DGXs to run then it will have to wait for 2 of them to be usable before being started.

  • They may not all start exactly at the same time, which means the client might start before the server and so a connection retry mechanism should be implemented in the client.

  • They have higher priority.

  • They are Non-Preemptable.

  • If 1 job FAILS, the others will stay alive, Toolkit does not automatically kill the other jobs at the moment, the user needs to take care of it.

  • If 1 job is CANCELLED, then the other jobs are CANCELLED as well.

Build images

Example repository: https://code.devsnc.com/ai-toolkit/example-job-multinode

This repository contains a simple client.py (worker node) and server.py (master node). The server echoes back the number sent by the client. You can build the Docker image by running the provided commands. Please check the environment variable used in main.py

export ACCOUNT_ID=$(eai account get --field id)
export IMAGE=registry.console.elementai.com/$ACCOUNT_ID/multinode-example
docker buildx build --platform linux/amd64 -t $IMAGE .
docker push $IMAGE

Implementation example

The command below runs 8 instances (--replicas=8) of the $IMAGE. The port used for communication will be:

--replicas-master-port=8080
  • The value passed in replica-master-port is accessible as the “MASTER_PORT” environment variable inside the job.

  • he address of the master job (rank zero job) is available as “MASTER_ADDR” inside the job. Internally, this will be the dns entry of rank 0 job.

  • The rank of each job is available as “RANK” inside the job.

  • Please refer to the code in the example repository to see how to use these variables in the example repository (https://code.devsnc.com/ai-toolkit/example-job-multinode/blob/main/main.py).

  • Each replica will have a rank from 0 to 7 as –replicas=8 is used in the above command. The jobs can be scheduled on the same node or different nodes, depending on the availability of cluster resources.

$ eai job new --replicas=8 --image=ubuntu --cpu 1 --replicas-master-port=8080 -- python  main.py

If users want to schedule jobs on different nodes (multi-node jobs), they must specify the exact resources along with –infiniband and –gpu-model-filter (A100, V100, or H100 (superpod)). Failing to specify these can result in unexpected behavior.

$ eai job new --replicas=8 --image=ubuntu --cpu 100 --gpu 8 --gpu-mem=80 --preemptable --gpu-model-filter=A100 --infiniband --replicas-master-port=8080 -- python  main.py

Or using the CLI directly:

$ export IMAGE="registry.console-dev.elementai.com/66e42857-5dce-4405-9ea4-4601f958840a/python-multinode"
$ eai job new --replicas=4 --image=$IMAGE --gpu 8 --gpu-mem=80 --gpu-model-filter=A100 --replicas-master-port=8080 --infiniband -- python main.py

CLI examples

# View replica jobs
$ eai job ls --replica-jobs
# or
$ eai replica-jobs ls

# Kill replica jobs
$ eai job kill {ANY_OF_JOB_ID}
# or
$ eai replica-jobs kill {replicaGroupId}

The above command returns the replicaGroupId that is used internally to identify the replica jobs belonging to a particular group. We can get more information about a replica job by using:

$ eai job replica-jobs get {replicaGroupId}

Jobspec example

In this case, 4 jobs will start, using 4 different nodes and a total of 32 GPUs.

image: registry.console.elementai.com/$ACCOUNT_ID/multinode-example
interactive: false
command:
- python
- main.py
resources:
gpu: 8
replicas: 4
options:
internal-dns:
    name: ""
    ports:
    - port: 8080
        target-port: 8080
        protocol: TCP

Old method using process-agent

Using multinode jobs is the way to go now for multinode training. Before that we used the process-agent and you can still find information about it here:

https://github.com/ServiceNow/toolkit-infiniband-example