Jobs

The job is the workload unit that is submitted and tracked by the Toolkit. It runs as a docker container in a kubernetes cluster. This section describes how jobs are submitted to the Toolkit, the impact the various flags has on the way they’re handled and how to track them.

A job can be divided in 2 parts:

  • the specification: values provided to describe it

  • the state information: info that is updated as the job progresses

Job Specification

The job specification is the start of it all. It is the set of parameters that describe the job. Parameters can be conceptually divided in four groups:

Execution Context

What has to be done by the job, in which context: Command, Environment Variables, Image, Data, Workdir and Network Isolation

Resource Request

What resources are required by the job: GPU, CPU, Memory, GPU Memory, GPU Tensor Cores, CUDA Version, GPU Model Filter, CPU Model Filter, InfiniBand, Replicas and Internal DNS

Scheduling Options

How should the scheduler handle the job: Restartable, Preemptable, Interactive, Bid and Maximum Run Time

Identification

How should the job be named or labeled: Name and Tags

The simplest way to submit a new job spec is through the CLI command eai job new. Individual attributes of the job specification can be set using specific eai job new parameters. The -f/--file parameter allows submitting a job spec in a yaml file for which individual attributes can be overriden (e.g. eai job new -f file.yml -e A=B)

Yaml file can be generated by using the --dry-run parameter. (e.g. eai job new --dry-run -- /bin/bash -c "env; ls; echo done" > file.yml

Execution Context

The following fields are the ones required to describe what a job has to do, and how to do it.

Command

The command is what will be executed in the container. If left empty, the image command (or build-time entrypoint + args) is executed.

Note

Unlike docker run, specifying a command will override the image entrypoint.

method

description

examples

new

remaining args

eai job new -- /bin/bash -c "env; ls; echo done"

yaml

command, list of strings

command:
- /bin/bash
- -c
- env; ls; echo done

Environment Variables

Environment variables can be given to a job.

The default is empty, although some environment variables are injected automatically in the container.

If the same variable is set more than once, the last assignment wins.

method

description

examples

new

-e <string>
--env <string>

eai job new -e VAR=value
eai job new --env VAR=value
eai job new --env VAR="value with whitespace"
eai job new -e VAR1=xxx -e VAR2=yyy

yaml

environmentVars, list of strings

environmentVars:
- VAR1=value
- VAR2="value with whitespace"

Image

This option specifies the image that will be used. The default is ubuntu:18.04.

Images can be any publicly available image (e.g. from dockerhub) or one from our internal registries to which you can push your own images.

method

description

examples

new

-i <string>
--image <string>

eai job new -i ubuntu:18.04
eai job new --image ubuntu:18.04

yaml

image, string

image: ubuntu:18.04

Data

Data is a primitive to manage any type of data. There is a set of operation described in details here ref data.

A data can be mounted in the job container. The syntax is: {DATA_UUID}/path/to/src:/path/in/container. Note that the short name for DATA_UUID can be used.

Requirements:

  • first and second parts of the string split on : need to start with a /. (both need to be absolute paths)

  • the source of the volume needs to exist and be a directory. This is not the default behavior in docker. Docker would create whatever is missing in the source path.

The default is empty.

method

description

examples

new

--data <string>
-d <string>

eai job new -d {DATA_UUID}/path/to/src:/path/in/container
eai job new --data {DATA_UUID}/path/to/src:/path/in/container

yaml

data, list of strings

data:
- {DATA_UUID}/path/to/src:/path/in/container

Workdir

This options sets the directory from where the command will be executed.

The default is empty, which will cause the container to use the workdir defined in the image if any.

Note

Since jobs run as the user who submitted the job (as opposed to root), if the image defines a workdir that is innaccessible from the user (i.e.: /root), the job will fail.

method

description

examples

new

-w <string>
--workdir <string>

eai job new -w /path/x
eai job new --workdir /path/x

yaml

workdir, string

workdir: /path/x

Network Isolation

Toolkit can block incoming or outgoing traffic for a job. It is useful when you want to run untrusted code.

To enable this feature, you can specify in the YAML job specification file:

value

description

yaml

in

Blocking IN traffic. Example: access to https://jobID.job.console.elementai.com or to the job from another job is blocked.

options:
  alphabits:
    network-isolation: in

out

Blocking OUT traffic. Example: Any request from the job is blocked.

options:
  alphabits:
    network-isolation: out

all

Blocking IN and OUT traffic.

options:
  alphabits:
    network-isolation: all

Note

It is still possible to access inside the job with eai job exec

Resource Request

Resource requests are one of the core concept of Toolkit. Each job must declare how much resources it will need when it will run, and the cluster will allocate exactly what is requested when the job runs.

Note

The Toolkit scheduler strategy is resource-based fair scheduling. Read more about it here.

Tip

Looking at resource usage of your jobs may help you better determine how much resource they need. This is why Toolkit provides job based grafana dashboard where you can look at it. See How to Monitor a Job Resources Usage.

See below for more informations about what happens when a job goes over its requested resource limit.

Here are described the resource-related parameters.

CPU

The amount of cpus needed by the job. It is a floating point value.

The job will not be allowed to use more than the amount of cpu cores declared. (it will be throttled if it tries to go above)

The default is 1.

method

description

examples

new

--cpu <float>

eai job new --cpu 2

yaml

cpu, float

cpu: 2

Memory

The amount of memory needed by the job, in integer gigabytes unit.

If the job tries to allocate more memory than the requested amount, allocation will fail in most cases.

If a program ignores those failled allocations, the program will be terminated by the Out Of Memory Killer via a SIGKILL (OOMKilled).

The default is 1.

method

description

examples

new

--mem <int>

eai job new --mem 2

yaml

mem, int

mem: 2

GPU

The number of GPUs needed by the job. The GPU ids assigned to the job are number from 0 to n - 1, where n is the number of requested GPUs.

Minimum amount of GPU memory that must be available on the GPUs where the job will run. The current GPUs have between 12GB and 32GB of ram depending on their model.

By default it is unset, meaning that GPU jobs will be allowed to run on GPUs with any amount of GPU ram (with a minimum of 12GB currently)

method

description

examples

new

--gpu <int>

eai job new --gpu 2

yaml

gpu, int

gpu: 2

GPUs can be required to have some specific features. These are documented here.

GPU Memory

Memory available on GPU boards currently vary between 12 and 32GB. It is possible to request a minimum amount of GPU memory. Keep in mind that adding restrictions on a job will make it harder to schedule.

The default is not to care.

method

description

examples

new

--gpu-mem <int>

eai job new --gpu 1 --gpu-mem 16

yaml

resources.gpuMem, int, >= 0, <= 32

resources:
  gpuMem: 16

GPU Tensor Cores

The Volta architecture has what NVidia has named tensor cores to perform FP16 computations much more efficiently. It is possible to request the availability of this feature.

The default is not to care.

method

description

examples

new

--gpu-tensor-cores

eai job new --gpu 1 --gpu-tensor-cores

yaml

resources.gpuTensorCores, boolean

resources:
  gpuTensorCores: true

CUDA Version

Some images require specific CUDA version, and some GPUs/drivers support only a subset of CUDA versions. This indicates specific CUDA version requirement.

The default is not to care.

method

description

examples

new

--cuda-version <float>

eai job new --cuda-version 9.2

yaml

resources.gpuCudaVersion, float

resources:
  gpuCudaVersion: 9.2

CPU Model Filter

It is possible to restrict the CPU model on which a job will run by specifying one or more CPU model filters. The filtering is done by case insensitive substring match. If multiple CPU model filters are specified, they must all be satisfied. Matching can also be negated by starting the match string with a !. The default is no restriction on CPU model (i.e. an empty filter list).

As an example, a filter of 6154 would select any CPU with “6154” in their model name (like “intel(r) xeon(r) gold 6154 cpu @ 3.00ghz”), while !6154 would avoid all CPUs with “6154” in their model name.

Available CPU models can be found on the cluster overview dashboard.

method

description

examples

new

--cpu-model-filter <string>

eai job new --cpu-model-filter 6154 eai job new --cpu-model-filter '!6154'

yaml

resources.cpuModel, string

resources:
  cpuModel: 6154
resources:
  cpuModel: '!6154'

Important

The ! is a reserved character in both shell and YAML. Wrapping in single quotes the strings that start with ! avoids problems.

GPU Model Filter

It is possible to restrict the GPU model on which a job will run by specifying one or more GPU model filters. The filtering is done by case insensitive substring match. If multiple GPU model filters are specified, they must all be satisfied. Matching can also be negated by starting the match string with a !. The default is no restriction on GPU model (i.e. an empty filter list).

As an example, a filter of V100 would select any GPU with “V100” in their model name (like “tesla v100-sxm2-32gb”), while !V100 would avoid all GPUs with “V100” in their model name.

Available GPU models can be found on the cluster overview dashboard.

method

description

examples

new

--gpu-model-filter <string>

eai job new --gpu 1 --gpu-model-filter V100 eai job new --gpu 1 --gpu-model-filter '!V100'

yaml

resources.gpuModel, string

resources:
  gpuModel: V100
resources:
  gpuModel: '!V100'

Important

The ! is a reserved character in both shell and YAML. Wrapping in single quotes the strings that start with ! avoids problems.

InfiniBand

Toolkit can provide InfiniBand devices to a job using the flag --infiniband. The job will be mounted with the devices in the directory /dev/infiniband/ and with the capability IPC_LOCK.

It is important to use the full node resources (8 GPUs) when you want to use this feature because only one job can use InfiniBand devices by node.

method

description

examples

new

--infiniband

eai job new --infiniband

yaml

options.infiniband, boolean

options:
  infiniband: true/false

Replicas

Toolkit allows to run replicas of your main job, more information can be found on this page Large Model Training.

  • --replicas= is the number of replicas you want for your job

  • --replica-master-port= is a shortcut for Internal DNS port.

Other mandatory options for replicas jobs:

  • --infiniband

  • --gpu

  • --gpu-mem=

  • --gpu-model-filter= Only V100 (32Gb), A100 (80gb) and H100 (80Gb)

More information on Internal DNS.

method

description

examples

new

  • --replicas=

  • --replica-master-port

eai job new --infiniband --gpu 8 --gpu-mem=80 --gpu-model-filter=A100 --replicas=4 ----replica-master-port=8080

yaml

  • resources.replicas, int

  • options.internal-dns, structure

resources:
  replicas: 4
options:
  internal-dns:
    name: ""
    ports:
      - port: 8080
        target-port: 8080
        protocol: TCP
environmentVars:
- MASTER_PORT=8080

X11

Toolkit can start a sidecar container that provides an X11 display to the main job container. It can be GPU-accelerated or not. The DISPLAY environment variable will be set automatically in the main job container to point to the X Server running in the sidecar.

Note

GPU Accelerated Xorg Server

There are implicit requirements and side-effects when running a GPU accelerated X11 sidecar:

  • At least 1 GPU must be requested

  • For OpenGL rendering to work, the container image must have the NVIDIA OpenGL Vendor Neutral Dispatch library installed. The easiest way to do so is by basing the container image on nvidia/opengl:*-glvnd-*. A list of available images is available here. More information regarding this setup is available on the project’s github repo <https://github.com/NVIDIA/libglvnd>

  • The X Server runs on GPU 0.

  • The screen resolution is currently fixed to 1280x960.

In the YAML job specification, specifying options.sidecars.x11.gpu to false will set up the CPU version, while true will enable the GPU accelerated X11 sidecar.

method

description

examples

new

--x11-cpu
--x11-gpu

eai job new --x11-cpu
eai job new --gpu 1 --x11-gpu

yaml

options.sidecars.x11.gpu, boolean

options:
  sidecars:
    x11:
      gpu: true/false

Internal DNS

Toolkit can create a custom DNS for your job which will allow other jobs in your organization to easily make request/connection to your job.

options.sidecars.internal-dns is a list of port to forward for the choosen DNS to your job:

  • name of the DNS is required and need to follow the RFC 1035 used by Kubernetes.

  • at least one port is required

  • 2 elements of ports cannot use the be same port number but you can use multiple time the same target port

method

description

examples

new

yaml

options.internal-dns, structure

options:
  internal-dns:
    name: mydns
    ports:
    - port: 80
      target-port: 8080
    - port: 5432
      target-port: 5432
      protocol: TCP

Using the previous example, it will be possible from another job in the same organization to call it with:

curl http://mydns

The request will be forwarded to the job on the port 8080.

Scheduling Options

Options in the job specification affect the way the scheduler will handle the job.

Bid

The bid value is an integer greater or equal to 0.

This parameter allows users to control the order in which the jobs are scheduled within an account: jobs with a higher bid go first amongst all jobs for that account, and for jobs with equal bid values within the account, the oldest are considered first. (FIFO)

Note: This parameter only affects the ordering of jobs within the account’s job queue. In other words, setting a high bid on a job doesn’t grant this job any priority over another accounts’s job. (See The Job vs. the other Jobs: The Scheduler for details)

The default is None. (which is handled as 0)

method

description

examples

new

--bid <int>
-b <int>

eai job new --bid 42
eai job new -b 42

yaml

bid, int, >= 0

bid: 42

Interactive

Interactive jobs have the highest priority and will preempt as many jobs as needed, no matter what the current user share of the cluster is (See The Job vs. the other Jobs: The Scheduler for details)

The default is false.

method

description

examples

new

-I
--interactive

eai job new -I
eai job new --interactive

yaml

interactive, boolean

interactive: true

Maximum Run Time

A new job can be submitted with a maximum run time in seconds after which it will be forcefully termintated.

The default is 172800 (48h) for non-preemptable jobs and 0, which means no maximum, for preemptable and interactive ones.

method

description

examples

new

--max-run-time <int>

eai job new --max-run-time 3600

yaml

maxRunTime, int

maxRunTime: 3600

Preemptable

A preemptable job is a job that is allowed to be interrupted by the scheduler when resources are required for other jobs.

Submitting preemptable jobs allow the scheduler to enforce a fairness policy with regards to how the cluster resources are used between all users. For this reason, there is much more resources available to run such jobs as opposed to non-preeptable ones. (See The Job vs. the other Jobs: The Scheduler for details)

The default is false.

method

description

examples

new

--preemptable

eai job new --preemptable

yaml

preemptable, boolean

preemptable: true

Restartable

Restartable jobs are implicitly preemptable but when preempted, the scheduler will add it back to the queue to be started again when conditions allow it.

See discussion about checkpointing and idempotence at Jobs Development Pro Tips.

The default is false.

method

description

examples

new

--restartable

eai job new --restartable

yaml

restartable, boolean

restartable: true

Identification

There are two different kinds of information that can be assigned to jobs: a name and tags. These can then be used to ask for matching job(s).

Name

The name is a single string value with a maximum length of 255 characters.

The default is empty.

method

description

examples

new

--name

eai job new --name "exp_1_attempt_39"

yaml

name, string

name: exp_1_attempt_39

Tags

A job can have a list of tags.

The default is the empty array [].

method

description

examples

new

--tag [stringArray]

eai job new --tag tag_1=value_1 --tag tag_2=value_2

yaml

tags, list of key, vallue

tags:
- key: key_1
  value: value_1

State Information

As soon as a job spec is submitted successfully, it becomes a job. A Job has all the job spec attributes and more. eai job info is the simplest way to get the job and display it.

A job goes through states according to the following states and events. Each state is described below.

../_images/Job_Lifecycle.svg

state

description

QUEUING

The job is the Toolkit. If the cluster is full, the job will stay in that state. Because of “fair scheduling”, even though you may see room for a particular job it may stay in that state because some other job have higher priority. (See The Job vs. the other Jobs: The Scheduler for more details)

QUEUED

Toolki has pushed the job to the Kubernetes part of the cluster, which means that it will be scheduled soon. When a job is in this state, consult the stateInfo for more details on what is happening on the kubernetes side.

RUNNING

The job is running on the cluster.

SUCCEEDED

The job finished running normally. (Its exit code is 0.)

FAILED

The job failed. The failure is either due to a bad job spec in which case there is no exit code in the job info or the command failed with a non-0 exit code.

INTERRUPTED

The job finished running due to either a problem related to the cluster or by scheduling preemption by the Toolkit.

CANCELLING

The job is pending cancellation.

CANCELLED

The job finished running due to cancellation.

As a job advances through the lifecycle, attributes are updated. Some are set once and for all:

name

assigned when

description

alive

QUEUING

indicates if the job is still active or not. It depends on the current state (see lifecycle above)

id

QUEUING

unique id assigned to the job at creation time

createdBy

QUEUING

user who created the job

createdOn

QUEUING

time at which the job was created

runs

QUEUING

list of state information. There is an entry for each incarnation of the job in the cluster, in chronological order. (See below)

state

QUEUING

one of QUEUING, QUEUED, RUNNING, SUCCEEDED, CANCELLING, CANCELLED, INTERRUPTED or FAILED

stateInfo

QUEUING
QUEUED

extra state information for the current state e.g. in QUEUED state, the image might be being pulled or the container created

A job can have multiple “incarnations” in the cluster, each of these being tracked by a Run.

  • there is at least one run in a job

  • only the last run can be alive

  • a new run is added every time a job enters the QUEUING state

Each Run in the runs list above can have the following attributes:

name

assigned when

description

cancelledOn

CANCELLED

time at which the job was cancelled

cancelRequestOn

CANCELLING

time at which cancellation was requested

createdOn

QUEUING

time at which the run was created

endedOn

SUCCEEDED
FAILED
INTERRUPTED
CANCELLED

time at which the job stopped being alive

id

QUEUING

unique id assigned to the run when it is created

info

RUNNING

if the job spec required gpu(s), their indices will be found here in an array under the gpus key and their gpu uuids in an array under the gpu_uuids key

ip

RUNNING

ip address the job was assigned

jobId

QUEUING

id of the job to which this run belongs

queuedOn

QUEUED

time at which the job gets to the QUEUED state

startedOn

RUNNING

time at which the job started RUNNING

Launch Delay

Whenever a job requesting an image is scheduled on a node which does not have the image, the job will take longer to start because the node needs to pull the image first. The stateInfo field will be updated accordingly which the job is in QUEUED state.

The Job vs. the other Jobs: The Scheduler

The goal of the Toolkit is to assign available resources to requesting jobs. The resource assignment process and logic is the subject of this section.

Job Priority

Interactive Jobs

All interactive jobs are handled first, in decreasing bid order and then in creation order (FIFO). Users each have a quota of one (1) interactive job. If a user already has a running interactive job, further ones will be failed with a stateInfo to that effect. Preemptable jobs will be preempted to make room for interactive jobs - they are to be started as fast as possible.

There is no specified limit to the time an interactive job can run and to help with the cluster management, interactive jobs are kept together on the same nodes as much as possible.

Non-Interactive Jobs

The job ordering is first based on the account occupancy. The occupancy is computed as the current share of the cluster an account is actually using across all its jobs.

All QUEUING jobs are considered in decreasing order of their account occupancy. Accounts with no jobs in the cluster at all are thus be considered first, followed by jobs submitted by accounts in order of increasing occupancy.

Amongst the jobs for accounts of equal occupancy, the account with the highest bid are considered first, and amongst those with equal bid, the oldest jobs are considered first (FIFO).

Job Queues

Jobs are classified in distinct queues depending on the resources they ask for. If a CPU-only job can not fit, no more CPU-only jobs are considered. The same goes for single-GPU jobs and multi-GPUs jobs. Job preemptability is also factored in._

Preemption

Jobs that are marked as being preemptable can be interrupted at any time in order to make room for jobs from other users when the occupancy based criterion is respected: jobs from a user with a higher occupancy will be preempted in order to make room for jobs from a user with lower occupancy. (Restartable jobs are implicitly considered to be preemptable).

A preempted restartable job will return to the QUEUING state automatically when it is interrupted and be rescheduled when cluster usage allows it. This means that a single job processing might span multiple runs, as will the logs.

A preempted NON-restartable job will end up in INTERRUPTED state.

The amount of resources available for non-preemptable jobs is much smaller than for the preemptable ones since it is much easier to share resources between preemptable jobs.

Making your jobs restartable/preemptable will make them run earlier, although they might be interrupted (and maybe restarted) if its resources are required to run jobs for users with lower occupancy.

Jobs Development Pro Tips

Checkpointing

Because there is a wide range of reasons why jobs can be interrupted at any given time, your jobs should create checkpoints from which they can resume their task whenever they will get rescheduled. Once done, you should set the restartable flag of those job.

Idempotence

An important quality of a well defined job is idempotence. Basically, it means that given the same input, a task will produce the same output. Implicitly, it means that an idempotent job will:

  • Always produce the same output from the same input

  • Put all temporary output in a private place

  • Not delete files that it didn’t create itself

  • Not overwrite any files (or atomically overwrite files with exactly the same content)

  • Not depend upon an unseeded random number

  • Not depend upon the system time

  • Not depend upon the value of Toolkit environment variables

  • Not depend upon external APIs, data or processes which aren’t versioned and idempotent themselves

Most machine learning training tasks can be trivially made idempotent, and there are tricks that can be used to ensure that even tasks that use a checkpoint are effectively idempotent.