Jobs

The job is the workload unit that is submitted and tracked by the Toolkit. It runs as a docker container in a kubernetes cluster. This section describes how jobs are submitted to the Toolkit, the impact the various flags has on the way they’re handled and how to track them.

A job can be divided in 2 parts:

the specification: values provided to describe it
the state information: info that is updated as the job progresses

Job Specification

The job specification is the start of it all. It is the set of parameters that describe the job. Parameters can be conceptually divided in four groups:

Execution Context: What has to be done by the job, in which context: Command, Environment Variables, Image, Data, Workdir and Network Isolation
Resource Request: What resources are required by the job: GPU, CPU, Memory, GPU Memory, GPU Tensor Cores, CUDA Version, GPU Model Filter, CPU Model Filter, InfiniBand, Replicas and Internal DNS
Scheduling Options: How should the scheduler handle the job: Restartable, Preemptable, Interactive, Bid and Maximum Run Time
Identification: How should the job be named or labeled: Name and Tags

The simplest way to submit a new job spec is through the CLI command eai job new. Individual attributes of the job specification can be set using specific eai job new parameters. The -f/--file parameter allows submitting a job spec in a yaml file for which individual attributes can be overriden (e.g. eai job new -f file.yml -e A=B)

Yaml file can be generated by using the --dry-run parameter. (e.g. eai job new --dry-run -- /bin/bash -c "env; ls; echo done" > file.yml

Execution Context

The following fields are the ones required to describe what a job has to do, and how to do it.

Command

The command is what will be executed in the container. If left empty, the image command (or build-time entrypoint + args) is executed.

Note

Unlike docker run, specifying a command will override the image entrypoint.

method	description	examples
new	remaining args	`eai job new -- /bin/bash -c "env; ls; echo done"`
yaml	`command`, list of strings	command: - /bin/bash - -c - env; ls; echo done

Environment Variables

Environment variables can be given to a job.

The default is empty, although some environment variables are injected automatically in the container.

If the same variable is set more than once, the last assignment wins.

method	description	examples
new	`-e <string>` `--env <string>`	`eai job new -e VAR=value` `eai job new --env VAR=value` `eai job new --env VAR="value with whitespace"` `eai job new -e VAR1=xxx -e VAR2=yyy`
yaml	`environmentVars`, list of strings	environmentVars: - VAR1=value - VAR2="value with whitespace"

Image

This option specifies the image that will be used. The default is ubuntu:18.04.

Images can be any publicly available image (e.g. from dockerhub) or one from our internal registries to which you can push your own images.

method	description	examples
new	`-i <string>` `--image <string>`	`eai job new -i ubuntu:18.04` `eai job new --image ubuntu:18.04`
yaml	`image`, string	image: ubuntu:18.04

Data

Data is a primitive to manage any type of data. There is a set of operation described in details here ref data.

A data can be mounted in the job container. The syntax is: {DATA_UUID}/path/to/src:/path/in/container. Note that the short name for DATA_UUID can be used.

Requirements:

first and second parts of the string split on : need to start with a /. (both need to be absolute paths)
the source of the volume needs to exist and be a directory. This is not the default behavior in docker. Docker would create whatever is missing in the source path.

The default is empty.

method	description	examples
new	`--data <string>` `-d <string>`	`eai job new -d {DATA_UUID}/path/to/src:/path/in/container` `eai job new --data {DATA_UUID}/path/to/src:/path/in/container`
yaml	`data`, list of strings	data: - {DATA_UUID}/path/to/src:/path/in/container

Workdir

This options sets the directory from where the command will be executed.

The default is empty, which will cause the container to use the workdir defined in the image if any.

Note

Since jobs run as the user who submitted the job (as opposed to root), if the image defines a workdir that is innaccessible from the user (i.e.: /root), the job will fail.

method	description	examples
new	`-w <string>` `--workdir <string>`	`eai job new -w /path/x` `eai job new --workdir /path/x`
yaml	`workdir`, string	workdir: /path/x

Network Isolation

Toolkit can block incoming or outgoing traffic for a job. It is useful when you want to run untrusted code.

To enable this feature, you can specify in the YAML job specification file:

value	description	yaml
in	Blocking IN traffic. Example: access to https://jobID.job.console.elementai.com or to the job from another job is blocked.	options: alphabits: network-isolation: in
out	Blocking OUT traffic. Example: Any request from the job is blocked.	options: alphabits: network-isolation: out
all	Blocking IN and OUT traffic.	options: alphabits: network-isolation: all

Note

It is still possible to access inside the job with eai job exec

Resource Request

Resource requests are one of the core concept of Toolkit. Each job must declare how much resources it will need when it will run, and the cluster will allocate exactly what is requested when the job runs.

Note

The Toolkit scheduler strategy is resource-based fair scheduling. Read more about it here.

Tip

Looking at resource usage of your jobs may help you better determine how much resource they need. This is why Toolkit provides job based grafana dashboard where you can look at it. See How to Monitor a Job Resources Usage.

See below for more informations about what happens when a job goes over its requested resource limit.

Here are described the resource-related parameters.

CPU

The amount of cpus needed by the job. It is a floating point value.

The job will not be allowed to use more than the amount of cpu cores declared. (it will be throttled if it tries to go above)

The default is 1.

method	description	examples
new	`--cpu <float>`	`eai job new --cpu 2`
yaml	`cpu`, float	cpu: 2

Memory

The amount of memory needed by the job, in integer gigabytes unit.

If the job tries to allocate more memory than the requested amount, allocation will fail in most cases.

If a program ignores those failled allocations, the program will be terminated by the Out Of Memory Killer via a SIGKILL (OOMKilled).

The default is 1.

method	description	examples
new	`--mem <int>`	`eai job new --mem 2`
yaml	`mem`, int	mem: 2

GPU

The number of GPUs needed by the job. The GPU ids assigned to the job are number from 0 to n - 1, where n is the number of requested GPUs.

Minimum amount of GPU memory that must be available on the GPUs where the job will run. The current GPUs have between 12GB and 32GB of ram depending on their model.

By default it is unset, meaning that GPU jobs will be allowed to run on GPUs with any amount of GPU ram (with a minimum of 12GB currently)

method	description	examples
new	`--gpu <int>`	`eai job new --gpu 2`
yaml	`gpu`, int	gpu: 2

GPUs can be required to have some specific features. These are documented here.

GPU Memory

Memory available on GPU boards currently vary between 12 and 32GB. It is possible to request a minimum amount of GPU memory. Keep in mind that adding restrictions on a job will make it harder to schedule.

The default is not to care.

method	description	examples
new	`--gpu-mem <int>`	`eai job new --gpu 1 --gpu-mem 16`
yaml	`resources.gpuMem`, int, `>= 0`, `<= 32`	resources: gpuMem: 16

GPU Tensor Cores

The Volta architecture has what NVidia has named tensor cores to perform FP16 computations much more efficiently. It is possible to request the availability of this feature.

The default is not to care.

method	description	examples
new	`--gpu-tensor-cores`	`eai job new --gpu 1 --gpu-tensor-cores`
yaml	`resources.gpuTensorCores`, boolean	resources: gpuTensorCores: true

CUDA Version

Some images require specific CUDA version, and some GPUs/drivers support only a subset of CUDA versions. This indicates specific CUDA version requirement.

The default is not to care.

method	description	examples
new	`--cuda-version <float>`	`eai job new --cuda-version 9.2`
yaml	`resources.gpuCudaVersion`, float	resources: gpuCudaVersion: 9.2

CPU Model Filter

It is possible to restrict the CPU model on which a job will run by specifying one or more CPU model filters. The filtering is done by case insensitive substring match. If multiple CPU model filters are specified, they must all be satisfied. Matching can also be negated by starting the match string with a !. The default is no restriction on CPU model (i.e. an empty filter list).

As an example, a filter of 6154 would select any CPU with “6154” in their model name (like “intel(r) xeon(r) gold 6154 cpu @ 3.00ghz”), while !6154 would avoid all CPUs with “6154” in their model name.

Available CPU models can be found on the cluster overview dashboard.

method	description	examples
new	`--cpu-model-filter <string>`	`eai job new --cpu-model-filter 6154` `eai job new --cpu-model-filter '!6154'`
yaml	`resources.cpuModel`, string	resources: cpuModel: 6154 resources: cpuModel: '!6154'

Important

The ! is a reserved character in both shell and YAML. Wrapping in single quotes the strings that start with ! avoids problems.

GPU Model Filter

It is possible to restrict the GPU model on which a job will run by specifying one or more GPU model filters. The filtering is done by case insensitive substring match. If multiple GPU model filters are specified, they must all be satisfied. Matching can also be negated by starting the match string with a !. The default is no restriction on GPU model (i.e. an empty filter list).

As an example, a filter of V100 would select any GPU with “V100” in their model name (like “tesla v100-sxm2-32gb”), while !V100 would avoid all GPUs with “V100” in their model name.

Available GPU models can be found on the cluster overview dashboard.

method	description	examples
new	`--gpu-model-filter <string>`	`eai job new --gpu 1 --gpu-model-filter V100` `eai job new --gpu 1 --gpu-model-filter '!V100'`
yaml	`resources.gpuModel`, string	resources: gpuModel: V100 resources: gpuModel: '!V100'

Important

The ! is a reserved character in both shell and YAML. Wrapping in single quotes the strings that start with ! avoids problems.

InfiniBand

Toolkit can provide InfiniBand devices to a job using the flag --infiniband. The job will be mounted with the devices in the directory /dev/infiniband/ and with the capability IPC_LOCK.

It is important to use the full node resources (8 GPUs) when you want to use this feature because only one job can use InfiniBand devices by node.

method	description	examples
new	`--infiniband`	`eai job new --infiniband`
yaml	`options.infiniband`, boolean	options: infiniband: true/false

Replicas

Toolkit allows to run replicas of your main job, more information can be found on this page Large Model Training.

--replicas= is the number of replicas you want for your job
--replica-master-port= is a shortcut for Internal DNS port.

Other mandatory options for replicas jobs:

--infiniband
--gpu
--gpu-mem=
--gpu-model-filter= Only V100 (32Gb), A100 (80gb) and H100 (80Gb)

More information on Internal DNS.

method	description	examples
new	`--replicas=` `--replica-master-port`	`eai job new --infiniband --gpu 8 --gpu-mem=80 --gpu-model-filter=A100 --replicas=4 ----replica-master-port=8080`
yaml	`resources.replicas`, int `options.internal-dns`, structure	resources: replicas: 4 options: internal-dns: name: "" ports: - port: 8080 target-port: 8080 protocol: TCP environmentVars: - MASTER_PORT=8080

X11

Toolkit can start a sidecar container that provides an X11 display to the main job container. It can be GPU-accelerated or not. The DISPLAY environment variable will be set automatically in the main job container to point to the X Server running in the sidecar.

Note

GPU Accelerated Xorg Server

There are implicit requirements and side-effects when running a GPU accelerated X11 sidecar:

At least 1 GPU must be requested
For OpenGL rendering to work, the container image must have the NVIDIA OpenGL Vendor Neutral Dispatch library installed. The easiest way to do so is by basing the container image on nvidia/opengl:*-glvnd-*. A list of available images is available here. More information regarding this setup is available on the project’s github repo <https://github.com/NVIDIA/libglvnd>
The X Server runs on GPU 0.
The screen resolution is currently fixed to 1280x960.

In the YAML job specification, specifying options.sidecars.x11.gpu to false will set up the CPU version, while true will enable the GPU accelerated X11 sidecar.

method	description	examples
new	`--x11-cpu` `--x11-gpu`	`eai job new --x11-cpu` `eai job new --gpu 1 --x11-gpu`
yaml	`options.sidecars.x11.gpu`, boolean	options: sidecars: x11: gpu: true/false

Internal DNS

Toolkit can create a custom DNS for your job which will allow other jobs in your organization to easily make request/connection to your job.

options.sidecars.internal-dns is a list of port to forward for the choosen DNS to your job:

name of the DNS is required and need to follow the RFC 1035 used by Kubernetes.
at least one port is required
2 elements of ports cannot use the be same port number but you can use multiple time the same target port

method	description	examples
new
yaml	`options.internal-dns`, structure	options: internal-dns: name: dns-<accountID>-mydns ports: - port: 80 target-port: 8080 - port: 5432 target-port: 5432 protocol: TCP

Using the previous example, it will be possible from another job in the same organization to call it with:

curl http://dns-<accountID>-mydns

The request will be forwarded to the job on the port 8080.

Note: If dns-<accountID>- is not present in the name, it will be automatically prepended.

Scheduling Options

Options in the job specification affect the way the scheduler will handle the job.

Bid

The bid value is an integer greater or equal to 0.

This parameter allows users to control the order in which the jobs are scheduled within an account: jobs with a higher bid go first amongst all jobs for that account, and for jobs with equal bid values within the account, the oldest are considered first. (FIFO)

Note: This parameter only affects the ordering of jobs within the account’s job queue. In other words, setting a high bid on a job doesn’t grant this job any priority over another accounts’s job. (See The Job vs. the other Jobs: The Scheduler for details)

The default is None. (which is handled as 0)

method	description	examples
new	`--bid <int>` `-b <int>`	`eai job new --bid 42` `eai job new -b 42`
yaml	`bid`, int, `>= 0`	bid: 42

Interactive

Interactive jobs have the highest priority and will preempt as many jobs as needed, no matter what the current user share of the cluster is (See The Job vs. the other Jobs: The Scheduler for details)

The default is false.

method	description	examples
new	`-I` `--interactive`	`eai job new -I` `eai job new --interactive`
yaml	`interactive`, boolean	interactive: true

Maximum Run Time

A new job can be submitted with a maximum run time in seconds after which it will be forcefully termintated.

The default is 172800 (48h) for non-preemptable jobs and 0, which means no maximum, for preemptable and interactive ones.

method	description	examples
new	`--max-run-time <int>`	`eai job new --max-run-time 3600`
yaml	`maxRunTime`, int	maxRunTime: 3600

Preemptable

A preemptable job is a job that is allowed to be interrupted by the scheduler when resources are required for other jobs.

Submitting preemptable jobs allow the scheduler to enforce a fairness policy with regards to how the cluster resources are used between all users. For this reason, there is much more resources available to run such jobs as opposed to non-preeptable ones. (See The Job vs. the other Jobs: The Scheduler for details)

The default is false.

method	description	examples
new	`--preemptable`	`eai job new --preemptable`
yaml	`preemptable`, boolean	preemptable: true

Restartable

Restartable jobs are implicitly preemptable but when preempted, the scheduler will add it back to the queue to be started again when conditions allow it.

See discussion about checkpointing and idempotence at Jobs Development Pro Tips.

The default is false.

method	description	examples
new	`--restartable`	`eai job new --restartable`
yaml	`restartable`, boolean	restartable: true

Identification

There are two different kinds of information that can be assigned to jobs: a name and tags. These can then be used to ask for matching job(s).

Name

The name is a single string value with a maximum length of 255 characters.

The default is empty.

method	description	examples
new	`--name`	`eai job new --name "exp_1_attempt_39"`
yaml	`name`, string	name: exp_1_attempt_39

Tags

A job can have a list of tags.

The default is the empty array [].

method	description	examples
new	`--tag [stringArray]`	`eai job new --tag tag_1=value_1 --tag tag_2=value_2`
yaml	`tags`, list of key, vallue	tags: - key: key_1 value: value_1

State Information

As soon as a job spec is submitted successfully, it becomes a job. A Job has all the job spec attributes and more. eai job info is the simplest way to get the job and display it.

A job goes through states according to the following states and events. Each state is described below.

state	description
`QUEUING`	The job is the Toolkit. If the cluster is full, the job will stay in that state. Because of “fair scheduling”, even though you may see room for a particular job it may stay in that state because some other job have higher priority. (See The Job vs. the other Jobs: The Scheduler for more details)
`QUEUED`	Toolki has pushed the job to the Kubernetes part of the cluster, which means that it will be scheduled soon. When a job is in this state, consult the `stateInfo` for more details on what is happening on the kubernetes side.
`RUNNING`	The job is running on the cluster.
`SUCCEEDED`	The job finished running normally. (Its exit code is 0.)
`FAILED`	The job failed. The failure is either due to a bad job spec in which case there is no exit code in the job info or the command failed with a non-0 exit code.
`INTERRUPTED`	The job finished running due to either a problem related to the cluster or by scheduling preemption by the Toolkit.
`CANCELLING`	The job is pending cancellation.
`CANCELLED`	The job finished running due to cancellation.

As a job advances through the lifecycle, attributes are updated. Some are set once and for all:

name	assigned when	description
`alive`	`QUEUING`	indicates if the job is still active or not. It depends on the current state (see lifecycle above)
`id`	`QUEUING`	unique id assigned to the job at creation time
`createdBy`	`QUEUING`	user who created the job
`createdOn`	`QUEUING`	time at which the job was created
`runs`	`QUEUING`	list of state information. There is an entry for each incarnation of the job in the cluster, in chronological order. (See below)
`state`	`QUEUING`	one of `QUEUING`, `QUEUED`, `RUNNING`, `SUCCEEDED`, `CANCELLING`, `CANCELLED`, `INTERRUPTED` or `FAILED`
`stateInfo`	`QUEUING` `QUEUED`	extra state information for the current state e.g. in `QUEUED` state, the image might be being pulled or the container created

A job can have multiple “incarnations” in the cluster, each of these being tracked by a Run.

there is at least one run in a job
only the last run can be alive
a new run is added every time a job enters the QUEUING state

Each Run in the runs list above can have the following attributes:

name	assigned when	description
`cancelledOn`	`CANCELLED`	time at which the job was cancelled
`cancelRequestOn`	`CANCELLING`	time at which cancellation was requested
`createdOn`	`QUEUING`	time at which the run was created
`endedOn`	`SUCCEEDED` `FAILED` `INTERRUPTED` `CANCELLED`	time at which the job stopped being `alive`
`id`	`QUEUING`	unique id assigned to the run when it is created
`info`	`RUNNING`	if the job spec required gpu(s), their indices will be found here in an array under the `gpus` key and their gpu uuids in an array under the `gpu_uuids` key
`ip`	`RUNNING`	ip address the job was assigned
`jobId`	`QUEUING`	id of the job to which this run belongs
`queuedOn`	`QUEUED`	time at which the job gets to the `QUEUED` state
`startedOn`	`RUNNING`	time at which the job started `RUNNING`

Launch Delay

Whenever a job requesting an image is scheduled on a node which does not have the image, the job will take longer to start because the node needs to pull the image first. The stateInfo field will be updated accordingly which the job is in QUEUED state.

The Job vs. the other Jobs: The Scheduler

The goal of the Toolkit is to assign available resources to requesting jobs. The resource assignment process and logic is the subject of this section.

Job Priority

Interactive Jobs

All interactive jobs are handled first, in decreasing bid order and then in creation order (FIFO). Users each have a quota of one (1) interactive job. If a user already has a running interactive job, further ones will be failed with a stateInfo to that effect. Preemptable jobs will be preempted to make room for interactive jobs - they are to be started as fast as possible.

There is no specified limit to the time an interactive job can run and to help with the cluster management, interactive jobs are kept together on the same nodes as much as possible.

Non-Interactive Jobs

The job ordering is first based on the account occupancy. The occupancy is computed as the current share of the cluster an account is actually using across all its jobs.

All QUEUING jobs are considered in decreasing order of their account occupancy. Accounts with no jobs in the cluster at all are thus be considered first, followed by jobs submitted by accounts in order of increasing occupancy.

Amongst the jobs for accounts of equal occupancy, the account with the highest bid are considered first, and amongst those with equal bid, the oldest jobs are considered first (FIFO).

Job Queues

Jobs are classified in distinct queues depending on the resources they ask for. If a CPU-only job can not fit, no more CPU-only jobs are considered. The same goes for single-GPU jobs and multi-GPUs jobs. Job preemptability is also factored in._

Preemption

Jobs that are marked as being preemptable can be interrupted at any time in order to make room for jobs from other users when the occupancy based criterion is respected: jobs from a user with a higher occupancy will be preempted in order to make room for jobs from a user with lower occupancy. (Restartable jobs are implicitly considered to be preemptable).

A preempted restartable job will return to the QUEUING state automatically when it is interrupted and be rescheduled when cluster usage allows it. This means that a single job processing might span multiple runs, as will the logs.

A preempted NON-restartable job will end up in INTERRUPTED state.

The amount of resources available for non-preemptable jobs is much smaller than for the preemptable ones since it is much easier to share resources between preemptable jobs.

Making your jobs restartable/preemptable will make them run earlier, although they might be interrupted (and maybe restarted) if its resources are required to run jobs for users with lower occupancy.

Jobs Development Pro Tips

Checkpointing

Because there is a wide range of reasons why jobs can be interrupted at any given time, your jobs should create checkpoints from which they can resume their task whenever they will get rescheduled. Once done, you should set the restartable flag of those job.

Idempotence

An important quality of a well defined job is idempotence. Basically, it means that given the same input, a task will produce the same output. Implicitly, it means that an idempotent job will:

Always produce the same output from the same input
Put all temporary output in a private place
Not delete files that it didn’t create itself
Not overwrite any files (or atomically overwrite files with exactly the same content)
Not depend upon an unseeded random number
Not depend upon the system time
Not depend upon the value of Toolkit environment variables
Not depend upon external APIs, data or processes which aren’t versioned and idempotent themselves

Most machine learning training tasks can be trivially made idempotent, and there are tricks that can be used to ensure that even tasks that use a checkpoint are effectively idempotent.