Jobs
The job is the workload unit that is submitted and tracked by the Toolkit. It runs as a docker container in a kubernetes cluster. This section describes how jobs are submitted to the Toolkit, the impact the various flags has on the way they’re handled and how to track them.
A job can be divided in 2 parts:
the specification: values provided to describe it
the state information: info that is updated as the job progresses
Job Specification
The job specification is the start of it all. It is the set of parameters that describe the job. Parameters can be conceptually divided in four groups:
- Execution Context
What has to be done by the job, in which context: Command, Environment Variables, Image, Data, Workdir and Network Isolation
- Resource Request
What resources are required by the job: GPU, CPU, Memory, GPU Memory, GPU Tensor Cores, CUDA Version, GPU Model Filter, CPU Model Filter, InfiniBand, Replicas and Internal DNS
- Scheduling Options
How should the scheduler handle the job: Restartable, Preemptable, Interactive, Bid and Maximum Run Time
- Identification
The simplest way to submit a new job spec is through the CLI command eai job new
. Individual attributes of the job
specification can be set using specific eai job new
parameters. The -f
/--file
parameter allows submitting a
job spec in a yaml file for which individual attributes can be overriden (e.g. eai job new -f file.yml -e A=B
)
Yaml file can be generated by using the --dry-run
parameter. (e.g. eai job new --dry-run -- /bin/bash -c "env;
ls; echo done" > file.yml
Execution Context
The following fields are the ones required to describe what a job has to do, and how to do it.
Command
The command is what will be executed in the container. If left empty, the image command (or build-time entrypoint + args) is executed.
Note
Unlike docker run
, specifying a command will override the image entrypoint.
method |
description |
examples |
---|---|---|
new |
remaining args |
|
yaml |
|
command:
- /bin/bash
- -c
- env; ls; echo done
|
Environment Variables
Environment variables can be given to a job.
The default is empty, although some environment variables are injected automatically in the container.
If the same variable is set more than once, the last assignment wins.
method |
description |
examples |
---|---|---|
new |
|
|
yaml |
|
environmentVars:
- VAR1=value
- VAR2="value with whitespace"
|
Image
This option specifies the image that will be used. The default is ubuntu:18.04
.
Images can be any publicly available image (e.g. from dockerhub) or one from our internal registries to which you can push your own images.
method |
description |
examples |
---|---|---|
new |
|
|
yaml |
|
image: ubuntu:18.04
|
Data
Data is a primitive to manage any type of data. There is a set of operation described in details here ref data.
A data can be mounted in the job container.
The syntax is: {DATA_UUID}/path/to/src:/path/in/container
.
Note that the short name for DATA_UUID
can be used.
Requirements:
first and second parts of the string split on
:
need to start with a/
. (both need to be absolute paths)the source of the volume needs to exist and be a directory. This is not the default behavior in docker. Docker would create whatever is missing in the source path.
The default is empty.
method |
description |
examples |
---|---|---|
new |
|
|
yaml |
|
data:
- {DATA_UUID}/path/to/src:/path/in/container
|
Workdir
This options sets the directory from where the command will be executed.
The default is empty, which will cause the container to use the workdir defined in the image if any.
Note
Since jobs run as the user who submitted the job (as opposed to root), if the image defines a workdir that is innaccessible from the user (i.e.: /root), the job will fail.
method |
description |
examples |
---|---|---|
new |
|
|
yaml |
|
workdir: /path/x
|
Network Isolation
Toolkit can block incoming or outgoing traffic for a job. It is useful when you want to run untrusted code.
To enable this feature, you can specify in the YAML job specification file:
value |
description |
yaml |
---|---|---|
in |
Blocking IN traffic. Example: access to https://jobID.job.console.elementai.com or to the job from another job is blocked. |
options:
alphabits:
network-isolation: in
|
out |
Blocking OUT traffic. Example: Any request from the job is blocked. |
options:
alphabits:
network-isolation: out
|
all |
Blocking IN and OUT traffic. |
options:
alphabits:
network-isolation: all
|
Note
It is still possible to access inside the job with eai job exec
Resource Request
Resource requests are one of the core concept of Toolkit. Each job must declare how much resources it will need when it will run, and the cluster will allocate exactly what is requested when the job runs.
Note
The Toolkit scheduler strategy is resource-based fair scheduling. Read more about it here.
Tip
Looking at resource usage of your jobs may help you better determine how much resource they need. This is why Toolkit provides job based grafana dashboard where you can look at it. See How to Monitor a Job Resources Usage.
See below for more informations about what happens when a job goes over its requested resource limit.
Here are described the resource-related parameters.
CPU
The amount of cpus needed by the job. It is a floating point value.
The job will not be allowed to use more than the amount of cpu cores declared. (it will be throttled if it tries to go above)
The default is 1
.
method |
description |
examples |
---|---|---|
new |
|
|
yaml |
|
cpu: 2
|
Memory
The amount of memory needed by the job, in integer gigabytes unit.
If the job tries to allocate more memory than the requested amount, allocation will fail in most cases.
If a program ignores those failled allocations, the program will be terminated by the Out Of Memory Killer via a SIGKILL (OOMKilled).
The default is 1
.
method |
description |
examples |
---|---|---|
new |
|
|
yaml |
|
mem: 2
|
GPU
The number of GPUs needed by the job. The GPU ids assigned to the job are number from 0
to n - 1
, where n is the number of requested GPUs.
Minimum amount of GPU memory that must be available on the GPUs where the job will run. The current GPUs have between 12GB and 32GB of ram depending on their model.
By default it is unset, meaning that GPU jobs will be allowed to run on GPUs with any amount of GPU ram (with a minimum of 12GB currently)
method |
description |
examples |
---|---|---|
new |
|
|
yaml |
|
gpu: 2
|
GPUs can be required to have some specific features. These are documented here.
GPU Memory
Memory available on GPU boards currently vary between 12 and 32GB. It is possible to request a minimum amount of GPU memory. Keep in mind that adding restrictions on a job will make it harder to schedule.
The default is not to care.
method |
description |
examples |
---|---|---|
new |
|
|
yaml |
|
resources:
gpuMem: 16
|
GPU Tensor Cores
The Volta architecture has what NVidia has named tensor cores to perform FP16 computations much more efficiently. It is possible to request the availability of this feature.
The default is not to care.
method |
description |
examples |
---|---|---|
new |
|
|
yaml |
|
resources:
gpuTensorCores: true
|
CUDA Version
Some images require specific CUDA version, and some GPUs/drivers support only a subset of CUDA versions. This indicates specific CUDA version requirement.
The default is not to care.
method |
description |
examples |
---|---|---|
new |
|
|
yaml |
|
resources:
gpuCudaVersion: 9.2
|
CPU Model Filter
It is possible to restrict the CPU model on which a job will run by specifying
one or more CPU model filters. The filtering is done by case insensitive substring match. If
multiple CPU model filters are specified, they must all be satisfied. Matching
can also be negated by starting the match string with a !
. The default is no
restriction on CPU model (i.e. an empty filter list).
As an example, a filter of 6154
would select any CPU with “6154” in their
model name (like “intel(r) xeon(r) gold 6154 cpu @ 3.00ghz”), while !6154
would avoid all CPUs with “6154” in their model name.
Available CPU models can be found on the cluster overview dashboard.
method |
description |
examples |
---|---|---|
new |
|
|
yaml |
|
resources:
cpuModel: 6154
resources:
cpuModel: '!6154'
|
Important
The !
is a reserved character in both shell and YAML.
Wrapping in single quotes the strings that start with !
avoids problems.
GPU Model Filter
It is possible to restrict the GPU model on which a job will run by
specifying one or more GPU model filters. The filtering is done by case insensitive substring
match. If multiple GPU model filters are specified, they must all be satisfied.
Matching can also be negated by starting the match string with a !
. The
default is no restriction on GPU model (i.e. an empty filter list).
As an example, a filter of V100
would select any GPU with “V100” in their model name
(like “tesla v100-sxm2-32gb”), while !V100
would avoid all GPUs with “V100”
in their model name.
Available GPU models can be found on the cluster overview dashboard.
method |
description |
examples |
---|---|---|
new |
|
|
yaml |
|
resources:
gpuModel: V100
resources:
gpuModel: '!V100'
|
Important
The !
is a reserved character in both shell and YAML.
Wrapping in single quotes the strings that start with !
avoids problems.
InfiniBand
Toolkit can provide InfiniBand devices to a job using the flag --infiniband
.
The job will be mounted with the devices in the directory /dev/infiniband/
and with the capability IPC_LOCK
.
It is important to use the full node resources (8 GPUs) when you want to use this feature because only one job can use InfiniBand devices by node.
method |
description |
examples |
---|---|---|
new |
|
|
yaml |
|
options:
infiniband: true/false
|
Replicas
Toolkit allows to run replicas of your main job, more information can be found on this page Large Model Training.
--replicas=
is the number of replicas you want for your job--replica-master-port=
is a shortcut for Internal DNS port.
Other mandatory options for replicas jobs:
--infiniband
--gpu
--gpu-mem=
--gpu-model-filter=
Only V100 (32Gb), A100 (80gb) and H100 (80Gb)
More information on Internal DNS.
method |
description |
examples |
---|---|---|
new |
|
|
yaml |
|
resources:
replicas: 4
options:
internal-dns:
name: ""
ports:
- port: 8080
target-port: 8080
protocol: TCP
environmentVars:
- MASTER_PORT=8080
|
X11
Toolkit can start a sidecar container that provides an X11 display to the main job container.
It can be GPU-accelerated or not. The DISPLAY
environment variable will be set automatically
in the main job container to point to the X Server running in the sidecar.
Note
GPU Accelerated Xorg Server
There are implicit requirements and side-effects when running a GPU accelerated X11 sidecar:
At least 1 GPU must be requested
For OpenGL rendering to work, the container image must have the NVIDIA OpenGL Vendor Neutral Dispatch library installed. The easiest way to do so is by basing the container image on
nvidia/opengl:*-glvnd-*
. A list of available images is available here. More information regarding this setup is available on the project’s github repo <https://github.com/NVIDIA/libglvnd>The X Server runs on GPU 0.
The screen resolution is currently fixed to 1280x960.
In the YAML job specification, specifying options.sidecars.x11.gpu
to false
will set up the CPU version, while true
will enable the GPU accelerated X11 sidecar.
method |
description |
examples |
---|---|---|
new |
|
|
yaml |
|
options:
sidecars:
x11:
gpu: true/false
|
Internal DNS
Toolkit can create a custom DNS for your job which will allow other jobs in your organization to easily make request/connection to your job.
options.sidecars.internal-dns
is a list of port to forward for the choosen DNS to your job:
name
of the DNS is required and need to follow the RFC 1035 used by Kubernetes.at least one port is required
2 elements of
ports
cannot use the be sameport
number but you can use multiple time the same target port
method |
description |
examples |
---|---|---|
new |
||
yaml |
|
options:
internal-dns:
name: dns-<accountID>-mydns
ports:
- port: 80
target-port: 8080
- port: 5432
target-port: 5432
protocol: TCP
|
Using the previous example, it will be possible from another job in the same organization to call it with:
curl http://dns-<accountID>-mydns
The request will be forwarded to the job on the port 8080
.
Note: If dns-<accountID>-
is not present in the name, it will be automatically prepended.
Scheduling Options
Options in the job specification affect the way the scheduler will handle the job.
Bid
The bid value is an integer greater or equal to 0.
This parameter allows users to control the order in which the jobs are scheduled within an account: jobs with a higher bid go first amongst all jobs for that account, and for jobs with equal bid values within the account, the oldest are considered first. (FIFO)
Note: This parameter only affects the ordering of jobs within the account’s job queue. In other words, setting a high bid on a job doesn’t grant this job any priority over another accounts’s job. (See The Job vs. the other Jobs: The Scheduler for details)
The default is None. (which is handled as 0)
method |
description |
examples |
---|---|---|
new |
|
|
yaml |
|
bid: 42
|
Interactive
Interactive jobs have the highest priority and will preempt as many jobs as needed, no matter what the current user share of the cluster is (See The Job vs. the other Jobs: The Scheduler for details)
The default is false
.
method |
description |
examples |
---|---|---|
new |
|
|
yaml |
|
interactive: true
|
Maximum Run Time
A new job can be submitted with a maximum run time in seconds after which it will be forcefully termintated.
The default is 172800
(48h) for non-preemptable jobs and 0
, which means no maximum, for preemptable
and interactive ones.
method |
description |
examples |
---|---|---|
new |
|
|
yaml |
|
maxRunTime: 3600
|
Preemptable
A preemptable job is a job that is allowed to be interrupted by the scheduler when resources are required for other jobs.
Submitting preemptable jobs allow the scheduler to enforce a fairness policy with regards to how the cluster resources are used between all users. For this reason, there is much more resources available to run such jobs as opposed to non-preeptable ones. (See The Job vs. the other Jobs: The Scheduler for details)
The default is false
.
method |
description |
examples |
---|---|---|
new |
|
|
yaml |
|
preemptable: true
|
Restartable
Restartable jobs are implicitly preemptable but when preempted, the scheduler will add it back to the queue to be started again when conditions allow it.
See discussion about checkpointing and idempotence at Jobs Development Pro Tips.
The default is false
.
method |
description |
examples |
---|---|---|
new |
|
|
yaml |
|
restartable: true
|
Identification
There are two different kinds of information that can be assigned to jobs: a name and tags. These can then be used to ask for matching job(s).
Name
The name is a single string value with a maximum length of 255 characters.
The default is empty.
method |
description |
examples |
---|---|---|
new |
|
|
yaml |
|
name: exp_1_attempt_39
|
State Information
As soon as a job spec is submitted successfully, it becomes a job. A Job has all the job
spec attributes and more. eai job info
is the simplest way to get the job and display
it.
A job goes through states according to the following states and events. Each state is described below.
state |
description |
---|---|
|
The job is the Toolkit. If the cluster is full, the job will stay in that state. Because of “fair scheduling”, even though you may see room for a particular job it may stay in that state because some other job have higher priority. (See The Job vs. the other Jobs: The Scheduler for more details) |
|
Toolki has pushed the job to the Kubernetes part of the cluster, which means
that it will be scheduled soon. When a job is in this state, consult the |
|
The job is running on the cluster. |
|
The job finished running normally. (Its exit code is 0.) |
|
The job failed. The failure is either due to a bad job spec in which case there is no exit code in the job info or the command failed with a non-0 exit code. |
|
The job finished running due to either a problem related to the cluster or by scheduling preemption by the Toolkit. |
|
The job is pending cancellation. |
|
The job finished running due to cancellation. |
As a job advances through the lifecycle, attributes are updated. Some are set once and for all:
name |
assigned when |
description |
---|---|---|
|
|
indicates if the job is still active or not. It depends on the current state (see lifecycle above) |
|
|
unique id assigned to the job at creation time |
|
|
user who created the job |
|
|
time at which the job was created |
|
|
list of state information. There is an entry for each incarnation of the job in the cluster, in chronological order. (See below) |
|
|
one of |
|
|
extra state information for the current state e.g. in |
A job can have multiple “incarnations” in the cluster, each of these being tracked by a Run.
there is at least one run in a job
only the last run can be alive
a new run is added every time a job enters the
QUEUING
state
Each Run in the runs
list above can have the following attributes:
name |
assigned when |
description |
---|---|---|
|
|
time at which the job was cancelled |
|
|
time at which cancellation was requested |
|
|
time at which the run was created |
|
|
time at which the job stopped being |
|
|
unique id assigned to the run when it is created |
|
|
if the job spec required gpu(s), their indices will be found here in an array
under the |
|
|
ip address the job was assigned |
|
|
id of the job to which this run belongs |
|
|
time at which the job gets to the |
|
|
time at which the job started |
Launch Delay
Whenever a job requesting an image is scheduled on a node which does not have the image,
the job will take longer to start because the node needs to pull the image first. The
stateInfo
field will be updated accordingly which the job is in QUEUED
state.
The Job vs. the other Jobs: The Scheduler
The goal of the Toolkit is to assign available resources to requesting jobs. The resource assignment process and logic is the subject of this section.
Job Priority
Interactive Jobs
All interactive jobs are handled first, in decreasing bid order and then in creation order (FIFO). Users
each have a quota of one (1) interactive job. If a user already has a running interactive job, further
ones will be failed with a stateInfo
to that effect. Preemptable jobs will be preempted to make
room for interactive jobs - they are to be started as fast as possible.
There is no specified limit to the time an interactive job can run and to help with the cluster management, interactive jobs are kept together on the same nodes as much as possible.
Non-Interactive Jobs
The job ordering is first based on the account occupancy
. The occupancy
is computed as the
current share of the cluster an account is actually using across all its jobs.
All QUEUING
jobs are considered in decreasing order of their account occupancy. Accounts with no
jobs in the cluster at all are thus be considered first, followed by jobs submitted by accounts in order of
increasing occupancy
.
Amongst the jobs for accounts of equal occupancy, the account with the highest bid are considered first, and amongst those with equal bid, the oldest jobs are considered first (FIFO).
Job Queues
Jobs are classified in distinct queues depending on the resources they ask for. If a CPU-only job can not fit, no more CPU-only jobs are considered. The same goes for single-GPU jobs and multi-GPUs jobs. Job preemptability is also factored in._
Preemption
Jobs that are marked as being preemptable can be interrupted at any time in order to make room for jobs from other users when the occupancy based criterion is respected: jobs from a user with a higher occupancy will be preempted in order to make room for jobs from a user with lower occupancy. (Restartable jobs are implicitly considered to be preemptable).
A preempted restartable job will return to the QUEUING
state automatically when it
is interrupted and be rescheduled when cluster usage allows it. This means that a
single job processing might span multiple runs, as will the logs.
A preempted NON-restartable job will end up in INTERRUPTED
state.
The amount of resources available for non-preemptable
jobs is much smaller than for the
preemptable
ones since it is much easier to share resources between preemptable
jobs.
Making your jobs restartable/preemptable will make them run earlier, although they might be interrupted (and maybe restarted) if its resources are required to run jobs for users with lower occupancy.
Jobs Development Pro Tips
Checkpointing
Because there is a wide range of reasons why jobs can be interrupted at any given time, your jobs should create
checkpoints from which they can resume their task whenever they will get rescheduled. Once done, you should set the
restartable
flag of those job.
Idempotence
An important quality of a well defined job is idempotence. Basically, it means that given the same input, a task will produce the same output. Implicitly, it means that an idempotent job will:
Always produce the same output from the same input
Put all temporary output in a private place
Not delete files that it didn’t create itself
Not overwrite any files (or atomically overwrite files with exactly the same content)
Not depend upon an unseeded random number
Not depend upon the system time
Not depend upon the value of Toolkit environment variables
Not depend upon external APIs, data or processes which aren’t versioned and idempotent themselves
Most machine learning training tasks can be trivially made idempotent, and there are tricks that can be used to ensure that even tasks that use a checkpoint are effectively idempotent.