Tutorial: Toolkit in Action
The following walkthrough is an excellent way to get a glimpse of what the Toolkit can do.
Toolkit heavily relies on Docker. If you are unfamiliar with that technology, we advise you to read the Walkthrough - Docker page.
Install the CLI
To work with the Toolkit, the first thing we need to do is download the CLI. The Toolkit CLI is an executable, available for two platforms: Linux and Mac (aka Darwin).
Download the binary, make it executable and move it in a directory which is in your PATH. Make sure to upgrade to the latest version.
# download latest Toolkit binary
curl https://console.elementai.com/cli/install | sh
# make it executable
chmod +x ./eai
# add eai executable to your PATH
mv eai [directory which is in your PATH]
# upgrade to the latest version
eai upgrade
# version information
eai --version
Authentication & Account
To use the Toolkit you must be authenticated. Run the login
command and follow the instructions to create your user.
You will have to follow a link to obtain your authentication token
and then paste it into the CLI.
eai login
Once logged in you can get some basic information about your user, including the all-important account ID
.
Every command is executed in the context of an account, which you can specify.
If you don’t specify it, your command will be run in the context of your sandbox account
.
Note
A user and an account are two different concepts. Your user is your login that only you can access. An account is a context with access to resources. Because your sandbox account
is only accessible to you by default, you might not immediately understand the difference. However, you can create multiple accounts, and each of these accounts may be shared among multiple users. Accounts are how Toolkit manages access control.
# get information about your user. (id, name, organization, mail, account name (or ID if no name)
eai user get
# get information about your account(s). (id, account name, organization, parent)
# when no account is specified, it defaults to your sandbox account
eai account get
A good practice is to assign a human-readable name to your user and account to make them easier to work with.
eai user set {USER_ID} --name {USER_NAME}
eai account set {ACCOUNT_ID} --name {ACCOUNT_NAME}
Note
The names you set will take effect globally, not just locally. Keep in mind names for users should be unique across the whole organization and account names should be unique for their parent (either the organization or a parent account).
Setting your default account
Each user can set a default account to have a easier way to manage data and job on an account.
Once, you were granted access to a project account, you will be able to set it as your default account:
eai user set --account <projectAccountID>
When it is defined, it will allows commands like eai data ls
, eai job ls
or eai job new
to use the default account without having to explictly asking for it.
Docker Images
Toolkit runs docker containers in Kubernetes. It can run public images (e.g.
tensorflow/tensorflow) or private images which must be pushed to our
internal registry residing at registry.console.elementai.com
.
Note
Toolkit also offers another internal registry, volatile-registry.console.elementai.com
,
where you can push temporary images. After 24 hours the images in this registry are automatically
deleted to save storage space.
It is also possible to extend the storage duration by appending -<number>days
or -<number>months
(like -3months
) to the tag used when pushing the image.
The maximum duration is 90 days
or 3 months
.
To take advantage of the Toolkit Docker registry, you need to proceed with the login configuration. You will also need to have Docker installed locally.
eai docker configure
This command will set up a credential helper for the toolkit registries in
~/.docker/config
.In order to use this configuration, a symlink named
docker-credential-eai
must be created and be located in$PATH
.The command will display the recommended
ln -s
directive based on where theeai
binary is located.ie: If the eai binary is located in ~/bin then the output of the eai docker configure would suggest:
ln -s ~/bin/eai ~/bin/docker-credential-eai
You may run the suggested command as is, or adapt it to your particular needs if you know what you are doing.
For more detail please go to: https://docs.docker.com/engine/reference/commandline/login/#credential-helpers
Once done, you can tag your own images following the pattern registry.console.elementai.com/<account id>/<image name>
to be able to push them to the Docker registry and use them in the Toolkit.
Assuming you have a certain knowledge of Docker to build your own image,
the following example pushes it to your default account, and gives it the name job-example
.
# Get your accout ID.
export ACCOUNT_ID=$(eai account get --field id)
export IMAGE=registry.console.elementai.com/$ACCOUNT_ID/job-example
docker build -t $IMAGE .
docker push $IMAGE
# You can use a name instead of the ID if you named your account previously
# Get organization name
export ORG_NAME=$(eai organization get --field name)
# Get account name
export ACCOUNT_NAME=$(eai account get --field name)
export ACCOUNT_ID=$ORG_NAME.$ACCOUNT_NAME
export IMAGE=registry.console.elementai.com/$ACCOUNT_ID/job-example
Hint
For tips consult the Toolkit Docker Tips page.
Submitting a Simple Job
The way to submit a new job is to use the eai job new
command with the
appropriate parameters to describe the resources to be assigned to the job,
the environment, the data, the image, the work directory, etc.
Let’s try to submit a new job on the cluster requesting two GPUs.
We can check that we have access to them using the nvidia-smi
command.
A job that needs GPU(s) must specify the number using the --gpu
option. There are
extra options that further refine the type of GPU (--gpu-mem
, --gpu-tensor-cores
).
# -i specifies the image that will be used. The default is ``ubuntu:18.04`` if not specified.
eai job new -i ubuntu:18.04 --gpu 2 -- nvidia-smi
We’re expecting output from the previous command. Let’s take a look at the job’s logs.
# Get output of the latest job
eai job logs
# Now, for a specific job
eai job logs {JOB_ID}
To see it in action, first start a job that will run for some time which displays something and then get the output.
# run a long running job with output
eai job new -- bash -c 'cnt=0; while (( cnt < 300 )); do (( cnt++ )); echo $cnt; sleep 1; done'
# follow the most recently submitted job logs
eai job logs -f
It’s also possible to list jobs that are currently running or that have finished in the last 24h.
eai job ls
Get all your active jobs in the cluster.
eai job ls --state alive
Get all information about a single job.
# information about the latest job
eai job info
# ... about a specific job
eai job info {JOB_ID}
You’ll have a lot of options you can choose from when submitting a new job. You can peruse them with the -h command.
eai job new -h
Connecting to a Running Job
To connect to a job, you can use the eai job exec
command. Try it with the
already running job above:
eai job exec {job id} bash
Then try doing top
, type c
to see the full command that is currently
running.
Killing a Job
Time to kill that job, even though it will only run for 5 minutes.
eai job kill {job id}
To kill all your live jobs:
eai job kill $(eai job ls --state alive --field id)
Types of Jobs
There are three options to call out here since they have a critical impact on job scheduling and resource allocation:
--preemptable
tells the scheduler that your job is volunteering to be cut off if necessary. This sounds counterintuitive, but since it makes life easier for the scheduler this will give your job access to more resources. Once such a job has been preempted, it is possible to reschedule it usingeai job retry {JOB_ID}
--restartable
tells the scheduler that your job can be preempted - like the above option - but also that it can be rescheduled automatically. Setting either this flag or the--preemptable
one allows to use the cluster more efficiently, since it gives the scheduler the most flexibility in its decision-making. Writing jobs that can be comfortably restarted in the middle of their execution is easy and is helpful in general, just in case unexpected crashes occur.--interactive
tells the scheduler that this job should be scheduled as soon as possible and should never be halted. Every user can create no more than one of these.A job that does not set any of these three flags will be allowed to run without halting for up to 48 hours, at which point it will be killed and not restarted. It is a non-preemptable job.
# Same as the previous example but with the preemptable option
eai job new -i ubuntu:18.04 --gpu 2 --preemptable -- nvidia-smi
Data
A data
is a primitive to manage any type of data associated with your job.
# Create a data resource with a name, pushing the content of a folder into it
eai data new {DATA_NAME} {FOLDER_NAME}
# Data resources listing
eai data ls
# Upload data into a data resource based on its ID
eai data push {DATA_ID} {DATA_FOLDER}
# Upload data into a data resource based on its human-readable name
eai data push {ORG_NAME}.{ACCOUNT_NAME}.{DATA_NAME} {DATA_FOLDER}
# More options
eai data -h
Job Using Data Resources
We will make Resnet-50 pretrained on Imagenet using torchvision and pytorch
Code and Dataset to Download
Download the tutorial code
here
Unzip the code into your working directory.
Download the data at https://www.kaggle.com/prasunroy/natural-images. Note that if you do not wish to download the data, we already uploaded the data on the Toolkit as
shared.dataset.natural_images
.Unzipping the file will create two directories, data/ and natural_images/.
You can safely delete data/ as it is a duplicate.
Upload the Dataset into a Data Resource
eai data new nat_imgs ./natural_images
# The dataset can now be referenced to as {ORG_NAME}.{ACCOUNT_NAME}.nat_imgs.
Checkpointing
We will save checkpoints throughout the training. Make a new data that will hold the checkpoints.
eai data new checkpoint
# The folder can now be referenced to ${ORG_NAME}.{ACCOUNT_NAME}.checkpoint
Model & Training Code
Read model.py to understand how we make a Resnet-50 pretrained on Imagenet using torchvision and pytorch lightning.
Make a training script, (see train.py). Here we train a Resnet18 on the dataset using the Pytorch Lighting library. This will automatically log the experiment using Tensorboard. See Pytorch Lightning documentation for details.
Build and Push your Image and Start your Training
export ORG_NAME=$(eai organization get --field name)
export ACCOUNT_NAME=$(eai account get --field name)
docker build -t registry.console.elementai.com/$ORG_NAME.$ACCOUNT_NAME/tutorial .
docker push registry.console.elementai.com/$ORG_NAME.$ACCOUNT_NAME/tutorial
Submit a new job to the cluster using 2 GPUs and 8 CPUs.
eai job new \
--data $ORG_NAME.$ACCOUNT_NAME.nat_imgs:/app/data/natural_images \
--data $ORG_NAME.$ACCOUNT_NAME.checkpoint:/app/checkpoint \
--image registry.console.elementai.com/$ORG_NAME.$ACCOUNT_NAME/tutorial \
--gpu 2 --cpu 8 --mem 12 --name tutorial_job
# The job is now named $ORG_NAME.$ACCOUNT_NAME.tutorial_job.
# The default port for Tensorboard is 6006, we can access it online at echo https://$ORG-$ACC-tutorial_job-6006.job.console.elementai.com
You can follow your job’s logs to monitor training.
eai job logs -f
Once the training is complete, you can download your checkpoint.
eai data pull $ORG_NAME.$ACCOUNT_NAME.checkpoint
You can also access the tensorboard through a new job.
eai job new \
--data $ORG_NAME.$ACCOUNT_NAME.checkpoint:/app/checkpoint \
--image tensorflow/tensorflow \
-- tensorboard --bind_all --logdir /app/checkpoint
# You can access your job at https://[job-id]-6006.job.console.elementai.com
Dashboard
You can see the resources available and what is running on them on the overview dashboard.