Tutorial: Toolkit in Action

The following walkthrough is an excellent way to get a glimpse of what the Toolkit can do.

Toolkit heavily relies on Docker. If you are unfamiliar with that technology, we advise you to read the Walkthrough - Docker page.

Install the CLI

To work with the Toolkit, the first thing we need to do is download the CLI. The Toolkit CLI is an executable, available for two platforms: Linux and Mac (aka Darwin).

Download the binary, make it executable and move it in a directory which is in your PATH. Make sure to upgrade to the latest version.

# download latest Toolkit binary
curl https://console.elementai.com/cli/install | sh

# make it executable
chmod +x ./eai

# add eai executable to your PATH
mv eai [directory which is in your PATH]

# upgrade to the latest version
eai upgrade

# version information
eai --version

Sign up

If you are new to Toolkit, you can follow this guide to know How to sign up.

CLI Authentication

To use the Toolkit you must be authenticated. Run the eai login command and follow the instructions. You will have to follow a link to obtain your authentication code and then paste it into the CLI.

If you don’t know how to login, you can follow the following guide: How to authenticate on the Webapp

eai login

Once logged in you can get some basic information about your user.

# get information about your user. (id, name, organization, mail, account name (or ID if no name)
eai user get

Getting an account

It is not possible to create an account by yourself anymore.

If you’re uncertain about the account, please consult with your teammates. If this is a completely new project, feel free to submit a request to create a new account in support channel.

If your team already has an account, you can ask the admin of the account to call the following command to grant you access:

eai account role member add snow.account_a.user <yourMail>

Setting your default account

Each user can set a default account as a preference to have a easier way to manage data and job on an account. If you’re uncertain about the account, please consult with your teammates or if this is a completely new project, see Getting an account.

Once, you were granted access to a project account, you will be able to set it as your default account:

eai user set --account <projectAccountID>

When it is defined, it will allows commands like eai data ls, eai job ls or eai job new to use the default account without having to explictly asking for it.

Note

A user and an account are two different concepts. Your user is your login that only you can access. An account is a context with access to resources Accounts are how Toolkit manages access control.

Using your Toolkit data home

Toolkit provides to each user a Toolkit data as home directory located at <organization>.home.<name>.

You can find the name or ID of your own data home by calling: eai user get under the field dataHome (be sure to update your CLI at first).

$ eai user get --fields mail,dataHome

fullName             dataHome
snow.guillaume_smaha snow.home.guillaume_smaha

You can use it for a job by calling:

$ eai job new --data <organization>.home.<name>:/home/toolkit

Note

This space is aimed to store private information only.

Please, do not store models, dataset or any big files. Keep it under 10GB.

Warning

This data home will be erased immediatelly when your user will be deactivated.

See Data home directory for the best practices on the data home directory.

Docker Images

Toolkit runs docker containers in Kubernetes. It can run public images (e.g. tensorflow/tensorflow) or private images which must be pushed to our internal registry residing at registry.console.elementai.com.

To take advantage of the Toolkit Docker registry, you will need to have Docker installed locally. You will also need to proceed with the login configuration, see How to login with docker

Once login is done, you can tag your own images following the pattern registry.console.elementai.com/<account id>/<image name> to be able to push them to the Docker registry and use them in the Toolkit. Assuming you have a certain knowledge of Docker to build your own image, the following example pushes it to your default account, and gives it the name job-example.

# Get your accout ID.
export ACCOUNT_ID=$(eai account get --field id)
export IMAGE=registry.console.elementai.com/$ACCOUNT_ID/job-example
docker build -t $IMAGE .
docker push $IMAGE
# You can use a name instead of the ID if you named your account previously
# Get organization name
export ORG_NAME=$(eai organization get --field name)
# Get account name
export ACCOUNT_NAME=$(eai account get --field name)
export ACCOUNT_ID=$ORG_NAME.$ACCOUNT_NAME
export IMAGE=registry.console.elementai.com/$ACCOUNT_ID/job-example

Hint

For tips consult the Toolkit Docker Tips page.

Submitting a Simple Job

The way to submit a new job is to use the eai job new command with the appropriate parameters to describe the resources to be assigned to the job, the environment, the data, the image, the work directory, etc.

Let’s try to submit a new job on the cluster requesting two GPUs. We can check that we have access to them using the nvidia-smi command.

A job that needs GPU(s) must specify the number using the --gpu option. There are extra options that further refine the type of GPU (--gpu-mem, --gpu-tensor-cores).

# -i specifies the image that will be used. The default is ``ubuntu:18.04`` if not specified.
eai job new -i ubuntu:18.04 --gpu 1 -- nvidia-smi

We’re expecting output from the previous command. Let’s take a look at the job’s logs.

# Get output of the latest job
eai job logs

# Now, for a specific job
eai job logs {JOB_ID}

To see it in action, first start a job that will run for some time which displays something and then get the output.

# run a long running job with output
eai job new -- bash -c 'cnt=0; while (( cnt < 300 )); do (( cnt++ )); echo $cnt; sleep 1; done'

# follow the most recently submitted job logs
eai job logs -f

It’s also possible to list jobs that are currently running or that have finished in the last 24h.

eai job ls

Get all your active jobs in the cluster.

eai job ls --state alive

Get all information about a single job.

# information about the latest job
eai job info

# ... about a specific job
eai job info {JOB_ID}

You’ll have a lot of options you can choose from when submitting a new job. You can peruse them with the -h command.

eai job new -h

Connecting to a Running Job

To connect to a job, you can use the eai job exec command. Try it with the already running job above:

eai job exec {job id} bash

Then try doing top, type c to see the full command that is currently running.

Killing a Job

Time to kill that job, even though it will only run for 5 minutes.

eai job kill {job id}

To kill all your live jobs:

eai job kill $(eai job ls --state alive --field id)

Types of Jobs

There are three options to call out here since they have a critical impact on job scheduling and resource allocation:

  • --preemptable tells the scheduler that your job is volunteering to be cut off if necessary. This sounds counterintuitive, but since it makes life easier for the scheduler this will give your job access to more resources. Once such a job has been preempted, it is possible to reschedule it using eai job retry {JOB_ID}

  • --restartable tells the scheduler that your job can be preempted - like the above option - but also that it can be rescheduled automatically. Setting either this flag or the --preemptable one allows to use the cluster more efficiently, since it gives the scheduler the most flexibility in its decision-making. Writing jobs that can be comfortably restarted in the middle of their execution is easy and is helpful in general, just in case unexpected crashes occur.

  • --interactive tells the scheduler that this job should be scheduled as soon as possible and should never be halted. Every user can create no more than one of these.

  • A job that does not set any of these three flags will be allowed to run without halting for up to 48 hours, at which point it will be killed and not restarted. It is a non-preemptable job.

# Same as the previous example but with the preemptable option
eai job new -i ubuntu:18.04 --gpu 2 --preemptable -- nvidia-smi

Data

A data is a primitive to manage any type of data associated with your job.

# Create a data resource with a name, pushing the content of a folder into it
eai data new {DATA_NAME} {FOLDER_NAME}

# Data resources listing
eai data ls

# Upload data into a data resource based on its ID
eai data push {DATA_ID} {DATA_FOLDER}

# Upload data into a data resource based on its human-readable name
eai data push {ORG_NAME}.{ACCOUNT_NAME}.{DATA_NAME} {DATA_FOLDER}

# More options
eai data -h

Job Using Data Resources

We will make Resnet-50 pretrained on Imagenet using torchvision and pytorch

Code and Dataset to Download

  • Download the tutorial code here

  • Unzip the code into your working directory.

  • Download the data at https://www.kaggle.com/prasunroy/natural-images. Note that if you do not wish to download the data, we already uploaded the data on the Toolkit as shared.dataset.natural_images.

  • Unzipping the file will create two directories, data/ and natural_images/.

  • You can safely delete data/ as it is a duplicate.

Upload the Dataset into a Data Resource

eai data new nat_imgs ./natural_images
# The dataset can now be referenced to as {ORG_NAME}.{ACCOUNT_NAME}.nat_imgs.

Checkpointing

We will save checkpoints throughout the training. Make a new data that will hold the checkpoints.

eai data new checkpoint
# The folder can now be referenced to ${ORG_NAME}.{ACCOUNT_NAME}.checkpoint

Model & Training Code

Read model.py to understand how we make a Resnet-50 pretrained on Imagenet using torchvision and pytorch lightning.

Make a training script, (see train.py). Here we train a Resnet18 on the dataset using the Pytorch Lighting library. This will automatically log the experiment using Tensorboard. See Pytorch Lightning documentation for details.

Build and Push your Image and Start your Training

export ORG_NAME=$(eai organization get --field name)
export ACCOUNT_NAME=$(eai account get --field name)
docker build -t registry.console.elementai.com/$ORG_NAME.$ACCOUNT_NAME/tutorial .
docker push registry.console.elementai.com/$ORG_NAME.$ACCOUNT_NAME/tutorial

Submit a new job to the cluster using 2 GPUs and 8 CPUs.

eai job new \
    --data $ORG_NAME.$ACCOUNT_NAME.nat_imgs:/app/data/natural_images \
    --data $ORG_NAME.$ACCOUNT_NAME.checkpoint:/app/checkpoint \
    --image registry.console.elementai.com/$ORG_NAME.$ACCOUNT_NAME/tutorial \
    --gpu 2 --cpu 8 --mem 12 --name tutorial_job

# The job is now named $ORG_NAME.$ACCOUNT_NAME.tutorial_job.
# The default port for Tensorboard is 6006, we can access it online at echo https://$ORG-$ACC-tutorial_job-6006.job.console.elementai.com

You can follow your job’s logs to monitor training.

eai job logs -f

Once the training is complete, you can download your checkpoint.

eai data pull $ORG_NAME.$ACCOUNT_NAME.checkpoint

You can also access the tensorboard through a new job.

eai job new \
    --data $ORG_NAME.$ACCOUNT_NAME.checkpoint:/app/checkpoint \
    --image tensorflow/tensorflow \
    -- tensorboard --bind_all --logdir /app/checkpoint

# You can access your job at https://[job-id]-6006.job.console.elementai.com

Dashboard

You can see the resources available and what is running on them on the Overview Dashboard.

Exploring job parameters

You can now discover all the job parameters by going to Job Specification.