Tutorial: Toolkit in Action

The following walkthrough is an excellent way to get a glimpse of what the Toolkit can do.

Toolkit heavily relies on Docker. If you are unfamiliar with that technology, we advise you to read the Walkthrough - Docker page.

Install the CLI

To work with the Toolkit, the first thing we need to do is download the CLI. The Toolkit CLI is an executable, available for two platforms: Linux and Mac (aka Darwin).

Download the binary, make it executable and move it in a directory which is in your PATH. Make sure to upgrade to the latest version.

# download latest Toolkit binary
curl https://console.elementai.com/cli/install | sh

# make it executable
chmod +x ./eai

# add eai executable to your PATH
mv eai [directory which is in your PATH]

# upgrade to the latest version
eai upgrade

# version information
eai --version

Authentication & Account

To use the Toolkit you must be authenticated. Run the login command and follow the instructions to create your user. You will have to follow a link to obtain your authentication token and then paste it into the CLI.

eai login

Once logged in you can get some basic information about your user, including the all-important account ID. Every command is executed in the context of an account, which you can specify. If you don’t specify it, your command will be run in the context of your sandbox account.

Note

A user and an account are two different concepts. Your user is your login that only you can access. An account is a context with access to resources. Because your sandbox account is only accessible to you by default, you might not immediately understand the difference. However, you can create multiple accounts, and each of these accounts may be shared among multiple users. Accounts are how Toolkit manages access control.

# get information about your user. (id, name, organization, mail, account name (or ID if no name)
eai user get
# get information about your account(s). (id, account name, organization, parent)
# when no account is specified, it defaults to your sandbox account
eai account get

A good practice is to assign a human-readable name to your user and account to make them easier to work with.

eai user set {USER_ID} --name {USER_NAME}
eai account set {ACCOUNT_ID} --name {ACCOUNT_NAME}

Note

The names you set will take effect globally, not just locally. Keep in mind names for users should be unique across the whole organization and account names should be unique for their parent (either the organization or a parent account).

Docker Images

Toolkit runs docker containers in Kubernetes. It can run public images (e.g. tensorflow/tensorflow) or private images which must be pushed to our internal registry residing at registry.console.elementai.com.

Note

Toolkit also offers another internal registry, volatile-registry.console.elementai.com, where you can push temporary images. After 24 hours the images in this registry are automatically deleted to save storage space.

It is also possible to extend the storage duration by appending -<number>days or -<number>months (like -3months) to the tag used when pushing the image. The maximum duration is 90 days or 3 months.

To take advantage of the Toolkit Docker registry, you need to proceed with the login configuration. You will also need to have Docker installed locally.

eai docker configure
  • This command will set up a credential helper for the toolkit registries in ~/.docker/config.

  • In order to use this configuration, a symlink named docker-credential-eai must be created and be located in $PATH.

  • The command will display the recommended ln -s directive based on where the eai binary is located.

  • ie: If the eai binary is located in ~/bin then the output of the eai docker configure would suggest: ln -s ~/bin/eai ~/bin/docker-credential-eai

  • You may run the suggested command as is, or adapt it to your particular needs if you know what you are doing.

For more detail please go to: https://docs.docker.com/engine/reference/commandline/login/#credential-helpers

Once done, you can tag your own images following the pattern registry.console.elementai.com/<account id>/<image name> to be able to push them to the Docker registry and use them in the Toolkit. Assuming you have a certain knowledge of Docker to build your own image, the following example pushes it to your personal account, and gives it the name job-example.

# Get your accout ID.
export ACCOUNT_ID=$(eai account get --field id)
export IMAGE=registry.console.elementai.com/$ACCOUNT_ID/job-example
docker build -t $IMAGE .
docker push $IMAGE
# You can use a name instead of the ID if you named your account previously
# Get organization name
export ORG_NAME=$(eai organization get --field name)
# Get account name
export ACCOUNT_NAME=$(eai account get --field name)
export ACCOUNT_ID=$ORG_NAME.$ACCOUNT_NAME
export IMAGE=registry.console.elementai.com/$ACCOUNT_ID/job-example

Hint

For tips consult the Toolkit Docker Tips page.

Submitting a Simple Job

The way to submit a new job is to use the eai job new command with the appropriate parameters to describe the resources to be assigned to the job, the environment, the data, the image, the work directory, etc.

Let’s try to submit a new job on the cluster requesting two GPUs. We can check that we have access to them using the nvidia-smi command.

A job that needs GPU(s) must specify the number using the --gpu option. There are extra options that further refine the type of GPU (--gpu-mem, --gpu-tensor-cores).

# -i specifies the image that will be used. The default is ``ubuntu:18.04`` if not specified.
eai job new -i ubuntu:18.04 --gpu 2 -- nvidia-smi

We’re expecting output from the previous command. Let’s take a look at the job’s logs.

# Get output of the latest job
eai job logs

# Now, for a specific job
eai job logs {JOB_ID}

To see it in action, first start a job that will run for some time which displays something and then get the output.

# run a long running job with output
eai job new -- bash -c 'cnt=0; while (( cnt < 300 )); do (( cnt++ )); echo $cnt; sleep 1; done'

# follow the most recently submitted job logs
eai job logs -f

It’s also possible to list jobs that are currently running or that have finished in the last 24h.

eai job ls

Get all your active jobs in the cluster.

eai job ls --state alive

Get all information about a single job.

# information about the latest job
eai job info

# ... about a specific job
eai job info {JOB_ID}

You’ll have a lot of options you can choose from when submitting a new job. You can peruse them with the -h command.

eai job new -h

Connecting to a Running Job

To connect to a job, you can use the eai job exec command. Try it with the already running job above:

eai job exec {job id} bash

Then try doing top, type c to see the full command that is currently running.

Killing a Job

Time to kill that job, even though it will only run for 5 minutes.

eai job kill {job id}

To kill all your live jobs:

eai job kill $(eai job ls --state alive --field id)

Types of Jobs

There are three options to call out here since they have a critical impact on job scheduling and resource allocation:

  • --preemptable tells the scheduler that your job is volunteering to be cut off if necessary. This sounds counterintuitive, but since it makes life easier for the scheduler this will give your job access to more resources. Once such a job has been preempted, it is possible to reschedule it using eai job retry {JOB_ID}

  • --restartable tells the scheduler that your job can be preempted - like the above option - but also that it can be rescheduled automatically. Setting either this flag or the --preemptable one allows to use the cluster more efficiently, since it gives the scheduler the most flexibility in its decision-making. Writing jobs that can be comfortably restarted in the middle of their execution is easy and is helpful in general, just in case unexpected crashes occur.

  • --interactive tells the scheduler that this job should be scheduled as soon as possible and should never be halted. Every user can create no more than one of these.

  • A job that does not set any of these three flags will be allowed to run without halting for up to 48 hours, at which point it will be killed and not restarted. It is a non-preemptable job.

# Same as the previous example but with the preemptable option
eai job new -i ubuntu:18.04 --gpu 2 --preemptable -- nvidia-smi

Data

A data is a primitive to manage any type of data associated with your job.

# Create a data resource with a name, pushing the content of a folder into it
eai data new {DATA_NAME} {FOLDER_NAME}

# Data resources listing
eai data ls

# Upload data into a data resource based on its ID
eai data push {DATA_ID} {DATA_FOLDER}

# Upload data into a data resource based on its human-readable name
eai data push {ORG_NAME}.{ACCOUNT_NAME}.{DATA_NAME} {DATA_FOLDER}

# More options
eai data -h

Job Using Data Resources

We will make Resnet-50 pretrained on Imagenet using torchvision and pytorch

Code and Dataset to Download

  • Download the tutorial code here

  • Unzip the code into your working directory.

  • Download the data at https://www.kaggle.com/prasunroy/natural-images. Note that if you do not wish to download the data, we already uploaded the data on the Toolkit as shared.dataset.natural_images.

  • Unzipping the file will create two directories, data/ and natural_images/.

  • You can safely delete data/ as it is a duplicate.

Upload the Dataset into a Data Resource

eai data new nat_imgs ./natural_images
# The dataset can now be referenced to as {ORG_NAME}.{ACCOUNT_NAME}.nat_imgs.

Checkpointing

We will save checkpoints throughout the training. Make a new data that will hold the checkpoints.

eai data new checkpoint
# The folder can now be referenced to ${ORG_NAME}.{ACCOUNT_NAME}.checkpoint

Model & Training Code

Read model.py to understand how we make a Resnet-50 pretrained on Imagenet using torchvision and pytorch lightning.

Make a training script, (see train.py). Here we train a Resnet18 on the dataset using the Pytorch Lighting library. This will automatically log the experiment using Tensorboard. See Pytorch Lightning documentation for details.

Build and Push your Image and Start your Training

export ORG_NAME=$(eai organization get --field name)
export ACCOUNT_NAME=$(eai account get --field name)
docker build -t registry.console.elementai.com/$ORG_NAME.$ACCOUNT_NAME/tutorial .
docker push registry.console.elementai.com/$ORG_NAME.$ACCOUNT_NAME/tutorial

Submit a new job to the cluster using 2 GPUs and 8 CPUs.

eai job new \
    --data $ORG_NAME.$ACCOUNT_NAME.nat_imgs:/app/data/natural_images \
    --data $ORG_NAME.$ACCOUNT_NAME.checkpoint:/app/checkpoint \
    --image registry.console.elementai.com/$ORG_NAME.$ACCOUNT_NAME/tutorial \
    --gpu 2 --cpu 8 --mem 12 --name tutorial_job

# The job is now named $ORG_NAME.$ACCOUNT_NAME.tutorial_job.
# The default port for Tensorboard is 6006, we can access it online at echo https://$ORG-$ACC-tutorial_job-6006.job.console.elementai.com

You can follow your job’s logs to monitor training.

eai job logs -f

Once the training is complete, you can download your checkpoint.

eai data pull $ORG_NAME.$ACCOUNT_NAME.checkpoint

You can also access the tensorboard through a new job.

eai job new \
    --data $ORG_NAME.$ACCOUNT_NAME.checkpoint:/app/checkpoint \
    --image tensorflow/tensorflow \
    -- tensorboard --bind_all --logdir /app/checkpoint

# You can access your job at https://[job-id]-6006.job.console.elementai.com

Dashboard

You can see the resources available and what is running on them on the overview dashboard.