Tutorial: Toolkit in Action
The following walkthrough is an excellent way to get a glimpse of what the Toolkit can do.
Toolkit heavily relies on Docker. If you are unfamiliar with that technology, we advise you to read the Walkthrough - Docker page.
Install the CLI
To work with the Toolkit, the first thing we need to do is download the CLI. The Toolkit CLI is an executable, available for two platforms: Linux and Mac (aka Darwin).
Download the binary, make it executable and move it in a directory which is in your PATH. Make sure to upgrade to the latest version.
# download latest Toolkit binary
curl https://console.elementai.com/cli/install | sh
# make it executable
chmod +x ./eai
# add eai executable to your PATH
mv eai [directory which is in your PATH]
# upgrade to the latest version
eai upgrade
# version information
eai --version
Sign up
If you are new to Toolkit, you can follow this guide to know How to sign up.
CLI Authentication
To use the Toolkit you must be authenticated. Run the eai login
command and follow the instructions.
You will have to follow a link to obtain your authentication code
and then paste it into the CLI.
If you don’t know how to login, you can follow the following guide: How to authenticate on the Webapp
eai login
Once logged in you can get some basic information about your user.
# get information about your user. (id, name, organization, mail, account name (or ID if no name)
eai user get
Getting an account
It is not possible to create an account by yourself anymore.
If you’re uncertain about the account, please consult with your teammates. If this is a completely new project, feel free to submit a request to create a new account in support channel.
If your team already has an account, you can ask the admin of the account to call the following command to grant you access:
eai account role member add snow.account_a.user <yourMail>
Setting your default account
Each user can set a default account as a preference to have a easier way to manage data and job on an account. If you’re uncertain about the account, please consult with your teammates or if this is a completely new project, see Getting an account.
Once, you were granted access to a project account, you will be able to set it as your default account:
eai user set --account <projectAccountID>
When it is defined, it will allows commands like eai data ls
, eai job ls
or eai job new
to use the default account without having to explictly asking for it.
Note
A user and an account are two different concepts. Your user is your login that only you can access. An account is a context with access to resources Accounts are how Toolkit manages access control.
Using your Toolkit data home
Toolkit provides to each user a Toolkit data as home directory located at <organization>.home.<name>
.
You can find the name or ID of your own data home by calling: eai user get
under the field dataHome
(be sure to update your CLI at first).
$ eai user get --fields mail,dataHome
fullName dataHome
snow.guillaume_smaha snow.home.guillaume_smaha
You can use it for a job by calling:
$ eai job new --data <organization>.home.<name>:/home/toolkit
Note
This space is aimed to store private information only.
Please, do not store models, dataset or any big files. Keep it under 10GB
.
Warning
This data home will be erased immediatelly when your user will be deactivated.
See Data home directory for the best practices on the data home directory.
Docker Images
Toolkit runs docker containers in Kubernetes. It can run public images (e.g.
tensorflow/tensorflow) or private images which must be pushed to our
internal registry residing at registry.console.elementai.com
.
To take advantage of the Toolkit Docker registry, you will need to have Docker installed locally. You will also need to proceed with the login configuration, see How to login with docker
Once login is done, you can tag your own images following the pattern registry.console.elementai.com/<account id>/<image name>
to be able to push them to the Docker registry and use them in the Toolkit.
Assuming you have a certain knowledge of Docker to build your own image,
the following example pushes it to your default account, and gives it the name job-example
.
# Get your accout ID.
export ACCOUNT_ID=$(eai account get --field id)
export IMAGE=registry.console.elementai.com/$ACCOUNT_ID/job-example
docker build -t $IMAGE .
docker push $IMAGE
# You can use a name instead of the ID if you named your account previously
# Get organization name
export ORG_NAME=$(eai organization get --field name)
# Get account name
export ACCOUNT_NAME=$(eai account get --field name)
export ACCOUNT_ID=$ORG_NAME.$ACCOUNT_NAME
export IMAGE=registry.console.elementai.com/$ACCOUNT_ID/job-example
Hint
For tips consult the Toolkit Docker Tips page.
Submitting a Simple Job
The way to submit a new job is to use the eai job new
command with the
appropriate parameters to describe the resources to be assigned to the job,
the environment, the data, the image, the work directory, etc.
Let’s try to submit a new job on the cluster requesting two GPUs.
We can check that we have access to them using the nvidia-smi
command.
A job that needs GPU(s) must specify the number using the --gpu
option. There are
extra options that further refine the type of GPU (--gpu-mem
, --gpu-tensor-cores
).
# -i specifies the image that will be used. The default is ``ubuntu:18.04`` if not specified.
eai job new -i ubuntu:18.04 --gpu 1 -- nvidia-smi
We’re expecting output from the previous command. Let’s take a look at the job’s logs.
# Get output of the latest job
eai job logs
# Now, for a specific job
eai job logs {JOB_ID}
To see it in action, first start a job that will run for some time which displays something and then get the output.
# run a long running job with output
eai job new -- bash -c 'cnt=0; while (( cnt < 300 )); do (( cnt++ )); echo $cnt; sleep 1; done'
# follow the most recently submitted job logs
eai job logs -f
It’s also possible to list jobs that are currently running or that have finished in the last 24h.
eai job ls
Get all your active jobs in the cluster.
eai job ls --state alive
Get all information about a single job.
# information about the latest job
eai job info
# ... about a specific job
eai job info {JOB_ID}
You’ll have a lot of options you can choose from when submitting a new job. You can peruse them with the -h command.
eai job new -h
Connecting to a Running Job
To connect to a job, you can use the eai job exec
command. Try it with the
already running job above:
eai job exec {job id} bash
Then try doing top
, type c
to see the full command that is currently
running.
Killing a Job
Time to kill that job, even though it will only run for 5 minutes.
eai job kill {job id}
To kill all your live jobs:
eai job kill $(eai job ls --state alive --field id)
Types of Jobs
There are three options to call out here since they have a critical impact on job scheduling and resource allocation:
--preemptable
tells the scheduler that your job is volunteering to be cut off if necessary. This sounds counterintuitive, but since it makes life easier for the scheduler this will give your job access to more resources. Once such a job has been preempted, it is possible to reschedule it usingeai job retry {JOB_ID}
--restartable
tells the scheduler that your job can be preempted - like the above option - but also that it can be rescheduled automatically. Setting either this flag or the--preemptable
one allows to use the cluster more efficiently, since it gives the scheduler the most flexibility in its decision-making. Writing jobs that can be comfortably restarted in the middle of their execution is easy and is helpful in general, just in case unexpected crashes occur.--interactive
tells the scheduler that this job should be scheduled as soon as possible and should never be halted. Every user can create no more than one of these.A job that does not set any of these three flags will be allowed to run without halting for up to 48 hours, at which point it will be killed and not restarted. It is a non-preemptable job.
# Same as the previous example but with the preemptable option
eai job new -i ubuntu:18.04 --gpu 2 --preemptable -- nvidia-smi
Data
A data
is a primitive to manage any type of data associated with your job.
# Create a data resource with a name, pushing the content of a folder into it
eai data new {DATA_NAME} {FOLDER_NAME}
# Data resources listing
eai data ls
# Upload data into a data resource based on its ID
eai data push {DATA_ID} {DATA_FOLDER}
# Upload data into a data resource based on its human-readable name
eai data push {ORG_NAME}.{ACCOUNT_NAME}.{DATA_NAME} {DATA_FOLDER}
# More options
eai data -h
Job Using Data Resources
We will make Resnet-50 pretrained on Imagenet using torchvision and pytorch
Code and Dataset to Download
Download the tutorial code
here
Unzip the code into your working directory.
Download the data at https://www.kaggle.com/prasunroy/natural-images. Note that if you do not wish to download the data, we already uploaded the data on the Toolkit as
shared.dataset.natural_images
.Unzipping the file will create two directories, data/ and natural_images/.
You can safely delete data/ as it is a duplicate.
Upload the Dataset into a Data Resource
eai data new nat_imgs ./natural_images
# The dataset can now be referenced to as {ORG_NAME}.{ACCOUNT_NAME}.nat_imgs.
Checkpointing
We will save checkpoints throughout the training. Make a new data that will hold the checkpoints.
eai data new checkpoint
# The folder can now be referenced to ${ORG_NAME}.{ACCOUNT_NAME}.checkpoint
Model & Training Code
Read model.py to understand how we make a Resnet-50 pretrained on Imagenet using torchvision and pytorch lightning.
Make a training script, (see train.py). Here we train a Resnet18 on the dataset using the Pytorch Lighting library. This will automatically log the experiment using Tensorboard. See Pytorch Lightning documentation for details.
Build and Push your Image and Start your Training
export ORG_NAME=$(eai organization get --field name)
export ACCOUNT_NAME=$(eai account get --field name)
docker build -t registry.console.elementai.com/$ORG_NAME.$ACCOUNT_NAME/tutorial .
docker push registry.console.elementai.com/$ORG_NAME.$ACCOUNT_NAME/tutorial
Submit a new job to the cluster using 2 GPUs and 8 CPUs.
eai job new \
--data $ORG_NAME.$ACCOUNT_NAME.nat_imgs:/app/data/natural_images \
--data $ORG_NAME.$ACCOUNT_NAME.checkpoint:/app/checkpoint \
--image registry.console.elementai.com/$ORG_NAME.$ACCOUNT_NAME/tutorial \
--gpu 2 --cpu 8 --mem 12 --name tutorial_job
# The job is now named $ORG_NAME.$ACCOUNT_NAME.tutorial_job.
# The default port for Tensorboard is 6006, we can access it online at echo https://$ORG-$ACC-tutorial_job-6006.job.console.elementai.com
You can follow your job’s logs to monitor training.
eai job logs -f
Once the training is complete, you can download your checkpoint.
eai data pull $ORG_NAME.$ACCOUNT_NAME.checkpoint
You can also access the tensorboard through a new job.
eai job new \
--data $ORG_NAME.$ACCOUNT_NAME.checkpoint:/app/checkpoint \
--image tensorflow/tensorflow \
-- tensorboard --bind_all --logdir /app/checkpoint
# You can access your job at https://[job-id]-6006.job.console.elementai.com
Dashboard
You can see the resources available and what is running on them on the Overview Dashboard.
Exploring job parameters
You can now discover all the job parameters by going to Job Specification.