Data
Data resource describes any kind of artefacts in EAI Toolkit; it could be a dataset, a function, a model, etc. The Toolkit manages data in the similar manner as a version-control system would: it tracks changes in files or a set of files, it creates immutable versions that can be recalled later. Data has been designed for coordinating work among users and/or sharing data easily between users, resources and accounts.
We will describe the different concepts and types of operations through a series of examples. The premise is that we have a directory made up of octopus images as local initial dataset. This simple dataset is composed of two images.
tree octopus_dataset
octopus_dataset
├── 1.png
└── 2.png
0 directories, 2 files
Data Creation
The initial step is to create the data resource that will contain our octopus dataset.
eai data new octopus ./octopus_dataset --verbose
Process file : 1.png...29 B
Process file : 2.png...29 B
Uploaded 77 B successffully
id name
70d0d3e2-aec1-48d5-b9ff-f842cbe1cc96 octopus
That command created the data resource and pushed the specified directory content to that resource. It is also doable to do it in two steps, init and then push commands.
# init
eai data new octopus
id name
70d0d3e2-aec1-48d5-b9ff-f842cbe1cc96 octopus
# push command using data id.
eai data push 70d0d3e2-aec1-48d5-b9ff-f842cbe1cc96 ./octopus_dataset --verbose
Process file : 1.png... 29 B
Process file : 2.png... 29 B
Uploaded 58 B successfully
# push command using data name.
export ORG_NAME=$(eai organization get --field name)
export ACCOUNT_NAME=$(eai account get --field name)
eai data push $ORG_NAME.$ACCOUNT_NAME.octopus ./octopus_dataset/ --verbose
Warning
To use data short name instead of data ID you must have set the name of your account, know your organization name and named your data resource.
# To give a name to your account
export ACCOUNT_ID=$(eai account get --field id)
eai account set $ACCOUNT_ID --name bob
Warning
From here instead of using the data ID we will use the data short name. A short name is composed of the organization name (acme for our exemple), an account name (bob) and a data name (octopus).
acme.bob.octopus
# As a reminder these two commands are equivalent:
eai data ls 1100d30c-aec1-48d5-b9ff-f842cbe1cc96
eai data ls acme.bob
To view all the data resources of an account.
# {DATA_ID}
eai data branch ls 70d0d3e2-aec1-48d5-b9ff-f842cbe1cc96
# Data short name
eai data branch ls acme.bob.octopus
branch version state
latest empty
There is version empty
with the branch latest
.
It is possible to see the content of this resource by using the command eai data content ls
(with the Data ID/Name as argument).
eai data content ls acme.bob.octopus@latest
name status
1.png (new)
2.png (new)
If no version is specified, the latest
version will be used
eai data content ls acme.bob.octopus
name status
1.png (new)
2.png (new)
Note that the keyword new
is used to mark new files added.
When there is a file modified the keyword modified
is used.
Data Branch
# Create a branch from empty as a good practice
eai data branch add acme.bob.octopus@empty v1
branch version
latest empty
v1 empty
# push new data on v1. That tag will become dirty
eai data push acme.bob.octopus@v1 1.png
# Data listing of that version, v1 or latest can be used.
eai data content ls acme.bob.octopus@v1
name status
1.png (new)
# add a branch name
eai data branch add acme.bob.octopus@empty v2
# (v1 != v2)
eai data content ls acme.bob.octopus@v2
eai data content ls acme.bob.octopus@v1
# push new data on v2. That tag will become dirty
eai data push acme.bob.octopus@v2 3.png
# (v1 != v2)
eai data content ls acme.bob.octopus@v2
eai data content ls acme.bob.octopus@v1
If you have to replace a file already present in the dataset by another file and want to keep the same name it is possible to do it during the push operation. Use the source file/directory followed by “:” and the new name at destination. That same command can be used to add a new file and rename it for data consistency. If “:” is not specified, the file/directory will keep the same name as the source.
eai data push acme.bob.octopus@v2 ./newoctopus/octopus_1.png:1.png ./newoctopus/octopus_4.png:4.png
eai data content ls acme.bob.octopus@v2
name status
1.png (new)
3.png (new)
4.png (new)
Let’s assume this data resource is under a shared account with many collaborators.
Another collaborator can pull the mutable
(dirty) version of that data :
# mutable version
eai data pull acme.bob.octopus@v2 ./dirty
Download 3.0 MB successfully
ls -l dirty
-rw-r--r-- 1 bob staff 1046572 Mar 12 14:47 1.png
-rw-r--r-- 1 bob staff 1034442 Mar 12 14:29 3.png
-rw-r--r-- 1 bob staff 981392 Mar 12 14:47 4.png
Using data resources in jobs
To use data resources in a job, they must be declared as a data mount. Multiple data can be mounted under a job.
# Mutable, v1 version
eai job new --data acme.bob.octopus@v0:/dataset:rw -i ubuntu:18.04
-- /bin/bash -c "touch /dataset/new_file && ls /dataset"
eai job logs -f --last
1.png new_file
eai data content ls acme.bob.octopus v0
.
./1.png
./new_file (new)
Data versioning (Alpha)
Warning
All of the following is available as an alpha feature. Since the ServiceNow acquisition, priorities have changed and this partially developed feature is unmainteained and not under development. Use at your own risks.
To enable data versioning, activate it under your user settings.
# Activate beta feature to your user
eai user set --feature '{"fusionflex_beta":"v1"}'
To disable data versioning, disable it under your user settings.
# Disable beta feature to your user
eai user set --feature '{"fusionflex_beta":null}'
To make this version immutable we have to commit our changes.
# Create a branch from a version is optional but recommended as a good practice
eai data commit acme.bob.octopus --branch v1
version tags
0e02417491b79662a88da3b689a90da6 [latest v1]
# Data listing of that version, v1 or latest can be used.
eai data content ls acme.bob.octopus@v1
name status
1.png
2.png
The version is a commit hash which is the immutable part of our data.
The latest
tag is a moving tag, which means that referring to latest
does not mean
referring to this particular commit hash.
To identify a version it is possible to use the version hash or branch name.
To demonstrate this concept we will add a branch on the version represented by branch v1. After this operation the branches v1 and v2 point to the same data version. Then we add new data to v2, the v2 branch. If we compare the data of v1 and v2 the composition of these branches is different. The v2 branch is now dirty. To make it immutable we have to commit our changes. That commit will create version v3.
# add a branch name
eai data branch add acme.bob.octopus@0e02417491b79662a88da3b689a90da6 v2
# (v1 == v2)
eai data content ls acme.bob.octopus@v1
eai data content ls acme.bob.octopus@v2
# push new data on v2. That tag will become dirty
eai data push acme.bob.octopus@v2 3.png
# (v1 != v2)
eai data content ls acme.bob.octopus@v2
eai data content ls acme.bob.octopus@v1
# Commit dirty v2 with tag v3.
eai data commit acme.bob.octopus@v2 --branch v3
If you have to replace a file already present in the dataset by another file and want to keep the same name it is possible to do it during the push operation. Use the source file/directory followed by “:” and the new name at destination. That same command can be used to add a new file and rename it for data consistency. If “:” is not specified, the file/directory will keep the same name as the source.
eai data push acme.bob.octopus@v3 ./newoctopus/octopus_1.png:1.png ./newoctopus/octopus_4.png:4.png
eai data content ls acme.bob.octopus@v3
name status
1.png (modified)
2.png
3.png
4.png (new)
At this point, 2.png and 3.png are the commited version and 1.png and 4.png are not commited.
Let’s assume this data resource is under a shared account with many collaborators.
Another collaborator can pull the immutable
or the mutable
(dirty) version of that data :
# immutable version
# To get the immutable version, the version must be prefixed by the keyword ``origin/``
eai data pull acme.bob.octopus@origin/v3 ./clean
Download 2.0 MB successfully
ls -l clean
-rw-r--r-- 1 bob staff 1034442 Mar 12 14:29 1.png
-rw-r--r-- 1 bob staff 981392 Mar 12 14:29 2.png
-rw-r--r-- 1 bob staff 981392 Mar 12 14:29 3.png
# mutable version, note that the 1.png file is not the same one as above.
eai data pull acme.bob.octopus@v3 ./dirty
Download 4.0 MB successfully
ls -l dirty
-rw-r--r-- 1 bob staff 1046572 Mar 12 14:47 1.png
-rw-r--r-- 1 bob staff 981392 Mar 12 14:29 2.png
-rw-r--r-- 1 bob staff 1034442 Mar 12 14:29 3.png
-rw-r--r-- 1 bob staff 981392 Mar 12 14:47 4.png
Using versioned data resource in jobs
To use data resources in a job, they must be declared as data mounts. In addition it’s recommended to define if they are immutable (read-only,``:ro``) or mutable (read-write,``:rw``).
By default they are in mutable mode. Multiple data can be mounted under a job.
# Immutable, latest version
# Note the :ro after the working directory
eai job new --data acme.bob.octopus:/dataset:ro -i ubuntu:18.04
-- /bin/bash -c 'ls /dataset'
# logs
eai job logs -f --last
1.png 2.png 3.png 4.png
# Immutable, v0 version
# Note the @v0 after the data ID
# Note the :ro after the working directory
eai job new --data acme.bob.octopus@v0:/dataset:ro -i ubuntu:18.04
-- /bin/bash -c 'ls /dataset'
eai job logs -f --last
1.png 2.png
# Mutable, v0 version
# Note the :rw after the working directory
eai job new --data acme.bob.octopus@v0:/dataset:rw -i ubuntu:18.04
-- /bin/bash -c "touch /dataset/new_file && ls /dataset"
eai job logs -f --last
1.png 2.png new_file
eai data content ls acme.bob.octopus@v0
name status
1.png
2.png
new_file (new)