Data

Data resource describes any kind of artefacts in EAI Toolkit; it could be a dataset, a function, a model, etc. The Toolkit manages data in the similar manner as a version-control system would: it tracks changes in files or a set of files, it creates immutable versions that can be recalled later. Data has been designed for coordinating work among users and/or sharing data easily between users, resources and accounts.

We will describe the different concepts and types of operations through a series of examples. The premise is that we have a directory made up of octopus images as local initial dataset. This simple dataset is composed of two images.

tree octopus_dataset
octopus_dataset
├── 1.png
└── 2.png

0 directories, 2 files

Data Creation

The initial step is to create the data resource that will contain our octopus dataset.

eai data new octopus ./octopus_dataset --verbose

Process file        : 1.png...29 B
Process file        : 2.png...29 B
Uploaded 77 B successffully

id                                   name
70d0d3e2-aec1-48d5-b9ff-f842cbe1cc96 octopus

That command created the data resource and pushed the specified directory content to that resource. It is also doable to do it in two steps, init and then push commands.

# init
eai data new octopus

id                                   name
70d0d3e2-aec1-48d5-b9ff-f842cbe1cc96 octopus

# push command using data id.
eai data push 70d0d3e2-aec1-48d5-b9ff-f842cbe1cc96 ./octopus_dataset --verbose

Process file  : 1.png... 29 B
Process file        : 2.png... 29 B
Uploaded 58 B successfully

# push command using data name.
export ORG_NAME=$(eai organization get --field name)
export ACCOUNT_NAME=$(eai account get --field name)
eai data push $ORG_NAME.$ACCOUNT_NAME.octopus ./octopus_dataset/ --verbose

Warning

To use data short name instead of data ID you must have set the name of your account, know your organization name and named your data resource.

# To give a name to your account
export ACCOUNT_ID=$(eai account get --field id)
eai account set $ACCOUNT_ID --name bob

Warning

From here instead of using the data ID we will use the data short name. A short name is composed of the organization name (acme for our exemple), an account name (bob) and a data name (octopus).

acme.bob.octopus

# As a reminder these two commands are equivalent:
eai data ls 1100d30c-aec1-48d5-b9ff-f842cbe1cc96
eai data ls acme.bob

To view all the data resources of an account.

# {DATA_ID}
eai data branch ls 70d0d3e2-aec1-48d5-b9ff-f842cbe1cc96

# Data short name
eai data branch ls acme.bob.octopus

branch version state
latest empty

There is version empty with the branch latest. It is possible to see the content of this resource by using the command eai data content ls (with the Data ID/Name as argument).

eai data content ls acme.bob.octopus@latest

name    status
1.png   (new)
2.png   (new)

If no version is specified, the latest version will be used

eai data content ls acme.bob.octopus

name    status
1.png   (new)
2.png   (new)

Note that the keyword new is used to mark new files added. When there is a file modified the keyword modified is used.

Data Branch

# Create a branch from empty as a good practice
eai data branch add acme.bob.octopus@empty v1

branch      version
latest      empty
v1          empty

# push new data on v1. That tag will become dirty
eai data push acme.bob.octopus@v1 1.png

# Data listing of that version, v1 or latest can be used.
eai data content ls acme.bob.octopus@v1

name    status
1.png   (new)

# add a branch name
eai data branch add acme.bob.octopus@empty v2

# (v1 != v2)
eai data content ls acme.bob.octopus@v2
eai data content ls acme.bob.octopus@v1

# push new data on v2. That tag will become dirty
eai data push acme.bob.octopus@v2 3.png

# (v1 != v2)
eai data content ls acme.bob.octopus@v2
eai data content ls acme.bob.octopus@v1

If you have to replace a file already present in the dataset by another file and want to keep the same name it is possible to do it during the push operation. Use the source file/directory followed by “:” and the new name at destination. That same command can be used to add a new file and rename it for data consistency. If “:” is not specified, the file/directory will keep the same name as the source.

eai data push acme.bob.octopus@v2 ./newoctopus/octopus_1.png:1.png  ./newoctopus/octopus_4.png:4.png

eai data content ls acme.bob.octopus@v2

name    status
1.png   (new)
3.png   (new)
4.png   (new)

Let’s assume this data resource is under a shared account with many collaborators. Another collaborator can pull the mutable (dirty) version of that data :

# mutable version

eai data pull acme.bob.octopus@v2 ./dirty
Download 3.0 MB successfully

ls -l dirty
-rw-r--r--  1 bob  staff  1046572 Mar 12 14:47 1.png
-rw-r--r--  1 bob  staff  1034442 Mar 12 14:29 3.png
-rw-r--r--  1 bob  staff   981392 Mar 12 14:47 4.png

Using data resources in jobs

To use data resources in a job, they must be declared as a data mount. Multiple data can be mounted under a job.

# Mutable, v1 version
eai job new --data acme.bob.octopus@v0:/dataset:rw -i  ubuntu:18.04
    -- /bin/bash -c "touch /dataset/new_file && ls /dataset"

eai job logs -f --last
1.png     new_file

eai data content ls acme.bob.octopus v0

.
./1.png
./new_file (new)

Data versioning (Alpha)

Warning

All of the following is available as an alpha feature. Since the ServiceNow acquisition, priorities have changed and this partially developed feature is unmainteained and not under development. Use at your own risks.

To enable data versioning, activate it under your user settings.

# Activate beta feature to your user
eai user set --feature '{"fusionflex_beta":"v1"}'

To disable data versioning, disable it under your user settings.

# Disable beta feature to your user
eai user set --feature '{"fusionflex_beta":null}'

To make this version immutable we have to commit our changes.

# Create a branch from a version is optional but recommended as a good practice
eai data commit acme.bob.octopus --branch v1

version                          tags
0e02417491b79662a88da3b689a90da6 [latest v1]

# Data listing of that version, v1 or latest can be used.
eai data content ls acme.bob.octopus@v1
name    status
1.png
2.png

The version is a commit hash which is the immutable part of our data. The latest tag is a moving tag, which means that referring to latest does not mean referring to this particular commit hash.

To identify a version it is possible to use the version hash or branch name.

To demonstrate this concept we will add a branch on the version represented by branch v1. After this operation the branches v1 and v2 point to the same data version. Then we add new data to v2, the v2 branch. If we compare the data of v1 and v2 the composition of these branches is different. The v2 branch is now dirty. To make it immutable we have to commit our changes. That commit will create version v3.

# add a branch name
eai data branch add acme.bob.octopus@0e02417491b79662a88da3b689a90da6 v2

# (v1 == v2)
eai data content ls acme.bob.octopus@v1
eai data content ls acme.bob.octopus@v2

# push new data on v2. That tag will become dirty
eai data push acme.bob.octopus@v2 3.png

# (v1 != v2)
eai data content ls acme.bob.octopus@v2
eai data content ls acme.bob.octopus@v1

# Commit dirty v2 with tag v3.
eai data commit acme.bob.octopus@v2 --branch v3

If you have to replace a file already present in the dataset by another file and want to keep the same name it is possible to do it during the push operation. Use the source file/directory followed by “:” and the new name at destination. That same command can be used to add a new file and rename it for data consistency. If “:” is not specified, the file/directory will keep the same name as the source.

eai data push acme.bob.octopus@v3 ./newoctopus/octopus_1.png:1.png  ./newoctopus/octopus_4.png:4.png

eai data content ls acme.bob.octopus@v3
name    status
1.png   (modified)
2.png
3.png
4.png   (new)

At this point, 2.png and 3.png are the commited version and 1.png and 4.png are not commited.

Let’s assume this data resource is under a shared account with many collaborators. Another collaborator can pull the immutable or the mutable (dirty) version of that data :

# immutable version

# To get the immutable version, the version must be prefixed by the keyword ``origin/``
eai data pull acme.bob.octopus@origin/v3 ./clean

Download 2.0 MB successfully

ls -l clean
-rw-r--r--  1 bob  staff  1034442 Mar 12 14:29 1.png
-rw-r--r--  1 bob  staff   981392 Mar 12 14:29 2.png
-rw-r--r--  1 bob  staff   981392 Mar 12 14:29 3.png

# mutable version, note that the 1.png file is not the same one as above.

eai data pull acme.bob.octopus@v3 ./dirty
Download 4.0 MB successfully

ls -l dirty
-rw-r--r--  1 bob  staff  1046572 Mar 12 14:47 1.png
-rw-r--r--  1 bob  staff   981392 Mar 12 14:29 2.png
-rw-r--r--  1 bob  staff  1034442 Mar 12 14:29 3.png
-rw-r--r--  1 bob  staff   981392 Mar 12 14:47 4.png

Using versioned data resource in jobs

To use data resources in a job, they must be declared as data mounts. In addition it’s recommended to define if they are immutable (read-only,``:ro``) or mutable (read-write,``:rw``).

By default they are in mutable mode. Multiple data can be mounted under a job.

# Immutable, latest version
# Note the :ro after the working directory
eai job new --data acme.bob.octopus:/dataset:ro -i ubuntu:18.04
    -- /bin/bash -c 'ls /dataset'

# logs
eai job logs -f --last
1.png     2.png    3.png    4.png

# Immutable, v0 version
# Note the @v0 after the data ID
# Note the :ro after the working directory
eai job new --data acme.bob.octopus@v0:/dataset:ro -i ubuntu:18.04
    -- /bin/bash -c 'ls /dataset'

eai job logs -f --last
1.png     2.png

# Mutable, v0 version
# Note the :rw after the working directory
eai job new --data acme.bob.octopus@v0:/dataset:rw -i  ubuntu:18.04
    -- /bin/bash -c "touch /dataset/new_file && ls /dataset"

eai job logs -f --last
1.png     2.png     new_file

eai data content ls acme.bob.octopus@v0

name        status
1.png
2.png
new_file    (new)