Guide to High Level Interface
There are two library modules, CommandLib
and MessageLib
, that provide high- and-low-level application
programming interfaces (APIs), respectively, that can be used for Python scripting or custom application development.
The high-level API is provided through an API class with methods that are very similar to the commands available in the DataFed command-line interface (CLI). Unlike the CLI, the API methods are functions that accept Python parameters and return reply messages (as Python objects) instead of text or JSON.
The low-level API, as the module name implies, exposes the binary message-passing interface used by DataFed and is intended for more complex applications. A user guide to the low-level API currently does not exist but will be provided in the future.
Note
While not not recommended for general use, there is a CLI
library module in the DataFed client package that implements
the DataFed CLI but also provides an accessible “command” function that allows text-based CLI commands to be executed directly
from a Python script (without requiring a system call).
This is a brief user guide that illustrates the usage of the high-level CommandLib
Python API.
It is not meant to be an exhaustive tutorial on using CommandLib
.
Instead, we cover functions in CommandLib
that would be used in most data orchestration scripts and custom software based on DataFed.
Users are encouraged to refer to the extensive documentation of DataFed’s CommandLib.CLI class
for comprehensive information on all functions in the CommandLib.CLI
class.
Getting Started
Users are recommended to follow the:
getting-started guide to get accounts, and allocations on DataFed
installation instructions to install and configure the DataFed Python package on the machine(s) where they intend to use DataFed
Note
Ensure that the Globus endpoint associated with the machine where you use DataFed is active.
Caution
Ensure that all DataFed Get and Put operations are within a directory that Globus has write access to.
Otherwise, you will notice a Permission Denied
error in your data transfer task messages.
Import package
We start by importing just the API
class within datafed.CommandLib
as shown below.
We also import json to simplify the process of communicating metadata with DataFed.
import json # For dealing with metadata
import os # For file level operations
import time # For timing demonstrations
import datetime # To demonstrate conversion between date and time formats
from datafed.CommandLib import API
Create instance
Finally, we create an instance of the DataFed API class via:
df_api = API()
Assuming that the DataFed client has been installed and setup with local user credentials, and our default GlobusID has been
configured correctly, we can now use df_api
to communicate with DataFed as an authenticated user. If not, refer back to
the installation instructions.
Note
In addition to supporting local user credentials for automatic log-in, the DataFed high level interface also provides functions for checking user authentication status and for logging users in or out using password credentials.
DataFed Responses & Projects
DataFed functions and responses
Typically, users would be working in the context of a DataFed Project
, which would have been created by the project’s principle investigator(s) or other administrators,
rather than the user’s own personal root
collection.
First, let’s try to find projects we are part of using the projectList()
function in DataFed:
plist_resp = df_api.projectList()
print(plist_resp)
(item {
id: "p/abc123"
title: "ABC123: Important Project"
owner: "u/breetju"
}
item {
id: "p/sns.dvs.1"
title: "SNS BL-11A"
owner: "u/stansberrydv"
}
item {
id: "p/trn001"
title: "TRN001 : DataFed Training"
owner: "u/somnaths"
}
offset: 0
count: 20
total: 3
, 'ListingReply')
DataFed typically responds to functions with messages.
It is important to get comfortable with these messages and extracting information from them if one is interested in using this interface to automate data orchestration.
Let’s dig into this object layer-by-layer:
The first layer is typically a tuple of size 2:
type(pl_resp), len(pl_resp)
(tuple, 2)
This tuple usually contains two key objects:
a message containing the information requested from DataFed
the type of that message, which allows us to interpret the reply and parse its fields correctly – in this case, our message is in the form of a
'ListingReply'
.
A simple check of the object type will confirm the type of our core Google Protocol Buffer message:
type(pl_resp[0])
google.protobuf.internal.python_message.ListingReply
ListingReply
is one of a handful of different message types that DataFed replies with across all its many functions.
We will be encountering most of the different types of messages in this user guide.
Interested users are encouraged to read official documentation and examples about Google Protobuf.
Protobuf messages are powerful objects that not only allow quick access to the information stored in their defined fields, but are also nominally subscriptable and iterable in Python.
Subcripting message objects
Besides the main information about the different projects, this ListingReply
also provides some contextual information
such as the:
count
- Maximum number of items that could be listed in this message,total
- Number of items listed in this messageoffset
- The number of items in past listings - this denotes the concept of page numbers
Though we won’t be needing the information in this case, here is how we might get the offset
:
pl_resp[0].offset
0
Accessing the item
component produces the actual listing of projects in the message:
len(pl_resp[0].item)
3
Now, if we wanted to get the title
field of the third project in the listing, we would access it as:
pl_resp[0].item[2].title
"TRN001 : DataFed Training"
Note
We will be accessing many fields in messages going forward. Users are recommended to revisit this section to remind themselves how to peel each layer of the message to get to the desired field since we will jump straight into using a single line of code to access the desired information henceforth in the interest of brevity.
Iterating through message items
Let’s say we wanted to print out ID and owner of each of the projects in the listing, we could iterate through the items as:
for proj in pl_resp[0].item:
print(proj.id, '\t', proj.owner)
p/abc123 u/breetju
p/sns.dvs.1 u/stansberrydv
p/trn001 u/somnaths
Exploring projects
We can take a look at basic information about a project using the projectView()
function:
df_api.projectView('p/trn001')
(proj {
id: "p/trn001"
title: "TRN001 : DataFed Training"
desc: "DataFed Training project"
owner: "u/somnaths"
ct: 1610905375
ut: 1610912585
admin: "u/stansberrydv"
admin: "u/breetju"
alloc {
repo: "cades-cnms"
data_limit: 1073741824
data_size: 0
rec_limit: 1000
rec_count: 0
path: "/data10t/cades-cnms/project/trn001/"
}
}
, 'ProjectDataReply')
Note that we got a different kind of reply from DataFed - a ProjectDataReply
object.
The methodology to access information in these objects is identical to that described above.
Nonetheless, this response provides some useful information such as the administrators, creation date, etc.
that might be useful for those members or administrators of several projects.
Contexts, aliases & IDs
Just as people have various facets within their own life such as their personal and professional lives,
DataFed too offers similar capabilities via contexts.
Users in DataFed have their own Personal Data
context as well as other contexts in the form of
Projects
as we have seen above.
Default context
We can always ask DataFed what context
it is using via the getContext()
function:
print(df_api.getContext())
'u/somnaths'
As mentioned earlier, DataFed typically replies with a Google Protobuf message object.
However, getContext()
is among the few functions where DataFed returns a simple string.
The return value from getContext()
reveals that DataFed is assuming that we intend to work
within the User’s Personal Data
.
Note
DataFed starts with its context set by default to the User’s Personal Data
rather than any project
Caution
Though the CommandLib interface of DataFed sets the default context to the User’s
Personal Data
, it is necessary that the user have a valid data allocation
to create and store data in their Personal Data
context.
There are two ways to set the context, one can set the context only within the scope of a function or simply reset the default scope.
Context per function
Every space in DataFed, regardless of whether it is a Project
or the user’s own Personal Data
contains a Collection called root
, which contains all other Data Records and Collections within this space.
Let us take a look at the root
Collection in the Training project.
In order to look at the Collection, we will be using the collectionView()
function.
We will be going over this specific function later in greater detail,
but will use it here to illustrate another concept.
Since we are interested in the root
Collection within the context
of the Training Project
,
and not the User
Personal Data
which is the current (default) context
,
we can specify the context for this function call using the context
keyword argument as:
print(df_api.collectionView('root', context='p/trn001'))
(coll {
id: "c/p_trn001_root"
title: "Root Collection"
alias: "root"
desc: "Root collection for project trn001"
owner: "p/trn001"
notes: 0
}, 'CollDataReply')
This function returns a different, yet somewhat similar response to that from the projectView()
function - a CollDataReply
object.
The desc
field in the above response illustrates that,
we did in fact get information regarding the root
Collection belonging to the Training project and not the user’s Personal Data
space.
Let’s see what would have happened if we did not specify the context
via the keyword argument:
print(df_api.collectionView('root'))
(coll {
id: "c/u_somnaths_root"
title: "root"
desc: "Root collection for user Suhas Somnath (somnaths)"
owner: "u/somnaths"
notes: 0
}, 'CollDataReply')
From the desc
field in the above output, we observe that simply asking for root
Collection returns information about the
user’s Personal data
rather than the root
Collection in Training project.
Contents of contexts
Now that we know how to get to the correct root
Collection,
we can take a look at the contents of the project by listing everything in the project’s
root
collection using the collectionItemsList()
function as shown below:
ls_resp = df_api.collectionItemsList('root', context='p/trn001')
print(ls_resp)
(item {
id: "c/34559341"
title: "breetju"
alias: "breetju"
owner: "p/trn001"
notes: 0
}
item {
id: "c/34559108"
title: "PROJSHARE"
alias: "projshare"
owner: "p/trn001"
notes: 0
}
item {
id: "c/34558900"
title: "somnaths"
alias: "somnaths"
owner: "p/trn001"
notes: 0
}
item {
id: "c/34559268"
title: "stansberrydv"
alias: "stansberrydv"
owner: "p/trn001"
notes: 0
}
offset: 0
count: 20
total: 4, 'ListingReply')
Just as in the projectList()
function, this function too returns a ListingReply
message.
Here, we see that the administrator of the project has created some collections for the private
use of project members and a collaborative space called PROJSHARE
Note
Not all projects would be structured in this manner.
Alias vs ID
So far, we have been addressing the Collections via their alias
- a human readable unique identifier.
Though aliases are indeed a convenient way to address items in DataFed, there are a few things to keep in mind:
Note
The alias
for a Data Record or Collection is unique only within a user’s Personal Data
or Project
context.
One would need to supply the context
when addressing a Record or Collection via its alias
Not supplying the context
when addressing via an alias
would result in an error:
df_api.collectionView('somnaths')
---------------------------------------------------------------------------
Exception Traceback (most recent call last)
<ipython-input-20-acb948617f34> in <module>
----> 1 df_api.collectionItemsList('somnaths')
//anaconda/lib/python3.5/site-packages/datafed/CommandLib.py in collectionItemsList(self, coll_id, offset, count, context)
757 msg.id = self._resolve_id( coll_id, context )
758
--> 759 return self._mapi.sendRecv( msg )
760
761
//anaconda/lib/python3.5/site-packages/datafed/MessageLib.py in sendRecv(self, msg, timeout, nack_except)
299 self.send( msg )
300 _timeout = (timeout if timeout != None else self._timeout)
--> 301 reply, mt, ctxt = self.recv( _timeout, nack_except )
302 if reply == None:
303 return None, None
//anaconda/lib/python3.5/site-packages/datafed/MessageLib.py in recv(self, timeout, nack_except)
343 if msg_type == "NackReply" and _nack_except:
344 if reply.err_msg:
--> 345 raise Exception(reply.err_msg)
346 else:
347 raise Exception("Server error {}".format( reply.err_code ))
Exception: Alias 'somnaths' does not exist
(source: dbGet:126 code:1)
Note
All Data Records and Collections always have a unique alphanumeric identifier or ID
even if the
user did not specify a human-friendly alias
An alternate way to address a Data Record or Collection is via its ID
:
df_api.collectionView('c/34558900')
(coll {
id: "c/34558900"
title: "somnaths"
alias: "somnaths"
owner: "p/trn001"
ct: 1610905632
ut: 1610905667
notes: 0
}, 'CollDataReply')
We observe that we can successfully get information about an entity in DataFed using its ID.
Note
ID
for Records, Collections, etc. in projects are unique across all of DataFed, and are not just
unique within a narrow scope such as within that of a Project or User’s space.
It is therefore unnecessary to provide the context
when addressing an item via its unique ID.
However, one would need to carefully extract the (automatically generated) ID of the Collection or Data Record of interest from the DataFed response in order to use it in subsequent code within a script.
Caution
When working within the context
of a Project with several collaborators,
there is a possibility that two users may use the same alias
for a Record or a Collection.
Managing aliases within Projects:
There is no single solution to this problem. However, here are some suggestions:
Team members of the project should coordinate and collaboratively assign aliases
Individual members elect to avoid using aliases within the context of their personal Collections
Individual members manually prefix aliases for items within their personal Collections with their initials (hopefully unique within the Project)
Manual context management
In this user guide, we will work within the context of the training project. In order to ensure that we continue to work within this context - create data records, collections, etc. within this space, we need to ensure that we minimize ambiguity about the context.
A naive approach is to simply define a python variable and use it in every function call instead of manually specifying it as we have done above:
context = 'p/trn001' # DataFed ID for the training project
Note
Please change the context
variable to suit your own project.
If you want to work within your own Personal Data
space,
set context
to None
.
Caution
Accidentally forgetting to specify the context
keyword argument in functions could
result in incorrect data management operations.
Set default context
Keeping track of and remembering to specify the context
keyword argument for all
function calls can be tedious if one is surely going to be working within a single context.
In such cases, DataFed provides the setContext()
function that allows the user to
specify the default context going forward:
df_api.setContext('p/trn001')
Note
setContext()
is valid within the scope of a single python process.
The user would need to call the function each time they instantiate the DataFed CommandLib.API
class
Now, one could operate on items within the project without having to specify the context
keyword argument. For example, running the same collectionView()
function that failed earlier
would work now:
df_api.collectionView('somnaths')
(coll {
id: "c/34558900"
title: "somnaths"
alias: "somnaths"
owner: "p/trn001"
ct: 1610905632
ut: 1610905667
notes: 0
}, 'CollDataReply')
If we wanted to temporarily operate on a different context such as the user’s Personal Data
,
we would need to specify the context
keyword argument explicitly for those function calls.
Set working collection
In this specific case, the Project has been organized to provide each user with their own private collection.
We can use a python variable to help ensure that any Data Records or Collections we want to create in our
private space is created within our own collection (somnaths
in this case) rather than
creating clutter in the root
collection of the project:
dest_collection = 'somnaths' # Destination collection
Note
Please change the dest_collection
variable to suit your own project.
If you want to work within the project’s root
collection, set dest_collection
to root
.
Data Records
Prepare (scientific) metadata
DataFed can accept metadata as dictionaries in python or as a JSON file.
Here, we simply create a dictionary with fake metadata in place of the real metadata:
parameters = {
'a': 4,
'b': [1, 2, -4, 7.123],
'c': 'Something important',
'd': {'x': 14, 'y': -19} # Can use nested dictionaries
}
Create Data Record
Until a future version of DataFed, which can accept a python dictionary itself instead
of a JSON file or a JSON string for the metadata, we will need to use json.dumps()
function to turn our python metadata dictionary parameters
into a JSON string, or
write the dictionary to a JSON file:
dc_resp = df_api.dataCreate('my important data',
metadata=json.dumps(parameters),
parent_id=dest_collection, # parent collection
)
Here, the parent_id
was set to the dest_collection
variable, as this variable contains the alias of our
personal collection within the project, in which our data record will be created.
Leaving this unspecified is equivalent to the default value of root
which means that
the Data Record would be created within the root
collection of the project.
Extract Record ID
Let’s look at the response we got for the dataCreate()
function call:
print(dc_resp)
(data {
id: "d/34682319"
title: "my important data"
metadata: "{\"a\":4,\"b\":[1,2,-4,7.123],\"c\":\"Something important\",\"d\":{\"x\":14,\"y\":-19}}"
repo_id: "repo/cades-cnms"
size: 0.0
ext_auto: true
ct: 1611077217
ut: 1611077217
owner: "p/trn001"
creator: "u/somnaths"
parent_id: "c/34558900"
}, 'RecordDataReply')
DataFed returned a RecordDataReply
object, which contains crucial pieces of information regarding the record.
Note
In the future, the dataCreate()
function would by default return only the ID
of the record
instead of such a verbose response if it successfully created the Data Record.
We expect to be able to continue to get this verbose response through an optional argument.
Such detailed information regarding the record can always be obtained via the dataView()
function.
Similar to getting the title from the project information, if we wanted to get the record ID to be used for later operations, here’s how we could go about it:
record_id = dc_resp[0].data[0].id
print(record_id)
'd/34682319'
Edit Record information
All information about Data Records, besides the unique ID
, can be edited using the
dataUpdate()
command. For example, if we wanted to change the title, add a human-readable
unique alias
, and add to the scientific metadata, we would as follows:
du_resp = df_api.dataUpdate(record_id,
title='Some new title for the data',
alias='my_first_dataset',
metadata=json.dumps({'appended_metadata': True})
)
print(du_resp)
(data {
id: "d/34682319"
title: "Some new title for the data"
alias: "my_first_dataset"
repo_id: "repo/cades-cnms"
size: 0.0
ext_auto: true
ct: 1611077217
ut: 1611077220
owner: "p/trn001"
creator: "u/somnaths"
notes: 0
}
update {
id: "d/34682319"
title: "Some new title for the data"
alias: "my_first_dataset"
owner: "p/trn001"
creator: "u/somnaths"
size: 0.0
notes: 0
deps_avail: true
}
, 'RecordDataReply')
Note
In the future, the dataUpdate()
command would return only an acknowledgement
of the successful execution of the data update.
View Record information
Since the response from the dataCreate()
and dataUpdate()
functions does not include the
metadata, we can always get the most comprehensive information about Data Records via the dataView()
function:
dv_resp = df_api.dataView(record_id)
print(dv_resp)
(data {
id: "d/34682319"
title: "Some new title for the data"
alias: "my_first_dataset"
metadata: "{\"a\":4,\"appended_metadata\":true,\"b\":[1,2,-4,7.123],\"c\":\"Something important\",\"d\":{\"x\":14,\"y\":-19}}"
repo_id: "repo/cades-cnms"
size: 0.0
ext_auto: true
ct: 1611077217
ut: 1611077220
owner: "p/trn001"
creator: "u/somnaths"
notes: 0
}, 'RecordDataReply')
The date and time in the Data Records are encoded according to the Unix time format and
can be converted to familiar python datetime
objects via fromtimestamp()
:
datetime.datetime.fromtimestamp(dv_resp[0].data[0].ct)
datetime.datetime(2021, 1, 19, 12, 26, 57)
Extract metadata
As the response above shows, the metadata is also part of the response we got from dataView()
.
By default, the metadata in the response is formatted as a JSON string:
print(dv_resp[0].data[0].metadata)
"{\"a\":4,\"appended_metadata\":true,\"b\":[1,2,-4,7.123],\"c\":\"Something important\",\"d\":{\"x\":14,\"y\":-19}}"
In order to get back a python dictionary, use json.loads()
print(json.loads(dv_resp[0].data[0].metadata))
{'a': 4,
'appended_metadata': True,
'b': [1, 2, -4, 7.123],
'c': 'Something important',
'd': {'x': 14, 'y': -19}}
We can clearly observe that both the original and the new metadata are present in the record.
Replace metadata
In the example above, we appended metadata to existing metadata, which is the default manner in which dataUpdate()
operates.
If desired, we could completely replace the metadata by setting metadata_set
to True
as in:
du_resp = df_api.dataUpdate(record_id,
metadata=json.dumps({'p': 14, 'q': 'Hello', 'r': [1, 2, 3]}),
metadata_set=True,
)
dv_resp = df_api.dataView(record_id)
print(json.loads(dv_resp[0].data[0].metadata))
{'p': 14, 'q': 'Hello', 'r': [1, 2, 3]}
The previous metadata keys such as a
, b
, c
, etc. have all been replaced by the new metadata fields.
Relationships and provenance
Let’s say that this first dataset went through some processing step which resulted in one or more new datasets. This processing step could be something as simple as a data cleaning operation or as complex as a multi-institutional cross-facility workflow. We could not only track the resultant new datasets as Data Records in DataFed but also the relationships between the datasets.
Note
We will cover topics related to associating raw data to Data Records in the next section.
First, we create Data Records as we have done earlier for the new datasets using the dataCreate()
function:
dc2_resp = df_api.dataCreate('cleaned data',
metadata=json.dumps({'cleaning_algorithm': 'gaussian_blur', 'size': 20}),
parent_id=dest_collection, # parent collection
)
clean_rec_id = dc2_resp[0].data[0].id
print(clean_rec_id)
'd/34682715'
We can establish a relationship or dependency
between the original / source Data Record and the subsequent Data Record
via several methods such as within the dataCreate()
function call or via a subsequent dataUpdate()
call.
Dependencies in DataFed are specified as a list
of relationships, themselves specified as list
objects,
wherein the first item in the list is the relationship type and the second item is the identifier of the related Data Record.
As of this writing, DataFed supports the following relationships:
der
- Is derived fromcomp
- Is comprised ofver
- Is new version of
For our example, we will say that our new Record is derived from our original record via the dataUpdate()
function:
dep_resp = df_api.dataUpdate(clean_rec_id, deps_add=[["der", record_id]])
print(dep_resp)
(data {
id: "d/34682715"
title: "cleaned data"
repo_id: "repo/cades-cnms"
size: 0.0
ext_auto: true
ct: 1611077405
ut: 1611078386
owner: "p/trn001"
creator: "u/somnaths"
deps {
id: "d/34682319"
alias: "my_first_dataset"
type: DEP_IS_DERIVED_FROM
dir: DEP_OUT
}
notes: 0
}
update {
id: "d/34682715"
title: "cleaned data"
owner: "p/trn001"
creator: "u/somnaths"
size: 0.0
notes: 0
deps_avail: true
dep {
id: "d/34682319"
alias: "my_first_dataset"
type: DEP_IS_DERIVED_FROM
dir: DEP_OUT
}
}, 'RecordDataReply')
The response shows that we did in fact manage to establish the DEP_IS_DERIVED_FROM
relationship.
In the DataFed web interface, when one selects either the original or derived Records and
clicks on the Provenance
view, we will observe that there is an
arrow originating from the original Data Record and terminating into the newly created Data Record:
Batch operations
DataFed has the dataBatchCreate()
and dataBatchUpdate()
functions to facilitate
the creation and editing of multiple Data Records in one shot.
Other functions
DataFed also offers the dataDelete()
function for the deletion of one or more Data Records
Data Transfer
Upload raw data
So far, the Data Record created above only contains simple text information along with the scientific metadata. It does not have the raw data that we colloquially refer to as “data” in science.
For the sake of demonstration, we will just use the metadata as the data itself:
with open('parameters.json', mode='w') as file_handle:
json.dump(parameters, file_handle)
With the data file created, we are ready to put this raw data into the record we created above.
Note
The raw data file must be located such that it is visible to the (default) Globus endpoint. To configure the default endpoint, follow the steps detailed towards the end of the installation instructions.
Note
Ensure that the Globus endpoint that will be used for uploading data is active.
put_resp = df_api.dataPut(record_id,
'./parameters.json',
wait=True, # Waits until transfer completes.
)
print(put_resp)
(item {
id: "d/34682319"
title: "Some new title for the data"
size: 0.0
owner: "p/trn001"
}
task {
id: "task/34702491"
type: TT_DATA_PUT
status: TS_SUCCEEDED
client: "u/somnaths"
step: 3
steps: 4
msg: "Finished"
ct: 1611102437
ut: 1611102444
source: "olcf#dtn/gpfs/alpine/stf011/scratch/somnaths/DataFed_Tutorial/parameters.json"
dest: "d/34682319"
}, 'DataPutReply')
The dataPut()
method initiates a Globus transfer on our behalf
from the machine where the command was entered to wherever the default data repository is located.
Note
The above data file was specified by its relative local path, so DataFed used our pre-configured default Globus endpoint to find the data file. As long as we have the id for any active Globus endpoint that we have authenticated access to, we can transfer data from that endpoint with its full absolute file path – even if the file system is not attached ot the local machine. Look for more information on this in later examples.
In addition, the dataPut()
method prints out the status of the Globus transfer as shown under the task
section of the response.
The task
msg
shows that the Globus transfer succeeded. The transfer succeeded before the message was returned because
the wait
keyword argument in the dataPut()
method was set to True
, meaning that we requested that DataFed not proceed
until the Globus transfer was completed.
This is not the default behavior of dataPut()
or dataGet()
.
In a later section, we will go over an example usecase wherein asynchronous transfers may be preferred.
Let’s view the Data Record we have been working on so far:
dv_resp = df_api.dataView(record_id)
print(dv_resp)
(data {
id: "d/34682319"
title: "Some new title for the data"
alias: "my_first_dataset"
metadata: "{\"p\":14,\"q\":\"Hello\",\"r\":[1,2,3]}"
repo_id: "repo/cades-cnms"
size: 86.0
source: "olcf#dtn/gpfs/alpine/stf011/scratch/somnaths/DataFed_Tutorial/parameters.json"
ext: ".json"
ext_auto: true
ct: 1611077217
ut: 1611077286
dt: 1611077286
owner: "p/trn001"
creator: "u/somnaths"
notes: 0
}, 'RecordDataReply')
Comparing this response against the response we got from the last dataView()
call,
you will notice the source
and file extension
have been updated.
Download raw data
DataFed is also capable of getting data stored in a DataFed repository and placing it in the
local or other Globus-visible filesystem via the dataGet()
function.
For demonstration purposes, we will simply download the raw data (.JSON file) that was placed into the first Data Record.
In order to avoid clashes in file-naming, dataGet()
names the downloaded file by the unique ID of the Data Record
that contains the raw data. We already have a parameters.json
file in our local folder and setting the orig_fname
keyword argument to True
would result in a clash in the file name.
Just to prove that the file download is indeed taking place, let’s check to make sure that there is no other JSON file whose name matches that of the record ID.
expected_file_name = os.path.join('.', record_id.split('d/')[-1]) + '.json'
print(expected_file_name)
./34682319.json
print(os.path.exists(expected_file_name))
False
Now that we know that we will not be having a file name clash, let us proceed with the dataGet()
function call.
Note
The current version of DataFed has a bug where dataGet()
only accepts a list
of Data Record or Collection IDs.
Until the next version, users are recommended to put their singular ID into a list
for dataGet()
.
get_resp = df_api.dataGet([record_id], # currently only accepts a list of IDs / aliases
'.', # directory where data should be downloaded
orig_fname=False, # do not name file by its original name
wait=True, # Wait until Globus transfer completes
)
print(get_resp)
(task {
id: "task/34682556"
type: TT_DATA_GET
status: TS_SUCCEEDED
client: "u/somnaths"
step: 2
steps: 3
msg: "Finished"
ct: 1611077310
ut: 1611077320
source: "d/34682319"
dest: "olcf#dtn/gpfs/alpine/stf011/scratch/somnaths/DataFed_Tutorial"
}
, 'TaskDataReply')
The response shows that the Globus file transfer to the local file system did indeed complete successfully. Now, let us verify that the file does indeed exist as it should:
print(os.path.exists(expected_file_name))
True
At this point, we are free to rename the downloaded file to whatever name we want using familiar python functions:
os.rename(expected_file_name, 'duplicate_parameters.json')
Tasks
DataFed makes it possible to check on the status of transfer tasks in an easy and programmatic manner.
From the earlier dataGet()
function call’s response, we can extract the task id
as:
task_id = get_resp[0].task[0].id
print(task_id)
task/34682556
Using the task ID, we can check on the status of the task
via the taskView()
function:
task_resp = df_api.taskView(task_id)
print(task_resp)
(task {
id: "task/34682556"
type: TT_DATA_GET
status: TS_SUCCEEDED
client: "u/somnaths"
step: 2
steps: 3
msg: "Finished"
ct: 1611077310
ut: 1611077320
source: "d/34682319"
dest: "olcf#dtn/gpfs/alpine/stf011/scratch/somnaths/DataFed_Tutorial"
}
, 'TaskDataReply')
The TaskDataReply
shows that the status
is indeed a success and the msg
is "Finished"
.
This specific example by itself was trivial since we had set the wait
keyword argument to True
in the dataGet()
function
call, which meant that DataFed would not proceed until the transfer was complete.
Furthermore, the nature of the transfer was also trivial in that it was a single file located in a single DataFed
repository being delivered to a single destination.
Note
A DataFed task
may itself contain / be responsible for several Globus file transfers.
As the structure of the dataGet()
function call suggests, one could request that several Data Records or
Data Collections (themselves containing thousands of Data Records or even Collections) be downloaded,
regardless of their location (several DataFed data repositories spread across the world in multiple institutions / continents).
In this case, the task
would be a composite of several Globus data transfers.
We can also extract the status of the task
as:
task_resp[0].task[0].status
3
Note that though the status was marked as TS_SUCCEEDED
in the Google Protobuf object,
we got an integer value for the status.
For now, we will use the numeric value of 3
to denote the successful completion of a file transfer task.
Note
A future version of DataFed may change the nature of the output / type for the status
property. In general, the exact return object types and nomenclature may evolve with DataFed.
Asynchronous transfers
So far we have been requesting that all transfers be completed before the next line of python code is executed. This is certainly acceptable for small data files but is perhaps not ideal for large files.
Here are some scenarios:
We are performing an array of simulations and want data transfers for a completed simulation to take place in the background while the subsequent simulation is being computed.
We may want to get multiple Data Records or Collections which may actually be spread over multiple DataFed data repositories or Projects, etc.
One could conceivably need to launch a child process to perform some operations while transfers took place asynchronously.
Before we demonstrate a simple example, let us define some handy functions:
The first is our fake, computationally expensive simulation denoted by expensive_simulation()
that just sleeps for 3 seconds.
It generates results that are written to a .dat
file and it returns the path to this
results data file. Though comically oversimplified, it is sufficiently accurate for demonstration purposes.
def expensive_simulation():
time.sleep(3)
# Yes, this simulation is deterministic and always results in the same result:
path_to_results = 'esnet#cern-diskpt1/data1/5MB-in-tiny-files/a/a/a-a-1KB.dat'
return path_to_results
The next handy function is check_xfer_status()
that looks up the instantaneous status of the transfer
of each task it is provided and returns only the statuses:
def check_xfer_status(task_ids):
# Create a list to hold all statuses
statuses = list()
# iterate over each of the task IDs in the input argument
for this_task_id in task_ids:
# First ask DataFed for information about this task
task_resp = df_api.taskView(this_task_id)
# Extract the status field from the response
# Add just the status to the list
statuses.append(task_resp[0].task[0].status)
return statuses
In the following demonstration, we perform a series of “computationally expensive” simulations.
Following our aim to mimic realistic scenarios, we also create a DataFed collection to hold all the simulation results:
coll_resp = df_api.collectionCreate('Simulations', parent_id=dest_collection)
sim_coll_id = coll_resp[0].coll[0].id
Knowing that the simulations take a while to complete,
we create a Data Record to hold each simulation’s resulting data file and then call dataPut()
to asynchronously upload the data in the background without impeding the following simulation
or, importantly - wasting precious wall time on the supercomputer.
xfer_tasks = list()
for ind in range(3):
print('Starting simulation #{}'.format(ind))
# Run the simulation and make sure to get the path to the results
results_file = expensive_simulation()
# Create a unique Data Record for this simulation
rec_resp = df_api.dataCreate('Simulation_' + str(ind),
metadata=json.dumps({'parameter_1': ind}),
parent_id=sim_coll_id)
# Extract the ID for this record from the response
this_rec_id = rec_resp[0].data[0].id
print('Uploading data from simulation #{}'.format(ind))
# Put the raw data into this record
put_resp = df_api.dataPut(this_rec_id, results_file, wait=False)
# Extract the task ID from the put response as we have done before
# Add that task ID to the list of tasks we need to track
xfer_tasks.append(put_resp[0].task.id)
# Print instantaneous transfer statuses of all data put tasks so far
print('Transfer status(es): {}'.format(check_xfer_status(xfer_tasks)))
print('')
print('Simulations complete')
Starting simulation #0
Uploading data from simulation #0
Transfer status(es): [2]
Starting simulation #1
Uploading data from simulation #1
Transfer status(es): [3, 2]
Starting simulation #2
Uploading data from simulation #2
Transfer status(es): [3, 3, 2]
Simulations complete
What we observe is that the data upload transfer task for all previous simulations are complete while the current simulation is in progress. Of course, the sequence and competing speeds of the simulation and the data transfer tasks will vary from one workload to another and this is just an illustration. However, it does illustrate a popular use-case for asynchronous file transfers.
Note
Users are recommended to perform data orchestration (especially large data movement - upload / download) operations outside the scope of heavy / parallel computation operations in order to avoid wasting precious wall time on compute clusters.
Task list
DataFed also provides the taskList()
function that displays a list of all
data upload or download tasks in descending order of time since creation.
This may be useful for those who are developing applications where one needs ot check on
and manage tasks initiated, for example, from different python sessions (either in the past or running elsewhere)
Collections
Collections are a great tool for organizing Data Records and other Collections within DataFed. Besides organization, they have other benefits such as facilitating the download of vast numbers of Data Records they may contain, regardless of where (DataFed data repositories, various projects, etc.) the individual Data Records are physically located.
Create collection
The process to create a Collection is very similar to that for the Data Record.
We would use the collectionCreate()
function as:
coll_alias = 'cat_dog_train'
coll_resp = df_api.collectionCreate('Image classification training data',
alias=coll_alias,
parent_id=dest_collection)
print(coll_resp)
(coll {
id: "c/34683877"
title: "Image classification training data"
alias: "cat_dog_train"
owner: "p/trn001"
ct: 1611078472
ut: 1611078472
parent_id: "c/34558900"
}
, 'CollDataReply')
Much like Data Records, Collections can be addressed using aliases instead of IDs.
However, as mentioned earlier, we would always need to specify the context
for the alias
.
What we get in response to the collectionCreate()
function is a CollDataReply
object.
It contains some high-level identification information such as the id
, alias
, parent_id
, etc.
It does not contain other information such as the number of Data Records within the collection itself.
We could peel the id
of this newly created Collection out of the message reply if we wanted to,
just as we did for the Data Record. However, we will just use the alias
for now.
Note
Collections have IDs starting with c/
just like Data Record IDs start with d/
and Project IDs start with p/
.
Populate with Records
Let’s say that we wanted to put training data for a machine learning application into this collection.
We could go ahead and populate the Collection with Data Records by using the dataCreate()
function
for each Data Record in the Collection.
In our example, we are interested in gathering examples of cats and dogs to train a machine learning model.
For simplicity, we will use the same tiny dataset for both cats and dogs.
The Data Records would be distinguishable via the animal
key or field in the metadata
.
Since we need to create several Data Records for dogs and cats, we will define a quick function:
import random
def generate_animal_data(is_dog=True):
this_animal = 'cat'
if is_dog:
this_animal = 'dog'
# To mimic a real-life scenario, we append a number to the animal type to denote
# the N-th example of a cat or dog. In this case, we use a random integer.
rec_resp = df_api.dataCreate(this_animal + '_' + str(random.randint(1, 100)),
metadata=json.dumps({'animal': this_animal}),
parent_id=coll_alias)
# Parse the dataCreate response to tease out the ID of the Record
this_rec_id = rec_resp[0].data[0].id
# path to the file containing the raw data
raw_data_path = 'esnet#newy-dtn/data1/5MB-in-tiny-files/a/a/a-a-1KB.dat'
# Putting the raw data into the record
put_resp = df_api.dataPut(this_rec_id, raw_data_path)
# Only returning the ID of the Data Record we created:
return this_rec_id
In the above function, we use a tiny dataset from ESNet’s read-only Globus endpoint: esnet#newy-dtn
.
The actual data itself is of little relevance to this example and will not really be used.
Tip
So far, we have only been providing the relative path to data when we use dataCreate()
.
dataCreate()
automatically gets the absolute path of the path in the local file system
and takes the UUID / legacy name of the Globus endpoint we set as default for this local file system.
However, we can also provide the name of the Globus endpoint followed by the absolute path of the desired file (or directory) from that Globus endpoint.
Now, we simply call the generate_animal_data()
function to generate data.
We will generate 5 examples each of cats and dogs:
cat_records = list()
dog_records = list()
for _ in range(5):
dog_records.append(generate_animal_data(is_dog=True))
for _ in range(5):
cat_records.append(generate_animal_data(is_dog=False))
print(cat_records)
['d/34684011', 'd/34684035', 'd/34684059', 'd/34684083', 'd/34684107']
print(dog_records)
['d/34683891', 'd/34683915', 'd/34683939', 'd/34683963', 'd/34683987']
List items in Collection
Now that we have generated the data into our Collection, we can list the contents of the Collection
simply via collectionItemList()
as shown below.
Since we set the context earlier in the guide, we do not need to specify the context
keyword argument though we are using the alias
as the identifier:
coll_list_resp = df_api.collectionItemsList(coll_alias)
print(coll_list_resp)
(item {
id: "d/34684107"
title: "cat_22"
owner: "p/trn001"
creator: "u/somnaths"
size: 0.0
notes: 0
}
item {
id: "d/34684011"
title: "cat_32"
owner: "p/trn001"
creator: "u/somnaths"
size: 0.0
notes: 0
}
item {
id: "d/34684035"
title: "cat_6"
owner: "p/trn001"
creator: "u/somnaths"
size: 0.0
notes: 0
}
item {
id: "d/34684083"
title: "cat_93"
owner: "p/trn001"
creator: "u/somnaths"
size: 0.0
notes: 0
}
item {
id: "d/34684059"
title: "cat_96"
owner: "p/trn001"
creator: "u/somnaths"
size: 0.0
notes: 0
}
item {
id: "d/34683939"
title: "dog_3"
owner: "p/trn001"
creator: "u/somnaths"
size: 0.0
notes: 0
}
item {
id: "d/34683915"
title: "dog_63"
owner: "p/trn001"
creator: "u/somnaths"
size: 0.0
notes: 0
}
item {
id: "d/34683891"
title: "dog_70"
owner: "p/trn001"
creator: "u/somnaths"
size: 0.0
notes: 0
}
item {
id: "d/34683987"
title: "dog_71"
owner: "p/trn001"
creator: "u/somnaths"
size: 0.0
notes: 0
}
item {
id: "d/34683963"
title: "dog_8"
owner: "p/trn001"
creator: "u/somnaths"
size: 0.0
notes: 0
}
offset: 0
count: 20
total: 10
, 'ListingReply')
From the above response, it is clear that we have 5 examples each for dogs and cats and that this Collection does not contain any other Collections or Data Records.
Note
If we had several dozens, hundreds, or even thousands of items in a Collection,
we would need to call collectionItemsList()
multiple times
by stepping up the offset
keyword argument each time to get the next “page” of results.
Queries
Let’s say that we want to segregate the cat data from the dog data and that
we did not already have the record IDs separated in the dog_records
and cat_records
variables.
One way to do this with the tools we have demonstrated so far might be to
use collectionItemsList()
to enumerate all the records, extract the title
of each of the Records
and then parse the information to separate cats from dogs.
If we did not have meaningful titles, we would have had to call dataView()
to get the metadata
of each of the Records to separate cats from dogs.
Obviously, these are highly sub-optimal solutions to the problem. The ideal solution is to use the search capability in DataFed.
Create query
While it is technically possible to construct queries using the queryCreate()
function in CommandLib
,
we will construct the query via the web interface since the query language will be changed soon, as of this writing.
Note
The query language is likely to change in a future version of DataFed.
In order to create the query, we will follow the subsequent steps and the screenshot of the interface below should help guide you through this process:
visit https://datafed.ornl.gov
Click on the
Data Search
tab in the bottom left of the page to expand the search tab.Uncheck all boxes in the
Scope
and only check theSelect
. This should reveal checkboxes in the left navigation panel.Now select the
Image Classification and Training data
collectionFinally, enter
animal == "cat"
in theMetadata
field in theData Search
tab in the bottom of the window
Your window should look something like this:
Now when we click the yellow colored right arrow / “play” button in the bottom right of the Data Search
tab,
we are taken to the search results page as shown below:
Click on the Save
button that looks like a floppy drive in the bottom right of the Data Search
tab.
This should reveal a pop up window that will let you name and save this search query as shown below:
We can give this search a title such as find_all_cats
and click on the Save
button now.
Note
Saved queries are visible at the very bottom of the navigation / left pane below Project Data
and Shared Data
.
List saved queries
Much like listing the Projects this user is part of or the contents of a Collection, one can also list the
saved queries via the queryList()
function as:
ql_resp = df_api.queryList()
print(ql_resp)
(item {
id: "q/34684970"
title: "find_all_cats"
}
offset: 0
count: 20
total: 1, 'ListingReply')
We again get a ListingReply
object which can be parsed if need be.
Importantly, we see our newly created query listed here.
We can extract the query ID as:
query_id = ql_resp[0].item[0].id
print(query_id)
'q/34684970'
View query
Just like dataView()
, we can view use queryView()
to view this query as well:
df_api.queryView(query_id)
(query {
id: "q/34684970"
title: "find_all_cats"
query: "{\"meta\":\"animal == \\\"cat\\\"\",\"scopes\":[{\"scope\":4,\"id\":\"c/34683877\",\"recurse\":true}]}"
owner: "u/somnaths"
ct: 1611078781
ut: 1611078781
}, 'QueryDataReply')
The query
string in the response reveals that:
we did search for data whose metadata lists their
animal
ascat
.we limited our
scope
to just one collection(by default) the query recursively searches all collections inside the collection we pointed out.
Execute query
Finally, we can run the desired query using queryExec()
as shown below:
query_resp = df_api.queryExec(query_id)
print(query_resp)
(item {
id: "d/34684011"
title: "cat_32"
owner: "p/trn001"
creator: "u/somnaths"
size: 1000.0
notes: 0
}
item {
id: "d/34684035"
title: "cat_6"
owner: "p/trn001"
creator: "u/somnaths"
size: 1000.0
notes: 0
}
item {
id: "d/34684059"
title: "cat_96"
owner: "p/trn001"
creator: "u/somnaths"
size: 1000.0
notes: 0
}
item {
id: "d/34684083"
title: "cat_93"
owner: "p/trn001"
creator: "u/somnaths"
size: 1000.0
notes: 0
}
item {
id: "d/34684107"
title: "cat_22"
owner: "p/trn001"
creator: "u/somnaths"
size: 1000.0
notes: 0
}
, 'ListingReply')
The response to this function call is also a ListingReply
object.
Note
In the current version of DataFed, the search query limits the number of results it returns from queries to 50. This behavior will be changed in a subsequent version of DataFed.
Let’s verify that the results from the query match our expectation (the list of cat IDs we collected when the records were created):
# First get IDs from query result
cat_rec_ids = [record.id for record in query_resp[0].item]
print(set(cat_rec_ids) == set(cat_records))
True
Collections continued
Let us continue with our original aim of segregating the cats from the dogs. We now know the IDs of all the cats from the response to a saved query.
Now, we will demonstrate ways in which we can organize data in DataFed.
Organize with Collections
The simplest and most powerful way to organize information is using Collections.
We could segregate all cat data into a new, separate collection just for cats via the collectionCreate()
function:
coll_resp = df_api.collectionCreate('Cats', alias='cats', parent_id=coll_alias)
cat_coll_id = coll_resp[0].coll[0].id
print(cat_coll_id)
'c/34685092'
Collection Parents
If we wanted to get an idea about where the newly created Cats
Collection is
with respect to the root
Collection of the current context
(the Training project),
we could use the collectionGetParents()
function as:
path_resp = df_api.collectionGetParents(cat_coll_id)
print(path_resp)
(path {
item {
id: "c/34683877"
title: "Image classification training data"
alias: "cat_dog_train"
}
item {
id: "c/34558900"
title: "somnaths"
alias: "somnaths"
}
item {
id: "c/p_trn001_root"
title: "Root Collection"
alias: "root"
}
}, 'CollPathReply')
What we get in return is a CollPathReply
message which essentially shows a
path
illustrating that the Cats
Collection is within the cat_dog_train
Collection,
which itself is within the user’s private collection - somnaths
, which in turn
is within the root
Collection of the Training Project.
Add and remove from Collections
Unlike before when we created the cat and dog records into a specific Collection, we now already have the cat Records in the incorrect Collection.
The first step towards organization is to add these existing records into the newly created
Cats
Collection via the collectionItemsUpdate()
function as shown below.
This function accepts a list of IDs to add via the add_ids
keyword argument:
cup_resp = df_api.collectionItemsUpdate(cat_coll_id, add_ids=cat_rec_ids)
print(cup_resp)
(, 'ListingReply')
Unlike most other functions, collectionItemsUpdate()
does not return much that we can work with.
However, this is acceptable since we knew the IDs being added into the Collection.
We can verify that the cat Records do indeed exist in the Cats
Collection using
the familiar collectionItemsList()
function as shown below.
In the interest of brevity, we capture the response and only print out ID and title of the items in the collection:
ls_resp = df_api.collectionItemsList(cat_coll_id)
print([(obj.id, obj.title) for obj in ls_resp[0].item])
[('d/34684107', 'cat_22'),
('d/34684011', 'cat_32'),
('d/34684035', 'cat_6'),
('d/34684083', 'cat_93'),
('d/34684059', 'cat_96')]
We have indeed ensured that the cat Records are part of the Cats
Collection.
However, let us list the contents of the original / outer collection:
ls_resp = df_api.collectionItemsList(coll_alias)
print([(obj.id, obj.title) for obj in ls_resp[0].item])
[('c/34685092', 'Cats'),
('d/34684107', 'cat_22'),
('d/34684011', 'cat_32'),
('d/34684035', 'cat_6'),
('d/34684083', 'cat_93'),
('d/34684059', 'cat_96')
('d/34683939', 'dog_3'),
('d/34683915', 'dog_63'),
('d/34683891', 'dog_70'),
('d/34683987', 'dog_71'),
('d/34683963', 'dog_8')]
We observe that the original collection continues to contain the cat Records, as well as the newly
created Cats
collection, and all the dog Records.
To complete the move, we need to de-link the cat Records from the original Collection.
We do this again via the collectionsItemsUpdate()
function.
However, this time, we would need to pass the same cat Record IDs with the rem_ids
keyword argument
rather than the add_ids
keyword argument:
cup_resp = df_api.collectionItemsUpdate(coll_alias, rem_ids=cat_rec_ids)
print(cup_resp)
(, 'ListingReply')
Let us verify that the original / outer Collection no longer contains cat Records:
ls_resp = df_api.collectionItemsList(coll_alias)
print([(obj.id, obj.title) for obj in ls_resp[0].item])
[('c/34685092', 'Cats'),
('d/34683939', 'dog_3'),
('d/34683915', 'dog_63'),
('d/34683891', 'dog_70'),
('d/34683987', 'dog_71'),
('d/34683963', 'dog_8')]
Download Collection
Finally, let us assume that we are interested in only downloading the data from all
cat Records.
A naive and suboptimal way to accomplish this is to perform 5 separate dataGet()
function calls - one per cat Record.
Fortunately, the dataGet()
function allows multiple Records or entire Collections to be downloaded with a single function call
as shown below.
Though we could provide the list of cat Record IDs, we will only provide the Cat
Collection ID instead.
We will ask dataGet()
to create a new directory called cat_data
and put all the data within this directory:
df_api.dataGet([cat_coll_id], './cat_data')
(item {
id: "d/34684011"
title: "cat_32"
owner: "p/trn001"
size: 1000.0
}
item {
id: "d/34684035"
title: "cat_6"
owner: "p/trn001"
size: 1000.0
}
item {
id: "d/34684059"
title: "cat_96"
owner: "p/trn001"
size: 1000.0
}
item {
id: "d/34684083"
title: "cat_93"
owner: "p/trn001"
size: 1000.0
}
item {
id: "d/34684107"
title: "cat_22"
owner: "p/trn001"
size: 1000.0
}
task {
id: "task/34685359"
type: TT_DATA_GET
status: TS_READY
client: "u/somnaths"
step: 0
steps: 2
msg: "Pending"
ct: 1611079028
ut: 1611079028
source: "d/34684011, d/34684035, d/34684059, d/34684083, d/34684107, ..."
dest: "olcf#dtn/gpfs/alpine/stf011/scratch/somnaths/DataFed_Tutorial/cat_data"
}, 'DataGetReply')
Note
Recall that dataGet()
can download arbitrarily large number of Records
regardless of the physical locations of the DataFed repositories containing the data.
Now, let us verify that all the data does in fact exist in this newly created directory in the local file system:
os.listdir('./cat_data')
['34684107.dat',
'34684059.dat',
'34684011.dat',
'34684035.dat',
'34684083.dat']
Other functions
Besides the above functions, DataFed offers the collectionDelete()
function,
which, as the name suggests, facilitates in deleting one or more collections and all other
objects within the collection (So long as the items do not also belong to other collections elsewhere).
Closing remarks
This user guide only provides an overview of some functions in DataFed that would be used most popularly.
The interested user is encouraged to go over the complete documentation of all the functions in CommandLib.CLI
here.