Record

Contains relevant classes for records, objects that track a particular state through some set of stages.

Classes:

`ArtifactRepresentation`(record, name, artifact)	This is a shorthand string representation for an artifact stored in a record as well as output cache info.
`CacheAwareDict`(args, *kwargs)	A normal dictionary that will return resolved versions of Lazy objects when accessed.
`Record`(manager, args[, hide])	A single persistent state that’s passed between stages in a single “experiment line”.

class curifactory.record.ArtifactRepresentation(record, name, artifact, metadata=None)

This is a shorthand string representation for an artifact stored in a record as well as output cache info. This is what gets displayed in the detailed experiment map in reports.

This will try to include helpful information in the string representation, such as pandas/ numpy shapes, or lengths where applicable.

Parameters

record (Record) – The record this artifact is stored in.
name (str) – The name of the artifact.
artifact – The artifact itself.

Methods:

html_safe()

Removes special characters that’d break html, except it doesn’t actually display correctly.

html_safe()

Removes special characters that’d break html, except it doesn’t actually display correctly.

(I think this is an issue with how graphviz renders outputs?)

class curifactory.record.CacheAwareDict(*args, **kwargs)

A normal dictionary that will return resolved versions of Lazy objects when accessed.

Attributes:

resolve

A flag for enabling/disabling auto-resolve.

resolve: A flag for enabling/disabling auto-resolve. If this is False, this works just like a normal dictionary. This is necessary in the input checking part of a normal stage, as we make state access before determining if stage execution is required or not.

class curifactory.record.Record(manager, args, hide=False)

A single persistent state that’s passed between stages in a single “experiment line”.

Parameters

manager (ArtifactManager) – The artifact manager this record is associated with.
args – The ExperimentArgs instance to apply to any stages this record is run through.
hide (bool) – If True, don’t add this record to the artifact manager.

Attributes:

`args`	The `ExperimentArgs` to apply to any stages this record is passed through.
`combo_hash`	This gets set on records that run an aggregate stage.
`input_records`	A list of any records used as input to this one.
`is_aggregate`	If this record runs an aggregate stage, we flip this flag to true to know we need to use the combo hash rather than the individual args hash.
`manager`	The `ArtifactManager` associated with this record.
`output`	The returned value from the last stage that was run with this record.
`stage_inputs`	A list of lists per stage with the state inputs that stage requested.
`stage_outputs`	A list of lists per stage with the state outputs that stage produced.
`stages`	The list of stage names that this record has run through so far.
`state`	The dictionary of all variables created by stages this record is passed through.
`state_artifact_reps`	Dictionary mimicking state that keeps an `ArtifactRepresentation` associated with each variable stored in `self.state`.
`stored_paths`	A list of paths that have been copied into a full store folder.
`unstored_tracked_paths`	Paths obtained with get_path/get_dir that should be copied to a full store folder. The last executed stage should manage copying anything listed here and then clearing it. This is a list of dicts that would be passed to the artifact manager’s ``get_artifact_path` function: (obj_name, subdir, prefix, and path).

Methods:

`get_dir`(dir_name_suffix[, subdir, prefix, …])	Returns an args-appropriate cache path with the passed name, (similar to get_path) and creates it as a directory.
`get_hash`()	Returns either the hash of the args, or the combo hash if this record is an aggregate.
`get_path`(obj_name[, subdir, prefix, …])	Return an args-appropriate cache path with passed object name.
`get_reference_name`()	This returns a name describing the record, in the format ‘Record [index on manager] (paramset name)
`make_copy`([args, add_to_manager])	Make a new record that has a deep-copied version of the current state.
`report`(reportable)	Add a reportable associated with this record, this will get added to the experiment run output report.
`set_aggregate`(aggregate_records)	Mark this record as starting with an aggregate stage, meaning the hash of all cached outputs produced within this record need to reflect the combo hash of all records going into it.
`set_hash`()	Establish the hash for the current args (and set it on the args instance).
`store_tracked_paths`()	Copy all of the recent relevant files generated (likely from the recently executing stage) into a store-full run.

args: The ExperimentArgs to apply to any stages this record is passed through.

combo_hash: This gets set on records that run an aggregate stage. This is set from utils.add_args_combo_hash.

get_dir(dir_name_suffix: str, subdir: Optional[str] = None, prefix: Optional[str] = None, stage_name: Optional[str] = None, track: bool = True) → str

Returns an args-appropriate cache path with the passed name, (similar to get_path) and creates it as a directory.

Parameters

dir_name_suffix (str) – the name to add as a suffix to the created directory name.
subdir (str) – An optional string of one or more nested subdirectories to prepend to the artifact filepath. This can be used if you want to subdivide cache and run artifacts into logical subsets, e.g. similar to https://towardsdatascience.com/the-importance-of-layered-thinking-in-data-engineering-a09f685edc71.
prefix (str) – An optional alternative prefix to the experiment-wide prefix (either the experiment name or custom-specified experiment prefix). This can be used if you want a cached object to work easier across multiple experiments, rather than being experiment specific. WARNING: use with caution, cross-experiment caching can mess with provenance.
stage_name (str) – The associated stage for a path. If not provided, the currently executing stage name is used.
track (bool) – whether to include returned path in a store full copy or not. This will only work if the returned path is not altered by a stage before saving something to it.

get_hash() → str: Returns either the hash of the args, or the combo hash if this record is an aggregate.

get_path(obj_name: str, subdir: Optional[str] = None, prefix: Optional[str] = None, stage_name: Optional[str] = None, track: bool = True) → str

Return an args-appropriate cache path with passed object name.

This should be equivalent to what a cacher for a stage should get. Note that this is calling the manager’s get_path, which will include the stage name. If calling this outside of a stage, it will include whatever stage was last run.

Parameters

obj_name (str) – the name to associate with the object as the last part of the filename.
subdir (str) – An optional string of one or more nested subdirectories to prepend to the artifact filepath. This can be used if you want to subdivide cache and run artifacts into logical subsets, e.g. similar to https://towardsdatascience.com/the-importance-of-layered-thinking-in-data-engineering-a09f685edc71.
prefix (str) – An optional alternative prefix to the experiment-wide prefix (either the experiment name or custom-specified experiment prefix). This can be used if you want a cached object to work easier across multiple experiments, rather than being experiment specific. WARNING: use with caution, cross-experiment caching can mess with provenance.
stage_name (str) – The associated stage for a path. If not provided, the currently executing stage name is used.
track (bool) – whether to include returned path in a store full copy or not. This will only work if the returned path is not altered by a stage before saving something to it.

get_reference_name() → str

This returns a name describing the record, in the format ‘Record [index on manager] (paramset name)

This should be the same as what’s shown in the stage map in the output report.

input_records: A list of any records used as input to this one. This mostly only occurs when aggregate stages are run.

is_aggregate: If this record runs an aggregate stage, we flip this flag to true to know we need to use the combo hash rather than the individual args hash.

make_copy(args=None, add_to_manager=True)

Make a new record that has a deep-copied version of the current state.

This is useful for a long running procedure that creates a common dataset for many other stages, so that it can be replicated across multiple argsets without having to recompute for each argset.

Note that state is really the only thing transferred to the new record, the stage and inputs/outputs lists will be empty.

Also note that the current record will be added to the input_records of the new record, since it may draw on data in its state.

Parameters

args – The new ExperimentArgs argset to apply to the new record. Leave as None to retain the same args as the current record.
add_to_manager – Whether to automatically add this record to the current manager or not.

manager: The ArtifactManager associated with this record.

output: The returned value from the last stage that was run with this record.

report(reportable: curifactory.reporting.Reportable)

Add a reportable associated with this record, this will get added to the experiment run output report.

Parameters: reportable (Reportable) – The reportable to render on the final experiment report.

set_aggregate(aggregate_records): Mark this record as starting with an aggregate stage, meaning the hash of all cached outputs produced within this record need to reflect the combo hash of all records going into it.

set_hash(): Establish the hash for the current args (and set it on the args instance).

stage_inputs: A list of lists per stage with the state inputs that stage requested.

stage_outputs: A list of lists per stage with the state outputs that stage produced.

stages: The list of stage names that this record has run through so far.

state: The dictionary of all variables created by stages this record is passed through. (AKA ‘Artifacts’) All inputs from stage decorators are pulled from this dictionary, and all outputs are stored here.

state_artifact_reps: Dictionary mimicking state that keeps an ArtifactRepresentation associated with each variable stored in self.state.

store_tracked_paths(): Copy all of the recent relevant files generated (likely from the recently executing stage) into a store-full run. This is run automatically at the end of a stage.

stored_paths: list: A list of paths that have been copied into a full store folder. These are the source paths, not the destination paths.

unstored_tracked_paths: list: Paths obtained with get_path/get_dir that should be copied to a full store folder. The last executed stage should manage copying anything listed here and then clearing it. This is a list of dicts that would be passed to the artifact manager’s ``get_artifact_path` function: (obj_name, subdir, prefix, and path)