Record

Contains relevant classes for records, objects that track a persistant state through some set of stages for a given parameter set.

Classes:

`ArtifactRepresentation`(record, name, artifact)	This is a shorthand string representation for an artifact stored in a record as well as output cache info.
`CacheAwareDict`(args, *kwargs)	A normal dictionary that will return resolved versions of Lazy objects when accessed.
`Record`(manager, param_set[, hide])	A single persistent state that’s passed between stages in a single “experiment line”.

class curifactory.record.ArtifactRepresentation(record, name, artifact, metadata=None, cacher=None)

This is a shorthand string representation for an artifact stored in a record as well as output cache info. This is what gets displayed in the detailed experiment map in reports.

This will try to include helpful information in the string representation, such as pandas/ numpy shapes, or lengths where applicable.

Parameters

record (Record) – The record this artifact is stored in.
name (str) – The name of the artifact.
artifact – The artifact itself.

Methods:

html_safe()

Removes special characters that’d break html, except it doesn’t actually display correctly.

html_safe()

Removes special characters that’d break html, except it doesn’t actually display correctly.

(I think this is an issue with how graphviz renders outputs?)

class curifactory.record.CacheAwareDict(*args, **kwargs)

A normal dictionary that will return resolved versions of Lazy objects when accessed.

Attributes:

resolve

A flag for enabling/disabling auto-resolve.

resolve: A flag for enabling/disabling auto-resolve. If this is False, this works just like a normal dictionary. This is necessary in the input checking part of a normal stage, as we make state access before determining if stage execution is required or not.

class curifactory.record.Record(manager, param_set, hide=False)

A single persistent state that’s passed between stages in a single “experiment line”.

Parameters

manager (ArtifactManager) – The artifact manager this record is associated with.
param_set – The parameter set (subclass of ExperimentParameters) to apply to any stages this record is run through.
hide (bool) – If True, don’t add this record to the artifact manager.

Attributes:

`combo_hash`	This gets set on records that run an aggregate stage.
`input_records`	A list of any records used as input to this one.
`is_aggregate`	If this record runs an aggregate stage, we flip this flag to true to know we need to use the combo hash rather than the individual args hash.
`manager`	The `ArtifactManager` associated with this record.
`output`	The returned value from the last stage that was run with this record.
`params`	The parameter set (subclass of `ExperimentParameters`) to apply to any stages this record is passed through.
`stage_cachers`	A list of the initialized cachers set for the current stage, if any.
`stage_inputs`	A list of lists per stage with the state inputs that stage requested.
`stage_inputs_names`	The names of the stage inputs as specified in the decorator.
`stage_kwargs_keys`	The keys of any kwargs passed into each stage from the experiment, the index associated with `self.stages`.
`stage_outputs`	A list of lists per stage with the state outputs that stage produced.
`stage_suppress_missing`	The list of whether the stage with the associated index in `self.stages` specified `suppress_missing_inputs` or not.
`stages`	The list of stage names that this record has run through so far.
`state`	The dictionary of all variables created by stages this record is passed through.
`state_artifact_reps`	Dictionary mimicking state that keeps an index to the associated artifact representation in manager’s artifact representation list.
`stored_paths`	A list of paths that have been copied into a full store folder.
`unstored_tracked_paths`	Paths obtained with get_path/get_dir that should be copied to a full store folder.

Methods:

`get_dir`(dir_name_suffix[, subdir, prefix, …])	Returns a cache path with the passed name and appropriate hash (similar to `get_path`) and creates it as a directory.
`get_hash`()	Returns either the hash of the parameter set, or the combo hash if this record is an aggregate.
`get_path`(obj_name[, subdir, prefix, …])	Return a cache path with passed object name and the correct hash based on the parameter set.
`get_reference_name`([map])	This returns a name describing the record, in the format ‘Record [index on manager] (paramset name)
`make_copy`([param_set, add_to_manager])	Make a new record that has a deep-copied version of the current state.
`report`(reportable)	Add a reportable associated with this record, this will get added to the experiment run output report.
`set_aggregate`(aggregate_records)	Mark this record as starting with an aggregate stage, meaning the hash of all cached outputs produced within this record need to reflect the combo hash of all records going into it.
`set_hash`()	Establish the hash for the current parameter set (and set it on the parameter set instance).
`store_tracked_paths`()	Copy all of the recent relevant files generated (likely from the recently executing stage) into a store-full run.

combo_hash: This gets set on records that run an aggregate stage. This is set from utils.add_params_combo_hash.

get_dir(dir_name_suffix: str, subdir: Optional[str] = None, prefix: Optional[str] = None, stage_name: Optional[str] = None, track: bool = True) → str

Returns a cache path with the passed name and appropriate hash (similar to get_path) and creates it as a directory.

Parameters

dir_name_suffix (str) – the name to add as a suffix to the created directory name.
subdir (str) – An optional string of one or more nested subdirectories to prepend to the artifact filepath. This can be used if you want to subdivide cache and run artifacts into logical subsets, e.g. similar to https://towardsdatascience.com/the-importance-of-layered-thinking-in-data-engineering-a09f685edc71.
prefix (str) – An optional alternative prefix to the experiment-wide prefix (either the experiment name or custom-specified experiment prefix). This can be used if you want a cached object to work easier across multiple experiments, rather than being experiment specific. WARNING: use with caution, cross-experiment caching can mess with provenance.
stage_name (str) – The associated stage for a path. If not provided, the currently executing stage name is used.
track (bool) – whether to include returned path in a store full copy or not. This will only work if the returned path is not altered by a stage before saving something to it.

get_hash() → str: Returns either the hash of the parameter set, or the combo hash if this record is an aggregate.

get_path(obj_name: str, subdir: Optional[str] = None, prefix: Optional[str] = None, stage_name: Optional[str] = None, track: bool = True) → str

Return a cache path with passed object name and the correct hash based on the parameter set.

This should be equivalent to what a cacher for a stage should get. Note that this is calling the manager’s get_path, which will include the stage name. If calling this outside of a stage, it will include whatever stage was last run.

Parameters

obj_name (str) – the name to associate with the object as the last part of the filename.
subdir (str) – An optional string of one or more nested subdirectories to prepend to the artifact filepath. This can be used if you want to subdivide cache and run artifacts into logical subsets, e.g. similar to https://towardsdatascience.com/the-importance-of-layered-thinking-in-data-engineering-a09f685edc71.
prefix (str) – An optional alternative prefix to the experiment-wide prefix (either the experiment name or custom-specified experiment prefix). This can be used if you want a cached object to work easier across multiple experiments, rather than being experiment specific. WARNING: use with caution, cross-experiment caching can mess with provenance.
stage_name (str) – The associated stage for a path. If not provided, the currently executing stage name is used.
track (bool) – whether to include returned path in a store full copy or not. This will only work if the returned path is not altered by a stage before saving something to it.

get_reference_name(map=False) → str

This returns a name describing the record, in the format ‘Record [index on manager] (paramset name)

This should be the same as what’s shown in the stage map in the output report.

input_records: list: A list of any records used as input to this one. This mostly only occurs when aggregate stages are run.

is_aggregate: If this record runs an aggregate stage, we flip this flag to true to know we need to use the combo hash rather than the individual args hash.

make_copy(param_set=None, add_to_manager=True)

Make a new record that has a deep-copied version of the current state.

This is useful for a long running procedure that creates a common dataset for many other stages, so that it can be replicated across multiple parameter sets without having to recompute for each individual parameter set.

Note that state is really the only thing transferred to the new record, the stage and inputs/outputs lists will be empty.

Also note that the current record will be added to the input_records of the new record, since it may draw on data in its state.

Parameters

param_set – The new ExperimentParameters to apply to the new record. Leave as None to retain the same parameter set as the current record.
add_to_manager – Whether to automatically add this record to the current manager or not.

Returns

A new record with the same state as this one, but under a different parameter set.

manager: The ArtifactManager associated with this record.

output: The returned value from the last stage that was run with this record.

params: The parameter set (subclass of ExperimentParameters) to apply to any stages this record is passed through.

report(reportable: curifactory.reporting.Reportable)

Add a reportable associated with this record, this will get added to the experiment run output report.

Parameters: reportable (Reportable) – The reportable to render on the final experiment report.

set_aggregate(aggregate_records): Mark this record as starting with an aggregate stage, meaning the hash of all cached outputs produced within this record need to reflect the combo hash of all records going into it.

set_hash(): Establish the hash for the current parameter set (and set it on the parameter set instance).

stage_cachers: list: A list of the initialized cachers set for the current stage, if any. This is so that a stage can get access to output path information if it needs.

stage_inputs: list: A list of lists per stage with the state inputs that stage requested. These are lists of indices into state_artifact_reps.

stage_inputs_names: list: The names of the stage inputs as specified in the decorator. This is primarily for helping checking correctness during DAG mapping phase.

stage_kwargs_keys: list: The keys of any kwargs passed into each stage from the experiment, the index associated with self.stages. We don’t store the full dictionary because there could potentially be large values and the user may not expect a reference to stick around etc. We really only need this for checking correctness of an experiment setup during DAG mapping phase.

stage_outputs: list: A list of lists per stage with the state outputs that stage produced. These are lists of indices into state_artifact_reps.

stage_suppress_missing: list: The list of whether the stage with the associated index in self.stages specified suppress_missing_inputs or not. This is primarly needed for the DAG mapping phase.

stages: The list of stage names that this record has run through so far.

state: The dictionary of all variables created by stages this record is passed through. (AKA ‘Artifacts’) All inputs from stage decorators are pulled from this dictionary, and all outputs are stored here.

state_artifact_reps: dict: Dictionary mimicking state that keeps an index to the associated artifact representation in manager’s artifact representation list.

store_tracked_paths(): Copy all of the recent relevant files generated (likely from the recently executing stage) into a store-full run. This is run automatically at the end of a stage.

stored_paths: list: A list of paths that have been copied into a full store folder. These are the source paths, not the destination paths.

unstored_tracked_paths: list: Paths obtained with get_path/get_dir that should be copied to a full store folder. The last executed stage should manage copying anything listed here and then clearing it. This is a list of dicts that would be passed to the artifact manager’s get_artifact_path function: (obj_name, subdir, prefix, and path)