Caching

Classes for various caching strategies, known as “cachers”

This is handled through a base Cacheable class, and each “strategy” cacher class extends it.

Note that there are effectively two ways to use cacheables - # TODO finish this

Classes:

`Cacheable`(path_override, name, subdir, …)	The base caching class, any caching strategy should extend this.
`FileReferenceCacher`(args, *kwargs)	Saves a file path or list of file paths generated from a stage as a json file.
`JsonCacher`(args, *kwargs)	Dumps an object to indented JSON.
`Lazy`(name, resolve)	A class to indicate a stage output as a lazy-cache object - curifactory will attempt to keep this out of memory as much as possible, immediately caching and deleting, and loading back into memeory only when needed.
`PandasCacher`(path_override, format, json, …)	Saves a pandas dataframe to selectable IO format.
`PandasCsvCacher`(path_override, to_csv_args, …)	Saves a pandas dataframe to CSV.
`PandasJsonCacher`(path_override, …)	Saves a pandas dataframe to JSON.
`PathRef`(args, *kwargs)	Special type of cacher that doesn’t directly save or load anything, it just tracks a file path for reference.
`PickleCacher`(args, *kwargs)	Dumps an object to a pickle file.
`RawJupyterNotebookCacher`(args, *kwargs)	Take a list of code cells (where each cell is a list of strings containing python code) and turn it into a jupyter notebook.

class curifactory.caching.Cacheable(path_override: Optional[str] = None, name: Optional[str] = None, subdir: Optional[str] = None, prefix: Optional[str] = None, extension: Optional[str] = None, record=None, track: bool = True)

The base caching class, any caching strategy should extend this.

Parameters

path_override (str) –
Use a specific path for the cacheable, rather than automatically setting it based on name etc. You can specify string formatting keywords to allow more control over caching when not specifying this as an inline cacher. (Please note that currently if path_override is specified, the resulting path is not included in a full store.)

Possible keyword replacement fields:
- {hash} - the hash of the record.
- {cache} - the path to the manager’s cache directory (does not include final ‘/’)
- {stage} - the name of the current stage.
- {name} - the name of this output object.
- {params.X} - parameter X of the parameters associated with the record.
- {experiment} - the name of the current experiment.
- {artifact_filename} - the normal filename for this cacher (doesn’t include dir path etc.)
name (str) – The obj name to use in automatically constructing a path. If a cacheable is used in stage header, this is automatically provided as the output string name from the stage outputs list.
subdir (str) – An optional string of one or more nested subdirectories to prepend to the artifact filepath. This can be used if you want to subdivide cache and run artifacts into logical subsets, e.g. similar to https://towardsdatascience.com/the-importance-of-layered-thinking-in-data-engineering-a09f685edc71.
prefix (str) – An optional alternative prefix to the experiment-wide prefix (either the experiment name or custom-specified experiment prefix). This can be used if you want a cached object to work easier across multiple experiments, rather than being experiment specific. WARNING: use with caution, cross-experiment caching can mess with provenance.
extension (str) – The filetype extension to add at the end of the path.
record (Record) – The current record this cacheable is caching under. This can be used to get a copy of the current args instance and is also how artifact metadata is collected.
track (bool) – whether to include returned path in a store full copy or not.

Note

It is strongly recommended that any subclasses of Cacheable take **kwargs in init and pass along to super():

class CustomCacher(cf.Cacheable):
    def __init__(self, path_override: str = None, custom_attribute: Any = None, **kwargs):
        super().__init__(path_override, **kwargs)
        self.some_custom_attribute = custom_attribute

This allows consistent handling of paths in the parent get_path() and check() functions.

If no custom attributes are needed, also pass in *args, so path_override can be specified without a kwarg:

class CustomCacher(cf.Cacheable):
    def __init__(self, *args, **kwargs):
        super().__init__(*args, extension=".custom", **kwargs)

Attributes:

`cache_paths`	The running list of paths this cacher is using, as appended by `get_path`.
`extension`	The filetype extension to add at the end of the path.
`extra_metadata`	`collect_metadata` uses but does not overwrite this, placing into the extra key in the actual metadata.
`metadata`	Metadata about the artifact cached with this cacheable.
`name`	The obj name to use in automatically constructing a path.
`path_override`	Use a specific path for the cacheable, rather than automatically setting it based on name etc.
`prefix`	An optional alternative prefix to the experiment-wide prefix (either the experiment name or custom-specified experiment prefix).
`record`	The current record this cacheable is caching under.
`stage`	The stage associated with this cacher, if applicable.
`subdir`	An optional string of one or more nested subdirectories to prepend to the artifact filepath.
`track`	Whether to store the artifact this cacher is used with in the run folder on store-full runs or not.

Methods:

`check`()	Check to see if this cacheable needs to be written or not.
`get_dir`([suffix])	Returns a path for a directory with the given suffix (if provided), appropriate for use in a `save` and `load` function.
`get_path`([suffix])	Retrieve the full filepath to use for saving and loading.
`load`()	Load the cacheable from disk.
`load_metadata`()	Read in the metadata file associated with this cacher.
`save`(obj)	Save the passed object to disk.
`save_metadata`()	Write all cacher metadata to a JSON metadata file associated with the output cache path.

cache_paths: list: The running list of paths this cacher is using, as appended by get_path.

check() → bool

Check to see if this cacheable needs to be written or not.

Note

This function will always return False if the args are None.

Returns: True if we find the cached file and the current parameter set doesn’t specify to overwrite, otherwise False.

extension: The filetype extension to add at the end of the path. (Optional, automatically used as suffix in get_path if provided)

extra_metadata: dict: collect_metadata uses but does not overwrite this, placing into the extra key in the actual metadata. This can be used by the cacher’s save function to store additional information that would then be available if the ‘load’ function calls load_metadata().

get_dir(suffix=None) → str: Returns a path for a directory with the given suffix (if provided), appropriate for use in a save and load function. This will create any subdirectories in the path if they don’t exist.

get_path(suffix=None) → str

Retrieve the full filepath to use for saving and loading. This should be called in the save() and load() implementations.

Parameters: suffix (str) – The suffix to append to the path. If not set, this will default to the cachable’s extension.

Note

If path_override is set on this cacher, this cacher _does not handle storing in the full store directory._ The assumption is that either you’re referring to a static external path (which doesn’t make sense to copy), or you’re manually passing in a record.get_path in which case the record has already dealt with any logic necessary to add the path to the record’s unstored_tracked paths which get copied over. Note also that this can be problematic for cachers that store multiple files since anything that isn’t the path_override won’t get copied. For multiple file cachers you should use ``name``/``subdir``/``prefix`` instead of setting a ``path_override``.

load(): Load the cacheable from disk.

Note

Any subclass is required to implement this.

load_metadata() → dict

Read in the metadata file associated with this cacher.

This populates the self.metadata and self.extra_metadata attributes, as well as returns self.metadata.

If writing a custom cacher that depends on something in the metadata, be sure to call it within the implemented load() function. While stages automatically call load_metadata() before calling the load() function, there’s no guarantee outside of the context of a stage (e.g. when using a cacher inline.) See the PandasCacher class for an example of using load_metadata().

Returns: The populated self.metadata metadata dictionary.

metadata: dict: Metadata about the artifact cached with this cacheable.

name: The obj name to use in automatically constructing a path. If a cacheable is used in stage header, this is automatically provided as the output string name from the stage outputs list.

path_override: Use a specific path for the cacheable, rather than automatically setting it based on name etc.

prefix: An optional alternative prefix to the experiment-wide prefix (either the experiment name or custom-specified experiment prefix). This can be used if you want a cached object to work easier across multiple experiments, rather than being experiment specific. WARNING: use with caution, cross-experiment caching can mess with provenance.

record: The current record this cacheable is caching under. This can be used to get a copy of the current args instance and is also how artifact metadata is collected.

save(obj): Save the passed object to disk.

Note

Any subclass is required to implement this.

save_metadata()

Write all cacher metadata to a JSON metadata file associated with the output cache path.

If writing a custom cacher that depends on/writes to the metadata, be sure to call it within the implemented save() function. While stages automatically call save_metadata() after saving the artifact normally, this isn’t guaranteed outside of the context of a stage (e.g. when using a cacher inline.) See the PandasCacher class for an example of using save_metadata() inside of save().

stage: str: The stage associated with this cacher, if applicable.

subdir: An optional string of one or more nested subdirectories to prepend to the artifact filepath. This can be used if you want to subdivide cache and run artifacts into logical subsets, e.g. similar to https://towardsdatascience.com/the-importance-of-layered-thinking-in-data-engineering-a09f685edc71.

track: Whether to store the artifact this cacher is used with in the run folder on store-full runs or not.

class curifactory.caching.FileReferenceCacher(*args, **kwargs)

Saves a file path or list of file paths generated from a stage as a json file. The check function will check existence of all file paths.

This is useful for instances where there may be a large number of files stored or generated to disk, as it would be unwieldy to return them all (or infeasible to keep them in memory) directly from the stage. When this cacher is checked for pre-existing, it tries to load the json file storing the filenames, and then checks for the existence of each path in that json file. If all of them exist, it will short-circuit computation.

Using this cacher does mean the user is in charge of loading/saving the file paths correctly, but in some cases that may be desirable.

This can also be used for storing a reference to a single file outside the normal cache.

When combined with the get_dir call on the record, you can create a cached directory of files similarly to a regular cacher and simply keep a reference to them as part of the actual cacher process.

Example

@stage(inputs=None, outputs=["many_text_files"], cachers=[FileReferenceCacher])
def output_text_files(record):
    file_path = record.get_dir("my_files")
    my_file_list = [os.path.join(file_path, f"my_file_{num}") for num in range(20)]

    for file in my_file_list:
        with open(file, 'w') as outfile:
            outfile.write("test")

    return my_file_list

Methods:

`check`()	Check to see if this cacheable needs to be written or not.
`load`()	Load the cacheable from disk.
`save`(files)	Save the passed object to disk.

check() → bool

Check to see if this cacheable needs to be written or not.

Note

This function will always return False if the args are None.

Returns: True if we find the cached file and the current parameter set doesn’t specify to overwrite, otherwise False.

load() → Union[list, str]: Load the cacheable from disk.

Note

Any subclass is required to implement this.

save(files: Union[list, str]) → str: Save the passed object to disk.

Note

Any subclass is required to implement this.

class curifactory.caching.JsonCacher(*args, **kwargs)

Dumps an object to indented JSON.

Methods:

`load`()	Load the cacheable from disk.
`save`(obj)	Save the passed object to disk.

load(): Load the cacheable from disk.

Note

Any subclass is required to implement this.

save(obj) → str: Save the passed object to disk.

Note

Any subclass is required to implement this.

class curifactory.caching.Lazy(name: str, resolve: bool = True)

A class to indicate a stage output as a lazy-cache object - curifactory will attempt to keep this out of memory as much as possible, immediately caching and deleting, and loading back into memeory only when needed.

This object is used by “wrapping” a stage output string with the class.

Example

@stage(inputs=None, outputs=["small_output", Lazy("large_output")], cachers=[PickleCacher]*2)
def some_stage(record: Record):
    ...

Parameters

name (str) – the name of the output to put into state.
resolve (bool) – Whether this lazy object should automatically reload the initial object when accessed from state. By default this is True - when a stage specifies the string name as an input and this object is requested from the record state, it loads and passes in the originally stored object. If set to False, the stage input will instead be populated with the lazy object itself, giving the inner stage code direct access to the cacher. This is useful if you need to keep objects out of memory and just want to refer to the cacher path (e.g. to send this path along to an external CLI/script.)

class curifactory.caching.PandasCacher(path_override: Optional[str] = None, format: Literal[csv, json, parquet, pickle, orc, hdf5, excel, xml] = 'pickle', to_args: Optional[dict] = None, read_args: Optional[dict] = None, **kwargs)

Saves a pandas dataframe to selectable IO format.

Parameters

format (str) – Selected pandas IO format. Choices are: (“csv”, “json”, “parquet”, “pickle”, “orc”, “hdf5”, “excel”, “xml”)
to_args (Dict) – Dictionary of arguments to use in the pandas to_*() call.
read_args (Dict) – Dictionary of arguments to use in the pandas read_*() call.

Methods:

`load`()	Load the cacheable from disk.
`save`(obj)	Save the passed object to disk.

load(): Load the cacheable from disk.

Note

Any subclass is required to implement this.

save(obj: pandas.core.frame.DataFrame) → str: Save the passed object to disk.

Note

Any subclass is required to implement this.

class curifactory.caching.PandasCsvCacher(path_override: Optional[str] = None, to_csv_args: dict = {}, read_csv_args: dict = {'index_col': 0}, **kwargs)

Saves a pandas dataframe to CSV.

Parameters

to_csv_args (Dict) – Dictionary of arguments to use in the pandas to_csv() call.
read_csv_args (Dict) – Dictionary of arguments to use in the pandas read_csv() call.

Note

This is equivalent to using PandasCacher(format='csv')

class curifactory.caching.PandasJsonCacher(path_override: Optional[str] = None, to_json_args: dict = {'double_precision': 15}, read_json_args: dict = {}, **kwargs)

Saves a pandas dataframe to JSON.

Warning

Using this cacher is inadvisable for floating point data, as precision will be lost, creating the potential for different results when using cached values with this cacher as opposed to the first non-cached run.

Parameters

to_json_args (Dict) – Dictionary of arguments to use in the pandas to_json() call.
read_json_args (Dict) – Dictionary of arguments to use in the pandas read_json() call.

Note

This is equivalent to using PandasCacher(format='json')

class curifactory.caching.PathRef(*args, **kwargs)

Special type of cacher that doesn’t directly save or load anything, it just tracks a file path for reference.

This is primarily useful for stages that never keep a particular object in memory and just want to directly pass around paths. The PathRef cacher allows still short-circuiting if the referenced path already exists, rather than needing to do it manually in the stage.

Note that when using a PathRef cacher you still need to return a value from the stage for the cacher to “save”. This cacher expects that return value to be the path that was written to, and internally runs an assert returned_path == self.get_path() to double check that the stage wrote to the correct place. This also means that the value stored “in memory” is just the path, and that path string is what gets gets “loaded”.

This cacher is distinct from the FileReferenceCacher in that the path of this cacher _is the referenced path_, rather than saving a file that contains the referenced path. (In the case of the latter, a new record/hash etc that refers to the same target filepath would still trigger stage execution and still requires the stage to do it’s own check of if the original file already exists before saving.

Example

@stage([], ["large_dataset_path"], [PathRef("./data/raw/big_data_{params.dataset}.csv")])
def make_big_data(record):
    # you can use record's ``stage_cachers`` to get the expected path
    output_path = record.stage_cachers[0].get_path()
    ...
    # make big data without keeping it in memory
    ...
    return output_path

@stage(["large_dataset_path"], ["model_path"], [PathRef])
def make_big_data(record, large_dataset_path):
    # the other way you can get a path that should be correct
    # is through record's ``get_path()``. The assert inside
    # PathRef's save will help us double check that it's correct.
    output_path = record.get_path(obj_name="model_path")
    ...
    # train model using large_dataset_path, the string path we need.
    ...
    return output_path

Methods:

`load`()	Load the cacheable from disk.
`save`(obj)	This is effectively a no-op, this cacher is just a reference to its own path.

load() → str: Load the cacheable from disk.

Note

Any subclass is required to implement this.

save(obj: str): This is effectively a no-op, this cacher is just a reference to its own path. obj is expected to be the same, and we assert that to help alert the user if something got mis-aligned and the path they wrote to wasn’t this cacher’s path.

class curifactory.caching.PickleCacher(*args, **kwargs)

Dumps an object to a pickle file.

Methods:

`load`()	Load the cacheable from disk.
`save`(obj)	Save the passed object to disk.

load(): Load the cacheable from disk.

Note

Any subclass is required to implement this.

save(obj) → str: Save the passed object to disk.

Note

Any subclass is required to implement this.

class curifactory.caching.RawJupyterNotebookCacher(*args, **kwargs)

Take a list of code cells (where each cell is a list of strings containing python code) and turn it into a jupyter notebook. This is useful in situations where you want each experiment to have some form of automatically populated analysis that a reportable wouldn’t sufficiently cover, e.g. an interactive set of widgets or dashboard.

Example

@stage(inputs=["results_path"], outputs=["exploration_notebook"], cachers=[RawJupyterNotebookCacher])
def make_exploration_notebook(record, results_path):
    def convert_path(path):
        '''A function to translate paths to local folder path.'''
        p = Path(path)
        p = Path(*p.parts[2:])
        return str(p)

    cells = [
        [
            "# imports",
            "from curifactory.caching import JsonCacher",
        ],
        [
            "# load things",
            f"cacher = JsonCacher('./{convert_path(results_path)})",
            "results = cacher.load()",
            "results_metadata = cacher.metadata",
        ],
        [
            "# analysis",
            "...",
        ],
    ]

    return cells

Methods:

`load`()	Load the cacheable from disk.
`save`(obj)	This saves the raw cell strings to a _cells.json, and then uses ipynb-py-convert to change the output python script into a notebook format.

load(): Load the cacheable from disk.

Note

Any subclass is required to implement this.

save(obj: list): This saves the raw cell strings to a _cells.json, and then uses ipynb-py-convert to change the output python script into a notebook format.