Caching

Classes for various caching strategies, known as “cachers”

This is handled through a base Cacheable class, and each “strategy” cacher class extends it.

Note that there are effectively two ways to use cacheables - # TODO finish this

Classes:

`Cacheable`(path_override, name, subdir, …)	The base caching class, any caching strategy should extend this.
`FileReferenceCacher`(args, *kwargs)	Saves a file path or list of file paths generated from a stage as a json file.
`JsonCacher`(args, *kwargs)	Dumps an object to indented JSON.
`Lazy`(name, resolve)	A class to indicate a stage output as a lazy-cache object - curifactory will attempt to keep this out of memory as much as possible, immediately caching and deleting, and loading back into memeory only when needed.
`PandasCsvCacher`(path_override, to_csv_args, …)	Saves a pandas dataframe to CSV.
`PandasJsonCacher`(path_override, …)	Saves a pandas dataframe to JSON.
`PickleCacher`(args, *kwargs)	Dumps an object to a pickle file.
`RawJupyterNotebookCacher`(args, *kwargs)	Take a list of code cells (where each cell is a list of strings containing python code) and turn it into a jupyter notebook.

class curifactory.caching.Cacheable(path_override: Optional[str] = None, name: Optional[str] = None, subdir: Optional[str] = None, prefix: Optional[str] = None, extension: Optional[str] = None, record=None, track: bool = True)

The base caching class, any caching strategy should extend this.

Parameters

path_override (str) – Use a specific path for the cacheable, rather than automatically setting it based on name etc.
name (str) – The obj name to use in automatically constructing a path. If a cacheable is used in stage header, this is automatically provided as the output string name from the stage outputs list.
subdir (str) – An optional string of one or more nested subdirectories to prepend to the artifact filepath. This can be used if you want to subdivide cache and run artifacts into logical subsets, e.g. similar to https://towardsdatascience.com/the-importance-of-layered-thinking-in-data-engineering-a09f685edc71.
prefix (str) – An optional alternative prefix to the experiment-wide prefix (either the experiment name or custom-specified experiment prefix). This can be used if you want a cached object to work easier across multiple experiments, rather than being experiment specific. WARNING: use with caution, cross-experiment caching can mess with provenance.
extension (str) – The filetype extension to add at the end of the path.
record (Record) – The current record this cacheable is caching under. This can be used to get a copy of the current args instance and is also how artifact metadata is collected.
track (bool) – whether to include returned path in a store full copy or not.

Note

It is strongly recommended that any subclasses of Cacheable take **kwargs in init and pass along to super():

class CustomCacher(cf.Cacheable):
    def __init__(self, path_override: str = None, custom_attribute: Any = None, **kwargs):
        super().__init__(path_override, **kwargs)
        self.some_custom_attribute = custom_attribute

This allows consistent handling of paths in the parent get_path() and check() functions.

If no custom attributes are needed, also pass in *args, so path_override can be specified without a kwarg:

class CustomCacher(cf.Cacheable):
    def __init__(self, *args, **kwargs):
        super().__init__(*args, extension=".custom", **kwargs)

Attributes:

`cache_paths`	The running list of paths this cacher is using, as appended by `get_path`.
`extension`	The filetype extension to add at the end of the path.
`extra_metadata`	`collect_metadata` uses but does not overwrite this, placing into the extra key in the actual metadata.
`metadata`	Metadata about the artifact cached with this cacheable.
`name`	The obj name to use in automatically constructing a path.
`path_override`	Use a specific path for the cacheable, rather than automatically setting it based on name etc.
`prefix`	An optional alternative prefix to the experiment-wide prefix (either the experiment name or custom-specified experiment prefix).
`record`	The current record this cacheable is caching under.
`stage`	The stage associated with this cacher, if applicable.
`subdir`	An optional string of one or more nested subdirectories to prepend to the artifact filepath.
`track`	Whether to store the artifact this cacher is used with in the run folder on store-full runs or not.

Methods:

`check`()	Check to see if this cacheable needs to be written or not.
`get_dir`([suffix])	Returns a path for a directory with the given suffix (if provided), appropriate for use in a `save` and `load` function.
`get_path`([suffix])	Retrieve the full filepath to use for saving and loading.
`load`()	Load the cacheable from disk.
`save`(obj)	Save the passed object to disk.

cache_paths: list: The running list of paths this cacher is using, as appended by get_path.

check() → bool

Check to see if this cacheable needs to be written or not.

Note

This function will always return False if the args are None.

Returns: True if we find the cached file and the current Args don’t specify to overwrite, otherwise False.

extension: The filetype extension to add at the end of the path. (Optional, automatically used as suffix in get_path if provided)

extra_metadata: dict: collect_metadata uses but does not overwrite this, placing into the extra key in the actual metadata. This can be used by the cacher’s save function to store additional information that would then be available if the ‘load’ function calls load_metadata().

get_dir(suffix=None) → str: Returns a path for a directory with the given suffix (if provided), appropriate for use in a save and load function. This will create any subdirectories in the path if they don’t exist.

get_path(suffix=None) → str

Retrieve the full filepath to use for saving and loading. This should be called in the save() and load() implementations.

Parameters: suffix (str) – The suffix to append to the path. If not set, this will default to the cachable’s extension.

Note

If path_override is set on this cacher, this cacher _does not handle storing in the full store directory._ The assumption is that either you’re referring to a static external path (which doesn’t make sense to copy), or you’re manually passing in a record.get_path in which case the record has already dealt with any logic necessary to add the path to the record’s unstored_tracked paths which get copied over. Note also that this can be problematic for cachers that store multiple files since anything that isn’t the path_override won’t get copied. For multiple file cachers you should use ``name``/``subdir``/``prefix`` instead of setting a ``path_override``.

load(): Load the cacheable from disk.

Note

Any subclass is required to implement this.

metadata: dict: Metadata about the artifact cached with this cacheable.

name: The obj name to use in automatically constructing a path. If a cacheable is used in stage header, this is automatically provided as the output string name from the stage outputs list.

path_override: Use a specific path for the cacheable, rather than automatically setting it based on name etc.

prefix: An optional alternative prefix to the experiment-wide prefix (either the experiment name or custom-specified experiment prefix). This can be used if you want a cached object to work easier across multiple experiments, rather than being experiment specific. WARNING: use with caution, cross-experiment caching can mess with provenance.

record: The current record this cacheable is caching under. This can be used to get a copy of the current args instance and is also how artifact metadata is collected.

save(obj): Save the passed object to disk.

Note

Any subclass is required to implement this.

stage: str: The stage associated with this cacher, if applicable.

subdir: An optional string of one or more nested subdirectories to prepend to the artifact filepath. This can be used if you want to subdivide cache and run artifacts into logical subsets, e.g. similar to https://towardsdatascience.com/the-importance-of-layered-thinking-in-data-engineering-a09f685edc71.

track: Whether to store the artifact this cacher is used with in the run folder on store-full runs or not.

class curifactory.caching.FileReferenceCacher(*args, **kwargs)

Saves a file path or list of file paths generated from a stage as a json file. The check function will check existence of all file paths.

This is useful for instances where there may be a large number of files stored or generated to disk, as it would be unwieldy to return them all (or infeasible to keep them in memory) directly from the stage. When this cacher is checked for pre-existing, it tries to load the json file storing the filenames, and then checks for the existence of each path in that json file. If all of them exist, it will short-circuit computation.

Using this cacher does mean the user is in charge of loading/saving the file paths correctly, but in some cases that may be desirable.

This can also be used for storing a reference to a single file outside the normal cache.

When combined with the get_dir call on the record, you can create a cached directory of files similarly to a regular cacher and simply keep a reference to them as part of the actual cacher process.

Example

@stage(inputs=None, outputs=["many_text_files"], cachers=[FileReferenceCacher])
def output_text_files(record):
    file_path = record.get_dir("my_files")
    my_file_list = [os.path.join(file_path, f"my_file_{num}") for num in range(20)]

    for file in my_file_list:
        with open(file, 'w') as outfile:
            outfile.write("test")

    return my_file_list

Methods:

`check`()	Check to see if this cacheable needs to be written or not.
`load`()	Load the cacheable from disk.
`save`(files)	Save the passed object to disk.

check() → bool

Check to see if this cacheable needs to be written or not.

Note

This function will always return False if the args are None.

Returns: True if we find the cached file and the current Args don’t specify to overwrite, otherwise False.

load() → Union[list, str]: Load the cacheable from disk.

Note

Any subclass is required to implement this.

save(files: Union[list, str]) → str: Save the passed object to disk.

Note

Any subclass is required to implement this.

class curifactory.caching.JsonCacher(*args, **kwargs)

Dumps an object to indented JSON.

Methods:

`load`()	Load the cacheable from disk.
`save`(obj)	Save the passed object to disk.

load(): Load the cacheable from disk.

Note

Any subclass is required to implement this.

save(obj) → str: Save the passed object to disk.

Note

Any subclass is required to implement this.

class curifactory.caching.Lazy(name: str, resolve: bool = True)

A class to indicate a stage output as a lazy-cache object - curifactory will attempt to keep this out of memory as much as possible, immediately caching and deleting, and loading back into memeory only when needed.

This object is used by “wrapping” a stage output string with the class.

Example

@stage(inputs=None, outputs=["small_output", cf.Lazy("large_output")], cachers=[PickleCacher]*2)
def some_stage(record: Record):
    ...

Parameters

name (str) – the name of the output to put into state.
resolve (bool) – Whether this lazy object should automatically reload the initial object when accessed from state. By default this is True - when a stage specifies the string name as an input and this object is requested from the record state, it loads and passes in the originally stored object. If set to False, the stage input will instead be populated with the lazy object itself, giving the inner stage code direct access to the cacher. This is useful if you need to keep objects out of memory and just want to refer to the cacher path (e.g. to send this path along to an external CLI/script.)

class curifactory.caching.PandasCsvCacher(path_override: Optional[str] = None, to_csv_args: dict = {}, read_csv_args: dict = {'index_col': 0}, **kwargs)

Saves a pandas dataframe to CSV.

Parameters

to_csv_args (Dict) – Dictionary of arguments to use in the pandas to_csv() call.
read_csv_args (Dict) – Dictionary of arguments to use in the pandas read_csv() call.

Methods:

`load`()	Load the cacheable from disk.
`save`(obj)	Save the passed object to disk.

load(): Load the cacheable from disk.

Note

Any subclass is required to implement this.

save(obj) → str: Save the passed object to disk.

Note

Any subclass is required to implement this.

class curifactory.caching.PandasJsonCacher(path_override: Optional[str] = None, to_json_args: dict = {'double_precision': 15}, read_json_args: dict = {}, **kwargs)

Saves a pandas dataframe to JSON.

Warning

Using this cacher is inadvisable for floating point data, as precision will be lost, creating the potential for different results when using cached values with this cacher as opposed to the first non-cached run.

Parameters

to_json_args (Dict) – Dictionary of arguments to use in the pandas to_json() call.
read_json_args (Dict) – Dictionary of arguments to use in the pandas read_json() call.

Methods:

`load`()	Load the cacheable from disk.
`save`(obj)	Save the passed object to disk.

load(): Load the cacheable from disk.

Note

Any subclass is required to implement this.

save(obj) → str: Save the passed object to disk.

Note

Any subclass is required to implement this.

class curifactory.caching.PickleCacher(*args, **kwargs)

Dumps an object to a pickle file.

Methods:

`load`()	Load the cacheable from disk.
`save`(obj)	Save the passed object to disk.

load(): Load the cacheable from disk.

Note

Any subclass is required to implement this.

save(obj) → str: Save the passed object to disk.

Note

Any subclass is required to implement this.

class curifactory.caching.RawJupyterNotebookCacher(*args, **kwargs)

Take a list of code cells (where each cell is a list of strings containing python code) and turn it into a jupyter notebook. This is useful in situations where you want each experiment to have some form of automatically populated analysis that a reportable wouldn’t sufficiently cover, e.g. an interactive set of widgets or dashboard.

Example

@stage(inputs=["results_path"], outputs=["exploration_notebook"], cachers=[RawJupyterNotebookCacher])
def make_exploration_notebook(record, results_path):
    def convert_path(path):
        '''A function to translate paths to local folder path.'''
        p = Path(path)
        p = Path(*p.parts[2:])
        return str(p)

    cells = [
        [
            "# imports",
            "from curifactory.caching import JsonCacher",
        ],
        [
            "# load things",
            f"cacher = JsonCacher('./{convert_path(results_path)})",
            "results = cacher.load()",
            "results_metadata = cacher.metadata",
        ],
        [
            "# analysis",
            "...",
        ],
    ]

    return cells

Methods:

`load`()	Load the cacheable from disk.
`save`(obj)	This saves the raw cell strings to a _cells.json, and then uses ipynb-py-convert to change the output python script into a notebook format.

load(): Load the cacheable from disk.

Note

Any subclass is required to implement this.

save(obj: list): This saves the raw cell strings to a _cells.json, and then uses ipynb-py-convert to change the output python script into a notebook format.