Caching
Classes for various caching strategies, known as “cachers”
This is handled through a base Cacheable
class, and each “strategy”
cacher class extends it.
Note that there are effectively two ways to use cacheables - # TODO finish this
Classes:
|
The base caching class, any caching strategy should extend this. |
|
Saves a file path or list of file paths generated from a stage as a json file. |
|
Dumps an object to indented JSON. |
|
A class to indicate a stage output as a lazy-cache object - curifactory will attempt to keep this out of memory as much as possible, immediately caching and deleting, and loading back into memeory only when needed. |
|
Saves a pandas dataframe to selectable IO format. |
|
Saves a pandas dataframe to CSV. |
|
Saves a pandas dataframe to JSON. |
|
Special type of cacher that doesn’t directly save or load anything, it just tracks a file path for reference. |
|
Dumps an object to a pickle file. |
|
Take a list of code cells (where each cell is a list of strings containing python code) and turn it into a jupyter notebook. |
-
class
curifactory.caching.
Cacheable
(path_override: Optional[str] = None, name: Optional[str] = None, subdir: Optional[str] = None, prefix: Optional[str] = None, extension: Optional[str] = None, record=None, track: bool = True) The base caching class, any caching strategy should extend this.
- Parameters
path_override (str) –
Use a specific path for the cacheable, rather than automatically setting it based on name etc. You can specify string formatting keywords to allow more control over caching when not specifying this as an inline cacher. (Please note that currently if path_override is specified, the resulting path is not included in a full store.)
Possible keyword replacement fields:
{hash} - the hash of the record.
{cache} - the path to the manager’s cache directory (does not include final ‘/’)
{stage} - the name of the current stage.
{name} - the name of this output object.
{params.X} - parameter X of the parameters associated with the record.
{experiment} - the name of the current experiment.
{artifact_filename} - the normal filename for this cacher (doesn’t include dir path etc.)
name (str) – The obj name to use in automatically constructing a path. If a cacheable is used in stage header, this is automatically provided as the output string name from the stage outputs list.
subdir (str) – An optional string of one or more nested subdirectories to prepend to the artifact filepath. This can be used if you want to subdivide cache and run artifacts into logical subsets, e.g. similar to https://towardsdatascience.com/the-importance-of-layered-thinking-in-data-engineering-a09f685edc71.
prefix (str) – An optional alternative prefix to the experiment-wide prefix (either the experiment name or custom-specified experiment prefix). This can be used if you want a cached object to work easier across multiple experiments, rather than being experiment specific. WARNING: use with caution, cross-experiment caching can mess with provenance.
extension (str) – The filetype extension to add at the end of the path.
record (Record) – The current record this cacheable is caching under. This can be used to get a copy of the current args instance and is also how artifact metadata is collected.
track (bool) – whether to include returned path in a store full copy or not.
Note
It is strongly recommended that any subclasses of Cacheable take
**kwargs
in init and pass along tosuper()
:class CustomCacher(cf.Cacheable): def __init__(self, path_override: str = None, custom_attribute: Any = None, **kwargs): super().__init__(path_override, **kwargs) self.some_custom_attribute = custom_attribute
This allows consistent handling of paths in the parent
get_path()
andcheck()
functions.If no custom attributes are needed, also pass in *args, so path_override can be specified without a kwarg:
class CustomCacher(cf.Cacheable): def __init__(self, *args, **kwargs): super().__init__(*args, extension=".custom", **kwargs)
Attributes:
The running list of paths this cacher is using, as appended by
get_path
.The filetype extension to add at the end of the path.
collect_metadata
uses but does not overwrite this, placing into the extra key in the actual metadata.Metadata about the artifact cached with this cacheable.
The obj name to use in automatically constructing a path.
Use a specific path for the cacheable, rather than automatically setting it based on name etc.
An optional alternative prefix to the experiment-wide prefix (either the experiment name or custom-specified experiment prefix).
The current record this cacheable is caching under.
The stage associated with this cacher, if applicable.
An optional string of one or more nested subdirectories to prepend to the artifact filepath.
Whether to store the artifact this cacher is used with in the run folder on store-full runs or not.
Methods:
check
()Check to see if this cacheable needs to be written or not.
get_dir
([suffix])Returns a path for a directory with the given suffix (if provided), appropriate for use in a
save
andload
function.get_path
([suffix])Retrieve the full filepath to use for saving and loading.
load
()Load the cacheable from disk.
save
(obj)Save the passed object to disk.
-
cache_paths
: list The running list of paths this cacher is using, as appended by
get_path
.
-
check
() → bool Check to see if this cacheable needs to be written or not.
Note
This function will always return False if the args are
None
.- Returns
True
if we find the cached file and the current parameter set doesn’t specify to overwrite, otherwiseFalse
.
-
extension
The filetype extension to add at the end of the path. (Optional, automatically used as suffix in get_path if provided)
-
extra_metadata
: dict collect_metadata
uses but does not overwrite this, placing into the extra key in the actual metadata. This can be used by the cacher’s save function to store additional information that would then be available if the ‘load’ function callsload_metadata()
.
-
get_dir
(suffix=None) → str Returns a path for a directory with the given suffix (if provided), appropriate for use in a
save
andload
function. This will create any subdirectories in the path if they don’t exist.
-
get_path
(suffix=None) → str Retrieve the full filepath to use for saving and loading. This should be called in the
save()
andload()
implementations.- Parameters
suffix (str) – The suffix to append to the path. If not set, this will default to the cachable’s extension.
Note
If
path_override
is set on this cacher, this cacher _does not handle storing in the full store directory._ The assumption is that either you’re referring to a static external path (which doesn’t make sense to copy), or you’re manually passing in arecord.get_path
in which case the record has already dealt with any logic necessary to add the path to the record’sunstored_tracked paths
which get copied over. Note also that this can be problematic for cachers that store multiple files since anything that isn’t the path_override won’t get copied. For multiple file cachers you should use ``name``/``subdir``/``prefix`` instead of setting a ``path_override``.
-
load
() Load the cacheable from disk.
Note
Any subclass is required to implement this.
-
metadata
: dict Metadata about the artifact cached with this cacheable.
-
name
The obj name to use in automatically constructing a path. If a cacheable is used in stage header, this is automatically provided as the output string name from the stage outputs list.
-
path_override
Use a specific path for the cacheable, rather than automatically setting it based on name etc.
-
prefix
An optional alternative prefix to the experiment-wide prefix (either the experiment name or custom-specified experiment prefix). This can be used if you want a cached object to work easier across multiple experiments, rather than being experiment specific. WARNING: use with caution, cross-experiment caching can mess with provenance.
-
record
The current record this cacheable is caching under. This can be used to get a copy of the current args instance and is also how artifact metadata is collected.
-
save
(obj) Save the passed object to disk.
Note
Any subclass is required to implement this.
-
stage
: str The stage associated with this cacher, if applicable.
-
subdir
An optional string of one or more nested subdirectories to prepend to the artifact filepath. This can be used if you want to subdivide cache and run artifacts into logical subsets, e.g. similar to https://towardsdatascience.com/the-importance-of-layered-thinking-in-data-engineering-a09f685edc71.
-
track
Whether to store the artifact this cacher is used with in the run folder on store-full runs or not.
-
class
curifactory.caching.
FileReferenceCacher
(*args, **kwargs) Saves a file path or list of file paths generated from a stage as a json file. The
check
function will check existence of all file paths.This is useful for instances where there may be a large number of files stored or generated to disk, as it would be unwieldy to return them all (or infeasible to keep them in memory) directly from the stage. When this cacher is checked for pre-existing, it tries to load the json file storing the filenames, and then checks for the existence of each path in that json file. If all of them exist, it will short-circuit computation.
Using this cacher does mean the user is in charge of loading/saving the file paths correctly, but in some cases that may be desirable.
This can also be used for storing a reference to a single file outside the normal cache.
When combined with the
get_dir
call on the record, you can create a cached directory of files similarly to a regular cacher and simply keep a reference to them as part of the actual cacher process.Example
@stage(inputs=None, outputs=["many_text_files"], cachers=[FileReferenceCacher]) def output_text_files(record): file_path = record.get_dir("my_files") my_file_list = [os.path.join(file_path, f"my_file_{num}") for num in range(20)] for file in my_file_list: with open(file, 'w') as outfile: outfile.write("test") return my_file_list
Methods:
check
()Check to see if this cacheable needs to be written or not.
load
()Load the cacheable from disk.
save
(files)Save the passed object to disk.
-
check
() → bool Check to see if this cacheable needs to be written or not.
Note
This function will always return False if the args are
None
.- Returns
True
if we find the cached file and the current parameter set doesn’t specify to overwrite, otherwiseFalse
.
-
load
() → Union[list, str] Load the cacheable from disk.
Note
Any subclass is required to implement this.
-
save
(files: Union[list, str]) → str Save the passed object to disk.
Note
Any subclass is required to implement this.
-
-
class
curifactory.caching.
JsonCacher
(*args, **kwargs) Dumps an object to indented JSON.
Methods:
load
()Load the cacheable from disk.
save
(obj)Save the passed object to disk.
-
load
() Load the cacheable from disk.
Note
Any subclass is required to implement this.
-
save
(obj) → str Save the passed object to disk.
Note
Any subclass is required to implement this.
-
-
class
curifactory.caching.
Lazy
(name: str, resolve: bool = True) A class to indicate a stage output as a lazy-cache object - curifactory will attempt to keep this out of memory as much as possible, immediately caching and deleting, and loading back into memeory only when needed.
This object is used by “wrapping” a stage output string with the class.
Example
@stage(inputs=None, outputs=["small_output", Lazy("large_output")], cachers=[PickleCacher]*2) def some_stage(record: Record): ...
- Parameters
name (str) – the name of the output to put into state.
resolve (bool) – Whether this lazy object should automatically reload the initial object when accessed from state. By default this is
True
- when a stage specifies the string name as an input and this object is requested from the record state, it loads and passes in the originally stored object. If set toFalse
, the stage input will instead be populated with the lazy object itself, giving the inner stage code direct access to the cacher. This is useful if you need to keep objects out of memory and just want to refer to the cacher path (e.g. to send this path along to an external CLI/script.)
-
class
curifactory.caching.
PandasCacher
(path_override: Optional[str] = None, format: Literal[csv, json, parquet, pickle, orc, hdf5, excel, xml] = 'pickle', to_args: Optional[dict] = None, read_args: Optional[dict] = None, **kwargs) Saves a pandas dataframe to selectable IO format.
- Parameters
format (str) – Selected pandas IO format. Choices are: (“csv”, “json”, “parquet”, “pickle”, “orc”, “hdf5”, “excel”, “xml”)
to_args (Dict) – Dictionary of arguments to use in the pandas
to_*()
call.read_args (Dict) – Dictionary of arguments to use in the pandas
read_*()
call.
Methods:
load
()Load the cacheable from disk.
save
(obj)Save the passed object to disk.
-
load
() Load the cacheable from disk.
Note
Any subclass is required to implement this.
-
save
(obj: pandas.core.frame.DataFrame) → str Save the passed object to disk.
Note
Any subclass is required to implement this.
-
class
curifactory.caching.
PandasCsvCacher
(path_override: Optional[str] = None, to_csv_args: dict = {}, read_csv_args: dict = {'index_col': 0}, **kwargs) Saves a pandas dataframe to CSV.
- Parameters
to_csv_args (Dict) – Dictionary of arguments to use in the pandas
to_csv()
call.read_csv_args (Dict) – Dictionary of arguments to use in the pandas
read_csv()
call.
Note
This is equivalent to using
PandasCacher(format='csv')
-
class
curifactory.caching.
PandasJsonCacher
(path_override: Optional[str] = None, to_json_args: dict = {'double_precision': 15}, read_json_args: dict = {}, **kwargs) Saves a pandas dataframe to JSON.
Warning
Using this cacher is inadvisable for floating point data, as precision will be lost, creating the potential for different results when using cached values with this cacher as opposed to the first non-cached run.
- Parameters
to_json_args (Dict) – Dictionary of arguments to use in the pandas
to_json()
call.read_json_args (Dict) – Dictionary of arguments to use in the pandas
read_json()
call.
Note
This is equivalent to using
PandasCacher(format='json')
-
class
curifactory.caching.
PathRef
(*args, **kwargs) Special type of cacher that doesn’t directly save or load anything, it just tracks a file path for reference.
This is primarily useful for stages that never keep a particular object in memory and just want to directly pass around paths. The
PathRef
cacher allows still short-circuiting if the referenced path already exists, rather than needing to do it manually in the stage.Note that when using a
PathRef
cacher you still need to return a value from the stage for the cacher to “save”. This cacher expects that return value to be the path that was written to, and internally runs anassert returned_path == self.get_path()
to double check that the stage wrote to the correct place. This also means that the value stored “in memory” is just the path, and that path string is what gets gets “loaded”.This cacher is distinct from the
FileReferenceCacher
in that the path of this cacher _is the referenced path_, rather than saving a file that contains the referenced path. (In the case of the latter, a new record/hash etc that refers to the same target filepath would still trigger stage execution and still requires the stage to do it’s own check of if the original file already exists before saving.Example
@stage([], ["large_dataset_path"], [PathRef("./data/raw/big_data_{params.dataset}.csv")]) def make_big_data(record): # you can use record's ``stage_cachers`` to get the expected path output_path = record.stage_cachers[0].get_path() ... # make big data without keeping it in memory ... return output_path
@stage(["large_dataset_path"], ["model_path"], [PathRef]) def make_big_data(record, large_dataset_path): # the other way you can get a path that should be correct # is through record's ``get_path()``. The assert inside # PathRef's save will help us double check that it's correct. output_path = record.get_path(obj_name="model_path") ... # train model using large_dataset_path, the string path we need. ... return output_path
Methods:
load
()Load the cacheable from disk.
save
(obj)This is effectively a no-op, this cacher is just a reference to its own path.
-
load
() → str Load the cacheable from disk.
Note
Any subclass is required to implement this.
-
save
(obj: str) This is effectively a no-op, this cacher is just a reference to its own path.
obj
is expected to be the same, and we assert that to help alert the user if something got mis-aligned and the path they wrote to wasn’t this cacher’s path.
-
-
class
curifactory.caching.
PickleCacher
(*args, **kwargs) Dumps an object to a pickle file.
Methods:
load
()Load the cacheable from disk.
save
(obj)Save the passed object to disk.
-
load
() Load the cacheable from disk.
Note
Any subclass is required to implement this.
-
save
(obj) → str Save the passed object to disk.
Note
Any subclass is required to implement this.
-
-
class
curifactory.caching.
RawJupyterNotebookCacher
(*args, **kwargs) Take a list of code cells (where each cell is a list of strings containing python code) and turn it into a jupyter notebook. This is useful in situations where you want each experiment to have some form of automatically populated analysis that a reportable wouldn’t sufficiently cover, e.g. an interactive set of widgets or dashboard.
Example
@stage(inputs=["results_path"], outputs=["exploration_notebook"], cachers=[RawJupyterNotebookCacher]) def make_exploration_notebook(record, results_path): def convert_path(path): '''A function to translate paths to local folder path.''' p = Path(path) p = Path(*p.parts[2:]) return str(p) cells = [ [ "# imports", "from curifactory.caching import JsonCacher", ], [ "# load things", f"cacher = JsonCacher('./{convert_path(results_path)})", "results = cacher.load()", "results_metadata = cacher.metadata", ], [ "# analysis", "...", ], ] return cells
Methods:
load
()Load the cacheable from disk.
save
(obj)This saves the raw cell strings to a _cells.json, and then uses ipynb-py-convert to change the output python script into a notebook format.
-
load
() Load the cacheable from disk.
Note
Any subclass is required to implement this.
-
save
(obj: list) This saves the raw cell strings to a _cells.json, and then uses ipynb-py-convert to change the output python script into a notebook format.
-