Cache ===== Curifactory makes it straightforward to store and re-use intermediate artifacts generated throughout an experiment with its caching mechanisms. During an experiment run, user-specified caching strategies dump parameter-set-versioned instances of stage outputs in a common cache folder, and when running a stage that already has the appropriate artifacts in the cache for the current parameter set, it uses the caching strategy to reload the artifact from cache instead of executing the stages. Storing artifacts in cache both helps re-execute the experiment faster as well as creates a "paper trail" for manual exploration of the artifacts. Caching strategies are ``Cacher`` classes that extend curifactory's base ``Cacheable`` class. Using these cachers is as easy as listing them in your stage decorator for each output the stage generates: .. code-block:: python from curifactory import stage from curifactory.caching import PandasJsonCacher, JsonCacher, PickleCacher @stage(outputs=["dataset", "metrics_dictionary", "model"], cachers=[PandasJsonCacher, JsonCacher, PickleCacher]) def return_all_the_things(record): ... return dataset, metrics, model There are several pre-implemented cachers that come with Curifactory in the :ref:`Caching` module that should cover many basic needs: * ``JsonCacher`` * ``PandasCacher`` - store a dataframe using a specified format * ``PandasCsvCacher`` - shortcut for ``PandasCacher(format='csv')`` * ``PandasJsonCacher`` - shortcut for ``PandasCacher(format='json')``, stores a dataframe as a json file (array of dictionaries, the keys as column names.) * ``PickleCacher`` * ``FileReferenceCacher`` - a json file that stores references to one or more file paths. * ``RawJupyterNotebookCacher`` - turns a list of list of strings of python code into a jupyter notebook As a last resort, most things should be cacheable through the ``PickleCacher``, but the advantage of using the ``JsonCacher`` where applicable allows you to manually browse through the cache easier, instead of needing to write a script to load a piece of cached data before viewing it. Some things may not cache correctly even with a ``PickleCacher``, such as pytorch models or similarly complex objects. For these, you can write your own "cacheable" and plug it into a decorator in the same way as the pre-made cachers. Implementing a custom cacheable requires extending the :class:`caching.Cacheable ` class, and the new class must have a ``save(obj)`` and ``load() -> obj`` function, which respectively should handle saving the passed artifact to disk, and loading and returning a reconstructed artifact. The base ``Cacheable`` has a ``get_path()`` function which the cacher implementation can assume correctly returns a full filepath including the correct versioned filename for the current artifact. In the case that a cacher needs to save more than one file or wants to provide a different suffix for the filename, this can be passed to ``get_path``. .. code-block:: python import pickle from curifactory.caching import Cacheable class TorchModelCacher(Cacheable): def __init__(self, *args, **kwargs): # NOTE: it is recommended to always include and pass *args and **kwargs # in custom cachers to allow functionality specified in the Cacher arguments section super().__init__(*args, extension=".model_obj" **kwargs) def save(self, obj): torch.save(obj.model.state_dict(), self.get_path("_model")) with open(self.get_path(), 'wb') as outfile: pickle.dump(obj, outfile) def load(self): with open(self.get_path(), 'rb') as infile: obj = pickle.load(infile) obj.model.load_state_dict(torch.load(self.get_path("_model"), map_location="cpu")) return obj .. note:: It is recommended to always include and pass ``*args`` and ``**kwargs`` in custom cachers to allow consistent functionality as specified in :ref:`Cacher arguments`. .. warning:: The returns from ``get_path()`` calls should be used exactly for the paths written to and read from - Curifactory internally tracks the ``get_path()`` outputs for determining what to copy to a full store folder, so if you write to ``get_path() + "something.json"``, it won't correctly track that path. Instead, use the suffix capability: ``get_path("something.json")``. If you have a lot of files to save, or need to do complicated path manipulation, instead use ``self.get_dir()`` as the base path, and curifactory will track the entire subfolder. In this example, we've defined a custom cacher for some python class that contains a torch model inside of it, in the ``.model`` attribute. Using pickle for the torch model itself is discouraged, but we still want to store the whole class as well. The custom cacher therefore saves to two separate files - first we save the model state dict with a ``_model`` suffix, then pickle the whole class. On load we reverse this process, by unpickling the whole class and then replacing the model attribute with the more appropriate ``load_state_dict`` results. You can then pass this class name in a cachers list in the stage decorator as if it were one of the premade cacheables: .. code-block:: python @stage(inputs=..., outputs=["trained_model"], cachers=[TorchModelCacher]) def train_model(record, ...): # ... Using cachers ------------- Cacher arguments ................ As specified above, you can use a cacher in a stage simply by providing the class name in the cachers list. You can also initialize the cacher in the list, and there are several parameters that provide additional control over the path that's used by the cacher. * **overwrite_path**: specifying this completely overrides all other path computation functionality and uses the provided path exactly. If using this in a stage decorator, that means it won't use any form of parameter set hash versioning. This is useful in situations where a stage is effectively a static transform that isn't affected by any parameters. * **subdir**: if specified, uses this subdirectory in front of the filename, both within the cache directory and within a full store run's artifacts directory. * **prefix**: By default, the experiment name is used as the prefix for every cached filepath. If there are specific artifacts that are safe to use across all experiments that call the stage this cacher is used from, you can specify the prefix here. * **track**: Tracked filepaths are paths that get copied into a full store run. This is always true by default, but there can be situations (especially when dealing with very large artifacts such as datasets) where it's not desirable to keep a copy of every single artifact. Setting this to ``False`` does **not** disable caching it normally into the cache directory, but it will not transfer that file to the full store run artifacts directory. Inline cachers .............. While the primary purpose of cachers is to use them as a "strategy" to specify to a stage, cachers can also be used inline, either directly in a stage or in any normal code. This is useful in cases where you need to manually load an artifact, and you have the path for it already. .. code-block:: python some_metrics_path = ... metrics = JsonCacher(some_metrics_path).load() You can also get the metadata associated with the artifact: .. code-block:: python some_metrics_path = ... cacher = JsonCacher(some_metrics_path) metrics = cacher.load() metadata = cahcer.load_metadata() Metadata -------- Every cached artifact saves an associated metadata json file that tracks information about the cacher, the current record, and the experiment run. This metadata file is copied along with the artifact in full store runs, and is kept when an artifact is re-used in a later run. This metadata dictionary is available on every cacher object through ``.metadata``. In addition, every ``Cacheable`` object has an ``.extra_metadata`` dictionary that custom cachers can use to store additional information either for provenance/informational use, or to help direct loading code. This extra metadata gets added to the cacher's ``metadata`` when saving, and is populated from a ``.load_metadata()`` call. An example might look like: .. code-block:: python class UsesExtraMetadataCacher(Cacheable): def save(self, obj): self.extra_metadata["the_best_number"] = 13 JsonCacher(self.geet_path()).save(obj) def load(self): assert self.extra_metadata["best_number"] == 13 return JsonCacher(self.get_path()).load() The curifactory stage decorator automatically handles calling ``save_metadata()`` and ``load_metadata()`` at the appropriate times for the above cacher to work. However, if you're using this custom cacher inline, these functions are never explicitly called. If you want to enable this cacher to work inline, you need to add in explicit save/load metadata calls in the save/load functions: .. code-block:: python class UsesExtraMetadataCacher(Cacheable): def save(self, obj): self.extra_metadata["the_best_number"] = 13 self.save_metadata() JsonCacher(self.get_path()).save(obj) def load(self): self.load_metadata() assert self.extra_metadata["best_number"] == 13 return JsonCacher(self.get_path()).load() Lazy cache objects ------------------ While caching by itself helps reduce overall computation time when re-running experiments over and over, if running sizable experiments with a lot of large data in state at once, memory can be a problem. Many times, when stages are appropriately caching everything, some objects don't need to be in memory at all because they're never used in a stage that actually runs. To address this, curifactory has a :code:`Lazy` class. This class is used by wrapping it around the string name in the outputs array: .. code-block:: python @stage(inputs=..., outputs=["small_object", Lazy("large-object")], cachers=...) When an output is specified as lazy, as soon as the stage computes, the output object is cached and removed from memory. The :code:`Lazy` instance is then inserted into the state. Whenever the :code:`large-object` key is accessed on the state, it uses the cacher to reload the object back into memory (but maintains the Lazy object in state, so as long as no references persist beyond the stage, it will stay out of memory. Because lazy objects rely on a cacher, cachers should always be specified for these stages. If no cachers are given, curifactory will automatically use a :code:`PickleCacher`. When a stage with a Lazy object is computed the second time, the cachers check for their appropriate files as normal, and if they are found the lazy output again keeps only a :code:`Lazy` instance in the record state rather than reloading the actual file. Lazy resolve ............ By default, every time a ``Lazy`` instance is passed into a stage wrapped function, it resolves to the object itself, meaning it calls the load function on the associated cacher. If a ``Lazy`` instance is specified with ``resolve=False``, then every time that artifact is used as input, the input that gets passed is the actual ``Lazy`` instance itself. The primary value in this is to be able to access the associated cacher from within a stage in order to get its path (this is useful when doing external calls.)