Cache
Curifactory makes it straightforward to store and re-use intermediate artifacts generated throughout an experiment with its caching mechanisms. During an experiment run, user-specified caching strategies dump parameter-set-versioned instances of stage outputs in a common cache folder, and when running a stage that already has the appropriate artifacts in the cache for the current parameter set, it uses the caching strategy to reload the artifact from cache instead of executing the stages. Storing artifacts in cache both helps re-execute the experiment faster as well as creates a “paper trail” for manual exploration of the artifacts.
Caching strategies are Cacher
classes that extend curifactory’s base Cacheable
class. Using
these cachers is as easy as listing them in your stage decorator for each output the stage generates:
from curifactory import stage
from curifactory.caching import PandasJsonCacher, JsonCacher, PickleCacher
@stage(outputs=["dataset", "metrics_dictionary", "model"], cachers=[PandasJsonCacher, JsonCacher, PickleCacher])
def return_all_the_things(record):
...
return dataset, metrics, model
There are several pre-implemented cachers that come with Curifactory in the Caching module that should cover many basic needs:
JsonCacher
PandasCacher
- store a dataframe using a specified formatPandasCsvCacher
- shortcut forPandasCacher(format='csv')
PandasJsonCacher
- shortcut forPandasCacher(format='json')
, stores a dataframe as a json file (array of dictionaries, the keys as column names.)PickleCacher
FileReferenceCacher
- a json file that stores references to one or more file paths.RawJupyterNotebookCacher
- turns a list of list of strings of python code into a jupyter notebook
As a last resort, most things should be cacheable through
the PickleCacher
, but the advantage of using the JsonCacher
where
applicable allows you to manually browse through
the cache easier, instead of needing to write a script to load a piece
of cached data before viewing it.
Some things may not cache correctly even with a PickleCacher
,
such as pytorch models or similarly complex objects. For these, you
can write your own “cacheable” and plug it into a decorator in the same
way as the pre-made cachers.
Implementing a custom cacheable requires extending the caching.Cacheable
class, and the new class must have a save(obj)
and load() -> obj
function, which respectively should handle saving the passed artifact to disk,
and loading and returning a reconstructed artifact.
The base Cacheable
has a get_path()
function which the cacher implementation can assume
correctly returns a full filepath including the correct versioned filename for the current artifact.
In the case that a cacher needs to save more than one file or wants to provide a different suffix for
the filename, this can be passed to get_path
.
import pickle
from curifactory.caching import Cacheable
class TorchModelCacher(Cacheable):
def __init__(self, *args, **kwargs):
# NOTE: it is recommended to always include and pass *args and **kwargs
# in custom cachers to allow functionality specified in the Cacher arguments section
super().__init__(*args, extension=".model_obj" **kwargs)
def save(self, obj):
torch.save(obj.model.state_dict(), self.get_path("_model"))
with open(self.get_path(), 'wb') as outfile:
pickle.dump(obj, outfile)
def load(self):
with open(self.get_path(), 'rb') as infile:
obj = pickle.load(infile)
obj.model.load_state_dict(torch.load(self.get_path("_model"), map_location="cpu"))
return obj
Note
It is recommended to always include and pass *args
and **kwargs
in custom cachers to allow
consistent functionality as specified in Cacher arguments.
Warning
The returns from get_path()
calls should be used exactly for the paths written to and read from -
Curifactory internally tracks the get_path()
outputs for determining what to copy to a full
store folder, so if you write to get_path() + "something.json"
, it won’t correctly track that path.
Instead, use the suffix capability: get_path("something.json")
. If you have a lot of files to save,
or need to do complicated path manipulation, instead use self.get_dir()
as the base path, and
curifactory will track the entire subfolder.
In this example, we’ve defined a custom cacher for some python class that contains a torch model inside of it, in
the .model
attribute.
Using pickle for the torch model itself is discouraged, but we still want to store the whole class as well.
The custom cacher therefore saves to two separate files - first we save the model state dict with a _model
suffix, then pickle the whole class. On load we reverse this process, by unpickling the whole class and then
replacing the model attribute with the more appropriate load_state_dict
results.
You can then pass this class name in a cachers list in the stage decorator as if it were one of the premade cacheables:
@stage(inputs=..., outputs=["trained_model"], cachers=[TorchModelCacher])
def train_model(record, ...):
# ...
Using cachers
Cacher arguments
As specified above, you can use a cacher in a stage simply by providing the class name in the cachers list. You can also initialize the cacher in the list, and there are several parameters that provide additional control over the path that’s used by the cacher.
overwrite_path: specifying this completely overrides all other path computation functionality and uses the provided path exactly. If using this in a stage decorator, that means it won’t use any form of parameter set hash versioning. This is useful in situations where a stage is effectively a static transform that isn’t affected by any parameters.
subdir: if specified, uses this subdirectory in front of the filename, both within the cache directory and within a full store run’s artifacts directory.
prefix: By default, the experiment name is used as the prefix for every cached filepath. If there are specific artifacts that are safe to use across all experiments that call the stage this cacher is used from, you can specify the prefix here.
track: Tracked filepaths are paths that get copied into a full store run. This is always true by default, but there can be situations (especially when dealing with very large artifacts such as datasets) where it’s not desirable to keep a copy of every single artifact. Setting this to
False
does not disable caching it normally into the cache directory, but it will not transfer that file to the full store run artifacts directory.
Inline cachers
While the primary purpose of cachers is to use them as a “strategy” to specify to a stage, cachers can also be used inline, either directly in a stage or in any normal code. This is useful in cases where you need to manually load an artifact, and you have the path for it already.
some_metrics_path = ...
metrics = JsonCacher(some_metrics_path).load()
You can also get the metadata associated with the artifact:
some_metrics_path = ...
cacher = JsonCacher(some_metrics_path)
metrics = cacher.load()
metadata = cahcer.load_metadata()
Metadata
Every cached artifact saves an associated metadata json file that tracks information about the cacher, the current record, and the experiment run. This metadata file is copied along with the artifact in full store runs, and is kept when an artifact is re-used in a later run.
This metadata dictionary is available on every cacher object through .metadata
. In addition, every
Cacheable
object has an .extra_metadata
dictionary that custom cachers can use to store additional
information either for provenance/informational use, or to help direct loading code. This extra metadata
gets added to the cacher’s metadata
when saving, and is populated from a .load_metadata()
call.
An example might look like:
class UsesExtraMetadataCacher(Cacheable):
def save(self, obj):
self.extra_metadata["the_best_number"] = 13
JsonCacher(self.geet_path()).save(obj)
def load(self):
assert self.extra_metadata["best_number"] == 13
return JsonCacher(self.get_path()).load()
The curifactory stage decorator automatically handles calling save_metadata()
and load_metadata()
at
the appropriate times for the above cacher to work. However, if you’re using this custom cacher inline, these
functions are never explicitly called. If you want to enable this cacher to work inline, you need to add in
explicit save/load metadata calls in the save/load functions:
class UsesExtraMetadataCacher(Cacheable):
def save(self, obj):
self.extra_metadata["the_best_number"] = 13
self.save_metadata()
JsonCacher(self.get_path()).save(obj)
def load(self):
self.load_metadata()
assert self.extra_metadata["best_number"] == 13
return JsonCacher(self.get_path()).load()
Lazy cache objects
While caching by itself helps reduce overall computation time when re-running
experiments over and over, if running sizable experiments with a lot of large data
in state at once, memory can be a problem. Many times, when stages are
appropriately caching everything, some objects don’t need to be in
memory at all because they’re never used in a stage that actually runs. To
address this, curifactory has a Lazy
class. This class is used by
wrapping it around the string name in the outputs array:
@stage(inputs=..., outputs=["small_object", Lazy("large-object")], cachers=...)
When an output is specified as lazy, as soon as the stage computes, the output
object is cached and removed from memory. The Lazy
instance is then inserted
into the state. Whenever the large-object
key is accessed on the state,
it uses the cacher to reload the object back into memory (but maintains the Lazy
object in state, so as long as no references persist beyond the stage, it will
stay out of memory.
Because lazy objects rely on a cacher, cachers should always be specified for
these stages. If no cachers are given, curifactory will automatically use a
PickleCacher
.
When a stage with a Lazy object is computed the second time, the cachers check
for their appropriate files as normal, and if they are found the lazy output
again keeps only a Lazy
instance in the record state rather than
reloading the actual file.
Lazy resolve
By default, every time a Lazy
instance is passed into a stage wrapped function, it resolves to
the object itself, meaning it calls the load function on the associated cacher. If a Lazy
instance
is specified with resolve=False
, then every time that artifact is used as input, the input that gets
passed is the actual Lazy
instance itself.
The primary value in this is to be able to access the associated cacher from within a stage in order to get its path (this is useful when doing external calls.)