Cache ===== Including a list of cachers in your stage decorators helps store intermediate results both for easier exploration as well as faster re-execution. There are four pre-implemented cachers that come with Curifactory in the :ref:`Caching` module that should cover most basic needs: * :code:`JsonCacher` * :code:`PandasCsvCacher` * :code:`PandasJsonCacher` - stores a dataframe as a json file (array of dictionaries, the keys as column names.) * :code:`PickleCacher` As a last resort, most things should be cacheable through the :code:`PickleCacher`, but the advantage of using the :code:`JsonCacher` where applicable allows you to manually browse through the cache easier, instead of needing to write a script to load a piece of cached data before viewing it. Some things may not cache correctly even with a :code:`PickleCacher`, such as pytorch models or similarly complex objects. For these, you can write your own "cacheable" and plug it into a decorator in the same way as the pre-made cachers. Implementing a custom cacheable requires extending the :class:`caching.Cacheable ` class, and the new class must have a :code:`load()` and :code:`save()` function. The base class has a :code:`path` attribute that both functions can assume is set correctly to a base path where it is appropriate to write any necessary files. Following is an example: .. code-block:: python import pickle from curifactory.caching import Cacheable class TorchModelCacher(Cacheable): def __init__(self): super().__init__("") # you would normally pass a string extension here if desired def save(self, obj): torch.save(obj.model.state_dict(), self.path + "_model") with open(self.path, 'wb') as outfile: pickle.dump(obj, outfile) def load(self): with open(self.path, 'rb') as infile: obj = pickle.load(infile) obj.model.load_state_dict(torch.load(self.path + "_model", map_location="cpu")) return obj In this example, we've defined a custom cacher for some python class that contains a torch model inside of it, in the :code:`.model` attribute. Using pickle for the torch model itself is discouraged, but we still want to store the whole class as well. The custom cacher therefore saves to two separate files - first we save the model state dict with a :code:`_model` suffix, then pickle the whole class. On load we reverse this process, by unpickling the whole class and then replacing the model attribute with the more appropriate :code:`load_state_dict` results. You can then pass this class name in a cachers list in the stage decorator as if it were one of the premade cacheables: .. code-block:: python @stage(inputs=..., outputs=["trained_model"], cachers=[TorchModelCacher]) def train_model(record, ...): # ... Lazy cache objects ------------------ While caching by itself helps reduce overall computation time when re-running experiments over and over, if running sizable experiments with a lot of large data in state at once, memory can be a problem. Many times, when stages are appropriately caching everything, some objects don't need to be in memory at all because they're never used in a stage that actually runs. To address this, curifactory has a :code:`Lazy` class. This class is used by wrapping it around the string name in the outputs array: .. code-block:: python @stage(inputs=..., outputs=["small_object", Lazy("large-object")], cachers=...) When an output is specified as lazy, as soon as the stage computes, the output object is cached and removed from memory. The :code:`Lazy` instance is then inserted into the state. Whenever the :code:`large-object` key is accessed on the state, it uses the cacher to reload the object back into memory (but maintains the Lazy object in state, so as long as no references persist beyond the stage, it will stay out of memory. Because lazy objects rely on a cacher, cachers should always be specified for these stages. If no cachers are given, curifactory will automatically use a :code:`PickleCacher`. When a stage with a Lazy object is computed the second time, the cachers check for their appropriate files as normal, and if they are found the lazy output again keeps only a :code:`Lazy` instance in the record state rather than reloading the actual file.