Cache
Including a list of cachers in your stage decorators helps store intermediate results both for easier exploration as well as faster re-execution.
There are four pre-implemented cachers that come with Curifactory in the Caching module that should cover most basic needs:
JsonCacher
PandasCsvCacher
PandasJsonCacher
- stores a dataframe as a json file (array of dictionaries, the keys as column names.)PickleCacher
As a last resort, most things should be cacheable through
the PickleCacher
, but the advantage of using the JsonCacher
where
applicable allows you to manually browse through
the cache easier, instead of needing to write a script to load a piece
of cached data before viewing it.
Some things may not cache correctly even with a PickleCacher
,
such as pytorch models or similarly complex objects. For these, you
can write your own “cacheable” and plug it into a decorator in the same
way as the pre-made cachers.
Implementing a custom cacheable requires extending the caching.Cacheable
class, and the new class must have a load()
and save()
function. The base class has a path
attribute that both functions can assume
is set correctly to a base path where it is appropriate to write any necessary files.
Following is an example:
import pickle
from curifactory.caching import Cacheable
class TorchModelCacher(Cacheable):
def __init__(self):
super().__init__("") # you would normally pass a string extension here if desired
def save(self, obj):
torch.save(obj.model.state_dict(), self.path + "_model")
with open(self.path, 'wb') as outfile:
pickle.dump(obj, outfile)
def load(self):
with open(self.path, 'rb') as infile:
obj = pickle.load(infile)
obj.model.load_state_dict(torch.load(self.path + "_model", map_location="cpu"))
return obj
In this example, we’ve defined a custom cacher for some python class that contains a torch model inside of it, in
the .model
attribute.
Using pickle for the torch model itself is discouraged, but we still want to store the whole class as well.
The custom cacher therefore saves to two separate files - first we save the model state dict with a _model
suffix, then pickle the whole class. On load we reverse this process, by unpickling the whole class and then
replacing the model attribute with the more appropriate load_state_dict
results.
You can then pass this class name in a cachers list in the stage decorator as if it were one of the premade cacheables:
@stage(inputs=..., outputs=["trained_model"], cachers=[TorchModelCacher])
def train_model(record, ...):
# ...
Lazy cache objects
While caching by itself helps reduce overall computation time when re-running
experiments over and over, if running sizable experiments with a lot of large data
in state at once, memory can be a problem. Many times, when stages are
appropriately caching everything, some objects don’t need to be in
memory at all because they’re never used in a stage that actually runs. To
address this, curifactory has a Lazy
class. This class is used by
wrapping it around the string name in the outputs array:
@stage(inputs=..., outputs=["small_object", Lazy("large-object")], cachers=...)
When an output is specified as lazy, as soon as the stage computes, the output
object is cached and removed from memory. The Lazy
instance is then inserted
into the state. Whenever the large-object
key is accessed on the state,
it uses the cacher to reload the object back into memory (but maintains the Lazy
object in state, so as long as no references persist beyond the stage, it will
stay out of memory.
Because lazy objects rely on a cacher, cachers should always be specified for
these stages. If no cachers are given, curifactory will automatically use a
PickleCacher
.
When a stage with a Lazy object is computed the second time, the cachers check
for their appropriate files as normal, and if they are found the lazy output
again keeps only a Lazy
instance in the record state rather than
reloading the actual file.