Caching
Classes:
Just meant to help avoid unwanted implicit _aggregate stages continually being added to the database when nothing is _really_ running. |
|
|
|
|
A cacher that loads a duckdb database connection. |
|
|
|
Saves a file path or list of file paths generated from a stage as a json file. |
|
|
|
Only writes out a metadata file, can be used for checking that a stage was completed/based on parameters. |
|
|
|
Special type of cacher that doesn't directly save or load anything, it just tracks a file path for reference. |
|
|
|
|
|
- class curifactory.experimental.caching.AggregateArtifactCacher
Just meant to help avoid unwanted implicit _aggregate stages continually being added to the database when nothing is _really_ running.
NOTE: Can’t be used as an inline cacher? (depends on an initialized ArtifactList artifact)
Methods:
check([silent])load_obj()Attributes:
- cache_paths: list
- check(silent=True)
- extra_metadata: dict
- load_obj()
- params = []
- paths_only: bool
When True, after the object is saved it is replaced with the string path it was saved at. Similarly, when load is called, the path gets returned.
- class curifactory.experimental.caching.Cacheable(path_override=None, extension='', paths_only=False)
Methods:
check([silent])clear()Remove self from the cache.
get_from_db_metadata(cacher_module, ...)get_path([suffix, dry])When dry is false don't add the resulting path to the cacher's list of tracked paths, (e.g. if this is just being used in a print statement.).
load()load_artifact(path)load_obj()resolve_template_string(path)Intended for when path_override is specified, this returns the path with formatting applied/resolved.
save(obj)save_obj(obj)Attributes:
When True, after the object is saved it is replaced with the string path it was saved at.
- Parameters:
path_override (str)
extension (str)
paths_only (bool)
- cache_paths: list
- check(silent=False)
- Parameters:
silent (bool)
- clear()
Remove self from the cache.
- clear_metadata()
- clear_obj()
- extra_metadata: dict
- static get_from_db_metadata(cacher_module, cacher_type, cacher_params)
- Parameters:
cacher_module (str)
cacher_type (str)
cacher_params (dict[str, Any])
- get_params()
- Return type:
dict[str, Any]
- get_path(suffix=None, dry=False)
When dry is false don’t add the resulting path to the cacher’s list of tracked paths, (e.g. if this is just being used in a print statement.)
- Return type:
str
- load()
- load_metadata()
- load_obj()
- load_paths()
- params = ['path_override', 'extension', 'paths_only']
- paths_only: bool
When True, after the object is saved it is replaced with the string path it was saved at. Similarly, when load is called, the path gets returned.
- resolve_template_string(path)
Intended for when path_override is specified, this returns the path with formatting applied/resolved.
Note that this does _not_ handle suffixing, suffix resolution needs to occur wherever this is used.
Possible keyword replacement fields: * {hash} - the hash of the stage. * {stage_name} - the name of the stage that produces this artifact * {[STAGE_ARG_NAME]} - any argument name from the stage itself * {artifact_name} - the name of this output object. * {artifact_filename} - the normal filename for this cacher (doesn’t include dir path etc.)
- Parameters:
path (str)
- Return type:
str
- save(obj)
- save_metadata()
- save_obj(obj)
- class curifactory.experimental.caching.DBCacher(connection_str=None, extensions=None, **kwargs)
A cacher that loads a duckdb database connection.
Methods:
check([silent])load()load_obj()Attributes:
- Parameters:
connection_str (str)
extensions (list[str])
- cache_paths: list
- check(silent=False)
- Parameters:
silent (bool)
- extra_metadata: dict
- load()
- load_obj()
- params = ['connection_str', 'kwargs', 'extensions']
- paths_only: bool
When True, after the object is saved it is replaced with the string path it was saved at. Similarly, when load is called, the path gets returned.
- class curifactory.experimental.caching.DBTableCacher(path_override=None, use_db_arg=-1, table_name=None, db=None)
Methods:
check([silent])get_db()load_obj()save_obj(relation_object)upsert_relation(relation_object)Attributes:
- Parameters:
path_override (str)
use_db_arg (int | str)
table_name (str)
db (DuckDBPyConnection)
- cache_paths: list
- check(silent=True)
- extra_metadata: dict
- get_db()
- get_table_name()
- load_obj()
- params = ['path_override', 'use_db_arg', 'table_name']
- paths_only: bool
When True, after the object is saved it is replaced with the string path it was saved at. Similarly, when load is called, the path gets returned.
- save_obj(relation_object)
- Parameters:
relation_object (DuckDBPyRelation)
- upsert_relation(relation_object)
- Parameters:
relation_object (DuckDBPyRelation)
- class curifactory.experimental.caching.FileReferenceCacher(*args, **kwargs)
Saves a file path or list of file paths generated from a stage as a json file. The
checkfunction will check existence of all file paths.This is useful for instances where there may be a large number of files stored or generated to disk, as it would be unwieldy to return them all (or infeasible to keep them in memory) directly from the stage. When this cacher is checked for pre-existing, it tries to load the json file storing the filenames, and then checks for the existence of each path in that json file. If all of them exist, it will short-circuit computation.
Using this cacher does mean the user is in charge of loading/saving the file paths correctly, but in some cases that may be desirable.
This can also be used for storing a reference to a single file outside the normal cache.
When combined with the
get_dircall on the record, you can create a cached directory of files similarly to a regular cacher and simply keep a reference to them as part of the actual cacher process.Example
@stage(inputs=None, outputs=["many_text_files"], cachers=[FileReferenceCacher]) def output_text_files(record): file_path = record.get_dir("my_files") my_file_list = [os.path.join(file_path, f"my_file_{num}") for num in range(20)] for file in my_file_list: with open(file, 'w') as outfile: outfile.write("test") return my_file_list
Methods:
check([silent])load_obj()save_obj(files)Attributes:
- cache_paths: list
- check(silent=False)
- Parameters:
silent (bool)
- Return type:
bool
- extra_metadata: dict
- load_obj()
- Return type:
list[str] | str
- params = ['path_override']
- paths_only: bool
When True, after the object is saved it is replaced with the string path it was saved at. Similarly, when load is called, the path gets returned.
- save_obj(files)
- Parameters:
files (list[str] | str)
- Return type:
str
- class curifactory.experimental.caching.JsonCacher(path_override=None)
Methods:
load_obj()save_obj(obj)Attributes:
- Parameters:
path_override (str)
- cache_paths: list
- extra_metadata: dict
- load_obj()
- params = ['path_override']
- paths_only: bool
When True, after the object is saved it is replaced with the string path it was saved at. Similarly, when load is called, the path gets returned.
- save_obj(obj)
- class curifactory.experimental.caching.MetadataOnlyCacher(path_override=None)
Only writes out a metadata file, can be used for checking that a stage was completed/based on parameters. The underlying assumption is that the stage only mutated something in some way and has no specific object to retrieve.
Methods:
check([silent])Attributes:
- Parameters:
path_override (str)
- cache_paths: list
- check(silent=False)
- extra_metadata: dict
- params = ['path_override']
- paths_only: bool
When True, after the object is saved it is replaced with the string path it was saved at. Similarly, when load is called, the path gets returned.
- class curifactory.experimental.caching.ParquetCacher(path_override=None, paths_only=False, use_db_arg=-1)
Methods:
load_obj()save_obj(obj)Attributes:
- Parameters:
path_override (str)
paths_only (bool)
use_db_arg (int | str)
- cache_paths: list
- extra_metadata: dict
- load_obj()
- params = ['path_override', 'paths_only', 'use_db_arg']
- paths_only: bool
When True, after the object is saved it is replaced with the string path it was saved at. Similarly, when load is called, the path gets returned.
- save_obj(obj)
- Parameters:
obj (DuckDBPyRelation | DataFrame | dict[str, list] | list[dict[str | Any]])
- class curifactory.experimental.caching.PathRef(*args, **kwargs)
Special type of cacher that doesn’t directly save or load anything, it just tracks a file path for reference.
This is primarily useful for stages that never keep a particular object in memory and just want to directly pass around paths. The
PathRefcacher allows still short-circuiting if the referenced path already exists, rather than needing to do it manually in the stage.Note that when using a
PathRefcacher you still need to return a value from the stage for the cacher to “save”. This cacher expects that return value to be the path that was written to, and internally runs anassert returned_path == self.get_path()to double check that the stage wrote to the correct place. This also means that the value stored “in memory” is just the path, and that path string is what gets gets “loaded”.This cacher is distinct from the
FileReferenceCacherin that the path of this cacher _is the referenced path_, rather than saving a file that contains the referenced path. (In the case of the latter, a new record/hash etc that refers to the same target filepath would still trigger stage execution and still requires the stage to do it’s own check of if the original file already exists before saving.Example
@stage([], ["large_dataset_path"], [PathRef("./data/raw/big_data_{params.dataset}.csv")]) def make_big_data(record): # you can use record's ``stage_cachers`` to get the expected path output_path = record.stage_cachers[0].get_path() ... # make big data without keeping it in memory ... return output_path
@stage(["large_dataset_path"], ["model_path"], [PathRef]) def make_big_data(record, large_dataset_path): # the other way you can get a path that should be correct # is through record's ``get_path()``. The assert inside # PathRef's save will help us double check that it's correct. output_path = record.get_path(obj_name="model_path") ... # train model using large_dataset_path, the string path we need. ... return output_path
Methods:
load()save(obj)This is effectively a no-op, this cacher is just a reference to its own path.
- load()
- Return type:
str
- save(obj)
This is effectively a no-op, this cacher is just a reference to its own path.
objis expected to be the same, and we assert that to help alert the user if something got mis-aligned and the path they wrote to wasn’t this cacher’s path.- Parameters:
obj (str)
- class curifactory.experimental.caching.PickleCacher(path_override=None)
Methods:
load_obj()save_obj(obj)Attributes:
- Parameters:
path_override (str)
- cache_paths: list
- extra_metadata: dict
- load_obj()
- params = ['path_override']
- paths_only: bool
When True, after the object is saved it is replaced with the string path it was saved at. Similarly, when load is called, the path gets returned.
- save_obj(obj)
- class curifactory.experimental.caching.ReportablesCacher(*args, **kwargs)
Methods:
load_obj()save_obj(reportables_list)- load_obj()
- Return type:
list[Reportable]
- save_obj(reportables_list)
- Parameters:
reportables_list (list[Reportable])
- class curifactory.experimental.caching.TrackingDBTableCacher(path_override=None, use_db_arg=-1, table_name=None, db=None, id_cols=None)
Methods:
equals_condition(prefix)load_obj()save_obj(relation_object)Attributes:
- Parameters:
path_override (str)
use_db_arg (int | str)
table_name (str)
db (DuckDBPyConnection)
id_cols (dict[str, str])
- cache_paths: list
- clear_obj()
- equals_condition(prefix)
- extra_metadata: dict
- join_condition()
- load_obj()
- params = ['path_override', 'use_db_arg', 'table_name', 'id_cols']
- paths_only: bool
When True, after the object is saved it is replaced with the string path it was saved at. Similarly, when load is called, the path gets returned.
- save_obj(relation_object)
- Parameters:
relation_object (DuckDBPyRelation)