Caching

Classes:

`AggregateArtifactCacher`()	Just meant to help avoid unwanted implicit _aggregate stages continually being added to the database when nothing is _really_ running.
`Cacheable`([path_override, extension, paths_only])
`DBCacher`([connection_str, extensions])	A cacher that loads a duckdb database connection.
`DBTableCacher`([path_override, use_db_arg, ...])
`FileReferenceCacher`(args, *kwargs)	Saves a file path or list of file paths generated from a stage as a json file.
`JsonCacher`([path_override])
`MetadataOnlyCacher`([path_override])	Only writes out a metadata file, can be used for checking that a stage was completed/based on parameters.
`ParquetCacher`([path_override, paths_only, ...])
`PathRef`(args, *kwargs)	Special type of cacher that doesn't directly save or load anything, it just tracks a file path for reference.
`PickleCacher`([path_override])
`ReportablesCacher`(args, *kwargs)
`TrackingDBTableCacher`([path_override, ...])

class curifactory.experimental.caching.AggregateArtifactCacher

Just meant to help avoid unwanted implicit _aggregate stages continually being added to the database when nothing is _really_ running.

NOTE: Can’t be used as an inline cacher? (depends on an initialized ArtifactList artifact)

Methods:

`check`([silent])
`load_obj`()

Attributes:

params

artifact: Artifact

cache_paths: list

check(silent=True)

extra_metadata: dict

load_obj()

params = []

paths_only: bool: When True, after the object is saved it is replaced with the string path it was saved at. Similarly, when load is called, the path gets returned.

class curifactory.experimental.caching.Cacheable(path_override=None, extension='', paths_only=False)

Methods:

`check`([silent])
`clear`()	Remove self from the cache.
`clear_metadata`()
`clear_obj`()
`get_from_db_metadata`(cacher_module, ...)
`get_params`()
`get_path`([suffix, dry])	When dry is false don't add the resulting path to the cacher's list of tracked paths, (e.g. if this is just being used in a print statement.).
`load`()
`load_artifact`(path)
`load_metadata`()
`load_obj`()
`load_paths`()
`resolve_template_string`(path)	Intended for when path_override is specified, this returns the path with formatting applied/resolved.
`save`(obj)
`save_metadata`()
`save_obj`(obj)

Attributes:

`params`
`paths_only`	When True, after the object is saved it is replaced with the string path it was saved at.

Parameters:

path_override (str)
extension (str)
paths_only (bool)

artifact: Artifact

cache_paths: list

check(silent=False)

Parameters:: silent (bool)

clear(): Remove self from the cache.

clear_metadata()

clear_obj()

extra_metadata: dict

static get_from_db_metadata(cacher_module, cacher_type, cacher_params)

Parameters:

cacher_module (str)
cacher_type (str)
cacher_params (dict[str, Any])

get_params()

Return type:: dict[str, Any]

get_path(suffix=None, dry=False)

When dry is false don’t add the resulting path to the cacher’s list of tracked paths, (e.g. if this is just being used in a print statement.)

Return type:: str

load()

load_artifact(path)

Parameters:: path (str)
Return type:: Artifact

load_metadata()

load_obj()

load_paths()

params = ['path_override', 'extension', 'paths_only']

paths_only: bool: When True, after the object is saved it is replaced with the string path it was saved at. Similarly, when load is called, the path gets returned.

resolve_template_string(path)

Intended for when path_override is specified, this returns the path with formatting applied/resolved.

Note that this does _not_ handle suffixing, suffix resolution needs to occur wherever this is used.

Possible keyword replacement fields: * {hash} - the hash of the stage. * {stage_name} - the name of the stage that produces this artifact * {[STAGE_ARG_NAME]} - any argument name from the stage itself * {artifact_name} - the name of this output object. * {artifact_filename} - the normal filename for this cacher (doesn’t include dir path etc.)

Parameters:: path (str)
Return type:: str

save(obj)

save_metadata()

save_obj(obj)

class curifactory.experimental.caching.DBCacher(connection_str=None, extensions=None, **kwargs)

A cacher that loads a duckdb database connection.

Methods:

`check`([silent])
`load`()
`load_obj`()

Attributes:

params

Parameters:

connection_str (str)
extensions (list[str])

artifact: Artifact

cache_paths: list

check(silent=False)

Parameters:: silent (bool)

extra_metadata: dict

load()

load_obj()

params = ['connection_str', 'kwargs', 'extensions']

paths_only: bool: When True, after the object is saved it is replaced with the string path it was saved at. Similarly, when load is called, the path gets returned.

class curifactory.experimental.caching.DBTableCacher(path_override=None, use_db_arg=-1, table_name=None, db=None)

Methods:

`check`([silent])
`get_db`()
`get_table_name`()
`load_obj`()
`save_obj`(relation_object)
`upsert_relation`(relation_object)

Attributes:

params

Parameters:

path_override (str)
use_db_arg (int | str)
table_name (str)
db (DuckDBPyConnection)

artifact: Artifact

cache_paths: list

check(silent=True)

extra_metadata: dict

get_db()

get_table_name()

load_obj()

params = ['path_override', 'use_db_arg', 'table_name']

paths_only: bool: When True, after the object is saved it is replaced with the string path it was saved at. Similarly, when load is called, the path gets returned.

save_obj(relation_object)

Parameters:: relation_object (DuckDBPyRelation)

upsert_relation(relation_object)

Parameters:: relation_object (DuckDBPyRelation)

class curifactory.experimental.caching.FileReferenceCacher(*args, **kwargs)

Saves a file path or list of file paths generated from a stage as a json file. The check function will check existence of all file paths.

This is useful for instances where there may be a large number of files stored or generated to disk, as it would be unwieldy to return them all (or infeasible to keep them in memory) directly from the stage. When this cacher is checked for pre-existing, it tries to load the json file storing the filenames, and then checks for the existence of each path in that json file. If all of them exist, it will short-circuit computation.

Using this cacher does mean the user is in charge of loading/saving the file paths correctly, but in some cases that may be desirable.

This can also be used for storing a reference to a single file outside the normal cache.

When combined with the get_dir call on the record, you can create a cached directory of files similarly to a regular cacher and simply keep a reference to them as part of the actual cacher process.

Example

@stage(inputs=None, outputs=["many_text_files"], cachers=[FileReferenceCacher])
def output_text_files(record):
    file_path = record.get_dir("my_files")
    my_file_list = [os.path.join(file_path, f"my_file_{num}") for num in range(20)]

    for file in my_file_list:
        with open(file, 'w') as outfile:
            outfile.write("test")

    return my_file_list

Methods:

`check`([silent])
`load_obj`()
`save_obj`(files)

Attributes:

params

artifact: Artifact

cache_paths: list

check(silent=False)

Parameters:: silent (bool)
Return type:: bool

extra_metadata: dict

load_obj()

Return type:: list[str] | str

params = ['path_override']

paths_only: bool: When True, after the object is saved it is replaced with the string path it was saved at. Similarly, when load is called, the path gets returned.

save_obj(files)

Parameters:: files (list[str] | str)
Return type:: str

class curifactory.experimental.caching.JsonCacher(path_override=None)

Methods:

`load_obj`()
`save_obj`(obj)

Attributes:

params

Parameters:: path_override (str)

artifact: Artifact

cache_paths: list

extra_metadata: dict

load_obj()

params = ['path_override']

paths_only: bool: When True, after the object is saved it is replaced with the string path it was saved at. Similarly, when load is called, the path gets returned.

save_obj(obj)

class curifactory.experimental.caching.MetadataOnlyCacher(path_override=None)

Only writes out a metadata file, can be used for checking that a stage was completed/based on parameters. The underlying assumption is that the stage only mutated something in some way and has no specific object to retrieve.

Methods:

check([silent])

Attributes:

params

Parameters:: path_override (str)

artifact: Artifact

cache_paths: list

check(silent=False)

extra_metadata: dict

params = ['path_override']

paths_only: bool: When True, after the object is saved it is replaced with the string path it was saved at. Similarly, when load is called, the path gets returned.

class curifactory.experimental.caching.ParquetCacher(path_override=None, paths_only=False, use_db_arg=-1)

Methods:

`load_obj`()
`save_obj`(obj)

Attributes:

params

Parameters:

path_override (str)
paths_only (bool)
use_db_arg (int | str)

artifact: Artifact

cache_paths: list

extra_metadata: dict

load_obj()

params = ['path_override', 'paths_only', 'use_db_arg']

paths_only: bool: When True, after the object is saved it is replaced with the string path it was saved at. Similarly, when load is called, the path gets returned.

save_obj(obj)

Parameters:: obj (DuckDBPyRelation | DataFrame | dict[str, list] | list[dict[str | Any]])

class curifactory.experimental.caching.PathRef(*args, **kwargs)

Special type of cacher that doesn’t directly save or load anything, it just tracks a file path for reference.

This is primarily useful for stages that never keep a particular object in memory and just want to directly pass around paths. The PathRef cacher allows still short-circuiting if the referenced path already exists, rather than needing to do it manually in the stage.

Note that when using a PathRef cacher you still need to return a value from the stage for the cacher to “save”. This cacher expects that return value to be the path that was written to, and internally runs an assert returned_path == self.get_path() to double check that the stage wrote to the correct place. This also means that the value stored “in memory” is just the path, and that path string is what gets gets “loaded”.

This cacher is distinct from the FileReferenceCacher in that the path of this cacher _is the referenced path_, rather than saving a file that contains the referenced path. (In the case of the latter, a new record/hash etc that refers to the same target filepath would still trigger stage execution and still requires the stage to do it’s own check of if the original file already exists before saving.

Example

@stage([], ["large_dataset_path"], [PathRef("./data/raw/big_data_{params.dataset}.csv")])
def make_big_data(record):
    # you can use record's ``stage_cachers`` to get the expected path
    output_path = record.stage_cachers[0].get_path()
    ...
    # make big data without keeping it in memory
    ...
    return output_path

@stage(["large_dataset_path"], ["model_path"], [PathRef])
def make_big_data(record, large_dataset_path):
    # the other way you can get a path that should be correct
    # is through record's ``get_path()``. The assert inside
    # PathRef's save will help us double check that it's correct.
    output_path = record.get_path(obj_name="model_path")
    ...
    # train model using large_dataset_path, the string path we need.
    ...
    return output_path

Methods:

`load`()
`save`(obj)	This is effectively a no-op, this cacher is just a reference to its own path.

load()

Return type:: str

save(obj)

This is effectively a no-op, this cacher is just a reference to its own path. obj is expected to be the same, and we assert that to help alert the user if something got mis-aligned and the path they wrote to wasn’t this cacher’s path.

Parameters:: obj (str)

class curifactory.experimental.caching.PickleCacher(path_override=None)

Methods:

`load_obj`()
`save_obj`(obj)

Attributes:

params

Parameters:: path_override (str)

artifact: Artifact

cache_paths: list

extra_metadata: dict

load_obj()

params = ['path_override']

paths_only: bool: When True, after the object is saved it is replaced with the string path it was saved at. Similarly, when load is called, the path gets returned.

save_obj(obj)

class curifactory.experimental.caching.ReportablesCacher(*args, **kwargs)

Methods:

`load_obj`()
`save_obj`(reportables_list)

load_obj()

Return type:: list[Reportable]

save_obj(reportables_list)

Parameters:: reportables_list (list[Reportable])

class curifactory.experimental.caching.TrackingDBTableCacher(path_override=None, use_db_arg=-1, table_name=None, db=None, id_cols=None)

Methods:

`clear_obj`()
`equals_condition`(prefix)
`join_condition`()
`load_obj`()
`save_obj`(relation_object)

Attributes:

params

Parameters:

path_override (str)
use_db_arg (int | str)
table_name (str)
db (DuckDBPyConnection)
id_cols (dict[str, str])

artifact: Artifact

cache_paths: list

clear_obj()

equals_condition(prefix)

extra_metadata: dict

join_condition()

load_obj()

params = ['path_override', 'use_db_arg', 'table_name', 'id_cols']

paths_only: bool: When True, after the object is saved it is replaced with the string path it was saved at. Similarly, when load is called, the path gets returned.

save_obj(relation_object)

Parameters:: relation_object (DuckDBPyRelation)