Tips and tricks

Minimum code to run stages

Directly/manually running stages can be useful for a variety of reasons, whether for debugging them, using the same configuration values as from an experiment, or as a basis for exploration in a jupyter notebook.

The smallest set of components needed to do this is an Params instance, an ArtifactManager, and a Record:

# required components
from curifactory import ArtifactManager, Record
from params import Params  # a custom dataclass extending cf.ExperimentParameters

# import whatever stages you need to run
from stages.data import get_data, clean_data

# initialize instances of the required components
params = Params() # create/initialize the params. Note that you could also directly
                  # load and manually call the get_params() of a parameter file
mngr = ArtifactManager("some_cache_ref_name") # the caching prefix name (normally
                                              # the experiment name)
record = Record(mngr, params)

# run desired stages
clean_data(get_data(record))

# get the return from the last run stage (you could also get it from record.state)
print(record.output)

Note that calling any stage, e.g. get_data(record) returns the modified record that was passed in, and since calling a stage always only takes exactly one parameter, (the record to use) you can directly chain the stages together with clean_data(get_data(record))

Common dataset for multiple different argsets

For some experiments, you may have a single dataset processing procedure that all other parameter sets share. By default, caching only applies to a single parameter set, so the data would have to be recomputed for every parameter set, if you defined your entire experiment in one procedure. However, it’s possible to write your experiment to “share” the state of a data processing record with the remainder of your argsets.

def run(param_sets: List[Params], mngr: ArtifactManager):

    # define the procedures
    common_data_proc = Procedure([get_data, clean_data], mngr) # the resulting state of this procedure
                                                               # should be the same for all our
                                                               # parameter sets, so we only want to run
                                                               # once it
    model_proc = Procedure([train_model, test_model], mngr) # this procedure is where we expect our
                                                            # argsets to differ, so run it per param set

    # run our common procedure once (assuming all of the relevant parameter sets are _actually_ the
    # same, we can just pull from the first one)
    data_record = common_data_proc.run(param_sets[0])

    # run each argset through model procedure
    for param_set in param_sets:
        record = data_record.make_copy(param_set) # duplicate the data record's state but with new param sets
        model_proc.run(param_set, record) # we can manually specify the record to use in a procedure if
                                          # we already have one

The above will result in an experiment graph that looks something like the following, assuming two argsets:

_images/common_proc.png

The make_copy() function on Record instances creates a new record with a deepcopy of the state attribute, meaning you can define procedures that only have later stages, and pass them the copied records.

This can be extended beyond just common datasets - any baseline procedure that will not change state across different parameter sets in a given experiment can follow this same pattern.

Branching in experiments

When running experiments that compare many different approaches, it is likely you’ll need to run different procedures for different parameter sets (as different stages may be necessary, for example in training an sklearn algorithm versus a model implemented in pytorch.) Since experiment scripts are arbitrary, you can create parameters intended solely for use in the experiment itself, and simply branch based on those parameters for each given parameter set:

def run(param_sets: List[Params], mngr: ArtifactManager):
    proc1 = Procedure([get_data, clean_data, train_sklearn_alg, test_model], mngr)
    proc2 = Procedure([get_data, clean_data, train_pytorch, test_model], mngr)

    for param_set in param_sets:
        # we assume "model_type" is one of the arguments
        if param_set.model_type == "sklearn":
            proc1.run(param_set)
        elif param_set.model_type == "pytorch":
            proc2.run(param_set)

    Procedure([compile_results], mngr).run(None) # an appropriately constructed
                                                 # aggregate stage can still run
                                                 # to compare outputs across multiple
                                                 # procedures.

Using the git commit hash from reports

Every time an experiment is run, the experiment store keeps track of the current git commit hash. If you need to be able to exactly reproduce an experiment, ensure that all code is committed before the run, and run it with the --full-store flag (see the Full stores (--store-full, --dry-cache) section.) The output report from your run will contain the git commit hash in the top metadata fields, as well as the command to reproduce it with the correct cache.

To re-run, checkout that hash in git, and enter the reproduce command given in the report:

git commit
experiment some_experiment -p some_params --store-full

# review the report to get the appropriate run reproduce command and the git commit hash

# to exactly reproduce:
git checkout [COMMIT_HASH]
experiment some_experiment -p some_params --cache data/runs/[RUN_REF_NAME] --dry-cache

Softlinking data directories

If dealing with very large data, in order to not have to mess with cache paths it can sometimes be useful to softlink the data/cache and data/runs. If you have one or more starting datasets that you want your experiments to have access to, you could softlink it to data/raw

In order to make these easy to work with, it’s recommended to make a shell script or makefile that runs something like:

# link_data_dirs.sh

ln -s [some_cache_path] ./data/cache
ln -s [some_runs_folder] ./data/runs
ln -s [some_datasets_path] ./data/raw