Parameter files and parameter sets

Another goal of Curifactory is to allow effective parameterization of experiments. Where this might normally be done with a json or yaml file, Curifactory directly uses python files for experiment parameterization/configuration. This has a few advantages:

  1. Parameters can be any python object, rather than simply a primitive type or dictionary.

  2. Parameter files can reference/use other parameter files, allowing modularity and composition.

  3. The resulting parameter sets that are passed into an experiment can be algorithmically generated or modified inside an parameter file, with the full power of the python language! An example for how this might be useful is a single parameter file that generates 10 very similar parameter sets for comparison, rather than having to individually define 10 different parameter configuration files. This could allow custom gridsearches for example.

Note

Throughout this documentation we use specific language to refer to different parts of parameterization:

  • Parameter - a single parameter is a single attribute of a parameter class.

  • Parameter class - a dataclass that extends curifactory’s ExperimentParameters and defines the possible hyperparameters.

  • Parameter set - an instance of a parameter class, every stage operates on a record with a single specific parameter set.

  • Parameter file - a python script as defined below that creates one or more parameter sets.

The ExperimentParameters class

As discussed on the Getting Started page, To define possible parameters, there should be a class that inherits curifactory.ExperimentParameters, and for ease of use should have the @dataclass decorator. Possible parameters your experiment stages can use are the attributes within this class, and by defining default values for each one, a parameter set constructor need only specify the parameters that differ from the defaults.

An example Params class is shown below:

from dataclasses import dataclass, field

from curifactory import ExperimentParameters


@dataclass
class Params(ExperimentParameters):
    example_param: str = ""
    example_number_of_epochs: int = 10

    # due to how dataclasses handle initialization, default lists and dictionaries need to
    # be handled with field factory from the dataclasses package.
    example_data: list[int] = field(default_factory=lambda: [1,2,3,4])

The actual parameter files (by default go in the params/ folder) are then each expected to define a get_params() function, which should return a list of Params instances. A very simple example based on the above Params class might look like:

from params import Params

def get_params():
    return [Params(name='test_params', example_number_of_epochs=15)]

Note

As Params is a completely user-defined class, you can technically name this class whatever you choose. The rest of this documentation is written under the assumption it is named Params.

Warning

While the parameters in your dataclass can be arbitrary types, weird issues can sometimes arise if you include non-serializable objects. We’ve run into problems with things like including a pytorch distributed strategy object as an argument, as it can end up in a weird recursive serialization loop when curifactory tries to get a serialized JSON string representation of the corresponding arguments.

Programmatic definition

The get_params() function can contain arbitrary code, and this is where advantages 2 and 3 listed above can be exploited. For instance, if we wanted to define sets for testing multiple different numbers of epochs, we could return a list of multiple Params instances, each with a different epochs number:

from params import Params

def get_params():
    param_sets = []
    for i in range(5, 15):
        param_sets.append(Params(name=f"epochs_run_{i}", example_number_of_epochs=i))
    return param_sets

If we wanted to make parameter sets compositional, we can import one of the other parameter files and reference its get_params() call in the new one:

from params import base, Params

def get_params():
    param_sets = base.get_params()
    param_sets[0].name = 'modified' # assuming we know there's only one Args instance (otherwise we do this in a loop)
    param_sets[0].starting_data = [0, 2, 4, 6]
    return param_sets

In the above example, there’s another parameters file named base, we get its arguments with base.get_params(), run our modifications, and return the modified argsets. In this way, any changes that get made to the base parameters also influence this one, allowing for a form of parameter set hierarchy.

We can also create common functions for helping build up large amounts of argsets. As an example, we may frequently wish to create “seeded” argsets, where we have the same arguments several times but with a different seed for sklearn models or similar. Rather than manually define this, or reimplementing it in every relevant get_params() function, we could extract it as in this example:

params/common.py
from copy import deepcopy
from params import Params

def seed_set(params: Params, seed_count: int = 5):
    seed_params = []
    for i in range(seed_count):
        # Make a copy of the passed params and apply a different seed
        new_params = deepcopy(params)
        new_params.name += f"_seed{i}"
        new_params.seed = i
        seed_params.append(new_params)
    return seed_params
params/seeded_models.py
from params import Params
from params.common import seed_set

def get_params():
    knn_params = Params(name="test_knn", model_type="knn")
    svm_params = Params(name="test_svm", model_type="svm")

    all_params = []
    all_params.extend(seed_set(knn_params))
    all_params.extend(seed_set(svm_params, 3))

    return all_params

Calling the get_params() in the params/seeded_models.py parameter file would return:

[
    Params(name='test_knn_seed0', model_type='knn', seed=0)
    Params(name='test_knn_seed1', model_type='knn', seed=1)
    Params(name='test_knn_seed2', model_type='knn', seed=2)
    Params(name='test_knn_seed3', model_type='knn', seed=3)
    Params(name='test_knn_seed4', model_type='knn', seed=4)
    Params(name='test_svm_seed0', model_type='svm', seed=0)
    Params(name='test_svm_seed1', model_type='svm', seed=1)
    Params(name='test_svm_seed2', model_type='svm', seed=2)
]

Using parameters

Every stage automatically has access to the currently relevant Params instance, as it is part of the passed record.

from curifactory import Record

from params import Params
import src

@stage(['training_data'], ['model'])
def train_model(record: Record, training_data):
    params: Params = record.params # use the type hinting to get good autocomplete in IDEs

    if params.model_type == "knn":
        # pass relevant parameters into the codebase functions
        src.train_knn(params.seed)
        # ...

Parameter set hashes and operational parameters

Curifactory automatically versions cached artifacts based on the parameter set used. It does this by computing a hash (the full details of which can be found on the Hashing Mechanics page,) which involves taking a form of string representation of the value for every attribute in a parameter set and computing the combined md5 hash.

There are a few types of cases where we may want to modify how that hash is being computed

  1. Some parameters may be “operational”, they influence how an experiment runs but shouldn’t change the results.

  2. By default the repr of some times of objects may not correctly return a value that uniquely and consistently represents what we want it to.

Say we have the following dataclasses:

@dataclass
class Params(cf.ExperimentParameters):
    model_size: int = 9000
    num_gpus: int = 1

If we create two parameter sets with different gpu counts, we get two different hashes:

p1 = Params(name="one_gpu", num_gpus=1)
p2 = Params(name="two_gpus", num_gpus=2)

p1.params_hash()
#> '1ae3169d21cc23f1665561f7e91fe266e'
p2.params_hash()
#> 'f1b00c12e820963221b1f60501d3822e'

This would mean any stages we run these two parameter sets through would compute and cache two sets of outputs. However, we may want to change the number of gpus we use (when moving between machines), and we want it to use the same cached values because we wouldn’t expect the results to change.

Curifactory will look for a special hash_representations dictionary on any ExperimentParameters class or composed dataclass on an ExperimentParameters subclass instance, which can optionally contain string keys of one or more of the attributes on the parameter class and an associated function that is passed the entire parameter set instance as well as the value of that specific parameter. By setting that function to None, we can tell Curifactory to ignore that parameter as part of the hash.

Since setting default dictionaries on dataclasses requires an annoying amount of syntax, Curifactory provides a set_hash_functions function to initialize it correctly.

If we want to ignore num_gpus, it might look like this:

@dataclass
class Params(cf.ExperimentParameters):
    model_size: int = 9000
    num_gpus: int = 1

    hash_representations: dict = cf.set_hash_functions(num_gpus=None)

If we now run the same code as above:

p1 = Params(name="one_gpu", num_gpus=1)
p2 = Params(name="two_gpus", num_gpus=2)

p1.params_hash()
#> 'b50ba553739feea66c8aab97787c22e0'
p2.params_hash()
#> 'b50ba553739feea66c8aab97787c22e0'

If we specify an actual function, that function takes both the whole parameter set as well as the specified parameter, meaning we can condition the hash representation for a specific parameter based on the others. (This is primarily useful if a parameter is a complex object and the repr doesn’t include some of the parameters it was initialized with.)

As a simplistic and somewhat silly example, we can condition our model_size hash representation on num_gpus:

@dataclass
class Params(cf.ExperimentParameters):
    model_size: int = 9000
    num_gpus: int = 1

    hash_representations: dict = cf.set_hash_functions(
        num_gpus=None,
        model_size=lambda self, obj: str(obj/self.num_gpus)
    )
p1 = Params(name="one_gpu", model_size=4500, num_gpus=1)
p2 = Params(name="two_gpus", model_size=9000, num_gpus=2)
p3 = Params(name="big_one_gpu", model_size=9000, num_gpus=1)

p1.params_hash()
#> 'a04bd13c314c694d8f1cff76cc34d2b'
p2.params_hash()
#> 'a04bd13c314c694d8f1cff76cc34d2b'
p3.params_hash()
#> 'ff1275fb121412c666259c7baefbf4e9'

Subparameter classes

It is possible to have parameter dataclasses composed of other dataclasses, which can help keep parameters organized and syntactically convenient:

@dataclass
class DataParams:
    path: str = ""

@dataclass
class Params(cf.ExperimentParameters):
    a: int = 6
    data: DataParams = DataParams()

my_params = Params()
my_params.data.path

Note that in python 3.11, the mutability requirements for dataclass fields were tightened, and the above code may not work (since a dataclass is mutable by default.) There are two ways around this, the first is easier but not recommended, placing a unsafe_hash=True in the sub dataclass:

@dataclass(unsafe_hash=True)
class DataParams:
    path: str = ""

The more correct way is to use dataclasses’ default_factory:

from dataclasses import dataclass, field

@dataclass
class DataParams:
    path: str = ""

@dataclass
class Params(cf.ExperimentParameters):
    a: int = 6
    data: DataParams = field(default_factory=lambda: DataParams())