Parameter files and parameter sets
==================================

Another goal of Curifactory is to allow effective parameterization of experiments. Where this might normally be
done with a json or yaml file, Curifactory directly uses python files for experiment parameterization/configuration.
This has a few advantages:

1. Parameters can be any python object, rather than simply a primitive type or dictionary.
2. Parameter files can reference/use other parameter files, allowing modularity and composition.
3. The resulting parameter sets that are passed into an experiment can be algorithmically generated or modified inside an
   parameter file, with the full power of the python language! An example for how this might be useful is a single
   parameter file that generates 10 very similar parameter sets for comparison, rather than having to individually define
   10 different parameter configuration files. This could allow custom gridsearches for example.

.. note::

    Throughout this documentation we use specific language to refer to different parts of parameterization:

    * **Parameter** - a single parameter is a single attribute of a parameter class.
    * **Parameter class** - a dataclass that extends curifactory's ``ExperimentParameters`` and defines the possible hyperparameters.
    * **Parameter set** - an instance of a parameter class, every stage operates on a record with a single specific parameter set.
    * **Parameter file** - a python script as defined below that creates one or more parameter sets.

The :code:`ExperimentParameters` class
--------------------------------------

As discussed on the :ref:`Getting Started` page, To define possible parameters, there should be a
class that inherits :code:`curifactory.ExperimentParameters`, and for ease of use should have the
:code:`@dataclass` decorator. Possible parameters your experiment stages can use are the attributes
within this class, and by defining default values for each one, a parameter set constructor need only
specify the parameters that differ from the defaults.

An example :code:`Params` class is shown below:

.. code-block:: python

    from dataclasses import dataclass, field

    from curifactory import ExperimentParameters


    @dataclass
    class Params(ExperimentParameters):
        example_param: str = ""
        example_number_of_epochs: int = 10

        # due to how dataclasses handle initialization, default lists and dictionaries need to
        # be handled with field factory from the dataclasses package.
        example_data: list[int] = field(default_factory=lambda: [1,2,3,4])


The actual parameter files (by default go in the ``params/`` folder) are then each expected to define a
``get_params()`` function, which should return a list of ``Params`` instances. A very simple example based on
the above ``Params`` class might look like:

.. code-block:: python

    from params import Params

    def get_params():
        return [Params(name='test_params', example_number_of_epochs=15)]

.. note::

    As ``Params`` is a completely user-defined class, you can technically name this class whatever you
    choose. The rest of this documentation is written under the assumption it is named ``Params``.


.. warning::

   While the parameters in your dataclass can be arbitrary types, weird issues
   can sometimes arise if you include non-serializable objects. We've run into
   problems with things like including a pytorch distributed strategy object as
   an argument, as it can end up in a weird recursive serialization loop when
   curifactory tries to get a serialized JSON string representation of the
   corresponding arguments.


Programmatic definition
-----------------------

The ``get_params()`` function can contain arbitrary code, and this is where advantages 2 and 3 listed above can be
exploited. For instance, if we wanted to define sets for testing multiple different numbers of epochs,
we could return a list of multiple ``Params`` instances, each with a different epochs number:

.. code-block:: python

    from params import Params

    def get_params():
        param_sets = []
        for i in range(5, 15):
            param_sets.append(Params(name=f"epochs_run_{i}", example_number_of_epochs=i))
        return param_sets

If we wanted to make parameter sets compositional, we can import one of the other parameter files and
reference its ``get_params()`` call in the new one:

.. code-block:: python

    from params import base, Params

    def get_params():
        param_sets = base.get_params()
        param_sets[0].name = 'modified' # assuming we know there's only one Args instance (otherwise we do this in a loop)
        param_sets[0].starting_data = [0, 2, 4, 6]
        return param_sets

In the above example, there's another parameters file named ``base``, we get its arguments with ``base.get_params()``,
run our modifications, and return the modified argsets. In this way, any changes that get made to the base parameters also influence
this one, allowing for a form of parameter set hierarchy.

We can also create common functions for helping build up large amounts of argsets. As an example, we may frequently
wish to create "seeded" argsets, where we have the same arguments several times but with a different seed for sklearn
models or similar. Rather than manually define this, or reimplementing it in every relevant ``get_params()`` function,
we could extract it as in this example:

.. code-block:: python
    :caption: params/common.py

    from copy import deepcopy
    from params import Params

    def seed_set(params: Params, seed_count: int = 5):
        seed_params = []
        for i in range(seed_count):
            # Make a copy of the passed params and apply a different seed
            new_params = deepcopy(params)
            new_params.name += f"_seed{i}"
            new_params.seed = i
            seed_params.append(new_params)
        return seed_params


.. code-block:: python
    :caption: params/seeded_models.py

    from params import Params
    from params.common import seed_set

    def get_params():
        knn_params = Params(name="test_knn", model_type="knn")
        svm_params = Params(name="test_svm", model_type="svm")

        all_params = []
        all_params.extend(seed_set(knn_params))
        all_params.extend(seed_set(svm_params, 3))

        return all_params

Calling the ``get_params()`` in the ``params/seeded_models.py`` parameter file would return:

.. code-block:: python

    [
        Params(name='test_knn_seed0', model_type='knn', seed=0)
        Params(name='test_knn_seed1', model_type='knn', seed=1)
        Params(name='test_knn_seed2', model_type='knn', seed=2)
        Params(name='test_knn_seed3', model_type='knn', seed=3)
        Params(name='test_knn_seed4', model_type='knn', seed=4)
        Params(name='test_svm_seed0', model_type='svm', seed=0)
        Params(name='test_svm_seed1', model_type='svm', seed=1)
        Params(name='test_svm_seed2', model_type='svm', seed=2)
    ]


Using parameters
----------------

Every stage automatically has access to the currently relevant ``Params`` instance, as it is part of
the passed record.

.. code-block:: python

    from curifactory import Record

    from params import Params
    import src

    @stage(['training_data'], ['model'])
    def train_model(record: Record, training_data):
        params: Params = record.params # use the type hinting to get good autocomplete in IDEs

        if params.model_type == "knn":
            # pass relevant parameters into the codebase functions
            src.train_knn(params.seed)
            # ...

Parameter set hashes and operational parameters
-----------------------------------------------

Curifactory automatically versions cached artifacts based on the parameter set used. It does this
by computing a hash (the full details of which can be found on the :ref:`Hashing Mechanics` page,) which
involves taking a form of string representation of the value for every attribute in a parameter
set and computing the combined md5 hash.

There are a few types of cases where we may want to modify how that hash is being computed

1. Some parameters may be "operational", they influence how an experiment runs but shouldn't change the results.
2. By default the ``repr`` of some times of objects may not correctly return a value that uniquely and consistently represents what we want it to.

Say we have the following dataclasses:

.. code-block:: python

    @dataclass
    class Params(cf.ExperimentParameters):
        model_size: int = 9000
        num_gpus: int = 1

If we create two parameter sets with different gpu counts, we get two different hashes:

.. code-block:: python

    p1 = Params(name="one_gpu", num_gpus=1)
    p2 = Params(name="two_gpus", num_gpus=2)

    p1.params_hash()
    #> '1ae3169d21cc23f1665561f7e91fe266e'
    p2.params_hash()
    #> 'f1b00c12e820963221b1f60501d3822e'

This would mean any stages we run these two parameter sets through would compute and cache two
sets of outputs. However, we may want to change the number of gpus we use (when moving between
machines), and we want it to use the same cached values because we wouldn't expect the results
to change.

Curifactory will look for a special ``hash_representations`` dictionary on any ``ExperimentParameters``
class or composed dataclass on an ``ExperimentParameters`` subclass instance, which can optionally
contain string keys of one or more of the attributes on the parameter class and an associated function
that is passed the entire parameter set instance as well as the value of that specific parameter. By
setting that function to ``None``, we can tell Curifactory to ignore that parameter as part of the hash.

Since setting default dictionaries on dataclasses requires an annoying amount of syntax, Curifactory
provides a ``set_hash_functions`` function to initialize it correctly.

If we want to ignore ``num_gpus``, it might look like this:

.. code-block:: python

    @dataclass
    class Params(cf.ExperimentParameters):
        model_size: int = 9000
        num_gpus: int = 1

        hash_representations: dict = cf.set_hash_functions(num_gpus=None)

If we now run the same code as above:

.. code-block:: python

    p1 = Params(name="one_gpu", num_gpus=1)
    p2 = Params(name="two_gpus", num_gpus=2)

    p1.params_hash()
    #> 'b50ba553739feea66c8aab97787c22e0'
    p2.params_hash()
    #> 'b50ba553739feea66c8aab97787c22e0'


If we specify an actual function, that function takes both the whole parameter set
as well as the specified parameter, meaning we can condition the hash representation
for a specific parameter based on the others. (This is primarily useful if a parameter
is a complex object and the ``repr`` doesn't include some of the parameters it was
initialized with.)

As a simplistic and somewhat silly example, we can condition our model_size hash
representation on num_gpus:

.. code-block:: python

    @dataclass
    class Params(cf.ExperimentParameters):
        model_size: int = 9000
        num_gpus: int = 1

        hash_representations: dict = cf.set_hash_functions(
            num_gpus=None,
            model_size=lambda self, obj: str(obj/self.num_gpus)
        )

.. code-block:: python

    p1 = Params(name="one_gpu", model_size=4500, num_gpus=1)
    p2 = Params(name="two_gpus", model_size=9000, num_gpus=2)
    p3 = Params(name="big_one_gpu", model_size=9000, num_gpus=1)

    p1.params_hash()
    #> 'a04bd13c314c694d8f1cff76cc34d2b'
    p2.params_hash()
    #> 'a04bd13c314c694d8f1cff76cc34d2b'
    p3.params_hash()
    #> 'ff1275fb121412c666259c7baefbf4e9'

Subparameter classes
--------------------

It is possible to have parameter dataclasses composed of other dataclasses, which can
help keep parameters organized and syntactically convenient:

.. code-block:: python

    @dataclass
    class DataParams:
        path: str = ""

    @dataclass
    class Params(cf.ExperimentParameters):
        a: int = 6
        data: DataParams = DataParams()

    my_params = Params()
    my_params.data.path

Note that in python 3.11, the mutability requirements for dataclass fields were tightened,
and the above code may not work (since a dataclass is mutable by default.) There are two
ways around this, the first is easier but not recommended, placing a ``unsafe_hash=True``
in the sub dataclass:

.. code-block:: python

    @dataclass(unsafe_hash=True)
    class DataParams:
        path: str = ""

The more correct way is to use dataclasses' ``default_factory``:

.. code-block:: python

    from dataclasses import dataclass, field

    @dataclass
    class DataParams:
        path: str = ""

    @dataclass
    class Params(cf.ExperimentParameters):
        a: int = 6
        data: DataParams = field(default_factory=lambda: DataParams())