Hashing

Utility functions for generating hashes and for parameter sets.

The hash of a parameter set is crucial to curifactory, as this hash gets prefixed to filenames stored in the cache and so is used to determine whether a particular artifact has already been computed for the given parameter set.

The basic idea of this hash computation is that some representation for every parameter in a parameter set is retrieved, which is then turned into a string, and the md5 hash of that string is then computed. The integer value of the resulting md5 hashes of each parameter is added up, and the final integer is turned back into a string hex “hash”.

An important concept is the ability to modify any given parameter’s representation that is used for the md5 hash, and whether it’s included as part of the overall hash at all. Some types of objects in python by default will only return a memory pointer when repr is called (which is the default mechanism we use for getting a string representation,) which means that every time an experiment is run, even if the parameters _should_ be the exact same, the hash will be different. By setting a dictionary of hash_representations on the parameter class, we can indivdiually control the representation computation for each parameter. We can also set the parameter representation to None, which means it will be ignored for the purposes of the hash. This is useful for “operational parameters”, or configuration of an experimeriment that wouldn’t actually modify the artifacts. (e.g. the number of gpu’s to train an ML model on and so forth.)

TODO: examples? (prob put this in the non-python-file docs)

Data:

PARAMETERS_BLACKLIST

The default parameters on the ExperimentParameters class that we always ignore as part of the hash.

Functions:

add_params_combo_hash(active_record, …[, …])

Returns a hex string representing the the combined parameter set hashes from the passed records list.

compute_hash(hash_representations)

Returns a combined order-independent md5 hash of the passed representations.

get_param_set_hash_values(param_set)

Collect the hash representations from every parameter in the passed parameter set.

get_parameter_hash_value(param_set, param_name)

Determines which hashing representation mechanism to use for the specified parameter, computes the result of the mechanism, and returns both.

hash_param_set(param_set[, …])

Returns a hex string representing the passed arguments, optionally recording the parameters and hash in the params registry.

param_set_string_hash_representations(param_set)

Get the hash representation of a parameter set into a json-dumpable dictionary.

set_hash_functions(*args, **kwargs)

Convenience function for easily setting the hash_representations dictionary with the appropriate dataclass field.

curifactory.hashing.PARAMETERS_BLACKLIST = ['name', 'hash', 'overwrite', 'hash_representations']

The default parameters on the ExperimentParameters class that we always ignore as part of the hash.

curifactory.hashing.add_params_combo_hash(active_record, records_list, registry_path: str, store_in_registry: bool = False)

Returns a hex string representing the the combined parameter set hashes from the passed records list. This is mainly used for getting a hash for an aggregate stage, which may not have a meaningful argument set of its own.

Parameters
  • active_record (Record) – The currently in-use record (likely owned by the aggregate stage.)

  • records_list (List[Record]) – The list of records to include as part of the resulting hash.

  • registry_path (str) – The location to keep the params_registry.json.

  • store_in_registry (bool) – Whether to update the params registry with the passed records or not.

Returns

The hash string computed from the combined record arguments.

curifactory.hashing.compute_hash(hash_representations: dict)str

Returns a combined order-independent md5 hash of the passed representations.

We do this by individually computing a hash for each item, and add the integer values up, turning the final number into a hash string. this ensures that the order in which things are hashed won’t change the hash as long as the values themselves are the same.

curifactory.hashing.get_param_set_hash_values(param_set)dict

Collect the hash representations from every parameter in the passed parameter set.

This essentially just calls get_parameter_hash_value on every parameter.

Returns

A dictionary keyed by the string parameter names, and the value the dry tuple result from get_parameter_hash_value.

curifactory.hashing.get_parameter_hash_value(param_set, param_name: str)tuple

Determines which hashing representation mechanism to use for the specified parameter, computes the result of the mechanism, and returns both.

This function takes any overriding hash_representations into account. The list of mechanisms it attempts to use to get a hashable representation of the parameter in order are:

  1. Skip any blacklisted internal curifactory parameters that shouldn’t affect the hash.

    This includes name, hash, overwrite, and the hash_representations attribute itself.

  2. If the value of the parameter is None, skip it. This allows default-ignoring

    new parameters.

  3. If there’s an associated hashing function in hash_representations, call that,

    passing in the entire parameter set and the current value of the parameter to be hashed

  4. If a parameter is another dataclass, recursively get_paramset_hash_values on it.

    Note that if this is unintended functionality, and you need the default dataclass repr for any reason, you can override it with the following:

    import curifactory as cf
    
    @dataclass
    class Params(cf.ExperimentParameters):
        some_other_dataclass: OtherDataclass = None
    
        hash_representations = cf.set_hash_functions(
            some_other_dataclass = lambda self, obj: obj.__class__
        )
        ...
    
  5. If a parameter is a callable, by default it might turn up a pointer address

    (we found this occurs with torch modules), so use the __qualname__ instead.

  6. Otherwise just use the normal repr.

Parameters
  • parameter_set – The parameter set (dataclass instance) to get the requested parameter from.

  • parameter_name (str) – The name of the parameter to get the hashable representation of.

Returns

A tuple where the first element is the strategy used to compute the hashable representation, and the second element is that computed representation.

curifactory.hashing.hash_param_set(param_set, store_in_registry: bool = False, registry_path: Optional[str] = None, dry: bool = False)Union[str, dict]

Returns a hex string representing the passed arguments, optionally recording the parameters and hash in the params registry.

Note that this hash is computed once and then stored on the parameter set. If values on parameter set are changed and hash_param_set is called again, it won’t be reflected in the hash.

Parameters
  • param_set (ExperimentParameters) – The argument set to hash.

  • registry_path (str) – The location to keep the params_registry.json. If this is None, ignore store_in_registry.

  • store_in_registry (bool) – Whether to update the params registry with the passed arguments or not.

  • dry (bool) – If True, don’t store and instead return a dictionary with each value as the tuple that contains the strategy used to compute the values to be hashed as well as the output from that hashing function code. Useful for debugging custom hashing functions.

Returns

The hash string computed from the arguments, or the dictionary of hashing functions if dry is True. (The output from get_param_set_hash_values)

curifactory.hashing.param_set_string_hash_representations(param_set)dict

Get the hash representation of a parameter set into a json-dumpable dictionary.

This is used both in the output report as well as in the params registry.

curifactory.hashing.set_hash_functions(*args, **kwargs)

Convenience function for easily setting the hash_representations dictionary with the appropriate dataclass field. Parameters passed to this function should be the same as the parameter name in the parameters class itself.

You can either call this function and pass in a dictionary with the hashing functions, or pass each hashing function as a kwarg. If you pass in both a dictionary as the first positional arg and specify kwargs, the kwarg hashing functions will be added to the dictionary.

Example

from dataclasses import dataclass
from curifactory import ExperimentParameters
from curifactory.params import set_hash_functions

@dataclass
class Params(ExperimentParameters):
    a: int = 0
    b: int = 0

    hash_representations: dict = set_hash_functions(
        a = lambda self, obj: str(a)
        b = None  # this means that b will _not be included in the hash_.
    )