Wrapper

The wrapper class around a transformer and its functionality.

class tx2.wrapper.Wrapper(train_texts: Union[numpy.ndarray, pandas.core.series.Series], train_labels: Union[numpy.ndarray, pandas.core.series.Series], test_texts: Union[numpy.ndarray, pandas.core.series.Series], test_labels: Union[numpy.ndarray, pandas.core.series.Series], encodings: Dict[str, int], classifier=None, language_model=None, tokenizer=None, cuda_device=None, cache_path='data', overwrite=False)

A wrapper or interface class between a transformer and the dashboard.

This class handles running all of the calculations for the data needed by the front-end visualizations.

Methods:

__init__(train_texts, train_labels, …[, …])

Constructor.

classify(texts)

Predict the category of each passed entry text.

embed(texts)

Get a sequence embedding from the language model for each passed text entry.

encode(text)

Encode/tokenize passed text into a format expected by transformer.

prepare([umap_args, clustering_alg, …])

Run all necessary precompute step to support the dashboard.

project(texts)

Use the wrapper’s UMAP model to project passed texts into two dimensions.

recompute_projections([umap_args, …])

Re-run both projection training and clustering algorithms.

recompute_visual_clusterings([…])

Re-run the clustering algorithm.

search_test_df(search)

Get a list of test dataframe indices that have any of the listed terms in the passed string.

soft_classify(texts)

Get the non-argmaxed final prediction layer outputs of the classification head.

Attributes:

batch_size

The batch size to use in backend dataloader creation.

cache_path

The directory path to cache pre-calculated values.

classification_function

A function to take a single set of inputs and return the index of the predicted class.

classifier

A class containing the entire network, which can be called as a function taking the encoded input and returning the output classification.

cluster_class_word_sets

A dictionary of clusters, further divided version of cluster_word_freqs that divides each word count up into the number of entries of each category containing that word, as calculated by tx2.calc.frequent_words_by_class_in_cluster().

cluster_profiles

A dictionary of aggregate sorted salience maps for each cluster as calculated by tx2.calc.aggregate_cluster_salience_maps().

cluster_word_freqs

A dictionary of clusters and sorted top word frequencies for each, as calculated by tx2.calc.frequent_words_in_cluster()

clusters

A dictionary of cluster names, each associated with a list of indices of points in that cluster, as calculated by tx2.calc.cluster_projections().

cuda_device

Set the device for pytorch to place tensors on, pass either “cpu” or “cuda”.

embedding_function

A function to take a single set of inputs and return embedded versions - a sequence representation from the language model.

embeddings_testing

Precomputed embeddings for each entry in test_texts, as returned by tx2.wrapper.Wrapper.embed().

embeddings_training

Precomputed embeddings for each entry in train_texts, as returned by tx2.wrapper.Wrapper.embed().

encode_function

A function to take a single text entry and return an encoded version of it.

encoder_options

The default options to pass to the tokenizer’s encode_plus() function.

encodings

A dictionary associating class label names with integer values.

language_model

A variable containing only the huggingface language model portion of the network.

max_clusters

Maximum number of clusters to retain.

max_len

The maximum length of each text entry, based on the expected input size of the transformer.

overwrite

Whether to ignore cached calculations and overwrite them or not.

predictions

The predicted class for each entry in test_texts, as returned by tx2.wrapper.Wrapper.classify().

projections_testing

The two dimensional projections of embeddings_testing, for each entry in test_texts.

projections_training

The two dimensional projections of embeddings_training, for each entry in train_texts.

projector

The trained UMAP projector.

salience_maps

The salience map for each entry in test_texts as calculated by tx2.calc.salience_map().

soft_classification_function

A function to take a single set of inputs and return the (non arg-maxed) output layer of the network.

test_labels

Collection of all class labels used during models testing process.

test_texts

Collection of all text entries used during models testing process.

tokenizer

The huggingface tokenizer to use for encoding text input.

train_labels

Collection of all class labels used during models training process.

train_texts

Collection of all text entries used during models training process.

__init__(train_texts: Union[numpy.ndarray, pandas.core.series.Series], train_labels: Union[numpy.ndarray, pandas.core.series.Series], test_texts: Union[numpy.ndarray, pandas.core.series.Series], test_labels: Union[numpy.ndarray, pandas.core.series.Series], encodings: Dict[str, int], classifier=None, language_model=None, tokenizer=None, cuda_device=None, cache_path='data', overwrite=False)

Constructor.

Parameters
  • train_texts – A set of text entries that were used during the model’s training process.

  • train_labels – The set of class labels for train_texts.

  • test_texts – The set of text entries that the model hadn’t seen during training.

  • test_labels – The set of class labels for test_texts.

  • encodings – A dictionary associating class label names with integer values.

  • classifier – A class/network containing a language model and classification head. Running this variable as a function by default should send the passed inputs through the entire network and return the argmaxed classification index (reverse encoding). Note that this argument is not required, if the user intends to manually specify classification functions.

  • language_model – A huggingface transformer model, if a custom network class is being used and has a layer representing the output of just the language model, pass it here. Note that this argument is not required, if the user intends to manually specify classification functions.

  • tokenizer – A huggingface tokenizer. Note that this argument is not required, if the user intends to manually specify encode and classification functions.

  • cuda_device – Set the device for pytorch to place tensors on, pass either “cpu” or “cuda”. This variable is used by the default embedding function. If unspecified and a GPU is found, “cuda” will be used, otherwise it defaults to “cpu”.

  • cache_path – The directory path to cache intermediate outputs from the tx2.wrapper.Wrapper.prepare() function. This allows the wrapper to precompute needed values for the dashboard to reduce render time and allow rerunning all wrapper code without needing to recompute. Note that every wrapper/dashboard instance is expected to have a unique cache path, otherwise filenames will conflict. You will need to set this if you intend to use more than one dashboard.

  • overwrite – Whether to ignore the cache and overwrite previous results or not.

batch_size

The batch size to use in backend dataloader creation.

cache_path

The directory path to cache pre-calculated values.

classification_function

A function to take a single set of inputs and return the index of the predicted class.

classifier

A class containing the entire network, which can be called as a function taking the encoded input and returning the output classification.

classify(texts: List[str])List[int]

Predict the category of each passed entry text.

Parameters

texts – An array of texts to predict on.

Returns

An array of predicted classes, whose labels can be reverse looked up through encodings.

cluster_class_word_sets

A dictionary of clusters, further divided version of cluster_word_freqs that divides each word count up into the number of entries of each category containing that word, as calculated by tx2.calc.frequent_words_by_class_in_cluster().

cluster_profiles

A dictionary of aggregate sorted salience maps for each cluster as calculated by tx2.calc.aggregate_cluster_salience_maps().

cluster_word_freqs

A dictionary of clusters and sorted top word frequencies for each, as calculated by tx2.calc.frequent_words_in_cluster()

clusters

A dictionary of cluster names, each associated with a list of indices of points in that cluster, as calculated by tx2.calc.cluster_projections().

cuda_device

Set the device for pytorch to place tensors on, pass either “cpu” or “cuda”. This variable is used by the default embedding function. If unspecified and a GPU is found, “cuda” will be used, otherwise it defaults to “cpu”.

embed(texts: List[str])List[List[float]]

Get a sequence embedding from the language model for each passed text entry.

Parameters

texts – An array of texts to embed.

Returns

An array of sequence embeddings.

embedding_function

A function to take a single set of inputs and return embedded versions - a sequence representation from the language model. This variable points to a sensible default function based on a language model layer being specified in the constructor. If classifier or language model were not specified to the constructor, this variable must be assigned to a custom function definition.

Example

Below is a simplified example of creating a customized embed function. my_custom_embedding_function will be used by the wrapper, and will be called with an array of pre-encoded inputs for a single entry, and is expected to return an array. (TODO: 1d or 2d?)

def my_custom_embedding_function(inputs):
    return np.mean(my_transformer(inputs['input_id'], inputs['attention_mask'])[0])

wrapper = Wrapper(...)
wrapper.embedding_function = my_custom_embedding_function
embeddings_testing

Precomputed embeddings for each entry in test_texts, as returned by tx2.wrapper.Wrapper.embed().

embeddings_training

Precomputed embeddings for each entry in train_texts, as returned by tx2.wrapper.Wrapper.embed().

encode(text: str)

Encode/tokenize passed text into a format expected by transformer.

Parameters

text – The text entry to tokenize.

Returns

A tokenized version of the text, by default this calls encode_plus() with the options specified in encoder_options, and returns a dictionary:

{
     "input_ids": [],
     "attention_mask": []
}
encode_function

A function to take a single text entry and return an encoded version of it. The default function will utilize the tokenizer given in the constructor if available.

encoder_options

The default options to pass to the tokenizer’s encode_plus() function. See huggingface documentation.

encodings

A dictionary associating class label names with integer values.

example:

{
    "label1": 0,
    "label2": 1,
}
language_model

A variable containing only the huggingface language model portion of the network.

max_clusters

Maximum number of clusters to retain. Note that this cannot exceed the number of colors in the dashboard.

max_len

The maximum length of each text entry, based on the expected input size of the transformer.

overwrite

Whether to ignore cached calculations and overwrite them or not.

predictions

The predicted class for each entry in test_texts, as returned by tx2.wrapper.Wrapper.classify().

prepare(umap_args={}, clustering_alg='DBSCAN', clustering_args={})

Run all necessary precompute step to support the dashboard. This function must be called before using in a dashboard instance.

Parameters
  • umap_args – Dictionary of arguments to pass into the UMAP model on instantiation.

  • clustering_alg – The name of the clustering algorithm to use, a class name from sklearn.cluster, see sklearn’s documentation. (“DBSCAN”, “KMeans”, “AffinityPropagation”, “Birch”, “OPTICS”, “AgglomerativeClustering”, “SpectralClustering”, “SpectralBiclustering”, “SpectralCoclustering”, “MiniBatchKMeans”, “FeatureAgglomeration”, “MeanShift”)

  • clustering_args – Dictionary of arguments to pass into clustering algorithm on instantiation.

project(texts: List[str])numpy.ndarray

Use the wrapper’s UMAP model to project passed texts into two dimensions.

Parameters

texts – An array of texts to embed.

Returns

A Nx2 numpy array, containing a size 2 array of coordinates for each of the N input text entries.

projections_testing

The two dimensional projections of embeddings_testing, for each entry in test_texts.

projections_training

The two dimensional projections of embeddings_training, for each entry in train_texts.

projector

The trained UMAP projector. See umap-learn documentation.

recompute_projections(umap_args={}, clustering_alg='DBSCAN', clustering_args={})

Re-run both projection training and clustering algorithms. Note that this automatically overrides both previously saved projections as well as clustering data.

Parameters
  • umap_args – Dictionary of arguments to pass into the UMAP model on instantiation.

  • clustering_alg

    The name of the clustering algorithm to use, a class name from sklearn.cluster, see sklearn’s documentation. (“DBSCAN”, “KMeans”, “AffinityPropagation”, “Birch”, “OPTICS”, “AgglomerativeClustering”, “SpectralClustering”, “SpectralBiclustering”, “SpectralCoclustering”, “MiniBatchKMeans”, “FeatureAgglomeration”, “MeanShift”)

  • clustering_args – Dictionary of arguments to pass into clustering algorithm on instantiation.

recompute_visual_clusterings(clustering_alg='DBSCAN', clustering_args={})

Re-run the clustering algorithm. Note that this automatically overrides any previously cached data for clusters.

Parameters
  • clustering_alg

    The name of the clustering algorithm to use, a class name from sklearn.cluster, see sklearn’s documentation. (“DBSCAN”, “KMeans”, “AffinityPropagation”, “Birch”, “OPTICS”, “AgglomerativeClustering”, “SpectralClustering”, “SpectralBiclustering”, “SpectralCoclustering”, “MiniBatchKMeans”, “FeatureAgglomeration”, “MeanShift”)

  • clustering_args – Dictionary of arguments to pass into the clustering algorithm on instantiation.

salience_maps

The salience map for each entry in test_texts as calculated by tx2.calc.salience_map().

search_test_df(search: str)List[int]

Get a list of test dataframe indices that have any of the listed terms in the passed string.

Parameters

search – The search string, can contain multiple terms delimited with ‘&’ to search for entries that have all of the terms.

Returns

A list of the indices for the test_df.

soft_classification_function

A function to take a single set of inputs and return the (non arg-maxed) output layer of the network.

soft_classify(texts: List[str])List[List[float]]

Get the non-argmaxed final prediction layer outputs of the classification head.

Parameters

texts – An array of texts to predict on.

Returns

An Nxd array of arrays, N the number of entries to predict on and d the number of categories.

test_labels

Collection of all class labels used during models testing process.

test_texts

Collection of all text entries used during models testing process.

tokenizer

The huggingface tokenizer to use for encoding text input.

train_labels

Collection of all class labels used during models training process.

train_texts

Collection of all text entries used during models training process.