Wrapper¶

The wrapper class around a transformer and its functionality.

class tx2.wrapper.Wrapper(train_texts: Union[numpy.ndarray, pandas.core.series.Series], train_labels: Union[numpy.ndarray, pandas.core.series.Series], test_texts: Union[numpy.ndarray, pandas.core.series.Series], test_labels: Union[numpy.ndarray, pandas.core.series.Series], encodings: Dict[str, int], classifier=None, language_model=None, tokenizer=None, cuda_device=None, cache_path='data', overwrite=False)¶

A wrapper or interface class between a transformer and the dashboard.

This class handles running all of the calculations for the data needed by the front-end visualizations.

Methods:

`__init__`(train_texts, train_labels, …[, …])	Constructor.
`classify`(texts)	Predict the category of each passed entry text.
`embed`(texts)	Get a sequence embedding from the language model for each passed text entry.
`encode`(text)	Encode/tokenize passed text into a format expected by transformer.
`prepare`([umap_args, clustering_alg, …])	Run all necessary precompute step to support the dashboard.
`project`(texts)	Use the wrapper’s UMAP model to project passed texts into two dimensions.
`recompute_projections`([umap_args, …])	Re-run both projection training and clustering algorithms.
`recompute_visual_clusterings`([…])	Re-run the clustering algorithm.
`search_test_df`(search)	Get a list of test dataframe indices that have any of the listed terms in the passed string.
`soft_classify`(texts)	Get the non-argmaxed final prediction layer outputs of the classification head.

Attributes:

`batch_size`	The batch size to use in backend dataloader creation.
`cache_path`	The directory path to cache pre-calculated values.
`classification_function`	A function to take a single set of inputs and return the index of the predicted class.
`classifier`	A class containing the entire network, which can be called as a function taking the encoded input and returning the output classification.
`cluster_class_word_sets`	A dictionary of clusters, further divided version of `cluster_word_freqs` that divides each word count up into the number of entries of each category containing that word, as calculated by `tx2.calc.frequent_words_by_class_in_cluster()`.
`cluster_profiles`	A dictionary of aggregate sorted salience maps for each cluster as calculated by `tx2.calc.aggregate_cluster_salience_maps()`.
`cluster_word_freqs`	A dictionary of clusters and sorted top word frequencies for each, as calculated by `tx2.calc.frequent_words_in_cluster()`
`clusters`	A dictionary of cluster names, each associated with a list of indices of points in that cluster, as calculated by `tx2.calc.cluster_projections()`.
`cuda_device`	Set the device for pytorch to place tensors on, pass either “cpu” or “cuda”.
`embedding_function`	A function to take a single set of inputs and return embedded versions - a sequence representation from the language model.
`embeddings_testing`	Precomputed embeddings for each entry in `test_texts`, as returned by `tx2.wrapper.Wrapper.embed()`.
`embeddings_training`	Precomputed embeddings for each entry in `train_texts`, as returned by `tx2.wrapper.Wrapper.embed()`.
`encode_function`	A function to take a single text entry and return an encoded version of it.
`encoder_options`	The default options to pass to the tokenizer’s `encode_plus()` function.
`encodings`	A dictionary associating class label names with integer values.
`language_model`	A variable containing only the huggingface language model portion of the network.
`max_clusters`	Maximum number of clusters to retain.
`max_len`	The maximum length of each text entry, based on the expected input size of the transformer.
`overwrite`	Whether to ignore cached calculations and overwrite them or not.
`predictions`	The predicted class for each entry in `test_texts`, as returned by `tx2.wrapper.Wrapper.classify()`.
`projections_testing`	The two dimensional projections of `embeddings_testing`, for each entry in `test_texts`.
`projections_training`	The two dimensional projections of `embeddings_training`, for each entry in `train_texts`.
`projector`	The trained UMAP projector.
`salience_maps`	The salience map for each entry in `test_texts` as calculated by `tx2.calc.salience_map()`.
`soft_classification_function`	A function to take a single set of inputs and return the (non arg-maxed) output layer of the network.
`test_labels`	Collection of all class labels used during models testing process.
`test_texts`	Collection of all text entries used during models testing process.
`tokenizer`	The huggingface tokenizer to use for encoding text input.
`train_labels`	Collection of all class labels used during models training process.
`train_texts`	Collection of all text entries used during models training process.

__init__(train_texts: Union[numpy.ndarray, pandas.core.series.Series], train_labels: Union[numpy.ndarray, pandas.core.series.Series], test_texts: Union[numpy.ndarray, pandas.core.series.Series], test_labels: Union[numpy.ndarray, pandas.core.series.Series], encodings: Dict[str, int], classifier=None, language_model=None, tokenizer=None, cuda_device=None, cache_path='data', overwrite=False)¶

Constructor.

Parameters

train_texts – A set of text entries that were used during the model’s training process.
train_labels – The set of class labels for train_texts.
test_texts – The set of text entries that the model hadn’t seen during training.
test_labels – The set of class labels for test_texts.
encodings – A dictionary associating class label names with integer values.
classifier – A class/network containing a language model and classification head. Running this variable as a function by default should send the passed inputs through the entire network and return the argmaxed classification index (reverse encoding). Note that this argument is not required, if the user intends to manually specify classification functions.
language_model – A huggingface transformer model, if a custom network class is being used and has a layer representing the output of just the language model, pass it here. Note that this argument is not required, if the user intends to manually specify classification functions.
tokenizer – A huggingface tokenizer. Note that this argument is not required, if the user intends to manually specify encode and classification functions.
cuda_device – Set the device for pytorch to place tensors on, pass either “cpu” or “cuda”. This variable is used by the default embedding function. If unspecified and a GPU is found, “cuda” will be used, otherwise it defaults to “cpu”.
cache_path – The directory path to cache intermediate outputs from the tx2.wrapper.Wrapper.prepare() function. This allows the wrapper to precompute needed values for the dashboard to reduce render time and allow rerunning all wrapper code without needing to recompute. Note that every wrapper/dashboard instance is expected to have a unique cache path, otherwise filenames will conflict. You will need to set this if you intend to use more than one dashboard.
overwrite – Whether to ignore the cache and overwrite previous results or not.

batch_size¶: The batch size to use in backend dataloader creation.

cache_path¶: The directory path to cache pre-calculated values.

classification_function¶: A function to take a single set of inputs and return the index of the predicted class.

classifier¶: A class containing the entire network, which can be called as a function taking the encoded input and returning the output classification.

classify(texts: List[str]) → List[int]¶

Predict the category of each passed entry text.

Parameters: texts – An array of texts to predict on.
Returns: An array of predicted classes, whose labels can be reverse looked up through encodings.

cluster_class_word_sets¶: A dictionary of clusters, further divided version of cluster_word_freqs that divides each word count up into the number of entries of each category containing that word, as calculated by tx2.calc.frequent_words_by_class_in_cluster().

cluster_profiles¶: A dictionary of aggregate sorted salience maps for each cluster as calculated by tx2.calc.aggregate_cluster_salience_maps().

cluster_word_freqs¶: A dictionary of clusters and sorted top word frequencies for each, as calculated by tx2.calc.frequent_words_in_cluster()

clusters¶: A dictionary of cluster names, each associated with a list of indices of points in that cluster, as calculated by tx2.calc.cluster_projections().

cuda_device¶: Set the device for pytorch to place tensors on, pass either “cpu” or “cuda”. This variable is used by the default embedding function. If unspecified and a GPU is found, “cuda” will be used, otherwise it defaults to “cpu”.

embed(texts: List[str]) → List[List[float]]¶

Get a sequence embedding from the language model for each passed text entry.

Parameters: texts – An array of texts to embed.
Returns: An array of sequence embeddings.

embedding_function¶

A function to take a single set of inputs and return embedded versions - a sequence representation from the language model. This variable points to a sensible default function based on a language model layer being specified in the constructor. If classifier or language model were not specified to the constructor, this variable must be assigned to a custom function definition.

Example

Below is a simplified example of creating a customized embed function. my_custom_embedding_function will be used by the wrapper, and will be called with an array of pre-encoded inputs for a single entry, and is expected to return an array. (TODO: 1d or 2d?)

def my_custom_embedding_function(inputs):
    return np.mean(my_transformer(inputs['input_id'], inputs['attention_mask'])[0])

wrapper = Wrapper(...)
wrapper.embedding_function = my_custom_embedding_function

embeddings_testing¶: Precomputed embeddings for each entry in test_texts, as returned by tx2.wrapper.Wrapper.embed().

embeddings_training¶: Precomputed embeddings for each entry in train_texts, as returned by tx2.wrapper.Wrapper.embed().

encode(text: str)¶

Encode/tokenize passed text into a format expected by transformer.

Parameters: text – The text entry to tokenize.
Returns: A tokenized version of the text, by default this calls encode_plus() with the options specified in encoder_options, and returns a dictionary:

{
     "input_ids": [],
     "attention_mask": []
}

encode_function¶: A function to take a single text entry and return an encoded version of it. The default function will utilize the tokenizer given in the constructor if available.

encoder_options¶: The default options to pass to the tokenizer’s encode_plus() function. See huggingface documentation.

encodings¶

A dictionary associating class label names with integer values.

example:

{
    "label1": 0,
    "label2": 1,
}

language_model¶: A variable containing only the huggingface language model portion of the network.

max_clusters¶: Maximum number of clusters to retain. Note that this cannot exceed the number of colors in the dashboard.

max_len¶: The maximum length of each text entry, based on the expected input size of the transformer.

overwrite¶: Whether to ignore cached calculations and overwrite them or not.

predictions¶: The predicted class for each entry in test_texts, as returned by tx2.wrapper.Wrapper.classify().

prepare(umap_args={}, clustering_alg='DBSCAN', clustering_args={})¶

Run all necessary precompute step to support the dashboard. This function must be called before using in a dashboard instance.

Parameters

umap_args – Dictionary of arguments to pass into the UMAP model on instantiation.
clustering_alg – The name of the clustering algorithm to use, a class name from sklearn.cluster, see sklearn’s documentation. (“DBSCAN”, “KMeans”, “AffinityPropagation”, “Birch”, “OPTICS”, “AgglomerativeClustering”, “SpectralClustering”, “SpectralBiclustering”, “SpectralCoclustering”, “MiniBatchKMeans”, “FeatureAgglomeration”, “MeanShift”)
clustering_args – Dictionary of arguments to pass into clustering algorithm on instantiation.

project(texts: List[str]) → numpy.ndarray¶

Use the wrapper’s UMAP model to project passed texts into two dimensions.

Parameters: texts – An array of texts to embed.
Returns: A Nx2 numpy array, containing a size 2 array of coordinates for each of the N input text entries.

projections_testing¶: The two dimensional projections of embeddings_testing, for each entry in test_texts.

projections_training¶: The two dimensional projections of embeddings_training, for each entry in train_texts.

projector¶: The trained UMAP projector. See umap-learn documentation.

recompute_projections(umap_args={}, clustering_alg='DBSCAN', clustering_args={})¶

Re-run both projection training and clustering algorithms. Note that this automatically overrides both previously saved projections as well as clustering data.

Parameters

umap_args – Dictionary of arguments to pass into the UMAP model on instantiation.
clustering_alg –
The name of the clustering algorithm to use, a class name from sklearn.cluster, see sklearn’s documentation. (“DBSCAN”, “KMeans”, “AffinityPropagation”, “Birch”, “OPTICS”, “AgglomerativeClustering”, “SpectralClustering”, “SpectralBiclustering”, “SpectralCoclustering”, “MiniBatchKMeans”, “FeatureAgglomeration”, “MeanShift”)
clustering_args – Dictionary of arguments to pass into clustering algorithm on instantiation.

recompute_visual_clusterings(clustering_alg='DBSCAN', clustering_args={})¶

Re-run the clustering algorithm. Note that this automatically overrides any previously cached data for clusters.

Parameters

clustering_alg –
The name of the clustering algorithm to use, a class name from sklearn.cluster, see sklearn’s documentation. (“DBSCAN”, “KMeans”, “AffinityPropagation”, “Birch”, “OPTICS”, “AgglomerativeClustering”, “SpectralClustering”, “SpectralBiclustering”, “SpectralCoclustering”, “MiniBatchKMeans”, “FeatureAgglomeration”, “MeanShift”)
clustering_args – Dictionary of arguments to pass into the clustering algorithm on instantiation.

salience_maps¶: The salience map for each entry in test_texts as calculated by tx2.calc.salience_map().

search_test_df(search: str) → List[int]¶

Get a list of test dataframe indices that have any of the listed terms in the passed string.

Parameters: search – The search string, can contain multiple terms delimited with ‘&’ to search for entries that have all of the terms.
Returns: A list of the indices for the test_df.

soft_classification_function¶: A function to take a single set of inputs and return the (non arg-maxed) output layer of the network.

soft_classify(texts: List[str]) → List[List[float]]¶

Get the non-argmaxed final prediction layer outputs of the classification head.

Parameters: texts – An array of texts to predict on.
Returns: An Nxd array of arrays, N the number of entries to predict on and d the number of categories.

test_labels¶: Collection of all class labels used during models testing process.

test_texts¶: Collection of all text entries used during models testing process.

tokenizer¶: The huggingface tokenizer to use for encoding text input.

train_labels¶: Collection of all class labels used during models training process.

train_texts¶: Collection of all text entries used during models training process.