Wrapper¶
The wrapper class around a transformer and its functionality.
-
class
tx2.wrapper.
Wrapper
(train_texts: Union[numpy.ndarray, pandas.core.series.Series], train_labels: Union[numpy.ndarray, pandas.core.series.Series], test_texts: Union[numpy.ndarray, pandas.core.series.Series], test_labels: Union[numpy.ndarray, pandas.core.series.Series], encodings: Dict[str, int], classifier=None, language_model=None, tokenizer=None, cuda_device=None, cache_path='data', overwrite=False)¶ A wrapper or interface class between a transformer and the dashboard.
This class handles running all of the calculations for the data needed by the front-end visualizations.
Methods:
__init__
(train_texts, train_labels, …[, …])Constructor.
classify
(texts)Predict the category of each passed entry text.
embed
(texts)Get a sequence embedding from the language model for each passed text entry.
encode
(text)Encode/tokenize passed text into a format expected by transformer.
prepare
([umap_args, clustering_alg, …])Run all necessary precompute step to support the dashboard.
project
(texts)Use the wrapper’s UMAP model to project passed texts into two dimensions.
recompute_projections
([umap_args, …])Re-run both projection training and clustering algorithms.
Re-run the clustering algorithm.
search_test_df
(search)Get a list of test dataframe indices that have any of the listed terms in the passed string.
soft_classify
(texts)Get the non-argmaxed final prediction layer outputs of the classification head.
Attributes:
The batch size to use in backend dataloader creation.
The directory path to cache pre-calculated values.
A function to take a single set of inputs and return the index of the predicted class.
A class containing the entire network, which can be called as a function taking the encoded input and returning the output classification.
A dictionary of clusters, further divided version of
cluster_word_freqs
that divides each word count up into the number of entries of each category containing that word, as calculated bytx2.calc.frequent_words_by_class_in_cluster()
.A dictionary of aggregate sorted salience maps for each cluster as calculated by
tx2.calc.aggregate_cluster_salience_maps()
.A dictionary of clusters and sorted top word frequencies for each, as calculated by
tx2.calc.frequent_words_in_cluster()
A dictionary of cluster names, each associated with a list of indices of points in that cluster, as calculated by
tx2.calc.cluster_projections()
.Set the device for pytorch to place tensors on, pass either “cpu” or “cuda”.
A function to take a single set of inputs and return embedded versions - a sequence representation from the language model.
Precomputed embeddings for each entry in
test_texts
, as returned bytx2.wrapper.Wrapper.embed()
.Precomputed embeddings for each entry in
train_texts
, as returned bytx2.wrapper.Wrapper.embed()
.A function to take a single text entry and return an encoded version of it.
The default options to pass to the tokenizer’s
encode_plus()
function.A dictionary associating class label names with integer values.
A variable containing only the huggingface language model portion of the network.
Maximum number of clusters to retain.
The maximum length of each text entry, based on the expected input size of the transformer.
Whether to ignore cached calculations and overwrite them or not.
The predicted class for each entry in
test_texts
, as returned bytx2.wrapper.Wrapper.classify()
.The two dimensional projections of
embeddings_testing
, for each entry intest_texts
.The two dimensional projections of
embeddings_training
, for each entry intrain_texts
.The trained UMAP projector.
The salience map for each entry in
test_texts
as calculated bytx2.calc.salience_map()
.A function to take a single set of inputs and return the (non arg-maxed) output layer of the network.
Collection of all class labels used during models testing process.
Collection of all text entries used during models testing process.
The huggingface tokenizer to use for encoding text input.
Collection of all class labels used during models training process.
Collection of all text entries used during models training process.
-
__init__
(train_texts: Union[numpy.ndarray, pandas.core.series.Series], train_labels: Union[numpy.ndarray, pandas.core.series.Series], test_texts: Union[numpy.ndarray, pandas.core.series.Series], test_labels: Union[numpy.ndarray, pandas.core.series.Series], encodings: Dict[str, int], classifier=None, language_model=None, tokenizer=None, cuda_device=None, cache_path='data', overwrite=False)¶ Constructor.
- Parameters
train_texts – A set of text entries that were used during the model’s training process.
train_labels – The set of class labels for train_texts.
test_texts – The set of text entries that the model hadn’t seen during training.
test_labels – The set of class labels for test_texts.
encodings – A dictionary associating class label names with integer values.
classifier – A class/network containing a language model and classification head. Running this variable as a function by default should send the passed inputs through the entire network and return the argmaxed classification index (reverse encoding). Note that this argument is not required, if the user intends to manually specify classification functions.
language_model – A huggingface transformer model, if a custom network class is being used and has a layer representing the output of just the language model, pass it here. Note that this argument is not required, if the user intends to manually specify classification functions.
tokenizer – A huggingface tokenizer. Note that this argument is not required, if the user intends to manually specify encode and classification functions.
cuda_device – Set the device for pytorch to place tensors on, pass either “cpu” or “cuda”. This variable is used by the default embedding function. If unspecified and a GPU is found, “cuda” will be used, otherwise it defaults to “cpu”.
cache_path – The directory path to cache intermediate outputs from the
tx2.wrapper.Wrapper.prepare()
function. This allows the wrapper to precompute needed values for the dashboard to reduce render time and allow rerunning all wrapper code without needing to recompute. Note that every wrapper/dashboard instance is expected to have a unique cache path, otherwise filenames will conflict. You will need to set this if you intend to use more than one dashboard.overwrite – Whether to ignore the cache and overwrite previous results or not.
-
batch_size
¶ The batch size to use in backend dataloader creation.
-
cache_path
¶ The directory path to cache pre-calculated values.
-
classification_function
¶ A function to take a single set of inputs and return the index of the predicted class.
-
classifier
¶ A class containing the entire network, which can be called as a function taking the encoded input and returning the output classification.
-
classify
(texts: List[str]) → List[int]¶ Predict the category of each passed entry text.
- Parameters
texts – An array of texts to predict on.
- Returns
An array of predicted classes, whose labels can be reverse looked up through
encodings
.
-
cluster_class_word_sets
¶ A dictionary of clusters, further divided version of
cluster_word_freqs
that divides each word count up into the number of entries of each category containing that word, as calculated bytx2.calc.frequent_words_by_class_in_cluster()
.
-
cluster_profiles
¶ A dictionary of aggregate sorted salience maps for each cluster as calculated by
tx2.calc.aggregate_cluster_salience_maps()
.
-
cluster_word_freqs
¶ A dictionary of clusters and sorted top word frequencies for each, as calculated by
tx2.calc.frequent_words_in_cluster()
-
clusters
¶ A dictionary of cluster names, each associated with a list of indices of points in that cluster, as calculated by
tx2.calc.cluster_projections()
.
-
cuda_device
¶ Set the device for pytorch to place tensors on, pass either “cpu” or “cuda”. This variable is used by the default embedding function. If unspecified and a GPU is found, “cuda” will be used, otherwise it defaults to “cpu”.
-
embed
(texts: List[str]) → List[List[float]]¶ Get a sequence embedding from the language model for each passed text entry.
- Parameters
texts – An array of texts to embed.
- Returns
An array of sequence embeddings.
-
embedding_function
¶ A function to take a single set of inputs and return embedded versions - a sequence representation from the language model. This variable points to a sensible default function based on a language model layer being specified in the constructor. If classifier or language model were not specified to the constructor, this variable must be assigned to a custom function definition.
Example
Below is a simplified example of creating a customized embed function.
my_custom_embedding_function
will be used by the wrapper, and will be called with an array of pre-encoded inputs for a single entry, and is expected to return an array. (TODO: 1d or 2d?)def my_custom_embedding_function(inputs): return np.mean(my_transformer(inputs['input_id'], inputs['attention_mask'])[0]) wrapper = Wrapper(...) wrapper.embedding_function = my_custom_embedding_function
-
embeddings_testing
¶ Precomputed embeddings for each entry in
test_texts
, as returned bytx2.wrapper.Wrapper.embed()
.
-
embeddings_training
¶ Precomputed embeddings for each entry in
train_texts
, as returned bytx2.wrapper.Wrapper.embed()
.
-
encode
(text: str)¶ Encode/tokenize passed text into a format expected by transformer.
- Parameters
text – The text entry to tokenize.
- Returns
A tokenized version of the text, by default this calls
encode_plus()
with the options specified inencoder_options
, and returns a dictionary:
{ "input_ids": [], "attention_mask": [] }
-
encode_function
¶ A function to take a single text entry and return an encoded version of it. The default function will utilize the tokenizer given in the constructor if available.
-
encoder_options
¶ The default options to pass to the tokenizer’s
encode_plus()
function. See huggingface documentation.
-
encodings
¶ A dictionary associating class label names with integer values.
example:
{ "label1": 0, "label2": 1, }
-
language_model
¶ A variable containing only the huggingface language model portion of the network.
-
max_clusters
¶ Maximum number of clusters to retain. Note that this cannot exceed the number of colors in the dashboard.
-
max_len
¶ The maximum length of each text entry, based on the expected input size of the transformer.
-
overwrite
¶ Whether to ignore cached calculations and overwrite them or not.
-
predictions
¶ The predicted class for each entry in
test_texts
, as returned bytx2.wrapper.Wrapper.classify()
.
-
prepare
(umap_args={}, clustering_alg='DBSCAN', clustering_args={})¶ Run all necessary precompute step to support the dashboard. This function must be called before using in a dashboard instance.
- Parameters
umap_args – Dictionary of arguments to pass into the UMAP model on instantiation.
clustering_alg – The name of the clustering algorithm to use, a class name from sklearn.cluster, see sklearn’s documentation. (“DBSCAN”, “KMeans”, “AffinityPropagation”, “Birch”, “OPTICS”, “AgglomerativeClustering”, “SpectralClustering”, “SpectralBiclustering”, “SpectralCoclustering”, “MiniBatchKMeans”, “FeatureAgglomeration”, “MeanShift”)
clustering_args – Dictionary of arguments to pass into clustering algorithm on instantiation.
-
project
(texts: List[str]) → numpy.ndarray¶ Use the wrapper’s UMAP model to project passed texts into two dimensions.
- Parameters
texts – An array of texts to embed.
- Returns
A Nx2 numpy array, containing a size 2 array of coordinates for each of the N input text entries.
-
projections_testing
¶ The two dimensional projections of
embeddings_testing
, for each entry intest_texts
.
-
projections_training
¶ The two dimensional projections of
embeddings_training
, for each entry intrain_texts
.
-
projector
¶ The trained UMAP projector. See umap-learn documentation.
-
recompute_projections
(umap_args={}, clustering_alg='DBSCAN', clustering_args={})¶ Re-run both projection training and clustering algorithms. Note that this automatically overrides both previously saved projections as well as clustering data.
- Parameters
umap_args – Dictionary of arguments to pass into the UMAP model on instantiation.
clustering_alg –
The name of the clustering algorithm to use, a class name from sklearn.cluster, see sklearn’s documentation. (“DBSCAN”, “KMeans”, “AffinityPropagation”, “Birch”, “OPTICS”, “AgglomerativeClustering”, “SpectralClustering”, “SpectralBiclustering”, “SpectralCoclustering”, “MiniBatchKMeans”, “FeatureAgglomeration”, “MeanShift”)
clustering_args – Dictionary of arguments to pass into clustering algorithm on instantiation.
-
recompute_visual_clusterings
(clustering_alg='DBSCAN', clustering_args={})¶ Re-run the clustering algorithm. Note that this automatically overrides any previously cached data for clusters.
- Parameters
clustering_alg –
The name of the clustering algorithm to use, a class name from sklearn.cluster, see sklearn’s documentation. (“DBSCAN”, “KMeans”, “AffinityPropagation”, “Birch”, “OPTICS”, “AgglomerativeClustering”, “SpectralClustering”, “SpectralBiclustering”, “SpectralCoclustering”, “MiniBatchKMeans”, “FeatureAgglomeration”, “MeanShift”)
clustering_args – Dictionary of arguments to pass into the clustering algorithm on instantiation.
-
salience_maps
¶ The salience map for each entry in
test_texts
as calculated bytx2.calc.salience_map()
.
-
search_test_df
(search: str) → List[int]¶ Get a list of test dataframe indices that have any of the listed terms in the passed string.
- Parameters
search – The search string, can contain multiple terms delimited with ‘&’ to search for entries that have all of the terms.
- Returns
A list of the indices for the
test_df
.
-
soft_classification_function
¶ A function to take a single set of inputs and return the (non arg-maxed) output layer of the network.
-
soft_classify
(texts: List[str]) → List[List[float]]¶ Get the non-argmaxed final prediction layer outputs of the classification head.
- Parameters
texts – An array of texts to predict on.
- Returns
An Nxd array of arrays, N the number of entries to predict on and d the number of categories.
-
test_labels
¶ Collection of all class labels used during models testing process.
-
test_texts
¶ Collection of all text entries used during models testing process.
-
tokenizer
¶ The huggingface tokenizer to use for encoding text input.
-
train_labels
¶ Collection of all class labels used during models training process.
-
train_texts
¶ Collection of all text entries used during models training process.
-