Calc¶

Helper calculation functions for the wrapper and dashboard.

Functions:

`aggregate_cluster_salience_maps`(clusters, …)	Create an “aggregate” salience map for each cluster.
`cluster_projections`(projections, …)	Runs a clustering algorithm (currently only dbscan supported) on the provided embedded or projected points, and provides a dictionary of data point indices.
`frequent_words_by_class_in_cluster`(…)	Takes the frequent words of a cluster and splits the counts up based on the classification of the entry they fall under.
`frequent_words_in_cluster`(texts)	Finds the most frequently occurring words for each cluster given.
`normalize_salience_map`(salience)	Get a salience map with the total scores as computed in `tx2.calc.sort_salience_map()` and normalize those scores.
`salience_map`(soft_classify_function, text, …)	Calculates the total change in output classification probabilities when each individual word in the text is removed, a proxy for each word’s “importance” in the model prediction.
`sort_salience_map`(salience)	Sort the passed salience map tuples by computing a total “delta” for each word, computed from the sum of absolute values for each predicted value.

tx2.calc.aggregate_cluster_salience_maps(clusters: Dict[str, List[int]], salience_maps) → Dict[str, List[Tuple[str, float]]]¶

Create an “aggregate” salience map for each cluster. This function combines the impact for each word from every instance where it appears in that cluster, and then sorts the results. This gives you how much overall impact the removal of a word has on all entries in that cluster - a proxy for importance of that word to that cluster.

Parameters

clusters – A dictionary of cluster names and arrays of associated point indices. This can directly take the output from tx2.calc.cluster_projections().
salience_maps – A list of salience maps as computed in tx2.calc.salience_map().

Returns

A dictionary with an array for each cluster, where the array consists of tuples - the first element is the word, and the second is the aggregated impact: ('WORD', AGGREGATE_DELTA)

tx2.calc.cluster_projections(projections, clustering_alg, **clustering_args) → Dict[str, List[int]]¶

Runs a clustering algorithm (currently only dbscan supported) on the provided embedded or projected points, and provides a dictionary of data point indices.

Parameters

projections – The data points to fit - should be numpy array of testing data points. Intended use is 2D UMAP projections, but should support any shape[1] size.
clustering_alg – The name of the clustering algorithm to use, a class name from sklearn.cluster, see sklearn’s documentation. ("DBSCAN", "KMeans", "AffinityPropagation", "Birch", "OPTICS", "AgglomerativeClustering", "SpectralClustering", "SpectralBiclustering", "SpectralCoclustering", "MiniBatchKMeans", "FeatureAgglomeration", "MeanShift"
clustering_args – Any options to pass to the clustering algorithm.

Returns

A dictionary where each key is the cluster label and each value is an array of the indices from the projections array that are in that cluster.

tx2.calc.frequent_words_by_class_in_cluster(freq_words: List[Tuple[str, int]], encodings: Dict[str, int], cluster_texts: Union[numpy.ndarray, pandas.core.series.Series], cluster_text_labels: Union[numpy.ndarray, pandas.core.series.Series]) → Dict[str, Dict[Any, int]]¶

Takes the frequent words of a cluster and splits the counts up based on the classification of the entry they fall under. (This gives a rough distribution for what category the words fall under within the cluster.)

Parameters

freq_words – An array of tuples of the words and their number of occurences, see tx2.calc.frequent_words_in_cluster()
encodings – The dictionary of class/category encodings.
cluster_texts – The set of texts from the desired cluster entries.
cluster_text_labels – The set of classification labels for the texts from the desired cluster entries.

Returns

A dictionary with each word as the key. The value for each is a dictionary with a “total” key and a key for each encoded class, the value of which is the number of entries with that class containing the word.

tx2.calc.frequent_words_in_cluster(texts: Union[numpy.ndarray, pandas.core.series.Series]) → List[Tuple[str, int]]¶

Finds the most frequently occurring words for each cluster given.

Parameters: texts – The filtered list of texts to find the frequent words for
Returns: A list of tuples, each tuple containing the word and the number of times it appears in that cluster.

tx2.calc.normalize_salience_map(salience) → Dict[str, float]¶

Get a salience map with the total scores as computed in tx2.calc.sort_salience_map() and normalize those scores.

Parameters: salience – A salience map as returned from tx2.calc.salience_map().
Returns: A dictionary with each word removed as a key, and each value the normalized diff caused by removing that word.

tx2.calc.salience_map(soft_classify_function, text: str, encodings: Dict[str, int], length: int = 256) → List[Tuple[str, numpy.ndarray, numpy.ndarray, str]]¶

Calculates the total change in output classification probabilities when each individual word in the text is removed, a proxy for each word’s “importance” in the model prediction.

Parameters

soft_classify_function – A function that takes as input an array of texts and returns an array of output values for each category.
text – The text to compute the salience map for.
encodings – The dictionary of class/category encodings.
length – The maximum number of words to stop at. Since transformer tokens are unlikely to always be full words, this won’t directly correspond to what the model actually uses, but it’s to help at least marginally cut down on processing time. (Running this function on each text in an entire data frame can take a while.)

Returns

An array of tuples. Each tuple corresponds to that word being removed, and includes the word that was removed, the output prediction values, the diff between the unaltered text and altered text output prediction values, and the new predicted category with the text removed: ("WORD", PRED_VALUES, PRED_VALUES - ORIGINAL_PRED_VALUES, PRED_CATEGORY). Note that the first entry in the map is the outputs for the original text.

tx2.calc.sort_salience_map(salience) → List[Tuple[str, float]]¶

Sort the passed salience map tuples by computing a total “delta” for each word, computed from the sum of absolute values for each predicted value.

Parameters: salience – A salience map as returned from tx2.calc.salience_map().
Returns: A new list of sorted tuples, where each tuple consists of ("WORD", TOTAL_DELTA).