icat.model.Model#

class icat.model.Model(data, text_col, anchor_types=None, default_sample_size=100)#

Bases: object

The interactive machine learning model - a basic binary classifier with tools for viewing and interacting with the data and features.

Parameters:
  • data (pd.DataFrame) – The data to explore with.

  • text_col (str) – The name of the text column in the passed data.

  • anchor_types (list[type | dict[str, any]]) – The list of class types of anchors to initially include in the interface. (This can be modified after initialization through the anchor_list.)

  • default_sample_size (int) – The initial number of points to sample for the visualizations.

Methods

__init__(data, text_col[, anchor_types, ...])

add_anchor(anchor)

Add the passed anchor to this model's anchor list.

compute_coverage()

Calculate the coverage of the current anchors on the current active data.

feature_names([in_model_only])

Provides a list of the feature column names in use in the data manager.

featurize([data, normalize, normalize_reference])

Run the anchors - calculates the output features for each anchor and adds the corresponding "weights" column to the dataframe.

fit()

Featurize the current data and fit the model to it.

is_seeded()

Determine if there are enough labels in the training data to train the model with.

is_trained()

load(path)

Reload the model with all of the data and anchors from the specified location.

predict([data, inplace])

Run model's classifier predictions on either the passed data or training data.

save(path)

Save the model and all associated data at the specified location.

Attributes

training_data

The rows (and only those rows) of the original data explicitly used for training.

text_col

The column in the dataframe with the text to explore.

classifier

The underlying machine learning algorithm that learns based on the training data.

anchor_list

The AnchorList instance that manages all features/featuring necessary for the classifier.

data

The DataManager instance that handles all labeling tasks and data filtering/sampling.

view

The InteractiveView or dashboard widget that glues together the various visual components.

add_anchor(anchor)#

Add the passed anchor to this model’s anchor list.

Parameters:

anchor (Anchor) – The Anchor to add to the list.

Note

See AnchorList.add_anchor for more details.

anchor_list: AnchorList#

The AnchorList instance that manages all features/featuring necessary for the classifier.

classifier: LogisticRegression#

The underlying machine learning algorithm that learns based on the training data.

compute_coverage()#

Calculate the coverage of the current anchors on the current active data.

Returns:

A dictionary where each key is the panel id of the anchor, and the value is a dictionary with the statistics: 'total', 'pos', 'neg', 'total_pct', 'pos_pct', and 'neg_pct'

Return type:

dict[str, dict[str, float | int]]

data: DataManager#

The DataManager instance that handles all labeling tasks and data filtering/sampling.

feature_names(in_model_only=False)#

Provides a list of the feature column names in use in the data manager.

Parameters:

in_model_only (bool) – Only include anchors whose in_model value is True.

Return type:

list[str]

featurize(data=None, normalize=False, normalize_reference=None)#

Run the anchors - calculates the output features for each anchor and adds the corresponding “weights” column to the dataframe. These are the values that the classifier uses to make its predictions.

Parameters:
  • data (pd.DataFrame) – The data to apply the anchors to. Uses the exploration data if not specified.

  • normalize (bool) – Whether to apply l1 normalization to the output values.

  • normalize_reference (Optional[pd.DataFrame]) – A different dataframe whose features to sum for the L1 norm, this is used with the model’s separate training data versus full dataset, since the normed values of just the training data would be vastly different than within the full set.

Returns:

The passed data with the feature columns on it.

Return type:

DataFrame

fit()#

Featurize the current data and fit the model to it.

is_seeded()#

Determine if there are enough labels in the training data to train the model with.

Returns:

False if the label column doesn’t exist, there’s fewer than 10 labeled points, or there’s only one class of label.

Return type:

bool

is_trained()#
Return type:

bool

static load(path)#

Reload the model with all of the data and anchors from the specified location.

Example

import icat

m1 = icat.Model(my_data, text_col="text")
m1.save("~/tmp/my_model")

m2 = icat.Model.load("~/tmp/my_model")
Parameters:

path (str) –

Return type:

Model

predict(data=None, inplace=True)#

Run model’s classifier predictions on either the passed data or training data.

Note

This function, like sklearn, assumes the model has already been fit. (We have no strict check for this, as this is for IML and the classifier is assumed to be re-fit multiple times.)

Parameters:
  • data (Optional[pd.DataFrame]) – If not specified, use the previously set training data, otherwise predict on this data.

  • inplace (bool) – Whether to operate directly on the passed data or create a copy of it.

Returns:

The predictions for either the active or passed data if provided.

Return type:

ndarray

save(path)#

Save the model and all associated data at the specified location.

Parameters:

path (str) –

text_col#

The column in the dataframe with the text to explore.

training_data: DataFrame#

The rows (and only those rows) of the original data explicitly used for training.

view: InteractiveView#

The InteractiveView or dashboard widget that glues together the various visual components.