icat.model.Model#

class icat.model.Model(data, text_col, anchor_types=None, default_sample_size=100, interface_width=None)#

Bases: object

The interactive machine learning model - a basic binary classifier with tools for viewing and interacting with the data and features.

Parameters:

data (pd.DataFrame) – The data to explore with.
text_col (str) – The name of the text column in the passed data.
anchor_types (list[type | dict[str, any]]) – The list of class types of anchors to initially include in the interface. (This can be modified after initialization through the anchor_list.)
default_sample_size (int) – The initial number of points to sample for the visualizations.
interface_width (int) – Fit the width of the UI into this number of pixels, or None. If None and ICAT detects that this is running in Jupyter Notebook instead of Jupyter Lab, it will attempt to appropriately size the output to a Notebook cell width.

Methods

`__init__`(data, text_col[, anchor_types, ...])
`add_anchor`(anchor)	Add the passed anchor to this model's anchor list.
`compute_coverage`()	Calculate the coverage of the current anchors on the current active data.
`feature_names`([in_model_only])	Provides a list of the feature column names in use in the data manager.
`featurize`([data, normalize, normalize_reference])	Run the anchors - calculates the output features for each anchor and adds the corresponding "weights" column to the dataframe.
`fire_on_status_event`(event)
`fit`()	Featurize the current data and fit the model to it.
`is_seeded`()	Determine if there are enough labels in the training data to train the model with.
`is_trained`()
`load`(path)	Reload the model with all of the data and anchors from the specified location.
`on_status_event`(callback)	Register a callback function for whenever something that should update a status label occurs.
`predict`([data, inplace])	Run model's classifier predictions on either the passed data or training data.
`save`(path)	Save the model and all associated data at the specified location.

Attributes

`training_data`	The rows (and only those rows) of the original data explicitly used for training.
`text_col`	The column in the dataframe with the text to explore.
`classifier`	The underlying machine learning algorithm that learns based on the training data.
`anchor_list`	The `AnchorList` instance that manages all features/featuring necessary for the classifier.
`data`	The `DataManager` instance that handles all labeling tasks and data filtering/sampling.
`view`	The `InteractiveView` or dashboard widget that glues together the various visual components.

add_anchor(anchor)#

Add the passed anchor to this model’s anchor list.

Parameters:: anchor (Anchor) – The Anchor to add to the list.

Note

See AnchorList.add_anchor for more details.

anchor_list: AnchorList#: The AnchorList instance that manages all features/featuring necessary for the classifier.

classifier: LogisticRegression#: The underlying machine learning algorithm that learns based on the training data.

compute_coverage()#

Calculate the coverage of the current anchors on the current active data.

Returns:: A dictionary where each key is the panel id of the anchor, and the value is a dictionary with the statistics: 'total', 'pos', 'neg', 'total_pct', 'pos_pct', and 'neg_pct'
Return type:: dict[str, dict[str, float | int]]

data: DataManager#: The DataManager instance that handles all labeling tasks and data filtering/sampling.

feature_names(in_model_only=False)#

Provides a list of the feature column names in use in the data manager.

Parameters:: in_model_only (bool) – Only include anchors whose in_model value is True.
Return type:: list[str]

featurize(data=None, normalize=False, normalize_reference=None)#

Run the anchors - calculates the output features for each anchor and adds the corresponding “weights” column to the dataframe. These are the values that the classifier uses to make its predictions.

Parameters:

data (pd.DataFrame) – The data to apply the anchors to. Uses the exploration data if not specified.
normalize (bool) – Whether to apply l1 normalization to the output values.
normalize_reference (Optional[pd.DataFrame]) – A different dataframe whose features to sum for the L1 norm, this is used with the model’s separate training data versus full dataset, since the normed values of just the training data would be vastly different than within the full set.

Returns:

The passed data with the feature columns on it.

Return type:

DataFrame

fire_on_status_event(event)#

Parameters:: event (str) –

fit()#: Featurize the current data and fit the model to it.

is_seeded()#

Determine if there are enough labels in the training data to train the model with.

Returns:: False if the label column doesn’t exist, there’s fewer than 10 labeled points, or there’s only one class of label.
Return type:: bool

is_trained()#

Return type:: bool

static load(path)#

Reload the model with all of the data and anchors from the specified location.

Example

import icat

m1 = icat.Model(my_data, text_col="text")
m1.save("~/tmp/my_model")

m2 = icat.Model.load("~/tmp/my_model")

Parameters:: path (str) –
Return type:: Model

on_status_event(callback)#

Callbacks for this event should take the text event description string, and a string with the source of the event. If None is passed, this means any prior event from this source is complete.

Parameters:: callback (Callable) –

predict(data=None, inplace=True)#

Run model’s classifier predictions on either the passed data or training data.

Note

This function, like sklearn, assumes the model has already been fit. (We have no strict check for this, as this is for IML and the classifier is assumed to be re-fit multiple times.)

Parameters:

data (Optional[pd.DataFrame]) – If not specified, use the previously set training data, otherwise predict on this data.
inplace (bool) – Whether to operate directly on the passed data or create a copy of it.

Returns:

The predictions for either the active or passed data if provided.

Return type:

ndarray

save(path)#

Save the model and all associated data at the specified location.

Parameters:: path (str) –

text_col#: The column in the dataframe with the text to explore.

training_data: DataFrame#: The rows (and only those rows) of the original data explicitly used for training.

view: InteractiveView#: The InteractiveView or dashboard widget that glues together the various visual components.