icat.model.Model#
- class icat.model.Model(data, text_col, anchor_types=None, default_sample_size=100)#
Bases:
object
The interactive machine learning model - a basic binary classifier with tools for viewing and interacting with the data and features.
- Parameters:
data (pd.DataFrame) – The data to explore with.
text_col (str) – The name of the text column in the passed data.
anchor_types (list[type | dict[str, any]]) – The list of class types of anchors to initially include in the interface. (This can be modified after initialization through the
anchor_list
.)default_sample_size (int) – The initial number of points to sample for the visualizations.
Methods
__init__
(data, text_col[, anchor_types, ...])add_anchor
(anchor)Add the passed anchor to this model's anchor list.
Calculate the coverage of the current anchors on the current active data.
feature_names
([in_model_only])Provides a list of the feature column names in use in the data manager.
featurize
([data, normalize, normalize_reference])Run the anchors - calculates the output features for each anchor and adds the corresponding "weights" column to the dataframe.
fit
()Featurize the current data and fit the model to it.
Determine if there are enough labels in the training data to train the model with.
load
(path)Reload the model with all of the data and anchors from the specified location.
predict
([data, inplace])Run model's classifier predictions on either the passed data or training data.
save
(path)Save the model and all associated data at the specified location.
Attributes
The rows (and only those rows) of the original data explicitly used for training.
The column in the dataframe with the text to explore.
The underlying machine learning algorithm that learns based on the training data.
The
AnchorList
instance that manages all features/featuring necessary for the classifier.The
DataManager
instance that handles all labeling tasks and data filtering/sampling.The
InteractiveView
or dashboard widget that glues together the various visual components.- add_anchor(anchor)#
Add the passed anchor to this model’s anchor list.
- Parameters:
anchor (Anchor) – The Anchor to add to the list.
Note
See
AnchorList.add_anchor
for more details.
- anchor_list: AnchorList#
The
AnchorList
instance that manages all features/featuring necessary for the classifier.
- classifier: LogisticRegression#
The underlying machine learning algorithm that learns based on the training data.
- compute_coverage()#
Calculate the coverage of the current anchors on the current active data.
- Returns:
A dictionary where each key is the panel id of the anchor, and the value is a dictionary with the statistics:
'total'
,'pos'
,'neg'
,'total_pct'
,'pos_pct'
, and'neg_pct'
- Return type:
dict[str, dict[str, float | int]]
- data: DataManager#
The
DataManager
instance that handles all labeling tasks and data filtering/sampling.
- feature_names(in_model_only=False)#
Provides a list of the feature column names in use in the data manager.
- Parameters:
in_model_only (bool) – Only include anchors whose
in_model
value isTrue
.- Return type:
list[str]
- featurize(data=None, normalize=False, normalize_reference=None)#
Run the anchors - calculates the output features for each anchor and adds the corresponding “weights” column to the dataframe. These are the values that the classifier uses to make its predictions.
- Parameters:
data (pd.DataFrame) – The data to apply the anchors to. Uses the exploration data if not specified.
normalize (bool) – Whether to apply l1 normalization to the output values.
normalize_reference (Optional[pd.DataFrame]) – A different dataframe whose features to sum for the L1 norm, this is used with the model’s separate training data versus full dataset, since the normed values of just the training data would be vastly different than within the full set.
- Returns:
The passed data with the feature columns on it.
- Return type:
DataFrame
- fit()#
Featurize the current data and fit the model to it.
- is_seeded()#
Determine if there are enough labels in the training data to train the model with.
- Returns:
False if the label column doesn’t exist, there’s fewer than 10 labeled points, or there’s only one class of label.
- Return type:
bool
- is_trained()#
- Return type:
bool
- static load(path)#
Reload the model with all of the data and anchors from the specified location.
Example
import icat m1 = icat.Model(my_data, text_col="text") m1.save("~/tmp/my_model") m2 = icat.Model.load("~/tmp/my_model")
- Parameters:
path (str) –
- Return type:
- predict(data=None, inplace=True)#
Run model’s classifier predictions on either the passed data or training data.
Note
This function, like sklearn, assumes the model has already been fit. (We have no strict check for this, as this is for IML and the classifier is assumed to be re-fit multiple times.)
- Parameters:
data (Optional[pd.DataFrame]) – If not specified, use the previously set training data, otherwise predict on this data.
inplace (bool) – Whether to operate directly on the passed data or create a copy of it.
- Returns:
The predictions for either the active or passed data if provided.
- Return type:
ndarray
- save(path)#
Save the model and all associated data at the specified location.
- Parameters:
path (str) –
- text_col#
The column in the dataframe with the text to explore.
- training_data: DataFrame#
The rows (and only those rows) of the original data explicitly used for training.
- view: InteractiveView#
The
InteractiveView
or dashboard widget that glues together the various visual components.