Introduction#

The Interactive Corpus Analysis Tool (ICAT) is a Python library for creating dashboards to explore textual datasets and build simple binary classification models to help filter through them and focus on entries of interest. This tool uses a form of interactive machine learning (IML), a paradigm of “machine teaching” [1] that sits at the intersection of the fields of human computer interaction (HCI), visual analytics, and machine learning. The intent of ICAT is to allow subject matter experts (SME) with limited to no experience in machine learning to benefit from an iterative human-in-the-loop (HITL) approach to building their own model without needing to understand the details of the underlying algorithm. This interactivity is achieved by allowing the user to create features, label data points, and visually manipulate a representation of the features to manually cluster and investigate data, while a model is trained on the fly based on these actions. ICAT is built on top of the Panel [2] library, using a combination of Vega, a custom IPyWidget using D3, and ipyvuetify, and is intended to be used inside of a Jupyter environment.

Statement of Need#

Machine teaching promises to democratize machine learning algorithms and grant non-machine-learning experts the ability to train, manipulate, and work with models themselves [1]. Traditionally, the process for an SME to obtain a model that aids in their data analysis is a time consuming iterative loop: they must first communicate their problem space and data to a machine learning expert, who experiments and trains a model for the SME, who then tests it and finds any issues or insufficiently learned concepts, which must then be communicated back to the ML expert, and the iterative loop continues as such. Ideally, an effective HITL training process involves the SME more directly in the training process, dramatically speeding up this iteration loop and benefiting from the SME’s implicit knowledge and experience. IML seeks to provide this process through mechanisms such as feature selection (interactive featuring) and model steering (interactive labeling) [3].

This is a challenging space for a number of reasons. The efficacy of an IML system heavily revolves around the design of the interface itself, in addition to the underlying machine learning models and the many considerations they entail. Thus, incorporating effective user experience design principles and understanding the mental models of the users as they explore and use the interface is crucial. Both quantitative and qualitative metrics must include the human element, so any research seeking to demonstrate a measured value-add or efficacy of an IML interface must incorporate user studies [4]. A positive user experience additionally constrains algorithmic design in terms of speed and efficiency–an underlying model that takes minutes to train is frustrating to interact with [5]. Care must be taken not to treat the user like a mechanical turk or mindless oracle for the model to endlessly query [6] [7].

Despite these challenges, there is tremendous potential for IML to empower SMEs and allow them to benefit from the value of machine learning in their work. For the field to grow and realize this potential, a great deal more research and work are required. Our work draws heavily on the IML interface concepts proposed by Suh and colleagues in 2019 [8], and as of this writing there is no other open source package implementing their visuals or overall interface. ICAT seeks to fill this gap to allow other researchers to explore, build on, and compare against the concepts discussed below to further the state of the field.