Continuous Active Learning (CAL) in VenioOne is a feature that automatically classifies documents using machine learning techniques. This article provides details on the underlying technology and a high-level overview of its process.
Technology Used
VenioOne's CAL implementation is built on a clean .NET conversion of LIBSVM version 3.18, enhanced with an improved stemming component. LIBSVM is a widely used library for Support Vector Machines (SVM), which serves as the core engine for document classification in CAL.
CAL transforms text into numeric features, trains an SVM model, and generates predictions along with confidence (probability) scores.
High-Level Flow of CAL
The CAL process involves the following steps:
- Inputs: A set of labeled training documents, a set of unlabeled documents to classify, and configuration settings (such as tokenizer rules, stop words, feature limits, and confidence threshold).
- Preprocessing: The tokenizer splits text into tokens, removes stop words and numbers, and can optionally apply stemming.
- Feature Selection: Calculates document frequency (DF), removes terms that are too rare or too common, and limits the total number of features to a configured cap.
- Vectorization: Builds sparse TF-IDF vectors for each document (term frequency weighted by inverse document frequency).
- Training: Runs a parallel grid search to find the best SVM settings (C / Gamma) and trains the final SVM with probability scoring enabled.
- Prediction: Outputs the predicted class plus per-class probability scores. If the model’s confidence is below the threshold, the result is labeled “unknown.”
For additional documentation beyond user guides, please contact Venio Systems support for any specific inquiries or further details.
Comments
0 comments
Please sign in to leave a comment.