When it comes to detecting toxic content, e.g., for content moderation, the challenge isn't as straightforward as it might seem. Real-world data is notoriously imbalanced—most online content is non-toxic, with only a small percentage of toxic comments. In fact, only 8% of the human chat messages in the dataset used for this post are considered toxic. This imbalance amplifies a classic machine learning problem: How do you build a model that is good at finding comparatively rare toxic content without a lot of false positives?
Toxic content-detection models often need to strike a delicate balance. False positives (wrongly labeling non-toxic content as toxic) are frustrating and can disrupt healthy conversations. False negatives (missing toxic content), on the other hand, are even more serious, especially in scenarios like online communities where toxic comments can harm the user experience and tarnish your brand.
In this post we will show that Orca Retrieval Augmented Classifiers offer a robust solution to tackle this often challenging modeling scenario.
Benchmark Setup
For this comparison, we leveraged the Jigsaw Unintended Bias in Toxicity Classification dataset from Kaggle, which has been used widely in the community for evaluating toxic content detection models. As previously mentioned, this dataset is heavily imbalanced—92% of the data is non-toxic, and only 8% is toxic.
To benchmark the models, we evaluated two approaches:
- Conventional Classifier: A simple single-layer classification head placed on top of a pre-trained embedding model.
- Orca Retrieval Augmented Classifier (RAC): This approach combines a pre-trained embedding model with a classification-specific retrieval-augmentation technique to improve model performance. Essentially this model has a memory of previously seen content (usually with human labels) that it refers back to when making decisions on new content. For simplicity, we used the original training set as the model’s memories.
Both models use the same underlying pre-trained embedding from Hugging Face’s GTE-Base, a sub 200M parameter model that's highly efficient (i.e., this is not a RAG’ed LLM)
Benchmark Results
Here’s a quick look at how the models performed:
Orca RAC
- Accuracy (Untrained): 93.8%
- ROC AUC (Untrained): 0.86
- Accuracy (Finetuned): 93.8%
- ROC AUC (Finetuned): 0.90
Conventional Classifier
- Accuracy (Untrained): N/A
- ROC AUC (Untrained): N/A
- Accuracy (Finetuned): 91.8%
- ROC AUC (Finetuned): 0.42
Note that the Accuracy reported is over 2 classes (toxic vs non-toxic), not recall of toxic content only.
Right out of the box, the pre-trained Orca RAC offers strong performance with no fine-tuning required. Its 93.75% accuracy and 0.86 ROC AUC without any training on this specific scenario show that even a default setup can deliver reliable results. After fine-tuning, the ROC AUC jumps to 0.90, indicating its ability to balance between false positives and false negatives better than the conventional classifier.
Contrast this with the baseline model: after training, the accuracy is respectable at 91.75%, but the ROC AUC lags far behind at just 0.42, suggesting it struggles with the imbalanced nature of the data and cannot differentiate toxicity well at all.
One thing that makes Orca RAC powerful is its robustness against imbalances in the data, meaning less or no tuning of hyperparameters (e.g., thresholds) or dataset pre-processing is needed to get good performance.
Essentially, this is achieved by the model referencing its memories and doing so only in a local sense, i.e., only referencing memories that it sees as relevant to the current input being processed. This makes it much more robust to global imbalances.
Tackling Data Drift
Toxicity isn't a static problem. What’s considered toxic evolves over time and models need to be robust to this kind of data drift. For example, the original Jigsaw dataset skews towards political discourse. But what happens when the model is exposed to a different kind of content?
We tested the trained models by mixing samples from a different data source into the test set, derived from largely apolitical Wikipedia page comments, which has a different distribution of toxicity.
Specifically, the new test set consists of ~80% of examples from the original dataset the models were finetuned on and 20% of samples from the new Wikipedia-derived test data. It has about 18% toxicity (vs. 8% for the original test set)
The RAC model’s memories were updated with examples from this dataset, ensuring no overlap with the test data (no additional finetuning for either model)
The results were telling:
Orca RAC
- Accuracy : 92.8%
- ROC AUC : 0.94
Conventional Classifier
- Accuracy : 50%
- ROC AUC : 0.48
The Orca RAC model proves significantly more resilient to changes in data distribution, maintaining an impressive 0.94 ROC AUC despite no training on the new dataset. Meanwhile, the conventional classifier, which was fine-tuned for the original data, stayed stuck in the past, producing essentially random guesses with only 50% accuracy and a 0.48 ROC AUC.
This resilience to drift is a huge advantage in real-world deployments, where behaviors and patterns shift constantly, not just around toxicity, but in many applications.
Conventional models often break down when confronted with data they weren’t trained on, requiring costly retraining and recalibration. With Orca, however, the model remains stable and effective.
Implications for the Future
The Orca Retrieval Augmented Classifier opens up exciting new possibilities in toxic content detection:
- Less human oversight: By continuously learning from new data, Orca can reduce the need for constant human review. For instance, once the model learns to flag a specific kind of toxic comment, moderators don’t have to repeatedly review similar cases.
- Easier setup, data-centric approach: Orca simplifies deployment. Instead of spending time tweaking models, you can focus on gathering high-quality data, knowing that the model will adapt and improve as it sees more examples.
- Better explainability and transparency: One standout feature of Orca is its ability to show which reference data points it used to make a classification. This helps with explainability—users can see why the model made a certain decision, which is critical for building trust in AI systems.
While more advanced techniques for a “conventional” (i.e., non-augmented) classifier can sometimes yield better results than our comparison model, they come with significant complexity and fragility. The Orca RAC model, on the other hand, offers high performance straight out of the box (finetuning is one line of code - no parameter tuning needed!), along with greater robustness and tuneability, especially in the face of shifting data patterns.
As the battle against toxic content continues to evolve, Orca offers a scalable, adaptable approach to the underlying modeling problem that requires minimal setup and delivers consistent results—without breaking the bank on infrastructure or retraining efforts.