Understanding real-time model adaptability
Orca is powered by contextual embeddings in an external database, ensuring that your model uses the most appropriate data during inference.
Step 1: Teach the model to use context
Traditional neural networks learn to make predictions by memorizing the distribution of training data. However, this training data can fall out of date, compromising the model's effectiveness.
When you set-up a classifier or any other deep-learning model using Orca, the model learns to adapt outputs based on context you choose to supply.
Step 2: Access data during inference
Because the behavior is now encoded in the neural network, an Orca-trained model leverages that context during inference, just like a generative model using retrieval augmentation.
Now, any inference run reacts almost instantly to new information you supply to the model.
Step 3: Monitor. Update. Swap
Once your model learns to use Orca, you can proactively manage memories to address data drift and other dynamic changes. You can even do complete swaps of context for customized or personalized variants of a model.
FAQs
Let's dive into the details.
Installation and Set-up
Orca works with discriminative models, including:
- Image classifiers
- Ranking models (including within recommendation systems)
- Text classifiers
- More model types coming soon!
Implementing an Orca model typically requires about the same amount of engineering effort as accessing a foundational LLM model via an API call. However, just like training or fine-tuning any model — or setting up a high-performing RAG pipeline — a model’s accuracy is highly dependent on the available data. As a result, the overall time needed to get started with Orca depends on the quality of the available data for your use case. Retrieval-augmented models will also be comparatively faster if you have to data-wrangle because you don’t need to wait for a training cycle to complete before you can see if new or different data improved your model’s accuracy.
The Orca team will work closely with you during both data gathering and preparation phases to optimize your model’s performance. Activities may include:
Developing Synthetic Data: Creating synthetic data to fill gaps in your existing data. Teams often use this synthetic data initially and gradually replace it with real data as it becomes available, refining the model's performance over time.
Labeling Unlabeled Data: Accelerating data labeling efforts through a combination of (1) auto-labelling new data points and (2) identifying outliers where we have less confidence for human review and labelling by your team (or a third party). We frequently use examples provided by you as memories, then use our own Orca-enabled classifier to create labels. This approach allows Orca to increase the effectiveness and accuracy of our auto-labelling when we start with or begin to acquire more real-world examples.
Testing and Curating Memory Data: Evaluating and refining memory data to ensure accurate and unbiased distributions.
Technically, you can build an Orca model by populating the initial memories with synthetic data. If your data contains any deviations in distribution from the real world, you may find that model accuracy is lower than initial test results show. To manage this risk, you could:
- Expand a limited data set by supplementing the existing distribution with synthetic data.
- Proactively re-tune the memories as you gather new data. This approach is especially helpful if you don’t have a way to collect data without an app in production.
In situations where you have limited data, we recommend that you prioritize that data for your evaluation data- set and leverage synthetic data for your model’s first memories. Taking this approach ensures you can evaluate and optimize model performance based on a foundational truth, enabling a higher level of accuracy. This also helps you avoid accidental biases in a synthetic evaluation set.
Yes. Orca does not require you to change your current tech stack. This includes your existing data storage solution along with the tooling you use for monitoring or evaluating your ML models. Orca’s embedding database and memory-tuning tools sit alongside your existing solutions.
Depending on the exact use case, Orca may increase your model’s accuracy against a static benchmark. Orca performs better in static evaluations when you have very dense data like images or challenges with skewed datasets (e.g. toxicity detection). In other cases, a model using retrieval-augmentation reaches parity with a traditional classifier or ranking model initially, but the augmented model maintains that performance across changes to inputs and desired outputs.
It’s important to be realistic, however. A toy model using retrieval augmentation with very limited data in its memory simply won’t be able to compete with a state-of-the-art model trained on a rich amount of data. When comparing comparably-sized models, retrieval-augmented models achieve the same initial accuracy and then maintain that accuracy even as the data drifts over time.
Model Performance
In more dynamic scenarios, retrieval-augmented models maintain better performance over time than a traditional deep learning model. You can also optimize a retrieval-augmented model to perform well across multiple test data-sets (for example, when you have different preferences on how to classify certain inputs).
In a static scenario, where models have access to well-curated training data, retrieval-augmented models and traditional deep learning models have comparable levels of accuracy. You’ll notice this when measuring these two types of models against benchmark data sets.
A retrieval-augmented classifier built with Orca utilizes the same concepts as injecting context into an LLM through RAG. Both approaches ensure that the model can access the latest information and provide customized responses based on data injected during inference.
Both types of models make sense to use in specific use cases. LLMs customized using RAG are most effective in generative use cases - content creation, expanding/structuring specific documents from text inputs, etc. In these applications, the ability for an LLM to generate text merits the increased compute and architectural complexity of these massive models.
Retrieval-augmented classifiers and rankers solve the discriminative use cases described in their names (classification and ranking) that require the ability to customize or quickly update the model. Using a retrieval-augmented classifier does have several significant advantages over leveraging a generative model masquerading as a classifier:
- Scope can be broader than natural language processing, so you can support use cases like image classification.
- Purpose built models typically achieve similar (or better) accuracy with lower latency and are more cost effective than leveraging a large-language-model.
The single best way to improve the performance of a retrieval-augmented model is by continuing to optimize the data that populates the model’s memory. By actively collecting more data and updating both your evaluation data and the data stored in your model’s memory, your model becomes more effective at matching your goals and reflecting ground-truth.
Managing a Production Instance
Orca does offer the option of full hosting, including inference, for retrieval-augmented models that you build with us. However, Orca can also run in your dedicated cloud or on-premise set up if you’d prefer for security reasons.
If you’re willing to set up an Orca instance within your own cloud instance or on-premise - absolutely. Otherwise, we can help you set up proxy data sets for your model’s memories, but that is operationally more complex.
Orca does not train any models with your data, unless we have contractually agreed to build a custom embedding model as part of your Enterprise engagement with us.
Right now, the data you store as memories in Orca act as external references for your model to boost performance.
Yes, Orca has scaling characteristics very similar to a conventional database - meaning it can scale to nearly arbitrary read and write volumes, but you will at some point need to employ standard storage/database scalability techniques to get there (e.g., sharding, partitioning, read replicas, etc.)
Yes, accessing external data with Orca introduces some latency compared to a model of similar size that relies solely on encoded training data. However, this latency penalty — typically a few milliseconds — is generally very small and usually imperceptible to end users.
In some cases, Orca can actually reduce overall latency for an application compared to injecting context through a re-purposed generative model. By shifting to a smaller, more efficient models that maintains accuracy and has context awareness, Orca may require less compute per input, and a simpler, faster architecture.
Find out if Orca is right for you
Speak to our ML engineers to see if we can help you create more consistent models.