LVM Microservice¶
Visual Question and Answering is one of the multimodal tasks empowered by LVMs (Large Visual Models). This microservice supports visual Q&A by using LLaVA as the base large visual model. It accepts two inputs: a prompt and an image. It outputs the answer to the prompt about the image.
Table of Contents¶
Overview¶
Users can configure and deploy LVM-related services based on their specific requirements. This microservice supports a variety of backend implementations, each tailored to different performance, hardware, and model needs, allowing for flexible integration into diverse GenAI workflows.
Key Features¶
Multimodal Interaction
Natively supports question and answering with various visual inputs, including images and videos.Flexible Backends
Integrates with multiple state-of-the-art LVM implementations like LLaVA, LLaMA-Vision, Video-LLaMA, and more.Scalable Deployment
Ready for deployment using Docker, Docker Compose, and Kubernetes, ensuring scalability from local development to production environments.Standardized API
Provides a consistent and simple API endpoint, abstracting the complexities of the different underlying models.
Supported Implementations¶
The LVM Microservice supports multiple implementation options. Select the one that best fits your use case and follow the linked documentation for detailed setup instructions.
Implementation |
Description |
Documentation |
---|---|---|
With LLaVA |
A general-purpose VQA service using the LLaVA model. |
|
With TGI LLaVA |
LLaVA service accelerated by TGI, optimized for Intel Gaudi HPUs. |
|
With LLaMA-Vision |
VQA service leveraging the LLaMA-Vision model. |
|
With Video-LLaMA |
A specialized service for performing VQA on video inputs. |
|
With vLLM |
High-throughput LVM serving accelerated by vLLM on Intel Gaudi HPUs. |
|
With PredictionGuard |
LVM service using Prediction Guard with built-in safety features. |