LVM Microservice¶

Visual Question and Answering is one of the multimodal tasks empowered by LVMs (Large Visual Models). This microservice supports visual Q&A by using LLaVA as the base large visual model. It accepts two inputs: a prompt and an image. It outputs the answer to the prompt about the image.

Overview¶

Users can configure and deploy LVM-related services based on their specific requirements. This microservice supports a variety of backend implementations, each tailored to different performance, hardware, and model needs, allowing for flexible integration into diverse GenAI workflows.

Key Features¶

Multimodal Interaction
Natively supports question and answering with various visual inputs, including images and videos.
Flexible Backends
Integrates with multiple state-of-the-art LVM implementations like LLaVA, LLaMA-Vision, Video-LLaMA, and more.
Scalable Deployment
Ready for deployment using Docker, Docker Compose, and Kubernetes, ensuring scalability from local development to production environments.
Standardized API
Provides a consistent and simple API endpoint, abstracting the complexities of the different underlying models.

Supported Implementations¶

The LVM Microservice supports multiple implementation options. Select the one that best fits your use case and follow the linked documentation for detailed setup instructions.

Implementation	Description	Documentation
With LLaVA	A general-purpose VQA service using the LLaVA model.	README_llava
With TGI LLaVA	LLaVA service accelerated by TGI, optimized for Intel Gaudi HPUs.	README_llava_tgi
With LLaMA-Vision	VQA service leveraging the LLaMA-Vision model.	README_llama_vision
With Video-LLaMA	A specialized service for performing VQA on video inputs.	README_video_llama
With vLLM	High-throughput LVM serving accelerated by vLLM on Intel Gaudi HPUs.	README_vllm
With PredictionGuard	LVM service using Prediction Guard with built-in safety features.	README_predictionguard