**Author** [ftian1](https://github.com/ftian1), [lvliang-intel](https://github.com/lvliang-intel), [hshen14](https://github.com/hshen14), [mkbhanda](https://github.com/mkbhanda), [irisdingbj](https://github.com/irisdingbj), [KfreeZ](https://github.com/kfreez), [zhlsunshine](https://github.com/zhlsunshine) **Edit Here to add your id** **Status** Under Review **Objective** Have a clear and good design for users to deploy their own GenAI applications on docker or Kubernetes environment. **Motivation** This RFC presents the OPEA deployment-related design for community discussion. **Design Proposal** Refer to this [OPEA overall architecture design document](24-05-16-OPEA-001-Overall-Design.md). The proposed OPEA deployment workflow is Deployment We provide two interfaces for deploying GenAI applications: 1. Docker deployment by python Here is a python example for constructing a RAG (Retrieval-Augmented Generation) application: ```python from comps import MicroService, ServiceOrchestrator class ChatQnAService: def __init__(self, port=8080): self.service_builder = ServiceOrchestrator(port=port, endpoint="/v1/chatqna") def add_remote_service(self): embedding = MicroService( name="embedding", port=6000, expose_endpoint="/v1/embeddings", use_remote_service=True ) retriever = MicroService( name="retriever", port=7000, expose_endpoint="/v1/retrieval", use_remote_service=True ) rerank = MicroService( name="rerank", port=8000, expose_endpoint="/v1/reranking", use_remote_service=True ) llm = MicroService( name="llm", port=9000, expose_endpoint="/v1/chat/completions", use_remote_service=True ) self.service_builder.add(embedding).add(retriever).add(rerank).add(llm) self.service_builder.flow_to(embedding, retriever) self.service_builder.flow_to(retriever, rerank) self.service_builder.flow_to(rerank, llm) ``` 2. Kubernetes deployment using YAML Here is a YAML example for constructing a RAG (Retrieval-Augmented Generation) application: ```yaml opea_micro_services: embedding: endpoint: /v1/embeddings port: 6000 retrieval: endpoint: /v1/retrieval port: 7000 reranking: endpoint: /v1/reranking port: 8000 llm: endpoint: /v1/chat/completions port: 9000 opea_mega_service: port: 8080 mega_flow: - embedding >> retrieval >> reranking >> llm ``` This YAML will be acting as a unified language interface for end user to define their GenAI Application. When deploying the GenAI application to Kubernetes environment, you should define and convert the YAML configuration file to an appropriate [docker compose](https://docs.docker.com/compose/), or [GenAI Microservice Connector-(GMC)](https://github.com/opea-project/GenAIInfra/tree/main/microservices-connector) custom resource file. Note: A convert tool will be provided for OPEA to convert unified language interface to docker componse or GMC. A sample GMC [Custom Resource](https://kubernetes.io/docs/concepts/extend-kubernetes/api-extension/custom-resources/) is like below: ```yaml apiVersion: gmc.opea.io/v1alpha3 kind: GMConnector metadata: labels: app.kubernetes.io/name: gmconnector name: chatqna namespace: gmcsample spec: routerConfig: name: router serviceName: router-service nodes: root: routerType: Sequence steps: - name: Embedding internalService: serviceName: embedding-service config: endpoint: /v1/embeddings - name: TeiEmbedding internalService: serviceName: tei-embedding-service config: gmcTokenSecret: gmc-tokens hostPath: /root/GMC/data/tei modelId: BAAI/bge-base-en-v1.5 endpoint: /embed isDownstreamService: true - name: Retriever data: $response internalService: serviceName: retriever-redis-server config: RedisUrl: redis-vector-db IndexName: rag-redis tei_endpoint: tei-embedding-service endpoint: /v1/retrieval - name: VectorDB internalService: serviceName: redis-vector-db isDownstreamService: true - name: Reranking data: $response internalService: serviceName: reranking-service config: tei_reranking_endpoint: tei-reranking-service gmcTokenSecret: gmc-tokens endpoint: /v1/reranking - name: TeiReranking internalService: serviceName: tei-reranking-service config: gmcTokenSecret: gmc-tokens hostPath: /root/GMC/data/rerank modelId: BAAI/bge-reranker-large endpoint: /rerank isDownstreamService: true - name: Llm data: $response internalService: serviceName: llm-service config: tgi_endpoint: tgi-service gmcTokenSecret: gmc-tokens endpoint: /v1/chat/completions - name: Tgi internalService: serviceName: tgi-service config: gmcTokenSecret: gmc-tokens hostPath: /root/GMC/data/tgi modelId: Intel/neural-chat-7b-v3-3 endpoint: /generate isDownstreamService: true ``` There should be an available `gmconnectors.gmc.opea.io` CR named `chatqna` under the namespace `gmcsample`, showing below: ```bash $kubectl get gmconnectors.gmc.opea.io -n gmcsample NAME URL READY AGE chatqa http://router-service.gmcsample.svc.cluster.local:8080 Success 3m ``` And the user can access the application pipeline via the value of `URL` field in above. The whole deployment process illustrated by the diagram below. Deployment Process **Alternatives Considered** [Kserve](https://github.com/kserve/kserve): has provided [InferenceGraph](https://kserve.github.io/website/0.9/modelserving/inference_graph/), however it only supports inference service and lack of deployment support. **Compatibility** n/a **Miscs** - TODO List: - [ ] one click deployment on AWS, GCP, Azure cloud - [ ] static cloud resource allocator vs dynamic cloud resource allocator - [ ] k8s GMC with istio