24-06-21-OPEA-001-Guardrails-Gateway

Guardrails Gateway

Author

zhxie, Forrest-zhao, ruijin-intel

Status

Under Review

Objective

Deploy opt-in guardrails in gateway on deployment environment.

Motivation

  • Reduce latency in network transmission and protocol encoding/decoding.

  • Support stateful guardrails.

  • Enhance Observability.

  • Leverage OpenVINO for AI acceleration instructions including AVX, AVX512 and AMX.

Design Proposal

Inference In Place

The LangChain-like workflow is presented below.

graph LR Entry(Entry)-->Gateway Gateway-->Embedding Embedding-->Gateway Gateway-->Retrieve Retrieve-->Gateway Gateway-->Rerank Rerank-->Gateway Gateway-->LLM LLM-->Guardrails Guardrails-->LLM LLM-->Gateway

All services use RESTful API calling to communicate. There is overhead in network transmission and protocol encoding/decoding. Early studies have shown that each hop adds a 3ms of latency, which can be even longer when mTLS is turned on for security reason in inter-nodes deployment.

The opt-in guardrails in gateway works in the architecture given below.

graph LR Entry(Entry)-->Gateway["Gateway\nGuardrails"] Gateway-->Embedding Embedding-->Gateway Gateway-->Retrieve Retrieve-->Gateway Gateway-->Rerank Rerank-->Gateway Gateway-->LLM LLM-->Gateway

The gateway can host multiple guardrails without extra network transmission or protocol encoding/decoding. In the real world deployment, there may be many guardrails in all perspectives, and the gateway is the best place to provide guardrails for the system.

The gateway consists of 2 basic components, inference runtime and guardrails.

graph TD Gateway---Runtime[Inference Runtime API] Runtime---OpenVINO Runtime---PyTorch Runtime---Others[...] Gateway---Guardrails Guardrails---Load[Load Model] Guardrails---Inference Guardrails---Access[Access Control]

A unified inference runtime API provides a general interface for inference runtimes. Any inference runtime can be integrated into the system including OpenVINO. The guardrails leverages the inferece runtime and decides if the request/reponse is valid.

Stateful Guardrails

The traditional workflow from ingress to egress is presented below.

flowchart LR Entry(Entry)-->GuardrailsA GuardrailsA["Guardrails\nAnti-Jailbreaking"]-->Embedding Embedding-->Retrieve Retrieve-->Rerank Rerank-->LLM LLM-->GuardrailsB["Guardrails\nAnti-Profanity"]

Guardrails service provides certain protection for LLM, such as anti-jailbreaking, anti-poisoning for the input side, anti-toxicity, factuality check for the output side, and PII detection for both input and output side.

Guardrails can also be spliited into 2 types, stateless and stateful. Guardrails including anti-jailbreaking, anti-toxicity and PII detection are considered as stateless guards, since they do not rely on both prompt input and response output, while anti-hallucination is regarded as a stateful guard, it needs both input and ouput for the relativity between.

Guardrails Microservice provides certain guardrails as microservice, but due to the limitation microservice, it is not able to track requests for responses, leading to difficulty in providing stateless guard ability.

The opt-in guardrails in gateway works in the architecture given below.

flowchart LR Entry(Entry)-->GuardrailsA subgraph Gateway GuardrailsA["Guardrails\nAnti-Jailbreaking"]-->GuardrailsC GuardrailsB-->GuardrailsC end GuardrailsC["Guardrails\nAnti-Hallucination"]-->Embedding Embedding-->Retrieve Retrieve-->Rerank Rerank-->LLM LLM-->GuardrailsB["Guardrails\nAnti-Profanity"]

As a alternative choice, the gateway will also provide guardrails ability, no matter stateful or stateless.

Observability

Envoy is the most popular proxy in cloud native, which contains out-of-box access log, stats and metrics, and can be integrated into observability platform including OpenTelemetry and Prometheus naturally.

Guardrails in gateway will leverages these abilities about observability to meet potential regulartory and compliance needs.

Multi-Services Deployment

Let’s say the embedding and LLM services are AI-powered and require guardrails protection.

The opt-in gateway can be deployed as a gateway or sidecar services.

graph LR Entry(Entry)-->Embedding subgraph SidecarA[Sidecar] Embedding end Embedding-->Retrieve Retrieve-->Rerank Rerank-->LLM subgraph SidecarB[Sidecar] LLM end

The gateway can also work with guardrails microservices.

graph LR Entry(Entry)-->GuardrailsC["Guardrails\nAnti-Hallucination"] GuardrailsC["Guardrails\nAnti-Hallucination"]-->GuardrailsA["Guardrails\nAnti-Jailbreaking"] GuardrailsA-->Embedding Embedding-->Retrieve Retrieve-->Rerank Rerank-->GuardrailsB["Guardrails\nAnti-Jailbreaking"] GuardrailsB-->LLM LLM-->GuardrailsD["Guardrails\nAnti-Profanity"] subgraph Gateway GuardrailsD-->GuardrailsC end

Alternatives Considered

Guardrails Microservice: has provided certain guardrails, however it only supports stateless guardrails.

Compatibility

N/A

Miscs

  • TODO

    • [ ] API definitions for meta service deployment and Kubernetes deployment

    • [ ] Envoy inference framework and guardrails HTTP filter