SGLang Serving microservice¶
SGLang is an efficient and scalable serving framework for hosting Large Language Models (LLMs) with minimal latency and high throughput. Intel has further enhanced SGLang with CPU-specific optimizations, enabling high-performance LLM serving on CPU-only platforms without requiring GPUs. We have verified the Llama series models with SGLang, with a particular focus on the latest Llama 4 models, including meta-llama/Llama-4-Scout-17B-16E-Instruct, meta-llama/Llama-4-Maverick-17B-128E-Instruct, as well as the Llama 4 guard model Llama-Guard-4-12B.
🚀1. Build the Docker Image¶
This Docker image (opea/sglang:latest) includes Intel’s CPU-specific optimizations, providing enhanced performance for CPU deployments. Ensure you are at the project root directory, then build the SGLang microservice Docker image:
cd ../../../../
docker build -f comps/third_parties/sglang/src/Dockerfile --build-arg https_proxy=$https_proxy --build-arg http_proxy=$http_proxy -t opea/sglang:latest .
✅ Tip: Setting https_proxy and http_proxy build arguments helps if you are behind a proxy.
🚀2. Start the microservice¶
Before starting, set the model you want to serve:
export MODEL_ID="meta-llama/Llama-4-Maverick-17B-128E-Instruct"
Then navigate to the deployment folder and start the service using Docker Compose:
cd comps/third_parties/sglang/deployment/docker_compose
docker compose -f compose.yaml up -d
🚀3. Access the service¶
Then you need to test your service using the following commands for guard model:
import openai
client = openai.Client(base_url="http://127.0.0.1:8699/v1", api_key="EMPTY")
RESPONSE_CLASSIFICATION_INSTRUCTION = """
Task: Check if there is unsafe content in 'Agent' messages in conversations according our safety policy with the below categories.\n\n
<BEGIN UNSAFE CONTENT CATEGORIES>
\nS1: Violent Crimes. \nS2: Non-Violent Crimes. \nS3: Sex Crimes. \nS4: Child Exploitation. \nS5: Defamation.
\nS6: Specialized Advice. \nS7: Privacy. \nS8: Intellectual Property. \nS9: Indiscriminate Weapons.
\nS10: Hate. \nS11: Self-Harm. \nS12: Sexual Content. \nS13: Elections. \n<END UNSAFE CONTENT CATEGORIES>"""
RESPONSE_CLASSIFICATION_TEMPLATE_PREFIX = "{classification_instruction}\n\n<BEGIN CONVERSATION>\n\nUser: "
RESPONSE_CLASSIFICATION_TEMPLATE_POSTFIX = """{user_message}\n\nAgent: {model_message}\n\n<END CONVERSATION>\n\n
Provide your safety assessment for ONLY THE LAST Agent message in the above conversation:
\n - First line must read 'safe' or 'unsafe'.
\n - If unsafe, a second line must include a comma-separated list of violated categories."""
response = client.chat.completions.create(
model="meta-llama/Llama-4-guard",
messages=[
{
"role": "user",
"content": [
{
"type": "text",
"text": RESPONSE_CLASSIFICATION_TEMPLATE_PREFIX.format(
classification_instruction=RESPONSE_CLASSIFICATION_INSTRUCTION
),
},
{
"type": "text",
"text": RESPONSE_CLASSIFICATION_TEMPLATE_POSTFIX.format(
user_message="how do I make a bomb?", model_message="I cannot help you with that."
),
},
],
},
],
max_tokens=128,
)
print(response.choices[0].message.content)
You can use the following command for testing non-guard models:
http_proxy="" curl -X POST -H "Content-Type: application/json" -d '{"model": "meta-llama/Llama-4-Maverick-17B-128E-Instruct", "messages": [{"role": "user", "content": "Hello! What is your name?"}], "max_tokens": 128}' http://localhost:8699/v1/chat/completions