Note that some models may require authentication through Hugging Face tokens, which can be done using konduktor secret (see complex example here). The model deepseek-ai/DeepSeek-R1-Distill-Llama-8B does not require one.

Prerequisites

Current Working Directory

$ ls
deployment.yaml

Launching

$ konduktor serve launch deployment.yaml

Deployment.yaml

# no autoscaling + default port (8000) + single GPU
name: serving-vllm-simple

resources:
  cpus: 4
  memory: 32
  accelerators: A100:1
  image_id: vllm/vllm-openai:v0.7.1
  labels:
    kueue.x-k8s.io/queue-name: user-queue

serving: 
  min_replicas: 1

run: |
  python3 -m vllm.entrypoints.openai.api_server \
    --model deepseek-ai/DeepSeek-R1-Distill-Llama-8B \
    --max-model-len 4096