Konduktor Serve Launch Deployment Yaml

Schema

name: <string>                    # required

envs:                             # optional
  key: value
workdir: <string>                 # optional

resources:
  cpus: <float>                   # required
  memory: <float>                 # required
  # must be image_id: vllm/vllm-openai:v0.7.1 w VLLM DEPLOYMENTS
  image_id: <string>              # required
  accelerators: <string>          # optional
  labels:                           
    kueue.x-k8s.io/queue-name: <string>  # required

serving:
  # if min_replicas != max_replicas, autoscaling is enabled automatically
  min_replicas: <int>             # required
  max_replicas: <int>             # optional; defaults to min_replicas
  ports: <int>                    # optional; defaults to 8000
  # GENERAL DEPLOYMENTS ONLY
  probe: <string>                 # optional w general deployments; defaults to None (no health probing)
                                  # EXCLUDE COMPLETELY w VLLM DEPLOYMENTS

file_mounts:                      # optional
  /remote/path: ./local/path

run: |
  # VLLM DEPLOYMENTS ONLY          # required
  python3 -m vllm.entrypoints.openai.api_server \
    --model <string>> \
    --max-model-len <int> \
    --tensor-parallel-size <int>  # required w GPUs > 1; otherwise exclude

Details

General min_replicas (required) max_replicas (optional)
  • if min_replicas != max_replicas, autoscaling is enabled automatically
vLLM (Aibrix) resources: image_id (required)
  • only vllm/vllm-openai:v0.7.1 or other version is supported by the OpenAI API
min_replicas (required) max_replicas (optional)
  • if min_replicas != max_replicas, autoscaling is enabled automatically
probe (exclude)
  • only /health is supported by the OpenAI API, so just exclude for simplicity and it will default to /health
run (required)
  • python3 -m vllm.entrypoints.openai.api_server (required)
    • --model (required)
      • some models like Llama 3.1 require authentication through a hugging face token, which can be passed into the deployment using konduktor secrets
      • ex. konduktor secret create --kind=env --inline HUGGING_FACE_HUB_TOKEN=hf_ABC123 my-hf-token
    • --max-model-len (required)
    • --tensor-parallel-size (required w GPUs > 1; otherwise optional)
  • See here for more info on vllm.entrypoints.openai.api_server