Serving Deployments (Experimental)

Experimental features are new and their interface and implementation may change at any time. Expect sharp edges.

Konduktor Serve is a powerful feature that simplifies deploying and managing ML models and general applications on Kubernetes. It provides two main deployment types: vLLM (Aibrix) Deployments: Optimized for serving large language models with vLLM, featuring automatic horizontal scaling, tensor parallelism, and OpenAI-compatible API endpoints. For now, only single node inference is supported. Accessible at <COMPANY>.trainy.us General Deployments: Deploy any containerized application with automatic horizontal scaling and health checks. Accessible at <COMPANY>2.trainy.us

Launch a deployment

To launch a deployment, use the konduktor serve launch command shown below.

konduktor serve launch my_deployment.yaml

In this single command, Konduktor automatically creates the following resources:

VLLM

Deployment:
- App Deployment
Service:
- App Service
PodAutoscaler: (optional)
- KPA (Knative-based Pod Autoscaler)

GENERAL

Deployment:
- App Deployment
Service:
- App Service
PodAutoscaler: (optional)
- HPA (Horizontal Pod Autoscaler)

Below is a basic, but incomplete deployment YAML to show the general idea of how to get started. The format is the same as konduktor launch task.yamls for jobs, except serving includes an extra section for replicas, ports, and health endpoint probing. For full, detailed examples of deployment.yaml, check out the bottom of this page.

name: incomplete-deployment-example

resources:
  cpus: 4
  memory: 32
  accelerators: H100:1
  ...

# specific to konduktor serve
serving: 
  min_replicas: 1
  max_replicas: 4
  ports: 9000
  probe: /health

run: |
  ...

Check status

To view your deployments, use the konduktor serve status command. Include --all-users or -u to see all deployments from all users in the cluster.

konduktor serve status
konduktor serve status --all-users

Optionally, use --direct to display direct IP endpoints instead of trainy.us endpoints.

konduktor serve status --direct

Alternatively to using --direct every time, you can modify ~/.konduktor/config.yaml as a permanent toggle for konduktor serve status --direct with:

serving:                          # optional
  endpoint: {trainy, direct}      # defaults to trainy

Down a deployment

To delete a deployment, use the konduktor serve down command. Include --all or -a to down all deployments from all users in the cluster.

konduktor serve down <DEPLOYMENT_NAME>
konduktor serve down --all

Accessing Deployments

trainy.us endpoints use https while direct IP endpoints use http. Requests through trainy.us timeout after 60 seconds.

vLLM (Aibrix)

Completion API:

# For trainy.us endpoint access:
curl https://<COMPANY>.trainy.us/v1/completions \
-H "Content-Type: application/json" \
-d '{
    "model": "<DEPLOYMENT_NAME>",
    "prompt": "San Francisco is a",
    "max_tokens": 128,
    "temperature": 0
}'

# For direct IP endpoint access:
curl http://<DIRECT_IP>/v1/completions \
-H "Content-Type: application/json" \
-d '{
    "model": "<DEPLOYMENT_NAME>",
    "prompt": "San Francisco is a",
    "max_tokens": 128,
    "temperature": 0
}'

Output:

top destination for tech companies, but it's also a hub for innovation and creativity.  So, it's no surprise that the city has a vibrant food scene. From the iconic Golden  Gate Bridge to the bustling streets of the Financial District, San Francisco offers  a unique blend of culture, history, and modernity. When it comes to food, the city is  known for its diverse cuisine, which reflects ...

Chat Completion API

# For trainy.us endpoint access:
curl https://<COMPANY>.trainy.us/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
    "model": "<DEPLOYMENT_NAME>",
    "messages": [
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Help me write a random number generator function in python"}
    ],
    "max_tokens": 128
}'

# For direct IP endpoint access:
curl http://<DIRECT_IP>/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
    "model": "<DEPLOYMENT_NAME>",
    "messages": [
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Help me write a random number generator function in python"}
    ],
    "max_tokens": 128
}'

Output:

Okay, so I need to help write a random number generator function in Python. Hmm, where do I start? I remember that Python has a module called random which provides functions for generating random numbers. So maybe I should use that. Let me think about what functions are available there.\n\nFirst, there's random.randint(a, b), which returns a random integer N between a and b, inclusive. That's useful. Then...

General

Basic API:

# For trainy.us endpoint access:
curl https://<COMPANY>.trainy.us/<DEPLOYMENT_NAME>

# For direct IP endpoint access:
curl -H "Host: <DEPLOYMENT_NAME>" http://<DIRECT_IP>

Output: Hello from Konduktor Serve! Health Probe API

# For trainy.us endpoint access:
curl https://<COMPANY>.trainy.us/<DEPLOYMENT_NAME>/health

# For direct IP endpoint access:
curl -H "Host: <DEPLOYMENT_NAME>" http://<DIRECT_IP>/health

Output: Hello from health probe!

Autoscaling

Use konduktor serve status to monitor the autoscaling process. The autoscaling process could take a few minutes, especially with a cold start from 0.

Scale-Up Behavior

0 —> 1: PA (Pod Autoscaler) triggers scale up immediately after the first request to the deployment endpoint.
1 —> N: PA triggers scale up based on average request rate metrics collected over a 30-second window.

Scale-Down Behavior

vLLM (Aibrix) Deployments: - stair-step scale-down

N —> N-1: PA triggers scale down based on average request rate metrics collected over a 30-second window. Grace period of 30 mins per pod.
1 —> 0: PA triggers scale down to zero replicas after 30 minutes of no requests to the model.

General Deployments: - fast scale-down

N —> 0: PA triggers a direct scale down to zero replicas after 20 minutes of no requests to the deployment.

GPU Scheduling Behavior

Observed GKE Behavior:

GKE’s GPU scheduling can be inefficient and may not always utilize nodes optimally.
GKE spins up new nodes even when existing nodes have sufficient GPU capacity.

Benchmarks

vLLM (Aibrix) Deployments: Throughput/Latency/Tokens

Konduktor Serve Aibrix Throughput/Latency/Tokens

Scale from 0 Cold Start Time (GKE GPUs), T4 and A100 respectively

Konduktor Serve Aibrix Scale from 0 Cold Start Time (GKE T4 GPUs)

Konduktor Serve Aibrix Scale from 0 Cold Start Time (GKE A100 GPUs)

General Deployments: Throughput/Latency/Errors

Konduktor Serve General Throughput/Latency/Errors

Scale from 0 Cold Start Time (no GPUs)

Konduktor Serve General Scale from 0 Cold Start Time

Example YAMLs

Schema

Deployment Schema - Complete reference for deployment.yaml fields

General

Simple (no autoscaling + default port (8000) + no health probing)
Complex (autoscaling + custom port + custom health probing endpoint)

vLLM (Aibrix)

Simple (no autoscaling + default port (8000) + single GPU)
Complex (autoscaling + custom port + multi GPU)

Get Started

CLI

User Guides

Serving Deployments (Experimental)

Launch a deployment

VLLM

GENERAL

Check status

Down a deployment

Accessing Deployments

vLLM (Aibrix)

General

Autoscaling

Scale-Up Behavior

Scale-Down Behavior

GPU Scheduling Behavior

Benchmarks

Example YAMLs

Schema

General

vLLM (Aibrix)

Get Started

CLI

User Guides

​Launch a deployment

VLLM

GENERAL

​Check status

​Down a deployment

​Accessing Deployments

​vLLM (Aibrix)

​General

​Autoscaling

​Scale-Up Behavior

​Scale-Down Behavior

​GPU Scheduling Behavior

​Benchmarks

​Example YAMLs

​Schema

​General

​vLLM (Aibrix)

Launch a deployment

Check status

Down a deployment

Accessing Deployments

vLLM (Aibrix)

General

Autoscaling

Scale-Up Behavior

Scale-Down Behavior

GPU Scheduling Behavior

Benchmarks

Example YAMLs

Schema

General

vLLM (Aibrix)