Experimental features are new and their interface and implementation may change at any time. Expect sharp edges.
Konduktor Serve is a powerful feature that simplifies deploying and managing ML models and general applications on Kubernetes. It provides two main deployment types: General Deployments: Deploy any containerized application with automatic scaling (based on CPU usage) and health checks. vLLM (Aibrix) Deployments: Optimized for serving large language models with vLLM, featuring automatic scaling (based on GPU usage), tensor parallelism, and OpenAI-compatible API endpoints. For now, only single node inference is supported. By default, all deployments are accessible at your own custom endpoint such as <COMPANY>.trainy.us. By modifying ~/.konduktor/config.yaml, you could also opt out of .trainy.us endpoints and configure konduktor serve to use direct IP endpoints only:
serving:
  endpoint: direct

Launch a deployment

To launch a deployment, use the konduktor serve launch command shown below.
konduktor serve launch my_deployment.yaml
In this single command, Konduktor automatically creates the following Kubernetes objects:
  1. A deployment
  2. General Only: LoadBalancer service for external access vLLM Only: ClusterIP service + AIBrix Gateway for external access
  3. An autoscaler (optional)
Below is a basic, but incomplete deployment YAML to show the general idea of how to get started. The format is the same as konduktor launch task.yamls for jobs, except serving includes an extra section for replicas, ports, and health endpoint probing. For full, detailed examples of deployment.yaml, check out the bottom of this page.
name: incomplete-deployment-example

resources:
  cpus: 4
  memory: 32
  accelerators: H100:1
  ...

# specific to konduktor serve
serving: 
  min_replicas: 1
  max_replicas: 4
  ports: 9000
  probe: /health

run: |
  ...

Check status

To view your deployments, use the konduktor serve status command. Include --all-users or -u to see all deployments from all users in the cluster.
konduktor serve status
konduktor serve status --all-users
Konduktor Serve Status If a .trainy.us endpoint is unavailable or you opted into using direct IP endpoints in ~/.konduktor/config.yaml, konduktor serve status will display direct IP endpoints. Konduktor Serve Status Fallback Optionally, use --direct to display direct IP endpoints instead of trainy.us endpoints.
konduktor serve status --direct
Konduktor Serve Status Direct

Down a deployment

To delete a deployment, use the konduktor serve down command. Include --all or -a to down all deployments from all users in the cluster.
konduktor serve down <DEPLOYMENT_NAME>
konduktor serve down --all

Accessing Deployments

Note: .trainy.us endpoints use https while direct IP endpoints use http.

General

Basic API:
curl https://<ENDPOINT>
Output: Hello from Konduktor Serve! Health Probe API
curl https://<ENDPOINT>/health
Output: Hello from health probe!

vLLM (Aibrix)

Completion API:
curl -v https://<ENDPOINT>/v1/completions \
-H "Content-Type: application/json" \
-d '{
    "model": "<MODEL_NAME>",
    "prompt": "San Francisco is a",
    "max_tokens": 128,
    "temperature": 0
}'
Output: top destination for tech companies, but it's also a hub for innovation and creativity. So, it's no surprise that the city has a vibrant food scene. From the iconic Golden Gate Bridge to the bustling streets of the Financial District, San Francisco offers a unique blend of culture, history, and modernity. When it comes to food, the city is known for its diverse cuisine, which reflects ... Chat Completion API
curl https://<ENDPOINT>/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
    "model": "<MODEL_NAME>",
    "messages": [
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Help me write a random number generator function in python"}
    ]
}'
Output: Okay, so I need to help write a random number generator function in Python. Hmm, where do I start? I remember that Python has a module called random which provides functions for generating random numbers. So maybe I should use that. Let me think about what functions are available there.\n\nFirst, there's random.randint(a, b), which returns a random integer N between a and b, inclusive. That's useful. Then...

Example YAMLs

Schema

General

  • Simple (no autoscaling + default port (8000) + no health probing)
  • Complex (autoscaling + custom port + custom health probing endpoint)

vLLM (Aibrix)

  • Simple (no autoscaling + default port (8000) + single GPU)
  • Complex (autoscaling + custom port + multi GPU)