Experimental features are new and their interface and implementation may change at any time. Expect sharp edges.
<COMPANY>.trainy.us
General Deployments: Deploy any containerized application with automatic horizontal scaling and health checks. Accessible at <COMPANY>2.trainy.us
Launch a deployment
To launch a deployment, use thekonduktor serve launch
command shown below.
VLLM
- Deployment:
- App Deployment
- Service:
- App Service
- PodAutoscaler: (optional)
- KPA (Knative-based Pod Autoscaler)
GENERAL
- Deployment:
- App Deployment
- Service:
- App Service
- PodAutoscaler: (optional)
- HPA (Horizontal Pod Autoscaler)
konduktor launch
task.yamls for jobs, except serving includes an extra section for replicas, ports, and health endpoint probing. For full, detailed examples of deployment.yaml
, check out the bottom of this page.
Check status
To view your deployments, use thekonduktor serve status
command.
Include --all-users
or -u
to see all deployments from all users in the cluster.

--direct
to display direct IP endpoints instead of trainy.us
endpoints.
--direct
every time, you can modify ~/.konduktor/config.yaml
as a permanent toggle for konduktor serve status --direct
with:

Down a deployment
To delete a deployment, use thekonduktor serve down
command.
Include --all
or -a
to down all deployments from all users in the cluster.
Accessing Deployments
trainy.us
endpoints use https
while direct IP endpoints use http
. Requests through trainy.us
timeout after 60 seconds.
vLLM (Aibrix)
Completion API:top destination for tech companies, but it's also a hub for innovation and creativity. So, it's no surprise that the city has a vibrant food scene. From the iconic Golden Gate Bridge to the bustling streets of the Financial District, San Francisco offers a unique blend of culture, history, and modernity. When it comes to food, the city is known for its diverse cuisine, which reflects ...
Chat Completion API
Okay, so I need to help write a random number generator function in Python. Hmm, where do I start? I remember that Python has a module called random which provides functions for generating random numbers. So maybe I should use that. Let me think about what functions are available there.\n\nFirst, there's random.randint(a, b), which returns a random integer N between a and b, inclusive. That's useful. Then...
General
Basic API:Hello from Konduktor Serve!
Health Probe API
Hello from health probe!
Autoscaling
Usekonduktor serve status
to monitor the autoscaling process. The autoscaling process could take a few minutes, especially with a cold start from 0.
Scale-Up Behavior
- 0 —> 1: PA (Pod Autoscaler) triggers scale up immediately after the first request to the deployment endpoint.
- 1 —> N: PA triggers scale up based on average request rate metrics collected over a 30-second window.
Scale-Down Behavior
vLLM (Aibrix) Deployments: - stair-step scale-down- N —> N-1: PA triggers scale down based on average request rate metrics collected over a 30-second window. Grace period of 30 mins per pod.
- 1 —> 0: PA triggers scale down to zero replicas after 30 minutes of no requests to the model.
- N —> 0: PA triggers a direct scale down to zero replicas after 20 minutes of no requests to the deployment.
GPU Scheduling Behavior
Observed GKE Behavior:- GKE’s GPU scheduling can be inefficient and may not always utilize nodes optimally.
- GKE spins up new nodes even when existing nodes have sufficient GPU capacity.
Example YAMLs
Schema
- Deployment Schema - Complete reference for deployment.yaml fields
General
- Simple (no autoscaling + default port (8000) + no health probing)
- Complex (autoscaling + custom port + custom health probing endpoint)