FAQ/Troubleshooting
Cluster auth/info
In order to get access to your Trainy managed GKE cluster, you’ll first need to generate a kubeconfig at~/.kube/config file from GCP. Use the following commands to find the list of your GKE clusters and set your kubeconfig. Make sure to substitute the values for <CLUSTER_NAME> and <COMPUTE_LOCATION> . In the example below, this would have been trainy-cluster and us-central respectively.
# setup `gcloud` cli and install
$ conda install -c conda-forge google-cloud-sdk
$ gcloud init
# list clusters and get credentials for one
gcloud container clusters list
NAME LOCATION MASTER_VERSION MASTER_IP MACHINE_TYPE NODE_VERSION NUM_NODES STATUS
trainy-cluster us-central1 1.30.5-gke.1443001 34.56.5.168 e2-medium 1.30.5-gke.1443001 RUNNING
gcloud container clusters get-credentials <CLUSTER_NAME> --location=<COMPUTE_LOCATION>
# confirm that you can connect
kubectl get nodes
Cluster Resources
Every trainy cluster consist of 3-6 small CPU VMs that are always running to run things like observability services as well as our in cluster controller and health monitors. In addition, there are autoscaling pools for the following GPU types
A100:8 ✅
A100-80GB:8 ✅
H100:8 - TCPXO enabled ✅ (1.6Tbps)
H200:8 - RDMA enabled 🚧 (in beta)
B200:8 - RDMA enabled 🚧 (in beta)
While it’s possible to run a task without a GPU, the autoscaling pools are configured to only accept requests for workloads that require a GPU, leaving only the small CPU instances for running CPU only tasks. As such we recommend lowering the cpu and memory
Submission Flow
First submit a GPU job to your cluster via konduktor launch and view it’s status with kondkutor status
(konduktor) root@se-h1-18-gpu:~/andrew# konduktor launch nccl-test.yaml
Log file: /root/.konduktor/logs/konduktor-logs-20250807-055043.log
konduktor.Task from YAML spec: composer.yaml
Considered resources (8 nodes):
---------------------------------
CPUs Mem (GB) GPUs
---------------------------------
180 1000 {'H200': 8}
---------------------------------
Launching a new job torch-nccl-allreduce. Proceed? [Y/n]: Y
Aborted!
(konduktor) root@se-h1-18-gpu:~/andrew# konduktor status
Log file: /root/.konduktor/logs/konduktor-logs-20250807-055112.log
User: root-6f00
Jobs
NAME STATUS RESOURCES SUBMITTED
torch-nccl-allreduce-78e9 PENDING 8x(180CPU, memory 1000Gi, H200)
for Trainy on GKE, a few things occur that allow the task to be executed by scaling up the GPU pools.
-
Task definition is created by the user
-
Cluster admits Task into Kueue as
workload (check with kubectl get workloads)
$ kubectl get workload
NAME QUEUE RESERVED IN ADMITTED FINISHED AGE
mrms-test-debug-2-aa75 user-queue dws-cluster-queue True 82m
-
ProvisioningRequest is sent to GCP and enqueued (check with kubectl get provreq)
$ kubectl get provreq
NAME ACCEPTED PROVISIONED FAILED AGE
mrms-test-debug-2-dws-prov-1 True False 23h
-
ProvisioningRequest is fulfilled and new GPU nodes will be added to fulfill the requests of the submitted user Task
$ kubectl get provreq
NAME ACCEPTED PROVISIONED FAILED AGE
mrms-test-debug-2-dws-prov-1 True True 23h
While waiting for a request to be fulfilled, it’s quite usual to see the following ResourcePoolExhausted status message from the ProvisioningRequest especially for large requests for highly in demand SKUs for long periods of time.
$ kubectl describe provreq
Name: jobset-a100-2nodes-64b5-6feac-dws-prov-1
Namespace: default
Labels: kueue.x-k8s.io/managed=true
Annotations: <none>
API Version: autoscaling.x-k8s.io/v1
Kind: ProvisioningRequest
Metadata:
Creation Timestamp: 2025-04-25T19:03:16Z
Generation: 1
Owner References:
API Version: kueue.x-k8s.io/v1beta1
Block Owner Deletion: true
Controller: true
Kind: Workload
Name: jobset-a100-2nodes-64b5-6feac
UID: 33870f33-bac9-40e9-a8ab-6120cd7c5012
Resource Version: 4530453
UID: bfb3404d-084f-4ab0-89b0-e2de10f38829
Spec:
Parameters:
Pod Sets:
Count: 2
Pod Template Ref:
Name: ppt-jobset-a100-2nodes-64b5-6feac-dws-prov-1-workers
Provisioning Class Name: queued-provisioning.gke.io
Status:
Conditions:
Last Transition Time: 2025-04-25T19:03:26Z
Message: Provisioning Request was successfully queued.
Observed Generation: 1
Reason: SuccessfullyQueued
Status: True
Type: Accepted
Last Transition Time: 2025-04-25T19:03:26Z
Message: Waiting for resources. Currently there are not enough resources available to fulfill the request.
Observed Generation: 1
Reason: ResourcePoolExhausted
Status: False
Type: Provisioned
Provisioning Class Details:
Accelerator Type: nvidia-a100-80gb
Node Group Name: gke-ryandev-55-a100-80gb-pool-70cc24ce-grp
Node Pool Auto Provisioned: false
Node Pool Name: a100-80gb-pool
Resize Request Name: gke-default-jobset-a100-2nodes-6-1eaa01b2de1b28bb
Selected Zone: us-central1-a
Events: <none>