Skip to main content

FAQ/Troubleshooting

Cluster auth/info

In order to get access to your Trainy managed GKE cluster, you’ll first need to generate a kubeconfig at~/.kube/config file from GCP. Use the following commands to find the list of your GKE clusters and set your kubeconfig. Make sure to substitute the values for <CLUSTER_NAME> and <COMPUTE_LOCATION> . In the example below, this would have been trainy-cluster and us-central respectively.
# setup `gcloud` cli and install
$ conda install -c conda-forge google-cloud-sdk
$ gcloud init

# list clusters and get credentials for one
gcloud container clusters list
NAME            LOCATION       MASTER_VERSION      MASTER_IP      MACHINE_TYPE  NODE_VERSION        NUM_NODES  STATUS
trainy-cluster  us-central1  1.30.5-gke.1443001  34.56.5.168    e2-medium     1.30.5-gke.1443001             RUNNING

gcloud container clusters get-credentials <CLUSTER_NAME> --location=<COMPUTE_LOCATION>

# confirm that you can connect
kubectl get nodes

Cluster Resources

Every trainy cluster consist of 3-6 small CPU VMs that are always running to run things like observability services as well as our in cluster controller and health monitors. In addition, there are autoscaling pools for the following GPU types
  • A100:8 ✅
  • A100-80GB:8 ✅
  • H100:8 - TCPXO enabled ✅ (1.6Tbps)
  • H200:8 - RDMA enabled 🚧 (in beta)
  • B200:8 - RDMA enabled 🚧 (in beta)
While it’s possible to run a task without a GPU, the autoscaling pools are configured to only accept requests for workloads that require a GPU, leaving only the small CPU instances for running CPU only tasks. As such we recommend lowering the cpu and memory

Submission Flow

First submit a GPU job to your cluster via konduktor launch and view it’s status with kondkutor status
(konduktor) root@se-h1-18-gpu:~/andrew# konduktor launch  nccl-test.yaml 
Log file: /root/.konduktor/logs/konduktor-logs-20250807-055043.log
konduktor.Task from YAML spec: composer.yaml
Considered resources (8 nodes):
---------------------------------
 CPUs   Mem (GB)   GPUs          
---------------------------------
 180    1000       {'H200': 8}   
---------------------------------
Launching a new job torch-nccl-allreduce. Proceed? [Y/n]: Y

Aborted!
(konduktor) root@se-h1-18-gpu:~/andrew# konduktor status
Log file: /root/.konduktor/logs/konduktor-logs-20250807-055112.log
User: root-6f00
Jobs
NAME                       STATUS   RESOURCES                        SUBMITTED  
torch-nccl-allreduce-78e9  PENDING  8x(180CPU, memory 1000Gi, H200) 
for Trainy on GKE, a few things occur that allow the task to be executed by scaling up the GPU pools.
  1. Task definition is created by the user
  2. Cluster admits Task into Kueue as workload (check with kubectl get workloads)
    $ kubectl get workload
    NAME                                                 QUEUE        RESERVED IN         ADMITTED   FINISHED   AGE
    mrms-test-debug-2-aa75                               user-queue   dws-cluster-queue   True                  82m
    
  3. ProvisioningRequest is sent to GCP and enqueued (check with kubectl get provreq)
    $ kubectl  get provreq
    NAME                           ACCEPTED   PROVISIONED   FAILED   AGE
    mrms-test-debug-2-dws-prov-1   True       False                   23h
    
  4. ProvisioningRequest is fulfilled and new GPU nodes will be added to fulfill the requests of the submitted user Task
    $ kubectl  get provreq
    NAME                           ACCEPTED   PROVISIONED   FAILED   AGE
    mrms-test-debug-2-dws-prov-1   True       True                   23h
    
While waiting for a request to be fulfilled, it’s quite usual to see the following ResourcePoolExhausted status message from the ProvisioningRequest especially for large requests for highly in demand SKUs for long periods of time.
$ kubectl describe provreq
Name:         jobset-a100-2nodes-64b5-6feac-dws-prov-1
Namespace:    default
Labels:       kueue.x-k8s.io/managed=true
Annotations:  <none>
API Version:  autoscaling.x-k8s.io/v1
Kind:         ProvisioningRequest
Metadata:
  Creation Timestamp:  2025-04-25T19:03:16Z
  Generation:          1
  Owner References:
    API Version:           kueue.x-k8s.io/v1beta1
    Block Owner Deletion:  true
    Controller:            true
    Kind:                  Workload
    Name:                  jobset-a100-2nodes-64b5-6feac
    UID:                   33870f33-bac9-40e9-a8ab-6120cd7c5012
  Resource Version:        4530453
  UID:                     bfb3404d-084f-4ab0-89b0-e2de10f38829
Spec:
  Parameters:
  Pod Sets:
    Count:  2
    Pod Template Ref:
      Name:                 ppt-jobset-a100-2nodes-64b5-6feac-dws-prov-1-workers
  Provisioning Class Name:  queued-provisioning.gke.io
Status:
  Conditions:
    Last Transition Time:  2025-04-25T19:03:26Z
    Message:               Provisioning Request was successfully queued.
    Observed Generation:   1
    Reason:                SuccessfullyQueued
    Status:                True
    Type:                  Accepted
    Last Transition Time:  2025-04-25T19:03:26Z
    Message:               Waiting for resources. Currently there are not enough resources available to fulfill the request.
    Observed Generation:   1
    Reason:                ResourcePoolExhausted
    Status:                False
    Type:                  Provisioned
  Provisioning Class Details:
    Accelerator Type:            nvidia-a100-80gb
    Node Group Name:             gke-ryandev-55-a100-80gb-pool-70cc24ce-grp
    Node Pool Auto Provisioned:  false
    Node Pool Name:              a100-80gb-pool
    Resize Request Name:         gke-default-jobset-a100-2nodes-6-1eaa01b2de1b28bb
    Selected Zone:               us-central1-a
Events:                          <none>
I