Jobs deployed via Konduktor can be scaled up to run on multiple nodes.
Environment variable | Description |
---|---|
MASTER_ADDR | The FQDN of the rank 0 worker. test-1234-workers-0-0.test-1234 |
NODE_HOST_IPS | A comma separated separated list of FQDN, test-1234-workers-0-0.test-1234,test-1234-workers-0-1.test-1234 |
RANK | The global rank within a job. |
NUM_NODES | The total number of nodes/workers |
NUM_GPUS_PER_NODE | The number of GPUs per node |