Get Started
Distributed Multi-Node Jobs
Jobs deployed via Konduktor can be scaled up to run on multiple nodes.
In the following example, we show how we set environment variables that make it possible to run PyTorch distributed to on two nodes.
Environment Variables
Konduktor will pass on the following environment variables to enable distributed jobs easily, as in PyTorch.
Environment variable | Description |
---|---|
MASTER_ADDR | The FQDN of the rank 0 worker. test-1234-workers-0-0.test-1234 |
NODE_HOST_IPS | A comma separated separated list of FQDN, test-1234-workers-0-0.test-1234,test-1234-workers-0-1.test-1234 |
RANK | The global rank within a job. |
NUM_NODES | The total number of nodes/workers |
NUM_GPUS_PER_NODE | The number of GPUs per node |