Get Started
Distributed Multi-Node Jobs
Jobs deployed via Konduktor can be scaled up to run on multiple nodes.
In the following example, we show how we set environment variables that make it possible to run PyTorch distributed to on two nodes.
Environment Variables
Konduktor will pass on the following environment variables to enable distributed jobs easily, as in PyTorch.
MASTER_ADDR | The FQDN of the rank 0 node. |
---|---|
RANK | The global rank. |
NUM_NODES | The total number of nodes for this job. |
NUM_GPUS_PER_NODE | The number of GPUs per node for this job |