Jobs deployed via Konduktor can be scaled up to run on multiple nodes.
In the following example, we show how we set environment variables that make it possible to run PyTorch distributed to on two nodes.
Konduktor will pass on the following environment variables to enable distributed jobs easily, as in PyTorch.
Environment variable | Description |
---|---|
MASTER_ADDR | The FQDN of the rank 0 worker. test-1234-workers-0-0.test-1234 |
NODE_HOST_IPS | A comma separated separated list of FQDN, test-1234-workers-0-0.test-1234,test-1234-workers-0-1.test-1234 |
RANK | The global rank within a job. |
NUM_NODES | The total number of nodes/workers |
NUM_GPUS_PER_NODE | The number of GPUs per node |
Jobs deployed via Konduktor can be scaled up to run on multiple nodes.
In the following example, we show how we set environment variables that make it possible to run PyTorch distributed to on two nodes.
Konduktor will pass on the following environment variables to enable distributed jobs easily, as in PyTorch.
Environment variable | Description |
---|---|
MASTER_ADDR | The FQDN of the rank 0 worker. test-1234-workers-0-0.test-1234 |
NODE_HOST_IPS | A comma separated separated list of FQDN, test-1234-workers-0-0.test-1234,test-1234-workers-0-1.test-1234 |
RANK | The global rank within a job. |
NUM_NODES | The total number of nodes/workers |
NUM_GPUS_PER_NODE | The number of GPUs per node |