Konduktor’s file sync on launch ensures your code and static assets are available inside each container before it starts. However, once your workload is running, you may also need to persist runtime outputs such as checkpoints, logs, or evaluation results back to cloud storage. When a Konduktor job runs, its containers execute in ephemeral pods. Local storage (Documentation Index
Fetch the complete documentation index at: https://docs.trainy.ai/llms.txt
Use this file to discover all available pages before exploring further.
/tmp, /root, or your workdir) is ephemeral and wiped when the pod restarts or completes. Therefore, we recommend uploading to cloud storage any data that must survive restarts or be analyzed later. Runtime sync works through direct writes to your configured object store (S3, GCS, etc.) rather than automatic file syncing seen on launch.
Common Use Cases
- Model checkpoints - Periodically save model state so you can resume after failure.
- Metrics or logs - Write evaluation summaries or artifacts for later inspection.
- Intermediate outputs - Store temporary data for multi-stage workflows.
Setup
Full setup for file sync requires cloud storage configuration which can be found here. Konduktor mounts your cloud credentials into the job containers and places them in~/.aws (S3) or ~/.config/gcloud (GS) at startup. If you plan to use command-line tools like aws s3, gsutil, or gcloud, ensure your image includes those CLIs or install them in your run: block.
We check our cloud service account credentials in the Trainy cluster with this:
~/.konduktor/config.yaml
Usage
Checkpoint Resumption
You can combine cloud storage bucket writes usingaws s3 cp or gcloud storage cp with Konduktor environment variables like RESTART_ATTEMPT and JOB_COMPLETION_INDEX to resume training automatically after a crash or restart. Each pod can detect whether it’s a retry and download its latest checkpoint from S3 or GCS before continuing.
Below is a complete example that trains a tiny PyTorch model, saves checkpoints to S3, crashes on purpose on the first attempt, then automatically resumes from the latest checkpoint uploaded to S3 storage.