$1.85 and 832 steps, gone to a pod eviction
A fine-tuning run on a rented A100 died at step 832 to a pod eviction. /workspace is ephemeral, nohup does not survive SSH disconnect, and pkill -9 python takes down more than your run.
The run was at step 832 of 1500. The checkpoint at step 500 was sitting in
/workspace/output. The host reclaimed the pod. Everything on that box was
gone: the checkpoint, the partial run, $1.85. I rebuilt from scratch.
Three wrong assumptions.
/workspace is not storage
On a SECURE GPU pool, /workspace is ephemeral. Even declaring volume_in_gb
in the pod spec does not protect you. The host can reclaim the node and take
the volume with it. The checkpoint at step 500 was not backup. It was a file
on someone else’s disk.
The fix: a local scp loop running in parallel with the training job. Every 60
seconds it polls the remote output directory and copies any new .safetensors
file it has not seen. Set --save_every_n_steps 250 instead of the default 500. Worst-case loss on eviction drops to one checkpoint window. The scp
connection needs -o ConnectTimeout=30 -o ServerAliveInterval=15, or it drops
silently when the remote is idle.
nohup does not survive the shell
nohup ... &; disown does not keep the process alive when the SSH session
closes on the RunPod ComfyUI image. The process dies with the shell.
setsid creates a new session fully decoupled from the terminal:
setsid bash -c "sleep 6000 && pkill -9 -f flux_train_network" \
< /dev/null > /dev/null 2>&1 &
The run survives the shell dying. The watchdog timer fires on schedule regardless of whether the connection is still open.
pkill -9 python kills more than your run
On that image, sshd is a Python wrapper. pkill -9 python globally takes it
down too, and the connection drops mid-run. Kill scripts by specific name with
pkill -f <script_name>, never by runtime.
Takeaway
Rented infrastructure is ephemeral by default. Engineer the run to survive the machine disappearing, not to assume it stays up.