$1.85 and 832 steps, gone to a pod eviction

The run was at step 832 of 1500. The checkpoint at step 500 was sitting in /workspace/output. The host reclaimed the pod. Everything on that box was gone: the checkpoint, the partial run, $1.85. I rebuilt from scratch.

Three wrong assumptions.

/workspace is not storage

On a SECURE GPU pool, /workspace is ephemeral. Even declaring volume_in_gb in the pod spec does not protect you. The host can reclaim the node and take the volume with it. The checkpoint at step 500 was not backup. It was a file on someone else’s disk.

The fix: a local scp loop running in parallel with the training job. Every 60 seconds it polls the remote output directory and copies any new .safetensors file it has not seen. Set --save_every_n_steps 250 instead of the default 500. Worst-case loss on eviction drops to one checkpoint window. The scp connection needs -o ConnectTimeout=30 -o ServerAliveInterval=15, or it drops silently when the remote is idle.

nohup does not survive the shell

nohup ... &; disown does not keep the process alive when the SSH session closes on the RunPod ComfyUI image. The process dies with the shell.

setsid creates a new session fully decoupled from the terminal:

setsid bash -c "sleep 6000 && pkill -9 -f flux_train_network" \
    < /dev/null > /dev/null 2>&1 &

The run survives the shell dying. The watchdog timer fires on schedule regardless of whether the connection is still open.

pkill -9 python kills more than your run

On that image, sshd is a Python wrapper. pkill -9 python globally takes it down too, and the connection drops mid-run. Kill scripts by specific name with pkill -f <script_name>, never by runtime.

Takeaway

Rented infrastructure is ephemeral by default. Engineer the run to survive the machine disappearing, not to assume it stays up.