Enabling Fault Tolerance for GPU Accelerated AI Workloads in Kubernetes – A. Singh & A. Paithankar

Enabling Fault Tolerance for GPU Accelerated AI Workloads in Kubernetes – A. Singh & A. Paithankar

Don’t miss out! Join us at our next Flagship Conference: KubeCon + CloudNativeCon Europe in London from April 1 – 4, 2025. Connect with our current graduated, incubating, and sandbox projects as the community gathers to further the education and advancement of cloud native computing. Learn more at https://kubecon.io

Enabling Fault Tolerance for GPU Accelerated AI Workloads in Kubernetes – Arpit Singh & Abhijit Paithankar, NVIDIA

In K8s based ML platforms, job failures from hardware errors such as GPU malfunctions, network disruptions, ECC errors, and OOM events pose significant challenges. These failures cause resource underutilization, wasted engineering time, and high operational costs, often requiring users to resubmit jobs. Current AI/ML frameworks lack adequate fault tolerance strategies, typically requiring manual intervention and causing delays before jobs can resume. This talk explores fault tolerance strategies including naive job restarts on failure, job restarts with hot spares, and job restarts by replacing faulty nodes. We discuss how to achieve fault propagation by leveraging node and pod conditions and address gaps in fault discovery and error propagation in the existing Kubernetes ecosystem. Our talk will also include ways to enhance components like the node-problem-detector and introduce new elements to close the gaps in fault detection , propagation reaction and remediation.