Home icon

Amazon SageMaker HyperPod announces health monitoring agent support for Slurm clusters

News



Amazon SageMaker HyperPod has announced the general availability of a health monitoring agent for Slurm clusters, designed to enhance ML workload resilience.

  • Performs passive, background health checks on GPU and Trainium-based nodes
  • Automatically detects and flags hardware issues like unresponsive GPUs
  • Marks unhealthy nodes and reboots or replaces them without manual intervention
  • Works with Slurm's job auto-resume functionality to continue training from last checkpoint
  • Available in all regions where HyperPod is generally available
  • Auto-enabled on new Slurm clusters, can be added to existing clusters via AMI upgrade

The agent helps ML teams train large models continuously and minimize disruptions caused by hardware failures.



Go to article

The AWS News Feed is currently looking for gold sponsors. If you want to support the AWS community and reach a large audience of AWS professionals, consider sponsoring the AWS News Feed.

Related articles

Mar 25
2026
Amazon SageMaker HyperPod now supports continuous provisioning for Slurm-orchestrated clusters
Mar 26
2025
Announcing multi-head node support in Slurm for Amazon SageMaker HyperPod clusters
May 7
2026
Amazon SageMaker HyperPod now supports AMI-based node lifecycle configuration for Slurm clusters
Mar 3
2026
Amazon SageMaker HyperPod now supports API-driven Slurm configuration

The AWS News Feed is currently looking for silver sponsors. If you want to support the AWS community and reach a large audience of AWS professionals, consider sponsoring the AWS News Feed.