Home icon

Amazon SageMaker HyperPod now supports on-demand deep health checks

News



This article announces on-demand deep health checks for Amazon SageMaker HyperPod clusters, enabling proactive GPU accelerator health verification on running instances.

  • Supports Amazon EKS and Slurm-orchestrated clusters with comprehensive hardware stress tests
  • Slurm clusters can run deep health checks during node provisioning at cluster creation
  • Targets entire instance groups or specific instances for connectivity and GPU health verification
  • Progress and results visible through SageMaker console and APIs at instance group and instance levels
  • Instances undergoing checks automatically isolated from workload scheduling
  • Failed instances automatically rebooted or replaced with automatic node recovery capability
  • Available in all regions where SageMaker HyperPod is available

On-demand deep health checks help prevent compute resource waste by identifying unhealthy nodes before workload execution, improving cluster reliability and job performance.



Go to article

The AWS News Feed is currently looking for gold sponsors. If you want to support the AWS community and reach a large audience of AWS professionals, consider sponsoring the AWS News Feed.

Related articles

Jul 10
2025
Amazon SageMaker HyperPod announces new observability capability
Nov 26
2025
Amazon SageMaker HyperPod now supports programmatic node reboot and replacement
Sep 8
2025
Announcing Managed Tiered Checkpointing for Amazon SageMaker HyperPod
Aug 11
2025
Amazon SageMaker HyperPod now provides a new cluster setup experience

The AWS News Feed is currently looking for silver sponsors. If you want to support the AWS community and reach a large audience of AWS professionals, consider sponsoring the AWS News Feed.