Amazon SageMaker HyperPod now supports programmatic node reboot and replacement
News
This article announces the general availability of new APIs for Amazon SageMaker HyperPod that enable programmatic node management for ML workload clusters.
- BatchRebootClusterNodes and BatchReplaceClusterNodes APIs now generally available
- Enables programmatic rebooting and replacement of unresponsive or degraded cluster nodes
- Supports both Slurm and EKS orchestrated clusters
- Batch operations support up to 25 instances for efficient large-scale recovery
- Available in US East (Ohio), Asia Pacific (Mumbai), and Asia Pacific (Tokyo)
- Accessible via AWS CLI, SDK, or API calls
These new APIs provide consistent, orchestrator-agnostic node recovery operations for SageMaker HyperPod clusters running time-sensitive ML workloads.
The AWS News Feed is currently looking for gold sponsors. If you want to support the AWS community and reach a large audience of AWS professionals, consider sponsoring the AWS News Feed.
Related articles
Feb 10
2026
2026
Amazon SageMaker HyperPod now supports node actions from the console
Aug 11
2025
2025
Amazon SageMaker HyperPod now provides a new cluster setup experience
Sep 8
2025
2025
Announcing Managed Tiered Checkpointing for Amazon SageMaker HyperPod
Aug 8
2025
2025
Amazon SageMaker HyperPod now supports continuous provisioning for enhanced cluster operations
The AWS News Feed is currently looking for silver sponsors. If you want to support the AWS community and reach a large audience of AWS professionals, consider sponsoring the AWS News Feed.