Announcing Managed Tiered Checkpointing for Amazon SageMaker HyperPod
News
AWS has announced managed tiered checkpointing for Amazon SageMaker HyperPod, a new feature designed to improve AI model training reliability and recovery.
- Uses CPU memory for frequent, rapid checkpoints and Amazon S3 for long-term data persistence
- Reduces training recovery time and minimizes progress loss during infrastructure failures
- Allows customers to configure checkpoint frequency and retention policies
- Integrated with PyTorch's Distributed Checkpoint (DCP) for easy implementation
- Currently available for SageMaker HyperPod clusters using EKS orchestrator
The solution enables organizations to train large-scale AI models more reliably and efficiently, with minimal code changes required.
The AWS News Feed is currently looking for gold sponsors. If you want to support the AWS community and reach a large audience of AWS professionals, consider sponsoring the AWS News Feed.
Related articles
Sep 9
2025
2025
Accelerate your model training with managed tiered checkpointing on Amazon SageMaker HyperPod
Dec 3
2025
2025
Amazon SageMaker HyperPod now supports checkpointless training
Nov 27
2025
2025
Managed Tiered KV Cache and Intelligent Routing for Amazon SageMaker HyperPod
Nov 26
2025
2025
SageMaker HyperPod now supports Managed tiered KV cache and intelligent routing
The AWS News Feed is currently looking for silver sponsors. If you want to support the AWS community and reach a large audience of AWS professionals, consider sponsoring the AWS News Feed.