Accelerate your model training with managed tiered checkpointing on Amazon SageMaker HyperPod
Machine Learning Blog
AWS has announced managed tiered checkpointing in Amazon SageMaker HyperPod, a new feature designed to solve challenges in large-scale AI model training by optimizing checkpoint storage and recovery.
- Addresses the trade-off between training time, recovery speed, and storage costs
- Uses CPU memory for high-performance checkpoint storage with automatic data replication
- Supports checkpointing for large models like Meta Llama 3 (70B) and DeepSeek-R1 (671B)
- Integrates with PyTorch Distributed Checkpointing (DCP)
- Can save checkpoints within seconds across clusters of hundreds to 15,000 GPUs
The feature enables faster model training recovery by storing checkpoints in multiple tiers, primarily in fast-access CPU RAM and periodically backing up to Amazon S3, without significant performance overhead.
The AWS News Feed is currently looking for gold sponsors. If you want to support the AWS community and reach a large audience of AWS professionals, consider sponsoring the AWS News Feed.
Related articles
2025
2025
2025
2025
The AWS News Feed is currently looking for silver sponsors. If you want to support the AWS community and reach a large audience of AWS professionals, consider sponsoring the AWS News Feed.