Accelerate your model training with managed tiered checkpointing on Amazon SageMaker HyperPod

Machine Learning Blog

AWS has announced managed tiered checkpointing in Amazon SageMaker HyperPod, a new feature designed to solve challenges in large-scale AI model training by optimizing checkpoint storage and recovery.

Addresses the trade-off between training time, recovery speed, and storage costs
Uses CPU memory for high-performance checkpoint storage with automatic data replication
Supports checkpointing for large models like Meta Llama 3 (70B) and DeepSeek-R1 (671B)
Integrates with PyTorch Distributed Checkpointing (DCP)
Can save checkpoints within seconds across clusters of hundreds to 15,000 GPUs

The feature enables faster model training recovery by storing checkpoints in multiple tiers, primarily in fast-access CPU RAM and periodically backing up to Amazon S3, without significant performance overhead.

Go to article

The AWS News Feed is currently looking for gold sponsors. If you want to support the AWS community and reach a large audience of AWS professionals, consider sponsoring the AWS News Feed.

Sep 8
2025

Announcing Managed Tiered Checkpointing for Amazon SageMaker HyperPod

Dec 3
2025

Introducing checkpointless and elastic training on Amazon SageMaker HyperPod

Dec 3
2025

Amazon SageMaker HyperPod now supports checkpointless training

Dec 15
2025

Checkpointless training on Amazon SageMaker HyperPod: Production-scale training with faster fault recovery

The AWS News Feed is currently looking for silver sponsors. If you want to support the AWS community and reach a large audience of AWS professionals, consider sponsoring the AWS News Feed.

Accelerate your model training with managed tiered checkpointing on Amazon SageMaker HyperPod

Related articles