Home icon

Accelerate your model training with managed tiered checkpointing on Amazon SageMaker HyperPod

Machine Learning Blog



AWS has announced managed tiered checkpointing in Amazon SageMaker HyperPod, a new feature designed to solve challenges in large-scale AI model training by optimizing checkpoint storage and recovery.

  • Addresses the trade-off between training time, recovery speed, and storage costs
  • Uses CPU memory for high-performance checkpoint storage with automatic data replication
  • Supports checkpointing for large models like Meta Llama 3 (70B) and DeepSeek-R1 (671B)
  • Integrates with PyTorch Distributed Checkpointing (DCP)
  • Can save checkpoints within seconds across clusters of hundreds to 15,000 GPUs

The feature enables faster model training recovery by storing checkpoints in multiple tiers, primarily in fast-access CPU RAM and periodically backing up to Amazon S3, without significant performance overhead.



Go to article

The AWS News Feed is currently looking for gold sponsors. If you want to support the AWS community and reach a large audience of AWS professionals, consider sponsoring the AWS News Feed.

Related articles

Sep 8
2025
Announcing Managed Tiered Checkpointing for Amazon SageMaker HyperPod
Dec 3
2025
Introducing checkpointless and elastic training on Amazon SageMaker HyperPod
Dec 3
2025
Amazon SageMaker HyperPod now supports checkpointless training
Dec 15
2025
Checkpointless training on Amazon SageMaker HyperPod: Production-scale training with faster fault recovery

The AWS News Feed is currently looking for silver sponsors. If you want to support the AWS community and reach a large audience of AWS professionals, consider sponsoring the AWS News Feed.