Amazon SageMaker HyperPod now supports checkpointless training

News

This article announces checkpointless training support in Amazon SageMaker HyperPod, a new capability for fault-tolerant foundational model training.

Reduces failure recovery time from hours to minutes without checkpoint-based restarts
Preserves model training state across distributed clusters automatically
Swaps faulty nodes on-the-fly using peer-to-peer state transfer from healthy accelerators
Achieves 95% training goodput even on clusters with thousands of AI accelerators
Eliminates expensive AI accelerator idle time and associated compute costs
Available in all AWS Regions where SageMaker HyperPod operates
Zero code changes needed for popular models like Llama and GPT OSS
Minimal modifications required for custom PyTorch-based workflows

Checkpointless training fundamentally improves fault recovery efficiency in large-scale AI model training, significantly reducing costs and accelerating training timelines.

Go to article

The AWS News Feed is currently looking for gold sponsors. If you want to support the AWS community and reach a large audience of AWS professionals, consider sponsoring the AWS News Feed.

Dec 3
2025

Introducing checkpointless and elastic training on Amazon SageMaker HyperPod

Dec 15
2025

Checkpointless training on Amazon SageMaker HyperPod: Production-scale training with faster fault recovery

Dec 4
2024

Amazon SageMaker HyperPod now provides flexible training plans

Sep 9
2025

Accelerate your model training with managed tiered checkpointing on Amazon SageMaker HyperPod

The AWS News Feed is currently looking for silver sponsors. If you want to support the AWS community and reach a large audience of AWS professionals, consider sponsoring the AWS News Feed.

Amazon SageMaker HyperPod now supports checkpointless training

Related articles