Home icon

Amazon SageMaker HyperPod now supports checkpointless training

News



This article announces checkpointless training support in Amazon SageMaker HyperPod, a new capability for fault-tolerant foundational model training.

  • Reduces failure recovery time from hours to minutes without checkpoint-based restarts
  • Preserves model training state across distributed clusters automatically
  • Swaps faulty nodes on-the-fly using peer-to-peer state transfer from healthy accelerators
  • Achieves 95% training goodput even on clusters with thousands of AI accelerators
  • Eliminates expensive AI accelerator idle time and associated compute costs
  • Available in all AWS Regions where SageMaker HyperPod operates
  • Zero code changes needed for popular models like Llama and GPT OSS
  • Minimal modifications required for custom PyTorch-based workflows

Checkpointless training fundamentally improves fault recovery efficiency in large-scale AI model training, significantly reducing costs and accelerating training timelines.



Go to article

The AWS News Feed is currently looking for gold sponsors. If you want to support the AWS community and reach a large audience of AWS professionals, consider sponsoring the AWS News Feed.

Related articles

Dec 3
2025
Introducing checkpointless and elastic training on Amazon SageMaker HyperPod
Dec 15
2025
Checkpointless training on Amazon SageMaker HyperPod: Production-scale training with faster fault recovery
Dec 4
2024
Amazon SageMaker HyperPod now provides flexible training plans
Sep 9
2025
Accelerate your model training with managed tiered checkpointing on Amazon SageMaker HyperPod

The AWS News Feed is currently looking for silver sponsors. If you want to support the AWS community and reach a large audience of AWS professionals, consider sponsoring the AWS News Feed.