Home icon

Introducing checkpointless and elastic training on Amazon SageMaker HyperPod

AWS News Blog



This article announces two new AI model training features for Amazon SageMaker HyperPod: checkpointless training and elastic training.

  • Checkpointless training eliminates checkpoint-restart cycles using peer-to-peer state recovery
  • Reduces fault recovery time from hours to minutes, improving training momentum
  • Built on four core components: collective communications optimization, memory-mapped data loading, in-process recovery, peer-to-peer state replication
  • Elastic training automatically scales training workloads based on available cluster resources
  • Training jobs expand to use idle capacity and contract when higher-priority workloads need resources
  • Preserves global batch size and adapts learning rates during scaling operations
  • Amazon Nova models trained using checkpointless technology on tens of thousands of accelerators
  • Internal testing showed over 80% reduction in downtime compared to traditional checkpoint recovery
  • Both features available at no additional cost in select AWS regions

These features enable faster AI model training by eliminating manual infrastructure management and maximizing cluster utilization.



Go to article

The AWS News Feed is currently looking for gold sponsors. If you want to support the AWS community and reach a large audience of AWS professionals, consider sponsoring the AWS News Feed.

Related articles

Dec 3
2025
Amazon SageMaker HyperPod now supports checkpointless training
Dec 3
2025
Introducing elastic training on Amazon SageMaker HyperPod
Dec 15
2025
Checkpointless training on Amazon SageMaker HyperPod: Production-scale training with faster fault recovery
Sep 9
2025
Accelerate your model training with managed tiered checkpointing on Amazon SageMaker HyperPod

The AWS News Feed is currently looking for silver sponsors. If you want to support the AWS community and reach a large audience of AWS professionals, consider sponsoring the AWS News Feed.