Home icon

Checkpointless training on Amazon SageMaker HyperPod: Production-scale training with faster fault recovery

Machine Learning Blog



This article introduces checkpointless training on Amazon SageMaker HyperPod, a new approach to foundation model training that replaces traditional checkpoint-based recovery with peer-to-peer state recovery.

  • Reduces fault recovery time by 80–93% (from 15–30 minutes to under 2 minutes)
  • Enables over 95% training goodput on clusters with thousands of AI accelerators
  • Uses five key components: rootless NCCL initialization, memory-mapped data loading, in-process recovery, peer-to-peer state replication, and SageMaker HyperPod training operator
  • Failed processes recover state directly from healthy peers over high-speed EFA network instead of loading from storage
  • Eliminates cluster-wide restarts; only failed processes recover independently while healthy processes continue training
  • Supports incremental adoption through four integration tiers, starting with NCCL optimization and progressing to full checkpointless recovery
  • Validated on production-scale clusters up to 2,304 GPUs with no impact on model training accuracy

Checkpointless training fundamentally shifts fault recovery from cluster-wide restarts to localized process-level recovery, dramatically improving efficiency and reducing costs for large-scale AI model training.



Go to article

The AWS News Feed is currently looking for gold sponsors. If you want to support the AWS community and reach a large audience of AWS professionals, consider sponsoring the AWS News Feed.

Related articles

Dec 3
2025
Introducing checkpointless and elastic training on Amazon SageMaker HyperPod
Dec 3
2025
Amazon SageMaker HyperPod now supports checkpointless training
Dec 3
2025
Introducing elastic training on Amazon SageMaker HyperPod
Sep 9
2025
Accelerate your model training with managed tiered checkpointing on Amazon SageMaker HyperPod

The AWS News Feed is currently looking for silver sponsors. If you want to support the AWS community and reach a large audience of AWS professionals, consider sponsoring the AWS News Feed.