Home icon

Adaptive infrastructure for foundation model training with elastic training on SageMaker HyperPod

Machine Learning Blog



This article introduces elastic training on Amazon SageMaker HyperPod, enabling ML workloads to automatically scale based on available cluster resources while maintaining training quality.

  • Training jobs dynamically scale up/down without manual intervention or infrastructure reconfiguration
  • Maintains constant global batch size and learning rate across different data-parallel configurations
  • Gracefully handles resource preemption for higher-priority workloads without terminating entire jobs
  • Uses PyTorch Distributed Checkpoint (DCP) for automatic model/optimizer state resharding
  • Integrates with Kubernetes, task governance, and SageMaker HyperPod observability
  • Llama-3 70B fine-tuning benchmark showed throughput improvement from 2K to 14K tokens/second
  • Pre-built recipes available for Llama and GPT-OSS models requiring only YAML configuration

Elastic training reduces GPU idle hours, cuts infrastructure costs, and accelerates model development by automatically utilizing available capacity while eliminating manual scaling overhead.



Go to article

The AWS News Feed is currently looking for gold sponsors. If you want to support the AWS community and reach a large audience of AWS professionals, consider sponsoring the AWS News Feed.

Related articles

Dec 3
2025
Introducing elastic training on Amazon SageMaker HyperPod
Jun 19
2025
Accelerate foundation model training and inference with Amazon SageMaker HyperPod and Amazon SageMaker Studio
Dec 4
2024
Accelerate foundation model training and fine-tuning with new Amazon SageMaker HyperPod recipes
Apr 2
2026
Scaling seismic foundation models on AWS: Distributed training with Amazon SageMaker HyperPod and expanding context windows

The AWS News Feed is currently looking for silver sponsors. If you want to support the AWS community and reach a large audience of AWS professionals, consider sponsoring the AWS News Feed.