Adaptive infrastructure for foundation model training with elastic training on SageMaker HyperPod
Machine Learning Blog
This article introduces elastic training on Amazon SageMaker HyperPod, enabling ML workloads to automatically scale based on available cluster resources while maintaining training quality.
- Training jobs dynamically scale up/down without manual intervention or infrastructure reconfiguration
- Maintains constant global batch size and learning rate across different data-parallel configurations
- Gracefully handles resource preemption for higher-priority workloads without terminating entire jobs
- Uses PyTorch Distributed Checkpoint (DCP) for automatic model/optimizer state resharding
- Integrates with Kubernetes, task governance, and SageMaker HyperPod observability
- Llama-3 70B fine-tuning benchmark showed throughput improvement from 2K to 14K tokens/second
- Pre-built recipes available for Llama and GPT-OSS models requiring only YAML configuration
Elastic training reduces GPU idle hours, cuts infrastructure costs, and accelerates model development by automatically utilizing available capacity while eliminating manual scaling overhead.
The AWS News Feed is currently looking for gold sponsors. If you want to support the AWS community and reach a large audience of AWS professionals, consider sponsoring the AWS News Feed.
Related articles
2025
2025
2024
2026
The AWS News Feed is currently looking for silver sponsors. If you want to support the AWS community and reach a large audience of AWS professionals, consider sponsoring the AWS News Feed.