Adaptive infrastructure for foundation model training with elastic training on SageMaker HyperPod

Machine Learning Blog

This article introduces elastic training on Amazon SageMaker HyperPod, enabling ML workloads to automatically scale based on available cluster resources while maintaining training quality.

Training jobs dynamically scale up/down without manual intervention or infrastructure reconfiguration
Maintains constant global batch size and learning rate across different data-parallel configurations
Gracefully handles resource preemption for higher-priority workloads without terminating entire jobs
Uses PyTorch Distributed Checkpoint (DCP) for automatic model/optimizer state resharding
Integrates with Kubernetes, task governance, and SageMaker HyperPod observability
Llama-3 70B fine-tuning benchmark showed throughput improvement from 2K to 14K tokens/second
Pre-built recipes available for Llama and GPT-OSS models requiring only YAML configuration

Elastic training reduces GPU idle hours, cuts infrastructure costs, and accelerates model development by automatically utilizing available capacity while eliminating manual scaling overhead.

Go to article

The AWS News Feed is currently looking for gold sponsors. If you want to support the AWS community and reach a large audience of AWS professionals, consider sponsoring the AWS News Feed.

Dec 3
2025

Introducing elastic training on Amazon SageMaker HyperPod

Jun 19
2025

Accelerate foundation model training and inference with Amazon SageMaker HyperPod and Amazon SageMaker Studio

Dec 4
2024

Accelerate foundation model training and fine-tuning with new Amazon SageMaker HyperPod recipes

Apr 2
2026

Scaling seismic foundation models on AWS: Distributed training with Amazon SageMaker HyperPod and expanding context windows

The AWS News Feed is currently looking for silver sponsors. If you want to support the AWS community and reach a large audience of AWS professionals, consider sponsoring the AWS News Feed.

Adaptive infrastructure for foundation model training with elastic training on SageMaker HyperPod

Related articles