Introducing elastic training on Amazon SageMaker HyperPod

News

This article announces elastic training support for Amazon SageMaker HyperPod, enabling automatic scaling of foundation model training workloads based on resource availability.

Automatically scales training jobs to utilize idle AI accelerators without manual reconfiguration
Eliminates need to halt, reconfigure, and restart training when compute availability changes
Reduces infrastructure management overhead and maximizes cluster utilization
Training starts with minimal resources and grows opportunistically as capacity becomes available
Zero code changes needed for public models like Llama and GPT OSS using HyperPod recipes
Custom models require lightweight configuration updates and minimal code modifications
Available in all regions where SageMaker HyperPod currently operates

Elastic training eliminates manual reconfiguration overhead, reduces costs through better resource utilization, and accelerates time-to-market for foundation model training.

Go to article

The AWS News Feed is currently looking for gold sponsors. If you want to support the AWS community and reach a large audience of AWS professionals, consider sponsoring the AWS News Feed.

Dec 3
2025

Introducing checkpointless and elastic training on Amazon SageMaker HyperPod

Dec 3
2025

Amazon SageMaker HyperPod now supports checkpointless training

Jun 30
2025

Announcing Amazon SageMaker HyperPod training operator

Dec 4
2024

Amazon SageMaker HyperPod now provides flexible training plans

The AWS News Feed is currently looking for silver sponsors. If you want to support the AWS community and reach a large audience of AWS professionals, consider sponsoring the AWS News Feed.

Introducing elastic training on Amazon SageMaker HyperPod

Related articles