Home icon

Introducing elastic training on Amazon SageMaker HyperPod

News



This article announces elastic training support for Amazon SageMaker HyperPod, enabling automatic scaling of foundation model training workloads based on resource availability.

  • Automatically scales training jobs to utilize idle AI accelerators without manual reconfiguration
  • Eliminates need to halt, reconfigure, and restart training when compute availability changes
  • Reduces infrastructure management overhead and maximizes cluster utilization
  • Training starts with minimal resources and grows opportunistically as capacity becomes available
  • Zero code changes needed for public models like Llama and GPT OSS using HyperPod recipes
  • Custom models require lightweight configuration updates and minimal code modifications
  • Available in all regions where SageMaker HyperPod currently operates

Elastic training eliminates manual reconfiguration overhead, reduces costs through better resource utilization, and accelerates time-to-market for foundation model training.



Go to article

The AWS News Feed is currently looking for gold sponsors. If you want to support the AWS community and reach a large audience of AWS professionals, consider sponsoring the AWS News Feed.

Related articles

Dec 3
2025
Introducing checkpointless and elastic training on Amazon SageMaker HyperPod
Dec 3
2025
Amazon SageMaker HyperPod now supports checkpointless training
Jun 30
2025
Announcing Amazon SageMaker HyperPod training operator
Dec 4
2024
Amazon SageMaker HyperPod now provides flexible training plans

The AWS News Feed is currently looking for silver sponsors. If you want to support the AWS community and reach a large audience of AWS professionals, consider sponsoring the AWS News Feed.