Home icon

SageMaker HyperPod now supports gang scheduling for distributed training workloads

News



This article announces gang scheduling support for Amazon SageMaker HyperPod, which ensures all pods required for distributed training jobs are ready before execution begins.

  • Gang scheduling prevents wasted compute from partial job runs and resource deadlocks
  • Monitors all pods in a workload and requeues if not all pods ready within set time
  • Administrators can configure pod readiness wait times, node failure handling, and retry scheduling
  • Automatically requeues pulled-back workloads to prevent job stalling
  • Available across 15 AWS regions including US, Asia Pacific, Europe, and South America

Gang scheduling improves resource efficiency and prevents distributed training jobs from blocking other workloads on SageMaker HyperPod clusters.



Go to article

The AWS News Feed is currently looking for gold sponsors. If you want to support the AWS community and reach a large audience of AWS professionals, consider sponsoring the AWS News Feed.

Related articles

Aug 14
2025
SageMaker HyperPod now supports Topology Aware Scheduling of LLM tasks
Dec 4
2024
Amazon SageMaker HyperPod now provides flexible training plans
Mar 4
2025
SageMaker Hyperpod Flexible Training Plans now supports instant start times and multiple offers
Dec 3
2025
Amazon SageMaker HyperPod now supports checkpointless training

The AWS News Feed is currently looking for silver sponsors. If you want to support the AWS community and reach a large audience of AWS professionals, consider sponsoring the AWS News Feed.