SageMaker HyperPod now supports gang scheduling for distributed training workloads

News

This article announces gang scheduling support for Amazon SageMaker HyperPod, which ensures all pods required for distributed training jobs are ready before execution begins.

Gang scheduling prevents wasted compute from partial job runs and resource deadlocks
Monitors all pods in a workload and requeues if not all pods ready within set time
Administrators can configure pod readiness wait times, node failure handling, and retry scheduling
Automatically requeues pulled-back workloads to prevent job stalling
Available across 15 AWS regions including US, Asia Pacific, Europe, and South America

Gang scheduling improves resource efficiency and prevents distributed training jobs from blocking other workloads on SageMaker HyperPod clusters.

Go to article

The AWS News Feed is currently looking for gold sponsors. If you want to support the AWS community and reach a large audience of AWS professionals, consider sponsoring the AWS News Feed.

Aug 14
2025

SageMaker HyperPod now supports Topology Aware Scheduling of LLM tasks

Dec 4
2024

Amazon SageMaker HyperPod now provides flexible training plans

Mar 4
2025

SageMaker Hyperpod Flexible Training Plans now supports instant start times and multiple offers

Dec 3
2025

Amazon SageMaker HyperPod now supports checkpointless training

The AWS News Feed is currently looking for silver sponsors. If you want to support the AWS community and reach a large audience of AWS professionals, consider sponsoring the AWS News Feed.

SageMaker HyperPod now supports gang scheduling for distributed training workloads

Related articles