Reduce ML training costs with Amazon SageMaker HyperPod

Machine Learning Blog

This article discusses Amazon SageMaker HyperPod, a resilient infrastructure solution designed to reduce machine learning training costs and minimize downtime during large-scale model training. Key insights include:

Training frontier models is highly compute-intensive, with potential hardware failure rates of 0.02%–0.06% per instance hour
As cluster sizes increase, the mean time between failures (MTBF) decreases dramatically
SageMaker HyperPod automatically detects, replaces, and resumes training after hardware failures
For a 256-instance cluster with a 0.05% failure rate, SageMaker HyperPod can:
- Reduce total training time by 32%
- Save approximately $25.6 million in training costs
- Reduce downtime from 280 to 40 minutes per failure

The solution enables ML teams to focus on model innovation by automatically managing infrastructure reliability during long training runs.

Go to article

The AWS News Feed is currently looking for gold sponsors. If you want to support the AWS community and reach a large audience of AWS professionals, consider sponsoring the AWS News Feed.

Aug 22
2025

Amazon SageMaker HyperPod enhances ML infrastructure with scalability and customizability

Dec 4
2024

Amazon SageMaker HyperPod now provides flexible training plans

Oct 21
2025

Accelerate large-scale AI training with Amazon SageMaker HyperPod training operator

Jun 30
2025

Announcing Amazon SageMaker HyperPod training operator

The AWS News Feed is currently looking for silver sponsors. If you want to support the AWS community and reach a large audience of AWS professionals, consider sponsoring the AWS News Feed.

Reduce ML training costs with Amazon SageMaker HyperPod

Related articles