Amazon SageMaker HyperPod introduces Amazon EKS support
AWS News Blog
This article introduces Amazon Elastic Kubernetes Service (EKS) support for Amazon SageMaker HyperPod, which enables customers to orchestrate HyperPod clusters using EKS, combining the power of Kubernetes with HyperPod's resilient environment designed for training large models.
Specifically, the article covers:
- The ability to manage HyperPod clusters using a Kubernetes-based interface, allowing seamless switching between Slurm and Amazon EKS for optimizing various workloads.
- Enhanced observability through the CloudWatch Observability EKS add-on, providing comprehensive monitoring capabilities for CPU, network, disk, and other low-level node metrics.
- Integration benefits, including resilience through deep health checks, automated node recovery, and job auto-resume capabilities.
- Steps for getting started with Amazon EKS support in Amazon SageMaker HyperPod, including creating a cluster, running jobs, and monitoring cluster performance.
- Key features and benefits, such as a resilient training environment, enhanced GPU observability, scientist-friendly tools, and flexible resource utilization.
The AWS News Feed is currently looking for gold sponsors. If you want to support the AWS community and reach a large audience of AWS professionals, consider sponsoring the AWS News Feed.
Related articles
Sep 12
2024
2024
Introducing Amazon EKS support in Amazon SageMaker HyperPod
Sep 10
2024
2024
Amazon EKS support in Amazon SageMaker HyperPod to scale foundation model development
Sep 10
2024
2024
Container Insights now announces SageMaker HyperPod node health observability on EKS
Aug 11
2025
2025
Amazon SageMaker HyperPod now provides a new cluster setup experience
The AWS News Feed is currently looking for silver sponsors. If you want to support the AWS community and reach a large audience of AWS professionals, consider sponsoring the AWS News Feed.