Home icon

Introducing Amazon EKS support in Amazon SageMaker HyperPod

Machine Learning Blog



This article introduces the Amazon EKS support in Amazon SageMaker HyperPod, a purpose-built infrastructure designed for resilience and scalability in training foundation models (FMs).

Specifically, the article covers:

  • Overview of Amazon EKS support in SageMaker HyperPod, including the architecture, managed resiliency features (deep health checks, automated node recovery, job auto resume), and user experiences for admins and scientists
  • Detailed guide on setting up HyperPod compute as worker nodes in an EKS cluster, with emphasis on the built-in resiliency features like deep health checks and automated node recovery
  • Demo of training job resiliency using the job auto resume functionality with the Kubeflow Training Operator for PyTorch, enabling automatic job recovery and continuation after interruptions or failures
  • Conclusion highlighting the benefits of EKS support in SageMaker HyperPod for resilient and scalable FM training, and resources for further learning


Go to article

The AWS News Feed is currently looking for gold sponsors. If you want to support the AWS community and reach a large audience of AWS professionals, consider sponsoring the AWS News Feed.

Related articles

Sep 10
2024
Amazon SageMaker HyperPod introduces Amazon EKS support
Sep 10
2024
Amazon EKS support in Amazon SageMaker HyperPod to scale foundation model development
Sep 10
2024
Container Insights now announces SageMaker HyperPod node health observability on EKS
Nov 20
2025
Amazon SageMaker Unified Studio adds EMR on EKS support with SSO capabilities

The AWS News Feed is currently looking for silver sponsors. If you want to support the AWS community and reach a large audience of AWS professionals, consider sponsoring the AWS News Feed.