Home icon

Container Insights now announces SageMaker HyperPod node health observability on EKS

News



The article announces that Amazon CloudWatch Container Insights now provides observability for the health status of SageMaker HyperPod nodes running on Elastic Kubernetes Service (EKS). It allows monitoring node availability for efficient training durations.

Specifically, the article covers:

  • Container Insights auto-discovers health status of SageMaker HyperPod nodes on EKS and visualizes them in curated dashboards.
  • It collects deep health check test results for HyperPod nodes and displays them in preset dashboards.
  • It helps identify unhealthy nodes, classifies failing nodes as "pending reboot" or "pending replacement", and guides on maintaining node health.
  • It provides visibility into node mutations, delays in training jobs, and how tasks resume from the last checkpoint.
  • Getting started is easy by installing the CloudWatch Observability EKS Add-on or the latest CloudWatch agent.
  • SageMaker HyperPod node health observability is available in all commercial regions where SageMaker HyperPod is present, with pricing based on observation.


Go to article

The AWS News Feed is currently looking for gold sponsors. If you want to support the AWS community and reach a large audience of AWS professionals, consider sponsoring the AWS News Feed.

Related articles

Sep 10
2024
Amazon SageMaker HyperPod introduces Amazon EKS support
Sep 12
2024
Introducing Amazon EKS support in Amazon SageMaker HyperPod
Jul 10
2025
Amazon SageMaker HyperPod announces new observability capability
Apr 22
2026
Amazon SageMaker HyperPod now supports on-demand deep health checks

The AWS News Feed is currently looking for silver sponsors. If you want to support the AWS community and reach a large audience of AWS professionals, consider sponsoring the AWS News Feed.