Container Insights now announces SageMaker HyperPod node health observability on EKS
News
The article announces that Amazon CloudWatch Container Insights now provides observability for the health status of SageMaker HyperPod nodes running on Elastic Kubernetes Service (EKS). It allows monitoring node availability for efficient training durations.
Specifically, the article covers:
- Container Insights auto-discovers health status of SageMaker HyperPod nodes on EKS and visualizes them in curated dashboards.
- It collects deep health check test results for HyperPod nodes and displays them in preset dashboards.
- It helps identify unhealthy nodes, classifies failing nodes as "pending reboot" or "pending replacement", and guides on maintaining node health.
- It provides visibility into node mutations, delays in training jobs, and how tasks resume from the last checkpoint.
- Getting started is easy by installing the CloudWatch Observability EKS Add-on or the latest CloudWatch agent.
- SageMaker HyperPod node health observability is available in all commercial regions where SageMaker HyperPod is present, with pricing based on observation.
The AWS News Feed is currently looking for gold sponsors. If you want to support the AWS community and reach a large audience of AWS professionals, consider sponsoring the AWS News Feed.
Related articles
The AWS News Feed is currently looking for silver sponsors. If you want to support the AWS community and reach a large audience of AWS professionals, consider sponsoring the AWS News Feed.