Home icon

Amazon SageMaker HyperPod now provides comprehensive observability for Restricted Instance Groups

News



This article announces comprehensive observability capabilities for Amazon SageMaker HyperPod Restricted Instance Groups, enabling teams to monitor foundation model training with unified visibility across infrastructure and workloads.

  • Monitor GPU utilization, NVLink bandwidth, CPU pressure, and FSx for Lustre usage from single dashboard
  • Pre-configured Amazon Managed Grafana dashboard backed by Amazon Managed Service for Prometheus
  • Metrics collected across four exporters covering GPU, system health, network, and Kubernetes state
  • Curated logs automatically available for epoch progress, training logs, errors, and tracebacks
  • Automatically enabled for new clusters; can be enabled for existing clusters with few clicks
  • Available in all AWS Regions supporting SageMaker HyperPod RIG

SageMaker HyperPod now eliminates manual metric collection, providing unified observability for foundation model training infrastructure and workloads.



Go to article

The AWS News Feed is currently looking for gold sponsors. If you want to support the AWS community and reach a large audience of AWS professionals, consider sponsoring the AWS News Feed.

Related articles

Jul 10
2025
Amazon SageMaker HyperPod announces new observability capability
Apr 17
2026
Amazon SageMaker HyperPod now supports flexible instance groups
Apr 27
2026
Amazon SageMaker HyperPod now supports G7e and r5d.16xlarge instances
Nov 25
2025
Amazon SageMaker HyperPod now supports Spot Instances

The AWS News Feed is currently looking for silver sponsors. If you want to support the AWS community and reach a large audience of AWS professionals, consider sponsoring the AWS News Feed.