Home icon

Accelerate foundation model development with one-click observability in Amazon SageMaker HyperPod

Machine Learning Blog



Amazon SageMaker HyperPod introduces a comprehensive one-click observability feature for foundation model development, providing automated insights and metrics visualization.

  • Automatically publishes metrics to Amazon Managed Prometheus and Grafana
  • Consolidates health and performance data from multiple sources like NVIDIA DCGM, Kubernetes, and hardware metrics
  • Offers pre-built dashboards for Cluster, Tasks, Inference, Training, and File System monitoring
  • Enables data scientists to monitor resource utilization at per-GPU level
  • Allows cluster administrators to configure custom alerts and notification channels

The feature simplifies cluster telemetry management, helping teams accelerate foundation model development and reduce operational complexity.



Go to article

The AWS News Feed is currently looking for gold sponsors. If you want to support the AWS community and reach a large audience of AWS professionals, consider sponsoring the AWS News Feed.

Related articles

Jun 19
2025
Accelerate foundation model training and inference with Amazon SageMaker HyperPod and Amazon SageMaker Studio
Jul 10
2025
Amazon SageMaker HyperPod announces new observability capability
Dec 4
2024
Accelerate foundation model training and fine-tuning with new Amazon SageMaker HyperPod recipes
Sep 10
2024
Amazon EKS support in Amazon SageMaker HyperPod to scale foundation model development

The AWS News Feed is currently looking for silver sponsors. If you want to support the AWS community and reach a large audience of AWS professionals, consider sponsoring the AWS News Feed.