Gain operational insights for NVIDIA GPU workloads using Amazon CloudWatch Container Insights
AWS Cloud Operations Blog
The article provides an overview of how to gain operational insights for NVIDIA GPU workloads using Amazon CloudWatch Container Insights. It explains how Container Insights now supports collecting metrics and logs for GPU and Elastic Fabric Adapter (EFA) performance on Amazon EKS clusters.
Specifically, the article covers:
- Solution overview and prerequisites for setting up an EKS cluster with GPU nodes and EFA support
- Step-by-step instructions for deploying a demo cluster, enabling CloudWatch Container Insights addon, and running GPU and EFA test workloads
- Details on the out-of-the-box CloudWatch Container Insights dashboards to visualize and analyze GPU and EFA metrics at different levels (cluster, node, pod, container, GPU device)
- Benefits of enhanced GPU and EFA observability for optimizing machine learning workloads and resource utilization
- Cleanup steps to tear down the demo environment
The AWS News Feed is currently looking for gold sponsors. If you want to support the AWS community and reach a large audience of AWS professionals, consider sponsoring the AWS News Feed.
Related articles
Nov 21
2025
2025
Amazon CloudWatch Container Insights adds Sub-Minute GPU Metrics for Amazon EKS
Apr 10
2024
2024
Announcing Amazon CloudWatch Container Insights for Amazon EKS Windows Workloads Monitoring
Apr 30
2026
2026
Amazon ECS Managed Instances now supports NVIDIA GPU metrics
Jun 9
2025
2025
Maximizing GPU Utilization using NVIDIA Run:ai in Amazon EKS
The AWS News Feed is currently looking for silver sponsors. If you want to support the AWS community and reach a large audience of AWS professionals, consider sponsoring the AWS News Feed.