Home icon

GPU Cost Attribution in Amazon EKS Using Amazon Managed Service for Prometheus, Amazon Managed Grafana, and OpenTelemetry

AWS Cloud Operations Blog



This article explains how to implement GPU cost attribution in Amazon EKS using observability tools to track and allocate GPU expenses across teams.

  • DCGM Exporter provides detailed GPU metrics including utilization and memory usage
  • Kube-state-metrics adds Kubernetes context for mapping GPU usage to applications
  • OpenTelemetry Collector enriches and pipelines metrics to Amazon Managed Prometheus
  • Three cost calculation types: allocated (requested), effective (actual), waste (difference)
  • Grafana dashboards visualize cost allocation by business unit and identify optimization opportunities
  • NVIDIA MIG technology partitions GPUs for better multi-tenancy and cost tracking
  • Organizations can implement request-based, utilization-based, or hybrid allocation methods

This observability-first approach enables accurate GPU cost chargeback, waste identification, and informed capacity planning for AI/ML workloads on EKS.



Go to article

The AWS News Feed is currently looking for gold sponsors. If you want to support the AWS community and reach a large audience of AWS professionals, consider sponsoring the AWS News Feed.

Related articles

Sep 2
2025
Split Cost Allocation Data for Amazon EKS supports NVIDIA & AMD GPU, Trainium, and Inferentia-powered EC2 instances
Apr 30
2026
Amazon ECS Managed Instances now supports NVIDIA GPU metrics
Jun 9
2025
Maximizing GPU Utilization using NVIDIA Run:ai in Amazon EKS
Sep 2
2025
Improve cost visibility of Machine Learning workloads on Amazon EKS with AWS Split Cost Allocation Data

The AWS News Feed is currently looking for silver sponsors. If you want to support the AWS community and reach a large audience of AWS professionals, consider sponsoring the AWS News Feed.