Introducing GPU Health Monitoring and Auto Repair for Amazon ECS Managed Instances

News

This article announces GPU health monitoring and auto repair functionality for Amazon ECS Managed Instances, which automatically detects and replaces failed NVIDIA GPUs.

Automatically detects critical NVIDIA GPU hardware failures and replaces impaired instances
Uses NVIDIA Data Center GPU Manager (DCGM) for continuous GPU health monitoring
Monitor GPU health via DescribeContainerInstances API and EventBridge notifications
Optional opt-out for manual instance lifecycle management at capacity provider level
Enabled by default on supported NVIDIA GPU instance types at no additional cost
Available in all AWS Commercial Regions

This feature improves availability and reliability of GPU-accelerated containerized workloads like GenAI inference by automating GPU failure detection and replacement.

Go to article

The AWS News Feed is currently looking for gold sponsors. If you want to support the AWS community and reach a large audience of AWS professionals, consider sponsoring the AWS News Feed.

Apr 30
2026

Amazon ECS Managed Instances now supports NVIDIA GPU metrics

Jul 7
2026

Amazon ECS Managed Instances reduces GPU management fees by up to 60%

Dec 16
2024

Announcing Node Health Monitoring and Auto-Repair for Amazon EKS

Apr 22
2025

Amazon EKS introduces node monitoring and auto repair capabilities

The AWS News Feed is currently looking for silver sponsors. If you want to support the AWS community and reach a large audience of AWS professionals, consider sponsoring the AWS News Feed.

Introducing GPU Health Monitoring and Auto Repair for Amazon ECS Managed Instances

Related articles