Introducing GPU Health Monitoring and Auto Repair for Amazon ECS Managed Instances
News
This article announces GPU health monitoring and auto repair functionality for Amazon ECS Managed Instances, which automatically detects and replaces failed NVIDIA GPUs.
- Automatically detects critical NVIDIA GPU hardware failures and replaces impaired instances
- Uses NVIDIA Data Center GPU Manager (DCGM) for continuous GPU health monitoring
- Monitor GPU health via DescribeContainerInstances API and EventBridge notifications
- Optional opt-out for manual instance lifecycle management at capacity provider level
- Enabled by default on supported NVIDIA GPU instance types at no additional cost
- Available in all AWS Commercial Regions
This feature improves availability and reliability of GPU-accelerated containerized workloads like GenAI inference by automating GPU failure detection and replacement.
The AWS News Feed is currently looking for gold sponsors. If you want to support the AWS community and reach a large audience of AWS professionals, consider sponsoring the AWS News Feed.
The AWS News Feed is currently looking for silver sponsors. If you want to support the AWS community and reach a large audience of AWS professionals, consider sponsoring the AWS News Feed.