Capacity-aware inference: Automatic instance fallback for SageMaker AI endpoints
Machine Learning Blog
This article announces capacity-aware instance pools for Amazon SageMaker AI inference endpoints, enabling automatic fallback to alternative instance types when preferred hardware is unavailable.
- Define prioritized instance type lists; SageMaker automatically provisions on available capacity
- Eliminates manual retries during endpoint creation, scale-out, and scale-in operations
- Supports Single Model Endpoints, Inference Component-based endpoints, and Asynchronous Inference
- Fleet naturally trends toward preferred hardware during scale-in and subsequent scale-out events
- CloudWatch metrics now include InstanceType dimension for per-type monitoring and observability
- Two optimization approaches: bring your own optimized models or use SageMaker inference recommendations
- Weighted scaling metrics handle mixed fleets with different throughput capacities
- Blue/green and rolling deployments supported with automatic rollback on health check failures
- Available in all commercial AWS Regions at no additional cost
Instance pools reduce operational overhead for ML inference by automating capacity resolution and providing flexible fallback strategies without manual intervention.
The AWS News Feed is currently looking for gold sponsors. If you want to support the AWS community and reach a large audience of AWS professionals, consider sponsoring the AWS News Feed.
Related articles
2026
2026
2026
2026
The AWS News Feed is currently looking for silver sponsors. If you want to support the AWS community and reach a large audience of AWS professionals, consider sponsoring the AWS News Feed.