Amazon SageMaker AI in 2025, a year in review part 1: Flexible Training Plans and improvements to price performance for inference workloads
Machine Learning Blog
This article reviews Amazon SageMaker AI's 2025 improvements across capacity, price performance, observability, and usability, focusing on training and inference enhancements.
- Flexible Training Plans now support inference endpoints with transparent upfront pricing for GPU capacity reservations
- Inference components add Multi-AZ high availability for fault tolerance across Availability Zones
- Parallel scaling deploys multiple model copies simultaneously, reducing response time to traffic surges
- NVMe caching accelerates model scaling and reduces inference latency during traffic spikes
- EAGLE-3 speculative decoding predicts tokens from hidden layers, improving throughput without quality loss
- Dynamic multi-adapter inference loads LoRA adapters on-demand, optimizing resource utilization
- Intelligent memory management automatically evicts least popular adapters when capacity reached
These enhancements make generative AI inference more accessible, reliable, and cost-effective for production workloads by addressing GPU availability, low-latency scaling, and multi-model deployment complexity.
The AWS News Feed is currently looking for gold sponsors. If you want to support the AWS community and reach a large audience of AWS professionals, consider sponsoring the AWS News Feed.
Related articles
2026
2025
2026
2024
The AWS News Feed is currently looking for silver sponsors. If you want to support the AWS community and reach a large audience of AWS professionals, consider sponsoring the AWS News Feed.