Introducing container caching in Amazon SageMaker AI for faster model scaling

Machine Learning Blog

Amazon SageMaker AI announces container image caching for inference, reducing end-to-end latency by up to 2x during scale-out events for generative AI models.

Container caching removes image pull bottleneck when launching new instances, complementing existing instance-store caching for reused instances
Demonstrated 51% latency reduction for Qwen3-8B model: startup time drops from 525 seconds to 258 seconds
Early access customers saw P50 improvements ranging from 38% to 65% depending on instance type and model size
Works automatically on supported accelerator instance types; no modifications to container images required
Maintains strict tenant isolation with per-endpoint caches automatically purged on endpoint deletion
Combines with sub-minute metrics (6x faster detection) and data caching for comprehensive scaling optimization
Available in all commercial AWS Regions where SageMaker inference is supported

Container caching enables generative AI applications to handle traffic spikes with rapid, predictable responses and low latency by eliminating major sources of scale-out delay.

Go to article

The AWS News Feed is currently looking for gold sponsors. If you want to support the AWS community and reach a large audience of AWS professionals, consider sponsoring the AWS News Feed.

Dec 3
2024

Supercharge your auto scaling for generative AI inference – Introducing Container Caching in SageMaker Inference

Jul 25
2024

Amazon SageMaker launches faster auto-scaling for Generative AI models

Dec 3
2025

New serverless customization in Amazon SageMaker AI accelerates model fine-tuning

Sep 18
2025

Use AWS Deep Learning Containers with Amazon SageMaker AI managed MLflow

The AWS News Feed is currently looking for silver sponsors. If you want to support the AWS community and reach a large audience of AWS professionals, consider sponsoring the AWS News Feed.

Introducing container caching in Amazon SageMaker AI for faster model scaling

Related articles