Home icon

Introducing container caching in Amazon SageMaker AI for faster model scaling

Machine Learning Blog



Amazon SageMaker AI announces container image caching for inference, reducing end-to-end latency by up to 2x during scale-out events for generative AI models.

  • Container caching removes image pull bottleneck when launching new instances, complementing existing instance-store caching for reused instances
  • Demonstrated 51% latency reduction for Qwen3-8B model: startup time drops from 525 seconds to 258 seconds
  • Early access customers saw P50 improvements ranging from 38% to 65% depending on instance type and model size
  • Works automatically on supported accelerator instance types; no modifications to container images required
  • Maintains strict tenant isolation with per-endpoint caches automatically purged on endpoint deletion
  • Combines with sub-minute metrics (6x faster detection) and data caching for comprehensive scaling optimization
  • Available in all commercial AWS Regions where SageMaker inference is supported

Container caching enables generative AI applications to handle traffic spikes with rapid, predictable responses and low latency by eliminating major sources of scale-out delay.



Go to article

The AWS News Feed is currently looking for gold sponsors. If you want to support the AWS community and reach a large audience of AWS professionals, consider sponsoring the AWS News Feed.

Related articles

Dec 3
2024
Supercharge your auto scaling for generative AI inference – Introducing Container Caching in SageMaker Inference
Jul 25
2024
Amazon SageMaker launches faster auto-scaling for Generative AI models
Dec 3
2025
New serverless customization in Amazon SageMaker AI accelerates model fine-tuning
Sep 18
2025
Use AWS Deep Learning Containers with Amazon SageMaker AI managed MLflow

The AWS News Feed is currently looking for silver sponsors. If you want to support the AWS community and reach a large audience of AWS professionals, consider sponsoring the AWS News Feed.