Home icon

SageMaker HyperPod now supports Managed tiered KV cache and intelligent routing

News



This article announces new capabilities for Amazon SageMaker HyperPod that optimize LLM inference performance for long-context prompts and multi-turn conversations.

  • Managed Tiered KV Cache intelligently caches and reuses computed attention values
  • Intelligent Routing directs requests to instances with optimal cached data
  • Delivers up to 40% latency reduction, 25% throughput improvement, and 25% cost savings
  • Two-tier architecture combines local CPU memory with disaggregated cluster-wide storage
  • Three routing strategies: prefix-aware, KV-aware, and round-robin for different workloads
  • Built-in observability integration with Amazon Managed Grafana for performance monitoring
  • Available in all regions where SageMaker HyperPod is currently available

These features reduce computational overhead for LLM inference by efficiently reusing cached key-value pairs, improving performance and cost-effectiveness for production applications.



Go to article

The AWS News Feed is currently looking for gold sponsors. If you want to support the AWS community and reach a large audience of AWS professionals, consider sponsoring the AWS News Feed.

Related articles

Nov 27
2025
Managed Tiered KV Cache and Intelligent Routing for Amazon SageMaker HyperPod
Sep 8
2025
Announcing Managed Tiered Checkpointing for Amazon SageMaker HyperPod
Mar 16
2026
SageMaker HyperPod now supports idle resource sharing for dynamic cluster utilization
Aug 11
2025
Amazon SageMaker HyperPod now provides a new cluster setup experience

The AWS News Feed is currently looking for silver sponsors. If you want to support the AWS community and reach a large audience of AWS professionals, consider sponsoring the AWS News Feed.