SageMaker HyperPod now supports Managed tiered KV cache and intelligent routing

News

This article announces new capabilities for Amazon SageMaker HyperPod that optimize LLM inference performance for long-context prompts and multi-turn conversations.

Managed Tiered KV Cache intelligently caches and reuses computed attention values
Intelligent Routing directs requests to instances with optimal cached data
Delivers up to 40% latency reduction, 25% throughput improvement, and 25% cost savings
Two-tier architecture combines local CPU memory with disaggregated cluster-wide storage
Three routing strategies: prefix-aware, KV-aware, and round-robin for different workloads
Built-in observability integration with Amazon Managed Grafana for performance monitoring
Available in all regions where SageMaker HyperPod is currently available

These features reduce computational overhead for LLM inference by efficiently reusing cached key-value pairs, improving performance and cost-effectiveness for production applications.

Go to article

The AWS News Feed is currently looking for gold sponsors. If you want to support the AWS community and reach a large audience of AWS professionals, consider sponsoring the AWS News Feed.

Nov 27
2025

Managed Tiered KV Cache and Intelligent Routing for Amazon SageMaker HyperPod

Sep 8
2025

Announcing Managed Tiered Checkpointing for Amazon SageMaker HyperPod

Mar 16
2026

SageMaker HyperPod now supports idle resource sharing for dynamic cluster utilization

Aug 11
2025

Amazon SageMaker HyperPod now provides a new cluster setup experience

The AWS News Feed is currently looking for silver sponsors. If you want to support the AWS community and reach a large audience of AWS professionals, consider sponsoring the AWS News Feed.

SageMaker HyperPod now supports Managed tiered KV cache and intelligent routing

Related articles