Managed Tiered KV Cache and Intelligent Routing for Amazon SageMaker HyperPod

Machine Learning Blog

This article announces Managed Tiered KV Cache and Intelligent Routing capabilities for Amazon SageMaker HyperPod, designed to optimize LLM inference performance and reduce costs.

Reduces time-to-first-token (TTFT) by up to 40% for long context workloads
Increases throughput by up to 38% and reduces compute costs by up to 28%
Two-tier KV cache: L1 (local CPU memory) and L2 (distributed cluster-wide storage)
L2 cache supports AWS-managed tiered storage or Redis backends
Four intelligent routing strategies: prefix-aware, KV-aware, round-robin, and default
Automatic cache management eliminates manual configuration overhead
Particularly beneficial for long documents, multi-turn conversations, and high-throughput inference
Built-in observability integration with Amazon Managed Grafana
Available in all AWS regions where SageMaker HyperPod is supported

These features enable enterprise-scale LLM deployments with significantly improved performance and cost efficiency, especially for applications processing long contexts or maintaining multi-turn conversations.

Go to article

The AWS News Feed is currently looking for gold sponsors. If you want to support the AWS community and reach a large audience of AWS professionals, consider sponsoring the AWS News Feed.

Nov 26
2025

SageMaker HyperPod now supports Managed tiered KV cache and intelligent routing

Sep 8
2025

Announcing Managed Tiered Checkpointing for Amazon SageMaker HyperPod

Sep 15
2025

Schedule topology-aware workloads using Amazon SageMaker HyperPod task governance

Aug 8
2025

Amazon SageMaker HyperPod now supports continuous provisioning for enhanced cluster operations

The AWS News Feed is currently looking for silver sponsors. If you want to support the AWS community and reach a large audience of AWS professionals, consider sponsoring the AWS News Feed.

Managed Tiered KV Cache and Intelligent Routing for Amazon SageMaker HyperPod

Related articles