Home icon

Introducing Disaggregated Inference on AWS powered by llm-d

Machine Learning Blog



This article announces AWS's collaboration with the llm-d team to bring disaggregated inference capabilities to AWS, enabling optimized large language model serving at scale.

  • llm-d separates LLM inference prefill and decode phases across distributed GPU resources for better optimization
  • Intelligent scheduling routes requests based on KV cache locality without requiring full cache state visibility
  • Prefill-decode disaggregation allows independent scaling of compute-intensive and memory-intensive phases
  • Wide expert parallelism optimizes Mixture-of-Experts models like DeepSeek-R1 and Qwen3.5
  • Tiered prefix caching offloads KV cache entries to CPU memory or disk beyond GPU limits
  • Integration with AWS Elastic Fabric Adapter (EFA) and NIXL enables high-performance point-to-point transfers
  • Benchmarks show up to 70% throughput improvement with prefill-decode disaggregation versus standard vLLM
  • Deployable on Amazon SageMaker HyperPod and Amazon EKS using Kubernetes-native architecture

llm-d provides production-grade orchestration and scheduling for distributed LLM inference, significantly improving performance and resource utilization for large-scale AI workloads on AWS.



Go to article

The AWS News Feed is currently looking for gold sponsors. If you want to support the AWS community and reach a large audience of AWS professionals, consider sponsoring the AWS News Feed.

Related articles

Mar 19
2026
AWS adds support for NIXL with EFA to accelerate LLM inference at scale
Feb 24
2026
Announcing AWS Elemental Inference
Apr 15
2026
Accelerating decode-heavy LLM inference with speculative decoding on AWS Trainium and vLLM
Jun 15
2026
How Public AI delivers sovereign LLM inference on AWS and Intel

The AWS News Feed is currently looking for silver sponsors. If you want to support the AWS community and reach a large audience of AWS professionals, consider sponsoring the AWS News Feed.