Introducing Disaggregated Inference on AWS powered by llm-d
Machine Learning Blog
This article announces AWS's collaboration with the llm-d team to bring disaggregated inference capabilities to AWS, enabling optimized large language model serving at scale.
- llm-d separates LLM inference prefill and decode phases across distributed GPU resources for better optimization
- Intelligent scheduling routes requests based on KV cache locality without requiring full cache state visibility
- Prefill-decode disaggregation allows independent scaling of compute-intensive and memory-intensive phases
- Wide expert parallelism optimizes Mixture-of-Experts models like DeepSeek-R1 and Qwen3.5
- Tiered prefix caching offloads KV cache entries to CPU memory or disk beyond GPU limits
- Integration with AWS Elastic Fabric Adapter (EFA) and NIXL enables high-performance point-to-point transfers
- Benchmarks show up to 70% throughput improvement with prefill-decode disaggregation versus standard vLLM
- Deployable on Amazon SageMaker HyperPod and Amazon EKS using Kubernetes-native architecture
llm-d provides production-grade orchestration and scheduling for distributed LLM inference, significantly improving performance and resource utilization for large-scale AI workloads on AWS.
The AWS News Feed is currently looking for gold sponsors. If you want to support the AWS community and reach a large audience of AWS professionals, consider sponsoring the AWS News Feed.
Related articles
The AWS News Feed is currently looking for silver sponsors. If you want to support the AWS community and reach a large audience of AWS professionals, consider sponsoring the AWS News Feed.