Home icon

Efficiently serve dozens of fine-tuned models with vLLM on Amazon SageMaker AI and Amazon Bedrock

Machine Learning Blog



This article describes AWS and vLLM's implementation of efficient multi-LoRA serving for Mixture of Experts (MoE) models, enabling multiple fine-tuned models to share GPU resources.

  • Multi-LoRA allows multiple custom models to share one GPU by swapping adapters per request
  • Developed fused_moe_lora kernel integrating LoRA operations into MoE inference
  • Execution optimizations: eliminated kernel recompilation, added early exit logic, implemented Programmatic Dependent Launch
  • Kernel optimizations: applied Split-K for load balancing, CTA swizzling for cache reuse, removed unnecessary masking
  • Custom tuning for SageMaker AI and Bedrock delivers 19% higher throughput and 8% lower latency
  • Supports GPT-OSS, Qwen3-MoE, DeepSeek, and Llama MoE models
  • Available in vLLM 0.15.0 and later; 454% OTPS improvement vs initial implementation

This work enables cost-effective serving of multiple fine-tuned models by eliminating GPU idle capacity waste through efficient adapter sharing.



Go to article

The AWS News Feed is currently looking for gold sponsors. If you want to support the AWS community and reach a large audience of AWS professionals, consider sponsoring the AWS News Feed.

Related articles

Mar 25
2026
Amazon SageMaker AI now supports serverless reinforcement fine-tuning for 12 additional models
May 20
2026
Build real-time voice applications with Amazon SageMaker AI and vLLM
May 29
2024
Fine-tune large multimodal models using Amazon SageMaker
Dec 3
2025
New serverless customization in Amazon SageMaker AI accelerates model fine-tuning

The AWS News Feed is currently looking for silver sponsors. If you want to support the AWS community and reach a large audience of AWS professionals, consider sponsoring the AWS News Feed.