Efficiently serve dozens of fine-tuned models with vLLM on Amazon SageMaker AI and Amazon Bedrock

Machine Learning Blog

This article describes AWS and vLLM's implementation of efficient multi-LoRA serving for Mixture of Experts (MoE) models, enabling multiple fine-tuned models to share GPU resources.

Multi-LoRA allows multiple custom models to share one GPU by swapping adapters per request
Developed fused_moe_lora kernel integrating LoRA operations into MoE inference
Execution optimizations: eliminated kernel recompilation, added early exit logic, implemented Programmatic Dependent Launch
Kernel optimizations: applied Split-K for load balancing, CTA swizzling for cache reuse, removed unnecessary masking
Custom tuning for SageMaker AI and Bedrock delivers 19% higher throughput and 8% lower latency
Supports GPT-OSS, Qwen3-MoE, DeepSeek, and Llama MoE models
Available in vLLM 0.15.0 and later; 454% OTPS improvement vs initial implementation

This work enables cost-effective serving of multiple fine-tuned models by eliminating GPU idle capacity waste through efficient adapter sharing.

Go to article

The AWS News Feed is currently looking for gold sponsors. If you want to support the AWS community and reach a large audience of AWS professionals, consider sponsoring the AWS News Feed.

Mar 25
2026

Amazon SageMaker AI now supports serverless reinforcement fine-tuning for 12 additional models

May 20
2026

Build real-time voice applications with Amazon SageMaker AI and vLLM

May 29
2024

Fine-tune large multimodal models using Amazon SageMaker

Dec 3
2025

New serverless customization in Amazon SageMaker AI accelerates model fine-tuning

The AWS News Feed is currently looking for silver sponsors. If you want to support the AWS community and reach a large audience of AWS professionals, consider sponsoring the AWS News Feed.

Efficiently serve dozens of fine-tuned models with vLLM on Amazon SageMaker AI and Amazon Bedrock

Related articles