Supercharge your LLM performance with Amazon SageMaker Large Model Inference container v15

Machine Learning Blog

AWS has announced Amazon SageMaker Large Model Inference (LMI) container v15, which offers significant improvements for deploying large language models (LLMs) with enhanced performance and expanded model support.

Introduces async mode with vLLM's AsyncLLMEngine for improved request handling
Supports vLLM V1 engine, delivering up to 111% higher throughput
Expanded API schema support with OpenAI Chat Completions, Completions, and TGI formats
Added multimodal support and function calling capabilities
Supports latest models like Llama 4, Gemma 3, Qwen, Mistral AI, and DeepSeek-R

The new container provides improved performance, flexibility, and ease of deployment for generative AI models across various use cases and model types.

Go to article

The AWS News Feed is currently looking for gold sponsors. If you want to support the AWS community and reach a large audience of AWS professionals, consider sponsoring the AWS News Feed.

Feb 12
2025

Achieve ~2x speed-up in LLM inference with Medusa-1 on Amazon SageMaker AI

Dec 24
2025

Optimizing LLM inference on Amazon SageMaker AI with BentoML’s LLM- Optimizer

May 29
2026

Comprehensive observability for Amazon SageMaker AI LLM inference: From GPU utilization to LLM quality

Mar 18
2024

Optimize price-performance of LLM inference on NVIDIA GPUs using the Amazon SageMaker integration with NVIDIA NIM Microservices

The AWS News Feed is currently looking for silver sponsors. If you want to support the AWS community and reach a large audience of AWS professionals, consider sponsoring the AWS News Feed.

Supercharge your LLM performance with Amazon SageMaker Large Model Inference container v15

Related articles