Build real-time voice applications with Amazon SageMaker AI and vLLM

Machine Learning Blog

This article demonstrates deploying Mistral AI's Voxtral-Mini-4B-Realtime-2602 speech model on Amazon SageMaker using vLLM for real-time speech-to-text transcription with bidirectional streaming.

SageMaker bidirectional streaming enables persistent HTTP/2 connections for simultaneous audio input and transcription output
vLLM's Realtime API provides WebSocket-based streaming transcription with low per-token latency via CUDA graph optimization
Custom Docker container bridges SageMaker HTTP/2 streams to vLLM WebSocket endpoints transparently
Audio must be base64-encoded PCM16 at 16 kHz mono before transmission
Includes file-based and live microphone Gradio clients for testing and interactive use
Supports voice agents, live captioning, contact center analytics, and accessibility applications
Requires ml.g6.4xlarge instance; tune chunk size and pacing for latency/throughput tradeoffs

This solution enables production-ready real-time voice applications by combining SageMaker's managed infrastructure with vLLM's efficient model serving, eliminating custom streaming infrastructure development.

Go to article

The AWS News Feed is currently looking for gold sponsors. If you want to support the AWS community and reach a large audience of AWS professionals, consider sponsoring the AWS News Feed.

May 13
2026

Build real-time voice streaming applications with Amazon Nova Sonic and WebRTC

Feb 25
2026

Efficiently serve dozens of fine-tuned models with vLLM on Amazon SageMaker AI and Amazon Bedrock

Dec 4
2024

Building Generative AI and ML solutions faster with AI apps from AWS partners using Amazon SageMaker

Oct 28
2025

Hosting NVIDIA speech NIM models on Amazon SageMaker AI: Parakeet ASR

The AWS News Feed is currently looking for silver sponsors. If you want to support the AWS community and reach a large audience of AWS professionals, consider sponsoring the AWS News Feed.

Build real-time voice applications with Amazon SageMaker AI and vLLM

Related articles