Home icon

Build real-time voice applications with Amazon SageMaker AI and vLLM

Machine Learning Blog



This article demonstrates deploying Mistral AI's Voxtral-Mini-4B-Realtime-2602 speech model on Amazon SageMaker using vLLM for real-time speech-to-text transcription with bidirectional streaming.

  • SageMaker bidirectional streaming enables persistent HTTP/2 connections for simultaneous audio input and transcription output
  • vLLM's Realtime API provides WebSocket-based streaming transcription with low per-token latency via CUDA graph optimization
  • Custom Docker container bridges SageMaker HTTP/2 streams to vLLM WebSocket endpoints transparently
  • Audio must be base64-encoded PCM16 at 16 kHz mono before transmission
  • Includes file-based and live microphone Gradio clients for testing and interactive use
  • Supports voice agents, live captioning, contact center analytics, and accessibility applications
  • Requires ml.g6.4xlarge instance; tune chunk size and pacing for latency/throughput tradeoffs

This solution enables production-ready real-time voice applications by combining SageMaker's managed infrastructure with vLLM's efficient model serving, eliminating custom streaming infrastructure development.



Go to article

The AWS News Feed is currently looking for gold sponsors. If you want to support the AWS community and reach a large audience of AWS professionals, consider sponsoring the AWS News Feed.

Related articles

May 13
2026
Build real-time voice streaming applications with Amazon Nova Sonic and WebRTC
Feb 25
2026
Efficiently serve dozens of fine-tuned models with vLLM on Amazon SageMaker AI and Amazon Bedrock
Dec 4
2024
Building Generative AI and ML solutions faster with AI apps from AWS partners using Amazon SageMaker
Oct 28
2025
Hosting NVIDIA speech NIM models on Amazon SageMaker AI: Parakeet ASR

The AWS News Feed is currently looking for silver sponsors. If you want to support the AWS community and reach a large audience of AWS professionals, consider sponsoring the AWS News Feed.