Build real-time voice applications with Amazon SageMaker AI and vLLM
Machine Learning Blog
This article demonstrates deploying Mistral AI's Voxtral-Mini-4B-Realtime-2602 speech model on Amazon SageMaker using vLLM for real-time speech-to-text transcription with bidirectional streaming.
- SageMaker bidirectional streaming enables persistent HTTP/2 connections for simultaneous audio input and transcription output
- vLLM's Realtime API provides WebSocket-based streaming transcription with low per-token latency via CUDA graph optimization
- Custom Docker container bridges SageMaker HTTP/2 streams to vLLM WebSocket endpoints transparently
- Audio must be base64-encoded PCM16 at 16 kHz mono before transmission
- Includes file-based and live microphone Gradio clients for testing and interactive use
- Supports voice agents, live captioning, contact center analytics, and accessibility applications
- Requires ml.g6.4xlarge instance; tune chunk size and pacing for latency/throughput tradeoffs
This solution enables production-ready real-time voice applications by combining SageMaker's managed infrastructure with vLLM's efficient model serving, eliminating custom streaming infrastructure development.
The AWS News Feed is currently looking for gold sponsors. If you want to support the AWS community and reach a large audience of AWS professionals, consider sponsoring the AWS News Feed.
Related articles
2026
2026
2024
2025
The AWS News Feed is currently looking for silver sponsors. If you want to support the AWS community and reach a large audience of AWS professionals, consider sponsoring the AWS News Feed.