Run ML inference on unplanned and spiky traffic using Amazon SageMaker multi-model endpoints

Machine Learning Blog

This article discusses a solution that allows Amazon SageMaker multi-model endpoints (MMEs) to dynamically adjust the compute power assigned to each model based on the traffic pattern. It uses DJL Serving as the model server, which provides auto-scaling capabilities to scale up or down the number of workers per model according to the traffic load.

Specifically, the article covers:

Overview of MME architecture and challenges with static resource allocation
Introduction to DJL Serving and its auto-scaling capabilities
Solution overview for using DJL Serving with MMEs
Steps to create a model artifact, pull the DJL Docker image, and create the model file
Creating a SageMaker model and endpoint with DJL Serving
Load testing with different traffic patterns using Locust
Results demonstrating dynamic scaling to handle varying traffic patterns

Go to article

The AWS News Feed is currently looking for gold sponsors. If you want to support the AWS community and reach a large audience of AWS professionals, consider sponsoring the AWS News Feed.

Oct 28
2024

Customized model monitoring for near real-time batch inference with Amazon SageMaker

Nov 26
2025

Enhancing and monitoring network performance when running ML Inference on Amazon EKS

Nov 25
2024

Amazon SageMaker launches Multi-Adapter Model Inference

Jun 8
2026

End-to-end encrypted ML inference with Amazon SageMaker AI and FHE

The AWS News Feed is currently looking for silver sponsors. If you want to support the AWS community and reach a large audience of AWS professionals, consider sponsoring the AWS News Feed.

Run ML inference on unplanned and spiky traffic using Amazon SageMaker multi-model endpoints

Related articles