Home icon

Run ML inference on unplanned and spiky traffic using Amazon SageMaker multi-model endpoints

Machine Learning Blog



This article discusses a solution that allows Amazon SageMaker multi-model endpoints (MMEs) to dynamically adjust the compute power assigned to each model based on the traffic pattern. It uses DJL Serving as the model server, which provides auto-scaling capabilities to scale up or down the number of workers per model according to the traffic load.

Specifically, the article covers:

  • Overview of MME architecture and challenges with static resource allocation
  • Introduction to DJL Serving and its auto-scaling capabilities
  • Solution overview for using DJL Serving with MMEs
  • Steps to create a model artifact, pull the DJL Docker image, and create the model file
  • Creating a SageMaker model and endpoint with DJL Serving
  • Load testing with different traffic patterns using Locust
  • Results demonstrating dynamic scaling to handle varying traffic patterns


Go to article

The AWS News Feed is currently looking for gold sponsors. If you want to support the AWS community and reach a large audience of AWS professionals, consider sponsoring the AWS News Feed.

Related articles

Oct 28
2024
Customized model monitoring for near real-time batch inference with Amazon SageMaker
Nov 26
2025
Enhancing and monitoring network performance when running ML Inference on Amazon EKS
Nov 25
2024
Amazon SageMaker launches Multi-Adapter Model Inference
Jun 8
2026
End-to-end encrypted ML inference with Amazon SageMaker AI and FHE

The AWS News Feed is currently looking for silver sponsors. If you want to support the AWS community and reach a large audience of AWS professionals, consider sponsoring the AWS News Feed.