Run ML inference on unplanned and spiky traffic using Amazon SageMaker multi-model endpoints
Machine Learning Blog
This article discusses a solution that allows Amazon SageMaker multi-model endpoints (MMEs) to dynamically adjust the compute power assigned to each model based on the traffic pattern. It uses DJL Serving as the model server, which provides auto-scaling capabilities to scale up or down the number of workers per model according to the traffic load.
Specifically, the article covers:
- Overview of MME architecture and challenges with static resource allocation
- Introduction to DJL Serving and its auto-scaling capabilities
- Solution overview for using DJL Serving with MMEs
- Steps to create a model artifact, pull the DJL Docker image, and create the model file
- Creating a SageMaker model and endpoint with DJL Serving
- Load testing with different traffic patterns using Locust
- Results demonstrating dynamic scaling to handle varying traffic patterns
The AWS News Feed is currently looking for gold sponsors. If you want to support the AWS community and reach a large audience of AWS professionals, consider sponsoring the AWS News Feed.
Related articles
2024
2025
2024
2026
The AWS News Feed is currently looking for silver sponsors. If you want to support the AWS community and reach a large audience of AWS professionals, consider sponsoring the AWS News Feed.