Amazon SageMaker launches faster auto-scaling for Generative AI models
News
This article announces a new capability from Amazon SageMaker Inference that allows faster auto-scaling for Generative AI models. It helps reduce the time it takes for models to scale automatically, enabling customers to improve the responsiveness of their applications as demand fluctuates.
Specifically, the article covers:
- Two new high-resolution CloudWatch metrics (ConcurrentRequestsPerModel and ConcurrentRequestsPerModelCopy) that track the actual concurrency or number of in-flight inference requests being processed by the model
- The ability to create auto-scaling policies using these metrics to scale models deployed on SageMaker endpoints, with new instances or model copies added in under a minute
- Availability on accelerator instance families in all AWS regions where Amazon SageMaker Inference is available, except China and the AWS GovCloud (US) Regions
- Links to the AWS ML blog and documentation for more information
The AWS News Feed is currently looking for gold sponsors. If you want to support the AWS community and reach a large audience of AWS professionals, consider sponsoring the AWS News Feed.
Related articles
Jul 25
2024
2024
Amazon SageMaker inference launches faster auto scaling for generative AI models
Dec 6
2024
2024
Amazon SageMaker introduces new capabilities to accelerate scaling of Generative AI Inference
Jul 9
2024
2024
Amazon SageMaker introduces a new generative AI inference optimization capability
Dec 3
2024
2024
Amazon SageMaker launches the updated inference optimization toolkit for generative AI
The AWS News Feed is currently looking for silver sponsors. If you want to support the AWS community and reach a large audience of AWS professionals, consider sponsoring the AWS News Feed.