Amazon SageMaker launches faster auto-scaling for Generative AI models

News

This article announces a new capability from Amazon SageMaker Inference that allows faster auto-scaling for Generative AI models. It helps reduce the time it takes for models to scale automatically, enabling customers to improve the responsiveness of their applications as demand fluctuates.

Specifically, the article covers:

Two new high-resolution CloudWatch metrics (ConcurrentRequestsPerModel and ConcurrentRequestsPerModelCopy) that track the actual concurrency or number of in-flight inference requests being processed by the model
The ability to create auto-scaling policies using these metrics to scale models deployed on SageMaker endpoints, with new instances or model copies added in under a minute
Availability on accelerator instance families in all AWS regions where Amazon SageMaker Inference is available, except China and the AWS GovCloud (US) Regions
Links to the AWS ML blog and documentation for more information

Go to article

The AWS News Feed is currently looking for gold sponsors. If you want to support the AWS community and reach a large audience of AWS professionals, consider sponsoring the AWS News Feed.

Jul 25
2024

Amazon SageMaker inference launches faster auto scaling for generative AI models

Dec 6
2024

Amazon SageMaker introduces new capabilities to accelerate scaling of Generative AI Inference

Jul 9
2024

Amazon SageMaker introduces a new generative AI inference optimization capability

Jun 30
2026

Amazon SageMaker AI cuts generative AI inference scale-out time by up to half with automatic container image caching

The AWS News Feed is currently looking for silver sponsors. If you want to support the AWS community and reach a large audience of AWS professionals, consider sponsoring the AWS News Feed.

Amazon SageMaker launches faster auto-scaling for Generative AI models

Related articles