Amazon SageMaker adds new inference capabilities that allow deploying multiple foundation models on the same endpoint, allocating specific compute and memory resources to each model, and using separate scaling policies, which can reduce model deployment costs by 50% on average and lower inference latency by 20%.

<div>
<p>The article introduces new Amazon SageMaker inference capabilities that help reduce foundation model deployment costs and latency.</p>
<p>Specifically, the article covers:</p>
<ul>
<li>Key components of the new inference capabilities, including the ability to deploy multiple foundation models on the same SageMaker endpoint and control resource allocation for each model</li>
<li>How to use the new capabilities from SageMaker Studio, Python SDK, AWS SDKs, AWS CLI, and CloudFormation</li>
<li>A demo showing how to deploy two large language models (Dolly v2 7B and FLAN-T5 XXL) on a SageMaker endpoint using the new inference capabilities</li>
<li>Benefits such as improved resource utilization, reduced deployment costs by 50% on average, and 20% lower inference latency on average</li>
<li>Availability and pricing details for the new capabilities</li>
</ul>
</div>


Amazon SageMaker adds new inference capabilities to help reduce foundation model deployment costs and latency

Related articles