Boost inference performance for Mixtral and Llama 2 models with new Amazon SageMaker containers
Machine Learning Blog
This article discusses the new version 0.26.0 of Amazon SageMaker Large Model Inference (LMI) Deep Learning Containers, which offers improved performance and new features for efficient inference with large language models.
Specifically, the article covers:
- New models supported, including Mixtral, Llama2-70B, and more
- Context window extension using rotary position embeddings (RoPE)
- Additional generation details like token log probabilities and finish reasons
- Performance improvements and new features for various LMI backends (LMI-Distributed, TensorRT-LLM, vLLM, NeuronX)
- A detailed guide on deploying the Mixtral model with LMI and utilizing the new generation details
- Conclusion highlighting the benefits of the new LMI capabilities
The AWS News Feed is currently looking for gold sponsors. If you want to support the AWS community and reach a large audience of AWS professionals, consider sponsoring the AWS News Feed.
Related articles
Jan 9
2024
2024
Inference Llama 2 models with real-time response streaming using Amazon SageMaker
Aug 21
2024
2024
Fine-tune Meta Llama 3.1 models for generative AI inference using Amazon SageMaker JumpStart
Apr 22
2025
2025
Supercharge your LLM performance with Amazon SageMaker Large Model Inference container v15
Jan 17
2024
2024
Fine-tune and deploy Llama 2 models cost-effectively in Amazon SageMaker JumpStart with AWS Inferentia and AWS Trainium
The AWS News Feed is currently looking for silver sponsors. If you want to support the AWS community and reach a large audience of AWS professionals, consider sponsoring the AWS News Feed.