Scaling your LLM inference workloads: multi-node deployment with TensorRT-LLM and Triton on Amazon EKS

HPC Blog

This AWS HPC Blog article details a comprehensive guide for scaling Large Language Model (LLM) inference workloads using multi-node deployment with TensorRT-LLM and Triton on Amazon EKS, specifically demonstrating the deployment of the Llama 3.1 405B model.

Key technologies used:
- Amazon EKS for Kubernetes cluster management
- NVIDIA Triton Inference Server
- NVIDIA TensorRT-LLM for model optimization
- Elastic Fabric Adapter (EFA) for low-latency networking
- Amazon EFS for shared storage
Deployment architecture highlights:
- Uses 2 x P5.48xlarge instances with 8 H100 GPUs each
- Implements tensor parallelism (8-way) and pipeline parallelism (2-way)
- Utilizes LeaderWorkerSet for multi-node model deployment
- Includes autoscaling with Horizontal Pod Autoscaler and Cluster Autoscaler
Key benefits:
Enables serving of massive LLMs across multiple nodes
Provides scalable and efficient inference infrastructure
Supports dynamic resource allocation and scaling

The article provides a detailed, step-by-step guide for setting up the infrastructure, configuring the deployment, and running inference on large language models.

Go to article

The AWS News Feed is currently looking for gold sponsors. If you want to support the AWS community and reach a large audience of AWS professionals, consider sponsoring the AWS News Feed.

Mar 18
2024

Optimize price-performance of LLM inference on NVIDIA GPUs using the Amazon SageMaker integration with NVIDIA NIM Microservices

Jan 9
2026

Accelerating LLM inference with post-training weight and activation using AWQ and GPTQ on Amazon SageMaker AI

Apr 22
2025

Supercharge your LLM performance with Amazon SageMaker Large Model Inference container v15

Apr 15
2026

Accelerating decode-heavy LLM inference with speculative decoding on AWS Trainium and vLLM

The AWS News Feed is currently looking for silver sponsors. If you want to support the AWS community and reach a large audience of AWS professionals, consider sponsoring the AWS News Feed.

Scaling your LLM inference workloads: multi-node deployment with TensorRT-LLM and Triton on Amazon EKS

Related articles