Accelerating LLM inference with post-training weight and activation using AWQ and GPTQ on Amazon SageMaker AI

Machine Learning Blog

This article explains how to accelerate large language model (LLM) inference using post-training quantization (PTQ) techniques AWQ and GPTQ on Amazon SageMaker AI.

PTQ reduces model size 2-8x by converting weights/activations to lower-bit integers without retraining
W4A16 asymmetric quantization achieves ultra-low precision with minimal accuracy loss
W8A8 enables full integer inference for maximum hardware utilization and speed
W8A16 weight-only quantization provides safe baseline with 2-4x memory reduction
AWQ uses activation-aware scaling to preserve critical weight channels at 4-bit precision
GPTQ applies layer-by-layer error compensation using Hessian approximations for optimal compression
Quantized models show 30-70% GPU memory reduction across Llama and Qwen models tested
End-to-end latency improves 2-3x; throughput increases significantly at high concurrency
SageMaker training jobs with llm-compressor library simplify quantization workflow
Quantized models deploy on smaller GPU instances, reducing infrastructure costs substantially

PTQ enables cost-effective, scalable LLM deployment by dramatically reducing memory requirements and inference latency while maintaining model quality, making large models practical for production environments.

Go to article

The AWS News Feed is currently looking for gold sponsors. If you want to support the AWS community and reach a large audience of AWS professionals, consider sponsoring the AWS News Feed.

Dec 24
2025

Optimizing LLM inference on Amazon SageMaker AI with BentoML’s LLM- Optimizer

Feb 12
2025

Achieve ~2x speed-up in LLM inference with Medusa-1 on Amazon SageMaker AI

Apr 22
2025

Supercharge your LLM performance with Amazon SageMaker Large Model Inference container v15

Jun 24
2025

Power Your LLM Training and Evaluation with the New SageMaker AI Generative AI Tools

The AWS News Feed is currently looking for silver sponsors. If you want to support the AWS community and reach a large audience of AWS professionals, consider sponsoring the AWS News Feed.

Accelerating LLM inference with post-training weight and activation using AWQ and GPTQ on Amazon SageMaker AI

Related articles