Accelerate CPU-based AI inference workloads using Intel AMX on Amazon EC2
Compute Blog
This article demonstrates how to accelerate CPU-based AI inference on Amazon EC2 using Intel Advanced Matrix Extensions (AMX), achieving up to 76% performance improvements through hardware acceleration and optimized precision formats.
- Intel AMX accelerates matrix operations directly on CPU cores using specialized hardware
- BF16 precision with AMX delivers 21-72% latency improvements at batch sizes 8+
- EC2 m8i instances provide 9-14% better performance than m7i across tested models
- Optimal batch sizes of 4-16 maximize AMX benefits for different model architectures
- Combined m8i + BF16 AMX optimization achieves up to 76% improvement vs m7i FP32
- CPU inference cost-effective for batch processing, small-medium models, variable workloads
- PyTorch automatically leverages AMX with minimal code changes via environment variables
- Benchmarked across six models: BigBird, DialoGPT, Gemma, DeepSeek, Llama, YOLOv5
- M8i delivers up to 13% better price-performance than m7i for inference workloads
This guide enables organizations to optimize CPU-based AI inference costs while maintaining performance through Intel AMX acceleration on modern EC2 instances.
The AWS News Feed is currently looking for gold sponsors. If you want to support the AWS community and reach a large audience of AWS professionals, consider sponsoring the AWS News Feed.
Related articles
2026
2025
2025
2025
The AWS News Feed is currently looking for silver sponsors. If you want to support the AWS community and reach a large audience of AWS professionals, consider sponsoring the AWS News Feed.