Accelerate NLP inference with ONNX Runtime on AWS Graviton processors
Machine Learning Blog
The article discusses how ONNX Runtime can be optimized to accelerate natural language processing (NLP) inference on AWS Graviton3 processors. It explains the improvements in performance achieved by utilizing optimized GEMM kernels for bfloat16 and int8 quantized models, enabling faster inference for transformer-based language models like BERT, RoBERTa, and GPT2.
Specifically, the article covers:
- Overview of ONNX Runtime and its support for optimized GEMM kernels on AWS Graviton3 processors
- How to enable the optimizations for bfloat16 fast math kernels in ONNX Runtime
- Benchmark results showing up to 65% improvement in throughput for fp32 models and up to 30% improvement for int8 quantized models
- Step-by-step instructions for running benchmarks on an AWS Graviton3-based EC2 instance
- Conclusion highlighting the performance benefits and encouraging readers to try the optimizations
The AWS News Feed is currently looking for gold sponsors. If you want to support the AWS community and reach a large audience of AWS professionals, consider sponsoring the AWS News Feed.
Related articles
2024
2024
2024
2025
The AWS News Feed is currently looking for silver sponsors. If you want to support the AWS community and reach a large audience of AWS professionals, consider sponsoring the AWS News Feed.