Accelerate NLP inference with ONNX Runtime on AWS Graviton processors

Machine Learning Blog

The article discusses how ONNX Runtime can be optimized to accelerate natural language processing (NLP) inference on AWS Graviton3 processors. It explains the improvements in performance achieved by utilizing optimized GEMM kernels for bfloat16 and int8 quantized models, enabling faster inference for transformer-based language models like BERT, RoBERTa, and GPT2.

Specifically, the article covers:

Overview of ONNX Runtime and its support for optimized GEMM kernels on AWS Graviton3 processors
How to enable the optimizations for bfloat16 fast math kernels in ONNX Runtime
Benchmark results showing up to 65% improvement in throughput for fp32 models and up to 30% improvement for int8 quantized models
Step-by-step instructions for running benchmarks on an AWS Graviton3-based EC2 instance
Conclusion highlighting the performance benefits and encouraging readers to try the optimizations

Go to article

The AWS News Feed is currently looking for gold sponsors. If you want to support the AWS community and reach a large audience of AWS professionals, consider sponsoring the AWS News Feed.

Jul 2
2024

Accelerated PyTorch inference with torch.compile on AWS Graviton processors

Jun 11
2024

Sprinklr improves performance by 20% and reduces cost by 25% for machine learning inference on AWS Graviton3

Feb 29
2024

Accelerating large-scale neural network training on CPUs with ThirdAI and AWS Graviton

Jun 5
2025

Run small language models cost-efficiently with AWS Graviton and Amazon SageMaker AI

The AWS News Feed is currently looking for silver sponsors. If you want to support the AWS community and reach a large audience of AWS professionals, consider sponsoring the AWS News Feed.

Accelerate NLP inference with ONNX Runtime on AWS Graviton processors

Related articles