Home icon

Accelerate NLP inference with ONNX Runtime on AWS Graviton processors

Machine Learning Blog



The article discusses how ONNX Runtime can be optimized to accelerate natural language processing (NLP) inference on AWS Graviton3 processors. It explains the improvements in performance achieved by utilizing optimized GEMM kernels for bfloat16 and int8 quantized models, enabling faster inference for transformer-based language models like BERT, RoBERTa, and GPT2.

Specifically, the article covers:

  • Overview of ONNX Runtime and its support for optimized GEMM kernels on AWS Graviton3 processors
  • How to enable the optimizations for bfloat16 fast math kernels in ONNX Runtime
  • Benchmark results showing up to 65% improvement in throughput for fp32 models and up to 30% improvement for int8 quantized models
  • Step-by-step instructions for running benchmarks on an AWS Graviton3-based EC2 instance
  • Conclusion highlighting the performance benefits and encouraging readers to try the optimizations


Go to article

The AWS News Feed is currently looking for gold sponsors. If you want to support the AWS community and reach a large audience of AWS professionals, consider sponsoring the AWS News Feed.

Related articles

Jul 2
2024
Accelerated PyTorch inference with torch.compile on AWS Graviton processors
Jun 11
2024
Sprinklr improves performance by 20% and reduces cost by 25% for machine learning inference on AWS Graviton3
Feb 29
2024
Accelerating large-scale neural network training on CPUs with ThirdAI and AWS Graviton
Jun 5
2025
Run small language models cost-efficiently with AWS Graviton and Amazon SageMaker AI

The AWS News Feed is currently looking for silver sponsors. If you want to support the AWS community and reach a large audience of AWS professionals, consider sponsoring the AWS News Feed.