Home icon
How Patsnap used GPT-2 inference on Amazon SageMaker with low latency and cost

Blog



This article describes how Patsnap optimized GPT-2 inference for patent search autocomplete using NVIDIA TensorRT and Amazon SageMaker.

  • Patsnap needed GPT-2 inference latency under 600ms for real-time patent search suggestions
  • NVIDIA TensorRT optimization reduced average latency from 1,172ms to 531ms (55% improvement)
  • Queries per second increased from 3.4 to 7.5 at maximum concurrency (120% improvement)
  • PyTorch model converted to TensorRT via ONNX intermediate format with no accuracy loss
  • Model deployed on SageMaker using bring-your-own-container with custom Docker image
  • Achieved 2.9x acceleration on NVIDIA P3.2xlarge GPU instances

The solution demonstrates how TensorRT optimization enables cost-effective, low-latency deployment of large language models in production environments using SageMaker.



Go to article

The AWS News Feed is currently looking for gold sponsors. If you want to support the AWS community and reach a large audience of AWS professionals, consider sponsoring the AWS News Feed.

Related articles

The AWS News Feed is currently looking for silver sponsors. If you want to support the AWS community and reach a large audience of AWS professionals, consider sponsoring the AWS News Feed.