Home icon
Retain original PDF formatting to view translated documents with Amazon Textract, Amazon Translate, and PDFBox

Blog



This article demonstrates how to automatically extract text from scanned PDFs and translate them while preserving original formatting using Amazon Textract, Amazon Translate, and Apache PDFBox.

  • Amazon Textract extracts text and geometry data from scanned PDF documents
  • Amazon Translate provides neural machine translation across 2,970+ language pairs
  • Solution uses geometry-based approach to overlay translated text maintaining original layout
  • Supports English, Spanish, Italian, Portuguese, French, and German extraction
  • Open-source PDF Translate library available on AWS Samples GitHub
  • Font size automatically calculated to fit translated text within original bounding boxes
  • Processing time: ~10 seconds for employment application, under 1 minute for text-heavy documents
  • Pay-as-you-go pricing based on pages processed and characters translated
  • Optional human review workflows available via Amazon A2I for validation

The solution enables scalable, cost-effective multilingual document translation while maintaining document structure and formatting for regulatory compliance and human review.



Go to article

The AWS News Feed is currently looking for gold sponsors. If you want to support the AWS community and reach a large audience of AWS professionals, consider sponsoring the AWS News Feed.

Related articles

The AWS News Feed is currently looking for silver sponsors. If you want to support the AWS community and reach a large audience of AWS professionals, consider sponsoring the AWS News Feed.