Home icon

Build a RAG data ingestion pipeline for large-scale ML workloads

Big Data Blog



This article discusses how to build a Retrieval Augmented Generation (RAG) data ingestion pipeline for large-scale machine learning workloads using Amazon OpenSearch Service and Amazon RDS for PostgreSQL with the pgvector extension as vector datastores.

Specifically, the article covers:

  • Overview of OpenSearch Service and Amazon RDS with pgvector for vector similarity search
  • Solution architecture using a Ray cluster for parallelizing data ingestion and creating vector embeddings
  • Dataset used (OSCAR corpus and SQuAD questions) and preprocessing steps
  • Infrastructure setup details for OpenSearch Service and Amazon RDS
  • Step-by-step instructions to deploy the solution using AWS CDK and run the data ingestion pipelines
  • Setting up the Ray dashboard to monitor cluster metrics
  • Conclusion on the extensibility of the solution to integrate other vector datastores


Go to article

The AWS News Feed is currently looking for gold sponsors. If you want to support the AWS community and reach a large audience of AWS professionals, consider sponsoring the AWS News Feed.

Related articles

Nov 22
2024
Building RAG-based applications with AWS Amplify AI Kit and Neon Postgres
Jul 17
2025
Building enterprise-scale RAG applications with Amazon S3 Vectors and DeepSeek R1 on Amazon SageMaker AI
Apr 22
2025
High-level architecture and components for a generative AI-based RAG solution
Feb 19
2025
Well-rounded technical architecture for a RAG implementation on AWS

The AWS News Feed is currently looking for silver sponsors. If you want to support the AWS community and reach a large audience of AWS professionals, consider sponsoring the AWS News Feed.