Build reliable Agentic AI solution with Amazon Bedrock: Learn from Pushpay’s journey on GenAI evaluation
Machine Learning Blog
This article details how Pushpay built a production-ready agentic AI search feature using Amazon Bedrock, focusing on their custom evaluation framework for continuous quality assurance.
- Pushpay created AI search enabling ministry staff to query community data using natural language
- Initial solution achieved only 60-70% accuracy with manual evaluation bottlenecks
- Implemented generative AI evaluation framework with golden dataset of 300+ representative queries
- Domain-category based evaluation revealed performance variations masked by aggregate metrics
- Strategic domain-level rollout achieved 95% accuracy on high-priority categories
- Dynamic prompt constructor tailors prompts based on query content and user context
- Reduced time-to-insight from 120 seconds to under 4 seconds for users
- Used Amazon Bedrock prompt caching to reduce latency and token costs
- Key lesson: Build evaluation frameworks early; think beyond aggregate accuracy scores
Pushpay's systematic, data-driven approach to AI agent optimization demonstrates that production readiness requires robust evaluation infrastructure, not just sophisticated prompts.
The AWS News Feed is currently looking for gold sponsors. If you want to support the AWS community and reach a large audience of AWS professionals, consider sponsoring the AWS News Feed.
Related articles
2025
2026
2024
2024
The AWS News Feed is currently looking for silver sponsors. If you want to support the AWS community and reach a large audience of AWS professionals, consider sponsoring the AWS News Feed.