Lower cost and latency for AI using Amazon ElastiCache as a semantic cache with Amazon Bedrock
Database Blog
This article explains how to implement semantic caching using Amazon ElastiCache for Valkey with Amazon Bedrock to reduce AI application costs and latency.
- Semantic caching reuses LLM responses for identical or semantically similar queries using vector embeddings
- Experiments showed up to 86% LLM cost reduction and 88% latency improvement with semantic caching
- ElastiCache for Valkey provides microsecond-latency vector search with 95%+ recall at best price-performance
- Solution uses Amazon Titan embeddings, Amazon Nova Premier LLM, and LangGraph orchestration
- Cache hits return millisecond responses without LLM inference; cache misses invoke LLM and store results
- Evaluation on 63,796 real chatbot queries achieved 91% accuracy at 0.75 similarity threshold
- Best practices include caching stable responses, handling multi-turn conversations, setting TTLs, and personalizing outputs
Semantic caching with ElastiCache enables significant cost and performance improvements for generative AI applications by intelligently reusing cached responses for similar queries.
The AWS News Feed is currently looking for gold sponsors. If you want to support the AWS community and reach a large audience of AWS professionals, consider sponsoring the AWS News Feed.
Related articles
Nov 18
2025
2025
New Amazon Bedrock service tiers help you match AI workload performance with cost
Aug 16
2024
2024
Improve speed and reduce cost for generative AI workloads with a persistent semantic cache in Amazon MemoryDB
Nov 26
2024
2024
Build a read-through semantic cache with Amazon OpenSearch Serverless and Amazon Bedrock
Apr 7
2026
2026
Manage AI costs with Amazon Bedrock Projects
The AWS News Feed is currently looking for silver sponsors. If you want to support the AWS community and reach a large audience of AWS professionals, consider sponsoring the AWS News Feed.