Optimize LLM response costs and latency with effective caching

Database Blog

This article explains how to optimize LLM costs and latency through effective caching strategies, potentially reducing costs by up to 90% and response times to milliseconds.

Caching stores and reuses previous embeddings, tokens, outputs, or prompts to reduce inference costs and latency
Prompt caching reduces latency by 85% and input token costs by 90% for repeated prompt prefixes
Request-response caching stores identical request-response pairs for quick retrieval without reprocessing
In-memory caches like Amazon MemoryDB provide persistent semantic caching with vector search capabilities
External database caches (Redis, OpenSearch, DynamoDB) support distributed applications with high concurrent writes
TTL-based invalidation automatically removes stale cache entries after specified periods
Proactive invalidation allows manual deletion of specific cache entries when data updates occur
Implement guardrails to prevent caching PII or protected data; maintain context-specific cache segregation
Only implement caching if it applies to at least 60% of system calls to justify added complexity

Effective caching transforms LLM deployments by dramatically reducing costs, improving response times, enabling greater scale, and ensuring consistency for production applications.

Go to article

The AWS News Feed is currently looking for gold sponsors. If you want to support the AWS community and reach a large audience of AWS professionals, consider sponsoring the AWS News Feed.

Nov 21
2025

Serverless strategies for streaming LLM responses

Aug 5
2024

Faster LLMs with speculative decoding and AWS Inferentia2

Jun 10
2025

Leveraging LLMs as an Augmentation to Traditional Hyperparameter Tuning

Nov 26
2024

Serving LLMs using vLLM and Amazon EC2 instances with AWS AI chips

The AWS News Feed is currently looking for silver sponsors. If you want to support the AWS community and reach a large audience of AWS professionals, consider sponsoring the AWS News Feed.

Optimize LLM response costs and latency with effective caching

Related articles