Improve operational visibility for inference workloads on Amazon Bedrock with new CloudWatch metrics for TTFT and Estimated Quota Consumption
Machine Learning Blog
This article announces two new CloudWatch metrics for Amazon Bedrock: TimeToFirstToken and EstimatedTPMQuotaUsage, providing server-side visibility into streaming latency and quota consumption for inference workloads.
- TimeToFirstToken measures latency from request receipt to first response token generation for streaming APIs
- EstimatedTPMQuotaUsage tracks tokens-per-minute quota consumed, accounting for burndown multipliers and cache tokens
- Both metrics automatically emitted at no cost with no API changes or opt-in required
- Available in AWS/Bedrock CloudWatch namespace with ModelId dimension filtering
- Supports cross-Region inference profiles for geographic and global configurations
- Enable proactive alarms, SLA baselines, and capacity planning without client-side instrumentation
- Quota formula varies by throughput type: on-demand applies output token burndown; provisioned throughput applies cache weighting
These metrics eliminate the need for custom instrumentation and help teams prevent throttling and performance degradation in production AI workloads.
The AWS News Feed is currently looking for gold sponsors. If you want to support the AWS community and reach a large audience of AWS professionals, consider sponsoring the AWS News Feed.
Related articles
2025
2024
2026
2026
The AWS News Feed is currently looking for silver sponsors. If you want to support the AWS community and reach a large audience of AWS professionals, consider sponsoring the AWS News Feed.