Evaluate AI agents systematically with Agent-EvalKit

Machine Learning Blog

This article introduces Agent-EvalKit, an open-source toolkit for systematically evaluating AI agents beyond surface-level output testing.

Agent-EvalKit integrates with AI coding assistants to evaluate agents across six phases: Plan, Data, Trace, Run agent, Eval, and Report
Addresses hidden failures like hallucination and incorrect tool usage that output-level testing cannot detect
Combines code-based and LLM-as-judge evaluators to assess faithfulness, tool accuracy, and response quality
Generates targeted test cases, captures execution traces, and produces code-level improvement recommendations
Travel agent case study revealed 32.3% faithfulness score despite 83.9% response quality due to hallucination on empty tool results
Supports Strands Agents SDK, LangGraph, and CrewAI frameworks with OpenTelemetry tracing
Integrates into CI/CD pipelines for continuous evaluation after code changes

Agent-EvalKit brings systematic evaluation into the development workflow, transforming vague quality concerns into specific, actionable code fixes with measurable impact.

Go to article

The AWS News Feed is currently looking for gold sponsors. If you want to support the AWS community and reach a large audience of AWS professionals, consider sponsoring the AWS News Feed.

Jul 23
2026

Evaluating AI Agents: A production blueprint with Strands and AgentCore

Mar 18
2026

Evaluating AI agents for production: A practical guide to Strands Evals

Mar 31
2026

Build reliable AI agents with Amazon Bedrock AgentCore Evaluations

Apr 2
2026

Simulate realistic users to evaluate multi-turn AI agents in Strands Evals

The AWS News Feed is currently looking for silver sponsors. If you want to support the AWS community and reach a large audience of AWS professionals, consider sponsoring the AWS News Feed.

Evaluate AI agents systematically with Agent-EvalKit

Related articles