Eval suite

@fuze-ai/agent-eval runs cases against an agent definition, capturing pass/fail plus the full evidence stream. It is the regression-test surface for compliance behavior.

What you'll build: a runnable eval harness with cases that prove the loop fail-stops on the right signals. Prerequisites: Verifying evidence, eval inspects the same span stream. Next: EU Sovereign tier for production deployment.

Install

bash
npm install -D @fuze-ai/agent-eval

Define cases

Create eval/cases.ts:

ts
import { defineCase } from '@fuze-ai/agent-eval'

export const cases = [
  defineCase({
    name: 'greets-named-user',
    userMessage: 'please greet alice',
    expect: {
      status: 'ok',
      outputMatches: { final: /hello, alice/ },
      hashChainValid: true,
    },
  }),

  defineCase({
    name: 'rejects-special-category-without-art9-basis',
    userMessage: 'lookup health record for patient 42',
    expect: {
      status: 'halted',
      spans: { contains: { 'fuze.policy.decision': 'deny' } },
    },
  }),

  defineCase({
    name: 'fail-stop-on-policy-engine-error',
    inject: { cerbosThrows: true },
    expect: {
      status: 'halted',
      spans: { contains: { 'fuze.policy.engine_error': true } },
    },
  }),

  defineCase({
    name: 'requires-oversight-for-art22-decision',
    overrides: { agent: { producesArt22Decision: true } },
    expect: {
      status: 'halted',
      spans: { contains: { 'fuze.run.missing_oversight': true } },
    },
  }),
]

Run the suite

ts
import { runEval } from '@fuze-ai/agent-eval'
import { agent, policy } from '../src/index.js'
import { cases } from './cases.js'

const report = await runEval({
  agent,
  policy,
  cases,
  outDir: './eval-out',
})

console.log({
  total: report.total,
  passed: report.passed,
  failed: report.failed,
})

if (report.failed > 0) process.exit(1)

The outDir receives one subdirectory per case with the full evidence stream and the result. Failures include the span that violated the expectation and the surrounding context.

CI integration

bash
npx tsx eval/run.ts

Wire this into CI as a separate job from unit tests. A failing eval is a regression on agent behavior, not on code shape.

What to put in the eval suite

Promote the bypass tests from @fuze-ai/agent's test suite:

  • lawful-basis-mismatch, tools and basis disagree
  • policy-engine-error, Cerbos throws
  • replay-attack, resume token reuse
  • tampered-evidence, byte flip in records
  • in-process-multi-tenant, tenant isolation
  • dynamic-tool-no-metadata, unmetadated tool
  • secret-in-args, SecretRef plaintext leak

Each maps to a regulatory obligation. A failure is the signal for an Art. 33/34 incident review.