Criterion-Based Evaluation

Layer 3: Evaluation

Purpose

Provides systematic framework for evaluating outputs against predefined success criteria
Ensures objective, traceable, evidence-based assessment
Enables consistent scoring across multiple evaluators
Creates audit trail mapping outputs to requirements

When to Use

Evaluation personas assessing domain or orchestration outputs
Quality assurance reviews requiring measurable judgments
Acceptance testing against stated objectives
Any assessment that requires traceability to original requirements
When evaluators need to provide scores, not just subjective opinions

Structure / Template

[LOAD CRITERIA]
- Import success criteria from orchestration plan or requirements
- List each criterion with its scoring rubric (if provided)
- Note any ambiguous criteria that need clarification

[SYSTEMATIC EVALUATION]
For each criterion:
  [Criterion #X]: [Description]

  [Evidence Review]
  - What evidence in the output addresses this criterion?
  - Direct quotes, references, or specific elements

  [Assessment]
  - Does it meet, partially meet, or not meet the criterion?
  - Scoring (if rubric provided): X/Y points
  - Justification with specific reasoning

  [Gaps Identified]
  - What's missing or incomplete?
  - What would be needed to fully satisfy this criterion?

[OVERALL EVALUATION]
- Summary score or rating (Pass/Fail, X/10, or per rubric)
- Criteria met: X of Y total
- Critical failures vs. minor gaps
- Overall recommendation (Approve / Revise / Reject)

[TRACEABILITY MATRIX]
- Map outputs back to original objectives
- Identify orphaned outputs (not tied to any criterion)
- Identify unaddressed criteria

[ACTIONABLE FEEDBACK]
- Prioritized list of gaps to address
- Specific, concrete improvement recommendations
- Do NOT prescribe how to fix (domain persona's job)
- Focus on WHAT is missing, not HOW to add it

Example

Technical (code review against requirements)

[LOAD CRITERIA]
[C1] API must support authentication via OAuth2
[C2] Response time <200ms for 95th percentile
[C3] All endpoints documented with OpenAPI spec
[C4] Unit test coverage ≥80%

[SYSTEMATIC EVALUATION]

[C1]: OAuth2 authentication
  Evidence: auth.py implements OAuth2 flow, token validation present
  Assessment: ✅ MEETS (10/10)
  Gaps: None

[C2]: Response time <200ms
  Evidence: Load test results show 95th percentile = 180ms
  Assessment: ✅ MEETS (10/10)
  Gaps: None

[C3]: OpenAPI documentation
  Evidence: openapi.yaml exists but missing 3 of 12 endpoints
  Assessment: ⚠️ PARTIAL (6/10)
  Gaps: POST /users, DELETE /sessions, GET /admin/logs undocumented

[C4]: Unit test coverage
  Evidence: Coverage report shows 72%
  Assessment: ❌ NOT MET (7/10)
  Gaps: 8% below requirement, missing tests for error handling paths

[OVERALL EVALUATION]
Score: 33/40 (82.5%)
Criteria met: 2 of 4 fully, 2 partially
Recommendation: REVISE - Address C3 and C4 before approval

[ACTIONABLE FEEDBACK]
Priority 1: Increase unit test coverage to ≥80% (currently 72%)
Priority 2: Document 3 missing API endpoints in OpenAPI spec

Combination Guidance

Pair With	When
Rule-Based Reasoning	Criteria are stated as explicit rules
Recursive Self-Eval	Evaluators check their own assessments
Chain of Thought	Multi-stage evaluation reasoning
Meta Rules	Set evaluation tone (strict vs. generous)

Criterion-Based Evaluation

Purpose

When to Use

Structure / Template

Example

Technical (code review against requirements)

Combination Guidance

Failure Modes to Avoid