Skip to content

Criterion-Based Evaluation

Layer 3: Evaluation
  • Provides systematic framework for evaluating outputs against predefined success criteria
  • Ensures objective, traceable, evidence-based assessment
  • Enables consistent scoring across multiple evaluators
  • Creates audit trail mapping outputs to requirements
  • Evaluation personas assessing domain or orchestration outputs
  • Quality assurance reviews requiring measurable judgments
  • Acceptance testing against stated objectives
  • Any assessment that requires traceability to original requirements
  • When evaluators need to provide scores, not just subjective opinions
[LOAD CRITERIA]
- Import success criteria from orchestration plan or requirements
- List each criterion with its scoring rubric (if provided)
- Note any ambiguous criteria that need clarification
[SYSTEMATIC EVALUATION]
For each criterion:
[Criterion #X]: [Description]
[Evidence Review]
- What evidence in the output addresses this criterion?
- Direct quotes, references, or specific elements
[Assessment]
- Does it meet, partially meet, or not meet the criterion?
- Scoring (if rubric provided): X/Y points
- Justification with specific reasoning
[Gaps Identified]
- What's missing or incomplete?
- What would be needed to fully satisfy this criterion?
[OVERALL EVALUATION]
- Summary score or rating (Pass/Fail, X/10, or per rubric)
- Criteria met: X of Y total
- Critical failures vs. minor gaps
- Overall recommendation (Approve / Revise / Reject)
[TRACEABILITY MATRIX]
- Map outputs back to original objectives
- Identify orphaned outputs (not tied to any criterion)
- Identify unaddressed criteria
[ACTIONABLE FEEDBACK]
- Prioritized list of gaps to address
- Specific, concrete improvement recommendations
- Do NOT prescribe how to fix (domain persona's job)
- Focus on WHAT is missing, not HOW to add it

Technical (code review against requirements)

Section titled “Technical (code review against requirements)”
[LOAD CRITERIA]
[C1] API must support authentication via OAuth2
[C2] Response time <200ms for 95th percentile
[C3] All endpoints documented with OpenAPI spec
[C4] Unit test coverage ≥80%
[SYSTEMATIC EVALUATION]
[C1]: OAuth2 authentication
Evidence: auth.py implements OAuth2 flow, token validation present
Assessment: ✅ MEETS (10/10)
Gaps: None
[C2]: Response time <200ms
Evidence: Load test results show 95th percentile = 180ms
Assessment: ✅ MEETS (10/10)
Gaps: None
[C3]: OpenAPI documentation
Evidence: openapi.yaml exists but missing 3 of 12 endpoints
Assessment: ⚠️ PARTIAL (6/10)
Gaps: POST /users, DELETE /sessions, GET /admin/logs undocumented
[C4]: Unit test coverage
Evidence: Coverage report shows 72%
Assessment: ❌ NOT MET (7/10)
Gaps: 8% below requirement, missing tests for error handling paths
[OVERALL EVALUATION]
Score: 33/40 (82.5%)
Criteria met: 2 of 4 fully, 2 partially
Recommendation: REVISE - Address C3 and C4 before approval
[ACTIONABLE FEEDBACK]
Priority 1: Increase unit test coverage to ≥80% (currently 72%)
Priority 2: Document 3 missing API endpoints in OpenAPI spec
Pair WithWhen
Rule-Based ReasoningCriteria are stated as explicit rules
Recursive Self-EvalEvaluators check their own assessments
Chain of ThoughtMulti-stage evaluation reasoning
Meta RulesSet evaluation tone (strict vs. generous)