1Why Evaluation Matters
You can't improve what you don't measure.
**Without evaluation:** - "This prompt seems to work" - No way to compare versions - Bugs slip through
**With evaluation:** - "This prompt scores 87% accuracy" - A/B test improvements - Catch regressions
**Types of evaluation:** 1. **Human evaluation**: Gold standard, expensive 2. **Automated metrics**: Fast, scalable 3. **LLM-as-judge**: Middle ground
2Defining Success Criteria
Before evaluating, define what "good" means.
**For a summarization prompt:** ```yaml criteria: - name: accuracy description: Summary contains only facts from source weight: 30% - name: completeness description: Covers all key points weight: 25% - name: conciseness description: No unnecessary words weight: 20% - name: readability description: Clear, well-structured weight: 15% - name: format description: Follows requested format weight: 10% ```
**Good criteria are:** - Specific and measurable - Independent (don't overlap) - Weighted by importance
3Building a Scoring Rubric
A rubric converts criteria into scores.
**5-point rubric example:** ``` Criterion: Accuracy
5 - Perfect: All statements are factually correct 4 - Good: Minor inaccuracies that don't change meaning 3 - Acceptable: Some inaccuracies but core message correct 2 - Poor: Significant errors or misrepresentations 1 - Failing: Mostly incorrect or fabricated information ```
**Binary rubric (simpler):** ``` Criterion: Format
1 - Follows JSON schema exactly 0 - Any deviation from schema ```
**Use 5-point for:** subjective quality **Use binary for:** objective requirements
4LLM-as-Judge Pattern
Use an LLM to evaluate another LLM's output.
**Evaluation prompt:** ``` You are an expert evaluator. Score this AI response on a 1-5 scale.
TASK: {{original_task}}
AI RESPONSE: {{ai_response}}
RUBRIC: - 5: Perfectly addresses the task - 4: Good response with minor issues - 3: Acceptable but missing elements - 2: Partially addresses task - 1: Fails to address task
Provide: 1. Score (1-5) 2. Brief justification (1-2 sentences)
Format: {"score": N, "reason": "..."} ```
**Tips:** - Use a different model than the one being evaluated - Include the original task for context - Keep rubric simple and clear
5Automated Test Suites
Build a test suite for your prompts.
**Structure:** ```typescript interface PromptTest { name: string; input: Record<string, string>; expectedOutput?: string; validators: Validator[]; }
const tests: PromptTest[] = [ { name: "Basic extraction", input: { text: "John is 25 years old" }, validators: [ { type: "json_valid" }, { type: "contains_field", field: "name" }, { type: "field_equals", field: "age", value: 25 } ] }, { name: "Missing data handling", input: { text: "The product costs $50" }, validators: [ { type: "json_valid" }, { type: "field_is_null", field: "name" } ] } ]; ```
Run tests on every prompt change to catch regressions.