1Why Evaluation Matters

You can't improve what you don't measure.

**Without evaluation:** - "This prompt seems to work" - No way to compare versions - Bugs slip through

**With evaluation:** - "This prompt scores 87% accuracy" - A/B test improvements - Catch regressions

**Types of evaluation:** 1. **Human evaluation**: Gold standard, expensive 2. **Automated metrics**: Fast, scalable 3. **LLM-as-judge**: Middle ground

2Defining Success Criteria

Before evaluating, define what "good" means.

**For a summarization prompt:** ```yaml criteria: - name: accuracy description: Summary contains only facts from source weight: 30% - name: completeness description: Covers all key points weight: 25% - name: conciseness description: No unnecessary words weight: 20% - name: readability description: Clear, well-structured weight: 15% - name: format description: Follows requested format weight: 10% ```

**Good criteria are:** - Specific and measurable - Independent (don't overlap) - Weighted by importance

3Building a Scoring Rubric

A rubric converts criteria into scores.

**5-point rubric example:** ``` Criterion: Accuracy

5 - Perfect: All statements are factually correct 4 - Good: Minor inaccuracies that don't change meaning 3 - Acceptable: Some inaccuracies but core message correct 2 - Poor: Significant errors or misrepresentations 1 - Failing: Mostly incorrect or fabricated information ```

**Binary rubric (simpler):** ``` Criterion: Format

1 - Follows JSON schema exactly 0 - Any deviation from schema ```

**Use 5-point for:** subjective quality **Use binary for:** objective requirements

4LLM-as-Judge Pattern

Use an LLM to evaluate another LLM's output.

**Evaluation prompt:** ``` You are an expert evaluator. Score this AI response on a 1-5 scale.

TASK: {{original_task}}

AI RESPONSE: {{ai_response}}

RUBRIC: - 5: Perfectly addresses the task - 4: Good response with minor issues - 3: Acceptable but missing elements - 2: Partially addresses task - 1: Fails to address task

Provide: 1. Score (1-5) 2. Brief justification (1-2 sentences)

Format: {"score": N, "reason": "..."} ```

**Tips:** - Use a different model than the one being evaluated - Include the original task for context - Keep rubric simple and clear

5Automated Test Suites

Build a test suite for your prompts.

**Structure:** ```typescript interface PromptTest { name: string; input: Record<string, string>; expectedOutput?: string; validators: Validator[]; }

const tests: PromptTest[] = [ { name: "Basic extraction", input: { text: "John is 25 years old" }, validators: [ { type: "json_valid" }, { type: "contains_field", field: "name" }, { type: "field_equals", field: "age", value: 25 } ] }, { name: "Missing data handling", input: { text: "The product costs $50" }, validators: [ { type: "json_valid" }, { type: "field_is_null", field: "name" } ] } ]; ```

Run tests on every prompt change to catch regressions.

1Why Evaluation Matters

You can't improve what you don't measure.

**Without evaluation:** - "This prompt seems to work" - No way to compare versions - Bugs slip through

**With evaluation:** - "This prompt scores 87% accuracy" - A/B test improvements - Catch regressions

**Types of evaluation:** 1. **Human evaluation**: Gold standard, expensive 2. **Automated metrics**: Fast, scalable 3. **LLM-as-judge**: Middle ground

2Defining Success Criteria

Before evaluating, define what "good" means.

**Good criteria are:** - Specific and measurable - Independent (don't overlap) - Weighted by importance

3Building a Scoring Rubric

A rubric converts criteria into scores.

**5-point rubric example:** ``` Criterion: Accuracy

**Binary rubric (simpler):** ``` Criterion: Format

1 - Follows JSON schema exactly 0 - Any deviation from schema ```

**Use 5-point for:** subjective quality **Use binary for:** objective requirements

4LLM-as-Judge Pattern

Use an LLM to evaluate another LLM's output.

**Evaluation prompt:** ``` You are an expert evaluator. Score this AI response on a 1-5 scale.

TASK: {{original_task}}

AI RESPONSE: {{ai_response}}

RUBRIC: - 5: Perfectly addresses the task - 4: Good response with minor issues - 3: Acceptable but missing elements - 2: Partially addresses task - 1: Fails to address task

Provide: 1. Score (1-5) 2. Brief justification (1-2 sentences)

Format: {"score": N, "reason": "..."} ```

**Tips:** - Use a different model than the one being evaluated - Include the original task for context - Keep rubric simple and clear

5Automated Test Suites

Build a test suite for your prompts.

**Structure:** ```typescript interface PromptTest { name: string; input: Record<string, string>; expectedOutput?: string; validators: Validator[]; }

Run tests on every prompt change to catch regressions.

Evaluation and Rubrics

1Why Evaluation Matters

2Defining Success Criteria

3Building a Scoring Rubric

4LLM-as-Judge Pattern

5Automated Test Suites

Practice What You Learned

Evaluation and Rubrics

1Why Evaluation Matters

2Defining Success Criteria

3Building a Scoring Rubric

4LLM-as-Judge Pattern

5Automated Test Suites

Practice What You Learned