Godly Prompts
Home
AI ModelsLeaderboard🛒 Marketplace⚡ Playground🎓 Learn📝 Blog💎 Pro
LearnGuidesEvaluation and Rubrics
Intermediate12 min read

Evaluation and Rubrics

Build scoring systems to measure prompt effectiveness and automate quality control.

Success criteriaRubric designAuto-evaluation

1Why Evaluation Matters

You can't improve what you don't measure.

**Without evaluation:** - "This prompt seems to work" - No way to compare versions - Bugs slip through

**With evaluation:** - "This prompt scores 87% accuracy" - A/B test improvements - Catch regressions

**Types of evaluation:** 1. **Human evaluation**: Gold standard, expensive 2. **Automated metrics**: Fast, scalable 3. **LLM-as-judge**: Middle ground

2Defining Success Criteria

Before evaluating, define what "good" means.

**For a summarization prompt:** ```yaml criteria: - name: accuracy description: Summary contains only facts from source weight: 30% - name: completeness description: Covers all key points weight: 25% - name: conciseness description: No unnecessary words weight: 20% - name: readability description: Clear, well-structured weight: 15% - name: format description: Follows requested format weight: 10% ```

**Good criteria are:** - Specific and measurable - Independent (don't overlap) - Weighted by importance

3Building a Scoring Rubric

A rubric converts criteria into scores.

**5-point rubric example:** ``` Criterion: Accuracy

5 - Perfect: All statements are factually correct 4 - Good: Minor inaccuracies that don't change meaning 3 - Acceptable: Some inaccuracies but core message correct 2 - Poor: Significant errors or misrepresentations 1 - Failing: Mostly incorrect or fabricated information ```

**Binary rubric (simpler):** ``` Criterion: Format

1 - Follows JSON schema exactly 0 - Any deviation from schema ```

**Use 5-point for:** subjective quality **Use binary for:** objective requirements

4LLM-as-Judge Pattern

Use an LLM to evaluate another LLM's output.

**Evaluation prompt:** ``` You are an expert evaluator. Score this AI response on a 1-5 scale.

TASK: {{original_task}}

AI RESPONSE: {{ai_response}}

RUBRIC: - 5: Perfectly addresses the task - 4: Good response with minor issues - 3: Acceptable but missing elements - 2: Partially addresses task - 1: Fails to address task

Provide: 1. Score (1-5) 2. Brief justification (1-2 sentences)

Format: {"score": N, "reason": "..."} ```

**Tips:** - Use a different model than the one being evaluated - Include the original task for context - Keep rubric simple and clear

5Automated Test Suites

Build a test suite for your prompts.

**Structure:** ```typescript interface PromptTest { name: string; input: Record<string, string>; expectedOutput?: string; validators: Validator[]; }

const tests: PromptTest[] = [ { name: "Basic extraction", input: { text: "John is 25 years old" }, validators: [ { type: "json_valid" }, { type: "contains_field", field: "name" }, { type: "field_equals", field: "age", value: 25 } ] }, { name: "Missing data handling", input: { text: "The product costs $50" }, validators: [ { type: "json_valid" }, { type: "field_is_null", field: "name" } ] } ]; ```

Run tests on every prompt change to catch regressions.

Practice What You Learned

Apply these concepts in our interactive lab with real AI.

PreviousFew-Shot Prompting That Actually Improves QualityNextPrompt Safety: Guardrails, Refusals, and Injection Resistance
Godly Prompts

The largest community-driven collection of AI prompts. Find, share, and vote on prompts that actually work.

Categories

  • Coding
  • Writing
  • Business
  • Research
  • Creative
  • Productivity

AI Models

  • ChatGPT / GPT-5
  • Claude 4
  • Gemini 3
  • Grok 4
  • Llama 4

Resources

  • Submit a Prompt
  • Leaderboard
  • About the Author
  • 📚 Get the Book
  • RSS Feed

Legal

  • Privacy Policy
  • Terms of Service
  • Contact Us

© 2026 Godly Prompts. All rights reserved.

Made with passion for the AI community