# A/B Eval Kit

Use this lightweight eval to test whether a Domain Context Brief improves AI-assisted product work compared with a conventional prompt.

## Goal

Answer one practical question:

> Does structured domain context produce better product architecture than a normal AI product prompt?

Also ask whether the output shows care for the world it serves, not just structural correctness.

For very early ideas, ask a slightly different question first:

> Does Discovery Mode produce more coherent, domain-specific product directions than a generic ideation prompt?

If the idea is easiest to understand through a concrete situation, test Story Mode before Discovery Mode:

> Does a Story Brief reveal actors, objects, stakes, handoffs, rules, sources of truth, and outcomes that a generic ideation prompt would miss?

The Discovery Mode eval should happen before the product spec eval. It is useful when the builder does not yet know whether the idea should become a workspace, assistant, review layer, marketplace, memory system, planning tool, or something else.

## Setup

Choose one product concept with enough domain complexity to expose misunderstandings.

Good test domains have:

- Multiple roles
- Multiple core objects
- Familiar concepts that may need to be adapted or recombined
- Human values, craft, risk, pride, frustration, or delight that shape good product judgment
- Future enhancement or innovation opportunities that could be missed by generic product thinking
- Real-world metaphors or models that may or may not carry over into the product medium
- User, organizational, business, trust, quality, or emotional outcomes that need to be balanced
- Audience differences in language, authority, or expertise
- Important states
- Rules or constraints
- External sources of truth
- AI actions that need boundaries

Avoid domains connected to private work, client information, or proprietary processes.

## Early-Idea Prompt Pair

Use this pair when the concept is still rough.

### Prompt A: Generic Ideation

```text
I have a rough idea for an AI-assisted product:

[Describe the hunch, audience, situation, or problem]

Brainstorm five product ideas I could build.
```

### Prompt B: Discovery Mode

```text
Given this rough idea:

[Describe the hunch, audience, situation, or problem]

Map the domain terrain, surface non-obvious concepts, identify rituals, tensions, sources of truth, and important decisions, then propose five coherent product directions that could only emerge from understanding this domain.

For each direction, identify the core object, main user, main workflow, AI advantage, source of truth, main risk, and why it is stronger than a generic product idea.
```

Score the outputs on:

| Criterion | Score | Notes |
| --- | --- | --- |
| Domain specificity |  | Does the idea come from the actual product world rather than generic AI features? |
| Concept discovery |  | Does it surface important nouns, rituals, handoffs, tensions, or decisions? |
| Direction diversity |  | Does it generate meaningfully different product shapes? |
| Coherence |  | Does each direction have a clear core object, user, workflow, and source of truth? |
| AI fit |  | Is the AI advantage specific and bounded? |
| Buildability |  | Can a promising direction move into a Core Brief? |

## Story Mode Prompt Pair

Use this pair when the product hunch needs a concrete multi-actor situation before the domain can be modeled well.

### Prompt A: Generic Scenario

```text
I have a rough idea for an AI-assisted product:

[Describe the hunch, audience, situation, or problem]

Write a short user scenario that explains the problem and possible solution.
```

### Prompt B: Story Mode

```text
I have a rough idea for an AI-assisted product:

[Describe the hunch, audience, situation, or problem]

Write a one-page Story Brief with cast, situation, wants, stakes, complication, and desired resolution.

Then extract the domain signals: actors, objects, records, episodes, handoffs, rules, sources of truth, lifecycle states, natural domain language, user outcome, business or organizational outcome, system or agent outcome, and riskiest assumptions.
```

Score the outputs on:

| Criterion | Score | Notes |
| --- | --- | --- |
| Actor specificity |  | Does it reveal multiple actors with distinct wants, stakes, and authority? |
| Domain friction |  | Does it find the real complication beyond generic tool friction? |
| Object discovery |  | Does it surface important records, artifacts, events, tasks, decisions, or resources? |
| Handoffs and episodes |  | Does it expose moments of transfer, review, exception, approval, or escalation? |
| Rules and evidence |  | Does it identify rules, sources of truth, and facts that require evidence? |
| Outcome clarity |  | Does it separate user, business or organizational, and system or agent outcomes? |
| Brief readiness |  | Can the extraction feed Discovery Mode or the Core Brief? |

## Build-Ready Prompt Pair

Use this pair when the product concept is clear enough to generate a spec, prototype, object model, or agent behavior.

### Prompt A: Conventional Prompt

Use a normal prompt that a busy builder might write.

```text
Help me design an AI-assisted [product concept].

Create a product spec including key features, user flows, object model, and AI assistant behavior.
```

### Prompt B: Domain Context Prompt

Use the same task, but include a completed Domain Context Brief.

```text
Help me design an AI-assisted [product concept].

Use this domain context:

[Paste completed Domain Context Brief]

Create a product spec including key features, user flows, object model, and AI assistant behavior.
```

For front-end prototypes, add one more instruction so the model translates the domain context into a product experience instead of exposing the framework itself:

```text
Use the domain context to shape the interface, product feel, future opportunity thinking, system behavior, and AI boundaries. Do not surface framework labels like "Concept Map," "Sources of Truth," or "Agent Boundaries" unless those labels would make sense to the actual users. Prefer plain-language task flows, progressive disclosure, inline evidence, review states, and exception handling.
```

## Rubric

Score each output from 1 to 5.

1 = poor or missing  
3 = usable but incomplete  
5 = strong, specific, and build-relevant

| Criterion | Score | Notes |
| --- | --- | --- |
| Concept clarity |  | Are the core objects distinct and well-defined? |
| Relationship integrity |  | Are relationships between objects correct and consistent? |
| State awareness |  | Does the output respect lifecycle stages? |
| Action safety |  | Does it distinguish user actions from AI actions? |
| Rule compliance |  | Does it preserve important domain rules? |
| Terminology consistency |  | Does it avoid ambiguity and drift? |
| Innovation fit |  | Does it distinguish inherited, adapted, combined, novel, and internal concepts? |
| Opportunity awareness |  | Does it reveal plausible future enhancements or innovation paths grounded in the domain? |
| Outcome alignment |  | Does it connect product decisions to user, organizational, business, trust, quality, or emotional outcomes? |
| Metaphor/medium fit |  | Does it distinguish what should carry over, adapt, or be left behind when translating the domain into the product medium? |
| Domain empathy |  | Does the output show care for what users value, fear, protect, enjoy, or take pride in? |
| Audience fit |  | Does it expose the right language, detail, and controls for each user role? |
| Flow quality |  | Are flows grounded in domain concepts and states? |
| AI behavior quality |  | Are responsibilities, limits, and verification needs clear? |
| Spec usefulness |  | Could a designer, PM, or engineer build from it? |
| Risk awareness |  | Does it catch likely failure modes? |
| Experience translation |  | Does it hide framework scaffolding while preserving domain rigor in the user experience? |
| Brand/experience fit |  | Does the product feel appropriate to the world it serves rather than generic software? |

Total score: `/90`

## Review Questions

After scoring, answer:

```text
Which output is more build-ready?
Which output contains more concept confusion?
Which output gives AI too much or too little authority?
Which output better handles sources of truth?
Which output would create fewer downstream product debates?
Which output better handles what is inherited, adapted, combined, or novel?
Which output reveals more useful future product opportunities grounded in the domain?
Which output better distinguishes real-world metaphors that should carry over from those that should change or be left behind?
Which output better connects product direction to user, business, trust, quality, or emotional outcomes?
Which output better understands why the domain matters to users?
Which output better fits the target audience's language, authority, and expertise?
Which output feels like a product for the target user rather than a visible domain model?
Which output has a brand and experience direction that fits the product world?
What did the Domain Context Brief improve?
What did it fail to improve?
```

## What To Look For

The conventional prompt may produce fluent but generic output.

Watch for:

- Objects that collapse into each other
- Missing states
- Vague agent permissions
- Generic feature lists
- Invented rules
- Weak sources of truth
- Product flows that ignore real-world constraints

The Domain Context prompt should produce:

- Cleaner object models
- More consistent language
- Clearer distinction between inherited domain reality and new product reality
- More care for the domain's values, craft, risks, and emotional stakes
- More grounded opportunity thinking based on domain concepts, relationships, rituals, and medium changes
- Clearer outcome tradeoffs and success criteria
- Better audience-specific translation of concepts, controls, and explanations
- Safer agent behavior
- Better user flows
- Clearer rules and source verification
- More useful implementation context
- User-facing screens that translate domain rigor into task language, review moments, blockers, confirmations, and evidence rather than exposing the framework as UI
- Brand, tone, hierarchy, and interaction choices that feel specific to the product world rather than generic SaaS

If it does not, revise the brief. The framework should earn its keep.

## Human-Centered Translation Check

After reviewing the domain-context output, mark each surfaced concept as one of:

- **User-facing:** Users naturally think in this term and need to act on it directly.
- **Workflow-facing:** The concept should shape navigation, grouping, status, or task order, but does not need a tutorial-like explanation.
- **System-facing:** The product or agent must track it, but users only need the effect, such as a disabled action, a blocker, a warning, or an evidence link.
- **Exception-facing:** Reveal it only when there is ambiguity, conflict, risk, missing evidence, or a decision requiring human authority.

If a prototype contains sections named after the framework itself, treat that as a design smell unless the intended users are explicitly modeling the domain.