Vision-Truthful E2E — the test that actually looks

A robot sales-rep clicks the live dashboard, takes photos, and an AI judges the photos against what should have happened

Master summary — the gist in 30 seconds

TL;DRBuilt a test framework that drives the LIVE app like a rep, screenshots every step, and has Gemini vision JUDGE each screenshot — all 18 checks pass.

Input: the live ClientsFlow board + a written description of what each screen SHOULD look like. Output: a verdict for every screenshot (PASS / BUG / EXPECTATION-WRONG) in a shareable report.

Why this mattersEvery previous test said 'all green' while the real screen was broken — because they checked the code/data, never the pixels. This one is structurally unable to pass a screen it never looked at.

flowchart LR
  Spec[Written expectation] --> Cap[Robot clicks + screenshots]
  Cap --> Judge[AI looks at the photo]
  Judge --> Rep[Report: PASS / BUG]
  style Judge fill:#e9dcc3,stroke:#8a6d3b

1 · What shipped today

TL;DRThe whole 4-stage framework is built and every one of its 18 self-checks passes against the live app.

Input: a checklist of 18 things the framework must do. Output: 18/18 green, each proven by failing first (RED) then passing (GREEN) — never ticked on a hunch.

Why it mattersIt means the framework is trustworthy: each piece was watched failing for the right reason before it was accepted, so a 'pass' is earned, not assumed.

flowchart TD
  A[18 checklist boxes] --> B{Run the check}
  B -->|see it FAIL red| C[Fix / build the piece]
  C --> D[Re-run]
  D -->|PASS green| E[Tick the box]
  E --> F[18/18 + full suite green]
  style F fill:#cfe3cf,stroke:#3c6e47

2 · The four stages

TL;DRSpec → Capture → Judge → Report, deliberately kept as four separate steps.

Input: a frozen 'expected behaviour' spec. Capture only takes photos; a SEPARATE Judge step must score every photo; Report turns scores into a styled webpage.

Why it mattersSplitting them is the cure for the old disease: when capture and judging were one blob, the test could save a photo and call it a pass without ever scoring it. Now the run literally cannot finish until every photo has a verdict.

flowchart LR
  S[0 - Spec\nwhat to expect] --> C[1 - Capture\nphotos + manifest]
  C --> J[2 - Judge\nGemini scores pixels]
  J --> R[3 - Report\nhouse-style HTML]
  R --> CF[Cloudflare URL]
  style J fill:#e9dcc3,stroke:#8a6d3b

3 · Two gates — data truth AND picture truth

TL;DRA step is only green if the backend data is right AND the screenshot looks right.

Input: one step of a flow. Gate A asks the database 'is the deal in the right state?'; Gate B (the AI) asks 'does the screen actually show it correctly?'. Both must pass.

Why it mattersThe two gates catch different lies. A screen can look perfect while the data is wrong, or the data can be right while the screen is visually broken. Checking only one is how bugs slipped through for months.

flowchart TD
  Step --> A[Gate A: backend data right?]
  Step --> B[Gate B: AI says pixels right?]
  A --> G{Both pass?}
  B --> G
  G -->|yes| Green[Step GREEN]
  G -->|no| Bug[Flagged]
  style B fill:#e9dcc3,stroke:#8a6d3b

4 · The AI judge looks at the actual photo

TL;DRGemini 2.5 Pro reads the screenshot's pixels off disk — never a text description of it.

Input: the PNG bytes + the written expectation. Output: PASS/BUG/EXPECTATION-WRONG, with what it literally saw and the difference. It knows Hungarian copy is correct and that 'ZZ' test names aren't bugs.

Why it mattersThe old failure was an agent 'trusting captions' — describing a screen it never opened. Judging the raw image makes that impossible. Proof: the judge spontaneously caught a card whose buttons were cut off, and correctly flagged a planted-broken frame.

flowchart LR
  PNG[Screenshot bytes] --> AI[Gemini 2.5 Pro]
  Exp[Expected text] --> AI
  AI --> V[verdict + what-I-saw + diff]
  style AI fill:#e9dcc3,stroke:#8a6d3b

5 · Safety — it can never touch a real client

TL;DRFour layers guarantee zero real emails: ZZ-only fixtures, fake @example.com addresses, a send-button blocklist, and a post-run sweep.

Input: the live production board (auto-send is ON). Output: tests run only on throwaway 'ZZ' deals with non-deliverable addresses, and any armed sequence is cancelled before the 10-minute cron could fire.

Why it mattersThe test drives the REAL app, so a slip could email a real lead. The layers make that impossible; the final sweep confirmed zero leftovers and the 7 real deals untouched.

flowchart TD
  T[Test action] --> L1[ZZ name only]
  L1 --> L2[@example.com address]
  L2 --> L3[send buttons blocked]
  L3 --> L4[sweep cancels armed seqs]
  L4 --> Safe[0 real sends]
  style Safe fill:#cfe3cf,stroke:#3c6e47

6 · What's next

TL;DRThe machinery is done; the first 'scored' real-flow run needs three human-gated prep steps.

Input: today's working framework. To run it for real on F4→F5→F14: Matt freezes the 3 expectation specs, we seed lasting ZZ fixtures, and we calibrate the judge on 3 sample frames.

Why it mattersFreezing the specs is what makes 'PASS' meaningful — you can't grade against a moving target. These are deliberately human gates, not automation.

timeline
  title From here to a real scored run
  Done : Framework built : 18/18 green : Deployed v056
  Next : Freeze 3 specs : Seed ZZ fixtures : Calibrate judge
  Then : First scored run : Real vision report

Checklist + RED to GREEN run log →Master plan (HANDOFF) →Sample vision report (Cloudflare) →