visual-qa-ultra — built, proven, wired for the 23-journey retest

A reusable robot QA tester that looks at real pixels and tells the truth

Master summary — the gist in 30 seconds

TL;DRThis session turned a one-off, hand-driven UI test into a reusable global skill — then wired it so the big 23-scenario ClientsFlow re-test runs cleanly and safely.

Input: a proven-but-throwaway live test + a handoff spec. Output: a global skill (engine + judge + live report), a smoke proof, a drop-in ClientsFlow config, and a ready-to-paste retest prompt.

Why this mattersThe old test worked but only because a careful human drove it once. Bottling that discipline into a tool means any web app can be tested the same truthful way, repeatably — and the scariest part (sending real emails) is now blocked by code, not by hoping the agent is careful.

flowchart LR
  A[Proven one-off test] --> B[Generalize into a skill]
  B --> C[Smoke-prove on a demo page]
  C --> D[Wire for ClientsFlow retest]
  D --> E[Paste the prompt - run all 23]

1 · What the skill actually is

TL;DRA QA robot that drives any web app, screenshots each step, and judges the real pixels live — never trusting a caption.

Input: a web app + what each screen SHOULD look like. Output: a live HTML report (watchable on localhost) where every screenshot gets a Gemini read + a Claude double-check, tagged honestly.

Why it mattersOld QA tools quietly marked things 'green' by checking code or text, not the actual picture — so broken screens slipped through. This one looks at the pixels off disk, like a person would, and only a really-driven screen can be called green.

flowchart TD
  S[Take screenshot] --> G[Gemini: 5-sentence read + confidence]
  G --> Q{PASS and confident?}
  Q -- yes --> T[Trust it - move on]
  Q -- no/unsure --> C[Claude reads the pixels + decides]
  C --> V[Verdict: OK / fix / blocked]
  V --> N[Next step]
  N --> S

2 · How we know it works (the smoke proof)

TL;DROn a public demo page it caught its own wrong expectation AND it skipped the expensive second look on a confident pass — exactly as designed.

Input: example.com + a deliberately-flawed expectation. Output: Gemini cried 'BUG', Claude looked at the actual pixels and corrected it to 'expectation was wrong' — and a confident clean frame was trusted without a second look.

Why it mattersThis is the whole point: two judges that disagree productively. The robot doesn't blindly trust the first opinion, and it doesn't waste the expensive second opinion on the obvious passes. That balance is what keeps a long run both honest and affordable.

flowchart LR
  subgraph Frame1[Frame 1 - my expectation was off]
    A[Gemini: BUG conf 1.0] --> B[Claude reads pixels] --> C[EXPECTATION_WRONG]
  end
  subgraph Frame2[Frame 2 - clearly fine]
    D[Gemini: PASS conf 1.0] --> E[Trusted - no second look]
  end

3 · Wiring it safely for the ClientsFlow retest

TL;DRA drop-in config + a code-enforced 'only ZZ test addresses' gate so the live run can actually send emails without ever hitting a real person.

Input: the generic skill (no live-app safety) + the ClientsFlow specifics. Output: a config file (URL, judge rules, selectors) + a hard send-gate that raises an error if a recipient isn't a ZZ test address.

Why it mattersReal autosend is ON in production. 'Be careful' is not a safety system. Now the code itself refuses any send to a non-test address — turning the biggest risk of the whole run into a wall, not a guideline.

flowchart TD
  X[Run wants to send an email] --> Y{Recipient is a ZZ test address?}
  Y -- yes --> Z[Send + verify it landed]
  Y -- no --> W[SendBlocked - STOP]

4 · Keeping a 23-scenario marathon from drowning

TL;DRThe report on disk is the memory, so the agent can clear its head at each scenario and resume — and it only loads a screenshot into context when it really needs to look.

Input: hundreds of frames over 23 scenarios. Output: a lean run — disk holds the truth, only uncertain frames are 'looked at', and context is compacted at scenario boundaries, never mid-task.

Why it mattersA long agent run normally chokes on its own screenshots. Treating the on-disk report as the source of truth means the agent's working memory stays small and the run survives restarts and overloads instead of dying halfway.

flowchart LR
  R[(log.json + frames on disk = truth)] --> A[Read a PNG only if uncertain]
  A --> B[Compact at scenario boundary]
  B --> C[Resume from first un-judged step]
  C --> R

5 · What's next + the honest risks

TL;DRPaste the prompt into a fresh chat to run all 23; expect the live Google-Meet step to possibly come back BLOCKED rather than fake-green.

Input: the finalized retest prompt + config. Output (when you run it): a master report of all 23 scenarios with honest tags, plus a separate list of any new product findings.

Why it mattersAn honest BLOCKED beats a narrated fake-green. The external bits (Meet + Fireflies) may genuinely be undrivable; flagging that truthfully is the feature, not a failure. The only other caveat: the web-research helpers got rate-limited, so the context rules lean on documented behavior.

timeline
  title From here
  Run it : paste prompt in a new chat : walk Scenario 1 to 23
  Watch for : live-call may be BLOCKED : not a bug, an honest tag
  Then : review the master report : triage new product findings

The final retest prompt →ClientsFlow run config →The skill (global) →