Human overview · for understanding
A robot sales-rep clicks the live dashboard, takes photos, and an AI judges the photos against what should have happened · 2026-06-21
A robot sales-rep clicks the live dashboard, takes photos, and an AI judges the photos against what should have happened
Master summary — the gist in 30 seconds
Input: the live ClientsFlow board + a written description of what each screen SHOULD look like. Output: a verdict for every screenshot (PASS / BUG / EXPECTATION-WRONG) in a shareable report.
flowchart LR Spec[Written expectation] --> Cap[Robot clicks + screenshots] Cap --> Judge[AI looks at the photo] Judge --> Rep[Report: PASS / BUG] style Judge fill:#e9dcc3,stroke:#8a6d3b
Input: a checklist of 18 things the framework must do. Output: 18/18 green, each proven by failing first (RED) then passing (GREEN) — never ticked on a hunch.
flowchart TD
A[18 checklist boxes] --> B{Run the check}
B -->|see it FAIL red| C[Fix / build the piece]
C --> D[Re-run]
D -->|PASS green| E[Tick the box]
E --> F[18/18 + full suite green]
style F fill:#cfe3cf,stroke:#3c6e47
Input: a frozen 'expected behaviour' spec. Capture only takes photos; a SEPARATE Judge step must score every photo; Report turns scores into a styled webpage.
flowchart LR S[0 - Spec\nwhat to expect] --> C[1 - Capture\nphotos + manifest] C --> J[2 - Judge\nGemini scores pixels] J --> R[3 - Report\nhouse-style HTML] R --> CF[Cloudflare URL] style J fill:#e9dcc3,stroke:#8a6d3b
Input: one step of a flow. Gate A asks the database 'is the deal in the right state?'; Gate B (the AI) asks 'does the screen actually show it correctly?'. Both must pass.
flowchart TD
Step --> A[Gate A: backend data right?]
Step --> B[Gate B: AI says pixels right?]
A --> G{Both pass?}
B --> G
G -->|yes| Green[Step GREEN]
G -->|no| Bug[Flagged]
style B fill:#e9dcc3,stroke:#8a6d3b
Input: the PNG bytes + the written expectation. Output: PASS/BUG/EXPECTATION-WRONG, with what it literally saw and the difference. It knows Hungarian copy is correct and that 'ZZ' test names aren't bugs.
flowchart LR PNG[Screenshot bytes] --> AI[Gemini 2.5 Pro] Exp[Expected text] --> AI AI --> V[verdict + what-I-saw + diff] style AI fill:#e9dcc3,stroke:#8a6d3b
Input: the live production board (auto-send is ON). Output: tests run only on throwaway 'ZZ' deals with non-deliverable addresses, and any armed sequence is cancelled before the 10-minute cron could fire.
flowchart TD T[Test action] --> L1[ZZ name only] L1 --> L2[@example.com address] L2 --> L3[send buttons blocked] L3 --> L4[sweep cancels armed seqs] L4 --> Safe[0 real sends] style Safe fill:#cfe3cf,stroke:#3c6e47
Input: today's working framework. To run it for real on F4→F5→F14: Matt freezes the 3 expectation specs, we seed lasting ZZ fixtures, and we calibrate the judge on 3 sample frames.
timeline title From here to a real scored run Done : Framework built : 18/18 green : Deployed v056 Next : Freeze 3 specs : Seed ZZ fixtures : Calibrate judge Then : First scored run : Real vision report