Session 2: Automated QA — WCAG, Contracts, and Impact Tracing

Series: What If a Design System Could Self QA?
Author: C. Maldonado
Date: February 2026

Where We Picked Up

Between sessions, I'd done some homework. Spacing and typography tokens got created in Tokens Studio and pushed to the repo — so the pipeline was no longer colors-only. The repo now had color.json, spacing.json, and typography.json all in DTCG format, all flowing through Style Dictionary.

But the pipeline had a gap. It could build tokens, but it couldn't tell you if a change was good or bad. Someone could delete a token that 30 components depend on, or change a brand color to something that fails WCAG contrast — and the pipeline would happily build it. We needed something between "tokens changed" and "tokens shipped."

That's what this session built: three standalone QA scripts that run on every PR.

(A note on naming: I called these "agents" throughout the session because that's CI terminology — like a Jenkins build agent or a monitoring collection agent. They're deterministic scripts: same input, same output, no AI. "AI agent" in this project means Phase 3+, when Claude Code actually enters the pipeline as intelligent middleware. The distinction matters and I want to be precise about it going forward.)

Part 1: The Three Scripts

I went into this session with one goal: ship working code, not documentation. After Session 1 produced a lot of planning artifacts, I wanted the inverse — build the thing, write about it after.

Script 1: WCAG Compliance

The first script reads color.json, resolves all the reference aliases (like color.text.primary → {color.primitive.gray.900} → #111827), and then tests every foreground color against every background color for WCAG 2.1 contrast compliance.

This surfaced real issues immediately. color.text.tertiary (#9CA3AF, gray-400) on a white background has a contrast ratio of 2.54:1 — well below the 4.5:1 AA threshold. Our feedback colors were also problematic: color.feedback.success (Tailwind's green-500, #22C55E) on white is only 2.28:1. These are inherited from the Tailwind palette and would fail any accessibility audit.

The script tests 98 pairings total. 50 pass, 8 pass for large text only, and 40 fail. Many of the "failures" are intentional design pairings (inverse text on a primary background is supposed to fail when tested against a white background — it's for dark surfaces). But the script surfaces them all, and the team gets to decide what's intentional versus what needs fixing.

The output: a JSON report for tooling and a Markdown report for humans.

Script 2: Contract Testing

The second script is simpler but arguably more important for dev confidence. It takes a snapshot of every token name and type, saves it as a baseline, and then compares future changes against that baseline.

If someone removes color.text.primary, the contract script flags it as BREAKING — any component referencing that token will break. If someone renames a token (detected by matching type + value on a removed + added pair), it flags the probable rename. Value changes get flagged as WARNING (visual impact, not a crash).

The critical thing: this runs before the code ships. A developer reviewing the PR sees "BREAKING: 1 token removed" and can ask "was that intentional?" before merging.

Script 3: Token Impact Analyzer

The third script is the orchestrator. It builds a reference graph — which semantic tokens point to which primitives — and can trace the blast radius of any change. "If I change blue.600, what breaks?" Answer: color.brand.primary and color.text.link. Two tokens. That's your impact.

It also merges the WCAG and Contract reports into a single unified view with a confidence score: Full QA (all scripts ran), Partial QA (some scripts failed), or None (infrastructure issue). This is how the team knows whether to trust the pipeline's assessment.

Part 2: The GitHub Actions Workflow

The scripts ran locally, but the point was never "run these manually." I built a GitHub Actions workflow that triggers on any PR touching tokens/:

Check out the PR branch and the base branch
Generate baselines from the base branch tokens (this is what "before" looks like)
Run all three scripts against the PR branch tokens (this is "after")
Build a summary comment and post it directly on the PR
Upload the full reports as downloadable artifacts

The first run had a bug — the baseline was being generated from the PR branch instead of main, so the contract script compared the PR against itself and saw no changes. A file copy command was putting the base branch tokens in the wrong directory. Fixed it, re-ran, and the comment correctly showed "Changed: 1" for the test token I'd modified.

The whole pipeline runs in about 8-12 seconds. That's from "PR opened" to "bot comment posted."

Part 3: Testing It Live

The real test was opening a PR with a one-character token change: #2563EB → #2563EC on color.primitive.blue.600. One hex digit.

The pipeline caught:

WCAG: 9 contrast regressions and 2 improvements from that single digit change
Contract: 1 value change detected (WARNING, not BREAKING)
Impact: Full QA confidence, both upstream scripts completed

The bot posted a formatted comment on the PR with all three findings. A reviewer could look at that comment and know exactly what the change did, what it affected, and whether it introduced any accessibility problems — without opening a single file.

That's the POC moment for Phase 2. Same energy as Run #5 in Session 1, but now it's not just "does the build pass" — it's "is the change good."

Part 4: Teaching Claude to Remember

With the scripts shipped and the CI pipeline running, I hit a different kind of problem: context loss. Every new Claude session started with 15–20 minutes of re-explaining the repo structure, the three scripts, the naming conventions, the baseline swap pattern. Three sessions in and I was spending more time onboarding my AI collaborator than building.

So I built a Claude skill — a directive file (.claude/skills/design-system-ops/SKILL.md) that lives in the repo and loads automatically at the start of every session. It's the "cheat sheet" version of the playbook: repo structure, token format rules, what each script does, what not to touch, debugging table. No portability language, no setup guides — just the directives for working on this system.

The skill also self-checks for staleness. At the start of each session, Claude compares the phase table in the skill against the actual state of the repo. If something doesn't match — a new script was added, work is happening on a phase marked "Not started" — it flags the discrepancy before proceeding. Updates happen explicitly at phase boundaries, not mid-session. The directive stays stable while work is in progress.

The skill solved the session-to-session problem, but it raised a bigger question: where does the full context live? Not just "what files exist" — but the architectural vision, the phase roadmap, the decisions that shaped the system, the dead ends that informed what we didn't build.

That's where the playbook became something more than a reference doc. It became a master spec — a living document that anchors the entire project. It captures the original intent, the architecture, every iteration, and why things were built the way they were. The skill tells Claude how to work on this system. The playbook tells Claude what we're building and why it matters. One is a cheat sheet, the other is the source of truth.

The playbook lives in the repo but is .gitignored — it's proprietary. It's the kind of document that could anchor any design system pipeline, not just this one. The skill loads automatically and handles the operational context. The playbook gets fed manually when the session needs the full picture.

Three layers of context, each with a different job: the skill (operational directives, loads automatically), the playbook (master spec, fed when needed), and the diary (narrative record, fed when it's time to document). Together, they mean a new Claude session can go from zero to productive in under a minute instead of 20.

What I Learned

Build first, document second. Session 1 produced a lot of planning docs. This session produced working code. Both are necessary, but if I had to pick one to do first, it's the code. Documentation about a thing that doesn't exist yet is just speculation. Documentation about a thing you just watched work is a record.

The contract script matters more than I expected. Going in, I thought WCAG would be the star. It's the one with the most dramatic output — red failures, contrast ratios, accessibility implications. But the contract script is the one that will save a developer's weekend. Catching a deleted token before it hits production is worth more than a hundred contrast warnings.

Intentional failures aren't bugs. 40 of 98 WCAG pairings "fail." But color.text.inverse (white) on color.bg.primary (also white) should fail — inverse text isn't meant for light backgrounds. The script's job isn't to judge intent. It surfaces everything, and the team decides. That's the "advisory mode" philosophy.

File operations are the hardest part of CI. The actual script logic — contrast math, reference resolution, graph traversal — was straightforward. What took the most debugging was getting the GitHub Actions workflow to correctly swap token files between branches for baseline comparison. Same lesson as Session 1: the automation plumbing is harder than the logic.

A 12-second feedback loop changes behavior. When the pipeline takes 12 seconds to tell you what your change did, you start making smaller, more deliberate changes. You test before you push. The tool shapes the workflow.

Context loss is a design problem too. Building the pipeline is one kind of work. Making sure the next session can pick up where you left off without re-explaining everything is a different kind of work — and just as important. The skill file is infrastructure for the workflow, not the product.

Artifacts

scripts/wcag-agent.mjs — WCAG 2.1 contrast compliance auditor (98 pairings, AA/AAA grading). Deterministic — no AI.
scripts/contract-agent.mjs — Token API surface contract tester (removal, rename, type change detection). Deterministic — no AI.
scripts/token-analyzer.mjs — Reference graph builder, impact tracer, and upstream report merger. Deterministic — no AI.
pipeline.config.json — Per-script execution mode configuration (advisory/enforced/report-only/off)
.github/workflows/token-pipeline.yml — GitHub Actions workflow with PR commenting and artifact upload
.claude/skills/design-system-ops/SKILL.md — Claude skill directive for persistent session context

Three QA scripts built, tested, and deployed to CI. Pipeline runs on every PR that touches tokens. Bot posts summary comments. Claude skill shipped so future sessions start at full speed. No AI in the pipeline — these are deterministic scripts. AI enters in Phase 3.

Built with: Figma, Tokens Studio, GitHub Actions, Style Dictionary v4, Claude Code, and an unreasonable amount of patience for YAML indentation.

Session 2: Automated QA — WCAG, Contracts, and Impact Tracing

Where We Picked Up

Part 1: The Three Scripts

Script 1: WCAG Compliance

Script 2: Contract Testing

Script 3: Token Impact Analyzer

Part 2: The GitHub Actions Workflow

Part 3: Testing It Live

Part 4: Teaching Claude to Remember

What I Learned

Artifacts

Comments

What If a Design System Could Self QA?

Session 3: The Component Spec Layer — Designing the Schema

More from this blog

Session 3: The Component Spec Layer — Designing the Schema

Session 1: Pipeline + Color Token Architecture

What We're Building and Why

Command Palette

Where We Picked Up

Part 1: The Three Scripts

Script 1: WCAG Compliance

Script 2: Contract Testing

Script 3: Token Impact Analyzer

Part 2: The GitHub Actions Workflow

Part 3: Testing It Live

Part 4: Teaching Claude to Remember

What I Learned

Artifacts

Comments

What If a Design System Could Self QA?

Session 3: The Component Spec Layer — Designing the Schema

More from this blog