
I joined a project with an existing Playwright E2E test suite, 38 spec files, ~165 tests, around 14,000 lines of test infrastructure. My first step was simple: run the tests locally.
8 out of 130 non-skipped tests passed. A 6% pass rate.
The confusing part? CI was green. It turned out CI ran everything with workers: 1, multiple workers plus the dev environment meant running tests locally just wasn't possible.
I had zero domain knowledge of this codebase. No context on why tests were written a certain way, what the custom wrappers did, or where the real problems were. So I started asking AI to analyze everything, the Playwright configs, the page objects, the spec files, the CI workflows. I asked questions to help me understand the codebase and to figure out what we could do to get tests running locally.
Over a few days, this produced 18 analysis documents covering Architecture, Root causes, Anti-patterns, Silent bugs and Test isolation
The analysis phase was about building a map of a codebase I didn't understand. Every document was a question answered.
With the analysis done, I had a clear picture of what needed to change. But the question was: in what order, and how do you avoid a big refactor that breaks everything?
The answer was tracer bullets, a concept from The Pragmatic Programmer. The idea is to build a thin end-to-end slice through all the layers to prove the architecture works, then expand from there.
I created 8 tracer bullets, each targeting a specific slice:
dependencies: ['Setup'].ownerOrg → ownerProject).globalSetup with a proper setup project.The key insight: the dependency graph told me which bullets could run in parallel. Bullets 1 and 2 were independent. Bullet 4 was independent. Bullet 3 depended on 1. This became important later when running multiple AI sessions.
Bullet 1 targeted a single file with 5 tests. The steps:
currentUser → sharedOrg → project)projects-settings-general.spec.ts to use the fixturesOnce I had a plan with all 33 tasks organized into phases. I needed something to work through them consistently — same process every time, same quality bar, same benchmarking. So I built a skill: pw-test-improvement.
A strict 7-step process for every change:
The skill had built-in knowledge: Playwright's locator priority (getByRole > getByLabel > getByText > ...), a list of anti-patterns to avoid (waitForTimeout, no-op assertions, CSS class selectors, forced clicks without justification), and migration patterns for replacing the Actions wrapper with direct Playwright calls.
It used the Playwright CLI to run tests directly and capture results.
The biggest change was moving from repeated beforeAll/afterAll blocks to Playwright fixtures. Before: each of 5 test files independently called getUser(), createOrg(), createProject() — 15 API calls total. After: worker-scoped fixtures shared across files — 7 calls total (53% reduction).
The key distinction was worker-scoped vs test-scoped:
{ scope: 'worker' }) — created once, shared across all tests in that worker. Good for expensive setup like orgs and projects.The Playwright config went from one project running all 38 spec files to 7 projects, each pointing to its MFE folder:
{ name: 'Applications', testDir: 'apps/ui/applications/e2e', dependencies: ['Setup'] },
{ name: 'Organizations', testDir: 'apps/ui/organizations/e2e', dependencies: ['Setup'] },
{ name: 'Projects', testDir: 'apps/ui/projects/e2e', dependencies: ['Setup'] },
// ... Subscriptions, Host, User Profile
This meant you could run --project=Applications to test just what you need, HTML reports grouped by area, and heavy specs got their own parallelism settings.
4 actual test failures looked like 57. Application tests used serial mode, so when the first test failed, all subsequent tests in that describe block were marked "did not run." The fix: split heavy specs into a dedicated project, increase timeouts (30s → 60s for beforeAll), cap workers to prevent API overload, and use worker-scoped fixtures to share expensive setup.
Not everything worked first time.
The cleanup project broke CI. We added a teardown project with Playwright's project dependencies to clean up test data after runs. It worked locally. In CI, it caused failures — the cleanup ran against a shared environment and interfered with other pipelines. Had to revert it.
Not everything should be a fixture. We tried converting everything to fixtures. After reviewing Playwright docs, we rejected one of the fixtures before doing it as worker-scoped fixtures share across files, which would pollute serial tests that need per-file isolation with different options.
This wasn't "tell AI to fix it." It was a collaboration process:
Ask questions relentlessly — "What does this method do?" "Why is this test flaky?" "According to Playwright docs we can do X, can you verify your suggestion based on the docs" I asked hundreds of questions during the analysis phase which lasted a few days.
Challenge every suggestion — "Are you sure? What about edge case X?" If the AI suggested a pattern, I'd ask it to explain why and if it was sure that was a good way of doing it.
Use docs as ground truth — I'd link to Playwright docs and ask "does this align with whats in the docs?" The AI's training data can be outdated; the docs are current.
Validate with multiple tools — I used Goose, Claude Code, and GitHub Copilot. Different tools catch different blind spots and have different opinions just like when you work with different team mates.
Check confidence explicitly — "What's your confidence level on this? why only a 7? How can we get a 10 confidence level?" This surfaces uncertainty the AI might not volunteer and also goes deeper to understanding what we haven't thought about and how we can improve things.
I ran up to 4 AI sessions in parallel — based on which tracer bullets were independent of each other. The dependency graph from the implementation plan told me what could safely run at the same time.
I'd switch between sessions to check progress, read through what was being changed, and step in when something needed verifying. The AI did the mechanical work, applying patterns, running tests, capturing benchmarks. I did the oversight, deciding what to fix next, catching when a suggestion didn't look right, and verifying against the actual Playwright docs.
Never more than 4 at a time. I wanted to read and understand everything that was happening.
| Metric | Before | After | Change |
|---|---|---|---|
| API calls per file | 15 | 7 | 53% reduction |
| UI test setup lines | 8 | 3 | 62% reduction |
| API setup/cleanup lines | 15 | 3 | 80% reduction |
| Files with manual try/finally | 15 | 0 | Fixtures handle it |
| Boilerplate removed | — | — | ~1,000 lines |
About testing:
expect(locator)) catch timing issues that manual checks missbeforeAll
About working with AI: