Blog

Why Your Test Suite Lies to You at Scale

May 16, 2026

Why Your Test Suite Lies to You at Scale PRO IMPLEMENTATION

New to Playwright reliability? Start with the fundamentals: Flaky Tests You Can’t Fix With Better Selectors — the same concepts with more explanation and simpler examples.

Green tests and broken production is a specific failure mode that gets more common as test suites grow. The locators are right, the assertions are correct, the mocks return the expected data — and none of it reflects what the system actually does under load, with real network conditions, against a real database.

This article covers three architectural problems that cause this: API non-idempotency, mock drift, and data accumulation. Each is invisible at small scale. Each becomes expensive at large scale.

Code examples are intentionally simplified — focus on the architectural pattern.

The Failure Mode Nobody Talks About

Most flakiness guides focus on selectors and timing. That’s the visible layer. The invisible layer is data and integration:

A POST request succeeds on the server, the response is lost in transit, Playwright retries, the server creates a second record. Your test now has two orders instead of one, and the assertion that checks order count fails — not because the feature is broken, but because the network hiccuped.
Your mock returns { order_id: "123" }. The backend deployed last Tuesday and now returns { orderId: "123" }. Tests are green. The field your frontend reads is undefined. Production is broken.
Tests create 100 users per minute. Nobody cleans up reliably. Two weeks later, unique constraint violations start appearing in unrelated tests. The database that was supposed to be isolated is shared state in disguise.

These aren’t test bugs. They’re architectural gaps. And they require architectural solutions.

Idempotency: Making POST Requests Safe to Retry

The standard mental model of HTTP: a request either succeeds or fails. The reality: a request can succeed on the server and fail to deliver the response. The client sees a timeout and retries. The server sees a new request.

For GET requests this is harmless. For POST requests that create or modify state, it creates duplicates.

The solution: idempotency keys

An idempotency key is a client-generated identifier that the server uses to detect duplicate requests. If the server has processed a request with this key before, it returns the cached result instead of processing again.

The key design question is how to generate the key. A static key per test fails when a test makes multiple POST requests — the server treats the second request as a duplicate of the first. A random UUID per request defeats the purpose — retries get new keys and bypass the deduplication.

The correct approach: derive the key deterministically from the request context.

import { createHash } from 'crypto';

export function generateIdempotencyKey(method: string, url: string, data: unknown): string {
  const payload = `${method}:${url}:${JSON.stringify(data)}`;
  return createHash('sha256').update(payload).digest('hex').slice(0, 16);
}

export abstract class BaseApiClient {
  protected async post(url: string, data?: unknown) {
    const key = generateIdempotencyKey('POST', url, data);
    return await this.request.post(url, {
      data,
      headers: { 'X-Idempotency-Key': key },
    });
  }
}

Two calls to createUser with identical data get identical keys — the server deduplicates. Two calls with different data (create user, then create order) get different keys — both process correctly.

Important nuance: if your test legitimately needs two identical records (same method, URL, and body), they’ll get the same key — and the server will return the cached result for the second call. This is correct behaviour for retries, but it means this approach assumes each unique operation has unique data. If you genuinely need two identical resources, add a distinguishing field (like a requestId or timestamp) to the body.

The backend requirement: this only works if the server implements idempotency key handling. Most payment APIs (Stripe, PayPal) support this natively. If your payment provider doesn’t — that’s their problem to solve, not yours. Use WireMock to mock them, or find their sandbox/test mode. If it’s your own internal backend that’s missing support — that’s a tech-debt conversation with your backend team. The pattern is well-documented and the database cost is minimal: store key + response hash, expire after 24 hours.

The network failure scenario:

Client → POST /orders (key: abc123) → Server processes, creates order
Server → Response lost in transit
Client → Timeout, retry POST /orders (key: abc123) → Server returns cached response
Result: One order, correct state

Without idempotency keys, the retry creates a second order. Your test’s assertion that checks order count fails, and you spend an hour investigating a “bug” that is actually a network reliability issue.

Mock Architecture: Three Levels, Three Use Cases

The mistake teams make is treating mocking as a single tool. page.route for everything. Then wondering why server-side failures aren’t caught.

Level 1: Native mocks (page.route)

page.route intercepts requests made from inside the browser context. It’s the right tool for testing UI behavior in isolation.

// Testing error state UI
await page.route('**/api/orders', (route) => {
  route.fulfill({ status: 503, body: JSON.stringify({ error: 'Service unavailable' }) });
});

await page.goto('/orders');
await expect(page.getByRole('alert')).toContainText('Service unavailable');

The architectural boundary: page.route cannot intercept requests made via Playwright’s request fixture, or any server-to-server calls your backend makes. Those requests originate outside the browser context.

How the request is made	Intercepted by `page.route`?
`page.goto()`, `page.click()` — browser navigation	✅ Yes
`page.evaluate(() => fetch('/api/...'))` — fetch inside browser	✅ Yes
`page.request.get('/api/...')` — browser request context	✅ Yes
`request.get('/api/...')` — standalone `request` fixture (Node.js)	❌ No
Backend server-to-server calls (Stripe, etc.)	❌ No

The distinction is browser context vs Node.js context — not UI vs API.

Why route.fulfill() instead of route.abort()? abort() causes the request to fail with a network error. Well-written apps handle this gracefully, but many enter an infinite retry loop waiting for a response that never comes. fulfill() returns a proper HTTP response — even a synthetic one — so the app moves on cleanly. Use abort() only when you specifically want to test network error handling.

Level 2: Infrastructure mocks (WireMock)

Server-to-server integrations — payment processors, SMS gateways, shipping APIs — need to be mocked at the network level, not the browser level.

services:
  wiremock:
    image: wiremock/wiremock:3.3.1
    ports:
      - '8080:8080'
    volumes:
      - ./wiremock/mappings:/home/wiremock/mappings
    command: ['--global-response-templating', '--verbose']

{
  "request": {
    "method": "POST",
    "urlPattern": "/v1/payment_intents"
  },
  "response": {
    "status": 200,
    "jsonBody": {
      "id": "pi_{{randomValue length=24 type='ALPHANUMERIC'}}",
      "status": "succeeded",
      "amount": "{{request.body.amount}}"
    },
    "transformers": ["response-template"]
  }
}

Response templating lets WireMock echo back request values, making mocks feel more realistic without hardcoding specific values. Point your backend’s external API base URLs to localhost:8080 via environment variables, and the backend never makes real external calls in tests.

One prerequisite: your backend needs to use configurable base URLs for external services — not hardcoded production endpoints. In well-structured backends this is already the case. If it’s not, that’s a refactor worth doing regardless of testing — hardcoded external URLs are a deployment problem too.

Level 3: Contract testing (Pact)

WireMock solves availability. It doesn’t solve drift. Your WireMock mapping can become outdated the moment the real API changes. This is the Lying Mock problem — and it requires a different solution.

Consumer-Driven Contract Testing (CDC) creates a formal, verifiable link between your test expectations and the provider’s actual implementation.

import { PactV3, MatchersV3 } from '@pact-foundation/pact';

const { like, string, integer } = MatchersV3;

const provider = new PactV3({
  consumer: 'test-suite',
  provider: 'order-service',
  dir: './pacts',
  logLevel: 'warn',
});

describe('Order Service contract', () => {
  it('returns order details', async () => {
    await provider
      .given('order ord_123 exists')
      .uponReceiving('GET /orders/ord_123')
      .withRequest({
        method: 'GET',
        path: '/orders/ord_123',
        headers: { Authorization: like('Bearer token') },
      })
      .willRespondWith({
        status: 200,
        body: {
          order_id: string('ord_123'), // field name is part of the contract
          status: string('CONFIRMED'),
          total: integer(4999),
        },
      })
      .executeTest(async (mockServer) => {
        const order = await fetchOrder(mockServer.url, 'ord_123');
        expect(order.status).toBe('CONFIRMED');
      });
  });
});

This test runs against a local mock server and generates a ./pacts/test-suite-order-service.json contract file. The backend team publishes this contract to a Pact Broker and runs verification against their actual code:

# On the provider side, in their CI pipeline
pact-provider-verifier \
  --provider-base-url http://localhost:8080 \
  --pact-broker-url https://your-pact-broker \
  --provider order-service \
  --publish-verification-results

If the backend renames order_id to orderId, verification fails in their pipeline before the change merges. The contract breaks at the source, not in production.

The Pact Broker is optional but valuable — it stores contract versions, tracks which consumer-provider pairs are compatible, and enables the can-i-deploy check that blocks deployments when contracts are broken. For smaller teams, storing contract files in a shared repository works as a simpler alternative.

Where to start with contracts: don’t try to contract-test everything. Start with the API calls that have caused the most incidents, or the ones that change most frequently. One contract on your critical payment or order flow is immediately valuable. Expand from there.

The organizational reality: contract testing requires the backend team to run verification in their pipeline. This is a commitment from both sides, not just a technical decision. For small teams or teams without strong cross-team coordination, a simpler starting point is storing contract JSON files in the backend repo and running verification manually — no Pact Broker required. Also worth being explicit: contracts verify response structure and field names. They don’t catch business logic bugs, side effects, or behaviour changes that preserve the schema.

Data Hygiene: The Infrastructure Approach

afterEach(() => api.deleteUser(userId)) is the standard cleanup pattern. It has two failure modes that make it unreliable at scale:

If the test crashes before userId is set, the cleanup never runs
If the test runner itself crashes or is killed, afterAll and afterEach hooks don’t execute

The result: orphaned test data accumulates. Unique constraints start failing on unrelated tests. Query performance degrades. The “isolated” test database becomes shared state.

Approach 1: TTL at the database level

Add expires_at to all test-created entities and set it to a short window:

// In your base API client or fixture
protected async createTestEntity(url: string, data: unknown) {
  return this.post(url, {
    ...data,
    expires_at: new Date(Date.now() + 24 * 60 * 60 * 1000).toISOString(),
    is_test: true,
  });
}

The database handles cleanup automatically. In PostgreSQL with pg_cron:

-- Install pg_cron extension once
-- Note: pg_cron may not be available on all managed PostgreSQL services (e.g. some cloud providers).
-- If unavailable, use a server-level cron job or a background worker instead.
CREATE EXTENSION IF NOT EXISTS pg_cron;

-- Schedule cleanup every hour
SELECT cron.schedule('cleanup-test-entities', '0 * * * *', $$
  DELETE FROM users
  WHERE expires_at < NOW() AND is_test = true;

  DELETE FROM orders
  WHERE expires_at < NOW() AND is_test = true;

  DELETE FROM payment_intents
  WHERE expires_at < NOW() AND is_test = true;
$$);

In MongoDB, a TTL index handles this natively:

db.users.createIndex(
  { expires_at: 1 },
  { expireAfterSeconds: 0 }, // documents deleted at expires_at time
);

Approach 2: Cleanup queue with global teardown

For cases where TTL isn’t practical — databases that don’t support it, or entities that need ordered cleanup (delete orders before users, not after):

interface CleanupItem {
  url: string;
  id: string;
  priority: number; // higher priority = deleted first
}

class CleanupQueue {
  private items: CleanupItem[] = [];

  push(item: CleanupItem) {
    this.items.push(item);
  }

  async flush(request: APIRequestContext) {
    const sorted = this.items.sort((a, b) => b.priority - a.priority);
    for (const item of sorted) {
      await request.delete(`${item.url}/${item.id}`).catch(() => {
        // Log but don't throw — cleanup failures shouldn't fail the suite
        console.warn(`Cleanup failed for ${item.url}/${item.id}`);
      });
    }
    this.items = [];
  }
}

export const cleanupQueue = new CleanupQueue();

import { cleanupQueue } from './cleanup/queue';

export default async function globalTeardown() {
  await cleanupQueue.flush(globalApiClient);
}

The cleanup queue survives individual test failures. Only a full runner crash (SIGKILL, power loss) prevents it from executing — and in that case, the TTL approach serves as a second line of defense. This is why TTL should be your default: it operates at the database level, independently of your test process, and survives any kind of crash. The cleanup queue is a complement for ordered cleanup, not a replacement.

Approach 3: Table partitioning for high-volume environments

When tests run continuously and create thousands of entities per hour, even scheduled deletes can become expensive. Deleting a million rows from a PostgreSQL table is a slow, lock-intensive operation.

Partitioning by date makes cleanup instantaneous — you drop a partition rather than deleting rows:

-- Create partitioned table
CREATE TABLE orders_test (
  id UUID PRIMARY KEY,
  created_at TIMESTAMPTZ NOT NULL,
  expires_at TIMESTAMPTZ,
  -- other fields
) PARTITION BY RANGE (created_at);

-- Create monthly partitions
CREATE TABLE orders_test_2024_12
  PARTITION OF orders_test
  FOR VALUES FROM ('2024-12-01') TO ('2025-01-01');

CREATE TABLE orders_test_2025_01
  PARTITION OF orders_test
  FOR VALUES FROM ('2025-01-01') TO ('2025-02-01');

Dropping last month’s partition:

-- Instantaneous, no table lock on the live partition
DROP TABLE orders_test_2024_12;

This is worth the setup complexity when your test suite creates more than ~10K entities per day. Below that threshold, the TTL approach is simpler and sufficient.

Limitations worth knowing: partitioning is PostgreSQL-native and well-supported, but MySQL’s implementation has more restrictions, and some ORMs handle partitioned tables poorly. More importantly, partitioning complicates migrations — adding a column to a partitioned table requires updating all existing partitions. And you need to create future partitions in advance — either manually or via a scheduled job. Don’t reach for this pattern unless you’re genuinely hitting performance problems with TTL-based cleanup.

The Decision Framework

Situation	Right tool
Testing UI error states in isolation	`page.route`
Backend calls external payment/SMS API	WireMock
Backend API changes cause test failures	Contract tests (Pact)
Test creates < 1K entities/day	TTL + pg_cron
Cleanup order matters	Cleanup queue
Test creates > 10K entities/day	Table partitioning
POST request creates duplicates on retry	Idempotency keys

What This Solves

The patterns here don’t make individual tests faster or more readable. They make the test suite trustworthy at scale — which is a different problem.

A suite that’s trustworthy means: when tests are green, you can deploy with confidence. When tests fail, the failure points to a real problem, not a network hiccup or a stale mock. When a test fails in CI, you can reproduce it locally with the same data.

That’s the gap between a test suite that’s a liability and one that’s an asset.

Reference implementation: Playwright BDR Template

Flaky Tests You Can't Fix With Better Selectors

May 15, 2026

Dmitry

QA Automation Engineer

Flaky Tests You Can’t Fix With Better Selectors CONCEPT

You’ve fixed your locators. You’ve switched to web-first assertions. Your tests still fail intermittently — but now the failures look different. Duplicate records in the database. Tests that pass alone but fail in parallel. Mocks that say everything is fine while production is broken.

This is the next layer of flakiness. It lives in your API calls, your test doubles, and your database. Better selectors won’t help here.

Code examples are simplified for clarity — focus on the idea, not the boilerplate.

TL;DR

Use idempotency keys on POST requests — one network glitch shouldn’t create two orders
page.route mocks the browser, not your server — know the difference
Use WireMock for server-to-server integrations you can’t control
Contract tests catch API drift before it reaches production
Never rely on afterEach for database cleanup — use TTL or a cleanup queue instead

The Problem: Flakiness That Looks Like Application Bugs

When a selector fails, the error is obvious. When a test creates a duplicate order because a network request was retried, the error looks like a business logic bug. You spend an hour investigating something that has nothing to do with your application code.

Three categories cause this:

API flakiness — a request succeeds on the server but the response never arrives. Playwright retries. Now you have two orders.

Lying mocks — your mocks say the API returns { order_id: "123" }. The backend deployed last week and now returns { orderId: "123" }. Tests are green. Production is broken.

Data pollution — tests create users, orders, and transactions but don’t clean up reliably. After a week, the test database is a graveyard that slows down queries and causes unique constraint violations.

Rule #1: Idempotency Keys — One Request, One Result

Networks are unreliable. A POST request can reach the server, create a record, and then the response gets lost in transit. Playwright sees a timeout and retries. The server sees a new request and creates another record.

The fix is an idempotency key — a unique header that tells the server “if you’ve seen this request before, return the same result instead of processing it again.”

import { createHash } from 'crypto';

export function generateIdempotencyKey(method: string, url: string, data: unknown): string {
  const payload = `${method}:${url}:${JSON.stringify(data)}`;
  return createHash('sha256').update(payload).digest('hex').slice(0, 16);
}

export abstract class BaseApiClient {
  protected async post(url: string, data?: unknown) {
    const key = generateIdempotencyKey('POST', url, data);
    return await this.request.post(url, {
      data,
      headers: { 'X-Idempotency-Key': key },
    });
  }
}

The key is generated from the request method, URL, and body — so two identical requests get the same key, but two different requests (create user, then create order) get different keys. One network glitch no longer creates two records.

Important nuance: if your test legitimately creates two identical orders (same body, same URL), they’ll get the same key — and the server will return the first result for both. This is intentional behaviour for retries, but it means this approach assumes each unique operation has unique data. If you need two genuinely identical records, add a unique field (like requestId) to the body.

Note: This only works if your backend handles the X-Idempotency-Key header. Check with your backend team — many order APIs support this out of the box. If your payment provider doesn’t support it — that’s their problem to solve, not yours. Look for their sandbox or test mode, or use WireMock to mock them entirely. If it’s your own backend that’s missing support — that’s a tech-debt conversation with your backend team, not something to work around in tests.

Rule #2: Know What Your Mocks Actually Cover

page.route is Playwright’s built-in way to intercept requests. It’s great for testing UI behavior in isolation — how does the page look when the API returns an error?

// ✅ Good use of page.route — testing UI error state
await page.route('**/api/orders', (route) => {
  route.fulfill({
    status: 500,
    body: JSON.stringify({ error: 'Internal Server Error' }),
  });
});

await page.goto('/orders');
await expect(page.getByText('Something went wrong')).toBeVisible();

The catch: page.route only intercepts requests made from inside the browser. If your test makes API calls directly through Playwright’s request fixture — server-side, without a browser — page.route won’t see them.

// This request bypasses page.route entirely
const response = await request.post('/api/orders', { data: orderData });

// But this goes through the browser context and IS intercepted by page.route
const response = await page.evaluate(() =>
  fetch('/api/orders', { method: 'POST' }).then((r) => r.json()),
);

Why route.fulfill() instead of route.abort()? abort() causes the request to fail with a network error. Some apps handle this gracefully, but others enter an infinite retry loop waiting for a response that never comes. fulfill() returns a proper HTTP response (even a fake one) so the app moves on cleanly.

For direct API calls in tests, you need mocks at a different level — either a wrapper around request, or an infrastructure mock like WireMock.

Rule #3: WireMock for Integrations You Don’t Control

Your backend calls Stripe for payments. It calls Twilio for SMS. It calls a shipping provider to get rates. In tests, you don’t want any of that to actually happen.

page.route can’t help here — these are server-to-server calls that never touch the browser. The solution is WireMock: a mock server that runs alongside your test environment and intercepts HTTP calls at the network level.

services:
  wiremock:
    image: wiremock/wiremock:3.3.1
    ports:
      - '8080:8080'
    volumes:
      - ./wiremock/mappings:/home/wiremock/mappings

{
  "request": {
    "method": "POST",
    "url": "/v1/payment_intents"
  },
  "response": {
    "status": 200,
    "jsonBody": {
      "id": "pi_test_123",
      "status": "succeeded"
    }
  }
}

Now your backend hits localhost:8080 in tests instead of the real Stripe API. Tests are fast, isolated, and don’t depend on external uptime.

Point your backend’s base URLs to WireMock via environment variables in your test environment:

STRIPE_BASE_URL=http://localhost:8080
TWILIO_BASE_URL=http://localhost:8080

One thing to be aware of: this requires your backend to use configurable base URLs for external services. In most well-structured backends this is already the case. If it’s not — that’s a conversation with the backend team, not a reason to skip WireMock.

Rule #4: Contract Tests — Stop Trusting Your Mocks

Here’s the problem with all mocks: they can lie. Your WireMock returns { payment_id: "pay_123" }. The backend team renames the field to paymentId. Your tests stay green. Production breaks.

This is called a Lying Mock — a test double that no longer matches reality.

Contract testing fixes this. Instead of just mocking the response, you write a contract: “I expect this request to return this response.” The backend then verifies that contract against its actual code.

import { PactV3, MatchersV3 } from '@pact-foundation/pact';

const provider = new PactV3({
  consumer: 'frontend-tests',
  provider: 'payment-service',
  dir: './pacts',
});

describe('Payment API contract', () => {
  it('returns payment confirmation', async () => {
    await provider
      .given('a valid payment intent exists')
      .uponReceiving('POST /v1/payment_intents')
      .withRequest({ method: 'POST', path: '/v1/payment_intents' })
      .willRespondWith({
        status: 200,
        body: {
          id: MatchersV3.string('pi_test_123'),
          status: MatchersV3.string('succeeded'),
        },
      })
      .executeTest(async (mockServer) => {
        const result = await createPayment(mockServer.url);
        expect(result.status).toBe('succeeded');
      });
  });
});

After this test runs, it generates a JSON contract file in ./pacts. The backend team runs that contract against their actual API. If they rename id to paymentId — the contract verification fails in their pipeline, before the change is merged.

Where to start: You don’t need to contract-test everything. Start with the APIs that change most often or have caused the most incidents. One contract on your payment flow is worth more than ten contracts on stable read-only endpoints.

One important caveat: contract testing requires the backend team to actually run the verification in their pipeline. This is an organizational commitment, not just a technical one. For small teams, storing contract JSON files in the backend repo and running verification manually is a simpler starting point than a full Pact Broker setup. Also note that contracts verify response structure — they don’t catch business logic bugs or side effects.

Rule #5: Stop Relying on `afterEach` for Cleanup

The classic approach to test data cleanup:

// ❌ Unreliable
afterEach(async () => {
  await api.deleteUser(userId);
});

This fails silently when a test crashes before setting userId. It doesn’t run when the test runner itself crashes. After a CI failure mid-run, your database has orphaned records that affect the next run.

Three approaches that actually work:

TTL — let the database clean up automatically

Add an expires_at field to your test entities and set it when creating them:

// When creating test data
await api.createUser({
  email: `test_${Date.now()}@example.com`,
  expires_at: new Date(Date.now() + 24 * 60 * 60 * 1000).toISOString(),
});

In PostgreSQL, a scheduled job handles cleanup:

-- Runs every hour via pg_cron
SELECT cron.schedule('cleanup-test-data', '0 * * * *', $$
  DELETE FROM users WHERE expires_at < NOW() AND is_test = true;
  DELETE FROM orders WHERE expires_at < NOW() AND is_test = true;
$$);

In MongoDB and Redis, TTL indexes handle this natively — no cron job needed.

Cleanup queue — collect IDs, delete in bulk

Track everything your tests create, then clean it all up in a global teardown:

// In your base API client
protected async post(url: string, data?: unknown) {
  const response = await this.request.post(url, { data });
  const body = await response.json();

  if (body.id) {
    cleanupQueue.push({ url, id: body.id });
  }
  return response;
}

// global-teardown.ts
export default async function globalTeardown() {
  for (const item of cleanupQueue) {
    await api.delete(`${item.url}/${item.id}`);
  }
}

Even if individual tests fail, the global teardown runs and cleans up the queue.

Which approach to use: TTL is the more reliable default — it works even if the test runner is killed with SIGKILL, because cleanup happens at the database level independently of your test process. Use TTL as your first line of defence. The cleanup queue is a good complement when you need guaranteed cleanup order or when your database doesn’t support scheduled jobs — but it won’t run if the process is hard-killed.

Putting It Together: The Data Reliability Cheat Sheet

Symptom	Root cause	Fix
Duplicate records after CI failure	No idempotency on POST requests	Add `X-Idempotency-Key` header
Tests green, production broken	Mocks don’t match real API	Add contract tests for critical endpoints
`page.route` mock not working	Request bypasses browser	Use WireMock or request wrapper
Database full of test garbage	`afterEach` cleanup unreliable	TTL field + pg_cron or cleanup queue
External API causing flakiness	Real network calls in tests	WireMock for server-to-server calls

What’s Next?

You now have three layers covered: test infrastructure, object lifecycle, and data reliability. The next layer is observability — how do you measure test health, identify patterns in flakiness, and prove to your manager that stability work has business value?

Want to go deeper on any of these topics? Check out the advanced version: Why Your Test Suite Lies to You at Scale

All patterns in this article are implemented in the Playwright BDR Template on GitHub.

Playwright Fixtures as a Dependency Injection Container: The Architecture That Scales

May 14, 2026

Dmitry

QA Automation Engineer

Playwright Fixtures as a Dependency Injection Container: The Architecture That Scales PRO IMPLEMENTATION

New to Playwright architecture? Start with the fundamentals: Your Playwright Tests Will Need Refactoring. Here’s How to Make It Painless — the same concepts with more explanation.

Most Playwright codebases start the same way: Page Objects instantiated with new inside tests, fixtures as an afterthought, test data seeded with workerIndex. This works at 50 tests. At 500, the maintenance cost becomes visible. At 1000, it becomes the primary engineering problem.

This article is about the architectural decisions that prevent that progression — specifically, treating Playwright’s fixture system as a proper DI container, not just a convenience wrapper around beforeEach.

Code examples are intentionally simplified — focus on the architectural pattern.

Three-Layer Architecture: POM, Flow, and Tests

TL;DR

Fixtures are a DI container with lifecycle management — not just a beforeEach wrapper. Getters enforce statelessness architecturally, not stylistically. Seed from testId + RUN_ID + repeatEachIndex — workerIndex breaks across shards. Domain-split fixtures with namespacing eliminate silent collisions. Builder pattern when factory overrides get unwieldy. test.step() for business-intent reporting.

Before diving into fixtures, it’s worth establishing the architectural model this article assumes. Most Playwright codebases that scale well use three layers:

Page Object (POM) — responsible for interacting with elements on a specific page: locators, clicks, form fills. Knows nothing about business logic or test scenarios.
Flow — describes complete business scenarios: “checkout”, “user registration”, “password reset”. Orchestrates Page Objects in the right sequence. The test calls checkoutFlow.submitOrder() and Flow handles which pages to visit, in what order, and what data to fill.
Test — declares intent. Reads like a specification: given this user, when this action, then this result.

This separation matters because changes are isolated: UI changes only touch Page Objects, process changes only touch Flows, tests remain stable. Fixtures are what make this architecture work — they manage the lifecycle of all three layers.

Why `new` Inside Tests Is a Scaling Problem

The naive approach looks like this:

test('checkout flow', async ({ page }) => {
  const cartPage = new CartPage(page);
  const checkoutPage = new CheckoutPage(page);
  const checkoutFlow = new CheckoutFlow(cartPage, checkoutPage);

  await checkoutFlow.submitOrder();
});

At first glance this is fine — explicit, readable, no magic. The problem surfaces when CartPage needs a new dependency. Now every test that constructs CartPage needs updating. In a 500-test suite, that’s a multi-day refactor with non-trivial regression risk.

The deeper issue: this pattern makes the test responsible for dependency resolution. That’s not the test’s job.

Fixtures as a DI Container

Playwright’s fixture system is, architecturally, a dependency injection container with lifecycle management. The key insight is that fixtures compose:

export const test = base.extend({
  cartPage: async ({ page }, use) => {
    await use(new CartPage(page));
  },

  checkoutPage: async ({ page }, use) => {
    await use(new CheckoutPage(page));
  },

  // Playwright resolves dependencies automatically
  checkoutFlow: async ({ cartPage, checkoutPage }, use) => {
    const flow = new CheckoutFlow(cartPage, checkoutPage);
    await use(flow);
    await flow.cleanup(); // teardown guaranteed regardless of test outcome
  },
});

Playwright builds the dependency graph, resolves it in the correct order, and handles teardown. If five tests depend on cartPage, Playwright creates one instance per test — not five, not one shared instance. The isolation is automatic.

The caching behavior matters: when multiple fixtures in the same test depend on the same fixture (e.g., both checkoutFlow and analyticsFlow depend on cartPage), Playwright creates exactly one cartPage instance for that test. This isn’t just an optimization — it means the two flows share state correctly, as they would in a real user session.

The Lifecycle Argument for Fixtures

Here’s the argument that matters for long-lived codebases: use fixtures even when the object seems stateless today.

CheckoutFlow might be a pure orchestrator right now — no state, no side effects, no external connections. But requirements change:

Next sprint: Flow needs to track an order ID for verification
Month after: Flow opens a WebSocket for real-time updates
Quarter later: Flow acquires a distributed lock that must be released

Each of these changes requires teardown. If CheckoutFlow is created with new in 300 tests, adding teardown means touching 300 files. If it’s in a fixture, you add one after use block:

checkoutFlow: async ({ cartPage, checkoutPage }, use) => {
  const flow = new CheckoutFlow(cartPage, checkoutPage);
  await use(flow);
  await flow.releaseLock(); // added once, applies everywhere
  await flow.closeConnection();
};

The fixture system gives you lifecycle management for free. The upfront investment is real — a few hours to set up the pattern properly. The cost of retrofitting it later: proportional to the number of tests.

The Pragmatic Rule: When Fixtures Are Overkill

Everything above is an argument for fixtures. Here’s the counterargument, because a good architecture isn’t about dogma.

Fixtures make sense when an object needs one or more of the following:

Lifecycle management — setup before the test, teardown after
Shared dependencies — the object depends on page, request, or another fixture
Potential for state — today stateless, but realistically might not be tomorrow

When none of these apply, a fixture is unnecessary indirection. A pure utility function — one that takes inputs and returns outputs with no side effects and no browser context — doesn’t belong in a fixture system. It belongs in a module:

// Just a function — no fixture needed
export function formatOrderId(id: string): string {
  return `ORD-${id.toUpperCase()}`;
}

// Factory function — pure, no browser context
export function createUser(overrides?: Partial<User>): User {
  return { role: 'customer', discount: 0, ...overrides };
}

// Fixture — depends on page, has implicit lifecycle
cartPage: async ({ page }, use) => {
  await use(new CartPage(page));
};

The decision rule: if an object touches the browser context or has any chance of needing teardown as the codebase evolves — fixture. If it’s a pure function or a data factory with no external dependencies — just export it and call it directly.

The cost of using fixtures when you don’t strictly need it: near zero. The cost of not using them when you should have: proportional to the number of tests you need to update.

Putting everything into fixtures because “it might need lifecycle someday” is the same mistake as premature optimization. It adds indirection without value and makes the codebase harder to read. The goal is judgment, not consistency for its own sake.

Lazy POM: Why Getters Beat Constructor Assignments

The standard Page Object pattern assigns locators in the constructor:

// Technically safe, architecturally suboptimal
class CartPage {
  private readonly submitButton: Locator;

  constructor(private page: Page) {
    this.submitButton = page.locator('button#submit');
  }
}

Playwright locators are lazy — they don’t query the DOM at construction time, they query it at the moment of interaction. So a locator assigned in the constructor won’t go stale: even if the DOM re-renders between construction and use, Playwright finds the element fresh when you call .click() or .isVisible(). This is technically fine.

The problem is what this pattern enables: the temptation to compute actual state in the constructor.

// This is a race condition bomb
constructor(page: Page) {
  (async () => {
    this.initialItemCount = await page.locator('.item').count();
  })();
}

The IIFE fires and is forgotten. The test accesses initialItemCount before the promise resolves. In a fast local environment this usually works. Under CI load with multiple workers competing for resources, it fails intermittently and is nearly impossible to reproduce.

The architectural fix: getters enforce statelessness

class CartPage {
  constructor(private page: Page) {}

  // Evaluated fresh on every access — no state, no race conditions
  get submitButton() {
    return this.page.getByRole('button', { name: 'Place order' });
  }

  get items() {
    return this.page.locator('.cart-item');
  }

  // For computed state, return a promise explicitly
  async getItemCount(): Promise<number> {
    return this.items.count();
  }
}

Getters make it structurally impossible to cache state at construction time. The Page Object is forced to be stateless — it can only describe how to find elements and interact with them, not what their current state is. Reading state is always an explicit async operation.

This is the State Trap pattern in reverse: instead of accidentally capturing a DOM snapshot at construction time, you’re architecturally prevented from doing so.

Deterministic Test Data at Scale

workerIndex as a faker seed is the most common data isolation mistake. The reasoning seems sound: each worker gets a unique number, so data is unique. The failure mode is subtle.

On 10 parallel CI shards, each shard has its own “Worker 0”, “Worker 1”, etc. The workerIndex namespace is shard-local. If the same test runs on Shard 1 Worker 0 and Shard 2 Worker 0 — during a retry, or due to shard misconfiguration — both generate identical data for the same testId. In a shared database, this means collisions — and the kind of intermittent failures that look like application bugs.

The correct seed: combine test identity with CI build ID

import { TestInfo } from '@playwright/test';
import { faker } from '@faker-js/faker';

function hashCode(str: string): number {
  return str.split('').reduce((acc, char) => {
    return (Math.imul(31, acc) + char.charCodeAt(0)) | 0;
  }, 0);
  // Note: not cryptographically secure, but collision probability is negligible
  // for the number of tests in any realistic suite — fine for faker seeding
}

export function seedFaker(testInfo: TestInfo): typeof faker {
  const RUN_ID = process.env.RUN_ID ?? 'local';

  // Three components:
  // testId: hash of file path + test name — unique per test, stable across runs
  // RUN_ID: CI build ID — different builds get different data
  // repeatEachIndex: handles retries — same test run gets same data on retry
  const seed = hashCode(`${testInfo.testId}-${RUN_ID}-${testInfo.repeatEachIndex}`);

  faker.seed(seed);
  return faker;
}

export const test = base.extend({
  faker: async ({}, use, testInfo) => {
    await use(seedFaker(testInfo));
  },
});

The repeatEachIndex component is worth explaining: when a test retries, it runs on potentially a different worker. Without repeatEachIndex in the seed, a retry would generate different data than the original run. If the failure was data-dependent, you can’t reproduce it. With repeatEachIndex, retries are deterministic — same seed, same data, reproducible failure.

The debugging payoff: when a test fails in CI, take the RUN_ID from the pipeline logs and run the test locally with RUN_ID=<value> npx playwright test <test-name>. You get the exact data that was generated in CI. This transforms “I can’t reproduce this” into a reproducible failure in under a minute.

Factory Pattern: Separating Structure From Noise

Random data everywhere obscures test intent. If a field doesn’t affect the outcome, it shouldn’t be visible in the test.

export interface User {
  id: string;
  email: string;
  name: string;
  role: 'customer' | 'vip' | 'admin';
  discount: number;
}

export function createUser(overrides?: Partial<User>, f: typeof faker = faker): User {
  return {
    id: f.string.uuid(),
    email: f.internet.email(),
    name: f.person.fullName(),
    role: 'customer',
    discount: 0,
    ...overrides,
  };
}

The factory provides structure and defaults. Overrides express what the test actually cares about:

// Only the meaningful fields are visible
test('VIP discount applied at checkout', async ({ checkoutFlow, faker }) => {
  const user = createUser({ role: 'vip', discount: 0.15 }, faker);
  const order = await checkoutFlow.asUser(user).checkout();

  expect(order.total).toBe(order.subtotal * 0.85);
});

For business scenarios that repeat across multiple tests, extract named datasets rather than duplicating overrides:

export const VIP_USER = {
  role: 'vip',
  discount: 0.15,
} as const satisfies Partial<User>;

export const ADMIN_USER = {
  role: 'admin',
  discount: 0,
} as const satisfies Partial<User>;

// In tests — intent is immediately clear
const user = createUser({ ...VIP_USER }, faker);

The satisfies operator here is doing real work: it validates that the dataset fields match the User type without widening the type. If someone adds a required field to User and forgets to update the dataset, TypeScript catches it at compile time.

When to consider the Builder pattern instead

The factory + overrides approach works well when objects are simple and combinations are limited. When complexity grows — a user with a role, subscription tier, notification preferences, and order history — the override object becomes unwieldy:

// Hard to read at a glance
const user = createUser(
  {
    role: 'vip',
    subscription: 'premium',
    notifications: { email: true, sms: false },
    orderCount: 3,
  },
  faker,
);

A Builder makes the same intent readable:

// Builder — reads like a specification
const user = new UserBuilder(faker)
  .asVip()
  .withPremiumSubscription()
  .withNotifications({ email: true, sms: false })
  .withOrderHistory(3)
  .build();

class UserBuilder {
  private overrides: Partial<User> = {};

  constructor(private f: typeof faker) {}

  asVip() {
    this.overrides.role = 'vip';
    this.overrides.discount = 0.15;
    return this;
  }

  withPremiumSubscription() {
    this.overrides.subscription = 'premium';
    return this;
  }

  withOrderHistory(count: number) {
    this.overrides.orderCount = count;
    return this;
  }

  build(): User {
    return createUser(this.overrides, this.f);
  }
}

The Builder delegates to the factory at the end — so you keep one source of truth for defaults, and the Builder just provides a fluent API for complex combinations. Use it when you have more than 3–4 meaningful combinations that appear repeatedly across tests. For simpler cases, the factory with overrides is less code and just as clear.

Scaling Fixtures: `mergeTests` and Namespacing

A single fixtures.ts file works until it doesn’t. The inflection point is usually around 15–20 fixtures, when multiple engineers are editing the same file simultaneously and merge conflicts become routine.

Domain-driven fixture splitting:

import { test as base } from '@playwright/test';
import { LoginPage, AdminPage } from '../pages';

type AuthFixtures = { loginPage: LoginPage; adminPage: AdminPage };

export const authTest = base.extend<AuthFixtures>({
  loginPage: async ({ page }, use) => {
    await use(new LoginPage(page));
  },
  adminPage: async ({ page }, use) => {
    await use(new AdminPage(page));
  }
});

// cart.fixtures.ts
type CartFixtures = { cartPage: CartPage; checkoutPage: CheckoutPage };

export const cartTest = base.extend<CartFixtures>({ ... });

// fixtures.ts — composition point
import { mergeTests } from '@playwright/test';
import { authTest } from './auth.fixtures';
import { cartTest } from './cart.fixtures';

export const test = mergeTests(authTest, cartTest);
export { expect } from '@playwright/test';

Tests import from fixtures.ts and see nothing change. The split is organizational, not behavioral.

The silent collision problem:

mergeTests doesn’t check for fixture name conflicts. If auth.fixtures.ts and billing.fixtures.ts both export a user fixture, the last one registered wins — silently. Tests that worked before mergeTests may start using a different user object without any error.

Namespacing eliminates this class of bug:

type AuthFixtures = {
  auth: {
    admin: Admin;
    user: User;
    guest: Guest;
  };
};

export const authTest = base.extend<AuthFixtures>({
  auth: async ({ page }, use) => {
    await use({
      admin: new Admin(page),
      user: new User(page),
      guest: new Guest(page),
    });
  },
});

// Collision is now structurally impossible
// auth.user vs billing.user — different namespaces, different objects
test('admin manages billing', async ({ auth, billing }) => {
  await auth.admin.login();
  await billing.user.subscribe();
});

The namespace also makes test code self-documenting: auth.admin vs billing.user is unambiguous in a way that two separate admin and user fixtures are not.

Business Steps: test.step and BDR

The quality of step descriptions in your reports determines how useful they are for debugging. The native Playwright tool is test.step():

// Technical log — breaks when implementation changes
async login() {
  await test.step('Click the login button', async () => {
    await this.page.getByRole('button', { name: 'Login' }).click();
  });
}

// Business intent — survives refactoring
async loginAs(user: User) {
  await test.step(`Authenticate as "${user.username}"`, async () => {
    await this.loginPage.login(user.username, user.password);
  });
}

The second version remains valid even if the login mechanism changes from a form to SSO. The report reads like a scenario, not a sequence of DOM operations.

In BDR methodology, this pattern is formalized with a @Step decorator that wraps methods automatically — eliminating the manual test.step() wrapping. If you’re building at scale and want cleaner syntax, it’s worth exploring.

ESLint: Architectural Enforcement

The fixture architecture only works if objects aren’t created with new inside tests. Document the rule in code:

module.exports = {
  overrides: [
    {
      // Scoped to test files only — won't flag Pagination or other non-POM classes
      files: ['tests/**/*.ts', '**/*.spec.ts'],
      rules: {
        'no-restricted-syntax': [
          'error',
          {
            selector: 'NewExpression[callee.name=/.*Page$/]',
            message: 'Instantiate Page Objects via fixtures, not new. See fixtures.ts.',
          },
          {
            selector: 'NewExpression[callee.name=/.*Flow$/]',
            message: 'Instantiate Flow objects via fixtures, not new. See fixtures.ts.',
          },
        ],
      },
    },
  ],
};

When a genuine exception exists — a factory function that creates a Page Object for testing purposes, for instance — the escape hatch is // eslint-disable-next-line with a mandatory comment:

// eslint-disable-next-line no-restricted-syntax
// Factory function — not a test file, constructing for unit testing POM behavior
const page = new LoginPage(mockPage);

The comment makes the exception visible and reviewable. Blanket disables without explanation are a red flag in code review.

The Architecture in Summary

Decision	Wrong	Right	Why
Object creation	`new PageObject()` in tests	Fixtures	Single update point when constructor changes
Locator definition	Constructor assignments	Getters	Prevents state capture, enforces statelessness
Faker seed	`workerIndex`	`testId` + `RUN_ID` + `repeatEachIndex`	Stable across shards and retries
Fixture organization	One monolithic file	Domain files + `mergeTests`	Parallel editing, clear ownership
Fixture naming	Flat namespace	Domain namespacing	Eliminates silent collisions
Architecture enforcement	Code review comments	ESLint rules scoped to `tests/**`	Automated, consistent, zero overhead
Step reporting	Technical descriptions	`test.step()` with business intent	Report reads like a scenario, not a DOM log

What This Architecture Actually Solves

None of this is complex to implement. The fixture DI pattern is an afternoon. Seeded faker is 20 lines. Namespacing is a refactor you can do incrementally.

What it solves is the compounding cost of the alternative. Every new PageObject() in a test is a future refactoring touchpoint. Every workerIndex seed is a potential data collision waiting for sufficient parallelism to trigger. Every flat fixture namespace is a silent collision waiting for the second engineer to add a fixture with the same name.

The architecture described here doesn’t make tests faster or more readable in the short term. It makes the codebase cheaper to maintain as it grows — which is the only metric that matters at scale.

Reference implementation: Playwright BDR Template

Your Playwright Tests Will Need Refactoring. Here's How to Make It Painless

May 13, 2026

Dmitry

QA Automation Engineer

Your Playwright Tests Will Need Refactoring. Here’s How to Make It Painless CONCEPT

You write 50 tests. Everything works. Six months later the team grows, tests become 300, and someone changes a constructor — and you spend two days updating imports across the entire project. Sound familiar?

This isn’t a discipline problem. It’s an architecture problem. And it’s fixable before it happens.

Code examples are simplified for clarity — focus on the idea, not the boilerplate.

TL;DR

Never instantiate Page Objects with new inside tests — use fixtures
Use getters instead of constructor assignments in Page Objects
Seed your test data with a combination of testId + RUN_ID + repeatEachIndex for reproducibility
Split fixtures by domain when the file gets large — use mergeTests
Use Namespacing to avoid silent fixture name collisions

What Is a Flow? (Quick Explainer)

Before we dive in — this article uses the term Flow, which might be unfamiliar.

In a well-structured Playwright project, tests are built in three layers:

Page Object (POM) — knows how to interact with elements on a specific page: find a button, fill a field, click a link
Flow — knows how to complete a business scenario: “checkout”, “register a user”, “reset a password”. It orchestrates Page Objects in the right sequence so tests don’t have to
Test — just calls the Flow and checks the result

So when you see checkoutFlow.submitOrder() in a test, that one line is hiding a sequence of page navigations, form fills, and button clicks — all managed by the Flow. The test doesn’t need to know the details.

The Problem: Architecture That Fights You at Scale

At 50 tests, messy architecture is invisible. At 300 tests, it becomes expensive. Two separate problems compound each other:

Data isolation breaks in parallel runs. Two workers create a user named “Ivan”, one test reads the other’s data, both fail. You spend an hour debugging something that has nothing to do with your application. This is a data seeding problem — solved in Rule #3.

Refactoring takes days instead of hours. Someone changes a constructor signature. Now you have 150 files to update. With modern tools this is still risky — you might miss one. This is a dependency management problem — solved in Rule #1.

Tests are impossible to read. Ten lines of setup before the actual test logic. New team members can’t tell what’s being tested and what’s just noise. This too is a dependency management problem — when setup lives in fixtures, tests read like specifications.

Rule #1: Stop Using `new` Inside Tests

This is the most common pattern that makes refactoring painful:

// Every test manages its own dependencies
test('checkout', async ({ page }) => {
  const cartPage = new CartPage(page);
  const checkoutPage = new CheckoutPage(page);
  const checkoutFlow = new CheckoutFlow(cartPage, checkoutPage);

  await checkoutFlow.submitOrder();
});

If CartPage needs a new dependency tomorrow — a logger, a config object, an API client — you update every single test that creates it. That’s your two days of refactoring.

The fix: fixtures as a DI container

// fixtures.ts — one place to manage all object creation
export const test = base.extend({
  cartPage: async ({ page }, use) => {
    await use(new CartPage(page));
  },

  checkoutFlow: async ({ cartPage, checkoutPage }, use) => {
    await use(new CheckoutFlow(cartPage, checkoutPage));
  },
});

// The test reads like a specification
test('checkout', async ({ checkoutFlow }) => {
  await checkoutFlow.submitOrder();
});

When CartPage constructor changes — you update fixtures.ts. One file. Done.

Why fixtures even when Flow seems stateless today:

Your CheckoutFlow might be pure today — no state, no side effects. But requirements change. Tomorrow it needs to track an order ID. Next month it opens a WebSocket connection that needs to be closed after the test.

If Flow is created via new in every test, adding teardown means updating hundreds of files. If it’s in a fixture, you add after use cleanup in one place:

checkoutFlow: async ({ cartPage, checkoutPage }, use) => {
  const flow = new CheckoutFlow(cartPage, checkoutPage);
  await use(flow);
  await flow.cleanup(); // added in one place, applies everywhere
};

The upfront investment is real — a few hours to set up fixtures properly. The cost of refactoring later: days, proportional to how many tests you have.

A note on pragmatism: Fixtures are for managing state and lifecycle. If you have a stateless utility function — like formatDate or a math helper — don’t wrap it in a fixture. A simple ES6 import is faster and less complex. Use fixtures for things that hold a page context or require setup/teardown. Everything else is just a function.

Rule #2: Use Getters in Page Objects, Not Constructor Assignments

This is subtle but important. Most tutorials show this:

// Locator computed once at construction time
class CartPage {
  private submitButton: Locator;

  constructor(page: Page) {
    this.submitButton = page.locator('button#submit');
  }
}

This looks fine. Playwright locators are lazy — they don’t query the DOM at construction time, they query it when you interact with them. So assigning a locator in the constructor is technically safe.

The real danger is what this pattern enables — the temptation to capture actual state in the constructor:

// Never do this
constructor(page: Page) {
  (async () => {
    this.itemCount = await page.locator('.items').count(); // race condition bomb
  })();
}

This creates an unmanaged race condition. Your test might read itemCount before the async function inside the constructor has resolved. This causes random CI failures that are nearly impossible to reproduce locally.

The fix: lazy getters

Getters are the architectural solution — not because they prevent stale locators (Playwright handles that), but because they make it structurally impossible to capture state at construction time. A getter can’t be async, so you physically can’t write this.itemCount = await something inside one.

// Fresh locator on every access, stateless by design
class CartPage {
  constructor(private page: Page) {}

  get submitButton() {
    return this.page.getByRole('button', { name: 'Place order' });
  }

  // Named cartItems, not itemCount — this returns a locator, not a number
  get cartItems() {
    return this.page.locator('.cart-item');
  }

  // For actual count — explicit async method, not a getter
  async getItemCount(): Promise<number> {
    return this.cartItems.count();
  }
}

The Page Object stays stateless. Reading state is always an explicit async operation, never something that happens silently at construction time.

Rule #3: Isolate Test Data for Parallel Runs

When you run 1000 tests in parallel across multiple CI shards, data collisions are inevitable — unless you design against them.

The common mistake is using workerIndex as a seed for test data. It seems logical: each worker gets a unique number, so data should be unique. The problem is that workerIndex resets per shard. On 10 parallel CI agents, each has its own “Worker 0”. Collisions are guaranteed.

The fix: combine test identity with CI build ID — not worker index

import { TestInfo } from '@playwright/test';
import { faker } from '@faker-js/faker';

function hashCode(str: string): number {
  return str.split('').reduce((acc, char) => {
    return (Math.imul(31, acc) + char.charCodeAt(0)) | 0;
  }, 0);
}

export function seedFaker(testInfo: TestInfo) {
  const RUN_ID = process.env.RUN_ID || 'local';
  const seed = hashCode(`${testInfo.testId}-${RUN_ID}-${testInfo.repeatEachIndex}`);
  faker.seed(seed);
  return faker;
}

// fixtures.ts
export const test = base.extend({
  faker: async ({}, use, testInfo) => {
    await use(seedFaker(testInfo));
  },
});

Three components in the seed:

testId — unique hash of the test file path and test name
RUN_ID — the CI build ID (e.g. GITHUB_RUN_ID), so different builds get different data
repeatEachIndex — handles retries correctly

Note: RUN_ID is an environment variable provided by your CI system — for example, GITHUB_RUN_ID in GitHub Actions. If it’s missing, the code falls back to 'local', so everything works on your machine without any extra setup.

The payoff: when a test fails in CI, grab the RUN_ID from the pipeline logs, run the test locally with the same ID, and you get the exact same names, emails, and UUIDs that were generated in CI. Reproducible failures instead of “I can’t reproduce this locally.”

Rule #4: Structure Test Data With Factories and Overrides

Random data everywhere creates noise. If a field doesn’t affect the test outcome, it shouldn’t be visible in the test.

// user.factory.ts — sensible defaults
export function createUser(overrides?: Partial<User>, f = faker): User {
  return {
    id: f.string.uuid(),
    email: f.internet.email(),
    name: f.person.fullName(),
    role: 'customer',
    ...overrides,
  };
}

// In the test — only what matters
test('VIP discount applies at checkout', async ({ checkoutFlow, faker }) => {
  const user = createUser({ role: 'vip', discount: 0.15 }, faker);
  await checkoutFlow.asUser(user).applyPromo();
});

The test declares intent, not implementation. When you read it, you know exactly what’s being tested: VIP role and discount. Everything else — name, email, UUID — is noise that the factory handles.

For data that represents specific business cases and appears repeatedly, extract it as a named dataset:

export const VIP_USER = { role: 'vip', discount: 0.15 } as const;

// In tests
const user = createUser({ ...VIP_USER }, faker);

Pro tip: Use the satisfies operator (TypeScript 4.9+) instead of as const for datasets. It ensures your data matches the User type without losing the specific literal values — catching type errors before you even run the test:
export const VIP_USER = {
  role: 'vip',
  discount: 0.15,
} satisfies Partial<User>;
If someone adds a required field to User and forgets to update the dataset, TypeScript will tell you at compile time, not at runtime.

Rule #5: Scale Fixtures With `mergeTests` and Namespacing

One fixtures.ts file is fine at the start. At 20+ fixtures it becomes a 400-line file that multiple people edit simultaneously.

Split by domain:

import { test as base } from '@playwright/test';
import { LoginPage } from '../pages/LoginPage';

export const authTest = base.extend<{ loginPage: LoginPage }>({
  loginPage: async ({ page }, use) => {
    await use(new LoginPage(page));
  },
});

// cart.fixtures.ts
import { test as base } from '@playwright/test';
import { CartPage } from '../pages/CartPage';

export const cartTest = base.extend<{ cartPage: CartPage }>({
  cartPage: async ({ page }, use) => {
    await use(new CartPage(page));
  },
});

// fixtures.ts — merge everything
import { mergeTests } from '@playwright/test';
import { authTest } from './auth.fixtures';
import { cartTest } from './cart.fixtures';

export const test = mergeTests(authTest, cartTest);

Tests don’t change at all — they still import from fixtures.ts. The split is purely organizational.

Watch out for name collisions:

If auth.fixtures.ts and cart.fixtures.ts both define a fixture called user, Playwright won’t warn you. The last one wins silently. This creates subtle bugs that are very hard to track down.

The fix is namespacing — group fixtures by domain:

// No collision possible
import { test as base } from '@playwright/test';
import { Admin } from '../pages/Admin';
import { User } from '../pages/User';

export const test = base.extend<{ auth: { admin: Admin; user: User } }>({
  auth: async ({ page }, use) => {
    await use({
      admin: new Admin(page),
      user: new User(page),
    });
  },
});

// In tests
test('admin can manage users', async ({ auth }) => {
  await auth.admin.login();
  await auth.user.register();
});

Rule #6: Write Business Steps, Not Technical Logs

If you use Allure or any step-based reporter, the quality of your step descriptions determines how useful the report is.

The native Playwright way is test.step():

// Technical log — describes implementation
async login() {
  await test.step('Click the login button', async () => {
    await this.page.getByRole('button', { name: 'Login' }).click();
  });
}

// Business intent — describes what happened
async loginAs(user: User) {
  await test.step(`Authenticate as "${user.username}"`, async () => {
    await this.loginPage.login(user.username, user.password);
  });
}

The first version breaks when you rename the button. The second version remains valid even if the entire login mechanism changes from a form to SSO. The report reads like a scenario, not a DOM manipulation log.

In BDR methodology we use a @Step decorator instead of wrapping every method manually — same result, cleaner syntax. If you’re interested in that approach, check it out.

ESLint: Enforce the Architecture Automatically

The best rule is one that doesn’t require a code review comment:

module.exports = {
  overrides: [
    {
      // Only applies inside test files — won't flag Page Object factories or helpers
      files: ['tests/**/*.ts', '**/*.spec.ts'],
      rules: {
        'no-restricted-syntax': [
          'error',
          {
            selector: 'NewExpression[callee.name=/.*Page$/]',
            message: 'Use fixtures instead of new for Page Objects. See fixtures.ts.',
          },
          {
            selector: 'NewExpression[callee.name=/.*Flow$/]',
            message: 'Use fixtures instead of new for Flow objects. See fixtures.ts.',
          },
        ],
      },
    },
  ],
};

Scoping to tests/** prevents false positives — new Pagination() in your app code won’t trigger this. Only new LoginPage() inside test files will.

Architecture Cheat Sheet

Symptom	Root cause	Fix
Refactoring takes days	`new PageObject()` in every test	Move to fixtures
Parallel tests corrupt each other’s data	`workerIndex` as seed	Seed with `testId` + `RUN_ID`
Can’t reproduce CI failures locally	Non-deterministic test data	Seeded faker fixture
fixtures.ts is 400 lines	No domain separation	`mergeTests` + domain files
Fixture collision, wrong object used	Flat fixture namespace	Namespace by domain
Report is unreadable	Technical step descriptions	`test.step()` with business intent (or `@Step` in BDR)

What’s Next?

This architecture handles the object lifecycle and data isolation. The next layer is async reliability — expect.poll, idempotency keys for parallel API calls, and cleaning up test data without relying on afterEach.

Want to go deeper? Check out the advanced version: Playwright Architecture at Scale: What Senior Engineers Do Differently

All patterns in this article are implemented in the Playwright BDR Template on GitHub.

Playwright CI: What Senior Engineers Do Differently

May 12, 2026

Dmitry

QA Automation Engineer

Playwright CI: What Senior Engineers Do Differently PRO IMPLEMENTATION

New to Playwright architecture? Start with the fundamentals first: Why Your Playwright Tests Fail in CI (And Never Locally) — the same concepts with more explanation and simpler examples.

Most teams reach a point where their test suite becomes a liability. Green locally, red in CI. Passes on retry, fails on the next run. The usual response is to increase timeouts, add waitForTimeout, and move on. The problem compounds quietly until someone spends a full day debugging a test that was never actually broken.

This guide is about the architectural decisions that prevent that from happening. Not “use better selectors” — you already know that. The decisions that determine whether your test infrastructure scales or slowly collapses under its own weight.

Code examples are intentionally simplified — focus on the architectural pattern, not the implementation details.

Mental Model Shift: Leaving Legacy Baggage Behind

TL;DR

Dependency Projects over globalSetup — fail fast when the environment is down, not after 800 tests. API auth in 50ms, not UI auth in 5 seconds. getByRole queries the accessibility tree — role survives refactoring, { name } doesn’t survive translation. Web-first assertions poll until ready — isVisible() is a snapshot. expect.poll for state that changes outside the UI — webhooks, background jobs, queues. Trace Viewer’s Action/Before/After snapshots show you why a click failed, not just that it did.

Before getting into architecture, a quick audit. Senior engineers migrating from Selenium or Puppeteer often bring habits that fight Playwright instead of leveraging it. These aren’t stylistic preferences — they’re architectural differences that affect reliability at scale.

If any of these look familiar in your codebase, fix them before layering on anything else:

page.$() or page.$$() → getByRole(), getByLabel(), getByTestId() Playwright locators are lazy and auto-retried on assertions. $() executes immediately against the current DOM state and cannot be polled.
waitForSelector() or waitForTimeout() → Remove them Playwright auto-waits for actionability before every interaction. Explicit waits are almost always either redundant or masking a real problem.
waitForNavigation() → await expect(page).toHaveURL('/dashboard') waitForNavigation() is prone to race conditions — it can resolve before the page is actually ready. toHaveURL polls until the URL matches, which is what you actually want.
isVisible(), isEnabled() in assertions → expect(loc).toBeVisible(), expect(loc).toBeEnabled() Snapshot methods return the state at one millisecond. Web-first assertions retry until the condition is true or the timeout expires.
console.log('HERE') → Trace Viewer Logs tell you that something happened. Traces show you the DOM, network, and console at the exact moment it happened — in CI, after the fact.

If your team is mid-migration, this is worth a dedicated refactor sprint. The patterns below assume you’re past this baseline.

The Problem With How Most Teams Structure Test Infrastructure

The typical Playwright setup looks like this: a globalSetup file that handles authentication, maybe some shared fixtures, and a flat list of test files. This works at 50 tests. At 500, the cracks appear.

globalSetup runs once, outside Playwright’s normal execution context. When it fails, you get dry Node.js logs. No trace, no network timeline, no DOM snapshots. You’re debugging blind.

More critically: there’s no built-in way to say “don’t run 800 tests if the environment is down.” You get 800 failures that all say the same thing and tell you nothing useful.

The Architecture: Dependency Projects as a Dependency Graph

The senior approach treats test infrastructure as a directed acyclic graph. Each node has prerequisites. If a prerequisite fails, dependent nodes don’t run.

export default defineConfig({
  projects: [
    {
      name: 'auth-setup',
      testMatch: /.*\.auth\.setup\.ts/,
    },
    {
      name: 'healthcheck',
      testMatch: /.*\.health\.setup\.ts/,
      dependencies: ['auth-setup'],
    },
    {
      name: 'chromium',
      use: { ...devices['Desktop Chrome'] },
      dependencies: ['healthcheck'],
    },
    {
      name: 'firefox',
      use: { ...devices['Desktop Firefox'] },
      dependencies: ['healthcheck'],
    },
  ],
});

The order in the array doesn’t matter — Playwright builds the graph automatically. What matters is the dependencies field.

What this buys you:

When the staging environment goes down at 2am, your CI doesn’t burn 40 minutes running tests that will all fail for the same reason. The healthcheck fails, Playwright stops, you get one clear failure instead of eight hundred.

When auth breaks after a backend deploy, you know immediately — not after waiting for the full suite to time out.

And crucially: every node in this graph is a real Playwright test. That means full Trace Viewer support. When auth setup fails in CI, you open the trace and see exactly which API call returned 401, what the response body said, and what the DOM looked like if there was a redirect. Compare that to parsing a stack trace from globalSetup.

Authentication: The 50ms vs 4 Second Decision

Every test that needs authentication has to pay the auth cost. The question is how much.

UI login on a realistic app with SSR, asset loading, and form rendering: 2–5 seconds. API login: 50–100ms. At 500 tests, that’s 2500 seconds vs 50 seconds of auth overhead — before you’ve even started testing anything.

test('authenticate', async ({ request }) => {
  const response = await request.post('/api/auth/login', {
    data: {
      email: process.env.TEST_USER_EMAIL,
      password: process.env.TEST_USER_PASSWORD,
    },
  });

  expect(response.status()).toBe(200);

  // Cookies are automatically captured from the request context
  await request.storageState({ path: '.auth/user.json' });
});

use: {
  storageState: '.auth/user.json',
}

The non-obvious part: you should have exactly one test that tests the login UI. Every other test that requires authentication just consumes the saved state. You’re not testing login 500 times — you’re testing it once and reusing the result.

This also means your login test is isolated. If the login flow changes, one test fails, clearly, with a good error message. Not 400 tests failing with “element not found” somewhere in the middle of an unrelated scenario.

Locator Strategy: Understanding the Model, Not Memorizing the Rules

The common framing — “use getByRole for actions, getByTestId for stable anchors” — is a simplification that leads engineers to make wrong choices in edge cases. The more useful mental model is understanding what each locator actually queries and what that means for test reliability.

What getByRole actually does

getByRole queries the accessibility tree, not the DOM. The accessibility tree is a parallel representation of the page that browsers expose to screen readers and assistive technology. It’s built from semantic HTML — <button>, <input>, <h1> — plus ARIA attributes.

This distinction matters: CSS classes, DOM structure, and visual styling don’t affect the accessibility tree. A <div class="btn-primary"> has no role. A <button> always has role button regardless of how it’s styled.

One important nuance: getByRole usually takes a { name: '...' } parameter to identify which element you mean. That name is resolved from the element’s text content, aria-label, or aria-labelledby. The role itself survives refactoring — but the name is tied to visible text, which means it breaks in multilingual apps when the locale changes. This is why getByTestId or a fixed aria-label are better choices when text is dynamic.

When getByRole fails to find an element, it usually means one of two things: the element genuinely doesn’t exist yet (timing issue), or the element has no semantic role (accessibility issue). The second case is a real bug in your application — your test is catching it.

// This finds the button by its role and accessible name
// Works regardless of CSS class, DOM nesting, or visual styling
await page.getByRole('button', { name: 'Place order' }).click();

// If this fails because there's no button role —
// that's an accessibility bug worth fixing, not a test bug

The accessible name in { name: '...' } can come from: the element’s text content, an aria-label attribute, or an aria-labelledby reference. Playwright checks all three automatically.

Why getByLabel is semantically stronger than getByTestId for forms

getByLabel finds form inputs by their associated label. The label is a contract: it tells users (and screen readers) what the field is for. If that contract changes, your test should know.

// If the label changes from 'Email address' to 'Work email'
// this test fails — correctly, because the UX changed
await page.getByLabel('Email address').fill('user@example.com');

getByTestId on the same field would pass silently. You might want that stability, or you might want the test to catch the label change. The choice depends on whether the label is a UX requirement or an implementation detail.

When getByTestId is the right choice — and why

getByTestId bypasses the accessibility tree entirely. It finds elements by a data-testid attribute you add to the DOM. This makes it stable in specific situations where semantic locators genuinely don’t work:

Complex component libraries (Ant Design, MUI) — these generate DOM structures where a single Select or Combobox contains multiple elements with the same role: a hidden native input, a trigger button, a text field. getByRole('combobox') picks the first in DOM order — deterministic, but often wrong. And it can change between library versions as internal structure shifts
Multi-language applications — getByRole('button', { name: 'Submit' }) breaks when the locale changes to French. getByTestId('submit-button') doesn’t care about the label language
A/B tests and personalization — button text varies per user variant; getByTestId gives you a stable anchor
Icon-only buttons — SVG icons without aria-label have no accessible name; getByTestId is the fallback

The tradeoff is real: getByTestId passes even if the element is visually broken, hidden by styles, or completely inaccessible to screen readers. You’re opting out of semantic validation.

The decision algorithm

1. Does the element have a reliable semantic role?
   → Yes: use getByRole
   → No: continue

2. Is it a form field with a label?
   → Yes: use getByLabel
   → No: continue

3. Can you ask the developer to add aria-label?
   → Yes: add it, then use getByRole(..., { name: 'aria-label value' })
   → No: continue

4. Use getByTestId — consciously, not by default

The correction to the “actions vs assertions” mental model

The framing “use getByTestId for clicks, getByRole for assertions” is wrong in both directions. The question is not what you’re doing with the element — it’s how stable the element’s semantics are.

// Both clicks — different locators because semantics differ
await page.getByRole('button', { name: 'Place order' }).click(); // stable role + name
await page.getByTestId('lang-switcher').click(); // dynamic text, no stable role

// Both assertions — different locators for the same reason
await expect(page.getByRole('heading')).toHaveText('Order confirmed'); // content IS the requirement
await expect(page.getByTestId('order-status')).toBeVisible(); // existence matters, not label

Use getByRole whenever the element has reliable semantics — for both clicks and assertions. Use getByTestId when semantics are unreliable — for both clicks and assertions.

Web-First Assertions: Why the Implementation Matters

The difference between isVisible() and expect(locator).toBeVisible() isn’t just syntax. It’s the difference between a point-in-time snapshot and a polling loop.

isVisible() makes one DOM query and returns immediately. If the element isn’t there at that exact millisecond, you get false. If your app is 10ms slower than usual in CI, the test fails.

expect(locator).toBeVisible() polls the DOM every ~100ms until the condition is true or the timeout expires. It’s designed for asynchronous UIs.

// Snapshot — fails if element isn't ready at this exact moment
const visible = await page.getByRole('dialog').isVisible();
expect(visible).toBe(true);

// Polling — waits for the element to appear
await expect(page.getByRole('dialog')).toBeVisible();

The more interesting case is expect.poll for non-UI state — and the contrast with waitForTimeout is worth making explicit.

The tempting pattern:

// Guessing — works until it doesn't
await page.getByText('Place order').click();
await page.waitForTimeout(5000);
const order = await api.getOrder(id);
expect(order.status).toBe('PAID');

This works in development where the backend is fast and the machine is unloaded. In CI under parallel execution, the backend takes 5001ms on a slow run. The test fails — not because the feature is broken, but because you guessed wrong about timing.

waitForTimeout is deterministic in the wrong direction: it fails on the system being slower than expected, but also wastes time when the system is faster. At 1000 tests, those wasted seconds add up to real CI cost.

The boundary that matters: web-first assertions (toBeVisible, toHaveURL, toHaveText) cover 95% of cases — they have built-in retry and should always be your first choice. expect.poll is for the remaining 5%: state that changes outside the UI with no visible indicator. A background job updating order status in the DB. A payment webhook from Stripe arriving and updating payment state. A message processed from Kafka by another service. The common pattern: you triggered something, the UI has nothing useful to show, and you can only verify the result via a direct API call.

// Background job updated order status — only verifiable via API
await expect
  .poll(
    async () => {
      const response = await request.get(`/api/orders/${orderId}`);
      const order = await response.json();
      return order.status;
    },
    {
      message: 'Order should reach CONFIRMED status',
      timeout: 30_000,
    },
  )
  .toBe('CONFIRMED');

This is the correct tool for Eventual Consistency scenarios — distributed systems where the UI updates before the database has committed, or where background jobs need to complete before the state is queryable.

A common mistake: manually setting intervals: [1000, 2000, 5000] on every poll. Playwright’s default intervals are reasonable. If you need custom timing, set a global timeout via test.setTimeout(60_000) for slow scenarios rather than tuning every individual poll.

`expect.toPass`: When You Need to Retry an Entire Interaction

expect.poll retries a single assertion. Sometimes you need to retry a whole sequence of actions — click a button, wait for a state change, verify the result. That’s expect.toPass:

await expect(async () => {
  await page.getByRole('button', { name: 'Sync' }).click();
  await expect(page.getByTestId('sync-status')).toHaveText('Complete');
}).toPass({
  intervals: [1_000, 2_000, 5_000],
  timeout: 15_000,
});

Here the intervals make sense — you’re controlling how often to repeat a user-visible action, not an internal polling check.

The decision boundary between poll and toPass:

Use expect.poll when you’re checking state without side effects — reading an API endpoint, querying a value. The polling itself is invisible to the system.

Use expect.toPass when the check requires triggering an action — clicking a refresh button, submitting a form, calling an endpoint that changes state. Here you want explicit control over retry frequency because each attempt has a visible effect.

Mixing them up creates subtle problems: using expect.toPass for a pure state check works but fires unnecessary user actions. Using expect.poll when you need to click something doesn’t work at all — poll only retries the assertion, not the preceding action.

Hydration: The Silent Test Killer in SSR Applications

If your application uses Next.js, Nuxt, or any other SSR framework, you’ve likely hit this: Playwright clicks a button, no error is thrown, but the application doesn’t respond. The test eventually times out waiting for a state change that never came.

The cause is hydration. The server sends fully-rendered HTML — the page looks complete, the button is in the DOM, Playwright’s actionability checks pass. But the JavaScript bundle hasn’t executed yet. There are no event listeners. The click lands on a dead element.

The solution is to wait for a signal that hydration is complete before starting meaningful interactions:

// Many frameworks add a class or attribute when hydration completes
await page.waitForSelector('[data-hydrated="true"]', { state: 'attached' });

// Or wait for a loading indicator to disappear
await expect(page.locator('#app-loading')).toBeHidden();

// Or wait for a specific element that only appears post-hydration
await expect(page.getByRole('navigation')).toBeVisible();

The right signal depends on your application. Work with your frontend team to add a reliable hydration marker if one doesn’t exist. It’s a small investment that eliminates an entire category of intermittent failures.

A note on force: true:

When a click does nothing, force: true is tempting. Understand what you’re actually disabling. Playwright’s actionability checks verify four things before every interaction:

Visible — element is not hidden by CSS or outside the viewport
Stable — element is not moving (animations, transitions in progress)
Enabled — element is not in a disabled or read-only state
Receiving events — element is not covered by another element

Bypassing these means your test no longer reflects what a real user can do. The test passes; the user is still stuck.

There is one legitimate exception: hidden file inputs (<input type="file">). The native element is hard to style, so developers often intentionally hide it and show a custom button instead. In such cases, Playwright cannot interact with the hidden element without force: true. When you genuinely need force: true, document it:

// force: true required — file input is visually hidden by design
await page.locator('input[type="file"]').setInputFiles('document.pdf', { force: true });

For everything else: find what’s blocking the element and wait for it to clear. force: true without a comment is a code smell that should fail review.

Network Hygiene: What’s Actually Slowing Your Tests

Third-party scripts are a common source of CI flakiness that’s easy to overlook. Analytics, support chat, session recording tools — these make network requests that can:

Trigger networkidle waits to never settle (if a script sends requests every 400ms)
Add latency to page loads
Occasionally fail with 5xx errors that your application handles gracefully but that affect timing

The fix is straightforward:

// In a base fixture, applied to all tests
await page.route(/google-analytics\.com|segment\.com|intercom\.io|fullstory\.com/, (route) => {
  // fulfill with 200 rather than abort — prevents apps from retrying indefinitely
  route.fulfill({ status: 200, body: '' });
});

One subtlety: don’t block web fonts unless you’ve confirmed your app handles them gracefully. Missing fonts cause layout shifts, which fail Playwright’s stability checks and can make elements appear to move right before you try to interact with them.

Trace Viewer: Making CI Failures Debuggable

The difference between a test suite that’s maintainable and one that isn’t often comes down to how debuggable failures are. A screenshot tells you what the page looked like. A trace tells you everything that happened.

use: {
  trace: 'retain-on-failure',
  screenshot: 'only-on-failure',
  video: 'retain-on-failure', // optional but useful for complex interactions
}

Navigating a trace effectively:

Metadata tab — check this first when a test fails in CI but passes locally. It shows the browser version, viewport size, and launch parameters. “Element not found” failures that only happen in CI are often caused by a different viewport — the element exists but is off-screen or hidden by a responsive breakpoint.

Snapshots: Action / Before / After — this is where most debugging happens. Each action in the trace has three states:

Before: DOM state before Playwright started the action
Action: The moment of interaction — you’ll see a red dot showing exactly where Playwright clicked
After: DOM state after the action completed

When a click does nothing, open the Action snapshot. If you see the red dot landing on a loading skeleton or an overlay div instead of your button, that’s your answer. The button was there, but something was on top of it.

Network tab — click any request to see headers, payload, and response body. When a test fails because a state change didn’t happen, check whether the API call was made, what it returned, and how long it took. A 200 response with an error in the body is a common cause of tests that fail without obvious reason.

Interactive DOM — snapshots aren’t screenshots. They’re live DOM captures you can inspect with DevTools. Open any snapshot, right-click an element, and you have full access to computed styles, attributes, and the element tree — at the exact moment in time when the action occurred. This is the feature that makes Trace Viewer genuinely different from video recording.

ESLint: Enforcing Architecture Automatically

The best architectural rules are the ones that don’t require human enforcement. Configure these once and they apply to every PR forever:

// .eslintrc.js (ESLint v8)
module.exports = {
  extends: ['plugin:playwright/recommended'],
  rules: {
    // Hard failures — these break things
    'playwright/no-wait-for-timeout': 'error',
    'playwright/no-focused-test': 'error',
    'playwright/no-page-pause': 'error',
    'playwright/missing-playwright-await': 'error',

    // Warnings — architectural debt worth addressing
    'playwright/prefer-web-first-assertions': 'warn',
    'playwright/no-force-option': 'warn',
    'playwright/no-skipped-test': 'warn',

    // Prevent bypassing seeded faker (if you use deterministic test data)
    'no-restricted-imports': [
      'error',
      {
        paths: [
          {
            name: '@faker-js/faker',
            message: 'Use the seeded faker fixture from test context for reproducible test data.',
          },
        ],
      },
    ],
  },
};

For ESLint v9+:

import playwright from 'eslint-plugin-playwright';

export default [
  {
    files: ['tests/**'],
    ...playwright.configs['flat/recommended'],
    rules: {
      ...playwright.configs['flat/recommended'].rules,
      'playwright/no-wait-for-timeout': 'error',
      'playwright/no-focused-test': 'error',
      'playwright/no-page-pause': 'error',
      'playwright/missing-playwright-await': 'error',
      'playwright/prefer-web-first-assertions': 'warn',
      'playwright/no-force-option': 'warn',
    },
  },
];

The error vs warn distinction matters. error means the CI pipeline fails. warn means the developer sees it in their IDE and in the PR, but it doesn’t block a merge. Use error for things that will definitely cause test failures or leave debug artifacts in CI. Use warn for patterns that indicate technical debt but may have legitimate exceptions.

On that note: rules exist to be broken consciously. If you’re working with a heavy component library — Ant Design, MUI with deeply nested generated selectors — sometimes // eslint-disable-next-line is the honest answer. The difference between a senior and a junior here isn’t that the senior never disables rules. It’s that they write a comment explaining why, and they don’t do it as a reflex.

The Flakiness Diagnostic Framework

When a test fails intermittently, the question isn’t “why did it fail this time?” It’s “what class of problem is this?”

Symptom	Root cause	Solution
Click lands, nothing happens	Hydration — JS not loaded yet	Wait for hydration signal
Passes locally, fails in CI consistently	Resource contention / network latency	Block third-party scripts, check `workers` config
Fails on 1 in 10 runs, no pattern	Race condition in assertion	Replace snapshot assertion with Web-first assertion
All tests fail simultaneously	Environment down / auth broken	Add healthcheck dependency project
Fails after deploy, selector not found	Fragile locator	Replace CSS with `getByTestId` or `getByRole`
Timeout waiting for state change	Eventual consistency	Replace `waitForTimeout` with `expect.poll`

The last row is where most teams go wrong. When a test times out waiting for a database state change, the instinct is to increase the timeout. The correct fix is to stop guessing how long the operation takes and start asking the system when it’s done.

Worker Configuration: The Resource Math

fullyParallel: true is one line. The consequences of getting the worker count wrong are dozens of intermittent failures that look like application bugs.

The math: each Playwright worker runs a browser instance. A Chromium instance needs roughly 200–300MB of RAM under load. On a CI agent with 4GB RAM, running 20 workers means 4–6GB just for browsers — before Node.js, your application server, and the OS.

export default defineConfig({
  fullyParallel: true,
  workers: process.env.CI ? '50%' : undefined,
});

50% of available cores leaves headroom for everything else. The tests run slightly slower than theoretical maximum, but they run reliably. The alternative — running at 100% and getting OOM kills that look like test failures — is worse in every way.

What This Architecture Actually Buys You

None of these patterns are difficult to implement. The dependency graph takes an afternoon. API auth is 20 lines. ESLint config is copy-paste.

The compounding value is that they change the economics of flakiness. Without them, every intermittent failure requires investigation — is this a real bug or noise? With them, most failures are deterministic and self-explanatory.

A healthcheck that fails clearly is better than 800 timeouts that might be anything. A trace that shows “button covered by loading overlay” is better than 40 minutes of local reproduction attempts. An ESLint error that prevents waitForTimeout from being committed is better than a code review comment that gets ignored.

The goal isn’t zero flakiness — distributed systems are inherently non-deterministic. The goal is failures that tell you something useful.

The patterns in this article are implemented in the Playwright BDR Template — a reference implementation you can clone and run.

Blog

Why Your Test Suite Lies to You at Scale PRO IMPLEMENTATION

The Failure Mode Nobody Talks About

Idempotency: Making POST Requests Safe to Retry

Mock Architecture: Three Levels, Three Use Cases

Data Hygiene: The Infrastructure Approach

The Decision Framework

What This Solves

Flaky Tests You Can’t Fix With Better Selectors CONCEPT

TL;DR

The Problem: Flakiness That Looks Like Application Bugs

Rule #1: Idempotency Keys — One Request, One Result

Rule #2: Know What Your Mocks Actually Cover

Rule #3: WireMock for Integrations You Don’t Control

Rule #4: Contract Tests — Stop Trusting Your Mocks

Rule #5: Stop Relying on afterEach for Cleanup

Putting It Together: The Data Reliability Cheat Sheet

What’s Next?

Playwright Fixtures as a Dependency Injection Container: The Architecture That Scales PRO IMPLEMENTATION

Three-Layer Architecture: POM, Flow, and Tests

TL;DR

Why new Inside Tests Is a Scaling Problem

Fixtures as a DI Container

The Lifecycle Argument for Fixtures

The Pragmatic Rule: When Fixtures Are Overkill

Lazy POM: Why Getters Beat Constructor Assignments

Deterministic Test Data at Scale

Factory Pattern: Separating Structure From Noise

Scaling Fixtures: mergeTests and Namespacing

Business Steps: test.step and BDR

ESLint: Architectural Enforcement

The Architecture in Summary

What This Architecture Actually Solves

Your Playwright Tests Will Need Refactoring. Here’s How to Make It Painless CONCEPT

TL;DR

What Is a Flow? (Quick Explainer)

The Problem: Architecture That Fights You at Scale

Rule #1: Stop Using new Inside Tests

Rule #2: Use Getters in Page Objects, Not Constructor Assignments

Rule #3: Isolate Test Data for Parallel Runs

Rule #4: Structure Test Data With Factories and Overrides

Rule #5: Scale Fixtures With mergeTests and Namespacing

Rule #6: Write Business Steps, Not Technical Logs

ESLint: Enforce the Architecture Automatically

Architecture Cheat Sheet

What’s Next?

Playwright CI: What Senior Engineers Do Differently PRO IMPLEMENTATION

Mental Model Shift: Leaving Legacy Baggage Behind

TL;DR

The Problem With How Most Teams Structure Test Infrastructure

The Architecture: Dependency Projects as a Dependency Graph

Authentication: The 50ms vs 4 Second Decision

Locator Strategy: Understanding the Model, Not Memorizing the Rules

Web-First Assertions: Why the Implementation Matters

expect.toPass: When You Need to Retry an Entire Interaction

Hydration: The Silent Test Killer in SSR Applications

Network Hygiene: What’s Actually Slowing Your Tests

Trace Viewer: Making CI Failures Debuggable

ESLint: Enforcing Architecture Automatically

The Flakiness Diagnostic Framework

Worker Configuration: The Resource Math

What This Architecture Actually Buys You

Rule #5: Stop Relying on `afterEach` for Cleanup

Why `new` Inside Tests Is a Scaling Problem

Scaling Fixtures: `mergeTests` and Namespacing

Rule #1: Stop Using `new` Inside Tests

Rule #5: Scale Fixtures With `mergeTests` and Namespacing

`expect.toPass`: When You Need to Retry an Entire Interaction