Flaky Tests You Can't Fix With Better Selectors

May 15, 2026

Flaky Tests You Can’t Fix With Better Selectors CONCEPT

You’ve fixed your locators. You’ve switched to web-first assertions. Your tests still fail intermittently — but now the failures look different. Duplicate records in the database. Tests that pass alone but fail in parallel. Mocks that say everything is fine while production is broken.

This is the next layer of flakiness. It lives in your API calls, your test doubles, and your database. Better selectors won’t help here.

Code examples are simplified for clarity — focus on the idea, not the boilerplate.

TL;DR

Use idempotency keys on POST requests — one network glitch shouldn’t create two orders
page.route mocks the browser, not your server — know the difference
Use WireMock for server-to-server integrations you can’t control
Contract tests catch API drift before it reaches production
Never rely on afterEach for database cleanup — use TTL or a cleanup queue instead

The Problem: Flakiness That Looks Like Application Bugs

When a selector fails, the error is obvious. When a test creates a duplicate order because a network request was retried, the error looks like a business logic bug. You spend an hour investigating something that has nothing to do with your application code.

Three categories cause this:

API flakiness — a request succeeds on the server but the response never arrives. Playwright retries. Now you have two orders.

Lying mocks — your mocks say the API returns { order_id: "123" }. The backend deployed last week and now returns { orderId: "123" }. Tests are green. Production is broken.

Data pollution — tests create users, orders, and transactions but don’t clean up reliably. After a week, the test database is a graveyard that slows down queries and causes unique constraint violations.

Rule #1: Idempotency Keys — One Request, One Result

Networks are unreliable. A POST request can reach the server, create a record, and then the response gets lost in transit. Playwright sees a timeout and retries. The server sees a new request and creates another record.

The fix is an idempotency key — a unique header that tells the server “if you’ve seen this request before, return the same result instead of processing it again.”

import { createHash } from 'crypto';

export function generateIdempotencyKey(method: string, url: string, data: unknown): string {
  const payload = `${method}:${url}:${JSON.stringify(data)}`;
  return createHash('sha256').update(payload).digest('hex').slice(0, 16);
}

export abstract class BaseApiClient {
  protected async post(url: string, data?: unknown) {
    const key = generateIdempotencyKey('POST', url, data);
    return await this.request.post(url, {
      data,
      headers: { 'X-Idempotency-Key': key },
    });
  }
}

The key is generated from the request method, URL, and body — so two identical requests get the same key, but two different requests (create user, then create order) get different keys. One network glitch no longer creates two records.

Important nuance: if your test legitimately creates two identical orders (same body, same URL), they’ll get the same key — and the server will return the first result for both. This is intentional behaviour for retries, but it means this approach assumes each unique operation has unique data. If you need two genuinely identical records, add a unique field (like requestId) to the body.

Note: This only works if your backend handles the X-Idempotency-Key header. Check with your backend team — many order APIs support this out of the box. If your payment provider doesn’t support it — that’s their problem to solve, not yours. Look for their sandbox or test mode, or use WireMock to mock them entirely. If it’s your own backend that’s missing support — that’s a tech-debt conversation with your backend team, not something to work around in tests.

Rule #2: Know What Your Mocks Actually Cover

page.route is Playwright’s built-in way to intercept requests. It’s great for testing UI behavior in isolation — how does the page look when the API returns an error?

// ✅ Good use of page.route — testing UI error state
await page.route('**/api/orders', (route) => {
  route.fulfill({
    status: 500,
    body: JSON.stringify({ error: 'Internal Server Error' }),
  });
});

await page.goto('/orders');
await expect(page.getByText('Something went wrong')).toBeVisible();

The catch: page.route only intercepts requests made from inside the browser. If your test makes API calls directly through Playwright’s request fixture — server-side, without a browser — page.route won’t see them.

// This request bypasses page.route entirely
const response = await request.post('/api/orders', { data: orderData });

// But this goes through the browser context and IS intercepted by page.route
const response = await page.evaluate(() =>
  fetch('/api/orders', { method: 'POST' }).then((r) => r.json()),
);

Why route.fulfill() instead of route.abort()? abort() causes the request to fail with a network error. Some apps handle this gracefully, but others enter an infinite retry loop waiting for a response that never comes. fulfill() returns a proper HTTP response (even a fake one) so the app moves on cleanly.

For direct API calls in tests, you need mocks at a different level — either a wrapper around request, or an infrastructure mock like WireMock.

Rule #3: WireMock for Integrations You Don’t Control

Your backend calls Stripe for payments. It calls Twilio for SMS. It calls a shipping provider to get rates. In tests, you don’t want any of that to actually happen.

page.route can’t help here — these are server-to-server calls that never touch the browser. The solution is WireMock: a mock server that runs alongside your test environment and intercepts HTTP calls at the network level.

services:
  wiremock:
    image: wiremock/wiremock:3.3.1
    ports:
      - '8080:8080'
    volumes:
      - ./wiremock/mappings:/home/wiremock/mappings

{
  "request": {
    "method": "POST",
    "url": "/v1/payment_intents"
  },
  "response": {
    "status": 200,
    "jsonBody": {
      "id": "pi_test_123",
      "status": "succeeded"
    }
  }
}

Now your backend hits localhost:8080 in tests instead of the real Stripe API. Tests are fast, isolated, and don’t depend on external uptime.

Point your backend’s base URLs to WireMock via environment variables in your test environment:

STRIPE_BASE_URL=http://localhost:8080
TWILIO_BASE_URL=http://localhost:8080

One thing to be aware of: this requires your backend to use configurable base URLs for external services. In most well-structured backends this is already the case. If it’s not — that’s a conversation with the backend team, not a reason to skip WireMock.

Rule #4: Contract Tests — Stop Trusting Your Mocks

Here’s the problem with all mocks: they can lie. Your WireMock returns { payment_id: "pay_123" }. The backend team renames the field to paymentId. Your tests stay green. Production breaks.

This is called a Lying Mock — a test double that no longer matches reality.

Contract testing fixes this. Instead of just mocking the response, you write a contract: “I expect this request to return this response.” The backend then verifies that contract against its actual code.

import { PactV3, MatchersV3 } from '@pact-foundation/pact';

const provider = new PactV3({
  consumer: 'frontend-tests',
  provider: 'payment-service',
  dir: './pacts',
});

describe('Payment API contract', () => {
  it('returns payment confirmation', async () => {
    await provider
      .given('a valid payment intent exists')
      .uponReceiving('POST /v1/payment_intents')
      .withRequest({ method: 'POST', path: '/v1/payment_intents' })
      .willRespondWith({
        status: 200,
        body: {
          id: MatchersV3.string('pi_test_123'),
          status: MatchersV3.string('succeeded'),
        },
      })
      .executeTest(async (mockServer) => {
        const result = await createPayment(mockServer.url);
        expect(result.status).toBe('succeeded');
      });
  });
});

After this test runs, it generates a JSON contract file in ./pacts. The backend team runs that contract against their actual API. If they rename id to paymentId — the contract verification fails in their pipeline, before the change is merged.

Where to start: You don’t need to contract-test everything. Start with the APIs that change most often or have caused the most incidents. One contract on your payment flow is worth more than ten contracts on stable read-only endpoints.

One important caveat: contract testing requires the backend team to actually run the verification in their pipeline. This is an organizational commitment, not just a technical one. For small teams, storing contract JSON files in the backend repo and running verification manually is a simpler starting point than a full Pact Broker setup. Also note that contracts verify response structure — they don’t catch business logic bugs or side effects.

Rule #5: Stop Relying on `afterEach` for Cleanup

The classic approach to test data cleanup:

// ❌ Unreliable
afterEach(async () => {
  await api.deleteUser(userId);
});

This fails silently when a test crashes before setting userId. It doesn’t run when the test runner itself crashes. After a CI failure mid-run, your database has orphaned records that affect the next run.

Three approaches that actually work:

TTL — let the database clean up automatically

Add an expires_at field to your test entities and set it when creating them:

// When creating test data
await api.createUser({
  email: `test_${Date.now()}@example.com`,
  expires_at: new Date(Date.now() + 24 * 60 * 60 * 1000).toISOString(),
});

In PostgreSQL, a scheduled job handles cleanup:

-- Runs every hour via pg_cron
SELECT cron.schedule('cleanup-test-data', '0 * * * *', $$
  DELETE FROM users WHERE expires_at < NOW() AND is_test = true;
  DELETE FROM orders WHERE expires_at < NOW() AND is_test = true;
$$);

In MongoDB and Redis, TTL indexes handle this natively — no cron job needed.

Cleanup queue — collect IDs, delete in bulk

Track everything your tests create, then clean it all up in a global teardown:

// In your base API client
protected async post(url: string, data?: unknown) {
  const response = await this.request.post(url, { data });
  const body = await response.json();

  if (body.id) {
    cleanupQueue.push({ url, id: body.id });
  }
  return response;
}

// global-teardown.ts
export default async function globalTeardown() {
  for (const item of cleanupQueue) {
    await api.delete(`${item.url}/${item.id}`);
  }
}

Even if individual tests fail, the global teardown runs and cleans up the queue.

Which approach to use: TTL is the more reliable default — it works even if the test runner is killed with SIGKILL, because cleanup happens at the database level independently of your test process. Use TTL as your first line of defence. The cleanup queue is a good complement when you need guaranteed cleanup order or when your database doesn’t support scheduled jobs — but it won’t run if the process is hard-killed.

Putting It Together: The Data Reliability Cheat Sheet

Symptom	Root cause	Fix
Duplicate records after CI failure	No idempotency on POST requests	Add `X-Idempotency-Key` header
Tests green, production broken	Mocks don’t match real API	Add contract tests for critical endpoints
`page.route` mock not working	Request bypasses browser	Use WireMock or request wrapper
Database full of test garbage	`afterEach` cleanup unreliable	TTL field + pg_cron or cleanup queue
External API causing flakiness	Real network calls in tests	WireMock for server-to-server calls

What’s Next?

You now have three layers covered: test infrastructure, object lifecycle, and data reliability. The next layer is observability — how do you measure test health, identify patterns in flakiness, and prove to your manager that stability work has business value?

Want to go deeper on any of these topics? Check out the advanced version: Why Your Test Suite Lies to You at Scale

All patterns in this article are implemented in the Playwright BDR Template on GitHub.