Skip to content

Flaky Tests You Can't Fix With Better Selectors

Flaky Tests You Can’t Fix With Better Selectors CONCEPT

Section titled “Flaky Tests You Can’t Fix With Better Selectors ”

You’ve fixed your locators. You’ve switched to web-first assertions. Your tests still fail intermittently — but now the failures look different. Duplicate records in the database. Tests that pass alone but fail in parallel. Mocks that say everything is fine while production is broken.

This is the next layer of flakiness. It lives in your API calls, your test doubles, and your database. Better selectors won’t help here.

Code examples are simplified for clarity — focus on the idea, not the boilerplate.


  1. Use idempotency keys on POST requests — one network glitch shouldn’t create two orders
  2. page.route mocks the browser, not your server — know the difference
  3. Use WireMock for server-to-server integrations you can’t control
  4. Contract tests catch API drift before it reaches production
  5. Never rely on afterEach for database cleanup — use TTL or a cleanup queue instead

The Problem: Flakiness That Looks Like Application Bugs

Section titled “The Problem: Flakiness That Looks Like Application Bugs”

When a selector fails, the error is obvious. When a test creates a duplicate order because a network request was retried, the error looks like a business logic bug. You spend an hour investigating something that has nothing to do with your application code.

Three categories cause this:

API flakiness — a request succeeds on the server but the response never arrives. Playwright retries. Now you have two orders.

Lying mocks — your mocks say the API returns { order_id: "123" }. The backend deployed last week and now returns { orderId: "123" }. Tests are green. Production is broken.

Data pollution — tests create users, orders, and transactions but don’t clean up reliably. After a week, the test database is a graveyard that slows down queries and causes unique constraint violations.


Rule #1: Idempotency Keys — One Request, One Result

Section titled “Rule #1: Idempotency Keys — One Request, One Result”

Networks are unreliable. A POST request can reach the server, create a record, and then the response gets lost in transit. Playwright sees a timeout and retries. The server sees a new request and creates another record.

The fix is an idempotency key — a unique header that tells the server “if you’ve seen this request before, return the same result instead of processing it again.”

api/infrastructure/idempotency.ts
import { createHash } from 'crypto';
export function generateIdempotencyKey(method: string, url: string, data: unknown): string {
const payload = `${method}:${url}:${JSON.stringify(data)}`;
return createHash('sha256').update(payload).digest('hex').slice(0, 16);
}
api/clients/BaseApiClient.ts
export abstract class BaseApiClient {
protected async post(url: string, data?: unknown) {
const key = generateIdempotencyKey('POST', url, data);
return await this.request.post(url, {
data,
headers: { 'X-Idempotency-Key': key },
});
}
}

The key is generated from the request method, URL, and body — so two identical requests get the same key, but two different requests (create user, then create order) get different keys. One network glitch no longer creates two records.

Important nuance: if your test legitimately creates two identical orders (same body, same URL), they’ll get the same key — and the server will return the first result for both. This is intentional behaviour for retries, but it means this approach assumes each unique operation has unique data. If you need two genuinely identical records, add a unique field (like requestId) to the body.

Note: This only works if your backend handles the X-Idempotency-Key header. Check with your backend team — many order APIs support this out of the box. If your payment provider doesn’t support it — that’s their problem to solve, not yours. Look for their sandbox or test mode, or use WireMock to mock them entirely. If it’s your own backend that’s missing support — that’s a tech-debt conversation with your backend team, not something to work around in tests.


Rule #2: Know What Your Mocks Actually Cover

Section titled “Rule #2: Know What Your Mocks Actually Cover”

page.route is Playwright’s built-in way to intercept requests. It’s great for testing UI behavior in isolation — how does the page look when the API returns an error?

// ✅ Good use of page.route — testing UI error state
await page.route('**/api/orders', (route) => {
route.fulfill({
status: 500,
body: JSON.stringify({ error: 'Internal Server Error' }),
});
});
await page.goto('/orders');
await expect(page.getByText('Something went wrong')).toBeVisible();

The catch: page.route only intercepts requests made from inside the browser. If your test makes API calls directly through Playwright’s request fixture — server-side, without a browser — page.route won’t see them.

// This request bypasses page.route entirely
const response = await request.post('/api/orders', { data: orderData });
// But this goes through the browser context and IS intercepted by page.route
const response = await page.evaluate(() =>
fetch('/api/orders', { method: 'POST' }).then((r) => r.json()),
);

Why route.fulfill() instead of route.abort()? abort() causes the request to fail with a network error. Some apps handle this gracefully, but others enter an infinite retry loop waiting for a response that never comes. fulfill() returns a proper HTTP response (even a fake one) so the app moves on cleanly.

For direct API calls in tests, you need mocks at a different level — either a wrapper around request, or an infrastructure mock like WireMock.


Rule #3: WireMock for Integrations You Don’t Control

Section titled “Rule #3: WireMock for Integrations You Don’t Control”

Your backend calls Stripe for payments. It calls Twilio for SMS. It calls a shipping provider to get rates. In tests, you don’t want any of that to actually happen.

page.route can’t help here — these are server-to-server calls that never touch the browser. The solution is WireMock: a mock server that runs alongside your test environment and intercepts HTTP calls at the network level.

docker-compose.yml
services:
wiremock:
image: wiremock/wiremock:3.3.1
ports:
- '8080:8080'
volumes:
- ./wiremock/mappings:/home/wiremock/mappings
wiremock/mappings/stripe-payment.json
{
"request": {
"method": "POST",
"url": "/v1/payment_intents"
},
"response": {
"status": 200,
"jsonBody": {
"id": "pi_test_123",
"status": "succeeded"
}
}
}

Now your backend hits localhost:8080 in tests instead of the real Stripe API. Tests are fast, isolated, and don’t depend on external uptime.

Point your backend’s base URLs to WireMock via environment variables in your test environment:

Terminal window
STRIPE_BASE_URL=http://localhost:8080
TWILIO_BASE_URL=http://localhost:8080

One thing to be aware of: this requires your backend to use configurable base URLs for external services. In most well-structured backends this is already the case. If it’s not — that’s a conversation with the backend team, not a reason to skip WireMock.


Rule #4: Contract Tests — Stop Trusting Your Mocks

Section titled “Rule #4: Contract Tests — Stop Trusting Your Mocks”

Here’s the problem with all mocks: they can lie. Your WireMock returns { payment_id: "pay_123" }. The backend team renames the field to paymentId. Your tests stay green. Production breaks.

This is called a Lying Mock — a test double that no longer matches reality.

Contract testing fixes this. Instead of just mocking the response, you write a contract: “I expect this request to return this response.” The backend then verifies that contract against its actual code.

tests/contracts/payment.pact.spec.ts
import { PactV3, MatchersV3 } from '@pact-foundation/pact';
const provider = new PactV3({
consumer: 'frontend-tests',
provider: 'payment-service',
dir: './pacts',
});
describe('Payment API contract', () => {
it('returns payment confirmation', async () => {
await provider
.given('a valid payment intent exists')
.uponReceiving('POST /v1/payment_intents')
.withRequest({ method: 'POST', path: '/v1/payment_intents' })
.willRespondWith({
status: 200,
body: {
id: MatchersV3.string('pi_test_123'),
status: MatchersV3.string('succeeded'),
},
})
.executeTest(async (mockServer) => {
const result = await createPayment(mockServer.url);
expect(result.status).toBe('succeeded');
});
});
});

After this test runs, it generates a JSON contract file in ./pacts. The backend team runs that contract against their actual API. If they rename id to paymentId — the contract verification fails in their pipeline, before the change is merged.

Where to start: You don’t need to contract-test everything. Start with the APIs that change most often or have caused the most incidents. One contract on your payment flow is worth more than ten contracts on stable read-only endpoints.

One important caveat: contract testing requires the backend team to actually run the verification in their pipeline. This is an organizational commitment, not just a technical one. For small teams, storing contract JSON files in the backend repo and running verification manually is a simpler starting point than a full Pact Broker setup. Also note that contracts verify response structure — they don’t catch business logic bugs or side effects.


Rule #5: Stop Relying on afterEach for Cleanup

Section titled “Rule #5: Stop Relying on afterEach for Cleanup”

The classic approach to test data cleanup:

// ❌ Unreliable
afterEach(async () => {
await api.deleteUser(userId);
});

This fails silently when a test crashes before setting userId. It doesn’t run when the test runner itself crashes. After a CI failure mid-run, your database has orphaned records that affect the next run.

Three approaches that actually work:

TTL — let the database clean up automatically

Add an expires_at field to your test entities and set it when creating them:

// When creating test data
await api.createUser({
email: `test_${Date.now()}@example.com`,
expires_at: new Date(Date.now() + 24 * 60 * 60 * 1000).toISOString(),
});

In PostgreSQL, a scheduled job handles cleanup:

-- Runs every hour via pg_cron
SELECT cron.schedule('cleanup-test-data', '0 * * * *', $$
DELETE FROM users WHERE expires_at < NOW() AND is_test = true;
DELETE FROM orders WHERE expires_at < NOW() AND is_test = true;
$$);

In MongoDB and Redis, TTL indexes handle this natively — no cron job needed.

Cleanup queue — collect IDs, delete in bulk

Track everything your tests create, then clean it all up in a global teardown:

// In your base API client
protected async post(url: string, data?: unknown) {
const response = await this.request.post(url, { data });
const body = await response.json();
if (body.id) {
cleanupQueue.push({ url, id: body.id });
}
return response;
}
// global-teardown.ts
export default async function globalTeardown() {
for (const item of cleanupQueue) {
await api.delete(`${item.url}/${item.id}`);
}
}

Even if individual tests fail, the global teardown runs and cleans up the queue.

Which approach to use: TTL is the more reliable default — it works even if the test runner is killed with SIGKILL, because cleanup happens at the database level independently of your test process. Use TTL as your first line of defence. The cleanup queue is a good complement when you need guaranteed cleanup order or when your database doesn’t support scheduled jobs — but it won’t run if the process is hard-killed.


Putting It Together: The Data Reliability Cheat Sheet

Section titled “Putting It Together: The Data Reliability Cheat Sheet”
SymptomRoot causeFix
Duplicate records after CI failureNo idempotency on POST requestsAdd X-Idempotency-Key header
Tests green, production brokenMocks don’t match real APIAdd contract tests for critical endpoints
page.route mock not workingRequest bypasses browserUse WireMock or request wrapper
Database full of test garbageafterEach cleanup unreliableTTL field + pg_cron or cleanup queue
External API causing flakinessReal network calls in testsWireMock for server-to-server calls

You now have three layers covered: test infrastructure, object lifecycle, and data reliability. The next layer is observability — how do you measure test health, identify patterns in flakiness, and prove to your manager that stability work has business value?

Want to go deeper on any of these topics? Check out the advanced version: Why Your Test Suite Lies to You at Scale


All patterns in this article are implemented in the Playwright BDR Template on GitHub.